This post continues "To sample or not
to sample..."Part
1 and Part
2. In Part 1, we looked at the general
motivation for sampling. In Part 2, we looked at simple random
sampling.
In this post, we focus on stratified sampling. Recall that one of
the pitfalls of simple random sampling is that not all values for a
given column may be represented.
Stratified sampling overcomes this by ensuring each value for a
specified column (usually the target) is
represented.
The query below illustrates how to get a 60% stratified sample on the CUSTOMERS table with target column BUYER, using the original distribution of BUYER column values 0 and 1. This means that we are interested in roughly 60% of the 1s and 60% of the 0s. Note that we could have tried a simple random sample, and chances are we would achieve a similar distribution, but it is possible to obtain a sample that contains no 1s. However, stratified sampling ensures that we obtain a representative set of 1s and 0s.
WITH TARGET_COUNT AS (SELECT BUYER, count(*) CNT
FROM CUSTOMERS
WHERE BUYER IS NOT NULL
GROUP BY BUYER)
SELECT * -- or filter out partition_row_num
FROM
(SELECT row_number() over(partition by BUYER ORDER BY ORA_HASH(CUST_ID)) partition_row_num, t.*
FROM CUSTOMERS t
WHERE BUYER IS NOT NULL)
WHERE partition_row_num = 1
OR ( BUYER=1
AND ORA_HASH(partition_row_num,((SELECT CNT FROM TARGET_COUNT WHERE BUYER=1)-1),12345) <
(SELECT CNT FROM TARGET_COUNT WHERE BUYER=1) * 60 / 100)
OR ( CD_BUYER=0
AND ORA_HASH(partition_row_num,((SELECT CNT FROM TARGET_COUNT WHERE BUYER=0)-1),12345) <
(SELECT CNT FROM TARGET_COUNT WHERE BUYER=0) * 60 / 100)
In the WITH clause, we first count the number of each target value, excluding nulls. The next key step is to assign a row number to each record partitioned by the target column. In the subquery, all records with target value 1 have row numbers from 1..N and those with 0 have row numbers 1..M, where N and M are the counts of records with 1 and 0 values, respectively. Finally, we include the first row to ensure at least one record is returned, and allow records where the partition_row_num hashes to a value less than 60% of N for 1s or M for 0s.
This can be extended to support multi-class columns simply be adding additional "OR" clauses.
In some cases, we may need to alter the distribution of class values, say from 90% 0s and 10% 1s, to 50% of each, producing a balanced stratified sample. This is desirable in, for example, scenarios involving fraud detection, where incidences of fraud can be rare, and we need to ensure there are sufficient representative cases of each class. Since there are only 10% 1s, the largest sample we can have for balanced target classes is 20% (all of the 1s, and ~11% of the 0s). The following query produces a balanced sample.
WITH TARGET_COUNT AS (SELECT BUYER, count(*) CNT
FROM CUSTOMERS t
WHERE BUYER IS NOT NULL
GROUP BY BUYER)
SELECT * -- or filter out partition_row_num
FROM (SELECT row_number() over(partition by BUYER ORDER BY ORA_HASH(CUST_ID)) partition_row_num, t.*
FROM CUSTOMERS t
WHERE BUYER IS NOT NULL)
WHERE partition_row_num = 1
OR ( BUYER=1 AND ORA_HASH(partition_row_num,((SELECT CNT FROM TARGET_COUNT WHERE BUYER=1)-1),12345) <
.20 * (SELECT SUM(CNT) FROM TARGET_COUNT)) / (SELECT COUNT(*) FROM TARGET_COUNT)
OR ( BUYER=0 AND ORA_HASH(partition_row_num,((SELECT CNT FROM TARGET_COUNT WHERE BUYER=0)-1),12345) <
.20 * (SELECT SUM(CNT) FROM TARGET_COUNT)) / (SELECT COUNT(*) FROM TARGET_COUNT)
The key difference occurs in the top level WHERE clause. The selection criteria differs after the '<' sign, where we take the sample percentage times the number of records divided by the number of target values.
As before, this can be extended to support multi-class columns simply be adding additional "OR" clauses.
The query below illustrates how to get a 60% stratified sample on the CUSTOMERS table with target column BUYER, using the original distribution of BUYER column values 0 and 1. This means that we are interested in roughly 60% of the 1s and 60% of the 0s. Note that we could have tried a simple random sample, and chances are we would achieve a similar distribution, but it is possible to obtain a sample that contains no 1s. However, stratified sampling ensures that we obtain a representative set of 1s and 0s.
WITH TARGET_COUNT AS (SELECT BUYER, count(*) CNT
FROM CUSTOMERS
WHERE BUYER IS NOT NULL
GROUP BY BUYER)
SELECT * -- or filter out partition_row_num
FROM
(SELECT row_number() over(partition by BUYER ORDER BY ORA_HASH(CUST_ID)) partition_row_num, t.*
FROM CUSTOMERS t
WHERE BUYER IS NOT NULL)
WHERE partition_row_num = 1
OR ( BUYER=1
AND ORA_HASH(partition_row_num,((SELECT CNT FROM TARGET_COUNT WHERE BUYER=1)-1),12345) <
(SELECT CNT FROM TARGET_COUNT WHERE BUYER=1) * 60 / 100)
OR ( CD_BUYER=0
AND ORA_HASH(partition_row_num,((SELECT CNT FROM TARGET_COUNT WHERE BUYER=0)-1),12345) <
(SELECT CNT FROM TARGET_COUNT WHERE BUYER=0) * 60 / 100)
In the WITH clause, we first count the number of each target value, excluding nulls. The next key step is to assign a row number to each record partitioned by the target column. In the subquery, all records with target value 1 have row numbers from 1..N and those with 0 have row numbers 1..M, where N and M are the counts of records with 1 and 0 values, respectively. Finally, we include the first row to ensure at least one record is returned, and allow records where the partition_row_num hashes to a value less than 60% of N for 1s or M for 0s.
This can be extended to support multi-class columns simply be adding additional "OR" clauses.
In some cases, we may need to alter the distribution of class values, say from 90% 0s and 10% 1s, to 50% of each, producing a balanced stratified sample. This is desirable in, for example, scenarios involving fraud detection, where incidences of fraud can be rare, and we need to ensure there are sufficient representative cases of each class. Since there are only 10% 1s, the largest sample we can have for balanced target classes is 20% (all of the 1s, and ~11% of the 0s). The following query produces a balanced sample.
WITH TARGET_COUNT AS (SELECT BUYER, count(*) CNT
FROM CUSTOMERS t
WHERE BUYER IS NOT NULL
GROUP BY BUYER)
SELECT * -- or filter out partition_row_num
FROM (SELECT row_number() over(partition by BUYER ORDER BY ORA_HASH(CUST_ID)) partition_row_num, t.*
FROM CUSTOMERS t
WHERE BUYER IS NOT NULL)
WHERE partition_row_num = 1
OR ( BUYER=1 AND ORA_HASH(partition_row_num,((SELECT CNT FROM TARGET_COUNT WHERE BUYER=1)-1),12345) <
.20 * (SELECT SUM(CNT) FROM TARGET_COUNT)) / (SELECT COUNT(*) FROM TARGET_COUNT)
OR ( BUYER=0 AND ORA_HASH(partition_row_num,((SELECT CNT FROM TARGET_COUNT WHERE BUYER=0)-1),12345) <
.20 * (SELECT SUM(CNT) FROM TARGET_COUNT)) / (SELECT COUNT(*) FROM TARGET_COUNT)
The key difference occurs in the top level WHERE clause. The selection criteria differs after the '<' sign, where we take the sample percentage times the number of records divided by the number of target values.
As before, this can be extended to support multi-class columns simply be adding additional "OR" clauses.