To sample or not to sample... Part 3

This post continues "To sample or not to sample..."Part 1 and Part 2. In Part 1, we looked at the general motivation for sampling. In Part 2, we looked at simple random sampling. In this post, we focus on stratified sampling. Recall that one of the pitfalls of simple random sampling is that not all values for a given column may be represented. Stratified sampling overcomes this by ensuring each value for a specified column (usually the target) is represented.

The query below illustrates how to get a 60% stratified sample on the CUSTOMERS table with target column BUYER, using the original distribution of BUYER column values 0 and 1. This means that we are interested in roughly 60% of the 1s and 60% of the 0s. Note that we could have tried a simple random sample, and chances are we would achieve a similar distribution, but it is possible to obtain a sample that contains no 1s. However, stratified sampling ensures that we obtain a representative set of 1s and 0s.

WITH TARGET_COUNT AS (SELECT BUYER, count(*) CNT
                      FROM CUSTOMERS
                      WHERE BUYER IS NOT NULL
                      GROUP BY BUYER)
SELECT *    -- or filter out partition_row_num
FROM
(SELECT row_number() over(partition by BUYER ORDER BY ORA_HASH(CUST_ID)) partition_row_num, t.*
   FROM CUSTOMERS t
   WHERE BUYER IS NOT NULL)
WHERE partition_row_num = 1
OR ( BUYER=1
     AND ORA_HASH(partition_row_num,((SELECT CNT FROM TARGET_COUNT WHERE BUYER=1)-1),12345) <
         (SELECT CNT FROM TARGET_COUNT WHERE BUYER=1) * 60 / 100)
OR ( CD_BUYER=0
     AND ORA_HASH(partition_row_num,((SELECT CNT FROM TARGET_COUNT WHERE BUYER=0)-1),12345) <
         (SELECT CNT FROM TARGET_COUNT WHERE BUYER=0) * 60 / 100)

In the WITH clause, we first count the number of each target value, excluding nulls. The next key step is to assign a row number to each record partitioned by the target column. In the subquery, all records with target value 1 have row numbers from 1..N and those with 0 have row numbers 1..M, where N and M are the counts of records with 1 and 0 values, respectively. Finally, we include the first row to ensure at least one record is returned, and allow records where the partition_row_num hashes to a value less than 60% of N for 1s or M for 0s.

This can be extended to support multi-class columns simply be adding additional "OR" clauses.

In some cases, we may need to alter the distribution of class values, say from 90% 0s and 10% 1s, to 50% of each, producing a balanced stratified sample. This is desirable in, for example, scenarios involving fraud detection, where incidences of fraud can be rare, and we need to ensure there are sufficient representative cases of each class. Since there are only 10% 1s, the largest sample we can have for balanced target classes is 20% (all of the 1s, and ~11% of the 0s). The following query produces a balanced sample.

WITH TARGET_COUNT AS (SELECT BUYER, count(*) CNT
                      FROM CUSTOMERS t
                      WHERE BUYER IS NOT NULL
                      GROUP BY BUYER)
SELECT *    -- or filter out partition_row_num
FROM (SELECT row_number() over(partition by BUYER ORDER BY ORA_HASH(CUST_ID)) partition_row_num, t.*
      FROM CUSTOMERS t
      WHERE BUYER IS NOT NULL)
WHERE partition_row_num = 1
OR ( BUYER=1 AND ORA_HASH(partition_row_num,((SELECT CNT FROM TARGET_COUNT WHERE BUYER=1)-1),12345) <
     .20 * (SELECT SUM(CNT) FROM TARGET_COUNT)) / (SELECT COUNT(*) FROM TARGET_COUNT)
OR ( BUYER=0 AND ORA_HASH(partition_row_num,((SELECT CNT FROM TARGET_COUNT WHERE BUYER=0)-1),12345) <
     .20 * (SELECT SUM(CNT) FROM TARGET_COUNT)) / (SELECT COUNT(*) FROM TARGET_COUNT)

The key difference occurs in the top level WHERE clause. The selection criteria differs after the '<' sign, where we take the sample percentage times the number of records divided by the number of target values.

As before, this can be extended to support multi-class columns simply be adding additional "OR" clauses.

To sample or not to sample... Part 3

Trending Articles

Bath man appears in court charged with attempted murder of a man...

MACLEAN, Allan

Black Angus Grilled Artichokes

Practice Sheet of Right form of verbs for HSC Students

Police blotter for Jan. 12

99 God Status for Whatsapp, Facebook

Rajasthan Board 12th Science Result 2018 name wise- RBSE 12th commerce result...

Notorious Naushad of Ippa gang nabbed

Child Kidnapping: Amy McNeil was kidnapped on her way to school by 5 adults;...

Sonible Smartlimit v1.1.5-R2R

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Arrow Flash 2 – Sinhala Dubbed – Episode 23 – 20th March 2016

[GET] AI Traffic Goldmine

[E² Plugin] HDF-Radio

Universal Multi-Patch v1.3 By RADIXX11

IWAN – Thanks and Praise ( Throw Back Thursday )

RONALD P SONDERGAARD Arrested by Miami-Dade County Corrections on Mar 03, 2017

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

HSSC Excise & Taxation Inspector Result 2017 Scorecard/ Category Wise Merit List