To sample or not to sample... Part 4

This post continues "To sample or not to sample..."Part 1, Part 2, and Part 3. In Part 1, we looked at the general motivation for sampling. In Part 2, we looked at simple random sampling without replacement, the most common sampling technique. In Part 3, we looked at stratified sampling, which helps to ensure a representative sample. In this post, we focus on simple random sampling with replacement, which has the following characteristics:

records are selected at random, each with equal probability
each selected record is immediately available to be selected again in the same sample
a record can be selected multiple times
the population of records remains constant, i.e., none are excluded

One data mining technique that uses simple random sampling with replacement is bootstrap aggregating, also known as bagging. The technique involves combining the predictions of multiple models usually from the same algorithm within either classification or regression, e.g., classification decision trees. Each model is built from a different sample of the original dataset using simple random sampling with replacement. These samples are called bootstrap samples.

A model composed of multiple other models is often referred to as an ensemble model. For regression models, the predictions of each component model are averaged. For classification models, voting is used to select the prediction agreed upon by the majority of models. Bagging often increases model accuracy and stability by reducing variance and avoiding overfitting. This, however, is a topic for another post.

There are many possible ways to produce a random sample with replacement. One approach is to identify each row by its row number and then to use a hash function to select the records for our sample. The following script samples 1200 records from the CUSTOMERS table. We first create a view appending the row_number column using the rownum pseudo column. Iterating for the number of records in our sample, we populate the table CUSTOMERS_MAP with the hash value, which is based on the iterator i and ranges from 1 to the number of records in CUSTOMERS. The final step is to create the sample view by joining the CUSTOMERS_MAP table to the CUSTOMERS_V table using the sample_number and row_number in the join condition.

CREATE VIEW CUSTOMERS_V AS select rownum row_number, o.* from CUSTOMERS o; CREATE TABLE CUSTOMERS_MAP (sample_number NUMBER); DECLARE v_row_count NUMBER; c_sample_size NUMBER := 1200; BEGIN SELECT count(*) INTO v_row_count FROM CUSTOMERS; FOR i IN 1..c_sample_size+1 LOOP INSERT INTO CUSTOMERS_MAP (sample_number) VALUES (ORA_HASH(i, v_row_count, 12345)); END LOOP; COMMIT; END; CREATE VIEW CUSTOMERS_SAMPLE_V AS SELECT v.* FROM CUSTOMERS_V v, CUSTOMERS_MAP m WHERE v.row_number = m.sample_number;

To obtain a different sample, change the rows returned by modifying the seed value provided to ORA_HASH. Note that the sample view is dependent on the contents of CUSTOMERS_MAP. Materializing the sample view ensures it will not change if CUSTOMERS_MAP is reused or if the order of rows in the original table is not maintained.

To conveniently produce multiple samples, we can extend the CUSTOMERS_MAP table as follows:

CREATE TABLE CUSTOMERS_MAP (sample_number_1 NUMBER, sample_number_2 NUMBER,sample_number_3 NUMBER);
The INSERT statement populates each of these sample numbers using a different seed value for each ORA_HASH invocation:

INSERT INTO CUSTOMERS_MAP (sample_number_1, sample_number_2,sample_number_3) VALUES (ORA_HASH(i, v_row_count, 12345),ORA_HASH(i, v_row_count, 23456),ORA_HASH(i, v_row_count, 34567));

Lastly, the sample views can be created each using a different column from the CUSTOMERS_MAP table. Here is an example for sample 2:

CREATE VIEW CUSTOMERS_SAMPLE_2_V AS SELECT v.* FROM CUSTOMERS_V v, CUSTOMERS_MAP m WHERE v.row_number = m.sample_number_2;

This extension avoids having to maintain multiple map tables or materializing each of the sample tables. Of course, as stated above, since the rows in the initial CUSTOMER table are not guaranteed to maintain the same order, either the CUSTOMERS_V view or each of the sample views must be materialized to guarantee reproducible results.

To sample or not to sample... Part 4

Trending Articles

Alok, Daecolm & Malou – Unforgettable – Single [iTunes Plus M4A]

Download: D boy ft Shenky & Chester – Nafola nafulwa”Prod By: Shenky”

Mp3 Download: Mdu - Mazola

GTA 5 PPSSPP Zip File Download For Android Mediafire 382 MB

Materials Around Us Class 6 Worksheet Science Chapter 6

Gulabi kallu Lyrics and translation | GAV / Govindhudu andhari vadele (2014)

Grimsby sex-swap teen Nicole beats the bullies

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

VMOU RSCIT Result 2017, RSCIT Result VMOU rkcl.vmou.ac.in Name Wise

Troubleshooting Connectivity #9 –ローカル接続でネットワークエラーとはこれいかに？

[MP3] Okpo Recordz Virus & Texzy –“Raba Raba” (Prod. by Exy Pro)

Karimnagar District Police Office Mobile Numbers List in Telangana State

Missing boy, Queens Quay West and Bathurst Street area, Javin Dillon, 15

usage of CSRF token in ABAP report for POST request

AVS4YOU Products Patcher v1.4 By RADIXX11

SAHARA FLASH LIVE IN WERAGOLLA 2018-04-20

Practice Sheet of Right form of verbs for HSC Students

99 God Status for Whatsapp, Facebook

Portable iSkysoft PDF Editor 5.6.0.1

Bureau of Internal Revenue: Regional Offices (Directory)