5.1.3. Sampling#

When working with large datasets, it’s often impractical-or even impossible-to process the entire dataset at once, especially during initial exploration. Sampling is the process of selecting a smaller, representative subset of the data from a much larger dataset.

This smaller subset helps:

  • Preview the dataset without overwhelming computational resources.

  • Run preliminary operations to understand trends, patterns, and issues before committing to full-scale analysis.

For example: If we want to analyze the entire population of a country with 1 billion people, loading and processing all the data at once would be infeasible. Instead, we could take a sample of, say, 50,000 individuals - as long as this sample is representative of the full population. The key principle of sampling is that the sample should preserve the important characteristics of the larger dataset.

Example Dataset: We will continue with this dataset.

import sqlite3
import pandas as pd

# Connect to Chinook sample database
conn = sqlite3.connect("../data/chinook.db")
# Load the 'tracks' table from Chinook database
df_tracks = pd.read_sql_query("SELECT * FROM tracks", conn)

# view first 5 rows of the dataset
df_tracks.head(5) 
TrackId Name AlbumId MediaTypeId GenreId Composer Milliseconds Bytes UnitPrice
0 1 For Those About To Rock (We Salute You) 1 1 1 Angus Young, Malcolm Young, Brian Johnson 343719 11170334 0.99
1 2 Balls to the Wall 2 2 1 None 342562 5510424 0.99
2 3 Fast As a Shark 3 2 1 F. Baltes, S. Kaufman, U. Dirkscneider & W. Ho... 230619 3990994 0.99
3 4 Restless and Wild 3 2 1 F. Baltes, R.A. Smith-Diesel, S. Kaufman, U. D... 252051 4331779 0.99
4 5 Princess of the Dawn 3 2 1 Deaffy & R.A. Smith-Diesel 375418 6290521 0.99

5.1.3.1. Simple Random Sampling#

As the name suggests, in simple random sampling we select the required number of data points entirely at random. Each record in the dataset has an equal chance of being chosen.

Example: From a dataset of 1 million customer transactions, we randomly select 10,000 transactions for initial analysis.

Code Example:

# Simple Random Sampling
# Randomly select 100 samples (without replacement) from df_tracks
simple_random_sample = df_tracks.sample(n=100, random_state=42)
simple_random_sample.shape
(100, 9)

5.1.3.2. Stratified Sampling#

Stratified sampling ensures that certain key distributions are preserved in the sample. This is especially important when the dataset contains imbalanced classes.

Example: Suppose we have a medical dataset with 1 million records and a binary label: Has CancerTrue / False.

  • 95% of the records have Has Cancer = False

  • 5% have Has Cancer = True

If we randomly sample 10,000 records, we might end up with an even higher proportion of False cases, worsening the imbalance. Instead, in stratified sampling, we ensure that the sample keeps the same 95:5 ratio as the original dataset. This preserves the underlying patterns and avoids skewing our analysis.

Code Example:

Stratified Sampling by ‘GenreId’. We take 10% sample from each genre group to preserve distribution.

stratified_sampled_df = df_tracks.groupby('GenreId', group_keys=False).apply(
    lambda x: x.sample(frac=0.1, random_state=42), include_groups=False
)
stratified_sampled_df.shape
(350, 8)

Note

Here we first group the tracks by the GenreId. Then from each smaller group, we sample the fraction (10%) of records needed. This helps in preserving the overall distribution of the dataset.

5.1.3.3. Cluster Sampling#

In cluster sampling, we first divide the dataset into smaller groups or clusters. Clusters can be defined geographically, temporally, or by any logical grouping depending on the problem. Then, we randomly select one or more clusters and include all data points from those clusters in our sample.

Example: In a national survey, instead of sampling individuals directly, we might:

  1. Divide the country into districts (clusters).

  2. Randomly select 50 districts.

  3. Collect data from all households in those districts.

Note

Stratified sampling can be seen as a special case of cluster sampling where the number of clusters represent the class distribution.

Code Example:

Cluster Sampling by ‘GenreId’. We randomly select 5 genres (clusters), then take all tracks from those genres.

num_clusters_to_sample = 5
sampled_clusters = df_tracks['GenreId'].drop_duplicates().sample(n=num_clusters_to_sample, random_state=42)
print("Selected genres: ", sampled_clusters.to_list())


cluster_sample = df_tracks[df_tracks['GenreId'].isin(sampled_clusters)]


cluster_sample.head()
Selected genres:  [9, 17, 1, 24, 12]
TrackId Name AlbumId MediaTypeId GenreId Composer Milliseconds Bytes UnitPrice
0 1 For Those About To Rock (We Salute You) 1 1 1 Angus Young, Malcolm Young, Brian Johnson 343719 11170334 0.99
1 2 Balls to the Wall 2 2 1 None 342562 5510424 0.99
2 3 Fast As a Shark 3 2 1 F. Baltes, S. Kaufman, U. Dirkscneider & W. Ho... 230619 3990994 0.99
3 4 Restless and Wild 3 2 1 F. Baltes, R.A. Smith-Diesel, S. Kaufman, U. D... 252051 4331779 0.99
4 5 Princess of the Dawn 3 2 1 Deaffy & R.A. Smith-Diesel 375418 6290521 0.99

5.1.3.4. Systematic Sampling#

In systematic sampling, we:

  1. Choose a random starting point in the dataset.

  2. Select records at fixed, pre-determined intervals.

Example: If we have a dataset of 100,000 entries and we want a sample of 1,000:

  • We pick a random starting index (say 57).

  • Then select every 100th record from that point (57, 157, 257, …).

This method is simple, efficient, and requires less computational overhead.

Code Example:

Select every 10th record starting from a random start index

import numpy as np

k = 10
start = np.random.randint(0, k)
systematic_indices = list(range(start, len(df_tracks), k))
systematic_sample = df_tracks.iloc[systematic_indices]
systematic_sample.shape
(351, 9)