Data Distribution and Feature Behavior

5.3.1. Data Distribution and Feature Behavior#

Before analyzing relationships between features or comparing data points, it is essential to understand how individual features behave on their own. Data distributions provide a high-level view of how values are spread, how frequently they occur, and what patterns or irregularities may be present.

Many real-world datasets follow common and well-studied distributions. Recognizing these patterns helps us reason about the underlying data-generating process and informs decisions about preprocessing, modeling, and evaluation.

In this section, we examine distributions at the level of single features, consider both numeric and categorical data, and introduce key concepts that describe distribution shape.

5.3.1.1. Univariate Distributions#

Univariate analysis focuses on one feature at a time. It helps identify skewed data, outliers, and the general shape of a feature’s distribution, such as normal, bimodal, or uniform.

For numeric features, histograms are a common visualization tool. They group values into bins and show how frequently values occur within each range.

import sqlite3
import pandas as pd
import matplotlib.pyplot as plt

# Connect to Chinook sample database
conn = sqlite3.connect("../data/chinook.db")

# Load the 'tracks' table into a DataFrame
df_tracks = pd.read_sql_query("SELECT * FROM tracks", conn)
df_tracks.head()

	TrackId	Name	AlbumId	MediaTypeId	GenreId	Composer	Milliseconds	Bytes	UnitPrice
0	1	For Those About To Rock (We Salute You)	1	1	1	Angus Young, Malcolm Young, Brian Johnson	343719	11170334	0.99
1	2	Balls to the Wall	2	2	1	None	342562	5510424	0.99
2	3	Fast As a Shark	3	2	1	F. Baltes, S. Kaufman, U. Dirkscneider & W. Ho...	230619	3990994	0.99
3	4	Restless and Wild	3	2	1	F. Baltes, R.A. Smith-Diesel, S. Kaufman, U. D...	252051	4331779	0.99
4	5	Princess of the Dawn	3	2	1	Deaffy & R.A. Smith-Diesel	375418	6290521	0.99

df_tracks.hist(
    figsize=(12, 10),
    bins=30,
    grid=False,
    edgecolor="black"
)

plt.tight_layout()
plt.show()

../../../_images/d67d279720d2cedd006daf6779fedde1634150113f7dbc88bdb8f7c73ac96da3.png

These plots provide a visual summary of features such as Milliseconds, Bytes, and UnitPrice. From these distributions, we can quickly spot skewness, extreme values, and irregular patterns that may require further attention.

5.3.1.2. Categorical Feature Distributions#

Not all features are numeric. Categorical features represent groups or labels rather than quantities. Examples include genre, category, country, or class labels.

For categorical features, distributions are described using frequencies instead of numeric ranges.

Common tools include:

Frequency tables using value_counts()
Bar charts or count plots
Mode as a summary statistic

df_tracks["GenreId"].value_counts()

GenreId
   1297
    579
    374
    332
    130
    93
     81
    74
    64
    61
     58
     48
    43
    40
    35
    30
    28
    28
    26
    24
    17
    15
    13
     12
     1
Name: count, dtype: int64

Understanding categorical distributions helps identify dominant classes, rare categories, and potential class imbalance, all of which can strongly affect downstream analysis and modeling.

Note

Categorical features should not be treated as numeric unless they have a meaningful ordering and spacing.

5.3.1.3. Empirical and Theoretical Distributions#

The distributions we observe in data are known as empirical distributions, as they are derived directly from collected samples. In contrast, theoretical distributions are mathematical models such as the normal, uniform, or exponential distributions.

In practice, we often compare empirical data to theoretical distributions to assess whether modeling assumptions are reasonable.

Examples of common theoretical distributions include:

Normal distribution for natural variation and measurement noise
Uniform distribution for equally likely outcomes
Exponential distribution for waiting times or lifetimes

Note

Many statistical methods assume an underlying theoretical distribution. Verifying whether this assumption is reasonable begins with empirical distribution analysis.

5.3.1.4. Distribution Shape: Skewness and Kurtosis#

Beyond central tendency and spread, the shape of a distribution provides important information.

Skewness measures the asymmetry of a distribution:

Positive skew indicates a long right tail
Negative skew indicates a long left tail
Zero skew indicates symmetry

Kurtosis measures the heaviness of the tails relative to a normal distribution:

High kurtosis indicates heavy tails and more extreme values
Low kurtosis indicates light tails and fewer outliers

These measures help explain why some features may violate modeling assumptions or require transformation before analysis.

df_tracks.skew(numeric_only=True)

TrackId         0.000000
AlbumId         0.089525
MediaTypeId     3.028186
GenreId         1.560547
Milliseconds    3.951430
Bytes           4.372829
UnitPrice       3.677272
dtype: float64

df_tracks.kurt(numeric_only=True)

TrackId         -1.200000
AlbumId         -1.033178
MediaTypeId      9.602763
GenreId          1.502033
Milliseconds    15.744308
Bytes           19.385956
UnitPrice       11.528913
dtype: float64

5.3.1.5. Why Distribution Analysis Matters#

Understanding feature distributions is a prerequisite for meaningful relationship analysis. Distribution shape influences correlation, distance, and similarity measures, and poorly behaved features can dominate or distort results.

With a clear understanding of how individual features behave, we can now move on to examining how features interact with one another.