4.1.1. Handling Imperfections#

4.1.1.1. Noise and Outlier Handling#

As discussed earlier in Data Imperfections section, noise and outliers can distort statistical metrics and give an imperfect representation of the data. We can use statistical plotting techniques (like histograms or box plots, see univariate-distribution) to identify them.

Below, we will sometimes use noise and outliers interchangeably for illustration, but they are not the same. Noise often requires correction, smoothing, or imputation rather than simply truncating values. In some cases, noise cannot be corrected because there is no reliable way to recover or predict the true value, or it may be difficult to even identify which records are noisy.

Removing records#

One simple strategy is to drop extreme values. This works well when:

  • The number of outliers is small.

  • Dropping them does not significantly reduce dataset quality.

For smaller datasets, dropping rows can lead to poor modeling or analysis, so use this approach carefully.

Example: Using a salary dataset with an extreme outlier (CEO salary):

import pandas as pd
import numpy as np

# Example salary data
np.random.seed(0)
n = 20
salaries = np.random.randint(50_000, 200_000, size=n).tolist()
salaries.append(5_000_000)  # extreme outlier
df_salaries = pd.DataFrame({"Employee": list(range(1, n+2)), "Salary": salaries})
display(df_salaries)
Employee Salary
0 1 93567
1 2 167952
2 3 145939
3 4 147639
4 5 91993
5 6 172579
6 7 136293
7 8 162420
8 9 98600
9 10 102620
10 11 130186
11 12 67089
12 13 158631
13 14 151201
14 15 132457
15 16 187993
16 17 117699
17 18 120608
18 19 57877
19 20 133966
20 21 5000000
# Identify outliers (salaries > 250,000)
outliers = df_salaries["Salary"] > 250_000
df_salaries[outliers]
Employee Salary
20 21 5000000

Now, we can drop the outlier records by filtering the dataframe:

df_cleaned = df_salaries[df_salaries["Salary"] <= 250_000]
display(df_cleaned)
Employee Salary
0 1 93567
1 2 167952
2 3 145939
3 4 147639
4 5 91993
5 6 172579
6 7 136293
7 8 162420
8 9 98600
9 10 102620
10 11 130186
11 12 67089
12 13 158631
13 14 151201
14 15 132457
15 16 187993
16 17 117699
17 18 120608
18 19 57877
19 20 133966

Clipping#

Instead of removing extreme values, we can limit them to a defined maximum. This reduces skew without losing records.

Example:

# Clip salaries to a maximum of 250,000
df_clipped = df_salaries.copy()
df_clipped["Salary"] = df_clipped["Salary"].clip(upper=250_000)
display(df_clipped)
Employee Salary
0 1 93567
1 2 167952
2 3 145939
3 4 147639
4 5 91993
5 6 172579
6 7 136293
7 8 162420
8 9 98600
9 10 102620
10 11 130186
11 12 67089
12 13 158631
13 14 151201
14 15 132457
15 16 187993
16 17 117699
17 18 120608
18 19 57877
19 20 133966
20 21 250000

Here, the CEO’s salary (5,000,000) is set to 250,000, keeping the record but reducing the impact on statistical measures.

Winsorizing#

Winsorizing is a statistical method of replacing extreme values with less extreme values.

  • For example, in a 90% Winsorization, the top 5% and bottom 5% of values are replaced with the corresponding 95th and 5th percentile values.

  • This preserves the bulk of the data while limiting the influence of outliers.

You can do this in pandas using scipy.stats.mstats.winsorize:

Example:

from scipy.stats.mstats import winsorize


# Winsorize top and bottom 10%
df_salaries["Salary_winsorized"] = winsorize(df_salaries["Salary"], limits=[0.05, 0.05])
display(df_salaries)
Employee Salary Salary_winsorized
0 1 93567 93567
1 2 167952 167952
2 3 145939 145939
3 4 147639 147639
4 5 91993 91993
5 6 172579 172579
6 7 136293 136293
7 8 162420 162420
8 9 98600 98600
9 10 102620 102620
10 11 130186 130186
11 12 67089 67089
12 13 158631 158631
13 14 151201 151201
14 15 132457 132457
15 16 187993 187993
16 17 117699 117699
17 18 120608 120608
18 19 57877 67089
19 20 133966 133966
20 21 5000000 187993

Here, the extreme salaries are replaced by the 10th and 90th percentile values, reducing the effect of the outlier.

Visualization: A histogram comparing the original and Winsorized salaries:

import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(14,5), sharey=True)

# Original salary distribution
axes[0].hist(df_salaries["Salary"], bins=5, color="skyblue", alpha=0.7)
axes[0].set_title("Original Salaries")
axes[0].set_xlabel("Salary")
axes[0].set_ylabel("Frequency")

# Winsorized salary distribution
axes[1].hist(df_salaries["Salary_winsorized"], bins=5, color="orange", alpha=0.7)
axes[1].set_title("Winsorized Salaries")
axes[1].set_xlabel("Salary")

plt.suptitle("Comparison of Original vs Winsorized Salaries", fontsize=16)
plt.tight_layout(rect=[0, 0, 1, 0.95])
plt.show()
../../../_images/7b20ab866b313f62a7323f8a300720ad5d7cf2e151a718b5ec2bdcb4e04acd2f.png

Tip

Similar to imputation, you can add an additional column (e.g., is_winsorized or is_clipped) to indicate that a value has been modified. This pattern often helps models by providing context and can lead to improved performance.