Null Handling and Imputation

4.1.2. Null Handling and Imputation#

Now that we have seen how we can identify null values in our data, we need to handle these values. Building Models or performing Analysis on data with missing values is extremely hard and it is not recommended. Most modern framework for machine learning expect your data to not have null values.

Let’s look at the different ways we can handle null values.

4.1.2.1. Deleting Records with Nulls#

If only a small portion of your data contains missing values, you can remove those records entirely. Pandas provides the .dropna() method for this.

Example:

import pandas as pd
data = {
    "Name": ["Alice", "Bob", "Charlie", "Diana"],
    "Age": [25, None, 30, None],
    "City": ["NY", "LA", "Chicago", "Houston"]
}

df = pd.DataFrame(data)
display(df)

	Name	Age	City
0	Alice	25.0	NY
1	Bob	NaN	LA
2	Charlie	30.0	Chicago
3	Diana	NaN	Houston

# Drop rows with any NULLs
df_cleaned = df.dropna()

df_cleaned

	Name	Age	City
0	Alice	25.0	NY
2	Charlie	30.0	Chicago

Warning

This approach is only advisable if removing those rows does not significantly reduce your dataset size.

4.1.2.2. Keeping or Replacing Meaningful Nulls#

Sometimes, NULL values actually carry meaning. For example, if a user never provided an address, this could signal they prefer not to share it. Or if a product has no expiration date, it may mean “no expiry.”

In such cases, you can either leave the NULLs or replace them with a more intuitive placeholder, such as "Unknown" or 0.

We can use the .fillna() method provided by pandas for this.

Example:

# Replace NULLs in the City column with 'Unknown'
df_filled = df.fillna({"Age": "Unknown"})

df_filled

	Name	Age	City
0	Alice	25.0	NY
1	Bob	Unknown	LA
2	Charlie	30.0	Chicago
3	Diana	Unknown	Houston

4.1.2.3. Data Imputation#

Imputation means filling in missing values with estimated or derived values. There are several strategies for this:

Formula-Based Imputation#

Sometimes, you can compute missing values from other columns. For example, if you have DateOfBirth, you can compute Age.

Example:

from datetime import datetime

# Sample data
data = {
    "Name": ["Alice", "Bob"],
    "DateOfBirth": ["1990-05-15", "1985-08-22"],
    "Age": [None, None]
}

df_formula = pd.DataFrame(data)
df_formula

	Name	DateOfBirth	Age
0	Alice	1990-05-15	None
1	Bob	1985-08-22	None

# Convert DateOfBirth to datetime
df_formula["DateOfBirth"] = pd.to_datetime(df_formula["DateOfBirth"])

# Compute Age as of today
today = pd.Timestamp(datetime.today())
df_formula["Age"] = df_formula["DateOfBirth"].apply(lambda dob: (today - dob).days // 365)

df_formula

	Name	DateOfBirth	Age
0	Alice	1990-05-15	35
1	Bob	1985-08-22	40

Distribution-Based Imputation#

You can replace NULLs with the mean or median of the column to maintain the dataset’s statistical properties.

Example:

# Fill NULLs in Age with the mean age
df_mean_imputed = df.copy()
df_mean_imputed

	Name	Age	City
0	Alice	25.0	NY
1	Bob	NaN	LA
2	Charlie	30.0	Chicago
3	Diana	NaN	Houston

df_mean_imputed["Age"] = df_mean_imputed["Age"].fillna(df_mean_imputed["Age"].mean())

df_mean_imputed

	Name	Age	City
0	Alice	25.0	NY
1	Bob	27.5	LA
2	Charlie	30.0	Chicago
3	Diana	27.5	Houston

Grouped Imputation#

A more sophisticated strategy is to impute values within subgroups of your data. For example, if you are working with housing prices and some records have missing square footage costs, you can replace those by averaging within the same ZIP code or neighborhood.

Example:

# Sample dataset with missing prices
housing_data = {
    "ZipCode": [1001, 1001, 1002, 1002, 1002],
    "SqFtPrice": [200, None, 250, None, 260]
}
df_housing = pd.DataFrame(housing_data)

df_housing

	ZipCode	SqFtPrice
0	1001	200.0
1	1001	NaN
2	1002	250.0
3	1002	NaN
4	1002	260.0

# Impute within each ZipCode group
df_housing["SqFtPrice"] = df_housing.groupby("ZipCode")["SqFtPrice"].transform(lambda x: x.fillna(x.mean()))

df_housing

	ZipCode	SqFtPrice
0	1001	200.0
1	1001	200.0
2	1002	250.0
3	1002	255.0
4	1002	260.0

Tip

It is good practice to track which records were imputed (for example, by adding an is_imputed column) so you don’t lose that information later. This flag can also be a useful feature for machine learning models, giving them a hint that these values might not be completely reliable.

Null Handling and Imputation

Contents

4.1.2. Null Handling and Imputation#

4.1.2.1. Deleting Records with Nulls#

4.1.2.2. Keeping or Replacing Meaningful Nulls#

4.1.2.3. Data Imputation#

Formula-Based Imputation#

Distribution-Based Imputation#

Grouped Imputation#