4.1.2. Null Handling and Imputation#
Now that we have seen how we can identify null values in our data, we need to handle these values. Building Models or performing Analysis on data with missing values is extremely hard and it is not recommended. Most modern framework for machine learning expect your data to not have null values.
Let’s look at the different ways we can handle null values.
4.1.2.1. Deleting Records with Nulls#
If only a small portion of your data contains missing values, you can remove those records entirely. Pandas provides the .dropna() method for this.
Example:
import pandas as pd
data = {
"Name": ["Alice", "Bob", "Charlie", "Diana"],
"Age": [25, None, 30, None],
"City": ["NY", "LA", "Chicago", "Houston"]
}
df = pd.DataFrame(data)
display(df)
| Name | Age | City | |
|---|---|---|---|
| 0 | Alice | 25.0 | NY |
| 1 | Bob | NaN | LA |
| 2 | Charlie | 30.0 | Chicago |
| 3 | Diana | NaN | Houston |
# Drop rows with any NULLs
df_cleaned = df.dropna()
df_cleaned
| Name | Age | City | |
|---|---|---|---|
| 0 | Alice | 25.0 | NY |
| 2 | Charlie | 30.0 | Chicago |
Warning
This approach is only advisable if removing those rows does not significantly reduce your dataset size.
4.1.2.2. Keeping or Replacing Meaningful Nulls#
Sometimes, NULL values actually carry meaning. For example, if a user never provided an address, this could signal they prefer not to share it. Or if a product has no expiration date, it may mean “no expiry.”
In such cases, you can either leave the NULLs or replace them with a more intuitive placeholder, such as "Unknown" or 0.
We can use the .fillna() method provided by pandas for this.
Example:
# Replace NULLs in the City column with 'Unknown'
df_filled = df.fillna({"Age": "Unknown"})
df_filled
| Name | Age | City | |
|---|---|---|---|
| 0 | Alice | 25.0 | NY |
| 1 | Bob | Unknown | LA |
| 2 | Charlie | 30.0 | Chicago |
| 3 | Diana | Unknown | Houston |
4.1.2.3. Data Imputation#
Imputation means filling in missing values with estimated or derived values. There are several strategies for this:
Formula-Based Imputation#
Sometimes, you can compute missing values from other columns. For example, if you have DateOfBirth, you can compute Age.
Example:
from datetime import datetime
# Sample data
data = {
"Name": ["Alice", "Bob"],
"DateOfBirth": ["1990-05-15", "1985-08-22"],
"Age": [None, None]
}
df_formula = pd.DataFrame(data)
df_formula
| Name | DateOfBirth | Age | |
|---|---|---|---|
| 0 | Alice | 1990-05-15 | None |
| 1 | Bob | 1985-08-22 | None |
# Convert DateOfBirth to datetime
df_formula["DateOfBirth"] = pd.to_datetime(df_formula["DateOfBirth"])
# Compute Age as of today
today = pd.Timestamp(datetime.today())
df_formula["Age"] = df_formula["DateOfBirth"].apply(lambda dob: (today - dob).days // 365)
df_formula
| Name | DateOfBirth | Age | |
|---|---|---|---|
| 0 | Alice | 1990-05-15 | 35 |
| 1 | Bob | 1985-08-22 | 40 |
Distribution-Based Imputation#
You can replace NULLs with the mean or median of the column to maintain the dataset’s statistical properties.
Example:
# Fill NULLs in Age with the mean age
df_mean_imputed = df.copy()
df_mean_imputed
| Name | Age | City | |
|---|---|---|---|
| 0 | Alice | 25.0 | NY |
| 1 | Bob | NaN | LA |
| 2 | Charlie | 30.0 | Chicago |
| 3 | Diana | NaN | Houston |
df_mean_imputed["Age"] = df_mean_imputed["Age"].fillna(df_mean_imputed["Age"].mean())
df_mean_imputed
| Name | Age | City | |
|---|---|---|---|
| 0 | Alice | 25.0 | NY |
| 1 | Bob | 27.5 | LA |
| 2 | Charlie | 30.0 | Chicago |
| 3 | Diana | 27.5 | Houston |
Grouped Imputation#
A more sophisticated strategy is to impute values within subgroups of your data. For example, if you are working with housing prices and some records have missing square footage costs, you can replace those by averaging within the same ZIP code or neighborhood.
Example:
# Sample dataset with missing prices
housing_data = {
"ZipCode": [1001, 1001, 1002, 1002, 1002],
"SqFtPrice": [200, None, 250, None, 260]
}
df_housing = pd.DataFrame(housing_data)
df_housing
| ZipCode | SqFtPrice | |
|---|---|---|
| 0 | 1001 | 200.0 |
| 1 | 1001 | NaN |
| 2 | 1002 | 250.0 |
| 3 | 1002 | NaN |
| 4 | 1002 | 260.0 |
# Impute within each ZipCode group
df_housing["SqFtPrice"] = df_housing.groupby("ZipCode")["SqFtPrice"].transform(lambda x: x.fillna(x.mean()))
df_housing
| ZipCode | SqFtPrice | |
|---|---|---|
| 0 | 1001 | 200.0 |
| 1 | 1001 | 200.0 |
| 2 | 1002 | 250.0 |
| 3 | 1002 | 255.0 |
| 4 | 1002 | 260.0 |
Tip
It is good practice to track which records were imputed (for example, by adding an is_imputed column) so you don’t lose that information later. This flag can also be a useful feature for machine learning models, giving them a hint that these values might not be completely reliable.