5.2.1. Structural Data Issues#

Often, the datasets we work with are not perfect. During data collection, some values may be missing or recorded multiple times. We examine these issues in this section.

5.2.1.1. Duplicates#

Often, datasets are combined from multiple sources or collected repeatedly over time, which can lead to duplicate records. These duplicates are problematic for analysis because they distort the true distribution of your data. Therefore, we should aim to identify them first.

Identifying Unique Constraints#

The first step is to determine what makes a record unique.

  • Some datasets have a dedicated unique ID column (like CustomerID).

  • In other cases, you might need to define a combination of fields that together create a unique identity (for example, Name + DOB + City).

Identifying Records with the Same Unique Constraints#

Once we define the unique constraint, we can check if multiple records share it. All the records with the same unique constraint can be termed as duplicates. We should aim to keep only one (the original) record and remove the others.

In pandas, we can use the df.duplicated() method to identify duplicates. It takes two optional parameters:

  1. subset (default: None): Specifies the column(s) that make up the unique constraint. If not passed, all columns are considered.

  2. keep (default: "first"): Defines which duplicate to keep.

    • "first" → keeps the first occurrence, marks others as duplicates.

    • "last" → keeps the last occurrence, marks others as duplicates.

    • False → marks all occurrences as duplicates.

The returned result is a boolean Series that indicates which records are duplicates. We can use this to filter or remove them.

Example: Identifying Duplicates

import pandas as pd

# Create a sample DataFrame with duplicates
data = {
    "TrackId": [1, 2, 2, 2, 3, 4, 4],
    "Name": ["Song A", "Song B", "Song B", "Song B", "Song C", "Song D", "Song D"],
    "AlbumId": [10, 20, 20, 20, 30, 40, 40],
    "Milliseconds": [210000, 180000, 180000, 190000, 200000, 230000, 230000]
}

df_tracks = pd.DataFrame(data)
df_tracks
TrackId Name AlbumId Milliseconds
0 1 Song A 10 210000
1 2 Song B 20 180000
2 2 Song B 20 180000
3 2 Song B 20 190000
4 3 Song C 30 200000
5 4 Song D 40 230000
6 4 Song D 40 230000
# Check for duplicates based on all columns
duplicates = df_tracks[df_tracks.duplicated()]
print("Duplicate rows:\n", duplicates)
Duplicate rows:
    TrackId    Name  AlbumId  Milliseconds
2        2  Song B       20        180000
6        4  Song D       40        230000
# Check for duplicates based on only TrackId
duplicates = df_tracks[df_tracks.duplicated(subset=['TrackId'])]
print("Duplicate rows:\n", duplicates)
Duplicate rows:
    TrackId    Name  AlbumId  Milliseconds
2        2  Song B       20        180000
3        2  Song B       20        190000
6        4  Song D       40        230000

We will look at how we can tackle duplicates in De-Duplication.

5.2.1.2. Null Values#

Ideally, every feature in every record would have a value. However, real-world datasets are rarely complete.

For many reasons such as human error, missing fields in older records, or users simply not providing information - NULL values often appear in data. Handling them properly is critical for reliable analysis.

Identifying Null Values#

Pandas provides the .isna() method (with .isnull() as an alias) to check for missing values. It returns a boolean DataFrame indicating whether each value is NaN or not.

Note

Empty strings ('') are not considered null values.

Example

import pandas as pd

data = {
    "Name": ["Alice", "Bob", "Charlie", "Diana"],
    "Age": [25, None, 30, None],
    "City": ["NY", "LA", "Chicago", "Houston"]
}

df = pd.DataFrame(data)
df
Name Age City
0 Alice 25.0 NY
1 Bob NaN LA
2 Charlie 30.0 Chicago
3 Diana NaN Houston
# Identify null values
df.isna()
Name Age City
0 False False False
1 False True False
2 False False False
3 False True False

If we want to know the number of rows that have null or non-null values, we can use the following:

# Count of null values per column
display(df.isna().sum())

# Count of non-null values per column
display(df.notna().sum())
Name    0
Age     2
City    0
dtype: int64
Name    4
Age     2
City    4
dtype: int64

Tip

A nice way to do this is by using df.info(), as discussed in Example: Using .info().

We will learn about handling these null values in the Null Handling and Imputation section.