5.2.1. Structural Data Issues#
Often, the datasets we work with are not perfect. During data collection, some values may be missing or recorded multiple times. We examine these issues in this section.
5.2.1.1. Duplicates#
Often, datasets are combined from multiple sources or collected repeatedly over time, which can lead to duplicate records. These duplicates are problematic for analysis because they distort the true distribution of your data. Therefore, we should aim to identify them first.
Identifying Unique Constraints#
The first step is to determine what makes a record unique.
Some datasets have a dedicated unique ID column (like
CustomerID).In other cases, you might need to define a combination of fields that together create a unique identity (for example,
Name + DOB + City).
Identifying Records with the Same Unique Constraints#
Once we define the unique constraint, we can check if multiple records share it. All the records with the same unique constraint can be termed as duplicates. We should aim to keep only one (the original) record and remove the others.
In pandas, we can use the df.duplicated() method to identify duplicates. It takes two optional parameters:
subset (default:
None): Specifies the column(s) that make up the unique constraint. If not passed, all columns are considered.keep (default:
"first"): Defines which duplicate to keep."first"→ keeps the first occurrence, marks others as duplicates."last"→ keeps the last occurrence, marks others as duplicates.False→ marks all occurrences as duplicates.
The returned result is a boolean Series that indicates which records are duplicates. We can use this to filter or remove them.
Example: Identifying Duplicates
import pandas as pd
# Create a sample DataFrame with duplicates
data = {
"TrackId": [1, 2, 2, 2, 3, 4, 4],
"Name": ["Song A", "Song B", "Song B", "Song B", "Song C", "Song D", "Song D"],
"AlbumId": [10, 20, 20, 20, 30, 40, 40],
"Milliseconds": [210000, 180000, 180000, 190000, 200000, 230000, 230000]
}
df_tracks = pd.DataFrame(data)
df_tracks
| TrackId | Name | AlbumId | Milliseconds | |
|---|---|---|---|---|
| 0 | 1 | Song A | 10 | 210000 |
| 1 | 2 | Song B | 20 | 180000 |
| 2 | 2 | Song B | 20 | 180000 |
| 3 | 2 | Song B | 20 | 190000 |
| 4 | 3 | Song C | 30 | 200000 |
| 5 | 4 | Song D | 40 | 230000 |
| 6 | 4 | Song D | 40 | 230000 |
# Check for duplicates based on all columns
duplicates = df_tracks[df_tracks.duplicated()]
print("Duplicate rows:\n", duplicates)
Duplicate rows:
TrackId Name AlbumId Milliseconds
2 2 Song B 20 180000
6 4 Song D 40 230000
# Check for duplicates based on only TrackId
duplicates = df_tracks[df_tracks.duplicated(subset=['TrackId'])]
print("Duplicate rows:\n", duplicates)
Duplicate rows:
TrackId Name AlbumId Milliseconds
2 2 Song B 20 180000
3 2 Song B 20 190000
6 4 Song D 40 230000
We will look at how we can tackle duplicates in De-Duplication.
5.2.1.2. Null Values#
Ideally, every feature in every record would have a value. However, real-world datasets are rarely complete.
For many reasons such as human error, missing fields in older records, or users simply not providing information - NULL values often appear in data. Handling them properly is critical for reliable analysis.
Identifying Null Values#
Pandas provides the .isna() method (with .isnull() as an alias) to check for missing values.
It returns a boolean DataFrame indicating whether each value is NaN or not.
Note
Empty strings ('') are not considered null values.
Example
import pandas as pd
data = {
"Name": ["Alice", "Bob", "Charlie", "Diana"],
"Age": [25, None, 30, None],
"City": ["NY", "LA", "Chicago", "Houston"]
}
df = pd.DataFrame(data)
df
| Name | Age | City | |
|---|---|---|---|
| 0 | Alice | 25.0 | NY |
| 1 | Bob | NaN | LA |
| 2 | Charlie | 30.0 | Chicago |
| 3 | Diana | NaN | Houston |
# Identify null values
df.isna()
| Name | Age | City | |
|---|---|---|---|
| 0 | False | False | False |
| 1 | False | True | False |
| 2 | False | False | False |
| 3 | False | True | False |
If we want to know the number of rows that have null or non-null values, we can use the following:
# Count of null values per column
display(df.isna().sum())
# Count of non-null values per column
display(df.notna().sum())
Name 0
Age 2
City 0
dtype: int64
Name 4
Age 2
City 4
dtype: int64
Tip
A nice way to do this is by using df.info(), as discussed in Example: Using .info().
We will learn about handling these null values in the Null Handling and Imputation section.