4.1.4. De-Duplication#
Now that we have learned to identify duplicates in Duplicates section, we will look at ways to process them.
Important
Why DeDuplication comes after handling imperfections, imputation, and data type standardization:
Two records that appear different due to formatting issues (e.g., “NY” vs “New York”, “100” vs 100, extra whitespace) might actually be duplicates. After standardizing formats, correcting data types, and cleaning imperfections, duplicate detection becomes more accurate. This ensures we don’t miss true duplicates or incorrectly identify non-duplicates.
4.1.4.1. Dropping Duplicates#
Pandas provides the .drop_duplicates() method to easily remove duplicate rows.
By default, it keeps the first occurrence and removes subsequent duplicates. The
keepparameter controls this behavior ('first','last', orFalse).The
subsetparameter specifies the group of columns used to identify duplicates. By default (None), it considers all columns.
Example: Dropping Duplicates
import pandas as pd
# Create a sample DataFrame with duplicates
data = {
"TrackId": [1, 2, 2, 2, 3, 4, 4],
"Name": ["Song A", "Song B", "Song B", "Song B", "Song C", "Song D", "Song D"],
"AlbumId": [10, 20, 20, 20, 30, 40, 40],
"Milliseconds": [210000, 180000, 180000, 190000, 200000, 230000, 230000]
}
df_tracks = pd.DataFrame(data)
df_tracks
| TrackId | Name | AlbumId | Milliseconds | |
|---|---|---|---|---|
| 0 | 1 | Song A | 10 | 210000 |
| 1 | 2 | Song B | 20 | 180000 |
| 2 | 2 | Song B | 20 | 180000 |
| 3 | 2 | Song B | 20 | 190000 |
| 4 | 3 | Song C | 30 | 200000 |
| 5 | 4 | Song D | 40 | 230000 |
| 6 | 4 | Song D | 40 | 230000 |
# Drop duplicates (based on all columns, keeping first occurrence)
df_tracks_cleaned = df_tracks.drop_duplicates()
print("Shape before removing duplicates:", df_tracks.shape)
print("Shape after removing duplicates:", df_tracks_cleaned.shape)
df_tracks_cleaned
Shape before removing duplicates: (7, 4)
Shape after removing duplicates: (5, 4)
| TrackId | Name | AlbumId | Milliseconds | |
|---|---|---|---|---|
| 0 | 1 | Song A | 10 | 210000 |
| 1 | 2 | Song B | 20 | 180000 |
| 3 | 2 | Song B | 20 | 190000 |
| 4 | 3 | Song C | 30 | 200000 |
| 5 | 4 | Song D | 40 | 230000 |
4.1.4.2. Merging Partial Duplicate Records#
Sometimes, you don’t want to simply drop duplicates because different records have complementary information.
For example, suppose you have two rows with the same customer ID:
One row has
date_of_birthfilled butaddressmissingAnother row has
addressfilled butdate_of_birthmissing
In this case, you can merge the records to create a more complete single row.
Example: Combining Records#
You can use .groupby() + .agg() to merge duplicates by taking the non-null value:
# Example dummy DataFrame with partial duplicates
import pandas as pd
data = {
"CustomerID": [1, 1, 2],
"Name": ["John Doe", "John Doe", "Jane Smith"],
"DOB": ["1980-01-01", None, "1990-02-02"],
"Address": [None, "123 Main St", "456 Oak St"]
}
df_customers = pd.DataFrame(data)
display(df_customers)
| CustomerID | Name | DOB | Address | |
|---|---|---|---|---|
| 0 | 1 | John Doe | 1980-01-01 | None |
| 1 | 1 | John Doe | None | 123 Main St |
| 2 | 2 | Jane Smith | 1990-02-02 | 456 Oak St |
# Combine duplicates by CustomerID
df_merged = df_customers.groupby("CustomerID", as_index=False).agg({
"Name": "first",
"DOB": "first",
"Address": "first"
})
df_merged
| CustomerID | Name | DOB | Address | |
|---|---|---|---|---|
| 0 | 1 | John Doe | 1980-01-01 | 123 Main St |
| 1 | 2 | Jane Smith | 1990-02-02 | 456 Oak St |
This process not only removes redundancy but also improves data completeness.
Tip
When cleaning duplicates, always:
Verify your unique keys carefully.
Decide whether to drop or merge records based on business logic.
Keep a backup of your original data.