4.1.4. De-Duplication#

Now that we have learned to identify duplicates in Duplicates section, we will look at ways to process them.

Important

Why DeDuplication comes after handling imperfections, imputation, and data type standardization:

Two records that appear different due to formatting issues (e.g., “NY” vs “New York”, “100” vs 100, extra whitespace) might actually be duplicates. After standardizing formats, correcting data types, and cleaning imperfections, duplicate detection becomes more accurate. This ensures we don’t miss true duplicates or incorrectly identify non-duplicates.

4.1.4.1. Dropping Duplicates#

Pandas provides the .drop_duplicates() method to easily remove duplicate rows.

  • By default, it keeps the first occurrence and removes subsequent duplicates. The keep parameter controls this behavior ('first', 'last', or False).

  • The subset parameter specifies the group of columns used to identify duplicates. By default (None), it considers all columns.

Example: Dropping Duplicates

import pandas as pd

# Create a sample DataFrame with duplicates
data = {
    "TrackId": [1, 2, 2, 2, 3, 4, 4],
    "Name": ["Song A", "Song B", "Song B", "Song B", "Song C", "Song D", "Song D"],
    "AlbumId": [10, 20, 20, 20, 30, 40, 40],
    "Milliseconds": [210000, 180000, 180000, 190000, 200000, 230000, 230000]
}

df_tracks = pd.DataFrame(data)

df_tracks
TrackId Name AlbumId Milliseconds
0 1 Song A 10 210000
1 2 Song B 20 180000
2 2 Song B 20 180000
3 2 Song B 20 190000
4 3 Song C 30 200000
5 4 Song D 40 230000
6 4 Song D 40 230000
# Drop duplicates (based on all columns, keeping first occurrence)
df_tracks_cleaned = df_tracks.drop_duplicates()

print("Shape before removing duplicates:", df_tracks.shape)
print("Shape after removing duplicates:", df_tracks_cleaned.shape)

df_tracks_cleaned
Shape before removing duplicates: (7, 4)
Shape after removing duplicates: (5, 4)
TrackId Name AlbumId Milliseconds
0 1 Song A 10 210000
1 2 Song B 20 180000
3 2 Song B 20 190000
4 3 Song C 30 200000
5 4 Song D 40 230000

4.1.4.2. Merging Partial Duplicate Records#

Sometimes, you don’t want to simply drop duplicates because different records have complementary information.

For example, suppose you have two rows with the same customer ID:

  • One row has date_of_birth filled but address missing

  • Another row has address filled but date_of_birth missing

In this case, you can merge the records to create a more complete single row.

Example: Combining Records#

You can use .groupby() + .agg() to merge duplicates by taking the non-null value:

# Example dummy DataFrame with partial duplicates
import pandas as pd

data = {
    "CustomerID": [1, 1, 2],
    "Name": ["John Doe", "John Doe", "Jane Smith"],
    "DOB": ["1980-01-01", None, "1990-02-02"],
    "Address": [None, "123 Main St", "456 Oak St"]
}

df_customers = pd.DataFrame(data)

display(df_customers)
CustomerID Name DOB Address
0 1 John Doe 1980-01-01 None
1 1 John Doe None 123 Main St
2 2 Jane Smith 1990-02-02 456 Oak St
# Combine duplicates by CustomerID
df_merged = df_customers.groupby("CustomerID", as_index=False).agg({
    "Name": "first",
    "DOB": "first",
    "Address": "first"
})

df_merged
CustomerID Name DOB Address
0 1 John Doe 1980-01-01 123 Main St
1 2 Jane Smith 1990-02-02 456 Oak St

This process not only removes redundancy but also improves data completeness.

Tip

When cleaning duplicates, always:

  • Verify your unique keys carefully.

  • Decide whether to drop or merge records based on business logic.

  • Keep a backup of your original data.