4.1.3. Data Type Handling#

We saw in Data Standardization Issues that real-world datasets often contain imperfect or inconsistent values.
To ensure consistency and reliability, we need to clean the data - both numeric and string - so that our analysis and models work on standardized inputs.

To resolve these issues, we generally follow the below given steps:

  1. Inspect the unique values in the column using methods like df["column"].unique()

  2. Decide how to map them to a clean, standardized format

  3. Apply transformations (e.g., .str.lower(), mapping dictionaries, or parsing functions)

Sometimes, grouping and cleaning can be automated (e.g., converting to lowercase). Other times, you may need to create a manual mapping to unify categories.

4.1.3.1. Numeric Issues#

Consider a numeric column like Age with values such as:

  • 12 (already numeric)

  • "32" (numeric string)

  • "one" (non-numeric string)

To standardize, our goal becomes:

  1. If the value is already an integer → keep it as is.

  2. If it is a numeric string → parse it to integer using pd.to_numeric.

  3. If it is a non-numeric string (like "one", "thirty-six"):

    • If such cases are rare → create a mapping, e.g. { "one": 1, "thirty-six": 36 }.

    • If frequent and diverse → write a custom parsing function (though this may fail for very unconventional strings).

Finally, if edge cases remain sparse and non-standard, manual cleaning may be the most efficient solution.

Example: Fixing Numeric Issues#

import pandas as pd

data = {"Age": [12, "32", "one", "thirty-six", "45"]}
df = pd.DataFrame(data)
display(df)
Age
0 12
1 32
2 one
3 thirty-six
4 45
# Convert numeric-looking values
df["Age_numeric"] = pd.to_numeric(df["Age"], errors="coerce")
display(df)
Age Age_numeric
0 12 12.0
1 32 32.0
2 one NaN
3 thirty-six NaN
4 45 45.0
# Handle special string cases with mapping
mapping = {"one": 1, "thirty-six": 36}
df["Age_numeric"] = df["Age_numeric"].fillna(df["Age"].map(mapping))

display(df)
Age Age_numeric
0 12 12.0
1 32 32.0
2 one 1.0
3 thirty-six 36.0
4 45 45.0

4.1.3.2. String Issues#

String columns can broadly be of two types:

  1. Free-form strings (e.g., names, comments): These are often noisy and may not add much to structured analysis.

  2. Categorical strings (e.g., city names, states, genres): These are very valuable for analysis but may be inconsistent.

For instance, consider a State column in U.S. data:

  • "Florida"

  • "florida"

  • "FL"

  • "FL, USA"

All of these should ideally map to the same category → "Florida".

To fix this, we can:

  • Normalize case (lowercase or uppercase)

  • Split unwanted suffixes (e.g., after commas)

  • Map known variants manually, if their frequency is low

Example: Fixing String Issues#

data = {
    "State": ["Florida", "florida", "FL", "FL, USA", "Texas", "tx", "California"]
}
df_states = pd.DataFrame(data)
display(df_states)
State
0 Florida
1 florida
2 FL
3 FL, USA
4 Texas
5 tx
6 California
# Normalize case
df_states["State_cleaned"] = df_states["State"].str.lower()

# Remove suffix after comma
df_states["State_cleaned"] = df_states["State_cleaned"].str.split(",").str[0]

# Manual mapping for abbreviations
state_mapping = {"fl": "florida", "tx": "texas"}
df_states["State_cleaned"] = df_states["State_cleaned"].replace(state_mapping)

# Capitalize for neatness
df_states["State_cleaned"] = df_states["State_cleaned"].str.title()

display(df_states)
State State_cleaned
0 Florida Florida
1 florida Florida
2 FL Florida
3 FL, USA Florida
4 Texas Texas
5 tx Texas
6 California California

Now all the variants collapse into neat, consistent categories like "Florida", "Texas", "California".

Tip

Often, you can use fuzzy matching to guess the most likely correct values. Fuzzy matching works by calculating similarity scores based on the overlap between your input and a set of expected values.

You can learn more about fuzzy matching here.

In Python, a popular library for this is fuzzywuzzy.

4.1.3.3. Other Issues#

Sometimes, certain columns require special formats. Examples include dates, addresses, and other structured fields. In such cases, we often need to come up with creative solutions to standardize these values and salvage as much data as possible.

Example: Standardizing Date Formats#

Suppose we have a dataset where dates are stored inconsistently:

data = {
    "Event": ["A", "B", "C", "D"],
    "EventDate": ["2024-01-15", "15/02/2024", "March 3, 2024", "2024.04.01"]
}

df = pd.DataFrame(data)

df
Event EventDate
0 A 2024-01-15
1 B 15/02/2024
2 C March 3, 2024
3 D 2024.04.01

As we can see, there are four different formats:

  • ISO (YYYY-MM-DD)

  • European (DD/MM/YYYY)

  • Textual month

  • Dotted format

You can standardize all dates using pd.to_datetime(), which tries to infer and parse:

df["EventDate_Parsed"] = pd.to_datetime(df["EventDate"], dayfirst=True, errors="coerce")

df
/tmp/ipykernel_2659/4161777420.py:1: UserWarning: Parsing dates in %Y-%m-%d format when dayfirst=True was specified. Pass `dayfirst=False` or specify a format to silence this warning.
  df["EventDate_Parsed"] = pd.to_datetime(df["EventDate"], dayfirst=True, errors="coerce")
Event EventDate EventDate_Parsed
0 A 2024-01-15 2024-01-15
1 B 15/02/2024 NaT
2 C March 3, 2024 NaT
3 D 2024.04.01 NaT

Note

  • dayfirst=True helps interpret ambiguous formats.

  • errors="coerce" will convert unparseable entries to NaT so you can catch them.

This ensures all dates are in a consistent, usable datetime format for further analysis or modeling.

Note

Based on your data, you should come up with small logics or tricks that serve your purpose. These solutions may not always be the most elegant, but they are often fast, practical, and effective in improving the quality of your dataset.