Data Type Handling

4.1.3. Data Type Handling#

We saw in Data Standardization Issues that real-world datasets often contain imperfect or inconsistent values.
To ensure consistency and reliability, we need to clean the data - both numeric and string - so that our analysis and models work on standardized inputs.

To resolve these issues, we generally follow the below given steps:

Inspect the unique values in the column using methods like df["column"].unique()
Decide how to map them to a clean, standardized format
Apply transformations (e.g., .str.lower(), mapping dictionaries, or parsing functions)

Sometimes, grouping and cleaning can be automated (e.g., converting to lowercase). Other times, you may need to create a manual mapping to unify categories.

4.1.3.1. Numeric Issues#

Consider a numeric column like Age with values such as:

12 (already numeric)
"32" (numeric string)
"one" (non-numeric string)

To standardize, our goal becomes:

If the value is already an integer → keep it as is.
If it is a numeric string → parse it to integer using pd.to_numeric.
If it is a non-numeric string (like "one", "thirty-six"):
- If such cases are rare → create a mapping, e.g. { "one": 1, "thirty-six": 36 }.
- If frequent and diverse → write a custom parsing function (though this may fail for very unconventional strings).

Finally, if edge cases remain sparse and non-standard, manual cleaning may be the most efficient solution.

Example: Fixing Numeric Issues#

import pandas as pd

data = {"Age": [12, "32", "one", "thirty-six", "45"]}
df = pd.DataFrame(data)
display(df)

	Age
0	12
1	32
2	one
3	thirty-six
4	45

# Convert numeric-looking values
df["Age_numeric"] = pd.to_numeric(df["Age"], errors="coerce")
display(df)

	Age	Age_numeric
0	12	12.0
1	32	32.0
2	one	NaN
3	thirty-six	NaN
4	45	45.0

# Handle special string cases with mapping
mapping = {"one": 1, "thirty-six": 36}
df["Age_numeric"] = df["Age_numeric"].fillna(df["Age"].map(mapping))

display(df)

	Age	Age_numeric
0	12	12.0
1	32	32.0
2	one	1.0
3	thirty-six	36.0
4	45	45.0

4.1.3.2. String Issues#

String columns can broadly be of two types:

Free-form strings (e.g., names, comments): These are often noisy and may not add much to structured analysis.
Categorical strings (e.g., city names, states, genres): These are very valuable for analysis but may be inconsistent.

For instance, consider a State column in U.S. data:

"Florida"
"florida"
"FL"
"FL, USA"

All of these should ideally map to the same category → "Florida".

To fix this, we can:

Normalize case (lowercase or uppercase)
Split unwanted suffixes (e.g., after commas)
Map known variants manually, if their frequency is low

Example: Fixing String Issues#

data = {
    "State": ["Florida", "florida", "FL", "FL, USA", "Texas", "tx", "California"]
}
df_states = pd.DataFrame(data)
display(df_states)

	State
0	Florida
1	florida
2	FL
3	FL, USA
4	Texas
5	tx
6	California

# Normalize case
df_states["State_cleaned"] = df_states["State"].str.lower()

# Remove suffix after comma
df_states["State_cleaned"] = df_states["State_cleaned"].str.split(",").str[0]

# Manual mapping for abbreviations
state_mapping = {"fl": "florida", "tx": "texas"}
df_states["State_cleaned"] = df_states["State_cleaned"].replace(state_mapping)

# Capitalize for neatness
df_states["State_cleaned"] = df_states["State_cleaned"].str.title()

display(df_states)

	State	State_cleaned
0	Florida	Florida
1	florida	Florida
2	FL	Florida
3	FL, USA	Florida
4	Texas	Texas
5	tx	Texas
6	California	California

Now all the variants collapse into neat, consistent categories like "Florida", "Texas", "California".

Tip

Often, you can use fuzzy matching to guess the most likely correct values. Fuzzy matching works by calculating similarity scores based on the overlap between your input and a set of expected values.

You can learn more about fuzzy matching here.

In Python, a popular library for this is fuzzywuzzy.

4.1.3.3. Other Issues#

Sometimes, certain columns require special formats. Examples include dates, addresses, and other structured fields. In such cases, we often need to come up with creative solutions to standardize these values and salvage as much data as possible.

Example: Standardizing Date Formats#

Suppose we have a dataset where dates are stored inconsistently:

data = {
    "Event": ["A", "B", "C", "D"],
    "EventDate": ["2024-01-15", "15/02/2024", "March 3, 2024", "2024.04.01"]
}

df = pd.DataFrame(data)

df

	Event	EventDate
0	A	2024-01-15
1	B	15/02/2024
2	C	March 3, 2024
3	D	2024.04.01

As we can see, there are four different formats:

ISO (YYYY-MM-DD)
European (DD/MM/YYYY)
Textual month
Dotted format

You can standardize all dates using pd.to_datetime(), which tries to infer and parse:

df["EventDate_Parsed"] = pd.to_datetime(df["EventDate"], dayfirst=True, errors="coerce")

df

/tmp/ipykernel_2715/4161777420.py:1: UserWarning: Parsing dates in %Y-%m-%d format when dayfirst=True was specified. Pass `dayfirst=False` or specify a format to silence this warning.
  df["EventDate_Parsed"] = pd.to_datetime(df["EventDate"], dayfirst=True, errors="coerce")

	Event	EventDate	EventDate_Parsed
0	A	2024-01-15	2024-01-15
1	B	15/02/2024	NaT
2	C	March 3, 2024	NaT
3	D	2024.04.01	NaT

Note

dayfirst=True helps interpret ambiguous formats.
errors="coerce" will convert unparseable entries to NaT so you can catch them.

This ensures all dates are in a consistent, usable datetime format for further analysis or modeling.

Note

Based on your data, you should come up with small logics or tricks that serve your purpose. These solutions may not always be the most elegant, but they are often fast, practical, and effective in improving the quality of your dataset.