4.1.5. Declutter Features#

In real-world datasets, features are not always neatly organized. Sometimes, a single column may contain nested, combined, or complex information that needs to be split or transformed into multiple meaningful features. This process is often referred to as feature decluttering, and it is an essential step to make your dataset more structured and model-ready.

Here are some common scenarios and strategies:

4.1.5.1. Splitting Arrays or Nested Attributes#

Sometimes, a single column may store multiple values in an array or a list-like structure. These can be split into separate features to make them usable.

Example: Suppose we have a column Genres containing multiple genres per track:

import pandas as pd

data = {
    "TrackId": [1, 2, 3],
    "Genres": [["Rock", "Blues"], ["Pop"], ["Jazz", "Soul", "Funk"]]
}
df = pd.DataFrame(data)
display(df)
TrackId Genres
0 1 [Rock, Blues]
1 2 [Pop]
2 3 [Jazz, Soul, Funk]

We create one-hot-like columns for each genre, allowing models to use these features effectively.

# Expand lists into separate boolean features
all_genres = set(g for sublist in df["Genres"] for g in sublist)
for genre in all_genres:
    df[genre] = df["Genres"].apply(lambda x: genre in x)

df.drop("Genres", axis=1, inplace=True)
display(df)
TrackId Rock Jazz Pop Soul Funk Blues
0 1 True False False False False True
1 2 False False True False False False
2 3 False True False True True False

4.1.5.2. Handling Comma-Separated Values#

Sometimes multiple categories are stored as a comma-separated string. Similar to arrays, these can be split and converted into boolean flags.

Example:

data = {"TrackId": [1, 2, 3], "Tags": ["rock,blues", "pop", "jazz,soul,funk"]}
df = pd.DataFrame(data)
display(df)
TrackId Tags
0 1 rock,blues
1 2 pop
2 3 jazz,soul,funk
# Split and create boolean flags
unique_tags = set(tag.strip() for tags in df["Tags"] for tag in tags.split(","))
for tag in unique_tags:
    df[tag] = df["Tags"].apply(lambda x: tag in x.split(","))

df.drop("Tags", axis=1, inplace=True)
display(df)
TrackId rock blues jazz pop soul funk
0 1 True True False False False False
1 2 False False False True False False
2 3 False False True False True True

Now each tag has its own column, which can be used in modeling as binary indicators.

4.1.5.3. Handling JSON or Nested Dictionaries#

Many datasets contain JSON objects or dictionaries in a single column. We can flatten these into multiple columns, while also handling missing or null values.

Example:

import json

data = {
    "TrackId": [1, 2],
    "Metadata": [
        '{"Composer": "John Doe", "Length": 210}',
        '{"Composer": "Jane Smith"}'
    ]
}
df = pd.DataFrame(data)
display(df)
TrackId Metadata
0 1 {"Composer": "John Doe", "Length": 210}
1 2 {"Composer": "Jane Smith"}
# Parse JSON strings and expand into columns
df_json = df["Metadata"].apply(json.loads).apply(pd.Series)
df = pd.concat([df.drop("Metadata", axis=1), df_json], axis=1)

# Fill missing values
df["Length"].fillna(df["Length"].mean(), inplace=True)
display(df)
/tmp/ipykernel_2698/2463841024.py:6: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["Length"].fillna(df["Length"].mean(), inplace=True)
TrackId Composer Length
0 1 John Doe 210.0
1 2 Jane Smith 210.0

Here, Length is missing for the second track, so we impute it with the mean, while Composer is preserved as a string feature.

4.1.5.4. Exploding Multi-Valued Features#

When features contain lists of variable length, you may want to explode them to create a long-format table. This is useful for aggregations or one-hot encoding later.

Example:

data = {"TrackId": [1, 2], "Genres": [["Rock","Blues"], ["Pop"]]}
df = pd.DataFrame(data)

df_exploded = df.explode("Genres")
display(df_exploded)
TrackId Genres
0 1 Rock
0 1 Blues
1 2 Pop

This is particularly useful when you want aggregations per category or need a row-per-category representation.

4.1.5.5. Creating Count or Frequency Features#

After splitting multi-value columns, you can create summary features to capture additional information:

  • Number of genres per track

  • Number of tags selected per record

Example:

df["NumGenres"] = df["Genres"].apply(len)
display(df)
TrackId Genres NumGenres
0 1 [Rock, Blues] 2
1 2 [Pop] 1

This can sometimes provide more predictive power than just the binary flags.