Declutter Features

4.1.5. Declutter Features#

In real-world datasets, features are not always neatly organized. Sometimes, a single column may contain nested, combined, or complex information that needs to be split or transformed into multiple meaningful features. This process is often referred to as feature decluttering, and it is an essential step to make your dataset more structured and model-ready.

Here are some common scenarios and strategies:

4.1.5.1. Splitting Arrays or Nested Attributes#

Sometimes, a single column may store multiple values in an array or a list-like structure. These can be split into separate features to make them usable.

Example: Suppose we have a column Genres containing multiple genres per track:

import pandas as pd

data = {
    "TrackId": [1, 2, 3],
    "Genres": [["Rock", "Blues"], ["Pop"], ["Jazz", "Soul", "Funk"]]
}
df = pd.DataFrame(data)
display(df)

	TrackId	Genres
0	1	[Rock, Blues]
1	2	[Pop]
2	3	[Jazz, Soul, Funk]

We create one-hot-like columns for each genre, allowing models to use these features effectively.

# Expand lists into separate boolean features
all_genres = set(g for sublist in df["Genres"] for g in sublist)
for genre in all_genres:
    df[genre] = df["Genres"].apply(lambda x: genre in x)

df.drop("Genres", axis=1, inplace=True)
display(df)

	TrackId	Funk	Rock	Blues	Pop	Jazz	Soul
0	1	False	True	True	False	False	False
1	2	False	False	False	True	False	False
2	3	True	False	False	False	True	True

4.1.5.2. Handling Comma-Separated Values#

Sometimes multiple categories are stored as a comma-separated string. Similar to arrays, these can be split and converted into boolean flags.

Example:

data = {"TrackId": [1, 2, 3], "Tags": ["rock,blues", "pop", "jazz,soul,funk"]}
df = pd.DataFrame(data)
display(df)

	TrackId	Tags
0	1	rock,blues
1	2	pop
2	3	jazz,soul,funk

# Split and create boolean flags
unique_tags = set(tag.strip() for tags in df["Tags"] for tag in tags.split(","))
for tag in unique_tags:
    df[tag] = df["Tags"].apply(lambda x: tag in x.split(","))

df.drop("Tags", axis=1, inplace=True)
display(df)

	TrackId	funk	rock	blues	soul	pop	jazz
0	1	False	True	True	False	False	False
1	2	False	False	False	False	True	False
2	3	True	False	False	True	False	True

Now each tag has its own column, which can be used in modeling as binary indicators.

4.1.5.3. Handling JSON or Nested Dictionaries#

Many datasets contain JSON objects or dictionaries in a single column. We can flatten these into multiple columns, while also handling missing or null values.

Example:

import json

data = {
    "TrackId": [1, 2],
    "Metadata": [
        '{"Composer": "John Doe", "Length": 210}',
        '{"Composer": "Jane Smith"}'
    ]
}
df = pd.DataFrame(data)
display(df)

	TrackId	Metadata
0	1	{"Composer": "John Doe", "Length": 210}
1	2	{"Composer": "Jane Smith"}

# Parse JSON strings and expand into columns
df_json = df["Metadata"].apply(json.loads).apply(pd.Series)
df = pd.concat([df.drop("Metadata", axis=1), df_json], axis=1)

# Fill missing values
df["Length"].fillna(df["Length"].mean(), inplace=True)
display(df)

/tmp/ipykernel_2756/2463841024.py:6: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.

  df["Length"].fillna(df["Length"].mean(), inplace=True)

	TrackId	Composer	Length
0	1	John Doe	210.0
1	2	Jane Smith	210.0

Here, Length is missing for the second track, so we impute it with the mean, while Composer is preserved as a string feature.

4.1.5.4. Exploding Multi-Valued Features#

When features contain lists of variable length, you may want to explode them to create a long-format table. This is useful for aggregations or one-hot encoding later.

Example:

data = {"TrackId": [1, 2], "Genres": [["Rock","Blues"], ["Pop"]]}
df = pd.DataFrame(data)

df_exploded = df.explode("Genres")
display(df_exploded)

	TrackId	Genres
0	1	Rock
0	1	Blues
1	2	Pop

This is particularly useful when you want aggregations per category or need a row-per-category representation.

4.1.5.5. Creating Count or Frequency Features#

After splitting multi-value columns, you can create summary features to capture additional information:

Number of genres per track
Number of tags selected per record

Example:

df["NumGenres"] = df["Genres"].apply(len)
display(df)

	TrackId	Genres	NumGenres
0	1	[Rock, Blues]	2
1	2	[Pop]	1

This can sometimes provide more predictive power than just the binary flags.

Declutter Features

Contents

4.1.5. Declutter Features#

4.1.5.1. Splitting Arrays or Nested Attributes#

4.1.5.2. Handling Comma-Separated Values#

4.1.5.3. Handling JSON or Nested Dictionaries#

4.1.5.4. Exploding Multi-Valued Features#

4.1.5.5. Creating Count or Frequency Features#