Categorical Encoding

4.2.1. Categorical Encoding#

Many machine learning algorithms require numeric input, so categorical features must be converted into a numeric format. This process is called categorical encoding.

There are two common strategies:

4.2.1.1. One-Hot Encoding (OHE)#

One-hot encoding creates a separate binary column for each category. If a row belongs to a category, the corresponding column is 1, otherwise 0.

Example: Suppose we have a Color column:

import pandas as pd

data = {"Item": ["Shirt", "Pants", "Hat"], "Color": ["Red", "Green", "Blue"]}
df = pd.DataFrame(data)
display(df)

	Item	Color
0	Shirt	Red
1	Pants	Green
2	Hat	Blue

We can use pd.get_dummies to one-hot encode the Color column:

df_ohe = pd.get_dummies(df, columns=["Color"])
display(df_ohe)

	Item	Color_Blue	Color_Green	Color_Red
0	Shirt	False	False	True
1	Pants	False	True	False
2	Hat	True	False	False

Each category now has its own column, which can be used directly in most ML models.

Optional: Drop First Column#

To avoid multicollinearity in linear models, you can drop one of the dummy columns:

df_ohe = pd.get_dummies(df, columns=["Color"], drop_first=True)
display(df_ohe)

	Item	Color_Green	Color_Red
0	Shirt	False	True
1	Pants	True	False
2	Hat	False	False

Absence of the other implies the presence of this one. Hence, we can remove it.

4.2.1.2. Label Encoding#

Label encoding assigns a unique integer to each category. This is particularly useful for ordinal features, where the order matters.

df["Color_Label"] = df["Color"].astype("category").cat.codes
display(df)

	Item	Color	Color_Label
0	Shirt	Red	2
1	Pants	Green	1
2	Hat	Blue	0

Warning

Be careful: Label encoding implies an order, so it should not be used for nominal features in most ML models.

4.2.1.3. When to Use Which#

Encoding Type	When to Use
One-hot encoding	Nominal categorical features (no order)
Label encoding	Ordinal categorical features (order matters)