4.2.1. Categorical Encoding#

Many machine learning algorithms require numeric input, so categorical features must be converted into a numeric format. This process is called categorical encoding.

There are two common strategies:

4.2.1.1. One-Hot Encoding (OHE)#

One-hot encoding creates a separate binary column for each category. If a row belongs to a category, the corresponding column is 1, otherwise 0.

Example: Suppose we have a Color column:

import pandas as pd

data = {"Item": ["Shirt", "Pants", "Hat"], "Color": ["Red", "Green", "Blue"]}
df = pd.DataFrame(data)
display(df)
Item Color
0 Shirt Red
1 Pants Green
2 Hat Blue

We can use pd.get_dummies to one-hot encode the Color column:

df_ohe = pd.get_dummies(df, columns=["Color"])
display(df_ohe)
Item Color_Blue Color_Green Color_Red
0 Shirt False False True
1 Pants False True False
2 Hat True False False

Each category now has its own column, which can be used directly in most ML models.

Optional: Drop First Column#

To avoid multicollinearity in linear models, you can drop one of the dummy columns:

df_ohe = pd.get_dummies(df, columns=["Color"], drop_first=True)
display(df_ohe)
Item Color_Green Color_Red
0 Shirt False True
1 Pants True False
2 Hat False False

Absence of the other implies the presence of this one. Hence, we can remove it.

4.2.1.2. Label Encoding#

Label encoding assigns a unique integer to each category. This is particularly useful for ordinal features, where the order matters.

df["Color_Label"] = df["Color"].astype("category").cat.codes
display(df)
Item Color Color_Label
0 Shirt Red 2
1 Pants Green 1
2 Hat Blue 0

Warning

Be careful: Label encoding implies an order, so it should not be used for nominal features in most ML models.

4.2.1.3. When to Use Which#

Encoding Type

When to Use

One-hot encoding

Nominal categorical features (no order)

Label encoding

Ordinal categorical features (order matters)