4.2.1. Categorical Encoding#
Many machine learning algorithms require numeric input, so categorical features must be converted into a numeric format. This process is called categorical encoding.
There are two common strategies:
4.2.1.1. One-Hot Encoding (OHE)#
One-hot encoding creates a separate binary column for each category. If a row belongs to a category, the corresponding column is 1, otherwise 0.
Example: Suppose we have a Color column:
import pandas as pd
data = {"Item": ["Shirt", "Pants", "Hat"], "Color": ["Red", "Green", "Blue"]}
df = pd.DataFrame(data)
display(df)
| Item | Color | |
|---|---|---|
| 0 | Shirt | Red |
| 1 | Pants | Green |
| 2 | Hat | Blue |
We can use pd.get_dummies to one-hot encode the Color column:
df_ohe = pd.get_dummies(df, columns=["Color"])
display(df_ohe)
| Item | Color_Blue | Color_Green | Color_Red | |
|---|---|---|---|---|
| 0 | Shirt | False | False | True |
| 1 | Pants | False | True | False |
| 2 | Hat | True | False | False |
Each category now has its own column, which can be used directly in most ML models.
Optional: Drop First Column#
To avoid multicollinearity in linear models, you can drop one of the dummy columns:
df_ohe = pd.get_dummies(df, columns=["Color"], drop_first=True)
display(df_ohe)
| Item | Color_Green | Color_Red | |
|---|---|---|---|
| 0 | Shirt | False | True |
| 1 | Pants | True | False |
| 2 | Hat | False | False |
Absence of the other implies the presence of this one. Hence, we can remove it.
4.2.1.2. Label Encoding#
Label encoding assigns a unique integer to each category. This is particularly useful for ordinal features, where the order matters.
df["Color_Label"] = df["Color"].astype("category").cat.codes
display(df)
| Item | Color | Color_Label | |
|---|---|---|---|
| 0 | Shirt | Red | 2 |
| 1 | Pants | Green | 1 |
| 2 | Hat | Blue | 0 |
Warning
Be careful: Label encoding implies an order, so it should not be used for nominal features in most ML models.
4.2.1.3. When to Use Which#
Encoding Type |
When to Use |
|---|---|
One-hot encoding |
Nominal categorical features (no order) |
Label encoding |
Ordinal categorical features (order matters) |