4.2.3. Derived Features#
Sometimes, the features provided in a dataset are not enough to capture all the underlying patterns. Derived features are new variables created from existing features to provide additional insight for models or analysis.
4.2.3.1. Combining Features#
Often, combining two or more existing features can reveal new information. For example, in retail, if we have quantity and unit_price, multiplying them gives total_sales, a feature directly useful for analysis.
import pandas as pd
df = pd.DataFrame({
"Product": ["A", "B", "C"],
"Quantity": [10, 5, 8],
"UnitPrice": [20, 50, 30]
})
# Derived feature: total sales
df["TotalSales"] = df["Quantity"] * df["UnitPrice"]
display(df)
| Product | Quantity | UnitPrice | TotalSales | |
|---|---|---|---|---|
| 0 | A | 10 | 20 | 200 |
| 1 | B | 5 | 50 | 250 |
| 2 | C | 8 | 30 | 240 |
4.2.3.2. Mathematical Approaches#
Mathematical transformations are commonly used to create features such as ratios, products, or areas. For instance:
BMI from
weightandheightArea from
lengthandwidthDensity from
massandvolume
df_people = pd.DataFrame({
"Name": ["Alice", "Bob", "Charlie"],
"Height_m": [1.65, 1.8, 1.75],
"Weight_kg": [60, 80, 70]
})
# Derived feature: BMI
df_people["BMI"] = df_people["Weight_kg"] / (df_people["Height_m"] ** 2)
display(df_people)
| Name | Height_m | Weight_kg | BMI | |
|---|---|---|---|---|
| 0 | Alice | 1.65 | 60 | 22.038567 |
| 1 | Bob | 1.80 | 80 | 24.691358 |
| 2 | Charlie | 1.75 | 70 | 22.857143 |
4.2.3.3. Using Domain Knowledge#
Domain knowledge can help craft features that are more meaningful for the problem at hand. For example, if we have columns like gender and married, creating a combined category can help capture interactions:
df_customers = pd.DataFrame({
"Name": ["Alice", "Bob", "Charlie"],
"Gender": ["F", "M", "M"],
"Married": ["Yes", "No", "Yes"]
})
# Derived feature: combined category
df_customers["Gender_Married"] = df_customers["Gender"] + "_" + df_customers["Married"]
display(df_customers)
| Name | Gender | Married | Gender_Married | |
|---|---|---|---|---|
| 0 | Alice | F | Yes | F_Yes |
| 1 | Bob | M | No | M_No |
| 2 | Charlie | M | Yes | M_Yes |
This new feature can help models detect patterns that are specific to a subgroup of customers, which might be missed if gender and married were treated independently.
Tip
Derived features are a powerful way to inject human insight into the dataset. However, manually crafting features can be time-consuming and sometimes limiting. Modern approaches such as deep learning often handle feature learning automatically, detecting complex patterns without explicit human design.
We can also apply date transformations as derived features:
df_orders = pd.DataFrame({
"OrderID": [1, 2, 3],
"OrderDate": pd.to_datetime(["2025-01-05", "2025-02-10", "2025-03-15"])
})
# Derived features from dates
df_orders["Month"] = df_orders["OrderDate"].dt.month
df_orders["DayOfWeek"] = df_orders["OrderDate"].dt.dayofweek
df_orders["WeekOfYear"] = df_orders["OrderDate"].dt.isocalendar().week
display(df_orders)
| OrderID | OrderDate | Month | DayOfWeek | WeekOfYear | |
|---|---|---|---|---|---|
| 0 | 1 | 2025-01-05 | 1 | 6 | 1 |
| 1 | 2 | 2025-02-10 | 2 | 0 | 7 |
| 2 | 3 | 2025-03-15 | 3 | 5 | 11 |
These derived features can be particularly useful in time series analysis, sales forecasting, and seasonality detection.