4.2.3. Derived Features#

Sometimes, the features provided in a dataset are not enough to capture all the underlying patterns. Derived features are new variables created from existing features to provide additional insight for models or analysis.

4.2.3.1. Combining Features#

Often, combining two or more existing features can reveal new information. For example, in retail, if we have quantity and unit_price, multiplying them gives total_sales, a feature directly useful for analysis.

import pandas as pd

df = pd.DataFrame({
    "Product": ["A", "B", "C"],
    "Quantity": [10, 5, 8],
    "UnitPrice": [20, 50, 30]
})

# Derived feature: total sales
df["TotalSales"] = df["Quantity"] * df["UnitPrice"]
display(df)
Product Quantity UnitPrice TotalSales
0 A 10 20 200
1 B 5 50 250
2 C 8 30 240

4.2.3.2. Mathematical Approaches#

Mathematical transformations are commonly used to create features such as ratios, products, or areas. For instance:

  • BMI from weight and height

  • Area from length and width

  • Density from mass and volume

df_people = pd.DataFrame({
    "Name": ["Alice", "Bob", "Charlie"],
    "Height_m": [1.65, 1.8, 1.75],
    "Weight_kg": [60, 80, 70]
})

# Derived feature: BMI
df_people["BMI"] = df_people["Weight_kg"] / (df_people["Height_m"] ** 2)
display(df_people)
Name Height_m Weight_kg BMI
0 Alice 1.65 60 22.038567
1 Bob 1.80 80 24.691358
2 Charlie 1.75 70 22.857143

4.2.3.3. Using Domain Knowledge#

Domain knowledge can help craft features that are more meaningful for the problem at hand. For example, if we have columns like gender and married, creating a combined category can help capture interactions:

df_customers = pd.DataFrame({
    "Name": ["Alice", "Bob", "Charlie"],
    "Gender": ["F", "M", "M"],
    "Married": ["Yes", "No", "Yes"]
})

# Derived feature: combined category
df_customers["Gender_Married"] = df_customers["Gender"] + "_" + df_customers["Married"]
display(df_customers)
Name Gender Married Gender_Married
0 Alice F Yes F_Yes
1 Bob M No M_No
2 Charlie M Yes M_Yes

This new feature can help models detect patterns that are specific to a subgroup of customers, which might be missed if gender and married were treated independently.

Tip

Derived features are a powerful way to inject human insight into the dataset. However, manually crafting features can be time-consuming and sometimes limiting. Modern approaches such as deep learning often handle feature learning automatically, detecting complex patterns without explicit human design.

We can also apply date transformations as derived features:

df_orders = pd.DataFrame({
    "OrderID": [1, 2, 3],
    "OrderDate": pd.to_datetime(["2025-01-05", "2025-02-10", "2025-03-15"])
})

# Derived features from dates
df_orders["Month"] = df_orders["OrderDate"].dt.month
df_orders["DayOfWeek"] = df_orders["OrderDate"].dt.dayofweek
df_orders["WeekOfYear"] = df_orders["OrderDate"].dt.isocalendar().week
display(df_orders)
OrderID OrderDate Month DayOfWeek WeekOfYear
0 1 2025-01-05 1 6 1
1 2 2025-02-10 2 0 7
2 3 2025-03-15 3 5 11

These derived features can be particularly useful in time series analysis, sales forecasting, and seasonality detection.