Feature Engineering

4.2. Feature Engineering#

Once the dataset has been cleaned, the next step is feature engineering - the process of transforming and creating features to improve the effectiveness of analysis or modeling. The goal is to make the data more informative, consistent, and suitable for the algorithms we plan to use.

Feature engineering can include a variety of tasks, such as:

Encoding Categorical Variables Many machine learning algorithms require numeric input, so categorical features need to be converted:
- One-hot encoding: Creates a separate binary column for each category (e.g., “Red”, “Green”, “Blue”).
- Label encoding: Assigns a unique integer to each category (useful for ordinal features).
Scaling and Normalization of Numeric Features Standardizing numeric features can improve the performance of many algorithms and help with convergence in optimization:
- Min-max scaling: Rescales values to a fixed range, usually [0,1].
- Standardization (Z-score normalization): Centers values around the mean and scales by the standard deviation.
- Log transformation: Useful for skewed distributions to reduce the impact of extreme values.
Creating Derived Features New features can be computed from existing ones to capture more meaningful patterns:
- Ratios or differences: E.g., profit margin = revenue ÷ cost, or age difference between two events.
- Aggregations: Summarizing data at different levels (e.g., total purchases per customer).
- Date/time features: Extracting day, month, year, weekday, or season from timestamps.
Handling Text Features Free-text fields may contain useful information if transformed appropriately:
- Tokenization and bag-of-words for text analytics.
- TF-IDF or embeddings to capture semantic meaning.
- Categorical extraction: e.g., extracting titles from names (“Dr.”, “Mr.”, “Ms.”).
We will discuss this in a later section on NLP. (See Text Preprocessing)
Feature Selection / Dimensionality Reduction Not all features are equally useful, and some may introduce noise or redundancy:
- Variance thresholding – Remove features with low variance.
- Correlation analysis – Remove highly correlated or redundant features.
- PCA / t-SNE – Reduce dimensionality while preserving important variance patterns.

By performing thoughtful feature engineering, we enhance the quality and predictive power of our dataset. This step often has as much impact on model performance as the choice of algorithm itself.