4.2.2. Numeric Transformations#
Numeric features in a dataset often need scaling, normalization, or transformation before being fed into machine learning models. Proper transformation can improve model performance, stability, and convergence in optimization algorithms.
4.2.2.1. Min-Max Scaling#
Min-max scaling rescales values to a fixed range, usually [0, 1].
Example: We can use MinMaxScaler from sklearn.preprocessing to do this.
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
data = {"Salary": [50000, 70000, 120000, 150000, 200000]}
df = pd.DataFrame(data)
scaler = MinMaxScaler()
df["Salary_minmax"] = scaler.fit_transform(df[["Salary"]])
display(df)
| Salary | Salary_minmax | |
|---|---|---|
| 0 | 50000 | 0.000000 |
| 1 | 70000 | 0.133333 |
| 2 | 120000 | 0.466667 |
| 3 | 150000 | 0.666667 |
| 4 | 200000 | 1.000000 |
This is useful for algorithms like neural networks or distance-based models (KNN, K-Means) that are sensitive to magnitude.
4.2.2.2. Standardization (Z-score)#
Standardization centers the data around the mean and scales by the standard deviation:
Example: We can use StandardScaler from sklearn.preprocessing to do this.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df["Salary_zscore"] = scaler.fit_transform(df[["Salary"]])
display(df)
| Salary | Salary_minmax | Salary_zscore | |
|---|---|---|---|
| 0 | 50000 | 0.000000 | -1.254963 |
| 1 | 70000 | 0.133333 | -0.885856 |
| 2 | 120000 | 0.466667 | 0.036911 |
| 3 | 150000 | 0.666667 | 0.590571 |
| 4 | 200000 | 1.000000 | 1.513338 |
Standardization is preferred for models like SVM, logistic regression, and PCA.
4.2.2.3. Log Transformation#
Log transformation reduces the impact of skewed distributions and extreme values.
Example:
import numpy as np
df["Salary_log"] = np.log1p(df["Salary"])
display(df)
| Salary | Salary_minmax | Salary_zscore | Salary_log | |
|---|---|---|---|---|
| 0 | 50000 | 0.000000 | -1.254963 | 10.819798 |
| 1 | 70000 | 0.133333 | -0.885856 | 11.156265 |
| 2 | 120000 | 0.466667 | 0.036911 | 11.695255 |
| 3 | 150000 | 0.666667 | 0.590571 | 11.918397 |
| 4 | 200000 | 1.000000 | 1.513338 | 12.206078 |
Log transformation is especially helpful for income, population, or any highly skewed features.
4.2.2.4. Demo: How Normalization Improves Computation#
Algorithms like gradient descent converge faster when features are on similar scales. Let’s demonstrate with a simple linear regression.
from sklearn.linear_model import SGDRegressor
from sklearn.metrics import mean_squared_error
import time
# Large synthetic dataset
np.random.seed(0)
X = np.random.randint(0, 1000, size=(10000, 1))
y = 3 * X.squeeze() + 500 + np.random.randn(10000) * 100
# Without normalization
start = time.time()
model = SGDRegressor(max_iter=1000, tol=1e-3)
model.fit(X, y)
time_unscaled = time.time() - start
# With Min-Max scaling
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)
start = time.time()
model.fit(X_scaled, y)
time_scaled = time.time() - start
print(f"Training time without scaling: {time_unscaled:.4f} s")
print(f"Training time with scaling: {time_scaled:.4f} s")
print(f"Boost: {time_unscaled / time_scaled:.2f}x faster")
Training time without scaling: 0.1446 s
Training time with scaling: 0.0114 s
Boost: 12.68x faster
Scaling the features reduces the number of iterations required for gradient descent to converge, thus improving computation speed and stability.