Numeric Transformations

4.2.2. Numeric Transformations#

Numeric features in a dataset often need scaling, normalization, or transformation before being fed into machine learning models. Proper transformation can improve model performance, stability, and convergence in optimization algorithms.

4.2.2.1. Min-Max Scaling#

Min-max scaling rescales values to a fixed range, usually [0, 1].

\[ x' = \frac{x - x\_{\min}}{x\_{\max} - x\_{\min}} \]

Example: We can use MinMaxScaler from sklearn.preprocessing to do this.

import pandas as pd
from sklearn.preprocessing import MinMaxScaler

data = {"Salary": [50000, 70000, 120000, 150000, 200000]}
df = pd.DataFrame(data)

scaler = MinMaxScaler()
df["Salary_minmax"] = scaler.fit_transform(df[["Salary"]])
display(df)

	Salary	Salary_minmax
0	50000	0.000000
1	70000	0.133333
2	120000	0.466667
3	150000	0.666667
4	200000	1.000000

This is useful for algorithms like neural networks or distance-based models (KNN, K-Means) that are sensitive to magnitude.

4.2.2.2. Standardization (Z-score)#

Standardization centers the data around the mean and scales by the standard deviation:

\[ x' = \frac{x - \mu}{\sigma} \]

Example: We can use StandardScaler from sklearn.preprocessing to do this.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df["Salary_zscore"] = scaler.fit_transform(df[["Salary"]])
display(df)

	Salary	Salary_minmax	Salary_zscore
0	50000	0.000000	-1.254963
1	70000	0.133333	-0.885856
2	120000	0.466667	0.036911
3	150000	0.666667	0.590571
4	200000	1.000000	1.513338

Standardization is preferred for models like SVM, logistic regression, and PCA.

4.2.2.3. Log Transformation#

Log transformation reduces the impact of skewed distributions and extreme values.

\[ x' = \log(x + 1) \]

Example:

import numpy as np

df["Salary_log"] = np.log1p(df["Salary"])
display(df)

	Salary	Salary_minmax	Salary_zscore	Salary_log
0	50000	0.000000	-1.254963	10.819798
1	70000	0.133333	-0.885856	11.156265
2	120000	0.466667	0.036911	11.695255
3	150000	0.666667	0.590571	11.918397
4	200000	1.000000	1.513338	12.206078

Log transformation is especially helpful for income, population, or any highly skewed features.

4.2.2.4. Demo: How Normalization Improves Computation#

Algorithms like gradient descent converge faster when features are on similar scales. Let’s demonstrate with a simple linear regression.

from sklearn.linear_model import SGDRegressor
from sklearn.metrics import mean_squared_error
import time

# Large synthetic dataset
np.random.seed(0)
X = np.random.randint(0, 1000, size=(10000, 1))
y = 3 * X.squeeze() + 500 + np.random.randn(10000) * 100

# Without normalization
start = time.time()
model = SGDRegressor(max_iter=1000, tol=1e-3)
model.fit(X, y)
time_unscaled = time.time() - start

# With Min-Max scaling
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

start = time.time()
model.fit(X_scaled, y)
time_scaled = time.time() - start

print(f"Training time without scaling: {time_unscaled:.4f} s")
print(f"Training time with scaling: {time_scaled:.4f} s")
print(f"Boost: {time_unscaled / time_scaled:.2f}x faster")

Training time without scaling: 0.1358 s
Training time with scaling: 0.0123 s
Boost: 11.05x faster

Scaling the features reduces the number of iterations required for gradient descent to converge, thus improving computation speed and stability.