Random Forest

6.2.1.7. Random Forest#

Random Forest is the definitive Bagging algorithm - and the go-to non-linear model for most tabular regression problems. It starts from the same idea as Bagging (see Ensemble Methods): grow many decision trees on bootstrap samples and average their predictions. But it adds one critical twist that makes it far more powerful: feature randomisation.

At each split inside every tree, the algorithm considers only a random subset of \(m\) features (not all \(p\) features). This decorrelates the trees. Without feature randomisation, if one feature is a very strong predictor, almost every tree would split on it first - the trees would look very similar and averaging them would provide little benefit. By forcing each tree to use a random feature subset, trees must find different paths through the data and end up making more independent errors.

The result is:

Reduced variance from averaging many diverse trees
Near-zero overfitting - unlike a single deep tree, the forest is robust
Free validation via out-of-bag (OOB) samples - each tree is validated on the ~37% of data it did not see during training

The Math#

Training algorithm:

For \(b = 1, \ldots, B\):
1. Draw a bootstrap sample \(\mathcal{D}_b\) of size \(n\) from the training data.
2. Grow a full (deep) decision tree on \(\mathcal{D}_b\). At every split, randomly sample \(m \leq p\) features and find the best split only among those \(m\) features.
Store all \(B\) trees.

Prediction:

\[\hat{y}_{\text{RF}}(\mathbf{x}) = \frac{1}{B}\sum_{b=1}^{B} \hat{f}_b(\mathbf{x})\]

The default choice of \(m = \sqrt{p}\) balances bias and variance well in practice.

Feature importance is computed as the mean decrease in impurity (MSE) across all splits on that feature, averaged over all trees.

Key hyperparameters:

Hyperparameter	Guidance
`n_estimators`	More is better up to a point; 100–500 is usually sufficient
`max_features`	`"sqrt"` (default) balances bias-variance; `1.0` = use all features
`max_depth`	Default `None` (full depth) works well; restrict if RAM is a concern
`min_samples_leaf`	Higher → smoother individual trees; useful for noisy targets
`oob_score=True`	Enables a free out-of-bag validation estimate

In scikit-learn#

from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(
    n_estimators=200,
    max_features='sqrt',
    oob_score=True,
    random_state=42,
    n_jobs=-1       # use all CPU cores
)
rf.fit(X_train, y_train)
print(f"OOB R²: {rf.oob_score_:.3f}")   # free estimate of test performance

No feature scaling is needed - Random Forest inherits this property from decision trees.

Example#

rf = RandomForestRegressor(n_estimators=200, max_features='sqrt',
                           oob_score=True, random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)

train_r2  = r2_score(y_train, rf.predict(X_train))
test_r2   = r2_score(y_test,  rf.predict(X_test))
test_rmse = np.sqrt(mean_squared_error(y_test, rf.predict(X_test)))
oob_r2    = rf.oob_score_

print(f"Train R²  : {train_r2:.3f}")
print(f"Test  R²  : {test_r2:.3f}")
print(f"OOB   R²  : {oob_r2:.3f}  ← free estimate, no test data used")
print(f"Test  RMSE: {test_rmse:.1f}")

Train R²  : 0.963
Test  R²  : 0.713
OOB   R²  : 0.732  ← free estimate, no test data used
Test  RMSE: 95.3

The forest achieves a test \(R^2\) of 0.713 and an OOB \(R^2\) of 0.732. Notice how closely OOB and test performance agree - the OOB estimate is a reliable free substitute for a separate validation set.

Feature Importance#

Random Forest provides interpretable feature importance scores as a by-product of training - no extra computation required:

../../../../_images/608e9c258a5be47d23930f65328c15e46737be782d21951cf6b78fb2ba9488fa.png

The most important feature is Feature 1 with an importance score of 0.292. The 4 non-informative features (shown in red) cluster at the bottom - Random Forest effectively deprioritises them.

Performance Stabilises with More Trees#

../../../../_images/a5141a62a3c2e4d106b464279ee765b66f03ed74dd3def5315ca26ef063cca17.png

Test performance plateaus around 100–200 trees. Adding more trees beyond this point yields diminishing returns - the forest is already stable.

Effect of `max_features` on the Bias–Variance Balance#

../../../../_images/b651afdcaee80477d0b817fe37a8d21a6dcaa4a304f3aa1a0793ca53e57f22a8.png

Using fewer features per split (left side) makes trees more diverse but individually weaker - higher bias, lower variance. Using all features (max_features=10) removes the decorrelation benefit. The default \(\sqrt{p}\) sits near the sweet spot.

Strengths and Weaknesses#


Strengths	Excellent out-of-the-box performance; handles non-linearity; free OOB validation; robust to outliers and missing values; feature importance
Weaknesses	Less interpretable than a single tree; slower to train/predict than linear models; can overfit on very noisy data with too many trees

Tip

Random Forest is one of the best default models for tabular regression. Start here when a linear model under-performs and you need non-linear power with minimal tuning.