EDS 232


Lesson 9

Bagging and Random Forests

In this lesson



  • Ensemble methods
  • Bagging: fitting many trees on bootstrap samples and averaging predictions
  • Out-of-bag error estimation: an alternative to cross-validation
  • Random forests: decorrelating trees by using a random subset of predictors at each split

A guiding dataset


22 ecological survey sites, each characterized by:

  • rainfall — annual precipitation (mm)
  • temperature — mean summer temperature (°C)
  • biomass — aboveground plant biomass (tonnes/ha) (response)

The variance problem in decision trees

The variance problem in decision trees


A single decision tree suffers from high variance: small changes in the training data can lead to very different trees and very different predictions. Deep trees are prone to overfitting.

One general strategy for reducing variance:

  1. Take many training sets from the same population
  2. Fit a separate model on each training set
  3. Average the resulting predictions

Core idea: The more uncorrelated the individual predictions are, the more variance reduction we get from averaging.

The challenge: in practice we only have a single training set. Bootstrapping is a way around this.

Bootsrapping & bagging

Bootsrap


Bootstrapping: generate a new training set by sampling with replacement from the original training set

  • Each bootstrapped sample is the same size as the original
  • Some observations appear more than once
  • Some observations are left out entirely

Bagging — Bootstrap AGGregatING


Bagging is a method to generate an ensemble model by:

  1. Generating many training sets from the single one we have by bootstrapping
  2. Fitting a model on each bootstrapped sample
  3. Then aggregate predictions to get a final prediction from the ensemble

Bagging for regression


Bagging for regression


Bagging for regression


Bagging for regression


Bagging for regression


Bagging for regression


  1. Generate \(B\) bootstrapped training sets
  2. Fit a decision tree on each bootstrapped training set
  3. Obtain \(B\) estimates \(\hat{f}^{*1}(x), \ldots, \hat{f}^{*B}(x)\), one for each tree
  4. Average the predictions:

\[\hat{f}_{\text{bag}}(x) = \frac{1}{B} \sum_{b=1}^{B} \hat{f}^{*b}(x)\]

Check-in

How would you modify the bagging workflow for a classification problem? What changes? Create the corresponding diagram.

Bagging for classification


In the case of classification, the workflow is the same as the regression case, but the final aggregation step changes:

  • Instead of averaging numerical predictions, take a majority vote across all \(B\) trees
  • The final prediction is the most commonly occurring class:

Out-of-bag error estimation

Out-of-bag observations


When fitting each bootstrapped tree, some of the training observations are left out of that bootstrap sample. These are called the out-of-bag (OOB) observations for that tree.

Suppose your model has three trees fit on Bootstrap samples 1, 2, and 3 shown here. For which trees would observation 4 be an OOB observation?

Out-of-bag observations


When fitting each bootstrapped tree, some of the training observations are left out of that bootstrap sample. These are called the out-of-bag (OOB) observations for that tree.

Suppose your model has three trees fit on Bootstrap samples 1, 2, and 3 shown here. For which trees would observation 4 be an OOB observation?

Observation 4 does not appear in Bootstrap samples 2 and 3, so it is an OOB observation for those two trees.

Out-of-bag error


OOB observations give us a way to estimate test error without cross-validation:

  1. For each observation \((x_i, y_i)\), identify all trees for which it was OOB
  2. Use only those trees to predict \(x_i\) (average for regression, majority vote for classification)
  3. Repeat for all \(n\) training observations: get one OOB prediction per observation
  4. Compute MSE (regression) or classification error from these OOB predictions

The OOB error is valid because each observation is predicted only by trees that were not trained on it.

We can use this error estimate to examne model parameters such as the number of trees.

Choosing the number of trees B


Unlike tree depth, \(B\) is not a regularization parameter. A very large \(B\) does not cause overfitting.

We pick a \(B\) large enough that the OOB error stabilizes. 100 to 500 are common defaults.

Variable importance

Variable importance


With \(B\) trees we lose the direct interpretability of a single tree diagram. But we can still extract a useful summary of which predictors matter most.


For each predictor, record the total reduction in RSS (regression) or Gini impurity (classification) achieved by splits on that predictor, averaged over all \(B\) trees.


Predictors with larger average reductions are more important to the model’s predictions.

Check-in


The plot shows variable importance for a model of bagged regression trees (B=200) on our example data.

  1. Which predictor is most important for predicting biomass?
  2. When creating the trees for the ensemble, which predictor would you expect to be used in the first split for most of them?

Check-in


The plot shows variable importance for a model of bagged regression trees (B=200) on our example data.

  1. Which predictor is most important for predicting biomass?

  2. When creating the trees for the ensemble, which predictor would you expect to be used in the first split for most of them?

  3. Rainfall: it has the largest average RSS reduction across all trees.

  4. Rainfall: consistent with the decision tree lesson, where the root node always split on rainfall first.

Random forests

Random forests: decorrelating the trees


Problem with bagging:

  • if one predictor is very strongly associated with the response, nearly every tree will use it as the top split
  • the \(B\) trees look similar
  • they are correlated
  • averaging correlated predictions does not reduce variance much.

Random forests fix this by introducing randomness at each split:

  • For every tree, at each split, randomly select a subset of \(m < p\) predictors as candidates
  • A fresh random subset is drawn at every split
  • Typical defaults: \(m = \sqrt{p}\) (classification), \(m = p/3\) (regression)

No single strong predictor can dominate every tree → trees are more diverse → averaging produces a larger reduction in variance.

Check-in


Suppose you are building one tree in a random forest with predictors {rainfall, temperature, elevation, soil moisture} (\(p = 4\), \(m = 2\)).

At the root node the algorithm randomly selects {rainfall, elevation} and splits on rainfall.

At the next node, which predictors are available to split on?

A new random sample of 2 predictors is drawn independently: it could be any pair, including ones used before. The previous split does not restrict future splits.

Random forest vs. bagging vs. single tree


Check-in


  1. If \(m = p\) (we use all predictors at every split), how does a random forest relate to bagging?

  2. When would you expect a random forest to outperform bagging the most?

  1. When \(m = p\), random forests reduce to bagging exactly. There is no restriction on which predictors can be considered, so every tree is built the same way as a bagged tree.

  2. When there are one or a few dominant predictors: bagged trees become too similar to each other, while random forests force diversity by sometimes excluding the dominant predictor from consideration.