EDS 232


Lesson 9

Decision Trees

In this lesson



  • Decision trees: flexible, interpretable prediction models that partition the predictor space into rectangular regions
  • Regression trees: decision trees for predicting a numerical response variable
  • Classification trees: decision trees for predicting a categorical class label
  • Dealing with overfitting: approaches for controlling model complexity

Regression trees

A guiding dataset


22 ecological survey sites, each characterized by:

  • rainfall — annual precipitation (mm)
  • temperature — mean summer temperature (°C)
  • biomass — aboveground plant biomass (tonnes/ha) (response)

What is a regression tree?

Regression tree structure


A regression tree predicts a numerical response by partitioning the predictor space into rectangular regions and assigning a predicted value to each region.

✏️

Tree terminology


  • Root node: first split, applied to the entire dataset
  • Internal node: a non-leaf node that applies an additional split to a subset
  • Leaf node: a terminal node — no further splits
  • Branch: the path an observation takes at each split
  • Depth: number of nodes along the longest root-to-leaf path

The predicted value at a leaf is the mean response of all training observations that reach that leaf.

Making predictions


To predict for a new observation: start at the root and follow the branches based on the split conditions until reaching a leaf. The value at the leaf is the prediction.

For example, a new site with rainfall = 430 mm and temperature = 10°C is predicted to have a biomass of 3.2 t/ha.

Check-in


Use the regression tree to predict the biomass for a new site with rainfall = 750 mm and temperature = 20°C.

Check-in


Use the regression tree to predict the biomass for a new site with rainfall = 750 mm and temperature = 20°C.

  1. Root: Is rainfall ≤ 583 mm? No → go right
  2. The right (No) branch leads directly to a leaf — no further split in this region
  3. Predicted biomass: 7.4 t/ha

Node splits and prediction regions

The root node split


In regression trees, the predicted value at the leaf is the mean of all training observations that reach this leaf.

We begin with our full training dataset:

The root node split


The root node asks whether rainfall ≤ 583mm" mm. This creates a vertical division in the predictor space.

The second split


The second node looks at observations in the left region (rainfall ≤ 583 mm) and asks whether temperature ≤ 15.1 °C. This adds a horizontal split within that region only.

Leaf regions and predictions


Leaves correspond to the final regions in the predictor space.

Each observation is assigned the mean response of the training observations that fall in the same region:

Check-in


The plot shows a new split added to the right region.

Modify the tree diagram to include a node and leaves that represent this additional split.

Check-in


Regression tree: summary


A regression tree works by:

1. Dividing the predictor space — the set of all possible values for \(X_1, X_2, \ldots, X_p\) — into \(J\) non-overlapping rectangular regions \(R_1, R_2, \ldots, R_J\).

2. Assigning a prediction to each region: for every observation that falls into region \(R_j\), the prediction is the mean response of the training observations in \(R_j\).

Fitting a regression tree

How are the splits found?


For a set of regions \(R_1, \ldots, R_J\), the quality of the tree is measured by the residual sum of squares (RSS):

\[RSS = \sum_{j=1}^J \sum_{i \in R_j} (y_i - \hat{y}_{R_j})^2\]

The regions are found using recursive binary splitting:

  1. Starting from the full dataset, search over all predictors and all possible thresholds to find the single split that most reduces the RSS
  2. Repeat this process within each resulting region
  3. Stop when a stopping criterion is met (e.g., maximum depth, or too few observations to split)

Finding the first split


The algorithm evaluates every possible threshold on every predictor. Here are three candidate splits on rainfall:

RSS as a function of the split threshold


The algorithm picks the threshold that minimizes RSS, this becomes the first node of the tree.

1-node tree: the result


Finding the second split


After the first split, the algorithm considers splitting each resulting region. Here are three candidate temperature splits within the left (low-rainfall) region:

RSS within the left region


The algorithm repeats the same search, now within each existing region, and picks the threshold that most reduces RSS.

Where would this algorithm end if we kept adding nodes and subdividing the regions without any stopping criterion?

Dealing with overfitting

Overfitting in decision trees


A fully-grown decision tree will perfectly fit the training data but generalize poorly to new observations.

As the tree grows deeper, the decision boundaries become increasingly specific to the training data:

Go to website for images

Check-in


  1. The full tree has a training MSE of 0. What does this tell us about how well it fits the training data?

  2. Would you expect the full tree to perform better or worse than the depth-3 tree on new, unseen data? Why?

  3. What are two approaches you could use to control the complexity of a decision tree?

  1. A training MSE near zero means the full tree essentially memorizes the training data — every observation is in its own leaf.

  2. The full tree would likely perform worse on new data. It has memorized noise in the training set (high variance) rather than learning the underlying pattern.

  3. Two common approaches:

    • Maximum depth: limit how many levels the tree can grow
    • Minimum samples per leaf: stop splitting when a region has too few observations

Controlling tree complexity


Maximum depth and minimum samples per leaf are both hyperparameters that control the complexity vs. fit tradeoff:

  • A shallower tree has fewer, larger regions
  • A deeper tree has more, smaller regions

How would cross-validation be applied to find the best values for these hyperparameters?

Compute the CV-MSE for a grid of depth and samples-per-leaf values. Select the combination that gives the lowest cross-validation error as the hyperparameters for the final tree.

Classification trees

A classification dataset


The same 22 sites, same predictors (rainfall, temperature), but a categorical response: the site’s biome type:

What is a classification tree?


A classification tree has the same structure as a regression tree (root, internal nodes, branches, and leaves) but makes categorical predictions instead of numerical ones.

For each leaf node, the prediction is the majority class (most common class) among the training observations that reach that leaf.

Check-in


Use the classification tree to predict the biome type for a new site with rainfall = 500 mm and temperature = 15°C.

Check-in


Use the classification tree to predict the biome type for a new site with rainfall = 500 mm and temperature = 15°C.

  1. Root: Is rainfall 500 mm ≤ 675.9 mm? Yes → left branch (low rainfall)
  2. Internal node: Is temperature 15°C ≤ 18.3°C? Yes → left branch (low temperature)
  3. Leaf: The majority class in this leaf is grassland

Decision regions


Just like regression trees, classification trees divide the predictor space into rectangular regions:

Fitting a classification tree

Fitting a classification tree


A classification tree is fitted by:

1. Dividing the predictor space into \(J\) non-overlapping regions \(R_1, \ldots, R_J\).

2. Predicting the majority class among training observations in each region \(R_j\). . . .

As with regression trees, recursive binary splitting is used, but we cannot use RSS for classification. Instead, we measure the impurity of the resulting regions.

We are often interested in the class proportions at each leaf, not just the predicted class. These give a sense of prediction confidence or purity: a leaf where 95% of observations share one class is more reliable than a 60/40 split.

Check-in


Looking at the classification tree, which leaf has the most confident predictions? Which would you say is the most “pure” and why?

Check-in


Which leaf is most pure?

The forest leaf (leftmost, green). All training observations in that region belong to the same class — forest — so the predicted class proportion is 1.0 for forest and 0.0 for all others.

The tree is maximally confident there: no ambiguity in the prediction.

In contrast, the other leaves contain a mix of classes, making their predictions less reliable.

Gini impurity


The most common splitting criterion for classification trees is the Gini impurity index: a measure of how mixed the classes are in a given region.


  • A region where all observations belong to the same class → Gini impurity = 0 (perfectly pure)
  • The more evenly observations are spread across classes → higher impurity


At each step, recursive binary splitting searches for the variable and threshold that produce the purest child regions.

Entropy is another common impurity measure that behaves similarly and tends to produce very similar trees in practice.

Trees vs. linear models

Trees vs. linear models


Decision trees and linear models capture fundamentally different types of relationships:

  • Linear models (regression, ridge, lasso): assume the response changes smoothly and approximately linearly with the predictors
  • Decision trees: make no such assumption, they partition the predictor space into rectangular regions with constant predictions


Trees naturally capture nonlinear, step-like relationships that linear models cannot fit well.

Linear models perform better when the true relationship is approximately linear and smooth.

Trees vs. linear models: example


Synthetic data where Y depends on temperature in a step-function pattern:

Trees vs. linear models: summary


Decision Trees Linear Models
Interpretability High, easy to visualize Ok, coefficients interpretable
Nonlinear relationships Handles them naturally Requires feature engineering
Prediction accuracy Often lower Competitive when relationship is linear
Variance High, sensitive to small changes Generally lower
Variable selection Implicit (splits on useful variables) Requires explicit methods (e.g., lasso)