rainfall— annual precipitation (mm)temperature— mean summer temperature (°C)biomass— aboveground plant biomass (tonnes/ha) (response)

EDS 232
Lesson 9
Decision Trees
In this lesson
Regression trees
A guiding dataset
22 ecological survey sites, each characterized by:
rainfall — annual precipitation (mm)temperature — mean summer temperature (°C)biomass — aboveground plant biomass (tonnes/ha) (response)
What is a regression tree?
Regression tree structure
A regression tree predicts a numerical response by partitioning the predictor space into rectangular regions and assigning a predicted value to each region.

✏️
Tree terminology

The predicted value at a leaf is the mean response of all training observations that reach that leaf.
Making predictions
To predict for a new observation: start at the root and follow the branches based on the split conditions until reaching a leaf. The value at the leaf is the prediction.
For example, a new site with rainfall = 430 mm and temperature = 10°C is predicted to have a biomass of 3.2 t/ha.

Check-in

Use the regression tree to predict the biomass for a new site with rainfall = 750 mm and temperature = 20°C.
Check-in

Use the regression tree to predict the biomass for a new site with rainfall = 750 mm and temperature = 20°C.
Node splits and prediction regions
The root node split
In regression trees, the predicted value at the leaf is the mean of all training observations that reach this leaf.
We begin with our full training dataset:

The root node split
The root node asks whether rainfall ≤ 583mm" mm. This creates a vertical division in the predictor space.


The second split
The second node looks at observations in the left region (rainfall ≤ 583 mm) and asks whether temperature ≤ 15.1 °C. This adds a horizontal split within that region only.


Leaf regions and predictions
Leaves correspond to the final regions in the predictor space.
Each observation is assigned the mean response of the training observations that fall in the same region:


Check-in
The plot shows a new split added to the right region.
Modify the tree diagram to include a node and leaves that represent this additional split.


Check-in


Regression tree: summary
A regression tree works by:
1. Dividing the predictor space — the set of all possible values for \(X_1, X_2, \ldots, X_p\) — into \(J\) non-overlapping rectangular regions \(R_1, R_2, \ldots, R_J\).
2. Assigning a prediction to each region: for every observation that falls into region \(R_j\), the prediction is the mean response of the training observations in \(R_j\).
Fitting a regression tree
How are the splits found?
For a set of regions \(R_1, \ldots, R_J\), the quality of the tree is measured by the residual sum of squares (RSS):
\[RSS = \sum_{j=1}^J \sum_{i \in R_j} (y_i - \hat{y}_{R_j})^2\]
The regions are found using recursive binary splitting:
Finding the first split
The algorithm evaluates every possible threshold on every predictor. Here are three candidate splits on rainfall:

RSS as a function of the split threshold

The algorithm picks the threshold that minimizes RSS, this becomes the first node of the tree.
1-node tree: the result


Finding the second split
After the first split, the algorithm considers splitting each resulting region. Here are three candidate temperature splits within the left (low-rainfall) region:

RSS within the left region
The algorithm repeats the same search, now within each existing region, and picks the threshold that most reduces RSS.

Where would this algorithm end if we kept adding nodes and subdividing the regions without any stopping criterion?
Dealing with overfitting
Overfitting in decision trees
A fully-grown decision tree will perfectly fit the training data but generalize poorly to new observations.
As the tree grows deeper, the decision boundaries become increasingly specific to the training data:
Go to website for images
Check-in
The full tree has a training MSE of 0. What does this tell us about how well it fits the training data?
Would you expect the full tree to perform better or worse than the depth-3 tree on new, unseen data? Why?
What are two approaches you could use to control the complexity of a decision tree?
A training MSE near zero means the full tree essentially memorizes the training data — every observation is in its own leaf.
The full tree would likely perform worse on new data. It has memorized noise in the training set (high variance) rather than learning the underlying pattern.
Two common approaches:
Controlling tree complexity
Maximum depth and minimum samples per leaf are both hyperparameters that control the complexity vs. fit tradeoff:
How would cross-validation be applied to find the best values for these hyperparameters?
Compute the CV-MSE for a grid of depth and samples-per-leaf values. Select the combination that gives the lowest cross-validation error as the hyperparameters for the final tree.
Classification trees
A classification dataset
The same 22 sites, same predictors (rainfall, temperature), but a categorical response: the site’s biome type:

What is a classification tree?
A classification tree has the same structure as a regression tree (root, internal nodes, branches, and leaves) but makes categorical predictions instead of numerical ones.
For each leaf node, the prediction is the majority class (most common class) among the training observations that reach that leaf.

Check-in

Use the classification tree to predict the biome type for a new site with rainfall = 500 mm and temperature = 15°C.
Check-in

Use the classification tree to predict the biome type for a new site with rainfall = 500 mm and temperature = 15°C.
Decision regions
Just like regression trees, classification trees divide the predictor space into rectangular regions:

Fitting a classification tree
Fitting a classification tree
A classification tree is fitted by:
1. Dividing the predictor space into \(J\) non-overlapping regions \(R_1, \ldots, R_J\).
2. Predicting the majority class among training observations in each region \(R_j\). . . .
As with regression trees, recursive binary splitting is used, but we cannot use RSS for classification. Instead, we measure the impurity of the resulting regions.
We are often interested in the class proportions at each leaf, not just the predicted class. These give a sense of prediction confidence or purity: a leaf where 95% of observations share one class is more reliable than a 60/40 split.
Check-in

Looking at the classification tree, which leaf has the most confident predictions? Which would you say is the most “pure” and why?
Check-in

Which leaf is most pure?
The forest leaf (leftmost, green). All training observations in that region belong to the same class — forest — so the predicted class proportion is 1.0 for forest and 0.0 for all others.
The tree is maximally confident there: no ambiguity in the prediction.
In contrast, the other leaves contain a mix of classes, making their predictions less reliable.
Gini impurity
The most common splitting criterion for classification trees is the Gini impurity index: a measure of how mixed the classes are in a given region.
At each step, recursive binary splitting searches for the variable and threshold that produce the purest child regions.
Entropy is another common impurity measure that behaves similarly and tends to produce very similar trees in practice.
Trees vs. linear models
Trees vs. linear models
Decision trees and linear models capture fundamentally different types of relationships:
Trees naturally capture nonlinear, step-like relationships that linear models cannot fit well.
Linear models perform better when the true relationship is approximately linear and smooth.
Trees vs. linear models: example
Synthetic data where Y depends on temperature in a step-function pattern:

Trees vs. linear models: summary
| Decision Trees | Linear Models | |
|---|---|---|
| Interpretability | High, easy to visualize | Ok, coefficients interpretable |
| Nonlinear relationships | Handles them naturally | Requires feature engineering |
| Prediction accuracy | Often lower | Competitive when relationship is linear |
| Variance | High, sensitive to small changes | Generally lower |
| Variable selection | Implicit (splits on useful variables) | Requires explicit methods (e.g., lasso) |