First Midterm Study Guide

Day-of instructions

The midterm will take place in our usual classroom during class time.
This is a closed-book, individual exam.
No electronics are allowed.
You can bring a 3x5 notecard with study notes on both sides
You will be provided a calculator for any small computations you may need to do.

General Concepts

Being able to abstract from a data scenario
- 1.1 What are the predictors: \(X_1, \ldots, X_p\)
- 1.2 What is the response variable: \(Y\)
- 1.3 Observation data points
- 1.4 Whether inference or prediction is the main goal
- 1.5 Whether the problem is a regression or classification task
Provide examples for parametric and non-parametric methods and potential advantages and drawbacks.
Identify which accuracy metrics are appropriate to use in classification and regression applications for a given scenario.
For the regression accuracy metrics MSE, \(R^2\), adjusted \(R^2\), being able to:
- 4.1 Explain what each metric is measuring
- 4.2 Explain how a set of observed data points \((x_i, y_i)\) and their predicted values \(\hat{y}_i = \hat{f}(x_i)\) fit into the formulas
Being able to identify false positives/negatives and true positives/negatives given classification results in visual and tabular formats and create confusion matrices.
For the classification accuracy metrics error rate, accuracy, precision, recall, TPR, FPR; being able to:
- 6.1 Define each metric in terms of TP, FP, TN, and FN
- 6.2 Calculate metrics given confusion matrices
- 6.3 Explain how \(F_1\) is computed from precision and recall and how to interpret it
- 6.4 Discuss what metric might be more appropriate to prioritize in specific applied scenarios
Discuss how precision and recall change as a classifier correctly predicts more or less true positives.
Identify and discuss overfitting in classification and regression scenarios, particularly in relation to test and training errors.
Define model variance and bias, compare model results (plots) and discuss their relative bias and variance; explain the bias-variance tradeoff and the challenges it poses to minimizing the test error.
Explain what “data leakage” is and identify whether it is occurring in different scenarios.
Discuss how class imbalance may affect classification results.
Explain how the cross-validation process for estimating test error is implemented, and how it can be used to tune hyperparameters and select across different models.

Methods

13. Simple Linear Regression

13.1 In the model’s formula \(Y \approx \beta_0 + \beta_1 X\), being able to interpret the coefficients \(\beta_0\), \(\beta_1\) abstractly and in specific data scenarios
13.2 Generally explain the process via which the coefficients \(\hat{\beta}\) are estimated
13.3 Interpret standard errors and confidence intervals in plain language
13.4 Explain the hypothesis testing process and interpret p-values
13.5 Apply metrics, statistics, and results from a fitted model to solve inference and prediction questions

14. Multiple Linear Regression

14.1 In the model’s formula \(Y \approx \beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p\), being able to interpret the coefficients \(\beta_0, \beta_1, \ldots, \beta_p\) abstractly and in specific data scenarios
14.2 Generally explain the process via which the coefficients \(\hat{\beta}\) are estimated
14.3 For \(p\) predictors and one response variable, discuss the difference between fitting \(p\) separate simple linear regressions
14.4 Explain the hypothesis testing process and interpret p-values
14.5 Apply metrics, statistics, and results from a fitted model to solve inference and prediction questions

15. K-Nearest Neighbors Classifier

15.1 Explain the procedure used to get a classification using KNN
15.2 Relate the K hyperparameter to variance, bias, and potential overfitting

16. Logistic Regression

16.1 Explain how the process of using a logistic regression model to perform binary classification tasks (modeling \(p(X)\) with the logistic function using the training data to fit it using maximum likelihood estimation; once the model \(p(X)\) has been fitted, select a classification threshold)
16.2 Interpret how the sign of the coefficients relates to the probability of belonging to the positive class
16.3 Explain the hypothesis testing process and interpret p-values
16.4 Apply metrics, statistics, and results from a fitted model to solve inference and prediction questions
16.5 Understand how an ROC curve is constructed and rank models according to their ROC curves and AUCs

Practice Scenarios

Please read

These are examples for the kind of questions the exam will have and the format of the exam, not necessarily these exact topics.
The exam will have two scenarios (10 questions each) in which you are requested to interpret results and discuss statements.
The exam will cover all models discussed in the course, including but not limited to those in these practice exercises.”

✅ Solutions to practice exercises

The answers presented are quite comprehensive, so they can help you study and clarify any misconceptions.

Practice 1 — Salmon Survival Study

A team is studying the relationship between ocean conditions and juvenile salmon survival in a coastal estuary. They collect data from 45 monitoring sites and are interested in understanding whether ocean temperature can predict how many juvenile salmon survive. The following variables are recorded at each site:

Sea surface temperature (SST): water temperature (°C) at each monitoring site
Juvenile salmon survival rate: percentage (%) of juvenile salmon that survived to a given life stage at each site

Q1. What is the predictor variable in this study? What is the response variable?

Q2. Is this a regression or a classification problem? Justify your answer in one sentence.

Q3. Is this a supervised or unsupervised learning task? Justify your answer.

Before fitting a model, the team is planning to use the following workflow:

Collect all 45 observations and split into a training set (80%) and a test set (20%).
Fit three candidate models on the training set: simple linear regression with SST, simple linear regression with log(SST), and a polynomial regression.
Evaluate all three models on the test set and choose the one with the lowest test MSE.
Report the test MSE of the chosen model as its performance estimate.

Q4. Identify the problem in this workflow. How can it be corrected?

The team fits a simple linear regression model on the training set:

\[\widehat{\text{survival rate}} = \hat{\beta}_0 + \hat{\beta}_1 \times \text{SST}\]

The model output and summary statistics are shown below.

	Estimate	Std. Error	\(t\)-statistic	\(p\)-value
Intercept (\(\hat{\beta}_0\))	82.4	6.3	13.08	< 0.001
SST (\(\hat{\beta}_1\))	−3.1	0.8	−3.88	< 0.001

95% confidence interval for \(\hat{\beta}_1\): \([-4.7,\ -1.5]\)
\(R^2 = 0.25\)

Q5. Interpret the slope coefficient \(\hat{\beta}_1 = -3.1\) in the context of this study.

Q6. Use the \(p\)-value to draw a conclusion about whether SST is a statistically significant predictor of survival rate.

Q7. One team member states:

“An \(R^2\) of 0.25 means the model is incorrect 75% of the time.”

Is this statement true or false? Explain your reasoning.

Q8. Another team member suggests:

“Let’s calculate the AUC to get a metric on the model’s performance.”

Is AUC an appropriate metric here? Explain.

The team expands their model to include two additional predictors: salinity (parts per thousand) and river discharge (m³/s):

\[\widehat{\text{survival rate}} = \hat{\beta}_0 + \hat{\beta}_1 \times \text{SST} + \hat{\beta}_2 \times \text{salinity} + \hat{\beta}_3 \times \text{discharge}\]

Q9. Interpret what the coefficient for salinity represents in this multiple regression model.

The expanded model produces the following results:

	Estimate	Std. Error	\(t\)-statistic	\(p\)-value
Intercept	78.1	7.2	10.85	< 0.001
SST	−2.8	0.9	−3.11	0.003
Salinity	0.4	0.6	0.67	0.510
Discharge	1.2	0.3	4.00	< 0.001

Q10. Based on the p-value for salinity, what conclusion do you draw about its relationship with survival rate? Explain.

Practice 2 — Soil Carbon Classification

A team of soil scientists is building a classifier to label soil samples according to carbon status. The samples are labeled as High Carbon or Low Carbon based on two soil measurements:

Bulk density index (\(X_1\)): normalized bulk density of the soil sample, on a 0–10 scale
Clay content index (\(X_2\)): normalized clay content of the soil sample, on a 0–10 scale

The team will be using High Carbon as the positive class.

Soil samples flagged as High Carbon will be published in a public database and used to certify land for carbon credit markets. Incorrectly certifying a Low Carbon site as High Carbon would undermine the program’s credibility and have financial consequences.

Q1. Which metric should the team prioritize recall or precision? Explain why.

The team applies their KNN classifier (\(K = 3\)) to a test set of 37 soil samples.

The figure below shows the KNN decision boundary (\(K = 3\)) fitted on the training data . Four test points (T1, T2, T3, T4) have been plotted. Their true labels are shown in the table.

Q2. Explain to the team what a false positive is in the context of this problem and identify one from the test points in the plot.

Test Point	True Label
T1	Low Carbon
T2	Low Carbon
T3	High Carbon
T4	High Carbon

The confusion matrix below shows the results.

	Predicted: High	Predicted: Low
Actual: High	9	3
Actual: Low	5	20

Q3. Calculate the False Positive Rate (FPR). Explain in plain language what this value means. Use the class names, not “positive” and “negative.”

Q4. The team considers using \(K = 1\) versus \(K = 9\). Which value of \(K\) is more likely to overfit the training data? Explain using the concepts of model flexibility, bias, and variance.

Instead of KNN, the team decides to fit a logistic regression model to predict carbon status.

Q5. Once the logistic regression model is fitted, what additional decision must the team make before it can output class labels? What are the tradeoffs of setting this value too high or too low in the context of the carbon certification program?

Q6. The coefficient for bulk density index (\(X_1\)) is \(-0.9\). A teammate says:

“This means sites with higher bulk density are less likely to be classified as High Carbon, given all else is held fixed.”

Are they correct? Explain.

Q7. Someone else suggests:

“When we apply \(k\)-fold cross-validation to evaluate a single model, we end up with \(k\) different intermediate trained models. We need to choose one of these to actually use to get the final predictions.”

Is this statement true or false? Explain.