Assignment 3

Regularization and Decision Trees

Published

June 2, 2026

This assignment covers topics from the course on regularization,PCA, decision trees, random forest,and high dimensional data. Task 1 and Task 2 will each contribute 45% to the total grade and Task 3 will contribute 10%. Review the rubric for this assignment here.


Submission Instructions

This assignment is submitted as a GitHub repository to Gradescope. Before submitting, make sure you satisfy every item in this checklist:

Resubmissions after the due date that fail to satisfy one of the checks above will be strictly held to the course’s 50%-regrade resubmission policy (see syllabus).

If you have any questions about assignment logistics, please reach out to the instructional team.


How to Submit via GitHub to Gradescope

  1. Fork this repository from the course GitHub organization.
  2. Clone your forked repository to your local machine or the workbench server.
  3. Complete the tasks inside your cloned repository. Do not move files out of the repo.
  4. As you work, commit and push your changes regularly.
  5. When you are finished, make a final push to ensure your most recent work is on GitHub.
  6. On Gradescope, select “GitHub” as your submission method and choose your forked repository and the correct branch (usually main).

Your Gradescope submission will pull directly from whatever is in your GitHub repo at the time of submission, so be sure to submit only once you are completely finished.


How the Otter Grader Works

These notebooks use the otter-grader library to provide instant feedback on your answers as you work. Here is how it works:

Step 1 — Initialize Otter

The very first cell of each notebook initializes the grader. Always run this cell first before anything else:

# Initialize Otter
import otter
grader = otter.Notebook("hw2-task2.ipynb")

Step 2 — Complete the exercises

Fill in code in the cells marked with ... or # Your answer here. Read each question carefully and make sure your variable names match exactly what is asked since the autograder checks specific variable names.

Step 3 — Run individual checks

After completing an exercise, run the grader.check("qN") cell immediately following it. This will tell you right away whether your answer is correct:

  • Passed: You’ll see something like q3 passed!. If so, you’re good to move on.
  • Failed: You’ll see a description of what went wrong. Re-read the question, inspect your output, and try again.

Stuck? Check that your variable names, data types, and shapes all match what the question asks for. Most failures come from a small mismatch. For example, returning a list instead of a DataFrame.

Step 4 — Run grader.check_all() before submitting

The very last cell of each notebook runs all checks at once and gives you a summary:

q3  ✅ passed
q5  ✅ passed
q6  ❌ failed
...

You must run this cell and leave its output visible before committing and pushing your notebook. This summary will be used to grade your autograded answers. You will not receive credit for questions that failed in this section. If this cell has not been run, or its output is missing, you may lose points.

  • Do not modify or delete any grader.check() or grader.check_all() cells.
  • Passing all autograder checks does not guarantee full credit on the assignemnt. Some questions are graded manually as well.
  • Always re-run all cells from top to bottom before your final commit to make sure everything still passes in sequence.

AI Policy

If you use generative AI on this assignment, you are expected to adhere to the following course policies:

  • Cultivate understanding: You should be able to fully understand, justify, and explain all the work you submit.
  • 🤔 Question AI outputs: Assume AI-generated answers may be incorrect and verify all information independently.
  • 🚫 Academic integrity: Submitting work you don’t understand or cannot explain will be considered plagiarism, regardless of whether AI use was disclosed.

If there are concerns about AI use in your work, your instructor will ask you to meet and discuss it. If understanding is clearly lacking and this is the first occurrence, you will have the chance to revise and resubmit for 50% of the original maximum grade within two days.


Task 1: Regularization

Predicting Building Heat Load with Ridge, Lasso, and PCA.

Follow the instructions in hw3-task1.ipynb to complete this task. You will practice:

  • Preprocessing data with train/test splits and predictor standardization using StandardScaler
  • Fitting and evaluating an OLS regression model
  • Applying Ridge regression with cross-validated hyperparameter tuning (RidgeCV) and visualizing the coefficient shrinkage path across lambda values
  • Applying Lasso regression with cross-validated tuning (LassoCV) and comparing its coefficient-zeroing behavior to Ridge
  • Comparing OLS, Ridge, and Lasso using test MSE and coefficient bar plots
  • Performing Principal Components Regression (PCR) by fitting PCA on scaled features, selecting the number of components via explained variance (≥90% threshold), and fitting OLS on the projected scores
  • Synthesizing all four models in a final summary table and interpreting which approach best fits the building energy efficiency dataset

Task 2: Decision Trees and Random Forests

Classifying Potable/ Non Potable Water

Follow the instructions in hw3-task2.ipynb to complete this task. You will practice:

  • Exploring a dataset with class imbalance and missing values, and applying median imputation with stratified train/test splitting
  • Fitting an unpruned decision tree as a baseline
  • Tuning tree depth using 5-fold cross-validation and selecting the optimal depth from a CV accuracy plot
  • Refitting and evaluating the pruned (tuned) decision tree and visualizing its structure, including interpreting root-node splits, feature thresholds, and Gini impurity
  • Fitting a Random Forest (200 trees) and comparing its test accuracy to the single-tree models
  • Visualizing and comparing feature importances (Gini-based) from both the decision tree and random forest
  • Interpreting a confusion matrix for the random forest
  • Building a final model comparison table/bar chart across all three models (unpruned tree, tuned tree, random forest)

Task 3: Considerations in High Dimensions

In this task you will explore common problems that may arise in high-dimensional datasets, in which the number of features is larger than the number of observations and use generative AI to aid in your understanding on this setting.

Your final submission will include: a concept from the reading you wanted clarification on, the initial prompt you used to get clarification from a genAI chatbot, and a short reflection on how you used genAI.

  1. Read section 6.4 Considerations in High Dimensions in the course textbook: An Introduction to Statistical Learning with applications in Python. The section covers challenges to fitting regression models and interpreting results on high-dimensional data.

  2. While reading, identify one concept or claim from the section that you did not fully understand.

  3. This step is designeed to help you write more detailed prompts. Use a generative AI tool (such as ChatGPT or Claude) to help clarify the concept you did not fully understand. Save your prompt so you can report it in this task. Your prompt should go beyond “explain X to me.” and should include at least two of the following elements:

    • Your background (For example: “I am X who knows Y and is learning about Z, I am reading X for my course on Y, etc)
    • The specific source of confusion
    • A requested format or approach (e.g.: “use a concrete numerical example” or create an analogy)
    • A constraint or comparison (“explain how Z method differs from Y method” or “avoid matrix notation”)
  4. If the response didn’t fully resolve your confusion, iterate: refine your next prompt and try again.

  5. In a short paragraph (4–6 sentences), reflect on your use of generative AI in this task: which tool did you use, what prompts did you give it, and how many iterations did it take to get it to clarify your question? Did the response match the information given in the textbook? This short reflection should be independently written by you, with no AI assistance.

Your submission should be a Quarto document, named hwk3-task3.qmd, (inside your forked repo) with the following structure:

Section Content
Title + intro Title + one sentence explaining the content of the document
Concept to clarify The concept or claim that you looked clarification for
Prompt The initial prompt you wrote to gain understanding
Generative AI reflection Answer to step 5

Rubric

  • Steps 2, 3, 5 and the Title/ Intro will be graded as 1: complete, 0: incomplete.