Assignment 3
Regularization and Decision Trees
This assignment covers topics from the course on regularization,PCA, decision trees, random forest,and high dimensional data. Task 1 and Task 2 will each contribute 45% to the total grade and Task 3 will contribute 10%. Review the rubric for this assignment here.
Submission Instructions
How to Submit via GitHub to Gradescope
How the Otter Grader Works
AI Policy
Task 1: Regularization
Predicting Building Heat Load with Ridge, Lasso, and PCA.
Follow the instructions in hw3-task1.ipynb to complete this task. You will practice:
- Preprocessing data with train/test splits and predictor standardization using
StandardScaler - Fitting and evaluating an OLS regression model
- Applying Ridge regression with cross-validated hyperparameter tuning (
RidgeCV) and visualizing the coefficient shrinkage path across lambda values - Applying Lasso regression with cross-validated tuning (
LassoCV) and comparing its coefficient-zeroing behavior to Ridge - Comparing OLS, Ridge, and Lasso using test MSE and coefficient bar plots
- Performing Principal Components Regression (PCR) by fitting PCA on scaled features, selecting the number of components via explained variance (≥90% threshold), and fitting OLS on the projected scores
- Synthesizing all four models in a final summary table and interpreting which approach best fits the building energy efficiency dataset
Task 2: Decision Trees and Random Forests
Classifying Potable/ Non Potable Water
Follow the instructions in hw3-task2.ipynb to complete this task. You will practice:
- Exploring a dataset with class imbalance and missing values, and applying median imputation with stratified train/test splitting
- Fitting an unpruned decision tree as a baseline
- Tuning tree depth using 5-fold cross-validation and selecting the optimal depth from a CV accuracy plot
- Refitting and evaluating the pruned (tuned) decision tree and visualizing its structure, including interpreting root-node splits, feature thresholds, and Gini impurity
- Fitting a Random Forest (200 trees) and comparing its test accuracy to the single-tree models
- Visualizing and comparing feature importances (Gini-based) from both the decision tree and random forest
- Interpreting a confusion matrix for the random forest
- Building a final model comparison table/bar chart across all three models (unpruned tree, tuned tree, random forest)
Task 3: Considerations in High Dimensions
In this task you will explore common problems that may arise in high-dimensional datasets, in which the number of features is larger than the number of observations and use generative AI to aid in your understanding on this setting.
Your final submission will include: a concept from the reading you wanted clarification on, the initial prompt you used to get clarification from a genAI chatbot, and a short reflection on how you used genAI.
Read section 6.4 Considerations in High Dimensions in the course textbook: An Introduction to Statistical Learning with applications in Python. The section covers challenges to fitting regression models and interpreting results on high-dimensional data.
While reading, identify one concept or claim from the section that you did not fully understand.
This step is designeed to help you write more detailed prompts. Use a generative AI tool (such as ChatGPT or Claude) to help clarify the concept you did not fully understand. Save your prompt so you can report it in this task. Your prompt should go beyond “explain X to me.” and should include at least two of the following elements:
- Your background (For example: “I am X who knows Y and is learning about Z, I am reading X for my course on Y, etc)
- The specific source of confusion
- A requested format or approach (e.g.: “use a concrete numerical example” or create an analogy)
- A constraint or comparison (“explain how Z method differs from Y method” or “avoid matrix notation”)
If the response didn’t fully resolve your confusion, iterate: refine your next prompt and try again.
In a short paragraph (4–6 sentences), reflect on your use of generative AI in this task: which tool did you use, what prompts did you give it, and how many iterations did it take to get it to clarify your question? Did the response match the information given in the textbook? This short reflection should be independently written by you, with no AI assistance.
Your submission should be a Quarto document, named hwk3-task3.qmd, (inside your forked repo) with the following structure:
| Section | Content |
|---|---|
| Title + intro | Title + one sentence explaining the content of the document |
| Concept to clarify | The concept or claim that you looked clarification for |
| Prompt | The initial prompt you wrote to gain understanding |
| Generative AI reflection | Answer to step 5 |
Rubric
- Steps 2, 3, 5 and the Title/ Intro will be graded as 1: complete, 0: incomplete.