This lab focuses on the preprocessing stage of the ML workflow. You will apply each step to a dataset of your choice from the suggested list below, with the goal of producing a clean, properly preprocessed features ready for modeling. The one concept we will walk through in detail before you begin is pipelines, a scikit-learn tool that keeps your preprocessing/modeling clean and prevents data leakage.
The steps you will follow:
Understand the big picture
Get the data and do a preliminary exploration
Create a representative test set and lock it away
Explore the training data to gain insights
Consider feature combinations
Prepare train data for ML algorithms (Build your preprocessing pipeline!)
If time permits, you can go on to steps 7 - 10 f the ML Workflow:
Try out different candidate models and estimate accuracy metrics with CV
Go back to the drawing board if needed
Select a model and fine-tune it
Evaluate model on test set
Pipelines
What is a pipeline?
Pipeline() is a tool from sklearn that chains together multiple data processing and modeling steps into a single, cohesive step. The pipeline can be used for multiple datasets and models across your notebook, helping to reduce code repetition and keep it consistent. Pipelines also help prevent data leakage, by ensuring that you are applying transformations to only your training data.
Why use a Pipeline?
When you preprocess data, there is a tempting shortcut: fit the scaler (or imputer, or encoder) on the entire dataset before splitting. However, doing so leaks test-set statistics into training and produces an overly optimistic performance estimate.
A scikit-learn Pipeline chains transformers into a single object. When you call .fit() on a pipeline, each transformer is fit on training data only and then used to transform it before passing the result to the next step. When you call .transform() on new data, the pipeline applies the already-fit transformers in sequence. This all happens without the test set influencing what was learned.
One Hot Encoder
Before we get started with creating our own pipeline, lets review how to use One Hot Encoding on its own:
Machine learning models require numeric inputs. Categorical columns like 'land_use' or 'species' must be converted to numbers before training. The standard approach is one-hot encoding: replace a column with k categories with k binary columns, one per category, where exactly one column is 1 for each row.
Example: A land_use column with three categories becomes:
land_use
forest
urban
wetland
forest
1
0
0
urban
0
1
0
wetland
0
0
1
forest
1
0
0
In code:
Code
import pandas as pdfrom sklearn.preprocessing import OneHotEncoderfrom sklearn.metrics import mean_squared_errorsample = pd.DataFrame({'land_use': ['forest', 'urban', 'wetland', 'forest', 'urban']})# The default, sparse_output = True, returns a scipy sparse matrix that is memory efficient# sparse_output = False returns a NumPy array instead which we can pass to pandas/ visualize withenc = OneHotEncoder(sparse_output=False) encoded = enc.fit_transform(sample[['land_use']])pd.DataFrame(encoded, columns=enc.get_feature_names_out())
land_use_forest
land_use_urban
land_use_wetland
0
1.0
0.0
0.0
1
0.0
1.0
0.0
2
0.0
0.0
1.0
3
1.0
0.0
0.0
4
0.0
1.0
0.0
Building a Numeric Pipeline
The simplest pipeline chains steps in a list of (name, object) tuples:
Code
from sklearn.pipeline import Pipeline, make_pipelinefrom sklearn.preprocessing import StandardScalerfrom sklearn.impute import SimpleImputerfrom sklearn import LinearRegressionnumeric_pipeline = Pipeline([ ('imputer', SimpleImputer(strategy='median')), # fill missing values with column median ('scaler', StandardScaler()) # standardize ])
Each step name (the first element of each tuple) is arbitrary but must be unique within the pipeline.
Handling Mixed Feature Types with ColumnTransformer()
Real datasets usually have a mix of numerical and categorical columns that need different preprocessing. ColumnTransformer applies different pipelines to different column subsets and concatenates the results:
The output of preprocessor.fit_transform(X_train) is a single numeric array: the scaled numerical columns followed by the one-hot encoded categorical columns, all learned from training data only.
Example: Building a Preprocessing Pipeline
The fake dataset below mimics a simple environmental monitoring table with numerical features (some missing) and a categorical feature. Here is how to build a full preprocessing pipeline, fit it on training data only, and inspect every step.
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
To use this pipeline, you can simpy fit it with your training data and then predict with your test data, just as we have been doing all quarter!
Code
# Fit the pipeline to your training datapipe.fit(X_train, y_train)# Make predictions on the test sety_pred = pipe.predict(X_test)# Check the msemse = mean_squared_error(y_test, y_pred)print('Mean Squared Error:', round(mse, 4))
Mean Squared Error: 3150.104
Hyperparameter Tuning with Pipelines
You can also use pipelines with cross-validated hyperparameter search.
To reference a parameter inside a pipeline step, use the pattern stepname__parametername (e.g., model__max_depth targets the max_depth parameter of the step named model).
Here we swap LinearRegression for a DecisionTreeRegressor and search over tree depth and minimum samples per leaf:
Best params: {'model__max_depth': 2, 'model__min_samples_leaf': 5}
Best CV RMSE: 59.017
The best_estimator_ stored in grid_search is a fully-fitted pipeline with the best hyperparameters. We can use it directly to evaluate on the held-out test set:
Choose one of the datasets below. Each is publicly available, contains a mix of feature types, and has a clear environmental prediction target. Load the data, read the documentation for each column, and then follow the workflow template in the next section.
Option 1: Global Air Pollution
City-level air quality data from ~23,000 locations worldwide, sourced from government monitoring agencies and aggregated by AQI category. Features include country, city, CO AQI value, Ozone AQI value, NO2 AQI value, and PM2.5 AQI value.
Bleaching presence/absence records for 34,846 coral reef observations across 14,405 sites in 93 countries, spanning 1980–2020. Features include sea surface temperature metrics, reef exposure, distance to land, mean turbidity, and cyclone frequency (62 variables total).
Incident-level records of shark–human interactions spanning over a century of global reports. Each row is one incident; features include year, country, location, activity (e.g., surfing, swimming, diving), victim sex and age, injury description, and whether the incident was fatal.
Use this section as a template. Fill in code and answers as you go. Each step either has a scaffold for you to complete or a question to answer.
Step 1: Understand the Big Picture
NoteQuestion 1
Before touching the data, answer the following:
What is the response variable? Is this a regression or classification problem?
What performance metric will you use, and why is it appropriate?
What is a sensible baseline to beat (e.g., predicting the mean, a simple rule, a published benchmark)?
Are there any domain-specific constraints on errors? Is over-prediction or under-prediction more costly?
Step 2: Preliminary Exploration
Load your dataset and run the cells below.
Code
import numpy as npimport pandas as pdimport matplotlib.pyplot as pltdf = pd.read_csv('your_dataset.csv') # replace with your actual load commanddf.info()
Code
df.describe()
NoteQuestion 2
How many numerical features and how many categorical features are there?
Which columns (if any) have missing values? What percentage of observations are missing in each?
Do the ranges in df.describe() look physically reasonable? Flag anything surprising.
Step 3: Create a representative test set and lock it away
Lock the test set away. Do not use it again until Step 9.
Step 4: Explore the train data to gain insights
Build your own exploratory analysis on the training set. Explore the following for each variable:
Name
Type (categorical, int/float, bounded/unbounded, text, structured, etc.)
% of missing values
Noisiness and type of noise (outliers, rounding errors, etc.)
Possibly useful for the task?
Type of distribution
NoteQuestion 3
Which predictors appear most strongly related to the target? You can use corr() to find out. Describe the direction.
Are there too many predictors?
Step 5: Consider Feature Combinations
In this step we consider if there are any combinations of our predictors that could produce a more informative feature. This is not a “one and done” process: as we are developing the model, we may come back and reassess whether there are new features that are worth including. Whenever possible, make sure your decisions are backed up by domain expertise.
Step 6: Build Your Preprocessing Pipeline
Identify your numerical and categorical columns, then construct a ColumnTransformer + Pipeline as shown in the review section above. Fit it on training data only, then apply it to both X_train_processed and X_test_processed.
Code
from sklearn.pipeline import Pipelinefrom sklearn.compose import ColumnTransformerfrom sklearn.preprocessing import StandardScaler, OneHotEncoderfrom sklearn.impute import SimpleImputernum_features = [...] # fill incat_features = [...] # fill in (empty list if none)numeric_pipeline = Pipeline([ ('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())])categorical_pipeline = Pipeline([ ('imputer', SimpleImputer(strategy='most_frequent')), ('encoder', OneHotEncoder(sparse_output=False, handle_unknown='ignore'))])preprocessor = ColumnTransformer([ ('num', numeric_pipeline, num_features), ('cat', categorical_pipeline, cat_features)])X_train_processed = preprocessor.fit_transform(X_train)X_test_processed = preprocessor.transform(X_test)print(f'Processed training shape: {X_train_processed.shape}')print(f'Processed test shape : {X_test_processed.shape}')
NoteQuestion 4
How will you handle missing values in your dataset? Justify your choice of imputation strategy.
If you one-hot encode a categorical column with k categories, how many new columns does it produce?
How many total columns does your processed feature matrix have? Is this different from the number of columns in the raw data?
If you have time, complete steps 7 -10. Try and incorporate Step 7 into your Pipeline as well.
Try out different candidate models and estimate accuracy metrics with CV