Pipelines and ML Workflows

Overview

This lab focuses on the preprocessing stage of the ML workflow. You will apply each step to a dataset of your choice from the suggested list below, with the goal of producing a clean, properly preprocessed features ready for modeling. The one concept we will walk through in detail before you begin is pipelines, a scikit-learn tool that keeps your preprocessing/modeling clean and prevents data leakage.

The steps you will follow:

  1. Understand the big picture
  2. Get the data and do a preliminary exploration
  3. Create a representative test set and lock it away
  4. Explore the training data to gain insights
  5. Consider feature combinations
  6. Prepare train data for ML algorithms (Build your preprocessing pipeline!)

If time permits, you can go on to steps 7 - 10 f the ML Workflow:

  1. Try out different candidate models and estimate accuracy metrics with CV
  2. Go back to the drawing board if needed
  3. Select a model and fine-tune it
  4. Evaluate model on test set

Pipelines

What is a pipeline?

Pipeline() is a tool from sklearn that chains together multiple data processing and modeling steps into a single, cohesive step. The pipeline can be used for multiple datasets and models across your notebook, helping to reduce code repetition and keep it consistent. Pipelines also help prevent data leakage, by ensuring that you are applying transformations to only your training data.

Why use a Pipeline?

When you preprocess data, there is a tempting shortcut: fit the scaler (or imputer, or encoder) on the entire dataset before splitting. However, doing so leaks test-set statistics into training and produces an overly optimistic performance estimate.

A scikit-learn Pipeline chains transformers into a single object. When you call .fit() on a pipeline, each transformer is fit on training data only and then used to transform it before passing the result to the next step. When you call .transform() on new data, the pipeline applies the already-fit transformers in sequence. This all happens without the test set influencing what was learned.

One Hot Encoder

Before we get started with creating our own pipeline, lets review how to use One Hot Encoding on its own:

Machine learning models require numeric inputs. Categorical columns like 'land_use' or 'species' must be converted to numbers before training. The standard approach is one-hot encoding: replace a column with k categories with k binary columns, one per category, where exactly one column is 1 for each row.

Example: A land_use column with three categories becomes:

land_use forest urban wetland
forest 1 0 0
urban 0 1 0
wetland 0 0 1
forest 1 0 0

In code:

Code
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import mean_squared_error

sample = pd.DataFrame({
    'land_use': ['forest', 'urban', 'wetland', 'forest', 'urban']
})

 # The default, sparse_output = True, returns a scipy sparse matrix that is memory efficient
 # sparse_output = False returns a NumPy array instead which we can pass to pandas/ visualize with
enc = OneHotEncoder(sparse_output=False) 
encoded = enc.fit_transform(sample[['land_use']])
pd.DataFrame(encoded, columns=enc.get_feature_names_out())
land_use_forest land_use_urban land_use_wetland
0 1.0 0.0 0.0
1 0.0 1.0 0.0
2 0.0 0.0 1.0
3 1.0 0.0 0.0
4 0.0 1.0 0.0

Building a Numeric Pipeline

The simplest pipeline chains steps in a list of (name, object) tuples:

Code
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn import LinearRegression

numeric_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),  # fill missing values with column median
    ('scaler',  StandardScaler())                   # standardize 
])

Each step name (the first element of each tuple) is arbitrary but must be unique within the pipeline.


Handling Mixed Feature Types with ColumnTransformer()

Real datasets usually have a mix of numerical and categorical columns that need different preprocessing. ColumnTransformer applies different pipelines to different column subsets and concatenates the results:

Code
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

num_features = ['temp', 'humidity', 'wind_speed']
cat_features = ['land_use']

categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(sparse_output=False))
])

preprocessor = ColumnTransformer([
    ('num', numeric_pipeline, num_features),
    ('cat', categorical_pipeline, cat_features)
])

The output of preprocessor.fit_transform(X_train) is a single numeric array: the scaled numerical columns followed by the one-hot encoded categorical columns, all learned from training data only.


Example: Building a Preprocessing Pipeline

The fake dataset below mimics a simple environmental monitoring table with numerical features (some missing) and a categorical feature. Here is how to build a full preprocessing pipeline, fit it on training data only, and inspect every step.

Code
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression


np.random.seed(42)
n = 120

fake_env = pd.DataFrame({
    'temperature': np.random.normal(15, 5, n),
    'humidity':    np.random.uniform(30, 90, n),
    'wind_speed':  np.random.exponential(5, n),
    'land_use':    np.random.choice(['forest', 'urban', 'wetland', 'grassland'], n),
    'aqi':         np.random.randint(20, 200, n)
})

# Introduce realistic missingness
fake_env.loc[np.random.choice(n, 12, replace=False), 'humidity']   = np.nan
fake_env.loc[np.random.choice(n, 6,  replace=False), 'wind_speed'] = np.nan
fake_env.loc[np.random.choice(n, 4,  replace=False), 'land_use']   = np.nan

print(fake_env.shape)
print(fake_env.isnull().sum())
fake_env.head()
(120, 5)
temperature     0
humidity       12
wind_speed      6
land_use        4
aqi             0
dtype: int64
temperature humidity wind_speed land_use aqi
0 17.483571 44.258253 6.052753 forest 77
1 14.308678 NaN 7.943097 wetland 167
2 18.238443 52.066988 11.036617 forest 80
3 22.615149 67.938350 2.062412 wetland 146
4 13.829233 68.011783 2.354684 forest 124

We split first so the pipeline is only ever fit on training data:

Code
X = fake_env.drop(columns='aqi')
y = fake_env['aqi']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Input columns:", list(X_train.columns))
print(f"X_train shape before preprocessing: {X_train.shape}")
Input columns: ['temperature', 'humidity', 'wind_speed', 'land_use']
X_train shape before preprocessing: (96, 4)

Now we define the feature lists and build the pipeline. X_train has 4 columns: 3 numeric and 1 categorical.

Code
num_features = ['temperature', 'humidity', 'wind_speed']
cat_features = ['land_use']

numeric_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler',  StandardScaler())
])

categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(sparse_output=False))
])

preprocessor = ColumnTransformer([
    ('num', numeric_pipeline, num_features),
    ('cat', categorical_pipeline, cat_features)
])

X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed  = preprocessor.transform(X_test)

print(f"X_train shape after preprocessing: {X_train_processed.shape}")
print(f"X_test  shape after preprocessing: {X_test_processed.shape}")
X_train shape after preprocessing: (96, 7)
X_test  shape after preprocessing: (24, 7)

Why 7 columns? The 3 numeric columns stay as 3 scaled columns. land_use has 4 categories, each becoming its own binary column. Total: 3 + 4 = 7.

Column names after preprocessing

ColumnTransformer returns a plain numpy array, so the column names are not carried through automatically. We can reconstruct them:

Code
num_out_cols = num_features
cat_out_cols = list(
    preprocessor.named_transformers_['cat']
                .named_steps['encoder']
                .get_feature_names_out(cat_features)
)

all_cols = num_out_cols + cat_out_cols
print("Output column names:", all_cols)
Output column names: ['temperature', 'humidity', 'wind_speed', 'land_use_forest', 'land_use_grassland', 'land_use_urban', 'land_use_wetland']

We can then wrap the processed array in a DataFrame to make it easy to inspect:

Code
X_train_df = pd.DataFrame(X_train_processed, columns=all_cols)
X_train_df.head()
temperature humidity wind_speed land_use_forest land_use_grassland land_use_urban land_use_wetland
0 0.024230 0.329350 -0.121396 0.0 1.0 0.0 0.0
1 0.412646 0.075692 -0.995871 0.0 0.0 1.0 0.0
2 -0.460885 -1.208293 -0.521291 0.0 0.0 0.0 1.0
3 -0.059091 1.162086 -0.499985 0.0 0.0 0.0 1.0
4 0.244386 -1.323940 -0.947675 0.0 0.0 1.0 0.0

Let’s compare this to our original dataframe to see what our pipeline achieved:

Code
fake_env.head()
temperature humidity wind_speed land_use aqi
0 17.483571 44.258253 6.052753 forest 77
1 14.308678 NaN 7.943097 wetland 167
2 18.238443 52.066988 11.036617 forest 80
3 22.615149 67.938350 2.062412 wetland 146
4 13.829233 68.011783 2.354684 forest 124

Notice that temperature, humidity, and wind_speed are now standardized, and land_use has been split into four binary columns (one for each category).

Build models with pipelines

Above we only used Pipeline() for preprocessing steps, but we can also use it to build out our models and hyperparamter tune.

Code
numeric_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler',  StandardScaler())
])

categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(sparse_output=False))
])

preprocessor = ColumnTransformer([
    ('num', numeric_pipeline, num_features),
    ('cat', categorical_pipeline, cat_features)
])


pipe = make_pipeline(preprocessor, LinearRegression())
pipe  # click on the diagram below to see the details of each step
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['temperature', 'humidity',
                                                   'wind_speed']),
                                                 ('cat',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('encoder',
                                                                   OneHotEncoder(sparse_output=False))]),
                                                  ['land_use'])])),
                ('linearregression', LinearRegression())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

To use this pipeline, you can simpy fit it with your training data and then predict with your test data, just as we have been doing all quarter!

Code
# Fit the pipeline to your training data
pipe.fit(X_train, y_train)

# Make predictions on the test set
y_pred = pipe.predict(X_test)

# Check the mse
mse = mean_squared_error(y_test, y_pred)
print('Mean Squared Error:', round(mse, 4))
Mean Squared Error: 3150.104

Hyperparameter Tuning with Pipelines

You can also use pipelines with cross-validated hyperparameter search.

To reference a parameter inside a pipeline step, use the pattern stepname__parametername (e.g., model__max_depth targets the max_depth parameter of the step named model).

Here we swap LinearRegression for a DecisionTreeRegressor and search over tree depth and minimum samples per leaf:

Code
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import GridSearchCV

tree_pipe = Pipeline([
    ('preprocessor', preprocessor),
    ('model', DecisionTreeRegressor(random_state=42))
])

param_grid = {
    'model__max_depth': [2, 4, 6, 8, None],
    'model__min_samples_leaf': [1, 5, 10]
}

grid_search = GridSearchCV(tree_pipe, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)

print('Best params:', grid_search.best_params_)
print('Best CV RMSE:', round((-grid_search.best_score_) ** 0.5, 4))
Best params: {'model__max_depth': 2, 'model__min_samples_leaf': 5}
Best CV RMSE: 59.017

The best_estimator_ stored in grid_search is a fully-fitted pipeline with the best hyperparameters. We can use it directly to evaluate on the held-out test set:

Code
best_pipe = grid_search.best_estimator_
y_pred_tree = best_pipe.predict(X_test)

test_mse = mean_squared_error(y_test, y_pred_tree)
print('Test RMSE (best decision tree pipeline):', round(test_mse ** 0.5, 4))
Test RMSE (best decision tree pipeline): 51.7786

The Datasets

Choose one of the datasets below. Each is publicly available, contains a mix of feature types, and has a clear environmental prediction target. Load the data, read the documentation for each column, and then follow the workflow template in the next section.

Option 1: Global Air Pollution

City-level air quality data from ~23,000 locations worldwide, sourced from government monitoring agencies and aggregated by AQI category. Features include country, city, CO AQI value, Ozone AQI value, NO2 AQI value, and PM2.5 AQI value.

  • Access: Download from Kaggle

Option 2: Coral Reef Bleaching

Bleaching presence/absence records for 34,846 coral reef observations across 14,405 sites in 93 countries, spanning 1980–2020. Features include sea surface temperature metrics, reef exposure, distance to land, mean turbidity, and cyclone frequency (62 variables total).

  • Access: Download from Kaggle

Option 3: Global Shark Attacks

Incident-level records of shark–human interactions spanning over a century of global reports. Each row is one incident; features include year, country, location, activity (e.g., surfing, swimming, diving), victim sex and age, injury description, and whether the incident was fatal.


Creating a Workflow of your own

Use this section as a template. Fill in code and answers as you go. Each step either has a scaffold for you to complete or a question to answer.


Step 1: Understand the Big Picture

NoteQuestion 1

Before touching the data, answer the following:

  1. What is the response variable? Is this a regression or classification problem?
  2. What performance metric will you use, and why is it appropriate?
  3. What is a sensible baseline to beat (e.g., predicting the mean, a simple rule, a published benchmark)?
  4. Are there any domain-specific constraints on errors? Is over-prediction or under-prediction more costly?

Step 2: Preliminary Exploration

Load your dataset and run the cells below.

Code
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('your_dataset.csv')  # replace with your actual load command
df.info()
Code
df.describe()
NoteQuestion 2
  1. How many numerical features and how many categorical features are there?
  2. Which columns (if any) have missing values? What percentage of observations are missing in each?
  3. Do the ranges in df.describe() look physically reasonable? Flag anything surprising.

Step 3: Create a representative test set and lock it away

Code
from sklearn.model_selection import train_test_split

response_var = 'your_response'   # replace

X = df.drop(columns=response_var)
y = df[response_var]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f'Training   : {X_train.shape[0]} rows')
print(f'Test       : {X_test.shape[0]} rows')

Lock the test set away. Do not use it again until Step 9.


Step 4: Explore the train data to gain insights

Build your own exploratory analysis on the training set. Explore the following for each variable:

  • Name
  • Type (categorical, int/float, bounded/unbounded, text, structured, etc.)
  • % of missing values
  • Noisiness and type of noise (outliers, rounding errors, etc.)
  • Possibly useful for the task?
  • Type of distribution
NoteQuestion 3
  1. Which predictors appear most strongly related to the target? You can use corr() to find out. Describe the direction.
  2. Are there too many predictors?

Step 5: Consider Feature Combinations

In this step we consider if there are any combinations of our predictors that could produce a more informative feature. This is not a “one and done” process: as we are developing the model, we may come back and reassess whether there are new features that are worth including. Whenever possible, make sure your decisions are backed up by domain expertise.


Step 6: Build Your Preprocessing Pipeline

Identify your numerical and categorical columns, then construct a ColumnTransformer + Pipeline as shown in the review section above. Fit it on training data only, then apply it to both X_train_processed and X_test_processed.

Code
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

num_features = [...]  # fill in
cat_features = [...]  # fill in (empty list if none)

numeric_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler',  StandardScaler())
])

categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(sparse_output=False, handle_unknown='ignore'))
])

preprocessor = ColumnTransformer([
    ('num', numeric_pipeline, num_features),
    ('cat', categorical_pipeline, cat_features)
])

X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed  = preprocessor.transform(X_test)

print(f'Processed training shape: {X_train_processed.shape}')
print(f'Processed test shape    : {X_test_processed.shape}')
NoteQuestion 4
  1. How will you handle missing values in your dataset? Justify your choice of imputation strategy.
  2. If you one-hot encode a categorical column with k categories, how many new columns does it produce?
  3. How many total columns does your processed feature matrix have? Is this different from the number of columns in the raw data?

If you have time, complete steps 7 -10. Try and incorporate Step 7 into your Pipeline as well.

  1. Try out different candidate models and estimate accuracy metrics with CV
  2. Go back to the drawing board if needed
  3. Select a model and fine-tune it
  4. Evaluate model on test set