End-to-End ML Workflows

EDS 232 Β· Machine Learning in Environmental Science

In this lesson


We will walk through the core steps of an end-to-end ML project:

  • Framing the problem and understanding the big picture
  • Getting the data and doing preliminary exploration
  • Creating and locking away a test set
  • Exploring and cleaning the training data
  • Preparing data for ML (scrubbing, scaling, encoding)
  • Trying candidate models and evaluating with CV
  • Diagnosing what went wrong and going back to the drawing board
  • Final model selection, tuning, and evaluation

The end-to-end workflow

With your team, order the core steps for a ML workflow


. . .

  1. Understand the big picture
  2. Get the data and do a preliminary exploration
  3. Create a representative test set and lock it away
  4. Explore the train data to gain insights
  5. Consider attribute combinations
  1. Prepare train data for ML algorithms
  2. Try out different candidate models and estimate accuracy metrics with CV
  3. Go back to the drawing board if needed
  4. Select a model and fine-tune it
  5. Evaluate model on test set

🌟 Present your results!

Discussion about initial steps


Step 4 β€” Explore the training data


Goals of this step:

  1. Identify any cleaning or transformations needed
  2. Identify promising predictors
  3. Brainstorm new features worth computing

For each variable, study:

  • Type (categorical, int/float, bounded/unbounded)
  • % of missing values
  • Noisiness (outliers, rounding errors)
  • Type of distribution
  • Whether it is plausibly useful for the task

Step 4 β€” Correlations


Plot each predictor against the target to uncover noise and correlations. Also compute pairwise correlations using .corr().

Important: .corr() only works for numerical variables and only measures linear association.

  • Close to +1 β†’ strong positive linear correlation
  • Close to –1 β†’ strong negative linear correlation
  • Close to 0 β†’ no linear correlation (but a non-linear relationship may still exist!)

Step 5 β€” Consider attribute combinations


Sometimes combining predictors produces a more informative feature than either one alone.

  • Whenever possible, ground decisions in domain expertise
  • This is an iterative process β€” revisit as the model develops

Preparing data for ML

Step 6 β€” Scrubbing


High-quality training data is critical. Common issues:

  • Omitted values (missing data)
  • Duplicate observations
  • Outliers

Potential fixes for each:

Approach When to use
Remove the observation Missingness is random and n is large
Impute (mean, median, KNN…) Losing observations is costly
Drop the entire feature Variable has too many missing values or is not useful

Always keep a backup of the raw data before transforming!

Step 6 β€” Scaling


Many ML models perform better when features are on similar scales.

Standardization (StandardScaler): subtract the mean, divide by the standard deviation β†’ zero mean, unit variance.

The target variable does not need to be scaled.

Linear scaling, clipping, and log-scaling are a few other ways you may want to scale your data. You can read a brief overview about each here.

Step 6 β€” One-hot encoding


Applied to categorical predictors.

Suppose you have a variable called biome with categories savanna, rainforest, grassland, and desert.

Most ML algorithms prefer to work with numerical variables, so we can transform this variable into numbers.

. . .

An easy way would be to assign each with a number:

  • savanna -> 0
  • rainforest -> 1
  • grassland -> 2
  • desert -> 3

. . .

One issue with this approach is that a model would treat the consecutive numbers as a numerical variable that contains information about the relative order between them.

Step 6 β€” One-hot encoding


One-hot encoding creates a binary feature per class:

  • new feature equals 1 when the observation belongs to the class,
  • new feature equals 0 if the observation does not belong to the class.

. . .

For example, the one-hot encoding for the biome classes would look like:

Observation savanna rainforest grassland desert
biome = savanna 1 0 0 0
biome = rainforest 0 1 0 0
biome = grassland 0 0 1 0
biome = desert 0 0 0 1

Each row has exactly one 1 and all other values 0.

. . .

For linear models, drop one encoded column to avoid perfect multicollinearity (drop parameter in OneHotEncoder).

Step 7 β€” Try candidate models


Once you have a solid, clean training set:

  1. Select a few candidate models appropriate for the problem setup
  2. Train with default hyperparameters β€” don’t over-invest yet
  3. Estimate performance with cross-validation (or ROC curves for classification)

The goal is a quick, comparable baseline across model families. Tuning comes later.

Evaluating and refining

Step 8 β€” Problems with the data


If, at this point, your results are bad, there are generally two things that could have gone wrong: β€œbad model” or β€œbad data”.

Problem Description Fix
Too small More high-quality data generally helps Collect more data
Nonrepresentative Sampling noise or bias Better sampling strategy
Poor quality Outliers, errors, missing values obscure signal Drop, correct, or impute
Irrelevant features Noise prevents the algorithm from finding signal Feature selection or extraction

Step 8 β€” Problems with models


Problem Description Fix
Overfitting Model memorizes training noise; poor generalization Simpler model; more data; reduce variance via hyperparameters
Underfitting Model too simple to capture data structure More flexible model; more informative features

. . .

Whatever the challenge: it’s ok to go back and experiment with different feature combinations and models.

Steps 9–10 β€” After model selection


9. Fine-tune the selected model.

Use cross-validation on the training set to search over hyperparameter combinations (e.g., GridSearchCV). All tuning decisions must use training data only β€” the test set stays locked!

. . .

10. Evaluate on the test set.

Evaluate exactly once. If test performance is much worse than your CV estimate, the model likely overfit during tuning.

. . .

🌟 Present your results.

Report the metric value and its practical meaning. Document the full pipeline, decisions made at each step, and any caveats about where the model may not generalize.