Lab 01: Introduction to Machine Learning in Python

Download the Lab Template

Download the lab template here and move to your eds232-labs repository.

The Python ML Ecosystem

Let’s meet the main ML python library players.

NumPy — Numerical Python

Before the ML-specific packages, NumPy is the backbone everything else is built on. It provides fast, efficient multi-dimensional arrays (ndarray) and element-wise math operations.

Almost every ML library stores data as NumPy arrays under the hood.

Example

Code

import numpy as np

# Simulated daily weather readings from a forest monitoring station
temperatures = np.array([28.0, 32.5, 36.1, 29.8, 41.0, 38.3, 22.5, 35.7])
humidity     = np.array([57.0, 42.0, 33.0, 65.0, 18.0, 25.0, 78.0, 38.0])

print("Temperatures (C):", temperatures)
print("Mean temperature :", np.mean(temperatures).round(2))
print("Max temperature  :", np.max(temperatures))
print("Std deviation    :", np.std(temperatures).round(2))

# Operations apply element-wise — no loops needed!
# Rough heat-dryness index: higher = greater fire concern
heat_dryness = temperatures * (1 - humidity / 100)
print("\nHeat-dryness index:", np.round(heat_dryness, 2))

Temperatures (C): [28.  32.5 36.1 29.8 41.  38.3 22.5 35.7]
Mean temperature : 32.99
Max temperature  : 41.0
Std deviation    : 5.64

Heat-dryness index: [12.04 18.85 24.19 10.43 33.62 28.72  4.95 22.13]

SciPy — Scientific Python

SciPy builds on NumPy and provides algorithms for:

Statistics (scipy.stats)
Linear algebra (scipy.linalg)
Optimization (scipy.optimize) — the math behind model training
Spatial data and distances (scipy.spatial)

In ML, SciPy is used for statistical tests, distance calculations, and preprocessing math. Many scikit-learn functions use SciPy internally.

Module	What it does
`scipy.stats`	Distributions, hypothesis tests, normality checks
`scipy.spatial`	Distance metrics (Euclidean, cosine, etc.)
`scipy.sparse`	Sparse matrices — efficient storage for high-dimensional data
`scipy.optimize`	Minimization functions that power gradient descent

Code

import scipy.stats as stats
from scipy.spatial.distance import euclidean

# --- scipy.stats: Are fire-day temperatures statistically different? ---
fire_temps    = np.array([38.5, 41.2, 36.8, 40.1, 39.5, 37.9, 42.0, 38.8])
no_fire_temps = np.array([22.1, 28.4, 25.6, 30.2, 19.8, 26.5, 23.0, 27.1])

t_stat, p_value = stats.ttest_ind(fire_temps, no_fire_temps)

print("=== t-test: fire-day vs no-fire-day temperatures ===")
print(f"Fire mean   : {fire_temps.mean():.2f} C")
print(f"No-fire mean: {no_fire_temps.mean():.2f} C")
print(f"t-statistic : {t_stat:.3f}")
print(f"p-value     : {p_value:.6f}")
print(f"Significant : {'Yes - temperature is a strong predictor!' if p_value < 0.05 else 'No'}")

# --- scipy.spatial: distance between two weather observations ---
# Each observation: [Temperature, Humidity, Wind, Rain]
day_A = [35.0, 28.0, 15.0, 0.0]   # hot, dry, windy — fire likely
day_B = [22.0, 75.0,  5.0, 2.5]   # cool, humid, calm, rainy — fire unlikely

print(f"\nEuclidean distance between day A and day B: {euclidean(day_A, day_B):.2f}")
print("(Large distance = very different weather conditions)")

=== t-test: fire-day vs no-fire-day temperatures ===
Fire mean   : 39.35 C
No-fire mean: 25.34 C
t-statistic : 10.242
p-value     : 0.000000
Significant : Yes - temperature is a strong predictor!

Euclidean distance between day A and day B: 49.84
(Large distance = very different weather conditions)

scikit-learn (sklearn) — Machine Learning Library

scikit-learn is the go-to library for classical machine learning in Python. It provides:

Dozens of ready-to-use algorithms (regression, decision trees, SVMs, k-means, etc.)
Consistent API — every model has the same .fit(), .predict(), .score() methods
Preprocessing tools — scaling, encoding, imputing missing values
Model evaluation — cross-validation, metrics, confusion matrices
Pipelines — chain preprocessing + modeling steps together cleanly

The sklearn API pattern — learn it once, use it everywhere:

from sklearn.some_module import SomeModel

model = SomeModel()                    # 1. Create the model
model.fit(X_train, y_train)            # 2. Train it on data
predictions = model.predict(X_test)    # 3. Make predictions
score = model.score(X_test, y_test)    # 4. Evaluate it

This same 4-step pattern works for every sklearn algorithm.

Module	What it provides
`sklearn.linear_model`	Linear & Logistic Regression
`sklearn.tree`	Decision Trees
`sklearn.ensemble`	Random Forests, Gradient Boosting
`sklearn.svm`	Support Vector Machines
`sklearn.neighbors`	K-Nearest Neighbors
`sklearn.cluster`	K-Means, DBSCAN (unsupervised)
`sklearn.preprocessing`	Scaling, Encoding, Normalization
`sklearn.model_selection`	train_test_split, cross_val_score
`sklearn.metrics`	Accuracy, F1, RMSE, Confusion Matrix

statsmodels — Statistical Inference Library

statsmodels is the go-to library for statistical analysis and inference in Python. While sklearn focuses on prediction, statsmodels focuses on understanding your data — giving you the full picture of model diagnostics. It provides:

Detailed model summaries — coefficients, p-values, t-statistics, confidence intervals
Hypothesis testing — test whether a coefficient is statistically significant
Inference-first design — built for answering “does X actually affect Y?” not just “what does Y predict?”
Wide range of models — OLS, GLM, time series (ARIMA), logistic regression, and more
Residual diagnostics — check assumptions like normality and homoscedasticity

The statsmodels pattern:

import statsmodels.api as sm

X_with_const = sm.add_constant(X)      # 1. Add intercept manually
model = sm.OLS(y, X_with_const).fit()  # 2. Fit the model
print(model.summary())                 # 3. Get the full stats breakdown

Key attributes after fitting:

Attribute	What it gives you
`model.params`	Coefficients (like `coef_` in sklearn)
`model.pvalues`	P-values for each coefficient
`model.tvalues`	T-statistics
`model.bse`	Standard errors
`model.conf_int()`	95% confidence intervals
`model.rsquared`	R² score
`model.summary()`	Full table with all of the above

Ecosystem Summary

NumPy          ->  Arrays and math (the foundation of everything)
SciPy          ->  Scientific algorithms: stats, distances, optimization
Pandas         ->  Data manipulation and exploration
scikit-learn   ->  Classical ML models, preprocessing, evaluation
statsmodels    ->  Statistical Analysis and inference

Demo: The Algerian Forest Fires Dataset

We will use the Algerian Forest Fires dataset from the UCI Machine Learning Repository, fetched directly using the ucimlrepo package.

Background

This dataset was collected across two regions of Algeria — Bejaia (northeast) and Sidi Bel-Abbes (northwest) — during the summer of 2012 (June to September). Researchers recorded daily weather conditions alongside fire danger indices.

Predictor and Target

For this lab we’ll fit a simple linear regression with one predictor:

Variable	Role	Description
`ISI`	Predictor (X)	Initial Spread Index — measures how fast a fire would spread based on wind and fuel moisture
`FWI`	Target (y)	Fire Weather Index — overall measure of fire danger

These two are directly related in the Canadian Forest Fire Weather Index system: ISI feeds into the FWI calculation, so we expect a strong linear relationship.

This is a supervised regression problem: given X, learn \(\hat{f}(X) = \beta_0 + \beta_1 X\) that predicts FWI.

Code

import pandas as pd
from ucimlrepo import fetch_ucirepo

forest_fires = fetch_ucirepo(id=547)

df = forest_fires.data.features.reset_index(drop=True)

df

	region	day	month	year	Temperature	RH	Ws	Rain	FFMC	DMC	DC	ISI	BUI	FWI
0	Bejaia	1	6	2012	29	57	18	0.0	65.7	3.4	7.6	1.3	3.4	0.5
1	Bejaia	2	6	2012	29	61	13	1.3	64.4	4.1	7.6	1.0	3.9	0.4
2	Bejaia	3	6	2012	26	82	22	13.1	47.1	2.5	7.1	0.3	2.7	0.1
3	Bejaia	4	6	2012	25	89	13	2.5	28.6	1.3	6.9	0.0	1.7	0
4	Bejaia	5	6	2012	27	77	16	0.0	64.8	3.0	14.2	1.2	3.9	0.5
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
239	Sidi-Bel Abbes	26	9	2012	30	65	14	0.0	85.4	16.0	44.5	4.5	16.9	6.5
240	Sidi-Bel Abbes	27	9	2012	28	87	15	4.4	41.1	6.5	8	0.1	6.2	0
241	Sidi-Bel Abbes	28	9	2012	27	87	29	0.5	45.9	3.5	7.9	0.4	3.4	0.2
242	Sidi-Bel Abbes	29	9	2012	24	54	18	0.1	79.7	4.3	15.2	1.7	5.1	0.7
243	Sidi-Bel Abbes	30	9	2012	24	64	15	0.2	67.3	3.8	16.5	1.2	4.8	0.5

244 rows × 14 columns

Data Cleaning

This dataset has two main messy features we need to update before we can build our model/visualizations.

Whitespace in column names — some column names have trailing spaces that will cause silent lookup errors
One corrupted row — a data entry error packed two values into one cell, which will surface as NaN after we convert columns to numeric

Step 1: Strip whitespace from column names

Code

print("Column names before cleaning:", df.columns.tolist())

df.columns = df.columns.str.strip()

print("Column names after cleaning:", df.columns.tolist())

Column names before cleaning: ['region', 'day', 'month', 'year', 'Temperature', ' RH', ' Ws', 'Rain ', 'FFMC', 'DMC', 'DC', 'ISI', 'BUI', 'FWI']
Column names after cleaning: ['region', 'day', 'month', 'year', 'Temperature', 'RH', 'Ws', 'Rain', 'FFMC', 'DMC', 'DC', 'ISI', 'BUI', 'FWI']

Step 2: Convert columns to numeric dtypes

Row 165 has a data entry error where two values were merged into one cell (e.g., DC = "14.6 9"). Pandas read the affected columns (DC, FWI) as strings instead of numbers. Converting to numeric with errors='coerce' turns that bad value into NaN, making the corrupted row easy to identify and remove.

Code

print("DC and FWI column types before conversion:")
print(df[['DC', 'FWI']].dtypes)

DC and FWI column types before conversion:
DC     object
FWI    object
dtype: object

Code

numeric_cols = [c for c in df.columns if c != 'region']
df[numeric_cols] = df[numeric_cols].apply(pd.to_numeric, errors='coerce')

print("DC and FWI column types after conversion:")
print(df[['DC', 'FWI']].dtypes)

DC and FWI column types after conversion:
DC     float64
FWI    float64
dtype: object

Step 3: Drop rows with missing values

Any row with a NaN is unusable for modeling. Dropping them removes the one corrupted row identified above.

Code

df = df.dropna().reset_index(drop=True)

print(df.dtypes)

region          object
day              int64
month            int64
year             int64
Temperature      int64
RH               int64
Ws               int64
Rain           float64
FFMC           float64
DMC            float64
DC             float64
ISI            float64
BUI            float64
FWI            float64
dtype: object

Exploratory Data Analysis

Before modeling, let’s look at the distribution of our target variable and the relationship between our predictor and target.

Two plots below:

FWI distribution — what does the range of fire danger look like?
ISI vs. FWI — does a higher initial spread index correspond to higher overall fire danger?

Code

import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(10, 4))

# --- Plot 1: FWI distribution ---
axes[0].hist(df['FWI'], bins=20, color='#d62728', edgecolor='white')
axes[0].set_xlabel('Fire Weather Index (FWI)')
axes[0].set_ylabel('Number of days')
axes[0].set_title('Distribution of FWI')

# --- Plot 2: ISI vs FWI ---
axes[1].scatter(df['ISI'], df['FWI'], alpha=0.5, color='#d62728', edgecolor='white', linewidths=0.3)
axes[1].set_xlabel('Initial Spread Index (ISI)')
axes[1].set_ylabel('Fire Weather Index (FWI)')
axes[1].set_title('ISI vs. FWI')

plt.suptitle('Algerian Forest Fires — Exploratory Data Analysis', fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()

Preparing X and y

Extract the single predictor (ISI) and the target (FWI) from df. sklearn expects X to be 2D — even for a single feature — so we use double brackets df[['ISI']] to get a one-column DataFrame rather than a flat array.

Code

X = df[['ISI']].values   # shape (n, 1) — 2D required by sklearn
y = df['FWI'].values     # shape (n,)

print(f"X shape: {X.shape}")
print(f"y shape: {y.shape}")

X shape: (243, 1)
y shape: (243,)

Training vs. Testing: The Core Concept

Before fitting a model, we split our data into two non-overlapping sets:

All Data (243 days)
|
|-- Training Set (80% = ~194 days)   <-- model learns from this
|
|-- Test Set     (20% = ~49 days)    <-- model is evaluated on this

Why? We want to know how the model performs on data it has never seen — that’s what matters in practice. Evaluating on the same data used for training gives an overly optimistic picture.

After fitting on the training set, we’ll compare R² on both sets. For a well-behaved model like linear regression, the two should be close — confirming that the model generalizes rather than just fitting noise.

`train_test_split` in Depth

scikit-learn provides train_test_split to split data automatically and safely.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X,            # feature matrix
    y,            # target vector
    test_size,    # fraction or count to put in the test set
    random_state  # integer seed for reproducibility
)

Let’s dig into each parameter.

`random_state`

train_test_split randomly shuffles the data before splitting. That is good — you don’t want all fire days landing in the training set.

But random shuffling means a different split every run, making results impossible to reproduce or compare. random_state sets a fixed random seed so the split is identical every time.

random_state=42 — any integer; 42 is common convention
random_state=None — truly random every run (not reproducible)

`stratify`

For classification problems, train_test_split accepts a stratify=y argument that preserves class proportions in both splits. For regression, there are no discrete classes, so stratify is not applicable — the split is always random.

`test_size`

test_size accepts:

A float between 0 and 1 — proportion of data (e.g. 0.2 = 20%)
An integer — exact number of samples (e.g. 50 = 50 days)

Split	Train %	Test %	Best for
80/20	80	20	Default; works well for most datasets
70/30	70	30	Smaller datasets needing more test coverage
90/10	90	10	Large datasets where training data is precious

Rule of thumb: more training data means a better model, but you need at least 30–50 test samples for a reliable evaluation.

How Test Size Affects Model Performance

Choosing test_size involves a tradeoff:

Smaller test set → more data for training → higher train accuracy, but the test estimate is based on fewer samples and is less reliable (higher variance across different random splits)
Larger test set → more reliable performance estimate, but the model trains on less data

An Example

Code

from sklearn.model_selection import train_test_split

# Standard 80/20 split
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,    # 20% goes to test set
    random_state=42   # fix the random seed
)

print(f"Original dataset : {X.shape[0]} observations")
print()
print("Training set:")
print(f"  X_train: {X_train.shape}  -> {len(X_train)} days")
print(f"  y_train: {y_train.shape}")
print()
print("Test set:")
print(f"  X_test : {X_test.shape}   -> {len(X_test)} days")
print(f"  y_test : {y_test.shape}")

Original dataset : 243 observations

Training set:
  X_train: (194, 1)  -> 194 days
  y_train: (194,)

Test set:
  X_test : (49, 1)   -> 49 days
  y_test : (49,)

Code

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)

print(f"Test MSE: {mse:.3f}")

Test MSE: 3.111

The Python ML Ecosystem

NumPy — Numerical Python

Example

SciPy — Scientific Python

scikit-learn (sklearn) — Machine Learning Library

The sklearn API pattern — learn it once, use it everywhere:

statsmodels — Statistical Inference Library

The statsmodels pattern:

Key attributes after fitting:

Ecosystem Summary

Demo: The Algerian Forest Fires Dataset

Background

Predictor and Target

Data Cleaning

Step 1: Strip whitespace from column names

Step 2: Convert columns to numeric dtypes

Step 3: Drop rows with missing values

Exploratory Data Analysis

Preparing X and y

Training vs. Testing: The Core Concept

train_test_split in Depth

random_state

stratify

test_size

How Test Size Affects Model Performance

An Example

`train_test_split` in Depth

`random_state`

`stratify`

`test_size`