EDS 232


Lesson 12

K-Means Clustering


In this lesson



  • Clustering as an unsupervised method for finding subgroups in data
  • The concept of a centroid and within-cluster variation
  • The \(K\)-means algorithm and how it minimizes within-cluster variation iteratively
  • Practical considerations: local optima, choosing \(K\), and interpreting results

A guiding example


Investigating elephant seal diving strategies


Consider Amanda from our first lesson. She records dive depth (meters) and dive duration (minutes) for a large number of northern elephant seals. The question she investigates is:

Are there distinct foraging strategies among these seals?


Clustering as unsupervised learning


Clustering finds subgroups in a dataset so that observations within a cluster are similar to each other and different from observations in other clusters.

. . .

Unlike classification, there is no response variable \(Y\): no prior labels, no “correct answer.”

Clustering is a type of unsupervised method: we discover structure within the data rather than predicting a known outcome.

. . .

PCA is another unsupervised method we have studied: it finds a low-dimensional representation of the observations that explains a large fraction of the variance.


The algorithm


What is a centroid?


A centroid is the average position of a set of points: the “center of mass” of the group.

Example

Obs. Depth \(x_1\) Duration \(x_2\) Cluster
1 60 18 A
2 210 6 A
3 150 15 A
4 240 26 B
5 330 11 B
6 360 29 B

What is a centroid?


A centroid is the average position of a set of points: the “center of mass” of the group.

Example

Obs. Depth \(x_1\) Duration \(x_2\) Cluster
1 60 18 A
2 210 6 A
3 150 15 A
4 240 26 B
5 330 11 B
6 360 29 B

Centroid of cluster A:

\[\bar{x}^{(A)} = \left(\frac{60 + 210 + 150}{3},\; \frac{18 + 6 + 15}{3}\right) = (140,\; 13)\]

Centroid of cluster B:

\[\bar{x}^{(B)} = \left(\frac{240 + 330 + 360}{3},\; \frac{26 + 11 + 29}{3}\right) = (310,\; 22)\]


What is a centroid?



Within-cluster variation



Within-cluster variation


For a good clustering, observations within each cluster should be as similar as possible. We measure this with within-cluster variation:

\[W(C_k) = \frac{1}{|C_k|} \sum_{i,\, i' \in C_k} \;\sum_{j=1}^{p} (x_{ij} - x_{i'j})^2\]

\(|C_k|\) = number of observations in cluster \(C_k\). \(p\) = number of features


The K-means objective



We want to find \(K\) clusters that minimize the total within-cluster variation across all clusters:

\[\underset{C_1, \ldots, C_K}{\text{minimize}} \left\{ \sum_{k=1}^{K} W(C_k) \right\}\]


. . .

The \(K\)-means algorithm finds an approximate solution to this minimization through a simple iterative procedure.


The K-means algorithm



  1. Initialize: randomly select \(K\) observations from the dataset as the initial centroids.

  2. Assign: assign each observation to the cluster whose centroid is nearest (smallest Euclidean distance).

  3. Update centroids: compute the new centroid (mean) of each cluster.

  4. Repeat steps 2–3 until assignments no longer change.

Every iteration is guaranteed to decrease (or maintain) the total within-cluster variation, so the algorithm always converges. However, it does not always converge to the global minimum!


. . .

The clustering illustrations in the following slides were created by Dr. Allison Horst.














K-means in action



K-means in action



K-means in action



K-means in action



K-means in action



K-means in action



Considerations


Local optima


Different random initializations can lead to different final clusterings — the algorithm finds a local minimum, not necessarily the global one.

Run \(K\)-means several times with different initializations and keep the solution with the lowest within cluster variation.


How to choose \(K\)


\(K\) is a hyperparameter chosen before running the algorithm. The elbow method: fit \(K\)-means for a range of \(K\) values and plot within-cluster variation vs. \(K\). The elbow (where the rate of decrease flattens) suggests a reasonable value of \(K\).

Choosing \(K\) also depends on what is practically meaningful or interpretable in the applied context.


Interpreting results


  • Clusters will always be found. The algorithm partitions observations regardless of whether true subgroups exist!

  • Clustering with different parameter choices (different \(K\), different initializations) will produce variable results.

  • The output should not be taken as absolute truth, but rather as a starting point for generating hypotheses and guiding further study.

. . .


In Amanda’s case, the clusters may correspond to real foraging strategies, but they could also reflect individual variation, seasonal patterns, or even artifacts of data collection.


That’s it for EDS 232!


Everything we covered!


  • 🤖 Machine learning — AI vs ML vs deep learning, learning from data, limitations
  • 📊 Data fundamentals — predictors, response variable, training set, test set
  • 🔮 Inference vs prediction — parametric vs non-parametric models
  • 🗂️ Problem types — supervised vs unsupervised; regression vs classification
  • 📈 Linear regression — simple & multiple; least squares estimation, polynomial
  • 🧮 Hypothesis testing — t-statistic, p-values, confidence intervals
  • 📏 Regression metrics — RSS, MSE, R², adjusted R², RSE
  • ⚖️ Bias-variance tradeoff — overfitting, model flexibility, training vs test error
  • 🔄 Cross-validation — k-fold CV, hyperparameter tuning, model selection
  • 🏘️ K-nearest neighbors (KNN) — non-parametric classification, choosing K
  • 🎭 Confusion matrix — true/false positives and negatives
  • 🎯 Classification metrics — precision, recall, F₁, accuracy, error rate
  • 🌊 Logistic regression — logistic function, log-odds, decision boundary
  • 📡 ROC curves & AUC — TPR, FPR, classification threshold tradeoffs
  • 🏔️ Ridge & lasso — regularization, penalty term λ, coefficient shrinkage
  • 🔭 PCA — principal components, scree plots, variance explained
  • 🗜️ PCR — principal component regression, dimensionality reduction
  • 🌳 Decision trees — recursive binary splitting, pruning, leaves, nodes, Gini impurity, pruning
  • 🌲 Bagging — bootstrap sampling, ensemble methods, out-of-bag error
  • 🌲🌲 Random forests — feature subsampling, variable importance
  • ⚔️ Support vector machines — margins, support vectors, kernels, C parameter
  • 🧠 Neural networks — layers, activation functions, convolutions, pooling, CNNs
  • 🎲 K-means clustering — centroids, within-cluster variation, elbow method