26  Clustering

Learning Outcomes
  • Introduction to clustering
  • Assessing the taxonomy of clustering approaches
  • K-Means clustering
  • Clustering with no explicit loss function: minimizing inertia
  • Hierarchical Agglomerative Clustering
  • Picking K: a hyperparameter

Last time, we began our journey into unsupervised learning by discussing Principal Component Analysis (PCA).

In this lecture, we will explore another very popular unsupervised learning concept: clustering. Clustering allows us to “group” similar datapoints together without being given labels of what “class” or where each point explicitly comes from. We will discuss two clustering algorithms: K-Means clustering and hierarchical agglomerative clustering, and we’ll examine the assumptions, strengths, and drawbacks of each one.

26.1 Review: Taxonomy of Machine Learning

26.1.1 Supervised Learning

In supervised learning, our goal is to create a function that maps inputs to outputs. Each model is learned from example input/output pairs (training set), validated using input/output pairs, and eventually tested on more input/output pairs. Each pair consists of:

  • Input vector
  • Output value (label)

In regression, our output value is quantitative, and in classification, our output value is categorical.

ML taxonomy

26.1.2 Unsupervised Learning

In unsupervised learning, our goal is to identify patterns in unlabeled data. In this type of learning, we do not have input/output pairs. Sometimes, we may have labels but choose to ignore them (e.g. PCA on labeled data). Instead, we are more interested in the inherent structure of the data we have rather than trying to simply predict a label using that structure of data. For example, if we are interested in dimensionality reduction, we can use PCA to reduce our data to a lower dimension.

Now, let’s consider a new problem: clustering.

26.1.3 Clustering Examples

26.1.3.1 Example 1 - Image Compression with Clustering

Clustering can be used for image compression. Why? Digital images consume significant storage and bandwidth, so reducing the number of colors simplifies images while retaining their visual appeal.

  • Clustering can group similar colors (pixels) in an image into clusters.
  • Each cluster represents a color centroid (mean color of the group).
  • Replace each pixel in the image with the color of its cluster centroid, reducing the number of unique colors.

In the example below, notice that using more clusters will make the compressed image look more similar to the original image, while using fewer clusters will look more like a silhouette of the image. Finding a good balance can help resemble the original image while storing less information.

pishi_compression

Image compression can also be done using other techniques including transform-based, prediction-based, and deep learning-based methods.

26.1.3.2 Example 2 - Clustering in Social Networks

In social network clustering, we identify groups (clusters or communities) of individuals who interact more frequently with each other. This can help analyze network behavior, predict relationships, and recommend connections. Some applications include:

  • Detecting communities on social network platforms.
  • Understanding collaboration in organizations.
  • Optimizing targeted marketing strategies.
social_network

Shown above is a synthetic graph representing interactions in a karate club, where each Node represents a member of the club and each Edge is an interaction between members (or Nodes). Members can be clustered based on the people they interact the most with.

26.1.3.3 Example 3 - Clustering in Climate Sciences

Clustering in climate science helps to identify patterns in large, complex datasets and provide insights into global and regional climate trends. Some examples include :

  • Climate zone identification
  • Weather pattern analysis
  • Environmental monitoring

In this example, we grouped regions based on average temperatures to understand climate zones.

climate_clusters

26.2 Taxonomy of Clustering Approaches

taxonomy

There are many types of clustering algorithms, and they all have strengths, inherent weaknesses, and different use cases. We will first focus on a partitional approach: K-Means clustering.

26.3 K-Means Clustering

The most popular clustering approach is K-Means. The algorithm itself entails the following:

  1. Pick an arbitrary \(k\), and randomly place \(k\) “centers”, each a different color.

  2. Repeat until convergence:

    1. Color points according to the closest center.
    2. Move the center for each color to the center of points with that color.

Consider the following data with an arbitrary \(k = 2\) and randomly placed “centers” denoted by the different colors (blue, orange):

init_cluster

Now, we will follow the rest of the algorithm. First, let us color each point according to the closest center:

cluster_class

Next, we will move the center for each color to the center of points with that color. Notice how the centers are generally well-centered amongst the data that shares its color.

cluster_iter1

Assume this process (re-color and re-set centers) repeats for a few more iterations. We eventually reach this state.

cluster_iter5

After this iteration, the center stays still and does not move at all. Thus, we have converged, and the clustering is complete!

26.3.0.1 A Quick Note

K-Means is a completely different algorithm than K-Nearest Neighbors. K-means is used for clustering, where each point is assigned to one of \(K\) clusters. On the other hand, K-Nearest Neighbors is used for classification (or, less often, regression), and the predicted value is typically the most common class among the \(K\)-nearest data points in the training set. The names may be similar, but there isn’t really anything in common.

26.4 Minimizing Inertia

Consider the following example where \(K = 4\):

four_cluster

Due to the randomness of where the \(K\) centers initialize/start, you will get a different output/clustering every time you run K-Means. Consider three possible K-Means outputs; the algorithm has converged, and the colors denote the final cluster they are clustered as.

random_outputs


Which clustering output is the best? To evaluate different clustering results, we need a loss function.

The two common loss functions are:

  • Inertia: Sum of squared distances from each data point to its center.
  • Distortion: Weighted sum of squared distances from each data point to its center.
inertia_distortion

In the example above:

  • Calculated inertia: \(0.47^2 + 0.19^2 + 0.34^2 + 0.25^2 + 0.58^2 + 0.36^2 + 0.44^2\)
  • Calculated distortion: \(\frac{0.47^2 + 0.19^2 + 0.34^2}{3} + \frac{0.25^2 + 0.58^2 + 0.36^2 + 0.44^2}{4}\)

Switching back to the four-cluster example at the beginning of this section, random.seed(25) had an inertia of 44.88, random.seed(29) had an inertia of 45.87, and random.seed(40) had an inertia of 54.27. It seems that the best clustering output was random.seed(25) with an inertia of 44.88!

It turns out that the function K-Means is trying to minimize is inertia, but often fails to find global optimum. Why does this happen? We can think of K-means as a pair of optimizers that take turns. The first optimizer holds center positions constant and optimizes data colors. The second optimizer holds data colors constant and optimizes center positions. Neither optimizer gets full control!

This is a hard problem: give an algorithm that optimizes inertia FOR A GIVEN \(K\); \(K\) is picked in advance. Your algorithm should return the EXACT best centers and colors, but you don’t need to worry about runtime.

Note: This is a bit of a CS61B/CS70/CS170 problem, so do not worry about completely understanding the tricky predicament we are in too much!

First note that there are \(k^n\) possible colorings because there are \(k\) colors that each point could be, so with \(n\) points, this gives \(k^n\) possibilities.

A potential algorithm:

  • For all possible \(k^n\) colorings:
    • Compute the \(k\) centers for that coloring.
    • Compute the inertia for the \(k\) centers.
      • If current inertia is better than best known, write down the current centers and coloring and call that the new best known.

No better algorithm has been found for solving the problem of minimizing inertia exactly.

Consider the following results of two K-Means clustering outputs:

clustering_comparison


Which clustering result do you like better? K-Means likes the one on the right better because it has lower inertia (the sum of squared distances from each data point to its center), but this raises some questions:

  • Why is the inertia on the right lower? K-Means optimizes for distance, not “blobbiness”.
  • Is clustering on the right “wrong”? Good question!

26.5 Hierarchical Agglomerative Clustering

Now, let us consider hierarchical agglomerative clustering.

hierarchical_approach


Now, let us introduce Hierarchical Agglomerative Clustering! We start with every data point in a separate cluster, and we’ll keep merging the most similar pairs of data points/clusters until we have one big cluster left. This is called a bottom-up or agglomerative method.

There are various ways to decide the order of combining clusters called Linkage Criterion:

  • Single linkage (similarity of the most similar): the distance between two clusters as the minimum distance between a point in the first cluster and a point in the second.
  • Average linkage: average similarity of pairs of points in clusters.
  • Complete linkage (similarity of the least similar): the distance between two clusters as the maximum distance between a point in the first cluster and a point in the second.

The linkage criterion decides how we measure the “distance” between two clusters. Regardless of the criterion we choose, the aim is to combine the two clusters that have the minimum “distance” between them, with the distance computed as per that criterion. In the case of complete linkage, for example, that means picking the two clusters that minimize the maximum distance between a point in the first cluster and a point in the second.

linkage

When the algorithm starts, every data point is in its own cluster. In the plot below, there are 12 data points, so the algorithm starts with 12 clusters. As the clustering begins, it assesses which clusters are the closest together.

agg1

The closest clusters are 10 and 11, so they are merged together.

agg2

Next, points 0 and 4 are merged together because they are closest.

agg3

At this point, we have 10 clusters: 8 with a single point (clusters 1, 2, 3, 4, 5, 6, 7, 8, and 9) and 2 with 2 points (clusters 0 and 10).

Although clusters 0 and 3 are not the closest, let us consider if we were trying to merge them. A tricky question arises: what is the “distance” between clusters 0 and 3? We can use the Complete-Link approach that uses the max distance among all pairs of points between groups to decide which group has smaller “distance”. In this case, the distance between clusters 0 and 3 would be the distance from the bottom 0 point to the 3 point.

agg4

Let us assume the algorithm runs a little longer, and we have reached the following state. Clusters 0 and 7 are up next, but why? The max line between any member of 0 and 6 is longer than the max line between any member of 0 and 7.

agg5

Thus, 0 and 7 are merged into 0 as they are closer under the complete linkage criterion.

After more iterations, we finally converge to the plot on the left. There are two clusters (0, 1), and the agglomerative algorithm has converged.

agg6


Notice that on the full dataset, our agglomerative clustering algorithm achieves the more “correct” output.

26.5.1 Clustering, Dendrograms, and Intuition

Agglomerative clustering is one form of “hierarchical clustering.” It is interpretable because we can keep track of when two clusters got merged (each cluster is a tree), and we can visualize the merging hierarchy, resulting in a “dendrogram.” Won’t discuss this any further for this course, but you might see these in the wild. Here are some examples:

dendro_1 dendro_2

26.5.1.1 How to use a dendrogram

  • Observe the Vertical Gaps:
    • Look for the largest vertical distance (gap) between horizontal lines as you move upward in the dendrogram.
    • This gap indicates a significant jump in distance and suggests that merging clusters beyond this point combines dissimilar groups.
  • Cut the Dendrogram:
    • Draw a horizontal line just below this large gap.
    • The number of vertical lines the horizontal line crosses represents the optimal K.
dendro_3

In this example, the horizontal line crosses 4 vertical lines so the optimal K is 4 clusters ([9,41], [5,20], [17,12], and [17,8,7,14]).

26.6 Picking K

The algorithms we’ve discussed require us to pick a \(K\) before we start. But how do we pick \(K\)? Often, the best \(K\) is subjective. For example, consider the state plot below.

states

How many clusters are there here? For K-Means, one approach to determine this is to plot inertia versus many different \(K\) values. We’d pick the \(K\) in the elbow, where we get diminishing returns afterward. Note that big, complicated data often lacks an elbow, so this method is not foolproof. Here, we would likely select \(K = 2\) or \(K = 3\).

elbow

26.6.1 Silhouette Scores

To evaluate how “well-clustered” a specific data point is, we can use the silhouette score, also termed the silhouette width. A high silhouette score indicates that a point is near the other points in its cluster; a low score means that it’s far from the other points in its cluster.

high_low

For a data point \(X\), score \(S\) is: \[S =\frac{B - A}{\max(A, B)}\] where \(A\) is the average distance to other points in the cluster, and \(B\) is the average distance to points in the closest cluster.

Consider what the highest possible value of \(S\) is and how that value can occur. The highest possible value of \(S\) is 1, which happens if every point in \(X\)’s cluster is right on top of \(X\); the average distance to other points in \(X\)’s cluster is \(0\), so \(A = 0\). Thus, \(S = \frac{B}{\max(0, B)} = \frac{B}{B} = 1\). Another case where \(S = 1\) could happen is if \(B\) is much greater than \(A\) (we denote this as \(B >> A\)).

Can \(S\) be negative? The answer is yes. If the average distance to X’s clustermates is larger than the distance to the closest cluster, then this is possible. For example, the “low score” point on the right of the image above has \(S = -0.13\).

26.6.2 Evaluating Clusters

To evaluate different clustering results, we discussed two loss functions:

  • Inertia: Sum of squared distances from each data point to its center.
    • Lower inertia indicates tighter clusters.
    • Inertia always decreases as k increases, so it doesn’t penalize overfitting.
    • Inertia doesn’t consider whether clusters are well-separated or non-spherical.
    • Use the elbow method to find optimal number of clusters.
  • Silhouette Score: Silhouette score evaluates how well a point fits within its cluster versus how poorly it fits into neighboring clusters.
    • A higher silhouette score indicates better-defined and separated clusters.
    • Useful when comparing clustering results for different values of k.
    • Negative silhouette scores can highlight misclassified points or overlapping clusters.
    • Assumes convex clusters and computationally expensive.

We discussed using the elbow method to pick a k with inertia, but let’s also discuss another method for visualizing silhouette scores.

26.6.3 Silhouette Plot

We can plot the silhouette scores for all of our datapoints. The x-axis represents the silhouette coefficient value or silhouette score. The y-axis tells us which cluster label the points belong to, as well as the number of points within a particular cluster. Points with large silhouette widths are deeply embedded in their cluster; the red dotted line shows the average. Below, we plot the silhouette score for our plot with \(K=2\).

high_low silhouette_1

Similarly, we can plot the silhouette score for the same dataset but with \(K=3\):

dendro_1 silhouette_2

The average silhouette score is lower with 3 clusters, so \(K=2\) is a better choice. This aligns with our visual intuition as well.

26.6.4 Picking K: Real World Metrics

Sometimes you can rely on real-world metrics to guide your choice of \(K\). Assume that you have a T-shirt production line and you want to decide about the different T-shirt sizes you need to produce. We can either:

  • Cluster heights and weights of customers with \(K = 3\) to design Small, Medium, and Large shirts
  • Cluster heights and weights of customers with \(K = 5\) to design XS, S, M, L, and XL shirts

To choose \(K\), consider projected costs and sales for the 2 different \(K\)s and select the one that maximizes profit.

26.7 Conclusion

We’ve now discussed a new machine learning goal — clustering — and explored two solutions:

  • K-Means Clustering tries to optimize a loss function called inertia (no known algorithm to find the optimal answer in an efficient manner)
  • Hierarchical Agglomerative Clustering builds clusters bottom-up by merging clusters “close” to each other, depending on the choice of linkage.

Our version of these algorithms required a hyperparameter \(K\). There are 4 ways to pick \(K\): intuitively, the elbow method, silhouette scores, or by harnessing real-world metrics.

There are many machine learning problems. Each can be addressed by many different solution techniques. Each has many metrics for evaluating success / loss. Many techniques can be used to solve different problem types. For example, linear models can be used for regression and classification.

We’ve only scratched the surface and haven’t discussed many important ideas, such as neural networks and deep learning. In the next lecture, we’ll provide some specific course recommendations on how to explore these topics further.