26  Clustering

Learning Outcomes
  • Introduction to clustering
  • Assessing the taxonomy of clustering approaches
  • K-Means clustering
  • Clustering with no explicit loss function: minimizing inertia
  • Hierarchical Agglomerative Clustering
  • Picking K: a hyperparameter

Last time, we began our journey into unsupervised learning by discussing Principal Component Analysis (PCA).

In this lecture, we will explore another very popular unsupervised learning concept: clustering. Clustering allows us to “group” similar datapoints together without being given labels of what “class” or where each point explicitly comes from. We will discuss two clustering algorithms: K-Means clustering and hierarchical agglomerative clustering, and we’ll examine the assumptions, strengths, and drawbacks of each one.

26.1 Review: Taxonomy of Machine Learning

26.1.1 Supervised Learning

In supervised learning, our goal is to create a function that maps inputs to outputs. Each model is learned from example input/output pairs (training set), validated using input/output pairs, and eventually tested on more input/output pairs. Each pair consists of:

  • Input vector
  • Output value (label)

In regression, our output value is quantitative, and in classification, our output value is categorical.

ML taxonomy

26.1.2 Unsupervised Learning

In unsupervised learning, our goal is to identify patterns in unlabeled data. In this type of learning, we do not have input/output pairs. Sometimes, we may have labels but choose to ignore them (e.g. PCA on labeled data). Instead, we are more interested in the inherent structure of the data we have rather than trying to simply predict a label using that structure of data. For example, if we are interested in dimensionality reduction, we can use PCA to reduce our data to a lower dimension.

Now, let’s consider a new problem: clustering.

26.1.3 Clustering Examples

26.1.3.1 Example 1

Consider this figure from Fall 2019 Midterm 2. The original dataset had 8 dimensions, but we have used PCA to reduce our data down to 2 dimensions.

blobs

Each point represents the 1st and 2nd principal component of how much time patrons spent at 8 different zoo exhibits. Visually and intuitively, we could potentially guess that this data belongs to 3 groups: one for each cluster. The goal of clustering is now to assign each point (in the 2 dimensional PCA representation) to a cluster.

clusters_ex1

This is an unsupervised task, as:

  • We don’t have labels for each visitor.
  • We want to infer patterns even without labels.

26.1.3.2 Example 2: Netflix

Now suppose you’re Netflix and are looking at information on customer viewing habits. Clustering can come in handy here. We can assign each person or show to a “cluster.” (Note: while we don’t know for sure that Netflix actually uses ML clustering to identify these categories, they could, in principle.)

Keep in mind that with clustering, we don’t need to define clusters in advance; it discovers groups automatically. On the other hand, with classification, we have to decide labels in advance. This marks one of the key differences between clustering and classification.

26.1.3.3 Example 3: Education

Let’s say we’re working with student-generated materials and pass them into the S-BERT module to extract sentence embeddings. Features from clusters are extracted to:

  1. Detect anomalies in group activities
  2. Predict the group’s median quiz grade
outline-ex3

Here we can see the outline of the anomaly detection module. It consists of:

  • S-BERT feature extraction
  • Topic extraction
  • Feature extraction
  • 16D \(\rightarrow\) 2D PCA dimensionality reduction and 2D \(\rightarrow\) 16D reconstruction
  • Anomaly detection based on reconstruction error

Looking more closely at our clustering, we can better understand the different components, which are represented by the centers. Below we have two examples.

components

Note that the details for this example are not in scope.

26.1.3.4 Example 4: Reverse Engineering Biology

Now, consider the plot below:

genes

The rows of this plot are conditions (e.g., a row might be: “poured acid on the cells”), and the columns are genes. The green coloration indicates that the gene was “off” whereas red indicates the gene was “on”. For example, the ~9 genes in the top left corner of the plot were all turned off by the 6 experiments (rows) at the top.

In a clustering lens, we might be interested in clustering similar observations together based on the reactions (on/off) to certain experiments.

For example, here is a look at our data before and after clustering.

beforeandafter4

Note: apologies if you can’t differentiate red from green by eye! Historical visualizations are not always the best.

26.2 Taxonomy of Clustering Approaches

taxonomy

There are many types of clustering algorithms, and they all have strengths, inherent weaknesses, and different use cases. We will first focus on a partitional approach: K-Means clustering.

26.3 K-Means Clustering

The most popular clustering approach is K-Means. The algorithm itself entails the following:

  1. Pick an arbitrary \(k\), and randomly place \(k\) “centers”, each a different color.

  2. Repeat until convergence:

    1. Color points according to the closest center.
    2. Move the center for each color to the center of points with that color.

Consider the following data with an arbitrary \(k = 2\) and randomly placed “centers” denoted by the different colors (blue, orange):

init_cluster

Now, we will follow the rest of the algorithm. First, let us color each point according to the closest center:

cluster_class

Next, we will move the center for each color to the center of points with that color. Notice how the centers are generally well-centered amongst the data that shares its color.

cluster_iter1

Assume this process (re-color and re-set centers) repeats for a few more iterations. We eventually reach this state.

cluster_iter5

After this iteration, the center stays still and does not move at all. Thus, we have converged, and the clustering is complete!

26.3.0.1 A Quick Note

K-Means is a completely different algorithm than K-Nearest Neighbors. K-means is used for clustering, where each point is assigned to one of \(K\) clusters. On the other hand, K-Nearest Neighbors is used for classification (or, less often, regression), and the predicted value is typically the most common class among the \(K\)-nearest data points in the training set. The names may be similar, but there isn’t really anything in common.

26.4 Minimizing Inertia

Consider the following example where \(K = 4\):

four_cluster

Due to the randomness of where the \(K\) centers initialize/start, you will get a different output/clustering every time you run K-Means. Consider three possible K-Means outputs; the algorithm has converged, and the colors denote the final cluster they are clustered as.

random_outputs


Which clustering output is the best? To evaluate different clustering results, we need a loss function.

The two common loss functions are:

  • Inertia: Sum of squared distances from each data point to its center.
  • Distortion: Weighted sum of squared distances from each data point to its center.
inertia_distortion

In the example above:

  • Calculated inertia: \(0.47^2 + 0.19^2 + 0.34^2 + 0.25^2 + 0.58^2 + 0.36^2 + 0.44^2\)
  • Calculated distortion: \(\frac{0.47^2 + 0.19^2 + 0.34^2}{3} + \frac{0.25^2 + 0.58^2 + 0.36^2 + 0.44^2}{4}\)

Switching back to the four-cluster example at the beginning of this section, random.seed(25) had an inertia of 44.96, random.seed(29) had an inertia of 45.95, and random.seed(40) had an inertia of 54.35. It seems that the best clustering output was random.seed(25) with an inertia of 44.96!

It turns out that the function K-Means is trying to minimize is inertia, but often fails to find global optimum. Why does this happen? We can think of K-means as a pair of optimizers that take turns. The first optimizer holds center positions constant and optimizes data colors. The second optimizer holds data colors constant and optimizes center positions. Neither optimizer gets full control!

This is a hard problem: give an algorithm that optimizes inertia FOR A GIVEN \(K\); \(K\) is picked in advance. Your algorithm should return the EXACT best centers and colors, but you don’t need to worry about runtime.

Note: This is a bit of a CS61B/CS70/CS170 problem, so do not worry about completely understanding the tricky predicament we are in too much!

A potential algorithm:

  • For all possible \(k^n\) colorings:
    • Compute the \(k\) centers for that coloring.
    • Compute the inertia for the \(k\) centers.
      • If current inertia is better than best known, write down the current centers and coloring and call that the new best known.

No better algorithm has been found for solving the problem of minimizing inertia exactly.

26.5 Hierarchical Agglomerative Clustering

Now, let us consider hierarchical agglomerative clustering.

hierarchical_approach


Consider the following results of two K-Means clustering outputs:

clustering_comparison


Which clustering result do you like better? It seems K-Means likes the one on the right better because it has lower inertia (the sum of squared distances from each data point to its center), but this raises some questions:

  • Why is the inertia on the right lower? K-Means optimizes for distance, not “blobbiness”.
  • Is clustering on the right “wrong”? Good question!

Now, let us introduce Hierarchical Agglomerative Clustering! We start with every data point in a separate cluster, and we’ll keep merging the most similar pairs of data points/clusters until we have one big cluster left. This is called a bottom-up or agglomerative method.

There are various ways to decide the order of combining clusters called Linkage Criterion:

  • Single linkage (similarity of the most similar): the distance between two clusters as the minimum distance between a point in the first cluster and a point in the second.
  • Complete linkage (similarity of the least similar): the distance between two clusters as the maximum distance between a point in the first cluster and a point in the second.
  • Average linkage: average similarity of pairs of points in clusters.

The linkage criterion decides how we measure the “distance” between two clusters. Regardless of the criterion we choose, the aim is to combine the two clusters that have the minimum “distance” between them, with the distance computed as per that criterion. In the case of complete linkage, for example, that means picking the two clusters that minimize the maximum distance between a point in the first cluster and a point in the second.

linkage

When the algorithm starts, every data point is in its own cluster. In the plot below, there are 12 data points, so the algorithm starts with 12 clusters. As the clustering begins, it assesses which clusters are the closest together.

agg1

The closest clusters are 10 and 11, so they are merged together.

agg2

Next, points 0 and 4 are merged together because they are closest.

agg3

At this point, we have 10 clusters: 8 with a single point (clusters 1, 2, 3, 4, 5, 6, 7, 8, and 9) and 2 with 2 points (clusters 0 and 10).

Although clusters 0 and 3 are not the closest, let us consider if we were trying to merge them. A tricky question arises: what is the “distance” between clusters 0 and 3? We can use the Complete-Link approach that uses the max distance among all pairs of points between groups to decide which group has smaller “distance”.

agg4

Let us assume the algorithm runs a little longer, and we have reached the following state. Clusters 0 and 7 are up next, but why? The max line between any member of 0 and 6 is longer than the max line between any member of 0 and 7.

agg5

Thus, 0 and 7 are merged into 0 as they are closer under the complete linkage criterion.

After more iterations, we finally converge to the plot on the left. There are two clusters (0, 1), and the agglomerative algorithm has converged.

agg6


Notice that on the full dataset, our agglomerative clustering algorithm achieves the more “correct” output.

26.5.1 Clustering, Dendrograms, and Intuition

Agglomerative clustering is one form of “hierarchical clustering.” It is interpretable because we can keep track of when two clusters got merged (each cluster is a tree), and we can visualize the merging hierarchy, resulting in a “dendrogram.” Won’t discuss this any further for this course, but you might see these in the wild. Here are some examples:

dendro_1 dendro_2

Some professors use agglomerative clustering for grading bins; if there is a big gap between two people, draw a grading threshold there. The idea is that grade clustering should be more like the figure below on the left, not the right.

grading

26.6 Picking K

The algorithms we’ve discussed require us to pick a \(K\) before we start. But how do we pick \(K\)? Often, the best \(K\) is subjective. For example, consider the state plot below.

states

How many clusters are there here? For K-Means, one approach to determine this is to plot inertia versus many different \(K\) values. We’d pick the \(K\) in the elbow, where we get diminishing returns afterward. Note that big, complicated data often lacks an elbow, so this method is not foolproof. Here, we would likely select \(K = 2\).

elbow

26.6.1 Silhouette Scores

To evaluate how “well-clustered” a specific data point is, we can use the silhouette score, also termed the silhouette width. A high silhouette score indicates that a point is near the other points in its cluster; a low score means that it’s far from the other points in its cluster.

high_low

For a data point \(X\), score \(S\) is: \[S =\frac{B - A}{\max(A, B)}\] where \(A\) is the average distance to other points in the cluster, and \(B\) is the average distance to points in the closest cluster.

Consider what the highest possible value of \(S\) is and how that value can occur. The highest possible value of \(S\) is 1, which happens if every point in \(X\)’s cluster is right on top of \(X\); the average distance to other points in \(X\)’s cluster is \(0\), so \(A = 0\). Thus, \(S = \frac{B}{\max(0, B)} = \frac{B}{B} = 1\). Another case where \(S = 1\) could happen is if \(B\) is much greater than \(A\) (we denote this as \(B >> A\)).

Can \(S\) be negative? The answer is yes. If the average distance to X’s clustermates is larger than the distance to the closest cluster, then this is possible. For example, the “low score” point on the right of the image above has \(S = -0.13\).

26.6.2 Silhouette Plot

We can plot the silhouette scores for all of our datapoints. The x-axis represents the silhouette coefficient value or silhouette score. The y-axis tells us which cluster label the points belong to, as well as the number of points within a particular cluster. Points with large silhouette widths are deeply embedded in their cluster; the red dotted line shows the average. Below, we plot the silhouette score for our plot with \(K=2\).

dendro_1 dendro_2

Similarly, we can plot the silhouette score for the same dataset but with \(K=3\):

dendro_1 dendro_2

The average silhouette score is lower with 3 clusters, so \(K=2\) is a better choice. This aligns with our visual intuition as well.

26.6.3 Picking K: Real World Metrics

Sometimes you can rely on real-world metrics to guide your choice of \(K\). For t-shirts, we can either:

  • Cluster heights and weights of customers with \(K = 3\) to design Small, Medium, and Large shirts
  • Cluster heights and weights of customers with \(K = 5\) to design XS, S, M, L, and XL shirts

To choose \(K\), consider projected costs and sales for the 2 different \(K\)s and select the one that maximizes profit.

26.7 Conclusion

We’ve now discussed a new machine learning goal —— clustering —— and explored two solutions:

  • K-Means Clustering tries to optimize a loss function called inertia (no known algorithm to find the optimal answer in an efficient manner)
  • Hierarchical Agglomerative Clustering builds clusters bottom-up by merging clusters “close” to each other, depending on the choice of linkage.

Our version of these algorithms required a hyperparameter \(K\). There are 4 ways to pick \(K\): the elbow method, silhouette scores, and by harnessing real-world metrics.

There are many machine learning problems. Each can be addressed by many different solution techniques. Each has many metrics for evaluating success / loss. Many techniques can be used to solve different problem types. For example, linear models can be used for regression and classification.

We’ve only scratched the surface and haven’t discussed many important ideas, such as neural networks and deep learning. In the last lecture, we’ll provide some specific course recommendations on how to explore these topics further.