Lecture 25 - Clustering Part 2

Isaac Schmidt, Summer 2021

In this demo, we will use spectral clustering to try to figure out the conferences of universities in NCAA Division I, based on the schedules of their men's basketball teams.

In college basketball, there are over 350 schools, divided into 32 conferences. Teams primarily play other teams within their conference, but they also play games against non-conference opponents.

For example, UC Berkeley, known as "California" or "Cal" in athletics, will play Stanford, USC, UCLA, etc., as they are all in the Pac-12 Conference. But Cal also plays other teams, such as San Diego State and Yale.

Loading in Data

The dataset we will use contains all DI college basketball games for the 2018-19 season. The data comes from Ken Massey.

The original dataset is in fixed-width format, so we import it into a DataFrame like so:

This dataset naturally lends itself to a graph. The vertices will be the teams, and there will be an edge between two teams if and only if they played each other during the season. If they played each other multiple times, the edge will have stronger weight.

It is likely that the "denser" regions of this graph will correspond to the individual conferences.

First, we have to do some more data wrangling, to ensure that each row represents a unique matchup:

Basic EDA

Let's take a look at our school, California.

California played 8 schools twice—these schools are all in the same conference as Cal, the Pac-12 Conference. There are other schools in the Pac-12 that Cal only played once, such as Utah, Oregon, and Oregon St. Finally, Cal played a few teams outside of its conference, like Fresno St, Cal Poly, Temple, etc.

We can look at all matchups. We see that there are 3987 unique matchups, and no matchup was played more than 3 times.

Conferences

To verify our results, and to make plotting easier, we will load in a dataset that tells us which conference each team is in. Remember, clustering is an unsupervised learning method, which means the clustering algorithm itself won't ever see this data, but we can use it to see if spectral clustering can "figure out" what the conferences are.

Here is information about each conference.

Creating our Graph

We will use the module networkx to create and store our graph.

The following cell loads in the teams as vertices (called nodes in the software), and loads in the matchups as edges.

353 nodes and 3987 edges is far too many to display, so to take a look at our graph we will just look at schools in the state of California. Do not worry about the code.

Spectral Clustering

We know there are 32 conferences, so let's try to divide our graph into 32 clusters.

As we already have a graph, the first step is to calculate the Laplacian matrix $L$.

We can do this manually:

Or, as we might expect, we can use networkx to calculate the Laplacian matrix directly:

Now, we will use numpy to calculate the eigenvectors and eigenvalues of $L$. The default is to return the eigenvalues and eigenvectors in ascending order, so we will reverse sort to the convention in Data 100, which is descending order:

With spectral clustering, as we want $k = 32$ clusters, we will have k-means use the eigenvectors corresponding to the 32 smallest eigenvalues.

However, as plotting all 32 dimensions at once is impossible, we will just look at two at a time. First, we will plot the eigenvectors corresponding to the two smallest eigenvalues:

Uh oh! Recall that the last eigenvector is always going to be a vector of all 1's, just scaled to be unit. So let's look at the eigenvectors corresponding to the second and third smallest eigenvalues instead.

We can see that even just by looking at two dimensions, some clusters appear to be relatively-well isolated. In the top right, there is a cluster of Big Sky schools. At the bottom, there is a cluster of Southern Conference schools. And the top left contains a bunch of schools that play in conferences based in the northeast US.

Let's now actually cluster these points.

Spectral clustering was able to accurately determine the conferences of each team!