Lecture 23: PCA

Raguvir Kunani and Isaac Schmidt, Summer 2021

Let's import our data and see what we have.

The first thing we need to do is center our data. We could standardize as well, but as each column is on roughly the same scale, we will not do so here.

Midterm Exam and Final Exam

Let's plot our data.

Let's calculate the covariance matrix for these two columns. Notice how $X^T X$ returned the same matrix as np.cov.

Now, let's determine the eigenvalues and eigenvalues of this matrix. We'll use np.linalg.eigh, which is a faster implementation than np.linalg.eig for symmetric matrices (which covariance matrices always are).

Now, we can plot the eigenvectores, scaled by their relative eigenvalues. Note that we've scaled up both eigenvectors by the same constant, so they are more readable on the plot.

SVD for Midterm and Final

We'll use np.linalg.svd to calculate the SVD of our centered matrix $X$.

Looks like s is an array, not a digonal matrix. We can correct that with the following trick:

We said in lecture that the singular values are the square roots of the eigenvalues of $X^T X$. Let's verify that:

Also notice that the eigenvalues of $X^T X$ are equal to the eigenvalues of the covariance matrix, multipled by $n$:

Combined Midterm and Final Score

Let's reduce the midterm and final exam results down to a single score, as we did in lecture. To do this, we multiply our centered $X$, by the first row of $V^T$:

If we were instead to use the second row of $V^T$, our scores would not be as variable.

We can look at the full $XV$ matrix:

PCA on our Full Dataset

Let's perform PCA on our full grades dataset.

If we actually wanted to build a grading scheme based on the first row of $V^T$, we could normalize the row—this is not too far off from our class's normal grading scheme!

Let's calculate the full $XV$ matrix to determine all 5 principal components. We'll call this matrix pc.

Picking the number of components

We can create a scree plot to determine how many principal components to use.

It looks like just the first principal component captures over 60 percent of the overall variance. Based on the plot, it looks like selecting either 2 or 3 principal components would be adequate.

Here is a plot displaying the first two principal components for each student, colored by the letter grade received.

You don't have access to the letter grades, but if you did, you could run code like this to generate a similar-looking plot:

ax = sns.scatterplot(x = pc[:, 0], y = pc[:, 1], hue = grades['Grade'])

drawing