23  PCA II

Learning Outcomes
  • Develop a deeper understanding of how to interpret Principal Components Analysis (PCA).
  • See how we use PCA for EDA and feature generation.

In this second lecture on PCA, we will talk more about the principal components that we got last time: what do they represent? How can we interpret them? How do we use PCA to preform Exploratory Data Analysis (EDA)?

23.1 PCA Review

23.1.1 Recap from Last Time

Recall the steps we take to obtain Principal Components via Singular Value Decomposition (SVD):

  1. Center the data matrix by subtracting the mean of each attribute column

  2. To find the first \(k\) principal components:

  • Compute the SVD of the data matrix (\(X = U{\Sigma}V^{T}\))
  • The first \(k\) columns of \(U{\Sigma}\) (or equivalently, \(XV\)) contain the first \(k\) principal components of \(X\).

The principal components are a low-dimension representation that capture as much of the original data’s total variance as possible.

Component scores measure how much variance each principal component captures. If the the total number of rows of the data matrix \(X\) is \(n\), \[\text{component score } i = \frac{\sigma_i^{2}}{n}\]

They sum to the total variance if we center our data.

We can also use the SVD to get a rank-\(k\) approximation of \(X\), \(X_k\).

\[X_k = \sum_{j = 1}^{k} \sigma_ju_jv_j^{T} \]

where \(\sigma_j\) is the \(j\)th singular value, \(u_j\) is the \(j\)th column of \(U\), and \(v_j\) is the \(j\)th column of \(V\).

23.1.2 PCA is a Linear Transformation

Essentially, we can think of PCA as a linear transformation, performed by the matrix \(V\). In particular, PCA centers the data matrix \(X\), then rotates it such that the direction with the most variation (i.e. the direction that is the most spread-out) is aligned with the \(x\)-axis.

Below is an illustration of how PCA centers and rotates a 2D dataset.

PCA as centering and rotation

From the final plot on the right, we can see the principal components plotted on the two axes, with Principal Component 1 (PC1) on the \(x\)-axis and Principal Component 2 (PC2) on the \(y\)-axis. We can see that principal components are orthogonal to each other and axis-aligned.

Another perspective to look at the principal component is through matrix multiplication.

Recall that to get the first principal component, we can get the first column of \(XV\). This is the same as taking the matrix \(X\) and multiply it with the first column of \(V\):

\[ \text{PC1} = v_{11} \cdot \begin{bmatrix} | \\ \vdots \\ x_1 \\ \vdots \\ | \end{bmatrix} + \cdots + v_{j1} \cdot \begin{bmatrix} | \\ \vdots \\ x_j \\ \vdots \\ | \end{bmatrix} + \cdots + v_{d1} \cdot \begin{bmatrix} | \\ \vdots \\ x_d \\ \vdots \\ | \end{bmatrix} \]

This allows us to see that principal components are in fact linear combinations of the columns of \(X\). The first column of \(V\) indicates how each column/feature contributes to PC1. For example, if \(v_{11}\) is big, then we know the first feature is important in constructing PC1, or that the first feature is important in discerning the data points from each other.

As a summary, we saw that

  • Principal components are all orthogonal to each other
  • Principal Components are axis-aligned. If we plot two PCs on a 2D plane, one will lie on the x-axis, the other on the y-axis
  • Principal Components are linear combinations of columns in our data X

23.2 Using PCA for EDA

Let’s see how PCA can help us with our EDA!

For this, we will consider a dataset from the United States House of Representatives voted in the month of September 2019.

Specifically, we’ll look at the records of Roll call votes. For each roll call vote, members of the house will have three options from consenting (“Yea”), dissenting (“Nay”), or not presenting (“Not Voting”).

The question we potentially want to ask is “do legislators’ roll call votes show a relationship with their political party?”

The data, compiled from ProPublica source, is a “skinny” table of data where each record is a single vote by a member across any roll call in the 116th Congress, 1st session, as downloaded in February 2020. The member of the House, whom we’ll call legislator, is denoted by their bioguide alphanumeric ID in http://bioguide.congress.gov/.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import yaml
from datetime import datetime
from ds100_utils import *

np.random.seed(23) #kallisti
plt.rcParams['figure.dpi'] = 150

# February 2019 House of Representatives roll call votes
# Downloaded using https://github.com/eyeseast/propublica-congress
votes = pd.read_csv('data/votes.csv')
votes = votes.astype({"roll call": str}) 
chamber session roll call member vote
0 House 1 555 A000374 Not Voting
1 House 1 555 A000370 Yes
2 House 1 555 A000055 No
3 House 1 555 A000371 Yes
4 House 1 555 A000372 No

To make the dataset easier to work with, we will pivot this table to group each legislator and their votes across every (roll call) vote in this month. We mark 1 if the legislator voted Yes (yea), and 0 otherwise (No/nay, no vote, speaker, etc.).

def was_yes(s):
    return 1 if s.iloc[0] == "Yes" else 0
vote_pivot = votes.pivot_table(index='member', 
                                columns='roll call', 
(441, 41)
roll call 515 516 517 518 519 520 521 522 523 524 ... 546 547 548 549 550 551 552 553 554 555
A000055 1 0 0 0 1 1 0 1 1 1 ... 0 0 1 0 0 1 0 0 1 0
A000367 0 0 0 0 0 0 0 0 0 0 ... 0 1 1 1 1 0 1 1 0 1
A000369 1 1 0 0 1 1 0 1 1 1 ... 0 0 1 0 0 1 0 0 1 0
A000370 1 1 1 1 1 0 1 0 0 0 ... 1 1 1 1 1 0 1 1 1 1
A000371 1 1 1 1 1 0 1 0 0 0 ... 1 1 1 1 1 0 1 1 1 1

5 rows × 41 columns

We can see our data has 441 rows (each representing a legislator) and 41 columns (each representing a roll call vote). So this data is relatively high-dimensional.

How do we analyze this data?

While we could consider loading information about the legislator, such as their party, and see how this relates to their voting pattern, it turns out that we can do a lot with PCA to cluster legislators by how they vote. Let’s do it!

As before, we first center the data.

vote_pivot_centered = vote_pivot - np.mean(vote_pivot, axis = 0)
roll call 515 516 517 518 519 520 521 522 523 524 ... 546 547 548 549 550 551 552 553 554 555
A000055 0.129252 -0.668934 -0.526077 -0.52381 0.049887 0.587302 -0.562358 0.634921 0.594104 0.560091 ... -0.521542 -0.526077 0.045351 -0.521542 -0.519274 0.54195 -0.521542 -0.535147 0.086168 -0.503401
A000367 -0.870748 -0.668934 -0.526077 -0.52381 -0.950113 -0.412698 -0.562358 -0.365079 -0.405896 -0.439909 ... -0.521542 0.473923 0.045351 0.478458 0.480726 -0.45805 0.478458 0.464853 -0.913832 0.496599
A000369 0.129252 0.331066 -0.526077 -0.52381 0.049887 0.587302 -0.562358 0.634921 0.594104 0.560091 ... -0.521542 -0.526077 0.045351 -0.521542 -0.519274 0.54195 -0.521542 -0.535147 0.086168 -0.503401
A000370 0.129252 0.331066 0.473923 0.47619 0.049887 -0.412698 0.437642 -0.365079 -0.405896 -0.439909 ... 0.478458 0.473923 0.045351 0.478458 0.480726 -0.45805 0.478458 0.464853 0.086168 0.496599
A000371 0.129252 0.331066 0.473923 0.47619 0.049887 -0.412698 0.437642 -0.365079 -0.405896 -0.439909 ... 0.478458 0.473923 0.045351 0.478458 0.480726 -0.45805 0.478458 0.464853 0.086168 0.496599

5 rows × 41 columns

We then compute the SVD using this centered data matrix.

u, s, vt = np.linalg.svd(vote_pivot_centered, full_matrices = False)

23.2.1 PCA Plot

We can get the principal components by calculating either \(U\Sigma\) or \(XV\).

Then we can try to plot the first two principal components using a scatter plot, with PC1 on the \(x\)-axis, PC2 on the \(y\)-axis. This is often called a PCA plot.

PCA plots allow us to visually assess similarities between our data points and if there are any clusters in our dataset.

pcs = u * s
sns.scatterplot(x=pcs[:, 0], y=pcs[:, 1]);

But how do we know if our plot is valid? We are projecting from a 41-dimensional space to a 2-dimensional one; how do we know whether or not we are overlooking important details of the dataset?

Remember that our singular values are tied to variances. If the first two singular values are large and all others are small, then two dimensions are enough to describe most of what distinguishes one observation from another. If not, then a PCA plot is omitting a lot of information.

23.2.2 Scree Plot

In practice, we will calculate something called the variance ratio of each principal component. It is defined as the fraction of total variance captured by each principal component.

To get the variance ratio of principal component \(i\), we calculate

\[ \text{variance ratio} = \frac{\text{component score } i}{\text{total variance}} = \frac{\sigma_i^2/n}{\sum_{j=1}^{r}\sigma_j/n} = \frac{\sigma^2}{\sum_{j=1}^{r}\sigma_j}. \]

We can implement this in Python.

np.round(s**2 / sum(s**2), 2)
array([0.8 , 0.05, 0.02, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01,
       0.01, 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ,
       0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ,
       0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ])

Indeed, we can see the first two principal components capture \(85\%\) of the total variance, so we should be fine with our PCA plot.

In fact, we can visualize this information, using what’s called a Scree plot.

A scree plot (and where its “elbow” of the plot is located) is a visual way of checking the distribution of captured variance.

plt.plot(s**2 / sum(s**2), marker='.');
plt.xlabel("Principal Component $i$");
plt.ylabel("Variance Ratio");

Sometimes, instead of variance ratio, people can plot the actual variance/component score of each principal component. That is also a valid scree plot.

Now let’s continue analyzing the PCA plot from earlier. We can see that there are two “clusters” of points: what’s so different between those two clusters? Let’s check.

We’ll load in the information about the legislators’ political parties. Now we have a new column in the original dataset about which party (Democratic or Republican) the member is affiliated with.

base_url = 'https://github.com/unitedstates/congress-legislators'
legislators_path = 'legislators-current.yaml'
f = fetch_and_cache(base_url + legislators_path, legislators_path)
legislators_data = yaml.safe_load(open('data/legislators-current.yaml'))

def to_date(s):
    return datetime.strptime(s, '%Y-%m-%d')

legs = pd.DataFrame(
    columns=['leg_id', 'first', 'last', 'gender', 'state', 'chamber', 'party', 'birthday'],
           to_date(x['bio']['birthday'])] for x in legislators_data])


vote2d = pd.DataFrame({
    'member': vote_pivot.index,
    'pc1': pcs[:, 0],
    'pc2': pcs[:, 1]
}).merge(legs, left_on='member', right_on='leg_id')
Using cached version that was downloaded (UTC): Thu Aug  3 22:16:30 2023

First, let’s the party affiliations of those who are in the upper left corner of the PCA plot. In other words, we take all members with negative PC1 values.

vote2d[vote2d['pc1'] < 0]['party'].value_counts()
Democrat    231
Name: count, dtype: int64

All of them are Democrats! Let’s keep going.

Now I’m only getting the upper right corner—all members with PC2 value greater than -2 and PC1 value greater than 0.

vote2d[(vote2d["pc2"] > -2) & (vote2d["pc1"] > 0)]['party'].value_counts()
Republican    194
Name: count, dtype: int64

All of them are Republicans. Let’s use these party labels to color our principal components.

cp = sns.color_palette()
party_cp = [cp[i] for i in [0, 3, 1]]
party_hue = ["Democrat", "Republican", "Independent"]
sns.scatterplot(x="pc1", y="pc2",
                hue="party", palette=party_cp,  hue_order=party_hue,
                data = vote2d);

23.2.3 Biplots

Recall the values in \(V\) encodes information about how each feature in our original data matrix contributes to the principal components. Let’s incorporate that into our visualization!

Biplots superimpose the directions onto the plot of principal component 2 vs. principal component 1.

Vector \(j\) corresponds to the direction for feature \(j\) (e.g. \(v_1j, v_2j\)). - There are several ways to scale biplot vectors; in this course we plot the direction itself. - For other scaling, which can lead to more interpretable directions/loadings, check out SAS biplots

Through biplots, we can interpret how features correlate with the principal components shown: positively, negatively, or not much at all.

import random

roll_calls = sorted([517, 520, 526, 527, 555, 553]) # features to plot on biplot
# roll_calls = [515, 516, 517, 520, 526, 527, 553]

plt.figure(figsize = (7, 7))
# first plot each datapoint in terms of the first two principal components
sns.scatterplot(x="pc1", y="pc2",
                hue="party", palette=party_cp,  hue_order=party_hue,
                data = vote2d);

# next, plot the loadings for PC1 and PC2
cp = sns.color_palette()[1:] # skip blue

directions_df= pd.DataFrame(data=vt[:2,:].T, index=vote_pivot_centered.columns, columns=["dir1", "dir2"])
dir1, dir2 = directions_df["dir1"], directions_df["dir2"]
for i, feature in enumerate(roll_calls):
    feature = str(feature)
    plt.arrow(0, 0,
              dir1.loc[feature], dir2.loc[feature],
              head_width=0.2, head_length=0.2, color=cp[i], label=feature)
plt.legend(bbox_to_anchor=(1.1, 1.05));

The directions of the arrow are (\(v_1\), \(v_2\)) where \(v_1\) and \(v_2\) are how that specific feature column contributes to PC1 and PC2, respectively. \(v_1\) and \(v_2\) are elements of \(V\)’s first and second columns, respectively (i.e., \(V^{\top}\)’s first two rows).

As a summary for this demo, we saw three visualizations we can create using the results from PCA:

  • PCA plots. These are scatter plots of PC1 against PC2. They help us assess similarities between our data points and if there are any clusters in our dataset.
  • Scree plots. These are line plots showing the variance ratio captured by each principal component, largest first. If first two is large enough, we know PCA plot is good representation of data.
  • Biplots. These are PCA plots superimposed by the directions onto the plot of principal component 2 vs. principal component 1. They show how much some features contribute to PC1 and PC2.

23.3 Applications of PCA

There are many more applications of PCA. In general, we use PCA mainly for the following reasons:

  • Visually identifying clusters of similar observations in high dimensions.
  • Removing irrelevant dimensions if we suspect that the dataset is inherently low rank. For example, if the columns are collinear: there are many attributes but only a few mostly determine the rest through linear associations.
  • Finding a small basis for representing variations in complex things, e.g., images, genes.
  • Reducing the number of dimensions to make some computation cheaper.

We’ve seen the first two in action in our previous demo. In some more complex datasets like images, we can leverage PCA to simplify our dataset.

In lecture, we walked through an example using PCA on the Fashion-MNIST dataset. Feel free to check out the demo notebook if you are interested.

23.3.1 PCA with sklearn

To conclude, let’s see how we can perform PCA more easily, with sklearn.

Like other models/algorithms in sklearn we’ve worked with before, we can create a PCA object. When initiating the object, we need to specify the number of principal components we want to get back. We’ll use 2 here.

from sklearn.decomposition import PCA

pca = PCA(n_components=2)

We can fit the model using .fit and our data matrix:

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

To get the principal components, we need to call .transform:

principal_components = pca.transform(vote_pivot_centered)

Moreover, we can get the variance ratios by obtaining the explained_variance_ratio_ model attribute:

array([0.80299948, 0.05260076])

This confirms our previous calculation.