26  PCA II

  • Develop a deeper understanding of how to interpret Principal Components Analysis (PCA).
  • See applications of PCA to some real-world contexts.

26.1 PCA Review

26.1.1 Using Principal Components

Steps to obtain Principal Components via SVD:

  1. Center the data matrix by subtracting the mean of each attribute column elementwise from the column.

  2. To find the p principal components:

  • Compute the SVD of the data matrix (\(X = U{\Sigma}V^{T}\))
  • The first \(p\) columns of \(U{\Sigma}\) (or equivalently, \(XV\)) contain the \(p\) principal components of \(X\).

The principal components are a low-dimension representation that capture as much of the original data’s total variance as possible.

The component scores sum to total variance if we center our data. \[\text{component score}_i = \frac{\sigma_i^{2}}{N}\]

We can also use the SVD to get a rank-p approximation of \(X\), \(X_p\).

\[X_p = \sum_{j = 1}^{p} \sigma_ju_jv_j^{T} \]

where \(\sigma_j\) is the jth singular value of \(X\), \(u_j\) is the jth column of \(U\), and \(v_j\) is the jth column of \(V\).

26.2 Case Study: House of Representatives Voting

Let’s examine how the House of Representatives (of the 116th Congress, 1st session) voted in the month of September 2019.

Specifically, we’ll look at the records of Roll call votes. From the U.S. Senate (link):

  • Roll call votes occur when a representative or senator votes “yea” or “nay,” so that the names of members voting on each side are recorded. A voice vote is a vote in which those in favor or against a measure say “yea” or “nay,” respectively, without the names or tallies of members voting on each side being recorded.

Do legislators’ roll call votes show a relationship with their political party?

Please visit this link to see the full Jupyter notebook demo.

As shown in the demo, the primary goal of PCA is to transform observations from high-dimensional data down to low dimensions through linear transformations.

A related goal of PCA relates to the idea that a low-dimension representation of the data should capture the variability of the original data. For example, if the first two singular values are large and the others are relatively small, then two dimensions are probably enough to describe most of waht distinguishes one observation from another. However, if this is not the case, then a PCA scatter plot is probably omitting lots of information.

We can use the the following formulas to quantify the amount each prinicial component contributes to the total variance:

\[\text{component score} = \frac{\sigma_i^{2}}{N}\]

\[\text{total variance} = \text{sum(component scores)}\]

\[\text{variance ratio of principal component i} = \frac{\text{component score i}}{\text{total variance}}\]

26.3 Interpreting PCA

26.3.1 Scree Plots

A scree plot shows the size of the diagonal value of \(\Sigma^{2}\), largest first.

Scree plots help us visually determine the number of dimensions needed to describe the data reasonably completely. The singular values that fall in region of the plot after a large drop off correspond to principal components that are not needed to describe the data, since they explain a relatively low proportion of the total variance of the data.

26.3.2 PC with SVD

After finding the SVD of \(X\):

slide15

We can derive the principal components of the data. Specifically, the first \(n\) rows of \(V^{T}\) are directions for the n principal components.

26.3.3 Columns of V are the Directions

slide16

The elements of each column of \(V\) (row of \(V^{T}\)) rotate the original feature vectors into a principal component.

The first column of V indicates how each feature contributes (e.g. positive, negative, etc.) to principal component 1.

26.3.4 Biplots

Biplots superimpose the directions onto the plot of principal component 2 vs. principal component 1.

Vector \(j\) corresponds to the direction for feature \(j\) (e.g. \(v_1j, v_2j\)). - There are several ways to scale biplots vectors; in this course we plot the direction itself. - For other scalings, which can lead to more interpretable directions/loadings, see SAS biplots

Through biplots, we can interpret how features correlate with the principal components shown: positively, negatively, or not much at all.

slide17_2

The directions of the arrow are (v1, v2) where v1 and v2 are how that specific feature column contributes to PC1 and PC2, respectively. v1 and v2 are elements of V’s first and second columns, respectively (i.e., V^T’s first two rows).

Say we were considering feature 3, and say that was the green arrow here (pointing bottom right).

  • v1 and v2 are the third elements of the respective columns in V. They are what scale feature 3’s column vector in the linear transformation to PC1 and PC2, respectively.

  • Here we would infer that v1 (in the x/pc1-direction) is positive, meaning that a linear increase in feature 3 would correspond to a linear increase of PC1, meaning feature 3 and PC1 are positively correlated.

  • v2 (in the y/pc2-direction) is negative, meaning a linear increase in feature 3 would result correspond to a linear decrease in PC2, meaning feature 3 and PC2 are negatively correlated.

26.4 Applications of PCA

26.4.1 PCA in Biology

PCA is commonly used in biomedical contexts, which have many named variables!

  1. To cluster data (Paper 1, Paper 2)

  2. To identify correlated variables (interpret rows of \(V^{T}\) as linear coefficients) (Paper 3). Uses biplots.

26.4.2 Image Classification

In machine learning, PCA is often used as a preprocessing step prior to training a supervised model.

See the following demo to see how PCA is useful for building an image classification model based on the MNIST-Fashion dataset.

26.4.3 Why PCA, then Model?

  1. Reduces dimensionality
  • speeds up training, reduces number of features, etc.)
  1. Avoids multicollinearity in new features (i.e. principal components)

slide21