22  PCA I

Learning Outcomes
  • Introduce principles of unsupervised learning
  • Define and carry out procedure of PCA
  • Understand connection between PCA and SVD

We are now in our second pass of the data science lifecycle: we introduced a new tool to carry out exploratory data analysis (EDA)—SQL—in the last two lectures. Now we will continue and look at a tool useful for both EDA and modeling.

22.1 Dimensionality Reduction

22.1.1 Visualizing High Dimensional Data

One of the first things we do in EDA is visualization. Visualization can give us a rough idea how similar/dissimilar some the data points are. For example, let’s consider the following dataset about genes. Each row represents an individual, and each column represents a gene expression. We can visualize the first feature using a rug plot.

Gene 1 Gene 2 Gene 3 Gene 4
10 6 12 5
11 4 9 7
8 5 10 6
3 3 2.5 2
2 2.8 1.3 4
1 1 2 7

1D Visualization of Gene data

We can see clearly that there are two “clusters” of points: one with lower Gene 1 values; one with higher Gene 1 values. Furthermore, we can visualize two or more features and the clusters are still present.

2D Visualization of Gene data

3D Visualization of Gene data

Since we are all 3-dimensional beings, we can’t visualize beyond three dimensions! However, many datasets come with more than three features. What can we do?

A lot of the times in EDA, we perform dimensionality reduction to reduce our dataset to lower dimensions, so we can easily visualize it.

22.1.2 Dimensionality

What do we mean by dimensionality reduction? In particular, what is dimensionality?

Previously, we have been working with data tables with rows and columns. Now, we have to be a bit more clear with our wording to follow the language of linear algebra.

Suppose we have a dataset of:

  • \(N\) observations (data points/rows)
  • \(d\) attributes (features/columns).

Linear Algebra views our data as a matrix:

  • \(N\) row vectors in a \(d\)-dimensional space; or
  • \(d\) column vectors in an \(N\)-dimensional space.

In Linear Algebra, the dimension of a matrix is usually its column rank. In Data Science, we will mostly follow this definition, but we have slightly more things to consider.

If we have a dataset about weights, with two columns measuring the same weights in different units (lbs and kg). We will say the dimension of this data matrix is \(1\), because the column rank of this matrix is \(1\). We can visualize this data matrix and see that the data points are indeed 1-dimensional—they lie on a single line.

Weight (lbs) Weight (kg)
113.0 51.3
136.5 61.9
153.0 69.4

Visualization of Weights Data

If we add an outlier to this dataset (might be due to a measurement error), however, the rank of the matrix would be \(2\), but because of the nature of the dataset, we would still consider the dimension of this dataset \(1\).

Visualization of Weights Data with an outlier

22.1.3 Reducing dimensionality

How do we reduce the dimension of a data matrix? We usually do this by projecting the dataset onto a lower dimensional space. There are many different ways we can project. Using the Gene data we’ve seen earlier, we can project all the data points on to the feature Gene 1, or the feature Gene 2.

Projecting Gene data onto Gene 1

Projecting Gene data onto Gene 1

We don’t have to project onto a feature though. For example, we can project the points onto an arbitrary line:

Projecting Gene data onto an arbitrary line

There are so many possible projections—how do we know which one to choose?

In general, we want the projection that is the best approximation for the original data. In other words, we want the projection that capture the most variance of the original data. The method we use to get such a projection is Principal Component Analysis (PCA).

22.2 PCA: Goal and Motivation

Let’s first see what our goals are when we are performing PCA and why we want to do it in practice.

22.2.1 Capturing Total Variance

We mentioned in the last section that we want to find the projection that capture the most variance of the original data. What do we mean by “variance”?

Similar to what we’ve seen before, variability of the data is the most important aspect that we want to maintain. It is the bulk of what we care about when we conduct EDA and modeling.

The total variance for a data matrix is defined as the sum of variances of its columns. To get the total variance of a data matrix, we get the variance of each column individually, and then sum them up.

To capture the “most variance,” one approach is to simply keep the columns with the highest variance—this is similar to what we did above with the Gene dataset where we project the data onto the feature Gene 1. This will sometimes work fine, but we want to do better.

It turns out that PCA does a better job at preserving variances. In fact, PCA will give us the projection that captures the most variances, out of all possible dimensionality reductions of the data matrix.

22.2.2 Goal and Motivation

The goal of PCA is to transform observations from high-dimensional data down to low dimensions (often 2) through linear transformations of the columns. In particular, we want to find a linear transformation that creates a low-dimension representation which captures as much of the original data’s total variance as possible.

Why do we want to do this? We’ve mentioned several reasons before, but here are some more. We often perform PCA during the Exploratory Data Analysis (EDA) stage of our data science lifecycle (if we already know what to model, we probably don’t need PCA); it helps us with

  • Visually identifying clusters of similar observations in high dimensions.
  • Removing irrelevant dimensions if we suspect that the dataset is inherently low rank. For example, if the columns are collinear: there are many attributes but only a few mostly determine the rest through linear associations.
  • Finding a small basis for representing variations in complex things, e.g., images, genes.
  • Reducing the number of dimensions to make some computation cheaper.

Why do we prefer a 2-dimensional reduction? As we’ve seen, most visualizations are 2D, so it makes the most sense visually.

22.2.3 Procedural Overview

How do we carry out PCA? Below is an overview, but you are not required to understand or perform PCA this way. We will introduce an easier and more efficient way in a later section.

To do PCA, we

  1. Center the data matrix by subtracting the mean of each attribute column.

  2. Find the principal components \(v_i\) for \(i \in \{1...k\}\), which fulfills the following criteria:

    • \(v_i\) is a unit vector that linearly combines the attributes/columns of the data matrix
    • \(v_i\) gives a one-dimensional projection of the data
    • \(v_i\) is chosen to minimize the sum of squared distances between each point and its projection onto \(v_i\).
    • \(v_i\) is orthogonal to all previous principal components.

In practice, we don’t carry out this procedure. Instead, we use singular value decomposition (SVD) to find all principal components efficiently.

22.3 Singular Value Decomposition

Before getting into Singular Value Decomposition, we need a refresher on linear algebra. In particular, how we can think of matrices as linear transformations.

22.3.1 Matrix as Transformation

Usually when we think about matrix multiplication, we think of it as taking dot product between the rows of the left matrix and the columns of the right matrix.

However, another more linear algebra centric perspective is the columns and transformations view. We’ve briefly used this perspective in the Ordinary Least Squares lecture, but it’s worth mentioning again here. Below is a simple matrix multiplication:

\[ \begin{bmatrix} {2}&{2}&{2}\\ {5}&{8}&{0}\\ \end{bmatrix}\begin{bmatrix} {2}&{1}\\ {1}&{1}\\ {4}&{1}\\ \end{bmatrix}=\begin{bmatrix} {14}&{6}\\ {18}&{13}\\ \end{bmatrix} \]

We can view the operation as taking each column of the left matrix, weight by the corresponding entry in the right matrix:

\[ \begin{bmatrix} {2}&{2}&{2}\\ {5}&{8}&{0}\\ \end{bmatrix}\begin{bmatrix} {2}\\ {1}\\ {4}\\ \end{bmatrix}=2\begin{bmatrix} {2}\\ {5}\\ \end{bmatrix}+1\begin{bmatrix} {2}\\ {8}\\ \end{bmatrix}+4\begin{bmatrix} {2}\\ {0}\\ \end{bmatrix}=\begin{bmatrix} {14}\\ {18}\\ \end{bmatrix} \]

We can think of the left matrix as the original data matrix, and the left matrix as representing a linear transformation. The columns of the left matrix are being transformed according to the rules specified by the right matrix. In this particular example, the right matrix transforms the original data matrix from 3-dimension to 2-dimension.

Understanding this perspective of matrix multiplication is crucial, because PCA does a similar thing: we transform the data matrix by right-multiplying it with a matrix to get the principal components.

Higher level view of PCA

Singular Value Decomposition will help us find this particular transformation matrix.

22.3.2 Orthonormality and Diagonal Matrices

To understand SVD, we need two more concepts from linear algebra: orthonormality and diagonal matrices.

Orthonormal is a portmanteau of two words: orthogonal and normal. The columns of a matrix is said to be orthonormal if

  • All the columns are orthogonal to each other. In other words, the dot product between any two columns is 0.
  • All the columns are unit vectors. In other words, the length each column is 1.

Orthonormal matrices have what’s called orthonormal inverses. If an \(m \times n\) matrix \(Q\) has orthonormal columns, then \(QQ^{\top} = I_m\) and \(Q^{\top}Q = I_n\).

Usually, the linear transformation represented by an orthonormal matrix is a rotation (and sometimes a reflection) of the coordinate system.

Orthonormal matrices as a rotation

Diagonal matrices are square matrices with non-zero values on the diagonal axis and zero everywhere else. Right-multiplied diagonal matrices scale each column up or down by a constant factor.

\[ \begin{bmatrix} {\mid}&{\mid}&{\mid}\\ {\vec{c}_{1}}&{\vec{c}_{2}}&{\vec{c}_{3}}\\ {\mid}&{\mid}&{\mid}\\ \end{bmatrix}\begin{bmatrix} {a_{1}}&{0}&{0}\\ {0}&{a_{2}}&{0}\\ {0}&{0}&{a_{3}}\\ \end{bmatrix}=\begin{bmatrix} {\mid}&{\mid}&{\mid}\\ {a_{1}\vec{c}_{1}}&{a_{2}\vec{c}_{2}}&{a_{3}\vec{c}_{3}}\\ {\mid}&{\mid}&{\mid}\\ \end{bmatrix} \]

Geometrically, the linear transformation represented by a diagonal matrix can be viewed as a scaling of the coordinate system.

Diagonal matrices as a scale

22.3.3 SVD

Singular value decomposition (SVD) is an important concept in linear algebra. We will not go much into the theory and details of SVD. Instead, we will only cover what is needed for a data science interpretation.

Singular value decomposition (SVD) describes a matrix decomposition into three matrices. For an \(n \times d\) data matrix \(X\) with rank \(r\), the SVD says:

\[ \Large X = U\Sigma V^{\top} \]

The first matrix \(U\) has the same shape as \(X\): \(n \times d\) and has orthonormal columns. Columns of \(U\) are called the left singular vectors.

\[ \begin{bmatrix} {\mid}&{\mid}&{}&{\mid}&{\mid}&{}&{\mid}\\ {\vec{u}_{1}}&{\vec{u}_{2}}&{\dots}&{\vec{u}_{r}}&{\vec{u}_{r+1}}&{\dots}&{\vec{u}_{d}}\\ {\mid}&{\mid}&{}&{\mid}&{\mid}&{}&{\mid}\\ \end{bmatrix} \]

Since \(U\) is orthonormal, it has an orthonormal inverse: \(UU^{\top} = I_{n}\) and \(U^{\top}U = I_d\). We can also view it as a rotation.

The second matrix \(\Sigma\) is a \(d \times d\) diagonal matrix. Its diagonal contains singular values, ordered from greatest to least. If the rank of the matrix \(X\) is \(r\), there will be \(r\) non-zero singular values.

\[ \begin{bmatrix} {\sigma_{1}}&{0}&{\dots}&{0}&{0}&{\dots}&{0}\\ {0}&{\sigma_{2}}&{\dots}&{0}&{0}&{\dots}&{0}\\ {\vdots}&{\vdots}&{\ddots}&{\vdots}&{\vdots}&{\ddots}&{\vdots}\\ {0}&{0}&{\dots}&{\sigma_{r}}&{0}&{\dots}&{0}\\ {0}&{0}&{\dots}&{0}&{0}&{\dots}&{0}\\ {\vdots}&{\vdots}&{\ddots}&{\vdots}&{\vdots}&{\ddots}&{\vdots}\\ {0}&{0}&{\dots}&{0}&{0}&{\dots}&{0}\\ \end{bmatrix} \]

The third matrix \(V^{\top}\) is a \(d \times d\) matrix with orthonormal rows. We call it “transpose” because we are interested in its rows rather than columns. In other words, the rows of \(V^{\top}\), or the columns of \(V\), are orthonormal. The columns of \(V\) are called the right singular vectors.

\[ \begin{bmatrix} {-}&{\vec{v}_{1}^{\top}}&{-}\\ {-}&{\vec{v}_{2}^{\top}}&{-}\\ {}&{\dots}&{}\\ {-}&{\vec{v}_{d}^{\top}}&{-}\\ \end{bmatrix} \]

Similar to \(U\), \(V\) is orthonormal, so it has an orthonormal inverse: \(VV^{\top} = V^{\top}V = I_{d}\). We can also view it as a rotation.

22.3.4 SVD in NumPy

In NumPy, this decomposition algorithm is already written and can be called with np.linalg.svd. There are many versions of SVD out in the wild, and we will need to set the parameter full_matrices to False to get our version.

22.4 PCA with SVD

Now that we have everything we need to carry out our PCA using the tool SVD.

In a previous section, we mentioned PCA gives us the principal components by linearly transforming the original data matrix. Where does that linear transformation come from? SVD!

Recall that SVD says \(X = U\Sigma V^{\top}\). If we right-multiply both sides by \(V\), we will get

\[ \Large XV = U\Sigma V^{\top}V = U\Sigma, \]

since \(V^{\top}V = I_d\). Therefore, this \(V\) matrix is the transformation we are looking for, and \(U\Sigma\) will get us the principal components!

To get the first \(k\) principal components, we can simply take the the first \(k\) columns of the matrix \(U\Sigma\), or equivalently the matrix \(XV\).

Alternatively, to get the first principal component (often called PC1), we can also take the first left singular vector (i.e. the first column of \(U\)) and scale it by the first singular value (i.e. the first diagonal entry in \(\Sigma\)).

We can also take the data matrix \(X\) and right-multiply it by the first right singular vector (i.e. the first column of \(V\)) to get PC1.

22.4.1 Summary

We finally know how to perform PCA with Singular Value Decomposition. Let’s summarize the procedure. Given an \(n \times d\) data matrix \(X\), we get the first \(k\) principal components using the following procedure:

  1. Center \(X\) by subtracting the mean of each column
  2. Get the Singular Value Decomposition of the centered \(X\): \(U\), \(\Sigma\), and \(V^{\top}\)
  3. Multiply either \(U\Sigma\) or \(XV\)
  4. Take the first \(k\) columns of \(U\Sigma\) or \(XV\). These are the first \(k\) principal components of \(X\).

22.4.2 PCA in NumPy

Let’s see how we can carry out this process using Python! We will load in a dataset containing measurements of rectangles: width, height, area, and perimeter.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(23) #kallisti
rectangle = pd.read_csv("data/rectangle_data.csv")
width height area perimeter
0 8 6 48 28
1 2 4 8 12
2 1 3 3 8
3 9 3 27 24
4 9 8 72 34

We can see this data matrix has rank \(3\), because perimeter is a linear combination of width and height. While we can also get area using width and height, the operation is not linear.

Now let’s carry out PCA. Step 1: Center the data matrix \(X\)

centered_df = rectangle - np.mean(rectangle, axis = 0)
width height area perimeter
0 2.97 1.35 24.78 8.64
1 -3.03 -0.65 -15.22 -7.36
2 -4.03 -1.65 -20.22 -11.36
3 3.97 -1.65 3.78 4.64
4 3.97 3.35 48.78 14.64 Step 2: Get the SVD of the centered \(X\)

Note, we will need to set the full_matrices argument to False.

U, S, Vt = np.linalg.svd(centered_df, full_matrices = False)

Let’s inspect these three matrices first.

The first 5 rows of \(U\):

0 1 2 3
0 -0.133910 0.005930 0.034734 -0.296836
1 0.086354 -0.079515 0.014948 0.711478
2 0.117766 -0.128963 0.085774 -0.065318
3 -0.027274 0.183177 0.010895 -0.031055
4 -0.258806 -0.094295 0.090270 -0.032818

\(\Sigma\) is returned as an array, because most of the values in the diagonal matrices are zero. NumPy only returns the singular values to save memory.

We can see there are only three non-zero singular values, confirming our previous judgment that the rank of \(X\) is 3.

array([197.,  27.,  23.,   0.])

To get \(\Sigma\) into its matrix form, we can use np.diag:

pd.DataFrame(np.diag(np.round(S, 1)))
0 1 2 3
0 197.4 0.0 0.0 0.0
1 0.0 27.4 0.0 0.0
2 0.0 0.0 23.3 0.0
3 0.0 0.0 0.0 0.0

Finally, \(V^{\top}\):

0 1 2 3
0 -0.098631 -0.072956 -0.931226 -0.343173
1 0.668460 -0.374186 -0.258375 0.588548
2 0.314625 -0.640483 0.257023 -0.651715
3 0.666667 0.666667 0.000000 -0.333333 Step 3: Multiply either \(U\Sigma\) or \(XV\)

To carry out matrix multiplication in NumPy, we use the @ symbol.

Let’s try \(U\Sigma\) first

pd.DataFrame(U @ np.diag(S)).head(5)
0 1 2 3
0 -26.432217 0.162686 0.807998 -2.738093e-15
1 17.045285 -2.181451 0.347732 6.562857e-15
2 23.245695 -3.538040 1.995334 -6.025133e-16
3 -5.383546 5.025395 0.253448 -2.864630e-16
4 -51.085217 -2.586948 2.099919 -3.027184e-16

Then \(XV\):

pd.DataFrame(centered_df @ Vt.T).head(5)
0 1 2 3
0 -26.432217 0.162686 0.807998 -2.978358e-15
1 17.045285 -2.181451 0.347732 1.462534e-15
2 23.245695 -3.538040 1.995334 2.350712e-15
3 -5.383546 5.025395 0.253448 -1.868135e-15
4 -51.085217 -2.586948 2.099919 -4.088581e-15

We can see these are exactly the same. Step 4: Get the first \(k\) columns

Let’s get the first two principal components of \(X\).

two_PCs = (U @ np.diag(S))[:, :2]
0 1
0 -26.432217 0.162686
1 17.045285 -2.181451
2 23.245695 -3.538040
3 -5.383546 5.025395
4 -51.085217 -2.586948

We can get the same thing using \(XV\):

(centered_df @ Vt.T).iloc[:, :2].head(5)
0 1
0 -26.432217 0.162686
1 17.045285 -2.181451
2 23.245695 -3.538040
3 -5.383546 5.025395
4 -51.085217 -2.586948

22.5 Data Variance and Centering

22.5.1 Variance with Singular Values

We said earlier that PCA gives us a projection that captures the most variance. How do we know how much variance is captured?

It turns out the singular values tell us this!

Formally, the \(i\)-th singular value tells us the component score—how much of the data variance is captured by the \(i\)th principal component. Suppose the number of data points is \(n\):

\[i\text{-th component score} = \frac{(i\text{-th singular value})^2}{n}\]

Summing up the component scores is equivalent to computing the total variance.

22.5.2 Centering

You might have noticed that we always center our data matrix \(X\) before performing PCA. We do this because only when we center the matrix will we have this nice interpretation of the singular values. In other words, if we don’t center the data matrix, singular values will not convert nicely to the component scores/variances captured by each principal component.

The formal proof of this is out of scope, but if you are interested, take a look at the next section.

22.5.3 (Bonus) Proof of component score

The proof defining component score out of scope for this class, but it is included below for your convenience.

Setup: Consider the design matrix \(X \in \mathbb{R}^{n \times d}\), where the \(j\)-th column (corresponding to the \(j\)-th feature) is \(x_j \in \mathbb{R}^n\) and the element in row \(i\), column \(j\) is \(x_{ij}\). Further define \(\tilde{X}\) as the centered design matrix. The \(j\)-th column is \(\tilde{x}_j \in \mathbb{R}^n\) and the element in row \(i\), column \(j\) is \(\tilde{x}_{ij} = x_{ij} - \bar{x_j}\), where \(\bar{x_j}\) is the mean of the \(x_j\) column vector from the original \(X\).

Variance: Construct the covariance matrix: \(\frac{1}{n} \tilde{X}^T \tilde{X} \in \mathbb{R}^{d \times d}\). The \(j\)-th element along the diagonal is the variance of the \(j\)-th column of the original design matrix \(X\):

\[\left( \frac{1}{n} \tilde{X}^T \tilde{X} \right)_{jj} = \frac{1}{n} \tilde{x}_j ^T \tilde{x}_j = \frac{1}{n} \sum_{i=i}^n (\tilde{x}_{ij} )^2 = \frac{1}{n} \sum_{i=i}^n (x_{ij} - \bar{x_j})^2\]

SVD: Suppose singular value decomposition of the centered design matrix \(\tilde{X}\) yields \(\tilde{X} = U \Sigma V^T\), where \(U \in \mathbb{R}^{n \times d}\) and \(V \in \mathbb{R}^{d \times d}\) are matrices with orthonormal columns, and \(\Sigma \in \mathbb{R}^{d \times d}\) is a diagonal matrix with singular values of \(\tilde{X}\).

\[\begin{aligned} \tilde{X}^T \tilde{X} &= (U \Sigma V^T )^T (U \Sigma V^T) \\ &= V \Sigma U^T U \Sigma V^T & (\Sigma^T = \Sigma) \\ &= V \Sigma^2 V^T & (U^T U = I) \\ \frac{1}{n} \tilde{X}^T \tilde{X} &= \frac{1}{n} V \Sigma V^T =V \left( \frac{1}{n} \Sigma \right) V^T \\ \frac{1}{n} \tilde{X}^T \tilde{X} V &= V \left( \frac{1}{n} \Sigma \right) V^T V = V \left( \frac{1}{n} \Sigma \right) & \text{(right multiply by }V \rightarrow V^T V = I \text{)} \\ V^T \frac{1}{n} \tilde{X}^T \tilde{X} V &= V^T V \left( \frac{1}{n} \Sigma \right) = \frac{1}{n} \Sigma & \text{(left multiply by }V^T \rightarrow V^T V = I \text{)} \\ \left( \frac{1}{n} \tilde{X}^T \tilde{X} \right)_{jj} &= \frac{1}{n}\sigma_j^2 & \text{(Define }\sigma_j\text{ as the} j\text{-th singular value)} \\ \frac{1}{n} \sigma_j^2 &= \frac{1}{n} \sum_{i=i}^n (x_{ij} - \bar{x_j})^2 \end{aligned}\]

The last line defines the \(j\)-th component score.