25  PCA II

Learning Outcomes
  • Dissect Singular Value Decomposition (SVD) and use it to calculate principal components
  • Develop a deeper understanding of how to interpret Principal Components Analysis (PCA)
  • See applications of PCA in some real-world contexts

25.1 Singular Value Decomposition (SVD)

Singular value decomposition (SVD) is an important concept in linear algebra. Since this class requires a linear algebra course (MATH 54, MATH 56, or EECS 16A) as a pre/co-requisite, we assume you have taken or are taking a linear algebra course, so we won’t explain SVD in its entirety. In particular, we will go over:

  • Why SVD is a valid decomposition of rectangular matrices
  • Why PCA is an application of SVD

We will not dive deep into the theory and details of SVD. Instead, we will only cover what is needed for a data science interpretation. If you’d like more information, check out EECS 16B Note 14 or EECS 16B Note 15.

[Linear Algebra Review] Orthonormality

Orthonormal is a combination of two words: orthogonal and normal.

When we say the columns of a matrix are orthonormal, we know that:

  1. The columns are all orthogonal to each other (all pairs of columns have a dot product of zero)
  2. All columns are unit vectors (the length of each column vector is 1)

Orthonormal matrices have a few important properties:

  • Orthonormal inverse: If an \(m \times n\) matrix \(Q\) has orthonormal columns, \(QQ^T= Iₘ\) and \(Q^TQ=Iₙ\).
  • Rotation of coordinates: The linear transformation represented by an orthonormal matrix is often a rotation (and less often a reflection). We can imagine columns of the matrix as where the unit vectors of the original space will land.
[Linear Algebra Review] Diagonal Matrices

Diagonal matrices are square matrices with non-zero values on the diagonal axis and zeros everywhere else.

Right-multiplied diagonal matrices scale each column up or down by a constant factor. Geometrically, this transformation can be viewed as scaling the coordinate system.

Singular value decomposition (SVD) describes a matrix \(X\)’s decomposition into three matrices: \[ X = U S V^T \]

Let’s break down each of these terms one by one.

25.1.1 \(U\)

  • \(U\) is an \(n \times d\) matrix: \(U \in \mathbb{R}^{n \times d}\).
  • Its columns are orthonormal.
    • \(\vec{u_i}^T\vec{u_j} = 0\) for all pairs \(i, j\).
    • All vectors \(\vec{u_i}\) are unit vectors where \(|| \vec{u_i} || = 1\) .
  • Columns of U are called the left singular vectors.
  • \(UU^T = I_n\) and \(U^TU = I_d\).
  • We can think of \(U\) as a rotation.

25.1.2 \(S\)

  • \(S\) is a \(d \times d\) matrix: \(S \in \mathbb{R}^{d \times d}\).
  • The majority of the matrix is zero.
  • It has \(r\) non-zero singular values, and \(r\) is the rank of \(X\).
  • Diagonal values (singular values \(s_1, s_2, ... s_r\)), are non-negative ordered from greatest to least: \(s_1 \ge s_2 \ge ... \ge s_r > 0\).
  • We can think of \(S\) as a scaling operation.

25.1.3 \(V^T\)

  • \(V^T\) is an \(d \times d\) matrix: \(V \in \mathbb{R}^{d \times d}\).
  • Columns of \(V\) are orthonormal, so the rows of \(V^T\) are orthonormal.
  • Columns of \(V\) are called the right singular vectors.
  • \(VV^T = V^TV = I_d\)
  • We can think of \(V\) as a rotation.

25.1.4 SVD in NumPy

For this demo, we’ll work with a rectangular dataset containing \(n=100\) rows and \(d=4\) columns.

Code
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

np.random.seed(23)  # kallisti

plt.rcParams["figure.figsize"] = (4, 4)
plt.rcParams["figure.dpi"] = 150
sns.set()

rectangle = pd.read_csv("data/rectangle_data.csv")
rectangle.head(5)
width height area perimeter
0 8 6 48 28
1 2 4 8 12
2 1 3 3 8
3 9 3 27 24
4 9 8 72 34

In NumPy, the SVD decomposition function can be called with np.linalg.svd (documentation). There are multiple versions of SVD; to get the version that we will follow, we need to set the full_matrices parameter to False.

U, S, Vt = np.linalg.svd(rectangle, full_matrices=False)

First, let’s examine U. As we can see, it’s dimensions are \(n \times d\).

U.shape
(100, 4)

The first 5 rows of U are shown below.

pd.DataFrame(U).head(5)
0 1 2 3
0 -0.155151 0.064830 -0.029935 0.638430
1 -0.038370 -0.089155 0.062019 -0.351010
2 -0.020357 -0.081138 0.058997 0.018831
3 -0.101519 -0.076203 -0.148160 -0.199067
4 -0.218973 0.206423 0.007274 -0.079833

\(S\) is a little different in NumPy. Since the only useful values in the diagonal matrix \(S\) are the singular values on the diagonal axis, only those values are returned and they are stored in an array.

Our rectangle_data has a rank of \(3\), so we should have 3 non-zero singular values, sorted from largest to smallest.

S
array([3.62932568e+02, 6.29904732e+01, 2.56544651e+01, 2.56949990e-15])

It seems like we have 4 non-zero values instead of 3, but notice that the last value is so small (\(10^{-15}\)) that it’s practically \(0\). Hence, we can round the values to get 3 singular values.

np.round(S)
array([363.,  63.,  26.,   0.])

To get S in matrix format, we use np.diag.

Sm = np.diag(S)
Sm
array([[3.62932568e+02, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 6.29904732e+01, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 2.56544651e+01, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 2.56949990e-15]])

Finally, we can see that Vt is indeed a \(d \times d\) matrix.

Vt.shape
(4, 4)
pd.DataFrame(Vt)
0 1 2 3
0 -0.146436 -0.129942 -8.100201e-01 -0.552756
1 -0.192736 -0.189128 5.863482e-01 -0.763727
2 -0.704957 0.709155 7.951614e-03 0.008396
3 -0.666667 -0.666667 -5.257886e-17 0.333333

To check that this SVD is a valid decomposition, we can reverse it and see if it matches our original table (it does, yay!).

pd.DataFrame(U @ Sm @ Vt).head(5)
0 1 2 3
0 8.0 6.0 48.0 28.0
1 2.0 4.0 8.0 12.0
2 1.0 3.0 3.0 8.0
3 9.0 3.0 27.0 24.0
4 9.0 8.0 72.0 34.0

25.2 PCA with SVD

Principal Component Analysis (PCA) and Singular Value Decomposition (SVD) can be easily mixed up, especially when you have to keep track of so many acronyms. Here is a quick summary:

  • SVD: a linear algebra algorithm that splits a matrix into 3 component parts.
  • PCA: a data science procedure used for dimensionality reduction that uses SVD as one of the steps.

25.2.1 Deriving Principal Components From SVD

After centering \(X\) so that each column has a mean of 0, we find its SVD: \[ X = U S V^T \]

Because \(X\) is centered, the covariance matrix of \(X\), \(\Sigma\), is equal to \(X^T X\). Rearranging this equation, we get

\[ \begin{align} \Sigma &= X^T X \\ &= (U S V^T)^T U S V^T \\ &= V S^T U^T U S V^T & \text{U is orthonormal, so $U^T U = I$} \\ &= V S^2 V^T \end{align} \]

Multiplying both sides by \(V\), we get

\[ \begin{align} \Sigma V &= VS^2 V^T V \\ &= V S^2 \end{align} \]

This shows that the columns of \(V\) are the eigenvectors of the covariance matrix \(\Sigma\) and, therefore, the principal components. Additionally, the squared singular values \(S^2\) are the eigenvalues of \(\Sigma\).

We’ve now shown that the first \(k\) columns of \(V\) (equivalently, the first \(k\) rows of \(V^{T}\)) are the first k principal components of \(X\). We can use them to construct the latent vector representation of \(X\), \(Z\), by projecting \(X\) onto the principal components.

slide16

From this equation \(Z = X V\), we can compoute a second way to calculate \(Z\):

\[ \begin{align} Z &= X V \\ &= USV^T V \\ &= U S \end{align} \]

\[Z = XV = US\]

In other words, we can construct \(X\)’s’ latent vector representation \(Z\) through:

  1. projecting \(X\) onto the first \(k\) columns of \(V\), \(V[:, :k]\)
  2. multiplying the first \(k\) columns of U and the first \(k\) rows of S

Using \(Z\), we can approximately recover the original \(X\) matrix by multiplying \(Z\) by \(V^T\): \[ Z V^T = XV V^T = USV^T = X\]

25.2.2 PCA Visualization

slide16

The elements of each column of \(V\) (row of \(V^{T}\)) rotate the original feature vectors into a principal component.

The first column of \(V\) indicates how each feature contributes (e.g. positive, negative, etc.) to principal component 1.

slide17

Coupled together, this interpretation also allows us to understand that:

  • The principal components are all orthogonal to each other because the columns of \(V\) are orthonormal.
  • Principal components are axis-aligned. That is, if you plot two PCs on a 2D plane, one will lie on the x-axis, the other on the y-axis.
  • Principal components are linear combinations of columns in our data \(X\).

25.2.3 Using Principal Components

Let’s summarize the steps to obtain Principal Components via SVD:

  1. Center the data matrix by subtracting the mean of each attribute column.

  2. To find the \(k\) principal components:

    1. Compute the SVD of the data matrix (\(X = U{S}V^{T}\)).
    2. The first \(k\) columns of \(U{S}\) (or equivalently, \(XV\)) contain the \(k\) principal components of \(X\).

25.2.4 Code Demo

Let’s now walk through an example where we compute PCA using SVD. In order to get the first \(k\) principal components from an \(n \times d\) matrix \(X\), we:

  1. Center \(X\) by subtracting the mean from each column. Notice how we specify axis=0 so that the mean is computed per column.
centered_df = rectangle - np.mean(rectangle, axis=0)
centered_df.head(5)
width height area perimeter
0 2.97 1.35 24.78 8.64
1 -3.03 -0.65 -15.22 -7.36
2 -4.03 -1.65 -20.22 -11.36
3 3.97 -1.65 3.78 4.64
4 3.97 3.35 48.78 14.64
  1. Get the Singular Value Decomposition of the centered \(X\): \(U\), \(S\) and \(V^T\)
U, S, Vt = np.linalg.svd(centered_df, full_matrices=False)
Sm = pd.DataFrame(np.diag(np.round(S, 1)))
  1. Approximate the data by multiplying by either \(US\) or \(XV\). Mathematically, these give the same result, but computationally, floating point approximation results in slightly different numbers for very small values (check out the right-most column in the cells below).
# US
pd.DataFrame(U @ np.diag(S)).head(5)
0 1 2 3
0 -26.432217 0.162686 0.807998 -1.447811e-15
1 17.045285 -2.181451 0.347732 3.893239e-15
2 23.245695 -3.538040 1.995334 -4.901321e-16
3 -5.383546 5.025395 0.253448 -3.544636e-16
4 -51.085217 -2.586948 2.099919 -4.133102e-16
# XV
pd.DataFrame(centered_df @ Vt.T).head(5)
0 1 2 3
0 -26.432217 0.162686 0.807998 -1.539509e-15
1 17.045285 -2.181451 0.347732 -2.072416e-16
2 23.245695 -3.538040 1.995334 4.588922e-16
3 -5.383546 5.025395 0.253448 -4.292862e-16
4 -51.085217 -2.586948 2.099919 -1.650532e-15
  1. Take the first \(k\) columns of \(US\) (or \(XV\)). These are the first \(k\) principal components of \(X\).
two_PCs = (U @ np.diag(S))[:, :2]  # using US
two_PCs = (centered_df @ Vt.T).iloc[:, :2]  # using XV
pd.DataFrame(two_PCs).head()
0 1
0 -26.432217 0.162686
1 17.045285 -2.181451
2 23.245695 -3.538040
3 -5.383546 5.025395
4 -51.085217 -2.586948

25.3 Data Variance and Centering

We define the total variance of a data matrix as the sum of variances of attributes. The principal components are a low-dimension representation that capture as much of the original data’s total variance as possible. Formally, the \(i\)-th singular value tells us the component score, i.e., how much of the data variance is captured by the \(i\)-th principal component. Assuming the number of datapoints is \(n\):

\[\text{i-th component score} = \frac{(\text{i-th singular value}^2)}{n}\]

Summing up the component scores is equivalent to computing the total variance if we center our data.

Data Centering: PCA has a data centering step that precedes any singular value decomposition, where, if implemented, defines the component score as above.

If you want to dive deeper into PCA, Steve Brunton’s SVD Video Series is a great resource.

25.4 Interpreting PCA

25.4.1 PCA Plot

We often plot the first two principal components using a scatter plot, with PC1 on the \(x\)-axis and PC2 on the \(y\)-axis. This is often called a PCA plot.

If the first two singular values are large and all others are small, then two dimensions are enough to describe most of what distinguishes one observation from another. If not, a PCA plot is omitting a lot of information.

PCA plots help us assess similarities between our data points and if there are any clusters in our dataset. In the case study before, for example, we could create the following PCA plot:

pca_plot

25.4.2 Scree Plots

A scree plot shows the variance ratio captured by each principal component, with the largest variance ratio first. They help us visually determine the number of dimensions needed to describe the data reasonably. The singular values that fall in the region of the plot after a large drop-off correspond to principal components that are not needed to describe the data since they explain a relatively low proportion of the total variance of the data. This point where adding more principal components results in diminishing returns is called the “elbow” and is the point just before the line flattens out. Using this “elbow method”, we can see that the elbow is at the second principal component.

scree_plot

25.4.3 Biplots

Biplots superimpose the directions onto the plot of PC1 vs. PC2, where vector \(j\) corresponds to the direction for feature \(j\) (e.g., \(v_{1j}, v_{2j}\)). There are several ways to scale biplot vectors – in this course, we plot the direction itself. For other scalings, which can lead to more interpretable directions/loadings, see SAS biplots.

Through biplots, we can interpret how features correlate with the principal components shown: positively, negatively, or not much at all.

slide17_2

The directions of the arrow are (\(v_1\), \(v_2\)), where \(v_1\) and \(v_2\) are how that specific feature column contributes to PC1 and PC2, respectively. \(v_1\) and \(v_2\) are elements of the first and second columns of \(V\), respectively (i.e., the first two rows of \(V^T\)).

Say we were considering feature 3, and say that was the purple arrow labeled “520” here (pointing bottom right).

  • \(v_1\) and \(v_2\) are the third elements of the respective columns in \(V\). They are scale feature 3’s column vector in the linear transformation to PC1 and PC2, respectively.
  • Here, we would infer that \(v_1\) (in the \(x\)/PC1-direction) is positive, meaning that a linear increase in feature 3 would correspond to a linear increase of PC1, meaning feature 3 and PC1 are positively correlated.
  • \(v_2\) (in the \(y\)/pc2-direction) is negative, meaning a linear increase in feature 3 would correspond to a linear decrease in PC2, meaning feature 3 and PC2 are negatively correlated.

25.5 Example 1: House of Representatives Voting

Let’s examine how the House of Representatives (of the 116th Congress, 1st session) voted in the month of September 2019.

Specifically, we’ll look at the records of Roll call votes. From the U.S. Senate (link): roll call votes occur when a representative or senator votes “yea” or “nay” so that the names of members voting on each side are recorded. A voice vote is a vote in which those in favor or against a measure say “yea” or “nay,” respectively, without the names or tallies of members voting on each side being recorded.

Code
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import yaml
from datetime import datetime
import plotly.express as px
import plotly.graph_objects as go


votes = pd.read_csv("data/votes.csv")
votes = votes.astype({"roll call": str})
votes.head()
chamber session roll call member vote
0 House 1 555 A000374 Not Voting
1 House 1 555 A000370 Yes
2 House 1 555 A000055 No
3 House 1 555 A000371 Yes
4 House 1 555 A000372 No

Suppose we pivot this table to group each legislator and their voting pattern across every (roll call) vote in this month. We mark 1 if the legislator voted Yes (“yea”), and 0 otherwise (“No”, “nay”, no vote, speaker, etc.).

Code
def was_yes(s):
    return 1 if s.iloc[0] == "Yes" else 0


vote_pivot = votes.pivot_table(
    index="member", columns="roll call", values="vote", aggfunc=was_yes, fill_value=0
)
print(vote_pivot.shape)
vote_pivot.head()
(441, 41)
roll call 515 516 517 518 519 520 521 522 523 524 ... 546 547 548 549 550 551 552 553 554 555
member
A000055 1 0 0 0 1 1 0 1 1 1 ... 0 0 1 0 0 1 0 0 1 0
A000367 0 0 0 0 0 0 0 0 0 0 ... 0 1 1 1 1 0 1 1 0 1
A000369 1 1 0 0 1 1 0 1 1 1 ... 0 0 1 0 0 1 0 0 1 0
A000370 1 1 1 1 1 0 1 0 0 0 ... 1 1 1 1 1 0 1 1 1 1
A000371 1 1 1 1 1 0 1 0 0 0 ... 1 1 1 1 1 0 1 1 1 1

5 rows × 41 columns

Do legislators’ roll call votes show a relationship with their political party?

25.5.1 PCA with SVD

While we could consider loading information about the legislator, such as their party, and see how this relates to their voting pattern, it turns out that we can do a lot with PCA to cluster legislators by how they vote. Let’s calculate the principal components using the SVD method.

vote_pivot_centered = vote_pivot - np.mean(vote_pivot, axis=0)
u, s, vt = np.linalg.svd(vote_pivot_centered, full_matrices=False) # SVD

We can use the singular values in s to construct a scree plot:

fig = px.line(y=s**2 / sum(s**2), title='Variance Explained', width=700, height=600, markers=True)
fig.update_xaxes(title_text='Principal Component i')
fig.update_yaxes(title_text='Proportion of Variance Explained')

It looks like this graph plateus after the third principal component, so our “elbow” is at PC 3, and most of the variance is captured by just the first three principal components. Let’s visualize them!

# calculate the first 3 principal components
vote_2d = pd.DataFrame(index=vote_pivot_centered.index)
vote_2d[["z1", "z2", "z3"]] = (u * s)[:, :3]

# plot the 3 PCs
fig = px.scatter_3d(vote_2d, x='z1', y='z2', z='z3', title='Vote Data', width=800, height=600)
fig.update_traces(marker=dict(size=5))

Baesd on the plot above, it looks like there are two clusters of datapoints. What do you think this corresponds to?

By incorporating member information (source, we can augment our graph with biographic data like each member’s party and gender.

Code
legislators_data = yaml.safe_load(open("data/legislators-2019.yaml"))


def to_date(s):
    return datetime.strptime(s, "%Y-%m-%d")


legs = pd.DataFrame(
    columns=[
        "leg_id",
        "first",
        "last",
        "gender",
        "state",
        "chamber",
        "party",
        "birthday",
    ],
    data=[
        [
            x["id"]["bioguide"],
            x["name"]["first"],
            x["name"]["last"],
            x["bio"]["gender"],
            x["terms"][-1]["state"],
            x["terms"][-1]["type"],
            x["terms"][-1]["party"],
            to_date(x["bio"]["birthday"]),
        ]
        for x in legislators_data
    ],
)
legs["age"] = 2024 - legs["birthday"].dt.year
legs.set_index("leg_id")
legs.sort_index()

vote_2d = vote_2d.join(legs.set_index("leg_id")).dropna()

np.random.seed(42)
vote_2d["z1_jittered"] = vote_2d["z1"] + np.random.normal(0, 0.1, len(vote_2d))
vote_2d["z2_jittered"] = vote_2d["z2"] + np.random.normal(0, 0.1, len(vote_2d))
vote_2d["z3_jittered"] = vote_2d["z3"] + np.random.normal(0, 0.1, len(vote_2d))

px.scatter_3d(vote_2d, x='z1_jittered', y='z2_jittered', z='z3_jittered', color='party', symbol="gender", size='age',
           title='Vote Data', width=800, height=600, size_max=10,
           opacity = 0.7,
           color_discrete_map={'Democrat':'blue', 'Republican':'red', "Independent": "green"},
           hover_data=['first', 'last', 'state', 'party', 'gender', 'age'])

Using SVD and PCA, we can clearly see a separation between the red dots (Republican) and blue dots (Deomcrat).

25.5.2 Exploring the Principal Components

We can also look at \(V^T\) directly to try to gain insight into why each component is as it is.

Code
fig_eig = px.bar(x=vote_pivot_centered.columns, y=vt[0, :])
# extract the trace from the figure
fig_eig.show()

We have the party affiliation labels so we can see if this eigenvector aligns with one of the parties.

Code
party_line_votes = (
    vote_pivot_centered.join(legs.set_index("leg_id")["party"])
    .groupby("party")
    .mean()
    .T.reset_index()
    .rename(columns={"index": "call"})
    .melt("call")
)
fig = px.bar(
    party_line_votes,
    x="call", y="value", facet_row = "party", color="party",
    color_discrete_map={'Democrat':'blue', 'Republican':'red', "Independent": "green"})
fig.for_each_annotation(lambda a: a.update(text=a.text.split("=")[-1]))

25.5.3 Biplot

Code
loadings = pd.DataFrame(
    {"pc1": np.sqrt(s[0]) * vt[0, :], "pc2": np.sqrt(s[1]) * vt[1, :]},
    index=vote_pivot_centered.columns,
)

vote_2d["num votes"] = votes[votes["vote"].isin(["Yes", "No"])].groupby("member").size()
vote_2d.dropna(inplace=True)

fig = px.scatter(
    vote_2d, 
    x='z1_jittered', 
    y='z2_jittered', 
    color='party', 
    symbol="gender", 
    size='num votes',
    title='Biplot', 
    width=800, 
    height=600, 
    size_max=10,
    opacity = 0.7,
    color_discrete_map={'Democrat':'blue', 'Republican':'red', "Independent": "green"},
    hover_data=['first', 'last', 'state', 'party', 'gender', 'age'])

for (call, pc1, pc2) in loadings.head(20).itertuples():
    fig.add_scatter(x=[0,pc1], y=[0,pc2], name=call, 
                    mode='lines+markers', textposition='top right',
                    marker= dict(size=10,symbol= "arrow-bar-up", angleref="previous"))
fig

Each roll call from the 116th Congress - 1st Session: https://clerk.house.gov/evs/2019/ROLL_500.asp

As shown in the demo, the primary goal of PCA is to transform observations from high-dimensional data down to low dimensions through linear transformations.

25.6 Example 2: Image Classification

In machine learning, PCA is often used as a preprocessing step prior to training a supervised model.

Let’s explore how PCA is useful for building an image classification model based on the Fashion-MNIST dataset, a dataset containing images of articles of clothing; these images are gray scale with a size of 28 by 28 pixels. The copyright for Fashion-MNIST is held by Zalando SE. Fashion-MNIST is licensed under the MIT license.

slide21

First, we’ll load in the data.

Code
import requests
from pathlib import Path
import time
import gzip
import os
import numpy as np
import plotly.express as px

def fetch_and_cache(data_url, file, data_dir="data", force=False):
    """
    Download and cache a url and return the file object.

    data_url: the web address to download
    file: the file in which to save the results.
    data_dir: (default="data") the location to save the data
    force: if true the file is always re-downloaded

    return: The pathlib.Path object representing the file.
    """

    data_dir = Path(data_dir)
    data_dir.mkdir(exist_ok=True)
    file_path = data_dir / Path(file)
    # If the file already exists and we want to force a download then
    # delete the file first so that the creation date is correct.
    if force and file_path.exists():
        file_path.unlink()
    if force or not file_path.exists():
        print("Downloading...", end=" ")
        resp = requests.get(data_url)
        with file_path.open("wb") as f:
            f.write(resp.content)
        print("Done!")
        last_modified_time = time.ctime(file_path.stat().st_mtime)
    else:
        last_modified_time = time.ctime(file_path.stat().st_mtime)
        print("Using cached version that was downloaded (UTC):", last_modified_time)
    return file_path


def head(filename, lines=5):
    """
    Returns the first few lines of a file.

    filename: the name of the file to open
    lines: the number of lines to include

    return: A list of the first few lines from the file.
    """
    from itertools import islice

    with open(filename, "r") as f:
        return list(islice(f, lines))


def load_data():
    """
    Loads the Fashion-MNIST dataset.

    This is a dataset of 60,000 28x28 grayscale images of 10 fashion categories,
    along with a test set of 10,000 images. This dataset can be used as
    a drop-in replacement for MNIST.

    The classes are:

    | Label | Description |
    |:-----:|-------------|
    |   0   | T-shirt/top |
    |   1   | Trouser     |
    |   2   | Pullover    |
    |   3   | Dress       |
    |   4   | Coat        |
    |   5   | Sandal      |
    |   6   | Shirt       |
    |   7   | Sneaker     |
    |   8   | Bag         |
    |   9   | Ankle boot  |

    Returns:
      Tuple of NumPy arrays: `(x_train, y_train), (x_test, y_test)`.

    **x_train**: uint8 NumPy array of grayscale image data with shapes
      `(60000, 28, 28)`, containing the training data.

    **y_train**: uint8 NumPy array of labels (integers in range 0-9)
      with shape `(60000,)` for the training data.

    **x_test**: uint8 NumPy array of grayscale image data with shapes
      (10000, 28, 28), containing the test data.

    **y_test**: uint8 NumPy array of labels (integers in range 0-9)
      with shape `(10000,)` for the test data.

    Example:

    (x_train, y_train), (x_test, y_test) = fashion_mnist.load_data()
    assert x_train.shape == (60000, 28, 28)
    assert x_test.shape == (10000, 28, 28)
    assert y_train.shape == (60000,)
    assert y_test.shape == (10000,)

    License:
      The copyright for Fashion-MNIST is held by Zalando SE.
      Fashion-MNIST is licensed under the [MIT license](
      https://github.com/zalandoresearch/fashion-mnist/blob/master/LICENSE).

    """
    dirname = os.path.join("datasets", "fashion-mnist")
    base = "https://storage.googleapis.com/tensorflow/tf-keras-datasets/"
    files = [
        "train-labels-idx1-ubyte.gz",
        "train-images-idx3-ubyte.gz",
        "t10k-labels-idx1-ubyte.gz",
        "t10k-images-idx3-ubyte.gz",
    ]

    paths = []
    for fname in files:
        paths.append(fetch_and_cache(base + fname, fname))
        # paths.append(get_file(fname, origin=base + fname, cache_subdir=dirname))

    with gzip.open(paths[0], "rb") as lbpath:
        y_train = np.frombuffer(lbpath.read(), np.uint8, offset=8)

    with gzip.open(paths[1], "rb") as imgpath:
        x_train = np.frombuffer(imgpath.read(), np.uint8, offset=16).reshape(
            len(y_train), 28, 28
        )

    with gzip.open(paths[2], "rb") as lbpath:
        y_test = np.frombuffer(lbpath.read(), np.uint8, offset=8)

    with gzip.open(paths[3], "rb") as imgpath:
        x_test = np.frombuffer(imgpath.read(), np.uint8, offset=16).reshape(
            len(y_test), 28, 28
        )

    return (x_train, y_train), (x_test, y_test)
Code
class_names = [
    "T-shirt/top",
    "Trouser",
    "Pullover",
    "Dress",
    "Coat",
    "Sandal",
    "Shirt",
    "Sneaker",
    "Bag",
    "Ankle boot",
]
class_dict = {i: class_name for i, class_name in enumerate(class_names)}

(train_images, train_labels), (test_images, test_labels) = load_data()
print("Training images", train_images.shape)
print("Test images", test_images.shape)

rng = np.random.default_rng(42)
n = 5000
sample_idx = rng.choice(np.arange(len(train_images)), size=n, replace=False)

# Invert and normalize the images so they look better
img_mat = -1 * train_images[sample_idx]
img_mat = (img_mat - img_mat.min()) / (img_mat.max() - img_mat.min())

images = pd.DataFrame(
    {
        "images": img_mat.tolist(),
        "labels": train_labels[sample_idx],
        "class": [class_dict[x] for x in train_labels[sample_idx]],
    }
)
Using cached version that was downloaded (UTC): Thu Apr 18 23:22:33 2024
Using cached version that was downloaded (UTC): Thu Apr 18 23:22:33 2024
Using cached version that was downloaded (UTC): Thu Apr 18 23:22:33 2024
Using cached version that was downloaded (UTC): Thu Apr 18 23:22:33 2024
Training images (60000, 28, 28)
Test images (10000, 28, 28)

Let’s see what some of the images contained in this dataset look like.

Code
def show_images(images, ncols=5, max_images=30):
    # conver the subset of images into a n,28,28 matrix for facet visualization
    img_mat = np.array(images.head(max_images)["images"].to_list())
    fig = px.imshow(
        img_mat,
        color_continuous_scale="gray",
        facet_col=0,
        facet_col_wrap=ncols,
        height=220 * int(np.ceil(len(images) / ncols)),
    )
    fig.update_layout(coloraxis_showscale=False)
    # Extract the facet number and convert it back to the class label.
    fig.for_each_annotation(
        lambda a: a.update(text=images.iloc[int(a.text.split("=")[-1])]["class"])
    )
    return fig


fig = show_images(images.groupby("class", as_index=False).sample(2), ncols=6)
fig.show()

Let’s break this down further and look at it by class, or the category of clothing:

Code
print(class_dict)

show_images(images.groupby('class',as_index=False).sample(2), ncols=6)
{0: 'T-shirt/top', 1: 'Trouser', 2: 'Pullover', 3: 'Dress', 4: 'Coat', 5: 'Sandal', 6: 'Shirt', 7: 'Sneaker', 8: 'Bag', 9: 'Ankle boot'}

25.6.1 Raw Data

As we can see, each 28x28 pixel image is labelled by the category of clothing it belongs to. Us humans can very easily look at these images and identify the type of clothing being displayed, even if the image is a little blurry. However, this task is less intuitive for machine learning models. To illustrate this, let’s take a small sample of the training data to see how the images above are represented in their raw format:

Code
images.head()
images labels class
0 [[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,... 3 Dress
1 [[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,... 4 Coat
2 [[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,... 0 T-shirt/top
3 [[1.0, 1.0, 1.0, 1.0, 1.0, 0.996078431372549, ... 2 Pullover
4 [[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,... 1 Trouser

Each row represents one image. Every image belongs to a "class" of clothing with it’s enumerated "label". In place of a typically displayed image, the raw data contains a 28x28 2D array of pixel values; each pixel value is a float between 0 and 1. If we just focus on the images, we get a 3D matrix. You can think of this as a matrix containing 2D images.

X = np.array(images["images"].to_list())
X.shape 
(5000, 28, 28)

However, we’re not used to working with 3D matrices for our training data X. Typical training data expects a vector of features for each datapoint, not a matrix per datapoint. We can reshape our 3D matrix so that it fits our typical training data by “unrolling” the the 28x28 pixels into a single row vector containing 28*28 = 784 dimensions.

X = X.reshape(X.shape[0], -1)
X.shape
(5000, 784)

What we have now is 5000 datapoints tha each have 784 features. That’s a lot of features! Not only would training a model on this data take a very long time, it’s also very likely that our matrix is linearly independent. PCA is a very good strategy to use in situations like these when there are lots of features, but we want to remove redundant information.

25.6.2 PCA with sklearn

To perform PCA, let’s begin by centering our data.

X = X - X.mean(axis=0)

We can run PCA using sklearn’s PCA package.

from sklearn.decomposition import PCA

n_comps = 50
pca = PCA(n_components=n_comps)
pca.fit(X)
PCA(n_components=50)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

25.6.3 Examining PCA Results

Now that sklearn helped us find the principal components, let’s visualize a scree plot.

# Make a line plot and show markers
fig = px.line(y=pca.explained_variance_ratio_ * 100, markers=True)
fig.show()

We can see that the line starts flattening out around 2 or 3, which suggests that most of the data is explained by just the first two or three dimensions. To illustrate this, let’s plot the first three principal components and the datapoints’ corresponding classes. Can you identify any patterns?

images[['z1', 'z2', 'z3']] = pca.transform(X)[:, :3]
fig = px.scatter_3d(images, x='z1', y='z2', z='z3', color='class', hover_data=['labels'], 
              width=1000, height=800)
# set marker size to 5
fig.update_traces(marker=dict(size=5))

25.7 Why Perform PCA

As we saw in the demos, we often perform PCA during the Exploratory Data Analysis (EDA) stage of our data science lifecycle (if we already know what to model, we probably don’t need PCA!). It helps us with:

  • Visually identifying clusters of similar observations in high dimensions.
  • Removing irrelevant dimensions if we suspect that the dataset is inherently low rank. For example, if the columns are collinear: there are many attributes but only a few mostly determine the rest through linear associations.
  • Finding a small basis for representing variations in complex things, e.g., images, genes.
  • Reducing the number of dimensions to make some computations cheaper.

25.7.1 Why PCA, then Model?

  1. Reduces dimensionality, allowing us to speed up training and reduce the number of features, etc.
  2. Avoids multicollinearity in the new features created (i.e. the principal components)
slide21

25.8 (Bonus) Applications of PCA

25.8.1 PCA in Biology

PCA is commonly used in biomedical contexts, which have many named variables! It can be used to:

  1. Cluster data (Paper 1, Paper 2).
  2. Identify correlated variables (interpret rows of \(V^{T}\) as linear coefficients) (Paper 3). Uses biplots.

25.9 (Bonus) PCA vs. Regression

25.9.1 Regression: Minimizing Horizontal/Verticle Error

Suppose we know the child mortality rate of a given country. Linear regression tries to predict the fertility rate from the mortality rate; for example, if the mortality is 6, we might guess the fertility is near 4. The regression line tells us the “best” prediction of fertility given all possible mortality values by minimizing the root mean squared error. See the vertical red lines (note that only some are shown).


We can also perform a regression in the reverse direction. That is, given fertility, we try to predict mortality. In this case, we get a different regression line that minimizes the root mean squared length of the horizontal lines.

25.9.2 SVD: Minimizing Perpendicular Error

The rank 1 approximation is close but not the same as the mortality regression line. Instead of minimizing horizontal or vertical error, our rank 1 approximation minimizes the error perpendicular to the subspace onto which we’re projecting. That is, SVD finds the line such that if we project our data onto that line, the error between the projection and our original data is minimized. The similarity of the rank 1 approximation and the fertility was just a coincidence. Looking at adiposity and bicep size from our body measurements dataset, we see the 1D subspace onto which we are projecting is between the two regression lines.

25.9.3 Beyond 1D and 2D

In higher dimensions, the idea behind principal components is the same! Suppose we have 30-dimensional data and decide to use the first 5 principal components. Our procedure minimizes the error between the original 30-dimensional data and the projection of that 30-dimensional data onto the “best” 5-dimensional subspace. See CS 189 Note 10 for more details.

25.10 (Bonus) Automatic Factorization

One key fact to remember is that the decomposition is not arbitrary. The rank of a matrix limits how small our inner dimensions can be if we want to perfectly recreate our matrix. The proof for this is out of scope.

Even if we know we have to factorize our matrix using an inner dimension of \(R\), that still leaves a large space of solutions to traverse. What if we have a procedure to automatically factorize a rank \(R\) matrix into an \(R\)-dimensional representation with some transformation matrix?

  • Lower dimensional representation avoids redundant features.
  • Imagine a 1000-dimensional dataset: If the rank is only 5, it’s much easier to do EDA after this mystery procedure.

What if we wanted a 2D representation? It’s valuable to compress all of the data that is relevant into as few dimensions as possible in order to plot it efficiently. Some 2D matrices yield better approximations than others. How well can we do?

25.11 (Bonus) Proof of Component Score

The proof defining component score is out of scope for this class, but it is included below for your convenience.

Setup: Consider the design matrix \(X \in \mathbb{R}^{n \times d}\), where the \(j\)-th column (corresponding to the \(j\)-th feature) is \(x_j \in \mathbb{R}^n\) and the element in row \(i\), column \(j\) is \(x_{ij}\). Further, define \(\tilde{X}\) as the centered design matrix. The \(j\)-th column is \(\tilde{x}_j \in \mathbb{R}^n\) and the element in row \(i\), column \(j\) is \(\tilde{x}_{ij} = x_{ij} - \bar{x_j}\), where \(\bar{x_j}\) is the mean of the \(x_j\) column vector from the original \(X\).

Variance: Construct the covariance matrix: \(\frac{1}{n} \tilde{X}^T \tilde{X} \in \mathbb{R}^{d \times d}\). The \(j\)-th element along the diagonal is the variance of the \(j\)-th column of the original design matrix \(X\):

\[\left( \frac{1}{n} \tilde{X}^T \tilde{X} \right)_{jj} = \frac{1}{n} \tilde{x}_j ^T \tilde{x}_j = \frac{1}{n} \sum_{i=i}^n (\tilde{x}_{ij} )^2 = \frac{1}{n} \sum_{i=i}^n (x_{ij} - \bar{x_j})^2\]

SVD: Suppose singular value decomposition of the centered design matrix \(\tilde{X}\) yields \(\tilde{X} = U S V^T\), where \(U \in \mathbb{R}^{n \times d}\) and \(V \in \mathbb{R}^{d \times d}\) are matrices with orthonormal columns, and \(S \in \mathbb{R}^{d \times d}\) is a diagonal matrix with singular values of \(\tilde{X}\).

\[ \begin{aligned} \tilde{X}^T \tilde{X} &= (U S V^T )^T (U S V^T) \\ &= V S U^T U S V^T & (S^T = S) \\ &= V S^2 V^T & (U^T U = I) \\ \frac{1}{n} \tilde{X}^T \tilde{X} &= \frac{1}{n} V S V^T =V \left( \frac{1}{n} S \right) V^T \\ \frac{1}{n} \tilde{X}^T \tilde{X} V &= V \left( \frac{1}{n} S \right) V^T V = V \left( \frac{1}{n} S \right) & \text{(right multiply by }V \rightarrow V^T V = I \text{)} \\ V^T \frac{1}{n} \tilde{X}^T \tilde{X} V &= V^T V \left( \frac{1}{n} S \right) = \frac{1}{n} S & \text{(left multiply by }V^T \rightarrow V^T V = I \text{)} \\ \left( \frac{1}{n} \tilde{X}^T \tilde{X} \right)_{jj} &= \frac{1}{n}S_j^2 & \text{(Define }S_j\text{ as the} j\text{-th singular value)} \\ \frac{1}{n} S_j^2 &= \frac{1}{n} \sum_{i=i}^n (x_{ij} - \bar{x_j})^2 \end{aligned} \]

The last line defines the \(j\)-th component score.