Lecture 25 – Data 100, Spring 2024¶

Data 100, Spring 2024

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px

PCA is a Linear Transformation¶

In the following short demo we show the steps of PCA on a two dimensional dataset. We wouldn't normally run PCA in this setting but it is helpful to visualize what is happening.

In [2]:
df = pd.read_csv("data/2d.csv")
df.head(3)
Out[2]:
x y
0 2.311043 5.436627
1 2.951447 6.093710
2 2.628517 6.776799

Let's visualize the dataset first.

In [3]:
fig = px.scatter(df, x='x', y='y', title='2D Data', width=700, height=700)
fig.update_xaxes(range=[-10, 10], zeroline=True, zerolinewidth=2, zerolinecolor='black')
fig.update_yaxes(range=[-10, 10], zeroline=True, zerolinewidth=2, zerolinecolor='black')

Step 1 Center the data (and visualize):

In [4]:
centered_df = df - df.mean(axis=0)
In [5]:
fig = px.scatter(centered_df, x='x', y='y', title='2D Data', width=700, height=700)
fig.update_xaxes(range=[-10, 10], zeroline=True, zerolinewidth=2, zerolinecolor='black')
fig.update_yaxes(range=[-10, 10], zeroline=True, zerolinewidth=2, zerolinecolor='black')

Step 2: Obtain the SVD of the centered data.

In [6]:
U, S, Vt = np.linalg.svd(centered_df, full_matrices = False)

Step 3: Project data onto the Principal Components (columns of $V$)

In [7]:
centered_df[["z1", "z2"]] = centered_df[['x', 'y']] @ Vt.T
# centered_df[["z1", "z2"]] = U @ np.diag(S) # does the same thing
centered_df.head(3)
Out[7]:
x y z1 z2
0 -0.782371 -1.708284 -1.878825 0.018793
1 -0.141967 -1.051201 -1.017886 -0.298473
2 -0.464897 -0.368111 -0.525540 0.274668
In [8]:
fig = px.scatter(centered_df, x='z1', y='z2', title='2D Data', width=700, height=700)
fig.update_xaxes(range=[-10, 10], zeroline=True, zerolinewidth=2, zerolinecolor='black')
fig.update_yaxes(range=[-10, 10], zeroline=True, zerolinewidth=2, zerolinecolor='black')

We mentioned that $V$ transforms $X$ to get the principal components. What does that tranformation look like?

Turns out $V$ simply rotates the centered data matrix $X$ such that the direction with the most variation (i.e. the direction that's the most spread-out) is aligned with the x-axis!