Lecture 25 – Data 100, Spring 2024¶
Data 100, Spring 2024
In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
PCA is a Linear Transformation¶
In the following short demo we show the steps of PCA on a two dimensional dataset. We wouldn't normally run PCA in this setting but it is helpful to visualize what is happening.
In [2]:
df = pd.read_csv("data/2d.csv")
df.head(3)
Out[2]:
x | y | |
---|---|---|
0 | 2.311043 | 5.436627 |
1 | 2.951447 | 6.093710 |
2 | 2.628517 | 6.776799 |
Let's visualize the dataset first.
In [3]:
fig = px.scatter(df, x='x', y='y', title='2D Data', width=700, height=700)
fig.update_xaxes(range=[-10, 10], zeroline=True, zerolinewidth=2, zerolinecolor='black')
fig.update_yaxes(range=[-10, 10], zeroline=True, zerolinewidth=2, zerolinecolor='black')
Step 1 Center the data (and visualize):
In [4]:
centered_df = df - df.mean(axis=0)
In [5]:
fig = px.scatter(centered_df, x='x', y='y', title='2D Data', width=700, height=700)
fig.update_xaxes(range=[-10, 10], zeroline=True, zerolinewidth=2, zerolinecolor='black')
fig.update_yaxes(range=[-10, 10], zeroline=True, zerolinewidth=2, zerolinecolor='black')
Step 2: Obtain the SVD of the centered data.
In [6]:
U, S, Vt = np.linalg.svd(centered_df, full_matrices = False)
Step 3: Project data onto the Principal Components (columns of $V$)
In [7]:
centered_df[["z1", "z2"]] = centered_df[['x', 'y']] @ Vt.T
# centered_df[["z1", "z2"]] = U @ np.diag(S) # does the same thing
centered_df.head(3)
Out[7]:
x | y | z1 | z2 | |
---|---|---|---|---|
0 | -0.782371 | -1.708284 | -1.878825 | 0.018793 |
1 | -0.141967 | -1.051201 | -1.017886 | -0.298473 |
2 | -0.464897 | -0.368111 | -0.525540 | 0.274668 |
In [8]:
fig = px.scatter(centered_df, x='z1', y='z2', title='2D Data', width=700, height=700)
fig.update_xaxes(range=[-10, 10], zeroline=True, zerolinewidth=2, zerolinecolor='black')
fig.update_yaxes(range=[-10, 10], zeroline=True, zerolinewidth=2, zerolinecolor='black')
We mentioned that $V$ transforms $X$ to get the principal components. What does that tranformation look like?
Turns out $V$ simply rotates the centered data matrix $X$ such that the direction with the most variation (i.e. the direction that's the most spread-out) is aligned with the x-axis!