Lecture 25 Supplemental Notebook¶

Data 100, Spring 2023

Acknowledgments Page

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(23) #kallisti

plt.rcParams['figure.figsize'] = (4, 4)
plt.rcParams['figure.dpi'] = 150
sns.set()

SVD Demo¶

In [2]:
rectangle = pd.read_csv("data/rectangle_data.csv")
rectangle.tail(5)
Out[2]:
width height area perimeter
95 8 5 40 26
96 8 7 56 30
97 1 4 4 10
98 1 6 6 14
99 2 6 12 16

Center the data.¶

In [3]:
# center data
X = rectangle - np.mean(rectangle, axis=0)

Compute SVD.¶


Singular value decomposition is a numerical technique to automatically decompose matrix into two matrices. Given an input matrix X, SVD will return $U\Sigma$ and $V^T$ such that $ X = U \Sigma V^T $. (np.linalg.svd documentation)

In [4]:
U, S, Vt = np.linalg.svd(X, full_matrices = False)

The SVD routine returns $U$ and $\Sigma$ as two separate variables.

In [5]:
pd.DataFrame(U)    # nicer printing with DataFrames
Out[5]:
0 1 2 3
0 -0.133910 0.005930 0.034734 -0.296836
1 0.086354 -0.079515 0.014948 0.711478
2 0.117766 -0.128963 0.085774 -0.065318
3 -0.027274 0.183177 0.010895 -0.031055
4 -0.258806 -0.094295 0.090270 -0.032818
... ... ... ... ...
95 -0.092321 0.052007 0.029907 -0.065218
96 -0.175499 -0.040147 0.039560 -0.056327
97 0.109202 -0.109114 0.013259 -0.051000
98 0.092073 -0.069417 -0.131771 -0.048640
99 0.059790 -0.058653 -0.107984 -0.074241

100 rows × 4 columns

In [6]:
S
Out[6]:
array([1.97388075e+02, 2.74346257e+01, 2.32626119e+01, 9.22425467e-15])
In [7]:
pd.DataFrame(Vt)
Out[7]:
0 1 2 3
0 -0.098631 -0.072956 -9.312257e-01 -0.343173
1 0.668460 -0.374186 -2.583754e-01 0.588548
2 0.314625 -0.640483 2.570230e-01 -0.651715
3 0.666667 0.666667 1.110223e-16 -0.333333

The two key pieces of the decomposition are $U\Sigma$ and $V^T$, which we can think of for now as analogous to our 'data' and 'transformation operation' from our manual decomposition earlier.

As we did before with our manual decomposition, we can recover our original rectangle data by multiplying the left matrix $U\Sigma$ by the right matrix $V^T$.

In [8]:
pd.DataFrame(U @ np.diag(S) @ Vt)
Out[8]:
0 1 2 3
0 2.97 1.35 24.78 8.64
1 -3.03 -0.65 -15.22 -7.36
2 -4.03 -1.65 -20.22 -11.36
3 3.97 -1.65 3.78 4.64
4 3.97 3.35 48.78 14.64
... ... ... ... ...
95 2.97 0.35 16.78 6.64
96 2.97 2.35 32.78 10.64
97 -4.03 -0.65 -19.22 -9.36
98 -4.03 1.35 -17.22 -5.36
99 -3.03 1.35 -11.22 -3.36

100 rows × 4 columns

Original data for reference:

In [9]:
X
Out[9]:
width height area perimeter
0 2.97 1.35 24.78 8.64
1 -3.03 -0.65 -15.22 -7.36
2 -4.03 -1.65 -20.22 -11.36
3 3.97 -1.65 3.78 4.64
4 3.97 3.35 48.78 14.64
... ... ... ... ...
95 2.97 0.35 16.78 6.64
96 2.97 2.35 32.78 10.64
97 -4.03 -0.65 -19.22 -9.36
98 -4.03 1.35 -17.22 -5.36
99 -3.03 1.35 -11.22 -3.36

100 rows × 4 columns




Principal Component 1 (PC1)¶

$$ X = U\Sigma V^T$$$$ XV = U\Sigma $$

Approach 1:

In [10]:
pc1 = S[0]*U[:, 0]
pd.DataFrame(pc1)
Out[10]:
0
0 -26.432217
1 17.045285
2 23.245695
3 -5.383546
4 -51.085217
... ...
95 -18.223108
96 -34.641325
97 21.555166
98 18.174109
99 11.801777

100 rows × 1 columns

Approach 2:

In [11]:
pd.DataFrame(X @ (Vt[0,:]).T)
Out[11]:
0
0 -26.432217
1 17.045285
2 23.245695
3 -5.383546
4 -51.085217
... ...
95 -18.223108
96 -34.641325
97 21.555166
98 18.174109
99 11.801777

100 rows × 1 columns

--



Rank 1 approximation¶

In [12]:
pd.DataFrame(U[:, 0:1] @ np.diag(S[0:1]) @ Vt[0:1,:])
Out[12]:
0 1 2 3
0 2.607034 1.928383 24.614360 9.070835
1 -1.681193 -1.243552 -15.873008 -5.849490
2 -2.292745 -1.695908 -21.646989 -7.977306
3 0.530984 0.392761 5.013297 1.847490
4 5.038583 3.726962 47.571869 17.531091
... ... ... ... ...
95 1.797362 1.329481 16.969827 6.253687
96 3.416707 2.527285 32.258893 11.887984
97 -2.126006 -1.572574 -20.072725 -7.397161
98 -1.792530 -1.325906 -16.924198 -6.236872
99 -1.164020 -0.861008 -10.990118 -4.050057

100 rows × 4 columns

Original data for reference:

In [13]:
X
Out[13]:
width height area perimeter
0 2.97 1.35 24.78 8.64
1 -3.03 -0.65 -15.22 -7.36
2 -4.03 -1.65 -20.22 -11.36
3 3.97 -1.65 3.78 4.64
4 3.97 3.35 48.78 14.64
... ... ... ... ...
95 2.97 0.35 16.78 6.64
96 2.97 2.35 32.78 10.64
97 -4.03 -0.65 -19.22 -9.36
98 -4.03 1.35 -17.22 -5.36
99 -3.03 1.35 -11.22 -3.36

100 rows × 4 columns