Lecture 24 – Data 100, Spring 2024¶

Data 100, Spring 2024

Acknowledgments Page

In [1]:
import pandas as pd
import numpy as np
import scipy as sp
import plotly.express as px
import seaborn as sns

Working with High Dimensional Data¶

In the following cells we will use visualization tools to push as far as we can in visualizing the MPG dataset in high-dimensional space:

In [2]:
mpg = sns.load_dataset("mpg").dropna()
mpg.head()
Out[2]:
mpg cylinders displacement horsepower weight acceleration model_year origin name
0 18.0 8 307.0 130.0 3504 12.0 70 usa chevrolet chevelle malibu
1 15.0 8 350.0 165.0 3693 11.5 70 usa buick skylark 320
2 18.0 8 318.0 150.0 3436 11.0 70 usa plymouth satellite
3 16.0 8 304.0 150.0 3433 12.0 70 usa amc rebel sst
4 17.0 8 302.0 140.0 3449 10.5 70 usa ford torino
In [3]:
px.histogram(mpg, x="displacement")
In [4]:
px.scatter(mpg, x="displacement", y="horsepower")
In [5]:
fig = px.scatter_3d(mpg, x="displacement", y="horsepower", z="weight",
                    width=800, height=800)
fig.update_traces(marker=dict(size=3))
In [6]:
fig = px.scatter_3d(mpg, x="displacement", 
                    y="horsepower", 
                    z="weight", 
                    color="model_year",
                    width=800, height=800, 
                    opacity=.7)
fig.update_traces(marker=dict(size=5))
In [7]:
fig = px.scatter_3d(mpg, x="displacement", 
                    y="horsepower", 
                    z="weight", 
                    color="model_year",
                    size="mpg",
                    symbol="origin",
                    width=900, height=800, 
                    opacity=.7)
# hide color scale legend on the plotly fig
fig.update_layout(coloraxis_showscale=False)
In [8]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2,)

components = pca.fit_transform(mpg[["displacement", "horsepower", "weight", "model_year"]])
mpg[["z1", "z2"]] = components
mpg.head()
Out[8]:
mpg cylinders displacement horsepower weight acceleration model_year origin name z1 z2
0 18.0 8 307.0 130.0 3504 12.0 70 usa chevrolet chevelle malibu 536.436716 50.754633
1 15.0 8 350.0 165.0 3693 11.5 70 usa buick skylark 320 730.333026 79.063105
2 18.0 8 318.0 150.0 3436 11.0 70 usa plymouth satellite 470.971839 75.338465
3 16.0 8 304.0 150.0 3433 12.0 70 usa amc rebel sst 466.393030 62.448249
4 17.0 8 302.0 140.0 3449 10.5 70 usa ford torino 481.657494 55.643203
In [9]:
px.scatter(mpg, x="z1", y="z2", color="model_year", hover_data=["displacement", "horsepower", "weight", "name"])



Return to lecture.




Singular Value Decomposition¶

Singular value decomposition is a numerical technique to automatically decompose matrix into three matrices. Given an input matrix X, SVD will return $U$, $S$ and $V^T$ such that $ X = U S V^T $.

Check the documentation of np.linalg.svd here. There are multiple versions of SVD; to get the version that we will follow, we need to set the full_matrices parameter to False.

For PCA we will typically work with data that is already centered.

In [10]:
rectangle = pd.read_csv("data/rectangle_data.csv")
rectangle
Out[10]:
width height area perimeter
0 8 6 48 28
1 2 4 8 12
2 1 3 3 8
3 9 3 27 24
4 9 8 72 34
... ... ... ... ...
95 8 5 40 26
96 8 7 56 30
97 1 4 4 10
98 1 6 6 14
99 2 6 12 16

100 rows × 4 columns

In [11]:
px.scatter_3d(rectangle, x="width", y="height", z="area", 
              width=800, height=800)
In [12]:
X = rectangle - np.mean(rectangle, axis = 0)
X
Out[12]:
width height area perimeter
0 2.97 1.35 24.78 8.64
1 -3.03 -0.65 -15.22 -7.36
2 -4.03 -1.65 -20.22 -11.36
3 3.97 -1.65 3.78 4.64
4 3.97 3.35 48.78 14.64
... ... ... ... ...
95 2.97 0.35 16.78 6.64
96 2.97 2.35 32.78 10.64
97 -4.03 -0.65 -19.22 -9.36
98 -4.03 1.35 -17.22 -5.36
99 -3.03 1.35 -11.22 -3.36

100 rows × 4 columns

In [13]:
U, S, Vt = np.linalg.svd(X, full_matrices = False)
In [14]:
print("Shape of U", U.shape)
print("Shape of S", S.shape)
print("Shape of Vt", Vt.shape)
Shape of U (100, 4)
Shape of S (4,)
Shape of Vt (4, 4)

$\Sigma$ is a little different in NumPy. Since the only useful values in the diagonal matrix $\Sigma$ are the singular values on the diagonal axis, only those values are returned and they are stored in an array.

Our rectangle_data has a rank of $3$, so we should have 3 non-zero singular values, sorted from largest to smallest.

In [15]:
S
Out[15]:
array([1.97388075e+02, 2.74346257e+01, 2.32626119e+01, 8.53898299e-15])

Hmm, looks like are four diagonal entries are not zero. What happened?

It turns out there were some numerical rounding errors, but the last value is so small ($10^{-15}$) that it's practically $0$.

In [16]:
np.isclose(S[3], 0)
Out[16]:
True
In [17]:
S.round(5)
Out[17]:
array([197.38808,  27.43463,  23.26261,   0.     ])

If we want the diagonal elements:

In [18]:
Sm = np.diag(S)
Sm
Out[18]:
array([[1.97388075e+02, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 2.74346257e+01, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 2.32626119e+01, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 8.53898299e-15]])

Examining U:

In [19]:
pd.DataFrame(U).head(5)
Out[19]:
0 1 2 3
0 -0.133910 0.005930 0.034734 0.168825
1 0.086354 -0.079515 0.014948 0.775872
2 0.117766 -0.128963 0.085774 -0.061736
3 -0.027274 0.183177 0.010895 -0.053764
4 -0.258806 -0.094295 0.090270 -0.026449

Finally, $V^{\top}$:

In [20]:
pd.DataFrame(Vt)
Out[20]:
0 1 2 3
0 -0.098631 -0.072956 -0.931226 -0.343173
1 0.668460 -0.374186 -0.258375 0.588548
2 0.314625 -0.640483 0.257023 -0.651715
3 0.666667 0.666667 0.000000 -0.333333
In [21]:
Vt.shape
Out[21]:
(4, 4)

To check that this SVD is a valid decomposition, we can reverse it.

In [22]:
display(pd.DataFrame(U @ Sm @ Vt).head(5))
display(pd.DataFrame(X).head(5))
0 1 2 3
0 2.97 1.35 24.78 8.64
1 -3.03 -0.65 -15.22 -7.36
2 -4.03 -1.65 -20.22 -11.36
3 3.97 -1.65 3.78 4.64
4 3.97 3.35 48.78 14.64
width height area perimeter
0 2.97 1.35 24.78 8.64
1 -3.03 -0.65 -15.22 -7.36
2 -4.03 -1.65 -20.22 -11.36
3 3.97 -1.65 3.78 4.64
4 3.97 3.35 48.78 14.64

PCA with SVD¶

Step 1: Center the Data Matrix $X$¶

In [23]:
X = rectangle - np.mean(rectangle, axis = 0)
X.head(10)
Out[23]:
width height area perimeter
0 2.97 1.35 24.78 8.64
1 -3.03 -0.65 -15.22 -7.36
2 -4.03 -1.65 -20.22 -11.36
3 3.97 -1.65 3.78 4.64
4 3.97 3.35 48.78 14.64
5 -2.03 -3.65 -20.22 -11.36
6 -1.03 -2.65 -15.22 -7.36
7 0.97 0.35 6.78 2.64
8 1.97 -3.65 -16.22 -3.36
9 2.97 -2.65 -7.22 0.64

In some situations where the units are on different scales it is useful to normalize the data before performing SVD. This can be done by dividing each column by its standard deviation.

In [24]:
Xstd = X / np.std(X, axis = 0)

Step 2: Get the SVD of centered $X$¶

In [25]:
U, S, Vt = np.linalg.svd(X, full_matrices = False)

Examining the singular values:

In [26]:
pd.DataFrame(np.diag(S))
Out[26]:
0 1 2 3
0 197.388075 0.000000 0.000000 0.000000e+00
1 0.000000 27.434626 0.000000 0.000000e+00
2 0.000000 0.000000 23.262612 0.000000e+00
3 0.000000 0.000000 0.000000 8.538983e-15

Computing the contribution to the total variance:

In [27]:
pd.DataFrame(np.round(S**2 / X.shape[0], 2))
Out[27]:
0
0 389.62
1 7.53
2 5.41
3 0.00

Much of the variance is in the first dimension. This is likely because the area is much larger than the other dimensions. Let's examine the standardized version.

In [28]:
U, S, Vt = np.linalg.svd(Xstd, full_matrices = False)
In [29]:
pd.DataFrame(np.round(S**2 / X.shape[0], 2))
Out[29]:
0
0 2.89
1 1.03
2 0.09
3 0.00

Now we see that most of the variance is in the first two dimensions which makes sense since rectangles are largely described by two numbers.

Step 3 Computing Approximations to the Data¶

Let's try to approximate this data in two dimensions

Using $Z = U * S$¶

In [30]:
Z = U[:, :2] @ np.diag(S[:2])
pd.DataFrame(Z).head()
Out[30]:
0 1
0 -2.165116 0.250465
1 1.659346 -0.502647
2 2.462645 -0.416318
3 -0.856182 1.483758
4 -3.884287 -0.182623

Using $Z = X * V$¶

In [31]:
Z = Xstd.to_numpy() @ Vt.T[:,:2]
pd.DataFrame(Z).head()
Out[31]:
0 1
0 -2.165116 0.250465
1 1.659346 -0.502647
2 2.462645 -0.416318
3 -0.856182 1.483758
4 -3.884287 -0.182623
In [32]:
px.scatter(x=Z[:, 0], y=Z[:, 1])

Comparing to scikit learn

In [33]:
pca = PCA(2)
pd.DataFrame(pca.fit_transform(rectangle)).head(5)
Out[33]:
0 1
0 26.432217 0.162686
1 -17.045285 -2.181451
2 -23.245695 -3.538040
3 5.383546 5.025395
4 51.085217 -2.586948
In [34]:
pd.DataFrame(pca.fit_transform(X)).head(5)
Out[34]:
0 1
0 26.432217 0.162686
1 -17.045285 -2.181451
2 -23.245695 -3.538040
3 5.383546 5.025395
4 51.085217 -2.586948
In [35]:
pd.DataFrame(pca.fit_transform(Xstd)).head(5)
Out[35]:
0 1
0 2.165116 0.250465
1 -1.659346 -0.502647
2 -2.462645 -0.416318
3 0.856182 1.483758
4 3.884287 -0.182623

Also notice that the covariance of the transformed diagonalized.

In [36]:
pd.DataFrame(np.cov(Z.T))
Out[36]:
0 1
0 2.915514e+00 -3.700743e-16
1 -3.700743e-16 1.037467e+00

Lower Rank Approximation of X¶

Let's now try to recover X from our approximation:

In [37]:
rectangle.head()
Out[37]:
width height area perimeter
0 8 6 48 28
1 2 4 8 12
2 1 3 3 8
3 9 3 27 24
4 9 8 72 34
In [38]:
k = 2
U, S, Vt = np.linalg.svd(Xstd, full_matrices = False)
scaling = np.diag(np.std(X, axis = 0))
# scaling = np.eye(X.shape[1])
Z = U[:,:k] @ np.diag(S[:k])

rectangle_hat = pd.DataFrame(
    (Z @ Vt[:k, :]) @ scaling + np.mean(rectangle, axis = 0).to_numpy(),
    columns = rectangle.columns)

display(rectangle_hat.head(3))

fig = px.scatter_3d(rectangle, x="width", y="height", z="area",
                    width=800, height=800)
fig.add_scatter3d(x=rectangle_hat["width"], y=rectangle_hat["height"], z=rectangle_hat["area"], 
                  mode="markers", name = "approximation")
width height area perimeter
0 8.111911 6.093164 45.859191 28.410150
1 2.103964 4.086549 6.011204 12.381026
2 1.285956 3.238055 -2.470229 9.048021
In [ ]:
 
In [ ]: