Lecture 22: Principal Components Analysis – Data 100, Fall 2020¶

Notebook originally by Josh Hug (Fall 2019)

Edits by Anthony D. Joseph and Suraj Rampure (Fall 2020)

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

We often create visualizations in order to facilitate exploratory data analysis. For example, we might create scatterplots to explore the relationship between pairs of variables in a dataset.

The dataset below gives the "percentage body fat, age, weight, height, and ten body circumference measurements" for 252 men.

http://jse.amstat.org/v4n1/datasets.johnson.html

For simplicity, we read in only 8 of the provided attributes, yielding the given dataframe.

#http://jse.amstat.org/datasets/fat.txt
df3 = pd.read_fwf("data/fat.dat.txt", colspecs = [(17, 21), (23, 29), (35, 37),
                                             (39, 45), (48, 53), (73, 77),
                                            (80, 85), (88, 93)], header=None, names = ["% fat", "density", "age", "weight", "height", "neck", "chest", "abdomen"])
df3.head()

We see that percentage fat and density in g/cm^3 are almost completely redundant.

sns.scatterplot(data = df3, x = "% fat", y = "density");

By contrast, while there is a strong correlation between neck and chest measurements, the resulting data is very 2 dimensional.

sns.scatterplot(data = df3, x = "neck", y = "chest");

Age and height show a small correlation as peolpe seem to get slightly smaller with greater age in this dataset.

sns.scatterplot(data = df3, x = "age", y = "height");

We note that there is one outlier where a person is slightly less than 29.5 inches tall. While there are a extraordinarily small number of adult males who are less than three feet tall, reflection on the rest of the data from this observation suggest that this was simply an error.

df3.query("height < 40")

df3 = df3.drop(41)

sns.scatterplot(data = df3, x = "age", y = "height");

We can try to visualize more than 2 attributes at once, but the relationships displayed in e.g. the color and dot size space are much harder for human readers to see. For example, above we saw that density and % fat are almost entirely redundant, but this relationship is impossible to see when comparing the colors and dot sizes.

sns.scatterplot(data = df3, x = "neck", y = "chest", hue="density", size = "% fat");

Seaborn gives us the ability to create a matrix of all possible pairs of variables. This is can be useful, though even with only 8 variables it's still difficult to fully digest.

sns.pairplot(df3);

We should note that despite the very strong relationship between % fat and density, the numerical rank of the data matrix is still 8. For the rank to be 7, we'd need the data to be almost exactly on a line. We'll talk about techniques to reduce the dimensionality over the course of this lecture and the next.

np.linalg.matrix_rank(df3)

8

House of Representatives Voting Data¶

Next, let's consider voting data from the house of representatives in the U.S. during the month of September 2019. In this example, our goal will be to try to find clusters of representatives who vote in similar ways. For example, we might expect to find that Democrats and Republicans vote similarly to other members of their party.

from pathlib import Path
from ds100_utils import fetch_and_cache
from datetime import datetime
from IPython.display import display

import yaml

plt.rcParams['figure.figsize'] = (4, 4)
plt.rcParams['figure.dpi'] = 150
sns.set()

base_url = 'https://github.com/unitedstates/congress-legislators/raw/master/'
legislators_path = 'legislators-current.yaml'
f = fetch_and_cache(base_url + legislators_path, legislators_path)
legislators_data = yaml.load(open(f))

def to_date(s):
    return datetime.strptime(s, '%Y-%m-%d')

legs = pd.DataFrame(
    columns=['leg_id', 'first', 'last', 'gender', 'state', 'chamber', 'party', 'birthday'],
    data=[[x['id']['bioguide'], 
           x['name']['first'],
           x['name']['last'],
           x['bio']['gender'],
           x['terms'][-1]['state'],
           x['terms'][-1]['type'],
           x['terms'][-1]['party'],
           to_date(x['bio']['birthday'])] for x in legislators_data])

legs.head(3)

Using cached version that was downloaded (UTC): Tue Nov 17 07:50:34 2020

/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:4: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  after removing the cwd from sys.path.

# February 2019 House of Representatives roll call votes
# Downloaded using https://github.com/eyeseast/propublica-congress
votes = pd.read_csv('data/votes.csv')
votes = votes.astype({"roll call": str}) 
votes.head()

votes.merge(legs, left_on='member', right_on='leg_id').sample(5)

def was_yes(s):
    if s.iloc[0] == 'Yes':
        return 1
    else:
        return 0

vote_pivot = votes.pivot_table(index='member', 
                                columns='roll call', 
                                values='vote', 
                                aggfunc=was_yes, 
                                fill_value=0)
print(vote_pivot.shape)
vote_pivot.head()

(441, 41)

vote_pivot.shape

(441, 41)

This data has 441 observations (members of the House of Representatives including the 6 non-voting representatives) and 41 dimensions (votes). While politics is quite polarized, none of these columns are linearly dependent as we note below.

np.linalg.matrix_rank(vote_pivot)

41

Suppose we want to find clusters of similar voting behavior. We might try by reducing our data to only two dimensions and looking to see if we can identify clear patterns. Let's start by looking at what votes were most controversial.

np.var(vote_pivot, axis=0).sort_values(ascending = False)

roll call
555    0.249988
540    0.249896
530    0.249896
542    0.249783
537    0.249783
533    0.249711
534    0.249711
536    0.249711
543    0.249711
550    0.249628
552    0.249536
549    0.249536
546    0.249536
518    0.249433
517    0.249320
547    0.249320
545    0.249063
553    0.248765
525    0.248425
551    0.248240
531    0.247397
524    0.246389
526    0.246111
521    0.246111
529    0.244898
528    0.244230
527    0.243150
520    0.242378
523    0.241144
522    0.231796
539    0.231796
516    0.221461
538    0.216679
544    0.198066
515    0.112546
541    0.095218
554    0.078743
532    0.071153
535    0.051460
519    0.047398
548    0.043295
dtype: float64

We see that roll call 548 had very little variance. According to http://clerk.house.gov/evs/2019/roll548.xml, this bill was referring to the 2019 Whistleblower Complaint about President Trump and Ukraine. The full text of the house resolution for this roll call can be found at https://www.congress.gov/bill/116th-congress/house-resolution/576/text:

(1) the whistleblower complaint received on August 12, 2019, by the Inspector General of the Intelligence Community shall be transmitted immediately to the Select Committee on Intelligence of the Senate and the Permanent Select Committee on Intelligence of the House of Representatives; and

(2) the Select Committee on Intelligence of the Senate and the Permanent Select Committee on Intelligence of the House of Representatives should be allowed to evaluate the complaint in a deliberate and bipartisan manner consistent with applicable statutes and processes in order to safeguard classified and sensitive information.

We see that 421 congresspeople voted for this resolution, and 12 did not vote for this resolution. 2 members answered "present" but did not vote no, and 10 did not vote at all. Clearly, a scatterplot involving this particular dimension isn't going to be useful.

vote_pivot['548'].value_counts()

1    421
0     20
Name: 548, dtype: int64

By contrast, we saw high variance for most of the other roll call votes. Most them had variances near 0.25, which is the maximum possible for a variable which can take on values 0 or 1. Let's consider the two highest variance variables, shown below:

vote_pivot['555'].value_counts()

1    222
0    219
Name: 555, dtype: int64

vote_pivot['530'].value_counts()

1    225
0    216
Name: 530, dtype: int64

Let's use these as our two dimensions for our scatterplot and see what happens.

sns.scatterplot(x='530', y='555', data=vote_pivot);

By adding some random noise, we can get rid of the overplotting.

vote_pivot_jittered = vote_pivot.copy()
vote_pivot_jittered.loc[:, '515':'555'] += np.random.random(vote_pivot_jittered.loc[:, '515':'555'].shape) * 0.3

sns.scatterplot(x='530', y='555', data=vote_pivot_jittered);

We can also look at this data labeled by party.

vote_pivot_labeled = vote_pivot.reset_index().merge(legs, left_on='member', right_on='leg_id').set_index('member')
vote_pivot_labeled.head(5)

vote_pivot_labeled_jittered = vote_pivot_labeled.copy()
vote_pivot_labeled_jittered.loc[:, '515':'555'] += np.random.random(vote_pivot_labeled_jittered.loc[:, '515':'555'].shape) * 0.3

sns.scatterplot(x='530', y='555', data=vote_pivot_labeled_jittered, hue="party");

sns.scatterplot(x='545', y='552', data=vote_pivot_labeled_jittered, hue="party");

We see that considering only two votes does seem to do a pretty good job of telling Republicans from Democrats. We'll see later in this lecture how we can do even better using a technique called "Principle Component Analysis" or PCA.

Before we can get there, we'll need to spend some time reviewing key linear algebra principles.

Matrix Operations in Python¶

age_and_height = np.array([[182, 28], [399, 30], [725, 33]])

M = np.array([[1, 0, 0], [0, 1, 1/12]])

age_and_height @ M

array([[182.        ,  28.        ,   2.33333333],
       [399.        ,  30.        ,   2.5       ],
       [725.        ,  33.        ,   2.75      ]])

Singular Value Decomposition Experiment¶

Manual Decomposition¶

In the table below, we have the width, height, area, and perimeter of a rectangle stored in a dataframe.

rectangle = pd.read_csv("data/rectangle_data.csv")
rectangle.head(5)

Naturally the perimeter is just the sum of 2x the width and 2x the height. Thus, if we create a new dataframe that has only the width, height, and area...

rectangle_no_perim = rectangle[["width", "height", "area"]]
rectangle_no_perim.head(5)

... then we can recover the perimeter by multiplying this matrix by

1 0 0 2 0 1 0 2 0 0 1 0

transform_3D_to_4D = [[1, 0, 0, 2], [0, 1, 0, 2], [0, 0, 1, 0]]

rectangle_with_perimeter_back = np.array(rectangle_no_perim) @ transform_3D_to_4D
pd.DataFrame(rectangle_with_perimeter_back, columns = ["width", "height", "area", "perimeter"]).head(5)

Singular Value Decomposition Example¶

Singular value decomposition is a numerical technique to (among other things) automatically uncover such redundancies. Given an input matrix X, SVD will return $U\Sigma$ and $V^T$ such that $ X = U \Sigma V^T $.

u, s, vt = np.linalg.svd(rectangle, full_matrices = False)

As we did before with our manual decomposition, we can recover our original rectangle data by multiplying the three return values of this function back together.

pd.DataFrame(u * s @ vt).head(4)

The two key pieces of the decomposition are $U\Sigma$ and $V^T$, which we can think of for now as analogous to our 'data' and 'transformation operation' from our manual decomposition earlier.

Let's start by looking at $U\Sigma$, which we can compute with the Python code u*s.

u*s

array([[-5.63092679e+01,  4.08369641e+00, -7.67968689e-01,
         1.83452516e-15],
       [-1.39258714e+01, -5.61592446e+00,  1.59106852e+00,
        -3.36581235e-15],
       [-7.38836950e+00, -5.11089273e+00,  1.51352951e+00,
        -7.04453569e-17],
       [-3.68444316e+01, -4.80005945e+00, -3.80095908e+00,
        -2.03741937e-16],
       [-7.94726055e+01,  1.30026983e+01,  1.86597851e-01,
        -2.66709514e-16],
       [-7.42135662e+00, -5.11810904e+00, -1.31469604e+00,
         7.80250388e-17],
       [-1.39588585e+01, -5.62314077e+00, -1.23715703e+00,
         5.55717762e-17],
       [-3.79895573e+01, -1.31360807e+00, -2.60712770e-01,
         3.01043718e-16],
       [-1.56692269e+01, -9.65347804e+00, -4.03555325e+00,
         3.52041495e-16],
       [-2.54468092e+01, -7.81311695e+00, -3.92620778e+00,
        -8.48509568e-17],
       [-3.26875093e+01, -2.52515864e+00,  3.87695076e-01,
        -1.26500289e-16],
       [-5.38957011e+01,  2.32104364e+00, -2.20593631e+00,
        -2.37753852e-16],
       [-4.08780385e+01, -1.86471027e+00, -2.34708823e+00,
         5.13492882e-17],
       [-5.34289549e+00, -3.98065864e+00,  7.79631035e-01,
         7.20472757e-17],
       [-2.82033419e+01, -8.33535389e+00,  5.30031897e+00,
         4.89486243e-17],
       [-6.40083890e+01,  7.06150790e+00,  1.43570386e+00,
         1.74060388e-16],
       [-4.32916052e+01, -1.02057499e-01, -9.09120615e-01,
         1.11263532e-16],
       [-1.39258714e+01, -5.61592446e+00,  1.59106852e+00,
         1.29145985e-16],
       [-1.67813655e+01, -6.15981034e+00,  2.33291861e+00,
        -1.70837887e-17],
       [-1.84343928e+01, -4.99433025e+00, -4.79403711e-01,
         7.25592757e-18],
       [-1.77311945e+01, -1.07873203e+01, -4.71576755e+00,
        -5.67149843e-16],
       [-4.85936532e+01,  1.10949307e+00, -1.55752846e+00,
        -1.09949265e-16],
       [-7.94726055e+01,  1.30026983e+01,  1.86597851e-01,
        -2.66709514e-16],
       [-2.94804161e+01, -4.87776777e+00, -2.47233693e+00,
         2.22105316e-17],
       [-3.31624238e+01, -4.83891361e+00, -3.13664801e+00,
         7.19721747e-17],
       [-4.40851318e+01,  4.87898862e-01,  5.12943775e-01,
         3.02603255e-16],
       [-3.18939828e+01, -3.11511500e+00, -1.03436931e+00,
         2.99484180e-16],
       [-4.08780385e+01, -1.86471027e+00, -2.34708823e+00,
         5.13492882e-17],
       [-2.25748215e+01, -7.26562290e+00, -3.25394509e+00,
         1.87126013e-16],
       [-2.29099271e+01, -4.36551973e+00,  2.78349606e-01,
        -3.75451682e-19],
       [-9.48332419e+00, -6.25195129e+00, -1.99491034e+00,
        -2.74769124e-17],
       [-7.09469707e+01,  9.44214673e+00, -6.10913535e-01,
        -2.57057627e-16],
       [-1.39588585e+01, -5.62314077e+00, -1.23715703e+00,
         5.55717762e-17],
       [-5.34289549e+00, -3.98065864e+00,  7.79631035e-01,
         7.20472757e-17],
       [-1.77311945e+01, -1.07873203e+01, -4.71576755e+00,
        -5.67149843e-16],
       [-2.74019549e+01, -3.74031736e+00, -3.78009854e-01,
        -1.65127367e-16],
       [-5.38957011e+01,  2.32104364e+00, -2.20593631e+00,
        -2.37753852e-16],
       [-6.23718552e+01,  5.89241965e+00,  2.83391341e+00,
        -1.81938882e-17],
       [-2.94144418e+01, -4.86333514e+00,  3.18411417e+00,
         3.46804968e-16],
       [-2.94804161e+01, -4.87776777e+00, -2.47233693e+00,
         2.22105316e-17],
       [-5.71027944e+01,  4.67365277e+00,  6.54095701e-01,
         1.76237992e-16],
       [-1.47523850e+01, -5.03318441e+00,  1.84907363e-01,
         1.84935901e-16],
       [-3.26875093e+01, -2.52515864e+00,  3.87695076e-01,
        -1.26500289e-16],
       [-2.94804161e+01, -4.87776777e+00, -2.47233693e+00,
         2.22105316e-17],
       [-2.54468092e+01, -7.81311695e+00, -3.92620778e+00,
        -8.48509568e-17],
       [-1.96368596e+01, -6.70369623e+00,  3.07476870e+00,
        -1.63313563e-16],
       [-1.56692269e+01, -9.65347804e+00, -4.03555325e+00,
         3.52041495e-16],
       [-5.34289549e+00, -3.98065864e+00,  7.79631035e-01,
         7.20472757e-17],
       [-2.20834134e+01, -4.94825977e+00,  1.68451077e+00,
         2.91846763e-16],
       [-1.96368596e+01, -6.70369623e+00,  3.07476870e+00,
        -1.63313563e-16],
       [-3.79730637e+01, -1.30999991e+00,  1.15340001e+00,
        -1.06495470e-16],
       [-1.97931620e+01, -1.19211625e+01, -5.39598185e+00,
         1.11621311e-16],
       [-2.57984083e+01, -4.91662193e+00, -1.80802586e+00,
         2.97924643e-16],
       [-1.96368596e+01, -6.70369623e+00,  3.07476870e+00,
        -1.63313563e-16],
       [-5.62762807e+01,  4.09091272e+00,  2.06025686e+00,
         1.42984452e-16],
       [-1.97931620e+01, -1.19211625e+01, -5.39598185e+00,
         1.11621311e-16],
       [-2.53478478e+01, -7.79146801e+00,  4.55846888e+00,
         1.95178398e-16],
       [-2.94144418e+01, -4.86333514e+00,  3.18411417e+00,
         3.46804968e-16],
       [-3.79895573e+01, -1.31360807e+00, -2.60712770e-01,
         3.01043718e-16],
       [-5.63092679e+01,  4.08369641e+00, -7.67968689e-01,
        -7.51795410e-17],
       [-2.83187968e+01, -8.36061099e+00, -4.59847046e+00,
         2.64707301e-16],
       [-3.67454702e+01, -4.77841051e+00,  4.68371758e+00,
         4.68711369e-17],
       [-1.67813655e+01, -6.15981034e+00,  2.33291861e+00,
        -1.70837887e-17],
       [-9.43384352e+00, -6.24112682e+00,  2.24742798e+00,
         1.12537765e-16],
       [-1.47523850e+01, -5.03318441e+00,  1.84907363e-01,
         1.84935901e-16],
       [-2.82033419e+01, -8.33535389e+00,  5.30031897e+00,
         4.89486243e-17],
       [-3.18939828e+01, -3.11511500e+00, -1.03436931e+00,
         2.99484180e-16],
       [-1.68308462e+01, -6.17063481e+00, -1.90941971e+00,
        -1.42390326e-16],
       [-1.96368596e+01, -6.70369623e+00,  3.07476870e+00,
        -1.63313563e-16],
       [-1.67813655e+01, -6.15981034e+00,  2.33291861e+00,
        -1.70837887e-17],
       [-7.38836950e+00, -5.11089273e+00,  1.51352951e+00,
        -7.04453569e-17],
       [-7.38836950e+00, -5.11089273e+00,  1.51352951e+00,
        -7.04453569e-17],
       [-5.01807063e+01,  2.28940579e+00,  1.28660032e+00,
         3.04162792e-16],
       [-5.63092679e+01,  4.08369641e+00, -7.67968689e-01,
        -7.51795410e-17],
       [-2.24923537e+01, -7.24758212e+00,  3.81661879e+00,
         3.41408172e-16],
       [-3.18609957e+01, -3.10789868e+00,  1.79385624e+00,
        -1.45045689e-16],
       [-3.29742148e+00, -2.85042454e+00,  4.57325598e-02,
         1.11175616e-17],
       [-7.94726055e+01,  1.30026983e+01,  1.86597851e-01,
        -2.66709514e-16],
       [-3.26875093e+01, -2.52515864e+00,  3.87695076e-01,
        -1.26500289e-16],
       [-3.67454702e+01, -4.77841051e+00,  4.68371758e+00,
         4.68711369e-17],
       [-2.53478478e+01, -7.79146801e+00,  4.55846888e+00,
         1.95178398e-16],
       [-8.21488316e+00, -4.52815268e+00,  1.07368347e-01,
         1.47608402e-17],
       [-3.31624238e+01, -4.83891361e+00, -3.13664801e+00,
         7.19721747e-17],
       [-1.76157396e+01, -1.07620632e+01,  5.18302188e+00,
         1.93518744e-16],
       [-1.15452918e+01, -7.38579354e+00, -2.67512465e+00,
        -2.95716741e-16],
       [-4.32586181e+01, -9.48411854e-02,  1.91910494e+00,
        -2.63936669e-16],
       [-5.38957011e+01,  2.32104364e+00, -2.20593631e+00,
        -2.37753852e-16],
       [-7.42135662e+00, -5.11810904e+00, -1.31469604e+00,
         7.80250388e-17],
       [-7.09469707e+01,  9.44214673e+00, -6.10913535e-01,
        -2.57057627e-16],
       [-4.85936532e+01,  1.10949307e+00, -1.55752846e+00,
        -1.09949265e-16],
       [-7.94726055e+01,  1.30026983e+01,  1.86597851e-01,
        -2.66709514e-16],
       [-2.73854614e+01, -3.73670921e+00,  1.03610292e+00,
        -1.54079509e-16],
       [-2.73854614e+01, -3.73670921e+00,  1.03610292e+00,
        -1.54079509e-16],
       [-1.14793175e+01, -7.37136091e+00,  2.98132646e+00,
        -2.99548674e-17],
       [-4.40851318e+01,  4.87898862e-01,  5.12943775e-01,
         3.02603255e-16],
       [-4.85936532e+01,  1.10949307e+00, -1.55752846e+00,
        -1.09949265e-16],
       [-6.40248825e+01,  7.05789975e+00,  2.15910829e-02,
         3.36050508e-17],
       [-9.43384352e+00, -6.24112682e+00,  2.24742798e+00,
         1.12537765e-16],
       [-1.35247915e+01, -8.50159500e+00,  3.71522493e+00,
        -1.72447500e-16],
       [-1.96368596e+01, -6.70369623e+00,  3.07476870e+00,
        -1.63313563e-16]])

Similarly, we can look at vt.

vt

array([[-1.46435746e-01, -1.29942186e-01, -8.10020095e-01,
        -5.52755864e-01],
       [-1.92735898e-01, -1.89127741e-01,  5.86348204e-01,
        -7.63727278e-01],
       [-7.04957447e-01,  7.09155330e-01,  7.95161387e-03,
         8.39576575e-03],
       [-6.66666667e-01, -6.66666667e-01, -1.08794322e-16,
         3.33333333e-01]])

The automatic decomposition returned by the svd function looks quite different than what we got when we manually decomposed our data into "data" and "operations". That is, vt is a bunch of seemingly arbitrary numbers instead of the rather simple:

1 0 0 2 0 1 0 2 0 0 1 0

Similarly, if we look at the shape of $U\Sigma$ and $V^T$ we see that they are bigger than in our manual decomposition. Specifically $U\Sigma$ still has 4 columns, meaning that each observation is 4 dimensional. Furthermore, rather than our transformation operation $V^T$ being 3x4, it's 4x4 rows tall, meaning that it maps 4 dimensional inputs back to 4 dimensions.

This seems problematic, because our goal of using SVD was to find a transformation operation that takes 3D inputs and maps them up to 4 dimensions.

Luckily, if we look carefully at $U\Sigma$, we see that the last attribute of each observation is very close to 0.

u * s

array([[-5.63092679e+01,  4.08369641e+00, -7.67968689e-01,
         1.83452516e-15],
       [-1.39258714e+01, -5.61592446e+00,  1.59106852e+00,
        -3.36581235e-15],
       [-7.38836950e+00, -5.11089273e+00,  1.51352951e+00,
        -7.04453569e-17],
       [-3.68444316e+01, -4.80005945e+00, -3.80095908e+00,
        -2.03741937e-16],
       [-7.94726055e+01,  1.30026983e+01,  1.86597851e-01,
        -2.66709514e-16],
       [-7.42135662e+00, -5.11810904e+00, -1.31469604e+00,
         7.80250388e-17],
       [-1.39588585e+01, -5.62314077e+00, -1.23715703e+00,
         5.55717762e-17],
       [-3.79895573e+01, -1.31360807e+00, -2.60712770e-01,
         3.01043718e-16],
       [-1.56692269e+01, -9.65347804e+00, -4.03555325e+00,
         3.52041495e-16],
       [-2.54468092e+01, -7.81311695e+00, -3.92620778e+00,
        -8.48509568e-17],
       [-3.26875093e+01, -2.52515864e+00,  3.87695076e-01,
        -1.26500289e-16],
       [-5.38957011e+01,  2.32104364e+00, -2.20593631e+00,
        -2.37753852e-16],
       [-4.08780385e+01, -1.86471027e+00, -2.34708823e+00,
         5.13492882e-17],
       [-5.34289549e+00, -3.98065864e+00,  7.79631035e-01,
         7.20472757e-17],
       [-2.82033419e+01, -8.33535389e+00,  5.30031897e+00,
         4.89486243e-17],
       [-6.40083890e+01,  7.06150790e+00,  1.43570386e+00,
         1.74060388e-16],
       [-4.32916052e+01, -1.02057499e-01, -9.09120615e-01,
         1.11263532e-16],
       [-1.39258714e+01, -5.61592446e+00,  1.59106852e+00,
         1.29145985e-16],
       [-1.67813655e+01, -6.15981034e+00,  2.33291861e+00,
        -1.70837887e-17],
       [-1.84343928e+01, -4.99433025e+00, -4.79403711e-01,
         7.25592757e-18],
       [-1.77311945e+01, -1.07873203e+01, -4.71576755e+00,
        -5.67149843e-16],
       [-4.85936532e+01,  1.10949307e+00, -1.55752846e+00,
        -1.09949265e-16],
       [-7.94726055e+01,  1.30026983e+01,  1.86597851e-01,
        -2.66709514e-16],
       [-2.94804161e+01, -4.87776777e+00, -2.47233693e+00,
         2.22105316e-17],
       [-3.31624238e+01, -4.83891361e+00, -3.13664801e+00,
         7.19721747e-17],
       [-4.40851318e+01,  4.87898862e-01,  5.12943775e-01,
         3.02603255e-16],
       [-3.18939828e+01, -3.11511500e+00, -1.03436931e+00,
         2.99484180e-16],
       [-4.08780385e+01, -1.86471027e+00, -2.34708823e+00,
         5.13492882e-17],
       [-2.25748215e+01, -7.26562290e+00, -3.25394509e+00,
         1.87126013e-16],
       [-2.29099271e+01, -4.36551973e+00,  2.78349606e-01,
        -3.75451682e-19],
       [-9.48332419e+00, -6.25195129e+00, -1.99491034e+00,
        -2.74769124e-17],
       [-7.09469707e+01,  9.44214673e+00, -6.10913535e-01,
        -2.57057627e-16],
       [-1.39588585e+01, -5.62314077e+00, -1.23715703e+00,
         5.55717762e-17],
       [-5.34289549e+00, -3.98065864e+00,  7.79631035e-01,
         7.20472757e-17],
       [-1.77311945e+01, -1.07873203e+01, -4.71576755e+00,
        -5.67149843e-16],
       [-2.74019549e+01, -3.74031736e+00, -3.78009854e-01,
        -1.65127367e-16],
       [-5.38957011e+01,  2.32104364e+00, -2.20593631e+00,
        -2.37753852e-16],
       [-6.23718552e+01,  5.89241965e+00,  2.83391341e+00,
        -1.81938882e-17],
       [-2.94144418e+01, -4.86333514e+00,  3.18411417e+00,
         3.46804968e-16],
       [-2.94804161e+01, -4.87776777e+00, -2.47233693e+00,
         2.22105316e-17],
       [-5.71027944e+01,  4.67365277e+00,  6.54095701e-01,
         1.76237992e-16],
       [-1.47523850e+01, -5.03318441e+00,  1.84907363e-01,
         1.84935901e-16],
       [-3.26875093e+01, -2.52515864e+00,  3.87695076e-01,
        -1.26500289e-16],
       [-2.94804161e+01, -4.87776777e+00, -2.47233693e+00,
         2.22105316e-17],
       [-2.54468092e+01, -7.81311695e+00, -3.92620778e+00,
        -8.48509568e-17],
       [-1.96368596e+01, -6.70369623e+00,  3.07476870e+00,
        -1.63313563e-16],
       [-1.56692269e+01, -9.65347804e+00, -4.03555325e+00,
         3.52041495e-16],
       [-5.34289549e+00, -3.98065864e+00,  7.79631035e-01,
         7.20472757e-17],
       [-2.20834134e+01, -4.94825977e+00,  1.68451077e+00,
         2.91846763e-16],
       [-1.96368596e+01, -6.70369623e+00,  3.07476870e+00,
        -1.63313563e-16],
       [-3.79730637e+01, -1.30999991e+00,  1.15340001e+00,
        -1.06495470e-16],
       [-1.97931620e+01, -1.19211625e+01, -5.39598185e+00,
         1.11621311e-16],
       [-2.57984083e+01, -4.91662193e+00, -1.80802586e+00,
         2.97924643e-16],
       [-1.96368596e+01, -6.70369623e+00,  3.07476870e+00,
        -1.63313563e-16],
       [-5.62762807e+01,  4.09091272e+00,  2.06025686e+00,
         1.42984452e-16],
       [-1.97931620e+01, -1.19211625e+01, -5.39598185e+00,
         1.11621311e-16],
       [-2.53478478e+01, -7.79146801e+00,  4.55846888e+00,
         1.95178398e-16],
       [-2.94144418e+01, -4.86333514e+00,  3.18411417e+00,
         3.46804968e-16],
       [-3.79895573e+01, -1.31360807e+00, -2.60712770e-01,
         3.01043718e-16],
       [-5.63092679e+01,  4.08369641e+00, -7.67968689e-01,
        -7.51795410e-17],
       [-2.83187968e+01, -8.36061099e+00, -4.59847046e+00,
         2.64707301e-16],
       [-3.67454702e+01, -4.77841051e+00,  4.68371758e+00,
         4.68711369e-17],
       [-1.67813655e+01, -6.15981034e+00,  2.33291861e+00,
        -1.70837887e-17],
       [-9.43384352e+00, -6.24112682e+00,  2.24742798e+00,
         1.12537765e-16],
       [-1.47523850e+01, -5.03318441e+00,  1.84907363e-01,
         1.84935901e-16],
       [-2.82033419e+01, -8.33535389e+00,  5.30031897e+00,
         4.89486243e-17],
       [-3.18939828e+01, -3.11511500e+00, -1.03436931e+00,
         2.99484180e-16],
       [-1.68308462e+01, -6.17063481e+00, -1.90941971e+00,
        -1.42390326e-16],
       [-1.96368596e+01, -6.70369623e+00,  3.07476870e+00,
        -1.63313563e-16],
       [-1.67813655e+01, -6.15981034e+00,  2.33291861e+00,
        -1.70837887e-17],
       [-7.38836950e+00, -5.11089273e+00,  1.51352951e+00,
        -7.04453569e-17],
       [-7.38836950e+00, -5.11089273e+00,  1.51352951e+00,
        -7.04453569e-17],
       [-5.01807063e+01,  2.28940579e+00,  1.28660032e+00,
         3.04162792e-16],
       [-5.63092679e+01,  4.08369641e+00, -7.67968689e-01,
        -7.51795410e-17],
       [-2.24923537e+01, -7.24758212e+00,  3.81661879e+00,
         3.41408172e-16],
       [-3.18609957e+01, -3.10789868e+00,  1.79385624e+00,
        -1.45045689e-16],
       [-3.29742148e+00, -2.85042454e+00,  4.57325598e-02,
         1.11175616e-17],
       [-7.94726055e+01,  1.30026983e+01,  1.86597851e-01,
        -2.66709514e-16],
       [-3.26875093e+01, -2.52515864e+00,  3.87695076e-01,
        -1.26500289e-16],
       [-3.67454702e+01, -4.77841051e+00,  4.68371758e+00,
         4.68711369e-17],
       [-2.53478478e+01, -7.79146801e+00,  4.55846888e+00,
         1.95178398e-16],
       [-8.21488316e+00, -4.52815268e+00,  1.07368347e-01,
         1.47608402e-17],
       [-3.31624238e+01, -4.83891361e+00, -3.13664801e+00,
         7.19721747e-17],
       [-1.76157396e+01, -1.07620632e+01,  5.18302188e+00,
         1.93518744e-16],
       [-1.15452918e+01, -7.38579354e+00, -2.67512465e+00,
        -2.95716741e-16],
       [-4.32586181e+01, -9.48411854e-02,  1.91910494e+00,
        -2.63936669e-16],
       [-5.38957011e+01,  2.32104364e+00, -2.20593631e+00,
        -2.37753852e-16],
       [-7.42135662e+00, -5.11810904e+00, -1.31469604e+00,
         7.80250388e-17],
       [-7.09469707e+01,  9.44214673e+00, -6.10913535e-01,
        -2.57057627e-16],
       [-4.85936532e+01,  1.10949307e+00, -1.55752846e+00,
        -1.09949265e-16],
       [-7.94726055e+01,  1.30026983e+01,  1.86597851e-01,
        -2.66709514e-16],
       [-2.73854614e+01, -3.73670921e+00,  1.03610292e+00,
        -1.54079509e-16],
       [-2.73854614e+01, -3.73670921e+00,  1.03610292e+00,
        -1.54079509e-16],
       [-1.14793175e+01, -7.37136091e+00,  2.98132646e+00,
        -2.99548674e-17],
       [-4.40851318e+01,  4.87898862e-01,  5.12943775e-01,
         3.02603255e-16],
       [-4.85936532e+01,  1.10949307e+00, -1.55752846e+00,
        -1.09949265e-16],
       [-6.40248825e+01,  7.05789975e+00,  2.15910829e-02,
         3.36050508e-17],
       [-9.43384352e+00, -6.24112682e+00,  2.24742798e+00,
         1.12537765e-16],
       [-1.35247915e+01, -8.50159500e+00,  3.71522493e+00,
        -1.72447500e-16],
       [-1.96368596e+01, -6.70369623e+00,  3.07476870e+00,
        -1.63313563e-16]])

Thus, it makes sense that we remove the last column of $U \Sigma$.

u = u[:, 0:3]
s = s[0:3]
u * s

array([[-5.63092679e+01,  4.08369641e+00, -7.67968689e-01],
       [-1.39258714e+01, -5.61592446e+00,  1.59106852e+00],
       [-7.38836950e+00, -5.11089273e+00,  1.51352951e+00],
       [-3.68444316e+01, -4.80005945e+00, -3.80095908e+00],
       [-7.94726055e+01,  1.30026983e+01,  1.86597851e-01],
       [-7.42135662e+00, -5.11810904e+00, -1.31469604e+00],
       [-1.39588585e+01, -5.62314077e+00, -1.23715703e+00],
       [-3.79895573e+01, -1.31360807e+00, -2.60712770e-01],
       [-1.56692269e+01, -9.65347804e+00, -4.03555325e+00],
       [-2.54468092e+01, -7.81311695e+00, -3.92620778e+00],
       [-3.26875093e+01, -2.52515864e+00,  3.87695076e-01],
       [-5.38957011e+01,  2.32104364e+00, -2.20593631e+00],
       [-4.08780385e+01, -1.86471027e+00, -2.34708823e+00],
       [-5.34289549e+00, -3.98065864e+00,  7.79631035e-01],
       [-2.82033419e+01, -8.33535389e+00,  5.30031897e+00],
       [-6.40083890e+01,  7.06150790e+00,  1.43570386e+00],
       [-4.32916052e+01, -1.02057499e-01, -9.09120615e-01],
       [-1.39258714e+01, -5.61592446e+00,  1.59106852e+00],
       [-1.67813655e+01, -6.15981034e+00,  2.33291861e+00],
       [-1.84343928e+01, -4.99433025e+00, -4.79403711e-01],
       [-1.77311945e+01, -1.07873203e+01, -4.71576755e+00],
       [-4.85936532e+01,  1.10949307e+00, -1.55752846e+00],
       [-7.94726055e+01,  1.30026983e+01,  1.86597851e-01],
       [-2.94804161e+01, -4.87776777e+00, -2.47233693e+00],
       [-3.31624238e+01, -4.83891361e+00, -3.13664801e+00],
       [-4.40851318e+01,  4.87898862e-01,  5.12943775e-01],
       [-3.18939828e+01, -3.11511500e+00, -1.03436931e+00],
       [-4.08780385e+01, -1.86471027e+00, -2.34708823e+00],
       [-2.25748215e+01, -7.26562290e+00, -3.25394509e+00],
       [-2.29099271e+01, -4.36551973e+00,  2.78349606e-01],
       [-9.48332419e+00, -6.25195129e+00, -1.99491034e+00],
       [-7.09469707e+01,  9.44214673e+00, -6.10913535e-01],
       [-1.39588585e+01, -5.62314077e+00, -1.23715703e+00],
       [-5.34289549e+00, -3.98065864e+00,  7.79631035e-01],
       [-1.77311945e+01, -1.07873203e+01, -4.71576755e+00],
       [-2.74019549e+01, -3.74031736e+00, -3.78009854e-01],
       [-5.38957011e+01,  2.32104364e+00, -2.20593631e+00],
       [-6.23718552e+01,  5.89241965e+00,  2.83391341e+00],
       [-2.94144418e+01, -4.86333514e+00,  3.18411417e+00],
       [-2.94804161e+01, -4.87776777e+00, -2.47233693e+00],
       [-5.71027944e+01,  4.67365277e+00,  6.54095701e-01],
       [-1.47523850e+01, -5.03318441e+00,  1.84907363e-01],
       [-3.26875093e+01, -2.52515864e+00,  3.87695076e-01],
       [-2.94804161e+01, -4.87776777e+00, -2.47233693e+00],
       [-2.54468092e+01, -7.81311695e+00, -3.92620778e+00],
       [-1.96368596e+01, -6.70369623e+00,  3.07476870e+00],
       [-1.56692269e+01, -9.65347804e+00, -4.03555325e+00],
       [-5.34289549e+00, -3.98065864e+00,  7.79631035e-01],
       [-2.20834134e+01, -4.94825977e+00,  1.68451077e+00],
       [-1.96368596e+01, -6.70369623e+00,  3.07476870e+00],
       [-3.79730637e+01, -1.30999991e+00,  1.15340001e+00],
       [-1.97931620e+01, -1.19211625e+01, -5.39598185e+00],
       [-2.57984083e+01, -4.91662193e+00, -1.80802586e+00],
       [-1.96368596e+01, -6.70369623e+00,  3.07476870e+00],
       [-5.62762807e+01,  4.09091272e+00,  2.06025686e+00],
       [-1.97931620e+01, -1.19211625e+01, -5.39598185e+00],
       [-2.53478478e+01, -7.79146801e+00,  4.55846888e+00],
       [-2.94144418e+01, -4.86333514e+00,  3.18411417e+00],
       [-3.79895573e+01, -1.31360807e+00, -2.60712770e-01],
       [-5.63092679e+01,  4.08369641e+00, -7.67968689e-01],
       [-2.83187968e+01, -8.36061099e+00, -4.59847046e+00],
       [-3.67454702e+01, -4.77841051e+00,  4.68371758e+00],
       [-1.67813655e+01, -6.15981034e+00,  2.33291861e+00],
       [-9.43384352e+00, -6.24112682e+00,  2.24742798e+00],
       [-1.47523850e+01, -5.03318441e+00,  1.84907363e-01],
       [-2.82033419e+01, -8.33535389e+00,  5.30031897e+00],
       [-3.18939828e+01, -3.11511500e+00, -1.03436931e+00],
       [-1.68308462e+01, -6.17063481e+00, -1.90941971e+00],
       [-1.96368596e+01, -6.70369623e+00,  3.07476870e+00],
       [-1.67813655e+01, -6.15981034e+00,  2.33291861e+00],
       [-7.38836950e+00, -5.11089273e+00,  1.51352951e+00],
       [-7.38836950e+00, -5.11089273e+00,  1.51352951e+00],
       [-5.01807063e+01,  2.28940579e+00,  1.28660032e+00],
       [-5.63092679e+01,  4.08369641e+00, -7.67968689e-01],
       [-2.24923537e+01, -7.24758212e+00,  3.81661879e+00],
       [-3.18609957e+01, -3.10789868e+00,  1.79385624e+00],
       [-3.29742148e+00, -2.85042454e+00,  4.57325598e-02],
       [-7.94726055e+01,  1.30026983e+01,  1.86597851e-01],
       [-3.26875093e+01, -2.52515864e+00,  3.87695076e-01],
       [-3.67454702e+01, -4.77841051e+00,  4.68371758e+00],
       [-2.53478478e+01, -7.79146801e+00,  4.55846888e+00],
       [-8.21488316e+00, -4.52815268e+00,  1.07368347e-01],
       [-3.31624238e+01, -4.83891361e+00, -3.13664801e+00],
       [-1.76157396e+01, -1.07620632e+01,  5.18302188e+00],
       [-1.15452918e+01, -7.38579354e+00, -2.67512465e+00],
       [-4.32586181e+01, -9.48411854e-02,  1.91910494e+00],
       [-5.38957011e+01,  2.32104364e+00, -2.20593631e+00],
       [-7.42135662e+00, -5.11810904e+00, -1.31469604e+00],
       [-7.09469707e+01,  9.44214673e+00, -6.10913535e-01],
       [-4.85936532e+01,  1.10949307e+00, -1.55752846e+00],
       [-7.94726055e+01,  1.30026983e+01,  1.86597851e-01],
       [-2.73854614e+01, -3.73670921e+00,  1.03610292e+00],
       [-2.73854614e+01, -3.73670921e+00,  1.03610292e+00],
       [-1.14793175e+01, -7.37136091e+00,  2.98132646e+00],
       [-4.40851318e+01,  4.87898862e-01,  5.12943775e-01],
       [-4.85936532e+01,  1.10949307e+00, -1.55752846e+00],
       [-6.40248825e+01,  7.05789975e+00,  2.15910829e-02],
       [-9.43384352e+00, -6.24112682e+00,  2.24742798e+00],
       [-1.35247915e+01, -8.50159500e+00,  3.71522493e+00],
       [-1.96368596e+01, -6.70369623e+00,  3.07476870e+00]])

Similarly, because the observations are now 3D, we should remove the last row of $V^T$, since we want to use $V^T$ to map our now 3D data into 4D (rather than expecting 4D input data).

vt = vt[0:3, :]

After removing the redundant portions of $U\Sigma$ and $V^T$, we can verify that multiplying them together again yields our original array.

u * s @ vt

array([[ 8.,  6., 48., 28.],
       [ 2.,  4.,  8., 12.],
       [ 1.,  3.,  3.,  8.],
       [ 9.,  3., 27., 24.],
       [ 9.,  8., 72., 34.],
       [ 3.,  1.,  3.,  8.],
       [ 4.,  2.,  8., 12.],
       [ 6.,  5., 30., 22.],
       [ 7.,  1.,  7., 16.],
       [ 8.,  2., 16., 20.],
       [ 5.,  5., 25., 20.],
       [ 9.,  5., 45., 28.],
       [ 8.,  4., 32., 24.],
       [ 1.,  2.,  2.,  6.],
       [ 2.,  9., 18., 22.],
       [ 7.,  8., 56., 30.],
       [ 7.,  5., 35., 24.],
       [ 2.,  4.,  8., 12.],
       [ 2.,  5., 10., 14.],
       [ 4.,  3., 12., 14.],
       [ 8.,  1.,  8., 18.],
       [ 8.,  5., 40., 26.],
       [ 9.,  8., 72., 34.],
       [ 7.,  3., 21., 20.],
       [ 8.,  3., 24., 22.],
       [ 6.,  6., 36., 24.],
       [ 6.,  4., 24., 20.],
       [ 8.,  4., 32., 24.],
       [ 7.,  2., 14., 18.],
       [ 4.,  4., 16., 16.],
       [ 4.,  1.,  4., 10.],
       [ 9.,  7., 63., 32.],
       [ 4.,  2.,  8., 12.],
       [ 1.,  2.,  2.,  6.],
       [ 8.,  1.,  8., 18.],
       [ 5.,  4., 20., 18.],
       [ 9.,  5., 45., 28.],
       [ 6.,  9., 54., 30.],
       [ 3.,  7., 21., 20.],
       [ 7.,  3., 21., 20.],
       [ 7.,  7., 49., 28.],
       [ 3.,  3.,  9., 12.],
       [ 5.,  5., 25., 20.],
       [ 7.,  3., 21., 20.],
       [ 8.,  2., 16., 20.],
       [ 2.,  6., 12., 16.],
       [ 7.,  1.,  7., 16.],
       [ 1.,  2.,  2.,  6.],
       [ 3.,  5., 15., 16.],
       [ 2.,  6., 12., 16.],
       [ 5.,  6., 30., 22.],
       [ 9.,  1.,  9., 20.],
       [ 6.,  3., 18., 18.],
       [ 2.,  6., 12., 16.],
       [ 6.,  8., 48., 28.],
       [ 9.,  1.,  9., 20.],
       [ 2.,  8., 16., 20.],
       [ 3.,  7., 21., 20.],
       [ 6.,  5., 30., 22.],
       [ 8.,  6., 48., 28.],
       [ 9.,  2., 18., 22.],
       [ 3.,  9., 27., 24.],
       [ 2.,  5., 10., 14.],
       [ 1.,  4.,  4., 10.],
       [ 3.,  3.,  9., 12.],
       [ 2.,  9., 18., 22.],
       [ 6.,  4., 24., 20.],
       [ 5.,  2., 10., 14.],
       [ 2.,  6., 12., 16.],
       [ 2.,  5., 10., 14.],
       [ 1.,  3.,  3.,  8.],
       [ 1.,  3.,  3.,  8.],
       [ 6.,  7., 42., 26.],
       [ 8.,  6., 48., 28.],
       [ 2.,  7., 14., 18.],
       [ 4.,  6., 24., 20.],
       [ 1.,  1.,  1.,  4.],
       [ 9.,  8., 72., 34.],
       [ 5.,  5., 25., 20.],
       [ 3.,  9., 27., 24.],
       [ 2.,  8., 16., 20.],
       [ 2.,  2.,  4.,  8.],
       [ 8.,  3., 24., 22.],
       [ 1.,  8.,  8., 18.],
       [ 5.,  1.,  5., 12.],
       [ 5.,  7., 35., 24.],
       [ 9.,  5., 45., 28.],
       [ 3.,  1.,  3.,  8.],
       [ 9.,  7., 63., 32.],
       [ 8.,  5., 40., 26.],
       [ 9.,  8., 72., 34.],
       [ 4.,  5., 20., 18.],
       [ 4.,  5., 20., 18.],
       [ 1.,  5.,  5., 12.],
       [ 6.,  6., 36., 24.],
       [ 8.,  5., 40., 26.],
       [ 8.,  7., 56., 30.],
       [ 1.,  4.,  4., 10.],
       [ 1.,  6.,  6., 14.],
       [ 2.,  6., 12., 16.]])

Singular Value Decomposition Experiment¶

The reasons that $U \Sigma$ and $V^T$ look so different than the results of our manual decomposition are a consequence of how singular value decomposition. Specifically, given X, SVD computers $U$, $\Sigma$, and $V$ such that:

$\Sigma$ is a diagonal matrix containing the singular values of X.
$U$ and $V$ are matrices whose columns form an orthonormal set.
$X = U \Sigma V^T$

That is, there are an infinite number of matrices such that $X = AB$. In our example above, we created A and B through manual calculation and insight (recognizing that perimeter was 2w + 2h). SVD computes them automatically, and the results have the specific properties enumerated above.

u, s, vt = np.linalg.svd(rectangle, full_matrices = False)

The Singular Values Matrix $\Sigma$¶

The middle result returned by the SVD process is a diagonal matrix consisting of the singular values. We won't say a lot about what they mean today, but we will note that if the matrix is of rank r, then the first r singular values will be non-zero, and the rest will be zero.

s

array([3.62932568e+02, 6.29904732e+01, 2.56544651e+01, 4.29718368e-15])

Python returns the singular values in a slightly funny format (as a list). To get them into the correct form, we can use diag(s).

np.diag(s)

array([[3.62932568e+02, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 6.29904732e+01, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 2.56544651e+01, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.29718368e-15]])

If we do this, then we can compute the original matrix using only the matrix multiplication operator. That is, rather than writing u * s @ vt, we can do u @ np.diag(s) @ vt.

(u @ np.diag(s) @ vt)[0:5, :]

array([[ 8.,  6., 48., 28.],
       [ 2.,  4.,  8., 12.],
       [ 1.,  3.,  3.,  8.],
       [ 9.,  3., 27., 24.],
       [ 9.,  8., 72., 34.]])

The fact that the last singular value is zero is why we were able to remove the last dimension of $U$ and the last operation in $V^T$. That is, since the last column of $U \Sigma$ is always zero, we can just remove it completely.

(u @ np.diag(s))

array([[-5.63092679e+01,  4.08369641e+00, -7.67968689e-01,
         1.83452516e-15],
       [-1.39258714e+01, -5.61592446e+00,  1.59106852e+00,
        -3.36581235e-15],
       [-7.38836950e+00, -5.11089273e+00,  1.51352951e+00,
        -7.04453569e-17],
       [-3.68444316e+01, -4.80005945e+00, -3.80095908e+00,
        -2.03741937e-16],
       [-7.94726055e+01,  1.30026983e+01,  1.86597851e-01,
        -2.66709514e-16],
       [-7.42135662e+00, -5.11810904e+00, -1.31469604e+00,
         7.80250388e-17],
       [-1.39588585e+01, -5.62314077e+00, -1.23715703e+00,
         5.55717762e-17],
       [-3.79895573e+01, -1.31360807e+00, -2.60712770e-01,
         3.01043718e-16],
       [-1.56692269e+01, -9.65347804e+00, -4.03555325e+00,
         3.52041495e-16],
       [-2.54468092e+01, -7.81311695e+00, -3.92620778e+00,
        -8.48509568e-17],
       [-3.26875093e+01, -2.52515864e+00,  3.87695076e-01,
        -1.26500289e-16],
       [-5.38957011e+01,  2.32104364e+00, -2.20593631e+00,
        -2.37753852e-16],
       [-4.08780385e+01, -1.86471027e+00, -2.34708823e+00,
         5.13492882e-17],
       [-5.34289549e+00, -3.98065864e+00,  7.79631035e-01,
         7.20472757e-17],
       [-2.82033419e+01, -8.33535389e+00,  5.30031897e+00,
         4.89486243e-17],
       [-6.40083890e+01,  7.06150790e+00,  1.43570386e+00,
         1.74060388e-16],
       [-4.32916052e+01, -1.02057499e-01, -9.09120615e-01,
         1.11263532e-16],
       [-1.39258714e+01, -5.61592446e+00,  1.59106852e+00,
         1.29145985e-16],
       [-1.67813655e+01, -6.15981034e+00,  2.33291861e+00,
        -1.70837887e-17],
       [-1.84343928e+01, -4.99433025e+00, -4.79403711e-01,
         7.25592757e-18],
       [-1.77311945e+01, -1.07873203e+01, -4.71576755e+00,
        -5.67149843e-16],
       [-4.85936532e+01,  1.10949307e+00, -1.55752846e+00,
        -1.09949265e-16],
       [-7.94726055e+01,  1.30026983e+01,  1.86597851e-01,
        -2.66709514e-16],
       [-2.94804161e+01, -4.87776777e+00, -2.47233693e+00,
         2.22105316e-17],
       [-3.31624238e+01, -4.83891361e+00, -3.13664801e+00,
         7.19721747e-17],
       [-4.40851318e+01,  4.87898862e-01,  5.12943775e-01,
         3.02603255e-16],
       [-3.18939828e+01, -3.11511500e+00, -1.03436931e+00,
         2.99484180e-16],
       [-4.08780385e+01, -1.86471027e+00, -2.34708823e+00,
         5.13492882e-17],
       [-2.25748215e+01, -7.26562290e+00, -3.25394509e+00,
         1.87126013e-16],
       [-2.29099271e+01, -4.36551973e+00,  2.78349606e-01,
        -3.75451682e-19],
       [-9.48332419e+00, -6.25195129e+00, -1.99491034e+00,
        -2.74769124e-17],
       [-7.09469707e+01,  9.44214673e+00, -6.10913535e-01,
        -2.57057627e-16],
       [-1.39588585e+01, -5.62314077e+00, -1.23715703e+00,
         5.55717762e-17],
       [-5.34289549e+00, -3.98065864e+00,  7.79631035e-01,
         7.20472757e-17],
       [-1.77311945e+01, -1.07873203e+01, -4.71576755e+00,
        -5.67149843e-16],
       [-2.74019549e+01, -3.74031736e+00, -3.78009854e-01,
        -1.65127367e-16],
       [-5.38957011e+01,  2.32104364e+00, -2.20593631e+00,
        -2.37753852e-16],
       [-6.23718552e+01,  5.89241965e+00,  2.83391341e+00,
        -1.81938882e-17],
       [-2.94144418e+01, -4.86333514e+00,  3.18411417e+00,
         3.46804968e-16],
       [-2.94804161e+01, -4.87776777e+00, -2.47233693e+00,
         2.22105316e-17],
       [-5.71027944e+01,  4.67365277e+00,  6.54095701e-01,
         1.76237992e-16],
       [-1.47523850e+01, -5.03318441e+00,  1.84907363e-01,
         1.84935901e-16],
       [-3.26875093e+01, -2.52515864e+00,  3.87695076e-01,
        -1.26500289e-16],
       [-2.94804161e+01, -4.87776777e+00, -2.47233693e+00,
         2.22105316e-17],
       [-2.54468092e+01, -7.81311695e+00, -3.92620778e+00,
        -8.48509568e-17],
       [-1.96368596e+01, -6.70369623e+00,  3.07476870e+00,
        -1.63313563e-16],
       [-1.56692269e+01, -9.65347804e+00, -4.03555325e+00,
         3.52041495e-16],
       [-5.34289549e+00, -3.98065864e+00,  7.79631035e-01,
         7.20472757e-17],
       [-2.20834134e+01, -4.94825977e+00,  1.68451077e+00,
         2.91846763e-16],
       [-1.96368596e+01, -6.70369623e+00,  3.07476870e+00,
        -1.63313563e-16],
       [-3.79730637e+01, -1.30999991e+00,  1.15340001e+00,
        -1.06495470e-16],
       [-1.97931620e+01, -1.19211625e+01, -5.39598185e+00,
         1.11621311e-16],
       [-2.57984083e+01, -4.91662193e+00, -1.80802586e+00,
         2.97924643e-16],
       [-1.96368596e+01, -6.70369623e+00,  3.07476870e+00,
        -1.63313563e-16],
       [-5.62762807e+01,  4.09091272e+00,  2.06025686e+00,
         1.42984452e-16],
       [-1.97931620e+01, -1.19211625e+01, -5.39598185e+00,
         1.11621311e-16],
       [-2.53478478e+01, -7.79146801e+00,  4.55846888e+00,
         1.95178398e-16],
       [-2.94144418e+01, -4.86333514e+00,  3.18411417e+00,
         3.46804968e-16],
       [-3.79895573e+01, -1.31360807e+00, -2.60712770e-01,
         3.01043718e-16],
       [-5.63092679e+01,  4.08369641e+00, -7.67968689e-01,
        -7.51795410e-17],
       [-2.83187968e+01, -8.36061099e+00, -4.59847046e+00,
         2.64707301e-16],
       [-3.67454702e+01, -4.77841051e+00,  4.68371758e+00,
         4.68711369e-17],
       [-1.67813655e+01, -6.15981034e+00,  2.33291861e+00,
        -1.70837887e-17],
       [-9.43384352e+00, -6.24112682e+00,  2.24742798e+00,
         1.12537765e-16],
       [-1.47523850e+01, -5.03318441e+00,  1.84907363e-01,
         1.84935901e-16],
       [-2.82033419e+01, -8.33535389e+00,  5.30031897e+00,
         4.89486243e-17],
       [-3.18939828e+01, -3.11511500e+00, -1.03436931e+00,
         2.99484180e-16],
       [-1.68308462e+01, -6.17063481e+00, -1.90941971e+00,
        -1.42390326e-16],
       [-1.96368596e+01, -6.70369623e+00,  3.07476870e+00,
        -1.63313563e-16],
       [-1.67813655e+01, -6.15981034e+00,  2.33291861e+00,
        -1.70837887e-17],
       [-7.38836950e+00, -5.11089273e+00,  1.51352951e+00,
        -7.04453569e-17],
       [-7.38836950e+00, -5.11089273e+00,  1.51352951e+00,
        -7.04453569e-17],
       [-5.01807063e+01,  2.28940579e+00,  1.28660032e+00,
         3.04162792e-16],
       [-5.63092679e+01,  4.08369641e+00, -7.67968689e-01,
        -7.51795410e-17],
       [-2.24923537e+01, -7.24758212e+00,  3.81661879e+00,
         3.41408172e-16],
       [-3.18609957e+01, -3.10789868e+00,  1.79385624e+00,
        -1.45045689e-16],
       [-3.29742148e+00, -2.85042454e+00,  4.57325598e-02,
         1.11175616e-17],
       [-7.94726055e+01,  1.30026983e+01,  1.86597851e-01,
        -2.66709514e-16],
       [-3.26875093e+01, -2.52515864e+00,  3.87695076e-01,
        -1.26500289e-16],
       [-3.67454702e+01, -4.77841051e+00,  4.68371758e+00,
         4.68711369e-17],
       [-2.53478478e+01, -7.79146801e+00,  4.55846888e+00,
         1.95178398e-16],
       [-8.21488316e+00, -4.52815268e+00,  1.07368347e-01,
         1.47608402e-17],
       [-3.31624238e+01, -4.83891361e+00, -3.13664801e+00,
         7.19721747e-17],
       [-1.76157396e+01, -1.07620632e+01,  5.18302188e+00,
         1.93518744e-16],
       [-1.15452918e+01, -7.38579354e+00, -2.67512465e+00,
        -2.95716741e-16],
       [-4.32586181e+01, -9.48411854e-02,  1.91910494e+00,
        -2.63936669e-16],
       [-5.38957011e+01,  2.32104364e+00, -2.20593631e+00,
        -2.37753852e-16],
       [-7.42135662e+00, -5.11810904e+00, -1.31469604e+00,
         7.80250388e-17],
       [-7.09469707e+01,  9.44214673e+00, -6.10913535e-01,
        -2.57057627e-16],
       [-4.85936532e+01,  1.10949307e+00, -1.55752846e+00,
        -1.09949265e-16],
       [-7.94726055e+01,  1.30026983e+01,  1.86597851e-01,
        -2.66709514e-16],
       [-2.73854614e+01, -3.73670921e+00,  1.03610292e+00,
        -1.54079509e-16],
       [-2.73854614e+01, -3.73670921e+00,  1.03610292e+00,
        -1.54079509e-16],
       [-1.14793175e+01, -7.37136091e+00,  2.98132646e+00,
        -2.99548674e-17],
       [-4.40851318e+01,  4.87898862e-01,  5.12943775e-01,
         3.02603255e-16],
       [-4.85936532e+01,  1.10949307e+00, -1.55752846e+00,
        -1.09949265e-16],
       [-6.40248825e+01,  7.05789975e+00,  2.15910829e-02,
         3.36050508e-17],
       [-9.43384352e+00, -6.24112682e+00,  2.24742798e+00,
         1.12537765e-16],
       [-1.35247915e+01, -8.50159500e+00,  3.71522493e+00,
        -1.72447500e-16],
       [-1.96368596e+01, -6.70369623e+00,  3.07476870e+00,
        -1.63313563e-16]])

(u[:, :-1] @ np.diag(s[:-1]) @ vt[:-1, :])[0:5, :]

array([[ 8.,  6., 48., 28.],
       [ 2.,  4.,  8., 12.],
       [ 1.,  3.,  3.,  8.],
       [ 9.,  3., 27., 24.],
       [ 9.,  8., 72., 34.]])

$U$ and $V^T$¶

Let's try verifying that $U$ is orthonormal. If so, any column dot producted with itself should be 1, and any columns dot producted with any other should be zero.

np.dot(u[:, 0], u[:, 0])

1.0000000000000009

np.dot(u[:, 1], u[:, 1])

1.000000000000001

np.dot(u[:, 2], u[:, 2])

1.0

for i in range(0, u.shape[1]):
    print(f"row {i} dot producted with itself is {np.dot(u[:, i], u[:, i])}")

row 0 dot producted with itself is 1.0000000000000009
row 1 dot producted with itself is 1.000000000000001
row 2 dot producted with itself is 1.0
row 3 dot producted with itself is 1.0

for i in range(0, u.shape[1]):
    for j in range(i + 1, u.shape[1]):
        print(f"row {i} dot producted with {j} is {np.dot(u[:, i], u[:, j])}")

row 0 dot producted with 1 is -1.3877787807814457e-17
row 0 dot producted with 2 is -1.5959455978986625e-16
row 0 dot producted with 3 is 1.1102230246251565e-16
row 1 dot producted with 2 is -1.5265566588595902e-16
row 1 dot producted with 3 is 3.469446951953614e-17
row 2 dot producted with 3 is -2.7755575615628914e-17

Let's now look at $V$.

v = vt.T

for i in range(0, v.shape[1]):
    print(f"row {i} of V dot producted with itself is {np.dot(v[:, i], v[:, i])}")

row 0 of V dot producted with itself is 0.9999999999999999
row 1 of V dot producted with itself is 0.9999999999999991
row 2 of V dot producted with itself is 0.9999999999999994
row 3 of V dot producted with itself is 0.9999999999999998

for i in range(0, v.shape[1]):
    for j in range(i + 1, v.shape[1]):
        print(f"row {i} of V dot producted with {j} is {np.dot(v[:, i], v[:, j])}")

row 0 of V dot producted with 1 is -9.71445146547012e-17
row 0 of V dot producted with 2 is -2.42861286636753e-17
row 0 of V dot producted with 3 is 8.326672684688674e-17
row 1 of V dot producted with 2 is -2.931682674400804e-16
row 1 of V dot producted with 3 is 1.6653345369377348e-16
row 2 of V dot producted with 3 is -1.0928757898653885e-16

We can also see that the transpose of u (and v) is also the inverse of u (and v).

u.T @ u

array([[ 1.00000000e+00, -2.08166817e-17, -1.31838984e-16,
         1.11022302e-16],
       [-2.08166817e-17,  1.00000000e+00, -9.71445147e-17,
         4.16333634e-17],
       [-1.31838984e-16, -9.71445147e-17,  1.00000000e+00,
        -2.08166817e-17],
       [ 1.11022302e-16,  4.16333634e-17, -2.08166817e-17,
         1.00000000e+00]])

v.T @ v

array([[ 1.00000000e+00, -1.11022302e-16, -2.77555756e-17,
         6.93889390e-17],
       [-1.11022302e-16,  1.00000000e+00, -3.05311332e-16,
         1.11022302e-16],
       [-2.77555756e-17, -3.05311332e-16,  1.00000000e+00,
        -1.11022302e-16],
       [ 6.93889390e-17,  1.11022302e-16, -1.11022302e-16,
         1.00000000e+00]])

Low Rank Approximations of Rectangle Data¶

This section was not covered in lecture but may add additional context.

colnames = list(rectangle.columns)

Suppose we our area and perimeter measurements are slightly noisy.

rectangle_noisy = rectangle.copy()
rectangle_noisy["perimeter"] += np.random.normal(loc = 0, scale = 0.01, size = len(rectangle_noisy["perimeter"]))
rectangle_noisy["area"] += np.random.normal(loc = 0, scale = 0.01, size = len(rectangle_noisy["area"]))
rectangle_noisy.head(5)

u, s, vt = np.linalg.svd(rectangle_noisy, full_matrices = False)

Even on this noisy data, we can recover the exact measurements by multiplying $U \Sigma V^T$.

vt

array([[-1.46441099e-01, -1.29945951e-01, -8.10040556e-01,
        -5.52723577e-01],
       [-1.92865041e-01, -1.89071562e-01,  5.86320958e-01,
        -7.63729503e-01],
       [-7.04787579e-01,  7.09324660e-01,  7.87599872e-03,
         8.42396058e-03],
       [-6.66807730e-01, -6.66501703e-01, -1.27362120e-05,
         3.33381059e-01]])

pd.DataFrame((u @ np.diag(s) @ vt)[0:5, :]).head(5)

Let's take a look at the singular values:

s

array([3.62919871e+02, 6.29804126e+01, 2.56543792e+01, 3.26951740e-02])

Despite not being zero, we can nevertheless again eliminate the last column (dimension) of $U$, last singular value from $\Sigma$ and last row (operation) from $V^T$. We see that we get back reasonable answers, though they're a little off.

pd.DataFrame((u[:, :-1] @ np.diag(s[:-1]) @ vt[:-1, :]),
            columns = colnames).head(5)

This is what is known as a "low rank approximation" of the data. Specifically, this is a rank 3 approximation since we're using 3 of the available 4 dimensions. Since the noise was small, this rank 3 approximation is good.

As a warmup to the rest of lecture, let's see what the rank 2 approximation looks like.

Note: In the code below, I've specified which rows/columns that I do want rather than the ones I don't want. That is, instead of writing u[:, :-2], I've written u[:, 0:2].

pd.DataFrame((u[:, 0:2] @ np.diag(s[0:2]) @ vt[0:2, :]),
            columns = colnames).head(5)

This code is getting a little hard to read, so let's write a function that computes the rank k approximation of a given DataFrame.

def compute_rank_k_approximation(data, k):
    u, s, vt = np.linalg.svd(data, full_matrices = False)
    return pd.DataFrame(u[:, 0:k] @ np.diag(s[0:k]) @ vt[0:k, :], columns = data.columns)

pd.DataFrame(compute_rank_k_approximation(rectangle_noisy, 2)).head(5)

We can even look at the rank 1 approximation.

compute_rank_k_approximation(rectangle_noisy, 1).head(5)

In effect, what the SVD process has done is figured out a way to rebuild an approximation of our entire 4D matrix using only 1 column of data. This might seem somewhat shocking.

pd.DataFrame((u[:, 0:1] @ np.diag(s[0:1]) @ vt[0:1, :]),
            columns = colnames).head(5)

u, s, vt = np.linalg.svd(rectangle, full_matrices = False)
compute_rank_k_approximation(rectangle_noisy, 1).head(5)

def compute_rank_k_approximation(data, k):
    u, s, vt = np.linalg.svd(data, full_matrices = False)
    return pd.DataFrame(u[:, 0:k] @ np.diag(s[0:k]) @ vt[0:k, :], columns = data.columns)

Rank 1 Approximation of 2D Data, Data Centering¶

# Downloads from https://www.gapminder.org/data/
cm_path = 'data/child_mortality_0_5_year_olds_dying_per_1000_born.csv'
fe_path = 'data/children_per_woman_total_fertility.csv'
cm = pd.read_csv(cm_path).set_index('country')['2017'].to_frame()/10
fe = pd.read_csv(fe_path).set_index('country')['2017'].to_frame()
child_data = cm.merge(fe, left_index=True, right_index=True).dropna()
child_data.columns = ['mortality', 'fertility']
child_data.head()

def scatter14(data):
    sns.scatterplot('mortality', 'fertility', data=data)
    plt.xlim([0, 14])
    plt.ylim([0, 14])
    plt.xticks(np.arange(0, 14, 2))
    plt.yticks(np.arange(0, 14, 2))    
    
scatter14(child_data)

/opt/anaconda3/lib/python3.7/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  FutureWarning

sns.scatterplot('mortality', 'fertility', data=child_data)

/opt/anaconda3/lib/python3.7/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  FutureWarning

<AxesSubplot:xlabel='mortality', ylabel='fertility'>

u, s, vt = np.linalg.svd(child_data, full_matrices = False)

child_data_reconstructed = pd.DataFrame(u @ np.diag(s) @ vt, columns = ["mortality", "fertility"], index=child_data.index)

As we'd expect, the product of $U$, $\Sigma$, and $V^T$ recovers the original data perfectly.

child_data_reconstructed.head(5)

sns.scatterplot('mortality', 'fertility', data=child_data)

/opt/anaconda3/lib/python3.7/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  FutureWarning

<AxesSubplot:xlabel='mortality', ylabel='fertility'>

What happens if we throw away a column of $U$, a singular value from $\Sigma$, and a row from $V^T$? In this case we end up with the "rank 1 approximation" of the data.

Looking at the data, we see that it does a surprisingly good job.

#child_data_rank_1_approximation = pd.DataFrame(u[:, :-1] @ np.diag(s[:-1]) @ vt[:-1, :], columns = ["mortality", "fertility"], index=child_data.index)

child_data_rank_1_approximation = compute_rank_k_approximation(child_data, 1)
child_data_rank_1_approximation.head(5)

By plotting the data in a 2D space, we can see what's going on. We're simply getting the original data projected on to some 1 dimensional subspace.

sns.scatterplot('mortality', 'fertility', data=child_data_rank_1_approximation)

/opt/anaconda3/lib/python3.7/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  FutureWarning

<AxesSubplot:xlabel='mortality', ylabel='fertility'>

There's one significant issue with our projection, which we can see by plotting both the original data and our reconstruction on the same axis. The issue is that the projection goes through the origin but our data has a non-zero y-intercept.

sns.scatterplot('mortality', 'fertility', data=child_data)
sns.scatterplot('mortality', 'fertility', data=child_data_rank_1_approximation)

/opt/anaconda3/lib/python3.7/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  FutureWarning
/opt/anaconda3/lib/python3.7/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  FutureWarning

<AxesSubplot:xlabel='mortality', ylabel='fertility'>

While this y-intercept misalignment isn't terrible here, it can be really bad. For example, consider the 2D dataset below (from our body measurements dataset from the previous lecture).

#http://jse.amstat.org/datasets/fat.txt
body_data = pd.read_fwf("data/fat.dat.txt", colspecs = [(9, 13), (17, 21), (23, 29), (35, 37),
                                             (39, 45), (48, 53), (57, 61), (64, 69),
                                             (73, 77), (80, 85), (88, 93), (96, 101),
                                             (105, 109), (113, 117), (121, 125), (129, 133),
                                             (137, 141), (145, 149)], 
                  
                  
                  header=None, names = ["% brozek fat", "% siri fat", "density", "age", 
                                       "weight", "height", "adiposity", "fat free weight",
                                       "neck", "chest", "abdomen", "hip", "thigh",
                                       "knee", "ankle", "bicep", "forearm",
                                       "wrist"])
#body_data = body_data.drop(41) #drop the weird record
body_data.head()

density_and_abdomen = body_data[["density", "abdomen"]]
density_and_abdomen.head(5)

If we look at the data, the rank 1 approximation looks at least vaguely sane from the table.

density_and_abdomen_rank_1_approximation = compute_rank_k_approximation(density_and_abdomen, 1)
density_and_abdomen_rank_1_approximation.head(5)

But if we plot on 2D axes, we'll see that things are very wrong.

sns.scatterplot(x="density", y="abdomen", data=body_data)

<AxesSubplot:xlabel='density', ylabel='abdomen'>

density_and_abdomen_rank_1_approximation = compute_rank_k_approximation(density_and_abdomen, 1)
sns.scatterplot(x="density", y="abdomen", data=body_data)
sns.scatterplot(x="density", y="abdomen", data=density_and_abdomen_rank_1_approximation);

Since the subspace that we're projecting on to is off and to the right, we end up with a bizarre result where our rank 1 approximation believes that density increases with abdomen size, even though the data shows the opposite.

To fix this issue, we should always start the SVD process by zero-centering our data. That is, for each column, we should subtract the mean of that column.

np.mean(density_and_abdomen, axis = 0)

density     1.055574
abdomen    92.555952
dtype: float64

density_and_abdomen_centered = density_and_abdomen - np.mean(density_and_abdomen, axis = 0)
density_and_abdomen_centered.head(5)

Now when we do the approximation, things work much better.

density_and_abdomen_centered_rank_1_approximation = compute_rank_k_approximation(density_and_abdomen_centered, 1)
sns.scatterplot(x="density", y="abdomen", data=density_and_abdomen_centered)
sns.scatterplot(x="density", y="abdomen", data=density_and_abdomen_centered_rank_1_approximation);

Interpreting Principal Components¶

Let's revisit our child mortality and maternal fertility data from before.

sns.scatterplot(data = child_data, x = "mortality", y= "fertility")
plt.xlim([0, 14])
plt.ylim([0, 14])
plt.xticks(np.arange(0, 14, 2))
plt.yticks(np.arange(0, 14, 2));

Since we're going to be doing SVD, let's make sure to center our data first.

np.mean(child_data, axis = 0)

mortality    3.034590
fertility    2.775956
dtype: float64

child_means = np.mean(child_data, axis = 0)
child_data_centered = child_data - child_means
sns.scatterplot(data = child_data_centered, x = "mortality", y= "fertility")


plt.xlim([-3, 11])
plt.ylim([-3, 11])
plt.xticks(np.arange(-3, 11, 2))
plt.yticks(np.arange(-3, 11, 2));
# plt.gcf().savefig("mortality_fertility_centered.png", dpi=300, bbox_inches="tight")

Principal Components and Singular Values¶

Singular Values and Variance¶

rectangle = pd.read_csv('data/rectangle_data.csv')

rectangle_centered = rectangle - np.mean(rectangle, axis = 0)

np.var(rectangle_centered)

width          7.6891
height         5.3475
area         338.7316
perimeter     50.7904
dtype: float64

sum(np.var(rectangle_centered))

402.5586000000002

u, s, vt = np.linalg.svd(rectangle_centered, full_matrices = False)

u[0:5, :]

array([[-0.1339099 ,  0.00592996,  0.03473374, -0.14290163],
       [ 0.08635418, -0.07951453,  0.01494809,  0.73679984],
       [ 0.11776646, -0.12896256,  0.08577428, -0.06410577],
       [-0.02727392,  0.1831771 ,  0.01089507, -0.03781763],
       [-0.258806  , -0.09429499,  0.09027015, -0.0466119 ]])

np.diag(s)

array([[1.97388075e+02, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 2.74346257e+01, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 2.32626119e+01, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 8.37988302e-15]])

s**2/rectangle_centered.shape[0]

array([3.89620522e+02, 7.52658687e+00, 5.41149115e+00, 7.02224394e-31])

sum(s**2/rectangle_centered.shape[0])

402.55859999999984

Practical PCA¶

Let's now step back and try to use PCA on our body measurement and congressional voting datasets.

body_data.head(5)

u, s, vt = np.linalg.svd(body_data, full_matrices = False)

We see that some of our principal components capture more variance than others.

s

array([5.02052705e+03, 2.62085056e+02, 2.54543432e+02, 1.41812428e+02,
       5.98783810e+01, 5.29106976e+01, 4.16567900e+01, 3.12823306e+01,
       2.60271725e+01, 2.44919558e+01, 2.21128215e+01, 2.05280695e+01,
       1.86667123e+01, 1.67184654e+01, 1.24277631e+01, 7.79276210e+00,
       1.97067950e+00, 2.72193414e-01])

Or we can compute the fraction of the variance captured by each principal component. The result seems shocking at first, as our data appears to be effectively rank 1.

np.round(s**2 / sum(s**2), 2)

array([0.99, 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ,
       0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ])

This seems absurd, as clearly there are several variables that we expect to be show significant variation independent of each other, e.g. weight, height, and age. If this happens to you, it's probably because you forgot to center your data!

body_data_centered = body_data - np.mean(body_data, axis = 0)
body_data_centered.head(5)

u, s, vt = np.linalg.svd(body_data_centered, full_matrices = False)

This time, we see that the top singular value is no longer as dominant.

s

array([5.86347268e+02, 2.61819089e+02, 1.67113835e+02, 5.99708077e+01,
       5.33302486e+01, 4.22415004e+01, 3.48011060e+01, 2.92855997e+01,
       2.55426169e+01, 2.39181665e+01, 2.08396531e+01, 2.04736098e+01,
       1.85404097e+01, 1.62788697e+01, 1.23779482e+01, 7.72010337e+00,
       1.95363897e+00, 4.48706065e-02])

Looking now at the fraction of the variance captured by each principal component, we see that the top 2 or 3 principal components capture quite a lot of the variance.

np.round(s**2 / sum(s**2), 2)

array([0.76, 0.15, 0.06, 0.01, 0.01, 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ,
       0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ])

We can also show this in the form of what is usually called a "scree plot".

plt.plot(s**2);

Thus, we expect that if we were to do a rank 3 approximation, we should get back data that's pretty close to where we started, as just those 3 dimensions capture 97% of the variance.

body_data_rank_3_approximation = compute_rank_k_approximation(body_data_centered, 3) +  np.mean(body_data, axis = 0)
body_data_rank_3_approximation.head(5)

body_data.head(5)

One very interesting thing we can do is try to plot the principal components themselves. In this case, let's plot only the first two.

u, s, vt = np.linalg.svd(body_data_centered, full_matrices = False)
pcs = u * s
sns.scatterplot(x=pcs[:, 0], y=pcs[:, 1]);

np.argmax(pcs[:, 0])

38

body_data.iloc[38, :]

% brozek fat        33.8000
% siri fat          35.2000
density              1.0202
age                 46.0000
weight             363.1500
height              72.2500
adiposity           48.9000
fat free weight    240.5000
neck                51.2000
chest              136.2000
abdomen            148.1000
hip                147.7000
thigh               87.3000
knee                49.1000
ankle               29.6000
bicep               45.0000
forearm             29.0000
wrist               21.4000
Name: 38, dtype: float64

Election Data¶

from ds100_utils import fetch_and_cache
import yaml
from datetime import datetime


base_url = 'https://github.com/unitedstates/congress-legislators/raw/master/'
legislators_path = 'legislators-current.yaml'
f = fetch_and_cache(base_url + legislators_path, legislators_path)
legislators_data = yaml.load(open(f))

def to_date(s):
    return datetime.strptime(s, '%Y-%m-%d')

legs = pd.DataFrame(
    columns=['leg_id', 'first', 'last', 'gender', 'state', 'chamber', 'party', 'birthday'],
    data=[[x['id']['bioguide'], 
           x['name']['first'],
           x['name']['last'],
           x['bio']['gender'],
           x['terms'][-1]['state'],
           x['terms'][-1]['type'],
           x['terms'][-1]['party'],
           to_date(x['bio']['birthday'])] for x in legislators_data])

legs.head(3)

Using cached version that was downloaded (UTC): Tue Nov 17 07:50:34 2020

/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:9: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  if __name__ == '__main__':

# February 2019 House of Representatives roll call votes
# Downloaded using https://github.com/eyeseast/propublica-congress
votes = pd.read_csv('data/votes.csv')
votes = votes.astype({"roll call": str}) 
votes.head()

def was_yes(s):
    if s.iloc[0] == 'Yes':
        return 1
    else:
        return 0
    
vote_pivot = votes.pivot_table(index='member', 
                                columns='roll call', 
                                values='vote', 
                                aggfunc=was_yes, 
                                fill_value=0)
print(vote_pivot.shape)
vote_pivot.head()

(441, 41)

vote_pivot_centered = vote_pivot - np.mean(vote_pivot, axis = 0)
vote_pivot_centered.head(5)

u, s, vt = np.linalg.svd(vote_pivot_centered, full_matrices = False)

np.round(s**2 / sum(s**2), 2)

array([0.8 , 0.05, 0.02, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01,
       0.01, 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ,
       0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ,
       0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ])

plt.plot(s**2)

[<matplotlib.lines.Line2D at 0x7fcd5f5ff4d0>]

pcs = u * s
sns.scatterplot(x=pcs[:, 0], y=pcs[:, 1]);

vote2d = pd.DataFrame({
    'member': vote_pivot.index,
    'pc1': pcs[:, 0],
    'pc2': pcs[:, 1]
}).merge(legs, left_on='member', right_on='leg_id')

vote2d[vote2d['pc1'] < 0]['party'].value_counts()

Democrat      228
Republican      1
Name: party, dtype: int64

#top right only
vote2d.query('pc2 > -2 and pc1 > 0')['party'].value_counts()

Republican    190
Name: party, dtype: int64

sns.scatterplot(x="pc1", y="pc2", hue="party", data = vote2d);

vote2d['pc1_jittered'] = vote2d['pc1'] + np.random.normal(loc = 0, scale = 0.1, size = vote2d.shape[0])
vote2d['pc2_jittered'] = vote2d['pc2'] + np.random.normal(loc = 0, scale = 0.1, size = vote2d.shape[0])

sns.scatterplot(x="pc1_jittered", y="pc2_jittered", hue="party", data = vote2d);

vote2d.head(5)

vote2d[vote2d['pc1'] > 0]['party'].value_counts()

Republican     195
Democrat         7
Libertarian      1
Name: party, dtype: int64

vote2d[vote2d['pc2'] < -1]

df = votes[votes['member'].isin(vote2d[vote2d['pc2'] < -1]['member'])]
df.groupby(['member', 'vote']).size()

member   vote      
A000367  No            31
         Yes           10
A000374  Not Voting    41
B001311  No            17
         Yes            7
C000537  No             1
         Not Voting    37
         Yes            3
C001087  No             6
         Not Voting    23
         Yes           12
G000582  Not Voting     1
         Yes            6
H001077  No            14
         Not Voting    15
         Yes           12
J000299  No            20
         Not Voting     7
         Yes           14
M001200  Not Voting    41
M001210  No            16
         Not Voting     1
         Yes            7
N000147  No             6
         Not Voting     1
P000197  Speaker       41
P000610  No             7
R000577  No            11
         Not Voting     8
         Yes           22
R000600  Not Voting     7
S001177  No             7
S001204  No             6
         Not Voting     1
dtype: int64

legs.query("leg_id == 'A000367'")

Let's look at only people who have voted more than 15 times lately¶

votes.head(5)

num_yes_or_no_votes_per_member = votes.query("vote == 'Yes' or vote == 'No'").groupby("member").size()
num_yes_or_no_votes_per_member.head(5)

member
A000055    40
A000367    41
A000369    41
A000370    41
A000371    41
dtype: int64

vote_pivot_with_yes_no_count = vote_pivot.merge(num_yes_or_no_votes_per_member.to_frame(), left_index = True, right_index = True, how="outer", ).fillna(0)
vote_pivot_with_yes_no_count = vote_pivot_with_yes_no_count.rename(columns = {0: 'yes_no_count'})
vote_pivot_with_yes_no_count.head(5)

regulars = vote_pivot_with_yes_no_count.query('yes_no_count >= 30')
regulars = regulars.drop('yes_no_count', 1)
regulars.shape

(425, 41)

regulars_centered = regulars - np.mean(regulars, axis = 0)
regulars_centered.head(5)

u, s, vt = np.linalg.svd(regulars_centered, full_matrices = False)

np.round(s**2 / sum(s**2), 2)

array([0.84, 0.02, 0.02, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01,
       0.01, 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ,
       0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ,
       0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ])

pcs = u * s
sns.scatterplot(x=pcs[:, 0], y=pcs[:, 1]);

vote2d = pd.DataFrame({
    'member': regulars_centered.index,
    'pc1': pcs[:, 0],
    'pc2': pcs[:, 1]
}).merge(legs, left_on='member', right_on='leg_id')

vote2d['pc1_jittered'] = vote2d['pc1'] + np.random.normal(loc = 0, scale = 0.1, size = vote2d.shape[0])
vote2d['pc2_jittered'] = vote2d['pc2'] + np.random.normal(loc = 0, scale = 0.1, size = vote2d.shape[0])

sns.scatterplot(x="pc1_jittered", y="pc2_jittered", hue="party", data = vote2d);

Exploring $V^T$¶

We can also look at Vt directly to try to gain insight into why each component is as it is.

num_votes = vt.shape[1]
votes = regulars.columns

def plot_pc(k):
    plt.bar(votes, vt[k, :], alpha=0.7)
    plt.xticks(votes, rotation=90);

with plt.rc_context({"figure.figsize": (12, 4)}):
    plot_pc(0)
#   plot_pc(1)

def compute_rank_k_approximation(data, k):
    u, s, vt = np.linalg.svd(data, full_matrices = False)
    return pd.DataFrame(u[:, 0:k] @ np.diag(s[0:k]) @ vt[0:k, :], columns = data.columns)

PCA vs. Linear Regression (Bonus)¶

Now we'll finally turn to trying to understand what the principle components and the low rank approximations actually mean!

Returning to our child mortality data from before, if we zero-center the child data, we see get back a better result. Note that we have to add back in the mean of each column to get things back into the right units.

means = np.mean(child_data, axis = 0)
child_data_centered = child_data - np.mean(child_data, axis = 0)
child_data_rank_1_approximation = compute_rank_k_approximation(child_data_centered, 1) + means

sns.scatterplot('mortality', 'fertility', data=child_data)
sns.scatterplot('mortality', 'fertility', data=child_data_rank_1_approximation)
plt.legend(['data', 'rank 1 approx'])
plt.xlim([0, 14])
plt.ylim([0, 14])
plt.xticks(np.arange(0, 14, 2))
plt.yticks(np.arange(0, 14, 2));

/opt/anaconda3/lib/python3.7/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  FutureWarning
/opt/anaconda3/lib/python3.7/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  FutureWarning

We can also give our rank 1 approximation as a line, showing the 1D subspace (in black) that our data is being projected onto.

sns.scatterplot('mortality', 'fertility', data=child_data)
sns.lineplot('mortality', 'fertility', data=child_data_rank_1_approximation, color='black')
plt.legend(['rank 1 approx', 'data']);
plt.xlim([0, 14])
plt.ylim([0, 14])
plt.xticks(np.arange(0, 14, 2))
plt.yticks(np.arange(0, 14, 2));

/opt/anaconda3/lib/python3.7/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  FutureWarning
/opt/anaconda3/lib/python3.7/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  FutureWarning

This plot probably brings to mind linear regression from Data 8. But PCA is NOT the same thing linear regression. Let's plot the regression lines for this data for comparison. Recall that the regression line gives us, e.g. the best possible linear prediction of the fertility given the mortality.

x, y = child_data['mortality'], child_data['fertility']
slope_x, intercept_x = np.polyfit(x, y, 1) # simple linear regression

scatter14(child_data)

plt.plot(x, slope_x * x + intercept_x, c = 'g')
for _, row in child_data.sample(20).iterrows():
    tx, ty = row['mortality'], row['fertility']
    plt.plot([tx, tx], [slope_x * tx + intercept_x, ty], c='red')
    
plt.legend(['predicted fertility']);

/opt/anaconda3/lib/python3.7/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  FutureWarning

In the plot above, the green regression line given minimizes the sum of the squared errors, given as red vertical lines.

We could also do the opposite thing, and try to predict fertility from mortality.

x, y = child_data['mortality'], child_data['fertility']
slope_y, intercept_y = np.polyfit(y, x, 1) # simple linear regression

scatter14(child_data)

plt.plot(slope_y * y + intercept_y, y, c = 'purple')
for _, row in child_data.sample(20).iterrows():
    tx, ty = row['mortality'], row['fertility']
    plt.plot([tx, slope_y * ty + intercept_y], [ty, ty], c='red')
    
plt.legend(['predicted mortality']);

/opt/anaconda3/lib/python3.7/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  FutureWarning

In the plot above, the green regression line given minimizes the sum of the squared errors, given as red horizontal lines.

Plotting the two regression lines and the 1D subspace chosen by PCA, we see that all 3 are distinct!

sns.lineplot('mortality', 'fertility', data=child_data_rank_1_approximation, color="black")
plt.plot(x, slope_x * x + intercept_x, c = 'g')
plt.plot(slope_y * y + intercept_y, y, c = 'purple');
sns.scatterplot('mortality', 'fertility', data=child_data)
plt.legend(['rank 1 approx', 'predicted fertility', 'predicted mortality'])
plt.xlim([0, 14])
plt.ylim([0, 14])
plt.xticks(np.arange(0, 14, 2))
plt.yticks(np.arange(0, 14, 2));

/opt/anaconda3/lib/python3.7/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  FutureWarning
/opt/anaconda3/lib/python3.7/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  FutureWarning

Given that the green line minimizes the "vertical error" and the purple line minimizes the "horizontal error". You might wonder what the black line minimizes. It turns out, it minimizes the "diagonal" error, i.e. the error in the direction perpendicular to itself.

sns.lineplot('mortality', 'fertility', data=child_data_rank_1_approximation, color="black")
plt.plot(x, slope_x * x + intercept_x, c = 'g')
plt.plot(slope_y * y + intercept_y, y, c = 'purple');
sns.scatterplot('mortality', 'fertility', data=child_data)

for idx, tdata in child_data.reset_index().sample(20).iterrows():
    tx = tdata["mortality"]
    ty = tdata["fertility"]
    tx_projected = child_data_rank_1_approximation.iloc[idx, 0]
    ty_projected = child_data_rank_1_approximation.iloc[idx, 1]
    plt.plot([tx, tx_projected], [ty, ty_projected], c='red')
    
plt.xlim([0, 14])
plt.ylim([0, 14])
plt.xticks(np.arange(0, 14, 2))
plt.yticks(np.arange(0, 14, 2));
plt.legend(['rank 1 approx', 'predicted fertility', 'predicted mortality']);

/opt/anaconda3/lib/python3.7/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  FutureWarning
/opt/anaconda3/lib/python3.7/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  FutureWarning

The function in the following cell makes it easy to make similar plots for whatever dataset you might be interested in.

def plot_x_regression_y_regression_1d_approximation(data): 
    xname = data.columns[0]
    yname = data.columns[1]
    
    x, y = data[xname], data[yname]
    slope_x, intercept_x = np.polyfit(x, y, 1) # simple linear regression
    
    x, y = data[xname], data[yname]
    slope_y, intercept_y = np.polyfit(y, x, 1) # simple linear regression
    
    means = np.mean(data, axis = 0)
    rank_1_approximation = compute_rank_k_approximation(data - means, 1) + means
    
    sns.lineplot(x=xname, y=yname, data=rank_1_approximation, color="black")
    plt.plot(x, slope_x * x + intercept_x, c = 'g')
    plt.plot(slope_y * y + intercept_y, y, c = 'purple');
    sns.scatterplot(xname, yname, data=data)
    
   
    for idx, tdata in data.reset_index().sample(20).iterrows():
        tx = tdata[xname]
        ty = tdata[yname]
        
        tx_projected = rank_1_approximation.iloc[idx, 0]
        ty_projected = rank_1_approximation.iloc[idx, 1]
        plt.plot([tx, tx_projected], [ty, ty_projected], c='red') 
    
    plt.legend(['1D PCA Subspace', 'predicted ' + xname, 'predicted ' + yname])

plot_x_regression_y_regression_1d_approximation(body_data.drop(41)[["adiposity", "bicep"]])

/opt/anaconda3/lib/python3.7/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  FutureWarning

	% fat	density	age	weight	height	neck	chest	abdomen
0	12.3	1.0708	23	154.25	67.75	36.2	93.1	85.2
1	6.1	1.0853	22	173.25	72.25	38.5	93.6	83.0
2	25.3	1.0414	22	154.00	66.25	34.0	95.8	87.9
3	10.4	1.0751	26	184.75	72.25	37.4	101.8	86.4
4	28.7	1.0340	24	184.25	71.25	34.4	97.3	100.0

	leg_id	first	last	gender	state	chamber	party	birthday
0	B000944	Sherrod	Brown	M	OH	sen	Democrat	1952-11-09
1	C000127	Maria	Cantwell	F	WA	sen	Democrat	1958-10-13
2	C000141	Benjamin	Cardin	M	MD	sen	Democrat	1943-10-05

	width	height	area	perimeter
0	8	6	47.991711	28.012191
1	2	4	7.996125	12.007949
2	1	3	3.005967	8.007360
3	9	3	27.005006	23.994362
4	9	8	71.988988	33.981539

	0	1	2	3
0	8.0	6.0	47.991711	28.012191
1	2.0	4.0	7.996125	12.007949
2	1.0	3.0	3.005967	8.007360
3	9.0	3.0	27.005006	23.994362
4	9.0	8.0	71.988988	33.981539

	width	height	area	perimeter
0	8.003101	6.003100	47.991711	28.010640
1	2.002333	4.002332	7.996125	12.006782
2	1.002101	3.002100	3.005967	8.006309
3	8.998765	2.998765	27.005005	23.994980
4	8.996400	7.996402	71.988988	33.983339

	chamber	session	roll call	member	vote
0	House	1	555	A000374	Not Voting
1	House	1	555	A000370	Yes
2	House	1	555	A000055	No
3	House	1	555	A000371	Yes
4	House	1	555	A000372	No

	chamber_x	session	roll call	member	vote	leg_id	first	last	gender	state	chamber_y	party	birthday
8638	House	1	551	K000391	No	K000391	Raja	Krishnamoorthi	M	IL	rep	Democrat	1973-07-19
13298	House	1	548	R000486	Yes	R000486	Lucille	Roybal-Allard	F	CA	rep	Democrat	1941-06-12
13176	House	1	547	R000616	Yes	R000616	Harley	Rouda	M	CA	rep	Democrat	1961-12-10
13018	House	1	541	R000395	Yes	R000395	Harold	Rogers	M	KY	rep	Republican	1937-12-31
5325	House	1	543	F000469	No	F000469	Russ	Fulcher	M	ID	rep	Republican	1973-07-19

roll call	515	516	517	518	519	520	521	522	523	524	...	546	547	548	549	550	551	552	553	554	555
member
A000055	1	0	0	0	1	1	0	1	1	1	...	0	0	1	0	0	1	0	0	1	0
A000367	0	0	0	0	0	0	0	0	0	0	...	0	1	1	1	1	0	1	1	0	1
A000369	1	1	0	0	1	1	0	1	1	1	...	0	0	1	0	0	1	0	0	1	0
A000370	1	1	1	1	1	0	1	0	0	0	...	1	1	1	1	1	0	1	1	1	1
A000371	1	1	1	1	1	0	1	0	0	0	...	1	1	1	1	1	0	1	1	1	1

	width	height	area	perimeter
0	7.461545	6.548142	47.997763	28.017113
1	3.124253	2.873189	7.983587	11.993373
2	2.069372	1.927959	2.994040	7.993553
3	6.320364	5.694408	27.034937	24.026993
4	9.126612	7.865352	71.987533	33.981782

	width	height	area	perimeter
0	8.246016	7.317183	45.612926	31.123552
1	2.039475	1.809748	11.281376	7.697741
2	1.083238	0.961222	5.991943	4.088546
3	5.395653	4.787885	29.846112	20.365215
4	11.635322	10.324717	64.360910	43.916063

	mortality	fertility
country
Afghanistan	6.820	4.48
Albania	1.330	1.71
Algeria	2.390	2.71
Angola	8.310	5.62
Antigua and Barbuda	0.816	2.04

	mortality	fertility
0	6.694067	4.660869
1	1.697627	1.182004
2	2.880467	2.005579
3	8.232160	5.731795
4	1.506198	1.048719

	% brozek fat	% siri fat	density	age	weight	height	adiposity	fat free weight	neck	chest	abdomen	hip	thigh	knee	ankle	bicep	forearm	wrist
0	12.6	12.3	1.0708	23	154.25	67.75	23.7	134.9	36.2	93.1	85.2	94.5	59.0	37.3	21.9	32.0	27.4	17.1
1	6.9	6.1	1.0853	22	173.25	72.25	23.4	161.3	38.5	93.6	83.0	98.7	58.7	37.3	23.4	30.5	28.9	18.2
2	24.6	25.3	1.0414	22	154.00	66.25	24.7	116.0	34.0	95.8	87.9	99.2	59.6	38.9	24.0	28.8	25.2	16.6
3	10.9	10.4	1.0751	26	184.75	72.25	24.9	164.7	37.4	101.8	86.4	101.2	60.1	37.3	22.8	32.4	29.4	18.2
4	27.8	28.7	1.0340	24	184.25	71.25	25.6	133.1	34.4	97.3	100.0	101.9	63.2	42.2	24.0	32.2	27.7	17.7

	density	abdomen
0	0.957134	85.201277
1	0.932425	83.001717
2	0.987458	87.900606
3	0.970613	86.401174
4	1.123369	99.998996

	density	abdomen
0	0.015226	-7.355952
1	0.029726	-9.555952
2	-0.014174	-4.655952
3	0.019526	-6.155952
4	-0.021574	7.444048

	% brozek fat	% siri fat	density	age	weight	height	adiposity	fat free weight	neck	chest	abdomen	hip	thigh	knee	ankle	bicep	forearm	wrist
0	-6.338492	-6.850794	0.015226	-21.884921	-24.674405	-2.39881	-1.736905	-8.813889	-1.792063	-7.724206	-7.355952	-5.404762	-0.405952	-1.290476	-1.202381	-0.273413	-1.263889	-1.129762
1	-12.038492	-13.050794	0.029726	-22.884921	-5.674405	2.10119	-2.036905	17.586111	0.507937	-7.224206	-9.555952	-1.204762	-0.705952	-1.290476	0.297619	-1.773413	0.236111	-0.029762
2	5.661508	6.149206	-0.014174	-22.884921	-24.924405	-3.89881	-0.736905	-27.713889	-3.992063	-5.024206	-4.655952	-0.704762	0.194048	0.309524	0.897619	-3.473413	-3.463889	-1.629762
3	-8.038492	-8.750794	0.019526	-18.884921	5.825595	2.10119	-0.536905	20.986111	-0.592063	0.975794	-6.155952	1.295238	0.694048	-1.290476	-0.302381	0.126587	0.736111	-0.029762
4	8.861508	9.549206	-0.021574	-20.884921	5.325595	1.10119	0.163095	-10.613889	-3.592063	-3.524206	7.444048	1.995238	3.794048	3.609524	0.897619	-0.073413	-0.963889	-0.529762

	% brozek fat	% siri fat	density	age	weight	height	adiposity	fat free weight	neck	chest	abdomen	hip	thigh	knee	ankle	bicep	forearm	wrist
0	13.732084	13.499443	1.067647	23.406671	155.524219	69.685017	22.382014	134.660313	35.750451	92.404761	81.663261	95.367676	57.832005	36.807149	22.471102	30.513755	27.871398	17.221299
1	6.949481	6.177696	1.084653	22.136786	173.262498	72.818426	22.848506	161.108343	37.281215	94.764794	81.952662	98.256188	59.629903	38.251464	23.494178	31.990484	28.799694	18.014750
2	24.388640	24.995400	1.041272	22.012988	155.515375	66.961498	24.013328	115.107332	35.327536	94.623409	87.487254	96.824325	59.191562	36.564650	22.049234	30.526265	27.727128	16.734867
3	10.535884	10.053321	1.076052	26.287526	185.315285	72.840641	24.486712	164.078362	38.187196	98.649630	87.304458	101.045858	61.263963	39.095345	23.826619	32.953935	29.261160	18.327856
4	27.324362	28.164091	1.034484	23.983222	184.352508	68.475540	26.973751	132.523943	37.446408	101.921954	96.240973	103.147363	63.286670	38.642499	23.123048	32.896046	28.991277	17.530191

roll call	515	516	517	518	519	520	521	522	523	524	...	546	547	548	549	550	551	552	553	554	555
member
A000055	0.129252	-0.668934	-0.526077	-0.52381	0.049887	0.587302	-0.562358	0.634921	0.594104	0.560091	...	-0.521542	-0.526077	0.045351	-0.521542	-0.519274	0.54195	-0.521542	-0.535147	0.086168	-0.503401
A000367	-0.870748	-0.668934	-0.526077	-0.52381	-0.950113	-0.412698	-0.562358	-0.365079	-0.405896	-0.439909	...	-0.521542	0.473923	0.045351	0.478458	0.480726	-0.45805	0.478458	0.464853	-0.913832	0.496599
A000369	0.129252	0.331066	-0.526077	-0.52381	0.049887	0.587302	-0.562358	0.634921	0.594104	0.560091	...	-0.521542	-0.526077	0.045351	-0.521542	-0.519274	0.54195	-0.521542	-0.535147	0.086168	-0.503401
A000370	0.129252	0.331066	0.473923	0.47619	0.049887	-0.412698	0.437642	-0.365079	-0.405896	-0.439909	...	0.478458	0.473923	0.045351	0.478458	0.480726	-0.45805	0.478458	0.464853	0.086168	0.496599
A000371	0.129252	0.331066	0.473923	0.47619	0.049887	-0.412698	0.437642	-0.365079	-0.405896	-0.439909	...	0.478458	0.473923	0.045351	0.478458	0.480726	-0.45805	0.478458	0.464853	0.086168	0.496599

	member	pc1	pc2	leg_id	first	last	gender	state	chamber	party	birthday	pc1_jittered	pc2_jittered
0	A000055	3.061356	0.364191	A000055	Robert	Aderholt	M	AL	rep	Republican	1965-07-22	3.158115	0.510657
1	A000367	0.188870	-2.433565	A000367	Justin	Amash	M	MI	rep	Libertarian	1980-04-18	0.053502	-2.441355
2	A000369	2.844370	0.821619	A000369	Mark	Amodei	M	NV	rep	Republican	1958-06-12	2.788314	0.939124
3	A000370	-2.607536	0.127977	A000370	Alma	Adams	F	NC	rep	Democrat	1946-05-27	-2.595474	0.037473
4	A000371	-2.607536	0.127977	A000371	Pete	Aguilar	M	CA	rep	Democrat	1979-06-19	-2.417696	0.084473

	515	516	517	518	519	520	521	522	523	524	...	547	548	549	550	551	552	553	554	555	yes_no_count
member
A000055	1	0	0	0	1	1	0	1	1	1	...	0	1	0	0	1	0	0	1	0	40.0
A000367	0	0	0	0	0	0	0	0	0	0	...	1	1	1	1	0	1	1	0	1	41.0
A000369	1	1	0	0	1	1	0	1	1	1	...	0	1	0	0	1	0	0	1	0	41.0
A000370	1	1	1	1	1	0	1	0	0	0	...	1	1	1	1	0	1	1	1	1	41.0
A000371	1	1	1	1	1	0	1	0	0	0	...	1	1	1	1	0	1	1	1	1	41.0

	515	516	517	518	519	520	521	522	523	524	...	546	547	548	549	550	551	552	553	554	555
member
A000055	0.101176	-0.691765	-0.543529	-0.541176	0.023529	0.581176	-0.581176	0.625882	0.588235	0.550588	...	-0.541176	-0.545882	0.014118	-0.538824	-0.536471	0.529412	-0.538824	-0.555294	0.056471	-0.522353
A000367	-0.898824	-0.691765	-0.543529	-0.541176	-0.976471	-0.418824	-0.581176	-0.374118	-0.411765	-0.449412	...	-0.541176	0.454118	0.014118	0.461176	0.463529	-0.470588	0.461176	0.444706	-0.943529	0.477647
A000369	0.101176	0.308235	-0.543529	-0.541176	0.023529	0.581176	-0.581176	0.625882	0.588235	0.550588	...	-0.541176	-0.545882	0.014118	-0.538824	-0.536471	0.529412	-0.538824	-0.555294	0.056471	-0.522353
A000370	0.101176	0.308235	0.456471	0.458824	0.023529	-0.418824	0.418824	-0.374118	-0.411765	-0.449412	...	0.458824	0.454118	0.014118	0.461176	0.463529	-0.470588	0.461176	0.444706	0.056471	0.477647
A000371	0.101176	0.308235	0.456471	0.458824	0.023529	-0.418824	0.418824	-0.374118	-0.411765	-0.449412	...	0.458824	0.454118	0.014118	0.461176	0.463529	-0.470588	0.461176	0.444706	0.056471	0.477647