import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import yaml
from datetime import datetime
from ds100_utils import *
np.random.seed(23) #kallisti
plt.rcParams['figure.dpi'] = 150
df = pd.read_csv("data/2d.csv")
df
x | y | |
---|---|---|
0 | 2.311043 | 5.436627 |
1 | 2.951447 | 6.093710 |
2 | 2.628517 | 6.776799 |
3 | 2.041157 | 5.335430 |
4 | 3.916969 | 8.948526 |
... | ... | ... |
95 | 3.639231 | 8.331902 |
96 | 2.765474 | 5.621709 |
97 | 2.745027 | 7.134981 |
98 | 3.945360 | 8.198725 |
99 | 2.743246 | 6.145312 |
100 rows × 2 columns
Let's visualize the dataset first.
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
# Move left y-axis and bottom x-axis to centre, passing through (0,0)
ax.spines['left'].set_position('zero')
ax.spines['bottom'].set_position('zero')
# Eliminate upper and right axes
ax.spines['right'].set_color('none')
ax.spines['top'].set_color('none')
# Show ticks in the left and lower axes only
ax.xaxis.set_ticks_position('bottom')
ax.yaxis.set_ticks_position('left')
sns.scatterplot(x="x", y="y", data = df);
plt.axis("square")
ax.set_ylabel("")
ax.set_xlabel("");
plt.xlim(-5, 5)
plt.ylim(-5, 10);
Now let's perform PCA on this 2D dataset to see what transformations are involved.
Step 1 of PCA is to center the data matrix.
centered_df = df - np.mean(df, axis = 0)
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
# Move left y-axis and bottom x-axis to centre, passing through (0,0)
ax.spines['left'].set_position('zero')
ax.spines['bottom'].set_position('zero')
# Eliminate upper and right axes
ax.spines['right'].set_color('none')
ax.spines['top'].set_color('none')
# Show ticks in the left and lower axes only
ax.xaxis.set_ticks_position('bottom')
ax.yaxis.set_ticks_position('left')
sns.scatterplot(x="x", y="y", data = centered_df);
plt.axis("square")
ax.set_ylabel("")
ax.set_xlabel("");
plt.xlim(-5, 5)
plt.ylim(-5, 5);
Step 2 is to obtain the SVD of the centered data.
U, S, Vt = np.linalg.svd(centered_df, full_matrices = False)
We mentioned that $V$ transforms $X$ to get the principal components. What does that tranformation look like?
PCs = centered_df @ Vt.T
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
# Move left y-axis and bottom x-axis to centre, passing through (0,0)
ax.spines['left'].set_position('zero')
ax.spines['bottom'].set_position('zero')
# Eliminate upper and right axes
ax.spines['right'].set_color('none')
ax.spines['top'].set_color('none')
# Show ticks in the left and lower axes only
ax.xaxis.set_ticks_position('bottom')
ax.yaxis.set_ticks_position('left')
sns.scatterplot(x=0, y=1, data = PCs);
plt.axis("square")
ax.set_ylabel("")
ax.set_xlabel("");
plt.xlim(-5, 5)
plt.ylim(-5, 5);
Turns out $V$ simply rotates the centered data matrix $X$ such that the direction with the most variation (i.e. the direction that's the most spread-out) is aligned with the x-axis!
Let's examine how the House of Representatives (of the 116th Congress, 1st session) voted in the month of September 2019.
From the U.S. Senate website:
Roll call votes occur when a representative or senator votes "yea" or "nay," so that the names of members voting on each side are recorded. A voice vote is a vote in which those in favor or against a measure say "yea" or "nay," respectively, without the names or tallies of members voting on each side being recorded.
The data, compiled from ProPublica source, is a "skinny" table of data where each record is a single vote by a member across any roll call in the 116th Congress, 1st session, as downloaded in February 2020. The member of the House, whom we'll call legislator, is denoted by their bioguide alphanumeric ID in http://bioguide.congress.gov/.
# February 2019 House of Representatives roll call votes
# Downloaded using https://github.com/eyeseast/propublica-congress
votes = pd.read_csv('data/votes.csv')
votes = votes.astype({"roll call": str})
votes
chamber | session | roll call | member | vote | |
---|---|---|---|---|---|
0 | House | 1 | 555 | A000374 | Not Voting |
1 | House | 1 | 555 | A000370 | Yes |
2 | House | 1 | 555 | A000055 | No |
3 | House | 1 | 555 | A000371 | Yes |
4 | House | 1 | 555 | A000372 | No |
... | ... | ... | ... | ... | ... |
17823 | House | 1 | 515 | Y000062 | Yes |
17824 | House | 1 | 515 | Y000065 | No |
17825 | House | 1 | 515 | Y000033 | Yes |
17826 | House | 1 | 515 | Z000017 | Yes |
17827 | House | 1 | 515 | P000197 | Speaker |
17828 rows × 5 columns
Suppose we pivot this table to group each legislator and their voting pattern across every (roll call) vote in this month. We mark 1 if the legislator voted Yes (yea), and 0 otherwise (No/nay, no vote, speaker, etc.).
def was_yes(s):
return 1 if s.iloc[0] == "Yes" else 0
vote_pivot = votes.pivot_table(index='member',
columns='roll call',
values='vote',
aggfunc=was_yes,
fill_value=0)
print(vote_pivot.shape)
vote_pivot.head()
(441, 41)
roll call | 515 | 516 | 517 | 518 | 519 | 520 | 521 | 522 | 523 | 524 | ... | 546 | 547 | 548 | 549 | 550 | 551 | 552 | 553 | 554 | 555 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
member | |||||||||||||||||||||
A000055 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | ... | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
A000367 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 1 |
A000369 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | ... | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
A000370 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | ... | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 |
A000371 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | ... | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 |
5 rows × 41 columns
How do we analyze this data?
While we could consider loading information about the legislator, such as their party, and see how this relates to their voting pattern, it turns out that we can do a lot with PCA to cluster legislators by how they vote.
vote_pivot_centered = vote_pivot - np.mean(vote_pivot, axis = 0)
vote_pivot_centered.head(5)
roll call | 515 | 516 | 517 | 518 | 519 | 520 | 521 | 522 | 523 | 524 | ... | 546 | 547 | 548 | 549 | 550 | 551 | 552 | 553 | 554 | 555 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
member | |||||||||||||||||||||
A000055 | 0.129252 | -0.668934 | -0.526077 | -0.52381 | 0.049887 | 0.587302 | -0.562358 | 0.634921 | 0.594104 | 0.560091 | ... | -0.521542 | -0.526077 | 0.045351 | -0.521542 | -0.519274 | 0.54195 | -0.521542 | -0.535147 | 0.086168 | -0.503401 |
A000367 | -0.870748 | -0.668934 | -0.526077 | -0.52381 | -0.950113 | -0.412698 | -0.562358 | -0.365079 | -0.405896 | -0.439909 | ... | -0.521542 | 0.473923 | 0.045351 | 0.478458 | 0.480726 | -0.45805 | 0.478458 | 0.464853 | -0.913832 | 0.496599 |
A000369 | 0.129252 | 0.331066 | -0.526077 | -0.52381 | 0.049887 | 0.587302 | -0.562358 | 0.634921 | 0.594104 | 0.560091 | ... | -0.521542 | -0.526077 | 0.045351 | -0.521542 | -0.519274 | 0.54195 | -0.521542 | -0.535147 | 0.086168 | -0.503401 |
A000370 | 0.129252 | 0.331066 | 0.473923 | 0.47619 | 0.049887 | -0.412698 | 0.437642 | -0.365079 | -0.405896 | -0.439909 | ... | 0.478458 | 0.473923 | 0.045351 | 0.478458 | 0.480726 | -0.45805 | 0.478458 | 0.464853 | 0.086168 | 0.496599 |
A000371 | 0.129252 | 0.331066 | 0.473923 | 0.47619 | 0.049887 | -0.412698 | 0.437642 | -0.365079 | -0.405896 | -0.439909 | ... | 0.478458 | 0.473923 | 0.045351 | 0.478458 | 0.480726 | -0.45805 | 0.478458 | 0.464853 | 0.086168 | 0.496599 |
5 rows × 41 columns
SLIDO QUESTION
vote_pivot_centered.shape
(441, 41)
u, s, vt = np.linalg.svd(vote_pivot_centered, full_matrices = False)
pcs = u * s
sns.scatterplot(x=pcs[:, 0], y=pcs[:, 1]);
plt.xlabel("PC1");
plt.ylabel("PC2");
If the first two singular values are large and all others are small, then two dimensions are enough to describe most of what distinguishes one observation from another. If not, then a PCA scatter plot is omitting lots of information.
An equivalent way to evaluate this is to determine the variance ratios, i.e., the fraction of the variance each PC contributes to total variance.
SLIDO QUESTION
np.round(s**2 / sum(s**2), 2)
array([0.8 , 0.05, 0.02, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ])
A scree plot (and where its "elbow" is located) is a visual way of checking the distribution of variance.
plt.plot(s**2 / sum(s**2), marker='.');
plt.xlabel("Principal Component $i$");
plt.ylabel("Variance Ratio");
Looks reasonable! Let's plot PC2 vs. PC1 again
sns.scatterplot(x=pcs[:, 0], y=pcs[:, 1]);
plt.xlabel("PC1");
plt.ylabel("PC2");
Baesd on the plot above, it looks like there are two clusters of datapoints. What do you think this corresponds to?
Suppose we load in more member information, from https://github.com/unitedstates/congress-legislators. This includes each legislator's political party.
# You can get current information about legislators with this code. In our case, we'll use
# a static copy of the 2019 membership roster to properly match our voting data.
# base_url = 'https://raw.githubusercontent.com/unitedstates/congress-legislators/main/'
# legislators_path = 'legislators-current.yaml'
# f = fetch_and_cache(base_url + legislators_path, legislators_path)
# Use 2019 data copy
legislators_data = yaml.safe_load(open('data/legislators-2019.yaml'))
def to_date(s):
return datetime.strptime(s, '%Y-%m-%d')
legs = pd.DataFrame(
columns=['leg_id', 'first', 'last', 'gender', 'state', 'chamber', 'party', 'birthday'],
data=[[x['id']['bioguide'],
x['name']['first'],
x['name']['last'],
x['bio']['gender'],
x['terms'][-1]['state'],
x['terms'][-1]['type'],
x['terms'][-1]['party'],
to_date(x['bio']['birthday'])] for x in legislators_data])
legs.set_index("leg_id")
legs.sort_index()
leg_id | first | last | gender | state | chamber | party | birthday | |
---|---|---|---|---|---|---|---|---|
0 | B000944 | Sherrod | Brown | M | OH | sen | Democrat | 1952-11-09 |
1 | C000127 | Maria | Cantwell | F | WA | sen | Democrat | 1958-10-13 |
2 | C000141 | Benjamin | Cardin | M | MD | sen | Democrat | 1943-10-05 |
3 | C000174 | Thomas | Carper | M | DE | sen | Democrat | 1947-01-23 |
4 | C001070 | Robert | Casey | M | PA | sen | Democrat | 1960-04-13 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
534 | M001197 | Martha | McSally | F | AZ | sen | Republican | 1966-03-22 |
535 | G000592 | Jared | Golden | M | ME | rep | Democrat | 1982-07-25 |
536 | K000395 | Fred | Keller | M | PA | rep | Republican | 1965-10-23 |
537 | B001311 | Dan | Bishop | M | NC | rep | Republican | 1964-07-01 |
538 | M001210 | Gregory | Murphy | M | NC | rep | Republican | 1963-03-05 |
539 rows × 8 columns
Let's check out how party affiliations relate to the PC1, PC2 transformation from earlier:
First, let's see which party has negative PC1 scores.
vote2d = pd.DataFrame({
'member': vote_pivot.index,
'pc1': pcs[:, 0],
'pc2': pcs[:, 1]
}).merge(legs, left_on='member', right_on='leg_id')
vote2d[vote2d['pc1'] < 0]['party'].value_counts()
party Democrat 231 Name: count, dtype: int64
Hm let's keep checking things out.
We didn't cover the query syntax that we use below, but if you're curious, check out the documentation.
#top right only
vote2d.query('pc2 > -2 and pc1 > 0')['party'].value_counts()
party Republican 194 Name: count, dtype: int64
Interesting. Let's use these party labels to color our principal components:
cp = sns.color_palette()
party_cp = [cp[i] for i in [0, 3, 1]]
party_hue = ["Democrat", "Republican", "Independent"]
sns.scatterplot(x="pc1", y="pc2",
hue="party", palette=party_cp, hue_order=party_hue,
data = vote2d);
There seems to be a bunch of overplotting, so let's jitter a bit.
vote2d['pc1_jittered'] = vote2d['pc1'] + np.random.normal(loc = 0, scale = 0.1, size = vote2d.shape[0])
vote2d['pc2_jittered'] = vote2d['pc2'] + np.random.normal(loc = 0, scale = 0.1, size = vote2d.shape[0])
sns.scatterplot(x="pc1_jittered", y="pc2_jittered",
hue="party", palette=party_cp, hue_order=party_hue,
data = vote2d);