Notebook originally by Josh Hug (Fall 2019)
Edits by Anthony D. Joseph and Suraj Rampure (Fall 2020)
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
We often create visualizations in order to facilitate exploratory data analysis. For example, we might create scatterplots to explore the relationship between pairs of variables in a dataset.
The dataset below gives the "percentage body fat, age, weight, height, and ten body circumference measurements" for 252 men.
http://jse.amstat.org/v4n1/datasets.johnson.html
For simplicity, we read in only 8 of the provided attributes, yielding the given dataframe.
#http://jse.amstat.org/datasets/fat.txt
df3 = pd.read_fwf("data/fat.dat.txt", colspecs = [(17, 21), (23, 29), (35, 37),
(39, 45), (48, 53), (73, 77),
(80, 85), (88, 93)], header=None, names = ["% fat", "density", "age", "weight", "height", "neck", "chest", "abdomen"])
df3.head()
We see that percentage fat and density in g/cm^3 are almost completely redundant.
sns.scatterplot(data = df3, x = "% fat", y = "density");
By contrast, while there is a strong correlation between neck and chest measurements, the resulting data is very 2 dimensional.
sns.scatterplot(data = df3, x = "neck", y = "chest");
Age and height show a small correlation as peolpe seem to get slightly smaller with greater age in this dataset.
sns.scatterplot(data = df3, x = "age", y = "height");
We note that there is one outlier where a person is slightly less than 29.5 inches tall. While there are a extraordinarily small number of adult males who are less than three feet tall, reflection on the rest of the data from this observation suggest that this was simply an error.
df3.query("height < 40")
df3 = df3.drop(41)
sns.scatterplot(data = df3, x = "age", y = "height");
We can try to visualize more than 2 attributes at once, but the relationships displayed in e.g. the color and dot size space are much harder for human readers to see. For example, above we saw that density and % fat are almost entirely redundant, but this relationship is impossible to see when comparing the colors and dot sizes.
sns.scatterplot(data = df3, x = "neck", y = "chest", hue="density", size = "% fat");
Seaborn gives us the ability to create a matrix of all possible pairs of variables. This is can be useful, though even with only 8 variables it's still difficult to fully digest.
sns.pairplot(df3);
We should note that despite the very strong relationship between % fat and density, the numerical rank of the data matrix is still 8. For the rank to be 7, we'd need the data to be almost exactly on a line. We'll talk about techniques to reduce the dimensionality over the course of this lecture and the next.
np.linalg.matrix_rank(df3)
Next, let's consider voting data from the house of representatives in the U.S. during the month of September 2019. In this example, our goal will be to try to find clusters of representatives who vote in similar ways. For example, we might expect to find that Democrats and Republicans vote similarly to other members of their party.
from pathlib import Path
from ds100_utils import fetch_and_cache
from datetime import datetime
from IPython.display import display
import yaml
plt.rcParams['figure.figsize'] = (4, 4)
plt.rcParams['figure.dpi'] = 150
sns.set()
base_url = 'https://github.com/unitedstates/congress-legislators/raw/master/'
legislators_path = 'legislators-current.yaml'
f = fetch_and_cache(base_url + legislators_path, legislators_path)
legislators_data = yaml.load(open(f))
def to_date(s):
return datetime.strptime(s, '%Y-%m-%d')
legs = pd.DataFrame(
columns=['leg_id', 'first', 'last', 'gender', 'state', 'chamber', 'party', 'birthday'],
data=[[x['id']['bioguide'],
x['name']['first'],
x['name']['last'],
x['bio']['gender'],
x['terms'][-1]['state'],
x['terms'][-1]['type'],
x['terms'][-1]['party'],
to_date(x['bio']['birthday'])] for x in legislators_data])
legs.head(3)
# February 2019 House of Representatives roll call votes
# Downloaded using https://github.com/eyeseast/propublica-congress
votes = pd.read_csv('data/votes.csv')
votes = votes.astype({"roll call": str})
votes.head()
votes.merge(legs, left_on='member', right_on='leg_id').sample(5)
def was_yes(s):
if s.iloc[0] == 'Yes':
return 1
else:
return 0
vote_pivot = votes.pivot_table(index='member',
columns='roll call',
values='vote',
aggfunc=was_yes,
fill_value=0)
print(vote_pivot.shape)
vote_pivot.head()
vote_pivot.shape
This data has 441 observations (members of the House of Representatives including the 6 non-voting representatives) and 41 dimensions (votes). While politics is quite polarized, none of these columns are linearly dependent as we note below.
np.linalg.matrix_rank(vote_pivot)
Suppose we want to find clusters of similar voting behavior. We might try by reducing our data to only two dimensions and looking to see if we can identify clear patterns. Let's start by looking at what votes were most controversial.
np.var(vote_pivot, axis=0).sort_values(ascending = False)
We see that roll call 548 had very little variance. According to http://clerk.house.gov/evs/2019/roll548.xml, this bill was referring to the 2019 Whistleblower Complaint about President Trump and Ukraine. The full text of the house resolution for this roll call can be found at https://www.congress.gov/bill/116th-congress/house-resolution/576/text:
(1) the whistleblower complaint received on August 12, 2019, by the Inspector General of the Intelligence Community shall be transmitted immediately to the Select Committee on Intelligence of the Senate and the Permanent Select Committee on Intelligence of the House of Representatives; and
(2) the Select Committee on Intelligence of the Senate and the Permanent Select Committee on Intelligence of the House of Representatives should be allowed to evaluate the complaint in a deliberate and bipartisan manner consistent with applicable statutes and processes in order to safeguard classified and sensitive information.
We see that 421 congresspeople voted for this resolution, and 12 did not vote for this resolution. 2 members answered "present" but did not vote no, and 10 did not vote at all. Clearly, a scatterplot involving this particular dimension isn't going to be useful.
vote_pivot['548'].value_counts()
By contrast, we saw high variance for most of the other roll call votes. Most them had variances near 0.25, which is the maximum possible for a variable which can take on values 0 or 1. Let's consider the two highest variance variables, shown below:
vote_pivot['555'].value_counts()
vote_pivot['530'].value_counts()
Let's use these as our two dimensions for our scatterplot and see what happens.
sns.scatterplot(x='530', y='555', data=vote_pivot);
By adding some random noise, we can get rid of the overplotting.
vote_pivot_jittered = vote_pivot.copy()
vote_pivot_jittered.loc[:, '515':'555'] += np.random.