import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('fivethirtyeight')
import seaborn as sns
import numpy as np
import pandas as pd
from glob import glob
sns.set_context("notebook")
https://www.ssa.gov/OACT/babynames/index.html
We can run terminal/shell commands directly in a notebook! Here's the code to download the dataset (not running since it takes a while):
!wget https://www.ssa.gov/oact/babynames/state/namesbystate.zip
# !wget https://www.ssa.gov/oact/babynames/state/namesbystate.zip
!unzip namesbystate.zip
!ls
!head CA.TXT
!wc -l CA.TXT
# !cat CA.TXT
ca = pd.read_csv('CA.TXT', header=None, names=['State', 'Sex', 'Year', 'Name', 'Count'])
ca.head()
ca['Count'].head()
ca[0:3]
# ca[0]
ca.iloc[0:3, 0:2]
ca.loc[0:3, 'State']
ca.loc[0:5, 'Sex':'Name']
What is the leftmost column?
emails = ca.head()
emails.index = ['a@gmail.com', 'b@gmail.com', 'c@gmail.com', 'd@gmail.com', 'e@gmail.com']
emails
emails.loc['b@gmail.com':'d@gmail.com', 'Year':'Name']
ca.head()
(ca['Year'] == 2016).head()
ca[ca['Year'] == 2016].head()
ca_sorted = ca[ca['Year'] == 2016]
ca_sorted.sort_values('Count', ascending=False).head()
# Make sure that filesizes are managable
!ls -alh *.TXT | head
glob('*.TXT')
file_names = glob('*.TXT')
baby_names = pd.concat(
(pd.read_csv(f, names=['State', 'Sex', 'Year', 'Name', 'Count']) for f in file_names)
).reset_index(drop=True)
baby_names.head()
len(baby_names)
baby_names[
(baby_names['State'] == 'CA')
& (baby_names['Year'] == 1995)
& (baby_names['Sex'] == 'M')
].head()
# Now I could write 3 nested for loops...
baby_names.groupby('State').size().head()
state_counts = baby_names.loc[:, ('State', 'Count')]
state_counts.head()
state_counts.groupby('State').sum().head()
state_counts.group('State', np.sum)
state_counts.groupby('State').agg(np.sum).head()
Using a custom function to aggregate.
Equivalent to this code from Data 8:
state_and_groups.group('State', np.sum)
baby_names.groupby(['State', 'Year']).size().head()
baby_names.groupby(['State', 'Year']).sum().head()
baby_names.groupby(['State', 'Year', 'Sex']).sum().head()
def first(series):
'''Returns the first value in the series.'''
return series.iloc[0]
most_popular_names = (
baby_names
.groupby(['State', 'Year', 'Sex'])
.agg(first)
)
most_popular_names
This creates a multilevel index. It is quite complex, but just know that you can still slice:
most_popular_names[most_popular_names['Name'] == 'Samuel']
And you can use .loc
as so:
most_popular_names.loc['CA', 1997, 'M']
most_popular_names.loc['CA', 1995:2000, 'M']
Survey question time!
baby_names.head()
baby_names['Name'].apply(len).head()
baby_names['Name'].str.len().head()
baby_names['Name'].str[-1].head()
To add column to dataframe:
baby_names['Last letter'] = baby_names['Name'].str[-1]
baby_names.head()
letter_counts = (baby_names
.loc[:, ('Sex', 'Count', 'Last letter')]
.groupby(['Last letter', 'Sex'])
.sum())
letter_counts.head()
Use .plot to get some basic plotting functionality:
# Why is this not good?
letter_counts.plot.barh(figsize=(15, 15))
Reading the docs shows me that pandas will make one set of bars for each column in my table. How do I move each sex into its own column? I have to use pivot:
# For comparison, the group above:
# letter_counts = (baby_names
# .loc[:, ('Sex', 'Count', 'Last letter')]
# .groupby(['Last letter', 'Sex'])
# .sum())
last_letter_pivot = baby_names.pivot_table(
index='Last letter', # the rows (turned into index)
columns='Sex', # the column values
values='Count', # the field(s) to processed in each group
aggfunc=sum, # group operation
)
last_letter_pivot.head()
last_letter_pivot.plot.barh(figsize=(10, 10))
Why is this still not ideal?
totals = last_letter_pivot['F'] + last_letter_pivot['M']
last_letter_props = pd.DataFrame({
'F': last_letter_pivot['F'] / totals,
'M': last_letter_pivot['M'] / totals,
}).sort_values('M')
last_letter_props.head()
last_letter_props.plot.barh(figsize=(10, 10))
What do you notice?
Let's use a subset of our dataset for now:
ca_and_ny = baby_names[
(baby_names['Year'] == 2016)
& (baby_names['State'].isin(['CA', 'NY']))
]
ca_and_ny.head()
We actually don't need to do any pivoting / grouping for seaborn!
sns.barplot(x=..., y=..., data=...)
Note the automatic confidence interval generation. Many seaborn functions have these nifty statistical features.
(It actually isn't useful for our case since we have a census. It also makes seaborn functions run slower since they use bootstrap to generate the CI, so sometimes you want to turn it off.)
Going to work with tips data just to demonstrate:
tips = sns.load_dataset("tips")
tips.head()
sns.distplot(...)
sns.lmplot(x=..., y=..., data=...)