DataTables, Indexes, Pandas, and Seaborn¶

import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('fivethirtyeight')
import seaborn as sns
import numpy as np
import pandas as pd
from glob import glob
sns.set_context("notebook")

Getting the Data¶

https://www.ssa.gov/OACT/babynames/index.html

https://www.ssa.gov/data/

We can run terminal/shell commands directly in a notebook! Here's the code to download the dataset (not running since it takes a while):

!wget https://www.ssa.gov/oact/babynames/state/namesbystate.zip

# !wget https://www.ssa.gov/oact/babynames/state/namesbystate.zip
!unzip namesbystate.zip

!ls

03-datatables-indexes-pandas.ipynb     NC.TXT
03live-datatables-indexes-pandas.ipynb ND.TXT
AK.TXT                                 NE.TXT
AL.TXT                                 NH.TXT
AR.TXT                                 NJ.TXT
AZ.TXT                                 NM.TXT
CA.TXT                                 NV.TXT
CO.TXT                                 NY.TXT
CT.TXT                                 OH.TXT
DC.TXT                                 OK.TXT
DE.TXT                                 OR.TXT
FL.TXT                                 PA.TXT
GA.TXT                                 RI.TXT
HI.TXT                                 SC.TXT
IA.TXT                                 SD.TXT
ID.TXT                                 StateReadMe.pdf
IL.TXT                                 TN.TXT
IN.TXT                                 TX.TXT
KS.TXT                                 UT.TXT
KY.TXT                                 VA.TXT
LA.TXT                                 VT.TXT
MA.TXT                                 WA.TXT
MD.TXT                                 WI.TXT
ME.TXT                                 WV.TXT
MI.TXT                                 WY.TXT
MN.TXT                                 lec03.ipynb
MO.TXT                                 lec03_live.ipynb
MS.TXT                                 namesbystate.zip
MT.TXT

!head CA.TXT

!wc -l CA.TXT

  367931 CA.TXT

# !cat CA.TXT

Question 1: What was the most popular name in CA last year?¶

ca = pd.read_csv('CA.TXT', header=None, names=['State', 'Sex', 'Year', 'Name', 'Count'])
ca.head()

Slicing¶

ca['Count'].head()

0    295
1    239
2    220
3    163
4    134
Name: Count, dtype: int64

ca[0:3]

# ca[0]

ca.iloc[0:3, 0:2]

ca.loc[0:3, 'State']

0    CA
1    CA
2    CA
3    CA
Name: State, dtype: object

ca.loc[0:5, 'Sex':'Name']

What is the leftmost column?

emails = ca.head()
emails.index = ['a@gmail.com', 'b@gmail.com', 'c@gmail.com', 'd@gmail.com', 'e@gmail.com']
emails

emails.loc['b@gmail.com':'d@gmail.com', 'Year':'Name']

ca.head()

(ca['Year'] == 2016).head()

0    False
1    False
2    False
3    False
4    False
Name: Year, dtype: bool

ca[ca['Year'] == 2016].head()

Sorting¶

ca_sorted = ca[ca['Year'] == 2016]
ca_sorted.sort_values('Count', ascending=False).head()

Question 2: Most popular names in all states for each year?¶

Put all DFs together¶

# Make sure that filesizes are managable
!ls -alh *.TXT | head

-rw-r--r--  1 sam  staff   548K Mar 10 00:00 AK.TXT
-rw-r--r--  1 sam  staff   2.6M Mar 10 00:00 AL.TXT
-rw-r--r--  1 sam  staff   1.9M Mar 10 00:00 AR.TXT
-rw-r--r--  1 sam  staff   2.2M Mar 10 00:00 AZ.TXT
-rw-r--r--  1 sam  staff   7.3M Mar 10 00:00 CA.TXT
-rw-r--r--  1 sam  staff   2.0M Mar 10 00:00 CO.TXT
-rw-r--r--  1 sam  staff   1.6M Mar 10 00:00 CT.TXT
-rw-r--r--  1 sam  staff   1.1M Mar 10 00:00 DC.TXT
-rw-r--r--  1 sam  staff   628K Mar 10 00:00 DE.TXT
-rw-r--r--  1 sam  staff   3.9M Mar 10 00:00 FL.TXT

glob('*.TXT')

file_names = glob('*.TXT')

baby_names = pd.concat(
    (pd.read_csv(f, names=['State', 'Sex', 'Year', 'Name', 'Count']) for f in file_names)
).reset_index(drop=True)
baby_names.head()

len(baby_names)

5838786

Group by state and year¶

baby_names[
    (baby_names['State'] == 'CA')
    & (baby_names['Year'] == 1995)
    & (baby_names['Sex'] == 'M')
].head()

# Now I could write 3 nested for loops...

baby_names.groupby('State').size().head()

State
AK     27624
AL    130297
AR     98853
AZ    110866
CA    367931
dtype: int64

state_counts = baby_names.loc[:, ('State', 'Count')]
state_counts.head()

state_counts.groupby('State').sum().head()

state_counts.group('State', np.sum)

state_counts.groupby('State').agg(np.sum).head()

Using a custom function to aggregate.

Equivalent to this code from Data 8:

state_and_groups.group('State', np.sum)

Grouping by multiple columns¶

baby_names.groupby(['State', 'Year']).size().head()

State  Year
AK     1910    16
       1911    11
       1912    20
       1913    12
       1914    32
dtype: int64

baby_names.groupby(['State', 'Year']).sum().head()

baby_names.groupby(['State', 'Year', 'Sex']).sum().head()

def first(series):
    '''Returns the first value in the series.'''
    return series.iloc[0]

most_popular_names = (
    baby_names
    .groupby(['State', 'Year', 'Sex'])
    .agg(first)
)
most_popular_names

This creates a multilevel index. It is quite complex, but just know that you can still slice:

most_popular_names[most_popular_names['Name'] == 'Samuel']

And you can use .loc as so:

most_popular_names.loc['CA', 1997, 'M']

Name     Daniel
Count      4452
Name: (CA, 1997, M), dtype: object

most_popular_names.loc['CA', 1995:2000, 'M']

Question 3: Can I deduce gender from the last letter of a person’s name?¶

Survey question time!

Compute last letter of each name¶

baby_names.head()

baby_names['Name'].apply(len).head()

0    4
1    5
2    4
3    8
4    5
Name: Name, dtype: int64

baby_names['Name'].str.len().head()

0    4
1    5
2    4
3    8
4    5
Name: Name, dtype: int64

baby_names['Name'].str[-1].head()

0    y
1    e
2    a
3    t
4    n
Name: Name, dtype: object

To add column to dataframe:

baby_names['Last letter'] = baby_names['Name'].str[-1]
baby_names.head()

Group by last letter and sex¶

letter_counts = (baby_names
                 .loc[:, ('Sex', 'Count', 'Last letter')]
                 .groupby(['Last letter', 'Sex'])
                 .sum())
letter_counts.head()

Visualize our result¶

Use .plot to get some basic plotting functionality:

# Why is this not good?
letter_counts.plot.barh(figsize=(15, 15))

<matplotlib.axes._subplots.AxesSubplot at 0x111b5fd30>

Reading the docs shows me that pandas will make one set of bars for each column in my table. How do I move each sex into its own column? I have to use pivot:

# For comparison, the group above:
# letter_counts = (baby_names
#                  .loc[:, ('Sex', 'Count', 'Last letter')]
#                  .groupby(['Last letter', 'Sex'])
#                  .sum())

last_letter_pivot = baby_names.pivot_table(
    index='Last letter', # the rows (turned into index)
    columns='Sex', # the column values
    values='Count', # the field(s) to processed in each group
    aggfunc=sum, # group operation
)
last_letter_pivot.head()

last_letter_pivot.plot.barh(figsize=(10, 10))

<matplotlib.axes._subplots.AxesSubplot at 0x1127b9080>

Why is this still not ideal?

Plotting raw counts
Not sorted by any order

totals = last_letter_pivot['F'] + last_letter_pivot['M']

last_letter_props = pd.DataFrame({
    'F': last_letter_pivot['F'] / totals,
    'M': last_letter_pivot['M'] / totals,
}).sort_values('M')
last_letter_props.head()

last_letter_props.plot.barh(figsize=(10, 10))

<matplotlib.axes._subplots.AxesSubplot at 0x112c6fac8>

What do you notice?

Seaborn¶

Let's use a subset of our dataset for now:

ca_and_ny = baby_names[
    (baby_names['Year'] == 2016)
    & (baby_names['State'].isin(['CA', 'NY']))
]
ca_and_ny.head()

We actually don't need to do any pivoting / grouping for seaborn!

sns.barplot(x=..., y=..., data=...)

Note the automatic confidence interval generation. Many seaborn functions have these nifty statistical features.

(It actually isn't useful for our case since we have a census. It also makes seaborn functions run slower since they use bootstrap to generate the CI, so sometimes you want to turn it off.)

Going to work with tips data just to demonstrate:

tips = sns.load_dataset("tips")
tips.head()

sns.distplot(...)

sns.lmplot(x=..., y=..., data=...)

			Name	Count
State	Year	Sex
AK	1910	F	Mary	14
	1910	M	John	8
	1911	F	Mary	12
	1911	M	John	15
	1912	F	Mary	9
	1912	M	John	16
	1913	F	Mary	21
	1913	M	John	19
	1914	F	Mary	22
	1914	M	John	17
	1915	F	Mary	23
	1915	M	John	21
	1916	F	Mary	18
	1916	M	John	25
	1917	F	Mary	21
	1917	M	John	26
	1918	F	Mary	27
	1918	M	John	23
	1919	F	Mary	22
	1919	M	John	24
	1920	F	Mary	38
	1920	M	John	21
	1921	F	Mary	36
	1921	M	John	35
	1922	F	Mary	29
	1922	M	Robert	22
	1923	F	Mary	26
	1923	M	John	27
	1924	F	Mary	41
	1924	M	John	36
...	...	...	...	...
WY	2002	F	Madison	35
	2002	M	Ethan	52
	2003	F	Emma	38
	2003	M	Jacob	47
	2004	F	Madison	40
	2004	M	Michael	34
	2005	F	Madison	38
	2005	M	Jacob	41
	2006	F	Emily	39
	2006	M	Ethan	44
	2007	F	Madison	36
	2007	M	James	38
	2008	F	Madison	35
	2008	M	James	41
	2009	F	Isabella	36
	2009	M	Wyatt	42
	2010	F	Isabella	44
	2010	M	James	36
	2011	F	Emma	43
	2011	M	William	32
	2012	F	Emma	40
	2012	M	Liam	41
	2013	F	Sophia	42
	2013	M	Liam	33
	2014	F	Olivia	40
	2014	M	Jackson	34
	2015	F	Emma	39
	2015	M	Liam	38
	2016	F	Emma	36
	2016	M	Wyatt	46

		Count
Last letter	Sex
a	F	49128453
a	M	1585024
b	F	9666
b	M	1369244
c	F	18211

Sex	F	M
Last letter
a	49128453	1585024
b	9666	1369244
c	18211	1565621
d	564804	15423771
e	31212081	12778932

	F	M
Last letter
a	0.968746	0.031254
i	0.830335	0.169665
e	0.709510	0.290490
z	0.645210	0.354790
y	0.571341	0.428659

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4

	State	Sex	Year	Name	Count
a@gmail.com	CA	F	1910	Mary	295
b@gmail.com	CA	F	1910	Helen	239
c@gmail.com	CA	F	1910	Dorothy	220
d@gmail.com	CA	F	1910	Margaret	163
e@gmail.com	CA	F	1910	Frances	134

	State	Sex	Year	Name	Count
213461	CA	F	2016	Mia	2785
213462	CA	F	2016	Sophia	2747
213463	CA	F	2016	Emma	2592
213464	CA	F	2016	Olivia	2533
213465	CA	F	2016	Isabella	2350

	State	Sex	Year	Name	Count
675533	CA	M	1995	Daniel	5003
675534	CA	M	1995	Michael	4783
675535	CA	M	1995	Jose	4572
675536	CA	M	1995	Christopher	4096
675537	CA	M	1995	David	4029

	Count
State
AK	424852
AL	5773719
AR	3408590
AZ	3532872
CA	30115165

	Count
State
AK	424852
AL	5773719
AR	3408590
AZ	3532872
CA	30115165