DataTables, Indexes, Pandas, and Seaborn

Some useful (free) resources

Introductory:

Core Pandas/Data Science books:

Complementary resources:

OK, let's load and configure some of our core libraries (as an aside, you can find a nice visual gallery of available matplotlib sytles here).

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

plt.style.use('fivethirtyeight')
sns.set_context("notebook")

Getting the Data

https://www.ssa.gov/OACT/babynames/index.html

https://www.ssa.gov/data

As we saw before, we can download data from the internet with Python, and do so only if needed:

In [2]:
import requests
from pathlib import Path

namesbystate_path = Path('namesbystate.zip')
data_url = 'https://www.ssa.gov/oact/babynames/state/namesbystate.zip'

if not namesbystate_path.exists():
    print('Downloading...', end=' ')
    resp = requests.get(data_url)
    with namesbystate_path.open('wb') as f:
        f.write(resp.content)
    print('Done!')
Downloading... Done!

Put all DFs together

Again, we'll work off our in-memory, compressed zip archive and pull the data out of it into Pandas DataFrames without ever putting it all on disk. We can see how large the compressed and uncompressed data is:

In [3]:
import zipfile
zf = zipfile.ZipFile(namesbystate_path, 'r')
sum(f.file_size for f in zf.filelist)/1_000_000
Out[3]:
122.38892
In [4]:
sum(f.compress_size for f in zf.filelist)/1_000_000
Out[4]:
21.568281
In [5]:
__/_  # divide the next-previous result by the previous one
Out[5]:
5.674486529547719

We want a single huge dataframe containing every state's data. Let's start by reading in the dataframe for each state into a Python list of dataframes.

In [6]:
%%time
data_frames_for_all_states = []

field_names = ['State', 'Sex', 'Year', 'Name', 'Count']
i = 0
for f in zf.filelist:
    i += 1
    if not f.filename.endswith('.TXT'):
        continue
    if (i > 51):
        break
    with zf.open(f) as fh:
        data_frames_for_all_states.append(pd.read_csv(fh, header=None, names=field_names))
CPU times: user 3.7 s, sys: 735 ms, total: 4.44 s
Wall time: 4.47 s

Now, we create a single DataFrame by concatenating these into one:

In [7]:
baby_names = pd.concat(data_frames_for_all_states).reset_index(drop=True)
baby_names.tail()
Out[7]:
State Sex Year Name Count
5905782 WV M 2017 Sutton 5
5905783 WV M 2017 Sylas 5
5905784 WV M 2017 Tatum 5
5905785 WV M 2017 Tripp 5
5905786 WV M 2017 Zeke 5
In [8]:
baby_names.shape
Out[8]:
(5905787, 5)

Group by state and year

In [9]:
baby_names[
    (baby_names['State'] == 'CA')
    & (baby_names['Year'] == 1995)
    & (baby_names['Sex'] == 'M')
].head()

# The lame way to build our DataFrame would be to manually write down
# the answers for all combinations of State, Year, and Sex.
Out[9]:
State Sex Year Name Count
685081 CA M 1995 Daniel 5003
685082 CA M 1995 Michael 4783
685083 CA M 1995 Jose 4572
685084 CA M 1995 Christopher 4098
685085 CA M 1995 David 4029
In [10]:
%%time
baby_names.groupby('State').size().head()
CPU times: user 285 ms, sys: 62.3 ms, total: 347 ms
Wall time: 348 ms
Out[10]:
State
AK     28084
AL    132065
AR    100157
AZ    113111
CA    374634
dtype: int64
In [11]:
state_counts = baby_names.loc[:, ('State', 'Count')]
state_counts.head()
Out[11]:
State Count
0 AK 14
1 AK 12
2 AK 10
3 AK 8
4 AK 7
In [12]:
sg = state_counts.groupby('State')
sg
Out[12]:
<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x110cf8898>
In [13]:
state_counts.groupby('State').sum().head()
Out[13]:
Count
State
AK 430161
AL 5815853
AR 3433745
AZ 3598468
CA 30527811

For Data 8 veterans, this is equivalent to this code from Data 8:

state_and_groups.group('State', np.sum)

In pandas, could also use agg here, yielding:

state_counts.groupby('State').agg(np.sum)

Grouping by multiple columns

In [14]:
baby_names.groupby(['State', 'Year']).size().head(3)
Out[14]:
State  Year
AK     1910    16
       1911    11
       1912    20
dtype: int64
In [15]:
baby_names.groupby(['State', 'Year']).sum().head(3)
Out[15]:
Count
State Year
AK 1910 115
1911 84
1912 141
In [16]:
baby_names.groupby(['State', 'Year', 'Sex']).sum().head()
Out[16]:
Count
State Year Sex
AK 1910 F 68
M 47
1911 F 44
M 40
1912 F 82
In [17]:
#%%time
def first(series):
    '''Returns the first value in the series.'''
    return series.iloc[0]

most_popular_names = baby_names.groupby(['State', 'Year', 'Sex']).agg(first)

most_popular_names.head()
Out[17]:
Name Count
State Year Sex
AK 1910 F Mary 14
M John 8
1911 F Mary 12
M John 15
1912 F Mary 9

As we'd expect, we get a MultiIndexed DataFrame, which we can index using [] just like our single indexed DataFrames.

In [18]:
most_popular_names[most_popular_names['Name'] == 'Samuel']
Out[18]:
Name Count
State Year Sex
ID 2010 M Samuel 114

.loc is a bit more complicated:

In [19]:
most_popular_names.loc['CA', 2017, :, :]
Out[19]:
Name Count
State Year Sex
CA 2017 F Emma 2726
M Noah 2511
In [20]:
most_popular_names.loc['CA', 1997, 'M', :]
Out[20]:
Name Count
State Year Sex
CA 1997 M Daniel 4452
In [21]:
most_popular_names.loc['CA', 1997, 'M']
Out[21]:
Name     Daniel
Count      4452
Name: (CA, 1997, M), dtype: object

Question 3: Can I deduce birth sex from the last letter of a person’s name?

Compute last letter of each name

In [22]:
baby_names.head()
Out[22]:
State Sex Year Name Count
0 AK F 1910 Mary 14
1 AK F 1910 Annie 12
2 AK F 1910 Anna 10
3 AK F 1910 Margaret 8
4 AK F 1910 Helen 7
In [23]:
baby_names['Name'].apply(len).head()
Out[23]:
0    4
1    5
2    4
3    8
4    5
Name: Name, dtype: int64
In [24]:
baby_names['Name'].str.len().head()
Out[24]:
0    4
1    5
2    4
3    8
4    5
Name: Name, dtype: int64
In [25]:
baby_names['Name'].str[-1].head()
Out[25]:
0    y
1    e
2    a
3    t
4    n
Name: Name, dtype: object

To add column to dataframe:

In [26]:
baby_names['Last letter'] = baby_names['Name'].str[-1]
baby_names.head()
Out[26]:
State Sex Year Name Count Last letter
0 AK F 1910 Mary 14 y
1 AK F 1910 Annie 12 e
2 AK F 1910 Anna 10 a
3 AK F 1910 Margaret 8 t
4 AK F 1910 Helen 7 n

Group by last letter and sex

In [27]:
letter_counts = (baby_names
                 .loc[:, ('Sex', 'Count', 'Last letter')]
                 .groupby(['Last letter', 'Sex'])
                 .sum())
letter_counts.head()
Out[27]:
Count
Last letter Sex
a F 49618993
M 1606538
b F 10029
M 1389618
c F 19264

Visualize our result

Use .plot to get some basic plotting functionality:

In [28]:
# Why is this not good?
letter_counts.plot.barh(figsize=(15, 15));

Reading the docs shows me that pandas will make one set of bars for each column in my table. How do I move each sex into its own column? I have to use pivot:

In [29]:
# For comparison, the group above:
# letter_counts = (baby_names
#                  .loc[:, ('Sex', 'Count', 'Last letter')]
#                  .groupby(['Last letter', 'Sex'])
#                  .sum())

last_letter_pivot = baby_names.pivot_table(
    index='Last letter', # the rows (turned into index)
    columns='Sex', # the column values
    values='Count', # the field(s) to processed in each group
    aggfunc=sum, # group operation
)
last_letter_pivot.head()
Out[29]:
Sex F M
Last letter
a 49618993 1606538
b 10029 1389618
c 19264 1582422
d 566303 15431983
e 31435817 12863704

Slides: GroupBy/Pivot comparison slides and Quiz

At this point, I highly recommend this very nice tutorial on Pivot Tables.

In [30]:
last_letter_pivot.plot.barh(figsize=(10, 10));

Why is this still not ideal?

  • Plotting raw counts
  • Not sorted by any order
In [31]:
totals = last_letter_pivot['F'] + last_letter_pivot['M']

last_letter_props = pd.DataFrame({
    'F': last_letter_pivot['F'] / totals,
    'M': last_letter_pivot['M'] / totals,
}).sort_values('M')
last_letter_props.head()
Out[31]:
F M
Last letter
a 0.968638 0.031362
i 0.823175 0.176825
e 0.709620 0.290380
z 0.638642 0.361358
y 0.572034 0.427966
In [32]:
last_letter_props.plot.barh(figsize=(10, 10));

What do you notice?