DataTables, Indexes, Pandas, and Seaborn¶

Some useful (free) resources¶

Introductory:

Getting started with Python for research, a gentle introduction to Python in data-intensive research.
A Whirlwind Tour of Python, by Jake VanderPlas, another quick Python intro (with notebooks).

Core Pandas/Data Science books:

The Python Data Science Handbook, by Jake VanderPlas.
Python for Data Analysis, 2nd Edition, by Wes McKinney, creator of Pandas. Companion Notebooks
Effective Pandas, a book by Tom Augspurger, core Pandas developer.

Complementary resources:

An introduction to "Data Science", a collection of Notebooks by BIDS' Stéfan Van der Walt.
Effective Computation in Physics, by Kathryn D. Huff; Anthony Scopatz. Notebooks to accompany the book. Don't be fooled by the title, it's a great book on modern computational practices with very little that's physics-specific.

OK, let's load and configure some of our core libraries (as an aside, you can find a nice visual gallery of available matplotlib sytles here).

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

plt.style.use('fivethirtyeight')
sns.set_context("notebook")

Getting the Data¶

https://www.ssa.gov/OACT/babynames/index.html

https://www.ssa.gov/data

As we saw before, we can download data from the internet with Python, and do so only if needed:

import requests
from pathlib import Path

namesbystate_path = Path('namesbystate.zip')
data_url = 'https://www.ssa.gov/oact/babynames/state/namesbystate.zip'

if not namesbystate_path.exists():
    print('Downloading...', end=' ')
    resp = requests.get(data_url)
    with namesbystate_path.open('wb') as f:
        f.write(resp.content)
    print('Done!')

Downloading... Done!

Question 2: Most popular names in all states for each year of each gender?¶

Put all DFs together¶

Again, we'll work off our in-memory, compressed zip archive and pull the data out of it into Pandas DataFrames without ever putting it all on disk. We can see how large the compressed and uncompressed data is:

import zipfile
zf = zipfile.ZipFile(namesbystate_path, 'r')
sum(f.file_size for f in zf.filelist)/1_000_000

122.38892

sum(f.compress_size for f in zf.filelist)/1_000_000

21.568281

__/_  # divide the next-previous result by the previous one

5.674486529547719

We want a single huge dataframe containing every state's data. Let's start by reading in the dataframe for each state into a Python list of dataframes.

%%time
data_frames_for_all_states = []

field_names = ['State', 'Sex', 'Year', 'Name', 'Count']
i = 0
for f in zf.filelist:
    i += 1
    if not f.filename.endswith('.TXT'):
        continue
    if (i > 51):
        break
    with zf.open(f) as fh:
        data_frames_for_all_states.append(pd.read_csv(fh, header=None, names=field_names))

CPU times: user 3.7 s, sys: 735 ms, total: 4.44 s
Wall time: 4.47 s

Now, we create a single DataFrame by concatenating these into one:

baby_names = pd.concat(data_frames_for_all_states).reset_index(drop=True)
baby_names.tail()

baby_names.shape

(5905787, 5)

Group by state and year¶

baby_names[
    (baby_names['State'] == 'CA')
    & (baby_names['Year'] == 1995)
    & (baby_names['Sex'] == 'M')
].head()

# The lame way to build our DataFrame would be to manually write down
# the answers for all combinations of State, Year, and Sex.

%%time
baby_names.groupby('State').size().head()

CPU times: user 285 ms, sys: 62.3 ms, total: 347 ms
Wall time: 348 ms

State
AK     28084
AL    132065
AR    100157
AZ    113111
CA    374634
dtype: int64

state_counts = baby_names.loc[:, ('State', 'Count')]
state_counts.head()

sg = state_counts.groupby('State')
sg

<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x110cf8898>

state_counts.groupby('State').sum().head()

For Data 8 veterans, this is equivalent to this code from Data 8:

state_and_groups.group('State', np.sum)

In pandas, could also use agg here, yielding:

state_counts.groupby('State').agg(np.sum)

Grouping by multiple columns¶

baby_names.groupby(['State', 'Year']).size().head(3)

State  Year
AK     1910    16
       1911    11
       1912    20
dtype: int64

baby_names.groupby(['State', 'Year']).sum().head(3)

baby_names.groupby(['State', 'Year', 'Sex']).sum().head()

#%%time
def first(series):
    '''Returns the first value in the series.'''
    return series.iloc[0]

most_popular_names = baby_names.groupby(['State', 'Year', 'Sex']).agg(first)

most_popular_names.head()

As we'd expect, we get a MultiIndexed DataFrame, which we can index using [] just like our single indexed DataFrames.

most_popular_names[most_popular_names['Name'] == 'Samuel']

.loc is a bit more complicated:

most_popular_names.loc['CA', 2017, :, :]

most_popular_names.loc['CA', 1997, 'M', :]

most_popular_names.loc['CA', 1997, 'M']

Name     Daniel
Count      4452
Name: (CA, 1997, M), dtype: object

Question 3: Can I deduce birth sex from the last letter of a person’s name?¶

Compute last letter of each name¶

baby_names.head()

baby_names['Name'].apply(len).head()

0    4
1    5
2    4
3    8
4    5
Name: Name, dtype: int64

baby_names['Name'].str.len().head()

0    4
1    5
2    4
3    8
4    5
Name: Name, dtype: int64

baby_names['Name'].str[-1].head()

0    y
1    e
2    a
3    t
4    n
Name: Name, dtype: object

To add column to dataframe:

baby_names['Last letter'] = baby_names['Name'].str[-1]
baby_names.head()

Group by last letter and sex¶

letter_counts = (baby_names
                 .loc[:, ('Sex', 'Count', 'Last letter')]
                 .groupby(['Last letter', 'Sex'])
                 .sum())
letter_counts.head()

Visualize our result¶

Use .plot to get some basic plotting functionality:

# Why is this not good?
letter_counts.plot.barh(figsize=(15, 15));

Reading the docs shows me that pandas will make one set of bars for each column in my table. How do I move each sex into its own column? I have to use pivot:

# For comparison, the group above:
# letter_counts = (baby_names
#                  .loc[:, ('Sex', 'Count', 'Last letter')]
#                  .groupby(['Last letter', 'Sex'])
#                  .sum())

last_letter_pivot = baby_names.pivot_table(
    index='Last letter', # the rows (turned into index)
    columns='Sex', # the column values
    values='Count', # the field(s) to processed in each group
    aggfunc=sum, # group operation
)
last_letter_pivot.head()

Slides: GroupBy/Pivot comparison slides and Quiz¶

At this point, I highly recommend this very nice tutorial on Pivot Tables.

last_letter_pivot.plot.barh(figsize=(10, 10));

Why is this still not ideal?

Plotting raw counts
Not sorted by any order

totals = last_letter_pivot['F'] + last_letter_pivot['M']

last_letter_props = pd.DataFrame({
    'F': last_letter_pivot['F'] / totals,
    'M': last_letter_pivot['M'] / totals,
}).sort_values('M')
last_letter_props.head()

last_letter_props.plot.barh(figsize=(10, 10));

What do you notice?

	State	Sex	Year	Name	Count
5905782	WV	M	2017	Sutton	5
5905783	WV	M	2017	Sylas	5
5905784	WV	M	2017	Tatum	5
5905785	WV	M	2017	Tripp	5
5905786	WV	M	2017	Zeke	5

		Count
Last letter	Sex
a	F	49618993
a	M	1606538
b	F	10029
b	M	1389618
c	F	19264

Sex	F	M
Last letter
a	49618993	1606538
b	10029	1389618
c	19264	1582422
d	566303	15431983
e	31435817	12863704

	F	M
Last letter
a	0.968638	0.031362
i	0.823175	0.176825
e	0.709620	0.290380
z	0.638642	0.361358
y	0.572034	0.427966

	State	Sex	Year	Name	Count
685081	CA	M	1995	Daniel	5003
685082	CA	M	1995	Michael	4783
685083	CA	M	1995	Jose	4572
685084	CA	M	1995	Christopher	4098
685085	CA	M	1995	David	4029

			Name	Count
State	Year	Sex
AK	1910	F	Mary	14
	1910	M	John	8
	1911	F	Mary	12
	1911	M	John	15
	1912	F	Mary	9

	Count
State
AK	430161
AL	5815853
AR	3433745
AZ	3598468
CA	30527811