Introductory:
Getting started with Python for research, a gentle introduction to Python in data-intensive research.
A Whirlwind Tour of Python, by Jake VanderPlas, another quick Python intro (with notebooks).
Core Pandas/Data Science books:
The Python Data Science Handbook, by Jake VanderPlas.
Python for Data Analysis, 2nd Edition, by Wes McKinney, creator of Pandas. Companion Notebooks
Effective Pandas, a book by Tom Augspurger, core Pandas developer.
Complementary resources:
An introduction to "Data Science", a collection of Notebooks by BIDS' Stéfan Van der Walt.
Effective Computation in Physics, by Kathryn D. Huff; Anthony Scopatz. Notebooks to accompany the book. Don't be fooled by the title, it's a great book on modern computational practices with very little that's physics-specific.
OK, let's load and configure some of our core libraries (as an aside, you can find a nice visual gallery of available matplotlib sytles here).
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
plt.style.use('fivethirtyeight')
sns.set_context("notebook")
https://www.ssa.gov/OACT/babynames/index.html
As we saw before, we can download data from the internet with Python, and do so only if needed:
import requests
from pathlib import Path
namesbystate_path = Path('namesbystate.zip')
data_url = 'https://www.ssa.gov/oact/babynames/state/namesbystate.zip'
if not namesbystate_path.exists():
print('Downloading...', end=' ')
resp = requests.get(data_url)
with namesbystate_path.open('wb') as f:
f.write(resp.content)
print('Done!')
Again, we'll work off our in-memory, compressed zip archive and pull the data out of it into Pandas DataFrames without ever putting it all on disk. We can see how large the compressed and uncompressed data is:
import zipfile
zf = zipfile.ZipFile(namesbystate_path, 'r')
sum(f.file_size for f in zf.filelist)/1_000_000
sum(f.compress_size for f in zf.filelist)/1_000_000
__/_ # divide the next-previous result by the previous one
We want a single huge dataframe containing every state's data. Let's start by reading in the dataframe for each state into a Python list of dataframes.
%%time
data_frames_for_all_states = []
field_names = ['State', 'Sex', 'Year', 'Name', 'Count']
i = 0
for f in zf.filelist:
i += 1
if not f.filename.endswith('.TXT'):
continue
if (i > 51):
break
with zf.open(f) as fh:
data_frames_for_all_states.append(pd.read_csv(fh, header=None, names=field_names))
Now, we create a single DataFrame by concatenating these into one:
baby_names = pd.concat(data_frames_for_all_states).reset_index(drop=True)
baby_names.tail()
baby_names.shape
baby_names[
(baby_names['State'] == 'CA')
& (baby_names['Year'] == 1995)
& (baby_names['Sex'] == 'M')
].head()
# The lame way to build our DataFrame would be to manually write down
# the answers for all combinations of State, Year, and Sex.
%%time
baby_names.groupby('State').size().head()
state_counts = baby_names.loc[:, ('State', 'Count')]
state_counts.head()
sg = state_counts.groupby('State')
sg
state_counts.groupby('State').sum().head()
For Data 8 veterans, this is equivalent to this code from Data 8:
state_and_groups.group('State', np.sum)
In pandas, could also use agg here, yielding:
state_counts.groupby('State').agg(np.sum)
baby_names.groupby(['State', 'Year']).size().head(3)
baby_names.groupby(['State', 'Year']).sum().head(3)
baby_names.groupby(['State', 'Year', 'Sex']).sum().head()
#%%time
def first(series):
'''Returns the first value in the series.'''
return series.iloc[0]
most_popular_names = baby_names.groupby(['State', 'Year', 'Sex']).agg(first)
most_popular_names.head()
As we'd expect, we get a MultiIndexed DataFrame, which we can index using [] just like our single indexed DataFrames.
most_popular_names[most_popular_names['Name'] == 'Samuel']
.loc
is a bit more complicated:
most_popular_names.loc['CA', 2017, :, :]
most_popular_names.loc['CA', 1997, 'M', :]
most_popular_names.loc['CA', 1997, 'M']
baby_names.head()
baby_names['Name'].apply(len).head()
baby_names['Name'].str.len().head()
baby_names['Name'].str[-1].head()
To add column to dataframe:
baby_names['Last letter'] = baby_names['Name'].str[-1]
baby_names.head()
letter_counts = (baby_names
.loc[:, ('Sex', 'Count', 'Last letter')]
.groupby(['Last letter', 'Sex'])
.sum())
letter_counts.head()
Use .plot to get some basic plotting functionality:
# Why is this not good?
letter_counts.plot.barh(figsize=(15, 15));
Reading the docs shows me that pandas will make one set of bars for each column in my table. How do I move each sex into its own column? I have to use pivot:
# For comparison, the group above:
# letter_counts = (baby_names
# .loc[:, ('Sex', 'Count', 'Last letter')]
# .groupby(['Last letter', 'Sex'])
# .sum())
last_letter_pivot = baby_names.pivot_table(
index='Last letter', # the rows (turned into index)
columns='Sex', # the column values
values='Count', # the field(s) to processed in each group
aggfunc=sum, # group operation
)
last_letter_pivot.head()
At this point, I highly recommend this very nice tutorial on Pivot Tables.
last_letter_pivot.plot.barh(figsize=(10, 10));
Why is this still not ideal?
totals = last_letter_pivot['F'] + last_letter_pivot['M']
last_letter_props = pd.DataFrame({
'F': last_letter_pivot['F'] / totals,
'M': last_letter_pivot['M'] / totals,
}).sort_values('M')
last_letter_props.head()
last_letter_props.plot.barh(figsize=(10, 10));
What do you notice?