Introductory:
Getting started with Python for research, a gentle introduction to Python in data-intensive research.
A Whirlwind Tour of Python, by Jake VanderPlas, another quick Python intro (with notebooks).
Core Pandas/Data Science books:
The Python Data Science Handbook, by Jake VanderPlas.
Python for Data Analysis, 2nd Edition, by Wes McKinney, creator of Pandas. Companion Notebooks
Effective Pandas, a book by Tom Augspurger, core Pandas developer.
Complementary resources:
An introduction to "Data Science", a collection of Notebooks by BIDS' Stéfan Van der Walt.
Effective Computation in Physics, by Kathryn D. Huff; Anthony Scopatz. Notebooks to accompany the book. Don't be fooled by the title, it's a great book on modern computational practices with very little that's physics-specific.
OK, let's load and configure some of our core libraries (as an aside, you can find a nice visual gallery of available matplotlib sytles here).
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
plt.style.use('fivethirtyeight')
sns.set_context("notebook")
https://www.ssa.gov/OACT/babynames/index.html
As we saw before, we can download data from the internet with Python, and do so only if needed:
import requests
from pathlib import Path
namesbystate_path = Path('namesbystate.zip')
data_url = 'https://www.ssa.gov/oact/babynames/state/namesbystate.zip'
if not namesbystate_path.exists():
print('Downloading...', end=' ')
resp = requests.get(data_url)
with namesbystate_path.open('wb') as f:
f.write(resp.content)
print('Done!')
Let's use Python to understand how this data is laid out:
import zipfile
zf = zipfile.ZipFile(namesbystate_path, 'r')
print([f.filename for f in zf.filelist])
We can pull the PDF readme to view it, but let's operate with the rest of the data in its compressed state:
zf.extract('StateReadMe.pdf')
Let's have a look at the California data, it should give us an idea about the structure of the whole thing:
ca_name = 'CA.TXT'
with zf.open(ca_name) as f:
for i in range(10):
print(f.readline().rstrip().decode())
This is equivalent (on macOS or Linux) to extracting the full CA.TXT
file to disk and then using the head
command (if you're on Windows, don't try to run the cell below):
zf.extract(ca_name)
!head {ca_name}
!cat /tmp/environment.yml
!echo {ca_name}
A couple of practical comments:
The above is using special tricks in IPython that let you call operating system commands via !cmd
, and that expand Python variables in such commands with the {var}
syntax. You can find more about IPython's special tricks in this tutorial.
head
doesn't work on Windows, though there are equivalent Windows commands. But by using Python code, even if it's a little bit more verbose, we have a 100% portable solution.
If the CA.TXT
file was huge, it would be wasteful to write it all to disk only to look at the start of the file.
The last point is an important, and general theme of this course: we need to learn how to operate with data only on an as-needed basis, because there are many situations in the real world where we can't afford to brute-force 'download all the things'.
Let's remove the CA.TXT
file to make sure we keep working with our compressed data, as if we couldn't extract it:
import os; os.unlink(ca_name)
import pandas as pd
field_names = ['State', 'Sex', 'Year', 'Name', 'Count']
with zf.open(ca_name) as fh:
ca = pd.read_csv(fh, header=None, names=field_names)
ca.head()
Let's play around a bit with our indexing techniques from earlier today.
ca['Count'].head()
ca[0:3]
#ca[0]
ca.iloc[:3, -2:]
ca.loc[0:3, 'State']
ca['Name'].head()
ca[['Name']].head()
ca[ca['Year'] == 2017].tail()
ca.head()
We can get a sense for the shape of our data:
ca.shape
ca.size # rows x columns
Pandas will give us a summary overview of the numerical data in the DataFrame:
ca.describe()
And let's look at the structure of the DataFrame:
ca.index
What we've done so far is NOT exploratory data analysis. We were just playing around a bit with the capabilities of the pandas library. Now that we're done, let's turn to the problem at hand: Identifying the most common name in California last year.
ca2017 = ca[ca['Year'] == 2017]
ca_sorted = ca2017.sort_values('Count', ascending=False).head(10)
ca_sorted