DataTables, Indexes, Pandas, and Seaborn¶

Some useful (free) resources¶

Introductory:

Getting started with Python for research, a gentle introduction to Python in data-intensive research.
A Whirlwind Tour of Python, by Jake VanderPlas, another quick Python intro (with notebooks).

Core Pandas/Data Science books:

The Python Data Science Handbook, by Jake VanderPlas.
Python for Data Analysis, 2nd Edition, by Wes McKinney, creator of Pandas. Companion Notebooks
Effective Pandas, a book by Tom Augspurger, core Pandas developer.

Complementary resources:

An introduction to "Data Science", a collection of Notebooks by BIDS' Stéfan Van der Walt.
Effective Computation in Physics, by Kathryn D. Huff; Anthony Scopatz. Notebooks to accompany the book. Don't be fooled by the title, it's a great book on modern computational practices with very little that's physics-specific.

OK, let's load and configure some of our core libraries (as an aside, you can find a nice visual gallery of available matplotlib sytles here).

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

plt.style.use('fivethirtyeight')
sns.set_context("notebook")

Getting the Data¶

https://www.ssa.gov/OACT/babynames/index.html

https://www.ssa.gov/data

As we saw before, we can download data from the internet with Python, and do so only if needed:

import requests
from pathlib import Path

namesbystate_path = Path('namesbystate.zip')
data_url = 'https://www.ssa.gov/oact/babynames/state/namesbystate.zip'

if not namesbystate_path.exists():
    print('Downloading...', end=' ')
    resp = requests.get(data_url)
    with namesbystate_path.open('wb') as f:
        f.write(resp.content)
    print('Done!')

Let's use Python to understand how this data is laid out:

import zipfile
zf = zipfile.ZipFile(namesbystate_path, 'r')
print([f.filename for f in zf.filelist])

['AK.TXT', 'AL.TXT', 'AR.TXT', 'AZ.TXT', 'CA.TXT', 'CO.TXT', 'CT.TXT', 'DC.TXT', 'DE.TXT', 'FL.TXT', 'GA.TXT', 'HI.TXT', 'IA.TXT', 'ID.TXT', 'IL.TXT', 'IN.TXT', 'KS.TXT', 'KY.TXT', 'LA.TXT', 'MA.TXT', 'MD.TXT', 'ME.TXT', 'MI.TXT', 'MN.TXT', 'MO.TXT', 'MS.TXT', 'MT.TXT', 'NC.TXT', 'ND.TXT', 'NE.TXT', 'NH.TXT', 'NJ.TXT', 'NM.TXT', 'NV.TXT', 'NY.TXT', 'OH.TXT', 'OK.TXT', 'OR.TXT', 'PA.TXT', 'RI.TXT', 'SC.TXT', 'SD.TXT', 'StateReadMe.pdf', 'TN.TXT', 'TX.TXT', 'UT.TXT', 'VA.TXT', 'VT.TXT', 'WA.TXT', 'WI.TXT', 'WV.TXT', 'WY.TXT']

We can pull the PDF readme to view it, but let's operate with the rest of the data in its compressed state:

zf.extract('StateReadMe.pdf')

'/Users/simonmo/Downloads/lec02/StateReadMe.pdf'

Let's have a look at the California data, it should give us an idea about the structure of the whole thing:

ca_name = 'CA.TXT'
with zf.open(ca_name) as f:
    for i in range(10):
        print(f.readline().rstrip().decode())

CA,F,1910,Mary,295
CA,F,1910,Helen,239
CA,F,1910,Dorothy,220
CA,F,1910,Margaret,163
CA,F,1910,Frances,134
CA,F,1910,Ruth,128
CA,F,1910,Evelyn,126
CA,F,1910,Alice,118
CA,F,1910,Virginia,101
CA,F,1910,Elizabeth,93

This is equivalent (on macOS or Linux) to extracting the full CA.TXT file to disk and then using the head command (if you're on Windows, don't try to run the cell below):

zf.extract(ca_name)
!head {ca_name}

!cat /tmp/environment.yml

cat: /tmp/environment.yml: No such file or directory

!echo {ca_name}

CA.TXT

A couple of practical comments:

The above is using special tricks in IPython that let you call operating system commands via !cmd, and that expand Python variables in such commands with the {var} syntax. You can find more about IPython's special tricks in this tutorial.
head doesn't work on Windows, though there are equivalent Windows commands. But by using Python code, even if it's a little bit more verbose, we have a 100% portable solution.
If the CA.TXT file was huge, it would be wasteful to write it all to disk only to look at the start of the file.

The last point is an important, and general theme of this course: we need to learn how to operate with data only on an as-needed basis, because there are many situations in the real world where we can't afford to brute-force 'download all the things'.

Let's remove the CA.TXT file to make sure we keep working with our compressed data, as if we couldn't extract it:

import os; os.unlink(ca_name)

Question 1: What was the most popular name in CA last year?¶

import pandas as pd

field_names = ['State', 'Sex', 'Year', 'Name', 'Count']
with zf.open(ca_name) as fh:
    ca = pd.read_csv(fh, header=None, names=field_names)
ca.head()

Indexing Review¶

Let's play around a bit with our indexing techniques from earlier today.

ca['Count'].head()

0    295
1    239
2    220
3    163
4    134
Name: Count, dtype: int64

ca[0:3]

#ca[0]

ca.iloc[:3, -2:]

ca.loc[0:3, 'State']

0    CA
1    CA
2    CA
3    CA
Name: State, dtype: object

ca['Name'].head()

0        Mary
1       Helen
2     Dorothy
3    Margaret
4     Frances
Name: Name, dtype: object

ca[['Name']].head()

ca[ca['Year'] == 2017].tail()

Understanding the Data¶

ca.head()

We can get a sense for the shape of our data:

ca.shape

(374634, 5)

ca.size  # rows x columns

1873170

Pandas will give us a summary overview of the numerical data in the DataFrame:

ca.describe()

And let's look at the structure of the DataFrame:

ca.index

RangeIndex(start=0, stop=374634, step=1)

Sorting¶

What we've done so far is NOT exploratory data analysis. We were just playing around a bit with the capabilities of the pandas library. Now that we're done, let's turn to the problem at hand: Identifying the most common name in California last year.

ca2017 = ca[ca['Year'] == 2017]
ca_sorted = ca2017.sort_values('Count', ascending=False).head(10)
ca_sorted

	Year	Count
count	374634.000000	374634.000000
mean	1982.741532	81.487027
std	26.107496	302.147462
min	1910.000000	5.000000
25%	1966.000000	7.000000
50%	1989.000000	13.000000
75%	2004.000000	39.000000
max	2017.000000	8263.000000

	State	Sex	Year	Name	Count
217344	CA	F	2017	Emma	2726
217345	CA	F	2017	Mia	2588
371716	CA	M	2017	Noah	2511
217346	CA	F	2017	Olivia	2474
217347	CA	F	2017	Sophia	2430
217348	CA	F	2017	Isabella	2337
371717	CA	M	2017	Sebastian	2264
371718	CA	M	2017	Liam	2180
371719	CA	M	2017	Ethan	2141
371720	CA	M	2017	Matthew	2120

	State	Sex	Year	Name	Count
0	CA	F	1910	Mary	295
1	CA	F	1910	Helen	239
2	CA	F	1910	Dorothy	220
3	CA	F	1910	Margaret	163
4	CA	F	1910	Frances	134

	State	Sex	Year	Name	Count
374629	CA	M	2017	Zeth	5
374630	CA	M	2017	Zeyad	5
374631	CA	M	2017	Zia	5
374632	CA	M	2017	Ziad	5
374633	CA	M	2017	Ziv	5