DataTables, Indexes, Pandas, and Seaborn

Some useful (free) resources

Introductory:

Core Pandas/Data Science books:

Complementary resources:

OK, let's load and configure some of our core libraries (as an aside, you can find a nice visual gallery of available matplotlib sytles here).

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

plt.style.use('fivethirtyeight')
sns.set_context("notebook")

Getting the Data

https://www.ssa.gov/OACT/babynames/index.html

https://www.ssa.gov/data

As we saw before, we can download data from the internet with Python, and do so only if needed:

In [2]:
import requests
from pathlib import Path

namesbystate_path = Path('namesbystate.zip')
data_url = 'https://www.ssa.gov/oact/babynames/state/namesbystate.zip'

if not namesbystate_path.exists():
    print('Downloading...', end=' ')
    resp = requests.get(data_url)
    with namesbystate_path.open('wb') as f:
        f.write(resp.content)
    print('Done!')

Let's use Python to understand how this data is laid out:

In [3]:
import zipfile
zf = zipfile.ZipFile(namesbystate_path, 'r')
print([f.filename for f in zf.filelist])
['AK.TXT', 'AL.TXT', 'AR.TXT', 'AZ.TXT', 'CA.TXT', 'CO.TXT', 'CT.TXT', 'DC.TXT', 'DE.TXT', 'FL.TXT', 'GA.TXT', 'HI.TXT', 'IA.TXT', 'ID.TXT', 'IL.TXT', 'IN.TXT', 'KS.TXT', 'KY.TXT', 'LA.TXT', 'MA.TXT', 'MD.TXT', 'ME.TXT', 'MI.TXT', 'MN.TXT', 'MO.TXT', 'MS.TXT', 'MT.TXT', 'NC.TXT', 'ND.TXT', 'NE.TXT', 'NH.TXT', 'NJ.TXT', 'NM.TXT', 'NV.TXT', 'NY.TXT', 'OH.TXT', 'OK.TXT', 'OR.TXT', 'PA.TXT', 'RI.TXT', 'SC.TXT', 'SD.TXT', 'StateReadMe.pdf', 'TN.TXT', 'TX.TXT', 'UT.TXT', 'VA.TXT', 'VT.TXT', 'WA.TXT', 'WI.TXT', 'WV.TXT', 'WY.TXT']

We can pull the PDF readme to view it, but let's operate with the rest of the data in its compressed state:

In [4]:
zf.extract('StateReadMe.pdf')
Out[4]:
'/Users/simonmo/Downloads/lec02/StateReadMe.pdf'

Let's have a look at the California data, it should give us an idea about the structure of the whole thing:

In [5]:
ca_name = 'CA.TXT'
with zf.open(ca_name) as f:
    for i in range(10):
        print(f.readline().rstrip().decode())
CA,F,1910,Mary,295
CA,F,1910,Helen,239
CA,F,1910,Dorothy,220
CA,F,1910,Margaret,163
CA,F,1910,Frances,134
CA,F,1910,Ruth,128
CA,F,1910,Evelyn,126
CA,F,1910,Alice,118
CA,F,1910,Virginia,101
CA,F,1910,Elizabeth,93

This is equivalent (on macOS or Linux) to extracting the full CA.TXT file to disk and then using the head command (if you're on Windows, don't try to run the cell below):

In [6]:
zf.extract(ca_name)
!head {ca_name}









In [7]:
!cat /tmp/environment.yml
cat: /tmp/environment.yml: No such file or directory
In [8]:
!echo {ca_name}
CA.TXT

A couple of practical comments:

  • The above is using special tricks in IPython that let you call operating system commands via !cmd, and that expand Python variables in such commands with the {var} syntax. You can find more about IPython's special tricks in this tutorial.

  • head doesn't work on Windows, though there are equivalent Windows commands. But by using Python code, even if it's a little bit more verbose, we have a 100% portable solution.

  • If the CA.TXT file was huge, it would be wasteful to write it all to disk only to look at the start of the file.

The last point is an important, and general theme of this course: we need to learn how to operate with data only on an as-needed basis, because there are many situations in the real world where we can't afford to brute-force 'download all the things'.

Let's remove the CA.TXT file to make sure we keep working with our compressed data, as if we couldn't extract it:

In [9]:
import os; os.unlink(ca_name)
In [10]:
import pandas as pd

field_names = ['State', 'Sex', 'Year', 'Name', 'Count']
with zf.open(ca_name) as fh:
    ca = pd.read_csv(fh, header=None, names=field_names)
ca.head()
Out[10]:
State Sex Year Name Count
0 CA F 1910 Mary 295
1 CA F 1910 Helen 239
2 CA F 1910 Dorothy 220
3 CA F 1910 Margaret 163
4 CA F 1910 Frances 134

Indexing Review

Let's play around a bit with our indexing techniques from earlier today.

In [11]:
ca['Count'].head()
Out[11]:
0    295
1    239
2    220
3    163
4    134
Name: Count, dtype: int64
In [12]:
ca[0:3]
Out[12]:
State Sex Year Name Count
0 CA F 1910 Mary 295
1 CA F 1910 Helen 239
2 CA F 1910 Dorothy 220
In [13]:
#ca[0]
In [14]:
ca.iloc[:3, -2:]
Out[14]:
Name Count
0 Mary 295
1 Helen 239
2 Dorothy 220
In [15]:
ca.loc[0:3, 'State']
Out[15]:
0    CA
1    CA
2    CA
3    CA
Name: State, dtype: object
In [16]:
ca['Name'].head()
Out[16]:
0        Mary
1       Helen
2     Dorothy
3    Margaret
4     Frances
Name: Name, dtype: object
In [17]:
ca[['Name']].head()
Out[17]:
Name
0 Mary
1 Helen
2 Dorothy
3 Margaret
4 Frances
In [18]:
ca[ca['Year'] == 2017].tail()
Out[18]:
State Sex Year Name Count
374629 CA M 2017 Zeth 5
374630 CA M 2017 Zeyad 5
374631 CA M 2017 Zia 5
374632 CA M 2017 Ziad 5
374633 CA M 2017 Ziv 5

Understanding the Data

In [19]:
ca.head()
Out[19]:
State Sex Year Name Count
0 CA F 1910 Mary 295
1 CA F 1910 Helen 239
2 CA F 1910 Dorothy 220
3 CA F 1910 Margaret 163
4 CA F 1910 Frances 134

We can get a sense for the shape of our data:

In [20]:
ca.shape
Out[20]:
(374634, 5)
In [21]:
ca.size  # rows x columns
Out[21]:
1873170

Pandas will give us a summary overview of the numerical data in the DataFrame:

In [22]:
ca.describe()
Out[22]:
Year Count
count 374634.000000 374634.000000
mean 1982.741532 81.487027
std 26.107496 302.147462
min 1910.000000 5.000000
25% 1966.000000 7.000000
50% 1989.000000 13.000000
75% 2004.000000 39.000000
max 2017.000000 8263.000000

And let's look at the structure of the DataFrame:

In [23]:
ca.index
Out[23]:
RangeIndex(start=0, stop=374634, step=1)

Sorting

What we've done so far is NOT exploratory data analysis. We were just playing around a bit with the capabilities of the pandas library. Now that we're done, let's turn to the problem at hand: Identifying the most common name in California last year.

In [24]:
ca2017 = ca[ca['Year'] == 2017]
ca_sorted = ca2017.sort_values('Count', ascending=False).head(10)
ca_sorted
Out[24]:
State Sex Year Name Count
217344 CA F 2017 Emma 2726
217345 CA F 2017 Mia 2588
371716 CA M 2017 Noah 2511
217346 CA F 2017 Olivia 2474
217347 CA F 2017 Sophia 2430
217348 CA F 2017 Isabella 2337
371717 CA M 2017 Sebastian 2264
371718 CA M 2017 Liam 2180
371719 CA M 2017 Ethan 2141
371720 CA M 2017 Matthew 2120