In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

plt.style.use('fivethirtyeight')
sns.set_context("notebook")

Let's start by loading the California baby names again.

In [2]:
import urllib.request
import os.path
import zipfile

data_url = "https://www.ssa.gov/oact/babynames/state/namesbystate.zip"
local_filename = "babynamesbystate.zip"
if not os.path.exists(local_filename): # if the data exists don't download again
    with urllib.request.urlopen(data_url) as resp, open(local_filename, 'wb') as f:
        f.write(resp.read())

zf = zipfile.ZipFile(local_filename, 'r')

ca_name = 'CA.TXT'
field_names = ['State', 'Sex', 'Year', 'Name', 'Count']
with zf.open(ca_name) as fh:
    babynames = pd.read_csv(fh, header=None, names=field_names)

babynames.sample(5)
Out[2]:
State Sex Year Name Count
199279 CA F 2012 Emmanuelle 14
37832 CA F 1957 Marilee 23
171302 CA F 2005 Ilona 7
5322 CA F 1922 Amy 23
227337 CA M 1918 August 16

Goal 1: Find the most popular baby name in California in 2018

In [3]:
babynames[babynames["Year"] == 2018].sort_values(by = "Count", ascending = False).head(5)
Out[3]:
State Sex Year Name Count
221131 CA F 2018 Emma 2722
378377 CA M 2018 Noah 2555
221132 CA F 2018 Mia 2484
221133 CA F 2018 Olivia 2456
378378 CA M 2018 Liam 2405

Goal 2: Find baby names that start with j.

Approach 1: Combine lecture 5 syntax with CS61A/CS88 ideas.

In [4]:
babynames["Name"].head(10)
Out[4]:
0         Mary
1        Helen
2      Dorothy
3     Margaret
4      Frances
5         Ruth
6       Evelyn
7        Alice
8     Virginia
9    Elizabeth
Name: Name, dtype: object
In [5]:
starts_with_j = [x[0] == 'J' for x in babynames["Name"]]
babynames[starts_with_j].sample(5)
Out[5]:
State Sex Year Name Count
331475 CA M 2001 Jaedon 12
358780 CA M 2011 Jermaine 30
272429 CA M 1972 Jamin 5
83771 CA F 1980 Joan 43
157294 CA F 2002 Julieta 49
In [6]:
j_names = babynames[[x[0] == 'J' for x in babynames["Name"]]]

Approach 2: Use the Series.str methods.

In [7]:
babynames["Name"].str.startswith('J').head(10)
Out[7]:
0    False
1    False
2    False
3    False
4    False
5    False
6    False
7    False
8    False
9    False
Name: Name, dtype: bool
In [8]:
starts_with_j = babynames["Name"].str.startswith('J')
starts_with_j.head(10)
Out[8]:
0    False
1    False
2    False
3    False
4    False
5    False
6    False
7    False
8    False
9    False
Name: Name, dtype: bool
In [9]:
babynames[babynames["Name"].str.startswith('J')].sample(5)
Out[9]:
State Sex Year Name Count
111028 CA F 1989 Josephina 10
310670 CA M 1993 Jamie 70
359123 CA M 2011 Jan 17
321070 CA M 1997 Johan 26
22795 CA F 1946 Jan 147
In [10]:
babynames[babynames["Name"].str.contains('ad')].sample(5)
Out[10]:
State Sex Year Name Count
66717 CA F 1972 Jada 7
324681 CA M 1998 Chadwick 6
303449 CA M 1990 Trinidad 23
141969 CA F 1998 Guadalupe 625
268932 CA M 1970 Shad 27
In [11]:
babynames["Name"].str.split('a').to_frame().head(5)
Out[11]:
Name
0 [M, ry]
1 [Helen]
2 [Dorothy]
3 [M, rg, ret]
4 [Fr, nces]

In-lecture challenge: Try to write a line of code that creates a list (or Series or array) of all names that end with “ert”.

In [ ]:
 

Goal 3: Sort names by their length.

Suppose we want to sort all baby names in California by their length.

As before, there are ways to do this using only lecture 5 content. For example, the montrosity below was concocted during the Sp19 version of this class.

In [12]:
babynames.iloc[[i for i, m in sorted(enumerate(babynames['Name']), key=lambda x: -len(x[1]))]].head(5)
Out[12]:
State Sex Year Name Count
102485 CA F 1986 Mariadelosangel 5
297806 CA M 1987 Franciscojavier 5
299240 CA M 1988 Franciscojavier 10
302145 CA M 1989 Franciscojavier 6
307053 CA M 1991 Ryanchristopher 7

Approach 1: Create a temporary column.

Create a new series of only the lengths. Then add that series to the dataframe as a column. Then sort by that column. Then drop that column.

In [13]:
#create a new series of only the lengths
babyname_lengths = babynames["Name"].str.len()

#add that series to the dataframe as a column
babynames["name_lengths"] = babyname_lengths
babynames.head(5)
Out[13]:
State Sex Year Name Count name_lengths
0 CA F 1910 Mary 295 4
1 CA F 1910 Helen 239 5
2 CA F 1910 Dorothy 220 7
3 CA F 1910 Margaret 163 8
4 CA F 1910 Frances 134 7
In [14]:
#sort by the temporary column
babynames = babynames.sort_values(by = "name_lengths", ascending=False)
babynames.head(5)
Out[14]:
State Sex Year Name Count name_lengths
319423 CA M 1996 Franciscojavier 8 15
302145 CA M 1989 Franciscojavier 6 15
299240 CA M 1988 Franciscojavier 10 15
312618 CA M 1993 Johnchristopher 5 15
307174 CA M 1991 Franciscojavier 6 15
In [15]:
#drop the temporary column
babynames = babynames.drop("name_lengths", axis = 1)
babynames.head(5)
Out[15]:
State Sex Year Name Count
319423 CA M 1996 Franciscojavier 8
302145 CA M 1989 Franciscojavier 6
299240 CA M 1988 Franciscojavier 10
312618 CA M 1993 Johnchristopher 5
307174 CA M 1991 Franciscojavier 6

We can also use the Python map function if we want to use an arbitrarily defined function. Suppose we want to sort by the number of occurrences of "dr" plus the number of occurences of "ea".

In [16]:
def dr_ea_count(string):
    return string.count('dr') + string.count('ea')

#create the temporary column
babynames["dr_ea_count"] = babynames["Name"].map(dr_ea_count)

#sort by the temporary column
babynames = babynames.sort_values(by = "dr_ea_count", ascending=False)

#drop that column
babynames = babynames.drop("dr_ea_count", 1)
babynames.head(5)
Out[16]:
State Sex Year Name Count
108712 CA F 1988 Deandrea 5
293396 CA M 1985 Deandrea 6
101958 CA F 1986 Deandrea 6
115935 CA F 1990 Deandrea 5
131003 CA F 1994 Leandrea 5
In [17]:
# anything else you all want to try?

Approach 2: Generate an index sorted in the desired order.

In [18]:
#let's start over by first scrambling the order of babynames
babynames = babynames.sample(frac=1)
babynames.head(5)
Out[18]:
State Sex Year Name Count
280279 CA M 1978 Malcolm 21
56099 CA F 1967 Dawne 10
329125 CA M 2000 Zavier 10
358188 CA M 2011 Raul 232
51056 CA F 1964 Carl 5

Another approach is to take advantage of the fact that .loc can accept an index. That is:

  • df.loc[idx] returns df with its rows in the same order as the given index.
  • Only works if the index exactly matches the DataFrame.

The first step was to create a sequence of the lengths of the names.

In [19]:
name_lengths = babynames["Name"].str.len()
name_lengths.head(5)
Out[19]:
280279    7
56099     5
329125    6
358188    4
51056     4
Name: Name, dtype: int64

The next step is to sort the new series we just created.

In [20]:
name_lengths_sorted_by_length = name_lengths.sort_values()
name_lengths_sorted_by_length.head(5)
Out[20]:
84400     2
256916    2
261551    2
374478    2
89965     2
Name: Name, dtype: int64

Next, we pass the index of the sorted series to the loc method of the original dataframe.

In [21]:
index_sorted_by_length = name_lengths_sorted_by_length.index
index_sorted_by_length
Out[21]:
Int64Index([ 84400, 256916, 261551, 374478,  89965, 329618, 141626, 299133,
            176157,  87352,
            ...
            102485, 329695, 312618, 312731, 297806, 324728, 302145, 299240,
            307174, 307053],
           dtype='int64', length=381214)
In [22]:
babynames.loc[index_sorted_by_length].head(5)
Out[22]:
State Sex Year Name Count
84400 CA F 1980 Ha 14
256916 CA M 1960 Al 42
261551 CA M 1964 Ty 48
374478 CA M 2016 Cy 8
89965 CA F 1982 Le 11

Note we can also do this all in one line:

In [23]:
babynames.loc[babynames["Name"].str.len().sort_values().index].head(5)
Out[23]:
State Sex Year Name Count
84400 CA F 1980 Ha 14
256916 CA M 1960 Al 42
261551 CA M 1964 Ty 48
374478 CA M 2016 Cy 8
89965 CA F 1982 Le 11

Goal 4: Name whose popularity has changed the most.

First we need to define change in popularity.

For the purposes of lecture, let’s stay simple and use the AMMD (absolute max/min difference): max(count) - min(count).

To make sure we understand this quantity, let's consider the name Jennifer.

In [24]:
jennifer_counts = babynames.query("Name == 'Jennifer'")["Count"]
jennifer_counts.head(5)
Out[24]:
28844      284
102675    4284
47574     1386
60907     4696
217549     225
Name: Count, dtype: int64

The AMMD for Jennifer is 6,061, as seen below:

In [25]:
max(jennifer_counts) - min(jennifer_counts)
Out[25]:
6061
In [26]:
def ammd(series):
    return max(series) - min(series)
In [27]:
ammd(jennifer_counts)
Out[27]:
6061

Approach 1: Naive For Loop

As a first approach, we can try to use a for loop.

In [28]:
#build dictionary where entry i is the ammd for the given name
#e.g. ammd["jennifer"] should be 6061
#ammd_of_babyname_counts = {}
#for name in ??:
#    counts_of_current_name = babynames[??]["Count"]
#    ammd_of_babyname_counts[name] = ammd(counts_of_current_name)
    
#convert to series
#ammd_of_babyname_counts = pd.Series(ammd_of_babyname_counts) 

Answer below. Note that we only used the first 100 names because otherwise the code takes ages to complete running.

In [29]:
#build dictionary where entry i is the ammd for the given name
#e.g. ammd["jennifer"] should be 6061
ammd_of_babyname_counts = {}
for name in sorted(babynames["Name"].unique())[0:100]:
    counts_of_current_name = babynames[babynames["Name"] == name]["Count"]
    ammd_of_babyname_counts[name] = ammd(counts_of_current_name)
    
#convert to series
ammd_of_babyname_counts = pd.Series(ammd_of_babyname_counts) 
ammd_of_babyname_counts.head(5)
Out[29]:
Aadan        2
Aaden      138
Aadhav       2
Aadhira      4
Aadhya      45
dtype: int64

Approach 2: Use groupby.agg

Instead, we can use the very powerful groupby.agg operation, which allows us to simply and efficiently compute what we want.

In [30]:
babynames.groupby("Name").agg(ammd).head(5)
Out[30]:
Year Count
Name
Aadan 6 2
Aaden 11 138
Aadhav 3 2
Aadhira 1 4
Aadhya 11 45

Note that the result includes both a Year and Count column. The Count column is what we want, namely the AMMD for the name in that row. To check your understanding, try to figure out what the Year column represents.

To understand how groupby works, consider the visual diagram below. The groupby function clusters rows from the original dataframe into groups (which I call subframes). The agg function then condenses each subframe into a single representative row using the provided function f.

groupby_picture.png

Some Additional Groupby Puzzles

Groupby puzzle #1: To test your understanding, try to interpret the result of the code below.

In [31]:
babynames.groupby("Year").agg(ammd).plot()
Out[31]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f7d110cfc90>

For reference, the first 5 values from the plot above are:

In [32]:
babynames.groupby("Year").agg(ammd).head(5)
Out[32]:
Count
Year
1910 290
1911 385
1912 529
1913 609
1914 768

groupby Puzzle #2

In [33]:
elections = pd.read_csv("elections.csv")
elections.sample(5)
Out[33]:
Year Candidate Party Popular vote Result %
111 1960 John Kennedy Democratic 34220984 win 50.082561
110 1956 T. Coleman Andrews States' Rights 107929 loss 0.174883
144 1996 Bill Clinton Democratic 47400125 win 49.296938
80 1924 Calvin Coolidge Republican 15723789 win 54.329113
81 1924 John W. Davis Democratic 8386242 loss 28.976291

We have to be careful when using aggregation functions. For example, the code below might be misinterpreted to say that Woodrow Wilson ran for election in 2016. Why is this happening?

elections.groupby("Party").agg(max).head(10)

groupby puzzle #3

Inspired by above, try to predict the results of the groupby operation shown. The answer is below the image.

groupby_puzzle.png

The top ?? will be "hi", the second ?? will be "tx", and the third ?? will be "sd".

groupby puzzle #4

Next we'll write code that properly returns the best result by each party. That is, each row should show the Year, Candidate, Popular Vote, Result, and % for the election in which that party saw its best results (rather than mixing them as in the example above).

In [34]:
#def get_first(s):
#    return s.iloc[0]
    
#elections_sorted_by_percent = elections.sort_values("%", ascending=False)
#elections_sorted_by_percent.groupby("Party").agg(lambda x : x.iloc[0])

groupby puzzle #6: Total Male and Female babies born

Suppose we want to find the total number of male and female babies born each year in California.

Try to figure out how we'd do this using groupby.

In [ ]:
 

Other groupby Features

It is possible to group a DataFrame by multiple features. For example, if we group by Year and Sex we get back a DataFrame with the total number of babies of each sex born in each year.

In [35]:
babynames.groupby(["Year", "Sex"]).agg(sum).head(6)
Out[35]:
Count
Year Sex
1910 F 5950
M 3213
1911 F 6602
M 3381
1912 F 9803
M 8142

The DataFrame resulting from the aggregation operation is now multi-indexed. That is, it has more than one dimension to its index. We will explore this in next week's exercises.

groupby.size()

In [36]:
#size returns a Series giving the size of each group
elections.groupby("Party").size().head(15)
Out[36]:
Party
American                  2
American Independent      3
Anti-Masonic              1
Anti-Monopoly             1
Citizens                  1
Communist                 1
Constitution              3
Constitutional Union      1
Democratic               46
Democratic-Republican     2
Dixiecrat                 1
Farmer–Labor              1
Free Soil                 2
Green                     6
Greenback                 1
dtype: int64

groupby.filter()

In [37]:
# filter gives a copy of the original DataFrame where row r is included
# if its group obeys the given condition
#
# Note: Filtering is done per GROUP, not per ROW.
elections.groupby("Year").filter(lambda sf: sf["%"].max() < 45)
Out[37]:
Year Candidate Party Popular vote Result %
23 1860 Abraham Lincoln Republican 1855993 win 39.699408
24 1860 John Bell Constitutional Union 590901 loss 12.639283
25 1860 John C. Breckinridge Southern Democratic 848019 loss 18.138998
26 1860 Stephen A. Douglas Northern Democratic 1380202 loss 29.522311
66 1912 Eugene V. Debs Socialist 901551 loss 6.004354
67 1912 Eugene W. Chafin Prohibition 208156 loss 1.386325
68 1912 Theodore Roosevelt Progressive 4122721 loss 27.457433
69 1912 William Taft Republican 3486242 loss 23.218466
70 1912 Woodrow Wilson Democratic 6296284 win 41.933422
115 1968 George Wallace American Independent 9901118 loss 13.571218
116 1968 Hubert Humphrey Democratic 31271839 loss 42.863537
117 1968 Richard Nixon Republican 31783783 win 43.565246
139 1992 Andre Marrou Libertarian 290087 loss 0.278516
140 1992 Bill Clinton Democratic 44909806 win 43.118485
141 1992 Bo Gritz Populist 106152 loss 0.101918
142 1992 George H. W. Bush Republican 39104550 loss 37.544784
143 1992 Ross Perot Independent 19743821 loss 18.956298

groupby.sum(), groupby.mean(), etc.

As an alternative to groupby.agg(sum), we can also simply do groupby.sum().

In [38]:
elections.groupby("Year").agg(sum).head()
Out[38]:
Popular vote %
Year
1824 264413 100.0
1828 1143703 100.0
1832 1287655 100.0
1836 1460216 100.0
1840 2404437 100.0
In [39]:
elections.groupby("Year").sum().head()
Out[39]:
Popular vote %
Year
1824 264413 100.0
1828 1143703 100.0
1832 1287655 100.0
1836 1460216 100.0
1840 2404437 100.0

The same applies for many other common operations.

In [40]:
elections.groupby("Year").agg(max).head()
Out[40]:
Candidate Party Popular vote Result %
Year
1824 John Quincy Adams Democratic-Republican 151271 win 57.210122
1828 John Quincy Adams National Republican 642806 win 56.203927
1832 William Wirt National Republican 702735 win 54.574789
1836 William Henry Harrison Whig 763291 win 52.272472
1840 William Henry Harrison Whig 1275583 win 53.051213
In [41]:
elections.groupby("Year").max().head()
Out[41]:
Candidate Party Popular vote Result %
Year
1824 John Quincy Adams Democratic-Republican 151271 win 57.210122
1828 John Quincy Adams National Republican 642806 win 56.203927
1832 William Wirt National Republican 702735 win 54.574789
1836 William Henry Harrison Whig 763291 win 52.272472
1840 William Henry Harrison Whig 1275583 win 53.051213
In [42]:
#elections.groupby("Year").mean().head()
#elections.groupby("Year").median().head()
elections.groupby("Year").max().head()
Out[42]:
Candidate Party Popular vote Result %
Year
1824 John Quincy Adams Democratic-Republican 151271 win 57.210122
1828 John Quincy Adams National Republican 642806 win 56.203927
1832 William Wirt National Republican 702735 win 54.574789
1836 William Henry Harrison Whig 763291 win 52.272472
1840 William Henry Harrison Whig 1275583 win 53.051213

Pivot Tables

Goal 5: Finding the number of babies born in each year of each sex.

Suppose we want to build a table showing the total number of babies born of each sex in each year. One way is to groupby using both columns of interest.

In [43]:
babynames.groupby(["Year", "Sex"]).agg(sum).head(6)
Out[43]:
Count
Year Sex
1910 F 5950
M 3213
1911 F 6602
M 3381
1912 F 9803
M 8142

A more natural approach is to use a pivot table (like we saw in data 8).

In [44]:
babynames_pivot = babynames.pivot_table(
    index='Year', # the rows (turned into index)
    columns='Sex', # the column values
    values='Count', # the field(s) to processed in each group
    aggfunc=np.max, # group operation
)
babynames_pivot.head(6)
Out[44]:
Sex F M
Year
1910 295 237
1911 390 214
1912 534 501
1913 584 614
1914 773 769
1915 998 1033

The basic idea behind pivot tables is shown in the image below.

pivot_picture.png

Extra Groupby Puzzle

In goal 1, we didn't take into account the unlikely possibility that the most popular name was actually spread across both birth sexes. For example, what if in the table below it turns out that there were 300 female Noahs born in CA in 2018. In that case, Noah would actually be the most popular.

Since our queries are getting pretty long, I've stuck them inside parentheses which allows us to spread them over many lines.

In [45]:
(
babynames[babynames["Year"] == 2018]
    .sort_values(by = "Count", ascending = False)
    .head(5)
)
Out[45]:
State Sex Year Name Count
221131 CA F 2018 Emma 2722
378377 CA M 2018 Noah 2555
221132 CA F 2018 Mia 2484
221133 CA F 2018 Olivia 2456
378378 CA M 2018 Liam 2405

Try to add a single line to the operation above so that each row represents the sum of male and female babies born in 2018 with that name. To do this, fill in the ??? below.

In [46]:
(
babynames[babynames["Year"] == 2018]
    #.???
    .sort_values(by = "Count", ascending = False)
    .head(5)
)
Out[46]:
State Sex Year Name Count
221131 CA F 2018 Emma 2722
378377 CA M 2018 Noah 2555
221132 CA F 2018 Mia 2484
221133 CA F 2018 Olivia 2456
378378 CA M 2018 Liam 2405