Introduction to Pandas, Part 2

Advanced Pandas syntax, aggregation, and joining.

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
from IPython.display import display

plt.style.use('fivethirtyeight')
sns.set_context("notebook")

Let's start by loading the California baby names again.

In [2]:
import urllib.request
import os.path
import zipfile

data_url = "https://www.ssa.gov/oact/babynames/state/namesbystate.zip"
local_filename = "babynamesbystate.zip"
if not os.path.exists(local_filename): # if the data exists don't download again
    with urllib.request.urlopen(data_url) as resp, open(local_filename, 'wb') as f:
        f.write(resp.read())

zf = zipfile.ZipFile(local_filename, 'r')

ca_name = 'CA.TXT'
field_names = ['State', 'Sex', 'Year', 'Name', 'Count']
with zf.open(ca_name) as fh:
    babynames = pd.read_csv(fh, header=None, names=field_names)

babynames.sample(5)
Out[2]:
State Sex Year Name Count
93934 CA F 1983 Ying 5
285204 CA M 1979 Howard 69
182581 CA F 2008 Suzanne 18
206116 CA F 2014 Madelynn 72
131780 CA F 1995 Tiara 68

Goal 1: Find the most popular baby name in California in 2018

In [3]:
babynames[babynames["Year"] == 2018].sort_values(by = "Count", ascending = False).head(5)
Out[3]:
State Sex Year Name Count
221154 CA F 2018 Emma 2741
382082 CA M 2018 Noah 2567
221155 CA F 2018 Mia 2497
221156 CA F 2018 Olivia 2464
382083 CA M 2018 Liam 2411
In [4]:
babynames[babynames["Year"] == 2018].sort_values(by = "Count").tail(5)
Out[4]:
State Sex Year Name Count
382083 CA M 2018 Liam 2411
221156 CA F 2018 Olivia 2464
221155 CA F 2018 Mia 2497
382082 CA M 2018 Noah 2567
221154 CA F 2018 Emma 2741

Goal 2: Find baby names that start with j.

Approach 1: Combine lecture 4 syntax with CS61A/CS88 ideas.

In [5]:
babynames["Name"].head(20)
Out[5]:
0          Mary
1         Helen
2       Dorothy
3      Margaret
4       Frances
5          Ruth
6        Evelyn
7         Alice
8      Virginia
9     Elizabeth
10     Florence
11        Marie
12      Mildred
13         Rose
14        Hazel
15       Louise
16    Josephine
17      Lucille
18        Grace
19       Gladys
Name: Name, dtype: object
In [6]:
starts_with_j = [ x.startswith('J') for x in babynames["Name"]]
#starts_with_j = [ x[0] == 'J'       for x in babynames["Name"]]  # Alternative. Why?
starts_with_j[:20]
Out[6]:
[False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 True,
 False,
 False,
 False]
In [7]:
starts_with_j = [x.startswith('J') for x in babynames["Name"]]
babynames[starts_with_j].sample(5)
Out[7]:
State Sex Year Name Count
321790 CA M 1996 Joe 196
292076 CA M 1983 Jamie 104
42014 CA F 1959 Janise 5
128127 CA F 1994 Jacklyn 68
190934 CA F 2010 Jessa 18
In [8]:
j_names = babynames[ [x.startswith('J') for x in babynames["Name"]] ]

Approach 2: Use the Series.str methods.

First, what are these special str things in there??

In [9]:
s = babynames["Name"].str
n = "Jane" # a Simple string, so we can compare...
In [10]:
s  # Let's look at s, and then use <TAB> on `s.` to get a sense of what's there...
Out[10]:
<pandas.core.strings.StringMethods at 0x10d74e850>
In [11]:
babynames["Name"].str.startswith('J').head(10)
Out[11]:
0    False
1    False
2    False
3    False
4    False
5    False
6    False
7    False
8    False
9    False
Name: Name, dtype: bool
In [12]:
starts_with_j = babynames["Name"].str.startswith('J')
starts_with_j.head(10)
Out[12]:
0    False
1    False
2    False
3    False
4    False
5    False
6    False
7    False
8    False
9    False
Name: Name, dtype: bool
In [13]:
babynames[babynames["Name"].str.startswith('J')].sample(5)
Out[13]:
State Sex Year Name Count
330538 CA M 1999 Jorje 8
93698 CA F 1983 Jennafer 5
289653 CA M 1981 Jeron 6
357216 CA M 2009 Jordin 13
39502 CA F 1958 Juana 20

Quick detour - which option performs better?

In [14]:
%timeit [x[0] == 'J' for x in babynames["Name"]]
%timeit [x.startswith('J') for x in babynames["Name"]]
%timeit babynames["Name"].str.startswith('J')
51.8 ms ± 501 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
86.1 ms ± 464 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
97.6 ms ± 519 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [15]:
babynames[babynames["Name"].str.contains('ad')].sample(5)
Out[15]:
State Sex Year Name Count
269460 CA M 1967 Amador 5
201276 CA F 2012 Jadzia 5
239491 CA M 1933 Trinidad 10
368871 CA M 2013 Angad 12
292668 CA M 1983 Braden 13
In [16]:
babynames["Name"].str.split('a').to_frame().head(5)
Out[16]:
Name
0 [M, ry]
1 [Helen]
2 [Dorothy]
3 [M, rg, ret]
4 [Fr, nces]

In-lecture challenge

Try to write a line of code that creates a list (or Series or array) of all names that end with “ert”.

Answer
    babynames[babynames["Name"].str.endswith('ert')]["Name"].unique()
    

Goal 3: Sort names by their length.

Suppose we want to sort all baby names in California by their length.

As before, there are ways to do this using only lecture 4 content. For example, the montrosity below was concocted during the Sp19 version of this class.

In [17]:
babynames.iloc[[i for i, m in sorted(enumerate(babynames['Name']), key=lambda x: -len(x[1]))]].head(5)
Out[17]:
State Sex Year Name Count
102493 CA F 1986 Mariadelosangel 5
301497 CA M 1987 Franciscojavier 5
302930 CA M 1988 Franciscojavier 10
305837 CA M 1989 Franciscojavier 6
310744 CA M 1991 Ryanchristopher 7

Approach 1: Create a temporary column.

Create a new series of only the lengths. Then add that series to the dataframe as a column. Then sort by that column. Then drop that column.

In [18]:
#create a new series of only the lengths
babyname_lengths = babynames["Name"].str.len()

#add that series to the dataframe as a column
babynames["name_lengths"] = babyname_lengths
babynames.head(5)
Out[18]:
State Sex Year Name Count name_lengths
0 CA F 1910 Mary 295 4
1 CA F 1910 Helen 239 5
2 CA F 1910 Dorothy 220 7
3 CA F 1910 Margaret 163 8
4 CA F 1910 Frances 134 7
In [19]:
#sort by the temporary column
babynames = babynames.sort_values(by = "name_lengths", ascending=False)
babynames.head(25)
Out[19]:
State Sex Year Name Count name_lengths
310744 CA M 1991 Ryanchristopher 7 15
302930 CA M 1988 Franciscojavier 10 15
333388 CA M 2000 Franciscojavier 6 15
102493 CA F 1986 Mariadelosangel 5 15
316422 CA M 1993 Ryanchristopher 5 15
301497 CA M 1987 Franciscojavier 5 15
326425 CA M 1997 Ryanchristopher 5 15
323115 CA M 1996 Franciscojavier 8 15
305837 CA M 1989 Franciscojavier 6 15
326250 CA M 1997 Franciscojavier 5 15
328420 CA M 1998 Franciscojavier 6 15
310865 CA M 1991 Franciscojavier 6 15
316309 CA M 1993 Johnchristopher 5 15
163342 CA F 2003 Mariadelcarmen 7 14
107964 CA F 1988 Mayraalejandra 8 14
108181 CA F 1988 Mariadelcarmen 7 14
142970 CA F 1998 Mariaguadalupe 25 14
349603 CA M 2006 Christianjames 5 14
107644 CA F 1988 Mariaguadalupe 10 14
192470 CA F 2010 Mariaguadalupe 7 14
325531 CA M 1997 Michaelanthony 9 14
172296 CA F 2005 Mariadelcarmen 5 14
349438 CA M 2006 Michaelanthony 6 14
140186 CA F 1997 Mariadelcarmen 12 14
129515 CA F 1994 Mariaguadalupe 11 14
In [20]:
#drop the temporary column
babynames = babynames.drop("name_lengths", axis = 'columns')
babynames.head(5)
Out[20]:
State Sex Year Name Count
310744 CA M 1991 Ryanchristopher 7
302930 CA M 1988 Franciscojavier 10
333388 CA M 2000 Franciscojavier 6
102493 CA F 1986 Mariadelosangel 5
316422 CA M 1993 Ryanchristopher 5

We can also use the Python map function if we want to use an arbitrarily defined function. Suppose we want to sort by the number of occurrences of "dr" plus the number of occurences of "ea".

In [21]:
def dr_ea_count(string):
    return string.count('dr') + string.count('ea')

#create the temporary column
babynames["dr_ea_count"] = babynames["Name"].map(dr_ea_count)

#sort by the temporary column
babynames = babynames.sort_values(by = "dr_ea_count", ascending=False)
babynames.head()
Out[21]:
State Sex Year Name Count dr_ea_count
101964 CA F 1986 Deandrea 6 3
297086 CA M 1985 Deandrea 6 3
131014 CA F 1994 Leandrea 5 3
115943 CA F 1990 Deandrea 5 3
108718 CA F 1988 Deandrea 5 3
In [22]:
#drop that column
babynames = babynames.drop("dr_ea_count", 1)
babynames.head(5)
Out[22]:
State Sex Year Name Count
101964 CA F 1986 Deandrea 6
297086 CA M 1985 Deandrea 6
131014 CA F 1994 Leandrea 5
115943 CA F 1990 Deandrea 5
108718 CA F 1988 Deandrea 5

Approach 2: Generate an index sorted in the desired order.

In [23]:
#let's start over by first scrambling the order of babynames
babynames = babynames.sample(frac=1)
babynames.head(5)
Out[23]:
State Sex Year Name Count
306522 CA M 1990 Louis 280
340179 CA M 2003 Edmund 14
60604 CA F 1969 Micheal 6
8678 CA F 1927 Faye 26
83654 CA F 1980 Antonia 61

Another approach is to take advantage of the fact that .loc can accept an index. That is:

  • df.loc[idx] returns df with its rows in the same order as the given index.
  • Only works if the index exactly matches the DataFrame.

The first step was to create a sequence of the lengths of the names.

In [24]:
name_lengths = babynames["Name"].str.len()
name_lengths.head(5)
Out[24]:
306522    5
340179    6
60604     7
8678      4
83654     7
Name: Name, dtype: int64

The next step is to sort the new series we just created.

In [25]:
name_lengths_sorted_by_length = name_lengths.sort_values()
name_lengths_sorted_by_length.head(5)
Out[25]:
335969    2
321209    2
315453    2
229598    2
93930     2
Name: Name, dtype: int64

Next, we pass the index of the sorted series to the loc method of the original dataframe.

In [26]:
index_sorted_by_length = name_lengths_sorted_by_length.index
index_sorted_by_length
Out[26]:
Int64Index([335969, 321209, 315453, 229598,  93930, 384651, 299549, 307079,
             89494, 104477,
            ...
            310865, 326250, 316422, 333388, 326425, 305837, 323115, 302930,
            310744, 102493],
           dtype='int64', length=387781)
In [27]:
babynames.loc[index_sorted_by_length].tail(5)
Out[27]:
State Sex Year Name Count
305837 CA M 1989 Franciscojavier 6
323115 CA M 1996 Franciscojavier 8
302930 CA M 1988 Franciscojavier 10
310744 CA M 1991 Ryanchristopher 7
102493 CA F 1986 Mariadelosangel 5

Note we can also do this all in one line:

In [28]:
babynames.loc[babynames["Name"].str.len().sort_values().index].head(5)
Out[28]:
State Sex Year Name Count
335969 CA M 2001 Kc 6
321209 CA M 1995 Al 5
315453 CA M 1993 Kc 9
229598 CA M 1914 Al 5
93930 CA F 1983 Vy 5

Goal 4: Name whose popularity has changed the most.

First we need to define change in popularity.

For the purposes of lecture, let’s stay simple and use the AMMD (absolute max/min difference): max(count) - min(count).

To make sure we understand this quantity, let's consider the name Jennifer.

In [29]:
jennifer_counts = babynames.query("Name == 'Jennifer'")["Count"]
jennifer_counts.head(5)
Out[29]:
267174       5
272939      11
127582    2940
325063      15
149045    2263
Name: Count, dtype: int64

The AMMD for Jennifer is 6,059, as seen below:

In [30]:
max(jennifer_counts) - min(jennifer_counts)
Out[30]:
6059
In [31]:
def ammd(series):
    return max(series) - min(series)
In [32]:
ammd(jennifer_counts)
Out[32]:
6059
In [33]:
ammd(babynames.query("Name == 'Jessica'")["Count"])
Out[33]:
6946

Approach 1: Naive For Loop

As a first approach, we can try to use a for loop.

In [34]:
#build dictionary where entry i is the ammd for the given name
#e.g. ammd["jennifer"] should be 6059
ammd_of_babyname_counts = {}
for name in babynames["Name"].unique()[0:10]:
    counts_of_current_name = babynames[babynames["Name"] == name]["Count"]
    ammd_of_babyname_counts[name] = ammd(counts_of_current_name)
    
#convert to series
ammd_of_babyname_counts = pd.Series(ammd_of_babyname_counts) 
ammd_of_babyname_counts
Out[34]:
Louis        355
Edmund        70
Micheal      397
Faye          53
Antonia       79
Kayliana       3
Townes         1
Robert      7091
Brian       2935
Santos        76
dtype: int64

Answer below. Note that we only used the first 100 names because otherwise the code takes ages to complete running.

In [35]:
#build dictionary where entry i is the ammd for the given name
#e.g. ammd["jennifer"] should be 6059
ammd_of_babyname_counts = {}
for name in sorted(babynames["Name"].unique())[0:100]:
    counts_of_current_name = babynames[babynames["Name"] == name]["Count"]
    ammd_of_babyname_counts[name] = ammd(counts_of_current_name)
    
#convert to series
ammd_of_babyname_counts = pd.Series(ammd_of_babyname_counts) 
ammd_of_babyname_counts.head(5)
Out[35]:
Aadan        2
Aadarsh      0
Aaden      148
Aadhav       2
Aadhira      4
dtype: int64

Approach 2: Use groupby.agg

Instead, we can use the very powerful groupby.agg operation, which allows us to simply and efficiently compute what we want.

In [36]:
babynames.groupby("Name").agg(max).head(5)
Out[36]:
State Sex Year Count
Name
Aadan CA M 2014 7
Aadarsh CA M 2019 6
Aaden CA M 2019 158
Aadhav CA M 2019 8
Aadhira CA F 2018 10
In [37]:
babynames.head()
Out[37]:
State Sex Year Name Count
306522 CA M 1990 Louis 280
340179 CA M 2003 Edmund 14
60604 CA F 1969 Micheal 6
8678 CA F 1927 Faye 26
83654 CA F 1980 Antonia 61
In [38]:
%%time
babynames.groupby("Name").agg(ammd).head(5)
CPU times: user 1.42 s, sys: 23.3 ms, total: 1.44 s
Wall time: 1.45 s
Out[38]:
Year Count
Name
Aadan 6 2
Aadarsh 0 0
Aaden 12 148
Aadhav 5 2
Aadhira 1 4

Note that the result includes both a Year and Count column. The Count column is what we want, namely the AMMD for the name in that row. To check your understanding, try to figure out what the Year column represents.

To understand how groupby works, consider the visual diagram below. The groupby function clusters rows from the original dataframe into groups (which I call subframes). The agg function then condenses each subframe into a single representative row using the provided function f.

groupby_picture.png

Let's actually build this DataFrame and play with it interactively to understand groupby a little better:

In [39]:
d = pd.DataFrame(dict(x=[3,1,4,1,5,9,2,5,6], y=[12,7,3,2,7,3,8,6,7]), index=list('ABCABCACB') )
d
Out[39]:
x y
A 3 12
B 1 7
C 4 3
A 1 2
B 5 7
C 9 3
A 2 8
C 5 6
B 6 7
In [41]:
dgb = d.groupby(d.index)
dgb
Out[41]:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x10d73d8d0>
In [42]:
dgb.groups
Out[42]:
{'A': Index(['A', 'A', 'A'], dtype='object'),
 'B': Index(['B', 'B', 'B'], dtype='object'),
 'C': Index(['C', 'C', 'C'], dtype='object')}
In [43]:
dgb.groups['A']
Out[43]:
Index(['A', 'A', 'A'], dtype='object')
In [44]:
dgb.get_group('A')
Out[44]:
x y
A 3 12
A 1 2
A 2 8
In [45]:
dgb.agg(ammd)
Out[45]:
x y
A 2 10
B 5 0
C 5 3

Some Additional Groupby Puzzles

Groupby puzzle #1: To test your understanding, try to interpret the result of the code below.

In [46]:
babynames.head(5)
Out[46]:
State Sex Year Name Count
306522 CA M 1990 Louis 280
340179 CA M 2003 Edmund 14
60604 CA F 1969 Micheal 6
8678 CA F 1927 Faye 26
83654 CA F 1980 Antonia 61
In [47]:
babynames.groupby("Year").agg(ammd).plot();
In [48]:
for i, (k, v) in enumerate(babynames.groupby("Year")):
    if i >= 3: break
    print(f"Year: {k}")
    display(v.head())
Year: 1910
State Sex Year Name Count
106 CA F 1910 Alberta 14
64 CA F 1910 Pearl 26
225 CA F 1910 Margie 5
90 CA F 1910 Ramona 18
208 CA F 1910 Billie 5
Year: 1911
State Sex Year Name Count
299 CA F 1911 Ruby 27
287 CA F 1911 Jennie 32
427 CA F 1911 Tillie 7
228640 CA M 1911 Fred 50
249 CA F 1911 Marjorie 78
Year: 1912
State Sex Year Name Count
671 CA F 1912 Lila 10
592 CA F 1912 Bessie 23
739 CA F 1912 Erna 6
523 CA F 1912 Betty 60
508 CA F 1912 Hazel 85

For reference, the first 5 values from the plot above are:

In [49]:
babynames.groupby("Year").agg(ammd).head(5)
Out[49]:
Count
Year
1910 290
1911 385
1912 529
1913 609
1914 768

groupby Puzzle #2

In [50]:
elections = pd.read_csv("elections.csv")
elections.sample(5)
Out[50]:
Year Candidate Party Popular vote Result %
175 2016 Gary Johnson Libertarian 4489235 loss 3.307714
171 2012 Mitt Romney Republican 60933504 loss 47.384076
161 2004 Ralph Nader Independent 465151 loss 0.380663
73 1916 Frank Hanly Prohibition 221302 loss 1.197041
40 1884 James G. Blaine Republican 4856905 loss 48.312208

We have to be careful when using aggregation functions. For example, the code below might be misinterpreted to say that Woodrow Wilson ran for election in 2016. Why is this happening?

elections.groupby("Party").agg(max).head(10)

In [51]:
elections.groupby("Party").agg(max).head(10)
Out[51]:
Year Candidate Popular vote Result %
Party
American 1976 Thomas J. Anderson 873053 loss 21.554001
American Independent 1976 Lester Maddox 9901118 loss 13.571218
Anti-Masonic 1832 William Wirt 100715 loss 7.821583
Anti-Monopoly 1884 Benjamin Butler 134294 loss 1.335838
Citizens 1980 Barry Commoner 233052 loss 0.270182
Communist 1932 William Z. Foster 103307 loss 0.261069
Constitution 2016 Michael Peroutka 203091 loss 0.152398
Constitutional Union 1860 John Bell 590901 loss 12.639283
Democratic 2016 Woodrow Wilson 69498516 win 61.344703
Democratic-Republican 1824 John Quincy Adams 151271 win 57.210122

groupby puzzle #3

Inspired by above, try to predict the results of the groupby operation shown. The answer is below the image.

groupby_puzzle.png

The top ?? will be "hi", the second ?? will be "tx", and the third ?? will be "sd".

In [52]:
ds = pd.DataFrame(dict(x=[3,1,4,1,5,9,2,5,6], 
                      y=['ak', 'tx', 'fl', 'hi', 'mi', 'ak', 'ca', 'sd', 'nc']), 
                      index=list('ABCABCACB') )
ds
Out[52]:
x y
A 3 ak
B 1 tx
C 4 fl
A 1 hi
B 5 mi
C 9 ak
A 2 ca
C 5 sd
B 6 nc
In [53]:
ds.groupby(ds.index).agg(max)
Out[53]:
x y
A 3 hi
B 6 tx
C 9 sd

groupby puzzle #4

Next we'll write code that properly returns the best result by each party. That is, each row should show the Year, Candidate, Popular Vote, Result, and % for the election in which that party saw its best results (rather than mixing them as in the example above), here's what the first rows of the correct output should look like:

parties.png

In [54]:
elections_sorted_by_percent = elections.sort_values("%", ascending=False)
elections_sorted_by_percent.head(5)
Out[54]:
Year Candidate Party Popular vote Result %
114 1964 Lyndon Johnson Democratic 43127041 win 61.344703
91 1936 Franklin Roosevelt Democratic 27752648 win 60.978107
120 1972 Richard Nixon Republican 47168710 win 60.907806
79 1920 Warren Harding Republican 16144093 win 60.574501
133 1984 Ronald Reagan Republican 54455472 win 59.023326
In [55]:
#def get_first(s):
#    return s.iloc[0]
    
elections_sorted_by_percent = elections.sort_values("%", ascending=False)
elections_sorted_by_percent.groupby("Party").agg(lambda x : x.iloc[0]).head(10)
Out[55]:
Year Candidate Popular vote Result %
Party
American 1856 Millard Fillmore 873053 loss 21.554001
American Independent 1968 George Wallace 9901118 loss 13.571218
Anti-Masonic 1832 William Wirt 100715 loss 7.821583
Anti-Monopoly 1884 Benjamin Butler 134294 loss 1.335838
Citizens 1980 Barry Commoner 233052 loss 0.270182
Communist 1932 William Z. Foster 103307 loss 0.261069
Constitution 2008 Chuck Baldwin 199750 loss 0.152398
Constitutional Union 1860 John Bell 590901 loss 12.639283
Democratic 1964 Lyndon Johnson 43127041 win 61.344703
Democratic-Republican 1824 Andrew Jackson 151271 loss 57.210122

You'll soon discover that with Pandas' rich tool set, there's typically more than one way to get to the same answer. Each approach has different tradeoffs in terms of readability, performance, memory consumption, complexity and more. It will take some experience for you to develop a sense of which approach is better for each problem, but you should in general try to think if you can at least envision a different solution to a given problem, especially if you find your current solution to be particularly convoluted or hard to read.

Here's a couple of other ways of obtaining the same result (in each case we only show the top part with head()). The first approach uses groupby but finds the location of the maximum value via the idxmax() method (look up its documentation!). We then index and sort by party to match the requested formatting:

In [56]:
elections.head()
Out[56]:
Year Candidate Party Popular vote Result %
0 1824 Andrew Jackson Democratic-Republican 151271 loss 57.210122
1 1824 John Quincy Adams Democratic-Republican 113142 win 42.789878
2 1828 Andrew Jackson Democratic 642806 win 56.203927
3 1828 John Quincy Adams National Republican 500897 loss 43.796073
4 1832 Andrew Jackson Democratic 702735 win 54.574789
In [57]:
elections.groupby('Party')['%'].idxmax()
Out[57]:
Party
American                  22
American Independent     115
Anti-Masonic               6
Anti-Monopoly             38
Citizens                 127
Communist                 89
Constitution             164
Constitutional Union      24
Democratic               114
Democratic-Republican      0
Dixiecrat                103
Farmer–Labor              78
Free Soil                 15
Green                    155
Greenback                 35
Independent              143
Liberal Republican        31
Libertarian              175
National Democratic       50
National Republican        3
National Union            27
Natural Law              148
New Alliance             136
Northern Democratic       26
Populist                  48
Progressive               68
Prohibition               49
Reform                   150
Republican               120
Socialist                 66
Southern Democratic       25
States' Rights           110
Taxpayers                147
Union                     93
Union Labor               42
Whig                      11
Name: %, dtype: int64
In [58]:
best_per_party = elections.loc[elections.groupby('Party')['%'].idxmax()]  # This is the computational part
best_per_party.set_index('Party').sort_index().head()  # This indexes by Party to match the formatting above
Out[58]:
Year Candidate Popular vote Result %
Party
American 1856 Millard Fillmore 873053 loss 21.554001
American Independent 1968 George Wallace 9901118 loss 13.571218
Anti-Masonic 1832 William Wirt 100715 loss 7.821583
Anti-Monopoly 1884 Benjamin Butler 134294 loss 1.335838
Citizens 1980 Barry Commoner 233052 loss 0.270182

And this one doesn't even use groupby! This approach instead uses the drop_duplicates method to keep only the last occurrence of of each party after having sorted by "%", which is the best performance. Again, the 2nd line is purely formatting:

In [59]:
best_per_party2 = elections.sort_values('%').drop_duplicates(['Party'], keep='last')
best_per_party2.set_index('Party').sort_index().head()  # Formatting
Out[59]:
Year Candidate Popular vote Result %
Party
American 1856 Millard Fillmore 873053 loss 21.554001
American Independent 1968 George Wallace 9901118 loss 13.571218
Anti-Masonic 1832 William Wirt 100715 loss 7.821583
Anti-Monopoly 1884 Benjamin Butler 134294 loss 1.335838
Citizens 1980 Barry Commoner 233052 loss 0.270182

Challenge: see if you can find yet another approach that still gives the same answer.


Other groupby Features

groupby.size()

In [60]:
elections.groupby("Party")
Out[60]:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x1a2d251ed0>
In [61]:
#size returns a Series giving the size of each group
elections.groupby("Party").size().head(15)
Out[61]:
Party
American                  2
American Independent      3
Anti-Masonic              1
Anti-Monopoly             1
Citizens                  1
Communist                 1
Constitution              3
Constitutional Union      1
Democratic               46
Democratic-Republican     2
Dixiecrat                 1
Farmer–Labor              1
Free Soil                 2
Green                     6
Greenback                 1
dtype: int64

groupby.filter()

In [62]:
# filter gives a copy of the original DataFrame where row r is included
# if its group obeys the given condition
#
# Note: Filtering is done per GROUP, not per ROW.
elections.groupby("Year").filter(lambda sf: sf["%"].max() < 45)
Out[62]:
Year Candidate Party Popular vote Result %
23 1860 Abraham Lincoln Republican 1855993 win 39.699408
24 1860 John Bell Constitutional Union 590901 loss 12.639283
25 1860 John C. Breckinridge Southern Democratic 848019 loss 18.138998
26 1860 Stephen A. Douglas Northern Democratic 1380202 loss 29.522311
66 1912 Eugene V. Debs Socialist 901551 loss 6.004354
67 1912 Eugene W. Chafin Prohibition 208156 loss 1.386325
68 1912 Theodore Roosevelt Progressive 4122721 loss 27.457433
69 1912 William Taft Republican 3486242 loss 23.218466
70 1912 Woodrow Wilson Democratic 6296284 win 41.933422
115 1968 George Wallace American Independent 9901118 loss 13.571218
116 1968 Hubert Humphrey Democratic 31271839 loss 42.863537
117 1968 Richard Nixon Republican 31783783 win 43.565246
139 1992 Andre Marrou Libertarian 290087 loss 0.278516
140 1992 Bill Clinton Democratic 44909806 win 43.118485
141 1992 Bo Gritz Populist 106152 loss 0.101918
142 1992 George H. W. Bush Republican 39104550 loss 37.544784
143 1992 Ross Perot Independent 19743821 loss 18.956298
In [63]:
for i, (n, g) in enumerate(elections.groupby("Party")):
    print(n)
    display(g.head(2))
    if i>3: break
American
Year Candidate Party Popular vote Result %
22 1856 Millard Fillmore American 873053 loss 21.554001
126 1976 Thomas J. Anderson American 158271 loss 0.194862
American Independent
Year Candidate Party Popular vote Result %
115 1968 George Wallace American Independent 9901118 loss 13.571218
119 1972 John G. Schmitz American Independent 1100868 loss 1.421524
Anti-Masonic
Year Candidate Party Popular vote Result %
6 1832 William Wirt Anti-Masonic 100715 loss 7.821583
Anti-Monopoly
Year Candidate Party Popular vote Result %
38 1884 Benjamin Butler Anti-Monopoly 134294 loss 1.335838
Citizens
Year Candidate Party Popular vote Result %
127 1980 Barry Commoner Citizens 233052 loss 0.270182

groupby.sum(), groupby.mean(), etc.

As an alternative to groupby.agg(sum), we can also simply do groupby.sum().

In [64]:
elections.groupby("Year").agg(sum).head()
Out[64]:
Popular vote %
Year
1824 264413 100.0
1828 1143703 100.0
1832 1287655 100.0
1836 1460216 100.0
1840 2404437 100.0
In [65]:
elections.groupby("Year").sum().head()
Out[65]:
Popular vote %
Year
1824 264413 100.0
1828 1143703 100.0
1832 1287655 100.0
1836 1460216 100.0
1840 2404437 100.0

The same applies for many other common operations.

In [66]:
elections.groupby("Year").agg(max).head()
Out[66]:
Candidate Party Popular vote Result %
Year
1824 John Quincy Adams Democratic-Republican 151271 win 57.210122
1828 John Quincy Adams National Republican 642806 win 56.203927
1832 William Wirt National Republican 702735 win 54.574789
1836 William Henry Harrison Whig 763291 win 52.272472
1840 William Henry Harrison Whig 1275583 win 53.051213
In [67]:
elections.groupby("Year").max().head()
Out[67]:
Candidate Party Popular vote Result %
Year
1824 John Quincy Adams Democratic-Republican 151271 win 57.210122
1828 John Quincy Adams National Republican 642806 win 56.203927
1832 William Wirt National Republican 702735 win 54.574789
1836 William Henry Harrison Whig 763291 win 52.272472
1840 William Henry Harrison Whig 1275583 win 53.051213
In [68]:
#elections.groupby("Year").mean().head()
#elections.groupby("Year").median().head()
elections.groupby("Year").max().head()
Out[68]:
Candidate Party Popular vote Result %
Year
1824 John Quincy Adams Democratic-Republican 151271 win 57.210122
1828 John Quincy Adams National Republican 642806 win 56.203927
1832 William Wirt National Republican 702735 win 54.574789
1836 William Henry Harrison Whig 763291 win 52.272472
1840 William Henry Harrison Whig 1275583 win 53.051213

Pivot Tables

It is possible to group a DataFrame by multiple features. For example, if we group by Year and Sex we get back a DataFrame with the total number of babies of each sex born in each year.

In [69]:
babynames.groupby(["Year", "Sex"]).agg(sum).head(6)
Out[69]:
Count
Year Sex
1910 F 5950
M 3213
1911 F 6602
M 3381
1912 F 9803
M 8142
In [70]:
babynames.groupby(["Sex", "Year"]).agg(sum).head(6)
Out[70]:
Count
Sex Year
F 1910 5950
1911 6602
1912 9803
1913 11860
1914 13815
1915 18643
In [71]:
babynames.head(10)
Out[71]:
State Sex Year Name Count
306522 CA M 1990 Louis 280
340179 CA M 2003 Edmund 14
60604 CA F 1969 Micheal 6
8678 CA F 1927 Faye 26
83654 CA F 1980 Antonia 61
220497 CA F 2017 Kayliana 6
364174 CA M 2011 Townes 6
18535 CA F 1941 Robert 7
107131 CA F 1988 Brian 15
248335 CA M 1947 Santos 14
In [72]:
babynames.groupby(["Count", "Year", "Sex"]).agg(sum).head(6)
Out[72]:
State Name
Count Year Sex
5 1910 F CACACACACACACACACACACACACACACACACACACACACACACA... MargieBillieTillieFayeMercedesEveRenaIlaJacque...
M CACACACACACACACACACACACACACACACACA LelandTomGuadalupeFloydAlexOliverStephenBenjam...
1911 F CACACACACACACACACACACACACACACACACACACACACACACA... LeolaJulietCorneliaBeverlyLeilaMamieReginaSueS...
M CACACACACACACACACACACACACACACACACACA MarvinAlanMerleOliverRogerRonaldHomerPhillipLo...
1912 F CACACACACACACACACACACACACACACACACACACACACACACA... MargretBillieMaudeLeolaTeresaChristinaFerneBer...
M CACACACACACACACACACACACACACACACACACACACACACACA... ClintonErwinPatrickNoelMyronMortonBeverlyBobEm...

The DataFrame resulting from the aggregation operation is now multi-indexed. That is, it has more than one dimension to its index. We will explore this in next week's exercises.

Goal 5: Finding the number of babies born in each year of each sex.

Suppose we want to build a table showing the total number of babies born of each sex in each year. One way is to groupby using both columns of interest.

In [73]:
babynames.groupby(["Year", "Sex"]).agg(sum).head(6)
Out[73]:
Count
Year Sex
1910 F 5950
M 3213
1911 F 6602
M 3381
1912 F 9803
M 8142

A more natural approach is to use a pivot table (like we saw in data 8).

In [74]:
babynames.head(5)
Out[74]:
State Sex Year Name Count
306522 CA M 1990 Louis 280
340179 CA M 2003 Edmund 14
60604 CA F 1969 Micheal 6
8678 CA F 1927 Faye 26
83654 CA F 1980 Antonia 61
In [75]:
babynames_pivot = babynames.pivot_table(
    index='Year', # the rows (turned into index)
    columns='Sex', # the column values
    values=['Count', 'Name'], # the field(s) to processed in each group
    aggfunc=np.max, # group operation
)
babynames_pivot.head(6)
Out[75]:
Count Name
Sex F M F M
Year
1910 295 237 Yvonne William
1911 390 214 Zelma Willis
1912 534 501 Yvonne Woodrow
1913 584 614 Zelma Yoshio
1914 773 769 Zelma Yoshio
1915 998 1033 Zita Yukio

The basic idea behind pivot tables is shown in the image below.

pivot_picture.png

Extra Groupby Puzzle

In goal 1, we didn't take into account the unlikely possibility that the most popular name was actually spread across both birth sexes. For example, what if in the table below it turns out that there were 300 female Noahs born in CA in 2018. In that case, Noah would actually be the most popular.

Since our queries are getting pretty long, I've stuck them inside parentheses which allows us to spread them over many lines.

In [76]:
(
babynames[babynames["Year"] == 2018]
    .sort_values(by = "Count", ascending = False)
    .head(5)
)
Out[76]:
State Sex Year Name Count
221154 CA F 2018 Emma 2741
382082 CA M 2018 Noah 2567
221155 CA F 2018 Mia 2497
221156 CA F 2018 Olivia 2464
382083 CA M 2018 Liam 2411

Try to add a single line to the operation above so that each row represents the sum of male and female babies born in 2018 with that name. To do this, fill in the ??? below.

In [77]:
(
babynames[babynames["Year"] == 2018]
    #.???
    .sort_values(by = "Count", ascending = False)
    .head(5)
)
Out[77]:
State Sex Year Name Count
221154 CA F 2018 Emma 2741
382082 CA M 2018 Noah 2567
221155 CA F 2018 Mia 2497
221156 CA F 2018 Olivia 2464
382083 CA M 2018 Liam 2411
In [78]:
def ammd2(L):
    return max(L) - min(L)

print(ammd2([1, 2, 3, 4, 5]))
4
In [79]:
ammd2 = lambda L: max(L) - min(L)
print(ammd2([1, 2, 3, 4, 5]))
4