Introduction to Pandas, Part 3¶

Advanced aggregation and joining.

In [1]:
import numpy as np
import pandas as pd
import plotly.express as px

Some Additional Groupby Puzzles¶

groupby Puzzle #1¶

In [2]:
# Let's read the elections dataset
elections = pd.read_csv("data/elections.csv")
elections.sample(5)
Out[2]:
Year Candidate Party Popular vote Result %
144 1996 Bill Clinton Democratic 47400125 win 49.296938
152 2000 George W. Bush Republican 50456002 win 47.974666
9 1836 William Henry Harrison Whig 550816 loss 37.721543
35 1880 James B. Weaver Greenback 308649 loss 3.352344
172 2016 Darrell Castle Constitution 203091 loss 0.149640

We have to be careful when using aggregation functions. For example, the code below might be misinterpreted to say that Woodrow Wilson successfully ran for election in 2020. Why is this happening?

In [3]:
elections.groupby("Party").agg(max).head(10)
Out[3]:
Year Candidate Popular vote Result %
Party
American 1976 Thomas J. Anderson 873053 loss 21.554001
American Independent 1976 Lester Maddox 9901118 loss 13.571218
Anti-Masonic 1832 William Wirt 100715 loss 7.821583
Anti-Monopoly 1884 Benjamin Butler 134294 loss 1.335838
Citizens 1980 Barry Commoner 233052 loss 0.270182
Communist 1932 William Z. Foster 103307 loss 0.261069
Constitution 2016 Michael Peroutka 203091 loss 0.152398
Constitutional Union 1860 John Bell 590901 loss 12.639283
Democratic 2020 Woodrow Wilson 81268924 win 61.344703
Democratic-Republican 1824 John Quincy Adams 151271 win 57.210122

Quick Subpuzzle¶

Inspired by above, try to predict the results of the groupby operation shown. The answer is below the image.

groupby_puzzle.png

The top ?? will be "hi", the second ?? will be "tx", and the third ?? will be "sd".

In [4]:
ds = pd.DataFrame(dict(x=[3,1,4,1,5,9,2,5,6], 
                      y=['ak', 'tx', 'fl', 'hi', 'mi', 'ak', 'ca', 'sd', 'nc']), 
                      index=list('ABCABCACB') )
ds
Out[4]:
x y
A 3 ak
B 1 tx
C 4 fl
A 1 hi
B 5 mi
C 9 ak
A 2 ca
C 5 sd
B 6 nc
In [5]:
ds.groupby(ds.index).agg(max)
Out[5]:
x y
A 3 hi
B 6 tx
C 9 sd

Completing groupby puzzle #1¶

Next we'll write code that properly returns the best result by each party. That is, each row should show the Year, Candidate, Popular Vote, Result, and % for the election in which that party saw its best results (rather than mixing them as in the example above), here's what the first rows of the correct output should look like:

parties.png

In [6]:
elections_sorted_by_percent = elections.sort_values("%", ascending=False)
elections_sorted_by_percent.head(5)
Out[6]:
Year Candidate Party Popular vote Result %
114 1964 Lyndon Johnson Democratic 43127041 win 61.344703
91 1936 Franklin Roosevelt Democratic 27752648 win 60.978107
120 1972 Richard Nixon Republican 47168710 win 60.907806
79 1920 Warren Harding Republican 16144093 win 60.574501
133 1984 Ronald Reagan Republican 54455472 win 59.023326
In [7]:
elections_sorted_by_percent = elections.sort_values("%", ascending=False)
elections_sorted_by_percent.groupby("Party").agg(lambda x : x.iloc[0]).head(9)
Out[7]:
Year Candidate Popular vote Result %
Party
American 1856 Millard Fillmore 873053 loss 21.554001
American Independent 1968 George Wallace 9901118 loss 13.571218
Anti-Masonic 1832 William Wirt 100715 loss 7.821583
Anti-Monopoly 1884 Benjamin Butler 134294 loss 1.335838
Citizens 1980 Barry Commoner 233052 loss 0.270182
Communist 1932 William Z. Foster 103307 loss 0.261069
Constitution 2008 Chuck Baldwin 199750 loss 0.152398
Constitutional Union 1860 John Bell 590901 loss 12.639283
Democratic 1964 Lyndon Johnson 43127041 win 61.344703

Grabbing the first entry is common enough that groupby has a first built-in method, which in this case will make the code a bit clearer:

In [8]:
elections_sorted_by_percent = elections.sort_values("%", ascending=False)
elections_sorted_by_percent.groupby("Party").first().head(9)
Out[8]:
Year Candidate Popular vote Result %
Party
American 1856 Millard Fillmore 873053 loss 21.554001
American Independent 1968 George Wallace 9901118 loss 13.571218
Anti-Masonic 1832 William Wirt 100715 loss 7.821583
Anti-Monopoly 1884 Benjamin Butler 134294 loss 1.335838
Citizens 1980 Barry Commoner 233052 loss 0.270182
Communist 1932 William Z. Foster 103307 loss 0.261069
Constitution 2008 Chuck Baldwin 199750 loss 0.152398
Constitutional Union 1860 John Bell 590901 loss 12.639283
Democratic 1964 Lyndon Johnson 43127041 win 61.344703

You'll soon discover that with Pandas' rich tool set, there's typically more than one way to get to the same answer. Each approach has different tradeoffs in terms of readability, performance, memory consumption, complexity and more. It will take some experience for you to develop a sense of which approach is better for each problem, but you should in general try to think if you can at least envision a different solution to a given problem, especially if you find your current solution to be particularly convoluted or hard to read.

Here's a couple of other ways of obtaining the same result (in each case we only show the top part with head()). The first approach uses groupby but finds the location of the maximum value via the idxmax() method (look up its documentation!). We then index and sort by party to match the requested formatting:

In [9]:
elections.head()
Out[9]:
Year Candidate Party Popular vote Result %
0 1824 Andrew Jackson Democratic-Republican 151271 loss 57.210122
1 1824 John Quincy Adams Democratic-Republican 113142 win 42.789878
2 1828 Andrew Jackson Democratic 642806 win 56.203927
3 1828 John Quincy Adams National Republican 500897 loss 43.796073
4 1832 Andrew Jackson Democratic 702735 win 54.574789
In [10]:
elections.groupby("Party")["%"].idxmax()
Out[10]:
Party
American                  22
American Independent     115
Anti-Masonic               6
Anti-Monopoly             38
Citizens                 127
Communist                 89
Constitution             164
Constitutional Union      24
Democratic               114
Democratic-Republican      0
Dixiecrat                103
Farmer–Labor              78
Free Soil                 15
Green                    155
Greenback                 35
Independent              143
Liberal Republican        31
Libertarian              175
National Democratic       50
National Republican        3
National Union            27
Natural Law              148
New Alliance             136
Northern Democratic       26
Populist                  48
Progressive               68
Prohibition               49
Reform                   150
Republican               120
Socialist                 66
Southern Democratic       25
States' Rights           110
Taxpayers                147
Union                     93
Union Labor               42
Whig                      11
Name: %, dtype: int64
In [11]:
# This is the computational part
best_per_party = elections.loc[elections.groupby("Party")["%"].idxmax()]

# This indexes by Party to match the formatting above
best_per_party.set_index('Party').sort_index().head() 
Out[11]:
Year Candidate Popular vote Result %
Party
American 1856 Millard Fillmore 873053 loss 21.554001
American Independent 1968 George Wallace 9901118 loss 13.571218
Anti-Masonic 1832 William Wirt 100715 loss 7.821583
Anti-Monopoly 1884 Benjamin Butler 134294 loss 1.335838
Citizens 1980 Barry Commoner 233052 loss 0.270182

And this one doesn't even use groupby! This approach instead uses the drop_duplicates method to keep only the last occurrence of of each party after having sorted by "%", which is the best performance. Again, the 2nd line is purely formatting:

In [12]:
best_per_party2 = elections.sort_values("%").drop_duplicates(["Party"], keep="last")
best_per_party2.set_index("Party").sort_index().head()  # Formatting
Out[12]:
Year Candidate Popular vote Result %
Party
American 1856 Millard Fillmore 873053 loss 21.554001
American Independent 1968 George Wallace 9901118 loss 13.571218
Anti-Masonic 1832 William Wirt 100715 loss 7.821583
Anti-Monopoly 1884 Benjamin Butler 134294 loss 1.335838
Citizens 1980 Barry Commoner 233052 loss 0.270182

Challenge: see if you can find yet another approach that still gives the same answer.


Other DataFrameGroupBy Features¶

The result of groupby is not a DataFrame or a list of DataFrames. It is instead a special type called a DataFrameGroupBy.

In [13]:
elections.head()
Out[13]:
Year Candidate Party Popular vote Result %
0 1824 Andrew Jackson Democratic-Republican 151271 loss 57.210122
1 1824 John Quincy Adams Democratic-Republican 113142 win 42.789878
2 1828 Andrew Jackson Democratic 642806 win 56.203927
3 1828 John Quincy Adams National Republican 500897 loss 43.796073
4 1832 Andrew Jackson Democratic 702735 win 54.574789
In [14]:
grouped_by_year = elections.groupby("Year")
type(grouped_by_year)
Out[14]:
pandas.core.groupby.generic.DataFrameGroupBy
In [15]:
grouped_by_year.groups.keys()
Out[15]:
dict_keys([1824, 1828, 1832, 1836, 1840, 1844, 1848, 1852, 1856, 1860, 1864, 1868, 1872, 1876, 1880, 1884, 1888, 1892, 1896, 1900, 1904, 1908, 1912, 1916, 1920, 1924, 1928, 1932, 1936, 1940, 1944, 1948, 1952, 1956, 1960, 1964, 1968, 1972, 1976, 1980, 1984, 1988, 1992, 1996, 2000, 2004, 2008, 2012, 2016, 2020])
In [16]:
grouped_by_year.groups[2020]
Out[16]:
Int64Index([178, 179, 180, 181], dtype='int64')
In [17]:
grouped_by_year.get_group(2020)
Out[17]:
Year Candidate Party Popular vote Result %
178 2020 Joseph Biden Democratic 81268924 win 51.311515
179 2020 Donald Trump Republican 74216154 loss 46.858542
180 2020 Jo Jorgensen Libertarian 1865724 loss 1.177979
181 2020 Howard Hawkins Green 405035 loss 0.255731

groupby.size()¶

In [18]:
elections.groupby("Party")
Out[18]:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7ff1b381ceb0>
In [19]:
# size returns a Series giving the size of each group
elections.groupby("Party").size().head(15)
Out[19]:
Party
American                  2
American Independent      3
Anti-Masonic              1
Anti-Monopoly             1
Citizens                  1
Communist                 1
Constitution              3
Constitutional Union      1
Democratic               47
Democratic-Republican     2
Dixiecrat                 1
Farmer–Labor              1
Free Soil                 2
Green                     7
Greenback                 1
dtype: int64

groupby.filter()¶

In [20]:
# filter gives a copy of the original DataFrame where row r is included
# if its group obeys the given condition
#
# Note: Filtering is done per GROUP, not per ROW.
elections.groupby("Year").filter(lambda sf: sf["%"].max() < 45).set_index('Year').sort_index()
Out[20]:
Candidate Party Popular vote Result %
Year
1860 Abraham Lincoln Republican 1855993 win 39.699408
1860 John Bell Constitutional Union 590901 loss 12.639283
1860 John C. Breckinridge Southern Democratic 848019 loss 18.138998
1860 Stephen A. Douglas Northern Democratic 1380202 loss 29.522311
1912 Eugene V. Debs Socialist 901551 loss 6.004354
1912 Eugene W. Chafin Prohibition 208156 loss 1.386325
1912 Theodore Roosevelt Progressive 4122721 loss 27.457433
1912 William Taft Republican 3486242 loss 23.218466
1912 Woodrow Wilson Democratic 6296284 win 41.933422
1968 George Wallace American Independent 9901118 loss 13.571218
1968 Hubert Humphrey Democratic 31271839 loss 42.863537
1968 Richard Nixon Republican 31783783 win 43.565246
1992 Andre Marrou Libertarian 290087 loss 0.278516
1992 Bill Clinton Democratic 44909806 win 43.118485
1992 Bo Gritz Populist 106152 loss 0.101918
1992 George H. W. Bush Republican 39104550 loss 37.544784
1992 Ross Perot Independent 19743821 loss 18.956298
In [21]:
# The code below lets us peek into the groups and see why they were rejected or not
for i, (n, g) in enumerate(elections.groupby("Party")):
    print(n)
    display(g)
    if i>3: break
American
Year Candidate Party Popular vote Result %
22 1856 Millard Fillmore American 873053 loss 21.554001
126 1976 Thomas J. Anderson American 158271 loss 0.194862
American Independent
Year Candidate Party Popular vote Result %
115 1968 George Wallace American Independent 9901118 loss 13.571218
119 1972 John G. Schmitz American Independent 1100868 loss 1.421524
124 1976 Lester Maddox American Independent 170274 loss 0.209640
Anti-Masonic
Year Candidate Party Popular vote Result %
6 1832 William Wirt Anti-Masonic 100715 loss 7.821583
Anti-Monopoly
Year Candidate Party Popular vote Result %
38 1884 Benjamin Butler Anti-Monopoly 134294 loss 1.335838
Citizens
Year Candidate Party Popular vote Result %
127 1980 Barry Commoner Citizens 233052 loss 0.270182

Puzzle 2: Finding the number of babies born in each year of each sex.¶

Earlier we saw how to add up the total number of babies born in each year, but what if we want the number born in each and year for each sex separately?

Dataset - US Babynames¶

Let's load the baby names dataset again.

In [22]:
import urllib.request
import os.path

# Download data from the web directly
data_url = "https://www.ssa.gov/oact/babynames/names.zip"
local_filename = "data/babynames.zip"
if not os.path.exists(local_filename): # if the data exists don't download again
    with urllib.request.urlopen(data_url) as resp, open(local_filename, 'wb') as f:
        f.write(resp.read())

        
# Load data without unzipping the file
import zipfile
babynames = [] 
with zipfile.ZipFile(local_filename, "r") as zf:
    data_files = [f for f in zf.filelist if f.filename[-3:] == "txt"]
    def extract_year_from_filename(fn):
        return int(fn[3:7])
    for f in data_files:
        year = extract_year_from_filename(f.filename)
        with zf.open(f) as fp:
            df = pd.read_csv(fp, names=["Name", "Sex", "Count"])
            df["Year"] = year
            babynames.append(df)
babynames = pd.concat(babynames)


babynames.head() # show the first few rows
Out[22]:
Name Sex Count Year
0 Mary F 7065 1880
1 Anna F 2604 1880
2 Emma F 2003 1880
3 Elizabeth F 1939 1880
4 Minnie F 1746 1880

With groupby¶

In [23]:
babynames.groupby("Year").sum()
Out[23]:
Count
Year
1880 201484
1881 192690
1882 221533
1883 216944
1884 243461
... ...
2017 3570234
2018 3508655
2019 3461444
2020 3327419
2021 3361375

142 rows × 1 columns

It is possible to group a DataFrame by multiple features. For example, if we group by Year and Sex we get back a DataFrame with the total number of babies of each sex born in each year.

In [24]:
babynames.head(10)
Out[24]:
Name Sex Count Year
0 Mary F 7065 1880
1 Anna F 2604 1880
2 Emma F 2003 1880
3 Elizabeth F 1939 1880
4 Minnie F 1746 1880
5 Margaret F 1578 1880
6 Ida F 1472 1880
7 Alice F 1414 1880
8 Bertha F 1320 1880
9 Sarah F 1288 1880
In [25]:
babynames.groupby(["Year", "Sex"]).agg(sum).head(6)
Out[25]:
Count
Year Sex
1880 F 90994
M 110490
1881 F 91953
M 100737
1882 F 107847
M 113686
In [26]:
babynames.groupby(["Sex", "Year"]).agg(sum).head(6)
Out[26]:
Count
Sex Year
F 1880 90994
1881 91953
1882 107847
1883 112319
1884 129019
1885 133055

The DataFrame resulting from an aggregation operation on a table grouped by a list of columns is multi-indexed. That is, it has more than one dimension to its index. We will explore this in a future lecture.

A more natural approach is to use a pivot table (like we saw in data 8).

In [27]:
babynames.head(5)
Out[27]:
Name Sex Count Year
0 Mary F 7065 1880
1 Anna F 2604 1880
2 Emma F 2003 1880
3 Elizabeth F 1939 1880
4 Minnie F 1746 1880

With Pivot Tables¶

In [28]:
babynames_pivot = babynames.pivot_table(
    index="Year",     # the rows (turned into index)
    columns="Sex",    # the column values
    values=["Count"], # the field(s) to processed in each group
    aggfunc=np.sum,   # group operation
)
babynames_pivot.head(6)
Out[28]:
Count
Sex F M
Year
1880 90994 110490
1881 91953 100737
1882 107847 113686
1883 112319 104625
1884 129019 114442
1885 133055 107799
In [29]:
babynames_pivot = babynames.pivot_table(
    index="Year",     # the rows (turned into index)
    columns="Sex",    # the column values
    values=["Count", "Name"], 
    aggfunc=np.max,   # group operation
)
babynames_pivot.head(6)
Out[29]:
Count Name
Sex F M F M
Year
1880 7065 9655 Zula Zeke
1881 6919 8769 Zula Zeb
1882 8148 9557 Zula Zed
1883 8012 8894 Zula Zeno
1884 9217 9388 Zula Zollie
1885 9128 8756 Zula Zollie

The basic idea behind pivot tables is shown in the image below.

pivot_picture.png


Merging Tables¶

In [30]:
elections
Out[30]:
Year Candidate Party Popular vote Result %
0 1824 Andrew Jackson Democratic-Republican 151271 loss 57.210122
1 1824 John Quincy Adams Democratic-Republican 113142 win 42.789878
2 1828 Andrew Jackson Democratic 642806 win 56.203927
3 1828 John Quincy Adams National Republican 500897 loss 43.796073
4 1832 Andrew Jackson Democratic 702735 win 54.574789
... ... ... ... ... ... ...
177 2016 Jill Stein Green 1457226 loss 1.073699
178 2020 Joseph Biden Democratic 81268924 win 51.311515
179 2020 Donald Trump Republican 74216154 loss 46.858542
180 2020 Jo Jorgensen Libertarian 1865724 loss 1.177979
181 2020 Howard Hawkins Green 405035 loss 0.255731

182 rows × 6 columns

In [31]:
babynames_2020 = babynames[babynames["Year"] == 2020]
babynames_2020
Out[31]:
Name Sex Count Year
0 Olivia F 17641 2020
1 Emma F 15656 2020
2 Ava F 13160 2020
3 Charlotte F 13065 2020
4 Sophia F 13036 2020
... ... ... ... ...
31448 Zykell M 5 2020
31449 Zylus M 5 2020
31450 Zymari M 5 2020
31451 Zyn M 5 2020
31452 Zyran M 5 2020

31453 rows × 4 columns

In [32]:
elections["First Name"] = elections["Candidate"].str.split().str[0]
In [33]:
elections
Out[33]:
Year Candidate Party Popular vote Result % First Name
0 1824 Andrew Jackson Democratic-Republican 151271 loss 57.210122 Andrew
1 1824 John Quincy Adams Democratic-Republican 113142 win 42.789878 John
2 1828 Andrew Jackson Democratic 642806 win 56.203927 Andrew
3 1828 John Quincy Adams National Republican 500897 loss 43.796073 John
4 1832 Andrew Jackson Democratic 702735 win 54.574789 Andrew
... ... ... ... ... ... ... ...
177 2016 Jill Stein Green 1457226 loss 1.073699 Jill
178 2020 Joseph Biden Democratic 81268924 win 51.311515 Joseph
179 2020 Donald Trump Republican 74216154 loss 46.858542 Donald
180 2020 Jo Jorgensen Libertarian 1865724 loss 1.177979 Jo
181 2020 Howard Hawkins Green 405035 loss 0.255731 Howard

182 rows × 7 columns

In [34]:
merged = pd.merge(left = elections, right = babynames_2020, 
                  left_on = "First Name", right_on = "Name")
merged
Out[34]:
Year_x Candidate Party Popular vote Result % First Name Name Sex Count Year_y
0 1824 Andrew Jackson Democratic-Republican 151271 loss 57.210122 Andrew Andrew F 12 2020
1 1824 Andrew Jackson Democratic-Republican 151271 loss 57.210122 Andrew Andrew M 6036 2020
2 1828 Andrew Jackson Democratic 642806 win 56.203927 Andrew Andrew F 12 2020
3 1828 Andrew Jackson Democratic 642806 win 56.203927 Andrew Andrew M 6036 2020
4 1832 Andrew Jackson Democratic 702735 win 54.574789 Andrew Andrew F 12 2020
... ... ... ... ... ... ... ... ... ... ... ...
253 2016 Hillary Clinton Democratic 65853514 loss 48.521539 Hillary Hillary F 52 2020
254 2020 Joseph Biden Democratic 81268924 win 51.311515 Joseph Joseph F 9 2020
255 2020 Joseph Biden Democratic 81268924 win 51.311515 Joseph Joseph M 8430 2020
256 2020 Jo Jorgensen Libertarian 1865724 loss 1.177979 Jo Jo F 35 2020
257 2020 Jo Jorgensen Libertarian 1865724 loss 1.177979 Jo Jo M 8 2020

258 rows × 11 columns

In [35]:
merged.sort_values("Count")
Out[35]:
Year_x Candidate Party Popular vote Result % First Name Name Sex Count Year_y
242 2008 Chuck Baldwin Constitution 199750 loss 0.152398 Chuck Chuck M 5 2020
136 1888 Alson Streeter Union Labor 146602 loss 1.288861 Alson Alson M 5 2020
102 1856 Millard Fillmore American 873053 loss 21.554001 Millard Millard M 5 2020
170 1920 Parley P. Christensen Farmer–Labor 265398 loss 0.995804 Parley Parley M 6 2020
190 1952 Adlai Stevenson Democratic 27375090 loss 44.446312 Adlai Adlai F 6 2020
... ... ... ... ... ... ... ... ... ... ... ...
67 1932 William Z. Foster Communist 103307 loss 0.261069 William William M 12613 2020
65 1912 William Taft Republican 3486242 loss 23.218466 William William M 12613 2020
63 1908 William Taft Republican 7678335 win 52.013300 William William M 12613 2020
61 1908 William Jennings Bryan Democratic 6408979 loss 43.414640 William William M 12613 2020
47 1832 William Wirt Anti-Masonic 100715 loss 7.821583 William William M 12613 2020

258 rows × 11 columns

In [ ]: