Pandas, Part III¶

Adapted from Josh Hug and Narges Norouzi

Updated by Dominic Liu

A demo of advanced aggregation and joining to accompany Lecture 4.

In [1]:
import numpy as np
import pandas as pd
import plotly.express as px

More on GroupBy¶

Quick review question¶

Try to predict the results of the groupby operation shown. The answer is below the image.

groupby_puzzle.png

The top ?? will be "hi", the second ?? will be "tx", and the third ?? will be "sd".

In [2]:
ds = pd.DataFrame(dict(x=[3,1,4,1,5,9,2,5,6], 
                      y=['ak', 'tx', 'fl', 'hi', 'mi', 'ak', 'ca', 'sd', 'nc']), 
                      index=list('ABCABCACB') )
ds
Out[2]:
x y
A 3 ak
B 1 tx
C 4 fl
A 1 hi
B 5 mi
C 9 ak
A 2 ca
C 5 sd
B 6 nc
In [3]:
ds.groupby(ds.index).agg(max)
Out[3]:
x y
A 3 hi
B 6 tx
C 9 sd

groupby.size and groupby.count()¶

In [4]:
df = pd.DataFrame({'letter':['A','A','B','C','C','C'], 
                   'num':[1,2,3,4,np.NaN,4], 
                   'state':[np.NaN, 'tx', 'fl', 'hi', np.NaN, 'ak']})
df
Out[4]:
letter num state
0 A 1.0 NaN
1 A 2.0 tx
2 B 3.0 fl
3 C 4.0 hi
4 C NaN NaN
5 C 4.0 ak

groupby.size() returns a Series, indexed by the letters that we grouped by, with values denoting the number of rows in each group/sub-DataFrame. It does not care about missing (NaN) values.

In [5]:
df.groupby("letter").size()
Out[5]:
letter
A    2
B    1
C    3
dtype: int64

groupby.count() returns a DataFrame, indexed by the letters that we grouped by. Each column represents the number of non-missing values for that letter.

In [6]:
df.groupby("letter").count()
Out[6]:
num state
letter
A 2 1
B 1 1
C 2 2

You might recall value_counts() function we talked about last week. What's the difference?

In [7]:
df["letter"].value_counts()
Out[7]:
C    3
A    2
B    1
Name: letter, dtype: int64

Turns out value_counts() does something similar to groupby.size(), except that it also sorts the resulting Series in descending order.

Filtering by Group¶

In [8]:
# Let's read the elections dataset
elections = pd.read_csv("data/elections.csv")
elections.sample(5)
Out[8]:
Year Candidate Party Popular vote Result %
64 1908 William Jennings Bryan Democratic 6408979 loss 43.414640
90 1936 Alf Landon Republican 16679543 loss 36.648285
111 1960 John Kennedy Democratic 34220984 win 50.082561
18 1852 John P. Hale Free Soil 155210 loss 4.930283
42 1888 Alson Streeter Union Labor 146602 loss 1.288861

Let's keep only the elections years where the maximum vote share % is less than 45%.

In [9]:
elections.groupby("Year").filter(lambda sf: sf["%"].max() < 45).head(10)
Out[9]:
Year Candidate Party Popular vote Result %
23 1860 Abraham Lincoln Republican 1855993 win 39.699408
24 1860 John Bell Constitutional Union 590901 loss 12.639283
25 1860 John C. Breckinridge Southern Democratic 848019 loss 18.138998
26 1860 Stephen A. Douglas Northern Democratic 1380202 loss 29.522311
66 1912 Eugene V. Debs Socialist 901551 loss 6.004354
67 1912 Eugene W. Chafin Prohibition 208156 loss 1.386325
68 1912 Theodore Roosevelt Progressive 4122721 loss 27.457433
69 1912 William Taft Republican 3486242 loss 23.218466
70 1912 Woodrow Wilson Democratic 6296284 win 41.933422
115 1968 George Wallace American Independent 9901118 loss 13.571218

groupby Puzzle¶

Attempt #1¶

We have to be careful when using aggregation functions. For example, the code below might be misinterpreted to say that Woodrow Wilson successfully ran for election in 2020. Why is this happening?

In [10]:
elections.groupby("Party").agg(max).head(10)
Out[10]:
Year Candidate Popular vote Result %
Party
American 1976 Thomas J. Anderson 873053 loss 21.554001
American Independent 1976 Lester Maddox 9901118 loss 13.571218
Anti-Masonic 1832 William Wirt 100715 loss 7.821583
Anti-Monopoly 1884 Benjamin Butler 134294 loss 1.335838
Citizens 1980 Barry Commoner 233052 loss 0.270182
Communist 1932 William Z. Foster 103307 loss 0.261069
Constitution 2016 Michael Peroutka 203091 loss 0.152398
Constitutional Union 1860 John Bell 590901 loss 12.639283
Democratic 2020 Woodrow Wilson 81268924 win 61.344703
Democratic-Republican 1824 John Quincy Adams 151271 win 57.210122

Attempt #2¶

Next we'll write code that properly returns the best result by each party. That is, each row should show the Year, Candidate, Popular Vote, Result, and % for the election in which that party saw its best results (rather than mixing them as in the example above), here's what the first rows of the correct output should look like:

parties.png

In [11]:
elections_sorted_by_percent = elections.sort_values("%", ascending=False)
elections_sorted_by_percent.head(8)
Out[11]:
Year Candidate Party Popular vote Result %
114 1964 Lyndon Johnson Democratic 43127041 win 61.344703
91 1936 Franklin Roosevelt Democratic 27752648 win 60.978107
120 1972 Richard Nixon Republican 47168710 win 60.907806
79 1920 Warren Harding Republican 16144093 win 60.574501
133 1984 Ronald Reagan Republican 54455472 win 59.023326
84 1928 Herbert Hoover Republican 21427123 win 58.368524
86 1932 Franklin Roosevelt Democratic 22821277 win 57.672125
109 1956 Dwight Eisenhower Republican 35579180 win 57.650654
In [12]:
elections_sorted_by_percent.groupby("Party").first()
Out[12]:
Year Candidate Popular vote Result %
Party
American 1856 Millard Fillmore 873053 loss 21.554001
American Independent 1968 George Wallace 9901118 loss 13.571218
Anti-Masonic 1832 William Wirt 100715 loss 7.821583
Anti-Monopoly 1884 Benjamin Butler 134294 loss 1.335838
Citizens 1980 Barry Commoner 233052 loss 0.270182
Communist 1932 William Z. Foster 103307 loss 0.261069
Constitution 2008 Chuck Baldwin 199750 loss 0.152398
Constitutional Union 1860 John Bell 590901 loss 12.639283
Democratic 1964 Lyndon Johnson 43127041 win 61.344703
Democratic-Republican 1824 Andrew Jackson 151271 loss 57.210122
Dixiecrat 1948 Strom Thurmond 1175930 loss 2.412304
Farmer–Labor 1920 Parley P. Christensen 265398 loss 0.995804
Free Soil 1848 Martin Van Buren 291501 loss 10.138474
Green 2000 Ralph Nader 2882955 loss 2.741176
Greenback 1880 James B. Weaver 308649 loss 3.352344
Independent 1992 Ross Perot 19743821 loss 18.956298
Liberal Republican 1872 Horace Greeley 2834761 loss 44.071406
Libertarian 2016 Gary Johnson 4489235 loss 3.307714
National Democratic 1896 John M. Palmer 134645 loss 0.969566
National Republican 1828 John Quincy Adams 500897 loss 43.796073
National Union 1864 Abraham Lincoln 2211317 win 54.951512
Natural Law 1996 John Hagelin 113670 loss 0.118219
New Alliance 1988 Lenora Fulani 217221 loss 0.237804
Northern Democratic 1860 Stephen A. Douglas 1380202 loss 29.522311
Populist 1892 James B. Weaver 1041028 loss 8.645038
Progressive 1912 Theodore Roosevelt 4122721 loss 27.457433
Prohibition 1892 John Bidwell 270879 loss 2.249468
Reform 1996 Ross Perot 8085294 loss 8.408844
Republican 1972 Richard Nixon 47168710 win 60.907806
Socialist 1912 Eugene V. Debs 901551 loss 6.004354
Southern Democratic 1860 John C. Breckinridge 848019 loss 18.138998
States' Rights 1956 T. Coleman Andrews 107929 loss 0.174883
Taxpayers 1996 Howard Phillips 184656 loss 0.192045
Union 1936 William Lemke 892378 loss 1.960733
Union Labor 1888 Alson Streeter 146602 loss 1.288861
Whig 1840 William Henry Harrison 1275583 win 53.051213

Alternative solutions¶

You'll soon discover that with Pandas' rich tool set, there's typically more than one way to get to the same answer. Each approach has different tradeoffs in terms of readability, performance, memory consumption, complexity and more. It will take some experience for you to develop a sense of which approach is better for each problem, but you should in general try to think if you can at least envision a different solution to a given problem, especially if you find your current solution to be particularly convoluted or hard to read.

Here's a couple of other ways of obtaining the same result (in each case we only show the top part with head()). The first approach uses groupby but finds the location of the maximum value via the idxmax() method (look up its documentation!). We then index and sort by party to match the requested formatting:

In [13]:
elections.groupby("Party")["%"].idxmax()
Out[13]:
Party
American                  22
American Independent     115
Anti-Masonic               6
Anti-Monopoly             38
Citizens                 127
Communist                 89
Constitution             164
Constitutional Union      24
Democratic               114
Democratic-Republican      0
Dixiecrat                103
Farmer–Labor              78
Free Soil                 15
Green                    155
Greenback                 35
Independent              143
Liberal Republican        31
Libertarian              175
National Democratic       50
National Republican        3
National Union            27
Natural Law              148
New Alliance             136
Northern Democratic       26
Populist                  48
Progressive               68
Prohibition               49
Reform                   150
Republican               120
Socialist                 66
Southern Democratic       25
States' Rights           110
Taxpayers                147
Union                     93
Union Labor               42
Whig                      11
Name: %, dtype: int64
In [14]:
# This is the computational part
best_per_party = elections.loc[elections.groupby("Party")["%"].idxmax()]

# This indexes by Party to match the formatting above
best_per_party.set_index('Party').sort_index().head() 
Out[14]:
Year Candidate Popular vote Result %
Party
American 1856 Millard Fillmore 873053 loss 21.554001
American Independent 1968 George Wallace 9901118 loss 13.571218
Anti-Masonic 1832 William Wirt 100715 loss 7.821583
Anti-Monopoly 1884 Benjamin Butler 134294 loss 1.335838
Citizens 1980 Barry Commoner 233052 loss 0.270182

And this one doesn't even use groupby! This approach instead uses the drop_duplicates method to keep only the last occurrence of of each party after having sorted by "%", which is the best performance. Again, the 2nd line is purely formatting:

In [15]:
best_per_party2 = elections.sort_values("%").drop_duplicates(["Party"], keep="last")
best_per_party2.set_index("Party").sort_index().head()  # Formatting
Out[15]:
Year Candidate Popular vote Result %
Party
American 1856 Millard Fillmore 873053 loss 21.554001
American Independent 1968 George Wallace 9901118 loss 13.571218
Anti-Masonic 1832 William Wirt 100715 loss 7.821583
Anti-Monopoly 1884 Benjamin Butler 134294 loss 1.335838
Citizens 1980 Barry Commoner 233052 loss 0.270182

Challenge: see if you can find yet another approach that still gives the same answer.

DataFrameGroupBy Objects¶

The result of groupby is not a DataFrame or a list of DataFrames. It is instead a special type called a DataFrameGroupBy.

In [16]:
grouped_by_party = elections.groupby("Party")
type(grouped_by_party)
Out[16]:
pandas.core.groupby.generic.DataFrameGroupBy

GroupBy objects are structured like dictionaries. In fact, we can actually see the dictionaries with the following code:

In [17]:
grouped_by_party.groups
Out[17]:
{'American': [22, 126], 'American Independent': [115, 119, 124], 'Anti-Masonic': [6], 'Anti-Monopoly': [38], 'Citizens': [127], 'Communist': [89], 'Constitution': [160, 164, 172], 'Constitutional Union': [24], 'Democratic': [2, 4, 8, 10, 13, 14, 17, 20, 28, 29, 34, 37, 39, 45, 47, 52, 55, 57, 64, 70, 74, 77, 81, 83, 86, 91, 94, 97, 100, 105, 108, 111, 114, 116, 118, 123, 129, 134, 137, 140, 144, 151, 158, 162, 168, 176, 178], 'Democratic-Republican': [0, 1], 'Dixiecrat': [103], 'Farmer–Labor': [78], 'Free Soil': [15, 18], 'Green': [149, 155, 156, 165, 170, 177, 181], 'Greenback': [35], 'Independent': [121, 130, 143, 161, 167, 174], 'Liberal Republican': [31], 'Libertarian': [125, 128, 132, 138, 139, 146, 153, 159, 163, 169, 175, 180], 'National Democratic': [50], 'National Republican': [3, 5], 'National Union': [27], 'Natural Law': [148], 'New Alliance': [136], 'Northern Democratic': [26], 'Populist': [48, 61, 141], 'Progressive': [68, 82, 101, 107], 'Prohibition': [41, 44, 49, 51, 54, 59, 63, 67, 73, 75, 99], 'Reform': [150, 154], 'Republican': [21, 23, 30, 32, 33, 36, 40, 43, 46, 53, 56, 60, 65, 69, 72, 79, 80, 84, 87, 90, 96, 98, 104, 106, 109, 112, 113, 117, 120, 122, 131, 133, 135, 142, 145, 152, 157, 166, 171, 173, 179], 'Socialist': [58, 62, 66, 71, 76, 85, 88, 92, 95, 102], 'Southern Democratic': [25], 'States' Rights': [110], 'Taxpayers': [147], 'Union': [93], 'Union Labor': [42], 'Whig': [7, 9, 11, 12, 16, 19]}

The keys of the dictionary are the groups (in this case, Party), and the values are the indices of rows belonging to that group. We can access a particular sub-DataFrame using get_group:

In [18]:
grouped_by_party.get_group("Socialist")
Out[18]:
Year Candidate Party Popular vote Result %
58 1904 Eugene V. Debs Socialist 402810 loss 2.985897
62 1908 Eugene V. Debs Socialist 420852 loss 2.850866
66 1912 Eugene V. Debs Socialist 901551 loss 6.004354
71 1916 Allan L. Benson Socialist 590524 loss 3.194193
76 1920 Eugene V. Debs Socialist 913693 loss 3.428282
85 1928 Norman Thomas Socialist 267478 loss 0.728623
88 1932 Norman Thomas Socialist 884885 loss 2.236211
92 1936 Norman Thomas Socialist 187910 loss 0.412876
95 1940 Norman Thomas Socialist 116599 loss 0.234237
102 1948 Norman Thomas Socialist 139569 loss 0.286312

Pivot Tables¶

In [19]:
import urllib.request
import os.path

# Download data from the web directly
data_url = "https://www.ssa.gov/oact/babynames/names.zip"
local_filename = "data/babynames.zip"
if not os.path.exists(local_filename): # if the data exists don't download again
    with urllib.request.urlopen(data_url) as resp, open(local_filename, 'wb') as f:
        f.write(resp.read())

        
# Load data without unzipping the file
import zipfile
babynames = [] 
with zipfile.ZipFile(local_filename, "r") as zf:
    data_files = [f for f in zf.filelist if f.filename[-3:] == "txt"]
    def extract_year_from_filename(fn):
        return int(fn[3:7])
    for f in data_files:
        year = extract_year_from_filename(f.filename)
        with zf.open(f) as fp:
            df = pd.read_csv(fp, names=["Name", "Sex", "Count"])
            df["Year"] = year
            babynames.append(df)
babynames = pd.concat(babynames)


babynames.head() # show the first few rows
Out[19]:
Name Sex Count Year
0 Mary F 7065 1880
1 Anna F 2604 1880
2 Emma F 2003 1880
3 Elizabeth F 1939 1880
4 Minnie F 1746 1880

Groupby with multiple columns¶

We want to build a table showing the total number of babies born of each sex in each year. One way is to groupby using both columns of interest:

In [20]:
babynames.groupby(["Year", "Sex"]).sum().head(6)
Out[20]:
Count
Year Sex
1880 F 90994
M 110490
1881 F 91953
M 100737
1882 F 107847
M 113686

pivot_table¶

In [21]:
babynames.pivot_table(
    index = "Year", 
    columns = "Sex", 
    values = "Count", 
    aggfunc = np.sum).head(6)
Out[21]:
Sex F M
Year
1880 90994 110490
1881 91953 100737
1882 107847 113686
1883 112320 104625
1884 129019 114442
1885 133055 107799

pivot_picture.png

pivot_table with multiple values¶

In [22]:
babynames.pivot_table(
    index = "Year", 
    columns = "Sex", 
    values = ["Count", "Name"], 
    aggfunc = np.max).head(6)
Out[22]:
Count Name
Sex F M F M
Year
1880 7065 9655 Zula Zeke
1881 6919 8769 Zula Zeb
1882 8148 9557 Zula Zed
1883 8012 8894 Zula Zeno
1884 9217 9388 Zula Zollie
1885 9128 8756 Zula Zollie

Join Tables¶

What if we want to know the popularity of presidential candidates' first names in 2020. What can we do?

In [23]:
elections.head(10)
Out[23]:
Year Candidate Party Popular vote Result %
0 1824 Andrew Jackson Democratic-Republican 151271 loss 57.210122
1 1824 John Quincy Adams Democratic-Republican 113142 win 42.789878
2 1828 Andrew Jackson Democratic 642806 win 56.203927
3 1828 John Quincy Adams National Republican 500897 loss 43.796073
4 1832 Andrew Jackson Democratic 702735 win 54.574789
5 1832 Henry Clay National Republican 484205 loss 37.603628
6 1832 William Wirt Anti-Masonic 100715 loss 7.821583
7 1836 Hugh Lawson White Whig 146109 loss 10.005985
8 1836 Martin Van Buren Democratic 763291 win 52.272472
9 1836 William Henry Harrison Whig 550816 loss 37.721543
In [24]:
babynames_2020 = babynames[babynames["Year"] == 2020]
babynames_2020.head(10)
Out[24]:
Name Sex Count Year
0 Olivia F 17664 2020
1 Emma F 15680 2020
2 Ava F 13179 2020
3 Charlotte F 13083 2020
4 Sophia F 13070 2020
5 Amelia F 12780 2020
6 Isabella F 12182 2020
7 Mia F 11243 2020
8 Evelyn F 9503 2020
9 Harper F 8834 2020
In [25]:
elections["First Name"] = elections["Candidate"].str.split(" ").str[0]
elections
Out[25]:
Year Candidate Party Popular vote Result % First Name
0 1824 Andrew Jackson Democratic-Republican 151271 loss 57.210122 Andrew
1 1824 John Quincy Adams Democratic-Republican 113142 win 42.789878 John
2 1828 Andrew Jackson Democratic 642806 win 56.203927 Andrew
3 1828 John Quincy Adams National Republican 500897 loss 43.796073 John
4 1832 Andrew Jackson Democratic 702735 win 54.574789 Andrew
... ... ... ... ... ... ... ...
177 2016 Jill Stein Green 1457226 loss 1.073699 Jill
178 2020 Joseph Biden Democratic 81268924 win 51.311515 Joseph
179 2020 Donald Trump Republican 74216154 loss 46.858542 Donald
180 2020 Jo Jorgensen Libertarian 1865724 loss 1.177979 Jo
181 2020 Howard Hawkins Green 405035 loss 0.255731 Howard

182 rows × 7 columns

Unlike in Data 8, the join function is called merge in pandas. join in pandas does something slightly different—we won't talk about it in this class.

In [26]:
merged = pd.merge(left = elections, right = babynames_2020, 
                  left_on = "First Name", right_on = "Name")
merged
Out[26]:
Year_x Candidate Party Popular vote Result % First Name Name Sex Count Year_y
0 1824 Andrew Jackson Democratic-Republican 151271 loss 57.210122 Andrew Andrew F 12 2020
1 1824 Andrew Jackson Democratic-Republican 151271 loss 57.210122 Andrew Andrew M 6049 2020
2 1828 Andrew Jackson Democratic 642806 win 56.203927 Andrew Andrew F 12 2020
3 1828 Andrew Jackson Democratic 642806 win 56.203927 Andrew Andrew M 6049 2020
4 1832 Andrew Jackson Democratic 702735 win 54.574789 Andrew Andrew F 12 2020
... ... ... ... ... ... ... ... ... ... ... ...
253 2016 Hillary Clinton Democratic 65853514 loss 48.521539 Hillary Hillary F 52 2020
254 2020 Joseph Biden Democratic 81268924 win 51.311515 Joseph Joseph F 9 2020
255 2020 Joseph Biden Democratic 81268924 win 51.311515 Joseph Joseph M 8458 2020
256 2020 Jo Jorgensen Libertarian 1865724 loss 1.177979 Jo Jo F 35 2020
257 2020 Jo Jorgensen Libertarian 1865724 loss 1.177979 Jo Jo M 8 2020

258 rows × 11 columns

In [27]:
merged.sort_values("Count", ascending = False)
Out[27]:
Year_x Candidate Party Popular vote Result % First Name Name Sex Count Year_y
53 1896 William Jennings Bryan Democratic 6509052 loss 46.871053 William William M 12643 2020
47 1832 William Wirt Anti-Masonic 100715 loss 7.821583 William William M 12643 2020
69 1936 William Lemke Union 892378 loss 1.960733 William William M 12643 2020
67 1932 William Z. Foster Communist 103307 loss 0.261069 William William M 12643 2020
55 1896 William McKinley Republican 7112138 win 51.213817 William William M 12643 2020
... ... ... ... ... ... ... ... ... ... ... ...
170 1920 Parley P. Christensen Farmer–Labor 265398 loss 0.995804 Parley Parley M 6 2020
190 1952 Adlai Stevenson Democratic 27375090 loss 44.446312 Adlai Adlai F 6 2020
242 2008 Chuck Baldwin Constitution 199750 loss 0.152398 Chuck Chuck M 5 2020
136 1888 Alson Streeter Union Labor 146602 loss 1.288861 Alson Alson M 5 2020
102 1856 Millard Fillmore American 873053 loss 21.554001 Millard Millard M 5 2020

258 rows × 11 columns