Lecture 4 – Data 100, Spring 2025¶

Data 100, Spring 2025

Acknowledgments Page

A demonstration of advanced pandas syntax to accompany Lecture 4.

In [1]:
import numpy as np
import pandas as pd
import plotly.express as px

Loading babynames Dataset¶

In [2]:
import urllib.request
import os.path
import zipfile

data_url = "https://www.ssa.gov/oact/babynames/state/namesbystate.zip"
local_filename = "data/babynamesbystate.zip"
if not os.path.exists(local_filename): # If the data exists don't download again
    with urllib.request.urlopen(data_url) as resp, open(local_filename, 'wb') as f:
        f.write(resp.read())

zf = zipfile.ZipFile(local_filename, 'r')

ca_name = 'STATE.CA.TXT'
field_names = ['State', 'Sex', 'Year', 'Name', 'Count']
with zf.open(ca_name) as fh:
    babynames = pd.read_csv(fh, header=None, names=field_names)

babynames.tail(10)
Out[2]:
State Sex Year Name Count
407418 CA M 2022 Zach 5
407419 CA M 2022 Zadkiel 5
407420 CA M 2022 Zae 5
407421 CA M 2022 Zai 5
407422 CA M 2022 Zay 5
407423 CA M 2022 Zayvier 5
407424 CA M 2022 Zia 5
407425 CA M 2022 Zora 5
407426 CA M 2022 Zuriel 5
407427 CA M 2022 Zylo 5

Custom sorting¶

Approach 1: Create a temporary column¶

In [3]:
# Create a Series of the length of each name
babyname_lengths = babynames["Name"].str.len()

# Add a column named "name_lengths" that includes the length of each name
babynames["name_lengths"] = babyname_lengths
babynames.head(5)
Out[3]:
State Sex Year Name Count name_lengths
0 CA F 1910 Mary 295 4
1 CA F 1910 Helen 239 5
2 CA F 1910 Dorothy 220 7
3 CA F 1910 Margaret 163 8
4 CA F 1910 Frances 134 7
In [4]:
# Sort by the temporary column
babynames = babynames.sort_values(by="name_lengths", ascending=False)
babynames.head(5)
Out[4]:
State Sex Year Name Count name_lengths
334166 CA M 1996 Franciscojavier 8 15
337301 CA M 1997 Franciscojavier 5 15
339472 CA M 1998 Franciscojavier 6 15
321792 CA M 1991 Ryanchristopher 7 15
327358 CA M 1993 Johnchristopher 5 15
In [5]:
# Drop the 'name_lengths' column
babynames = babynames.drop("name_lengths", axis="columns")
babynames.head(5)
Out[5]:
State Sex Year Name Count
334166 CA M 1996 Franciscojavier 8
337301 CA M 1997 Franciscojavier 5
339472 CA M 1998 Franciscojavier 6
321792 CA M 1991 Ryanchristopher 7
327358 CA M 1993 Johnchristopher 5

Approach 2: Sorting using the key argument¶

Apply the key function to the values before sorting. The key function should be vectorized. It should expect a Series and return a Series with the same shape as the input. It will be applied to each column in by independently.

In [6]:
babynames.sort_values(by="Name", key=lambda x:x.str.len(), ascending=False).head()
Out[6]:
State Sex Year Name Count
334166 CA M 1996 Franciscojavier 8
327472 CA M 1993 Ryanchristopher 5
337301 CA M 1997 Franciscojavier 5
337477 CA M 1997 Ryanchristopher 5
312543 CA M 1987 Franciscojavier 5

Approach 3: Sorting Using the map Function¶

We can also use the Python map function if we want to use an arbitrarily defined function. Suppose we want to sort by the number of occurrences of "dr" plus the number of occurrences of "ea".

In [7]:
# First, define a function to count the number of times "dr" or "ea" appear in each name
def dr_ea_count(string):
    return string.count('dr') + string.count('ea')

# Then, use `map` to apply `dr_ea_count` to each name in the "Name" column
babynames["dr_ea_count"] = babynames["Name"].map(dr_ea_count)

# Sort the DataFrame by the new "dr_ea_count" column so we can see our handiwork
babynames = babynames.sort_values(by="dr_ea_count", ascending=False)
babynames.head()
Out[7]:
State Sex Year Name Count dr_ea_count
115957 CA F 1990 Deandrea 5 3
101976 CA F 1986 Deandrea 6 3
131029 CA F 1994 Leandrea 5 3
108731 CA F 1988 Deandrea 5 3
308131 CA M 1985 Deandrea 6 3
In [8]:
# Drop the `dr_ea_count` column
babynames = babynames.drop("dr_ea_count", axis="columns")
babynames.head(5)
Out[8]:
State Sex Year Name Count
115957 CA F 1990 Deandrea 5
101976 CA F 1986 Deandrea 6
131029 CA F 1994 Leandrea 5
108731 CA F 1988 Deandrea 5
308131 CA M 1985 Deandrea 6

Grouping¶

Group rows that share a common feature, then aggregate data across the group.

In this example, we count the total number of babies born each year (considering only a small subset of the data for simplicity).

In [9]:
babynames.groupby("Year")
Out[9]:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7c77cd79b690>
In [10]:
# Selecting only numerical columns to perform grouping on and then grouping by "Year"
babies_by_year = babynames[["Year", "Count"]].groupby("Year").agg(sum)
babies_by_year
Out[10]:
Count
Year
1910 9163
1911 9983
1912 17946
1913 22094
1914 26926
... ...
2018 395436
2019 386996
2020 362882
2021 362582
2022 360023

113 rows × 1 columns

In [11]:
# Plotting baby counts per year
fig = px.line(babies_by_year, y="Count")
fig.update_layout(font_size=18, 
                  autosize=False, 
                  width=700, 
                  height=400)

Slido Exercise¶

Try to predict the results of the groupby operation shown. The answer is below the image.

Image

The top ?? will be "hi", the second ?? will be "tx", and the third ?? will be "sd".

In [12]:
ds = pd.DataFrame(dict(x=[3, 1, 4, 1, 5, 9, 2, 5, 6], 
                      y=['ak', 'tx', 'fl', 'hi', 'mi', 'ak', 'ca', 'sd', 'nc']), 
                      index=list('ABCABCACB') )
ds
Out[12]:
x y
A 3 ak
B 1 tx
C 4 fl
A 1 hi
B 5 mi
C 9 ak
A 2 ca
C 5 sd
B 6 nc
In [13]:
# Performing groupby on the first column with max aggregation function
ds.groupby(ds.index).agg(max)
Out[13]:
x y
A 3 hi
B 6 tx
C 9 sd

Case Study: Name "Popularity"¶

In this exercise, let's find the name with sex "F" that has dropped most in popularity since its peak usage in California. We'll start by filtering babynames to only include names corresponding to sex "F".

In [14]:
f_babynames = babynames[babynames["Sex"]=="F"]
f_babynames
Out[14]:
State Sex Year Name Count
115957 CA F 1990 Deandrea 5
101976 CA F 1986 Deandrea 6
131029 CA F 1994 Leandrea 5
108731 CA F 1988 Deandrea 5
193104 CA F 2010 Deandra 5
... ... ... ... ... ...
192320 CA F 2010 Annaly 7
212793 CA F 2015 Iyanna 6
202350 CA F 2013 Lailah 50
212784 CA F 2015 Helene 6
212787 CA F 2015 Holley 6

239537 rows × 5 columns

In [15]:
# We sort the data by "Year"
f_babynames = f_babynames.sort_values("Year")
f_babynames
Out[15]:
State Sex Year Name Count
23 CA F 1910 Bernice 59
219 CA F 1910 Katharine 5
170 CA F 1910 Clarice 7
26 CA F 1910 Doris 56
30 CA F 1910 Ethel 52
... ... ... ... ... ...
236551 CA F 2022 Dani 47
235914 CA F 2022 Iris 361
235918 CA F 2022 Lucy 345
235973 CA F 2022 Sage 226
239452 CA F 2022 Rylin 5

239537 rows × 5 columns

To build our intuition on how to answer our research question, let's visualize the prevalence of the name "Jennifer" over time.

In [16]:
jenn_entries = f_babynames[f_babynames["Name"]=="Jennifer"]
jenn_entries
Out[16]:
State Sex Year Name Count
13610 CA F 1934 Jennifer 5
16325 CA F 1938 Jennifer 5
16993 CA F 1939 Jennifer 6
17533 CA F 1940 Jennifer 13
18210 CA F 1941 Jennifer 24
... ... ... ... ... ...
221406 CA F 2018 Jennifer 167
225149 CA F 2019 Jennifer 145
228787 CA F 2020 Jennifer 141
232561 CA F 2021 Jennifer 91
236136 CA F 2022 Jennifer 114

86 rows × 5 columns

In [17]:
# We'll talk about how to generate plots in a later lecture
fig = px.line(jenn_entries, x="Year", y="Count")

fig.update_layout(font_size = 18, 
                  autosize=False, 
                  width=1000, 
                  height=400)

We'll need a mathematical definition for the change in popularity of a name in California.

Define the metric "Ratio to Peak" (RTP). We'll calculate this as the count of the name in 2022 (the most recent year for which we have data) divided by the largest count of this name in any year.

A demo calculation for Jennifer:

In [18]:
# Construct a Series containing our Jennifer count data
jenn_counts_ser = jenn_entries["Count"]
In [19]:
# In the year with the highest Jennifer count, 6065 Jennifers were born
max_jenn = np.max(jenn_counts_ser)
max_jenn
Out[19]:
6065
In [20]:
# Remember that we sorted f_babynames by "Year". 
# This means that grabbing the final entry gives us the most recent count of Jennifers: 114
# In 2022, the most recent year for which we have data, 114 Jennifers were born
curr_jenn = jenn_counts_ser.iloc[-1]
curr_jenn
Out[20]:
114
In [21]:
# Compute the RTP
curr_jenn / max_jenn
Out[21]:
0.018796372629843364

We can also write a function that produces the ratio_to_peakfor a given Series. This will allow us to use .groupby to speed up our computation for all names in the dataset.

In [22]:
def ratio_to_peak(series):
    """
    Compute the RTP for a Series containing the counts per year for a single name (year column sorted ascendingly).
    """
    return series.iloc[-1] / np.max(series)
In [23]:
# Then, find the RTP
ratio_to_peak(jenn_counts_ser)
Out[23]:
0.018796372629843364

Now, let's use .groupby to compute the RTPs for all names in the dataset.

You may see a warning message when running the cell below. As discussed in the lecture, pandas can't apply an aggregation function to non-numeric data (it doens't make sense to divide "CA" by a number). We can select numerical columns of interest directly.

In [24]:
rtp_table = f_babynames.groupby("Name")[["Year", "Count"]].agg(ratio_to_peak)
rtp_table
Out[24]:
Year Count
Name
Aadhini 1.0 1.000000
Aadhira 1.0 0.500000
Aadhya 1.0 0.660000
Aadya 1.0 0.586207
Aahana 1.0 0.269231
... ... ...
Zyanya 1.0 0.466667
Zyla 1.0 1.000000
Zylah 1.0 1.000000
Zyra 1.0 1.000000
Zyrah 1.0 0.833333

13782 rows × 2 columns

In [25]:
# Results in a TypeError
# rtp_table = f_babynames.groupby("Name").agg(ratio_to_peak)
# rtp_table

This is the pandas equivalent of .group from Data 8. If we wanted to achieve this same result using the datascience library, we would write:

f_babynames.group("Name", ratio_to_peak)

Slido Exercise¶

Is there a row where Year is not equal to 1?

In [26]:
# Unique values in the Year column
rtp_table["Year"].unique()
Out[26]:
array([1.])
In [27]:
# Dropping the "Year" column
rtp_table.drop("Year", axis="columns", inplace=True)
rtp_table
Out[27]:
Count
Name
Aadhini 1.000000
Aadhira 0.500000
Aadhya 0.660000
Aadya 0.586207
Aahana 0.269231
... ...
Zyanya 0.466667
Zyla 1.000000
Zylah 1.000000
Zyra 1.000000
Zyrah 0.833333

13782 rows × 1 columns

In [28]:
# Rename "Count" to "Count RTP" for clarity
rtp_table = rtp_table.rename(columns={"Count":"Count RTP"})
rtp_table
Out[28]:
Count RTP
Name
Aadhini 1.000000
Aadhira 0.500000
Aadhya 0.660000
Aadya 0.586207
Aahana 0.269231
... ...
Zyanya 0.466667
Zyla 1.000000
Zylah 1.000000
Zyra 1.000000
Zyrah 0.833333

13782 rows × 1 columns

In [29]:
# What name has fallen the most in popularity?
rtp_table.sort_values("Count RTP")
Out[29]:
Count RTP
Name
Debra 0.001260
Debbie 0.002815
Carol 0.003180
Tammy 0.003249
Susan 0.003305
... ...
Fidelia 1.000000
Naveyah 1.000000
Finlee 1.000000
Roseline 1.000000
Aadhini 1.000000

13782 rows × 1 columns

We can visualize the decrease in the popularity of the name "Debra:"

In [30]:
def plot_name(*names):
    fig = px.line(f_babynames[f_babynames["Name"].isin(names)], 
                  x="Year", y="Count", color="Name",
                  title=f"Popularity for: {names}")
    fig.update_layout(font_size=18, 
                  autosize=False, 
                  width=1000, 
                  height=400)
    return fig

plot_name("Debra")
In [31]:
# Find the 10 names that have decreased the most in popularity
top10 = rtp_table.sort_values("Count RTP").head(10).index
top10
Out[31]:
Index(['Debra', 'Debbie', 'Carol', 'Tammy', 'Susan', 'Cheryl', 'Shannon',
       'Tina', 'Michele', 'Terri'],
      dtype='object', name='Name')
In [32]:
plot_name(*top10)

For fun, try plotting your name or your friends' names.

groupby.size and groupby.count()¶

In [33]:
df = pd.DataFrame({'letter':['A', 'A', 'B', 'C', 'C', 'C'], 
                   'num':[1, 2, 3, 4, np.NaN, 4], 
                   'state':[np.NaN, 'tx', 'fl', 'hi', np.NaN, 'ak']})
df
Out[33]:
letter num state
0 A 1.0 NaN
1 A 2.0 tx
2 B 3.0 fl
3 C 4.0 hi
4 C NaN NaN
5 C 4.0 ak

groupby.size() returns a Series, indexed by the letters that we grouped by, with values denoting the number of rows in each group/sub-DataFrame. It does not care about missing (NaN) values.

In [34]:
df.groupby("letter").size()
Out[34]:
letter
A    2
B    1
C    3
dtype: int64

groupby.count() returns a DataFrame, indexed by the letters that we grouped by. Each column represents the number of non-missing values for that letter.

In [35]:
df.groupby("letter").count()
Out[35]:
num state
letter
A 2 1
B 1 1
C 2 2

You might recall value_counts() function we talked about last week. What's the difference?

In [36]:
df["letter"].value_counts()
Out[36]:
letter
C    3
A    2
B    1
Name: count, dtype: int64

Turns out value_counts() does something similar to groupby.size(), except that it also sorts the resulting Series in descending order.

Filtering by Group¶

In [37]:
# Let's read the elections dataset
elections = pd.read_csv("data/elections.csv")
elections.sample(5)
Out[37]:
Year Candidate Party Popular vote Result %
163 2008 Bob Barr Libertarian 523715 loss 0.399565
58 1904 Eugene V. Debs Socialist 402810 loss 2.985897
150 1996 Ross Perot Reform 8085294 loss 8.408844
48 1892 James B. Weaver Populist 1041028 loss 8.645038
118 1972 George McGovern Democratic 29173222 loss 37.670670

Let's keep only the elections years where the maximum vote share % is less than 45%.

In [38]:
elections.groupby("Year").filter(lambda sf: sf["%"].max() < 45).head(10)
Out[38]:
Year Candidate Party Popular vote Result %
23 1860 Abraham Lincoln Republican 1855993 win 39.699408
24 1860 John Bell Constitutional Union 590901 loss 12.639283
25 1860 John C. Breckinridge Southern Democratic 848019 loss 18.138998
26 1860 Stephen A. Douglas Northern Democratic 1380202 loss 29.522311
66 1912 Eugene V. Debs Socialist 901551 loss 6.004354
67 1912 Eugene W. Chafin Prohibition 208156 loss 1.386325
68 1912 Theodore Roosevelt Progressive 4122721 loss 27.457433
69 1912 William Taft Republican 3486242 loss 23.218466
70 1912 Woodrow Wilson Democratic 6296284 win 41.933422
115 1968 George Wallace American Independent 9901118 loss 13.571218
In [39]:
elections_max_percentage = elections.groupby("Year")[["%"]].agg(max)
elections_max_percentage
Out[39]:
%
Year
1824 57.210122
1828 56.203927
1832 54.574789
1836 52.272472
1840 53.051213
1844 50.749477
1848 47.309296
1852 51.013168
1856 45.306080
1860 39.699408
1864 54.951512
1868 52.665305
1872 55.928594
1876 51.528376
1880 48.369234
1884 48.884933
1888 48.656799
1892 46.121393
1896 51.213817
1900 52.342640
1904 56.562787
1908 52.013300
1912 41.933422
1916 49.367987
1920 60.574501
1924 54.329113
1928 58.368524
1932 57.672125
1936 60.978107
1940 54.871202
1944 53.773801
1948 49.601536
1952 55.325173
1956 57.650654
1960 50.082561
1964 61.344703
1968 43.565246
1972 60.907806
1976 50.271900
1980 50.897944
1984 59.023326
1988 53.518845
1992 43.118485
1996 49.296938
2000 48.491813
2004 50.771824
2008 53.023510
2012 51.258484
2016 48.521539
2020 51.311515
2024 49.808629
In [40]:
elections_max_percentage.sort_values(by="%").head()
Out[40]:
%
Year
1860 39.699408
1912 41.933422
1992 43.118485
1968 43.565246
1856 45.306080

groupby Puzzle¶

Assume that we want to know the best election by each party.

Attempt #1¶

We have to be careful when using aggregation functions. For example, the code below might be misinterpreted to say that Woodrow Wilson successfully ran for election in 2020. Why is this happening?

In [41]:
elections.groupby("Party").agg(max).head(10)
Out[41]:
Year Candidate Popular vote Result %
Party
American 1976 Thomas J. Anderson 873053 loss 21.554001
American Independent 1976 Lester Maddox 9901118 loss 13.571218
Anti-Masonic 1832 William Wirt 100715 loss 7.821583
Anti-Monopoly 1884 Benjamin Butler 134294 loss 1.335838
Citizens 1980 Barry Commoner 233052 loss 0.270182
Communist 1932 William Z. Foster 103307 loss 0.261069
Constitution 2016 Michael Peroutka 203091 loss 0.152398
Constitutional Union 1860 John Bell 590901 loss 12.639283
Democratic 2024 Woodrow Wilson 81268924 win 61.344703
Democratic-Republican 1824 John Quincy Adams 151271 win 57.210122

Attempt #2¶

Next, we'll write code that properly returns the best result by each party. That is, each row should show the Year, Candidate, Popular Vote, Result, and % for the election in which that party saw its best results (rather than mixing them as in the example above). Here's what the first rows of the correct output should look like:

parties.png

In [42]:
elections_sorted_by_percent = elections.sort_values("%", ascending=False)
elections_sorted_by_percent.head(8)
Out[42]:
Year Candidate Party Popular vote Result %
114 1964 Lyndon Johnson Democratic 43127041 win 61.344703
91 1936 Franklin Roosevelt Democratic 27752648 win 60.978107
120 1972 Richard Nixon Republican 47168710 win 60.907806
79 1920 Warren Harding Republican 16144093 win 60.574501
133 1984 Ronald Reagan Republican 54455472 win 59.023326
84 1928 Herbert Hoover Republican 21427123 win 58.368524
86 1932 Franklin Roosevelt Democratic 22821277 win 57.672125
109 1956 Dwight Eisenhower Republican 35579180 win 57.650654
In [43]:
elections_sorted_by_percent.groupby("Party").first()
Out[43]:
Year Candidate Popular vote Result %
Party
American 1856 Millard Fillmore 873053 loss 21.554001
American Independent 1968 George Wallace 9901118 loss 13.571218
Anti-Masonic 1832 William Wirt 100715 loss 7.821583
Anti-Monopoly 1884 Benjamin Butler 134294 loss 1.335838
Citizens 1980 Barry Commoner 233052 loss 0.270182
Communist 1932 William Z. Foster 103307 loss 0.261069
Constitution 2008 Chuck Baldwin 199750 loss 0.152398
Constitutional Union 1860 John Bell 590901 loss 12.639283
Democratic 1964 Lyndon Johnson 43127041 win 61.344703
Democratic-Republican 1824 Andrew Jackson 151271 loss 57.210122
Dixiecrat 1948 Strom Thurmond 1175930 loss 2.412304
Farmer–Labor 1920 Parley P. Christensen 265398 loss 0.995804
Free Soil 1848 Martin Van Buren 291501 loss 10.138474
Green 2000 Ralph Nader 2882955 loss 2.741176
Greenback 1880 James B. Weaver 308649 loss 3.352344
Independent 1992 Ross Perot 19743821 loss 18.956298
Liberal Republican 1872 Horace Greeley 2834761 loss 44.071406
Libertarian 2016 Gary Johnson 4489235 loss 3.307714
Libertarian Party 2024 Chase Oliver 650130 loss 0.418895
National Democratic 1896 John M. Palmer 134645 loss 0.969566
National Republican 1828 John Quincy Adams 500897 loss 43.796073
National Union 1864 Abraham Lincoln 2211317 win 54.951512
Natural Law 1996 John Hagelin 113670 loss 0.118219
New Alliance 1988 Lenora Fulani 217221 loss 0.237804
Northern Democratic 1860 Stephen A. Douglas 1380202 loss 29.522311
Populist 1892 James B. Weaver 1041028 loss 8.645038
Progressive 1912 Theodore Roosevelt 4122721 loss 27.457433
Prohibition 1892 John Bidwell 270879 loss 2.249468
Reform 1996 Ross Perot 8085294 loss 8.408844
Republican 1972 Richard Nixon 47168710 win 60.907806
Socialist 1912 Eugene V. Debs 901551 loss 6.004354
Southern Democratic 1860 John C. Breckinridge 848019 loss 18.138998
States' Rights 1956 T. Coleman Andrews 107929 loss 0.174883
Taxpayers 1996 Howard Phillips 184656 loss 0.192045
Union 1936 William Lemke 892378 loss 1.960733
Union Labor 1888 Alson Streeter 146602 loss 1.288861
Whig 1840 William Henry Harrison 1275583 win 53.051213

Alternative Solutions¶

You'll soon discover that with Pandas rich tool set, there's typically more than one way to get to the same answer. Each approach has different tradeoffs in terms of readability, performance, memory consumption, complexity, and more. It will take some experience for you to develop a sense of which approach is better for each problem, but you should, in general, try to think if you can at least envision a different solution to a given problem, especially if you find your current solution to be particularly convoluted or hard to read.

Here are a couple of other ways of obtaining the same result (in each case, we only show the top part with head()). The first approach uses groupby but finds the location of the maximum value via the idxmax() method (look up its documentation!). We then index and sort by Party to match the requested formatting:

In [44]:
elections.groupby("Party")["%"].idxmax()
Out[44]:
Party
American                  22
American Independent     115
Anti-Masonic               6
Anti-Monopoly             38
Citizens                 127
Communist                 89
Constitution             164
Constitutional Union      24
Democratic               114
Democratic-Republican      0
Dixiecrat                103
Farmer–Labor              78
Free Soil                 15
Green                    155
Greenback                 35
Independent              143
Liberal Republican        31
Libertarian              175
Libertarian Party        186
National Democratic       50
National Republican        3
National Union            27
Natural Law              148
New Alliance             136
Northern Democratic       26
Populist                  48
Progressive               68
Prohibition               49
Reform                   150
Republican               120
Socialist                 66
Southern Democratic       25
States' Rights           110
Taxpayers                147
Union                     93
Union Labor               42
Whig                      11
Name: %, dtype: int64
In [45]:
# This is the computational part
best_per_party = elections.loc[elections.groupby("Party")["%"].idxmax()]

# This indexes by Party to match the formatting above
best_per_party.set_index('Party').sort_index().head() 
Out[45]:
Year Candidate Popular vote Result %
Party
American 1856 Millard Fillmore 873053 loss 21.554001
American Independent 1968 George Wallace 9901118 loss 13.571218
Anti-Masonic 1832 William Wirt 100715 loss 7.821583
Anti-Monopoly 1884 Benjamin Butler 134294 loss 1.335838
Citizens 1980 Barry Commoner 233052 loss 0.270182

And this one doesn't even use groupby! This approach instead uses the drop_duplicates method to keep only the last occurrence of of each party after having sorted by "%", which is the best performance. Again, the 2nd line is purely formatting:

In [46]:
best_per_party2 = elections.sort_values("%").drop_duplicates(["Party"], keep="last")
best_per_party2.set_index("Party").sort_index().head()  # Formatting
Out[46]:
Year Candidate Popular vote Result %
Party
American 1856 Millard Fillmore 873053 loss 21.554001
American Independent 1968 George Wallace 9901118 loss 13.571218
Anti-Masonic 1832 William Wirt 100715 loss 7.821583
Anti-Monopoly 1884 Benjamin Butler 134294 loss 1.335838
Citizens 1980 Barry Commoner 233052 loss 0.270182

Challenge: See if you can find yet another approach that still gives the same answer.

DataFrameGroupBy Objects¶

The result of groupby is not a DataFrame or a list of DataFrames. It is instead a special type called a DataFrameGroupBy.

In [47]:
grouped_by_party = elections.groupby("Party")
type(grouped_by_party)
Out[47]:
pandas.core.groupby.generic.DataFrameGroupBy

GroupBy objects are structured like dictionaries. In fact, we can actually see the dictionaries with the following code:

In [48]:
grouped_by_party.groups
Out[48]:
{'American': [22, 126], 'American Independent': [115, 119, 124], 'Anti-Masonic': [6], 'Anti-Monopoly': [38], 'Citizens': [127], 'Communist': [89], 'Constitution': [160, 164, 172], 'Constitutional Union': [24], 'Democratic': [2, 4, 8, 10, 13, 14, 17, 20, 28, 29, 34, 37, 39, 45, 47, 52, 55, 57, 64, 70, 74, 77, 81, 83, 86, 91, 94, 97, 100, 105, 108, 111, 114, 116, 118, 123, 129, 134, 137, 140, 144, 151, 158, 162, 168, 176, 178, 183], 'Democratic-Republican': [0, 1], 'Dixiecrat': [103], 'Farmer–Labor': [78], 'Free Soil': [15, 18], 'Green': [149, 155, 156, 165, 170, 177, 181, 184], 'Greenback': [35], 'Independent': [121, 130, 143, 161, 167, 174, 185], 'Liberal Republican': [31], 'Libertarian': [125, 128, 132, 138, 139, 146, 153, 159, 163, 169, 175, 180], 'Libertarian Party': [186], 'National Democratic': [50], 'National Republican': [3, 5], 'National Union': [27], 'Natural Law': [148], 'New Alliance': [136], 'Northern Democratic': [26], 'Populist': [48, 61, 141], 'Progressive': [68, 82, 101, 107], 'Prohibition': [41, 44, 49, 51, 54, 59, 63, 67, 73, 75, 99], 'Reform': [150, 154], 'Republican': [21, 23, 30, 32, 33, 36, 40, 43, 46, 53, 56, 60, 65, 69, 72, 79, 80, 84, 87, 90, 96, 98, 104, 106, 109, 112, 113, 117, 120, 122, 131, 133, 135, 142, 145, 152, 157, 166, 171, 173, 179, 182], 'Socialist': [58, 62, 66, 71, 76, 85, 88, 92, 95, 102], 'Southern Democratic': [25], 'States' Rights': [110], 'Taxpayers': [147], 'Union': [93], 'Union Labor': [42], 'Whig': [7, 9, 11, 12, 16, 19]}

The keys of the dictionary are the groups (in this case, Party), and the values are the indices of rows belonging to that group. We can access a particular sub-DataFrame using get_group:

In [49]:
grouped_by_party.get_group("Socialist")
Out[49]:
Year Candidate Party Popular vote Result %
58 1904 Eugene V. Debs Socialist 402810 loss 2.985897
62 1908 Eugene V. Debs Socialist 420852 loss 2.850866
66 1912 Eugene V. Debs Socialist 901551 loss 6.004354
71 1916 Allan L. Benson Socialist 590524 loss 3.194193
76 1920 Eugene V. Debs Socialist 913693 loss 3.428282
85 1928 Norman Thomas Socialist 267478 loss 0.728623
88 1932 Norman Thomas Socialist 884885 loss 2.236211
92 1936 Norman Thomas Socialist 187910 loss 0.412876
95 1940 Norman Thomas Socialist 116599 loss 0.234237
102 1948 Norman Thomas Socialist 139569 loss 0.286312

Pivot Tables¶

Groupby with multiple columns¶

We want to build a table showing the total number of babies born of each sex in each year. One way is to groupby using both columns of interest:

In [50]:
babynames.groupby(["Year", "Sex"])[["Count"]].sum().head(6)
Out[50]:
Count
Year Sex
1910 F 5950
M 3213
1911 F 6602
M 3381
1912 F 9804
M 8142

pivot_table¶

In [51]:
babynames.pivot_table(
    index="Year", 
    columns="Sex", 
    values="Count", 
    aggfunc=np.sum).head(6)
Out[51]:
Sex F M
Year
1910 5950 3213
1911 6602 3381
1912 9804 8142
1913 11860 10234
1914 13815 13111
1915 18643 17192
Image

pivot_table with Multiple values¶

In [52]:
babynames.pivot_table(
    index="Year", 
    columns="Sex", 
    values=["Count", "Name"], 
    aggfunc=np.max).head(6)
Out[52]:
Count Name
Sex F M F M
Year
1910 295 237 Yvonne William
1911 390 214 Zelma Willis
1912 534 501 Yvonne Woodrow
1913 584 614 Zelma Yoshio
1914 773 769 Zelma Yoshio
1915 998 1033 Zita Yukio

Join Tables¶

What if we want to know the popularity of presidential candidates' first names in California in 2022? What can we do?

In [53]:
elections.head(10)
Out[53]:
Year Candidate Party Popular vote Result %
0 1824 Andrew Jackson Democratic-Republican 151271 loss 57.210122
1 1824 John Quincy Adams Democratic-Republican 113142 win 42.789878
2 1828 Andrew Jackson Democratic 642806 win 56.203927
3 1828 John Quincy Adams National Republican 500897 loss 43.796073
4 1832 Andrew Jackson Democratic 702735 win 54.574789
5 1832 Henry Clay National Republican 484205 loss 37.603628
6 1832 William Wirt Anti-Masonic 100715 loss 7.821583
7 1836 Hugh Lawson White Whig 146109 loss 10.005985
8 1836 Martin Van Buren Democratic 763291 win 52.272472
9 1836 William Henry Harrison Whig 550816 loss 37.721543
In [54]:
babynames_2022 = babynames[babynames["Year"]==2022]
babynames_2022.head(10)
Out[54]:
State Sex Year Name Count
237964 CA F 2022 Leandra 10
404916 CA M 2022 Leandro 99
405892 CA M 2022 Andreas 14
235927 CA F 2022 Andrea 322
405695 CA M 2022 Deandre 18
237588 CA F 2022 Andromeda 13
238547 CA F 2022 Jeannette 7
405620 CA M 2022 Rudra 20
406178 CA M 2022 Alexandre 10
239282 CA F 2022 Kahealani 5
In [55]:
elections["First Name"] = elections["Candidate"].str.split(" ").str[0]
elections
Out[55]:
Year Candidate Party Popular vote Result % First Name
0 1824 Andrew Jackson Democratic-Republican 151271 loss 57.210122 Andrew
1 1824 John Quincy Adams Democratic-Republican 113142 win 42.789878 John
2 1828 Andrew Jackson Democratic 642806 win 56.203927 Andrew
3 1828 John Quincy Adams National Republican 500897 loss 43.796073 John
4 1832 Andrew Jackson Democratic 702735 win 54.574789 Andrew
... ... ... ... ... ... ... ...
182 2024 Donald Trump Republican 77303568 win 49.808629 Donald
183 2024 Kamala Harris Democratic 75019230 loss 48.336772 Kamala
184 2024 Jill Stein Green 861155 loss 0.554864 Jill
185 2024 Robert Kennedy Independent 756383 loss 0.487357 Robert
186 2024 Chase Oliver Libertarian Party 650130 loss 0.418895 Chase

187 rows × 7 columns

Unlike in Data 8, the join function is called merge in pandas. join in pandas does something slightly different—we won't talk about it in this class.

In [56]:
merged = pd.merge(left=elections, right=babynames_2022, 
                  left_on="First Name", right_on="Name")
merged
Out[56]:
Year_x Candidate Party Popular vote Result % First Name State Sex Year_y Name Count
0 1824 Andrew Jackson Democratic-Republican 151271 loss 57.210122 Andrew CA M 2022 Andrew 741
1 1828 Andrew Jackson Democratic 642806 win 56.203927 Andrew CA M 2022 Andrew 741
2 1832 Andrew Jackson Democratic 702735 win 54.574789 Andrew CA M 2022 Andrew 741
3 1824 John Quincy Adams Democratic-Republican 113142 win 42.789878 John CA M 2022 John 490
4 1828 John Quincy Adams National Republican 500897 loss 43.796073 John CA M 2022 John 490
... ... ... ... ... ... ... ... ... ... ... ... ...
151 2016 Evan McMullin Independent 732273 loss 0.539546 Evan CA M 2022 Evan 488
152 2016 Hillary Clinton Democratic 65853514 loss 48.521539 Hillary CA F 2022 Hillary 10
153 2020 Joseph Biden Democratic 81268924 win 51.311515 Joseph CA M 2022 Joseph 785
154 2024 Chase Oliver Libertarian Party 650130 loss 0.418895 Chase CA M 2022 Chase 203
155 2024 Chase Oliver Libertarian Party 650130 loss 0.418895 Chase CA F 2022 Chase 6

156 rows × 12 columns

In [57]:
merged.sort_values("Count", ascending=False)
Out[57]:
Year_x Candidate Party Popular vote Result % First Name State Sex Year_y Name Count
73 1884 Benjamin Butler Anti-Monopoly 134294 loss 1.335838 Benjamin CA M 2022 Benjamin 1524
74 1888 Benjamin Harrison Republican 5443633 win 47.858041 Benjamin CA M 2022 Benjamin 1524
75 1892 Benjamin Harrison Republican 5176108 loss 42.984101 Benjamin CA M 2022 Benjamin 1524
51 1920 James M. Cox Democratic 9139661 loss 34.293063 James CA M 2022 James 1086
39 1844 James Polk Democratic 1339570 win 50.749477 James CA M 2022 James 1086
... ... ... ... ... ... ... ... ... ... ... ... ...
92 1912 Woodrow Wilson Democratic 6296284 win 41.933422 Woodrow CA M 2022 Woodrow 6
93 1916 Woodrow Wilson Democratic 9126868 win 49.367987 Woodrow CA M 2022 Woodrow 6
116 1964 Lyndon Johnson Democratic 43127041 win 61.344703 Lyndon CA M 2022 Lyndon 6
155 2024 Chase Oliver Libertarian Party 650130 loss 0.418895 Chase CA F 2022 Chase 6
146 2016 Darrell Castle Constitution 203091 loss 0.149640 Darrell CA M 2022 Darrell 5

156 rows × 12 columns

In [ ]: