Lecture 3 – Data 100, Fall 2023¶

Data 100, Fall 2023

Acknowledgments Page

A demonstration of advanced pandas syntax to accompany Lecture 3.

In [1]:
import numpy as np
import pandas as pd
import plotly.express as px
In [2]:
# Loading the elections DataFrame
elections = pd.read_csv("data/elections.csv")

elections.head() 
Out[2]:
Year Candidate Party Popular vote Result %
0 1824 Andrew Jackson Democratic-Republican 151271 loss 57.210122
1 1824 John Quincy Adams Democratic-Republican 113142 win 42.789878
2 1828 Andrew Jackson Democratic 642806 win 56.203927
3 1828 John Quincy Adams National Republican 500897 loss 43.796073
4 1832 Andrew Jackson Democratic 702735 win 54.574789

Data Extraction in Pandas¶

Integer-Based Extraction Using iloc¶

iloc selects items by row and column integer position.

Arguments to .iloc can be:

  1. A list.
  2. A slice (syntax is exclusive of the right hand side of the slice).
  3. A single value.
In [3]:
# Select the rows at positions 1, 2, and 3.
# Select the columns at positions 0, 1, and 2.
# Remember that Python indexing begins at position 0!
elections.iloc[[1, 2, 3], [0, 1, 2]]
Out[3]:
Year Candidate Party
1 1824 John Quincy Adams Democratic-Republican
2 1828 Andrew Jackson Democratic
3 1828 John Quincy Adams National Republican
In [4]:
# Index-based extraction using a list of rows and a slice of column indices
elections.iloc[[1, 2, 3], 0:3]
Out[4]:
Year Candidate Party
1 1824 John Quincy Adams Democratic-Republican
2 1828 Andrew Jackson Democratic
3 1828 John Quincy Adams National Republican
In [5]:
# Selecting all rows using a colon
elections.iloc[:, 0:3]
Out[5]:
Year Candidate Party
0 1824 Andrew Jackson Democratic-Republican
1 1824 John Quincy Adams Democratic-Republican
2 1828 Andrew Jackson Democratic
3 1828 John Quincy Adams National Republican
4 1832 Andrew Jackson Democratic
... ... ... ...
177 2016 Jill Stein Green
178 2020 Joseph Biden Democratic
179 2020 Donald Trump Republican
180 2020 Jo Jorgensen Libertarian
181 2020 Howard Hawkins Green

182 rows × 3 columns

In [6]:
elections.iloc[[1, 2, 3], 1]
Out[6]:
1    John Quincy Adams
2       Andrew Jackson
3    John Quincy Adams
Name: Candidate, dtype: object
In [7]:
# Extracting the value at row 0 and the second column
elections.iloc[0,1]
Out[7]:
'Andrew Jackson'

Context-dependent Extraction using []¶

We could technically do anything we want using loc or iloc. However, in practice, the [] operator is often used instead to yield more concise code.

[] is a bit trickier to understand than loc or iloc, but it achieves essentially the same functionality. The difference is that [] is context-dependent.

[] only takes one argument, which may be:

  1. A slice of row integers.
  2. A list of column labels.
  3. A single column label.

If we provide a slice of row numbers, we get the numbered rows.

In [8]:
elections[3:7]
Out[8]:
Year Candidate Party Popular vote Result %
3 1828 John Quincy Adams National Republican 500897 loss 43.796073
4 1832 Andrew Jackson Democratic 702735 win 54.574789
5 1832 Henry Clay National Republican 484205 loss 37.603628
6 1832 William Wirt Anti-Masonic 100715 loss 7.821583

If we provide a list of column names, we get the listed columns.

In [9]:
elections[["Year", "Candidate", "Result"]]
Out[9]:
Year Candidate Result
0 1824 Andrew Jackson loss
1 1824 John Quincy Adams win
2 1828 Andrew Jackson win
3 1828 John Quincy Adams loss
4 1832 Andrew Jackson win
... ... ... ...
177 2016 Jill Stein loss
178 2020 Joseph Biden win
179 2020 Donald Trump loss
180 2020 Jo Jorgensen loss
181 2020 Howard Hawkins loss

182 rows × 3 columns

And if we provide a single column name we get back just that column, stored as a Series.

In [10]:
elections["Candidate"]
Out[10]:
0         Andrew Jackson
1      John Quincy Adams
2         Andrew Jackson
3      John Quincy Adams
4         Andrew Jackson
             ...        
177           Jill Stein
178         Joseph Biden
179         Donald Trump
180         Jo Jorgensen
181       Howard Hawkins
Name: Candidate, Length: 182, dtype: object

Slido Exercise¶

In [11]:
weird = pd.DataFrame({
    1:["topdog","botdog"], 
    "1":["topcat","botcat"]
})
weird
Out[11]:
1 1
0 topdog topcat
1 botdog botcat
In [12]:
weird[1]
Out[12]:
0    topdog
1    botdog
Name: 1, dtype: object
In [13]:
weird["1"]
Out[13]:
0    topcat
1    botcat
Name: 1, dtype: object
In [14]:
weird[1:]
Out[14]:
1 1
1 botdog botcat

Dataset: California baby names¶

In today's lecture, we'll work with the babynames dataset, which contains information about the names of infants born in California.

The cell below pulls census data from a government website and then loads it into a usable form. The code shown here is outside of the scope of Data 100, but you're encouraged to dig into it if you are interested!

In [15]:
import urllib.request
import os.path
import zipfile

data_url = "https://www.ssa.gov/oact/babynames/state/namesbystate.zip"
local_filename = "data/babynamesbystate.zip"
if not os.path.exists(local_filename): # If the data exists don't download again
    with urllib.request.urlopen(data_url) as resp, open(local_filename, 'wb') as f:
        f.write(resp.read())

zf = zipfile.ZipFile(local_filename, 'r')

ca_name = 'STATE.CA.TXT'
field_names = ['State', 'Sex', 'Year', 'Name', 'Count']
with zf.open(ca_name) as fh:
    babynames = pd.read_csv(fh, header=None, names=field_names)

babynames.head()
Out[15]:
State Sex Year Name Count
0 CA F 1910 Mary 295
1 CA F 1910 Helen 239
2 CA F 1910 Dorothy 220
3 CA F 1910 Margaret 163
4 CA F 1910 Frances 134

Conditional Selection¶

In [16]:
# Ask yourself: Why is :9 is the correct slice to select the first 10 rows?
babynames_first_10_rows = babynames.loc[:9, :]

babynames_first_10_rows
Out[16]:
State Sex Year Name Count
0 CA F 1910 Mary 295
1 CA F 1910 Helen 239
2 CA F 1910 Dorothy 220
3 CA F 1910 Margaret 163
4 CA F 1910 Frances 134
5 CA F 1910 Ruth 128
6 CA F 1910 Evelyn 126
7 CA F 1910 Alice 118
8 CA F 1910 Virginia 101
9 CA F 1910 Elizabeth 93

By passing in a sequence (list, array, or Series) of boolean values, we can extract a subset of the rows in a DataFrame. We will keep only the rows that correspond to a boolean value of True.

In [17]:
# Notice how we have exactly 10 elements in our boolean array argument.
babynames_first_10_rows[[True, False, True, False, True, False, True, False, True, False]]
Out[17]:
State Sex Year Name Count
0 CA F 1910 Mary 295
2 CA F 1910 Dorothy 220
4 CA F 1910 Frances 134
6 CA F 1910 Evelyn 126
8 CA F 1910 Virginia 101
In [18]:
# Or using .loc to filter a DataFrame by a Boolean array argument
babynames_first_10_rows.loc[[True, False, True, False, True, False, True, False, True, False], :]
Out[18]:
State Sex Year Name Count
0 CA F 1910 Mary 295
2 CA F 1910 Dorothy 220
4 CA F 1910 Frances 134
6 CA F 1910 Evelyn 126
8 CA F 1910 Virginia 101

Oftentimes, we'll use boolean selection to check for entries in a DataFrame that meet a particular condition.

In [19]:
# First, use a logical condition to generate a boolean Series
logical_operator = (babynames["Sex"] == "F")
logical_operator
Out[19]:
0          True
1          True
2          True
3          True
4          True
          ...  
407423    False
407424    False
407425    False
407426    False
407427    False
Name: Sex, Length: 407428, dtype: bool
In [20]:
# Then, use this boolean Series to filter the DataFrame
babynames[logical_operator]
Out[20]:
State Sex Year Name Count
0 CA F 1910 Mary 295
1 CA F 1910 Helen 239
2 CA F 1910 Dorothy 220
3 CA F 1910 Margaret 163
4 CA F 1910 Frances 134
... ... ... ... ... ...
239532 CA F 2022 Zemira 5
239533 CA F 2022 Ziggy 5
239534 CA F 2022 Zimal 5
239535 CA F 2022 Zosia 5
239536 CA F 2022 Zulay 5

239537 rows × 5 columns

Boolean selection also works with loc!

In [21]:
# Notice that we did not have to specify columns to select 
# If no columns are referenced, pandas will automatically select all columns
babynames.loc[babynames["Sex"] == "F"]
Out[21]:
State Sex Year Name Count
0 CA F 1910 Mary 295
1 CA F 1910 Helen 239
2 CA F 1910 Dorothy 220
3 CA F 1910 Margaret 163
4 CA F 1910 Frances 134
... ... ... ... ... ...
239532 CA F 2022 Zemira 5
239533 CA F 2022 Ziggy 5
239534 CA F 2022 Zimal 5
239535 CA F 2022 Zosia 5
239536 CA F 2022 Zulay 5

239537 rows × 5 columns

Bitwise Operators¶

To filter on multiple conditions, we combine boolean operators using bitwise comparisons.

Symbol Usage Meaning
~ ~p Returns negation of p
| p | q p OR q
& p & q p AND q
^ p ^ q p XOR q (exclusive or)
In [22]:
babynames[(babynames["Sex"] == "F") & (babynames["Year"] < 2000)]
Out[22]:
State Sex Year Name Count
0 CA F 1910 Mary 295
1 CA F 1910 Helen 239
2 CA F 1910 Dorothy 220
3 CA F 1910 Margaret 163
4 CA F 1910 Frances 134
... ... ... ... ... ...
149050 CA F 1999 Zareen 5
149051 CA F 1999 Zeinab 5
149052 CA F 1999 Zhane 5
149053 CA F 1999 Zoha 5
149054 CA F 1999 Zoila 5

149055 rows × 5 columns

In [23]:
babynames[(babynames["Sex"] == "F") | (babynames["Year"] < 2000)]
Out[23]:
State Sex Year Name Count
0 CA F 1910 Mary 295
1 CA F 1910 Helen 239
2 CA F 1910 Dorothy 220
3 CA F 1910 Margaret 163
4 CA F 1910 Frances 134
... ... ... ... ... ...
342435 CA M 1999 Yuuki 5
342436 CA M 1999 Zakariya 5
342437 CA M 1999 Zavier 5
342438 CA M 1999 Zayn 5
342439 CA M 1999 Zayne 5

342440 rows × 5 columns

Slido Exercises¶

We want to obtain the first three baby names with count > 250.

No description has been provided for this image

In [24]:
babynames.iloc[[0, 233, 484], [3, 4]]
Out[24]:
Name Count
0 Mary 295
233 Mary 390
484 Mary 534
In [25]:
babynames.loc[[0, 233, 484]]
Out[25]:
State Sex Year Name Count
0 CA F 1910 Mary 295
233 CA F 1911 Mary 390
484 CA F 1912 Mary 534
In [26]:
babynames.loc[babynames["Count"] > 250, ["Name", "Count"]].head(3)
Out[26]:
Name Count
0 Mary 295
233 Mary 390
484 Mary 534
In [27]:
babynames.loc[babynames["Count"] > 250, ["Name", "Count"]].iloc[0:2, :]
Out[27]:
Name Count
0 Mary 295
233 Mary 390

.isin for Selection based on a list, array, or Series¶

In [28]:
# Note: The parentheses surrounding the code make it possible to break the code into multiple lines for readability

(
    babynames[(babynames["Name"] == "Bella") | 
              (babynames["Name"] == "Alex") |
              (babynames["Name"] == "Narges") |
              (babynames["Name"] == "Lisa")]
)
Out[28]:
State Sex Year Name Count
6289 CA F 1923 Bella 5
7512 CA F 1925 Bella 8
12368 CA F 1932 Lisa 5
14741 CA F 1936 Lisa 8
17084 CA F 1939 Lisa 5
... ... ... ... ... ...
393248 CA M 2018 Alex 495
396111 CA M 2019 Alex 438
398983 CA M 2020 Alex 379
401788 CA M 2021 Alex 333
404663 CA M 2022 Alex 344

317 rows × 5 columns

In [29]:
# A more concise method to achieve the above: .isin
names = ["Bella", "Alex", "Narges", "Lisa"]
display(babynames["Name"].isin(names))
display(babynames[babynames["Name"].isin(names)])
0         False
1         False
2         False
3         False
4         False
          ...  
407423    False
407424    False
407425    False
407426    False
407427    False
Name: Name, Length: 407428, dtype: bool
State Sex Year Name Count
6289 CA F 1923 Bella 5
7512 CA F 1925 Bella 8
12368 CA F 1932 Lisa 5
14741 CA F 1936 Lisa 8
17084 CA F 1939 Lisa 5
... ... ... ... ... ...
393248 CA M 2018 Alex 495
396111 CA M 2019 Alex 438
398983 CA M 2020 Alex 379
401788 CA M 2021 Alex 333
404663 CA M 2022 Alex 344

317 rows × 5 columns

.str Functions for Defining a Condition¶

In [30]:
# What if we only want names that start with "N"?
display(babynames["Name"].str.startswith("N"))
display(babynames[babynames["Name"].str.startswith("N")])
0         False
1         False
2         False
3         False
4         False
          ...  
407423    False
407424    False
407425    False
407426    False
407427    False
Name: Name, Length: 407428, dtype: bool
State Sex Year Name Count
76 CA F 1910 Norma 23
83 CA F 1910 Nellie 20
127 CA F 1910 Nina 11
198 CA F 1910 Nora 6
310 CA F 1911 Nellie 23
... ... ... ... ... ...
407319 CA M 2022 Nilan 5
407320 CA M 2022 Niles 5
407321 CA M 2022 Nolen 5
407322 CA M 2022 Noriel 5
407323 CA M 2022 Norris 5

12229 rows × 5 columns

Adding, Removing, and Modifying Columns¶

Add a Column¶

To add a column, use [] to reference the desired new column, then assign it to a Series or array of appropriate length.

In [31]:
# Create a Series of the length of each name
babyname_lengths = babynames["Name"].str.len()

# Add a column named "name_lengths" that includes the length of each name
babynames["name_lengths"] = babyname_lengths

babynames
Out[31]:
State Sex Year Name Count name_lengths
0 CA F 1910 Mary 295 4
1 CA F 1910 Helen 239 5
2 CA F 1910 Dorothy 220 7
3 CA F 1910 Margaret 163 8
4 CA F 1910 Frances 134 7
... ... ... ... ... ... ...
407423 CA M 2022 Zayvier 5 7
407424 CA M 2022 Zia 5 3
407425 CA M 2022 Zora 5 4
407426 CA M 2022 Zuriel 5 6
407427 CA M 2022 Zylo 5 4

407428 rows × 6 columns

Modify a Column¶

To modify a column, use [] to access the desired column, then re-assign it to a new array or Series.

In [32]:
# Modify the "name_lengths" column to be one less than its original value
babynames["name_lengths"] = babynames["name_lengths"] - 1
babynames
Out[32]:
State Sex Year Name Count name_lengths
0 CA F 1910 Mary 295 3
1 CA F 1910 Helen 239 4
2 CA F 1910 Dorothy 220 6
3 CA F 1910 Margaret 163 7
4 CA F 1910 Frances 134 6
... ... ... ... ... ... ...
407423 CA M 2022 Zayvier 5 6
407424 CA M 2022 Zia 5 2
407425 CA M 2022 Zora 5 3
407426 CA M 2022 Zuriel 5 5
407427 CA M 2022 Zylo 5 3

407428 rows × 6 columns

Rename a Column Name¶

Rename a column using the .rename() method.

In [33]:
# Rename "name_lengths" to "Length"
babynames = babynames.rename(columns={"name_lengths":"Length"})
babynames
Out[33]:
State Sex Year Name Count Length
0 CA F 1910 Mary 295 3
1 CA F 1910 Helen 239 4
2 CA F 1910 Dorothy 220 6
3 CA F 1910 Margaret 163 7
4 CA F 1910 Frances 134 6
... ... ... ... ... ... ...
407423 CA M 2022 Zayvier 5 6
407424 CA M 2022 Zia 5 2
407425 CA M 2022 Zora 5 3
407426 CA M 2022 Zuriel 5 5
407427 CA M 2022 Zylo 5 3

407428 rows × 6 columns

Delete a Column¶

Remove a column using .drop().

In [34]:
# Remove our new "Length" column
babynames = babynames.drop("Length", axis="columns")
babynames
Out[34]:
State Sex Year Name Count
0 CA F 1910 Mary 295
1 CA F 1910 Helen 239
2 CA F 1910 Dorothy 220
3 CA F 1910 Margaret 163
4 CA F 1910 Frances 134
... ... ... ... ... ...
407423 CA M 2022 Zayvier 5
407424 CA M 2022 Zia 5
407425 CA M 2022 Zora 5
407426 CA M 2022 Zuriel 5
407427 CA M 2022 Zylo 5

407428 rows × 5 columns

Useful Utility Functions¶

NumPy¶

The NumPy functions you encountered in Data 8 are compatible with objects in pandas.

In [35]:
yash_counts = babynames[babynames["Name"] == "Yash"]["Count"]
yash_counts
Out[35]:
331824     8
334114     9
336390    11
338773    12
341387    10
343571    14
345767    24
348230    29
350889    24
353445    29
356221    25
358978    27
361831    29
364905    24
367867    23
370945    18
374055    14
376756    18
379660    18
383338     9
385903    12
388529    17
391485    16
394906    10
397874     9
400171    15
403092    13
406006    13
Name: Count, dtype: int64
In [36]:
# Average number of babies named Yash each year

np.mean(yash_counts)
Out[36]:
17.142857142857142
In [37]:
# Max number of babies named Yash born in any single year

max(yash_counts)
Out[37]:
29

Built-In pandas Methods¶

There are many, many utility functions built into pandas, far more than we can possibly cover in lecture. You are encouraged to explore all the functionality outlined in the pandas documentation.

In [38]:
# Returns the shape of the object in the format (num_rows, num_columns)
babynames.shape
Out[38]:
(407428, 5)
In [39]:
# Returns the total number of entries in the object, equal to num_rows * num_columns
babynames.size
Out[39]:
2037140
In [40]:
# What summary statistics can we describe?
babynames.describe()
Out[40]:
Year Count
count 407428.000000 407428.000000
mean 1985.733609 79.543456
std 27.007660 293.698654
min 1910.000000 5.000000
25% 1969.000000 7.000000
50% 1992.000000 13.000000
75% 2008.000000 38.000000
max 2022.000000 8260.000000
In [41]:
# Our statistics are slightly different when working with a Series
babynames["Sex"].describe()
Out[41]:
count     407428
unique         2
top            F
freq      239537
Name: Sex, dtype: object
In [42]:
# Randomly sample row(s) from the DataFrame
babynames.sample()
Out[42]:
State Sex Year Name Count
208579 CA F 2014 Rayah 7
In [43]:
# Rerun this cell a few times – you'll get different results!
babynames.sample(5).iloc[:, 2:]
Out[43]:
Year Name Count
92989 1983 Dayana 8
323162 1992 Kareem 27
58873 1969 Rebecca 1025
83706 1980 Marta 53
61539 1970 Rebeca 32
In [44]:
# Sampling with replacement
babynames[babynames["Year"] == 2000].sample(4, replace = True).iloc[:,2:]
Out[44]:
Year Name Count
150031 2000 Savana 31
344662 2000 Caeden 5
344849 2000 Marquel 5
344046 2000 Eusebio 8
In [45]:
# Count the number of times each unique value occurs in a Series
babynames["Name"].value_counts()
Out[45]:
Name
Jean         223
Francis      221
Guadalupe    218
Jessie       217
Marion       214
            ... 
Renesme        1
Purity         1
Olanna         1
Nohea          1
Zayvier        1
Name: count, Length: 20437, dtype: int64
In [46]:
# Return an array of all unique values in the Series
babynames["Name"].unique()
Out[46]:
array(['Mary', 'Helen', 'Dorothy', ..., 'Zae', 'Zai', 'Zayvier'],
      dtype=object)
In [47]:
# Sort a Series
babynames["Name"].sort_values()
Out[47]:
366001      Aadan
384005      Aadan
369120      Aadan
398211    Aadarsh
370306      Aaden
           ...   
220691      Zyrah
197529      Zyrah
217429      Zyrah
232167      Zyrah
404544      Zyrus
Name: Name, Length: 407428, dtype: object
In [48]:
# Sort a DataFrame – there are lots of Michaels in California
babynames.sort_values(by = "Count", ascending = False)
Out[48]:
State Sex Year Name Count
268041 CA M 1957 Michael 8260
267017 CA M 1956 Michael 8258
317387 CA M 1990 Michael 8246
281850 CA M 1969 Michael 8245
283146 CA M 1970 Michael 8196
... ... ... ... ... ...
317292 CA M 1989 Olegario 5
317291 CA M 1989 Norbert 5
317290 CA M 1989 Niles 5
317289 CA M 1989 Nikola 5
407427 CA M 2022 Zylo 5

407428 rows × 5 columns

Custom sorting¶

Approach 1: Create a temporary column¶

In [49]:
# Create a Series of the length of each name
babyname_lengths = babynames["Name"].str.len()

# Add a column named "name_lengths" that includes the length of each name
babynames["name_lengths"] = babyname_lengths
babynames.head(5)
Out[49]:
State Sex Year Name Count name_lengths
0 CA F 1910 Mary 295 4
1 CA F 1910 Helen 239 5
2 CA F 1910 Dorothy 220 7
3 CA F 1910 Margaret 163 8
4 CA F 1910 Frances 134 7
In [50]:
# Sort by the temporary column
babynames = babynames.sort_values(by = "name_lengths", ascending=False)
babynames.head(5)
Out[50]:
State Sex Year Name Count name_lengths
334166 CA M 1996 Franciscojavier 8 15
337301 CA M 1997 Franciscojavier 5 15
339472 CA M 1998 Franciscojavier 6 15
321792 CA M 1991 Ryanchristopher 7 15
327358 CA M 1993 Johnchristopher 5 15
In [51]:
# Drop the 'name_length' column
babynames = babynames.drop("name_lengths", axis = 'columns')
babynames.head(5)
Out[51]:
State Sex Year Name Count
334166 CA M 1996 Franciscojavier 8
337301 CA M 1997 Franciscojavier 5
339472 CA M 1998 Franciscojavier 6
321792 CA M 1991 Ryanchristopher 7
327358 CA M 1993 Johnchristopher 5

Approach 2: Sorting using the key argument¶

In [52]:
babynames.sort_values("Name", key = lambda x:x.str.len(), ascending = False).head()
Out[52]:
State Sex Year Name Count
334166 CA M 1996 Franciscojavier 8
327472 CA M 1993 Ryanchristopher 5
337301 CA M 1997 Franciscojavier 5
337477 CA M 1997 Ryanchristopher 5
312543 CA M 1987 Franciscojavier 5

Approach 3: Sorting Using the map Function¶

We can also use the Python map function if we want to use an arbitrarily defined function. Suppose we want to sort by the number of occurrences of "dr" plus the number of occurences of "ea".

In [53]:
# First, define a function to count the number of times "dr" or "ea" appear in each name
def dr_ea_count(string):
    return string.count('dr') + string.count('ea')

# Then, use `map` to apply `dr_ea_count` to each name in the "Name" column
babynames["dr_ea_count"] = babynames["Name"].map(dr_ea_count)

# Sort the DataFrame by the new "dr_ea_count" column so we can see our handiwork
babynames = babynames.sort_values(by = "dr_ea_count", ascending=False)
babynames.head()
Out[53]:
State Sex Year Name Count dr_ea_count
115957 CA F 1990 Deandrea 5 3
101976 CA F 1986 Deandrea 6 3
131029 CA F 1994 Leandrea 5 3
108731 CA F 1988 Deandrea 5 3
308131 CA M 1985 Deandrea 6 3
In [54]:
# Drop the `dr_ea_count` column
babynames = babynames.drop("dr_ea_count", axis = 'columns')
babynames.head(5)
Out[54]:
State Sex Year Name Count
115957 CA F 1990 Deandrea 5
101976 CA F 1986 Deandrea 6
131029 CA F 1994 Leandrea 5
108731 CA F 1988 Deandrea 5
308131 CA M 1985 Deandrea 6

Grouping¶

Group rows that share a common feature, then aggregate data across the group.

In this example, we count the total number of babies born in each year (considering only a small subset of the data, for simplicity).

No description has been provided for this image

In [55]:
# The code below uses the full babynames dataset, which is why some numbers are different relative to the diagram
babynames[["Year", "Count"]].groupby("Year").agg(sum)
Out[55]:
Count
Year
1910 9163
1911 9983
1912 17946
1913 22094
1914 26926
... ...
2018 395436
2019 386996
2020 362882
2021 362582
2022 360023

113 rows × 1 columns

There are many different aggregation functions we can use, all of which are useful in different applications.

In [56]:
# What is the earliest year in which each name appeared?
babynames.groupby("Name")[["Year"]].agg(min)
Out[56]:
Year
Name
Aadan 2008
Aadarsh 2019
Aaden 2007
Aadhav 2014
Aadhini 2022
... ...
Zymir 2020
Zyon 1999
Zyra 2012
Zyrah 2011
Zyrus 2021

20437 rows × 1 columns

In [57]:
# What is the largest single-year count of each name?
babynames.groupby("Name")[["Count"]].agg(max)
Out[57]:
Count
Name
Aadan 7
Aadarsh 6
Aaden 158
Aadhav 8
Aadhini 6
... ...
Zymir 5
Zyon 17
Zyra 16
Zyrah 6
Zyrus 5

20437 rows × 1 columns

Case Study: Name "Popularity"¶

In this exercise, let's find the name with sex "F" that has dropped most in popularity since its peak usage. We'll start by filtering babynames to only include names corresponding to sex "F".

In [58]:
f_babynames = babynames[babynames["Sex"] == "F"]
f_babynames
Out[58]:
State Sex Year Name Count
115957 CA F 1990 Deandrea 5
101976 CA F 1986 Deandrea 6
131029 CA F 1994 Leandrea 5
108731 CA F 1988 Deandrea 5
193104 CA F 2010 Deandra 5
... ... ... ... ... ...
192320 CA F 2010 Annaly 7
212793 CA F 2015 Iyanna 6
202350 CA F 2013 Lailah 50
212784 CA F 2015 Helene 6
212787 CA F 2015 Holley 6

239537 rows × 5 columns

In [59]:
# We sort the data by year
f_babynames = f_babynames.sort_values("Year")
f_babynames
Out[59]:
State Sex Year Name Count
23 CA F 1910 Bernice 59
219 CA F 1910 Katharine 5
170 CA F 1910 Clarice 7
26 CA F 1910 Doris 56
30 CA F 1910 Ethel 52
... ... ... ... ... ...
236551 CA F 2022 Dani 47
235914 CA F 2022 Iris 361
235918 CA F 2022 Lucy 345
235973 CA F 2022 Sage 226
239452 CA F 2022 Rylin 5

239537 rows × 5 columns

To build our intuition on how to answer our research question, let's visualize the prevalence of the name "Jennifer" over time.

In [60]:
# We'll talk about how to generate plots in a later lecture
fig = px.line(f_babynames[f_babynames["Name"] == "Jennifer"],
              x = "Year", y = "Count")
fig.update_layout(font_size = 18, 
                  autosize=False, 
                  width=1000, 
                  height=400)

We'll need a mathematical definition for the change in popularity of a name.

Define the metric "ratio to peak" (RTP). We'll calculate this as the count of the name in 2022 (the most recent year for which we have data) divided by the largest count of this name in any year.

A demo calculation for Jennifer:

In [61]:
# In the year with the highest Jennifer count, 6065 Jennifers were born
max_jenn = np.max(f_babynames[f_babynames["Name"] == "Jennifer"]["Count"])
max_jenn
Out[61]:
6065
In [62]:
# Remember that we sorted f_babynames by year. 
# This means that grabbing the final entry gives us the most recent count of Jennifers: 114
# In 2022, the most recent year for which we have data, 114 Jennifers were born
curr_jenn = f_babynames[f_babynames["Name"] == "Jennifer"]["Count"].iloc[-1]
curr_jenn
Out[62]:
114
In [63]:
# Compute the RTP
curr_jenn / max_jenn
Out[63]:
0.018796372629843364

We can also write a function that produces the ratio_to_peakfor a given Series. This will allow us to use .groupby to speed up our computation for all names in the dataset.

In [64]:
def ratio_to_peak(series):
    """
    Compute the RTP for a Series containing the counts per year for a single name
    """
    return series.iloc[-1] / np.max(series)
In [65]:
# Construct a Series containing our Jennifer count data
jenn_counts_ser = f_babynames[f_babynames["Name"] == "Jennifer"]["Count"]

# Then, find the RTP
ratio_to_peak(jenn_counts_ser)
Out[65]:
0.018796372629843364

Now, let's use .groupby to compute the RTPs for all names in the dataset.

You may see a warning message when running the cell below. As discussed in lecture, pandas can't apply an aggregation function to non-numeric data (it doens't make sense to divide "CA" by a number). By default, .groupby will drop any columns that cannot be aggregated.

In [66]:
rtp_table = f_babynames.groupby("Name")[["Year", "Count"]].agg(ratio_to_peak)
rtp_table
Out[66]:
Year Count
Name
Aadhini 1.0 1.000000
Aadhira 1.0 0.500000
Aadhya 1.0 0.660000
Aadya 1.0 0.586207
Aahana 1.0 0.269231
... ... ...
Zyanya 1.0 0.466667
Zyla 1.0 1.000000
Zylah 1.0 1.000000
Zyra 1.0 1.000000
Zyrah 1.0 0.833333

13782 rows × 2 columns

In [67]:
# Results in a TypeError
# rtp_table = f_babynames.groupby("Name").agg(ratio_to_peak)
# rtp_table

This is the pandas equivalent of .group from Data 8. If we wanted to achieve this same result using the datascience library, we would write:

f_babynames.group("Name", ratio_to_peak)

To avoid the warning message above, we explicitly extract only the columns relevant to our analysis before using .agg.

In [68]:
# Recompute the RTPs, but only performing the calculation on the "Count" column
rtp_table = f_babynames.groupby("Name")[["Count"]].agg(ratio_to_peak)
rtp_table
Out[68]:
Count
Name
Aadhini 1.000000
Aadhira 0.500000
Aadhya 0.660000
Aadya 0.586207
Aahana 0.269231
... ...
Zyanya 0.466667
Zyla 1.000000
Zylah 1.000000
Zyra 1.000000
Zyrah 0.833333

13782 rows × 1 columns

In [69]:
# Rename "Count" to "Count RTP" for clarity
rtp_table = rtp_table.rename(columns = {"Count": "Count RTP"})
rtp_table
Out[69]:
Count RTP
Name
Aadhini 1.000000
Aadhira 0.500000
Aadhya 0.660000
Aadya 0.586207
Aahana 0.269231
... ...
Zyanya 0.466667
Zyla 1.000000
Zylah 1.000000
Zyra 1.000000
Zyrah 0.833333

13782 rows × 1 columns

In [70]:
# What name has fallen the most in popularity?
rtp_table.sort_values("Count RTP")
Out[70]:
Count RTP
Name
Debra 0.001260
Debbie 0.002815
Carol 0.003180
Tammy 0.003249
Susan 0.003305
... ...
Fidelia 1.000000
Naveyah 1.000000
Finlee 1.000000
Roseline 1.000000
Aadhini 1.000000

13782 rows × 1 columns

We can visualize the decrease in the popularity of the name "Debra:"

In [71]:
def plot_name(*names):
    fig = px.line(f_babynames[f_babynames["Name"].isin(names)], 
                  x = "Year", y = "Count", color="Name",
                  title=f"Popularity for: {names}")
    fig.update_layout(font_size = 18, 
                  autosize=False, 
                  width=1000, 
                  height=400)
    return fig

plot_name("Debra")
In [72]:
# Find the 10 names that have decreased the most in popularity
top10 = rtp_table.sort_values("Count RTP").head(10).index
top10
Out[72]:
Index(['Debra', 'Debbie', 'Carol', 'Tammy', 'Susan', 'Cheryl', 'Shannon',
       'Tina', 'Michele', 'Terri'],
      dtype='object', name='Name')
In [73]:
plot_name(*top10)

For fun, try plotting your name or your friends' names.