Lecture 3 – Data 100, Fall 2023¶
Data 100, Fall 2023
A demonstration of advanced pandas
syntax to accompany Lecture 3.
import numpy as np
import pandas as pd
import plotly.express as px
# Loading the elections DataFrame
elections = pd.read_csv("data/elections.csv")
elections.head()
Year | Candidate | Party | Popular vote | Result | % | |
---|---|---|---|---|---|---|
0 | 1824 | Andrew Jackson | Democratic-Republican | 151271 | loss | 57.210122 |
1 | 1824 | John Quincy Adams | Democratic-Republican | 113142 | win | 42.789878 |
2 | 1828 | Andrew Jackson | Democratic | 642806 | win | 56.203927 |
3 | 1828 | John Quincy Adams | National Republican | 500897 | loss | 43.796073 |
4 | 1832 | Andrew Jackson | Democratic | 702735 | win | 54.574789 |
iloc
selects items by row and column integer position.
Arguments to .iloc
can be:
- A list.
- A slice (syntax is exclusive of the right hand side of the slice).
- A single value.
# Select the rows at positions 1, 2, and 3.
# Select the columns at positions 0, 1, and 2.
# Remember that Python indexing begins at position 0!
elections.iloc[[1, 2, 3], [0, 1, 2]]
Year | Candidate | Party | |
---|---|---|---|
1 | 1824 | John Quincy Adams | Democratic-Republican |
2 | 1828 | Andrew Jackson | Democratic |
3 | 1828 | John Quincy Adams | National Republican |
# Index-based extraction using a list of rows and a slice of column indices
elections.iloc[[1, 2, 3], 0:3]
Year | Candidate | Party | |
---|---|---|---|
1 | 1824 | John Quincy Adams | Democratic-Republican |
2 | 1828 | Andrew Jackson | Democratic |
3 | 1828 | John Quincy Adams | National Republican |
# Selecting all rows using a colon
elections.iloc[:, 0:3]
Year | Candidate | Party | |
---|---|---|---|
0 | 1824 | Andrew Jackson | Democratic-Republican |
1 | 1824 | John Quincy Adams | Democratic-Republican |
2 | 1828 | Andrew Jackson | Democratic |
3 | 1828 | John Quincy Adams | National Republican |
4 | 1832 | Andrew Jackson | Democratic |
... | ... | ... | ... |
177 | 2016 | Jill Stein | Green |
178 | 2020 | Joseph Biden | Democratic |
179 | 2020 | Donald Trump | Republican |
180 | 2020 | Jo Jorgensen | Libertarian |
181 | 2020 | Howard Hawkins | Green |
182 rows × 3 columns
elections.iloc[[1, 2, 3], 1]
1 John Quincy Adams 2 Andrew Jackson 3 John Quincy Adams Name: Candidate, dtype: object
# Extracting the value at row 0 and the second column
elections.iloc[0,1]
'Andrew Jackson'
Context-dependent Extraction using []
¶
We could technically do anything we want using loc
or iloc
. However, in practice, the []
operator is often used instead to yield more concise code.
[]
is a bit trickier to understand than loc
or iloc
, but it achieves essentially the same functionality. The difference is that []
is context-dependent.
[]
only takes one argument, which may be:
- A slice of row integers.
- A list of column labels.
- A single column label.
If we provide a slice of row numbers, we get the numbered rows.
elections[3:7]
Year | Candidate | Party | Popular vote | Result | % | |
---|---|---|---|---|---|---|
3 | 1828 | John Quincy Adams | National Republican | 500897 | loss | 43.796073 |
4 | 1832 | Andrew Jackson | Democratic | 702735 | win | 54.574789 |
5 | 1832 | Henry Clay | National Republican | 484205 | loss | 37.603628 |
6 | 1832 | William Wirt | Anti-Masonic | 100715 | loss | 7.821583 |
If we provide a list of column names, we get the listed columns.
elections[["Year", "Candidate", "Result"]]
Year | Candidate | Result | |
---|---|---|---|
0 | 1824 | Andrew Jackson | loss |
1 | 1824 | John Quincy Adams | win |
2 | 1828 | Andrew Jackson | win |
3 | 1828 | John Quincy Adams | loss |
4 | 1832 | Andrew Jackson | win |
... | ... | ... | ... |
177 | 2016 | Jill Stein | loss |
178 | 2020 | Joseph Biden | win |
179 | 2020 | Donald Trump | loss |
180 | 2020 | Jo Jorgensen | loss |
181 | 2020 | Howard Hawkins | loss |
182 rows × 3 columns
And if we provide a single column name we get back just that column, stored as a Series
.
elections["Candidate"]
0 Andrew Jackson 1 John Quincy Adams 2 Andrew Jackson 3 John Quincy Adams 4 Andrew Jackson ... 177 Jill Stein 178 Joseph Biden 179 Donald Trump 180 Jo Jorgensen 181 Howard Hawkins Name: Candidate, Length: 182, dtype: object
Slido Exercise¶
weird = pd.DataFrame({
1:["topdog","botdog"],
"1":["topcat","botcat"]
})
weird
1 | 1 | |
---|---|---|
0 | topdog | topcat |
1 | botdog | botcat |
weird[1]
0 topdog 1 botdog Name: 1, dtype: object
weird["1"]
0 topcat 1 botcat Name: 1, dtype: object
weird[1:]
1 | 1 | |
---|---|---|
1 | botdog | botcat |
Dataset: California baby names¶
In today's lecture, we'll work with the babynames
dataset, which contains information about the names of infants born in California.
The cell below pulls census data from a government website and then loads it into a usable form. The code shown here is outside of the scope of Data 100, but you're encouraged to dig into it if you are interested!
import urllib.request
import os.path
import zipfile
data_url = "https://www.ssa.gov/oact/babynames/state/namesbystate.zip"
local_filename = "data/babynamesbystate.zip"
if not os.path.exists(local_filename): # If the data exists don't download again
with urllib.request.urlopen(data_url) as resp, open(local_filename, 'wb') as f:
f.write(resp.read())
zf = zipfile.ZipFile(local_filename, 'r')
ca_name = 'STATE.CA.TXT'
field_names = ['State', 'Sex', 'Year', 'Name', 'Count']
with zf.open(ca_name) as fh:
babynames = pd.read_csv(fh, header=None, names=field_names)
babynames.head()
State | Sex | Year | Name | Count | |
---|---|---|---|---|---|
0 | CA | F | 1910 | Mary | 295 |
1 | CA | F | 1910 | Helen | 239 |
2 | CA | F | 1910 | Dorothy | 220 |
3 | CA | F | 1910 | Margaret | 163 |
4 | CA | F | 1910 | Frances | 134 |
Conditional Selection¶
# Ask yourself: Why is :9 is the correct slice to select the first 10 rows?
babynames_first_10_rows = babynames.loc[:9, :]
babynames_first_10_rows
State | Sex | Year | Name | Count | |
---|---|---|---|---|---|
0 | CA | F | 1910 | Mary | 295 |
1 | CA | F | 1910 | Helen | 239 |
2 | CA | F | 1910 | Dorothy | 220 |
3 | CA | F | 1910 | Margaret | 163 |
4 | CA | F | 1910 | Frances | 134 |
5 | CA | F | 1910 | Ruth | 128 |
6 | CA | F | 1910 | Evelyn | 126 |
7 | CA | F | 1910 | Alice | 118 |
8 | CA | F | 1910 | Virginia | 101 |
9 | CA | F | 1910 | Elizabeth | 93 |
By passing in a sequence (list, array, or Series
) of boolean values, we can extract a subset of the rows in a DataFrame
. We will keep only the rows that correspond to a boolean value of True
.
# Notice how we have exactly 10 elements in our boolean array argument.
babynames_first_10_rows[[True, False, True, False, True, False, True, False, True, False]]
State | Sex | Year | Name | Count | |
---|---|---|---|---|---|
0 | CA | F | 1910 | Mary | 295 |
2 | CA | F | 1910 | Dorothy | 220 |
4 | CA | F | 1910 | Frances | 134 |
6 | CA | F | 1910 | Evelyn | 126 |
8 | CA | F | 1910 | Virginia | 101 |
# Or using .loc to filter a DataFrame by a Boolean array argument
babynames_first_10_rows.loc[[True, False, True, False, True, False, True, False, True, False], :]
State | Sex | Year | Name | Count | |
---|---|---|---|---|---|
0 | CA | F | 1910 | Mary | 295 |
2 | CA | F | 1910 | Dorothy | 220 |
4 | CA | F | 1910 | Frances | 134 |
6 | CA | F | 1910 | Evelyn | 126 |
8 | CA | F | 1910 | Virginia | 101 |
Oftentimes, we'll use boolean selection to check for entries in a DataFrame
that meet a particular condition.
# First, use a logical condition to generate a boolean Series
logical_operator = (babynames["Sex"] == "F")
logical_operator
0 True 1 True 2 True 3 True 4 True ... 407423 False 407424 False 407425 False 407426 False 407427 False Name: Sex, Length: 407428, dtype: bool
# Then, use this boolean Series to filter the DataFrame
babynames[logical_operator]
State | Sex | Year | Name | Count | |
---|---|---|---|---|---|
0 | CA | F | 1910 | Mary | 295 |
1 | CA | F | 1910 | Helen | 239 |
2 | CA | F | 1910 | Dorothy | 220 |
3 | CA | F | 1910 | Margaret | 163 |
4 | CA | F | 1910 | Frances | 134 |
... | ... | ... | ... | ... | ... |
239532 | CA | F | 2022 | Zemira | 5 |
239533 | CA | F | 2022 | Ziggy | 5 |
239534 | CA | F | 2022 | Zimal | 5 |
239535 | CA | F | 2022 | Zosia | 5 |
239536 | CA | F | 2022 | Zulay | 5 |
239537 rows × 5 columns
Boolean selection also works with loc
!
# Notice that we did not have to specify columns to select
# If no columns are referenced, pandas will automatically select all columns
babynames.loc[babynames["Sex"] == "F"]
State | Sex | Year | Name | Count | |
---|---|---|---|---|---|
0 | CA | F | 1910 | Mary | 295 |
1 | CA | F | 1910 | Helen | 239 |
2 | CA | F | 1910 | Dorothy | 220 |
3 | CA | F | 1910 | Margaret | 163 |
4 | CA | F | 1910 | Frances | 134 |
... | ... | ... | ... | ... | ... |
239532 | CA | F | 2022 | Zemira | 5 |
239533 | CA | F | 2022 | Ziggy | 5 |
239534 | CA | F | 2022 | Zimal | 5 |
239535 | CA | F | 2022 | Zosia | 5 |
239536 | CA | F | 2022 | Zulay | 5 |
239537 rows × 5 columns
Bitwise Operators¶
To filter on multiple conditions, we combine boolean operators using bitwise comparisons.
Symbol | Usage | Meaning |
---|---|---|
~ | ~p | Returns negation of p |
| | p | q | p OR q |
& | p & q | p AND q |
^ | p ^ q | p XOR q (exclusive or) |
babynames[(babynames["Sex"] == "F") & (babynames["Year"] < 2000)]
State | Sex | Year | Name | Count | |
---|---|---|---|---|---|
0 | CA | F | 1910 | Mary | 295 |
1 | CA | F | 1910 | Helen | 239 |
2 | CA | F | 1910 | Dorothy | 220 |
3 | CA | F | 1910 | Margaret | 163 |
4 | CA | F | 1910 | Frances | 134 |
... | ... | ... | ... | ... | ... |
149050 | CA | F | 1999 | Zareen | 5 |
149051 | CA | F | 1999 | Zeinab | 5 |
149052 | CA | F | 1999 | Zhane | 5 |
149053 | CA | F | 1999 | Zoha | 5 |
149054 | CA | F | 1999 | Zoila | 5 |
149055 rows × 5 columns
babynames[(babynames["Sex"] == "F") | (babynames["Year"] < 2000)]
State | Sex | Year | Name | Count | |
---|---|---|---|---|---|
0 | CA | F | 1910 | Mary | 295 |
1 | CA | F | 1910 | Helen | 239 |
2 | CA | F | 1910 | Dorothy | 220 |
3 | CA | F | 1910 | Margaret | 163 |
4 | CA | F | 1910 | Frances | 134 |
... | ... | ... | ... | ... | ... |
342435 | CA | M | 1999 | Yuuki | 5 |
342436 | CA | M | 1999 | Zakariya | 5 |
342437 | CA | M | 1999 | Zavier | 5 |
342438 | CA | M | 1999 | Zayn | 5 |
342439 | CA | M | 1999 | Zayne | 5 |
342440 rows × 5 columns
Slido Exercises¶
We want to obtain the first three baby names with count > 250
.
babynames.iloc[[0, 233, 484], [3, 4]]
Name | Count | |
---|---|---|
0 | Mary | 295 |
233 | Mary | 390 |
484 | Mary | 534 |
babynames.loc[[0, 233, 484]]
State | Sex | Year | Name | Count | |
---|---|---|---|---|---|
0 | CA | F | 1910 | Mary | 295 |
233 | CA | F | 1911 | Mary | 390 |
484 | CA | F | 1912 | Mary | 534 |
babynames.loc[babynames["Count"] > 250, ["Name", "Count"]].head(3)
Name | Count | |
---|---|---|
0 | Mary | 295 |
233 | Mary | 390 |
484 | Mary | 534 |
babynames.loc[babynames["Count"] > 250, ["Name", "Count"]].iloc[0:2, :]
Name | Count | |
---|---|---|
0 | Mary | 295 |
233 | Mary | 390 |
.isin
for Selection based on a list, array, or Series
¶
# Note: The parentheses surrounding the code make it possible to break the code into multiple lines for readability
(
babynames[(babynames["Name"] == "Bella") |
(babynames["Name"] == "Alex") |
(babynames["Name"] == "Narges") |
(babynames["Name"] == "Lisa")]
)
State | Sex | Year | Name | Count | |
---|---|---|---|---|---|
6289 | CA | F | 1923 | Bella | 5 |
7512 | CA | F | 1925 | Bella | 8 |
12368 | CA | F | 1932 | Lisa | 5 |
14741 | CA | F | 1936 | Lisa | 8 |
17084 | CA | F | 1939 | Lisa | 5 |
... | ... | ... | ... | ... | ... |
393248 | CA | M | 2018 | Alex | 495 |
396111 | CA | M | 2019 | Alex | 438 |
398983 | CA | M | 2020 | Alex | 379 |
401788 | CA | M | 2021 | Alex | 333 |
404663 | CA | M | 2022 | Alex | 344 |
317 rows × 5 columns
# A more concise method to achieve the above: .isin
names = ["Bella", "Alex", "Narges", "Lisa"]
display(babynames["Name"].isin(names))
display(babynames[babynames["Name"].isin(names)])
0 False 1 False 2 False 3 False 4 False ... 407423 False 407424 False 407425 False 407426 False 407427 False Name: Name, Length: 407428, dtype: bool
State | Sex | Year | Name | Count | |
---|---|---|---|---|---|
6289 | CA | F | 1923 | Bella | 5 |
7512 | CA | F | 1925 | Bella | 8 |
12368 | CA | F | 1932 | Lisa | 5 |
14741 | CA | F | 1936 | Lisa | 8 |
17084 | CA | F | 1939 | Lisa | 5 |
... | ... | ... | ... | ... | ... |
393248 | CA | M | 2018 | Alex | 495 |
396111 | CA | M | 2019 | Alex | 438 |
398983 | CA | M | 2020 | Alex | 379 |
401788 | CA | M | 2021 | Alex | 333 |
404663 | CA | M | 2022 | Alex | 344 |
317 rows × 5 columns
.str
Functions for Defining a Condition¶
# What if we only want names that start with "N"?
display(babynames["Name"].str.startswith("N"))
display(babynames[babynames["Name"].str.startswith("N")])
0 False 1 False 2 False 3 False 4 False ... 407423 False 407424 False 407425 False 407426 False 407427 False Name: Name, Length: 407428, dtype: bool
State | Sex | Year | Name | Count | |
---|---|---|---|---|---|
76 | CA | F | 1910 | Norma | 23 |
83 | CA | F | 1910 | Nellie | 20 |
127 | CA | F | 1910 | Nina | 11 |
198 | CA | F | 1910 | Nora | 6 |
310 | CA | F | 1911 | Nellie | 23 |
... | ... | ... | ... | ... | ... |
407319 | CA | M | 2022 | Nilan | 5 |
407320 | CA | M | 2022 | Niles | 5 |
407321 | CA | M | 2022 | Nolen | 5 |
407322 | CA | M | 2022 | Noriel | 5 |
407323 | CA | M | 2022 | Norris | 5 |
12229 rows × 5 columns
Adding, Removing, and Modifying Columns¶
Add a Column¶
To add a column, use []
to reference the desired new column, then assign it to a Series
or array of appropriate length.
# Create a Series of the length of each name
babyname_lengths = babynames["Name"].str.len()
# Add a column named "name_lengths" that includes the length of each name
babynames["name_lengths"] = babyname_lengths
babynames
State | Sex | Year | Name | Count | name_lengths | |
---|---|---|---|---|---|---|
0 | CA | F | 1910 | Mary | 295 | 4 |
1 | CA | F | 1910 | Helen | 239 | 5 |
2 | CA | F | 1910 | Dorothy | 220 | 7 |
3 | CA | F | 1910 | Margaret | 163 | 8 |
4 | CA | F | 1910 | Frances | 134 | 7 |
... | ... | ... | ... | ... | ... | ... |
407423 | CA | M | 2022 | Zayvier | 5 | 7 |
407424 | CA | M | 2022 | Zia | 5 | 3 |
407425 | CA | M | 2022 | Zora | 5 | 4 |
407426 | CA | M | 2022 | Zuriel | 5 | 6 |
407427 | CA | M | 2022 | Zylo | 5 | 4 |
407428 rows × 6 columns
Modify a Column¶
To modify a column, use []
to access the desired column, then re-assign it to a new array or Series.
# Modify the "name_lengths" column to be one less than its original value
babynames["name_lengths"] = babynames["name_lengths"] - 1
babynames
State | Sex | Year | Name | Count | name_lengths | |
---|---|---|---|---|---|---|
0 | CA | F | 1910 | Mary | 295 | 3 |
1 | CA | F | 1910 | Helen | 239 | 4 |
2 | CA | F | 1910 | Dorothy | 220 | 6 |
3 | CA | F | 1910 | Margaret | 163 | 7 |
4 | CA | F | 1910 | Frances | 134 | 6 |
... | ... | ... | ... | ... | ... | ... |
407423 | CA | M | 2022 | Zayvier | 5 | 6 |
407424 | CA | M | 2022 | Zia | 5 | 2 |
407425 | CA | M | 2022 | Zora | 5 | 3 |
407426 | CA | M | 2022 | Zuriel | 5 | 5 |
407427 | CA | M | 2022 | Zylo | 5 | 3 |
407428 rows × 6 columns
Rename a Column Name¶
Rename a column using the .rename()
method.
# Rename "name_lengths" to "Length"
babynames = babynames.rename(columns={"name_lengths":"Length"})
babynames
State | Sex | Year | Name | Count | Length | |
---|---|---|---|---|---|---|
0 | CA | F | 1910 | Mary | 295 | 3 |
1 | CA | F | 1910 | Helen | 239 | 4 |
2 | CA | F | 1910 | Dorothy | 220 | 6 |
3 | CA | F | 1910 | Margaret | 163 | 7 |
4 | CA | F | 1910 | Frances | 134 | 6 |
... | ... | ... | ... | ... | ... | ... |
407423 | CA | M | 2022 | Zayvier | 5 | 6 |
407424 | CA | M | 2022 | Zia | 5 | 2 |
407425 | CA | M | 2022 | Zora | 5 | 3 |
407426 | CA | M | 2022 | Zuriel | 5 | 5 |
407427 | CA | M | 2022 | Zylo | 5 | 3 |
407428 rows × 6 columns
Delete a Column¶
Remove a column using .drop()
.
# Remove our new "Length" column
babynames = babynames.drop("Length", axis="columns")
babynames
State | Sex | Year | Name | Count | |
---|---|---|---|---|---|
0 | CA | F | 1910 | Mary | 295 |
1 | CA | F | 1910 | Helen | 239 |
2 | CA | F | 1910 | Dorothy | 220 |
3 | CA | F | 1910 | Margaret | 163 |
4 | CA | F | 1910 | Frances | 134 |
... | ... | ... | ... | ... | ... |
407423 | CA | M | 2022 | Zayvier | 5 |
407424 | CA | M | 2022 | Zia | 5 |
407425 | CA | M | 2022 | Zora | 5 |
407426 | CA | M | 2022 | Zuriel | 5 |
407427 | CA | M | 2022 | Zylo | 5 |
407428 rows × 5 columns
Useful Utility Functions¶
NumPy
¶
The NumPy
functions you encountered in Data 8 are compatible with objects in pandas
.
yash_counts = babynames[babynames["Name"] == "Yash"]["Count"]
yash_counts
331824 8 334114 9 336390 11 338773 12 341387 10 343571 14 345767 24 348230 29 350889 24 353445 29 356221 25 358978 27 361831 29 364905 24 367867 23 370945 18 374055 14 376756 18 379660 18 383338 9 385903 12 388529 17 391485 16 394906 10 397874 9 400171 15 403092 13 406006 13 Name: Count, dtype: int64
# Average number of babies named Yash each year
np.mean(yash_counts)
17.142857142857142
# Max number of babies named Yash born in any single year
max(yash_counts)
29
Built-In pandas
Methods¶
There are many, many utility functions built into pandas
, far more than we can possibly cover in lecture. You are encouraged to explore all the functionality outlined in the pandas
documentation.
# Returns the shape of the object in the format (num_rows, num_columns)
babynames.shape
(407428, 5)
# Returns the total number of entries in the object, equal to num_rows * num_columns
babynames.size
2037140
# What summary statistics can we describe?
babynames.describe()
Year | Count | |
---|---|---|
count | 407428.000000 | 407428.000000 |
mean | 1985.733609 | 79.543456 |
std | 27.007660 | 293.698654 |
min | 1910.000000 | 5.000000 |
25% | 1969.000000 | 7.000000 |
50% | 1992.000000 | 13.000000 |
75% | 2008.000000 | 38.000000 |
max | 2022.000000 | 8260.000000 |
# Our statistics are slightly different when working with a Series
babynames["Sex"].describe()
count 407428 unique 2 top F freq 239537 Name: Sex, dtype: object
# Randomly sample row(s) from the DataFrame
babynames.sample()
State | Sex | Year | Name | Count | |
---|---|---|---|---|---|
208579 | CA | F | 2014 | Rayah | 7 |
# Rerun this cell a few times – you'll get different results!
babynames.sample(5).iloc[:, 2:]
Year | Name | Count | |
---|---|---|---|
92989 | 1983 | Dayana | 8 |
323162 | 1992 | Kareem | 27 |
58873 | 1969 | Rebecca | 1025 |
83706 | 1980 | Marta | 53 |
61539 | 1970 | Rebeca | 32 |
# Sampling with replacement
babynames[babynames["Year"] == 2000].sample(4, replace = True).iloc[:,2:]
Year | Name | Count | |
---|---|---|---|
150031 | 2000 | Savana | 31 |
344662 | 2000 | Caeden | 5 |
344849 | 2000 | Marquel | 5 |
344046 | 2000 | Eusebio | 8 |
# Count the number of times each unique value occurs in a Series
babynames["Name"].value_counts()
Name Jean 223 Francis 221 Guadalupe 218 Jessie 217 Marion 214 ... Renesme 1 Purity 1 Olanna 1 Nohea 1 Zayvier 1 Name: count, Length: 20437, dtype: int64
# Return an array of all unique values in the Series
babynames["Name"].unique()
array(['Mary', 'Helen', 'Dorothy', ..., 'Zae', 'Zai', 'Zayvier'], dtype=object)
# Sort a Series
babynames["Name"].sort_values()
366001 Aadan 384005 Aadan 369120 Aadan 398211 Aadarsh 370306 Aaden ... 220691 Zyrah 197529 Zyrah 217429 Zyrah 232167 Zyrah 404544 Zyrus Name: Name, Length: 407428, dtype: object
# Sort a DataFrame – there are lots of Michaels in California
babynames.sort_values(by = "Count", ascending = False)
State | Sex | Year | Name | Count | |
---|---|---|---|---|---|
268041 | CA | M | 1957 | Michael | 8260 |
267017 | CA | M | 1956 | Michael | 8258 |
317387 | CA | M | 1990 | Michael | 8246 |
281850 | CA | M | 1969 | Michael | 8245 |
283146 | CA | M | 1970 | Michael | 8196 |
... | ... | ... | ... | ... | ... |
317292 | CA | M | 1989 | Olegario | 5 |
317291 | CA | M | 1989 | Norbert | 5 |
317290 | CA | M | 1989 | Niles | 5 |
317289 | CA | M | 1989 | Nikola | 5 |
407427 | CA | M | 2022 | Zylo | 5 |
407428 rows × 5 columns
# Create a Series of the length of each name
babyname_lengths = babynames["Name"].str.len()
# Add a column named "name_lengths" that includes the length of each name
babynames["name_lengths"] = babyname_lengths
babynames.head(5)
State | Sex | Year | Name | Count | name_lengths | |
---|---|---|---|---|---|---|
0 | CA | F | 1910 | Mary | 295 | 4 |
1 | CA | F | 1910 | Helen | 239 | 5 |
2 | CA | F | 1910 | Dorothy | 220 | 7 |
3 | CA | F | 1910 | Margaret | 163 | 8 |
4 | CA | F | 1910 | Frances | 134 | 7 |
# Sort by the temporary column
babynames = babynames.sort_values(by = "name_lengths", ascending=False)
babynames.head(5)
State | Sex | Year | Name | Count | name_lengths | |
---|---|---|---|---|---|---|
334166 | CA | M | 1996 | Franciscojavier | 8 | 15 |
337301 | CA | M | 1997 | Franciscojavier | 5 | 15 |
339472 | CA | M | 1998 | Franciscojavier | 6 | 15 |
321792 | CA | M | 1991 | Ryanchristopher | 7 | 15 |
327358 | CA | M | 1993 | Johnchristopher | 5 | 15 |
# Drop the 'name_length' column
babynames = babynames.drop("name_lengths", axis = 'columns')
babynames.head(5)
State | Sex | Year | Name | Count | |
---|---|---|---|---|---|
334166 | CA | M | 1996 | Franciscojavier | 8 |
337301 | CA | M | 1997 | Franciscojavier | 5 |
339472 | CA | M | 1998 | Franciscojavier | 6 |
321792 | CA | M | 1991 | Ryanchristopher | 7 |
327358 | CA | M | 1993 | Johnchristopher | 5 |
Approach 2: Sorting using the key
argument¶
babynames.sort_values("Name", key = lambda x:x.str.len(), ascending = False).head()
State | Sex | Year | Name | Count | |
---|---|---|---|---|---|
334166 | CA | M | 1996 | Franciscojavier | 8 |
327472 | CA | M | 1993 | Ryanchristopher | 5 |
337301 | CA | M | 1997 | Franciscojavier | 5 |
337477 | CA | M | 1997 | Ryanchristopher | 5 |
312543 | CA | M | 1987 | Franciscojavier | 5 |
Approach 3: Sorting Using the map
Function¶
We can also use the Python map function if we want to use an arbitrarily defined function. Suppose we want to sort by the number of occurrences of "dr" plus the number of occurences of "ea".
# First, define a function to count the number of times "dr" or "ea" appear in each name
def dr_ea_count(string):
return string.count('dr') + string.count('ea')
# Then, use `map` to apply `dr_ea_count` to each name in the "Name" column
babynames["dr_ea_count"] = babynames["Name"].map(dr_ea_count)
# Sort the DataFrame by the new "dr_ea_count" column so we can see our handiwork
babynames = babynames.sort_values(by = "dr_ea_count", ascending=False)
babynames.head()
State | Sex | Year | Name | Count | dr_ea_count | |
---|---|---|---|---|---|---|
115957 | CA | F | 1990 | Deandrea | 5 | 3 |
101976 | CA | F | 1986 | Deandrea | 6 | 3 |
131029 | CA | F | 1994 | Leandrea | 5 | 3 |
108731 | CA | F | 1988 | Deandrea | 5 | 3 |
308131 | CA | M | 1985 | Deandrea | 6 | 3 |
# Drop the `dr_ea_count` column
babynames = babynames.drop("dr_ea_count", axis = 'columns')
babynames.head(5)
State | Sex | Year | Name | Count | |
---|---|---|---|---|---|
115957 | CA | F | 1990 | Deandrea | 5 |
101976 | CA | F | 1986 | Deandrea | 6 |
131029 | CA | F | 1994 | Leandrea | 5 |
108731 | CA | F | 1988 | Deandrea | 5 |
308131 | CA | M | 1985 | Deandrea | 6 |
Grouping¶
Group rows that share a common feature, then aggregate data across the group.
In this example, we count the total number of babies born in each year (considering only a small subset of the data, for simplicity).
# The code below uses the full babynames dataset, which is why some numbers are different relative to the diagram
babynames[["Year", "Count"]].groupby("Year").agg(sum)
Count | |
---|---|
Year | |
1910 | 9163 |
1911 | 9983 |
1912 | 17946 |
1913 | 22094 |
1914 | 26926 |
... | ... |
2018 | 395436 |
2019 | 386996 |
2020 | 362882 |
2021 | 362582 |
2022 | 360023 |
113 rows × 1 columns
There are many different aggregation functions we can use, all of which are useful in different applications.
# What is the earliest year in which each name appeared?
babynames.groupby("Name")[["Year"]].agg(min)
Year | |
---|---|
Name | |
Aadan | 2008 |
Aadarsh | 2019 |
Aaden | 2007 |
Aadhav | 2014 |
Aadhini | 2022 |
... | ... |
Zymir | 2020 |
Zyon | 1999 |
Zyra | 2012 |
Zyrah | 2011 |
Zyrus | 2021 |
20437 rows × 1 columns
# What is the largest single-year count of each name?
babynames.groupby("Name")[["Count"]].agg(max)
Count | |
---|---|
Name | |
Aadan | 7 |
Aadarsh | 6 |
Aaden | 158 |
Aadhav | 8 |
Aadhini | 6 |
... | ... |
Zymir | 5 |
Zyon | 17 |
Zyra | 16 |
Zyrah | 6 |
Zyrus | 5 |
20437 rows × 1 columns
Case Study: Name "Popularity"¶
In this exercise, let's find the name with sex "F" that has dropped most in popularity since its peak usage. We'll start by filtering babynames
to only include names corresponding to sex "F".
f_babynames = babynames[babynames["Sex"] == "F"]
f_babynames
State | Sex | Year | Name | Count | |
---|---|---|---|---|---|
115957 | CA | F | 1990 | Deandrea | 5 |
101976 | CA | F | 1986 | Deandrea | 6 |
131029 | CA | F | 1994 | Leandrea | 5 |
108731 | CA | F | 1988 | Deandrea | 5 |
193104 | CA | F | 2010 | Deandra | 5 |
... | ... | ... | ... | ... | ... |
192320 | CA | F | 2010 | Annaly | 7 |
212793 | CA | F | 2015 | Iyanna | 6 |
202350 | CA | F | 2013 | Lailah | 50 |
212784 | CA | F | 2015 | Helene | 6 |
212787 | CA | F | 2015 | Holley | 6 |
239537 rows × 5 columns
# We sort the data by year
f_babynames = f_babynames.sort_values("Year")
f_babynames
State | Sex | Year | Name | Count | |
---|---|---|---|---|---|
23 | CA | F | 1910 | Bernice | 59 |
219 | CA | F | 1910 | Katharine | 5 |
170 | CA | F | 1910 | Clarice | 7 |
26 | CA | F | 1910 | Doris | 56 |
30 | CA | F | 1910 | Ethel | 52 |
... | ... | ... | ... | ... | ... |
236551 | CA | F | 2022 | Dani | 47 |
235914 | CA | F | 2022 | Iris | 361 |
235918 | CA | F | 2022 | Lucy | 345 |
235973 | CA | F | 2022 | Sage | 226 |
239452 | CA | F | 2022 | Rylin | 5 |
239537 rows × 5 columns
To build our intuition on how to answer our research question, let's visualize the prevalence of the name "Jennifer" over time.
# We'll talk about how to generate plots in a later lecture
fig = px.line(f_babynames[f_babynames["Name"] == "Jennifer"],
x = "Year", y = "Count")
fig.update_layout(font_size = 18,
autosize=False,
width=1000,
height=400)
We'll need a mathematical definition for the change in popularity of a name.
Define the metric "ratio to peak" (RTP). We'll calculate this as the count of the name in 2022 (the most recent year for which we have data) divided by the largest count of this name in any year.
A demo calculation for Jennifer:
# In the year with the highest Jennifer count, 6065 Jennifers were born
max_jenn = np.max(f_babynames[f_babynames["Name"] == "Jennifer"]["Count"])
max_jenn
6065
# Remember that we sorted f_babynames by year.
# This means that grabbing the final entry gives us the most recent count of Jennifers: 114
# In 2022, the most recent year for which we have data, 114 Jennifers were born
curr_jenn = f_babynames[f_babynames["Name"] == "Jennifer"]["Count"].iloc[-1]
curr_jenn
114
# Compute the RTP
curr_jenn / max_jenn
0.018796372629843364
We can also write a function that produces the ratio_to_peak
for a given Series
. This will allow us to use .groupby
to speed up our computation for all names in the dataset.
def ratio_to_peak(series):
"""
Compute the RTP for a Series containing the counts per year for a single name
"""
return series.iloc[-1] / np.max(series)
# Construct a Series containing our Jennifer count data
jenn_counts_ser = f_babynames[f_babynames["Name"] == "Jennifer"]["Count"]
# Then, find the RTP
ratio_to_peak(jenn_counts_ser)
0.018796372629843364
Now, let's use .groupby
to compute the RTPs for all names in the dataset.
You may see a warning message when running the cell below. As discussed in lecture, pandas
can't apply an aggregation function to non-numeric data (it doens't make sense to divide "CA" by a number). By default, .groupby
will drop any columns that cannot be aggregated.
rtp_table = f_babynames.groupby("Name")[["Year", "Count"]].agg(ratio_to_peak)
rtp_table
Year | Count | |
---|---|---|
Name | ||
Aadhini | 1.0 | 1.000000 |
Aadhira | 1.0 | 0.500000 |
Aadhya | 1.0 | 0.660000 |
Aadya | 1.0 | 0.586207 |
Aahana | 1.0 | 0.269231 |
... | ... | ... |
Zyanya | 1.0 | 0.466667 |
Zyla | 1.0 | 1.000000 |
Zylah | 1.0 | 1.000000 |
Zyra | 1.0 | 1.000000 |
Zyrah | 1.0 | 0.833333 |
13782 rows × 2 columns
# Results in a TypeError
# rtp_table = f_babynames.groupby("Name").agg(ratio_to_peak)
# rtp_table
This is the pandas
equivalent of .group
from Data 8. If we wanted to achieve this same result using the datascience
library, we would write:
f_babynames.group("Name", ratio_to_peak)
To avoid the warning message above, we explicitly extract only the columns relevant to our analysis before using .agg
.
# Recompute the RTPs, but only performing the calculation on the "Count" column
rtp_table = f_babynames.groupby("Name")[["Count"]].agg(ratio_to_peak)
rtp_table
Count | |
---|---|
Name | |
Aadhini | 1.000000 |
Aadhira | 0.500000 |
Aadhya | 0.660000 |
Aadya | 0.586207 |
Aahana | 0.269231 |
... | ... |
Zyanya | 0.466667 |
Zyla | 1.000000 |
Zylah | 1.000000 |
Zyra | 1.000000 |
Zyrah | 0.833333 |
13782 rows × 1 columns
# Rename "Count" to "Count RTP" for clarity
rtp_table = rtp_table.rename(columns = {"Count": "Count RTP"})
rtp_table
Count RTP | |
---|---|
Name | |
Aadhini | 1.000000 |
Aadhira | 0.500000 |
Aadhya | 0.660000 |
Aadya | 0.586207 |
Aahana | 0.269231 |
... | ... |
Zyanya | 0.466667 |
Zyla | 1.000000 |
Zylah | 1.000000 |
Zyra | 1.000000 |
Zyrah | 0.833333 |
13782 rows × 1 columns
# What name has fallen the most in popularity?
rtp_table.sort_values("Count RTP")
Count RTP | |
---|---|
Name | |
Debra | 0.001260 |
Debbie | 0.002815 |
Carol | 0.003180 |
Tammy | 0.003249 |
Susan | 0.003305 |
... | ... |
Fidelia | 1.000000 |
Naveyah | 1.000000 |
Finlee | 1.000000 |
Roseline | 1.000000 |
Aadhini | 1.000000 |
13782 rows × 1 columns
We can visualize the decrease in the popularity of the name "Debra:"
def plot_name(*names):
fig = px.line(f_babynames[f_babynames["Name"].isin(names)],
x = "Year", y = "Count", color="Name",
title=f"Popularity for: {names}")
fig.update_layout(font_size = 18,
autosize=False,
width=1000,
height=400)
return fig
plot_name("Debra")
# Find the 10 names that have decreased the most in popularity
top10 = rtp_table.sort_values("Count RTP").head(10).index
top10
Index(['Debra', 'Debbie', 'Carol', 'Tammy', 'Susan', 'Cheryl', 'Shannon', 'Tina', 'Michele', 'Terri'], dtype='object', name='Name')
plot_name(*top10)
For fun, try plotting your name or your friends' names.