Lecture 3 – Data 100, Fall 2024¶

Data 100, Fall 2024

Acknowledgments Page

A demonstration of advanced pandas syntax to accompany Lecture 3.

In [1]:
import numpy as np
import pandas as pd
In [2]:
# Loading the elections DataFrame
elections = pd.read_csv("data/elections.csv")

elections.head() 
Out[2]:
Year Candidate Party Popular vote Result %
0 1824 Andrew Jackson Democratic-Republican 151271 loss 57.210122
1 1824 John Quincy Adams Democratic-Republican 113142 win 42.789878
2 1828 Andrew Jackson Democratic 642806 win 56.203927
3 1828 John Quincy Adams National Republican 500897 loss 43.796073
4 1832 Andrew Jackson Democratic 702735 win 54.574789

Integer-Based Extraction Using iloc¶

iloc selects items by row and column integer position.

Arguments to .iloc can be:

  1. A list.
  2. A slice (syntax is exclusive of the right hand side of the slice).
  3. A single value.
In [3]:
# Select the rows at positions 1, 2, and 3.
# Select the columns at positions 0, 1, and 2.
# Remember that Python indexing begins at position 0!
elections.iloc[[1, 2, 3], [0, 1, 2]]
Out[3]:
Year Candidate Party
1 1824 John Quincy Adams Democratic-Republican
2 1828 Andrew Jackson Democratic
3 1828 John Quincy Adams National Republican
In [4]:
# Index-based extraction using a list of rows and a slice of 
# column integer positions 
elections.iloc[[1, 2, 3], 0:3]
Out[4]:
Year Candidate Party
1 1824 John Quincy Adams Democratic-Republican
2 1828 Andrew Jackson Democratic
3 1828 John Quincy Adams National Republican
In [5]:
# Selecting all rows using a colon
elections.iloc[:, 0:3]
Out[5]:
Year Candidate Party
0 1824 Andrew Jackson Democratic-Republican
1 1824 John Quincy Adams Democratic-Republican
2 1828 Andrew Jackson Democratic
3 1828 John Quincy Adams National Republican
4 1832 Andrew Jackson Democratic
... ... ... ...
177 2016 Jill Stein Green
178 2020 Joseph Biden Democratic
179 2020 Donald Trump Republican
180 2020 Jo Jorgensen Libertarian
181 2020 Howard Hawkins Green

182 rows × 3 columns

In [6]:
# Extracting data from one column
elections.iloc[[1, 2, 3], 1]
Out[6]:
1    John Quincy Adams
2       Andrew Jackson
3    John Quincy Adams
Name: Candidate, dtype: object
In [7]:
# Extracting the value at row 0 and the second column
elections.iloc[0,1]
Out[7]:
'Andrew Jackson'

Context-dependent Extraction using []¶

We could technically do anything we want using loc or iloc. However, in practice, the [] operator is often used instead to yield more concise code.

[] is a bit trickier to understand than loc or iloc, but it achieves essentially the same functionality. The difference is that [] is context-dependent.

[] only takes one argument, which may be:

  1. A slice of row integers.
  2. A list of column labels.
  3. A single column label.

If we provide a slice of row numbers, we get the numbered rows.

In [8]:
elections[3:7]
Out[8]:
Year Candidate Party Popular vote Result %
3 1828 John Quincy Adams National Republican 500897 loss 43.796073
4 1832 Andrew Jackson Democratic 702735 win 54.574789
5 1832 Henry Clay National Republican 484205 loss 37.603628
6 1832 William Wirt Anti-Masonic 100715 loss 7.821583

If we provide a list of column names, we get the listed columns.

In [9]:
elections[["Year", "Candidate", "Result"]]
Out[9]:
Year Candidate Result
0 1824 Andrew Jackson loss
1 1824 John Quincy Adams win
2 1828 Andrew Jackson win
3 1828 John Quincy Adams loss
4 1832 Andrew Jackson win
... ... ... ...
177 2016 Jill Stein loss
178 2020 Joseph Biden win
179 2020 Donald Trump loss
180 2020 Jo Jorgensen loss
181 2020 Howard Hawkins loss

182 rows × 3 columns

And if we provide a single column name we get back just that column, stored as a Series.

In [10]:
elections["Candidate"]
Out[10]:
0         Andrew Jackson
1      John Quincy Adams
2         Andrew Jackson
3      John Quincy Adams
4         Andrew Jackson
             ...        
177           Jill Stein
178         Joseph Biden
179         Donald Trump
180         Jo Jorgensen
181       Howard Hawkins
Name: Candidate, Length: 182, dtype: object

Slido Exercise¶

In [11]:
weird = pd.DataFrame({"a":["one fish", "two fish"], 
                      "b":["red fish", "blue fish"]})
weird
Out[11]:
a b
0 one fish red fish
1 two fish blue fish
In [12]:
weird.iloc[1, 1]
Out[12]:
'blue fish'
In [13]:
# weird.loc[1, 1]
In [14]:
weird.iloc[1, :]
Out[14]:
a     two fish
b    blue fish
Name: 1, dtype: object
In [15]:
weird.loc[1, 'b']
Out[15]:
'blue fish'
In [16]:
weird.iloc[1, 0]
Out[16]:
'two fish'

Dataset: California baby names¶

In today's lecture, we'll work with the babynames dataset, which contains information about the names of infants born in California.

The cell below pulls census data from a government website and then loads it into a usable form. The code shown here is outside of the scope of Data 100, but you're encouraged to dig into it if you are interested!

In [17]:
import urllib.request
import os.path
import zipfile

data_url = "https://www.ssa.gov/oact/babynames/state/namesbystate.zip"
local_filename = "data/babynamesbystate.zip"
if not os.path.exists(local_filename): # If the data exists don't download again
    with urllib.request.urlopen(data_url) as resp, open(local_filename, 'wb') as f:
        f.write(resp.read())

zf = zipfile.ZipFile(local_filename, 'r')

ca_name = 'STATE.CA.TXT'
field_names = ['State', 'Sex', 'Year', 'Name', 'Count']
with zf.open(ca_name) as fh:
    babynames = pd.read_csv(fh, header=None, names=field_names)

babynames.head()
Out[17]:
State Sex Year Name Count
0 CA F 1910 Mary 295
1 CA F 1910 Helen 239
2 CA F 1910 Dorothy 220
3 CA F 1910 Margaret 163
4 CA F 1910 Frances 134

Conditional Selection¶

In [18]:
# Ask yourself: Why is :9 is the correct slice to select the first 10 rows?
babynames_first_10_rows = babynames.loc[:9, :]

babynames_first_10_rows
Out[18]:
State Sex Year Name Count
0 CA F 1910 Mary 295
1 CA F 1910 Helen 239
2 CA F 1910 Dorothy 220
3 CA F 1910 Margaret 163
4 CA F 1910 Frances 134
5 CA F 1910 Ruth 128
6 CA F 1910 Evelyn 126
7 CA F 1910 Alice 118
8 CA F 1910 Virginia 101
9 CA F 1910 Elizabeth 93

By passing in a sequence (list, array, or Series) of boolean values, we can extract a subset of the rows in a DataFrame. We will keep only the rows that correspond to a boolean value of True.

In [19]:
# Notice how we have exactly 10 elements in our boolean array argument.
babynames_first_10_rows[[True, False, True, False, True, False, True, False, True, False]]
Out[19]:
State Sex Year Name Count
0 CA F 1910 Mary 295
2 CA F 1910 Dorothy 220
4 CA F 1910 Frances 134
6 CA F 1910 Evelyn 126
8 CA F 1910 Virginia 101
In [20]:
# Or using .loc to filter a DataFrame by a Boolean array argument.
babynames_first_10_rows.loc[[True, False, True, False, True, False, True, False, True, False], :]
Out[20]:
State Sex Year Name Count
0 CA F 1910 Mary 295
2 CA F 1910 Dorothy 220
4 CA F 1910 Frances 134
6 CA F 1910 Evelyn 126
8 CA F 1910 Virginia 101

Oftentimes, we'll use boolean selection to check for entries in a DataFrame that meet a particular condition.

In [21]:
# First, use a logical condition to generate a boolean Series
logical_operator = (babynames["Sex"] == "F")
logical_operator
Out[21]:
0          True
1          True
2          True
3          True
4          True
          ...  
407423    False
407424    False
407425    False
407426    False
407427    False
Name: Sex, Length: 407428, dtype: bool
In [22]:
# Then, use this boolean Series to filter the DataFrame
babynames[logical_operator]
Out[22]:
State Sex Year Name Count
0 CA F 1910 Mary 295
1 CA F 1910 Helen 239
2 CA F 1910 Dorothy 220
3 CA F 1910 Margaret 163
4 CA F 1910 Frances 134
... ... ... ... ... ...
239532 CA F 2022 Zemira 5
239533 CA F 2022 Ziggy 5
239534 CA F 2022 Zimal 5
239535 CA F 2022 Zosia 5
239536 CA F 2022 Zulay 5

239537 rows × 5 columns

Boolean selection also works with loc!

In [23]:
# Notice that we did not have to specify columns to select 
# If no columns are referenced, pandas will automatically select all columns
babynames.loc[babynames["Sex"] == "F"]
Out[23]:
State Sex Year Name Count
0 CA F 1910 Mary 295
1 CA F 1910 Helen 239
2 CA F 1910 Dorothy 220
3 CA F 1910 Margaret 163
4 CA F 1910 Frances 134
... ... ... ... ... ...
239532 CA F 2022 Zemira 5
239533 CA F 2022 Ziggy 5
239534 CA F 2022 Zimal 5
239535 CA F 2022 Zosia 5
239536 CA F 2022 Zulay 5

239537 rows × 5 columns

To filter on multiple conditions, we combine boolean operators using bitwise comparisons.

Symbol Usage Meaning
~ ~p Returns negation of p
| p | q p OR q
& p & q p AND q
^ p ^ q p XOR q (exclusive or)
In [24]:
babynames[(babynames["Sex"] == "F") & (babynames["Year"] < 2000)]
Out[24]:
State Sex Year Name Count
0 CA F 1910 Mary 295
1 CA F 1910 Helen 239
2 CA F 1910 Dorothy 220
3 CA F 1910 Margaret 163
4 CA F 1910 Frances 134
... ... ... ... ... ...
149050 CA F 1999 Zareen 5
149051 CA F 1999 Zeinab 5
149052 CA F 1999 Zhane 5
149053 CA F 1999 Zoha 5
149054 CA F 1999 Zoila 5

149055 rows × 5 columns

In [25]:
babynames[(babynames["Sex"] == "F") | (babynames["Year"] < 2000)]
Out[25]:
State Sex Year Name Count
0 CA F 1910 Mary 295
1 CA F 1910 Helen 239
2 CA F 1910 Dorothy 220
3 CA F 1910 Margaret 163
4 CA F 1910 Frances 134
... ... ... ... ... ...
342435 CA M 1999 Yuuki 5
342436 CA M 1999 Zakariya 5
342437 CA M 1999 Zavier 5
342438 CA M 1999 Zayn 5
342439 CA M 1999 Zayne 5

342440 rows × 5 columns

Slido Exercises¶

We want to obtain the first three baby names with count > 250.

No description has been provided for this image

In [26]:
babynames.iloc[[0, 233, 484], [3, 4]]
Out[26]:
Name Count
0 Mary 295
233 Mary 390
484 Mary 534
In [27]:
babynames.loc[[0, 233, 484]]
Out[27]:
State Sex Year Name Count
0 CA F 1910 Mary 295
233 CA F 1911 Mary 390
484 CA F 1912 Mary 534
In [28]:
babynames.loc[babynames["Count"] > 250, ["Name", "Count"]].head(3)
Out[28]:
Name Count
0 Mary 295
233 Mary 390
484 Mary 534
In [29]:
babynames.loc[babynames["Count"] > 250, ["Name", "Count"]].iloc[0:2, :]
Out[29]:
Name Count
0 Mary 295
233 Mary 390
In [30]:
# Note: The parentheses surrounding the code make it possible to 
# break the code into multiple lines for readability

(
    babynames[(babynames["Name"] == "Bella") | 
              (babynames["Name"] == "Alex") |
              (babynames["Name"] == "Narges") |
              (babynames["Name"] == "Lisa")]
)
Out[30]:
State Sex Year Name Count
6289 CA F 1923 Bella 5
7512 CA F 1925 Bella 8
12368 CA F 1932 Lisa 5
14741 CA F 1936 Lisa 8
17084 CA F 1939 Lisa 5
... ... ... ... ... ...
393248 CA M 2018 Alex 495
396111 CA M 2019 Alex 438
398983 CA M 2020 Alex 379
401788 CA M 2021 Alex 333
404663 CA M 2022 Alex 344

317 rows × 5 columns

In [31]:
# A more concise method to achieve the above: .isin
names = ["Bella", "Alex", "Narges", "Lisa"]
display(babynames["Name"].isin(names))
display(babynames[babynames["Name"].isin(names)])
0         False
1         False
2         False
3         False
4         False
          ...  
407423    False
407424    False
407425    False
407426    False
407427    False
Name: Name, Length: 407428, dtype: bool
State Sex Year Name Count
6289 CA F 1923 Bella 5
7512 CA F 1925 Bella 8
12368 CA F 1932 Lisa 5
14741 CA F 1936 Lisa 8
17084 CA F 1939 Lisa 5
... ... ... ... ... ...
393248 CA M 2018 Alex 495
396111 CA M 2019 Alex 438
398983 CA M 2020 Alex 379
401788 CA M 2021 Alex 333
404663 CA M 2022 Alex 344

317 rows × 5 columns

In [32]:
# What if we only want names that start with "N"?
display(babynames["Name"].str.startswith("N"))

display(babynames[babynames["Name"].str.startswith("N")])
0         False
1         False
2         False
3         False
4         False
          ...  
407423    False
407424    False
407425    False
407426    False
407427    False
Name: Name, Length: 407428, dtype: bool
State Sex Year Name Count
76 CA F 1910 Norma 23
83 CA F 1910 Nellie 20
127 CA F 1910 Nina 11
198 CA F 1910 Nora 6
310 CA F 1911 Nellie 23
... ... ... ... ... ...
407319 CA M 2022 Nilan 5
407320 CA M 2022 Niles 5
407321 CA M 2022 Nolen 5
407322 CA M 2022 Noriel 5
407323 CA M 2022 Norris 5

12229 rows × 5 columns

Adding, Removing, and Modifying Columns¶

To add a column, use [] to reference the desired new column, then assign it to a Series or array of appropriate length.

In [33]:
# Create a Series of the length of each name.
babyname_lengths = babynames["Name"].str.len()

# Add a column named "name_lengths" that includes the length of each name.
babynames["name_lengths"] = babyname_lengths

babynames
Out[33]:
State Sex Year Name Count name_lengths
0 CA F 1910 Mary 295 4
1 CA F 1910 Helen 239 5
2 CA F 1910 Dorothy 220 7
3 CA F 1910 Margaret 163 8
4 CA F 1910 Frances 134 7
... ... ... ... ... ... ...
407423 CA M 2022 Zayvier 5 7
407424 CA M 2022 Zia 5 3
407425 CA M 2022 Zora 5 4
407426 CA M 2022 Zuriel 5 6
407427 CA M 2022 Zylo 5 4

407428 rows × 6 columns

To modify a column, use [] to access the desired column, then re-assign it to a new array or Series.

In [34]:
# Modify the "name_lengths" column to be one less than its original value.
babynames["name_lengths"] = babynames["name_lengths"] - 1
babynames
Out[34]:
State Sex Year Name Count name_lengths
0 CA F 1910 Mary 295 3
1 CA F 1910 Helen 239 4
2 CA F 1910 Dorothy 220 6
3 CA F 1910 Margaret 163 7
4 CA F 1910 Frances 134 6
... ... ... ... ... ... ...
407423 CA M 2022 Zayvier 5 6
407424 CA M 2022 Zia 5 2
407425 CA M 2022 Zora 5 3
407426 CA M 2022 Zuriel 5 5
407427 CA M 2022 Zylo 5 3

407428 rows × 6 columns

Rename a column using the .rename() method.

In [35]:
# Rename "name_lengths" to "Length".
babynames = babynames.rename(columns={"name_lengths":"Length"})
babynames
Out[35]:
State Sex Year Name Count Length
0 CA F 1910 Mary 295 3
1 CA F 1910 Helen 239 4
2 CA F 1910 Dorothy 220 6
3 CA F 1910 Margaret 163 7
4 CA F 1910 Frances 134 6
... ... ... ... ... ... ...
407423 CA M 2022 Zayvier 5 6
407424 CA M 2022 Zia 5 2
407425 CA M 2022 Zora 5 3
407426 CA M 2022 Zuriel 5 5
407427 CA M 2022 Zylo 5 3

407428 rows × 6 columns

Remove a column using .drop().

In [36]:
# Remove our new "Length" column.
babynames = babynames.drop("Length", axis="columns")
babynames
Out[36]:
State Sex Year Name Count
0 CA F 1910 Mary 295
1 CA F 1910 Helen 239
2 CA F 1910 Dorothy 220
3 CA F 1910 Margaret 163
4 CA F 1910 Frances 134
... ... ... ... ... ...
407423 CA M 2022 Zayvier 5
407424 CA M 2022 Zia 5
407425 CA M 2022 Zora 5
407426 CA M 2022 Zuriel 5
407427 CA M 2022 Zylo 5

407428 rows × 5 columns

Useful Utility Functions¶

NumPy¶

The NumPy functions you encountered in Data 8 are compatible with objects in pandas.

In [37]:
yash_counts = babynames[babynames["Name"] == "Yash"]["Count"]
yash_counts
Out[37]:
331824     8
334114     9
336390    11
338773    12
341387    10
343571    14
345767    24
348230    29
350889    24
353445    29
356221    25
358978    27
361831    29
364905    24
367867    23
370945    18
374055    14
376756    18
379660    18
383338     9
385903    12
388529    17
391485    16
394906    10
397874     9
400171    15
403092    13
406006    13
Name: Count, dtype: int64
In [38]:
# Average number of babies named Yash each year

np.mean(yash_counts)
Out[38]:
17.142857142857142
In [39]:
# Max number of babies named Yash born in any single year

max(yash_counts)
Out[39]:
29

Built-In pandas Methods¶

There are many, many utility functions built into pandas, far more than we can possibly cover in lecture. You are encouraged to explore all the functionality outlined in the pandas documentation.

In [40]:
# Returns the shape of the object in the format (num_rows, num_columns)
babynames.shape
Out[40]:
(407428, 5)
In [41]:
# Returns the total number of entries in the object, equal to num_rows * num_columns
babynames.size
Out[41]:
2037140
In [42]:
# What summary statistics can we describe?
babynames.describe()
Out[42]:
Year Count
count 407428.000000 407428.000000
mean 1985.733609 79.543456
std 27.007660 293.698654
min 1910.000000 5.000000
25% 1969.000000 7.000000
50% 1992.000000 13.000000
75% 2008.000000 38.000000
max 2022.000000 8260.000000
In [43]:
# Our statistics are slightly different when working with a Series.
babynames["Sex"].describe()
Out[43]:
count     407428
unique         2
top            F
freq      239537
Name: Sex, dtype: object
In [44]:
# Randomly sample row(s) from the DataFrame.
babynames.sample()
Out[44]:
State Sex Year Name Count
154936 CA F 2001 Darya 10
In [45]:
# Rerun this cell a few times – you'll get different results!
babynames.sample(5).iloc[:, 2:]
Out[45]:
Year Name Count
339370 1998 Vito 7
99351 1985 Janea 5
386376 2015 Reilly 8
15857 1938 Sherry 40
118930 1991 Tatum 8
In [46]:
# Sampling with replacement.
babynames[babynames["Year"] == 2000].sample(4, replace = True).iloc[:,2:]
Out[46]:
Year Name Count
344801 2000 Justine 5
151714 2000 Cloey 7
342700 2000 Troy 158
152033 2000 Evelina 6
In [47]:
# Count the number of times each unique value occurs in a Series.
babynames["Name"].value_counts()
Out[47]:
Name
Jean         223
Francis      221
Guadalupe    218
Jessie       217
Marion       214
            ... 
Renesme        1
Purity         1
Olanna         1
Nohea          1
Zayvier        1
Name: count, Length: 20437, dtype: int64
In [48]:
# Return an array of all unique values in the Series.
babynames["Name"].unique()
Out[48]:
array(['Mary', 'Helen', 'Dorothy', ..., 'Zae', 'Zai', 'Zayvier'],
      dtype=object)
In [49]:
# Sort a Series.
babynames["Name"].sort_values()
Out[49]:
366001      Aadan
384005      Aadan
369120      Aadan
398211    Aadarsh
370306      Aaden
           ...   
220691      Zyrah
197529      Zyrah
217429      Zyrah
232167      Zyrah
404544      Zyrus
Name: Name, Length: 407428, dtype: object
In [50]:
# Sort a DataFrame – there are lots of Michaels in California.
babynames.sort_values(by="Count", ascending=False)
Out[50]:
State Sex Year Name Count
268041 CA M 1957 Michael 8260
267017 CA M 1956 Michael 8258
317387 CA M 1990 Michael 8246
281850 CA M 1969 Michael 8245
283146 CA M 1970 Michael 8196
... ... ... ... ... ...
317292 CA M 1989 Olegario 5
317291 CA M 1989 Norbert 5
317290 CA M 1989 Niles 5
317289 CA M 1989 Nikola 5
407427 CA M 2022 Zylo 5

407428 rows × 5 columns

Custom sorting¶

Approach 1: Create a temporary column¶

In [51]:
# Create a Series of the length of each name.
babyname_lengths = babynames["Name"].str.len()

# Add a column named "name_lengths" that includes the length of each name.
babynames["name_lengths"] = babyname_lengths
babynames.head(5)
Out[51]:
State Sex Year Name Count name_lengths
0 CA F 1910 Mary 295 4
1 CA F 1910 Helen 239 5
2 CA F 1910 Dorothy 220 7
3 CA F 1910 Margaret 163 8
4 CA F 1910 Frances 134 7
In [52]:
# Sort by the temporary column.
babynames = babynames.sort_values(by="name_lengths", ascending=False)
babynames.head(5)
Out[52]:
State Sex Year Name Count name_lengths
334166 CA M 1996 Franciscojavier 8 15
337301 CA M 1997 Franciscojavier 5 15
339472 CA M 1998 Franciscojavier 6 15
321792 CA M 1991 Ryanchristopher 7 15
327358 CA M 1993 Johnchristopher 5 15
In [53]:
# Drop the 'name_length' column.
babynames = babynames.drop("name_lengths", axis="columns")
babynames.head(5)
Out[53]:
State Sex Year Name Count
334166 CA M 1996 Franciscojavier 8
337301 CA M 1997 Franciscojavier 5
339472 CA M 1998 Franciscojavier 6
321792 CA M 1991 Ryanchristopher 7
327358 CA M 1993 Johnchristopher 5

Approach 2: Sorting using the key argument¶

In [54]:
babynames.sort_values("Name", key=lambda x:x.str.len(), ascending=False).head()
Out[54]:
State Sex Year Name Count
334166 CA M 1996 Franciscojavier 8
327472 CA M 1993 Ryanchristopher 5
337301 CA M 1997 Franciscojavier 5
337477 CA M 1997 Ryanchristopher 5
312543 CA M 1987 Franciscojavier 5

Approach 3: Sorting Using the map Function¶

We can also use the Python map function if we want to use an arbitrarily defined function. Suppose we want to sort by the number of occurrences of "dr" plus the number of occurences of "ea".

In [55]:
# First, define a function to count 
# the number of times "dr" or "ea" appear in each name.
def dr_ea_count(string):
    return string.count('dr') + string.count('ea')

# Then, use `map` to apply `dr_ea_count` to each name in the "Name" column.
babynames["dr_ea_count"] = babynames["Name"].map(dr_ea_count)

# Sort the DataFrame by the new "dr_ea_count" column 
# so we can see our handiwork.
babynames = babynames.sort_values(by="dr_ea_count", ascending=False)
babynames.head()
Out[55]:
State Sex Year Name Count dr_ea_count
115957 CA F 1990 Deandrea 5 3
101976 CA F 1986 Deandrea 6 3
131029 CA F 1994 Leandrea 5 3
108731 CA F 1988 Deandrea 5 3
308131 CA M 1985 Deandrea 6 3
In [56]:
# Drop the `dr_ea_count` column.
babynames = babynames.drop("dr_ea_count", axis="columns")
babynames.head(5)
Out[56]:
State Sex Year Name Count
115957 CA F 1990 Deandrea 5
101976 CA F 1986 Deandrea 6
131029 CA F 1994 Leandrea 5
108731 CA F 1988 Deandrea 5
308131 CA M 1985 Deandrea 6

Grouping¶

Group rows that share a common feature, then aggregate data across the group.

In this example, we count the total number of babies born in each year (considering only a small subset of the data, for simplicity).

No description has been provided for this image

In [57]:
# The code below uses the full babynames dataset, which is why 
# some numbers are different relative to the diagram.
babynames[["Year", "Count"]].groupby("Year").agg(sum)
Out[57]:
Count
Year
1910 9163
1911 9983
1912 17946
1913 22094
1914 26926
... ...
2018 395436
2019 386996
2020 362882
2021 362582
2022 360023

113 rows × 1 columns

There are many different aggregation functions we can use, all of which are useful in different applications.

In [58]:
# What is the earliest year in which each name appeared?
babynames.groupby("Name")[["Year"]].agg(min)
Out[58]:
Year
Name
Aadan 2008
Aadarsh 2019
Aaden 2007
Aadhav 2014
Aadhini 2022
... ...
Zymir 2020
Zyon 1999
Zyra 2012
Zyrah 2011
Zyrus 2021

20437 rows × 1 columns

In [59]:
# What is the largest single-year count of each name?
babynames.groupby("Name")[["Count"]].agg(max)
Out[59]:
Count
Name
Aadan 7
Aadarsh 6
Aaden 158
Aadhav 8
Aadhini 6
... ...
Zymir 5
Zyon 17
Zyra 16
Zyrah 6
Zyrus 5

20437 rows × 1 columns