Lecture 3 – Data 100, Spring 2025¶

Data 100, Spring 2025

Acknowledgments Page

A demonstration of advanced pandas syntax to accompany Lecture 3.

In [1]:
import numpy as np
import pandas as pd
In [2]:
# Loading the elections DataFrame
elections = pd.read_csv("data/elections.csv")

elections.head() 
Out[2]:
Year Candidate Party Popular vote Result %
0 1824 Andrew Jackson Democratic-Republican 151271 loss 57.210122
1 1824 John Quincy Adams Democratic-Republican 113142 win 42.789878
2 1828 Andrew Jackson Democratic 642806 win 56.203927
3 1828 John Quincy Adams National Republican 500897 loss 43.796073
4 1832 Andrew Jackson Democratic 702735 win 54.574789

Integer-Based Extraction Using iloc¶

iloc selects items by row and column integer position.

Arguments to .iloc can be:

  1. A list.
  2. A slice (syntax is exclusive of the right hand side of the slice).
  3. A single value.
In [3]:
# Select the rows at positions 1, 2, and 3.
# Select the columns at positions 0, 1, and 2.
# Remember that Python indexing begins at position 0!
elections.iloc[[1, 2, 3], [0, 1, 2]]
Out[3]:
Year Candidate Party
1 1824 John Quincy Adams Democratic-Republican
2 1828 Andrew Jackson Democratic
3 1828 John Quincy Adams National Republican
In [4]:
# Index-based extraction using a list of rows and a slice of 
# column integer positions 
elections.iloc[[1, 2, 3], 0:3]
Out[4]:
Year Candidate Party
1 1824 John Quincy Adams Democratic-Republican
2 1828 Andrew Jackson Democratic
3 1828 John Quincy Adams National Republican
In [5]:
# Selecting all rows using a colon
elections.iloc[:, 0:3]
Out[5]:
Year Candidate Party
0 1824 Andrew Jackson Democratic-Republican
1 1824 John Quincy Adams Democratic-Republican
2 1828 Andrew Jackson Democratic
3 1828 John Quincy Adams National Republican
4 1832 Andrew Jackson Democratic
... ... ... ...
182 2024 Donald Trump Republican
183 2024 Kamala Harris Democratic
184 2024 Jill Stein Green
185 2024 Robert Kennedy Independent
186 2024 Chase Oliver Libertarian Party

187 rows × 3 columns

In [6]:
# Extracting data from one column
elections.iloc[[1, 2, 3], 1]
Out[6]:
1    John Quincy Adams
2       Andrew Jackson
3    John Quincy Adams
Name: Candidate, dtype: object
In [7]:
# Extracting the value at row 0 and the second column
elections.iloc[0,1]
Out[7]:
'Andrew Jackson'

Context-dependent Extraction using []¶

We could technically do anything we want using loc or iloc. However, in practice, the [] operator is often used instead to yield more concise code.

[] is a bit trickier to understand than loc or iloc, but it achieves essentially the same functionality. The difference is that [] is context-dependent.

[] only takes one argument, which may be:

  1. A slice of row integers.
  2. A list of column labels.
  3. A single column label.

If we provide a slice of row numbers, we get the numbered rows.

In [8]:
elections[3:7]
Out[8]:
Year Candidate Party Popular vote Result %
3 1828 John Quincy Adams National Republican 500897 loss 43.796073
4 1832 Andrew Jackson Democratic 702735 win 54.574789
5 1832 Henry Clay National Republican 484205 loss 37.603628
6 1832 William Wirt Anti-Masonic 100715 loss 7.821583

If we provide a list of column names, we get the listed columns.

In [9]:
elections[["Year", "Candidate", "Result"]]
Out[9]:
Year Candidate Result
0 1824 Andrew Jackson loss
1 1824 John Quincy Adams win
2 1828 Andrew Jackson win
3 1828 John Quincy Adams loss
4 1832 Andrew Jackson win
... ... ... ...
182 2024 Donald Trump win
183 2024 Kamala Harris loss
184 2024 Jill Stein loss
185 2024 Robert Kennedy loss
186 2024 Chase Oliver loss

187 rows × 3 columns

And if we provide a single column name we get back just that column, stored as a Series.

In [10]:
elections["Candidate"]
Out[10]:
0         Andrew Jackson
1      John Quincy Adams
2         Andrew Jackson
3      John Quincy Adams
4         Andrew Jackson
             ...        
182         Donald Trump
183        Kamala Harris
184           Jill Stein
185       Robert Kennedy
186         Chase Oliver
Name: Candidate, Length: 187, dtype: object

Slido Exercise¶

In [11]:
weird = pd.DataFrame({"a":["one fish", "two fish"], 
                      "b":["red fish", "blue fish"]})
weird
Out[11]:
a b
0 one fish red fish
1 two fish blue fish
In [12]:
weird.iloc[1, 1]
Out[12]:
'blue fish'
In [13]:
# weird.loc[1, 1]
In [14]:
weird.iloc[1, :]
Out[14]:
a     two fish
b    blue fish
Name: 1, dtype: object
In [15]:
weird.loc[1, 'b']
Out[15]:
'blue fish'
In [16]:
weird.iloc[1, 0]
Out[16]:
'two fish'

Dataset: California baby names¶

In today's lecture, we'll work with the babynames dataset, which contains information about the names of infants born in California.

The cell below pulls census data from a government website and then loads it into a usable form. The code shown here is outside of the scope of Data 100, but you're encouraged to dig into it if you are interested!

In [17]:
import urllib.request
import os.path
import zipfile

data_url = "https://www.ssa.gov/oact/babynames/state/namesbystate.zip"
local_filename = "data/babynamesbystate.zip"
if not os.path.exists(local_filename): # If the data exists don't download again
    with urllib.request.urlopen(data_url) as resp, open(local_filename, 'wb') as f:
        f.write(resp.read())

zf = zipfile.ZipFile(local_filename, 'r')

ca_name = 'STATE.CA.TXT'
field_names = ['State', 'Sex', 'Year', 'Name', 'Count']
with zf.open(ca_name) as fh:
    babynames = pd.read_csv(fh, header=None, names=field_names)

babynames.head()
Out[17]:
State Sex Year Name Count
0 CA F 1910 Mary 295
1 CA F 1910 Helen 239
2 CA F 1910 Dorothy 220
3 CA F 1910 Margaret 163
4 CA F 1910 Frances 134

Conditional Selection¶

In [18]:
# Ask yourself: Why is :9 is the correct slice to select the first 10 rows?
babynames_first_10_rows = babynames.loc[:9, :]

babynames_first_10_rows
Out[18]:
State Sex Year Name Count
0 CA F 1910 Mary 295
1 CA F 1910 Helen 239
2 CA F 1910 Dorothy 220
3 CA F 1910 Margaret 163
4 CA F 1910 Frances 134
5 CA F 1910 Ruth 128
6 CA F 1910 Evelyn 126
7 CA F 1910 Alice 118
8 CA F 1910 Virginia 101
9 CA F 1910 Elizabeth 93

By passing in a sequence (list, array, or Series) of boolean values, we can extract a subset of the rows in a DataFrame. We will keep only the rows that correspond to a boolean value of True.

In [19]:
# Notice how we have exactly 10 elements in our boolean array argument.
babynames_first_10_rows[[True, False, True, False, True, False, True, False, True, False]]
Out[19]:
State Sex Year Name Count
0 CA F 1910 Mary 295
2 CA F 1910 Dorothy 220
4 CA F 1910 Frances 134
6 CA F 1910 Evelyn 126
8 CA F 1910 Virginia 101
In [20]:
# Or using .loc to filter a DataFrame by a Boolean array argument.
babynames_first_10_rows.loc[[True, False, True, False, True, False, True, False, True, False], :]
Out[20]:
State Sex Year Name Count
0 CA F 1910 Mary 295
2 CA F 1910 Dorothy 220
4 CA F 1910 Frances 134
6 CA F 1910 Evelyn 126
8 CA F 1910 Virginia 101

Oftentimes, we'll use boolean selection to check for entries in a DataFrame that meet a particular condition.

In [21]:
# First, use a logical condition to generate a boolean Series
logical_operator = (babynames["Sex"]=="F")
logical_operator
Out[21]:
0          True
1          True
2          True
3          True
4          True
          ...  
407423    False
407424    False
407425    False
407426    False
407427    False
Name: Sex, Length: 407428, dtype: bool
In [22]:
# Then, use this boolean Series to filter the DataFrame
babynames[logical_operator]
Out[22]:
State Sex Year Name Count
0 CA F 1910 Mary 295
1 CA F 1910 Helen 239
2 CA F 1910 Dorothy 220
3 CA F 1910 Margaret 163
4 CA F 1910 Frances 134
... ... ... ... ... ...
239532 CA F 2022 Zemira 5
239533 CA F 2022 Ziggy 5
239534 CA F 2022 Zimal 5
239535 CA F 2022 Zosia 5
239536 CA F 2022 Zulay 5

239537 rows × 5 columns

Boolean selection also works with loc!

In [23]:
# Notice that we did not have to specify columns to select 
# If no columns are referenced, pandas will automatically select all columns
babynames.loc[babynames["Sex"]=="F"]
Out[23]:
State Sex Year Name Count
0 CA F 1910 Mary 295
1 CA F 1910 Helen 239
2 CA F 1910 Dorothy 220
3 CA F 1910 Margaret 163
4 CA F 1910 Frances 134
... ... ... ... ... ...
239532 CA F 2022 Zemira 5
239533 CA F 2022 Ziggy 5
239534 CA F 2022 Zimal 5
239535 CA F 2022 Zosia 5
239536 CA F 2022 Zulay 5

239537 rows × 5 columns

To filter on multiple conditions, we combine boolean operators using bitwise comparisons.

Symbol Usage Meaning
~ ~p Returns negation of p
| p | q p OR q
& p & q p AND q
^ p ^ q p XOR q (exclusive or)
In [24]:
babynames[(babynames["Sex"]=="F") & (babynames["Year"]<2000)]
Out[24]:
State Sex Year Name Count
0 CA F 1910 Mary 295
1 CA F 1910 Helen 239
2 CA F 1910 Dorothy 220
3 CA F 1910 Margaret 163
4 CA F 1910 Frances 134
... ... ... ... ... ...
149050 CA F 1999 Zareen 5
149051 CA F 1999 Zeinab 5
149052 CA F 1999 Zhane 5
149053 CA F 1999 Zoha 5
149054 CA F 1999 Zoila 5

149055 rows × 5 columns

In [25]:
babynames[(babynames["Sex"]=="F") | (babynames["Year"]<2000)]
Out[25]:
State Sex Year Name Count
0 CA F 1910 Mary 295
1 CA F 1910 Helen 239
2 CA F 1910 Dorothy 220
3 CA F 1910 Margaret 163
4 CA F 1910 Frances 134
... ... ... ... ... ...
342435 CA M 1999 Yuuki 5
342436 CA M 1999 Zakariya 5
342437 CA M 1999 Zavier 5
342438 CA M 1999 Zayn 5
342439 CA M 1999 Zayne 5

342440 rows × 5 columns

Slido Exercises¶

We want to obtain the first three baby names with count > 250.

No description has been provided for this image

In [26]:
babynames.iloc[[0, 233, 484], [3, 4]]
Out[26]:
Name Count
0 Mary 295
233 Mary 390
484 Mary 534
In [27]:
babynames.loc[[0, 233, 484]]
Out[27]:
State Sex Year Name Count
0 CA F 1910 Mary 295
233 CA F 1911 Mary 390
484 CA F 1912 Mary 534
In [28]:
babynames.loc[babynames["Count"]>250, ["Name", "Count"]].head(3)
Out[28]:
Name Count
0 Mary 295
233 Mary 390
484 Mary 534
In [29]:
babynames.loc[babynames["Count"]>250, ["Name", "Count"]].iloc[0:2, :]
Out[29]:
Name Count
0 Mary 295
233 Mary 390
In [30]:
# Note: The parentheses surrounding the code make it possible to 
# break the code into multiple lines for readability

(
    babynames[(babynames["Name"]=="Bella") | 
              (babynames["Name"]=="Alex") |
              (babynames["Name"]=="Narges") |
              (babynames["Name"]=="Lisa")]
)
Out[30]:
State Sex Year Name Count
6289 CA F 1923 Bella 5
7512 CA F 1925 Bella 8
12368 CA F 1932 Lisa 5
14741 CA F 1936 Lisa 8
17084 CA F 1939 Lisa 5
... ... ... ... ... ...
393248 CA M 2018 Alex 495
396111 CA M 2019 Alex 438
398983 CA M 2020 Alex 379
401788 CA M 2021 Alex 333
404663 CA M 2022 Alex 344

317 rows × 5 columns

In [31]:
# A more concise method to achieve the above: .isin
names = ["Bella", "Alex", "Narges", "Lisa"]
display(babynames["Name"].isin(names))
display(babynames[babynames["Name"].isin(names)])
0         False
1         False
2         False
3         False
4         False
          ...  
407423    False
407424    False
407425    False
407426    False
407427    False
Name: Name, Length: 407428, dtype: bool
State Sex Year Name Count
6289 CA F 1923 Bella 5
7512 CA F 1925 Bella 8
12368 CA F 1932 Lisa 5
14741 CA F 1936 Lisa 8
17084 CA F 1939 Lisa 5
... ... ... ... ... ...
393248 CA M 2018 Alex 495
396111 CA M 2019 Alex 438
398983 CA M 2020 Alex 379
401788 CA M 2021 Alex 333
404663 CA M 2022 Alex 344

317 rows × 5 columns

In [32]:
# What if we only want names that start with "N"?
display(babynames["Name"].str.startswith("N"))

display(babynames[babynames["Name"].str.startswith("N")])
0         False
1         False
2         False
3         False
4         False
          ...  
407423    False
407424    False
407425    False
407426    False
407427    False
Name: Name, Length: 407428, dtype: bool
State Sex Year Name Count
76 CA F 1910 Norma 23
83 CA F 1910 Nellie 20
127 CA F 1910 Nina 11
198 CA F 1910 Nora 6
310 CA F 1911 Nellie 23
... ... ... ... ... ...
407319 CA M 2022 Nilan 5
407320 CA M 2022 Niles 5
407321 CA M 2022 Nolen 5
407322 CA M 2022 Noriel 5
407323 CA M 2022 Norris 5

12229 rows × 5 columns

Adding, Removing, and Modifying Columns¶

To add a column, use [] to reference the desired new column, then assign it to a Series or array of appropriate length.

In [33]:
# Create a Series of the length of each name.
babyname_lengths = babynames["Name"].str.len()

# Add a column named "name_lengths" that includes the length of each name.
babynames["name_lengths"] = babyname_lengths

babynames
Out[33]:
State Sex Year Name Count name_lengths
0 CA F 1910 Mary 295 4
1 CA F 1910 Helen 239 5
2 CA F 1910 Dorothy 220 7
3 CA F 1910 Margaret 163 8
4 CA F 1910 Frances 134 7
... ... ... ... ... ... ...
407423 CA M 2022 Zayvier 5 7
407424 CA M 2022 Zia 5 3
407425 CA M 2022 Zora 5 4
407426 CA M 2022 Zuriel 5 6
407427 CA M 2022 Zylo 5 4

407428 rows × 6 columns

To modify a column, use [] to access the desired column, then re-assign it to a new array or Series.

In [34]:
# Modify the "name_lengths" column to be one less than its original value.
babynames["name_lengths"] = babynames["name_lengths"] - 1
babynames
Out[34]:
State Sex Year Name Count name_lengths
0 CA F 1910 Mary 295 3
1 CA F 1910 Helen 239 4
2 CA F 1910 Dorothy 220 6
3 CA F 1910 Margaret 163 7
4 CA F 1910 Frances 134 6
... ... ... ... ... ... ...
407423 CA M 2022 Zayvier 5 6
407424 CA M 2022 Zia 5 2
407425 CA M 2022 Zora 5 3
407426 CA M 2022 Zuriel 5 5
407427 CA M 2022 Zylo 5 3

407428 rows × 6 columns

Rename a column using the .rename() method.

In [35]:
# Rename "name_lengths" to "Length".
babynames = babynames.rename(columns={"name_lengths":"Length"})
babynames
Out[35]:
State Sex Year Name Count Length
0 CA F 1910 Mary 295 3
1 CA F 1910 Helen 239 4
2 CA F 1910 Dorothy 220 6
3 CA F 1910 Margaret 163 7
4 CA F 1910 Frances 134 6
... ... ... ... ... ... ...
407423 CA M 2022 Zayvier 5 6
407424 CA M 2022 Zia 5 2
407425 CA M 2022 Zora 5 3
407426 CA M 2022 Zuriel 5 5
407427 CA M 2022 Zylo 5 3

407428 rows × 6 columns

Remove a column using .drop().

In [36]:
# Remove our new "Length" column.
babynames = babynames.drop("Length", axis="columns")
babynames
Out[36]:
State Sex Year Name Count
0 CA F 1910 Mary 295
1 CA F 1910 Helen 239
2 CA F 1910 Dorothy 220
3 CA F 1910 Margaret 163
4 CA F 1910 Frances 134
... ... ... ... ... ...
407423 CA M 2022 Zayvier 5
407424 CA M 2022 Zia 5
407425 CA M 2022 Zora 5
407426 CA M 2022 Zuriel 5
407427 CA M 2022 Zylo 5

407428 rows × 5 columns

Useful Utility Functions¶

NumPy¶

The NumPy functions you encountered in Data 8 are compatible with objects in pandas.

In [37]:
yash_counts = babynames[babynames["Name"]=="Yash"]["Count"]
yash_counts
Out[37]:
331824     8
334114     9
336390    11
338773    12
341387    10
343571    14
345767    24
348230    29
350889    24
353445    29
356221    25
358978    27
361831    29
364905    24
367867    23
370945    18
374055    14
376756    18
379660    18
383338     9
385903    12
388529    17
391485    16
394906    10
397874     9
400171    15
403092    13
406006    13
Name: Count, dtype: int64
In [38]:
# Average number of babies named Yash each year
np.mean(yash_counts)
Out[38]:
17.142857142857142
In [39]:
# Max number of babies named Yash born in any single year
max(yash_counts)
Out[39]:
29

Built-In pandas Methods¶

There are many, many utility functions built into pandas, far more than we can possibly cover in lecture. You are encouraged to explore all the functionality outlined in the pandas documentation.

In [40]:
# Returns the shape of the object in the format (num_rows, num_columns)
babynames.shape
Out[40]:
(407428, 5)
In [41]:
# Returns the total number of entries in the object, equal to num_rows * num_columns
babynames.size
Out[41]:
2037140
In [42]:
# What summary statistics can we describe?
babynames.describe()
Out[42]:
Year Count
count 407428.000000 407428.000000
mean 1985.733609 79.543456
std 27.007660 293.698654
min 1910.000000 5.000000
25% 1969.000000 7.000000
50% 1992.000000 13.000000
75% 2008.000000 38.000000
max 2022.000000 8260.000000
In [43]:
# Our statistics are slightly different when working with a Series.
babynames["Sex"].describe()
Out[43]:
count     407428
unique         2
top            F
freq      239537
Name: Sex, dtype: object
In [44]:
# Randomly sample row(s) from the DataFrame.
babynames.sample()
Out[44]:
State Sex Year Name Count
240526 CA M 1914 Allan 10
In [45]:
# Rerun this cell a few times – you'll get different results!
babynames.sample(5).iloc[:, 2:]
Out[45]:
Year Name Count
149126 2000 Sabrina 625
255778 1942 Mack 9
66265 1972 Judi 12
289601 1974 Gabino 7
114584 1990 Janeen 11
In [46]:
# Sampling with replacement.
babynames[babynames["Year"]==2000].sample(4, replace=True).iloc[:,2:]
Out[46]:
Year Name Count
151976 2000 Berenize 6
342835 2000 Ray 78
149400 2000 Marie 123
344076 2000 Jossue 8
In [47]:
# Count the number of times each unique value occurs in a Series.
babynames["Name"].value_counts()
Out[47]:
Name
Jean         223
Francis      221
Guadalupe    218
Jessie       217
Marion       214
            ... 
Renesme        1
Purity         1
Olanna         1
Nohea          1
Zayvier        1
Name: count, Length: 20437, dtype: int64
In [48]:
# Return an array of all unique values in the Series.
babynames["Name"].unique()
Out[48]:
array(['Mary', 'Helen', 'Dorothy', ..., 'Zae', 'Zai', 'Zayvier'],
      dtype=object)
In [49]:
# Sort a Series.
babynames["Name"].sort_values()
Out[49]:
366001      Aadan
384005      Aadan
369120      Aadan
398211    Aadarsh
370306      Aaden
           ...   
220691      Zyrah
197529      Zyrah
217429      Zyrah
232167      Zyrah
404544      Zyrus
Name: Name, Length: 407428, dtype: object
In [50]:
# Sort a DataFrame – there are lots of Michaels in California.
babynames.sort_values(by="Count", ascending=False)
Out[50]:
State Sex Year Name Count
268041 CA M 1957 Michael 8260
267017 CA M 1956 Michael 8258
317387 CA M 1990 Michael 8246
281850 CA M 1969 Michael 8245
283146 CA M 1970 Michael 8196
... ... ... ... ... ...
317292 CA M 1989 Olegario 5
317291 CA M 1989 Norbert 5
317290 CA M 1989 Niles 5
317289 CA M 1989 Nikola 5
407427 CA M 2022 Zylo 5

407428 rows × 5 columns

Custom sorting¶

Approach 1: Create a temporary column¶

In [51]:
# Create a Series of the length of each name.
babyname_lengths = babynames["Name"].str.len()

# Add a column named "name_lengths" that includes the length of each name.
babynames["name_lengths"] = babyname_lengths
babynames.head(5)
Out[51]:
State Sex Year Name Count name_lengths
0 CA F 1910 Mary 295 4
1 CA F 1910 Helen 239 5
2 CA F 1910 Dorothy 220 7
3 CA F 1910 Margaret 163 8
4 CA F 1910 Frances 134 7
In [52]:
# Sort by the temporary column.
babynames = babynames.sort_values(by="name_lengths", ascending=False)
babynames.head(5)
Out[52]:
State Sex Year Name Count name_lengths
334166 CA M 1996 Franciscojavier 8 15
337301 CA M 1997 Franciscojavier 5 15
339472 CA M 1998 Franciscojavier 6 15
321792 CA M 1991 Ryanchristopher 7 15
327358 CA M 1993 Johnchristopher 5 15
In [53]:
# Drop the 'name_length' column.
babynames = babynames.drop("name_lengths", axis="columns")
babynames.head(5)
Out[53]:
State Sex Year Name Count
334166 CA M 1996 Franciscojavier 8
337301 CA M 1997 Franciscojavier 5
339472 CA M 1998 Franciscojavier 6
321792 CA M 1991 Ryanchristopher 7
327358 CA M 1993 Johnchristopher 5

Approach 2: Sorting using the key argument¶

In [54]:
babynames.sort_values("Name", key=lambda x:x.str.len(), ascending=False).head()
Out[54]:
State Sex Year Name Count
334166 CA M 1996 Franciscojavier 8
327472 CA M 1993 Ryanchristopher 5
337301 CA M 1997 Franciscojavier 5
337477 CA M 1997 Ryanchristopher 5
312543 CA M 1987 Franciscojavier 5

Approach 3: Sorting Using the map Function¶

We can also use the Python map function if we want to use an arbitrarily defined function. Suppose we want to sort by the number of occurrences of "dr" plus the number of occurences of "ea".

In [55]:
# First, define a function to count 
# the number of times "dr" or "ea" appear in each name.
def dr_ea_count(string):
    return string.count('dr') + string.count('ea')

# Then, use `map` to apply `dr_ea_count` to each name in the "Name" column.
babynames["dr_ea_count"] = babynames["Name"].map(dr_ea_count)

# Sort the DataFrame by the new "dr_ea_count" column 
# so we can see our handiwork.
babynames = babynames.sort_values(by="dr_ea_count", ascending=False)
babynames.head()
Out[55]:
State Sex Year Name Count dr_ea_count
115957 CA F 1990 Deandrea 5 3
101976 CA F 1986 Deandrea 6 3
131029 CA F 1994 Leandrea 5 3
108731 CA F 1988 Deandrea 5 3
308131 CA M 1985 Deandrea 6 3
In [56]:
# Drop the `dr_ea_count` column.
babynames = babynames.drop("dr_ea_count", axis="columns")
babynames.head(5)
Out[56]:
State Sex Year Name Count
115957 CA F 1990 Deandrea 5
101976 CA F 1986 Deandrea 6
131029 CA F 1994 Leandrea 5
108731 CA F 1988 Deandrea 5
308131 CA M 1985 Deandrea 6