3  Pandas II

Learning Outcomes
  • Continue building familiarity with pandas syntax.
  • Extract data from a DataFrame using conditional selection.
  • Recognize situations where aggregation is useful and identify the correct technique for performing an aggregation.

Last time, we introduced the pandas library as a toolkit for processing data. We learned the DataFrame and Series data structures, familiarized ourselves with the basic syntax for manipulating tabular data, and began writing our first lines of pandas code.

In this lecture, we’ll start to dive into some advanced pandas syntax. You may find it helpful to follow along with a notebook of your own as we walk through these new pieces of code.

We’ll start by loading the babynames dataset.

Code
# This code pulls census data and loads it into a DataFrame
# We won't cover it explicitly in this class, but you are welcome to explore it on your own
import pandas as pd
import numpy as np
import urllib.request
import os.path
import zipfile

data_url = "https://www.ssa.gov/oact/babynames/state/namesbystate.zip"
local_filename = "data/babynamesbystate.zip"
if not os.path.exists(local_filename): # If the data exists don't download again
    with urllib.request.urlopen(data_url) as resp, open(local_filename, 'wb') as f:
        f.write(resp.read())

zf = zipfile.ZipFile(local_filename, 'r')

ca_name = 'STATE.CA.TXT'
field_names = ['State', 'Sex', 'Year', 'Name', 'Count']
with zf.open(ca_name) as fh:
    babynames = pd.read_csv(fh, header=None, names=field_names)

babynames.head()
State Sex Year Name Count
0 CA F 1910 Mary 295
1 CA F 1910 Helen 239
2 CA F 1910 Dorothy 220
3 CA F 1910 Margaret 163
4 CA F 1910 Frances 134

3.1 Conditional Selection

Conditional selection allows us to select a subset of rows in a DataFrame that satisfy some specified condition.

To understand how to use conditional selection, we must look at another possible input of the .loc and [] methods – a boolean array, which is simply an array or Series where each element is either True or False. This boolean array must have a length equal to the number of rows in the DataFrame. It will return all rows that correspond to a value of True in the array. We used a very similar technique when performing conditional extraction from a Series in the last lecture.

To see this in action, let’s select all even-indexed rows in the first 10 rows of our DataFrame.

# Ask yourself: why is :9 is the correct slice to select the first 10 rows?
babynames_first_10_rows = babynames.loc[:9, :]

# Notice how we have exactly 10 elements in our boolean array argument
babynames_first_10_rows[[True, False, True, False, True, False, True, False, True, False]]
State Sex Year Name Count
0 CA F 1910 Mary 295
2 CA F 1910 Dorothy 220
4 CA F 1910 Frances 134
6 CA F 1910 Evelyn 126
8 CA F 1910 Virginia 101

We can perform a similar operation using .loc.

babynames_first_10_rows.loc[[True, False, True, False, True, False, True, False, True, False], :]
State Sex Year Name Count
0 CA F 1910 Mary 295
2 CA F 1910 Dorothy 220
4 CA F 1910 Frances 134
6 CA F 1910 Evelyn 126
8 CA F 1910 Virginia 101

These techniques worked well in this example, but you can imagine how tedious it might be to list out True and Falsefor every row in a larger DataFrame. To make things easier, we can instead provide a logical condition as an input to .loc or [] that returns a boolean array with the necessary length.

For example, to return all names associated with F sex:

# First, use a logical condition to generate a boolean array
logical_operator = (babynames["Sex"] == "F")

# Then, use this boolean array to filter the DataFrame
babynames[logical_operator].head()
State Sex Year Name Count
0 CA F 1910 Mary 295
1 CA F 1910 Helen 239
2 CA F 1910 Dorothy 220
3 CA F 1910 Margaret 163
4 CA F 1910 Frances 134

Recall from the previous lecture that .head() will return only the first few rows in the DataFrame. In reality, babynames[logical operator] contains as many rows as there are entries in the original babynames DataFrame with sex "F".

Here, logical_operator evaluates to a Series of boolean values with length 407428.

Code
print("There are a total of {} values in 'logical_operator'".format(len(logical_operator)))
There are a total of 407428 values in 'logical_operator'

Rows starting at row 0 and ending at row 239536 evaluate to True and are thus returned in the DataFrame. Rows from 239537 onwards evaluate to False and are omitted from the output.

Code
print("The 0th item in this 'logical_operator' is: {}".format(logical_operator.iloc[0]))
print("The 239536th item in this 'logical_operator' is: {}".format(logical_operator.iloc[239536]))
print("The 239537th item in this 'logical_operator' is: {}".format(logical_operator.iloc[239537]))
The 0th item in this 'logical_operator' is: True
The 239536th item in this 'logical_operator' is: True
The 239537th item in this 'logical_operator' is: False

Passing a Series as an argument to babynames[] has the same effect as using a boolean array. In fact, the [] selection operator can take a boolean Series, array, and list as arguments. These three are used interchangeably throughout the course.

We can also use .loc to achieve similar results.

babynames.loc[babynames["Sex"] == "F"].head()
State Sex Year Name Count
0 CA F 1910 Mary 295
1 CA F 1910 Helen 239
2 CA F 1910 Dorothy 220
3 CA F 1910 Margaret 163
4 CA F 1910 Frances 134

Boolean conditions can be combined using various bitwise operators, allowing us to filter results by multiple conditions. In the table below, p and q are boolean arrays or Series.

Symbol Usage Meaning
~ ~p Returns negation of p
| p | q p OR q
& p & q p AND q
^ p ^ q p XOR q (exclusive or)

When combining multiple conditions with logical operators, we surround each individual condition with a set of parenthesis (). This imposes an order of operations on pandas evaluating your logic and can avoid code erroring.

For example, if we want to return data on all names with sex "F" born before the year 2000, we can write:

babynames[(babynames["Sex"] == "F") & (babynames["Year"] < 2000)].head()
State Sex Year Name Count
0 CA F 1910 Mary 295
1 CA F 1910 Helen 239
2 CA F 1910 Dorothy 220
3 CA F 1910 Margaret 163
4 CA F 1910 Frances 134

Note that we’re working with Series, so using and in place of &, or or in place | will error.

# This line of code will raise a ValueError
# babynames[(babynames["Sex"] == "F") and (babynames["Year"] < 2000)].head()

If we want to return data on all names with sex "F" or all born before the year 2000, we can write:

babynames[(babynames["Sex"] == "F") | (babynames["Year"] < 2000)].head()
State Sex Year Name Count
0 CA F 1910 Mary 295
1 CA F 1910 Helen 239
2 CA F 1910 Dorothy 220
3 CA F 1910 Margaret 163
4 CA F 1910 Frances 134

Boolean array selection is a useful tool, but can lead to overly verbose code for complex conditions. In the example below, our boolean condition is long enough to extend for several lines of code.

# Note: The parentheses surrounding the code make it possible to break the code on to multiple lines for readability
(
    babynames[(babynames["Name"] == "Bella") | 
              (babynames["Name"] == "Alex") |
              (babynames["Name"] == "Ani") |
              (babynames["Name"] == "Lisa")]
).head()
State Sex Year Name Count
6289 CA F 1923 Bella 5
7512 CA F 1925 Bella 8
12368 CA F 1932 Lisa 5
14741 CA F 1936 Lisa 8
17084 CA F 1939 Lisa 5

Fortunately, pandas provides many alternative methods for constructing boolean filters.

The .isin function is one such example. This method evaluates if the values in a Series are contained in a different sequence (list, array, or Series) of values. In the cell below, we achieve equivalent results to the DataFrame above with far more concise code.

names = ["Bella", "Alex", "Narges", "Lisa"]
babynames["Name"].isin(names).head()
0    False
1    False
2    False
3    False
4    False
Name: Name, dtype: bool
babynames[babynames["Name"].isin(names)].head()
State Sex Year Name Count
6289 CA F 1923 Bella 5
7512 CA F 1925 Bella 8
12368 CA F 1932 Lisa 5
14741 CA F 1936 Lisa 8
17084 CA F 1939 Lisa 5

The function str.startswith can be used to define a filter based on string values in a Series object. It checks to see if string values in a Series start with a particular character.

# Identify whether names begin with the letter "N"
babynames["Name"].str.startswith("N").head()
0    False
1    False
2    False
3    False
4    False
Name: Name, dtype: bool
# Extracting names that begin with the letter "N"
babynames[babynames["Name"].str.startswith("N")].head()
State Sex Year Name Count
76 CA F 1910 Norma 23
83 CA F 1910 Nellie 20
127 CA F 1910 Nina 11
198 CA F 1910 Nora 6
310 CA F 1911 Nellie 23

3.2 Adding, Removing, and Modifying Columns

In many data science tasks, we may need to change the columns contained in our DataFrame in some way. Fortunately, the syntax to do so is fairly straightforward.

To add a new column to a DataFrame, we use a syntax similar to that used when accessing an existing column. Specify the name of the new column by writing df["column"], then assign this to a Series or array containing the values that will populate this column.

# Create a Series of the length of each name. 
babyname_lengths = babynames["Name"].str.len()

# Add a column named "name_lengths" that includes the length of each name
babynames["name_lengths"] = babyname_lengths
babynames.head(5)
State Sex Year Name Count name_lengths
0 CA F 1910 Mary 295 4
1 CA F 1910 Helen 239 5
2 CA F 1910 Dorothy 220 7
3 CA F 1910 Margaret 163 8
4 CA F 1910 Frances 134 7

If we need to later modify an existing column, we can do so by referencing this column again with the syntax df["column"], then re-assigning it to a new Series or array of the appropriate length.

# Modify the “name_lengths” column to be one less than its original value
babynames["name_lengths"] = babynames["name_lengths"] - 1
babynames.head()
State Sex Year Name Count name_lengths
0 CA F 1910 Mary 295 3
1 CA F 1910 Helen 239 4
2 CA F 1910 Dorothy 220 6
3 CA F 1910 Margaret 163 7
4 CA F 1910 Frances 134 6

We can rename a column using the .rename() method. It takes in a dictionary that maps old column names to their new ones.

# Rename “name_lengths” to “Length”
babynames = babynames.rename(columns={"name_lengths":"Length"})
babynames.head()
State Sex Year Name Count Length
0 CA F 1910 Mary 295 3
1 CA F 1910 Helen 239 4
2 CA F 1910 Dorothy 220 6
3 CA F 1910 Margaret 163 7
4 CA F 1910 Frances 134 6

If we want to remove a column or row of a DataFrame, we can call the .drop (documentation) method. Use the axis parameter to specify whether a column or row should be dropped. Unless otherwise specified, pandas will assume that we are dropping a row by default.

# Drop our new "Length" column from the DataFrame
babynames = babynames.drop("Length", axis="columns")
babynames.head(5)
State Sex Year Name Count
0 CA F 1910 Mary 295
1 CA F 1910 Helen 239
2 CA F 1910 Dorothy 220
3 CA F 1910 Margaret 163
4 CA F 1910 Frances 134

Notice that we re-assigned babynames to the result of babynames.drop(...). This is a subtle but important point: pandas table operations do not occur in-place. Calling df.drop(...) will output a copy of df with the row/column of interest removed without modifying the original df table.

In other words, if we simply call:

# This creates a copy of `babynames` and removes the column "Name"...
babynames.drop("Name", axis="columns")

# ...but the original `babynames` is unchanged! 
# Notice that the "Name" column is still present
babynames.head(5)
State Sex Year Name Count
0 CA F 1910 Mary 295
1 CA F 1910 Helen 239
2 CA F 1910 Dorothy 220
3 CA F 1910 Margaret 163
4 CA F 1910 Frances 134

3.3 Useful Utility Functions

pandas contains an extensive library of functions that can help shorten the process of setting and getting information from its data structures. In the following section, we will give overviews of each of the main utility functions that will help us in Data 100.

Discussing all functionality offered by pandas could take an entire semester! We will walk you through the most commonly-used functions and encourage you to explore and experiment on your own.

  • NumPy and built-in function support
  • .shape
  • .size
  • .describe()
  • .sample()
  • .value_counts()
  • .unique()
  • .sort_values()

The pandas documentation will be a valuable resource in Data 100 and beyond.

3.3.1 NumPy

pandas is designed to work well with NumPy, the framework for array computations you encountered in Data 8. Just about any NumPy function can be applied to pandas DataFrames and Series.

# Pull out the number of babies named Yash each year
yash_count = babynames[babynames["Name"] == "Yash"]["Count"]
yash_count.head()
331824     8
334114     9
336390    11
338773    12
341387    10
Name: Count, dtype: int64
# Average number of babies named Yash each year
np.mean(yash_count)
17.142857142857142
# Max number of babies named Yash born in any one year
np.max(yash_count)
29

3.3.2 .shape and .size

.shape and .size are attributes of Series and DataFrames that measure the “amount” of data stored in the structure. Calling .shape returns a tuple containing the number of rows and columns present in the DataFrame or Series. .size is used to find the total number of elements in a structure, equivalent to the number of rows times the number of columns.

Many functions strictly require the dimensions of the arguments along certain axes to match. Calling these dimension-finding functions is much faster than counting all of the items by hand.

# Return the shape of the DataFrame, in the format (num_rows, num_columns)
babynames.shape
(407428, 5)
# Return the size of the DataFrame, equal to num_rows * num_columns
babynames.size
2037140

3.3.3 .describe()

If many statistics are required from a DataFrame (minimum value, maximum value, mean value, etc.), then .describe() (documentation) can be used to compute all of them at once.

babynames.describe()
Year Count
count 407428.000000 407428.000000
mean 1985.733609 79.543456
std 27.007660 293.698654
min 1910.000000 5.000000
25% 1969.000000 7.000000
50% 1992.000000 13.000000
75% 2008.000000 38.000000
max 2022.000000 8260.000000

A different set of statistics will be reported if .describe() is called on a Series.

babynames["Sex"].describe()
count     407428
unique         2
top            F
freq      239537
Name: Sex, dtype: object

3.3.4 .sample()

As we will see later in the semester, random processes are at the heart of many data science techniques (for example, train-test splits, bootstrapping, and cross-validation). .sample() (documentation) lets us quickly select random entries (a row if called from a DataFrame, or a value if called from a Series).

By default, .sample() selects entries without replacement. Pass in the argument replace=True to sample with replacement.

# Sample a single row
babynames.sample()
State Sex Year Name Count
322054 CA M 1991 Sherwin 6

Naturally, this can be chained with other methods and operators (iloc, etc.).

# Sample 5 random rows, and select all columns after column 2
babynames.sample(5).iloc[:, 2:]
Year Name Count
348524 2002 Kenji 17
197715 2012 Rylee 221
239390 2022 Nahia 5
175096 2006 Rylin 9
29180 1951 Pearl 28
# Randomly sample 4 names from the year 2000, with replacement, and select all columns after column 2
babynames[babynames["Year"] == 2000].sample(4, replace = True).iloc[:, 2:]
Year Name Count
150014 2000 Allyssa 31
152815 2000 Zahraa 5
342879 2000 Tucker 63
344867 2000 Nery 5

3.3.5 .value_counts()

The Series.value_counts() (documentation) method counts the number of occurrence of each unique value in a Series. In other words, it counts the number of times each unique value appears. This is often useful for determining the most or least common entries in a Series.

In the example below, we can determine the name with the most years in which at least one person has taken that name by counting the number of times each name appears in the "Name" column of babynames. Note that the return value is also a Series.

babynames["Name"].value_counts().head()
Jean         223
Francis      221
Guadalupe    218
Jessie       217
Marion       214
Name: Name, dtype: int64

3.3.6 .unique()

If we have a Series with many repeated values, then .unique() (documentation) can be used to identify only the unique values. Here we return an array of all the names in babynames.

babynames["Name"].unique()
array(['Mary', 'Helen', 'Dorothy', ..., 'Zae', 'Zai', 'Zayvier'],
      dtype=object)

3.3.7 .sort_values()

Ordering a DataFrame can be useful for isolating extreme values. For example, the first 5 entries of a row sorted in descending order (that is, from highest to lowest) are the largest 5 values. .sort_values (documentation) allows us to order a DataFrame or Series by a specified column. We can choose to either receive the rows in ascending order (default) or descending order.

# Sort the "Count" column from highest to lowest
babynames.sort_values(by="Count", ascending=False).head()
State Sex Year Name Count
268041 CA M 1957 Michael 8260
267017 CA M 1956 Michael 8258
317387 CA M 1990 Michael 8246
281850 CA M 1969 Michael 8245
283146 CA M 1970 Michael 8196

Unlike when calling .value_counts() on a DataFrame, we do not need to explicitly specify the column used for sorting when calling .value_counts() on a Series. We can still specify the ordering paradigm – that is, whether values are sorted in ascending or descending order.

# Sort the "Name" Series alphabetically
babynames["Name"].sort_values(ascending=True).head()
366001      Aadan
384005      Aadan
369120      Aadan
398211    Aadarsh
370306      Aaden
Name: Name, dtype: object

3.4 Parting Note

Manipulating DataFrames is not a skill that is mastered in just one day. Due to the flexibility of pandas, there are many different ways to get from point A to point B. We recommend trying multiple different ways to solve the same problem to gain even more practice and reach that point of mastery sooner.

Next, we will start digging deeper into the mechanics behind grouping data.