3  Pandas II

  • Build familiarity with advanced pandas syntax
  • Extract data from a DataFrame using conditional selection
  • Recognize situations where aggregation is useful and identify the correct technique for performing an aggregation

Last time, we introduced the pandas library as a toolkit for processing data. We learned the DataFrame and Series data structures, familiarized ourselves with the basic syntax for manipulating tabular data, and began writing our first lines of pandas code.

In this lecture, we’ll start to dive into some advanced pandas syntax. You may find it helpful to follow along with a notebook of your own as we walk through these new pieces of code.

We’ll start by loading the babynames dataset.

import pandas as pd
import numpy as np
import urllib.request
import os.path
import zipfile

data_url = "https://www.ssa.gov/oact/babynames/state/namesbystate.zip"
local_filename = "babynamesbystate.zip"
if not os.path.exists(local_filename): # if the data exists don't download again
    with urllib.request.urlopen(data_url) as resp, open(local_filename, 'wb') as f:
        f.write(resp.read())

zf = zipfile.ZipFile(local_filename, 'r')

ca_name = 'CA.TXT'
field_names = ['State', 'Sex', 'Year', 'Name', 'Count']
with zf.open(ca_name) as fh:
    babynames = pd.read_csv(fh, header=None, names=field_names)

babynames.head()
State Sex Year Name Count
0 CA F 1910 Mary 295
1 CA F 1910 Helen 239
2 CA F 1910 Dorothy 220
3 CA F 1910 Margaret 163
4 CA F 1910 Frances 134

3.1 Conditional Selection

Conditional selection allows us to select a subset of rows in a DataFrame if they follow some specified condition.

To understand how to use conditional selection, we must look at another possible input of the .loc and [] methods – a boolean array, which is simply an array where each element is either True or False. This boolean array must have a length equal to the number of rows in the DataFrame. It will return all rows in the position of a corresponding True value in the array.

To see this in action, let’s select all even-indexed rows in the first 10 rows of our DataFrame.

# Ask yourself: why is :9 is the correct slice to select the first 10 rows?
babynames_first_10_rows = babynames.loc[:9, :]

# Notice how we have exactly 10 elements in our boolean array argument
babynames_first_10_rows[[True, False, True, False, True, False, True, False, True, False]]
State Sex Year Name Count
0 CA F 1910 Mary 295
2 CA F 1910 Dorothy 220
4 CA F 1910 Frances 134
6 CA F 1910 Evelyn 126
8 CA F 1910 Virginia 101

We can perform a similar operation using .loc.

babynames_first_10_rows.loc[[True, False, True, False, True, False, True, False, True, False], :]
State Sex Year Name Count
0 CA F 1910 Mary 295
2 CA F 1910 Dorothy 220
4 CA F 1910 Frances 134
6 CA F 1910 Evelyn 126
8 CA F 1910 Virginia 101

These techniques worked well in this example, but you can imagine how tedious it might be to list out Trues and Falses for every row in a larger DataFrame. To make things easier, we can instead provide a logical condition as an input to .loc or [] that returns a boolean array with the necessary length.

For example, to return all names associated with F sex:

# First, use a logical condition to generate a boolean array
logical_operator = (babynames["Sex"] == "F")

# Then, use this boolean array to filter the DataFrame
babynames[logical_operator].head()
State Sex Year Name Count
0 CA F 1910 Mary 295
1 CA F 1910 Helen 239
2 CA F 1910 Dorothy 220
3 CA F 1910 Margaret 163
4 CA F 1910 Frances 134

Here, logical_operator evaluates to a Series of boolean values with length 400762.

print("There are a total of {} values in 'logical_operator'".format(len(logical_operator)))
There are a total of 400762 values in 'logical_operator'

Rows starting at row 0 and ending at row 235790 evaluate to True and are thus returned in the DataFrame.

Code
print("The 0th item in this 'logical_operator' is: {}".format(logical_operator.iloc[0]))
print("The 235790th item in this 'logical_operator' is: {}".format(logical_operator.iloc[235790]))
print("The 235791th item in this 'logical_operator' is: {}".format(logical_operator.iloc[235791]))
The 0th item in this 'logical_operator' is: True
The 235790th item in this 'logical_operator' is: True
The 235791th item in this 'logical_operator' is: False

Passing a Series as an argument to babynames[] has the same affect as using a boolean array. In fact, the [] selection operator can take a boolean Series, array, and list as arguments. These three are used interchangeably thoughout the course.

We can also use .loc to achieve similar results.

babynames.loc[babynames["Sex"] == "F"].head()
State Sex Year Name Count
0 CA F 1910 Mary 295
1 CA F 1910 Helen 239
2 CA F 1910 Dorothy 220
3 CA F 1910 Margaret 163
4 CA F 1910 Frances 134

Boolean conditions can be combined using various operators that allow us to filter results by multiple conditions. Some examples include the & (and) operator and the | (or) operator.

Note: When combining multiple conditions with logical operators, be sure to surround each condition with a set of parenthesis (). If you forget, your code will throw an error.

For example, if we want to return data on all females born before the 21st century, we can write:

babynames[(babynames["Sex"] == "F") & (babynames["Year"] < 2000)].head()
State Sex Year Name Count
0 CA F 1910 Mary 295
1 CA F 1910 Helen 239
2 CA F 1910 Dorothy 220
3 CA F 1910 Margaret 163
4 CA F 1910 Frances 134

Boolean array selection is a useful tool, but can lead to overly verbose code for complex conditions. Pandas provide many alternatives:

(
    babynames[(babynames["Name"] == "Bella") | 
              (babynames["Name"] == "Alex") |
              (babynames["Name"] == "Ani") |
              (babynames["Name"] == "Lisa")]
).head()
# Note: The parentheses surrounding the code make it possible to break the code on to multiple lines for readability
State Sex Year Name Count
6289 CA F 1923 Bella 5
7512 CA F 1925 Bella 8
12368 CA F 1932 Lisa 5
14741 CA F 1936 Lisa 8
17084 CA F 1939 Lisa 5

The .isin function can be used to filter dataframes. The method helps in selecting rows with having a particular (or multiple) value in a particular column.

names = ["Bella", "Alex", "Ani", "Lisa"]
babynames[babynames["Name"].isin(names)].head()
State Sex Year Name Count
6289 CA F 1923 Bella 5
7512 CA F 1925 Bella 8
12368 CA F 1932 Lisa 5
14741 CA F 1936 Lisa 8
17084 CA F 1939 Lisa 5

The function str.startswith can be used to define a filter based on string values in a Series object.

babynames[babynames["Name"].str.startswith("N")].head()
State Sex Year Name Count
76 CA F 1910 Norma 23
83 CA F 1910 Nellie 20
127 CA F 1910 Nina 11
198 CA F 1910 Nora 6
310 CA F 1911 Nellie 23

3.2 Handy Utility Functions

pandas contains an extensive library of functions that can help shorten the process of setting and getting information from its data structures. In the following section, we will give overviews of each of the main utility functions that will help us in Data 100.

  • Numpy and built-in function support
  • .shape
  • .size
  • .describe()
  • .sample()
  • .value_counts()
  • .unique()
  • .sort_values()

3.2.1 Numpy

bella_counts = babynames[babynames["Name"] == "Bella"]["Count"]
# Average number of babies named Bella each year
np.mean(bella_counts)
270.1860465116279
# Max number of babies named Bella born on a given year
max(bella_counts)
902

3.2.2 .shape and .size

.shape and .size are attributes of Series and DataFrames that measure the “amount” of data stored in the structure. Calling .shape returns a tuple containing the number of rows and columns present in the DataFrame or Series. .size is used to find the total number of elements in a structure, equivalent to the number of rows times the number of columns.

Many functions strictly require the dimensions of the arguments along certain axes to match. Calling these dimension-finding functions is much faster than counting all of the items by hand.

babynames.shape
(400762, 5)
babynames.size
2003810

3.2.3 .describe()

If many statistics are required from a DataFrame (minimum value, maximum value, mean value, etc.), then .describe() can be used to compute all of them at once.

babynames.describe()
Year Count
count 400762.000000 400762.000000
mean 1985.131287 79.953781
std 26.821004 295.414618
min 1910.000000 5.000000
25% 1968.000000 7.000000
50% 1991.000000 13.000000
75% 2007.000000 38.000000
max 2021.000000 8262.000000

A different set of statistics will be reported if .describe() is called on a Series.

babynames["Sex"].describe()
count     400762
unique         2
top            F
freq      235791
Name: Sex, dtype: object

3.2.4 .sample()

As we will see later in the semester, random processes are at the heart of many data science techniques (for example, train-test splits, bootstrapping, and cross-validation). .sample() lets us quickly select random entries (a row if called from a DataFrame, or a value if called from a Series).

babynames.sample()
State Sex Year Name Count
94914 CA F 1984 Ivonne 21
babynames.sample(5).iloc[:, 2:]
Year Name Count
104391 1987 Britnie 10
223218 2018 Maribelle 11
84035 1980 Letisia 24
251285 1941 Bryan 12
56022 1967 Tera 12
babynames[babynames["Year"] == 2000].sample(4, replace = True).iloc[:, 2:]
Year Name Count
340545 2000 Lizandro 7
152666 2000 Nerissa 5
150438 2000 Delanie 18
151502 2000 Jackelyne 8

3.2.5 .value_counts()

When we want to know the distribution of the items in a Series (for example, what items are most/least common), we use .value-counts() to get a breakdown of the unique values and their counts. In the example below, we can determine the name with the most years in which at least one person has taken that name by counting the number of times each name appears in the "Name" column of babynames.

babynames["Name"].value_counts().head()
Jean         221
Francis      219
Guadalupe    216
Jessie       215
Marion       213
Name: Name, dtype: int64

3.2.6 .unique()

If we have a Series with many repeated values, then .unique() can be used to identify only the unique values. Here we can get a list of all the names in babynames.

Exercise: what function can we call on the Series below to get the number of unique names?

babynames["Name"].unique()
array(['Mary', 'Helen', 'Dorothy', ..., 'Zyire', 'Zylo', 'Zyrus'],
      dtype=object)

3.2.7 .sort_values()

Ordering a DataFrame can be useful for isolating extreme values. For example, the first 5 entries of a row sorted in descending order (that is, from highest to lowest) are the largest 5 values. .sort_values allows us to order a DataFrame or Series by a specified rule. For DataFrames, we must specify the column by which we want to compare the rows and the function will return such rows. We can choose to either receive the rows in ascending order (default) or descending order.

babynames.sort_values(by = "Count", ascending=False).head()
State Sex Year Name Count
263272 CA M 1956 Michael 8262
264297 CA M 1957 Michael 8250
313644 CA M 1990 Michael 8247
278109 CA M 1969 Michael 8244
279405 CA M 1970 Michael 8197

We do not need to explicitly specify the column used for sorting when calling .value_counts() on a Series. We can still specify the ordering paradigm – that is, whether values are sorted in ascending or descending order.

babynames["Name"].sort_values(ascending=True).head()
380256      Aadan
362255      Aadan
365374      Aadan
394460    Aadarsh
366561      Aaden
Name: Name, dtype: object

3.2.7.1 Sorting With a Custom Key

Using .sort_values can be useful in many situations, but it many not cover all use cases. This is because pandas automatically sorts values in order according to numeric value (for number data) or alphabetical order (for string data). The following code finds the top 5 most popular names in California in 2021.

# Sort names by count in year 2021
# Recall that `.head(5)` displays the first five rows in the DataFrame
babynames[babynames["Year"] == 2021].sort_values("Count", ascending=False).head()
State Sex Year Name Count
397909 CA M 2021 Noah 2591
397910 CA M 2021 Liam 2469
232145 CA F 2021 Olivia 2395
232146 CA F 2021 Emma 2171
397911 CA M 2021 Mateo 2108

This offers us a lot of functionality, but what if we need to sort by some other metric? For example, what if we wanted to find the longest names in the DataFrame?

We can do this by specifying the key parameter of .sort_values. The key parameter is assigned to a function of our choice. This function is then applied to each value in the specified column. pandas will, finally, sort the DataFrame by the values outputted by the function.

# Here, a lambda function is applied to find the length of each value, `x`, in the "Name" column
babynames.sort_values("Name", key=lambda x: x.str.len(), ascending=False).head(5)
State Sex Year Name Count
313143 CA M 1989 Franciscojavier 6
333732 CA M 1997 Ryanchristopher 5
330421 CA M 1996 Franciscojavier 8
323615 CA M 1993 Johnchristopher 5
310235 CA M 1988 Franciscojavier 10

3.3 Adding and Removing Columns

To add a new column to a DataFrame, we use a syntax similar to that used when accessing an existing column. Specify the name of the new column by writing dataframe["new_column"], then assign this to a Series or Array containing the values that will populate this column.

# Add a column named "name_lengths" that includes the length of each name
babynames["name_lengths"] = babynames["Name"].str.len()
babynames.head(5)
State Sex Year Name Count name_lengths
0 CA F 1910 Mary 295 4
1 CA F 1910 Helen 239 5
2 CA F 1910 Dorothy 220 7
3 CA F 1910 Margaret 163 8
4 CA F 1910 Frances 134 7
# Sort by the temporary column
babynames = babynames.sort_values(by = "name_lengths", ascending=False)
babynames.head()
State Sex Year Name Count name_lengths
313143 CA M 1989 Franciscojavier 6 15
333732 CA M 1997 Ryanchristopher 5 15
330421 CA M 1996 Franciscojavier 8 15
323615 CA M 1993 Johnchristopher 5 15
310235 CA M 1988 Franciscojavier 10 15

In the example above, we made use of an in-built function given to us by the str accessor for getting the length of names. Then we used name_length column to sort the dataframe. What if we had wanted to generate the values in our new column using a function of our own making?

We can do this using the Series .map method. .map takes in a function as input, and will apply this function to each value of a Series.

For example, say we wanted to find the number of occurrences of the sequence “dr” or “ea” in each name.

# First, define a function to count the number of times "dr" or "ea" appear in each name
def dr_ea_count(string):
    return string.count("dr") + string.count("ea")

# Then, use `map` to apply `dr_ea_count` to each name in the "Name" column
babynames["dr_ea_count"] = babynames["Name"].map(dr_ea_count)

# Sort the DataFrame by the new "dr_ea_count" column so we can see our handiwork
babynames.sort_values(by = "dr_ea_count", ascending = False).head(5)
State Sex Year Name Count name_lengths dr_ea_count
101969 CA F 1986 Deandrea 6 8 3
304390 CA M 1985 Deandrea 6 8 3
131022 CA F 1994 Leandrea 5 8 3
115950 CA F 1990 Deandrea 5 8 3
108723 CA F 1988 Deandrea 5 8 3

If we want to remove a column or row of a DataFrame, we can call the .drop method. Use the axis parameter to specify whether a column or row should be dropped. Unless otherwise specified, pandas will assume that we are dropping a row by default.

# Drop our "dr_ea_count" and "length" columns from the DataFrame
babynames = babynames.drop(["dr_ea_count", "name_lengths"], axis="columns")
babynames.head(5)
State Sex Year Name Count
313143 CA M 1989 Franciscojavier 6
333732 CA M 1997 Ryanchristopher 5
330421 CA M 1996 Franciscojavier 8
323615 CA M 1993 Johnchristopher 5
310235 CA M 1988 Franciscojavier 10

Notice that we reassigned babynames to the result of babynames.drop(...). This is a subtle, but important point: pandas table operations do not occur in-place. Calling dataframe.drop(...) will output a copy of dataframe with the row/column of interest removed, without modifying the original dataframe table.

In other words, if we simply call:

# This creates a copy of `babynames` and removes the row with label 3...
babynames.drop(3, axis="rows")

# ...but the original `babynames` is unchanged! 
# Notice that the row with label 3 is still present
babynames.head(5)
State Sex Year Name Count
313143 CA M 1989 Franciscojavier 6
333732 CA M 1997 Ryanchristopher 5
330421 CA M 1996 Franciscojavier 8
323615 CA M 1993 Johnchristopher 5
310235 CA M 1988 Franciscojavier 10

3.4 Aggregating Data with GroupBy

Up until this point, we have been working with individual rows of DataFrames. As data scientists, we often wish to investigate trends across a larger subset of our data. For example, we may want to compute some summary statistic (the mean, median, sum, etc.) for a group of rows in our DataFrame. To do this, we’ll use pandas GroupBy objects.

Let’s say we wanted to aggregate all rows in babynames for a given year.

babynames.groupby("Year")
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000022E30AA5880>

What does this strange output mean? Calling .groupby has generated a GroupBy object. You can imagine this as a set of “mini” sub-DataFrames, where each subframe contains all of the rows from babynames that correspond to a particular year.

The diagram below shows a simplified view of babynames to help illustrate this idea.

Creating a GroupBy object

We can’t work with a GroupBy object directly – that is why you saw that strange output earlier, rather than a standard view of a DataFrame. To actually manipulate values within these “mini” DataFrames, we’ll need to call an aggregation method. This is a method that tells pandas how to aggregate the values within the GroupBy object. Once the aggregation is applied, pandas will return a normal (now grouped) DataFrame.

The first aggregation method we’ll consider is .agg. The .agg method takes in a function as its argument; this function is then applied to each column of a “mini” grouped DataFrame. We end up with a new DataFrame with one aggregated row per subframe. Let’s see this in action by finding the sum of all counts for each year in babynames – this is equivalent to finding the number of babies born in each year.

babynames.groupby("Year").agg(sum).head(5)
Count
Year
1910 9163
1911 9983
1912 17946
1913 22094
1914 26926

We can relate this back to the diagram we used above. Remember that the diagram uses a simplified version of babynames, which is why we see smaller values for the summed counts.

Performing an aggregation

Calling .agg has condensed each subframe back into a single row. This gives us our final output: a DataFrame that is now indexed by "Year", with a single row for each unique year in the original babynames DataFrame.

You may be wondering: where did the "State", "Sex", and "Name" columns go? Logically, it doesn’t make sense to sum the string data in these columns (how would we add “Mary” + “Ann”?). Because of this, pandas will simply omit these columns when it performs the aggregation on the DataFrame. Since this happens implicitly, without the user specifying that these columns should be ignored, it’s easy to run into troubling situations where columns are removed without the programmer noticing. It is better coding practice to select only the columns we care about before performing the aggregation.

# Same result, but now we explicitly tell Pandas to only consider the "Count" column when summing
babynames.groupby("Year")[["Count"]].agg(sum).head(5)
Count
Year
1910 9163
1911 9983
1912 17946
1913 22094
1914 26926

3.4.1 Parting note

Manipulating DataFrames is a skill that is not mastered in just one day. Due to the flexibility of pandas, there are many different ways to get from a point A to a point B. We recommend trying multiple different ways to solve the same problem to gain even more practice and reach that point of mastery sooner.

Next, we will start digging deeper into the mechanics behind grouping data.