Extract data from a DataFrame using conditional selection
Recognize situations where aggregation is useful and identify the correct technique for performing an aggregation
Last time, we introduced the pandas library as a toolkit for processing data. We learned the DataFrame and Series data structures, familiarized ourselves with the basic syntax for manipulating tabular data, and began writing our first lines of pandas code.
In this lecture, we’ll start to dive into some advanced pandas syntax. You may find it helpful to follow along with a notebook of your own as we walk through these new pieces of code.
We’ll start by loading the babynames dataset.
import pandas as pdimport numpy as npimport urllib.requestimport os.pathimport zipfiledata_url ="https://www.ssa.gov/oact/babynames/state/namesbystate.zip"local_filename ="babynamesbystate.zip"ifnot os.path.exists(local_filename): # if the data exists don't download againwith urllib.request.urlopen(data_url) as resp, open(local_filename, 'wb') as f: f.write(resp.read())zf = zipfile.ZipFile(local_filename, 'r')ca_name ='CA.TXT'field_names = ['State', 'Sex', 'Year', 'Name', 'Count']with zf.open(ca_name) as fh: babynames = pd.read_csv(fh, header=None, names=field_names)babynames.head()
State
Sex
Year
Name
Count
0
CA
F
1910
Mary
295
1
CA
F
1910
Helen
239
2
CA
F
1910
Dorothy
220
3
CA
F
1910
Margaret
163
4
CA
F
1910
Frances
134
3.1 Conditional Selection
Conditional selection allows us to select a subset of rows in a DataFrame if they follow some specified condition.
To understand how to use conditional selection, we must look at another possible input of the .loc and [] methods – a boolean array, which is simply an array where each element is either True or False. This boolean array must have a length equal to the number of rows in the DataFrame. It will return all rows in the position of a corresponding True value in the array.
To see this in action, let’s select all even-indexed rows in the first 10 rows of our DataFrame.
# Ask yourself: why is :9 is the correct slice to select the first 10 rows?babynames_first_10_rows = babynames.loc[:9, :]# Notice how we have exactly 10 elements in our boolean array argumentbabynames_first_10_rows[[True, False, True, False, True, False, True, False, True, False]]
These techniques worked well in this example, but you can imagine how tedious it might be to list out Trues and Falses for every row in a larger DataFrame. To make things easier, we can instead provide a logical condition as an input to .loc or [] that returns a boolean array with the necessary length.
For example, to return all names associated with F sex:
# First, use a logical condition to generate a boolean arraylogical_operator = (babynames["Sex"] =="F")# Then, use this boolean array to filter the DataFramebabynames[logical_operator].head()
State
Sex
Year
Name
Count
0
CA
F
1910
Mary
295
1
CA
F
1910
Helen
239
2
CA
F
1910
Dorothy
220
3
CA
F
1910
Margaret
163
4
CA
F
1910
Frances
134
Here, logical_operator evaluates to a Series of boolean values with length 400762.
print("There are a total of {} values in 'logical_operator'".format(len(logical_operator)))
There are a total of 400762 values in 'logical_operator'
Rows starting at row 0 and ending at row 235790 evaluate to True and are thus returned in the DataFrame.
Code
print("The 0th item in this 'logical_operator' is: {}".format(logical_operator.iloc[0]))print("The 235790th item in this 'logical_operator' is: {}".format(logical_operator.iloc[235790]))print("The 235791th item in this 'logical_operator' is: {}".format(logical_operator.iloc[235791]))
The 0th item in this 'logical_operator' is: True
The 235790th item in this 'logical_operator' is: True
The 235791th item in this 'logical_operator' is: False
Passing a Series as an argument to babynames[] has the same affect as using a boolean array. In fact, the [] selection operator can take a boolean Series, array, and list as arguments. These three are used interchangeably thoughout the course.
We can also use .loc to achieve similar results.
babynames.loc[babynames["Sex"] =="F"].head()
State
Sex
Year
Name
Count
0
CA
F
1910
Mary
295
1
CA
F
1910
Helen
239
2
CA
F
1910
Dorothy
220
3
CA
F
1910
Margaret
163
4
CA
F
1910
Frances
134
Boolean conditions can be combined using various operators that allow us to filter results by multiple conditions. Some examples include the & (and) operator and the | (or) operator.
Note: When combining multiple conditions with logical operators, be sure to surround each condition with a set of parenthesis (). If you forget, your code will throw an error.
For example, if we want to return data on all females born before the 21st century, we can write:
Boolean array selection is a useful tool, but can lead to overly verbose code for complex conditions. Pandas provide many alternatives:
( babynames[(babynames["Name"] =="Bella") | (babynames["Name"] =="Alex") | (babynames["Name"] =="Ani") | (babynames["Name"] =="Lisa")]).head()# Note: The parentheses surrounding the code make it possible to break the code on to multiple lines for readability
State
Sex
Year
Name
Count
6289
CA
F
1923
Bella
5
7512
CA
F
1925
Bella
8
12368
CA
F
1932
Lisa
5
14741
CA
F
1936
Lisa
8
17084
CA
F
1939
Lisa
5
The .isin function can be used to filter dataframes. The method helps in selecting rows with having a particular (or multiple) value in a particular column.
pandas contains an extensive library of functions that can help shorten the process of setting and getting information from its data structures. In the following section, we will give overviews of each of the main utility functions that will help us in Data 100.
# Average number of babies named Bella each yearnp.mean(bella_counts)
270.1860465116279
# Max number of babies named Bella born on a given yearmax(bella_counts)
902
3.2.2.shape and .size
.shape and .size are attributes of Series and DataFrames that measure the “amount” of data stored in the structure. Calling .shape returns a tuple containing the number of rows and columns present in the DataFrame or Series. .size is used to find the total number of elements in a structure, equivalent to the number of rows times the number of columns.
Many functions strictly require the dimensions of the arguments along certain axes to match. Calling these dimension-finding functions is much faster than counting all of the items by hand.
babynames.shape
(400762, 5)
babynames.size
2003810
3.2.3.describe()
If many statistics are required from a DataFrame (minimum value, maximum value, mean value, etc.), then .describe() can be used to compute all of them at once.
babynames.describe()
Year
Count
count
400762.000000
400762.000000
mean
1985.131287
79.953781
std
26.821004
295.414618
min
1910.000000
5.000000
25%
1968.000000
7.000000
50%
1991.000000
13.000000
75%
2007.000000
38.000000
max
2021.000000
8262.000000
A different set of statistics will be reported if .describe() is called on a Series.
babynames["Sex"].describe()
count 400762
unique 2
top F
freq 235791
Name: Sex, dtype: object
3.2.4.sample()
As we will see later in the semester, random processes are at the heart of many data science techniques (for example, train-test splits, bootstrapping, and cross-validation). .sample() lets us quickly select random entries (a row if called from a DataFrame, or a value if called from a Series).
When we want to know the distribution of the items in a Series (for example, what items are most/least common), we use .value-counts() to get a breakdown of the unique values and their counts. In the example below, we can determine the name with the most years in which at least one person has taken that name by counting the number of times each name appears in the "Name" column of babynames.
babynames["Name"].value_counts().head()
Jean 221
Francis 219
Guadalupe 216
Jessie 215
Marion 213
Name: Name, dtype: int64
3.2.6.unique()
If we have a Series with many repeated values, then .unique() can be used to identify only the unique values. Here we can get a list of all the names in babynames.
Exercise: what function can we call on the Series below to get the number of unique names?
Ordering a DataFrame can be useful for isolating extreme values. For example, the first 5 entries of a row sorted in descending order (that is, from highest to lowest) are the largest 5 values. .sort_values allows us to order a DataFrame or Series by a specified rule. For DataFrames, we must specify the column by which we want to compare the rows and the function will return such rows. We can choose to either receive the rows in ascending order (default) or descending order.
We do not need to explicitly specify the column used for sorting when calling .value_counts() on a Series. We can still specify the ordering paradigm – that is, whether values are sorted in ascending or descending order.
Using .sort_values can be useful in many situations, but it many not cover all use cases. This is because pandas automatically sorts values in order according to numeric value (for number data) or alphabetical order (for string data). The following code finds the top 5 most popular names in California in 2021.
# Sort names by count in year 2021# Recall that `.head(5)` displays the first five rows in the DataFramebabynames[babynames["Year"] ==2021].sort_values("Count", ascending=False).head()
State
Sex
Year
Name
Count
397909
CA
M
2021
Noah
2591
397910
CA
M
2021
Liam
2469
232145
CA
F
2021
Olivia
2395
232146
CA
F
2021
Emma
2171
397911
CA
M
2021
Mateo
2108
This offers us a lot of functionality, but what if we need to sort by some other metric? For example, what if we wanted to find the longest names in the DataFrame?
We can do this by specifying the key parameter of .sort_values. The key parameter is assigned to a function of our choice. This function is then applied to each value in the specified column. pandas will, finally, sort the DataFrame by the values outputted by the function.
# Here, a lambda function is applied to find the length of each value, `x`, in the "Name" columnbabynames.sort_values("Name", key=lambda x: x.str.len(), ascending=False).head(5)
State
Sex
Year
Name
Count
313143
CA
M
1989
Franciscojavier
6
333732
CA
M
1997
Ryanchristopher
5
330421
CA
M
1996
Franciscojavier
8
323615
CA
M
1993
Johnchristopher
5
310235
CA
M
1988
Franciscojavier
10
3.3 Adding and Removing Columns
To add a new column to a DataFrame, we use a syntax similar to that used when accessing an existing column. Specify the name of the new column by writing dataframe["new_column"], then assign this to a Series or Array containing the values that will populate this column.
# Add a column named "name_lengths" that includes the length of each namebabynames["name_lengths"] = babynames["Name"].str.len()babynames.head(5)
State
Sex
Year
Name
Count
name_lengths
0
CA
F
1910
Mary
295
4
1
CA
F
1910
Helen
239
5
2
CA
F
1910
Dorothy
220
7
3
CA
F
1910
Margaret
163
8
4
CA
F
1910
Frances
134
7
# Sort by the temporary columnbabynames = babynames.sort_values(by ="name_lengths", ascending=False)babynames.head()
State
Sex
Year
Name
Count
name_lengths
313143
CA
M
1989
Franciscojavier
6
15
333732
CA
M
1997
Ryanchristopher
5
15
330421
CA
M
1996
Franciscojavier
8
15
323615
CA
M
1993
Johnchristopher
5
15
310235
CA
M
1988
Franciscojavier
10
15
In the example above, we made use of an in-built function given to us by the str accessor for getting the length of names. Then we used name_length column to sort the dataframe. What if we had wanted to generate the values in our new column using a function of our own making?
We can do this using the Series .map method. .map takes in a function as input, and will apply this function to each value of a Series.
For example, say we wanted to find the number of occurrences of the sequence “dr” or “ea” in each name.
# First, define a function to count the number of times "dr" or "ea" appear in each namedef dr_ea_count(string):return string.count("dr") + string.count("ea")# Then, use `map` to apply `dr_ea_count` to each name in the "Name" columnbabynames["dr_ea_count"] = babynames["Name"].map(dr_ea_count)# Sort the DataFrame by the new "dr_ea_count" column so we can see our handiworkbabynames.sort_values(by ="dr_ea_count", ascending =False).head(5)
State
Sex
Year
Name
Count
name_lengths
dr_ea_count
101969
CA
F
1986
Deandrea
6
8
3
304390
CA
M
1985
Deandrea
6
8
3
131022
CA
F
1994
Leandrea
5
8
3
115950
CA
F
1990
Deandrea
5
8
3
108723
CA
F
1988
Deandrea
5
8
3
If we want to remove a column or row of a DataFrame, we can call the .drop method. Use the axis parameter to specify whether a column or row should be dropped. Unless otherwise specified, pandas will assume that we are dropping a row by default.
# Drop our "dr_ea_count" and "length" columns from the DataFramebabynames = babynames.drop(["dr_ea_count", "name_lengths"], axis="columns")babynames.head(5)
State
Sex
Year
Name
Count
313143
CA
M
1989
Franciscojavier
6
333732
CA
M
1997
Ryanchristopher
5
330421
CA
M
1996
Franciscojavier
8
323615
CA
M
1993
Johnchristopher
5
310235
CA
M
1988
Franciscojavier
10
Notice that we reassigned babynames to the result of babynames.drop(...). This is a subtle, but important point: pandas table operations do not occur in-place. Calling dataframe.drop(...) will output a copy of dataframe with the row/column of interest removed, without modifying the original dataframe table.
In other words, if we simply call:
# This creates a copy of `babynames` and removes the row with label 3...babynames.drop(3, axis="rows")# ...but the original `babynames` is unchanged! # Notice that the row with label 3 is still presentbabynames.head(5)
State
Sex
Year
Name
Count
313143
CA
M
1989
Franciscojavier
6
333732
CA
M
1997
Ryanchristopher
5
330421
CA
M
1996
Franciscojavier
8
323615
CA
M
1993
Johnchristopher
5
310235
CA
M
1988
Franciscojavier
10
3.4 Aggregating Data with GroupBy
Up until this point, we have been working with individual rows of DataFrames. As data scientists, we often wish to investigate trends across a larger subset of our data. For example, we may want to compute some summary statistic (the mean, median, sum, etc.) for a group of rows in our DataFrame. To do this, we’ll use pandasGroupBy objects.
Let’s say we wanted to aggregate all rows in babynames for a given year.
babynames.groupby("Year")
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000022E30AA5880>
What does this strange output mean? Calling .groupby has generated a GroupBy object. You can imagine this as a set of “mini” sub-DataFrames, where each subframe contains all of the rows from babynames that correspond to a particular year.
The diagram below shows a simplified view of babynames to help illustrate this idea.
Creating a GroupBy object
We can’t work with a GroupBy object directly – that is why you saw that strange output earlier, rather than a standard view of a DataFrame. To actually manipulate values within these “mini” DataFrames, we’ll need to call an aggregation method. This is a method that tells pandas how to aggregate the values within the GroupBy object. Once the aggregation is applied, pandas will return a normal (now grouped) DataFrame.
The first aggregation method we’ll consider is .agg. The .agg method takes in a function as its argument; this function is then applied to each column of a “mini” grouped DataFrame. We end up with a new DataFrame with one aggregated row per subframe. Let’s see this in action by finding the sum of all counts for each year in babynames – this is equivalent to finding the number of babies born in each year.
babynames.groupby("Year").agg(sum).head(5)
Count
Year
1910
9163
1911
9983
1912
17946
1913
22094
1914
26926
We can relate this back to the diagram we used above. Remember that the diagram uses a simplified version of babynames, which is why we see smaller values for the summed counts.
Performing an aggregation
Calling .agg has condensed each subframe back into a single row. This gives us our final output: a DataFrame that is now indexed by "Year", with a single row for each unique year in the original babynames DataFrame.
You may be wondering: where did the "State", "Sex", and "Name" columns go? Logically, it doesn’t make sense to sum the string data in these columns (how would we add “Mary” + “Ann”?). Because of this, pandas will simply omit these columns when it performs the aggregation on the DataFrame. Since this happens implicitly, without the user specifying that these columns should be ignored, it’s easy to run into troubling situations where columns are removed without the programmer noticing. It is better coding practice to select only the columns we care about before performing the aggregation.
# Same result, but now we explicitly tell Pandas to only consider the "Count" column when summingbabynames.groupby("Year")[["Count"]].agg(sum).head(5)
Count
Year
1910
9163
1911
9983
1912
17946
1913
22094
1914
26926
3.4.1 Parting note
Manipulating DataFrames is a skill that is not mastered in just one day. Due to the flexibility of pandas, there are many different ways to get from a point A to a point B. We recommend trying multiple different ways to solve the same problem to gain even more practice and reach that point of mastery sooner.
Next, we will start digging deeper into the mechanics behind grouping data.