In this notebook we review some of the key advanced Pandas operations.
groupby: grouping collections of records that share the same value for one set of fields and then computing aggregate statistic over the remaining fields
pivot: similar to groupby except the results are presented slightly differently ...
merge: join data from a pair of dataframes into a single dataframe.
To illustrate these operations we will use some toy data about peoples favorite colors and numbers. To protect peoples identities their favorite numbers and colors are fictional.
import pandas as pd
people = pd.DataFrame(
[["Joey", "blue", 42, "M"],
["Weiwei", "blue", 50, "F"],
["Joey", "green", 8, "M"],
["Karina", "green", 7, "F"],
["Fernando", "pink", -9, "M"],
["Nhi", "blue", 3, "F"],
["Sam", "pink", -42, "M"]],
columns = ["Name", "Color", "Number", "Sex"])
people
The groupby
operator groups rows in the table that are the same in one or more columns.
grps = people.groupby("Color")
grps
grps.size()
grps.apply(lambda df: display(df))
people.loc[grps.indices["blue"],:]
We will commonly combine groupby
with column selection (e.g., df.groupby("Region")["Sales"]
) and then finally adding some aggregate calculation on that column:
people.groupby("Color")["Number"].median()
people.groupby("Color")["Number"].mean()
people.groupby("Color")["Number"].count()
Remember we can group by one or more columns
people.groupby(["Color", "Sex"])['Number'].count()
people.groupby(["Color", "Sex"])[['Name','Number']].count()
import numpy as np
def avg_str_len(series):
return series.str.len().mean()
res = (
people
.groupby(["Color", "Sex"])
.aggregate({"Name": avg_str_len, "Number": np.mean})
)
res
Notice that the groupby
operation creates an index based on the grouping columns.
res.loc[['blue','F'], :]
res.loc[['green'], :]
In some cases we might want to leave the grouping fields as columns:
(
people
.groupby(["Color", "Sex"], as_index=False)
.aggregate({"Name": "first", "Number": np.mean})
)
Pivot is used to examine aggregates with respect to two characteristics. You might construct a pivot of sales data if you wanted to look at average sales broken down by year and market.
The pivot operation is essentially a groupby
operation that transforms the rows and the columns. For example consider the following groupby operation:
people.groupby(["Color", "Sex"])['Number'].count()
We can use pivot
to compute the same result but displayed slightly differently:
people.pivot_table(
values = "Number", # the entry to aggregate over
index = "Color", # the row grouping attributes
columns = "Sex", # the column grouping attributes
aggfunc = "count" # the aggregation function
)
Notice that:
Sex
) has been "pivoted" from the rows to column location. pink
and F
since none of the females chose pink
as their favorite color.We can specify how missing values are filled in:
people.pivot_table(
values = "Number",
index = "Color",
columns = "Sex",
aggfunc = "count",
fill_value = 0.0
)
The merge operation combines data from two dataframes into one dataframe. The merge operation in Pandas behaves like a join operation in SQL (we will cover SQL joins later in the semester). Unfortunately, Pandas also offers a join
function which is a limited version of merge
.
Suppose I also have a list of email addresses that I would like to combine with my people
dataframe from above
email = pd.DataFrame(
[["Deb", "deborah_nolan@berkeley.edu"],
["Sam", "samlau95@berkeley.edu"],
["John", "doe@nope.com"],
["Joey", "jegonzal@cs.berkeley.edu"],
["Weiwei", "weiwzhang@berkeley.edu"],
["Weiwei", "weiwzhang+123@berkeley.edu"],
["Karina", "kgoot@berkeley.edu"]],
columns = ["User Name", "Email"])
email
I can use the merge function to combine these two tables:
people.merge(email,
how = "inner",
left_on = "Name", right_on = "User Name")
Notice that:
Fernando
didn't have an email address and Deb
didn't have a color preference.Joey
occurred twice in the people table and shows up twice in the output. Weiwei
occurred twice in the email table and appears twice in the output.We could group by name (or by email) and take only the first:
(
people
.merge(email,
how = "inner",
left_on = "Name", right_on = "User Name")
.groupby('Name').first()
)
The above join was an inner join. What if we wanted to keep all of the people and leave missing in the email address field when their email addresses are not present.
people.merge(email,
how = "left",
left_on = "Name", right_on = "User Name")
people.merge(email,
how = "right",
left_on = "Name", right_on = "User Name")
people.merge(email,
how = "outer",
left_on = "Name", right_on = "User Name")
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
plt.style.use('fivethirtyeight')
sns.set_context("notebook")
The following function is a useful helper for downloading and caching data.
def fetch_and_cache(data_url, file, data_dir="data", force=False):
"""
Download and cache a url and return the file object.
data_url: the web address to download
file: the file in which to save the results.
data_dir: (default="data") the location to save the data
force: if true the file is always re-downloaded
return: The pathlib.Path object representing the file.
"""
import requests
from pathlib import Path
data_dir = Path(data_dir)
data_dir.mkdir(exist_ok=True)
file_path = data_dir/Path(file)
if force and file_path.exists():
file_path.unlink()
if force or not file_path.exists():
print('Downloading...', end=' ')
resp = requests.get(data_url)
with file_path.open('wb') as f:
f.write(resp.content)
print('Done!')
else:
import time
birth_time = time.ctime(file_path.stat().st_ctime)
print("Using cached version downloaded:", birth_time)
return file_path
data_url = 'https://www.ssa.gov/oact/babynames/state/namesbystate.zip'
namesbystate_path = fetch_and_cache(data_url, 'namesbystate.zip')
import zipfile
zf = zipfile.ZipFile(namesbystate_path, 'r')
field_names = ['State', 'Sex', 'Year', 'Name', 'Count']
def load_dataframe_from_zip(zf, f):
with zf.open(f) as fh:
return pd.read_csv(fh, header=None, names=field_names)
states = [
load_dataframe_from_zip(zf, f)
for f in sorted(zf.filelist, key=lambda x:x.filename)
if f.filename.endswith('.TXT')
]
baby_names = pd.concat(states).reset_index(drop=True)
baby_names['Last Letter'] = baby_names['Name'].str[-1]
baby_names.head()
baby_names.size
We can use the groupby
operation to determine the total number of registered babies with last names ending in each letter:
last_letter_totals = baby_names.groupby('Last Letter')['Count'].sum()
last_letter_totals
last_letter_totals.plot.bar(figsize=(10, 10))
We can use the pivot
operation to break the last letter of each name down by the birth sex:
last_letter_pivot = baby_names.pivot_table(
values='Count', # the field(s) to processed in each group
index='Last Letter', # the rows (turned into index)
columns='Sex', # the column values
aggfunc=sum, # group operation
)
last_letter_pivot.head()
last_letter_pivot.plot.bar(figsize=(10, 10));
These are the total counts. We might instead be interested in the proportion of males or females ending in each letter.
prop_last_letter_pivot = last_letter_pivot.div(last_letter_totals, axis=0)
prop_last_letter_pivot
prop_last_letter_pivot.plot.bar(figsize=(10, 10))
If we display the bars in order of the proportion of males we get a much clearer picture:
(
prop_last_letter_pivot
.sort_values("M")
.plot.bar(figsize=(10, 10))
)