In this notebook we review some of the key advanced Pandas operations.
groupby: grouping collections of records that share the same value for one set of fields and then computing aggregate statistic over the remaining fields
pivot: similar to groupby except the results are presented slightly differently ...
merge: join data from a pair of dataframes into a single dataframe.
To illustrate these operations we will use some toy data about peoples favorite colors and numbers. To protect peoples identities their favorite numbers and colors are fictional.
import pandas as pd
people = pd.DataFrame(
[["Joey", "blue", 42, "M"],
["Weiwei", "blue", 50, "F"],
["Joey", "green", 8, "M"],
["Karina", "green", 7, "F"],
["Fernando", "pink", -9, "M"],
["Nhi", "blue", 3, "F"],
["Sam", "pink", -42, "M"]],
columns = ["Name", "Color", "Number", "Sex"])
people
The groupby
operator groups rows in the table that are the same in one or more columns.
grps = people.groupby("Color")
grps
grps.size()
grps.apply(lambda df: display(df))
people.loc[grps.indices["blue"],:]
We will commonly combine groupby
with column selection (e.g., df.groupby("Region")["Sales"]
) and then finally adding some aggregate calculation on that column:
people.groupby("Color")["Number"].median()
people.groupby("Color")["Number"].mean()
people.groupby("Color")["Number"].count()
Remember we can group by one or more columns
people.groupby(["Color", "Sex"])['Number'].count()
people.groupby(["Color", "Sex"])[['Name','Number']].count()
import numpy as np
def avg_str_len(series):
return series.str.len().mean()
res = (
people
.groupby(["Color", "Sex"])
.aggregate({"Name": avg_str_len, "Number": np.mean})
)
res
Notice that the groupby
operation creates an index based on the grouping columns.
res.loc[['blue','F'], :]
res.loc[['green'], :]
In some cases we might want to leave the grouping fields as columns:
(
people
.groupby(["Color", "Sex"], as_index=False)
.aggregate({"Name": "first", "Number": np.mean})
)
Pivot is used to examine aggregates with respect to two characteristics. You might construct a pivot of sales data if you wanted to look at average sales broken down by year and market.
The pivot operation is essentially a groupby
operation that transforms the rows and the columns. For example consider the following groupby operation:
people.groupby(["Color", "Sex"])['Number'].count()
We can use pivot
to compute the same result but displayed slightly differently:
people.pivot_table(
values = "Number", # the entry to aggregate over
index = "Color", # the row grouping attributes
columns = "Sex", # the column grouping attributes
aggfunc = "count" # the aggregation function
)
Notice that:
Sex
) has been "pivoted" from the rows to column location. pink
and F
since none of the females chose pink
as their favorite color.We can specify how missing values are filled in:
people.pivot_table(
values = "Number",
index = "Color",
columns = "Sex",
aggfunc = "count",
fill_value = 0.0
)
The merge operation combines data from two dataframes into one dataframe. The merge operation in Pandas behaves like a join operation in SQL (we will cover SQL joins later in the semester). Unfortunately, Pandas also offers a join
function which is a limited version of merge
.
Suppose I also have a list of email addresses that I would like to combine with my people
dataframe from above
email = pd.DataFrame(
[["Deb", "deborah_nolan@berkeley.edu"],
["Sam", "samlau95@berkeley.edu"],
["John", "doe@nope.com"],
["Joey", "jegonzal@cs.berkeley.edu"],
["Weiwei", "weiwzhang@berkeley.edu"],
["Weiwei", "weiwzhang+123@berkeley.edu"],
["Karina", "kgoot@berkeley.edu"]],
columns = ["User Name", "Email"])
email
I can use the merge function to combine these two tables:
people.merge(email,
how = "inner",
left_on = "Name", right_on = "User Name")
Notice that:
Fernando
didn't have an email address and Deb
didn't have a color preference.Joey
occurred twice in the people table and shows up twice in the output. Weiwei
occurred twice in the email table and appears twice in the output.We could group by name (or by email) and take only the first:
(
people
.merge(email,
how = "inner",
left_on = "Name", right_on = "User Name")
.groupby('Name').first()
)
The above join was an inner join. What if we wanted to keep all of the people and leave missing in the email address field when their email addresses are not present.
people.merge(email,
how = "left",
left_on = "Name", right_on = "User Name")
people.merge(email,
how = "right",
left_on = "Name", right_on = "User Name")
people.merge(email,
how = "outer",
left_on = "Name", right_on = "User Name")