Advanced Pandas Operations¶

In this notebook we review some of the key advanced Pandas operations.

groupby: grouping collections of records that share the same value for one set of fields and then computing aggregate statistic over the remaining fields
pivot: similar to groupby except the results are presented slightly differently ...
merge: join data from a pair of dataframes into a single dataframe.

To illustrate these operations we will use some toy data about peoples favorite colors and numbers. To protect peoples identities their favorite numbers and colors are fictional.

import pandas as pd

people = pd.DataFrame(
    [["Joey",      "blue",    42,  "M"],
     ["Weiwei",    "blue",    50,  "F"],
     ["Joey",      "green",    8,  "M"],
     ["Karina",    "green",    7,  "F"],
     ["Fernando",  "pink",    -9,  "M"],
     ["Nhi",       "blue",     3,  "F"],
     ["Sam",       "pink",   -42,  "M"]], 
    columns = ["Name", "Color", "Number", "Sex"])
people

Groupby¶

The groupby operator groups rows in the table that are the same in one or more columns.

grps = people.groupby("Color")
grps

<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x10ec9bf60>

grps.size()

Color
blue     3
green    2
pink     2
dtype: int64

grps.apply(lambda df: display(df))

people.loc[grps.indices["blue"],:]

We will commonly combine groupby with column selection (e.g., df.groupby("Region")["Sales"]) and then finally adding some aggregate calculation on that column:

people.groupby("Color")["Number"].median()

Color
blue     42.0
green     7.5
pink    -25.5
Name: Number, dtype: float64

people.groupby("Color")["Number"].mean()

Color
blue     31.666667
green     7.500000
pink    -25.500000
Name: Number, dtype: float64

people.groupby("Color")["Number"].count()

Color
blue     3
green    2
pink     2
Name: Number, dtype: int64

Remember we can group by one or more columns

people.groupby(["Color", "Sex"])['Number'].count()

Color  Sex
blue   F      2
       M      1
green  F      1
       M      1
pink   M      2
Name: Number, dtype: int64

people.groupby(["Color", "Sex"])[['Name','Number']].count()

import numpy as np

def avg_str_len(series):
    return series.str.len().mean()

res = (
    people
        .groupby(["Color", "Sex"])
        .aggregate({"Name": avg_str_len, "Number": np.mean})
)

res

Grouping and Indexes¶

Notice that the groupby operation creates an index based on the grouping columns.

res.loc[['blue','F'], :]

res.loc[['green'], :]

In some cases we might want to leave the grouping fields as columns:

(
    people
        .groupby(["Color", "Sex"], as_index=False)
        .aggregate({"Name": "first", "Number": np.mean})
)

Pivot¶

Pivot is used to examine aggregates with respect to two characteristics. You might construct a pivot of sales data if you wanted to look at average sales broken down by year and market.

The pivot operation is essentially a groupby operation that transforms the rows and the columns. For example consider the following groupby operation:

people.groupby(["Color", "Sex"])['Number'].count()

Color  Sex
blue   F      2
       M      1
green  F      1
       M      1
pink   M      2
Name: Number, dtype: int64

We can use pivot to compute the same result but displayed slightly differently:

people.pivot_table(
    values  = "Number", # the entry to aggregate over
    index   = "Color",  # the row grouping attributes
    columns = "Sex",    # the column grouping attributes
    aggfunc = "count"   # the aggregation function
)

Notice that:

the second "grouping" column (Sex) has been "pivoted" from the rows to column location.
there is a missing value for pink and F since none of the females chose pink as their favorite color.

We can specify how missing values are filled in:

people.pivot_table(
    values  = "Number",
    index   = "Color",
    columns = "Sex",
    aggfunc = "count",
    fill_value = 0.0
)

Merging (joining)¶

The merge operation combines data from two dataframes into one dataframe. The merge operation in Pandas behaves like a join operation in SQL (we will cover SQL joins later in the semester). Unfortunately, Pandas also offers a join function which is a limited version of merge.

Suppose I also have a list of email addresses that I would like to combine with my people dataframe from above

email = pd.DataFrame(
    [["Deb",  "deborah_nolan@berkeley.edu"],
     ["Sam",  "samlau95@berkeley.edu"],
     ["John", "doe@nope.com"],
     ["Joey", "jegonzal@cs.berkeley.edu"],
     ["Weiwei", "weiwzhang@berkeley.edu"],
     ["Weiwei", "weiwzhang+123@berkeley.edu"],
     ["Karina", "kgoot@berkeley.edu"]], 
    columns = ["User Name", "Email"])
email

I can use the merge function to combine these two tables:

people.merge(email, 
            how = "inner",
            left_on = "Name", right_on = "User Name")

Notice that:

the output dataframe only contains rows that have names in both tables. For example, Fernando didn't have an email address and Deb didn't have a color preference.
The name Joey occurred twice in the people table and shows up twice in the output.
The name Weiwei occurred twice in the email table and appears twice in the output.

How could we fix the duplicate entries?¶

We could group by name (or by email) and take only the first:

(
    people
        .merge(email, 
            how = "inner",
            left_on = "Name", right_on = "User Name")
        .groupby('Name').first()
)

Left Joins¶

The above join was an inner join. What if we wanted to keep all of the people and leave missing in the email address field when their email addresses are not present.

people.merge(email, 
            how = "left",
            left_on = "Name", right_on = "User Name")

Right Joins¶

people.merge(email, 
            how = "right",
            left_on = "Name", right_on = "User Name")

Outer Joins¶

people.merge(email, 
            how = "outer",
            left_on = "Name", right_on = "User Name")

	Name	Color	Number	Sex
0	Joey	blue	42	M
1	Weiwei	blue	50	F
2	Joey	green	8	M
3	Karina	green	7	F
4	Fernando	pink	-9	M
5	Nhi	blue	3	F
6	Sam	pink	-42	M

		Name	Number
Color	Sex
blue	F	4.5	26.5
blue	M	4.0	42.0
green	F	6.0	7.0
green	M	4.0	8.0
pink	M	5.5	-25.5

	User Name	Email
0	Deb	deborah_nolan@berkeley.edu
1	Sam	samlau95@berkeley.edu
2	John	doe@nope.com
3	Joey	jegonzal@cs.berkeley.edu
4	Weiwei	weiwzhang@berkeley.edu
5	Weiwei	weiwzhang+123@berkeley.edu
6	Karina	kgoot@berkeley.edu

	Color	Number	Sex	User Name	Email
Name
Joey	blue	42	M	Joey	jegonzal@cs.berkeley.edu
Karina	green	7	F	Karina	kgoot@berkeley.edu
Sam	pink	-42	M	Sam	samlau95@berkeley.edu
Weiwei	blue	50	F	Weiwei	weiwzhang@berkeley.edu

Sex	F	M
Color
blue	2.0	1.0
green	1.0	1.0
pink	NaN	2.0