Advanced Pandas Operations

In this notebook we review some of the key advanced Pandas operations.

  • groupby: grouping collections of records that share the same value for one set of fields and then computing aggregate statistic over the remaining fields

  • pivot: similar to groupby except the results are presented slightly differently ...

  • merge: join data from a pair of dataframes into a single dataframe.

To illustrate these operations we will use some toy data about peoples favorite colors and numbers. To protect peoples identities their favorite numbers and colors are fictional.

In [1]:
import pandas as pd

people = pd.DataFrame(
    [["Joey",      "blue",    42,  "M"],
     ["Weiwei",    "blue",    50,  "F"],
     ["Joey",      "green",    8,  "M"],
     ["Karina",    "green",    7,  "F"],
     ["Fernando",  "pink",    -9,  "M"],
     ["Nhi",       "blue",     3,  "F"],
     ["Sam",       "pink",   -42,  "M"]], 
    columns = ["Name", "Color", "Number", "Sex"])
people
Out[1]:
Name Color Number Sex
0 Joey blue 42 M
1 Weiwei blue 50 F
2 Joey green 8 M
3 Karina green 7 F
4 Fernando pink -9 M
5 Nhi blue 3 F
6 Sam pink -42 M

Groupby

The groupby operator groups rows in the table that are the same in one or more columns.

In [2]:
grps = people.groupby("Color")
grps
Out[2]:
<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x10ec9bf60>
In [3]:
grps.size()
Out[3]:
Color
blue     3
green    2
pink     2
dtype: int64
In [4]:
grps.apply(lambda df: display(df))
Name Color Number Sex
0 Joey blue 42 M
1 Weiwei blue 50 F
5 Nhi blue 3 F
Name Color Number Sex
0 Joey blue 42 M
1 Weiwei blue 50 F
5 Nhi blue 3 F
Name Color Number Sex
2 Joey green 8 M
3 Karina green 7 F
Name Color Number Sex
4 Fernando pink -9 M
6 Sam pink -42 M
Out[4]:
In [5]:
people.loc[grps.indices["blue"],:]
Out[5]:
Name Color Number Sex
0 Joey blue 42 M
1 Weiwei blue 50 F
5 Nhi blue 3 F

We will commonly combine groupby with column selection (e.g., df.groupby("Region")["Sales"]) and then finally adding some aggregate calculation on that column:

In [6]:
people.groupby("Color")["Number"].median()
Out[6]:
Color
blue     42.0
green     7.5
pink    -25.5
Name: Number, dtype: float64
In [7]:
people.groupby("Color")["Number"].mean()
Out[7]:
Color
blue     31.666667
green     7.500000
pink    -25.500000
Name: Number, dtype: float64
In [8]:
people.groupby("Color")["Number"].count()
Out[8]:
Color
blue     3
green    2
pink     2
Name: Number, dtype: int64

Remember we can group by one or more columns

In [9]:
people.groupby(["Color", "Sex"])['Number'].count()
Out[9]:
Color  Sex
blue   F      2
       M      1
green  F      1
       M      1
pink   M      2
Name: Number, dtype: int64
In [10]:
people.groupby(["Color", "Sex"])[['Name','Number']].count()
Out[10]:
Name Number
Color Sex
blue F 2 2
M 1 1
green F 1 1
M 1 1
pink M 2 2
In [11]:
import numpy as np

def avg_str_len(series):
    return series.str.len().mean()

res = (
    people
        .groupby(["Color", "Sex"])
        .aggregate({"Name": avg_str_len, "Number": np.mean})
)

res
Out[11]:
Name Number
Color Sex
blue F 4.5 26.5
M 4.0 42.0
green F 6.0 7.0
M 4.0 8.0
pink M 5.5 -25.5

Grouping and Indexes

Notice that the groupby operation creates an index based on the grouping columns.

In [12]:
res.loc[['blue','F'], :]
Out[12]:
Name Number
Color Sex
blue F 4.5 26.5
M 4.0 42.0
In [13]:
res.loc[['green'], :]
Out[13]:
Name Number
Color Sex
green F 6.0 7.0
M 4.0 8.0

In some cases we might want to leave the grouping fields as columns:

In [14]:
(
    people
        .groupby(["Color", "Sex"], as_index=False)
        .aggregate({"Name": "first", "Number": np.mean})
)
Out[14]:
Color Sex Name Number
0 blue F Weiwei 26.5
1 blue M Joey 42.0
2 green F Karina 7.0
3 green M Joey 8.0
4 pink M Fernando -25.5

Pivot

Pivot is used to examine aggregates with respect to two characteristics. You might construct a pivot of sales data if you wanted to look at average sales broken down by year and market.

The pivot operation is essentially a groupby operation that transforms the rows and the columns. For example consider the following groupby operation:

In [15]:
people.groupby(["Color", "Sex"])['Number'].count()
Out[15]:
Color  Sex
blue   F      2
       M      1
green  F      1
       M      1
pink   M      2
Name: Number, dtype: int64

We can use pivot to compute the same result but displayed slightly differently:

In [16]:
people.pivot_table(
    values  = "Number", # the entry to aggregate over
    index   = "Color",  # the row grouping attributes
    columns = "Sex",    # the column grouping attributes
    aggfunc = "count"   # the aggregation function
)
Out[16]:
Sex F M
Color
blue 2.0 1.0
green 1.0 1.0
pink NaN 2.0

Notice that:

  1. the second "grouping" column (Sex) has been "pivoted" from the rows to column location.
  2. there is a missing value for pink and F since none of the females chose pink as their favorite color.

We can specify how missing values are filled in:

In [17]:
people.pivot_table(
    values  = "Number",
    index   = "Color",
    columns = "Sex",
    aggfunc = "count",
    fill_value = 0.0
)
Out[17]:
Sex F M
Color
blue 2 1
green 1 1
pink 0 2

Merging (joining)

The merge operation combines data from two dataframes into one dataframe. The merge operation in Pandas behaves like a join operation in SQL (we will cover SQL joins later in the semester). Unfortunately, Pandas also offers a join function which is a limited version of merge.

Suppose I also have a list of email addresses that I would like to combine with my people dataframe from above

In [18]:
email = pd.DataFrame(
    [["Deb",  "deborah_nolan@berkeley.edu"],
     ["Sam",  "samlau95@berkeley.edu"],
     ["John", "doe@nope.com"],
     ["Joey", "jegonzal@cs.berkeley.edu"],
     ["Weiwei", "weiwzhang@berkeley.edu"],
     ["Weiwei", "weiwzhang+123@berkeley.edu"],
     ["Karina", "kgoot@berkeley.edu"]], 
    columns = ["User Name", "Email"])
email
Out[18]:
User Name Email
0 Deb deborah_nolan@berkeley.edu
1 Sam samlau95@berkeley.edu
2 John doe@nope.com
3 Joey jegonzal@cs.berkeley.edu
4 Weiwei weiwzhang@berkeley.edu
5 Weiwei weiwzhang+123@berkeley.edu
6 Karina kgoot@berkeley.edu

I can use the merge function to combine these two tables:

In [19]:
people.merge(email, 
            how = "inner",
            left_on = "Name", right_on = "User Name")
Out[19]:
Name Color Number Sex User Name Email
0 Joey blue 42 M Joey jegonzal@cs.berkeley.edu
1 Joey green 8 M Joey jegonzal@cs.berkeley.edu
2 Weiwei blue 50 F Weiwei weiwzhang@berkeley.edu
3 Weiwei blue 50 F Weiwei weiwzhang+123@berkeley.edu
4 Karina green 7 F Karina kgoot@berkeley.edu
5 Sam pink -42 M Sam samlau95@berkeley.edu

Notice that:

  1. the output dataframe only contains rows that have names in both tables. For example, Fernando didn't have an email address and Deb didn't have a color preference.
  2. The name Joey occurred twice in the people table and shows up twice in the output.
  3. The name Weiwei occurred twice in the email table and appears twice in the output.

How could we fix the duplicate entries?

We could group by name (or by email) and take only the first:

In [20]:
(
    people
        .merge(email, 
            how = "inner",
            left_on = "Name", right_on = "User Name")
        .groupby('Name').first()
)
Out[20]:
Color Number Sex User Name Email
Name
Joey blue 42 M Joey jegonzal@cs.berkeley.edu
Karina green 7 F Karina kgoot@berkeley.edu
Sam pink -42 M Sam samlau95@berkeley.edu
Weiwei blue 50 F Weiwei weiwzhang@berkeley.edu

Left Joins

The above join was an inner join. What if we wanted to keep all of the people and leave missing in the email address field when their email addresses are not present.

In [21]:
people.merge(email, 
            how = "left",
            left_on = "Name", right_on = "User Name")
Out[21]:
Name Color Number Sex User Name Email
0 Joey blue 42 M Joey jegonzal@cs.berkeley.edu
1 Weiwei blue 50 F Weiwei weiwzhang@berkeley.edu
2 Weiwei blue 50 F Weiwei weiwzhang+123@berkeley.edu
3 Joey green 8 M Joey jegonzal@cs.berkeley.edu
4 Karina green 7 F Karina kgoot@berkeley.edu
5 Fernando pink -9 M NaN NaN
6 Nhi blue 3 F NaN NaN
7 Sam pink -42 M Sam samlau95@berkeley.edu

Right Joins

In [22]:
people.merge(email, 
            how = "right",
            left_on = "Name", right_on = "User Name")
Out[22]:
Name Color Number Sex User Name Email
0 Joey blue 42.0 M Joey jegonzal@cs.berkeley.edu
1 Joey green 8.0 M Joey jegonzal@cs.berkeley.edu
2 Weiwei blue 50.0 F Weiwei weiwzhang@berkeley.edu
3 Weiwei blue 50.0 F Weiwei weiwzhang+123@berkeley.edu
4 Karina green 7.0 F Karina kgoot@berkeley.edu
5 Sam pink -42.0 M Sam samlau95@berkeley.edu
6 NaN NaN NaN NaN Deb deborah_nolan@berkeley.edu
7 NaN NaN NaN NaN John doe@nope.com

Outer Joins

In [23]:
people.merge(email, 
            how = "outer",
            left_on = "Name", right_on = "User Name")
Out[23]:
Name Color Number Sex User Name Email
0 Joey blue 42.0 M Joey jegonzal@cs.berkeley.edu
1 Joey green 8.0 M Joey jegonzal@cs.berkeley.edu
2 Weiwei blue 50.0 F Weiwei weiwzhang@berkeley.edu
3 Weiwei blue 50.0 F Weiwei weiwzhang+123@berkeley.edu
4 Karina green 7.0 F Karina kgoot@berkeley.edu
5 Fernando pink -9.0 M NaN NaN
6 Nhi blue 3.0 F NaN NaN
7 Sam pink -42.0 M Sam samlau95@berkeley.edu
8 NaN NaN NaN NaN Deb deborah_nolan@berkeley.edu
9 NaN NaN NaN NaN John doe@nope.com