Advanced Pandas Operations

In this notebook we review some of the key advanced Pandas operations.

  • groupby: grouping collections of records that share the same value for one set of fields and then computing aggregate statistic over the remaining fields

  • pivot: similar to groupby except the results are presented slightly differently ...

  • merge: join data from a pair of dataframes into a single dataframe.

To illustrate these operations we will use some toy data about peoples favorite colors and numbers. To protect peoples identities their favorite numbers and colors are fictional.

In [1]:
import pandas as pd

people = pd.DataFrame(
    [["Joey",      "blue",    42,  "M"],
     ["Weiwei",    "blue",    50,  "F"],
     ["Joey",      "green",    8,  "M"],
     ["Karina",    "green",    7,  "F"],
     ["Fernando",  "pink",    -9,  "M"],
     ["Nhi",       "blue",     3,  "F"],
     ["Sam",       "pink",   -42,  "M"]], 
    columns = ["Name", "Color", "Number", "Sex"])
people
Out[1]:
Name Color Number Sex
0 Joey blue 42 M
1 Weiwei blue 50 F
2 Joey green 8 M
3 Karina green 7 F
4 Fernando pink -9 M
5 Nhi blue 3 F
6 Sam pink -42 M

Groupby

The groupby operator groups rows in the table that are the same in one or more columns.

In [2]:
grps = people.groupby("Color")
grps
Out[2]:
<pandas.core.groupby.DataFrameGroupBy object at 0x11246ffd0>
In [3]:
grps.size()
Out[3]:
Color
blue     3
green    2
pink     2
dtype: int64
In [4]:
grps.apply(lambda df: display(df))
Name Color Number Sex
0 Joey blue 42 M
1 Weiwei blue 50 F
5 Nhi blue 3 F
Name Color Number Sex
0 Joey blue 42 M
1 Weiwei blue 50 F
5 Nhi blue 3 F
Name Color Number Sex
2 Joey green 8 M
3 Karina green 7 F
Name Color Number Sex
4 Fernando pink -9 M
6 Sam pink -42 M
Out[4]:
In [5]:
people.loc[grps.indices["blue"],:]
Out[5]:
Name Color Number Sex
0 Joey blue 42 M
1 Weiwei blue 50 F
5 Nhi blue 3 F

We will commonly combine groupby with column selection (e.g., df.groupby("Region")["Sales"]) and then finally adding some aggregate calculation on that column:

In [6]:
people.groupby("Color")["Number"].median()
Out[6]:
Color
blue     42.0
green     7.5
pink    -25.5
Name: Number, dtype: float64
In [7]:
people.groupby("Color")["Number"].mean()
Out[7]:
Color
blue     31.666667
green     7.500000
pink    -25.500000
Name: Number, dtype: float64
In [8]:
people.groupby("Color")["Number"].count()
Out[8]:
Color
blue     3
green    2
pink     2
Name: Number, dtype: int64

Remember we can group by one or more columns

In [9]:
people.groupby(["Color", "Sex"])['Number'].count()
Out[9]:
Color  Sex
blue   F      2
       M      1
green  F      1
       M      1
pink   M      2
Name: Number, dtype: int64
In [10]:
people.groupby(["Color", "Sex"])[['Name','Number']].count()
Out[10]:
Name Number
Color Sex
blue F 2 2
M 1 1
green F 1 1
M 1 1
pink M 2 2
In [11]:
import numpy as np

def avg_str_len(series):
    return series.str.len().mean()

res = (
    people
        .groupby(["Color", "Sex"])
        .aggregate({"Name": avg_str_len, "Number": np.mean})
)

res
Out[11]:
Name Number
Color Sex
blue F 4.5 26.5
M 4.0 42.0
green F 6.0 7.0
M 4.0 8.0
pink M 5.5 -25.5

Grouping and Indexes

Notice that the groupby operation creates an index based on the grouping columns.

In [12]:
res.loc[['blue','F'], :]
Out[12]:
Name Number
Color Sex
blue F 4.5 26.5
M 4.0 42.0
In [13]:
res.loc[['green'], :]
Out[13]:
Name Number
Color Sex
green F 6.0 7.0
M 4.0 8.0

In some cases we might want to leave the grouping fields as columns:

In [14]:
(
    people
        .groupby(["Color", "Sex"], as_index=False)
        .aggregate({"Name": "first", "Number": np.mean})
)
Out[14]:
Color Sex Name Number
0 blue F Weiwei 26.5
1 blue M Joey 42.0
2 green F Karina 7.0
3 green M Joey 8.0
4 pink M Fernando -25.5

Pivot

Pivot is used to examine aggregates with respect to two characteristics. You might construct a pivot of sales data if you wanted to look at average sales broken down by year and market.

The pivot operation is essentially a groupby operation that transforms the rows and the columns. For example consider the following groupby operation:

In [15]:
people.groupby(["Color", "Sex"])['Number'].count()
Out[15]:
Color  Sex
blue   F      2
       M      1
green  F      1
       M      1
pink   M      2
Name: Number, dtype: int64

We can use pivot to compute the same result but displayed slightly differently:

In [16]:
people.pivot_table(
    values  = "Number", # the entry to aggregate over
    index   = "Color",  # the row grouping attributes
    columns = "Sex",    # the column grouping attributes
    aggfunc = "count"   # the aggregation function
)
Out[16]:
Sex F M
Color
blue 2.0 1.0
green 1.0 1.0
pink NaN 2.0

Notice that:

  1. the second "grouping" column (Sex) has been "pivoted" from the rows to column location.
  2. there is a missing value for pink and F since none of the females chose pink as their favorite color.

We can specify how missing values are filled in:

In [17]:
people.pivot_table(
    values  = "Number",
    index   = "Color",
    columns = "Sex",
    aggfunc = "count",
    fill_value = 0.0
)
Out[17]:
Sex F M
Color
blue 2 1
green 1 1
pink 0 2

Merging (joining)

The merge operation combines data from two dataframes into one dataframe. The merge operation in Pandas behaves like a join operation in SQL (we will cover SQL joins later in the semester). Unfortunately, Pandas also offers a join function which is a limited version of merge.

Suppose I also have a list of email addresses that I would like to combine with my people dataframe from above

In [18]:
email = pd.DataFrame(
    [["Deb",  "deborah_nolan@berkeley.edu"],
     ["Sam",  "samlau95@berkeley.edu"],
     ["John", "doe@nope.com"],
     ["Joey", "jegonzal@cs.berkeley.edu"],
     ["Weiwei", "weiwzhang@berkeley.edu"],
     ["Weiwei", "weiwzhang+123@berkeley.edu"],
     ["Karina", "kgoot@berkeley.edu"]], 
    columns = ["User Name", "Email"])
email
Out[18]:
User Name Email
0 Deb deborah_nolan@berkeley.edu
1 Sam samlau95@berkeley.edu
2 John doe@nope.com
3 Joey jegonzal@cs.berkeley.edu
4 Weiwei weiwzhang@berkeley.edu
5 Weiwei weiwzhang+123@berkeley.edu
6 Karina kgoot@berkeley.edu

I can use the merge function to combine these two tables:

In [19]:
people.merge(email, 
            how = "inner",
            left_on = "Name", right_on = "User Name")
Out[19]:
Name Color Number Sex User Name Email
0 Joey blue 42 M Joey jegonzal@cs.berkeley.edu
1 Joey green 8 M Joey jegonzal@cs.berkeley.edu
2 Weiwei blue 50 F Weiwei weiwzhang@berkeley.edu
3 Weiwei blue 50 F Weiwei weiwzhang+123@berkeley.edu
4 Karina green 7 F Karina kgoot@berkeley.edu
5 Sam pink -42 M Sam samlau95@berkeley.edu

Notice that:

  1. the output dataframe only contains rows that have names in both tables. For example, Fernando didn't have an email address and Deb didn't have a color preference.
  2. The name Joey occurred twice in the people table and shows up twice in the output.
  3. The name Weiwei occurred twice in the email table and appears twice in the output.

How could we fix the duplicate entries?

We could group by name (or by email) and take only the first:

In [20]:
(
    people
        .merge(email, 
            how = "inner",
            left_on = "Name", right_on = "User Name")
        .groupby('Name').first()
)
Out[20]:
Color Number Sex User Name Email
Name
Joey blue 42 M Joey jegonzal@cs.berkeley.edu
Karina green 7 F Karina kgoot@berkeley.edu
Sam pink -42 M Sam samlau95@berkeley.edu
Weiwei blue 50 F Weiwei weiwzhang@berkeley.edu

Left Joins

The above join was an inner join. What if we wanted to keep all of the people and leave missing in the email address field when their email addresses are not present.

In [21]:
people.merge(email, 
            how = "left",
            left_on = "Name", right_on = "User Name")
Out[21]:
Name Color Number Sex User Name Email
0 Joey blue 42 M Joey jegonzal@cs.berkeley.edu
1 Weiwei blue 50 F Weiwei weiwzhang@berkeley.edu
2 Weiwei blue 50 F Weiwei weiwzhang+123@berkeley.edu
3 Joey green 8 M Joey jegonzal@cs.berkeley.edu
4 Karina green 7 F Karina kgoot@berkeley.edu
5 Fernando pink -9 M NaN NaN
6 Nhi blue 3 F NaN NaN
7 Sam pink -42 M Sam samlau95@berkeley.edu

Right Joins

In [22]:
people.merge(email, 
            how = "right",
            left_on = "Name", right_on = "User Name")
Out[22]:
Name Color Number Sex User Name Email
0 Joey blue 42.0 M Joey jegonzal@cs.berkeley.edu
1 Joey green 8.0 M Joey jegonzal@cs.berkeley.edu
2 Weiwei blue 50.0 F Weiwei weiwzhang@berkeley.edu
3 Weiwei blue 50.0 F Weiwei weiwzhang+123@berkeley.edu
4 Karina green 7.0 F Karina kgoot@berkeley.edu
5 Sam pink -42.0 M Sam samlau95@berkeley.edu
6 NaN NaN NaN NaN Deb deborah_nolan@berkeley.edu
7 NaN NaN NaN NaN John doe@nope.com

Outer Joins

In [23]:
people.merge(email, 
            how = "outer",
            left_on = "Name", right_on = "User Name")
Out[23]:
Name Color Number Sex User Name Email
0 Joey blue 42.0 M Joey jegonzal@cs.berkeley.edu
1 Joey green 8.0 M Joey jegonzal@cs.berkeley.edu
2 Weiwei blue 50.0 F Weiwei weiwzhang@berkeley.edu
3 Weiwei blue 50.0 F Weiwei weiwzhang+123@berkeley.edu
4 Karina green 7.0 F Karina kgoot@berkeley.edu
5 Fernando pink -9.0 M NaN NaN
6 Nhi blue 3.0 F NaN NaN
7 Sam pink -42.0 M Sam samlau95@berkeley.edu
8 NaN NaN NaN NaN Deb deborah_nolan@berkeley.edu
9 NaN NaN NaN NaN John doe@nope.com

Finishing the Baby Names Lecture:

Standard imports

In [24]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

plt.style.use('fivethirtyeight')
sns.set_context("notebook")

Downloading the data

The following function is a useful helper for downloading and caching data.

In [25]:
def fetch_and_cache(data_url, file, data_dir="data", force=False):
    """
    Download and cache a url and return the file object.
    
    data_url: the web address to download
    file: the file in which to save the results.
    data_dir: (default="data") the location to save the data
    force: if true the file is always re-downloaded 
    
    return: The pathlib.Path object representing the file.
    """
    import requests
    from pathlib import Path
    data_dir = Path(data_dir)
    data_dir.mkdir(exist_ok=True)
    file_path = data_dir/Path(file)
    if force and file_path.exists():
        file_path.unlink()
    if force or not file_path.exists():
        print('Downloading...', end=' ')
        resp = requests.get(data_url)
        with file_path.open('wb') as f:
            f.write(resp.content)
        print('Done!')
    else:
        import time 
        birth_time = time.ctime(file_path.stat().st_ctime)
        print("Using cached version downloaded:", birth_time)
    return file_path
In [26]:
data_url = 'https://www.ssa.gov/oact/babynames/state/namesbystate.zip'
namesbystate_path = fetch_and_cache(data_url, 'namesbystate.zip')
Using cached version downloaded: Sun Jan 28 14:42:53 2018

Loading from ZipFile

In [27]:
import zipfile
zf = zipfile.ZipFile(namesbystate_path, 'r')

field_names = ['State', 'Sex', 'Year', 'Name', 'Count']

def load_dataframe_from_zip(zf, f):
    with zf.open(f) as fh: 
        return pd.read_csv(fh, header=None, names=field_names)
        
states = [
    load_dataframe_from_zip(zf, f)
    for f in sorted(zf.filelist, key=lambda x:x.filename) 
    if f.filename.endswith('.TXT')
]

baby_names = pd.concat(states).reset_index(drop=True)

Question 3: Can I deduce birth sex from the last letter of a person’s name?

In [28]:
baby_names['Last Letter'] = baby_names['Name'].str[-1]
baby_names.head()
Out[28]:
State Sex Year Name Count Last Letter
0 AK F 1910 Mary 14 y
1 AK F 1910 Annie 12 e
2 AK F 1910 Anna 10 a
3 AK F 1910 Margaret 8 t
4 AK F 1910 Helen 7 n
In [29]:
baby_names.size
Out[29]:
35032716

How common is each last letter?

We can use the groupby operation to determine the total number of registered babies with last names ending in each letter:

In [30]:
last_letter_totals = baby_names.groupby('Last Letter')['Count'].sum()
last_letter_totals
Out[30]:
Last Letter
a    50713477
b     1378910
c     1583832
d    15988575
e    43991013
f      155669
g      533604
h    13492007
i     3573915
j       13412
k     5023272
l    18824393
m     5730924
n    52147158
o     3856068
p      648095
q        5582
r    13021669
s    19218283
t    11053023
u       95990
v       30538
w     3049750
x      605755
y    40258082
z      161535
Name: Count, dtype: int64
In [31]:
last_letter_totals.plot.bar(figsize=(10, 10))
Out[31]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a181a4dd8>

Breakdown by Birth Sex

We can use the pivot operation to break the last letter of each name down by the birth sex:

In [32]:
last_letter_pivot = baby_names.pivot_table(
    values='Count', # the field(s) to processed in each group
    index='Last Letter', # the rows (turned into index)
    columns='Sex', # the column values
    aggfunc=sum, # group operation
)
last_letter_pivot.head()
Out[32]:
Sex F M
Last Letter
a 49128453 1585024
b 9666 1369244
c 18211 1565621
d 564804 15423771
e 31212081 12778932
In [33]:
last_letter_pivot.plot.bar(figsize=(10, 10));

These are the total counts. We might instead be interested in the proportion of males or females ending in each letter.

In [34]:
prop_last_letter_pivot = last_letter_pivot.div(last_letter_totals, axis=0)
prop_last_letter_pivot
Out[34]:
Sex F M
Last Letter
a 0.968746 0.031254
b 0.007010 0.992990
c 0.011498 0.988502
d 0.035325 0.964675
e 0.709510 0.290490
f 0.003193 0.996807
g 0.024381 0.975619
h 0.521534 0.478466
i 0.830335 0.169665
j 0.068372 0.931628
k 0.002880 0.997120
l 0.263349 0.736651
m 0.063177 0.936823
n 0.341749 0.658251
o 0.079855 0.920145
p 0.000955 0.999045
q 0.016123 0.983877
r 0.281733 0.718267
s 0.166683 0.833317
t 0.198072 0.801928
u 0.484623 0.515377
v 0.085827 0.914173
w 0.013170 0.986830
x 0.030559 0.969441
y 0.571341 0.428659
z 0.645210 0.354790
In [35]:
prop_last_letter_pivot.plot.bar(figsize=(10, 10))
Out[35]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a32262320>

If we display the bars in order of the proportion of males we get a much clearer picture:

In [36]:
(
    prop_last_letter_pivot
        .sort_values("M")
        .plot.bar(figsize=(10, 10))
)
Out[36]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a327dbf60>