Advanced Pandas Operations¶

In this notebook we review some of the key advanced Pandas operations.

groupby: grouping collections of records that share the same value for one set of fields and then computing aggregate statistic over the remaining fields
pivot: similar to groupby except the results are presented slightly differently ...
merge: join data from a pair of dataframes into a single dataframe.

To illustrate these operations we will use some toy data about peoples favorite colors and numbers. To protect peoples identities their favorite numbers and colors are fictional.

import pandas as pd

people = pd.DataFrame(
    [["Joey",      "blue",    42,  "M"],
     ["Weiwei",    "blue",    50,  "F"],
     ["Joey",      "green",    8,  "M"],
     ["Karina",    "green",    7,  "F"],
     ["Fernando",  "pink",    -9,  "M"],
     ["Nhi",       "blue",     3,  "F"],
     ["Sam",       "pink",   -42,  "M"]], 
    columns = ["Name", "Color", "Number", "Sex"])
people

Groupby¶

The groupby operator groups rows in the table that are the same in one or more columns.

grps = people.groupby("Color")
grps

<pandas.core.groupby.DataFrameGroupBy object at 0x11246ffd0>

grps.size()

Color
blue     3
green    2
pink     2
dtype: int64

grps.apply(lambda df: display(df))

people.loc[grps.indices["blue"],:]

We will commonly combine groupby with column selection (e.g., df.groupby("Region")["Sales"]) and then finally adding some aggregate calculation on that column:

people.groupby("Color")["Number"].median()

Color
blue     42.0
green     7.5
pink    -25.5
Name: Number, dtype: float64

people.groupby("Color")["Number"].mean()

Color
blue     31.666667
green     7.500000
pink    -25.500000
Name: Number, dtype: float64

people.groupby("Color")["Number"].count()

Color
blue     3
green    2
pink     2
Name: Number, dtype: int64

Remember we can group by one or more columns

people.groupby(["Color", "Sex"])['Number'].count()

Color  Sex
blue   F      2
       M      1
green  F      1
       M      1
pink   M      2
Name: Number, dtype: int64

people.groupby(["Color", "Sex"])[['Name','Number']].count()

import numpy as np

def avg_str_len(series):
    return series.str.len().mean()

res = (
    people
        .groupby(["Color", "Sex"])
        .aggregate({"Name": avg_str_len, "Number": np.mean})
)

res

Grouping and Indexes¶

Notice that the groupby operation creates an index based on the grouping columns.

res.loc[['blue','F'], :]

res.loc[['green'], :]

In some cases we might want to leave the grouping fields as columns:

(
    people
        .groupby(["Color", "Sex"], as_index=False)
        .aggregate({"Name": "first", "Number": np.mean})
)

Pivot¶

Pivot is used to examine aggregates with respect to two characteristics. You might construct a pivot of sales data if you wanted to look at average sales broken down by year and market.

The pivot operation is essentially a groupby operation that transforms the rows and the columns. For example consider the following groupby operation:

people.groupby(["Color", "Sex"])['Number'].count()

Color  Sex
blue   F      2
       M      1
green  F      1
       M      1
pink   M      2
Name: Number, dtype: int64

We can use pivot to compute the same result but displayed slightly differently:

people.pivot_table(
    values  = "Number", # the entry to aggregate over
    index   = "Color",  # the row grouping attributes
    columns = "Sex",    # the column grouping attributes
    aggfunc = "count"   # the aggregation function
)

Notice that:

the second "grouping" column (Sex) has been "pivoted" from the rows to column location.
there is a missing value for pink and F since none of the females chose pink as their favorite color.

We can specify how missing values are filled in:

people.pivot_table(
    values  = "Number",
    index   = "Color",
    columns = "Sex",
    aggfunc = "count",
    fill_value = 0.0
)

Merging (joining)¶

The merge operation combines data from two dataframes into one dataframe. The merge operation in Pandas behaves like a join operation in SQL (we will cover SQL joins later in the semester). Unfortunately, Pandas also offers a join function which is a limited version of merge.

Suppose I also have a list of email addresses that I would like to combine with my people dataframe from above

email = pd.DataFrame(
    [["Deb",  "deborah_nolan@berkeley.edu"],
     ["Sam",  "samlau95@berkeley.edu"],
     ["John", "doe@nope.com"],
     ["Joey", "jegonzal@cs.berkeley.edu"],
     ["Weiwei", "weiwzhang@berkeley.edu"],
     ["Weiwei", "weiwzhang+123@berkeley.edu"],
     ["Karina", "kgoot@berkeley.edu"]], 
    columns = ["User Name", "Email"])
email

I can use the merge function to combine these two tables:

people.merge(email, 
            how = "inner",
            left_on = "Name", right_on = "User Name")

Notice that:

the output dataframe only contains rows that have names in both tables. For example, Fernando didn't have an email address and Deb didn't have a color preference.
The name Joey occurred twice in the people table and shows up twice in the output.
The name Weiwei occurred twice in the email table and appears twice in the output.

How could we fix the duplicate entries?¶

We could group by name (or by email) and take only the first:

(
    people
        .merge(email, 
            how = "inner",
            left_on = "Name", right_on = "User Name")
        .groupby('Name').first()
)

Left Joins¶

The above join was an inner join. What if we wanted to keep all of the people and leave missing in the email address field when their email addresses are not present.

people.merge(email, 
            how = "left",
            left_on = "Name", right_on = "User Name")

Right Joins¶

people.merge(email, 
            how = "right",
            left_on = "Name", right_on = "User Name")

Outer Joins¶

people.merge(email, 
            how = "outer",
            left_on = "Name", right_on = "User Name")

Finishing the Baby Names Lecture:¶

Standard imports¶

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

plt.style.use('fivethirtyeight')
sns.set_context("notebook")

Downloading the data¶

The following function is a useful helper for downloading and caching data.

def fetch_and_cache(data_url, file, data_dir="data", force=False):
    """
    Download and cache a url and return the file object.
    
    data_url: the web address to download
    file: the file in which to save the results.
    data_dir: (default="data") the location to save the data
    force: if true the file is always re-downloaded 
    
    return: The pathlib.Path object representing the file.
    """
    import requests
    from pathlib import Path
    data_dir = Path(data_dir)
    data_dir.mkdir(exist_ok=True)
    file_path = data_dir/Path(file)
    if force and file_path.exists():
        file_path.unlink()
    if force or not file_path.exists():
        print('Downloading...', end=' ')
        resp = requests.get(data_url)
        with file_path.open('wb') as f:
            f.write(resp.content)
        print('Done!')
    else:
        import time 
        birth_time = time.ctime(file_path.stat().st_ctime)
        print("Using cached version downloaded:", birth_time)
    return file_path

data_url = 'https://www.ssa.gov/oact/babynames/state/namesbystate.zip'
namesbystate_path = fetch_and_cache(data_url, 'namesbystate.zip')

Using cached version downloaded: Sun Jan 28 14:42:53 2018

Loading from ZipFile¶

import zipfile
zf = zipfile.ZipFile(namesbystate_path, 'r')

field_names = ['State', 'Sex', 'Year', 'Name', 'Count']

def load_dataframe_from_zip(zf, f):
    with zf.open(f) as fh: 
        return pd.read_csv(fh, header=None, names=field_names)
        
states = [
    load_dataframe_from_zip(zf, f)
    for f in sorted(zf.filelist, key=lambda x:x.filename) 
    if f.filename.endswith('.TXT')
]

baby_names = pd.concat(states).reset_index(drop=True)

Question 3: Can I deduce birth sex from the last letter of a person’s name?¶

baby_names['Last Letter'] = baby_names['Name'].str[-1]
baby_names.head()

baby_names.size

35032716

How common is each last letter?¶

We can use the groupby operation to determine the total number of registered babies with last names ending in each letter:

last_letter_totals = baby_names.groupby('Last Letter')['Count'].sum()
last_letter_totals

Last Letter
a    50713477
b     1378910
c     1583832
d    15988575
e    43991013
f      155669
g      533604
h    13492007
i     3573915
j       13412
k     5023272
l    18824393
m     5730924
n    52147158
o     3856068
p      648095
q        5582
r    13021669
s    19218283
t    11053023
u       95990
v       30538
w     3049750
x      605755
y    40258082
z      161535
Name: Count, dtype: int64

last_letter_totals.plot.bar(figsize=(10, 10))

<matplotlib.axes._subplots.AxesSubplot at 0x1a181a4dd8>

Breakdown by Birth Sex¶

We can use the pivot operation to break the last letter of each name down by the birth sex:

last_letter_pivot = baby_names.pivot_table(
    values='Count', # the field(s) to processed in each group
    index='Last Letter', # the rows (turned into index)
    columns='Sex', # the column values
    aggfunc=sum, # group operation
)
last_letter_pivot.head()

last_letter_pivot.plot.bar(figsize=(10, 10));

These are the total counts. We might instead be interested in the proportion of males or females ending in each letter.

prop_last_letter_pivot = last_letter_pivot.div(last_letter_totals, axis=0)
prop_last_letter_pivot

prop_last_letter_pivot.plot.bar(figsize=(10, 10))

<matplotlib.axes._subplots.AxesSubplot at 0x1a32262320>

If we display the bars in order of the proportion of males we get a much clearer picture:

(
    prop_last_letter_pivot
        .sort_values("M")
        .plot.bar(figsize=(10, 10))
)

<matplotlib.axes._subplots.AxesSubplot at 0x1a327dbf60>

	State	Sex	Year	Name	Count	Last Letter
0	AK	F	1910	Mary	14	y
1	AK	F	1910	Annie	12	e
2	AK	F	1910	Anna	10	a
3	AK	F	1910	Margaret	8	t
4	AK	F	1910	Helen	7	n

Sex	F	M
Last Letter
a	49128453	1585024
b	9666	1369244
c	18211	1565621
d	564804	15423771
e	31212081	12778932

Sex	F	M
Last Letter
a	0.968746	0.031254
b	0.007010	0.992990
c	0.011498	0.988502
d	0.035325	0.964675
e	0.709510	0.290490
f	0.003193	0.996807
g	0.024381	0.975619
h	0.521534	0.478466
i	0.830335	0.169665
j	0.068372	0.931628
k	0.002880	0.997120
l	0.263349	0.736651
m	0.063177	0.936823
n	0.341749	0.658251
o	0.079855	0.920145
p	0.000955	0.999045
q	0.016123	0.983877
r	0.281733	0.718267
s	0.166683	0.833317
t	0.198072	0.801928
u	0.484623	0.515377
v	0.085827	0.914173
w	0.013170	0.986830
x	0.030559	0.969441
y	0.571341	0.428659
z	0.645210	0.354790

	Name	Color	Number	Sex
0	Joey	blue	42	M
1	Weiwei	blue	50	F
2	Joey	green	8	M
3	Karina	green	7	F
4	Fernando	pink	-9	M
5	Nhi	blue	3	F
6	Sam	pink	-42	M

		Name	Number
Color	Sex
blue	F	4.5	26.5
blue	M	4.0	42.0
green	F	6.0	7.0
green	M	4.0	8.0
pink	M	5.5	-25.5

	User Name	Email
0	Deb	deborah_nolan@berkeley.edu
1	Sam	samlau95@berkeley.edu
2	John	doe@nope.com
3	Joey	jegonzal@cs.berkeley.edu
4	Weiwei	weiwzhang@berkeley.edu
5	Weiwei	weiwzhang+123@berkeley.edu
6	Karina	kgoot@berkeley.edu

Sex	F	M
Color
blue	2.0	1.0
green	1.0	1.0
pink	NaN	2.0