In [1]:
import pandas as pd
import numpy as np


# Simple Demonstration of Joins¶

To illustrate how joins work we create two toy tables here:

In [2]:
favorite_numbers = pd.DataFrame(
[["Joey", 42],
["Joey", 3.14],
["Deb", 7],
["Sam", 9],
["Nhi", 3],
["Bob", np.NaN]], columns = ["Name", "Number"])
[["Joey", "jegonzal@berkeley.edu"],
["Deb", "deborah_nolan@berkeley.edu"],
["Sam", "samlau95@berkeley.edu"],
["John", "doe@nope.com"],
["Joey", "jegonzal@cs.berkeley.edu"]], columns = ["Name", "Email"])


The first table contains favorite numbers of some famous people. (The numbers have been changes to protect their identity.)

In [3]:
favorite_numbers

Out[3]:
Name Number
0 Joey 42.00
1 Joey 3.14
2 Deb 7.00
3 Sam 9.00
4 Nhi 3.00
5 Bob NaN

The second table contains the individuals email addresses.

In [4]:
email_addr

Out[4]:
Name Email
0 Joey jegonzal@berkeley.edu
1 Deb deborah_nolan@berkeley.edu
2 Sam samlau95@berkeley.edu
3 John doe@nope.com
4 Joey jegonzal@cs.berkeley.edu

## Many ways to join the data¶

There are actually many ways you could imagine combining data from both of these tables. In the following we work through a few example methods.

## pd.merge¶

Probably the most general and standard way to join tables in pandas is to use the merge function:

In [5]:
pd.merge(favorite_numbers, email_addr)

Out[5]:
Name Number Email
0 Joey 42.00 jegonzal@berkeley.edu
1 Joey 42.00 jegonzal@cs.berkeley.edu
2 Joey 3.14 jegonzal@berkeley.edu
3 Joey 3.14 jegonzal@cs.berkeley.edu
4 Deb 7.00 deborah_nolan@berkeley.edu
5 Sam 9.00 samlau95@berkeley.edu

Notice in the above join that:

1. The Name column is used to define which records match from each table. Pandas will by default join on any matching column names.
2. Only the records that occur in both tables are included in the final table. This is called an inner join.
3. Joey occurs 4 times since the name Joey had two email addresses and two favorite numbers.

Many of the joins you will do in data science will either be inner joins or left joins (see below).

In [6]:
pd.merge(favorite_numbers, email_addr, on="Name", how="inner")

Out[6]:
Name Number Email
0 Joey 42.00 jegonzal@berkeley.edu
1 Joey 42.00 jegonzal@cs.berkeley.edu
2 Joey 3.14 jegonzal@berkeley.edu
3 Joey 3.14 jegonzal@cs.berkeley.edu
4 Deb 7.00 deborah_nolan@berkeley.edu
5 Sam 9.00 samlau95@berkeley.edu

### Left Join¶

A left join will keep all the entries in the left table even if they have no matching entry in the right table. For example we didn't have Nhi or Bob's email address and so they appear as missing values rather than being dropped from the join.

In [7]:
pd.merge(favorite_numbers, email_addr, on="Name",how="left")

Out[7]:
Name Number Email
0 Joey 42.00 jegonzal@berkeley.edu
1 Joey 42.00 jegonzal@cs.berkeley.edu
2 Joey 3.14 jegonzal@berkeley.edu
3 Joey 3.14 jegonzal@cs.berkeley.edu
4 Deb 7.00 deborah_nolan@berkeley.edu
5 Sam 9.00 samlau95@berkeley.edu
6 Nhi 3.00 NaN
7 Bob NaN NaN

### Outer Join¶

The outer join keeps entries in both tables even if they don't have a match in the other and substitutes NaN for missing values.

In [8]:
pd.merge(favorite_numbers, email_addr, on="Name", how="outer")

Out[8]:
Name Number Email
0 Joey 42.00 jegonzal@berkeley.edu
1 Joey 42.00 jegonzal@cs.berkeley.edu
2 Joey 3.14 jegonzal@berkeley.edu
3 Joey 3.14 jegonzal@cs.berkeley.edu
4 Deb 7.00 deborah_nolan@berkeley.edu
5 Sam 9.00 samlau95@berkeley.edu
6 Nhi 3.00 NaN
7 Bob NaN NaN
8 John NaN doe@nope.com

## Using the Join function¶

Pandas also provides a join function which joints two tables on their index. This function also let's you specify what kind of join you would like.

In [9]:
favorite_numbers.set_index("Name").join(email_addr.set_index("Name"), how="inner")

Out[9]:
Number Email
Name
Deb 7.00 deborah_nolan@berkeley.edu
Joey 42.00 jegonzal@berkeley.edu
Joey 42.00 jegonzal@cs.berkeley.edu
Joey 3.14 jegonzal@berkeley.edu
Joey 3.14 jegonzal@cs.berkeley.edu
Sam 9.00 samlau95@berkeley.edu