In [1]:
import pandas as pd
import numpy as np

Simple Demonstration of Joins

To illustrate how joins work we create two toy tables here:

In [2]:
favorite_numbers = pd.DataFrame(
    [["Joey", 42],
     ["Joey", 3.14],
     ["Deb", 7],
     ["Sam", 9],
     ["Nhi", 3],
     ["Bob", np.NaN]], columns = ["Name", "Number"])
email_addr = pd.DataFrame(
    [["Joey", "jegonzal@berkeley.edu"],
     ["Deb", "deborah_nolan@berkeley.edu"],
     ["Sam", "samlau95@berkeley.edu"],
     ["John", "doe@nope.com"],
     ["Joey", "jegonzal@cs.berkeley.edu"]], columns = ["Name", "Email"])

The first table contains favorite numbers of some famous people. (The numbers have been changes to protect their identity.)

In [3]:
favorite_numbers
Out[3]:
Name Number
0 Joey 42.00
1 Joey 3.14
2 Deb 7.00
3 Sam 9.00
4 Nhi 3.00
5 Bob NaN

The second table contains the individuals email addresses.

In [4]:
email_addr
Out[4]:
Name Email
0 Joey jegonzal@berkeley.edu
1 Deb deborah_nolan@berkeley.edu
2 Sam samlau95@berkeley.edu
3 John doe@nope.com
4 Joey jegonzal@cs.berkeley.edu

Many ways to join the data

There are actually many ways you could imagine combining data from both of these tables. In the following we work through a few example methods.

pd.merge

Probably the most general and standard way to join tables in pandas is to use the merge function:

In [5]:
pd.merge(favorite_numbers, email_addr)
Out[5]:
Name Number Email
0 Joey 42.00 jegonzal@berkeley.edu
1 Joey 42.00 jegonzal@cs.berkeley.edu
2 Joey 3.14 jegonzal@berkeley.edu
3 Joey 3.14 jegonzal@cs.berkeley.edu
4 Deb 7.00 deborah_nolan@berkeley.edu
5 Sam 9.00 samlau95@berkeley.edu

Notice in the above join that:

  1. The Name column is used to define which records match from each table. Pandas will by default join on any matching column names.
  2. Only the records that occur in both tables are included in the final table. This is called an inner join.
  3. Joey occurs 4 times since the name Joey had two email addresses and two favorite numbers.

Many of the joins you will do in data science will either be inner joins or left joins (see below).

We could be more specific about the join using the following additional arguments.

In [6]:
pd.merge(favorite_numbers, email_addr, on="Name", how="inner")
Out[6]:
Name Number Email
0 Joey 42.00 jegonzal@berkeley.edu
1 Joey 42.00 jegonzal@cs.berkeley.edu
2 Joey 3.14 jegonzal@berkeley.edu
3 Joey 3.14 jegonzal@cs.berkeley.edu
4 Deb 7.00 deborah_nolan@berkeley.edu
5 Sam 9.00 samlau95@berkeley.edu

Left Join

A left join will keep all the entries in the left table even if they have no matching entry in the right table. For example we didn't have Nhi or Bob's email address and so they appear as missing values rather than being dropped from the join.

In [7]:
pd.merge(favorite_numbers, email_addr, on="Name",how="left")
Out[7]:
Name Number Email
0 Joey 42.00 jegonzal@berkeley.edu
1 Joey 42.00 jegonzal@cs.berkeley.edu
2 Joey 3.14 jegonzal@berkeley.edu
3 Joey 3.14 jegonzal@cs.berkeley.edu
4 Deb 7.00 deborah_nolan@berkeley.edu
5 Sam 9.00 samlau95@berkeley.edu
6 Nhi 3.00 NaN
7 Bob NaN NaN

Outer Join

The outer join keeps entries in both tables even if they don't have a match in the other and substitutes NaN for missing values.

In [8]:
pd.merge(favorite_numbers, email_addr, on="Name", how="outer")
Out[8]:
Name Number Email
0 Joey 42.00 jegonzal@berkeley.edu
1 Joey 42.00 jegonzal@cs.berkeley.edu
2 Joey 3.14 jegonzal@berkeley.edu
3 Joey 3.14 jegonzal@cs.berkeley.edu
4 Deb 7.00 deborah_nolan@berkeley.edu
5 Sam 9.00 samlau95@berkeley.edu
6 Nhi 3.00 NaN
7 Bob NaN NaN
8 John NaN doe@nope.com

Using the Join function

Pandas also provides a join function which joints two tables on their index. This function also let's you specify what kind of join you would like.

In [9]:
favorite_numbers.set_index("Name").join(email_addr.set_index("Name"), how="inner")
Out[9]:
Number Email
Name
Deb 7.00 deborah_nolan@berkeley.edu
Joey 42.00 jegonzal@berkeley.edu
Joey 42.00 jegonzal@cs.berkeley.edu
Joey 3.14 jegonzal@berkeley.edu
Joey 3.14 jegonzal@cs.berkeley.edu
Sam 9.00 samlau95@berkeley.edu