import pandas as pd
import numpy as np
To illustrate how joins work we create two toy tables here:
favorite_numbers = pd.DataFrame(
[["Joey", 42],
["Joey", 3.14],
["Deb", 7],
["Sam", 9],
["Nhi", 3],
["Bob", np.NaN]], columns = ["Name", "Number"])
email_addr = pd.DataFrame(
[["Joey", "jegonzal@berkeley.edu"],
["Deb", "deborah_nolan@berkeley.edu"],
["Sam", "samlau95@berkeley.edu"],
["John", "doe@nope.com"],
["Joey", "jegonzal@cs.berkeley.edu"]], columns = ["Name", "Email"])
The first table contains favorite numbers of some famous people. (The numbers have been changes to protect their identity.)
favorite_numbers
The second table contains the individuals email addresses.
email_addr
There are actually many ways you could imagine combining data from both of these tables. In the following we work through a few example methods.
Probably the most general and standard way to join tables in pandas is to use the merge function:
pd.merge(favorite_numbers, email_addr)
Notice in the above join that:
Name
column is used to define which records match from each table. Pandas will by default join on any matching column names.Many of the joins you will do in data science will either be inner joins or left joins (see below).
We could be more specific about the join using the following additional arguments.
pd.merge(favorite_numbers, email_addr, on="Name", how="inner")
A left join will keep all the entries in the left table even if they have no matching entry in the right table. For example we didn't have Nhi or Bob's email address and so they appear as missing values rather than being dropped from the join.
pd.merge(favorite_numbers, email_addr, on="Name",how="left")
The outer join keeps entries in both tables even if they don't have a match in the other and substitutes NaN for missing values.
pd.merge(favorite_numbers, email_addr, on="Name", how="outer")
Pandas also provides a join function which joints two tables on their index. This function also let's you specify what kind of join you would like.
favorite_numbers.set_index("Name").join(email_addr.set_index("Name"), how="inner")