Join Review

In this notebook we'll briefly review joining tables as discussed in data 8.

Often data is spread across two tables. Joining provides us with a way to naturally combine related tables.

Let's start by reading data from the given zip file. To showcase how to do read data that is inside zip files, we're going to use the zipfile module. Doing so will allow us to avoid needing to unzip the data. Running the cell below, we see that the zip file contains elections.csv and presidents.csv.

We could call my_zip.extractall() to unzip the files, but we won't. Instead, we'll read directly from the zip file itself.

To join tables df and df2, we call the function df.merge(df2). Merge is just the word that the authors of pandas picked for joining tables. I don't know why.

Note: Unfortunately, Pandas also has a function called df.join. This is a limited version of merge. For ths sake of generality, we will only use merge in this class.

I can use the merge function to combine these two tables:

Notice that:

  1. The output dataframe only contains rows that have names in both tables. For example, presidents before 1824 do not appear in the joined table because there was no popular vote before 1824.
  2. The name Andrew Jackson occurred three times in the election table and shows up three times in the output.
  3. Grover Cleveland occurs six times! Twice for every election he was in. This is because he appears three times in the elections table and twice in the presidents table. This results in 3 x 2 = 6 combinations.
  4. Several presidents are missing because their names are not an exact match. For example, John F. Kennedy is "John Kennedy" in the elections table and "John F. Kennedy" in the presidents table.

With the caveats above in mind, tThis merged DataFrame is handy because we can use it to plot, e.g. the age of each president when they were elected.

How could we fix the duplicate Grover Cleveland?

We could group by name/candidate and take only the first:

Right Join

The above join was an inner join. What if we wanted to keep all of the presidents and leave missing data for years when there was no popular vote? In this case we'd do a "right" join, where we make sure to include EVERY row from our right dataframe, in this case presidents.

Similarly, if we also want to include candidates not in the presidents table (e.g. because they had never won), we can use a "left" join.

If we wanted to keep both, we can instead do an "outer join".