In [1]:
import numpy as np
import pandas as pd

Print Visualization

In [2]:
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')
sns.set_context("notebook")

Web visualization

In [3]:
import plotly.offline as py
import plotly.express as px
import cufflinks as cf
cf.set_config_file(sharing="private", offline=True, offline_connected=False)

The Data Table Package

In data8 you used numpy and datascience

In [4]:
import numpy as np
from datascience import Table

Working with Matrices

Creating a toy matrix

In [5]:
m = np.array(np.arange(12, dtype=float)).reshape(3,4)
m
Out[5]:
array([[ 0.,  1.,  2.,  3.],
       [ 4.,  5.,  6.,  7.],
       [ 8.,  9., 10., 11.]])

Slicing rows and columns

In [6]:
m[:, [1,3]]
Out[6]:
array([[ 1.,  3.],
       [ 5.,  7.],
       [ 9., 11.]])

Doing linear algebra

In [7]:
m.T @ (m + m) 
Out[7]:
array([[160., 184., 208., 232.],
       [184., 214., 244., 274.],
       [208., 244., 280., 316.],
       [232., 274., 316., 358.]])

Working with Tables

Load data

In [8]:
t = Table.read_table("elections.csv")
t
Out[8]:
Candidate Party % Year Result
Reagan Republican 50.7 1980 win
Carter Democratic 41 1980 loss
Anderson Independent 6.6 1980 loss
Reagan Republican 58.8 1984 win
Mondale Democratic 37.6 1984 loss
Bush Republican 53.4 1988 win
Dukakis Democratic 45.6 1988 loss
Clinton Democratic 43 1992 win
Bush Republican 37.4 1992 loss
Perot Independent 18.9 1992 loss

... (13 rows omitted)

Access columns and rows:

In [9]:
t.column("Party")
Out[9]:
array(['Republican', 'Democratic', 'Independent', 'Republican',
       'Democratic', 'Republican', 'Democratic', 'Democratic',
       'Republican', 'Independent', 'Democratic', 'Republican',
       'Independent', 'Democratic', 'Republican', 'Democratic',
       'Republican', 'Democratic', 'Republican', 'Democratic',
       'Republican', 'Democratic', 'Republican'], dtype='<U11')
In [10]:
t.row(2)
Out[10]:
Row(Candidate='Anderson', Party='Independent', %=6.6, Year=1980, Result='loss')

Define predicates

In [11]:
from datascience import are
t.where("Year", are.above_or_equal_to(2000))
Out[11]:
Candidate Party % Year Result
Gore Democratic 48.4 2000 loss
Bush Republican 47.9 2000 win
Kerry Democratic 48.3 2004 loss
Bush Republican 50.7 2004 win
Obama Democratic 52.9 2008 win
McCain Republican 45.7 2008 loss
Obama Democratic 51.1 2012 win
Romney Republican 47.2 2012 loss
Clinton Democratic 48.2 2016 loss
Trump Republican 46.1 2016 win

and do many other things...

The Table class in the datascience package implements a simple DataFrame. However, the Table package is not particularly optimized and doesn't support a wide range useful functionality. In this class, we will use Pandas, a more feature rich and widely adopted DataFrame library.

Pandas

It is customary to import pandas as pd

In [12]:
import pandas as pd

Reading Data Sources

Pandas has a number of very useful file reading tools. You can see them enumerated by typing "pd.re" and pressing tab. We'll be using read_csv today.

In [13]:
elections = pd.read_csv("elections.csv")
elections # if we end a cell with an expression or variable name, the result will print
Out[13]:
Candidate Party % Year Result
0 Reagan Republican 50.7 1980 win
1 Carter Democratic 41.0 1980 loss
2 Anderson Independent 6.6 1980 loss
3 Reagan Republican 58.8 1984 win
4 Mondale Democratic 37.6 1984 loss
5 Bush Republican 53.4 1988 win
6 Dukakis Democratic 45.6 1988 loss
7 Clinton Democratic 43.0 1992 win
8 Bush Republican 37.4 1992 loss
9 Perot Independent 18.9 1992 loss
10 Clinton Democratic 49.2 1996 win
11 Dole Republican 40.7 1996 loss
12 Perot Independent 8.4 1996 loss
13 Gore Democratic 48.4 2000 loss
14 Bush Republican 47.9 2000 win
15 Kerry Democratic 48.3 2004 loss
16 Bush Republican 50.7 2004 win
17 Obama Democratic 52.9 2008 win
18 McCain Republican 45.7 2008 loss
19 Obama Democratic 51.1 2012 win
20 Romney Republican 47.2 2012 loss
21 Clinton Democratic 48.2 2016 loss
22 Trump Republican 46.1 2016 win

Read a table from a website

In [14]:
dfs = pd.read_html("https://en.wikipedia.org/wiki/Greenhouse_gas")
dfs[4]  # read the 5th table on the page
Out[14]:
Food Types Greenhouse Gas Emissions (g CO2-Ceq per g protein)
0 Ruminant Meat 62.00
1 Recirculating Aquaculture 30.00
2 Trawling Fishery 26.00
3 Non-recirculating Aquaculture 12.00
4 Pork 10.00
5 Poultry 10.00
6 Dairy 9.10
7 Non-trawling Fishery 8.60
8 Eggs 6.80
9 Starchy Roots 1.70
10 Wheat 1.20
11 Maize 1.20
12 Legumes 0.25

Read a Microsoft Excel file

In [15]:
pd.read_excel("fossil_fuel.xlsx", sheet_name="Data2")
Out[15]:
Seven main fossil fuel combustion sources Contribution (%)
0 Liquid fuels (e.g., gasoline, fuel oil) 0.36
1 Solid fuels (e.g., coal) 0.35
2 Gaseous fuels (e.g., natural gas) 0.2
3 Cement production 3 %
4 Flaring gas industrially and at wells < 1%
5 Non-fuel hydrocarbons < 1%
6 "International bunker fuels" of transport not ... 4 %

Looking at only a few rows

We can use the head command to return only a few rows of a dataframe.

In [16]:
elections.head()
Out[16]:
Candidate Party % Year Result
0 Reagan Republican 50.7 1980 win
1 Carter Democratic 41.0 1980 loss
2 Anderson Independent 6.6 1980 loss
3 Reagan Republican 58.8 1984 win
4 Mondale Democratic 37.6 1984 loss
In [17]:
elections.head(3)
Out[17]:
Candidate Party % Year Result
0 Reagan Republican 50.7 1980 win
1 Carter Democratic 41.0 1980 loss
2 Anderson Independent 6.6 1980 loss

There is also a tail command.

In [18]:
elections.tail(7)
Out[18]:
Candidate Party % Year Result
16 Bush Republican 50.7 2004 win
17 Obama Democratic 52.9 2008 win
18 McCain Republican 45.7 2008 loss
19 Obama Democratic 51.1 2012 win
20 Romney Republican 47.2 2012 loss
21 Clinton Democratic 48.2 2016 loss
22 Trump Republican 46.1 2016 win

A random sample of 7 entries.

In [19]:
# Note I am seeding the sample so later stages in my
# notebook work. My favorite seed is 42
elec_sample = elections.sample(7, random_state=42) 
elec_sample
Out[19]:
Candidate Party % Year Result
15 Kerry Democratic 48.3 2004 loss
9 Perot Independent 18.9 1992 loss
0 Reagan Republican 50.7 1980 win
8 Bush Republican 37.4 1992 loss
17 Obama Democratic 52.9 2008 win
12 Perot Independent 8.4 1996 loss
1 Carter Democratic 41.0 1980 loss

Sampling with replacement

In [20]:
elections.sample(10, replace=True)
Out[20]:
Candidate Party % Year Result
6 Dukakis Democratic 45.6 1988 loss
19 Obama Democratic 51.1 2012 win
7 Clinton Democratic 43.0 1992 win
1 Carter Democratic 41.0 1980 loss
8 Bush Republican 37.4 1992 loss
4 Mondale Democratic 37.6 1984 loss
1 Carter Democratic 41.0 1980 loss
20 Romney Republican 47.2 2012 loss
12 Perot Independent 8.4 1996 loss
21 Clinton Democratic 48.2 2016 loss

Sampling columns?

In [21]:
elections.sample(2, axis=1).head()
Out[21]:
Candidate Result
0 Reagan win
1 Carter loss
2 Anderson loss
3 Reagan win
4 Mondale loss

Utility Operations

In addition to head, tail, and sample the are a range of useful operations.

Shape returns the number of rows and columns.

In [22]:
elections.shape
Out[22]:
(23, 5)

Size describes the number of "cells" in the dataframe

In [23]:
elections.size
Out[23]:
115

We can sort the rows by their values:

In [24]:
elections.sort_values(["Year", "Result"])
Out[24]:
Candidate Party % Year Result
1 Carter Democratic 41.0 1980 loss
2 Anderson Independent 6.6 1980 loss
0 Reagan Republican 50.7 1980 win
4 Mondale Democratic 37.6 1984 loss
3 Reagan Republican 58.8 1984 win
6 Dukakis Democratic 45.6 1988 loss
5 Bush Republican 53.4 1988 win
8 Bush Republican 37.4 1992 loss
9 Perot Independent 18.9 1992 loss
7 Clinton Democratic 43.0 1992 win
11 Dole Republican 40.7 1996 loss
12 Perot Independent 8.4 1996 loss
10 Clinton Democratic 49.2 1996 win
13 Gore Democratic 48.4 2000 loss
14 Bush Republican 47.9 2000 win
15 Kerry Democratic 48.3 2004 loss
16 Bush Republican 50.7 2004 win
18 McCain Republican 45.7 2008 loss
17 Obama Democratic 52.9 2008 win
20 Romney Republican 47.2 2012 loss
19 Obama Democratic 51.1 2012 win
21 Clinton Democratic 48.2 2016 loss
22 Trump Republican 46.1 2016 win

We can rename columns:

In [25]:
elections.rename(columns={"%": "Percent"}).head()
Out[25]:
Candidate Party Percent Year Result
0 Reagan Republican 50.7 1980 win
1 Carter Democratic 41.0 1980 loss
2 Anderson Independent 6.6 1980 loss
3 Reagan Republican 58.8 1984 win
4 Mondale Democratic 37.6 1984 loss

Note that the rename method returned a new dataframe and didn't modify the original one.

Most operations in Pandas are not mutating. This produces cleaner code. If you change something it should be stored in a new appropriately named variable.

In [26]:
elections.head()
Out[26]:
Candidate Party % Year Result
0 Reagan Republican 50.7 1980 win
1 Carter Democratic 41.0 1980 loss
2 Anderson Independent 6.6 1980 loss
3 Reagan Republican 58.8 1984 win
4 Mondale Democratic 37.6 1984 loss

Casting types.

In [27]:
elections.astype({"Year": float}).head()
Out[27]:
Candidate Party % Year Result
0 Reagan Republican 50.7 1980.0 win
1 Carter Democratic 41.0 1980.0 loss
2 Anderson Independent 6.6 1980.0 loss
3 Reagan Republican 58.8 1984.0 win
4 Mondale Democratic 37.6 1984.0 loss

We can get summary statistics for each column

In [28]:
elections.describe(include="all")
Out[28]:
Candidate Party % Year Result
count 23 23 23.000000 23.000000 23
unique 15 3 NaN NaN 2
top Bush Democratic NaN NaN loss
freq 4 10 NaN NaN 13
mean NaN NaN 42.513043 1996.869565 NaN
std NaN NaN 13.476117 11.627961 NaN
min NaN NaN 6.600000 1980.000000 NaN
25% NaN NaN 40.850000 1988.000000 NaN
50% NaN NaN 47.200000 1996.000000 NaN
75% NaN NaN 49.950000 2006.000000 NaN
max NaN NaN 58.800000 2016.000000 NaN

You can even transpose a dataframe

In [29]:
elections.transpose()
Out[29]:
0 1 2 3 4 5 6 7 8 9 ... 13 14 15 16 17 18 19 20 21 22
Candidate Reagan Carter Anderson Reagan Mondale Bush Dukakis Clinton Bush Perot ... Gore Bush Kerry Bush Obama McCain Obama Romney Clinton Trump
Party Republican Democratic Independent Republican Democratic Republican Democratic Democratic Republican Independent ... Democratic Republican Democratic Republican Democratic Republican Democratic Republican Democratic Republican
% 50.7 41 6.6 58.8 37.6 53.4 45.6 43 37.4 18.9 ... 48.4 47.9 48.3 50.7 52.9 45.7 51.1 47.2 48.2 46.1
Year 1980 1980 1980 1984 1984 1988 1988 1992 1992 1992 ... 2000 2000 2004 2004 2008 2008 2012 2012 2016 2016
Result win loss loss win loss win loss win loss loss ... loss win loss win win loss win loss loss win

5 rows × 23 columns

The fact that you can take the transpose implies some interesting symmetry properties that will lead us to treat rows and columns symmetrically. This has interesting implications on how we refer to rows and columns!

Anatomy of a DataFrame

The DataFrame has named columns

In [30]:
elections.head()
Out[30]:
Candidate Party % Year Result
0 Reagan Republican 50.7 1980 win
1 Carter Democratic 41.0 1980 loss
2 Anderson Independent 6.6 1980 loss
3 Reagan Republican 58.8 1984 win
4 Mondale Democratic 37.6 1984 loss
In [31]:
elections.columns
Out[31]:
Index(['Candidate', 'Party', '%', 'Year', 'Result'], dtype='object')

The columns have types

In [32]:
elections.dtypes
Out[32]:
Candidate     object
Party         object
%            float64
Year           int64
Result        object
dtype: object

You can access a the rows as an array of lists

In [33]:
elections.values
Out[33]:
array([['Reagan', 'Republican', 50.7, 1980, 'win'],
       ['Carter', 'Democratic', 41.0, 1980, 'loss'],
       ['Anderson', 'Independent', 6.6, 1980, 'loss'],
       ['Reagan', 'Republican', 58.8, 1984, 'win'],
       ['Mondale', 'Democratic', 37.6, 1984, 'loss'],
       ['Bush', 'Republican', 53.4, 1988, 'win'],
       ['Dukakis', 'Democratic', 45.6, 1988, 'loss'],
       ['Clinton', 'Democratic', 43.0, 1992, 'win'],
       ['Bush', 'Republican', 37.4, 1992, 'loss'],
       ['Perot', 'Independent', 18.9, 1992, 'loss'],
       ['Clinton', 'Democratic', 49.2, 1996, 'win'],
       ['Dole', 'Republican', 40.7, 1996, 'loss'],
       ['Perot', 'Independent', 8.4, 1996, 'loss'],
       ['Gore', 'Democratic', 48.4, 2000, 'loss'],
       ['Bush', 'Republican', 47.9, 2000, 'win'],
       ['Kerry', 'Democratic', 48.3, 2004, 'loss'],
       ['Bush', 'Republican', 50.7, 2004, 'win'],
       ['Obama', 'Democratic', 52.9, 2008, 'win'],
       ['McCain', 'Republican', 45.7, 2008, 'loss'],
       ['Obama', 'Democratic', 51.1, 2012, 'win'],
       ['Romney', 'Republican', 47.2, 2012, 'loss'],
       ['Clinton', 'Democratic', 48.2, 2016, 'loss'],
       ['Trump', 'Republican', 46.1, 2016, 'win']], dtype=object)

Indexes (Part 1)

All dataframes have an index.

In [34]:
elections
Out[34]:
Candidate Party % Year Result
0 Reagan Republican 50.7 1980 win
1 Carter Democratic 41.0 1980 loss
2 Anderson Independent 6.6 1980 loss
3 Reagan Republican 58.8 1984 win
4 Mondale Democratic 37.6 1984 loss
5 Bush Republican 53.4 1988 win
6 Dukakis Democratic 45.6 1988 loss
7 Clinton Democratic 43.0 1992 win
8 Bush Republican 37.4 1992 loss
9 Perot Independent 18.9 1992 loss
10 Clinton Democratic 49.2 1996 win
11 Dole Republican 40.7 1996 loss
12 Perot Independent 8.4 1996 loss
13 Gore Democratic 48.4 2000 loss
14 Bush Republican 47.9 2000 win
15 Kerry Democratic 48.3 2004 loss
16 Bush Republican 50.7 2004 win
17 Obama Democratic 52.9 2008 win
18 McCain Republican 45.7 2008 loss
19 Obama Democratic 51.1 2012 win
20 Romney Republican 47.2 2012 loss
21 Clinton Democratic 48.2 2016 loss
22 Trump Republican 46.1 2016 win
In [35]:
elections.index
Out[35]:
RangeIndex(start=0, stop=23, step=1)

By default a RangeIndex is attached enumerating the rows. This is shown in bold as the far left column. Recall that we sampled the elections table. Let's examine that sample:

In [36]:
elec_sample
Out[36]:
Candidate Party % Year Result
15 Kerry Democratic 48.3 2004 loss
9 Perot Independent 18.9 1992 loss
0 Reagan Republican 50.7 1980 win
8 Bush Republican 37.4 1992 loss
17 Obama Democratic 52.9 2008 win
12 Perot Independent 8.4 1996 loss
1 Carter Democratic 41.0 1980 loss
In [37]:
elec_sample.index
Out[37]:
Int64Index([15, 9, 0, 8, 17, 12, 1], dtype='int64')

Notice that the index is different. It maintained the index of the rows in the original table. This is really useful if we wanted to go back and relate derived tables with their original values.

You can change the index.

In [38]:
elec_sample_iyear = elec_sample.set_index("Year")
elec_sample_iyear
Out[38]:
Candidate Party % Result
Year
2004 Kerry Democratic 48.3 loss
1992 Perot Independent 18.9 loss
1980 Reagan Republican 50.7 win
1992 Bush Republican 37.4 loss
2008 Obama Democratic 52.9 win
1996 Perot Independent 8.4 loss
1980 Carter Democratic 41.0 loss
In [39]:
elec_sample_iyear.index
Out[39]:
Int64Index([2004, 1992, 1980, 1992, 2008, 1996, 1980], dtype='int64', name='Year')

Note that the set_index operation is not mutating.

In [40]:
elections.index
Out[40]:
RangeIndex(start=0, stop=23, step=1)
In [41]:
elections.head()
Out[41]:
Candidate Party % Year Result
0 Reagan Republican 50.7 1980 win
1 Carter Democratic 41.0 1980 loss
2 Anderson Independent 6.6 1980 loss
3 Reagan Republican 58.8 1984 win
4 Mondale Democratic 37.6 1984 loss

The index allows you to reference rows by name. You will see this in a moment when we talk about slicing.

Note: The index does not need to be unique.

The Columns are also an index

Recall that we could get the list of column names

In [42]:
elections.columns
Out[42]:
Index(['Candidate', 'Party', '%', 'Year', 'Result'], dtype='object')

Notice that the return type is an index. Recall we could transpose a DataFrame. This is effectively swapping the row and column index:

In [43]:
elections.transpose().index
Out[43]:
Index(['Candidate', 'Party', '%', 'Year', 'Result'], dtype='object')

Accessing Rows and Columns (Slicing)

There are many ways to access rows and columns of a Pandas DataFrame. We will spend some time reviewing most of the options.

Accessing Columns using [ ]

The DataFrame class has an indexing operator [] that lets you do a variety of different things.

Just like the Table in data8 you can access columns using the square [ ] brakets.

In [44]:
elec_sample
Out[44]:
Candidate Party % Year Result
15 Kerry Democratic 48.3 2004 loss
9 Perot Independent 18.9 1992 loss
0 Reagan Republican 50.7 1980 win
8 Bush Republican 37.4 1992 loss
17 Obama Democratic 52.9 2008 win
12 Perot Independent 8.4 1996 loss
1 Carter Democratic 41.0 1980 loss

You can pass a list of columns names:

In [45]:
elec_sample[ ["Candidate", "Year"] ] # space added to show the list
Out[45]:
Candidate Year
15 Kerry 2004
9 Perot 1992
0 Reagan 1980
8 Bush 1992
17 Obama 2008
12 Perot 1996
1 Carter 1980
In [46]:
elec_sample[["Candidate", "Year", "Result"]] # No space is more standard
Out[46]:
Candidate Year Result
15 Kerry 2004 loss
9 Perot 1992 loss
0 Reagan 1980 win
8 Bush 1992 loss
17 Obama 2008 win
12 Perot 1996 loss
1 Carter 1980 loss

If you pass a list with a single element you get back a DataFrame.

In [47]:
elec_sample[["Candidate"]]
Out[47]:
Candidate
15 Kerry
9 Perot
0 Reagan
8 Bush
17 Obama
12 Perot
1 Carter

Series

If you pass a single item instead of a list you get back a Series

In [48]:
party = elec_sample["Party"]
party
Out[48]:
15     Democratic
9     Independent
0      Republican
8      Republican
17     Democratic
12    Independent
1      Democratic
Name: Party, dtype: object

When accessing a single column we get back a pd.Series object

In [49]:
type(party)
Out[49]:
pandas.core.series.Series

The series object represents a single column (or row) of data. The Series object has a index, a name, and values. A series can be thought of as a map.

In [50]:
party.index
Out[50]:
Int64Index([15, 9, 0, 8, 17, 12, 1], dtype='int64')
In [51]:
party.name
Out[51]:
'Party'
In [52]:
party.values
Out[52]:
array(['Democratic', 'Independent', 'Republican', 'Republican',
       'Democratic', 'Independent', 'Democratic'], dtype=object)

We can convert a Series into a DataFrame

In [53]:
party.to_frame()
Out[53]:
Party
15 Democratic
9 Independent
0 Republican
8 Republican
17 Democratic
12 Independent
1 Democratic

Series act like numpy arrays and support most numpy operations

In [54]:
year = elec_sample["Year"]
year
Out[54]:
15    2004
9     1992
0     1980
8     1992
17    2008
12    1996
1     1980
Name: Year, dtype: int64
In [55]:
year.mean()
Out[55]:
1993.142857142857

Apply numpy opeations:

In [56]:
np.sin(year * 3)
Out[56]:
15   -0.845947
9     0.637133
0     0.682887
8     0.637133
17   -0.999992
12    0.124082
1     0.682887
Name: Year, dtype: float64

We can sort the Series by value or by index.

In [57]:
np.sin(year * 3).sort_values()
Out[57]:
17   -0.999992
15   -0.845947
12    0.124082
9     0.637133
8     0.637133
0     0.682887
1     0.682887
Name: Year, dtype: float64
In [58]:
np.sin(year * 3).sort_index()
Out[58]:
0     0.682887
1     0.682887
8     0.637133
9     0.637133
12    0.124082
15   -0.845947
17   -0.999992
Name: Year, dtype: float64

Counting unique values

Series also has a very useful function .value_counts() which allows us to compute the number of occurences of each unique value.

In [59]:
year.value_counts()
Out[59]:
1980    2
1992    2
2008    1
1996    1
2004    1
Name: Year, dtype: int64
In [60]:
party_counts = elections['Party'].value_counts()
party_counts
Out[60]:
Democratic     10
Republican     10
Independent     3
Name: Party, dtype: int64

Note that in each case we also got back a series and these series (like all series) are maps from index to value.

In [61]:
party_counts.index
Out[61]:
Index(['Democratic', 'Republican', 'Independent'], dtype='object')
In [62]:
party_counts.values
Out[62]:
array([10, 10,  3])
In [63]:
party_counts["Independent"]
Out[63]:
3

Indexes allow us to relate data

Notice how in call cases I keep track of the index making it possible to relate this data back to the sample and even the original table.

For example, in the following we create a new series with the name weird and join it back with the original data.

In [64]:
weird = np.sin(year * 3).rename("weird")
weird
Out[64]:
15   -0.845947
9     0.637133
0     0.682887
8     0.637133
17   -0.999992
12    0.124082
1     0.682887
Name: weird, dtype: float64

Here we use the Pandas join operation. This joins on the index. You will learn more about this next and the more general merge operation next week.

In [65]:
elections.join(weird) 
Out[65]:
Candidate Party % Year Result weird
0 Reagan Republican 50.7 1980 win 0.682887
1 Carter Democratic 41.0 1980 loss 0.682887
2 Anderson Independent 6.6 1980 loss NaN
3 Reagan Republican 58.8 1984 win NaN
4 Mondale Democratic 37.6 1984 loss NaN
5 Bush Republican 53.4 1988 win NaN
6 Dukakis Democratic 45.6 1988 loss NaN
7 Clinton Democratic 43.0 1992 win NaN
8 Bush Republican 37.4 1992 loss 0.637133
9 Perot Independent 18.9 1992 loss 0.637133
10 Clinton Democratic 49.2 1996 win NaN
11 Dole Republican 40.7 1996 loss NaN
12 Perot Independent 8.4 1996 loss 0.124082
13 Gore Democratic 48.4 2000 loss NaN
14 Bush Republican 47.9 2000 win NaN
15 Kerry Democratic 48.3 2004 loss -0.845947
16 Bush Republican 50.7 2004 win NaN
17 Obama Democratic 52.9 2008 win -0.999992
18 McCain Republican 45.7 2008 loss NaN
19 Obama Democratic 51.1 2012 win NaN
20 Romney Republican 47.2 2012 loss NaN
21 Clinton Democratic 48.2 2016 loss NaN
22 Trump Republican 46.1 2016 win NaN
In [66]:
weird.to_frame().join(elections)
Out[66]:
weird Candidate Party % Year Result
15 -0.845947 Kerry Democratic 48.3 2004 loss
9 0.637133 Perot Independent 18.9 1992 loss
0 0.682887 Reagan Republican 50.7 1980 win
8 0.637133 Bush Republican 37.4 1992 loss
17 -0.999992 Obama Democratic 52.9 2008 win
12 0.124082 Perot Independent 8.4 1996 loss
1 0.682887 Carter Democratic 41.0 1980 loss

What kind of join is this?

Column Assignment (Mutating)

You can modify and even add columns using the square brackets [ ]

In [67]:
tmp = elec_sample.copy()
tmp["Year"] = tmp["Year"] * -1 + .5
tmp
Out[67]:
Candidate Party % Year Result
15 Kerry Democratic 48.3 -2003.5 loss
9 Perot Independent 18.9 -1991.5 loss
0 Reagan Republican 50.7 -1979.5 win
8 Bush Republican 37.4 -1991.5 loss
17 Obama Democratic 52.9 -2007.5 win
12 Perot Independent 8.4 -1995.5 loss
1 Carter Democratic 41.0 -1979.5 loss

Adding a new column by assignment:

In [68]:
tmp["Corrected Year"] = tmp["Year"] * -1 + .5
tmp
Out[68]:
Candidate Party % Year Result Corrected Year
15 Kerry Democratic 48.3 -2003.5 loss 2004.0
9 Perot Independent 18.9 -1991.5 loss 1992.0
0 Reagan Republican 50.7 -1979.5 win 1980.0
8 Bush Republican 37.4 -1991.5 loss 1992.0
17 Obama Democratic 52.9 -2007.5 win 2008.0
12 Perot Independent 8.4 -1995.5 loss 1996.0
1 Carter Democratic 41.0 -1979.5 loss 1980.0
In [69]:
tmp["Random Numbers"] = np.random.randn(tmp.shape[0])
tmp
Out[69]:
Candidate Party % Year Result Corrected Year Random Numbers
15 Kerry Democratic 48.3 -2003.5 loss 2004.0 0.597980
9 Perot Independent 18.9 -1991.5 loss 1992.0 -1.416501
0 Reagan Republican 50.7 -1979.5 win 1980.0 -1.692268
8 Bush Republican 37.4 -1991.5 loss 1992.0 -2.071410
17 Obama Democratic 52.9 -2007.5 win 2008.0 -0.362172
12 Perot Independent 8.4 -1995.5 loss 1996.0 0.223105
1 Carter Democratic 41.0 -1979.5 loss 1980.0 0.444457

Accessing by rows and columns by index .loc

You can access rows and columns of a DataFrame by name using the .loc[ ] syntax.

In [70]:
elec_sample
Out[70]:
Candidate Party % Year Result
15 Kerry Democratic 48.3 2004 loss
9 Perot Independent 18.9 1992 loss
0 Reagan Republican 50.7 1980 win
8 Bush Republican 37.4 1992 loss
17 Obama Democratic 52.9 2008 win
12 Perot Independent 8.4 1996 loss
1 Carter Democratic 41.0 1980 loss
In [71]:
elec_sample.loc[:, ["Party", "Year"] ]
Out[71]:
Party Year
15 Democratic 2004
9 Independent 1992
0 Republican 1980
8 Republican 1992
17 Democratic 2008
12 Independent 1996
1 Democratic 1980

The syntax for .loc is:

  df.loc[ rows_list, column_list ]

We can pass a list of row names (index values):

In [72]:
elec_sample
Out[72]:
Candidate Party % Year Result
15 Kerry Democratic 48.3 2004 loss
9 Perot Independent 18.9 1992 loss
0 Reagan Republican 50.7 1980 win
8 Bush Republican 37.4 1992 loss
17 Obama Democratic 52.9 2008 win
12 Perot Independent 8.4 1996 loss
1 Carter Democratic 41.0 1980 loss
In [73]:
elec_sample.loc[[1,15,9], :]
Out[73]:
Candidate Party % Year Result
1 Carter Democratic 41.0 1980 loss
15 Kerry Democratic 48.3 2004 loss
9 Perot Independent 18.9 1992 loss
In [74]:
elec_sample.loc[[1,15,9]]
Out[74]:
Candidate Party % Year Result
1 Carter Democratic 41.0 1980 loss
15 Kerry Democratic 48.3 2004 loss
9 Perot Independent 18.9 1992 loss
In [75]:
elec_sample_iyear
Out[75]:
Candidate Party % Result
Year
2004 Kerry Democratic 48.3 loss
1992 Perot Independent 18.9 loss
1980 Reagan Republican 50.7 win
1992 Bush Republican 37.4 loss
2008 Obama Democratic 52.9 win
1996 Perot Independent 8.4 loss
1980 Carter Democratic 41.0 loss

How many rows will this call return

In [76]:
elec_sample_iyear.loc[[2004, 1980], :]
Out[76]:
Candidate Party % Result
Year
2004 Kerry Democratic 48.3 loss
1980 Reagan Republican 50.7 win
1980 Carter Democratic 41.0 loss

Loc also supports slicing (for all types, including numeric and string labels!). Note that the slicing for loc is inclusive, even for numeric slices. In general, avoiding range slicing with .loc.

In [77]:
elections.loc[0:4, 'Candidate':'Year']
Out[77]:
Candidate Party % Year
0 Reagan Republican 50.7 1980
1 Carter Democratic 41.0 1980
2 Anderson Independent 6.6 1980
3 Reagan Republican 58.8 1984
4 Mondale Democratic 37.6 1984

Keep in mind that the ranges need to be over the index values and not the locations.

Note the index values need to have well defined contiguous ranges.

In [78]:
elec_sample_iyear.sort_index().loc[1980:2004, 'Candidate':'Party']
Out[78]:
Candidate Party
Year
1980 Reagan Republican
1980 Carter Democratic
1992 Perot Independent
1992 Bush Republican
1996 Perot Independent
2004 Kerry Democratic

If we provide only a single label for the column argument, we get back a Series just as with regular [ ] indexing.

In [79]:
elec_sample.loc[:, 'Candidate']
Out[79]:
15     Kerry
9      Perot
0     Reagan
8       Bush
17     Obama
12     Perot
1     Carter
Name: Candidate, dtype: object

If we want a data frame instead and don't want to use to_frame, we can provde a list containing the column name.

In [80]:
elec_sample.loc[:, ['Candidate']]
Out[80]:
Candidate
15 Kerry
9 Perot
0 Reagan
8 Bush
17 Obama
12 Perot
1 Carter

We can also select a single row. Notice that in this case we also get back a Series where the index is the set of columns.

In [81]:
obama_row = elec_sample_iyear.loc[2008, :]
obama_row
Out[81]:
Candidate         Obama
Party        Democratic
%                  52.9
Result              win
Name: 2008, dtype: object
In [82]:
obama_row.name
Out[82]:
2008
In [83]:
obama_row.index
Out[83]:
Index(['Candidate', 'Party', '%', 'Result'], dtype='object')
In [84]:
obama_row.values
Out[84]:
array(['Obama', 'Democratic', 52.9, 'win'], dtype=object)

It is worth noting that the Series also functions like a map from the index to the values.

In [85]:
obama_row["Party"]
Out[85]:
'Democratic'

If we omit the column argument altogether, the default behavior is to retrieve all columns.

In [86]:
elections.set_index("Year").loc[[2008, 2012]]
Out[86]:
Candidate Party % Result
Year
2008 Obama Democratic 52.9 win
2008 McCain Republican 45.7 loss
2012 Obama Democratic 51.1 win
2012 Romney Republican 47.2 loss

Indexing a Single Value

What happens if you give scalar arguments for the requested rows AND columns. The answer is that you get back just a single value.

In [87]:
elections.loc[0, 'Candidate']
Out[87]:
'Reagan'

Boolean Array Selection (WHERE in SQL)

The .loc[] and also [ ] support arrays of booleans as an input. In this case, the array must be exactly as long as the number of rows. The result is a filtered version of the data frame, where only rows corresponding to True appear.

In [88]:
elec_sample
Out[88]:
Candidate Party % Year Result
15 Kerry Democratic 48.3 2004 loss
9 Perot Independent 18.9 1992 loss
0 Reagan Republican 50.7 1980 win
8 Bush Republican 37.4 1992 loss
17 Obama Democratic 52.9 2008 win
12 Perot Independent 8.4 1996 loss
1 Carter Democratic 41.0 1980 loss
In [89]:
elec_sample.loc[[False, False, False, False, True, False, False]]
Out[89]:
Candidate Party % Year Result
17 Obama Democratic 52.9 2008 win

You can also pass the same arguments to the [ ] operator.

In [90]:
elec_sample[[False, False, False, False, True, False, False]]
Out[90]:
Candidate Party % Year Result
17 Obama Democratic 52.9 2008 win

One very common task in Data Science is filtering. Boolean Array Selection is one way to achieve this in Pandas. We start by observing logical operators like the equality operator can be applied to Pandas Series data to generate a Boolean Array. For example, we can compare the 'Result' column to the String 'win':

In [91]:
elections.head(5)
Out[91]:
Candidate Party % Year Result
0 Reagan Republican 50.7 1980 win
1 Carter Democratic 41.0 1980 loss
2 Anderson Independent 6.6 1980 loss
3 Reagan Republican 58.8 1984 win
4 Mondale Democratic 37.6 1984 loss
In [92]:
iswin = elections['Result'] == 'win'
iswin
Out[92]:
0      True
1     False
2     False
3      True
4     False
5      True
6     False
7      True
8     False
9     False
10     True
11    False
12    False
13    False
14     True
15    False
16     True
17     True
18    False
19     True
20    False
21    False
22     True
Name: Result, dtype: bool

The output of the logical operator applied to the Series is another Series with the same name and index, but of datatype boolean. The entry at row #i represents the result of the application of that operator to the entry of the original Series at row #i.

Such a boolean Series can be used as an argument to the [] operator. For example, the following code creates a DataFrame of all election winners since 1980.

In [93]:
elections[iswin]
Out[93]:
Candidate Party % Year Result
0 Reagan Republican 50.7 1980 win
3 Reagan Republican 58.8 1984 win
5 Bush Republican 53.4 1988 win
7 Clinton Democratic 43.0 1992 win
10 Clinton Democratic 49.2 1996 win
14 Bush Republican 47.9 2000 win
16 Bush Republican 50.7 2004 win
17 Obama Democratic 52.9 2008 win
19 Obama Democratic 51.1 2012 win
22 Trump Republican 46.1 2016 win

Above, we've assigned the result of the logical operator to a new variable called iswin. This is uncommon. Usually, the series is created and used on the same line. Such code is a little tricky to read at first, but you'll get used to it quickly.

In [94]:
elections[elections['Result'] == 'win']
Out[94]:
Candidate Party % Year Result
0 Reagan Republican 50.7 1980 win
3 Reagan Republican 58.8 1984 win
5 Bush Republican 53.4 1988 win
7 Clinton Democratic 43.0 1992 win
10 Clinton Democratic 49.2 1996 win
14 Bush Republican 47.9 2000 win
16 Bush Republican 50.7 2004 win
17 Obama Democratic 52.9 2008 win
19 Obama Democratic 51.1 2012 win
22 Trump Republican 46.1 2016 win

We can select multiple criteria by creating multiple boolean Series and combining them using the & operator.

In [95]:
elections[
    (elections['Result'] == 'win') & 
    (elections['%'] < 50)
]
Out[95]:
Candidate Party % Year Result
7 Clinton Democratic 43.0 1992 win
10 Clinton Democratic 49.2 1996 win
14 Bush Republican 47.9 2000 win
22 Trump Republican 46.1 2016 win

Using the logical negation ~ operator (not).

In [96]:
elections[
    (elections['Result'] == 'win') & 
    ~(elections['%'] < 50)
]
Out[96]:
Candidate Party % Year Result
0 Reagan Republican 50.7 1980 win
3 Reagan Republican 58.8 1984 win
5 Bush Republican 53.4 1988 win
16 Bush Republican 50.7 2004 win
17 Obama Democratic 52.9 2008 win
19 Obama Democratic 51.1 2012 win

The | operator is the symbol for or.

In [97]:
elections[
    ~((elections['Party'] == "Democratic") | 
        (elections['Party'] == "Republican"))
]
Out[97]:
Candidate Party % Year Result
2 Anderson Independent 6.6 1980 loss
9 Perot Independent 18.9 1992 loss
12 Perot Independent 8.4 1996 loss

If we have multiple conditions (say Republican or Democratic), we can use the isin operator to simplify our code.

In [98]:
elections[elections['Party'].isin(["Republican", "Democratic"])]
Out[98]:
Candidate Party % Year Result
0 Reagan Republican 50.7 1980 win
1 Carter Democratic 41.0 1980 loss
3 Reagan Republican 58.8 1984 win
4 Mondale Democratic 37.6 1984 loss
5 Bush Republican 53.4 1988 win
6 Dukakis Democratic 45.6 1988 loss
7 Clinton Democratic 43.0 1992 win
8 Bush Republican 37.4 1992 loss
10 Clinton Democratic 49.2 1996 win
11 Dole Republican 40.7 1996 loss
13 Gore Democratic 48.4 2000 loss
14 Bush Republican 47.9 2000 win
15 Kerry Democratic 48.3 2004 loss
16 Bush Republican 50.7 2004 win
17 Obama Democratic 52.9 2008 win
18 McCain Republican 45.7 2008 loss
19 Obama Democratic 51.1 2012 win
20 Romney Republican 47.2 2012 loss
21 Clinton Democratic 48.2 2016 loss
22 Trump Republican 46.1 2016 win

An alternate simpler way to get back a specific set of rows is to use the query command.

In [99]:
elections.query("Result == 'win' and Year < 2000")
Out[99]:
Candidate Party % Year Result
0 Reagan Republican 50.7 1980 win
3 Reagan Republican 58.8 1984 win
5 Bush Republican 53.4 1988 win
7 Clinton Democratic 43.0 1992 win
10 Clinton Democratic 49.2 1996 win

Note, the query command needs a bit of care and cannot be applied to that contain periods or special characters.

In [100]:
tmp2 = elections.rename(columns={"Year": "Elec Year", "%": "%*100"})
tmp2.head()
Out[100]:
Candidate Party %*100 Elec Year Result
0 Reagan Republican 50.7 1980 win
1 Carter Democratic 41.0 1980 loss
2 Anderson Independent 6.6 1980 loss
3 Reagan Republican 58.8 1984 win
4 Mondale Democratic 37.6 1984 loss
In [101]:
tmp2.query("`Elec Year` > 2000")
Out[101]:
Candidate Party %*100 Elec Year Result
15 Kerry Democratic 48.3 2004 loss
16 Bush Republican 50.7 2004 win
17 Obama Democratic 52.9 2008 win
18 McCain Republican 45.7 2008 loss
19 Obama Democratic 51.1 2012 win
20 Romney Republican 47.2 2012 loss
21 Clinton Democratic 48.2 2016 loss
22 Trump Republican 46.1 2016 win

In general, I don't use the query function because of these issues.

Accessing by Location iloc (Integer-Location)

This is similar to a spreadsheet

loc's cousin iloc is very similar, but is used to access based on numerical position instead of label. For example, to access to the top 3 rows and top 3 columns of a table, we can use [0:3, 0:3]. iloc slicing is exclusive, just like standard Python slicing of numerical values.

In [102]:
elec_small = elections.head(5)
elec_small
Out[102]:
Candidate Party % Year Result
0 Reagan Republican 50.7 1980 win
1 Carter Democratic 41.0 1980 loss
2 Anderson Independent 6.6 1980 loss
3 Reagan Republican 58.8 1984 win
4 Mondale Democratic 37.6 1984 loss

We use the the dataframe.iloc[row_slice, column_slice] to access specific rows and columns by their location. For example, to access to the top 3 rows and top 3 columns of a table, we can use [0:3, 0:3]. iloc slicing is exclusive, just like standard Python slicing of numerical values.

In [103]:
elec_small.iloc[0:3, 0:3]
Out[103]:
Candidate Party %
0 Reagan Republican 50.7
1 Carter Democratic 41.0
2 Anderson Independent 6.6

Return the last three rows using slicing?

Solution elec_small.iloc[-3:, :]
In [104]:
# code here




Caution

We will use both loc and iloc in the course. Loc is generally preferred for a number of reasons, for example:

  1. It is harder to make mistakes since you have to literally write out what you want to get.
  2. Code is easier to read, because the reader doesn't have to know e.g. what column #31 represents.
  3. It is robust against permutations of the data, e.g. the social security administration switches the order of two columns.

However, iloc is sometimes more convenient. We'll provide examples of when iloc is the superior choice.



Running Python on Rows

In general, we avoid directly running python code on each row or iterating over dataframes. It is often orders of magnitude faster to use builtin operations to modify entire columns at once.

However, occasionally you need to apply code to the dataframe directly.

The apply function executes the input function on each row (if axis=1) or each column (if axis=0):

In [105]:
elections.apply(lambda row: row['Party'][0:3], axis=1)
Out[105]:
0     Rep
1     Dem
2     Ind
3     Rep
4     Dem
5     Rep
6     Dem
7     Dem
8     Rep
9     Ind
10    Dem
11    Rep
12    Ind
13    Dem
14    Rep
15    Dem
16    Rep
17    Dem
18    Rep
19    Dem
20    Rep
21    Dem
22    Rep
dtype: object

You can also directly iterate over the rows:

In [106]:
for (year, row) in elections.set_index("Year").iterrows():
    if row['Result'] == "win":
        print(f"In {year} the winner was {row['Party']}")
In 1980 the winner was Republican
In 1984 the winner was Republican
In 1988 the winner was Republican
In 1992 the winner was Democratic
In 1996 the winner was Democratic
In 2000 the winner was Republican
In 2004 the winner was Republican
In 2008 the winner was Democratic
In 2012 the winner was Democratic
In 2016 the winner was Republican

Quick Challenge

Which of the following expressions return DataFrame of the first 3 Candidate and Party names for candidates that won with more than 50% of the vote.

In [107]:
elections.iloc[[0, 3, 5], [0, 3]]
Out[107]:
Candidate Year
0 Reagan 1980
3 Reagan 1984
5 Bush 1988
In [108]:
elections.loc[[0, 3, 5], "Candidate":"Year"]
Out[108]:
Candidate Party % Year
0 Reagan Republican 50.7 1980
3 Reagan Republican 58.8 1984
5 Bush Republican 53.4 1988
In [109]:
elections.loc[elections["%"] > 50, ["Candidate", "Year"]].head(3)
Out[109]:
Candidate Year
0 Reagan 1980
3 Reagan 1984
5 Bush 1988
In [110]:
elections.loc[elections["%"] > 50, ["Candidate", "Year"]].iloc[0:2, :]
Out[110]:
Candidate Year
0 Reagan 1980
3 Reagan 1984

Handy Properties and Utility Functions for Series and DataFrames

Python Operations on Numerical DataFrames and Series

Consider a series of only the vote percentages of election winners.

In [111]:
winners = elections.query("Result == 'win'")["%"]
winners
Out[111]:
0     50.7
3     58.8
5     53.4
7     43.0
10    49.2
14    47.9
16    50.7
17    52.9
19    51.1
22    46.1
Name: %, dtype: float64

We can perform various Python operations (including numpy operations) to DataFrames and Series.

In [112]:
max(winners)
Out[112]:
58.8
In [113]:
np.mean(winners)
Out[113]:
50.38

We can also do more complicated operations like computing the mean squared error, i.e. the average L2 loss.

In [114]:
c = 50.38
mse = np.mean((c - winners)**2)
mse
Out[114]:
16.741599999999995
In [115]:
c2 = 50.35
mse2 = np.mean((c2 - winners)**2)
mse2
Out[115]:
16.742499999999996

We can also apply mathematical operations to a DataFrame so long as it has only numerical data.

In [116]:
(elections[["%", "Year"]] + 3).head(5)
Out[116]:
% Year
0 53.7 1983
1 44.0 1983
2 9.6 1983
3 61.8 1987
4 40.6 1987

Baby Names Data

Now let's play around a bit with the large baby names dataset we saw in lecture 1. We'll start by loading that dataset from the social security administration's website.

To keep the data small enough to avoid crashing datahub, we're going to look at only California rather than looking at the national dataset.

In [117]:
import urllib.request
import os.path
import zipfile

data_url = "https://www.ssa.gov/oact/babynames/state/namesbystate.zip"
local_filename = "babynamesbystate.zip"
if not os.path.exists(local_filename): # if the data exists don't download again
    with urllib.request.urlopen(data_url) as resp, open(local_filename, 'wb') as f:
        f.write(resp.read())

zf = zipfile.ZipFile(local_filename, 'r')

ca_name = 'CA.TXT'
field_names = ['State', 'Sex', 'Year', 'Name', 'Count']
with zf.open(ca_name) as fh:
    babynames = pd.read_csv(fh, header=None, names=field_names)

babynames.sample(5)
Out[117]:
State Sex Year Name Count
82545 CA F 1979 Karly 7
208953 CA F 2014 Verena 6
6379 CA F 1924 Jean 460
168445 CA F 2005 Amy 638
263594 CA M 1966 Stephen 1245

Goal 1: Find the most popular female baby names in California in 2018

Solution babynames[ (babynames["Year"] == 2018) & (babynames["Sex"] == "F") ].sort_values(by = "Count", ascending = False).head(20)
In [118]:
# Solution here

Goal 2: Make a plot of how many baby girls were named Nora over the years.

Solution babynames[(babynames.Name == "Nora") & (babynames.Sex == "F") ].iplot(x="Year", y="Count")
In [119]:
# Solution here
In [ ]: