Pandas Basics

A high-level overview of the Pandas library.

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

plt.style.use('fivethirtyeight')
sns.set_context("notebook")

Reading in DataFrames from Files

Pandas has a number of very useful file reading tools. You can see them enumerated by typing "pd.re" and pressing tab. We'll be using read_csv today.

In [2]:
elections = pd.read_csv("elections.csv")
elections # if we end a cell with an expression or variable name, the result will print
Out[2]:
Candidate Party % Year Result
0 Reagan Republican 50.7 1980 win
1 Carter Democratic 41.0 1980 loss
2 Anderson Independent 6.6 1980 loss
3 Reagan Republican 58.8 1984 win
4 Mondale Democratic 37.6 1984 loss
5 Bush Republican 53.4 1988 win
6 Dukakis Democratic 45.6 1988 loss
7 Clinton Democratic 43.0 1992 win
8 Bush Republican 37.4 1992 loss
9 Perot Independent 18.9 1992 loss
10 Clinton Democratic 49.2 1996 win
11 Dole Republican 40.7 1996 loss
12 Perot Independent 8.4 1996 loss
13 Gore Democratic 48.4 2000 loss
14 Bush Republican 47.9 2000 win
15 Kerry Democratic 48.3 2004 loss
16 Bush Republican 50.7 2004 win
17 Obama Democratic 52.9 2008 win
18 McCain Republican 45.7 2008 loss
19 Obama Democratic 51.1 2012 win
20 Romney Republican 47.2 2012 loss
21 Clinton Democratic 48.2 2016 loss
22 Trump Republican 46.1 2016 win

We can use the head command to return only a few rows of a dataframe.

In [3]:
elections.head(10)
Out[3]:
Candidate Party % Year Result
0 Reagan Republican 50.7 1980 win
1 Carter Democratic 41.0 1980 loss
2 Anderson Independent 6.6 1980 loss
3 Reagan Republican 58.8 1984 win
4 Mondale Democratic 37.6 1984 loss
5 Bush Republican 53.4 1988 win
6 Dukakis Democratic 45.6 1988 loss
7 Clinton Democratic 43.0 1992 win
8 Bush Republican 37.4 1992 loss
9 Perot Independent 18.9 1992 loss

There is also a tail command.

In [4]:
elections.tail(7)
Out[4]:
Candidate Party % Year Result
16 Bush Republican 50.7 2004 win
17 Obama Democratic 52.9 2008 win
18 McCain Republican 45.7 2008 loss
19 Obama Democratic 51.1 2012 win
20 Romney Republican 47.2 2012 loss
21 Clinton Democratic 48.2 2016 loss
22 Trump Republican 46.1 2016 win

The read_csv command lets us specify a column to use an index. For example, we could have used Year as the index.

In [5]:
elections_year_index = pd.read_csv("elections.csv", index_col = "Year")
elections_year_index.head(5)
Out[5]:
Candidate Party % Result
Year
1980 Reagan Republican 50.7 win
1980 Carter Democratic 41.0 loss
1980 Anderson Independent 6.6 loss
1984 Reagan Republican 58.8 win
1984 Mondale Democratic 37.6 loss

Alternately, we could have used the set_index commmand.

In [6]:
elections_party_index = elections.set_index("Party")
elections_party_index.head(5)
Out[6]:
Candidate % Year Result
Party
Republican Reagan 50.7 1980 win
Democratic Carter 41.0 1980 loss
Independent Anderson 6.6 1980 loss
Republican Reagan 58.8 1984 win
Democratic Mondale 37.6 1984 loss

The set_index command (along with all other data frame methods) does not modify the dataframe. That is, the original "elections" is untouched. Note: There is a flag called "inplace" which does modify the calling dataframe.

In [7]:
elections.head() #the index remains unchanged
Out[7]:
Candidate Party % Year Result
0 Reagan Republican 50.7 1980 win
1 Carter Democratic 41.0 1980 loss
2 Anderson Independent 6.6 1980 loss
3 Reagan Republican 58.8 1984 win
4 Mondale Democratic 37.6 1984 loss

By contrast, column names are ideally unique. For example, if we try to read in a file for which column names are not unique, Pandas will automatically rename any duplicates.

In [8]:
dups = pd.read_csv("duplicate_columns.csv")
dups
Out[8]:
name name.1 flavor
0 john smith vanilla
1 zhang shan chocolate
2 fulan alfulani NaN
3 hong gildong banana

The [] Operator

The DataFrame class has an indexing operator [] that lets you do a variety of different things. If your provide a String to the [] operator, you get back a Series corresponding to the requested label.

In [9]:
elections_year_index.head(6)
Out[9]:
Candidate Party % Result
Year
1980 Reagan Republican 50.7 win
1980 Carter Democratic 41.0 loss
1980 Anderson Independent 6.6 loss
1984 Reagan Republican 58.8 win
1984 Mondale Democratic 37.6 loss
1988 Bush Republican 53.4 win
In [10]:
elections_year_index["Candidate"].head(6)
Out[10]:
Year
1980      Reagan
1980      Carter
1980    Anderson
1984      Reagan
1984     Mondale
1988        Bush
Name: Candidate, dtype: object

The [] operator also accepts a list of strings. In this case, you get back a DataFrame corresponding to the requested strings.

In [11]:
elections_year_index[["Candidate", "Party"]].head()
Out[11]:
Candidate Party
Year
1980 Reagan Republican
1980 Carter Democratic
1980 Anderson Independent
1984 Reagan Republican
1984 Mondale Democratic

A list of one label also returns a DataFrame. This can be handy if you want your results as a DataFrame, not a series.

In [12]:
elections_year_index[["Candidate"]].head()
Out[12]:
Candidate
Year
1980 Reagan
1980 Carter
1980 Anderson
1984 Reagan
1984 Mondale

Note that we can also use the to_frame method to turn a Series into a DataFrame.

In [13]:
elections_year_index["Candidate"].to_frame().head()
Out[13]:
Candidate
Year
1980 Reagan
1980 Carter
1980 Anderson
1984 Reagan
1984 Mondale

The [] operator also accepts numerical slices as arguments. In this case, we are indexing by row, not column!

In [14]:
elections_year_index[0:3]
Out[14]:
Candidate Party % Result
Year
1980 Reagan Republican 50.7 win
1980 Carter Democratic 41.0 loss
1980 Anderson Independent 6.6 loss

If you provide a single argument to the [] operator, it tries to use it as a name. This is true even if the argument passed to [] is an integer.

In [15]:
#elections_year_index[0] #this does not work, try uncommenting this to see it fail in action, woo

The following cells allow you to test your understanding.

In [16]:
weird = pd.DataFrame({
    1:["topdog","botdog"], 
    "1":["topcat","botcat"]
})
weird
Out[16]:
1 1
0 topdog topcat
1 botdog botcat
In [17]:
weird[1] #try to predict the output
Out[17]:
0    topdog
1    botdog
Name: 1, dtype: object
In [18]:
weird["1"] #try to predict the output
Out[18]:
0    topcat
1    botcat
Name: 1, dtype: object
In [19]:
weird[1:] #try to predict the output
Out[19]:
1 1
1 botdog botcat

Boolean Array Selection

The [] operator also supports array of booleans as an input. In this case, the array must be exactly as long as the number of rows. The result is a filtered version of the data frame, where only rows corresponding to True appear.

In [20]:
elections_year_index
Out[20]:
Candidate Party % Result
Year
1980 Reagan Republican 50.7 win
1980 Carter Democratic 41.0 loss
1980 Anderson Independent 6.6 loss
1984 Reagan Republican 58.8 win
1984 Mondale Democratic 37.6 loss
1988 Bush Republican 53.4 win
1988 Dukakis Democratic 45.6 loss
1992 Clinton Democratic 43.0 win
1992 Bush Republican 37.4 loss
1992 Perot Independent 18.9 loss
1996 Clinton Democratic 49.2 win
1996 Dole Republican 40.7 loss
1996 Perot Independent 8.4 loss
2000 Gore Democratic 48.4 loss
2000 Bush Republican 47.9 win
2004 Kerry Democratic 48.3 loss
2004 Bush Republican 50.7 win
2008 Obama Democratic 52.9 win
2008 McCain Republican 45.7 loss
2012 Obama Democratic 51.1 win
2012 Romney Republican 47.2 loss
2016 Clinton Democratic 48.2 loss
2016 Trump Republican 46.1 win
In [21]:
elections_year_index[[False, False, False, False, False, 
          False, False, True, False, False,
          True, False, False, False, True,
          False, False, False, False, False,
          False, False, True]]
Out[21]:
Candidate Party % Result
Year
1992 Clinton Democratic 43.0 win
1996 Clinton Democratic 49.2 win
2000 Bush Republican 47.9 win
2016 Trump Republican 46.1 win

One very common task in Data Science is filtering. Boolean Array Selection is one way to achieve this in Pandas. We start by observing logical operators like the equality operator can be applied to Pandas Series data to generate a Boolean Array. For example, we can compare the 'Result' column to the String 'win':

In [22]:
elections_year_index.head(5)
Out[22]:
Candidate Party % Result
Year
1980 Reagan Republican 50.7 win
1980 Carter Democratic 41.0 loss
1980 Anderson Independent 6.6 loss
1984 Reagan Republican 58.8 win
1984 Mondale Democratic 37.6 loss
In [23]:
iswin = elections_year_index['Result'] == 'win'
iswin#.head(5)
Out[23]:
Year
1980     True
1980    False
1980    False
1984     True
1984    False
1988     True
1988    False
1992     True
1992    False
1992    False
1996     True
1996    False
1996    False
2000    False
2000     True
2004    False
2004     True
2008     True
2008    False
2012     True
2012    False
2016    False
2016     True
Name: Result, dtype: bool

The output of the logical operator applied to the Series is another Series with the same name and index, but of datatype boolean. The entry at row #i represents the result of the application of that operator to the entry of the original Series at row #i.

Such a boolean Series can be used as an argument to the [] operator. For example, the following code creates a DataFrame of all election winners since 1980.

In [24]:
elections_year_index[iswin]
Out[24]:
Candidate Party % Result
Year
1980 Reagan Republican 50.7 win
1984 Reagan Republican 58.8 win
1988 Bush Republican 53.4 win
1992 Clinton Democratic 43.0 win
1996 Clinton Democratic 49.2 win
2000 Bush Republican 47.9 win
2004 Bush Republican 50.7 win
2008 Obama Democratic 52.9 win
2012 Obama Democratic 51.1 win
2016 Trump Republican 46.1 win

Above, we've assigned the result of the logical operator to a new variable called iswin. This is uncommon. Usually, the series is created and used on the same line. Such code is a little tricky to read at first, but you'll get used to it quickly.

In [25]:
elections_year_index[elections_year_index['Result'] == 'win']
Out[25]:
Candidate Party % Result
Year
1980 Reagan Republican 50.7 win
1984 Reagan Republican 58.8 win
1988 Bush Republican 53.4 win
1992 Clinton Democratic 43.0 win
1996 Clinton Democratic 49.2 win
2000 Bush Republican 47.9 win
2004 Bush Republican 50.7 win
2008 Obama Democratic 52.9 win
2012 Obama Democratic 51.1 win
2016 Trump Republican 46.1 win

We can select multiple criteria by creating multiple boolean Series and combining them using the & operator.

In [26]:
win50plus = (elections_year_index['Result'] == 'win') & (elections_year_index['%'] < 50)
In [27]:
win50plus.head(5)
Out[27]:
Year
1980    False
1980    False
1980    False
1984    False
1984    False
dtype: bool
In [28]:
elections_year_index[(elections_year_index['Result'] == 'win')
          & (elections_year_index['%'] < 50)]

# Note for Python experts: The reason we use the & symbol and not the word "and" is because the Python __and__ 
# method overrides the "&" operator, not the "and" operator.
Out[28]:
Candidate Party % Result
Year
1992 Clinton Democratic 43.0 win
1996 Clinton Democratic 49.2 win
2000 Bush Republican 47.9 win
2016 Trump Republican 46.1 win

The | operator is the symbol for or.

In [29]:
elections_year_index[(elections_year_index['Party'] == 'Republican')
          | (elections_year_index['Party'] == "Democratic")]
Out[29]:
Candidate Party % Result
Year
1980 Reagan Republican 50.7 win
1980 Carter Democratic 41.0 loss
1984 Reagan Republican 58.8 win
1984 Mondale Democratic 37.6 loss
1988 Bush Republican 53.4 win
1988 Dukakis Democratic 45.6 loss
1992 Clinton Democratic 43.0 win
1992 Bush Republican 37.4 loss
1996 Clinton Democratic 49.2 win
1996 Dole Republican 40.7 loss
2000 Gore Democratic 48.4 loss
2000 Bush Republican 47.9 win
2004 Kerry Democratic 48.3 loss
2004 Bush Republican 50.7 win
2008 Obama Democratic 52.9 win
2008 McCain Republican 45.7 loss
2012 Obama Democratic 51.1 win
2012 Romney Republican 47.2 loss
2016 Clinton Democratic 48.2 loss
2016 Trump Republican 46.1 win

If we have multiple conditions (say Republican or Democratic), we can use the isin operator to simplify our code.

In [30]:
elections_year_index['Party'].isin(["Republican", "Democratic"])
Out[30]:
Year
1980     True
1980     True
1980    False
1984     True
1984     True
1988     True
1988     True
1992     True
1992     True
1992    False
1996     True
1996     True
1996    False
2000     True
2000     True
2004     True
2004     True
2008     True
2008     True
2012     True
2012     True
2016     True
2016     True
Name: Party, dtype: bool
In [31]:
elections_year_index[elections_year_index['Party'].isin(["Republican", "Democratic"])]
Out[31]:
Candidate Party % Result
Year
1980 Reagan Republican 50.7 win
1980 Carter Democratic 41.0 loss
1984 Reagan Republican 58.8 win
1984 Mondale Democratic 37.6 loss
1988 Bush Republican 53.4 win
1988 Dukakis Democratic 45.6 loss
1992 Clinton Democratic 43.0 win
1992 Bush Republican 37.4 loss
1996 Clinton Democratic 49.2 win
1996 Dole Republican 40.7 loss
2000 Gore Democratic 48.4 loss
2000 Bush Republican 47.9 win
2004 Kerry Democratic 48.3 loss
2004 Bush Republican 50.7 win
2008 Obama Democratic 52.9 win
2008 McCain Republican 45.7 loss
2012 Obama Democratic 51.1 win
2012 Romney Republican 47.2 loss
2016 Clinton Democratic 48.2 loss
2016 Trump Republican 46.1 win

An alternate simpler way to get back a specific set of rows is to use the query command.

In [32]:
elections_year_index.query?
In [33]:
elections_year_index.query("Result == 'win' and Year < 2000")
Out[33]:
Candidate Party % Result
Year
1980 Reagan Republican 50.7 win
1984 Reagan Republican 58.8 win
1988 Bush Republican 53.4 win
1992 Clinton Democratic 43.0 win
1996 Clinton Democratic 49.2 win

Label-based access with loc

In [34]:
elections.head(5)
Out[34]:
Candidate Party % Year Result
0 Reagan Republican 50.7 1980 win
1 Carter Democratic 41.0 1980 loss
2 Anderson Independent 6.6 1980 loss
3 Reagan Republican 58.8 1984 win
4 Mondale Democratic 37.6 1984 loss
In [35]:
elections.loc[[0, 1, 2, 3, 4], ['Candidate','Party', 'Year']]
Out[35]:
Candidate Party Year
0 Reagan Republican 1980
1 Carter Democratic 1980
2 Anderson Independent 1980
3 Reagan Republican 1984
4 Mondale Democratic 1984

Note: The loc command won't work with numeric arguments if we're using the elections DataFrame that was indexed by year.

In [36]:
elections_year_index.head(5)
Out[36]:
Candidate Party % Result
Year
1980 Reagan Republican 50.7 win
1980 Carter Democratic 41.0 loss
1980 Anderson Independent 6.6 loss
1984 Reagan Republican 58.8 win
1984 Mondale Democratic 37.6 loss
In [37]:
#causes error
elections_year_index.loc[[0, 1, 2, 3, 4], ['Candidate','Party']]#
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-37-216dbccfb42e> in <module>
      1 #causes error
----> 2 elections_year_index.loc[[0, 1, 2, 3, 4], ['Candidate','Party']]#

/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py in __getitem__(self, key)
   1416                 except (KeyError, IndexError, AttributeError):
   1417                     pass
-> 1418             return self._getitem_tuple(key)
   1419         else:
   1420             # we by definition only have the 0th axis

/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py in _getitem_tuple(self, tup)
    812         # ugly hack for GH #836
    813         if self._multi_take_opportunity(tup):
--> 814             return self._multi_take(tup)
    815 
    816         # no shortcut needed

/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py in _multi_take(self, tup)
    867         d = {
    868             axis: self._get_listlike_indexer(key, axis)
--> 869             for (key, axis) in zip(tup, o._AXIS_ORDERS)
    870         }
    871         return o._reindex_with_indexers(d, copy=True, allow_dups=True)

/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py in <dictcomp>(.0)
    867         d = {
    868             axis: self._get_listlike_indexer(key, axis)
--> 869             for (key, axis) in zip(tup, o._AXIS_ORDERS)
    870         }
    871         return o._reindex_with_indexers(d, copy=True, allow_dups=True)

/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py in _get_listlike_indexer(self, key, axis, raise_missing)
   1090 
   1091         self._validate_read_indexer(
-> 1092             keyarr, indexer, o._get_axis_number(axis), raise_missing=raise_missing
   1093         )
   1094         return keyarr, indexer

/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py in _validate_read_indexer(self, key, indexer, axis, raise_missing)
   1175                 raise KeyError(
   1176                     "None of [{key}] are in the [{axis}]".format(
-> 1177                         key=key, axis=self.obj._get_axis_name(axis)
   1178                     )
   1179                 )

KeyError: "None of [Int64Index([0, 1, 2, 3, 4], dtype='int64', name='Year')] are in the [index]"
In [38]:
elections_year_index.loc[[1980, 1984], ['Candidate','Party']]
Out[38]:
Candidate Party
Year
1980 Reagan Republican
1980 Carter Democratic
1980 Anderson Independent
1984 Reagan Republican
1984 Mondale Democratic

Loc also supports slicing (for all types, including numeric and string labels!). Note that the slicing for loc is inclusive, even for numeric slices.

In [39]:
elections.loc[0:4, 'Candidate':'Year']
Out[39]:
Candidate Party % Year
0 Reagan Republican 50.7 1980
1 Carter Democratic 41.0 1980
2 Anderson Independent 6.6 1980
3 Reagan Republican 58.8 1984
4 Mondale Democratic 37.6 1984
In [40]:
elections_year_index.loc[1980:1984, 'Candidate':'Party']
Out[40]:
Candidate Party
Year
1980 Reagan Republican
1980 Carter Democratic
1980 Anderson Independent
1984 Reagan Republican
1984 Mondale Democratic

If we provide only a single label for the column argument, we get back a Series.

In [41]:
elections.loc[0:4, 'Candidate']
Out[41]:
0      Reagan
1      Carter
2    Anderson
3      Reagan
4     Mondale
Name: Candidate, dtype: object

If we want a data frame instead and don't want to use to_frame, we can provde a list containing the column name.

In [42]:
elections.loc[0:4, ['Candidate']]
Out[42]:
Candidate
0 Reagan
1 Carter
2 Anderson
3 Reagan
4 Mondale

If we give only one row but many column labels, we'll get back a Series corresponding to a row of the table. This new Series has a neat index, where each entry is the name of the column that the data came from.

In [43]:
elections.head(1)
Out[43]:
Candidate Party % Year Result
0 Reagan Republican 50.7 1980 win
In [44]:
elections.loc[0, 'Candidate':'Year']
Out[44]:
Candidate        Reagan
Party        Republican
%                  50.7
Year               1980
Name: 0, dtype: object
In [45]:
elections.loc[[0], 'Candidate':'Year']
Out[45]:
Candidate Party % Year
0 Reagan Republican 50.7 1980

If we omit the column argument altogether, the default behavior is to retrieve all columns.

In [46]:
elections.loc[[2, 4, 5]]
Out[46]:
Candidate Party % Year Result
2 Anderson Independent 6.6 1980 loss
4 Mondale Democratic 37.6 1984 loss
5 Bush Republican 53.4 1988 win

Loc also supports boolean array inputs instead of labels. The Boolean arrays must be of the same length as the row/column shape of the dataframe, respectively (in versions prior to 0.25, Pandas used to allow size mismatches and would assume the missing values were all False, this was changed in 2019).

In [47]:
elections.loc[[True, False, False, True, False, False, True, True, True, False, False, True, 
               True, True, False, True, True, False, False, False, True, False, False], # row mask
              [True, False, False, True, True] # column mask
             ]
Out[47]:
Candidate Year Result
0 Reagan 1980 win
3 Reagan 1984 win
6 Dukakis 1988 loss
7 Clinton 1992 win
8 Bush 1992 loss
11 Dole 1996 loss
12 Perot 1996 loss
13 Gore 2000 loss
15 Kerry 2004 loss
16 Bush 2004 win
20 Romney 2012 loss
In [48]:
elections.loc[[0, 3], ['Candidate', 'Year']]
Out[48]:
Candidate Year
0 Reagan 1980
3 Reagan 1984

We can use boolean array arguments for one axis of the data, and labels for the other.

In [49]:
elections.loc[[True, False, False, True, False, False, True, True, True, False, False, True, 
               True, True, False, True, True, False, False, False, True, False, False], # row mask
              
              'Candidate':'%' # column label slice
             ]
Out[49]:
Candidate Party %
0 Reagan Republican 50.7
3 Reagan Republican 58.8
6 Dukakis Democratic 45.6
7 Clinton Democratic 43.0
8 Bush Republican 37.4
11 Dole Republican 40.7
12 Perot Independent 8.4
13 Gore Democratic 48.4
15 Kerry Democratic 48.3
16 Bush Republican 50.7
20 Romney Republican 47.2

A student asks what happens if you give scalar arguments for the requested rows AND columns. The answer is that you get back just a single value.

In [50]:
elections.loc[15, '%']
Out[50]:
48.3

Positional access with iloc

loc's cousin iloc is very similar, but is used to access based on numerical position instead of label. For example, to access to the top 3 rows and top 3 columns of a table, we can use [0:3, 0:3]. iloc slicing is exclusive, just like standard Python slicing of numerical values.

In [51]:
elections.head(5)
Out[51]:
Candidate Party % Year Result
0 Reagan Republican 50.7 1980 win
1 Carter Democratic 41.0 1980 loss
2 Anderson Independent 6.6 1980 loss
3 Reagan Republican 58.8 1984 win
4 Mondale Democratic 37.6 1984 loss
In [52]:
elections.iloc[:3, 2:]
Out[52]:
% Year Result
0 50.7 1980 win
1 41.0 1980 loss
2 6.6 1980 loss

We will use both loc and iloc in the course. Loc is generally preferred for a number of reasons, for example:

  1. It is harder to make mistakes since you have to literally write out what you want to get.
  2. Code is easier to read, because the reader doesn't have to know e.g. what column #31 represents.
  3. It is robust against permutations of the data, e.g. the social security administration switches the order of two columns.

However, iloc is sometimes more convenient. We'll provide examples of when iloc is the superior choice.

Quick Challenge

Which of the following expressions return DataFrame of the first 3 Candidate and Year for candidates that won with more than 50% of the vote.

In [53]:
elections.head(10)
Out[53]:
Candidate Party % Year Result
0 Reagan Republican 50.7 1980 win
1 Carter Democratic 41.0 1980 loss
2 Anderson Independent 6.6 1980 loss
3 Reagan Republican 58.8 1984 win
4 Mondale Democratic 37.6 1984 loss
5 Bush Republican 53.4 1988 win
6 Dukakis Democratic 45.6 1988 loss
7 Clinton Democratic 43.0 1992 win
8 Bush Republican 37.4 1992 loss
9 Perot Independent 18.9 1992 loss
In [54]:
elections.iloc[[0, 3, 5], [0, 3]]
Out[54]:
Candidate Year
0 Reagan 1980
3 Reagan 1984
5 Bush 1988
In [55]:
elections.loc[[0, 3, 5], "Candidate":"Year"]
Out[55]:
Candidate Party % Year
0 Reagan Republican 50.7 1980
3 Reagan Republican 58.8 1984
5 Bush Republican 53.4 1988
In [56]:
elections.loc[elections["%"] > 50, ["Candidate", "Year"]].head(3)
Out[56]:
Candidate Year
0 Reagan 1980
3 Reagan 1984
5 Bush 1988
In [57]:
elections.loc[elections["%"] > 50, ["Candidate", "Year"]].iloc[0:2, :]
Out[57]:
Candidate Year
0 Reagan 1980
3 Reagan 1984

Sampling

Pandas dataframes also make it easy to get a sample. We simply use the sample method and provide the number of samples that we'd like as the arugment. Sampling is done without replacement by default. Set replace=True if you want replacement.

In [58]:
elections.sample(10)
Out[58]:
Candidate Party % Year Result
4 Mondale Democratic 37.6 1984 loss
11 Dole Republican 40.7 1996 loss
0 Reagan Republican 50.7 1980 win
3 Reagan Republican 58.8 1984 win
14 Bush Republican 47.9 2000 win
9 Perot Independent 18.9 1992 loss
15 Kerry Democratic 48.3 2004 loss
13 Gore Democratic 48.4 2000 loss
8 Bush Republican 37.4 1992 loss
18 McCain Republican 45.7 2008 loss
In [59]:
elections.query("Year < 1992").sample(50, replace=True)
Out[59]:
Candidate Party % Year Result
4 Mondale Democratic 37.6 1984 loss
0 Reagan Republican 50.7 1980 win
5 Bush Republican 53.4 1988 win
0 Reagan Republican 50.7 1980 win
6 Dukakis Democratic 45.6 1988 loss
1 Carter Democratic 41.0 1980 loss
2 Anderson Independent 6.6 1980 loss
5 Bush Republican 53.4 1988 win
3 Reagan Republican 58.8 1984 win
6 Dukakis Democratic 45.6 1988 loss
0 Reagan Republican 50.7 1980 win
6 Dukakis Democratic 45.6 1988 loss
2 Anderson Independent 6.6 1980 loss
1 Carter Democratic 41.0 1980 loss
1 Carter Democratic 41.0 1980 loss
6 Dukakis Democratic 45.6 1988 loss
5 Bush Republican 53.4 1988 win
4 Mondale Democratic 37.6 1984 loss
4 Mondale Democratic 37.6 1984 loss
2 Anderson Independent 6.6 1980 loss
2 Anderson Independent 6.6 1980 loss
1 Carter Democratic 41.0 1980 loss
2 Anderson Independent 6.6 1980 loss
5 Bush Republican 53.4 1988 win
4 Mondale Democratic 37.6 1984 loss
6 Dukakis Democratic 45.6 1988 loss
5 Bush Republican 53.4 1988 win
2 Anderson Independent 6.6 1980 loss
1 Carter Democratic 41.0 1980 loss
1 Carter Democratic 41.0 1980 loss
2 Anderson Independent 6.6 1980 loss
6 Dukakis Democratic 45.6 1988 loss
3 Reagan Republican 58.8 1984 win
3 Reagan Republican 58.8 1984 win
1 Carter Democratic 41.0 1980 loss
6 Dukakis Democratic 45.6 1988 loss
3 Reagan Republican 58.8 1984 win
4 Mondale Democratic 37.6 1984 loss
0 Reagan Republican 50.7 1980 win
2 Anderson Independent 6.6 1980 loss
0 Reagan Republican 50.7 1980 win
2 Anderson Independent 6.6 1980 loss
0 Reagan Republican 50.7 1980 win
0 Reagan Republican 50.7 1980 win
1 Carter Democratic 41.0 1980 loss
3 Reagan Republican 58.8 1984 win
3 Reagan Republican 58.8 1984 win
4 Mondale Democratic 37.6 1984 loss
6 Dukakis Democratic 45.6 1988 loss
2 Anderson Independent 6.6 1980 loss

Handy Properties and Utility Functions for Series and DataFrames

Python Operations on Numerical DataFrames and Series

Consider a series of only the vote percentages of election winners.

In [60]:
winners = elections.query("Result == 'win'")["%"]
winners
Out[60]:
0     50.7
3     58.8
5     53.4
7     43.0
10    49.2
14    47.9
16    50.7
17    52.9
19    51.1
22    46.1
Name: %, dtype: float64

We can perform various Python operations (including numpy operations) to DataFrames and Series.

In [61]:
max(winners)
Out[61]:
58.8
In [62]:
np.mean(winners)
Out[62]:
50.38

We can also do more complicated operations like computing the mean squared error, i.e. the average L2 loss. (This will mean more in the next few weeks.)

In [63]:
c = 50.38
mse = np.mean((c - winners)**2)
mse
Out[63]:
16.741599999999995
In [64]:
c2 = 50.35
mse2 = np.mean((c2 - winners)**2)
mse2
Out[64]:
16.742499999999996

We can also apply mathematical operations to a DataFrame so long as it has only numerical data.

In [65]:
(elections[["%", "Year"]] + 3).head(5)
Out[65]:
% Year
0 53.7 1983
1 44.0 1983
2 9.6 1983
3 61.8 1987
4 40.6 1987

Handy Utility Methods

The head, shape, size, and describe methods can be used to quickly get a good sense of the data we're working with. For example:

In [66]:
mottos = pd.read_csv("mottos.csv", index_col="State")
In [67]:
mottos.head(20)
Out[67]:
Motto Translation Language Date Adopted
State
Alabama Audemus jura nostra defendere We dare defend our rights! Latin 1923
Alaska North to the future English 1967
Arizona Ditat Deus God enriches Latin 1863
Arkansas Regnat populus The people rule Latin 1907
California Eureka (Εὕρηκα) I have found it Greek 1849
Colorado Nil sine numine Nothing without providence. Latin November 6, 1861
Connecticut Qui transtulit sustinet He who transplanted sustains Latin October 9, 1662
Delaware Liberty and Independence English 1847
Florida In God We Trust English 1868
Georgia Wisdom, Justice, Moderation English 1798
Hawaii Ua mau ke ea o ka ʻāina i ka pono The life of the land is perpetuated in righteo... Hawaiian July 31, 1843
Idaho Esto perpetua Let it be perpetual Latin 1890
Illinois State sovereignty, national union English 1819
Indiana The Crossroads of America English 1937
Iowa Our liberties we prize and our rights we will ... English 1847
Kansas Ad astra per aspera To the stars through difficulties Latin 1861
Kentucky Deo gratiam habeamus Let us be grateful to God Latin 2002
Louisiana Union, justice, confidence English 1902
Maine Dirigo I lead Latin 1820
Maryland Fatti maschii, parole femine Manly deeds, womanly words Italian 1874
In [68]:
mottos.size
Out[68]:
200

The fact that the size is 200 means our data file is relatively small, with only 200 total entries.

In [69]:
mottos.shape
Out[69]:
(50, 4)

Since we're looking at data for states, and we see the number 50, it looks like we've mostly likely got a complete dataset that omits Washington D.C. and U.S. territories like Guam and Puerto Rico.

In [70]:
mottos.describe()
Out[70]:
Motto Translation Language Date Adopted
count 50 49 50 50
unique 50 30 8 47
top Live Free or Die Latin 1861
freq 1 20 23 2

Above, we see a quick summary of all the data. For example, the most common language for mottos is Latin, which covers 23 different states. Does anything else seem surprising?

We can get a direct reference to the index using .index.

In [71]:
mottos.index
Out[71]:
Index(['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado',
       'Connecticut', 'Delaware', 'Florida', 'Georgia', 'Hawaii', 'Idaho',
       'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana',
       'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota',
       'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada',
       'New Hampshire', 'New Jersey', 'New Mexico', 'New York',
       'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon',
       'Pennsylvania', 'Rhode Island', 'South Carolina', 'South Dakota',
       'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington',
       'West Virginia', 'Wisconsin', 'Wyoming'],
      dtype='object', name='State')

We can also access individual properties of the index, for example, mottos.index.name.

In [72]:
mottos.index.name
Out[72]:
'State'

This reflects the fact that in our data frame, the index IS the state name!

In [73]:
mottos.head(2)
Out[73]:
Motto Translation Language Date Adopted
State
Alabama Audemus jura nostra defendere We dare defend our rights! Latin 1923
Alaska North to the future English 1967

It turns out the columns also have an Index. We can access this index by using .columns.

In [74]:
mottos.head(2)
Out[74]:
Motto Translation Language Date Adopted
State
Alabama Audemus jura nostra defendere We dare defend our rights! Latin 1923
Alaska North to the future English 1967

There are also a ton of useful utility methods we can use with Data Frames and Series. For example, we can create a copy of a data frame sorted by a specific column using sort_values.

In [75]:
elections.sort_values('%', ascending=False)
Out[75]:
Candidate Party % Year Result
3 Reagan Republican 58.8 1984 win
5 Bush Republican 53.4 1988 win
17 Obama Democratic 52.9 2008 win
19 Obama Democratic 51.1 2012 win
0 Reagan Republican 50.7 1980 win
16 Bush Republican 50.7 2004 win
10 Clinton Democratic 49.2 1996 win
13 Gore Democratic 48.4 2000 loss
15 Kerry Democratic 48.3 2004 loss
21 Clinton Democratic 48.2 2016 loss
14 Bush Republican 47.9 2000 win
20 Romney Republican 47.2 2012 loss
22 Trump Republican 46.1 2016 win
18 McCain Republican 45.7 2008 loss
6 Dukakis Democratic 45.6 1988 loss
7 Clinton Democratic 43.0 1992 win
1 Carter Democratic 41.0 1980 loss
11 Dole Republican 40.7 1996 loss
4 Mondale Democratic 37.6 1984 loss
8 Bush Republican 37.4 1992 loss
9 Perot Independent 18.9 1992 loss
12 Perot Independent 8.4 1996 loss
2 Anderson Independent 6.6 1980 loss

As mentioned before, all Data Frame methods return a copy and do not modify the original data structure, unless you set inplace to True.

In [76]:
elections.head(5)
Out[76]:
Candidate Party % Year Result
0 Reagan Republican 50.7 1980 win
1 Carter Democratic 41.0 1980 loss
2 Anderson Independent 6.6 1980 loss
3 Reagan Republican 58.8 1984 win
4 Mondale Democratic 37.6 1984 loss

If we want to sort in reverse order, we can set ascending=False.

In [77]:
elections.sort_values('%', ascending=False)
Out[77]:
Candidate Party % Year Result
3 Reagan Republican 58.8 1984 win
5 Bush Republican 53.4 1988 win
17 Obama Democratic 52.9 2008 win
19 Obama Democratic 51.1 2012 win
0 Reagan Republican 50.7 1980 win
16 Bush Republican 50.7 2004 win
10 Clinton Democratic 49.2 1996 win
13 Gore Democratic 48.4 2000 loss
15 Kerry Democratic 48.3 2004 loss
21 Clinton Democratic 48.2 2016 loss
14 Bush Republican 47.9 2000 win
20 Romney Republican 47.2 2012 loss
22 Trump Republican 46.1 2016 win
18 McCain Republican 45.7 2008 loss
6 Dukakis Democratic 45.6 1988 loss
7 Clinton Democratic 43.0 1992 win
1 Carter Democratic 41.0 1980 loss
11 Dole Republican 40.7 1996 loss
4 Mondale Democratic 37.6 1984 loss
8 Bush Republican 37.4 1992 loss
9 Perot Independent 18.9 1992 loss
12 Perot Independent 8.4 1996 loss
2 Anderson Independent 6.6 1980 loss

We can also use sort_values on Series objects.

In [78]:
mottos['Language'].sort_values(ascending=False).head(10)
Out[78]:
State
Montana           Spanish
Missouri            Latin
North Carolina      Latin
Arizona             Latin
Arkansas            Latin
Colorado            Latin
Connecticut         Latin
Idaho               Latin
Kansas              Latin
Kentucky            Latin
Name: Language, dtype: object

For Series, the value_counts method is often quite handy.

In [79]:
elections['Party'].value_counts()
Out[79]:
Democratic     10
Republican     10
Independent     3
Name: Party, dtype: int64
In [80]:
mottos['Language'].value_counts()
Out[80]:
Latin             23
English           21
Greek              1
Hawaiian           1
Italian            1
Spanish            1
Chinook Jargon     1
French             1
Name: Language, dtype: int64

Also commonly used is the unique method, which returns all unique values as a numpy array.

In [81]:
mottos['Language'].unique()
Out[81]:
array(['Latin', 'English', 'Greek', 'Hawaiian', 'Italian', 'French',
       'Spanish', 'Chinook Jargon'], dtype=object)

Baby Names Data

Now let's play around a bit with a large baby names dataset that is publicly available. We'll start by loading that dataset from the social security administration's website.

To keep the data small enough to avoid crashing datahub, we're going to look at only California rather than looking at the national dataset.

In [82]:
import urllib.request
import os.path
import zipfile

data_url = "https://www.ssa.gov/oact/babynames/state/namesbystate.zip"
local_filename = "babynamesbystate.zip"
if not os.path.exists(local_filename): # if the data exists don't download again
    with urllib.request.urlopen(data_url) as resp, open(local_filename, 'wb') as f:
        f.write(resp.read())

zf = zipfile.ZipFile(local_filename, 'r')

ca_name = 'CA.TXT'
field_names = ['State', 'Sex', 'Year', 'Name', 'Count']
with zf.open(ca_name) as fh:
    babynames = pd.read_csv(fh, header=None, names=field_names)

babynames.sample(5)
Out[82]:
State Sex Year Name Count
302729 CA M 1988 Deshaun 13
296571 CA M 1985 Anderson 11
269009 CA M 1967 Jayson 15
354618 CA M 2008 Rayhan 10
42123 CA F 1959 Teressa 5

Goal 1: Find the most popular baby name in California in 2018

In [83]:
babynames[babynames["Year"] == 2018].sort_values(by = "Count", ascending = False).head(5)
Out[83]:
State Sex Year Name Count
221154 CA F 2018 Emma 2741
382082 CA M 2018 Noah 2567
221155 CA F 2018 Mia 2497
221156 CA F 2018 Olivia 2464
382083 CA M 2018 Liam 2411
In [84]:
most_common_count = babynames[babynames["Year"] == 2018].sort_values(by = "Count", ascending = False).iloc[0]["Count"]
most_common_count
Out[84]:
2741

Goal 2: Baby names that start with j. Hard to do with today's tools.

Goal 3: Name whose popularity has changed the most. Also tough.

These goals are hard with our tools so far. Will discuss next ime.