In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

plt.style.use('fivethirtyeight')
sns.set_context("notebook")

Reading in DataFrames from Files

Pandas has a number of very useful file reading tools. You can see them enumerated by typing "pd.re" and pressing tab. We'll be using read_csv today.

In [2]:
elections = pd.read_csv("elections.csv")
elections # if we end a cell with an expression or variable name, the result will print
Out[2]:
Candidate Party % Year Result
0 Reagan Republican 50.7 1980 win
1 Carter Democratic 41.0 1980 loss
2 Anderson Independent 6.6 1980 loss
3 Reagan Republican 58.8 1984 win
4 Mondale Democratic 37.6 1984 loss
5 Bush Republican 53.4 1988 win
6 Dukakis Democratic 45.6 1988 loss
7 Clinton Democratic 43.0 1992 win
8 Bush Republican 37.4 1992 loss
9 Perot Independent 18.9 1992 loss
10 Clinton Democratic 49.2 1996 win
11 Dole Republican 40.7 1996 loss
12 Perot Independent 8.4 1996 loss
13 Gore Democratic 48.4 2000 loss
14 Bush Republican 47.9 2000 win
15 Kerry Democratic 48.3 2004 loss
16 Bush Republican 50.7 2004 win
17 Obama Democratic 52.9 2008 win
18 McCain Republican 45.7 2008 loss
19 Obama Democratic 51.1 2012 win
20 Romney Republican 47.2 2012 loss
21 Clinton Democratic 48.2 2016 loss
22 Trump Republican 46.1 2016 win

We can use the head command to return only a few rows of a dataframe.

In [3]:
elections.head(7)
Out[3]:
Candidate Party % Year Result
0 Reagan Republican 50.7 1980 win
1 Carter Democratic 41.0 1980 loss
2 Anderson Independent 6.6 1980 loss
3 Reagan Republican 58.8 1984 win
4 Mondale Democratic 37.6 1984 loss
5 Bush Republican 53.4 1988 win
6 Dukakis Democratic 45.6 1988 loss

There is also a tail command.

In [4]:
elections.tail(7)
Out[4]:
Candidate Party % Year Result
16 Bush Republican 50.7 2004 win
17 Obama Democratic 52.9 2008 win
18 McCain Republican 45.7 2008 loss
19 Obama Democratic 51.1 2012 win
20 Romney Republican 47.2 2012 loss
21 Clinton Democratic 48.2 2016 loss
22 Trump Republican 46.1 2016 win

The read_csv command lets us specify a column to use an index. For example, we could have used Year as the index.

In [5]:
elections_year_index = pd.read_csv("elections.csv", index_col = "Year")
elections_year_index.head(5)
Out[5]:
Candidate Party % Result
Year
1980 Reagan Republican 50.7 win
1980 Carter Democratic 41.0 loss
1980 Anderson Independent 6.6 loss
1984 Reagan Republican 58.8 win
1984 Mondale Democratic 37.6 loss

Alternately, we could have used the set_index commmand.

In [6]:
elections_party_index = elections.set_index("Party")
elections_party_index.head(5)
Out[6]:
Candidate % Year Result
Party
Republican Reagan 50.7 1980 win
Democratic Carter 41.0 1980 loss
Independent Anderson 6.6 1980 loss
Republican Reagan 58.8 1984 win
Democratic Mondale 37.6 1984 loss

The set_index command (along with all other data frame methods) does not modify the dataframe. That is, the original "elections" is untouched. Note: There is a flag called "inplace" which does modify the calling dataframe.

In [7]:
elections.head() #the index remains unchanged
Out[7]:
Candidate Party % Year Result
0 Reagan Republican 50.7 1980 win
1 Carter Democratic 41.0 1980 loss
2 Anderson Independent 6.6 1980 loss
3 Reagan Republican 58.8 1984 win
4 Mondale Democratic 37.6 1984 loss

By contrast, column names are ideally unique. For example, if we try to read in a file for which column names are not unique, Pandas will automatically rename any duplicates.

In [8]:
dups = pd.read_csv("duplicate_columns.csv")
dups
Out[8]:
name name.1 flavor
0 john smith vanilla
1 zhang shan chocolate
2 fulan alfulani NaN
3 hong gildong banana

The [] Operator

The DataFrame class has an indexing operator [] that lets you do a variety of different things. If your provide a String to the [] operator, you get back a Series corresponding to the requested label.

In [9]:
elections_year_index.head(6)
Out[9]:
Candidate Party % Result
Year
1980 Reagan Republican 50.7 win
1980 Carter Democratic 41.0 loss
1980 Anderson Independent 6.6 loss
1984 Reagan Republican 58.8 win
1984 Mondale Democratic 37.6 loss
1988 Bush Republican 53.4 win
In [10]:
elections_year_index["Candidate"].head(6)
Out[10]:
Year
1980      Reagan
1980      Carter
1980    Anderson
1984      Reagan
1984     Mondale
1988        Bush
Name: Candidate, dtype: object

The [] operator also accepts a list of strings. In this case, you get back a DataFrame corresponding to the requested strings.

In [11]:
elections_year_index[["Candidate", "Party"]].head(6)
Out[11]:
Candidate Party
Year
1980 Reagan Republican
1980 Carter Democratic
1980 Anderson Independent
1984 Reagan Republican
1984 Mondale Democratic
1988 Bush Republican

A list of one label also returns a DataFrame. This can be handy if you want your results as a DataFrame, not a series.

In [12]:
elections_year_index[["Candidate"]].head(6)
Out[12]:
Candidate
Year
1980 Reagan
1980 Carter
1980 Anderson
1984 Reagan
1984 Mondale
1988 Bush

Note that we can also use the to_frame method to turn a Series into a DataFrame.

In [13]:
elections_year_index["Candidate"].to_frame().head(5)
Out[13]:
Candidate
Year
1980 Reagan
1980 Carter
1980 Anderson
1984 Reagan
1984 Mondale

The [] operator also accepts numerical slices as arguments. In this case, we are indexing by row, not column!

In [14]:
elections_year_index[0:3]
Out[14]:
Candidate Party % Result
Year
1980 Reagan Republican 50.7 win
1980 Carter Democratic 41.0 loss
1980 Anderson Independent 6.6 loss

If you provide a single argument to the [] operator, it tries to use it as a name. This is true even if the argument passed to [] is an integer.

In [15]:
#elections_year_index[0] #this does not work, try uncommenting this to see it fail in action, woo

The following cells allow you to test your understanding.

In [16]:
weird = pd.DataFrame({
    1:["topdog","botdog"], 
    "1":["topcat","botcat"]
})
weird
Out[16]:
1 1
0 topdog topcat
1 botdog botcat
In [17]:
weird[1] #try to predict the output
Out[17]:
0    topdog
1    botdog
Name: 1, dtype: object
In [18]:
weird["1"] #try to predict the output
Out[18]:
0    topcat
1    botcat
Name: 1, dtype: object
In [19]:
weird[1:] #try to predict the output
Out[19]:
1 1
1 botdog botcat

Boolean Array Selection

The [] operator also supports array of booleans as an input. In this case, the array must be exactly as long as the number of rows. The result is a filtered version of the data frame, where only rows corresponding to True appear.

In [20]:
elections_year_index[[False, False, False, False, False, 
          False, False, True, False, False,
          True, False, False, False, True,
          False, False, False, False, False,
          False, False, True]]
Out[20]:
Candidate Party % Result
Year
1992 Clinton Democratic 43.0 win
1996 Clinton Democratic 49.2 win
2000 Bush Republican 47.9 win
2016 Trump Republican 46.1 win

One very common task in Data Science is filtering. Boolean Array Selection is one way to achieve this in Pandas. We start by observing logical operators like the equality operator can be applied to Pandas Series data to generate a Boolean Array. For example, we can compare the 'Result' column to the String 'win':

In [21]:
elections_year_index.head(5)
Out[21]:
Candidate Party % Result
Year
1980 Reagan Republican 50.7 win
1980 Carter Democratic 41.0 loss
1980 Anderson Independent 6.6 loss
1984 Reagan Republican 58.8 win
1984 Mondale Democratic 37.6 loss
In [22]:
iswin = elections_year_index['Result'] == 'win'
iswin#.head(5)
Out[22]:
Year
1980     True
1980    False
1980    False
1984     True
1984    False
1988     True
1988    False
1992     True
1992    False
1992    False
1996     True
1996    False
1996    False
2000    False
2000     True
2004    False
2004     True
2008     True
2008    False
2012     True
2012    False
2016    False
2016     True
Name: Result, dtype: bool

The output of the logical operator applied to the Series is another Series with the same name and index, but of datatype boolean. The entry at row #i represents the result of the application of that operator to the entry of the original Series at row #i.

Such a boolean Series can be used as an argument to the [] operator. For example, the following code creates a DataFrame of all election winners since 1980.

In [23]:
elections_year_index[iswin]
Out[23]:
Candidate Party % Result
Year
1980 Reagan Republican 50.7 win
1984 Reagan Republican 58.8 win
1988 Bush Republican 53.4 win
1992 Clinton Democratic 43.0 win
1996 Clinton Democratic 49.2 win
2000 Bush Republican 47.9 win
2004 Bush Republican 50.7 win
2008 Obama Democratic 52.9 win
2012 Obama Democratic 51.1 win
2016 Trump Republican 46.1 win

Above, we've assigned the result of the logical operator to a new variable called iswin. This is uncommon. Usually, the series is created and used on the same line. Such code is a little tricky to read at first, but you'll get used to it quickly.

In [24]:
elections_year_index[elections_year_index['Result'] == 'win']
Out[24]:
Candidate Party % Result
Year
1980 Reagan Republican 50.7 win
1984 Reagan Republican 58.8 win
1988 Bush Republican 53.4 win
1992 Clinton Democratic 43.0 win
1996 Clinton Democratic 49.2 win
2000 Bush Republican 47.9 win
2004 Bush Republican 50.7 win
2008 Obama Democratic 52.9 win
2012 Obama Democratic 51.1 win
2016 Trump Republican 46.1 win

We can select multiple criteria by creating multiple boolean Series and combining them using the & operator.

In [25]:
win50plus = (elections_year_index['Result'] == 'win') & (elections_year_index['%'] < 50)
In [26]:
win50plus.head(5)
Out[26]:
Year
1980    False
1980    False
1980    False
1984    False
1984    False
dtype: bool
In [27]:
elections_year_index[(elections_year_index['Result'] == 'win')
          & (elections_year_index['%'] < 50)]

# Note for Python experts: The reason we use the & symbol and not the word "and" is because the Python __and__ 
# method overrides the "&" operator, not the "and" operator.
Out[27]:
Candidate Party % Result
Year
1992 Clinton Democratic 43.0 win
1996 Clinton Democratic 49.2 win
2000 Bush Republican 47.9 win
2016 Trump Republican 46.1 win

The | operator is the symbol for or.

In [28]:
elections_year_index[(elections_year_index['Party'] == 'Republican')
          | (elections_year_index['Party'] == "Democratic")]
Out[28]:
Candidate Party % Result
Year
1980 Reagan Republican 50.7 win
1980 Carter Democratic 41.0 loss
1984 Reagan Republican 58.8 win
1984 Mondale Democratic 37.6 loss
1988 Bush Republican 53.4 win
1988 Dukakis Democratic 45.6 loss
1992 Clinton Democratic 43.0 win
1992 Bush Republican 37.4 loss
1996 Clinton Democratic 49.2 win
1996 Dole Republican 40.7 loss
2000 Gore Democratic 48.4 loss
2000 Bush Republican 47.9 win
2004 Kerry Democratic 48.3 loss
2004 Bush Republican 50.7 win
2008 Obama Democratic 52.9 win
2008 McCain Republican 45.7 loss
2012 Obama Democratic 51.1 win
2012 Romney Republican 47.2 loss
2016 Clinton Democratic 48.2 loss
2016 Trump Republican 46.1 win

If we have multiple conditions (say Republican or Democratic), we can use the isin operator to simplify our code.

In [29]:
elections_year_index['Party'].isin(["Republican", "Democratic"])
Out[29]:
Year
1980     True
1980     True
1980    False
1984     True
1984     True
1988     True
1988     True
1992     True
1992     True
1992    False
1996     True
1996     True
1996    False
2000     True
2000     True
2004     True
2004     True
2008     True
2008     True
2012     True
2012     True
2016     True
2016     True
Name: Party, dtype: bool
In [30]:
elections_year_index[elections_year_index['Party'].isin(["Republican", "Democratic"])]
Out[30]:
Candidate Party % Result
Year
1980 Reagan Republican 50.7 win
1980 Carter Democratic 41.0 loss
1984 Reagan Republican 58.8 win
1984 Mondale Democratic 37.6 loss
1988 Bush Republican 53.4 win
1988 Dukakis Democratic 45.6 loss
1992 Clinton Democratic 43.0 win
1992 Bush Republican 37.4 loss
1996 Clinton Democratic 49.2 win
1996 Dole Republican 40.7 loss
2000 Gore Democratic 48.4 loss
2000 Bush Republican 47.9 win
2004 Kerry Democratic 48.3 loss
2004 Bush Republican 50.7 win
2008 Obama Democratic 52.9 win
2008 McCain Republican 45.7 loss
2012 Obama Democratic 51.1 win
2012 Romney Republican 47.2 loss
2016 Clinton Democratic 48.2 loss
2016 Trump Republican 46.1 win

An alternate simpler way to get back a specific set of rows is to use the query command.

In [31]:
elections_year_index.query("Result == 'win' and Year < 2000")
Out[31]:
Candidate Party % Result
Year
1980 Reagan Republican 50.7 win
1984 Reagan Republican 58.8 win
1988 Bush Republican 53.4 win
1992 Clinton Democratic 43.0 win
1996 Clinton Democratic 49.2 win

loc and iloc

In [32]:
elections.head(5)
Out[32]:
Candidate Party % Year Result
0 Reagan Republican 50.7 1980 win
1 Carter Democratic 41.0 1980 loss
2 Anderson Independent 6.6 1980 loss
3 Reagan Republican 58.8 1984 win
4 Mondale Democratic 37.6 1984 loss
In [33]:
elections.loc[[0, 1, 2, 3, 4], ['Candidate','Party', 'Year']]
Out[33]:
Candidate Party Year
0 Reagan Republican 1980
1 Carter Democratic 1980
2 Anderson Independent 1980
3 Reagan Republican 1984
4 Mondale Democratic 1984

Note: The loc command won't work with numeric arguments if we're using the elections DataFrame that was indexed by year.

In [34]:
elections_year_index.head(5)
Out[34]:
Candidate Party % Result
Year
1980 Reagan Republican 50.7 win
1980 Carter Democratic 41.0 loss
1980 Anderson Independent 6.6 loss
1984 Reagan Republican 58.8 win
1984 Mondale Democratic 37.6 loss
In [35]:
#causes error
#elections_year_index.loc[[0, 1, 2, 3, 4], ['Candidate','Party']]#
In [36]:
elections_year_index.loc[[1980, 1984], ['Candidate','Party']]
Out[36]:
Candidate Party
Year
1980 Reagan Republican
1980 Carter Democratic
1980 Anderson Independent
1984 Reagan Republican
1984 Mondale Democratic

Loc also supports slicing (for all types, including numeric and string labels!). Note that the slicing for loc is inclusive, even for numeric slices.

In [37]:
elections.loc[0:4, 'Candidate':'Year']
Out[37]:
Candidate Party % Year
0 Reagan Republican 50.7 1980
1 Carter Democratic 41.0 1980
2 Anderson Independent 6.6 1980
3 Reagan Republican 58.8 1984
4 Mondale Democratic 37.6 1984
In [38]:
elections_year_index.loc[1980:1984, 'Candidate':'Party']
Out[38]:
Candidate Party
Year
1980 Reagan Republican
1980 Carter Democratic
1980 Anderson Independent
1984 Reagan Republican
1984 Mondale Democratic

If we provide only a single label for the column argument, we get back a Series.

In [39]:
elections.loc[0:4, 'Candidate']
Out[39]:
0      Reagan
1      Carter
2    Anderson
3      Reagan
4     Mondale
Name: Candidate, dtype: object

If we want a data frame instead and don't want to use to_frame, we can provde a list containing the column name.

In [40]:
elections.loc[0:4, ['Candidate']]
Out[40]:
Candidate
0 Reagan
1 Carter
2 Anderson
3 Reagan
4 Mondale

If we give only one row but many column labels, we'll get back a Series corresponding to a row of the table. This new Series has a neat index, where each entry is the name of the column that the data came from.

In [41]:
elections.head(1)
Out[41]:
Candidate Party % Year Result
0 Reagan Republican 50.7 1980 win
In [42]:
elections.loc[0, 'Candidate':'Year']
Out[42]:
Candidate        Reagan
Party        Republican
%                  50.7
Year               1980
Name: 0, dtype: object
In [43]:
elections.loc[[0], 'Candidate':'Year']
Out[43]:
Candidate Party % Year
0 Reagan Republican 50.7 1980

If we omit the column argument altogether, the default behavior is to retrieve all columns.

In [44]:
elections.loc[[2, 4, 5]]
Out[44]:
Candidate Party % Year Result
2 Anderson Independent 6.6 1980 loss
4 Mondale Democratic 37.6 1984 loss
5 Bush Republican 53.4 1988 win

Loc also supports boolean array inputs instead of labels. If the arrays are too short, loc assumes the missing values are False. Note: For newer versions of Pandas, loc will not assume that the missing values are False and instead throw an error.

In [45]:
elections.loc[[True, False, False, True], [True, False, False, True]]
Out[45]:
Candidate Year
0 Reagan 1980
3 Reagan 1984
In [46]:
elections.loc[[0, 3], ['Candidate', 'Year']]
Out[46]:
Candidate Year
0 Reagan 1980
3 Reagan 1984

We can use boolean array arguments for one axis of the data, and labels for the other.

In [47]:
elections.loc[[True, False, False, True], 'Candidate':'%']
Out[47]:
Candidate Party %
0 Reagan Republican 50.7
3 Reagan Republican 58.8

A student asks what happens if you give scalar arguments for the requested rows AND columns. The answer is that you get back just a single value.

In [48]:
elections.loc[0, 'Candidate']
Out[48]:
'Reagan'

iloc

loc's cousin iloc is very similar, but is used to access based on numerical position instead of label. For example, to access to the top 3 rows and top 3 columns of a table, we can use [0:3, 0:3]. iloc slicing is exclusive, just like standard Python slicing of numerical values.

In [49]:
elections.head(5)
Out[49]:
Candidate Party % Year Result
0 Reagan Republican 50.7 1980 win
1 Carter Democratic 41.0 1980 loss
2 Anderson Independent 6.6 1980 loss
3 Reagan Republican 58.8 1984 win
4 Mondale Democratic 37.6 1984 loss
In [50]:
elections.iloc[0:3, 0:3]
Out[50]:
Candidate Party %
0 Reagan Republican 50.7
1 Carter Democratic 41.0
2 Anderson Independent 6.6
In [51]:
mottos = pd.read_csv('mottos.csv')
In [52]:
mottos.iloc[0:3, 0:3]
Out[52]:
State Motto Translation
0 Alabama Audemus jura nostra defendere We dare defend our rights!
1 Alaska North to the future
2 Arizona Ditat Deus God enriches

We will use both loc and iloc in the course. Loc is generally preferred for a number of reasons, for example:

  1. It is harder to make mistakes since you have to literally write out what you want to get.
  2. Code is easier to read, because the reader doesn't have to know e.g. what column #31 represents.
  3. It is robust against permutations of the data, e.g. the social security administration switches the order of two columns.

However, iloc is sometimes more convenient. We'll provide examples of when iloc is the superior choice.

Quick Challenge

Which of the following expressions return DataFrame of the first 3 Candidate and Party names for candidates that won with more than 50% of the vote.

In [53]:
elections.iloc[[0, 3, 5], [0, 3]]
Out[53]:
Candidate Year
0 Reagan 1980
3 Reagan 1984
5 Bush 1988
In [54]:
elections.loc[[0, 3, 5], "Candidate":"Year"]
Out[54]:
Candidate Party % Year
0 Reagan Republican 50.7 1980
3 Reagan Republican 58.8 1984
5 Bush Republican 53.4 1988
In [55]:
elections.loc[elections["%"] > 50, ["Candidate", "Year"]].head(3)
Out[55]:
Candidate Year
0 Reagan 1980
3 Reagan 1984
5 Bush 1988
In [56]:
elections.loc[elections["%"] > 50, ["Candidate", "Year"]].iloc[0:2, :]
Out[56]:
Candidate Year
0 Reagan 1980
3 Reagan 1984

Sampling

Pandas dataframes also make it easy to get a sample. We simply use the sample method and provide the number of samples that we'd like as the arugment. Sampling is done without replacement by default. Set replace=True if you want replacement.

In [57]:
elections.sample(10)
Out[57]:
Candidate Party % Year Result
17 Obama Democratic 52.9 2008 win
11 Dole Republican 40.7 1996 loss
13 Gore Democratic 48.4 2000 loss
14 Bush Republican 47.9 2000 win
15 Kerry Democratic 48.3 2004 loss
12 Perot Independent 8.4 1996 loss
18 McCain Republican 45.7 2008 loss
2 Anderson Independent 6.6 1980 loss
8 Bush Republican 37.4 1992 loss
3 Reagan Republican 58.8 1984 win
In [58]:
elections.query("Year < 1992").sample(50, replace=True)
Out[58]:
Candidate Party % Year Result
4 Mondale Democratic 37.6 1984 loss
6 Dukakis Democratic 45.6 1988 loss
3 Reagan Republican 58.8 1984 win
0 Reagan Republican 50.7 1980 win
3 Reagan Republican 58.8 1984 win
2 Anderson Independent 6.6 1980 loss
6 Dukakis Democratic 45.6 1988 loss
2 Anderson Independent 6.6 1980 loss
0 Reagan Republican 50.7 1980 win
3 Reagan Republican 58.8 1984 win
6 Dukakis Democratic 45.6 1988 loss
6 Dukakis Democratic 45.6 1988 loss
5 Bush Republican 53.4 1988 win
6 Dukakis Democratic 45.6 1988 loss
6 Dukakis Democratic 45.6 1988 loss
5 Bush Republican 53.4 1988 win
6 Dukakis Democratic 45.6 1988 loss
5 Bush Republican 53.4 1988 win
5 Bush Republican 53.4 1988 win
1 Carter Democratic 41.0 1980 loss
2 Anderson Independent 6.6 1980 loss
6 Dukakis Democratic 45.6 1988 loss
5 Bush Republican 53.4 1988 win
5 Bush Republican 53.4 1988 win
4 Mondale Democratic 37.6 1984 loss
1 Carter Democratic 41.0 1980 loss
4 Mondale Democratic 37.6 1984 loss
4 Mondale Democratic 37.6 1984 loss
3 Reagan Republican 58.8 1984 win
3 Reagan Republican 58.8 1984 win
4 Mondale Democratic 37.6 1984 loss
6 Dukakis Democratic 45.6 1988 loss
2 Anderson Independent 6.6 1980 loss
0 Reagan Republican 50.7 1980 win
4 Mondale Democratic 37.6 1984 loss
3 Reagan Republican 58.8 1984 win
3 Reagan Republican 58.8 1984 win
5 Bush Republican 53.4 1988 win
4 Mondale Democratic 37.6 1984 loss
2 Anderson Independent 6.6 1980 loss
0 Reagan Republican 50.7 1980 win
3 Reagan Republican 58.8 1984 win
0 Reagan Republican 50.7 1980 win
4 Mondale Democratic 37.6 1984 loss
3 Reagan Republican 58.8 1984 win
5 Bush Republican 53.4 1988 win
5 Bush Republican 53.4 1988 win
3 Reagan Republican 58.8 1984 win
6 Dukakis Democratic 45.6 1988 loss
3 Reagan Republican 58.8 1984 win

Handy Properties and Utility Functions for Series and DataFrames

Python Operations on Numerical DataFrames and Series

Consider a series of only the vote percentages of election winners.

In [59]:
winners = elections.query("Result == 'win'")["%"]
winners
Out[59]:
0     50.7
3     58.8
5     53.4
7     43.0
10    49.2
14    47.9
16    50.7
17    52.9
19    51.1
22    46.1
Name: %, dtype: float64

We can perform various Python operations (including numpy operations) to DataFrames and Series.

In [60]:
max(winners)
Out[60]:
58.8
In [61]:
np.mean(winners)
Out[61]:
50.38000000000001

We can also apply mathematical operations to a DataFrame so long as it has only numerical data.

In [62]:
(elections[["%", "Year"]] + 3).head(5)
Out[62]:
% Year
0 53.7 1983
1 44.0 1983
2 9.6 1983
3 61.8 1987
4 40.6 1987

Handy Utility Methods

The head, shape, size, and describe methods can be used to quickly get a good sense of the data we're working with. For example:

In [63]:
mottos.head(20)
Out[63]:
State Motto Translation Language Date Adopted
0 Alabama Audemus jura nostra defendere We dare defend our rights! Latin 1923
1 Alaska North to the future English 1967
2 Arizona Ditat Deus God enriches Latin 1863
3 Arkansas Regnat populus The people rule Latin 1907
4 California Eureka (Εὕρηκα) I have found it Greek 1849
5 Colorado Nil sine numine Nothing without providence. Latin November 6, 1861
6 Connecticut Qui transtulit sustinet He who transplanted sustains Latin October 9, 1662
7 Delaware Liberty and Independence English 1847
8 Florida In God We Trust English 1868
9 Georgia Wisdom, Justice, Moderation English 1798
10 Hawaii Ua mau ke ea o ka ʻāina i ka pono The life of the land is perpetuated in righteo... Hawaiian July 31, 1843
11 Idaho Esto perpetua Let it be perpetual Latin 1890
12 Illinois State sovereignty, national union English 1819
13 Indiana The Crossroads of America English 1937
14 Iowa Our liberties we prize and our rights we will ... English 1847
15 Kansas Ad astra per aspera To the stars through difficulties Latin 1861
16 Kentucky Deo gratiam habeamus Let us be grateful to God Latin 2002
17 Louisiana Union, justice, confidence English 1902
18 Maine Dirigo I lead Latin 1820
19 Maryland Fatti maschii, parole femine Manly deeds, womanly words Italian 1874
In [64]:
mottos.size
Out[64]:
250

The fact that the size is 200 means our data file is relatively small, with only 200 total entries.

In [65]:
mottos.shape
Out[65]:
(50, 5)

Since we're looking at data for states, and we see the number 50, it looks like we've mostly likely got a complete dataset that omits Washington D.C. and U.S. territories like Guam and Puerto Rico.

In [66]:
mottos.describe()
Out[66]:
State Motto Translation Language Date Adopted
count 50 50 49 50 50
unique 50 50 30 8 47
top Kansas Equality before the law Latin 1893
freq 1 1 20 23 2

Above, we see a quick summary of all the data. For example, the most common language for mottos is Latin, which covers 23 different states. Does anything else seem surprising?

We can get a direct reference to the index using .index.

In [67]:
mottos.index
Out[67]:
RangeIndex(start=0, stop=50, step=1)

We can also access individual properties of the index, for example, mottos.index.name.

In [68]:
mottos.index.name

This reflects the fact that in our data frame, the index IS the state name!

In [69]:
mottos.head(2)
Out[69]:
State Motto Translation Language Date Adopted
0 Alabama Audemus jura nostra defendere We dare defend our rights! Latin 1923
1 Alaska North to the future English 1967

It turns out the columns also have an Index. We can access this index by using .columns.

In [70]:
mottos.columns
Out[70]:
Index(['State', 'Motto', 'Translation', 'Language', 'Date Adopted'], dtype='object')

There are also a ton of useful utility methods we can use with Data Frames and Series. For example, we can create a copy of a data frame sorted by a specific column using sort_values.

In [71]:
elections.sort_values('%')
Out[71]:
Candidate Party % Year Result
2 Anderson Independent 6.6 1980 loss
12 Perot Independent 8.4 1996 loss
9 Perot Independent 18.9 1992 loss
8 Bush Republican 37.4 1992 loss
4 Mondale Democratic 37.6 1984 loss
11 Dole Republican 40.7 1996 loss
1 Carter Democratic 41.0 1980 loss
7 Clinton Democratic 43.0 1992 win
6 Dukakis Democratic 45.6 1988 loss
18 McCain Republican 45.7 2008 loss
22 Trump Republican 46.1 2016 win
20 Romney Republican 47.2 2012 loss
14 Bush Republican 47.9 2000 win
21 Clinton Democratic 48.2 2016 loss
15 Kerry Democratic 48.3 2004 loss
13 Gore Democratic 48.4 2000 loss
10 Clinton Democratic 49.2 1996 win
16 Bush Republican 50.7 2004 win
0 Reagan Republican 50.7 1980 win
19 Obama Democratic 51.1 2012 win
17 Obama Democratic 52.9 2008 win
5 Bush Republican 53.4 1988 win
3 Reagan Republican 58.8 1984 win

As mentioned before, all Data Frame methods return a copy and do not modify the original data structure, unless you set inplace to True.

In [72]:
elections.head(5)
Out[72]:
Candidate Party % Year Result
0 Reagan Republican 50.7 1980 win
1 Carter Democratic 41.0 1980 loss
2 Anderson Independent 6.6 1980 loss
3 Reagan Republican 58.8 1984 win
4 Mondale Democratic 37.6 1984 loss

If we want to sort in reverse order, we can set ascending=False.

In [73]:
elections.sort_values('%', ascending=False)
Out[73]:
Candidate Party % Year Result
3 Reagan Republican 58.8 1984 win
5 Bush Republican 53.4 1988 win
17 Obama Democratic 52.9 2008 win
19 Obama Democratic 51.1 2012 win
0 Reagan Republican 50.7 1980 win
16 Bush Republican 50.7 2004 win
10 Clinton Democratic 49.2 1996 win
13 Gore Democratic 48.4 2000 loss
15 Kerry Democratic 48.3 2004 loss
21 Clinton Democratic 48.2 2016 loss
14 Bush Republican 47.9 2000 win
20 Romney Republican 47.2 2012 loss
22 Trump Republican 46.1 2016 win
18 McCain Republican 45.7 2008 loss
6 Dukakis Democratic 45.6 1988 loss
7 Clinton Democratic 43.0 1992 win
1 Carter Democratic 41.0 1980 loss
11 Dole Republican 40.7 1996 loss
4 Mondale Democratic 37.6 1984 loss
8 Bush Republican 37.4 1992 loss
9 Perot Independent 18.9 1992 loss
12 Perot Independent 8.4 1996 loss
2 Anderson Independent 6.6 1980 loss

We can also use sort_values on Series objects.

In [74]:
mottos['Language'].sort_values().head(10)
Out[74]:
46    Chinook Jargon
49           English
29           English
28           English
27           English
26           English
48           English
37           English
38           English
40           English
Name: Language, dtype: object
In [75]:
mottos.query("State == 'Washington'")
Out[75]:
State Motto Translation Language Date Adopted
46 Washington Al-ki or Alki By and by Chinook Jargon

For Series, the value_counts method is often quite handy.

In [76]:
elections['Party'].value_counts()
Out[76]:
Democratic     10
Republican     10
Independent     3
Name: Party, dtype: int64
In [77]:
mottos['Language'].value_counts()
Out[77]:
Latin             23
English           21
Greek              1
Spanish            1
Italian            1
Chinook Jargon     1
French             1
Hawaiian           1
Name: Language, dtype: int64

Also commonly used is the unique method, which returns all unique values as a numpy array.

In [78]:
mottos['Language'].unique()
Out[78]:
array(['Latin', 'English', 'Greek', 'Hawaiian', 'Italian', 'French',
       'Spanish', 'Chinook Jargon'], dtype=object)

Baby Names Data

Now let's play around a bit with the large baby names dataset we saw in lecture 1. We'll start by loading that dataset from the social security administration's website.

To keep the data small enough to avoid crashing datahub, we're going to look at only California rather than looking at the national dataset.

In [79]:
import urllib.request
import os.path
import zipfile

data_url = "https://www.ssa.gov/oact/babynames/state/namesbystate.zip"
local_filename = "babynamesbystate.zip"
if not os.path.exists(local_filename): # if the data exists don't download again
    with urllib.request.urlopen(data_url) as resp, open(local_filename, 'wb') as f:
        f.write(resp.read())

zf = zipfile.ZipFile(local_filename, 'r')

ca_name = 'CA.TXT'
field_names = ['State', 'Sex', 'Year', 'Name', 'Count']
with zf.open(ca_name) as fh:
    babynames = pd.read_csv(fh, header=None, names=field_names)

babynames.sample(5)
Out[79]:
State Sex Year Name Count
298627 CA M 1988 Cyrus 30
233607 CA M 1929 Jaime 8
371572 CA M 2015 Kaan 8
164215 CA F 2003 Patty 5
48265 CA F 1963 Lily 22

Goal 1: Find the most popular baby name in California in 2018

In [80]:
babynames[babynames["Year"] == 2018].sort_values(by = "Count", ascending = False).head(5)
Out[80]:
State Sex Year Name Count
221131 CA F 2018 Emma 2722
378377 CA M 2018 Noah 2555
221132 CA F 2018 Mia 2484
221133 CA F 2018 Olivia 2456
378378 CA M 2018 Liam 2405

Goal 2: Baby names that start with j. Hard to do with today's tools.

Goal 3: Name whose popularity has changed the most. Also tough.

These goals are hard with our tools so far. Will discuss next ime.