%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

plt.style.use('fivethirtyeight')
sns.set_context("notebook")

Reading in DataFrames from Files¶

Pandas has a number of very useful file reading tools. You can see them enumerated by typing "pd.re" and pressing tab. We'll be using read_csv today.

elections = pd.read_csv("elections.csv")
elections # if we end a cell with an expression or variable name, the result will print

We can use the head command to return only a few rows of a dataframe.

elections.head(7)

There is also a tail command.

elections.tail(7)

The read_csv command lets us specify a column to use an index. For example, we could have used Year as the index.

elections_year_index = pd.read_csv("elections.csv", index_col = "Year")
elections_year_index.head(5)

Alternately, we could have used the set_index commmand.

elections_party_index = elections.set_index("Party")
elections_party_index.head(5)

The set_index command (along with all other data frame methods) does not modify the dataframe. That is, the original "elections" is untouched. Note: There is a flag called "inplace" which does modify the calling dataframe.

elections.head() #the index remains unchanged

By contrast, column names are ideally unique. For example, if we try to read in a file for which column names are not unique, Pandas will automatically rename any duplicates.

dups = pd.read_csv("duplicate_columns.csv")
dups

The [] Operator¶

The DataFrame class has an indexing operator [] that lets you do a variety of different things. If your provide a String to the [] operator, you get back a Series corresponding to the requested label.

elections_year_index.head(6)

elections_year_index["Candidate"].head(6)

Year
1980      Reagan
1980      Carter
1980    Anderson
1984      Reagan
1984     Mondale
1988        Bush
Name: Candidate, dtype: object

The [] operator also accepts a list of strings. In this case, you get back a DataFrame corresponding to the requested strings.

elections_year_index[["Candidate", "Party"]].head(6)

A list of one label also returns a DataFrame. This can be handy if you want your results as a DataFrame, not a series.

elections_year_index[["Candidate"]].head(6)

Note that we can also use the to_frame method to turn a Series into a DataFrame.

elections_year_index["Candidate"].to_frame().head(5)

The [] operator also accepts numerical slices as arguments. In this case, we are indexing by row, not column!

elections_year_index[0:3]

If you provide a single argument to the [] operator, it tries to use it as a name. This is true even if the argument passed to [] is an integer.

#elections_year_index[0] #this does not work, try uncommenting this to see it fail in action, woo

The following cells allow you to test your understanding.

weird = pd.DataFrame({
    1:["topdog","botdog"], 
    "1":["topcat","botcat"]
})
weird

weird[1] #try to predict the output

0    topdog
1    botdog
Name: 1, dtype: object

weird["1"] #try to predict the output

0    topcat
1    botcat
Name: 1, dtype: object

weird[1:] #try to predict the output

Boolean Array Selection¶

The [] operator also supports array of booleans as an input. In this case, the array must be exactly as long as the number of rows. The result is a filtered version of the data frame, where only rows corresponding to True appear.

elections_year_index[[False, False, False, False, False, 
          False, False, True, False, False,
          True, False, False, False, True,
          False, False, False, False, False,
          False, False, True]]

One very common task in Data Science is filtering. Boolean Array Selection is one way to achieve this in Pandas. We start by observing logical operators like the equality operator can be applied to Pandas Series data to generate a Boolean Array. For example, we can compare the 'Result' column to the String 'win':

elections_year_index.head(5)

iswin = elections_year_index['Result'] == 'win'
iswin#.head(5)

Year
1980     True
1980    False
1980    False
1984     True
1984    False
1988     True
1988    False
1992     True
1992    False
1992    False
1996     True
1996    False
1996    False
2000    False
2000     True
2004    False
2004     True
2008     True
2008    False
2012     True
2012    False
2016    False
2016     True
Name: Result, dtype: bool

The output of the logical operator applied to the Series is another Series with the same name and index, but of datatype boolean. The entry at row #i represents the result of the application of that operator to the entry of the original Series at row #i.

Such a boolean Series can be used as an argument to the [] operator. For example, the following code creates a DataFrame of all election winners since 1980.

elections_year_index[iswin]

Above, we've assigned the result of the logical operator to a new variable called iswin. This is uncommon. Usually, the series is created and used on the same line. Such code is a little tricky to read at first, but you'll get used to it quickly.

elections_year_index[elections_year_index['Result'] == 'win']

We can select multiple criteria by creating multiple boolean Series and combining them using the & operator.

win50plus = (elections_year_index['Result'] == 'win') & (elections_year_index['%'] < 50)

win50plus.head(5)

Year
1980    False
1980    False
1980    False
1984    False
1984    False
dtype: bool

elections_year_index[(elections_year_index['Result'] == 'win')
          & (elections_year_index['%'] < 50)]

# Note for Python experts: The reason we use the & symbol and not the word "and" is because the Python __and__ 
# method overrides the "&" operator, not the "and" operator.

The | operator is the symbol for or.

elections_year_index[(elections_year_index['Party'] == 'Republican')
          | (elections_year_index['Party'] == "Democratic")]

If we have multiple conditions (say Republican or Democratic), we can use the isin operator to simplify our code.

elections_year_index['Party'].isin(["Republican", "Democratic"])

Year
1980     True
1980     True
1980    False
1984     True
1984     True
1988     True
1988     True
1992     True
1992     True
1992    False
1996     True
1996     True
1996    False
2000     True
2000     True
2004     True
2004     True
2008     True
2008     True
2012     True
2012     True
2016     True
2016     True
Name: Party, dtype: bool

elections_year_index[elections_year_index['Party'].isin(["Republican", "Democratic"])]

An alternate simpler way to get back a specific set of rows is to use the query command.

elections_year_index.query("Result == 'win' and Year < 2000")

loc and iloc¶

elections.head(5)

elections.loc[[0, 1, 2, 3, 4], ['Candidate','Party', 'Year']]

Note: The loc command won't work with numeric arguments if we're using the elections DataFrame that was indexed by year.

elections_year_index.head(5)

#causes error
#elections_year_index.loc[[0, 1, 2, 3, 4], ['Candidate','Party']]#

elections_year_index.loc[[1980, 1984], ['Candidate','Party']]

Loc also supports slicing (for all types, including numeric and string labels!). Note that the slicing for loc is inclusive, even for numeric slices.

elections.loc[0:4, 'Candidate':'Year']

elections_year_index.loc[1980:1984, 'Candidate':'Party']

If we provide only a single label for the column argument, we get back a Series.

elections.loc[0:4, 'Candidate']

0      Reagan
1      Carter
2    Anderson
3      Reagan
4     Mondale
Name: Candidate, dtype: object

If we want a data frame instead and don't want to use to_frame, we can provde a list containing the column name.

elections.loc[0:4, ['Candidate']]

If we give only one row but many column labels, we'll get back a Series corresponding to a row of the table. This new Series has a neat index, where each entry is the name of the column that the data came from.

elections.head(1)

elections.loc[0, 'Candidate':'Year']

Candidate        Reagan
Party        Republican
%                  50.7
Year               1980
Name: 0, dtype: object

elections.loc[[0], 'Candidate':'Year']

If we omit the column argument altogether, the default behavior is to retrieve all columns.

elections.loc[[2, 4, 5]]

Loc also supports boolean array inputs instead of labels. If the arrays are too short, loc assumes the missing values are False. Note: For newer versions of Pandas, loc will not assume that the missing values are False and instead throw an error.

elections.loc[[True, False, False, True], [True, False, False, True]]

elections.loc[[0, 3], ['Candidate', 'Year']]

We can use boolean array arguments for one axis of the data, and labels for the other.

elections.loc[[True, False, False, True], 'Candidate':'%']

A student asks what happens if you give scalar arguments for the requested rows AND columns. The answer is that you get back just a single value.

elections.loc[0, 'Candidate']

'Reagan'

iloc¶

loc's cousin iloc is very similar, but is used to access based on numerical position instead of label. For example, to access to the top 3 rows and top 3 columns of a table, we can use [0:3, 0:3]. iloc slicing is exclusive, just like standard Python slicing of numerical values.

elections.head(5)

elections.iloc[0:3, 0:3]

mottos = pd.read_csv('mottos.csv')

mottos.iloc[0:3, 0:3]

We will use both loc and iloc in the course. Loc is generally preferred for a number of reasons, for example:

It is harder to make mistakes since you have to literally write out what you want to get.
Code is easier to read, because the reader doesn't have to know e.g. what column #31 represents.
It is robust against permutations of the data, e.g. the social security administration switches the order of two columns.

However, iloc is sometimes more convenient. We'll provide examples of when iloc is the superior choice.

Quick Challenge¶

Which of the following expressions return DataFrame of the first 3 Candidate and Party names for candidates that won with more than 50% of the vote.

elections.iloc[[0, 3, 5], [0, 3]]

elections.loc[[0, 3, 5], "Candidate":"Year"]

elections.loc[elections["%"] > 50, ["Candidate", "Year"]].head(3)

elections.loc[elections["%"] > 50, ["Candidate", "Year"]].iloc[0:2, :]

Sampling¶

Pandas dataframes also make it easy to get a sample. We simply use the sample method and provide the number of samples that we'd like as the arugment. Sampling is done without replacement by default. Set replace=True if you want replacement.

elections.sample(10)

elections.query("Year < 1992").sample(50, replace=True)

Handy Properties and Utility Functions for Series and DataFrames¶

Python Operations on Numerical DataFrames and Series¶

Consider a series of only the vote percentages of election winners.

winners = elections.query("Result == 'win'")["%"]
winners

0     50.7
3     58.8
5     53.4
7     43.0
10    49.2
14    47.9
16    50.7
17    52.9
19    51.1
22    46.1
Name: %, dtype: float64

We can perform various Python operations (including numpy operations) to DataFrames and Series.

max(winners)

58.8

np.mean(winners)

50.38000000000001

We can also apply mathematical operations to a DataFrame so long as it has only numerical data.

(elections[["%", "Year"]] + 3).head(5)

Handy Utility Methods¶

The head, shape, size, and describe methods can be used to quickly get a good sense of the data we're working with. For example:

mottos.head(20)

mottos.size

250

The fact that the size is 200 means our data file is relatively small, with only 200 total entries.

mottos.shape

(50, 5)

Since we're looking at data for states, and we see the number 50, it looks like we've mostly likely got a complete dataset that omits Washington D.C. and U.S. territories like Guam and Puerto Rico.

mottos.describe()

Above, we see a quick summary of all the data. For example, the most common language for mottos is Latin, which covers 23 different states. Does anything else seem surprising?

We can get a direct reference to the index using .index.

mottos.index

RangeIndex(start=0, stop=50, step=1)

We can also access individual properties of the index, for example, mottos.index.name.

mottos.index.name

This reflects the fact that in our data frame, the index IS the state name!

mottos.head(2)

It turns out the columns also have an Index. We can access this index by using .columns.

mottos.columns

Index(['State', 'Motto', 'Translation', 'Language', 'Date Adopted'], dtype='object')

There are also a ton of useful utility methods we can use with Data Frames and Series. For example, we can create a copy of a data frame sorted by a specific column using sort_values.

elections.sort_values('%')

As mentioned before, all Data Frame methods return a copy and do not modify the original data structure, unless you set inplace to True.

elections.head(5)

If we want to sort in reverse order, we can set ascending=False.

elections.sort_values('%', ascending=False)

We can also use sort_values on Series objects.

mottos['Language'].sort_values().head(10)

46    Chinook Jargon
49           English
29           English
28           English
27           English
26           English
48           English
37           English
38           English
40           English
Name: Language, dtype: object

mottos.query("State == 'Washington'")

For Series, the value_counts method is often quite handy.

elections['Party'].value_counts()

Democratic     10
Republican     10
Independent     3
Name: Party, dtype: int64

mottos['Language'].value_counts()

Latin             23
English           21
Greek              1
Spanish            1
Italian            1
Chinook Jargon     1
French             1
Hawaiian           1
Name: Language, dtype: int64

Also commonly used is the unique method, which returns all unique values as a numpy array.

mottos['Language'].unique()

array(['Latin', 'English', 'Greek', 'Hawaiian', 'Italian', 'French',
       'Spanish', 'Chinook Jargon'], dtype=object)

Baby Names Data¶

Now let's play around a bit with the large baby names dataset we saw in lecture 1. We'll start by loading that dataset from the social security administration's website.

To keep the data small enough to avoid crashing datahub, we're going to look at only California rather than looking at the national dataset.

import urllib.request
import os.path
import zipfile

data_url = "https://www.ssa.gov/oact/babynames/state/namesbystate.zip"
local_filename = "babynamesbystate.zip"
if not os.path.exists(local_filename): # if the data exists don't download again
    with urllib.request.urlopen(data_url) as resp, open(local_filename, 'wb') as f:
        f.write(resp.read())

zf = zipfile.ZipFile(local_filename, 'r')

ca_name = 'CA.TXT'
field_names = ['State', 'Sex', 'Year', 'Name', 'Count']
with zf.open(ca_name) as fh:
    babynames = pd.read_csv(fh, header=None, names=field_names)

babynames.sample(5)

Goal 1: Find the most popular baby name in California in 2018

babynames[babynames["Year"] == 2018].sort_values(by = "Count", ascending = False).head(5)

Goal 2: Baby names that start with j. Hard to do with today's tools.

Goal 3: Name whose popularity has changed the most. Also tough.

These goals are hard with our tools so far. Will discuss next ime.

	name	name.1	flavor
0	john	smith	vanilla
1	zhang	shan	chocolate
2	fulan	alfulani	NaN
3	hong	gildong	banana

	State	Motto	Translation
0	Alabama	Audemus jura nostra defendere	We dare defend our rights!
1	Alaska	North to the future	—
2	Arizona	Ditat Deus	God enriches

	%	Year
0	53.7	1983
1	44.0	1983
2	9.6	1983
3	61.8	1987
4	40.6	1987

	State	Motto	Translation	Language	Date Adopted
0	Alabama	Audemus jura nostra defendere	We dare defend our rights!	Latin	1923
1	Alaska	North to the future	—	English	1967
2	Arizona	Ditat Deus	God enriches	Latin	1863
3	Arkansas	Regnat populus	The people rule	Latin	1907
4	California	Eureka (Εὕρηκα)	I have found it	Greek	1849
5	Colorado	Nil sine numine	Nothing without providence.	Latin	November 6, 1861
6	Connecticut	Qui transtulit sustinet	He who transplanted sustains	Latin	October 9, 1662
7	Delaware	Liberty and Independence	—	English	1847
8	Florida	In God We Trust	—	English	1868
9	Georgia	Wisdom, Justice, Moderation	—	English	1798
10	Hawaii	Ua mau ke ea o ka ʻāina i ka pono	The life of the land is perpetuated in righteo...	Hawaiian	July 31, 1843
11	Idaho	Esto perpetua	Let it be perpetual	Latin	1890
12	Illinois	State sovereignty, national union	—	English	1819
13	Indiana	The Crossroads of America	—	English	1937
14	Iowa	Our liberties we prize and our rights we will ...	—	English	1847
15	Kansas	Ad astra per aspera	To the stars through difficulties	Latin	1861
16	Kentucky	Deo gratiam habeamus	Let us be grateful to God	Latin	2002
17	Louisiana	Union, justice, confidence	—	English	1902
18	Maine	Dirigo	I lead	Latin	1820
19	Maryland	Fatti maschii, parole femine	Manly deeds, womanly words	Italian	1874

	State	Motto	Translation	Language	Date Adopted
0	Alabama	Audemus jura nostra defendere	We dare defend our rights!	Latin	1923
1	Alaska	North to the future	—	English	1967

	Candidate	Party	%	Year	Result
0	Reagan	Republican	50.7	1980	win
1	Carter	Democratic	41.0	1980	loss
2	Anderson	Independent	6.6	1980	loss
3	Reagan	Republican	58.8	1984	win
4	Mondale	Democratic	37.6	1984	loss
5	Bush	Republican	53.4	1988	win
6	Dukakis	Democratic	45.6	1988	loss
7	Clinton	Democratic	43.0	1992	win
8	Bush	Republican	37.4	1992	loss
9	Perot	Independent	18.9	1992	loss
10	Clinton	Democratic	49.2	1996	win
11	Dole	Republican	40.7	1996	loss
12	Perot	Independent	8.4	1996	loss
13	Gore	Democratic	48.4	2000	loss
14	Bush	Republican	47.9	2000	win
15	Kerry	Democratic	48.3	2004	loss
16	Bush	Republican	50.7	2004	win
17	Obama	Democratic	52.9	2008	win
18	McCain	Republican	45.7	2008	loss
19	Obama	Democratic	51.1	2012	win
20	Romney	Republican	47.2	2012	loss
21	Clinton	Democratic	48.2	2016	loss
22	Trump	Republican	46.1	2016	win

	Candidate	%	Year	Result
Party
Republican	Reagan	50.7	1980	win
Democratic	Carter	41.0	1980	loss
Independent	Anderson	6.6	1980	loss
Republican	Reagan	58.8	1984	win
Democratic	Mondale	37.6	1984	loss

	State	Motto	Translation	Language	Date Adopted
count	50	50	49	50	50
unique	50	50	30	8	47
top	Kansas	Equality before the law	—	Latin	1893
freq	1	1	20	23	2

	State	Sex	Year	Name	Count
298627	CA	M	1988	Cyrus	30
233607	CA	M	1929	Jaime	8
371572	CA	M	2015	Kaan	8
164215	CA	F	2003	Patty	5
48265	CA	F	1963	Lily	22

	State	Sex	Year	Name	Count
221131	CA	F	2018	Emma	2722
378377	CA	M	2018	Noah	2555
221132	CA	F	2018	Mia	2484
221133	CA	F	2018	Olivia	2456
378378	CA	M	2018	Liam	2405