Pandas Basics¶

A high-level overview of the Pandas library. This is not a standalone notebook. Please see the slides or video for a narrative explanation of what is going on.

Alternately, lab 2 provides a narrative explanation of much of the content of this notebook.

In [6]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

plt.style.use('fivethirtyeight')
sns.set_context("notebook")

Reading in DataFrames from Files¶

Pandas has a number of very useful file reading tools. You can see them enumerated by typing "pd.re" and pressing tab. We'll be using read_csv today.

In [7]:
elections = pd.read_csv("elections.csv")
elections # if we end a cell with an expression or variable name, the result will print
Out[7]:
Year Candidate Party Popular vote Result %
0 1824 Andrew Jackson Democratic-Republican 151271 loss 57.210122
1 1824 John Quincy Adams Democratic-Republican 113142 win 42.789878
2 1828 Andrew Jackson Democratic 642806 win 56.203927
3 1828 John Quincy Adams National Republican 500897 loss 43.796073
4 1832 Andrew Jackson Democratic 702735 win 54.574789
... ... ... ... ... ... ...
177 2016 Jill Stein Green 1457226 loss 1.073699
178 2020 Joseph Biden Democratic 81268924 win 51.311515
179 2020 Donald Trump Republican 74216154 loss 46.858542
180 2020 Jo Jorgensen Libertarian 1865724 loss 1.177979
181 2020 Howard Hawkins Green 405035 loss 0.255731

182 rows × 6 columns

As a simple indexing example, consider the code below, which returns the first 5 rows of the DataFrame.

In [8]:
elections.loc[0:4]
Out[8]:
Year Candidate Party Popular vote Result %
0 1824 Andrew Jackson Democratic-Republican 151271 loss 57.210122
1 1824 John Quincy Adams Democratic-Republican 113142 win 42.789878
2 1828 Andrew Jackson Democratic 642806 win 56.203927
3 1828 John Quincy Adams National Republican 500897 loss 43.796073
4 1832 Andrew Jackson Democratic 702735 win 54.574789

We can also use the head command to return only a few rows of a dataframe.

In [5]:
elections.head(5)
Out[5]:
Year Candidate Party Popular vote Result %
0 1824 Andrew Jackson Democratic-Republican 151271 loss 57.210122
1 1824 John Quincy Adams Democratic-Republican 113142 win 42.789878
2 1828 Andrew Jackson Democratic 642806 win 56.203927
3 1828 John Quincy Adams National Republican 500897 loss 43.796073
4 1832 Andrew Jackson Democratic 702735 win 54.574789

Or the tail command to get the last so many rows.

In [6]:
elections.tail(5)
Out[6]:
Year Candidate Party Popular vote Result %
177 2016 Jill Stein Green 1457226 loss 1.073699
178 2020 Joseph Biden Democratic 81268924 win 51.311515
179 2020 Donald Trump Republican 74216154 loss 46.858542
180 2020 Jo Jorgensen Libertarian 1865724 loss 1.177979
181 2020 Howard Hawkins Green 405035 loss 0.255731

If we want a subset of the columns, we can also use loc just to ask for those.

In [9]:
elections.loc[0:4, "Year":"Party"]
Out[9]:
Year Candidate Party
0 1824 Andrew Jackson Democratic-Republican
1 1824 John Quincy Adams Democratic-Republican
2 1828 Andrew Jackson Democratic
3 1828 John Quincy Adams National Republican
4 1832 Andrew Jackson Democratic

loc¶

loc selects items by row and column label.

In [10]:
elections.loc[[87, 25, 179], ["Year", "Candidate", "Result"]]
Out[10]:
Year Candidate Result
87 1932 Herbert Hoover loss
25 1860 John C. Breckinridge loss
179 2020 Donald Trump loss
In [11]:
elections.loc[[87, 25, 179], "Popular vote":"%"]
Out[11]:
Popular vote Result %
87 15761254 loss 39.830594
25 848019 loss 18.138998
179 74216154 loss 46.858542
In [12]:
elections.loc[[87, 25, 179], "Popular vote"]
Out[12]:
87     15761254
25       848019
179    74216154
Name: Popular vote, dtype: int64
In [10]:
elections.loc[:, ["Year", "Candidate", "Result"]]
Out[10]:
Year Candidate Result
0 1824 Andrew Jackson loss
1 1824 John Quincy Adams win
2 1828 Andrew Jackson win
3 1828 John Quincy Adams loss
4 1832 Andrew Jackson win
... ... ... ...
177 2016 Jill Stein loss
178 2020 Joseph Biden win
179 2020 Donald Trump loss
180 2020 Jo Jorgensen loss
181 2020 Howard Hawkins loss

182 rows × 3 columns

iloc¶

iloc selects items by row and column number.

In [13]:
elections.iloc[[1, 2, 3], [0, 1, 2]]
Out[13]:
Year Candidate Party
1 1824 John Quincy Adams Democratic-Republican
2 1828 Andrew Jackson Democratic
3 1828 John Quincy Adams National Republican
In [14]:
elections.iloc[[1, 2, 3], 0:2]
Out[14]:
Year Candidate
1 1824 John Quincy Adams
2 1828 Andrew Jackson
3 1828 John Quincy Adams
In [15]:
elections.iloc[[1, 2, 3], 1]
Out[15]:
1    John Quincy Adams
2       Andrew Jackson
3    John Quincy Adams
Name: Candidate, dtype: object
In [16]:
elections.iloc[:, [0, 1, 4]]
Out[16]:
Year Candidate Result
0 1824 Andrew Jackson loss
1 1824 John Quincy Adams win
2 1828 Andrew Jackson win
3 1828 John Quincy Adams loss
4 1832 Andrew Jackson win
... ... ... ...
177 2016 Jill Stein loss
178 2020 Joseph Biden win
179 2020 Donald Trump loss
180 2020 Jo Jorgensen loss
181 2020 Howard Hawkins loss

182 rows × 3 columns

[]¶

We could technically do anything we want using loc or iloc. However, in practice, the [] operator is often used instead to yield more concise code.

[] is a bit trickier to understand than loc or iloc, but it does essentially the same thing.

If we provide a slice of row numbers, we get the numbered rows.

In [34]:
elections[3:7]
Out[34]:
Year Candidate Party Popular vote Result %
3 1828 John Quincy Adams National Republican 500897 loss 43.796073
4 1832 Andrew Jackson Democratic 702735 win 54.574789
5 1832 Henry Clay National Republican 484205 loss 37.603628
6 1832 William Wirt Anti-Masonic 100715 loss 7.821583

If we provide a list of column names, we get the listed columns.

In [38]:
elections[["Year", "Candidate", "Result"]].tail(5)
Out[38]:
Year Candidate Result
177 2016 Jill Stein loss
178 2020 Joseph Biden win
179 2020 Donald Trump loss
180 2020 Jo Jorgensen loss
181 2020 Howard Hawkins loss

And if we provide a single column name we get back just that column.

In [40]:
elections["Candidate"].tail(5)
Out[40]:
177        Jill Stein
178      Joseph Biden
179      Donald Trump
180      Jo Jorgensen
181    Howard Hawkins
Name: Candidate, dtype: object
In [17]:
weird = pd.DataFrame({
    1:["topdog","botdog"], 
    "1":["topcat","botcat"]
})
weird
Out[17]:
1 1
0 topdog topcat
1 botdog botcat
In [18]:
#weird[1] #try to predict the output
In [19]:
#weird["1"] #try to predict the output
In [20]:
#weird[1:] #try to predict the output

Pandas Datastructures: DataFrames, Series, and Indices¶

In [53]:
type(elections)
Out[53]:
pandas.core.frame.DataFrame
In [54]:
type(elections["Candidate"])
Out[54]:
pandas.core.series.Series
In [50]:
mottos = pd.read_csv("mottos.csv", index_col = "State")
mottos.loc["Alabama":"California"]
Out[50]:
Motto Translation Language Date Adopted
State
Alabama Audemus jura nostra defendere We dare defend our rights! Latin 1923
Alaska North to the future — English 1967
Arizona Ditat Deus God enriches Latin 1863
Arkansas Regnat populus The people rule Latin 1907
California Eureka (Εὕρηκα) I have found it Greek 1849
In [58]:
elections["Candidate"].tail(5).to_frame()
Out[58]:
Candidate
177 Jill Stein
178 Joseph Biden
179 Donald Trump
180 Jo Jorgensen
181 Howard Hawkins
In [59]:
elections[["Candidate"]].tail(5)
Out[59]:
Candidate
177 Jill Stein
178 Joseph Biden
179 Donald Trump
180 Jo Jorgensen
181 Howard Hawkins
In [60]:
mottos.index
Out[60]:
Index(['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado',
       'Connecticut', 'Delaware', 'Florida', 'Georgia', 'Hawaii', 'Idaho',
       'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana',
       'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota',
       'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada',
       'New Hampshire', 'New Jersey', 'New Mexico', 'New York',
       'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon',
       'Pennsylvania', 'Rhode Island', 'South Carolina', 'South Dakota',
       'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington',
       'West Virginia', 'Wisconsin', 'Wyoming'],
      dtype='object', name='State')
In [61]:
mottos.columns
Out[61]:
Index(['Motto', 'Translation', 'Language', 'Date Adopted'], dtype='object')

Conditional Selection¶

In [21]:
elections[elections["Party"] == "Independent"]
Out[21]:
Year Candidate Party Popular vote Result %
121 1976 Eugene McCarthy Independent 740460 loss 0.911649
130 1980 John B. Anderson Independent 5719850 loss 6.631143
143 1992 Ross Perot Independent 19743821 loss 18.956298
161 2004 Ralph Nader Independent 465151 loss 0.380663
167 2008 Ralph Nader Independent 739034 loss 0.563842
174 2016 Evan McMullin Independent 732273 loss 0.539546
In [22]:
elections["Party"] == "Independent"
Out[22]:
0      False
1      False
2      False
3      False
4      False
       ...  
177    False
178    False
179    False
180    False
181    False
Name: Party, Length: 182, dtype: bool

Boolean array selection also works with loc!

In [68]:
elections.loc[elections["Party"] == "Independent"]
Out[68]:
Year Candidate Party Popular vote Result %
121 1976 Eugene McCarthy Independent 740460 loss 0.911649
130 1980 John B. Anderson Independent 5719850 loss 6.631143
143 1992 Ross Perot Independent 19743821 loss 18.956298
161 2004 Ralph Nader Independent 465151 loss 0.380663
167 2008 Ralph Nader Independent 739034 loss 0.563842
174 2016 Evan McMullin Independent 732273 loss 0.539546
In [23]:
elections[(elections["Result"] == "win") & (elections["%"] < 47)]
Out[23]:
Year Candidate Party Popular vote Result %
1 1824 John Quincy Adams Democratic-Republican 113142 win 42.789878
20 1856 James Buchanan Democratic 1835140 win 45.306080
23 1860 Abraham Lincoln Republican 1855993 win 39.699408
47 1892 Grover Cleveland Democratic 5553898 win 46.121393
70 1912 Woodrow Wilson Democratic 6296284 win 41.933422
117 1968 Richard Nixon Republican 31783783 win 43.565246
140 1992 Bill Clinton Democratic 44909806 win 43.118485
173 2016 Donald Trump Republican 62984828 win 46.407862
In [24]:
elections2 = pd.read_csv("annoying_puzzle2.csv")
elections2
Out[24]:
Candidate Party % Year Result
0 Reagan Republican 50.897944 1980 win
1 Carter Democratic 41.132848 1980 loss
2 Anderson Independent 6.631143 1980 loss
3 Reagan Republican 59.023326 1984 win
4 Mondale Democratic 40.729429 1984 loss
5 Bush Republican 53.518845 1988 win
6 Dukakis Democratic 45.770691 1988 loss
In [25]:
#Which of the following yield the desired answer on the slides?
#elections2.iloc[[0, 3, 5], [0, 3]]
#elections2.loc[[0, 3, 5], "Candidate":"Year"]
#elections2.loc[elections["%"] > 50, ["Candidate", "Year"]].head(3)
#elections2.loc[elections["%"] > 50, ["Candidate", "Year"]].iloc[0:2, :]
In [86]:
(
    elections[(elections["Party"] == "Anti-Masonic")  | 
              (elections["Party"] == "American")      |
              (elections["Party"] == "Anti-Monopoly") |
              (elections["Party"] == "American Independent")]
)
#Note: The parentheses surrounding the code make it possible to break the code on to multiple lines for readability
Out[86]:
Year Candidate Party Popular vote Result %
6 1832 William Wirt Anti-Masonic 100715 loss 7.821583
22 1856 Millard Fillmore American 873053 loss 21.554001
38 1884 Benjamin Butler Anti-Monopoly 134294 loss 1.335838
115 1968 George Wallace American Independent 9901118 loss 13.571218
119 1972 John G. Schmitz American Independent 1100868 loss 1.421524
124 1976 Lester Maddox American Independent 170274 loss 0.209640
126 1976 Thomas J. Anderson American 158271 loss 0.194862
In [88]:
a_parties = ["Anti-Masonic", "American", "Anti-Monopoly", "American Independent"]
elections[elections["Party"].isin(a_parties)]
Out[88]:
Year Candidate Party Popular vote Result %
6 1832 William Wirt Anti-Masonic 100715 loss 7.821583
22 1856 Millard Fillmore American 873053 loss 21.554001
38 1884 Benjamin Butler Anti-Monopoly 134294 loss 1.335838
115 1968 George Wallace American Independent 9901118 loss 13.571218
119 1972 John G. Schmitz American Independent 1100868 loss 1.421524
124 1976 Lester Maddox American Independent 170274 loss 0.209640
126 1976 Thomas J. Anderson American 158271 loss 0.194862
In [89]:
elections[elections["Party"].str.startswith("A")]
Out[89]:
Year Candidate Party Popular vote Result %
6 1832 William Wirt Anti-Masonic 100715 loss 7.821583
22 1856 Millard Fillmore American 873053 loss 21.554001
38 1884 Benjamin Butler Anti-Monopoly 134294 loss 1.335838
115 1968 George Wallace American Independent 9901118 loss 13.571218
119 1972 John G. Schmitz American Independent 1100868 loss 1.421524
124 1976 Lester Maddox American Independent 170274 loss 0.209640
126 1976 Thomas J. Anderson American 158271 loss 0.194862
In [91]:
elections.query('Year >= 2000 and Result == "win"')
Out[91]:
Year Candidate Party Popular vote Result %
152 2000 George W. Bush Republican 50456002 win 47.974666
157 2004 George W. Bush Republican 62040610 win 50.771824
162 2008 Barack Obama Democratic 69498516 win 53.023510
168 2012 Barack Obama Democratic 65915795 win 51.258484
173 2016 Donald Trump Republican 62984828 win 46.407862
178 2020 Joseph Biden Democratic 81268924 win 51.311515
In [94]:
parties = ["Republican", "Democratic"]
elections.query('Result == "win" and Party not in @parties')
Out[94]:
Year Candidate Party Popular vote Result %
1 1824 John Quincy Adams Democratic-Republican 113142 win 42.789878
11 1840 William Henry Harrison Whig 1275583 win 53.051213
16 1848 Zachary Taylor Whig 1360235 win 47.309296
27 1864 Abraham Lincoln National Union 2211317 win 54.951512

Built In Functions¶

In [128]:
winners = elections.query('Result == "win"')["%"]
winners.head(5)
Out[128]:
1     42.789878
2     56.203927
4     54.574789
8     52.272472
11    53.051213
Name: %, dtype: float64
In [129]:
np.mean(winners)
Out[129]:
51.711492943
In [130]:
max(winners)
Out[130]:
61.34470329
In [135]:
elections
Out[135]:
Year Candidate Party Popular vote Result %
0 1824 Andrew Jackson Democratic-Republican 151271 loss 57.210122
1 1824 John Quincy Adams Democratic-Republican 113142 win 42.789878
2 1828 Andrew Jackson Democratic 642806 win 56.203927
3 1828 John Quincy Adams National Republican 500897 loss 43.796073
4 1832 Andrew Jackson Democratic 702735 win 54.574789
... ... ... ... ... ... ...
177 2016 Jill Stein Green 1457226 loss 1.073699
178 2020 Joseph Biden Democratic 81268924 win 51.311515
179 2020 Donald Trump Republican 74216154 loss 46.858542
180 2020 Jo Jorgensen Libertarian 1865724 loss 1.177979
181 2020 Howard Hawkins Green 405035 loss 0.255731

182 rows × 6 columns

In [131]:
elections.size
Out[131]:
1092
In [132]:
elections.shape
Out[132]:
(182, 6)
In [134]:
elections.describe()
Out[134]:
Year Popular vote %
count 182.000000 1.820000e+02 182.000000
mean 1934.087912 1.235364e+07 27.470350
std 57.048908 1.907715e+07 22.968034
min 1824.000000 1.007150e+05 0.098088
25% 1889.000000 3.876395e+05 1.219996
50% 1936.000000 1.709375e+06 37.677893
75% 1988.000000 1.897775e+07 48.354977
max 2020.000000 8.126892e+07 61.344703
In [137]:
elections.sample(5).iloc[:, 0:2]
Out[137]:
Year Candidate
178 2020 Joseph Biden
51 1896 Joshua Levering
41 1884 John St. John
165 2008 Cynthia McKinney
143 1992 Ross Perot
In [143]:
elections.query('Year == 2000').sample(4, replace = True).iloc[:, 0:2]
Out[143]:
Year Candidate
151 2000 Al Gore
151 2000 Al Gore
154 2000 Pat Buchanan
155 2000 Ralph Nader
In [145]:
elections["Candidate"].value_counts()
Out[145]:
Norman Thomas         5
Ralph Nader           4
Franklin Roosevelt    4
Eugene V. Debs        4
Andrew Jackson        3
                     ..
Silas C. Swallow      1
Alton B. Parker       1
John G. Woolley       1
Joshua Levering       1
Howard Hawkins        1
Name: Candidate, Length: 132, dtype: int64
In [147]:
elections["Party"].unique()
Out[147]:
array(['Democratic-Republican', 'Democratic', 'National Republican',
       'Anti-Masonic', 'Whig', 'Free Soil', 'Republican', 'American',
       'Constitutional Union', 'Southern Democratic',
       'Northern Democratic', 'National Union', 'Liberal Republican',
       'Greenback', 'Anti-Monopoly', 'Prohibition', 'Union Labor',
       'Populist', 'National Democratic', 'Socialist', 'Progressive',
       'Farmer–Labor', 'Communist', 'Union', 'Dixiecrat',
       "States' Rights", 'American Independent', 'Independent',
       'Libertarian', 'Citizens', 'New Alliance', 'Taxpayers',
       'Natural Law', 'Green', 'Reform', 'Constitution'], dtype=object)
In [148]:
elections["Candidate"].sort_values()
Out[148]:
75           Aaron S. Watkins
27            Abraham Lincoln
23            Abraham Lincoln
108           Adlai Stevenson
105           Adlai Stevenson
                ...          
19             Winfield Scott
37     Winfield Scott Hancock
74             Woodrow Wilson
70             Woodrow Wilson
16             Zachary Taylor
Name: Candidate, Length: 182, dtype: object
In [151]:
elections.sort_values("%", ascending = False)
Out[151]:
Year Candidate Party Popular vote Result %
114 1964 Lyndon Johnson Democratic 43127041 win 61.344703
91 1936 Franklin Roosevelt Democratic 27752648 win 60.978107
120 1972 Richard Nixon Republican 47168710 win 60.907806
79 1920 Warren Harding Republican 16144093 win 60.574501
133 1984 Ronald Reagan Republican 54455472 win 59.023326
... ... ... ... ... ... ...
165 2008 Cynthia McKinney Green 161797 loss 0.123442
148 1996 John Hagelin Natural Law 113670 loss 0.118219
160 2004 Michael Peroutka Constitution 143630 loss 0.117542
141 1992 Bo Gritz Populist 106152 loss 0.101918
156 2004 David Cobb Green 119859 loss 0.098088

182 rows × 6 columns