Pandas, Part I¶

Adapted from Josh Hug and Narges Norouzi

Updated by Bella Crouch

A high-level overview of the pandas library to accompany Lecture 2.

In [1]:
# `pd` is the conventional alias for Pandas, as `np` is for NumPy
import pandas as pd

Series, DataFrames, and Indices¶

Series, DataFrames, and Indices are fundamental pandas data structures for storing tabular data and processing the data using vectorized operations.

Series¶

Series is a 1-D labeled array data. We can think of it as columnar data.

Creating a new Series object¶

Below we create a Series object and will look into its two components: 1) values and 2) index.

In [2]:
s = pd.Series([-1, 10, 2])
s
Out[2]:
0    -1
1    10
2     2
dtype: int64
In [3]:
s.values
Out[3]:
array([-1, 10,  2])
In [4]:
s.index
Out[4]:
RangeIndex(start=0, stop=3, step=1)

In the example above, pandas automatically generated an Index of integer labels. We can also create a Series object by providing a custom Index.

In [5]:
s = pd.Series([-1, 10, 2], index = ["a", "b", "c"])
s
Out[5]:
a    -1
b    10
c     2
dtype: int64
In [6]:
s.values
Out[6]:
array([-1, 10,  2])
In [7]:
s.index
Out[7]:
Index(['a', 'b', 'c'], dtype='object')

After it has been created, we can reassign the Index of a Series to a new Index.

In [8]:
s.index = ["first", "second", "third"]
s
Out[8]:
first     -1
second    10
third      2
dtype: int64

Selection in Series¶

We can select a single value or a set of values in a Series using:

  • A single label
  • A list of labels
  • A filtering condition
In [9]:
s = pd.Series([4, -2, 0, 6], index = ["a", "b", "c", "d"])
s
Out[9]:
a    4
b   -2
c    0
d    6
dtype: int64

Selection using one or more label(s)

In [10]:
# Selection using a single label
# Notice how the return value is a single array element
s["a"]
Out[10]:
4
In [11]:
# Selection using a list of labels
# Notice how the return value is another Series
s[["a", "c"]]
Out[11]:
a    4
c    0
dtype: int64

Selection using a filter condition

In [12]:
# Filter condition: select all elements greater than 0
s>0
Out[12]:
a     True
b    False
c    False
d     True
dtype: bool
In [13]:
# Use the Boolean filter to select data from the original Series
s[s>0]
Out[13]:
a    4
d    6
dtype: int64

DataFrame¶

A DataFrame is a 2-D tabular data structure with both row and column labels. In this lecture, we will see how a DataFrame can be created from scratch or loaded from a file.

Loading data from a file into a DataFrame¶

For loading data into a DataFrame, pandas has a number of very useful file reading tools. We'll be using read_csv today to load data from a csv file into a DataFrame object.

In [14]:
elections = pd.read_csv("data/elections.csv")
elections
Out[14]:
Year Candidate Party Popular vote Result %
0 1824 Andrew Jackson Democratic-Republican 151271 loss 57.210122
1 1824 John Quincy Adams Democratic-Republican 113142 win 42.789878
2 1828 Andrew Jackson Democratic 642806 win 56.203927
3 1828 John Quincy Adams National Republican 500897 loss 43.796073
4 1832 Andrew Jackson Democratic 702735 win 54.574789
... ... ... ... ... ... ...
177 2016 Jill Stein Green 1457226 loss 1.073699
178 2020 Joseph Biden Democratic 81268924 win 51.311515
179 2020 Donald Trump Republican 74216154 loss 46.858542
180 2020 Jo Jorgensen Libertarian 1865724 loss 1.177979
181 2020 Howard Hawkins Green 405035 loss 0.255731

182 rows × 6 columns

Creating a new DataFrame object¶

We can also create a DataFrame in variety of ways. Here we cover the following:

  1. Using a list and column names
  2. From a dictionary
  3. From a Series
In [15]:
# Creating a single-column DataFrame using a list
df_list_1 = pd.DataFrame([1, 2, 3], columns = ["Number"])
display(df_list_1)
Number
0 1
1 2
2 3
In [16]:
# Creating a multi-column DataFrame using a list of lists
df_list_2 = pd.DataFrame([[1, "one"], [2, "two"]], columns = ["Number", "Description"])
df_list_2
Out[16]:
Number Description
0 1 one
1 2 two
In [17]:
# Creating a DataFrame from a dictionary of columns
df_dict_1 = pd.DataFrame({"Fruit":["Strawberry", "Orange"], "Price":[5.49, 3.99]})
df_dict_1
Out[17]:
Fruit Price
0 Strawberry 5.49
1 Orange 3.99
In [18]:
# Creating a DataFrame from a list of row dictionaries
df_dict_2 = pd.DataFrame([{"Fruit":"Strawberry", "Price":5.49}, 
                   {"Fruit":"Orange", "Price":3.99}])
df_dict_2
Out[18]:
Fruit Price
0 Strawberry 5.49
1 Orange 3.99
In [19]:
# In the examples below, we create a DataFrame from a Series

s_a = pd.Series(["a1", "a2", "a3"], index = ["r1", "r2", "r3"])
s_b = pd.Series(["b1", "b2", "b3"], index = ["r1", "r2", "r3"])
In [20]:
# Passing Series objects for columns
df_ser = pd.DataFrame({"A-column":s_a, "B-column":s_b})
df_ser
Out[20]:
A-column B-column
r1 a1 b1
r2 a2 b2
r3 a3 b3
In [21]:
# Passing a Series to the DataFrame constructor to make a one-column dataframe
df_ser = pd.DataFrame(s_a)
df_ser
Out[21]:
0
r1 a1
r2 a2
r3 a3
In [22]:
# Using to_frame() to convert a Series to DataFrame
ser_to_df = s_a.to_frame()
ser_to_df
Out[22]:
0
r1 a1
r2 a2
r3 a3
In [23]:
# Creating a DataFrame from a csv file and specifying the index column
mottos = pd.read_csv("data/mottos.csv", index_col = "State")
mottos.head(5) # Using `.head` shows only the first 5 rows to save space
Out[23]:
Motto Translation Language Date Adopted
State
Alabama Audemus jura nostra defendere We dare defend our rights! Latin 1923
Alaska North to the future — English 1967
Arizona Ditat Deus God enriches Latin 1863
Arkansas Regnat populus The people rule Latin 1907
California Eureka (Εὕρηκα) I have found it Greek 1849
In [24]:
elections.set_index("Candidate", inplace=True) # This sets the index to the "Candidate" column
elections
Out[24]:
Year Party Popular vote Result %
Candidate
Andrew Jackson 1824 Democratic-Republican 151271 loss 57.210122
John Quincy Adams 1824 Democratic-Republican 113142 win 42.789878
Andrew Jackson 1828 Democratic 642806 win 56.203927
John Quincy Adams 1828 National Republican 500897 loss 43.796073
Andrew Jackson 1832 Democratic 702735 win 54.574789
... ... ... ... ... ...
Jill Stein 2016 Green 1457226 loss 1.073699
Joseph Biden 2020 Democratic 81268924 win 51.311515
Donald Trump 2020 Republican 74216154 loss 46.858542
Jo Jorgensen 2020 Libertarian 1865724 loss 1.177979
Howard Hawkins 2020 Green 405035 loss 0.255731

182 rows × 5 columns

In [25]:
elections.reset_index(inplace=True) # Revert the index back to its default numeric labeling
elections
Out[25]:
Candidate Year Party Popular vote Result %
0 Andrew Jackson 1824 Democratic-Republican 151271 loss 57.210122
1 John Quincy Adams 1824 Democratic-Republican 113142 win 42.789878
2 Andrew Jackson 1828 Democratic 642806 win 56.203927
3 John Quincy Adams 1828 National Republican 500897 loss 43.796073
4 Andrew Jackson 1832 Democratic 702735 win 54.574789
... ... ... ... ... ... ...
177 Jill Stein 2016 Green 1457226 loss 1.073699
178 Joseph Biden 2020 Democratic 81268924 win 51.311515
179 Donald Trump 2020 Republican 74216154 loss 46.858542
180 Jo Jorgensen 2020 Libertarian 1865724 loss 1.177979
181 Howard Hawkins 2020 Green 405035 loss 0.255731

182 rows × 6 columns

Slicing in DataFrames¶

We can use .head to return only a few rows of a dataframe.

In [26]:
elections.head() # By default, calling .head with no argument will show the first 5 rows
Out[26]:
Candidate Year Party Popular vote Result %
0 Andrew Jackson 1824 Democratic-Republican 151271 loss 57.210122
1 John Quincy Adams 1824 Democratic-Republican 113142 win 42.789878
2 Andrew Jackson 1828 Democratic 642806 win 56.203927
3 John Quincy Adams 1828 National Republican 500897 loss 43.796073
4 Andrew Jackson 1832 Democratic 702735 win 54.574789
In [27]:
elections.head(3)
Out[27]:
Candidate Year Party Popular vote Result %
0 Andrew Jackson 1824 Democratic-Republican 151271 loss 57.210122
1 John Quincy Adams 1824 Democratic-Republican 113142 win 42.789878
2 Andrew Jackson 1828 Democratic 642806 win 56.203927

We can also use .tail to get the last so many rows.

In [28]:
elections.tail(5)
Out[28]:
Candidate Year Party Popular vote Result %
177 Jill Stein 2016 Green 1457226 loss 1.073699
178 Joseph Biden 2020 Democratic 81268924 win 51.311515
179 Donald Trump 2020 Republican 74216154 loss 46.858542
180 Jo Jorgensen 2020 Libertarian 1865724 loss 1.177979
181 Howard Hawkins 2020 Green 405035 loss 0.255731

loc¶

loc selects items by row and column label.

In [29]:
elections.loc[[87, 25, 179], ["Year", "Candidate", "Result"]]
Out[29]:
Year Candidate Result
87 1932 Herbert Hoover loss
25 1860 John C. Breckinridge loss
179 2020 Donald Trump loss
In [30]:
elections.loc[[87, 25, 179], "Popular vote":"%"]
Out[30]:
Popular vote Result %
87 15761254 loss 39.830594
25 848019 loss 18.138998
179 74216154 loss 46.858542
In [31]:
elections.loc[[87, 25, 179], "Popular vote"]
Out[31]:
87     15761254
25       848019
179    74216154
Name: Popular vote, dtype: int64
In [32]:
elections.loc[0, "Candidate"]
Out[32]:
'Andrew Jackson'
In [33]:
elections.loc[:, ["Year", "Candidate", "Result"]]
Out[33]:
Year Candidate Result
0 1824 Andrew Jackson loss
1 1824 John Quincy Adams win
2 1828 Andrew Jackson win
3 1828 John Quincy Adams loss
4 1832 Andrew Jackson win
... ... ... ...
177 2016 Jill Stein loss
178 2020 Joseph Biden win
179 2020 Donald Trump loss
180 2020 Jo Jorgensen loss
181 2020 Howard Hawkins loss

182 rows × 3 columns

iloc¶

iloc selects items by row and column integer position.

In [34]:
# Select the rows at positions 1, 2, and 3
# Select the columns at positions 0, 1, and 2
# Remember that Python indexing begins at position 0!
elections.iloc[[1, 2, 3], [0, 1, 2]]
Out[34]:
Candidate Year Party
1 John Quincy Adams 1824 Democratic-Republican
2 Andrew Jackson 1828 Democratic
3 John Quincy Adams 1828 National Republican
In [35]:
elections.iloc[[1, 2, 3], 0:3]
Out[35]:
Candidate Year Party
1 John Quincy Adams 1824 Democratic-Republican
2 Andrew Jackson 1828 Democratic
3 John Quincy Adams 1828 National Republican
In [36]:
elections.iloc[[1, 2, 3], 1]
Out[36]:
1    1824
2    1828
3    1828
Name: Year, dtype: int64
In [37]:
elections.iloc[:, 0:3]
Out[37]:
Candidate Year Party
0 Andrew Jackson 1824 Democratic-Republican
1 John Quincy Adams 1824 Democratic-Republican
2 Andrew Jackson 1828 Democratic
3 John Quincy Adams 1828 National Republican
4 Andrew Jackson 1832 Democratic
... ... ... ...
177 Jill Stein 2016 Green
178 Joseph Biden 2020 Democratic
179 Donald Trump 2020 Republican
180 Jo Jorgensen 2020 Libertarian
181 Howard Hawkins 2020 Green

182 rows × 3 columns

[]¶

We could technically do anything we want using loc or iloc. However, in practice, the [] operator is often used instead to yield more concise code.

[] is a bit trickier to understand than loc or iloc, but it achieves essentially the same functionality. The difference is that [] is context-dependent.

If we provide a slice of row numbers, we get the numbered rows.

In [38]:
elections[3:7]
Out[38]:
Candidate Year Party Popular vote Result %
3 John Quincy Adams 1828 National Republican 500897 loss 43.796073
4 Andrew Jackson 1832 Democratic 702735 win 54.574789
5 Henry Clay 1832 National Republican 484205 loss 37.603628
6 William Wirt 1832 Anti-Masonic 100715 loss 7.821583

If we provide a list of column names, we get the listed columns.

In [39]:
elections[["Year", "Candidate", "Result"]]
Out[39]:
Year Candidate Result
0 1824 Andrew Jackson loss
1 1824 John Quincy Adams win
2 1828 Andrew Jackson win
3 1828 John Quincy Adams loss
4 1832 Andrew Jackson win
... ... ... ...
177 2016 Jill Stein loss
178 2020 Joseph Biden win
179 2020 Donald Trump loss
180 2020 Jo Jorgensen loss
181 2020 Howard Hawkins loss

182 rows × 3 columns

And if we provide a single column name we get back just that column, stored as a Series.

In [40]:
elections["Candidate"]
Out[40]:
0         Andrew Jackson
1      John Quincy Adams
2         Andrew Jackson
3      John Quincy Adams
4         Andrew Jackson
             ...        
177           Jill Stein
178         Joseph Biden
179         Donald Trump
180         Jo Jorgensen
181       Howard Hawkins
Name: Candidate, Length: 182, dtype: object

Slido Exercises¶

In [41]:
example = pd.Series([4, 5, 6], index=["one", "two", "three"])
example[example > 4].values
Out[41]:
array([5, 6])
In [42]:
df = pd.DataFrame({"c1":[1, 2, 3, 4], "c2":[2, 4, 6, 8]})
df.columns
Out[42]:
Index(['c1', 'c2'], dtype='object')
In [43]:
weird = pd.DataFrame({
    1:["topdog","botdog"], 
    "1":["topcat","botcat"]
})
weird
Out[43]:
1 1
0 topdog topcat
1 botdog botcat
In [44]:
weird[1]
Out[44]:
0    topdog
1    botdog
Name: 1, dtype: object
In [45]:
weird["1"]
Out[45]:
0    topcat
1    botcat
Name: 1, dtype: object
In [46]:
weird[1:]
Out[46]:
1 1
1 botdog botcat