Lecture 2 – Data 100, Fall 2023¶

Data 100, Fall 2023

Acknowledgments Page

A high-level overview of the pandas library to accompany Lecture 2.

In [1]:
# `pd` is the conventional alias for Pandas, as `np` is for NumPy
import pandas as pd

Series, DataFrames, and Indices¶

Series, DataFrames, and Indices are fundamental pandas data structures for storing tabular data and processing the data using vectorized operations.

Series¶

A Series is a 1-D labeled array of data. We can think of it as columnar data.

Creating a new Series object¶

Below, we create a Series object and will look into its two components: 1) values and 2) index.

In [2]:
s = pd.Series(["welcome", "to", "data 100"])

s
Out[2]:
0     welcome
1          to
2    data 100
dtype: object
In [3]:
s.values
Out[3]:
array(['welcome', 'to', 'data 100'], dtype=object)
In [4]:
s.index
Out[4]:
RangeIndex(start=0, stop=3, step=1)

In the example above, pandas automatically generated an Index of integer labels. We can also create a Series object by providing a custom Index.

In [5]:
s = pd.Series([-1, 10, 2], index = ["a", "b", "c"])
s
Out[5]:
a    -1
b    10
c     2
dtype: int64
In [6]:
s.values
Out[6]:
array([-1, 10,  2])
In [7]:
s.index
Out[7]:
Index(['a', 'b', 'c'], dtype='object')

After it has been created, we can reassign the Index of a Series to a new Index.

In [8]:
s.index = ["first", "second", "third"]
s
Out[8]:
first     -1
second    10
third      2
dtype: int64

Selection in Series¶

We can select a single value or a set of values in a Series using:

  • A single label
  • A list of labels
  • A filtering condition
In [9]:
s = pd.Series([4, -2, 0, 6], index = ["a", "b", "c", "d"])
s
Out[9]:
a    4
b   -2
c    0
d    6
dtype: int64

Selection using one or more label(s)

In [10]:
# Selection using a single label
# Notice how the return value is a single array element
s["a"]
Out[10]:
4
In [11]:
# Selection using a list of labels
# Notice how the return value is another Series
s[["a", "c"]]
Out[11]:
a    4
c    0
dtype: int64

Selection using a filter condition

In [12]:
# Filter condition: select all elements greater than 0
s>0
Out[12]:
a     True
b    False
c    False
d     True
dtype: bool
In [13]:
# Use the Boolean filter to select data from the original Series
s[s>0]
Out[13]:
a    4
d    6
dtype: int64

DataFrame¶

A DataFrame is a 2-D tabular data structure with both row and column labels. In this lecture, we will see how a DataFrame can be created from scratch or loaded from a file.

Creating a new DataFrame object¶

We can also create a DataFrame in a variety of ways. Here, we cover the following:

  1. From a CSV file
  2. Using a list and column names
  3. From a dictionary
  4. From a Series

Creating a DataFrame from a CSV file¶

For loading data into a DataFrame, pandas has a number of very useful file reading tools. We'll be using read_csv today to load data from a CSV file into a DataFrame object.

In [14]:
elections = pd.read_csv("data/elections.csv")
elections
Out[14]:
Year Candidate Party Popular vote Result %
0 1824 Andrew Jackson Democratic-Republican 151271 loss 57.210122
1 1824 John Quincy Adams Democratic-Republican 113142 win 42.789878
2 1828 Andrew Jackson Democratic 642806 win 56.203927
3 1828 John Quincy Adams National Republican 500897 loss 43.796073
4 1832 Andrew Jackson Democratic 702735 win 54.574789
... ... ... ... ... ... ...
177 2016 Jill Stein Green 1457226 loss 1.073699
178 2020 Joseph Biden Democratic 81268924 win 51.311515
179 2020 Donald Trump Republican 74216154 loss 46.858542
180 2020 Jo Jorgensen Libertarian 1865724 loss 1.177979
181 2020 Howard Hawkins Green 405035 loss 0.255731

182 rows × 6 columns

By passing a column to the index_col attribute, the Index can be defined at the initialization.

In [15]:
elections = pd.read_csv("data/elections.csv", index_col = "Year")
elections
Out[15]:
Candidate Party Popular vote Result %
Year
1824 Andrew Jackson Democratic-Republican 151271 loss 57.210122
1824 John Quincy Adams Democratic-Republican 113142 win 42.789878
1828 Andrew Jackson Democratic 642806 win 56.203927
1828 John Quincy Adams National Republican 500897 loss 43.796073
1832 Andrew Jackson Democratic 702735 win 54.574789
... ... ... ... ... ...
2016 Jill Stein Green 1457226 loss 1.073699
2020 Joseph Biden Democratic 81268924 win 51.311515
2020 Donald Trump Republican 74216154 loss 46.858542
2020 Jo Jorgensen Libertarian 1865724 loss 1.177979
2020 Howard Hawkins Green 405035 loss 0.255731

182 rows × 5 columns

Creating a DataFrame using a list and column names¶

In [16]:
# Creating a single-column DataFrame using a list
df_list_1 = pd.DataFrame([1, 2, 3], 
                         columns = ["Number"])
display(df_list_1)
Number
0 1
1 2
2 3
In [17]:
# Creating a multi-column DataFrame using a list of lists
df_list_2 = pd.DataFrame([[1, "one"], [2, "two"]], 
                         columns = ["Number", "Description"])
df_list_2
Out[17]:
Number Description
0 1 one
1 2 two

Creating a DataFrame from a dictionary¶

In [18]:
# Creating a DataFrame from a dictionary of columns
df_dict_1 = pd.DataFrame({"Fruit":["Strawberry", "Orange"], 
                          "Price":[5.49, 3.99]})
df_dict_1
Out[18]:
Fruit Price
0 Strawberry 5.49
1 Orange 3.99
In [19]:
# Creating a DataFrame from a list of row dictionaries
df_dict_2 = pd.DataFrame([{"Fruit":"Strawberry", "Price":5.49}, 
                          {"Fruit":"Orange", "Price":3.99}])
df_dict_2
Out[19]:
Fruit Price
0 Strawberry 5.49
1 Orange 3.99

Creating a DataFrame from a Series¶

In [20]:
# In the examples below, we create a DataFrame from a Series

s_a = pd.Series(["a1", "a2", "a3"], index = ["r1", "r2", "r3"])
s_b = pd.Series(["b1", "b2", "b3"], index = ["r1", "r2", "r3"])
In [21]:
# Passing Series objects for columns
df_ser = pd.DataFrame({"A-column":s_a, "B-column":s_b})
df_ser
Out[21]:
A-column B-column
r1 a1 b1
r2 a2 b2
r3 a3 b3
In [22]:
# Passing a Series to the DataFrame constructor to make a one-column dataframe
df_ser = pd.DataFrame(s_a)
df_ser
Out[22]:
0
r1 a1
r2 a2
r3 a3
In [23]:
# Using to_frame() to convert a Series to DataFrame
ser_to_df = s_a.to_frame()
ser_to_df
Out[23]:
0
r1 a1
r2 a2
r3 a3
In [24]:
# Creating a DataFrame from a CSV file and specifying the Index column
elections = pd.read_csv("data/elections.csv", index_col = "Candidate")
elections.head(5) # Using `.head` shows only the first 5 rows to save space
Out[24]:
Year Party Popular vote Result %
Candidate
Andrew Jackson 1824 Democratic-Republican 151271 loss 57.210122
John Quincy Adams 1824 Democratic-Republican 113142 win 42.789878
Andrew Jackson 1828 Democratic 642806 win 56.203927
John Quincy Adams 1828 National Republican 500897 loss 43.796073
Andrew Jackson 1832 Democratic 702735 win 54.574789
In [25]:
elections.reset_index(inplace = True) # Need to reset the index to keep 'Candidate' as one of the DataFrane Columns
elections.set_index("Party", inplace=True) # This sets the index to the "Candidate" column
elections
Out[25]:
Candidate Year Popular vote Result %
Party
Democratic-Republican Andrew Jackson 1824 151271 loss 57.210122
Democratic-Republican John Quincy Adams 1824 113142 win 42.789878
Democratic Andrew Jackson 1828 642806 win 56.203927
National Republican John Quincy Adams 1828 500897 loss 43.796073
Democratic Andrew Jackson 1832 702735 win 54.574789
... ... ... ... ... ...
Green Jill Stein 2016 1457226 loss 1.073699
Democratic Joseph Biden 2020 81268924 win 51.311515
Republican Donald Trump 2020 74216154 loss 46.858542
Libertarian Jo Jorgensen 2020 1865724 loss 1.177979
Green Howard Hawkins 2020 405035 loss 0.255731

182 rows × 5 columns

DataFrame attributes: index, columns, and shape¶

In [26]:
elections.index
Out[26]:
Index(['Democratic-Republican', 'Democratic-Republican', 'Democratic',
       'National Republican', 'Democratic', 'National Republican',
       'Anti-Masonic', 'Whig', 'Democratic', 'Whig',
       ...
       'Constitution', 'Republican', 'Independent', 'Libertarian',
       'Democratic', 'Green', 'Democratic', 'Republican', 'Libertarian',
       'Green'],
      dtype='object', name='Party', length=182)
In [27]:
elections.columns
Out[27]:
Index(['Candidate', 'Year', 'Popular vote', 'Result', '%'], dtype='object')

The Index column can be set to the default list of integers by calling reset_index() on a DataFrame.

In [28]:
elections.reset_index(inplace=True) # Revert the index back to its default numeric labeling
elections
Out[28]:
Party Candidate Year Popular vote Result %
0 Democratic-Republican Andrew Jackson 1824 151271 loss 57.210122
1 Democratic-Republican John Quincy Adams 1824 113142 win 42.789878
2 Democratic Andrew Jackson 1828 642806 win 56.203927
3 National Republican John Quincy Adams 1828 500897 loss 43.796073
4 Democratic Andrew Jackson 1832 702735 win 54.574789
... ... ... ... ... ... ...
177 Green Jill Stein 2016 1457226 loss 1.073699
178 Democratic Joseph Biden 2020 81268924 win 51.311515
179 Republican Donald Trump 2020 74216154 loss 46.858542
180 Libertarian Jo Jorgensen 2020 1865724 loss 1.177979
181 Green Howard Hawkins 2020 405035 loss 0.255731

182 rows × 6 columns

In [29]:
elections.shape
Out[29]:
(182, 6)

Slicing in DataFrames¶

We can use .head to return only a few rows of a dataframe.

In [30]:
# Loading DataFrame again to keep the original ordering of columns
elections = pd.read_csv("data/elections.csv")

elections.head() # By default, calling .head with no argument will show the first 5 rows
Out[30]:
Year Candidate Party Popular vote Result %
0 1824 Andrew Jackson Democratic-Republican 151271 loss 57.210122
1 1824 John Quincy Adams Democratic-Republican 113142 win 42.789878
2 1828 Andrew Jackson Democratic 642806 win 56.203927
3 1828 John Quincy Adams National Republican 500897 loss 43.796073
4 1832 Andrew Jackson Democratic 702735 win 54.574789
In [31]:
elections.head(3)
Out[31]:
Year Candidate Party Popular vote Result %
0 1824 Andrew Jackson Democratic-Republican 151271 loss 57.210122
1 1824 John Quincy Adams Democratic-Republican 113142 win 42.789878
2 1828 Andrew Jackson Democratic 642806 win 56.203927

We can also use .tail to get the last so many rows.

In [32]:
elections.tail(5)
Out[32]:
Year Candidate Party Popular vote Result %
177 2016 Jill Stein Green 1457226 loss 1.073699
178 2020 Joseph Biden Democratic 81268924 win 51.311515
179 2020 Donald Trump Republican 74216154 loss 46.858542
180 2020 Jo Jorgensen Libertarian 1865724 loss 1.177979
181 2020 Howard Hawkins Green 405035 loss 0.255731

Label-Based Extraction Usingloc¶

Arguments to .loc can be:

  1. A list.
  2. A slice (syntax is inclusive of the right-hand side of the slice).
  3. A single value.

loc selects items by row and column label.

In [33]:
# Selection by a list
elections.loc[[87, 25, 179], ["Year", "Candidate", "Result"]]
Out[33]:
Year Candidate Result
87 1932 Herbert Hoover loss
25 1860 John C. Breckinridge loss
179 2020 Donald Trump loss
In [34]:
# Selection by a list and a slice of columns
elections.loc[[87, 25, 179], "Popular vote":"%"]
Out[34]:
Popular vote Result %
87 15761254 loss 39.830594
25 848019 loss 18.138998
179 74216154 loss 46.858542
In [35]:
# Extracting all rows using a colon
elections.loc[:, ["Year", "Candidate", "Result"]]
Out[35]:
Year Candidate Result
0 1824 Andrew Jackson loss
1 1824 John Quincy Adams win
2 1828 Andrew Jackson win
3 1828 John Quincy Adams loss
4 1832 Andrew Jackson win
... ... ... ...
177 2016 Jill Stein loss
178 2020 Joseph Biden win
179 2020 Donald Trump loss
180 2020 Jo Jorgensen loss
181 2020 Howard Hawkins loss

182 rows × 3 columns

In [36]:
# Extracting all columns using a colon
elections.loc[[87, 25, 179], :]
Out[36]:
Year Candidate Party Popular vote Result %
87 1932 Herbert Hoover Republican 15761254 loss 39.830594
25 1860 John C. Breckinridge Southern Democratic 848019 loss 18.138998
179 2020 Donald Trump Republican 74216154 loss 46.858542
In [37]:
# Selection by a list and a single-column label
elections.loc[[87, 25, 179], "Popular vote"]
Out[37]:
87     15761254
25       848019
179    74216154
Name: Popular vote, dtype: int64
In [38]:
# Note that if we pass "Popular vote" in a list, the output will be a DataFrame
elections.loc[[87, 25, 179], ["Popular vote"]]
Out[38]:
Popular vote
87 15761254
25 848019
179 74216154
In [39]:
# Selection by a row label and a column label
elections.loc[0, "Candidate"]
Out[39]:
'Andrew Jackson'

Integer-Based Extraction Using iloc¶

iloc selects items by row and column integer position.

Arguments to .iloc can be:

  1. A list.
  2. A slice (syntax is exclusive of the right hand side of the slice).
  3. A single value.
In [40]:
# Select the rows at positions 1, 2, and 3.
# Select the columns at positions 0, 1, and 2.
# Remember that Python indexing begins at position 0!
elections.iloc[[1, 2, 3], [0, 1, 2]]
Out[40]:
Year Candidate Party
1 1824 John Quincy Adams Democratic-Republican
2 1828 Andrew Jackson Democratic
3 1828 John Quincy Adams National Republican
In [41]:
# Index-based extraction using a list of rows and a slice of column indices
elections.iloc[[1, 2, 3], 0:3]
Out[41]:
Year Candidate Party
1 1824 John Quincy Adams Democratic-Republican
2 1828 Andrew Jackson Democratic
3 1828 John Quincy Adams National Republican
In [42]:
# Selecting all rows using a colon
elections.iloc[:, 0:3]
Out[42]:
Year Candidate Party
0 1824 Andrew Jackson Democratic-Republican
1 1824 John Quincy Adams Democratic-Republican
2 1828 Andrew Jackson Democratic
3 1828 John Quincy Adams National Republican
4 1832 Andrew Jackson Democratic
... ... ... ...
177 2016 Jill Stein Green
178 2020 Joseph Biden Democratic
179 2020 Donald Trump Republican
180 2020 Jo Jorgensen Libertarian
181 2020 Howard Hawkins Green

182 rows × 3 columns

In [43]:
elections.iloc[[1, 2, 3], 1]
Out[43]:
1    John Quincy Adams
2       Andrew Jackson
3    John Quincy Adams
Name: Candidate, dtype: object
In [44]:
# Extracting the value at row 0 and the second column
elections.iloc[0,1]
Out[44]:
'Andrew Jackson'

Context-dependent Extraction using []¶

We could technically do anything we want using loc or iloc. However, in practice, the [] operator is often used instead to yield more concise code.

[] is a bit trickier to understand than loc or iloc, but it achieves essentially the same functionality. The difference is that [] is context-dependent.

[] only takes one argument, which may be:

  1. A slice of row integers.
  2. A list of column labels.
  3. A single column label.

If we provide a slice of row numbers, we get the numbered rows.

In [45]:
elections[3:7]
Out[45]:
Year Candidate Party Popular vote Result %
3 1828 John Quincy Adams National Republican 500897 loss 43.796073
4 1832 Andrew Jackson Democratic 702735 win 54.574789
5 1832 Henry Clay National Republican 484205 loss 37.603628
6 1832 William Wirt Anti-Masonic 100715 loss 7.821583

If we provide a list of column names, we get the listed columns.

In [46]:
elections[["Year", "Candidate", "Result"]]
Out[46]:
Year Candidate Result
0 1824 Andrew Jackson loss
1 1824 John Quincy Adams win
2 1828 Andrew Jackson win
3 1828 John Quincy Adams loss
4 1832 Andrew Jackson win
... ... ... ...
177 2016 Jill Stein loss
178 2020 Joseph Biden win
179 2020 Donald Trump loss
180 2020 Jo Jorgensen loss
181 2020 Howard Hawkins loss

182 rows × 3 columns

And if we provide a single column name we get back just that column, stored as a Series.

In [47]:
elections["Candidate"]
Out[47]:
0         Andrew Jackson
1      John Quincy Adams
2         Andrew Jackson
3      John Quincy Adams
4         Andrew Jackson
             ...        
177           Jill Stein
178         Joseph Biden
179         Donald Trump
180         Jo Jorgensen
181       Howard Hawkins
Name: Candidate, Length: 182, dtype: object

Slido Exercises¶

In [48]:
example = pd.Series([4, 5, 6], index=["one", "two", "three"])
example[example > 4].values
Out[48]:
array([5, 6])
In [49]:
df = pd.DataFrame({"c1":[1, 2, 3, 4], "c2":[2, 4, 6, 8]})
df.columns
Out[49]:
Index(['c1', 'c2'], dtype='object')
In [50]:
weird = pd.DataFrame({
    1:["topdog","botdog"], 
    "1":["topcat","botcat"]
})
weird
Out[50]:
1 1
0 topdog topcat
1 botdog botcat
In [51]:
weird[1]
Out[51]:
0    topdog
1    botdog
Name: 1, dtype: object
In [52]:
weird["1"]
Out[52]:
0    topcat
1    botcat
Name: 1, dtype: object
In [53]:
weird[1:]
Out[53]:
1 1
1 botdog botcat