# `pd` is the conventional alias for Pandas, as `np` is for NumPy
import pandas as pd
Series, DataFrames, and Indices are fundamental pandas data structures for storing tabular data and processing the data using vectorized operations.
s = pd.Series([-1, 10, 2])
s
0 -1 1 10 2 2 dtype: int64
s.values
array([-1, 10, 2])
s.index
RangeIndex(start=0, stop=3, step=1)
In the example above, pandas automatically generated an Index of integer labels. We can also create a Series object by providing a custom Index.
s = pd.Series([-1, 10, 2], index = ["a", "b", "c"])
s
a -1 b 10 c 2 dtype: int64
s.values
array([-1, 10, 2])
s.index
Index(['a', 'b', 'c'], dtype='object')
After it has been created, we can reassign the Index of a Series to a new Index.
s.index = ["first", "second", "third"]
s
first -1 second 10 third 2 dtype: int64
We can select a single value or a set of values in a Series using:
s = pd.Series([4, -2, 0, 6], index = ["a", "b", "c", "d"])
s
a 4 b -2 c 0 d 6 dtype: int64
Selection using one or more label(s)
# Selection using a single label
# Notice how the return value is a single array element
s["a"]
4
# Selection using a list of labels
# Notice how the return value is another Series
s[["a", "c"]]
a 4 c 0 dtype: int64
Selection using a filter condition
# Filter condition: select all elements greater than 0
s>0
a True b False c False d True dtype: bool
# Use the Boolean filter to select data from the original Series
s[s>0]
a 4 d 6 dtype: int64
A DataFrame is a 2-D tabular data structure with both row and column labels. In this lecture, we will see how a DataFrame can be created from scratch or loaded from a file.
DataFrame¶For loading data into a DataFrame, pandas has a number of very useful file reading tools. We'll be using read_csv today to load data from a csv file into a DataFrame object.
elections = pd.read_csv("data/elections.csv")
elections
| Year | Candidate | Party | Popular vote | Result | % | |
|---|---|---|---|---|---|---|
| 0 | 1824 | Andrew Jackson | Democratic-Republican | 151271 | loss | 57.210122 | 
| 1 | 1824 | John Quincy Adams | Democratic-Republican | 113142 | win | 42.789878 | 
| 2 | 1828 | Andrew Jackson | Democratic | 642806 | win | 56.203927 | 
| 3 | 1828 | John Quincy Adams | National Republican | 500897 | loss | 43.796073 | 
| 4 | 1832 | Andrew Jackson | Democratic | 702735 | win | 54.574789 | 
| ... | ... | ... | ... | ... | ... | ... | 
| 177 | 2016 | Jill Stein | Green | 1457226 | loss | 1.073699 | 
| 178 | 2020 | Joseph Biden | Democratic | 81268924 | win | 51.311515 | 
| 179 | 2020 | Donald Trump | Republican | 74216154 | loss | 46.858542 | 
| 180 | 2020 | Jo Jorgensen | Libertarian | 1865724 | loss | 1.177979 | 
| 181 | 2020 | Howard Hawkins | Green | 405035 | loss | 0.255731 | 
182 rows × 6 columns
DataFrame object¶We can also create a DataFrame in variety of ways. Here we cover the following:
Series# Creating a single-column DataFrame using a list
df_list_1 = pd.DataFrame([1, 2, 3], columns = ["Number"])
display(df_list_1)
| Number | |
|---|---|
| 0 | 1 | 
| 1 | 2 | 
| 2 | 3 | 
# Creating a multi-column DataFrame using a list of lists
df_list_2 = pd.DataFrame([[1, "one"], [2, "two"]], columns = ["Number", "Description"])
df_list_2
| Number | Description | |
|---|---|---|
| 0 | 1 | one | 
| 1 | 2 | two | 
# Creating a DataFrame from a dictionary of columns
df_dict_1 = pd.DataFrame({"Fruit":["Strawberry", "Orange"], "Price":[5.49, 3.99]})
df_dict_1
| Fruit | Price | |
|---|---|---|
| 0 | Strawberry | 5.49 | 
| 1 | Orange | 3.99 | 
# Creating a DataFrame from a list of row dictionaries
df_dict_2 = pd.DataFrame([{"Fruit":"Strawberry", "Price":5.49}, 
                   {"Fruit":"Orange", "Price":3.99}])
df_dict_2
| Fruit | Price | |
|---|---|---|
| 0 | Strawberry | 5.49 | 
| 1 | Orange | 3.99 | 
# In the examples below, we create a DataFrame from a Series
s_a = pd.Series(["a1", "a2", "a3"], index = ["r1", "r2", "r3"])
s_b = pd.Series(["b1", "b2", "b3"], index = ["r1", "r2", "r3"])
# Passing Series objects for columns
df_ser = pd.DataFrame({"A-column":s_a, "B-column":s_b})
df_ser
| A-column | B-column | |
|---|---|---|
| r1 | a1 | b1 | 
| r2 | a2 | b2 | 
| r3 | a3 | b3 | 
# Passing a Series to the DataFrame constructor to make a one-column dataframe
df_ser = pd.DataFrame(s_a)
df_ser
| 0 | |
|---|---|
| r1 | a1 | 
| r2 | a2 | 
| r3 | a3 | 
# Using to_frame() to convert a Series to DataFrame
ser_to_df = s_a.to_frame()
ser_to_df
| 0 | |
|---|---|
| r1 | a1 | 
| r2 | a2 | 
| r3 | a3 | 
# Creating a DataFrame from a csv file and specifying the index column
mottos = pd.read_csv("data/mottos.csv", index_col = "State")
mottos.head(5) # Using `.head` shows only the first 5 rows to save space
| Motto | Translation | Language | Date Adopted | |
|---|---|---|---|---|
| State | ||||
| Alabama | Audemus jura nostra defendere | We dare defend our rights! | Latin | 1923 | 
| Alaska | North to the future | — | English | 1967 | 
| Arizona | Ditat Deus | God enriches | Latin | 1863 | 
| Arkansas | Regnat populus | The people rule | Latin | 1907 | 
| California | Eureka (Εὕρηκα) | I have found it | Greek | 1849 | 
elections.set_index("Candidate", inplace=True) # This sets the index to the "Candidate" column
elections
| Year | Party | Popular vote | Result | % | |
|---|---|---|---|---|---|
| Candidate | |||||
| Andrew Jackson | 1824 | Democratic-Republican | 151271 | loss | 57.210122 | 
| John Quincy Adams | 1824 | Democratic-Republican | 113142 | win | 42.789878 | 
| Andrew Jackson | 1828 | Democratic | 642806 | win | 56.203927 | 
| John Quincy Adams | 1828 | National Republican | 500897 | loss | 43.796073 | 
| Andrew Jackson | 1832 | Democratic | 702735 | win | 54.574789 | 
| ... | ... | ... | ... | ... | ... | 
| Jill Stein | 2016 | Green | 1457226 | loss | 1.073699 | 
| Joseph Biden | 2020 | Democratic | 81268924 | win | 51.311515 | 
| Donald Trump | 2020 | Republican | 74216154 | loss | 46.858542 | 
| Jo Jorgensen | 2020 | Libertarian | 1865724 | loss | 1.177979 | 
| Howard Hawkins | 2020 | Green | 405035 | loss | 0.255731 | 
182 rows × 5 columns
elections.reset_index(inplace=True) # Revert the index back to its default numeric labeling
elections
| Candidate | Year | Party | Popular vote | Result | % | |
|---|---|---|---|---|---|---|
| 0 | Andrew Jackson | 1824 | Democratic-Republican | 151271 | loss | 57.210122 | 
| 1 | John Quincy Adams | 1824 | Democratic-Republican | 113142 | win | 42.789878 | 
| 2 | Andrew Jackson | 1828 | Democratic | 642806 | win | 56.203927 | 
| 3 | John Quincy Adams | 1828 | National Republican | 500897 | loss | 43.796073 | 
| 4 | Andrew Jackson | 1832 | Democratic | 702735 | win | 54.574789 | 
| ... | ... | ... | ... | ... | ... | ... | 
| 177 | Jill Stein | 2016 | Green | 1457226 | loss | 1.073699 | 
| 178 | Joseph Biden | 2020 | Democratic | 81268924 | win | 51.311515 | 
| 179 | Donald Trump | 2020 | Republican | 74216154 | loss | 46.858542 | 
| 180 | Jo Jorgensen | 2020 | Libertarian | 1865724 | loss | 1.177979 | 
| 181 | Howard Hawkins | 2020 | Green | 405035 | loss | 0.255731 | 
182 rows × 6 columns
DataFrames¶We can use .head to return only a few rows of a dataframe.
elections.head() # By default, calling .head with no argument will show the first 5 rows
| Candidate | Year | Party | Popular vote | Result | % | |
|---|---|---|---|---|---|---|
| 0 | Andrew Jackson | 1824 | Democratic-Republican | 151271 | loss | 57.210122 | 
| 1 | John Quincy Adams | 1824 | Democratic-Republican | 113142 | win | 42.789878 | 
| 2 | Andrew Jackson | 1828 | Democratic | 642806 | win | 56.203927 | 
| 3 | John Quincy Adams | 1828 | National Republican | 500897 | loss | 43.796073 | 
| 4 | Andrew Jackson | 1832 | Democratic | 702735 | win | 54.574789 | 
elections.head(3)
| Candidate | Year | Party | Popular vote | Result | % | |
|---|---|---|---|---|---|---|
| 0 | Andrew Jackson | 1824 | Democratic-Republican | 151271 | loss | 57.210122 | 
| 1 | John Quincy Adams | 1824 | Democratic-Republican | 113142 | win | 42.789878 | 
| 2 | Andrew Jackson | 1828 | Democratic | 642806 | win | 56.203927 | 
We can also use .tail to get the last so many rows.
elections.tail(5)
| Candidate | Year | Party | Popular vote | Result | % | |
|---|---|---|---|---|---|---|
| 177 | Jill Stein | 2016 | Green | 1457226 | loss | 1.073699 | 
| 178 | Joseph Biden | 2020 | Democratic | 81268924 | win | 51.311515 | 
| 179 | Donald Trump | 2020 | Republican | 74216154 | loss | 46.858542 | 
| 180 | Jo Jorgensen | 2020 | Libertarian | 1865724 | loss | 1.177979 | 
| 181 | Howard Hawkins | 2020 | Green | 405035 | loss | 0.255731 | 
loc¶loc selects items by row and column label.
elections.loc[[87, 25, 179], ["Year", "Candidate", "Result"]]
| Year | Candidate | Result | |
|---|---|---|---|
| 87 | 1932 | Herbert Hoover | loss | 
| 25 | 1860 | John C. Breckinridge | loss | 
| 179 | 2020 | Donald Trump | loss | 
elections.loc[[87, 25, 179], "Popular vote":"%"]
| Popular vote | Result | % | |
|---|---|---|---|
| 87 | 15761254 | loss | 39.830594 | 
| 25 | 848019 | loss | 18.138998 | 
| 179 | 74216154 | loss | 46.858542 | 
elections.loc[[87, 25, 179], "Popular vote"]
87 15761254 25 848019 179 74216154 Name: Popular vote, dtype: int64
elections.loc[0, "Candidate"]
'Andrew Jackson'
elections.loc[:, ["Year", "Candidate", "Result"]]
| Year | Candidate | Result | |
|---|---|---|---|
| 0 | 1824 | Andrew Jackson | loss | 
| 1 | 1824 | John Quincy Adams | win | 
| 2 | 1828 | Andrew Jackson | win | 
| 3 | 1828 | John Quincy Adams | loss | 
| 4 | 1832 | Andrew Jackson | win | 
| ... | ... | ... | ... | 
| 177 | 2016 | Jill Stein | loss | 
| 178 | 2020 | Joseph Biden | win | 
| 179 | 2020 | Donald Trump | loss | 
| 180 | 2020 | Jo Jorgensen | loss | 
| 181 | 2020 | Howard Hawkins | loss | 
182 rows × 3 columns
iloc¶iloc selects items by row and column integer position.
# Select the rows at positions 1, 2, and 3
# Select the columns at positions 0, 1, and 2
# Remember that Python indexing begins at position 0!
elections.iloc[[1, 2, 3], [0, 1, 2]]
| Candidate | Year | Party | |
|---|---|---|---|
| 1 | John Quincy Adams | 1824 | Democratic-Republican | 
| 2 | Andrew Jackson | 1828 | Democratic | 
| 3 | John Quincy Adams | 1828 | National Republican | 
elections.iloc[[1, 2, 3], 0:3]
| Candidate | Year | Party | |
|---|---|---|---|
| 1 | John Quincy Adams | 1824 | Democratic-Republican | 
| 2 | Andrew Jackson | 1828 | Democratic | 
| 3 | John Quincy Adams | 1828 | National Republican | 
elections.iloc[[1, 2, 3], 1]
1 1824 2 1828 3 1828 Name: Year, dtype: int64
elections.iloc[:, 0:3]
| Candidate | Year | Party | |
|---|---|---|---|
| 0 | Andrew Jackson | 1824 | Democratic-Republican | 
| 1 | John Quincy Adams | 1824 | Democratic-Republican | 
| 2 | Andrew Jackson | 1828 | Democratic | 
| 3 | John Quincy Adams | 1828 | National Republican | 
| 4 | Andrew Jackson | 1832 | Democratic | 
| ... | ... | ... | ... | 
| 177 | Jill Stein | 2016 | Green | 
| 178 | Joseph Biden | 2020 | Democratic | 
| 179 | Donald Trump | 2020 | Republican | 
| 180 | Jo Jorgensen | 2020 | Libertarian | 
| 181 | Howard Hawkins | 2020 | Green | 
182 rows × 3 columns
[]¶We could technically do anything we want using loc or iloc. However, in practice, the [] operator is often used instead to yield more concise code.
[] is a bit trickier to understand than loc or iloc, but it achieves essentially the same functionality. The difference is that [] is context-dependent.
If we provide a slice of row numbers, we get the numbered rows.
elections[3:7]
| Candidate | Year | Party | Popular vote | Result | % | |
|---|---|---|---|---|---|---|
| 3 | John Quincy Adams | 1828 | National Republican | 500897 | loss | 43.796073 | 
| 4 | Andrew Jackson | 1832 | Democratic | 702735 | win | 54.574789 | 
| 5 | Henry Clay | 1832 | National Republican | 484205 | loss | 37.603628 | 
| 6 | William Wirt | 1832 | Anti-Masonic | 100715 | loss | 7.821583 | 
If we provide a list of column names, we get the listed columns.
elections[["Year", "Candidate", "Result"]]
| Year | Candidate | Result | |
|---|---|---|---|
| 0 | 1824 | Andrew Jackson | loss | 
| 1 | 1824 | John Quincy Adams | win | 
| 2 | 1828 | Andrew Jackson | win | 
| 3 | 1828 | John Quincy Adams | loss | 
| 4 | 1832 | Andrew Jackson | win | 
| ... | ... | ... | ... | 
| 177 | 2016 | Jill Stein | loss | 
| 178 | 2020 | Joseph Biden | win | 
| 179 | 2020 | Donald Trump | loss | 
| 180 | 2020 | Jo Jorgensen | loss | 
| 181 | 2020 | Howard Hawkins | loss | 
182 rows × 3 columns
And if we provide a single column name we get back just that column, stored as a Series.
elections["Candidate"]
0         Andrew Jackson
1      John Quincy Adams
2         Andrew Jackson
3      John Quincy Adams
4         Andrew Jackson
             ...        
177           Jill Stein
178         Joseph Biden
179         Donald Trump
180         Jo Jorgensen
181       Howard Hawkins
Name: Candidate, Length: 182, dtype: object
example = pd.Series([4, 5, 6], index=["one", "two", "three"])
example[example > 4].values
array([5, 6])
df = pd.DataFrame({"c1":[1, 2, 3, 4], "c2":[2, 4, 6, 8]})
df.columns
Index(['c1', 'c2'], dtype='object')
weird = pd.DataFrame({
    1:["topdog","botdog"], 
    "1":["topcat","botcat"]
})
weird
| 1 | 1 | |
|---|---|---|
| 0 | topdog | topcat | 
| 1 | botdog | botcat | 
weird[1]
0 topdog 1 botdog Name: 1, dtype: object
weird["1"]
0 topcat 1 botcat Name: 1, dtype: object
weird[1:]
| 1 | 1 | |
|---|---|---|
| 1 | botdog | botcat |