# `pd` is the conventional alias for Pandas, as `np` is for NumPy
import pandas as pd
Series, DataFrames, and Indices are fundamental pandas
data structures for storing tabular data and processing the data using vectorized operations.
s = pd.Series([-1, 10, 2])
s
0 -1 1 10 2 2 dtype: int64
s.values
array([-1, 10, 2])
s.index
RangeIndex(start=0, stop=3, step=1)
In the example above, pandas
automatically generated an Index of integer labels. We can also create a Series
object by providing a custom Index.
s = pd.Series([-1, 10, 2], index = ["a", "b", "c"])
s
a -1 b 10 c 2 dtype: int64
s.values
array([-1, 10, 2])
s.index
Index(['a', 'b', 'c'], dtype='object')
After it has been created, we can reassign the Index of a Series
to a new Index.
s.index = ["first", "second", "third"]
s
first -1 second 10 third 2 dtype: int64
We can select a single value or a set of values in a Series
using:
s = pd.Series([4, -2, 0, 6], index = ["a", "b", "c", "d"])
s
a 4 b -2 c 0 d 6 dtype: int64
Selection using one or more label(s)
# Selection using a single label
# Notice how the return value is a single array element
s["a"]
4
# Selection using a list of labels
# Notice how the return value is another Series
s[["a", "c"]]
a 4 c 0 dtype: int64
Selection using a filter condition
# Filter condition: select all elements greater than 0
s>0
a True b False c False d True dtype: bool
# Use the Boolean filter to select data from the original Series
s[s>0]
a 4 d 6 dtype: int64
A DataFrame
is a 2-D tabular data structure with both row and column labels. In this lecture, we will see how a DataFrame
can be created from scratch or loaded from a file.
DataFrame
¶For loading data into a DataFrame
, pandas
has a number of very useful file reading tools. We'll be using read_csv
today to load data from a csv file into a DataFrame
object.
elections = pd.read_csv("data/elections.csv")
elections
Year | Candidate | Party | Popular vote | Result | % | |
---|---|---|---|---|---|---|
0 | 1824 | Andrew Jackson | Democratic-Republican | 151271 | loss | 57.210122 |
1 | 1824 | John Quincy Adams | Democratic-Republican | 113142 | win | 42.789878 |
2 | 1828 | Andrew Jackson | Democratic | 642806 | win | 56.203927 |
3 | 1828 | John Quincy Adams | National Republican | 500897 | loss | 43.796073 |
4 | 1832 | Andrew Jackson | Democratic | 702735 | win | 54.574789 |
... | ... | ... | ... | ... | ... | ... |
177 | 2016 | Jill Stein | Green | 1457226 | loss | 1.073699 |
178 | 2020 | Joseph Biden | Democratic | 81268924 | win | 51.311515 |
179 | 2020 | Donald Trump | Republican | 74216154 | loss | 46.858542 |
180 | 2020 | Jo Jorgensen | Libertarian | 1865724 | loss | 1.177979 |
181 | 2020 | Howard Hawkins | Green | 405035 | loss | 0.255731 |
182 rows × 6 columns
DataFrame
object¶We can also create a DataFrame
in variety of ways. Here we cover the following:
Series
# Creating a single-column DataFrame using a list
df_list_1 = pd.DataFrame([1, 2, 3], columns = ["Number"])
display(df_list_1)
Number | |
---|---|
0 | 1 |
1 | 2 |
2 | 3 |
# Creating a multi-column DataFrame using a list of lists
df_list_2 = pd.DataFrame([[1, "one"], [2, "two"]], columns = ["Number", "Description"])
df_list_2
Number | Description | |
---|---|---|
0 | 1 | one |
1 | 2 | two |
# Creating a DataFrame from a dictionary of columns
df_dict_1 = pd.DataFrame({"Fruit":["Strawberry", "Orange"], "Price":[5.49, 3.99]})
df_dict_1
Fruit | Price | |
---|---|---|
0 | Strawberry | 5.49 |
1 | Orange | 3.99 |
# Creating a DataFrame from a list of row dictionaries
df_dict_2 = pd.DataFrame([{"Fruit":"Strawberry", "Price":5.49},
{"Fruit":"Orange", "Price":3.99}])
df_dict_2
Fruit | Price | |
---|---|---|
0 | Strawberry | 5.49 |
1 | Orange | 3.99 |
# In the examples below, we create a DataFrame from a Series
s_a = pd.Series(["a1", "a2", "a3"], index = ["r1", "r2", "r3"])
s_b = pd.Series(["b1", "b2", "b3"], index = ["r1", "r2", "r3"])
# Passing Series objects for columns
df_ser = pd.DataFrame({"A-column":s_a, "B-column":s_b})
df_ser
A-column | B-column | |
---|---|---|
r1 | a1 | b1 |
r2 | a2 | b2 |
r3 | a3 | b3 |
# Passing a Series to the DataFrame constructor to make a one-column dataframe
df_ser = pd.DataFrame(s_a)
df_ser
0 | |
---|---|
r1 | a1 |
r2 | a2 |
r3 | a3 |
# Using to_frame() to convert a Series to DataFrame
ser_to_df = s_a.to_frame()
ser_to_df
0 | |
---|---|
r1 | a1 |
r2 | a2 |
r3 | a3 |
# Creating a DataFrame from a csv file and specifying the index column
mottos = pd.read_csv("data/mottos.csv", index_col = "State")
mottos.head(5) # Using `.head` shows only the first 5 rows to save space
Motto | Translation | Language | Date Adopted | |
---|---|---|---|---|
State | ||||
Alabama | Audemus jura nostra defendere | We dare defend our rights! | Latin | 1923 |
Alaska | North to the future | — | English | 1967 |
Arizona | Ditat Deus | God enriches | Latin | 1863 |
Arkansas | Regnat populus | The people rule | Latin | 1907 |
California | Eureka (Εὕρηκα) | I have found it | Greek | 1849 |
elections.set_index("Candidate", inplace=True) # This sets the index to the "Candidate" column
elections
Year | Party | Popular vote | Result | % | |
---|---|---|---|---|---|
Candidate | |||||
Andrew Jackson | 1824 | Democratic-Republican | 151271 | loss | 57.210122 |
John Quincy Adams | 1824 | Democratic-Republican | 113142 | win | 42.789878 |
Andrew Jackson | 1828 | Democratic | 642806 | win | 56.203927 |
John Quincy Adams | 1828 | National Republican | 500897 | loss | 43.796073 |
Andrew Jackson | 1832 | Democratic | 702735 | win | 54.574789 |
... | ... | ... | ... | ... | ... |
Jill Stein | 2016 | Green | 1457226 | loss | 1.073699 |
Joseph Biden | 2020 | Democratic | 81268924 | win | 51.311515 |
Donald Trump | 2020 | Republican | 74216154 | loss | 46.858542 |
Jo Jorgensen | 2020 | Libertarian | 1865724 | loss | 1.177979 |
Howard Hawkins | 2020 | Green | 405035 | loss | 0.255731 |
182 rows × 5 columns
elections.reset_index(inplace=True) # Revert the index back to its default numeric labeling
elections
Candidate | Year | Party | Popular vote | Result | % | |
---|---|---|---|---|---|---|
0 | Andrew Jackson | 1824 | Democratic-Republican | 151271 | loss | 57.210122 |
1 | John Quincy Adams | 1824 | Democratic-Republican | 113142 | win | 42.789878 |
2 | Andrew Jackson | 1828 | Democratic | 642806 | win | 56.203927 |
3 | John Quincy Adams | 1828 | National Republican | 500897 | loss | 43.796073 |
4 | Andrew Jackson | 1832 | Democratic | 702735 | win | 54.574789 |
... | ... | ... | ... | ... | ... | ... |
177 | Jill Stein | 2016 | Green | 1457226 | loss | 1.073699 |
178 | Joseph Biden | 2020 | Democratic | 81268924 | win | 51.311515 |
179 | Donald Trump | 2020 | Republican | 74216154 | loss | 46.858542 |
180 | Jo Jorgensen | 2020 | Libertarian | 1865724 | loss | 1.177979 |
181 | Howard Hawkins | 2020 | Green | 405035 | loss | 0.255731 |
182 rows × 6 columns
DataFrame
s¶We can use .head
to return only a few rows of a dataframe.
elections.head() # By default, calling .head with no argument will show the first 5 rows
Candidate | Year | Party | Popular vote | Result | % | |
---|---|---|---|---|---|---|
0 | Andrew Jackson | 1824 | Democratic-Republican | 151271 | loss | 57.210122 |
1 | John Quincy Adams | 1824 | Democratic-Republican | 113142 | win | 42.789878 |
2 | Andrew Jackson | 1828 | Democratic | 642806 | win | 56.203927 |
3 | John Quincy Adams | 1828 | National Republican | 500897 | loss | 43.796073 |
4 | Andrew Jackson | 1832 | Democratic | 702735 | win | 54.574789 |
elections.head(3)
Candidate | Year | Party | Popular vote | Result | % | |
---|---|---|---|---|---|---|
0 | Andrew Jackson | 1824 | Democratic-Republican | 151271 | loss | 57.210122 |
1 | John Quincy Adams | 1824 | Democratic-Republican | 113142 | win | 42.789878 |
2 | Andrew Jackson | 1828 | Democratic | 642806 | win | 56.203927 |
We can also use .tail
to get the last so many rows.
elections.tail(5)
Candidate | Year | Party | Popular vote | Result | % | |
---|---|---|---|---|---|---|
177 | Jill Stein | 2016 | Green | 1457226 | loss | 1.073699 |
178 | Joseph Biden | 2020 | Democratic | 81268924 | win | 51.311515 |
179 | Donald Trump | 2020 | Republican | 74216154 | loss | 46.858542 |
180 | Jo Jorgensen | 2020 | Libertarian | 1865724 | loss | 1.177979 |
181 | Howard Hawkins | 2020 | Green | 405035 | loss | 0.255731 |
loc
¶loc
selects items by row and column label.
elections.loc[[87, 25, 179], ["Year", "Candidate", "Result"]]
Year | Candidate | Result | |
---|---|---|---|
87 | 1932 | Herbert Hoover | loss |
25 | 1860 | John C. Breckinridge | loss |
179 | 2020 | Donald Trump | loss |
elections.loc[[87, 25, 179], "Popular vote":"%"]
Popular vote | Result | % | |
---|---|---|---|
87 | 15761254 | loss | 39.830594 |
25 | 848019 | loss | 18.138998 |
179 | 74216154 | loss | 46.858542 |
elections.loc[[87, 25, 179], "Popular vote"]
87 15761254 25 848019 179 74216154 Name: Popular vote, dtype: int64
elections.loc[0, "Candidate"]
'Andrew Jackson'
elections.loc[:, ["Year", "Candidate", "Result"]]
Year | Candidate | Result | |
---|---|---|---|
0 | 1824 | Andrew Jackson | loss |
1 | 1824 | John Quincy Adams | win |
2 | 1828 | Andrew Jackson | win |
3 | 1828 | John Quincy Adams | loss |
4 | 1832 | Andrew Jackson | win |
... | ... | ... | ... |
177 | 2016 | Jill Stein | loss |
178 | 2020 | Joseph Biden | win |
179 | 2020 | Donald Trump | loss |
180 | 2020 | Jo Jorgensen | loss |
181 | 2020 | Howard Hawkins | loss |
182 rows × 3 columns
iloc
¶iloc
selects items by row and column integer position.
# Select the rows at positions 1, 2, and 3
# Select the columns at positions 0, 1, and 2
# Remember that Python indexing begins at position 0!
elections.iloc[[1, 2, 3], [0, 1, 2]]
Candidate | Year | Party | |
---|---|---|---|
1 | John Quincy Adams | 1824 | Democratic-Republican |
2 | Andrew Jackson | 1828 | Democratic |
3 | John Quincy Adams | 1828 | National Republican |
elections.iloc[[1, 2, 3], 0:3]
Candidate | Year | Party | |
---|---|---|---|
1 | John Quincy Adams | 1824 | Democratic-Republican |
2 | Andrew Jackson | 1828 | Democratic |
3 | John Quincy Adams | 1828 | National Republican |
elections.iloc[[1, 2, 3], 1]
1 1824 2 1828 3 1828 Name: Year, dtype: int64
elections.iloc[:, 0:3]
Candidate | Year | Party | |
---|---|---|---|
0 | Andrew Jackson | 1824 | Democratic-Republican |
1 | John Quincy Adams | 1824 | Democratic-Republican |
2 | Andrew Jackson | 1828 | Democratic |
3 | John Quincy Adams | 1828 | National Republican |
4 | Andrew Jackson | 1832 | Democratic |
... | ... | ... | ... |
177 | Jill Stein | 2016 | Green |
178 | Joseph Biden | 2020 | Democratic |
179 | Donald Trump | 2020 | Republican |
180 | Jo Jorgensen | 2020 | Libertarian |
181 | Howard Hawkins | 2020 | Green |
182 rows × 3 columns
[]
¶We could technically do anything we want using loc
or iloc
. However, in practice, the []
operator is often used instead to yield more concise code.
[]
is a bit trickier to understand than loc
or iloc
, but it achieves essentially the same functionality. The difference is that []
is context-dependent.
If we provide a slice of row numbers, we get the numbered rows.
elections[3:7]
Candidate | Year | Party | Popular vote | Result | % | |
---|---|---|---|---|---|---|
3 | John Quincy Adams | 1828 | National Republican | 500897 | loss | 43.796073 |
4 | Andrew Jackson | 1832 | Democratic | 702735 | win | 54.574789 |
5 | Henry Clay | 1832 | National Republican | 484205 | loss | 37.603628 |
6 | William Wirt | 1832 | Anti-Masonic | 100715 | loss | 7.821583 |
If we provide a list of column names, we get the listed columns.
elections[["Year", "Candidate", "Result"]]
Year | Candidate | Result | |
---|---|---|---|
0 | 1824 | Andrew Jackson | loss |
1 | 1824 | John Quincy Adams | win |
2 | 1828 | Andrew Jackson | win |
3 | 1828 | John Quincy Adams | loss |
4 | 1832 | Andrew Jackson | win |
... | ... | ... | ... |
177 | 2016 | Jill Stein | loss |
178 | 2020 | Joseph Biden | win |
179 | 2020 | Donald Trump | loss |
180 | 2020 | Jo Jorgensen | loss |
181 | 2020 | Howard Hawkins | loss |
182 rows × 3 columns
And if we provide a single column name we get back just that column, stored as a Series
.
elections["Candidate"]
0 Andrew Jackson 1 John Quincy Adams 2 Andrew Jackson 3 John Quincy Adams 4 Andrew Jackson ... 177 Jill Stein 178 Joseph Biden 179 Donald Trump 180 Jo Jorgensen 181 Howard Hawkins Name: Candidate, Length: 182, dtype: object
example = pd.Series([4, 5, 6], index=["one", "two", "three"])
example[example > 4].values
array([5, 6])
df = pd.DataFrame({"c1":[1, 2, 3, 4], "c2":[2, 4, 6, 8]})
df.columns
Index(['c1', 'c2'], dtype='object')
weird = pd.DataFrame({
1:["topdog","botdog"],
"1":["topcat","botcat"]
})
weird
1 | 1 | |
---|---|---|
0 | topdog | topcat |
1 | botdog | botcat |
weird[1]
0 topdog 1 botdog Name: 1, dtype: object
weird["1"]
0 topcat 1 botcat Name: 1, dtype: object
weird[1:]
1 | 1 | |
---|---|---|
1 | botdog | botcat |