# `pd` is the conventional alias for Pandas, as `np` is for NumPy
import pandas as pd
Series, DataFrames, and Indices are fundamental pandas data structures for storing tabular data and processing the data using vectorized operations.
s = pd.Series(["welcome", "to", "data 100"])
s
0 welcome 1 to 2 data 100 dtype: object
s.values
array(['welcome', 'to', 'data 100'], dtype=object)
s.index
RangeIndex(start=0, stop=3, step=1)
In the example above, pandas automatically generated an Index of integer labels. We can also create a Series object by providing a custom Index.
s = pd.Series([-1, 10, 2], index = ["a", "b", "c"])
s
a -1 b 10 c 2 dtype: int64
s.values
array([-1, 10, 2])
s.index
Index(['a', 'b', 'c'], dtype='object')
After it has been created, we can reassign the Index of a Series to a new Index.
s.index = ["first", "second", "third"]
s
first -1 second 10 third 2 dtype: int64
We can select a single value or a set of values in a Series using:
s = pd.Series([4, -2, 0, 6], index = ["a", "b", "c", "d"])
s
a 4 b -2 c 0 d 6 dtype: int64
Selection using one or more label(s)
# Selection using a single label
# Notice how the return value is a single array element
s["a"]
4
# Selection using a list of labels
# Notice how the return value is another Series
s[["a", "c"]]
a 4 c 0 dtype: int64
Selection using a filter condition
# Filter condition: select all elements greater than 0
s>0
a True b False c False d True dtype: bool
# Use the Boolean filter to select data from the original Series
s[s>0]
a 4 d 6 dtype: int64
A DataFrame is a 2-D tabular data structure with both row and column labels. In this lecture, we will see how a DataFrame can be created from scratch or loaded from a file.
DataFrame object¶We can also create a DataFrame in a variety of ways. Here, we cover the following:
SeriesDataFrame from a CSV file¶For loading data into a DataFrame, pandas has a number of very useful file reading tools. We'll be using read_csv today to load data from a CSV file into a DataFrame object.
elections = pd.read_csv("data/elections.csv")
elections
| Year | Candidate | Party | Popular vote | Result | % | |
|---|---|---|---|---|---|---|
| 0 | 1824 | Andrew Jackson | Democratic-Republican | 151271 | loss | 57.210122 | 
| 1 | 1824 | John Quincy Adams | Democratic-Republican | 113142 | win | 42.789878 | 
| 2 | 1828 | Andrew Jackson | Democratic | 642806 | win | 56.203927 | 
| 3 | 1828 | John Quincy Adams | National Republican | 500897 | loss | 43.796073 | 
| 4 | 1832 | Andrew Jackson | Democratic | 702735 | win | 54.574789 | 
| ... | ... | ... | ... | ... | ... | ... | 
| 177 | 2016 | Jill Stein | Green | 1457226 | loss | 1.073699 | 
| 178 | 2020 | Joseph Biden | Democratic | 81268924 | win | 51.311515 | 
| 179 | 2020 | Donald Trump | Republican | 74216154 | loss | 46.858542 | 
| 180 | 2020 | Jo Jorgensen | Libertarian | 1865724 | loss | 1.177979 | 
| 181 | 2020 | Howard Hawkins | Green | 405035 | loss | 0.255731 | 
182 rows × 6 columns
By passing a column to the index_col attribute, the Index can be defined at the initialization.
elections = pd.read_csv("data/elections.csv", index_col = "Candidate")
elections
| Candidate | Party | Popular vote | Result | % | |
|---|---|---|---|---|---|
| Year | |||||
| 1824 | Andrew Jackson | Democratic-Republican | 151271 | loss | 57.210122 | 
| 1824 | John Quincy Adams | Democratic-Republican | 113142 | win | 42.789878 | 
| 1828 | Andrew Jackson | Democratic | 642806 | win | 56.203927 | 
| 1828 | John Quincy Adams | National Republican | 500897 | loss | 43.796073 | 
| 1832 | Andrew Jackson | Democratic | 702735 | win | 54.574789 | 
| ... | ... | ... | ... | ... | ... | 
| 2016 | Jill Stein | Green | 1457226 | loss | 1.073699 | 
| 2020 | Joseph Biden | Democratic | 81268924 | win | 51.311515 | 
| 2020 | Donald Trump | Republican | 74216154 | loss | 46.858542 | 
| 2020 | Jo Jorgensen | Libertarian | 1865724 | loss | 1.177979 | 
| 2020 | Howard Hawkins | Green | 405035 | loss | 0.255731 | 
182 rows × 5 columns
DataFrame using a list and column names¶# Creating a single-column DataFrame using a list
df_list_1 = pd.DataFrame([1, 2, 3], 
                         columns = ["Number"])
display(df_list_1)
| Number | |
|---|---|
| 0 | 1 | 
| 1 | 2 | 
| 2 | 3 | 
# Creating a multi-column DataFrame using a list of lists
df_list_2 = pd.DataFrame([[1, "one"], [2, "two"]], 
                         columns = ["Number", "Description"])
df_list_2
| Number | Description | |
|---|---|---|
| 0 | 1 | one | 
| 1 | 2 | two | 
DataFrame from a dictionary¶# Creating a DataFrame from a dictionary of columns
df_dict_1 = pd.DataFrame({"Fruit":["Strawberry", "Orange"], 
                          "Price":[5.49, 3.99]})
df_dict_1
| Fruit | Price | |
|---|---|---|
| 0 | Strawberry | 5.49 | 
| 1 | Orange | 3.99 | 
# Creating a DataFrame from a list of row dictionaries
df_dict_2 = pd.DataFrame([{"Fruit":"Strawberry", "Price":5.49}, 
                          {"Fruit":"Orange", "Price":3.99}])
df_dict_2
| Fruit | Price | |
|---|---|---|
| 0 | Strawberry | 5.49 | 
| 1 | Orange | 3.99 | 
DataFrame from a Series¶# In the examples below, we create a DataFrame from a Series
s_a = pd.Series(["a1", "a2", "a3"], index = ["r1", "r2", "r3"])
s_b = pd.Series(["b1", "b2", "b3"], index = ["r1", "r2", "r3"])
# Passing Series objects for columns
df_ser = pd.DataFrame({"A-column":s_a, "B-column":s_b})
df_ser
| A-column | B-column | |
|---|---|---|
| r1 | a1 | b1 | 
| r2 | a2 | b2 | 
| r3 | a3 | b3 | 
# Passing a Series to the DataFrame constructor to make a one-column dataframe
df_ser = pd.DataFrame(s_a)
df_ser
| 0 | |
|---|---|
| r1 | a1 | 
| r2 | a2 | 
| r3 | a3 | 
# Using to_frame() to convert a Series to DataFrame
ser_to_df = s_a.to_frame()
ser_to_df
| 0 | |
|---|---|
| r1 | a1 | 
| r2 | a2 | 
| r3 | a3 | 
# Creating a DataFrame from a CSV file and specifying the Index column
elections = pd.read_csv("data/elections.csv", index_col = "Candidate")
elections.head(5) # Using `.head` shows only the first 5 rows to save space
| Year | Party | Popular vote | Result | % | |
|---|---|---|---|---|---|
| Candidate | |||||
| Andrew Jackson | 1824 | Democratic-Republican | 151271 | loss | 57.210122 | 
| John Quincy Adams | 1824 | Democratic-Republican | 113142 | win | 42.789878 | 
| Andrew Jackson | 1828 | Democratic | 642806 | win | 56.203927 | 
| John Quincy Adams | 1828 | National Republican | 500897 | loss | 43.796073 | 
| Andrew Jackson | 1832 | Democratic | 702735 | win | 54.574789 | 
elections.reset_index(inplace = True) # Need to reset the index to keep 'Candidate' as one of the DataFrane Columns
elections.set_index("Party", inplace=True) # This sets the index to the "Candidate" column
elections
| Candidate | Year | Popular vote | Result | % | |
|---|---|---|---|---|---|
| Party | |||||
| Democratic-Republican | Andrew Jackson | 1824 | 151271 | loss | 57.210122 | 
| Democratic-Republican | John Quincy Adams | 1824 | 113142 | win | 42.789878 | 
| Democratic | Andrew Jackson | 1828 | 642806 | win | 56.203927 | 
| National Republican | John Quincy Adams | 1828 | 500897 | loss | 43.796073 | 
| Democratic | Andrew Jackson | 1832 | 702735 | win | 54.574789 | 
| ... | ... | ... | ... | ... | ... | 
| Green | Jill Stein | 2016 | 1457226 | loss | 1.073699 | 
| Democratic | Joseph Biden | 2020 | 81268924 | win | 51.311515 | 
| Republican | Donald Trump | 2020 | 74216154 | loss | 46.858542 | 
| Libertarian | Jo Jorgensen | 2020 | 1865724 | loss | 1.177979 | 
| Green | Howard Hawkins | 2020 | 405035 | loss | 0.255731 | 
182 rows × 5 columns
DataFrame attributes: index, columns, and shape¶elections.index
Index(['Democratic-Republican', 'Democratic-Republican', 'Democratic',
       'National Republican', 'Democratic', 'National Republican',
       'Anti-Masonic', 'Whig', 'Democratic', 'Whig',
       ...
       'Constitution', 'Republican', 'Independent', 'Libertarian',
       'Democratic', 'Green', 'Democratic', 'Republican', 'Libertarian',
       'Green'],
      dtype='object', name='Party', length=182)
elections.columns
Index(['Candidate', 'Year', 'Popular vote', 'Result', '%'], dtype='object')
The Index column can be set to the default list of integers by calling reset_index() on a DataFrame.
elections.reset_index(inplace=True) # Revert the index back to its default numeric labeling
elections
| Party | Candidate | Year | Popular vote | Result | % | |
|---|---|---|---|---|---|---|
| 0 | Democratic-Republican | Andrew Jackson | 1824 | 151271 | loss | 57.210122 | 
| 1 | Democratic-Republican | John Quincy Adams | 1824 | 113142 | win | 42.789878 | 
| 2 | Democratic | Andrew Jackson | 1828 | 642806 | win | 56.203927 | 
| 3 | National Republican | John Quincy Adams | 1828 | 500897 | loss | 43.796073 | 
| 4 | Democratic | Andrew Jackson | 1832 | 702735 | win | 54.574789 | 
| ... | ... | ... | ... | ... | ... | ... | 
| 177 | Green | Jill Stein | 2016 | 1457226 | loss | 1.073699 | 
| 178 | Democratic | Joseph Biden | 2020 | 81268924 | win | 51.311515 | 
| 179 | Republican | Donald Trump | 2020 | 74216154 | loss | 46.858542 | 
| 180 | Libertarian | Jo Jorgensen | 2020 | 1865724 | loss | 1.177979 | 
| 181 | Green | Howard Hawkins | 2020 | 405035 | loss | 0.255731 | 
182 rows × 6 columns
elections.shape
(182, 6)
DataFrames¶We can use .head to return only a few rows of a dataframe.
# Loading DataFrame again to keep the original ordering of columns
elections = pd.read_csv("data/elections.csv")
elections.head() # By default, calling .head with no argument will show the first 5 rows
| Year | Candidate | Party | Popular vote | Result | % | |
|---|---|---|---|---|---|---|
| 0 | 1824 | Andrew Jackson | Democratic-Republican | 151271 | loss | 57.210122 | 
| 1 | 1824 | John Quincy Adams | Democratic-Republican | 113142 | win | 42.789878 | 
| 2 | 1828 | Andrew Jackson | Democratic | 642806 | win | 56.203927 | 
| 3 | 1828 | John Quincy Adams | National Republican | 500897 | loss | 43.796073 | 
| 4 | 1832 | Andrew Jackson | Democratic | 702735 | win | 54.574789 | 
elections.head(3)
| Year | Candidate | Party | Popular vote | Result | % | |
|---|---|---|---|---|---|---|
| 0 | 1824 | Andrew Jackson | Democratic-Republican | 151271 | loss | 57.210122 | 
| 1 | 1824 | John Quincy Adams | Democratic-Republican | 113142 | win | 42.789878 | 
| 2 | 1828 | Andrew Jackson | Democratic | 642806 | win | 56.203927 | 
We can also use .tail to get the last so many rows.
elections.tail(5)
| Year | Candidate | Party | Popular vote | Result | % | |
|---|---|---|---|---|---|---|
| 177 | 2016 | Jill Stein | Green | 1457226 | loss | 1.073699 | 
| 178 | 2020 | Joseph Biden | Democratic | 81268924 | win | 51.311515 | 
| 179 | 2020 | Donald Trump | Republican | 74216154 | loss | 46.858542 | 
| 180 | 2020 | Jo Jorgensen | Libertarian | 1865724 | loss | 1.177979 | 
| 181 | 2020 | Howard Hawkins | Green | 405035 | loss | 0.255731 | 
loc¶Arguments to .loc can be:
loc selects items by row and column label.
# Selection by a list
elections.loc[[87, 25, 179], ["Year", "Candidate", "Result"]]
| Year | Candidate | Result | |
|---|---|---|---|
| 87 | 1932 | Herbert Hoover | loss | 
| 25 | 1860 | John C. Breckinridge | loss | 
| 179 | 2020 | Donald Trump | loss | 
# Selection by a list and a slice of columns
elections.loc[[87, 25, 179], "Popular vote":"%"]
| Popular vote | Result | % | |
|---|---|---|---|
| 87 | 15761254 | loss | 39.830594 | 
| 25 | 848019 | loss | 18.138998 | 
| 179 | 74216154 | loss | 46.858542 | 
# Extracting all rows using a colon
elections.loc[:, ["Year", "Candidate", "Result"]]
| Year | Candidate | Result | |
|---|---|---|---|
| 0 | 1824 | Andrew Jackson | loss | 
| 1 | 1824 | John Quincy Adams | win | 
| 2 | 1828 | Andrew Jackson | win | 
| 3 | 1828 | John Quincy Adams | loss | 
| 4 | 1832 | Andrew Jackson | win | 
| ... | ... | ... | ... | 
| 177 | 2016 | Jill Stein | loss | 
| 178 | 2020 | Joseph Biden | win | 
| 179 | 2020 | Donald Trump | loss | 
| 180 | 2020 | Jo Jorgensen | loss | 
| 181 | 2020 | Howard Hawkins | loss | 
182 rows × 3 columns
# Extracting all columns using a colon
elections.loc[[87, 25, 179], :]
| Year | Candidate | Party | Popular vote | Result | % | |
|---|---|---|---|---|---|---|
| 87 | 1932 | Herbert Hoover | Republican | 15761254 | loss | 39.830594 | 
| 25 | 1860 | John C. Breckinridge | Southern Democratic | 848019 | loss | 18.138998 | 
| 179 | 2020 | Donald Trump | Republican | 74216154 | loss | 46.858542 | 
# Selection by a list and a single-column label
elections.loc[[87, 25, 179], "Popular vote"]
87 15761254 25 848019 179 74216154 Name: Popular vote, dtype: int64
# Note that if we pass "Popular vote" in a list, the output will be a DataFrame
elections.loc[[87, 25, 179], ["Popular vote"]]
| Popular vote | |
|---|---|
| 87 | 15761254 | 
| 25 | 848019 | 
| 179 | 74216154 | 
# Selection by a row label and a column label
elections.loc[0, "Candidate"]
'Andrew Jackson'
iloc¶iloc selects items by row and column integer position.
Arguments to .iloc can be:
# Select the rows at positions 1, 2, and 3.
# Select the columns at positions 0, 1, and 2.
# Remember that Python indexing begins at position 0!
elections.iloc[[1, 2, 3], [0, 1, 2]]
| Year | Candidate | Party | |
|---|---|---|---|
| 1 | 1824 | John Quincy Adams | Democratic-Republican | 
| 2 | 1828 | Andrew Jackson | Democratic | 
| 3 | 1828 | John Quincy Adams | National Republican | 
# Index-based extraction using a list of rows and a slice of column indices
elections.iloc[[1, 2, 3], 0:3]
| Year | Candidate | Party | |
|---|---|---|---|
| 1 | 1824 | John Quincy Adams | Democratic-Republican | 
| 2 | 1828 | Andrew Jackson | Democratic | 
| 3 | 1828 | John Quincy Adams | National Republican | 
# Selecting all rows using a colon
elections.iloc[:, 0:3]
| Year | Candidate | Party | |
|---|---|---|---|
| 0 | 1824 | Andrew Jackson | Democratic-Republican | 
| 1 | 1824 | John Quincy Adams | Democratic-Republican | 
| 2 | 1828 | Andrew Jackson | Democratic | 
| 3 | 1828 | John Quincy Adams | National Republican | 
| 4 | 1832 | Andrew Jackson | Democratic | 
| ... | ... | ... | ... | 
| 177 | 2016 | Jill Stein | Green | 
| 178 | 2020 | Joseph Biden | Democratic | 
| 179 | 2020 | Donald Trump | Republican | 
| 180 | 2020 | Jo Jorgensen | Libertarian | 
| 181 | 2020 | Howard Hawkins | Green | 
182 rows × 3 columns
elections.iloc[[1, 2, 3], 1]
1 John Quincy Adams 2 Andrew Jackson 3 John Quincy Adams Name: Candidate, dtype: object
# Extracting the value at row 0 and the second column
elections.iloc[0,1]
'Andrew Jackson'
[]¶We could technically do anything we want using loc or iloc. However, in practice, the [] operator is often used instead to yield more concise code.
[] is a bit trickier to understand than loc or iloc, but it achieves essentially the same functionality. The difference is that [] is context-dependent.
[] only takes one argument, which may be:
If we provide a slice of row numbers, we get the numbered rows.
elections[3:7]
| Year | Candidate | Party | Popular vote | Result | % | |
|---|---|---|---|---|---|---|
| 3 | 1828 | John Quincy Adams | National Republican | 500897 | loss | 43.796073 | 
| 4 | 1832 | Andrew Jackson | Democratic | 702735 | win | 54.574789 | 
| 5 | 1832 | Henry Clay | National Republican | 484205 | loss | 37.603628 | 
| 6 | 1832 | William Wirt | Anti-Masonic | 100715 | loss | 7.821583 | 
If we provide a list of column names, we get the listed columns.
elections[["Year", "Candidate", "Result"]]
| Year | Candidate | Result | |
|---|---|---|---|
| 0 | 1824 | Andrew Jackson | loss | 
| 1 | 1824 | John Quincy Adams | win | 
| 2 | 1828 | Andrew Jackson | win | 
| 3 | 1828 | John Quincy Adams | loss | 
| 4 | 1832 | Andrew Jackson | win | 
| ... | ... | ... | ... | 
| 177 | 2016 | Jill Stein | loss | 
| 178 | 2020 | Joseph Biden | win | 
| 179 | 2020 | Donald Trump | loss | 
| 180 | 2020 | Jo Jorgensen | loss | 
| 181 | 2020 | Howard Hawkins | loss | 
182 rows × 3 columns
And if we provide a single column name we get back just that column, stored as a Series.
elections["Candidate"]
0         Andrew Jackson
1      John Quincy Adams
2         Andrew Jackson
3      John Quincy Adams
4         Andrew Jackson
             ...        
177           Jill Stein
178         Joseph Biden
179         Donald Trump
180         Jo Jorgensen
181       Howard Hawkins
Name: Candidate, Length: 182, dtype: object
example = pd.Series([4, 5, 6], index=["one", "two", "three"])
example[example > 4].values
array([5, 6])
df = pd.DataFrame({"c1":[1, 2, 3, 4], "c2":[2, 4, 6, 8]})
df.columns
Index(['c1', 'c2'], dtype='object')