# `pd` is the conventional alias for Pandas, as `np` is for NumPy
import pandas as pd
2 Pandas I
Data scientists work with data stored in a variety of formats. The primary focus of this class is in understanding tabular data – one of the most widely used formats in data science. This note introduces DataFrames, which are among the most popular representations of tabular data. We’ll also introduce pandas
, the standard Python package for manipulating data in DataFrames.
2.1 Introduction to Exploratory Data Analysis
Imagine you collected, or have been given a box of data. What do you do next?
The first step is to clean your data. Data cleaning often corrects issues in the structure and formatting of data, including missing values and unit conversions.
Data scientists have coined the term exploratory data analysis (EDA) to describe the process of transforming raw data to insightful observations. EDA is an open-ended analysis of transforming, visualizing, and summarizing patterns in data. In order to conduct EDA, we first need to familiarize ourselves with pandas
– an important programming tool.
2.2 Introduction to Pandas
pandas
is a data analysis library to make data cleaning and analysis fast and convenient in Python.
The pandas
library adopts many coding idioms from NumPy
. The biggest difference is that pandas
is designed for working with tabular data, one of the most common data formats (and the focus of Data 100).
Before writing any code, we must import pandas
into our Python environment.
2.3 Series, DataFrames, and Indices
There are three fundamental data structures in pandas
:
- Series: 1D labeled array data; best thought of as columnar data
- DataFrame: 2D tabular data with rows and columns
- Index: A sequence of row/column labels
DataFrames, Series, and Indices can be represented visually in the following diagram.
Notice how the DataFrame is a two dimensional object – it contains both rows and columns. The Series above is a singular column of this DataFrame, namely the Candidate
column. Both contain an Index, or a shared list of row labels (the integers from 0 to 5, inclusive).
2.3.1 Series
A Series represents a column of a DataFrame; more generally, it can be any 1-dimensional array-like object containing values of the same type with associated data labels, called its index.
import pandas as pd
= pd.Series([-1, 10, 2])
s print(s)
0 -1
1 10
2 2
dtype: int64
# Data contained within the Series s.array
<PandasArray>
[-1, 10, 2]
Length: 3, dtype: int64
# The Index of the Series s.index
RangeIndex(start=0, stop=3, step=1)
By default, row indices in pandas
are a sequential list of integers beginning from 0. Optionally, a list of desired indices can be passed to the index
argument.
= pd.Series([-1, 10, 2], index = ["a", "b", "c"])
s print(s)
a -1
b 10
c 2
dtype: int64
Indices can also be changed after initialization.
= ["first", "second", "third"]
s.index print(s)
first -1
second 10
third 2
dtype: int64
2.3.1.1 Selection in Series
Similar to an array, we can select a single value or a set of values from a Series. There are 3 primary methods of selecting data.
- A single index label
- A list of index labels
- A filtering condition
Let’s define the following Series ser
.
= pd.Series([4, -2, 0, 6], index = ["a", "b", "c", "d"])
ser print(ser)
a 4
b -2
c 0
d 6
dtype: int64
2.3.1.1.1 A Single Index Label
print(ser["a"]) # Notice how the return value is a single array element
4
2.3.1.1.2 A List of Index Labels
"a", "c"]] # Notice how the return value is another Series ser[[
a 4
c 0
dtype: int64
2.3.1.1.3 A Filtering Condition
Perhaps the most interesting (and useful) method of selecting data from a Series is with a filtering condition.
We first must apply a vectorized boolean operation to our Series that encodes the filter conditon.
> 0 # Filter condition: select all elements greater than 0 ser
a True
b False
c False
d True
dtype: bool
Upon “indexing” in our Series with this condition, pandas
selects only the rows with True
values.
> 0] ser[ser
a 4
d 6
dtype: int64
2.3.2 DataFrames
In Data 8, you encountered the Table
class of the datascience
library, which represented tabular data. In Data 100, we’ll be using the DataFrame
class of the pandas
library.
Here is an example of a DataFrame that contains election data.
import pandas as pd
= pd.read_csv("data/elections.csv")
elections elections
Year | Candidate | Party | Popular vote | Result | % | |
---|---|---|---|---|---|---|
0 | 1824 | Andrew Jackson | Democratic-Republican | 151271 | loss | 57.210122 |
1 | 1824 | John Quincy Adams | Democratic-Republican | 113142 | win | 42.789878 |
2 | 1828 | Andrew Jackson | Democratic | 642806 | win | 56.203927 |
3 | 1828 | John Quincy Adams | National Republican | 500897 | loss | 43.796073 |
4 | 1832 | Andrew Jackson | Democratic | 702735 | win | 54.574789 |
... | ... | ... | ... | ... | ... | ... |
177 | 2016 | Jill Stein | Green | 1457226 | loss | 1.073699 |
178 | 2020 | Joseph Biden | Democratic | 81268924 | win | 51.311515 |
179 | 2020 | Donald Trump | Republican | 74216154 | loss | 46.858542 |
180 | 2020 | Jo Jorgensen | Libertarian | 1865724 | loss | 1.177979 |
181 | 2020 | Howard Hawkins | Green | 405035 | loss | 0.255731 |
182 rows × 6 columns
Let’s dissect the code above.
We first import the
pandas
library into our Python environment, using the aliaspd
.
import pandas as pd
There are a number of ways to read data into a DataFrame. In Data 100, our data are typically stored in a CSV (comma-seperated values) file format. We can import a CSV file into a DataFrame by passing the data path as an argument to the following
pandas
function.
pd.read_csv("elections.csv")
This code stores our DataFrame object in the elections
variable. Upon inspection, our elections
DataFrame has 182 rows and 6 columns (Year
, Candidate
, Party
, Popular Vote
, Result
, %
). Each row represents a single record – in our example, a presedential candidate from some particular year. Each column represents a single attribute, or feature of the record.
In the example above, we constructed a DataFrame object using data from a CSV file. As we’ll explore in the next section, we can create a DataFrame with data of our own.
2.3.2.1 Creating a DataFrame
There are many ways to create a DataFrame. Here, we will cover the most popular approaches.
- Using a list and column names
- From a dictionary
- From a Series
2.3.2.1.1 Using a List and Column Names
Consider the following examples.
= pd.DataFrame([1, 2, 3], columns=["Numbers"])
df_list df_list
Numbers | |
---|---|
0 | 1 |
1 | 2 |
2 | 3 |
The first code cell creates a DataFrame with a single column Numbers
, while the second creates a DataFrame with an additional column Description
. Notice how a 2D list of values is required to initialize the second DataFrame – each nested list represents a single row of data.
= pd.DataFrame([[1, "one"], [2, "two"]], columns = ["Number", "Description"])
df_list df_list
Number | Description | |
---|---|---|
0 | 1 | one |
1 | 2 | two |
2.3.2.1.2 From a Dictionary
A second (and more common) way to create a DataFrame is with a dictionary. The dictionary keys represent the column names, and the dictionary values represent the column values.
= pd.DataFrame({"Fruit": ["Strawberry", "Orange"], "Price": [5.49, 3.99]})
df_dict df_dict
Fruit | Price | |
---|---|---|
0 | Strawberry | 5.49 |
1 | Orange | 3.99 |
2.3.2.1.3 From a Series
Earlier, we explained how a Series was synonymous to a column in a DataFrame. It follows then, that a DataFrame is equivalent to a collection of Series, which all share the same index.
In fact, we can initialize a DataFrame by merging two or more Series.
# Notice how our indices, or row labels, are the same
= pd.Series(["a1", "a2", "a3"], index = ["r1", "r2", "r3"])
s_a = pd.Series(["b1", "b2", "b3"], index = ["r1", "r2", "r3"])
s_b
"A-column": s_a, "B-column": s_b}) pd.DataFrame({
A-column | B-column | |
---|---|---|
r1 | a1 | b1 |
r2 | a2 | b2 |
r3 | a3 | b3 |
2.3.3 Indices
The major takeaway: we can think of a DataFrame as a collection of Series that all share the same Index.
On a more technical note, an Index doesn’t have to be an integer, nor does it have to be unique. For example, we can set the index of the elections
Dataframe to be the name of presedential candidates. Selecting a new Series from this modified DataFrame yields the following.
# This sets the index to the "Candidate" column
"Candidate", inplace=True) elections.set_index(
To retrieve the indices of a DataFrame, simply use the .index
attribute of the DataFrame class.
elections.index
Index(['Andrew Jackson', 'John Quincy Adams', 'Andrew Jackson',
'John Quincy Adams', 'Andrew Jackson', 'Henry Clay', 'William Wirt',
'Hugh Lawson White', 'Martin Van Buren', 'William Henry Harrison',
...
'Darrell Castle', 'Donald Trump', 'Evan McMullin', 'Gary Johnson',
'Hillary Clinton', 'Jill Stein', 'Joseph Biden', 'Donald Trump',
'Jo Jorgensen', 'Howard Hawkins'],
dtype='object', name='Candidate', length=182)
# This resets the index to be the default list of integers
=True) elections.reset_index(inplace
2.4 Slicing in DataFrames
Now that we’ve learned how to create DataFrames, let’s dive deeper into their capabilities.
The API (application programming interface) for the DataFrame class is enormous. In this section, we’ll discuss several methods of the DataFrame API that allow us to extract subsets of data.
The simplest way to manipulate a DataFrame is to extract a subset of rows and columns, known as slicing. We will do so with three primary methods of the DataFrame class:
.loc
.iloc
[]
2.4.1 Indexing with .loc
The .loc
operator selects rows and columns in a DataFrame by their row and column label(s), respectively. The row labels (commonly referred to as the indices) are the bold text on the far left of a DataFrame, while the column labels are the column names found at the top of a DataFrame.
To grab data with .loc
, we must specify the row and column label(s) where the data exists. The row labels are the first argument to the .loc
function; the column labels are the second. For example, we can select the the row labeled 0
and the column labeled Candidate
from the elections
DataFrame.
0, 'Candidate'] elections.loc[
'Andrew Jackson'
To select multiple rows and columns, we can use Python slice notation. Here, we select both the first four rows and columns.
0:3, 'Year':'Popular vote'] elections.loc[
Year | Party | Popular vote | |
---|---|---|---|
0 | 1824 | Democratic-Republican | 151271 |
1 | 1824 | Democratic-Republican | 113142 |
2 | 1828 | Democratic | 642806 |
3 | 1828 | National Republican | 500897 |
Suppose that instead, we wanted every column value for the first four rows in the elections
DataFrame. The shorthand :
is useful for this.
0:3, :] elections.loc[
Candidate | Year | Party | Popular vote | Result | % | |
---|---|---|---|---|---|---|
0 | Andrew Jackson | 1824 | Democratic-Republican | 151271 | loss | 57.210122 |
1 | John Quincy Adams | 1824 | Democratic-Republican | 113142 | win | 42.789878 |
2 | Andrew Jackson | 1828 | Democratic | 642806 | win | 56.203927 |
3 | John Quincy Adams | 1828 | National Republican | 500897 | loss | 43.796073 |
There are a couple of things we should note. Unlike conventional Python, Pandas allows us to slice string values (in our example, the column labels). Secondly, slicing with .loc
is inclusive. Notice how our resulting DataFrame includes every row and column between and including the slice labels we specified.
Equivalently, we can use a list to obtain multiple rows and columns in our elections
DataFrame.
0, 1, 2, 3], ['Year', 'Candidate', 'Party', 'Popular vote']] elections.loc[[
Year | Candidate | Party | Popular vote | |
---|---|---|---|---|
0 | 1824 | Andrew Jackson | Democratic-Republican | 151271 |
1 | 1824 | John Quincy Adams | Democratic-Republican | 113142 |
2 | 1828 | Andrew Jackson | Democratic | 642806 |
3 | 1828 | John Quincy Adams | National Republican | 500897 |
Lastly, we can interchange list and slicing notation.
0, 1, 2, 3], :] elections.loc[[
Candidate | Year | Party | Popular vote | Result | % | |
---|---|---|---|---|---|---|
0 | Andrew Jackson | 1824 | Democratic-Republican | 151271 | loss | 57.210122 |
1 | John Quincy Adams | 1824 | Democratic-Republican | 113142 | win | 42.789878 |
2 | Andrew Jackson | 1828 | Democratic | 642806 | win | 56.203927 |
3 | John Quincy Adams | 1828 | National Republican | 500897 | loss | 43.796073 |
2.4.2 Indexing with .iloc
Slicing with .iloc
works similarily to .loc
, although .iloc
uses the integer positions of rows and columns rather the labels. The arguments to the .iloc
function also behave similarly - single values, lists, indices, and any combination of these are permitted.
Let’s begin reproducing our results from above. We’ll begin by selecting for the first presedential candidate in our elections
DataFrame:
# elections.loc[0, "Candidate"] - Previous approach
0, 1] elections.iloc[
1824
Notice how the first argument to both .loc
and .iloc
are the same. This is because the row with a label of 0 is conveniently in the 0th (or first) position of the elections
DataFrame. Generally, this is true of any DataFrame where the row labels are incremented in ascending order from 0.
However, when we select the first four rows and columns using .iloc
, we notice something.
# elections.loc[0:3, 'Year':'Popular vote'] - Previous approach
0:4, 0:4] elections.iloc[
Candidate | Year | Party | Popular vote | |
---|---|---|---|---|
0 | Andrew Jackson | 1824 | Democratic-Republican | 151271 |
1 | John Quincy Adams | 1824 | Democratic-Republican | 113142 |
2 | Andrew Jackson | 1828 | Democratic | 642806 |
3 | John Quincy Adams | 1828 | National Republican | 500897 |
Slicing is no longer inclusive in .iloc
- it’s exclusive. This is one of Pandas syntatical subtleties; you’ll get used to with practice.
List behavior works just as expected.
#elections.loc[[0, 1, 2, 3], ['Year', 'Candidate', 'Party', 'Popular vote']] - Previous Approach
0, 1, 2, 3], [0, 1, 2, 3]] elections.iloc[[
Candidate | Year | Party | Popular vote | |
---|---|---|---|---|
0 | Andrew Jackson | 1824 | Democratic-Republican | 151271 |
1 | John Quincy Adams | 1824 | Democratic-Republican | 113142 |
2 | Andrew Jackson | 1828 | Democratic | 642806 |
3 | John Quincy Adams | 1828 | National Republican | 500897 |
This discussion begs the question: when should we use .loc
vs .iloc
? In most cases, .loc
is generally safer to use. You can imagine .iloc
may return incorrect values when applied to a dataset where the ordering of data can change.
2.4.3 Indexing with []
The []
selection operator is the most baffling of all, yet the commonly used. It only takes a single argument, which may be one of the following:
- A slice of row numbers
- A list of column labels
- A single column label
That is, []
is context dependent. Let’s see some examples.
2.4.3.1 A slice of row numbers
Say we wanted the first four rows of our elections
DataFrame.
0:4] elections[
Candidate | Year | Party | Popular vote | Result | % | |
---|---|---|---|---|---|---|
0 | Andrew Jackson | 1824 | Democratic-Republican | 151271 | loss | 57.210122 |
1 | John Quincy Adams | 1824 | Democratic-Republican | 113142 | win | 42.789878 |
2 | Andrew Jackson | 1828 | Democratic | 642806 | win | 56.203927 |
3 | John Quincy Adams | 1828 | National Republican | 500897 | loss | 43.796073 |
2.4.3.2 A list of column labels
Suppose we now want the first four columns.
"Year", "Candidate", "Party", "Popular vote"]] elections[[
Year | Candidate | Party | Popular vote | |
---|---|---|---|---|
0 | 1824 | Andrew Jackson | Democratic-Republican | 151271 |
1 | 1824 | John Quincy Adams | Democratic-Republican | 113142 |
2 | 1828 | Andrew Jackson | Democratic | 642806 |
3 | 1828 | John Quincy Adams | National Republican | 500897 |
4 | 1832 | Andrew Jackson | Democratic | 702735 |
... | ... | ... | ... | ... |
177 | 2016 | Jill Stein | Green | 1457226 |
178 | 2020 | Joseph Biden | Democratic | 81268924 |
179 | 2020 | Donald Trump | Republican | 74216154 |
180 | 2020 | Jo Jorgensen | Libertarian | 1865724 |
181 | 2020 | Howard Hawkins | Green | 405035 |
182 rows × 4 columns
2.4.3.3 A single column label
Lastly, if we only want the Candidate
column.
"Candidate"] elections[
0 Andrew Jackson
1 John Quincy Adams
2 Andrew Jackson
3 John Quincy Adams
4 Andrew Jackson
...
177 Jill Stein
178 Joseph Biden
179 Donald Trump
180 Jo Jorgensen
181 Howard Hawkins
Name: Candidate, Length: 182, dtype: object
The output looks like a Series! In this course, we’ll become very comfortable with []
, especially for selecting columns. In practice, []
is much more common than .loc
.
2.5 Parting Note
The pandas
library is enormous and contains many useful functions. Here is a link to documentation.
The introductory pandas
lectures will cover important data structures and methods you should be fluent in. However, we want you to get familiar with the real world programming practice of …Googling! Answers to your questions can be found in documentation, Stack Overflow, etc.
With that, let’s move on to Pandas II.