import numpy as np
import pandas as pd
Print Visualization
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')
sns.set_context("notebook")
Web visualization
import plotly.offline as py
import plotly.express as px
import cufflinks as cf
cf.set_config_file(sharing="private", offline=True, offline_connected=False)
In data8 you used numpy
and datascience
import numpy as np
from datascience import Table
Creating a toy matrix
m = np.array(np.arange(12, dtype=float)).reshape(3,4)
m
Slicing rows and columns
m[:, [1,3]]
Doing linear algebra
m.T @ (m + m)
Load data
t = Table.read_table("elections.csv")
t
Access columns and rows:
t.column("Party")
t.row(2)
Define predicates
from datascience import are
t.where("Year", are.above_or_equal_to(2000))
and do many other things...
The Table
class in the datascience
package implements a simple DataFrame. However, the Table
package is not particularly optimized and doesn't support a wide range useful functionality. In this class, we will use Pandas, a more feature rich and widely adopted DataFrame library.
It is customary to import pandas as pd
import pandas as pd
Pandas has a number of very useful file reading tools. You can see them enumerated by typing "pd.re" and pressing tab. We'll be using read_csv today.
elections = pd.read_csv("elections.csv")
elections # if we end a cell with an expression or variable name, the result will print
Read a table from a website
dfs = pd.read_html("https://en.wikipedia.org/wiki/Greenhouse_gas")
dfs[4] # read the 5th table on the page
Read a Microsoft Excel file
pd.read_excel("fossil_fuel.xlsx", sheet_name="Data2")
We can use the head command to return only a few rows of a dataframe.
elections.head()
elections.head(3)
There is also a tail command.
elections.tail(7)
A random sample of 7 entries.
# Note I am seeding the sample so later stages in my
# notebook work. My favorite seed is 42
elec_sample = elections.sample(7, random_state=42)
elec_sample
Sampling with replacement
elections.sample(10, replace=True)
Sampling columns?
elections.sample(2, axis=1).head()
In addition to head
, tail
, and sample
the are a range of useful operations.
Shape returns the number of rows and columns.
elections.shape
Size describes the number of "cells" in the dataframe
elections.size
We can sort the rows by their values:
elections.sort_values(["Year", "Result"])
We can rename columns:
elections.rename(columns={"%": "Percent"}).head()
Note that the rename
method returned a new dataframe and didn't modify the original one.
Most operations in Pandas are not mutating. This produces cleaner code. If you change something it should be stored in a new appropriately named variable.
elections.head()
Casting types.
elections.astype({"Year": float}).head()
We can get summary statistics for each column
elections.describe(include="all")
You can even transpose a dataframe
elections.transpose()
The fact that you can take the transpose implies some interesting symmetry properties that will lead us to treat rows and columns symmetrically. This has interesting implications on how we refer to rows and columns!
The DataFrame has named columns
elections.head()
elections.columns
The columns have types
elections.dtypes
You can access a the rows as an array of lists
elections.values
All dataframes have an index.
elections
elections.index
By default a RangeIndex is attached enumerating the rows. This is shown in bold as the far left column. Recall that we sampled the elections table. Let's examine that sample:
elec_sample
elec_sample.index
Notice that the index is different. It maintained the index of the rows in the original table. This is really useful if we wanted to go back and relate derived tables with their original values.
You can change the index.
elec_sample_iyear = elec_sample.set_index("Year")
elec_sample_iyear
elec_sample_iyear.index
Note that the set_index
operation is not mutating.
elections.index
elections.head()
The index allows you to reference rows by name. You will see this in a moment when we talk about slicing.
Note: The index does not need to be unique.
Recall that we could get the list of column names
elections.columns
Notice that the return type is an index. Recall we could transpose a DataFrame. This is effectively swapping the row and column index:
elections.transpose().index
There are many ways to access rows and columns of a Pandas DataFrame. We will spend some time reviewing most of the options.
[ ]
¶The DataFrame class has an indexing operator [] that lets you do a variety of different things.
Just like the Table in data8 you can access columns using the square [ ]
brakets.
elec_sample
You can pass a list of columns names:
elec_sample[ ["Candidate", "Year"] ] # space added to show the list
elec_sample[["Candidate", "Year", "Result"]] # No space is more standard
If you pass a list with a single element you get back a DataFrame.
elec_sample[["Candidate"]]
If you pass a single item instead of a list you get back a Series
party = elec_sample["Party"]
party
When accessing a single column we get back a pd.Series
object
type(party)
The series object represents a single column (or row) of data. The Series
object has a index
, a name
, and values
. A series can be thought of as a map.
party.index
party.name
party.values
We can convert a Series into a DataFrame
party.to_frame()
Series act like numpy arrays and support most numpy operations
year = elec_sample["Year"]
year
year.mean()
Apply numpy opeations:
np.sin(year * 3)
We can sort the Series
by value or by index.
np.sin(year * 3).sort_values()
np.sin(year * 3).sort_index()
Series also has a very useful function .value_counts()
which allows us to compute the number of occurences of each unique value.
year.value_counts()
party_counts = elections['Party'].value_counts()
party_counts
Note that in each case we also got back a series and these series (like all series) are maps from index to value.
party_counts.index
party_counts.values
party_counts["Independent"]
Notice how in call cases I keep track of the index making it possible to relate this data back to the sample and even the original table.
For example, in the following we create a new series with the name weird and join it back with the original data.
weird = np.sin(year * 3).rename("weird")
weird
Here we use the Pandas join operation. This joins on the index. You will learn more about this next and the more general merge
operation next week.
elections.join(weird)
weird.to_frame().join(elections)
What kind of join is this?
You can modify and even add columns using the square brackets [ ]
tmp = elec_sample.copy()
tmp["Year"] = tmp["Year"] * -1 + .5
tmp
Adding a new column by assignment:
tmp["Corrected Year"] = tmp["Year"] * -1 + .5
tmp
tmp["Random Numbers"] = np.random.randn(tmp.shape[0])
tmp
.loc
¶You can access rows and columns of a DataFrame by name using the .loc[ ]
syntax.
elec_sample
elec_sample.loc[:, ["Party", "Year"] ]
The syntax for .loc
is:
df.loc[ rows_list, column_list ]
We can pass a list of row names (index values):
elec_sample
elec_sample.loc[[1,15,9], :]
elec_sample.loc[[1,15,9]]
elec_sample_iyear
How many rows will this call return
elec_sample_iyear.loc[[2004, 1980], :]
Loc also supports slicing (for all types, including numeric and string labels!). Note that the slicing for loc is inclusive, even for numeric slices. In general, avoiding range slicing with .loc
.
elections.loc[0:4, 'Candidate':'Year']
Keep in mind that the ranges need to be over the index values and not the locations.
Note the index values need to have well defined contiguous ranges.
elec_sample_iyear.sort_index().loc[1980:2004, 'Candidate':'Party']
If we provide only a single label for the column argument, we get back a Series just as with regular [ ]
indexing.
elec_sample.loc[:, 'Candidate']
If we want a data frame instead and don't want to use to_frame, we can provde a list containing the column name.
elec_sample.loc[:, ['Candidate']]
We can also select a single row. Notice that in this case we also get back a Series
where the index is the set of columns.
obama_row = elec_sample_iyear.loc[2008, :]
obama_row
obama_row.name
obama_row.index
obama_row.values
It is worth noting that the Series
also functions like a map from the index to the values.
obama_row["Party"]
If we omit the column argument altogether, the default behavior is to retrieve all columns.
elections.set_index("Year").loc[[2008, 2012]]
What happens if you give scalar arguments for the requested rows AND columns. The answer is that you get back just a single value.
elections.loc[0, 'Candidate']
The .loc[]
and also [ ]
support arrays of booleans as an input. In this case, the array must be exactly as long as the number of rows. The result is a filtered version of the data frame, where only rows corresponding to True appear.
elec_sample
elec_sample.loc[[False, False, False, False, True, False, False]]
You can also pass the same arguments to the [ ]
operator.
elec_sample[[False, False, False, False, True, False, False]]
One very common task in Data Science is filtering. Boolean Array Selection is one way to achieve this in Pandas. We start by observing logical operators like the equality operator can be applied to Pandas Series data to generate a Boolean Array. For example, we can compare the 'Result' column to the String 'win':
elections.head(5)
iswin = elections['Result'] == 'win'
iswin
The output of the logical operator applied to the Series is another Series with the same name and index, but of datatype boolean. The entry at row #i represents the result of the application of that operator to the entry of the original Series at row #i.
Such a boolean Series can be used as an argument to the [] operator. For example, the following code creates a DataFrame of all election winners since 1980.
elections[iswin]
Above, we've assigned the result of the logical operator to a new variable called iswin
. This is uncommon. Usually, the series is created and used on the same line. Such code is a little tricky to read at first, but you'll get used to it quickly.
elections[elections['Result'] == 'win']
We can select multiple criteria by creating multiple boolean Series and combining them using the &
operator.
elections[
(elections['Result'] == 'win') &
(elections['%'] < 50)
]
Using the logical negation ~
operator (not).
elections[
(elections['Result'] == 'win') &
~(elections['%'] < 50)
]
The | operator is the symbol for or.
elections[
~((elections['Party'] == "Democratic") |
(elections['Party'] == "Republican"))
]
If we have multiple conditions (say Republican or Democratic), we can use the isin operator to simplify our code.
elections[elections['Party'].isin(["Republican", "Democratic"])]
An alternate simpler way to get back a specific set of rows is to use the query
command.
elections.query("Result == 'win' and Year < 2000")
Note, the query command needs a bit of care and cannot be applied to that contain periods or special characters.
tmp2 = elections.rename(columns={"Year": "Elec Year", "%": "%*100"})
tmp2.head()
tmp2.query("`Elec Year` > 2000")
In general, I don't use the query function because of these issues.
iloc
(Integer-Location)¶This is similar to a spreadsheet
loc's cousin iloc is very similar, but is used to access based on numerical position instead of label. For example, to access to the top 3 rows and top 3 columns of a table, we can use [0:3, 0:3]. iloc slicing is exclusive, just like standard Python slicing of numerical values.
elec_small = elections.head(5)
elec_small
We use the the dataframe.iloc[row_slice, column_slice]
to access specific rows and columns by their location. For example, to access to the top 3 rows and top 3 columns of a table, we can use [0:3, 0:3]. iloc slicing is exclusive, just like standard Python slicing of numerical values.
elec_small.iloc[0:3, 0:3]
Return the last three rows using slicing?
elec_small.iloc[-3:, :]
# code here
We will use both loc and iloc in the course. Loc is generally preferred for a number of reasons, for example:
However, iloc is sometimes more convenient. We'll provide examples of when iloc is the superior choice.
In general, we avoid directly running python code on each row or iterating over dataframes. It is often orders of magnitude faster to use builtin operations to modify entire columns at once.
However, occasionally you need to apply code to the dataframe directly.
The apply
function executes the input function on each row (if axis=1) or each column (if axis=0):
elections.apply(lambda row: row['Party'][0:3], axis=1)
You can also directly iterate over the rows:
for (year, row) in elections.set_index("Year").iterrows():
if row['Result'] == "win":
print(f"In {year} the winner was {row['Party']}")
Which of the following expressions return DataFrame of the first 3 Candidate and Party names for candidates that won with more than 50% of the vote.
elections.iloc[[0, 3, 5], [0, 3]]
elections.loc[[0, 3, 5], "Candidate":"Year"]
elections.loc[elections["%"] > 50, ["Candidate", "Year"]].head(3)
elections.loc[elections["%"] > 50, ["Candidate", "Year"]].iloc[0:2, :]
Consider a series of only the vote percentages of election winners.
winners = elections.query("Result == 'win'")["%"]
winners
We can perform various Python operations (including numpy operations) to DataFrames and Series.
max(winners)
np.mean(winners)
We can also do more complicated operations like computing the mean squared error, i.e. the average L2 loss.
c = 50.38
mse = np.mean((c - winners)**2)
mse
c2 = 50.35
mse2 = np.mean((c2 - winners)**2)
mse2
We can also apply mathematical operations to a DataFrame so long as it has only numerical data.
(elections[["%", "Year"]] + 3).head(5)
Now let's play around a bit with the large baby names dataset we saw in lecture 1. We'll start by loading that dataset from the social security administration's website.
To keep the data small enough to avoid crashing datahub, we're going to look at only California rather than looking at the national dataset.
import urllib.request
import os.path
import zipfile
data_url = "https://www.ssa.gov/oact/babynames/state/namesbystate.zip"
local_filename = "babynamesbystate.zip"
if not os.path.exists(local_filename): # if the data exists don't download again
with urllib.request.urlopen(data_url) as resp, open(local_filename, 'wb') as f:
f.write(resp.read())
zf = zipfile.ZipFile(local_filename, 'r')
ca_name = 'CA.TXT'
field_names = ['State', 'Sex', 'Year', 'Name', 'Count']
with zf.open(ca_name) as fh:
babynames = pd.read_csv(fh, header=None, names=field_names)
babynames.sample(5)
Goal 1: Find the most popular female baby names in California in 2018
babynames[
(babynames["Year"] == 2018) &
(babynames["Sex"] == "F")
].sort_values(by = "Count", ascending = False).head(20)
# Solution here
Goal 2: Make a plot of how many baby girls were named Nora over the years.
babynames[(babynames.Name == "Nora") & (babynames.Sex == "F") ].iplot(x="Year", y="Count")
# Solution here