import numpy as np
import pandas as pd

Print Visualization

import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')
sns.set_context("notebook")

Web visualization

import plotly.offline as py
import plotly.express as px
import cufflinks as cf
cf.set_config_file(sharing="private", offline=True, offline_connected=False)

The Data Table Package¶

In data8 you used numpy and datascience

import numpy as np
from datascience import Table

Working with Matrices¶

Creating a toy matrix

m = np.array(np.arange(12, dtype=float)).reshape(3,4)
m

array([[ 0.,  1.,  2.,  3.],
       [ 4.,  5.,  6.,  7.],
       [ 8.,  9., 10., 11.]])

Slicing rows and columns

m[:, [1,3]]

array([[ 1.,  3.],
       [ 5.,  7.],
       [ 9., 11.]])

Doing linear algebra

m.T @ (m + m)

array([[160., 184., 208., 232.],
       [184., 214., 244., 274.],
       [208., 244., 280., 316.],
       [232., 274., 316., 358.]])

Working with Tables¶

Load data

t = Table.read_table("elections.csv")
t

Access columns and rows:

t.column("Party")

array(['Republican', 'Democratic', 'Independent', 'Republican',
       'Democratic', 'Republican', 'Democratic', 'Democratic',
       'Republican', 'Independent', 'Democratic', 'Republican',
       'Independent', 'Democratic', 'Republican', 'Democratic',
       'Republican', 'Democratic', 'Republican', 'Democratic',
       'Republican', 'Democratic', 'Republican'], dtype='<U11')

t.row(2)

Row(Candidate='Anderson', Party='Independent', %=6.6, Year=1980, Result='loss')

Define predicates

from datascience import are
t.where("Year", are.above_or_equal_to(2000))

and do many other things...

The Table class in the datascience package implements a simple DataFrame. However, the Table package is not particularly optimized and doesn't support a wide range useful functionality. In this class, we will use Pandas, a more feature rich and widely adopted DataFrame library.

Pandas¶

It is customary to import pandas as pd

import pandas as pd

Reading Data Sources¶

Pandas has a number of very useful file reading tools. You can see them enumerated by typing "pd.re" and pressing tab. We'll be using read_csv today.

elections = pd.read_csv("elections.csv")
elections # if we end a cell with an expression or variable name, the result will print

Read a table from a website

dfs = pd.read_html("https://en.wikipedia.org/wiki/Greenhouse_gas")
dfs[4]  # read the 5th table on the page

Read a Microsoft Excel file

pd.read_excel("fossil_fuel.xlsx", sheet_name="Data2")

Looking at only a few rows¶

We can use the head command to return only a few rows of a dataframe.

elections.head()

elections.head(3)

There is also a tail command.

elections.tail(7)

A random sample of 7 entries.

# Note I am seeding the sample so later stages in my
# notebook work. My favorite seed is 42
elec_sample = elections.sample(7, random_state=42) 
elec_sample

Sampling with replacement

elections.sample(10, replace=True)

Sampling columns?

elections.sample(2, axis=1).head()

Utility Operations¶

In addition to head, tail, and sample the are a range of useful operations.

Shape returns the number of rows and columns.

elections.shape

(23, 5)

Size describes the number of "cells" in the dataframe

elections.size

115

We can sort the rows by their values:

elections.sort_values(["Year", "Result"])

We can rename columns:

elections.rename(columns={"%": "Percent"}).head()

Note that the rename method returned a new dataframe and didn't modify the original one.

Most operations in Pandas are not mutating. This produces cleaner code. If you change something it should be stored in a new appropriately named variable.

elections.head()

Casting types.

elections.astype({"Year": float}).head()

We can get summary statistics for each column

elections.describe(include="all")

You can even transpose a dataframe

elections.transpose()

The fact that you can take the transpose implies some interesting symmetry properties that will lead us to treat rows and columns symmetrically. This has interesting implications on how we refer to rows and columns!

Anatomy of a DataFrame¶

The DataFrame has named columns

elections.head()

elections.columns

Index(['Candidate', 'Party', '%', 'Year', 'Result'], dtype='object')

The columns have types

elections.dtypes

Candidate     object
Party         object
%            float64
Year           int64
Result        object
dtype: object

You can access a the rows as an array of lists

elections.values

array([['Reagan', 'Republican', 50.7, 1980, 'win'],
       ['Carter', 'Democratic', 41.0, 1980, 'loss'],
       ['Anderson', 'Independent', 6.6, 1980, 'loss'],
       ['Reagan', 'Republican', 58.8, 1984, 'win'],
       ['Mondale', 'Democratic', 37.6, 1984, 'loss'],
       ['Bush', 'Republican', 53.4, 1988, 'win'],
       ['Dukakis', 'Democratic', 45.6, 1988, 'loss'],
       ['Clinton', 'Democratic', 43.0, 1992, 'win'],
       ['Bush', 'Republican', 37.4, 1992, 'loss'],
       ['Perot', 'Independent', 18.9, 1992, 'loss'],
       ['Clinton', 'Democratic', 49.2, 1996, 'win'],
       ['Dole', 'Republican', 40.7, 1996, 'loss'],
       ['Perot', 'Independent', 8.4, 1996, 'loss'],
       ['Gore', 'Democratic', 48.4, 2000, 'loss'],
       ['Bush', 'Republican', 47.9, 2000, 'win'],
       ['Kerry', 'Democratic', 48.3, 2004, 'loss'],
       ['Bush', 'Republican', 50.7, 2004, 'win'],
       ['Obama', 'Democratic', 52.9, 2008, 'win'],
       ['McCain', 'Republican', 45.7, 2008, 'loss'],
       ['Obama', 'Democratic', 51.1, 2012, 'win'],
       ['Romney', 'Republican', 47.2, 2012, 'loss'],
       ['Clinton', 'Democratic', 48.2, 2016, 'loss'],
       ['Trump', 'Republican', 46.1, 2016, 'win']], dtype=object)

Indexes (Part 1)¶

All dataframes have an index.

elections

elections.index

RangeIndex(start=0, stop=23, step=1)

By default a RangeIndex is attached enumerating the rows. This is shown in bold as the far left column. Recall that we sampled the elections table. Let's examine that sample:

elec_sample

elec_sample.index

Int64Index([15, 9, 0, 8, 17, 12, 1], dtype='int64')

Notice that the index is different. It maintained the index of the rows in the original table. This is really useful if we wanted to go back and relate derived tables with their original values.

You can change the index.

elec_sample_iyear = elec_sample.set_index("Year")
elec_sample_iyear

elec_sample_iyear.index

Int64Index([2004, 1992, 1980, 1992, 2008, 1996, 1980], dtype='int64', name='Year')

Note that the set_index operation is not mutating.

elections.index

RangeIndex(start=0, stop=23, step=1)

elections.head()

The index allows you to reference rows by name. You will see this in a moment when we talk about slicing.

Note: The index does not need to be unique.

The Columns are also an index¶

Recall that we could get the list of column names

elections.columns

Index(['Candidate', 'Party', '%', 'Year', 'Result'], dtype='object')

Notice that the return type is an index. Recall we could transpose a DataFrame. This is effectively swapping the row and column index:

elections.transpose().index

Index(['Candidate', 'Party', '%', 'Year', 'Result'], dtype='object')

Accessing Rows and Columns (Slicing)¶

There are many ways to access rows and columns of a Pandas DataFrame. We will spend some time reviewing most of the options.

Accessing Columns using `[ ]`¶

The DataFrame class has an indexing operator [] that lets you do a variety of different things.

Just like the Table in data8 you can access columns using the square [ ] brakets.

elec_sample

You can pass a list of columns names:

elec_sample[ ["Candidate", "Year"] ] # space added to show the list

elec_sample[["Candidate", "Year", "Result"]] # No space is more standard

If you pass a list with a single element you get back a DataFrame.

elec_sample[["Candidate"]]

Series¶

If you pass a single item instead of a list you get back a Series

party = elec_sample["Party"]
party

15     Democratic
9     Independent
0      Republican
8      Republican
17     Democratic
12    Independent
1      Democratic
Name: Party, dtype: object

When accessing a single column we get back a pd.Series object

type(party)

pandas.core.series.Series

The series object represents a single column (or row) of data. The Series object has a index, a name, and values. A series can be thought of as a map.

party.index

Int64Index([15, 9, 0, 8, 17, 12, 1], dtype='int64')

party.name

'Party'

party.values

array(['Democratic', 'Independent', 'Republican', 'Republican',
       'Democratic', 'Independent', 'Democratic'], dtype=object)

We can convert a Series into a DataFrame

party.to_frame()

Series act like numpy arrays and support most numpy operations

year = elec_sample["Year"]
year

15    2004
9     1992
0     1980
8     1992
17    2008
12    1996
1     1980
Name: Year, dtype: int64

year.mean()

1993.142857142857

Apply numpy opeations:

np.sin(year * 3)

15   -0.845947
9     0.637133
0     0.682887
8     0.637133
17   -0.999992
12    0.124082
1     0.682887
Name: Year, dtype: float64

We can sort the Series by value or by index.

np.sin(year * 3).sort_values()

17   -0.999992
15   -0.845947
12    0.124082
9     0.637133
8     0.637133
0     0.682887
1     0.682887
Name: Year, dtype: float64

np.sin(year * 3).sort_index()

0     0.682887
1     0.682887
8     0.637133
9     0.637133
12    0.124082
15   -0.845947
17   -0.999992
Name: Year, dtype: float64

Counting unique values¶

Series also has a very useful function .value_counts() which allows us to compute the number of occurences of each unique value.

year.value_counts()

1980    2
1992    2
2008    1
1996    1
2004    1
Name: Year, dtype: int64

party_counts = elections['Party'].value_counts()
party_counts

Democratic     10
Republican     10
Independent     3
Name: Party, dtype: int64

Note that in each case we also got back a series and these series (like all series) are maps from index to value.

party_counts.index

Index(['Democratic', 'Republican', 'Independent'], dtype='object')

party_counts.values

array([10, 10,  3])

party_counts["Independent"]

3

Indexes allow us to relate data¶

Notice how in call cases I keep track of the index making it possible to relate this data back to the sample and even the original table.

For example, in the following we create a new series with the name weird and join it back with the original data.

weird = np.sin(year * 3).rename("weird")
weird

15   -0.845947
9     0.637133
0     0.682887
8     0.637133
17   -0.999992
12    0.124082
1     0.682887
Name: weird, dtype: float64

Here we use the Pandas join operation. This joins on the index. You will learn more about this next and the more general merge operation next week.

elections.join(weird)

weird.to_frame().join(elections)

What kind of join is this?

Column Assignment (Mutating)¶

You can modify and even add columns using the square brackets [ ]

tmp = elec_sample.copy()
tmp["Year"] = tmp["Year"] * -1 + .5
tmp

Adding a new column by assignment:

tmp["Corrected Year"] = tmp["Year"] * -1 + .5
tmp

tmp["Random Numbers"] = np.random.randn(tmp.shape[0])
tmp

Accessing by rows and columns by index `.loc`¶

You can access rows and columns of a DataFrame by name using the .loc[ ] syntax.

elec_sample

elec_sample.loc[:, ["Party", "Year"] ]

The syntax for .loc is:

  df.loc[ rows_list, column_list ]

We can pass a list of row names (index values):

elec_sample

elec_sample.loc[[1,15,9], :]

elec_sample.loc[[1,15,9]]

elec_sample_iyear

How many rows will this call return

elec_sample_iyear.loc[[2004, 1980], :]

Loc also supports slicing (for all types, including numeric and string labels!). Note that the slicing for loc is inclusive, even for numeric slices. In general, avoiding range slicing with .loc.

elections.loc[0:4, 'Candidate':'Year']

Keep in mind that the ranges need to be over the index values and not the locations.

Note the index values need to have well defined contiguous ranges.

elec_sample_iyear.sort_index().loc[1980:2004, 'Candidate':'Party']

If we provide only a single label for the column argument, we get back a Series just as with regular [ ] indexing.

elec_sample.loc[:, 'Candidate']

15     Kerry
9      Perot
0     Reagan
8       Bush
17     Obama
12     Perot
1     Carter
Name: Candidate, dtype: object

If we want a data frame instead and don't want to use to_frame, we can provde a list containing the column name.

elec_sample.loc[:, ['Candidate']]

We can also select a single row. Notice that in this case we also get back a Series where the index is the set of columns.

obama_row = elec_sample_iyear.loc[2008, :]
obama_row

Candidate         Obama
Party        Democratic
%                  52.9
Result              win
Name: 2008, dtype: object

obama_row.name

2008

obama_row.index

Index(['Candidate', 'Party', '%', 'Result'], dtype='object')

obama_row.values

array(['Obama', 'Democratic', 52.9, 'win'], dtype=object)

It is worth noting that the Series also functions like a map from the index to the values.

obama_row["Party"]

'Democratic'

If we omit the column argument altogether, the default behavior is to retrieve all columns.

elections.set_index("Year").loc[[2008, 2012]]

Indexing a Single Value¶

What happens if you give scalar arguments for the requested rows AND columns. The answer is that you get back just a single value.

elections.loc[0, 'Candidate']

'Reagan'

Boolean Array Selection (WHERE in SQL)¶

The .loc[] and also [ ] support arrays of booleans as an input. In this case, the array must be exactly as long as the number of rows. The result is a filtered version of the data frame, where only rows corresponding to True appear.

elec_sample

elec_sample.loc[[False, False, False, False, True, False, False]]

You can also pass the same arguments to the [ ] operator.

elec_sample[[False, False, False, False, True, False, False]]

One very common task in Data Science is filtering. Boolean Array Selection is one way to achieve this in Pandas. We start by observing logical operators like the equality operator can be applied to Pandas Series data to generate a Boolean Array. For example, we can compare the 'Result' column to the String 'win':

elections.head(5)

iswin = elections['Result'] == 'win'
iswin

0      True
1     False
2     False
3      True
4     False
5      True
6     False
7      True
8     False
9     False
10     True
11    False
12    False
13    False
14     True
15    False
16     True
17     True
18    False
19     True
20    False
21    False
22     True
Name: Result, dtype: bool

The output of the logical operator applied to the Series is another Series with the same name and index, but of datatype boolean. The entry at row #i represents the result of the application of that operator to the entry of the original Series at row #i.

Such a boolean Series can be used as an argument to the [] operator. For example, the following code creates a DataFrame of all election winners since 1980.

elections[iswin]

Above, we've assigned the result of the logical operator to a new variable called iswin. This is uncommon. Usually, the series is created and used on the same line. Such code is a little tricky to read at first, but you'll get used to it quickly.

elections[elections['Result'] == 'win']

We can select multiple criteria by creating multiple boolean Series and combining them using the & operator.

elections[
    (elections['Result'] == 'win') & 
    (elections['%'] < 50)
]

Using the logical negation ~ operator (not).

elections[
    (elections['Result'] == 'win') & 
    ~(elections['%'] < 50)
]

The | operator is the symbol for or.

elections[
    ~((elections['Party'] == "Democratic") | 
        (elections['Party'] == "Republican"))
]

If we have multiple conditions (say Republican or Democratic), we can use the isin operator to simplify our code.

elections[elections['Party'].isin(["Republican", "Democratic"])]

An alternate simpler way to get back a specific set of rows is to use the query command.

elections.query("Result == 'win' and Year < 2000")

Note, the query command needs a bit of care and cannot be applied to that contain periods or special characters.

tmp2 = elections.rename(columns={"Year": "Elec Year", "%": "%*100"})
tmp2.head()

tmp2.query("`Elec Year` > 2000")

In general, I don't use the query function because of these issues.

Accessing by Location `iloc` (Integer-Location)¶

This is similar to a spreadsheet

loc's cousin iloc is very similar, but is used to access based on numerical position instead of label. For example, to access to the top 3 rows and top 3 columns of a table, we can use [0:3, 0:3]. iloc slicing is exclusive, just like standard Python slicing of numerical values.

elec_small = elections.head(5)
elec_small

We use the the dataframe.iloc[row_slice, column_slice] to access specific rows and columns by their location. For example, to access to the top 3 rows and top 3 columns of a table, we can use [0:3, 0:3]. iloc slicing is exclusive, just like standard Python slicing of numerical values.

elec_small.iloc[0:3, 0:3]

Return the last three rows using slicing?

Solution

elec_small.iloc[-3:, :]

# code here

Caution¶

We will use both loc and iloc in the course. Loc is generally preferred for a number of reasons, for example:

It is harder to make mistakes since you have to literally write out what you want to get.
Code is easier to read, because the reader doesn't have to know e.g. what column #31 represents.
It is robust against permutations of the data, e.g. the social security administration switches the order of two columns.

However, iloc is sometimes more convenient. We'll provide examples of when iloc is the superior choice.

Running Python on Rows¶

In general, we avoid directly running python code on each row or iterating over dataframes. It is often orders of magnitude faster to use builtin operations to modify entire columns at once.

However, occasionally you need to apply code to the dataframe directly.

The apply function executes the input function on each row (if axis=1) or each column (if axis=0):

elections.apply(lambda row: row['Party'][0:3], axis=1)

0     Rep
1     Dem
2     Ind
3     Rep
4     Dem
5     Rep
6     Dem
7     Dem
8     Rep
9     Ind
10    Dem
11    Rep
12    Ind
13    Dem
14    Rep
15    Dem
16    Rep
17    Dem
18    Rep
19    Dem
20    Rep
21    Dem
22    Rep
dtype: object

You can also directly iterate over the rows:

for (year, row) in elections.set_index("Year").iterrows():
    if row['Result'] == "win":
        print(f"In {year} the winner was {row['Party']}")

In 1980 the winner was Republican
In 1984 the winner was Republican
In 1988 the winner was Republican
In 1992 the winner was Democratic
In 1996 the winner was Democratic
In 2000 the winner was Republican
In 2004 the winner was Republican
In 2008 the winner was Democratic
In 2012 the winner was Democratic
In 2016 the winner was Republican

Quick Challenge¶

Which of the following expressions return DataFrame of the first 3 Candidate and Party names for candidates that won with more than 50% of the vote.

elections.iloc[[0, 3, 5], [0, 3]]

elections.loc[[0, 3, 5], "Candidate":"Year"]

elections.loc[elections["%"] > 50, ["Candidate", "Year"]].head(3)

elections.loc[elections["%"] > 50, ["Candidate", "Year"]].iloc[0:2, :]

Handy Properties and Utility Functions for Series and DataFrames¶

Python Operations on Numerical DataFrames and Series¶

Consider a series of only the vote percentages of election winners.

winners = elections.query("Result == 'win'")["%"]
winners

0     50.7
3     58.8
5     53.4
7     43.0
10    49.2
14    47.9
16    50.7
17    52.9
19    51.1
22    46.1
Name: %, dtype: float64

We can perform various Python operations (including numpy operations) to DataFrames and Series.

max(winners)

58.8

np.mean(winners)

50.38

We can also do more complicated operations like computing the mean squared error, i.e. the average L2 loss.

c = 50.38
mse = np.mean((c - winners)**2)
mse

16.741599999999995

c2 = 50.35
mse2 = np.mean((c2 - winners)**2)
mse2

16.742499999999996

We can also apply mathematical operations to a DataFrame so long as it has only numerical data.

(elections[["%", "Year"]] + 3).head(5)

Baby Names Data¶

Now let's play around a bit with the large baby names dataset we saw in lecture 1. We'll start by loading that dataset from the social security administration's website.

To keep the data small enough to avoid crashing datahub, we're going to look at only California rather than looking at the national dataset.

import urllib.request
import os.path
import zipfile

data_url = "https://www.ssa.gov/oact/babynames/state/namesbystate.zip"
local_filename = "babynamesbystate.zip"
if not os.path.exists(local_filename): # if the data exists don't download again
    with urllib.request.urlopen(data_url) as resp, open(local_filename, 'wb') as f:
        f.write(resp.read())

zf = zipfile.ZipFile(local_filename, 'r')

ca_name = 'CA.TXT'
field_names = ['State', 'Sex', 'Year', 'Name', 'Count']
with zf.open(ca_name) as fh:
    babynames = pd.read_csv(fh, header=None, names=field_names)

babynames.sample(5)

Goal 1: Find the most popular female baby names in California in 2018

Solution


babynames[
    (babynames["Year"] == 2018) & 
    (babynames["Sex"] == "F")
].sort_values(by = "Count", ascending = False).head(20)

# Solution here

Goal 2: Make a plot of how many baby girls were named Nora over the years.

Solution


babynames[(babynames.Name == "Nora") & (babynames.Sex == "F") ].iplot(x="Year", y="Count")

# Solution here

	Food Types	Greenhouse Gas Emissions (g CO2-Ceq per g protein)
0	Ruminant Meat	62.00
1	Recirculating Aquaculture	30.00
2	Trawling Fishery	26.00
3	Non-recirculating Aquaculture	12.00
4	Pork	10.00
5	Poultry	10.00
6	Dairy	9.10
7	Non-trawling Fishery	8.60
8	Eggs	6.80
9	Starchy Roots	1.70
10	Wheat	1.20
11	Maize	1.20
12	Legumes	0.25

	Seven main fossil fuel combustion sources	Contribution (%)
0	Liquid fuels (e.g., gasoline, fuel oil)	0.36
1	Solid fuels (e.g., coal)	0.35
2	Gaseous fuels (e.g., natural gas)	0.2
3	Cement production	3 %
4	Flaring gas industrially and at wells	< 1%
5	Non-fuel hydrocarbons	< 1%
6	"International bunker fuels" of transport not ...	4 %

	Candidate	Party	%	Year	Result
count	23	23	23.000000	23.000000	23
unique	15	3	NaN	NaN	2
top	Bush	Democratic	NaN	NaN	loss
freq	4	10	NaN	NaN	13
mean	NaN	NaN	42.513043	1996.869565	NaN
std	NaN	NaN	13.476117	11.627961	NaN
min	NaN	NaN	6.600000	1980.000000	NaN
25%	NaN	NaN	40.850000	1988.000000	NaN
50%	NaN	NaN	47.200000	1996.000000	NaN
75%	NaN	NaN	49.950000	2006.000000	NaN
max	NaN	NaN	58.800000	2016.000000	NaN

Candidate	Party	%	Year	Result
Reagan	Republican	50.7	1980	win
Carter	Democratic	41	1980	loss
Anderson	Independent	6.6	1980	loss
Reagan	Republican	58.8	1984	win
Mondale	Democratic	37.6	1984	loss
Bush	Republican	53.4	1988	win
Dukakis	Democratic	45.6	1988	loss
Clinton	Democratic	43	1992	win
Bush	Republican	37.4	1992	loss
Perot	Independent	18.9	1992	loss

Candidate	Party	%	Year	Result
Gore	Democratic	48.4	2000	loss
Bush	Republican	47.9	2000	win
Kerry	Democratic	48.3	2004	loss
Bush	Republican	50.7	2004	win
Obama	Democratic	52.9	2008	win
McCain	Republican	45.7	2008	loss
Obama	Democratic	51.1	2012	win
Romney	Republican	47.2	2012	loss
Clinton	Democratic	48.2	2016	loss
Trump	Republican	46.1	2016	win

	Candidate	Party	%	Year	Result
1	Carter	Democratic	41.0	1980	loss
2	Anderson	Independent	6.6	1980	loss
0	Reagan	Republican	50.7	1980	win
4	Mondale	Democratic	37.6	1984	loss
3	Reagan	Republican	58.8	1984	win
6	Dukakis	Democratic	45.6	1988	loss
5	Bush	Republican	53.4	1988	win
8	Bush	Republican	37.4	1992	loss
9	Perot	Independent	18.9	1992	loss
7	Clinton	Democratic	43.0	1992	win
11	Dole	Republican	40.7	1996	loss
12	Perot	Independent	8.4	1996	loss
10	Clinton	Democratic	49.2	1996	win
13	Gore	Democratic	48.4	2000	loss
14	Bush	Republican	47.9	2000	win
15	Kerry	Democratic	48.3	2004	loss
16	Bush	Republican	50.7	2004	win
18	McCain	Republican	45.7	2008	loss
17	Obama	Democratic	52.9	2008	win
20	Romney	Republican	47.2	2012	loss
19	Obama	Democratic	51.1	2012	win
21	Clinton	Democratic	48.2	2016	loss
22	Trump	Republican	46.1	2016	win

	weird	Candidate	Party	%	Year	Result
15	-0.845947	Kerry	Democratic	48.3	2004	loss
9	0.637133	Perot	Independent	18.9	1992	loss
0	0.682887	Reagan	Republican	50.7	1980	win
8	0.637133	Bush	Republican	37.4	1992	loss
17	-0.999992	Obama	Democratic	52.9	2008	win
12	0.124082	Perot	Independent	8.4	1996	loss
1	0.682887	Carter	Democratic	41.0	1980	loss

	State	Sex	Year	Name	Count
82545	CA	F	1979	Karly	7
208953	CA	F	2014	Verena	6
6379	CA	F	1924	Jean	460
168445	CA	F	2005	Amy	638
263594	CA	M	1966	Stephen	1245

The Data Table Package¶

Working with Matrices¶

Working with Tables¶

Pandas¶

Reading Data Sources¶

Looking at only a few rows¶

Utility Operations¶

Anatomy of a DataFrame¶

Indexes (Part 1)¶

The Columns are also an index¶

Accessing Rows and Columns (Slicing)¶

Accessing Columns using [ ]¶

Series¶

Counting unique values¶

Indexes allow us to relate data¶

Column Assignment (Mutating)¶

Accessing by rows and columns by index .loc¶

Indexing a Single Value¶

Boolean Array Selection (WHERE in SQL)¶

Accessing by Location iloc (Integer-Location)¶

Caution¶

Running Python on Rows¶

Quick Challenge¶

Handy Properties and Utility Functions for Series and DataFrames¶

Python Operations on Numerical DataFrames and Series¶

Baby Names Data¶

Accessing Columns using `[ ]`¶

Accessing by rows and columns by index `.loc`¶

Accessing by Location `iloc` (Integer-Location)¶