Python Software in Data 100¶

We will be using Python and Jupyter Notebooks for most of the class. You should take some time to learn how to use Jupyter and some of the keyboard shortcuts.

Software Packages¶

We will be using a wide range of different Python software packages. To install and manage these packages we will be using the Conda environment manager. The following is a list of packages we will routinely use in lectures and homeworks:

Linear Algebra¶

import numpy as np

Data manipulation¶

import pandas as pd

Visualization¶

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Interactivity¶

import ipywidgets as widgets
from ipywidgets import interact

Occasionally we will also use interactive visualization libraries like Plotly:

## Plotly plotting support
import plotly.offline as py
py.init_notebook_mode()
import plotly.graph_objs as go
import plotly.figure_factory as ff
import cufflinks as cf
cf.set_config_file(offline=True, world_readable=True, theme='ggplot')

The Data Science Lifecycle¶

In this lecture we provide a sample of the various topics covered in DS100. In particular we will discuss key aspects of the data science lifecycle:

Question/Problem Formulation:
1. What do we want to know or what problems are we trying to solve?
2. What are our hypotheses?
3. What are our metrics of success?
Data Acquisition and Cleaning:
1. What data do we have and what data do we need?
2. How will we collect more data?
3. Is our data representative?
4. How do we organize the data for analysis?
Exploratory Data Analysis:
1. How is our data organized and what does it contain?
2. What are the biases, anomalies, or other issues with the data?
3. How do we transform the data to enable effective analysis?
Prediction and Inference:
1. What does the data say about the world?
2. Does it answer our questions or accurately solve the problem?
3. How robust are our conclusions and can we trust the predictions?

Starting with a Question: Who are you (the students of DS100)?¶

This is a pretty vague question but let's start with the goal of learning something about the students in the class.

Data Acquisition and Cleaning¶

In DS100 we will study various methods to collect data.

To answer this question, I downloaded the course roster and extracted everyones names.

students = pd.read_csv("roster.csv")
students.head()

What are some of the issues that we will need to address in this data?

Answer:

What is the meaning of Role
Some names appear capitalized.

In the above sample we notice that some of the names are capitalized and some are not. This will be an issue in our later analysis so let's convert all names to lower case.

students['Name'] = students['Name'].str.lower()
students.head()

Exploratory Data Analysis¶

In DS100 we will study exploratory data analysis and practice analyzing new datasets.

How many records do we have:¶

A good starting point is understanding the size of the data.

Solution¶

print("There are", len(students), "students on the roster.")

There are 664 students on the roster.

Is this big data? (would you call this a "big class")

Answer

This would not normally constitute big data ... however this is a common data size for a lot of data analysis tasks.

Is this a big class? YES!

Is my data representative of the population I want to study?¶

Should we be worried about the sample size? Is this even a sample?

Answer:

This is (or at least was) a complete census of the class containing all the official students. We will learn more about data collection and sampling in the next lecture.

Understanding the structure of data¶

It is important that we understand the meaning of each field and how the data is organized.

students.head()

What is the meaning of the Role field?

Solution Understanding the meaning of field can often be achieved by looking at the types of data it contains (in particular the counts of its unique values).

students['Role'].value_counts().to_frame()

Summarizing the Data¶

We will often want to numerically or visually summarize the data. How can we summarize the name field?

In DS100 we will deal with many different kinds of data (not just numbers) and we will study techniques to diverse types of data.

A good starting point might be to examine the lengths of the strings.

sns.distplot(students['Name'].str.len(), rug=True, axlabel="Number of Characters")

<matplotlib.axes._subplots.AxesSubplot at 0x11983b860>

The above density plot combines histograms with kernel density estimators and a rug plot to convey information about the distribution of name lengths.

In DS100 we will learn a lot about how to visualize data.

Does the above plot seem reasonable? Why might we want to check the lengths of strings.

Answer Yes the above plot seems reasonable for name lengths. We might be concerned if there were 0 or even 1 letter names as these might represent abbreviations or missing entries.

What does a name tell us about a person?¶

We normally don't get to pick our names but they can say a lot about us.

Question: What information might a name reveal about a person?

Here are some examples we will explore in this lecture:

Gender
Age

Obtaining More Data¶

To study what a name tells about a person we will download data from the United States Social Security office containing the number of registered names broken down by year, sex, and name. This is often called the baby names data as social security numbers are typically given at birth.

Note: In the following we download the data programmatically to ensure that the process is reproducible.

import urllib.request
import os.path

data_url = "https://www.ssa.gov/oact/babynames/names.zip"
local_filename = "babynames.zip"
if not os.path.exists(local_filename): # if the data exists don't download again
    with urllib.request.urlopen(data_url) as resp, open(local_filename, 'wb') as f:
        f.write(resp.read())

The data is organized into separate files in the format yobYYYY.txt with each file containing the name, sex, and count of babies registered in that year.

Loading the Data¶

Note: In the following we load the data directly into python without decompressing the zipfile.

In DS100 we will think a bit more about how we can be efficient in our data analysis to support processing large datasets.

import zipfile
babynames = [] 
with zipfile.ZipFile(local_filename, "r") as zf:
    data_files = [f for f in zf.filelist if f.filename[-3:] == "txt"]
    def extract_year_from_filename(fn):
        return int(fn[3:7])
    for f in data_files:
        year = extract_year_from_filename(f.filename)
        with zf.open(f) as fp:
            df = pd.read_csv(fp, names=["Name", "Sex", "Count"])
            df["Year"] = year
            babynames.append(df)
babynames = pd.concat(babynames)

Understanding the Setting¶

In DS100 you will have to learn about different data sources on your own.

Reading from SSN Office description:

All names are from Social Security card applications for births that occurred in the United States after 1879. Note  that many people born before 1937 never applied for a Social Security card, so their names are not included in our data. For others who did apply, our records may not show the place of birth, and again their names are not included in our data.

All data are from a 100% sample of our records on Social Security card applications as of March 2017.

Data Cleaning¶

Examining the data:

babynames.head()

In our earlier analysis we converted names to lower case. We will do the same again here:

babynames['Name'] = babynames['Name'].str.lower()
babynames.head()

Exploratory Data Analysis¶

How many people does this data represent?

format(babynames['Count'].sum(), ',d')

'344,533,897'

len(babynames)

1891894

Is this number low or high?

Answer

It seems low. However the social security website states:

All names are from Social Security card applications for births that occurred in the United States after 1879. **Note that many people born before 1937 never applied for a Social Security card, so their names are not included in our data.** For others who did apply, our records may not show the place of birth, and again their names are not included in our data. All data are from a 100% sample of our records on Social Security card applications as of the end of February 2016.

Temporal Patterns Conditioned on Gender¶

In DS100 we still study how to visualize and analyze relationships in data.

In this example we construct a pivot table which aggregates the number of babies registered for each year by Sex.

pivot_year_name_count = pd.pivot_table(babynames, 
        index=['Year'], # the row index
        columns=['Sex'], # the column values
        values='Count', # the field(s) to processed in each group
        aggfunc=np.sum,
    )

pivot_year_name_count.head()

We can visualize these descriptive statistics:

pink_blue = ["#E188DB", "#334FFF"]
with sns.color_palette(sns.color_palette(pink_blue)):
    pivot_year_name_count.plot(marker=".")
    plt.title("Registered Names vs Year Stratified by Sex")
    plt.ylabel('Names Registered that Year')

In DS100 we will learn to use many different plotting technologies.

pivot_year_name_count.iplot(
    mode="lines+markers", size=8, colors=pink_blue,
    xTitle="Year", yTitle='Names Registered that Year',
    filename="Registered SSN Names")

pivot_year_name_nunique = pd.pivot_table(babynames, 
        index=['Year'], 
        columns=['Sex'], 
        values='Name', 
        aggfunc=lambda x: len(np.unique(x)),
    )

pivot_year_name_nunique.iplot(
    mode="lines+markers", size=8, colors=pink_blue,
    xTitle="Year", yTitle='Number of Unique Names',
    filename="Unique SSN Names")

Some observations:

Registration data seems limited in the early 1900s. Because many people did not register before 1937.
You can see the baby boomers.
Females have greater diversity of names.

Estimating the Sex of a Baby Given it's Name¶

We can use the baby names dataset to compute the total number of babies with each name broken down by Sex.

sex_counts = pd.pivot_table(babynames, index='Name', columns='Sex', values='Count',
                            aggfunc='sum', fill_value=0., margins=True)
sex_counts.head()

For each name we would like to estimate the probability that the baby is Female.

$$ \Large \hat{\textbf{P}\hspace{0pt}}(\texttt{Female} \,\,\, | \,\,\, \texttt{Name} ) = \frac{\textbf{Count}(\texttt{Female and Name})}{\textbf{Count}(\texttt{Name})} $$

We can calculate this estimator from the data:

prob_female = sex_counts['F'] / sex_counts['All'] 
prob_female.head()

Name
aaban        0.0
aabha        1.0
aabid        0.0
aabir        0.0
aabriella    1.0
dtype: float64

Testing the function:

prob_female["joey"]

0.10852281623492981

prob_female["fernando"]

0.0061703653700109695

prob_female["nhi"]

1.0

prob_female["sona"]

1.0

We can define a function to return the most likely Sex for a name.

def sex_from_name(name):
    lower_name = name.lower()
    if lower_name in prob_female.index:
        return 'F' if prob_female[lower_name] > 0.5 else 'M'
    else:
        return "Unknown"

sex_from_name("Joey")

'M'

Estimating the faction of Females in DS100¶

Can we use the Baby Names data?¶

What fraction of the student names are in the baby names database:

names = pd.Index(students["Name"]).intersection(prob_female.index)
print("Fraction of names in the babynames data:" , len(names) / len(students))

Fraction of names in the babynames data: 0.8855421686746988

Applying the `sex_from_name` function to all students¶

We can apply the sex_from_name function to all students in the class to estimate the number of male and female students.

count_by_sex = students['Name'].apply(sex_from_name).value_counts().to_frame()
count_by_sex

Using the above we can estimate the fraction of female students in the class

count_by_sex.loc['F'] / (count_by_sex.loc['M'] + count_by_sex.loc['F'])

Name    0.312925
dtype: float64

How do we feel about this estimate?
Do we trust it?

Using simulation to estimate uncertainty¶

Below we build a primitive estimate of the uncertainty in our model by simulating the Sex of each student according to the baby names dataset.

def simulate_class(students):
    p = prob_female.loc[students['Name']].dropna()
    is_female = np.random.rand(len(p)) < p
    return np.mean(is_female)

propritions = np.array([simulate_class(students) for n in range(1000)])

sns.distplot(propritions, rug=True, axlabel="Number of Characters")

<matplotlib.axes._subplots.AxesSubplot at 0x121dcc4a8>

Bonus Material: Does you name reveal your age?¶

In the following cell we define a variable for your name. Feel free to download the notebook and follow along.

Distribution of a Name over Time¶

We want to estimate the probability of being born in a particular year given someone's name. To construct this estimate for each name we need to compute the number of babies born each year with that name.

In the following code block we use the pivot_table expression to compute the total number of babies born with a given name for each year.

name_year_pivot = babynames.pivot_table( 
        index=['Year'], columns=['Name'], values='Count', aggfunc=np.sum)
name_year_pivot.head()

Notice that there are a lot of NaN values. These correspond to names that didn't occur in that year.

To estimate the probability of being born in a year given the name we need to compute:

$$ \Large \hat{\textbf{P}\hspace{0pt}}(\texttt{Year} \,\,\, | \,\,\, \texttt{Name} ) = \frac{\textbf{Count}(\texttt{Year and Name})}{\textbf{Count}(\texttt{Name})} $$

For names that never occured we will replace the NaN with probability 0.0 using fillna.

prob_year_given_name = name_year_pivot.div(name_year_pivot.sum()).fillna(0.0)
prob_year_given_name.head()

Visualizing the $\hat{\textbf{P}\hspace{0pt}}(\texttt{Year} \,\,\, | \,\,\, \texttt{Name} )$¶

In the following we visualize the probability of names over time:

prob_year_given_name[["joey", "fernando", "joyce", "deb"]].iplot(
    mode="lines+markers", size=8, xTitle="Year", yTitle='Proportion',
    filename="Name Popularity")

Notice that some names are spread over time while others have more specific times when they were popular.

Trying more Contemporary Names¶

We can also examine some more contemporary names.

prob_year_given_name[["keisha", "kanye", "nemo"]].iplot(
    mode="lines+markers", size=8, xTitle="Year", yTitle='Poportion',
    filename="Name Popularity2")

Notice that Kanye was popular in 2004 and "Finding Nemo" came out in 2003.

Question: How old is the class?¶

Ideally, we would run a census collecting the age of each student. What are limitations of this approach?

It is time consuming/costly to collect this data.
Students may refuse to answer.
Students may not answer truthfully.

Can we use the Baby Names data?¶

What fraction of the student names are in the baby names database:

names = pd.Index(students["Name"]).intersection(prob_year_given_name.columns)
print("Fraction of names in the babynames data:" , len(names) / len(students))

Fraction of names in the babynames data: 0.8855421686746988

Simulation¶

In Data8 we relied on simulation. Here we define a function to simulate the age of a baby given it's name by sampling from the distribution that we estimated above.

def simulate_age_given_name(name):
    years = prob_year_given_name.index.values
    return np.random.choice(years, size=1, p = prob_year_given_name.loc[:, name])[0]

Let's simulate a date of birth for Professor Gonzalez.

simulate_age_given_name("joey")

1955

The following function then simulates the date of birth for all the students in the class.

def simulate_class_avg_age(names):
    return np.mean([simulate_age_given_name(n) for n in names])

simulate_class_avg_age(names)

1983.3486394557824

We can run this function many times to compute the distribution over the average ages of the class. How many times should we run?

dist_avg_age = np.array([simulate_class_avg_age(names) for c in range(200)])

# sns.distplot(dist_avg_age)
f = ff.create_distplot([dist_avg_age], ["Class Ages"],bin_size=0.25)
py.iplot(f)

We have just constructed a Monte Carlo simulation of the average ages of the class under the estimates that we constructed.

Is this a good estimator?¶

How many of you were born around 1983?
How many of you were born before 1988?

What went wrong?¶

Our age distribution looked at the popularity of a name through all time. Who was born before 1890?
Students are likely to have been born much more recently.

How can we incorporate this knowledge.

</br>

Incorporating Prior Knowledge¶

What if we constrain our data to a more realistic time window?

lthresh = 1985
uthresh = 2005
prior = pd.Series(0.000001, index = name_year_pivot.index, name="prior") 
prior[(prior.index > lthresh) & (prior.index < uthresh)] = 1.0
prior = prior/np.sum(prior)
prior.plot()

<matplotlib.axes._subplots.AxesSubplot at 0x124885278>

Incorporating the Prior Belief into our Model¶

We apply Bayes rule to capture our prior knowledge.

$$\large P(y \,|\, n) = \frac{P(n\, | \, y) P(y)}{P(n)} $$

Time permitting we will cover some basics of Bayesian modeling in DS100.

year_name_pivot = babynames.pivot_table( 
        index=['Name'], columns=['Year'], values='Count', aggfunc=np.sum)
prob_name_given_year = year_name_pivot.div(year_name_pivot.sum()).fillna(0)
u = (prob_name_given_year * prior)
posterior = (u.div(u.sum(axis=1), axis=0)).fillna(0.0).transpose()
posterior_age_dist = np.mean(posterior[names],axis=1)

posterior_age_dist.iplot(xTitle="Year", yTitle="P(Age of Student)")

post_class_ages = []
for i in range(10000):
    post_class_ages.append(
        np.mean(np.random.choice(posterior_age_dist.index.values, size=len(names), 
                                 p=posterior_age_dist)))
print(np.percentile(post_class_ages, [2.5, 50, 97.5]))
f = ff.create_distplot([post_class_ages], ["Posterior Class Ages"],bin_size=0.25)
py.iplot(f)

[ 1995.40982143  1995.90306122  1996.41496599]

	Name	Sex	Count	Year
0	Mary	F	9217	1884
1	Anna	F	3860	1884
2	Emma	F	2587	1884
3	Elizabeth	F	2549	1884
4	Minnie	F	2243	1884

	Name	Sex	Count	Year
0	mary	F	9217	1884
1	anna	F	3860	1884
2	emma	F	2587	1884
3	elizabeth	F	2549	1884
4	minnie	F	2243	1884

Sex	F	M
Year
1880	90992	110491
1881	91953	100743
1882	107847	113686
1883	112318	104627
1884	129020	114443

	Name
M	404
F	184
Unknown	76

	Name	Role
0	NOAH	Student
1	HAYDEN	Student
2	Christopher	Student
3	Sharon	Student
4	Shruti	Student

Sex	F	M	All
Name
aaban	0.0	96.0	96.0
aabha	35.0	0.0	35.0
aabid	0.0	10.0	10.0
aabir	0.0	5.0	5.0
aabriella	26.0	0.0	26.0

Name	aaban	aabha	aabid	aabir	aabriella	aada	aadam	aadan	aadarsh	aaden	...	zytaveon	zytavion	zytavious	zyus	zyva	zyvion	zyvon	zyyanna	zyyon	zzyzx
Year
1880	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1881	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1882	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1883	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1884	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

Python Software in Data 100¶

Software Packages¶

Linear Algebra¶

Data manipulation¶

Visualization¶

Interactivity¶

The Data Science Lifecycle¶

Starting with a Question: Who are you (the students of DS100)?¶

Data Acquisition and Cleaning¶

Exploratory Data Analysis¶

How many records do we have:¶

Solution¶

Is my data representative of the population I want to study?¶

Understanding the structure of data¶

Summarizing the Data¶

What does a name tell us about a person?¶

Obtaining More Data¶

Loading the Data¶

Understanding the Setting¶

Data Cleaning¶

Exploratory Data Analysis¶

Temporal Patterns Conditioned on Gender¶

Estimating the Sex of a Baby Given it's Name¶

Estimating the faction of Females in DS100¶

Can we use the Baby Names data?¶

Applying the sex_from_name function to all students¶

Using simulation to estimate uncertainty¶

Bonus Material: Does you name reveal your age?¶

Distribution of a Name over Time¶

Visualizing the $\hat{\textbf{P}\hspace{0pt}}(\texttt{Year} \,\,\, | \,\,\, \texttt{Name} )$¶

Trying more Contemporary Names¶

Question: How old is the class?¶

Can we use the Baby Names data?¶

Simulation¶

Is this a good estimator?¶

What went wrong?¶

Incorporating Prior Knowledge¶

Incorporating the Prior Belief into our Model¶

Applying the `sex_from_name` function to all students¶