In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline

## Plotly plotting support
# import plotly.plotly as py

import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.figure_factory as ff
import cufflinks as cf

cf.set_config_file(offline=True, world_readable=True, theme='ggplot')
IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Overview

In this lecture we provide a sample of the various topics covered in DS100. In particular we will discuss key aspects of the data science lifecycle:

  1. Question/Problem Formulation:
    1. What do we want to know or what problems are we trying to solve?
    2. What are our hypotheses?
    3. What are our metrics of success?

  2. Data Acquisition and Cleaning:
    1. What data do we have and what data do we need?
    2. How will we collect more data?
    3. How do we organize the data for analysis?

  3. Exploratory Data Analysis:
    1. Do we already have relevant data?
    2. What are the biases, anomalies, or other issues with the data?
    3. How do we transform the data to enable effective analysis?

  4. Prediction and Inference:
    1. What does the data say about the world?
    2. Does it answer our questions or accurately solve the problem?
    3. How robust are our conclusions?

Question: Who are you (the students of DS100)?

This is a pretty vague question but let's start with the goal of learning something about the students in the class.

Data Acquisition and Cleaning

In DS100 we will study various methods to collect data.

To answer this question, I downloaded the course roster and extracted everyones names.

In [2]:
students = pd.read_csv("roster.csv")
students.head()
Out[2]:
Name Role
0 Keeley Student
1 John Student
2 BRYAN Student
3 Kaylan Student
4 Sol Student

What are some of the issues that we will need to address in this data?

Answer:

  1. What is the meaning of Role
  2. Some names appear capitalized.



</br>

Data Cleaning

In DS100 we will study how to identify anomalies in data and apply corrections.

In the above sample we notice that some of the names are capitalized and some are not. This will be an issue in our later analysis so let's convert all names to lower case.

In [3]:
students['Name'] = students['Name'].str.lower()
students.head()
Out[3]:
Name Role
0 keeley Student
1 john Student
2 bryan Student
3 kaylan Student
4 sol Student




Exploratory Data Analysis

In DS100 we will study exploratory data analysis and practice analyzing new datasets.



A Good starting point is understanding the size of the data.

How many records do we have:

Solution

In [4]:
print("There are", len(students), "students on the roster.")
There are 279 students on the roster.



Is this big data? (or at least "big class")

Answer:

This would not normally constitute big data ... however this is a common data size for a lot of data analysis tasks.



Should we be worried about the sample size? Is this even a sample?

Answer:

This is (or at least was) a complete census of the class containing all the official students. We will learn more about data collection and sampling.



What is the meaning of the Role field?

Solution

Understanding the meaning of field can often be achieved by looking at the types of data it contains (in particular the counts of its unique values).

In [5]:
pd.DataFrame(students['Role'].value_counts())
Out[5]:
Role
Student 237
Waitlist Student 42



What about the names? How can we summarize this field?

In DS100 we will deal with many different kinds of data (not just numbers) and we will study techniques to diverse types of data.

A good starting point might be to examine the lengths of the strings.

In [6]:
sns.distplot(students['Name'].str.len(), rug=True, axlabel="Number of Characters")
Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x118c88c88>

The above density plot combines histograms with kernel density estimators and a rug plot to convey information about the distribution of name lengths.

In DS100 we will learn a lot about how to visualize data.



Does the above plot seem reasonable? Why might we want to check the lengths of strings.

Answer

Yes the above plot seems reasonable for name lengths. We might be concerned if there were 0 or even 1 letter names as these might represent abbreviations or missing entries.






What is in a name?

We normally don't get to pick our names but they can say a lot about us. What information might a name reveal about a person?

Here are some examples we will explore in this lecture:

  1. Age
  2. Gender




Obtaining More Data

To study what a name tells about a person we will download data from the United States Social Security office containing the number of registered names broken down by year, sex, and name. This is often called the baby names data as social security numbers are typically given at birth.

Note: In the following we download the data programmatically to ensure that the process is reproducible.

In [7]:
import urllib.request
import os.path

data_url = "https://www.ssa.gov/oact/babynames/names.zip"
local_filename = "babynames.zip"
if not os.path.exists(local_filename): # if the data exists don't download again
    with urllib.request.urlopen(data_url) as resp, open(local_filename, 'wb') as f:
        f.write(resp.read())

The data is organized into separate files in the format yobYYYY.txt with each file containing the name, sex, and count of babies registered in that year.

Loading the Data

Note: In the following we load the data directly into python without decompressing the zipfile.

In DS100 we will think a bit more about how we can be efficient in our data analysis to support processing large datasets.

In [8]:
import zipfile
babynames = [] 
with zipfile.ZipFile(local_filename, "r") as zf:
    data_files = [f for f in zf.filelist if f.filename[-3:] == "txt"]
    def extract_year_from_filename(fn):
        return int(fn[3:7])
    for f in data_files:
        year = extract_year_from_filename(f.filename)
        with zf.open(f) as fp:
            df = pd.read_csv(fp, names=["Name", "Sex", "Count"])
            df["Year"] = year
            babynames.append(df)
babynames = pd.concat(babynames)




Understanding the Setting

In DS100 you will have to learn about different data sources on your own.

Reading from SSN Office description:

All names are from Social Security card applications for births that occurred in the United States after 1879. Note  that many people born before 1937 never applied for a Social Security card, so their names are not included in our data. For others who did apply, our records may not show the place of birth, and again their names are not included in our data.

All data are from a 100% sample of our records on Social Security card applications as of March 2017.




Data Cleaning

Examining the data:

In [9]:
babynames.head()
Out[9]:
Name Sex Count Year
0 Mary F 9217 1884
1 Anna F 3860 1884
2 Emma F 2587 1884
3 Elizabeth F 2549 1884
4 Minnie F 2243 1884

In our earlier analysis we converted names to lower case. We will do the same again here:

In [10]:
babynames['Name'] = babynames['Name'].str.lower()
babynames.head()
Out[10]:
Name Sex Count Year
0 mary F 9217 1884
1 anna F 3860 1884
2 emma F 2587 1884
3 elizabeth F 2549 1884
4 minnie F 2243 1884




Exploratory Data Analysis

How many people does this data represent?

In [11]:
format(babynames['Count'].sum(), ',d') 
Out[11]:
'344,533,897'

Is this number low or high?

Answer

It seems low. However this is what the social security website states: All names are from Social Security card applications for births that occurred in the United States after 1879. Note that many people born before 1937 never applied for a Social Security card, so their names are not included in our data. For others who did apply, our records may not show the place of birth, and again their names are not included in our data. All data are from a 100% sample of our records on Social Security card applications as of the end of February 2016.





Temporal Patterns Conditioned on Gender

In DS100 we still study how to visualize and analyze relationships in data.

In [12]:
pivot_year_name_count = pd.pivot_table(babynames, 
        index=['Year'], # the row index
        columns=['Sex'], # the column values
        values='Count', # the field(s) to processed in each group
        aggfunc=np.sum,
    )

pink_blue = ["#E188DB", "#334FFF"]
with sns.color_palette(sns.color_palette(pink_blue)):
    pivot_year_name_count.plot(marker=".")
    plt.title("Registered Names vs Year Stratified by Sex")
    plt.ylabel('Names Registered that Year')

In DS100 we will learn to use many different plotting technologies.

In [13]:
pivot_year_name_count.iplot(
    mode="lines+markers", size=8, colors=pink_blue,
    xTitle="Year", yTitle='Names Registered that Year',
    filename="Registered SSN Names")

How has the number of unique names each changed?

In [14]:
pivot_year_name_nunique = pd.pivot_table(babynames, 
        index=['Year'], 
        columns=['Sex'], 
        values='Name', 
        aggfunc=lambda x: len(np.unique(x)),
    )

pivot_year_name_nunique.iplot(
    mode="lines+markers", size=8, colors=pink_blue,
    xTitle="Year", yTitle='Number of Unique Names',
    filename="Unique SSN Names")

This could in part be due to increasing number of names.

In [15]:
(pivot_year_name_nunique/pivot_year_name_count).iplot(
    mode="lines+markers", size=8, colors=pink_blue,
    xTitle="Year", yTitle='Fraction of Unique Names',
    filename="Unique SSN Names")

What patterns do we see?

Some observations

  1. Registration data seems limited in the early 1900s. Because many people did not register before 1937.
  2. You can see the baby boomers.
  3. Females have greater diversity of names.




Question: Does you name reveal your age?

In the following cell we define a variable for your name. Feel free to download the notebook and follow along.

In [16]:
my_name = "joey" # all lowercase

Compute the proportion of the name each year

In [17]:
name_year_pivot = babynames.pivot_table( 
        index=['Year'], columns=['Name'], values='Count', aggfunc=np.sum)
prob_year_given_name = name_year_pivot.div(name_year_pivot.sum()).fillna(0)
In [18]:
prob_year_given_name[[my_name, "joseph", "deborah"]].iplot(
    mode="lines+markers", size=8, xTitle="Year", yTitle='Poportion',
    filename="Name Popularity")




Trying more Contemporary Names

In [19]:
prob_year_given_name[[my_name, "keisha", "kanye"]].iplot(
    mode="lines+markers", size=8, xTitle="Year", yTitle='Poportion',
    filename="Name Popularity2")




Question: How old is the class?

Ideally, we would run a census collecting the age of each student. What are limitations of this approach?

  1. It is time consuming/costly to collect this data.
  2. Students may refuse to answer.
  3. Students may not answer truthfully.




Can we use the Baby Names data?

What fraction of the student names are in the baby names database:

In [20]:
names = pd.Index(students["Name"]).intersection(prob_year_given_name.columns)
In [21]:
print("Fraction of names in the babynames data:" , len(names) / len(students))
Fraction of names in the babynames data: 0.8458781362007168

Simulation

In Data8 we relied on simulation:

In [22]:
def simulate_name(name):
    years = prob_year_given_name.index.values
    return np.random.choice(years, size=1, p = prob_year_given_name.loc[:, name])[0]
In [23]:
simulate_name("joey")
Out[23]:
1984
In [24]:
def simulate_class_avg(num_classes, names):
    return np.array([np.mean([simulate_name(n) for n in names]) for c in range(num_classes)])
In [25]:
simulate_class_avg(1, names)
Out[25]:
array([ 1982.27542373])

Simulating the Average Age of Students in the Class

In [26]:
class_ages = simulate_class_avg(200, names)
In [27]:
f = ff.create_distplot([class_ages], ["Class Ages"],bin_size=0.25)
py.iplot(f)

Directly Marginalizing the Empirical Distribution

We could build the probability distribution for the age of an average student directly:

$$ \tilde{P}(\text{year} \, |\, \text{this class}) = \sum_{\text{name}} \tilde{P}(\text{year} \, | \, \text{name}) \tilde{P}(\text{name} \, | \, \text{this class}) $$

In DS100 we will explore the use of probability calculations to derive new estimators

However, we have more direct estimates of a distributions and so instead we can marginalize over student names:

In [28]:
age_dist = prob_year_given_name[names].mean(axis=1)
age_dist.iplot(xTitle="Year", yTitle="P(Age of Student)")

What is the expected age of students?

In [29]:
np.sum(age_dist * age_dist.index.values)
Out[29]:
1983.8467418005248

Is this a reasonable estimate? Can we quantify our uncertainty?

In [30]:
class_ages = []
for i in range(10000):
    class_ages.append(
        np.mean(np.random.choice(age_dist.index.values, size=len(names), p=age_dist)))
print(np.percentile(class_ages, [2.5, 50, 97.5]))
f = ff.create_distplot([class_ages], ["Class Ages"],bin_size=0.25)
py.iplot(f)
[ 1980.20338983  1983.88135593  1987.31811441]

Is this a good estimator?

  1. How many of you were born around 1983?
  2. How many of you were born before 1988?




What went wrong?

  1. Our age distribution looked at the popularity of a name through all time. Who was born before 1890?
  2. Students are likely to have been born much more recently.

How can we incorporate this knowledge.




</br>

Incorporating Prior Knowledge

What if we constrain our data to a more realistic time window?

In [31]:
lthresh = 1985
uthresh = 2005
prior = pd.Series(0.000001, index = name_year_pivot.index, name="prior") 
prior[(prior.index > lthresh) & (prior.index < uthresh)] = 1.0
prior = prior/np.sum(prior)
prior.plot()
Out[31]:
<matplotlib.axes._subplots.AxesSubplot at 0x11e8fe4a8>

Incorporating the Prior Believe into our Model

Take advantages of Bayesian reasoning.

$$\large P(y \,|\, n) = \frac{P(n\, | \, y) P(y)}{P(n)} $$

Time permitting we will cover some basics of Bayesian modeling in DS100.

In [32]:
year_name_pivot = babynames.pivot_table( 
        index=['Name'], columns=['Year'], values='Count', aggfunc=np.sum)
prob_name_given_year = year_name_pivot.div(year_name_pivot.sum()).fillna(0)
u = (prob_name_given_year * prior)
posterior = (u.div(u.sum(axis=1), axis=0)).fillna(0.0).transpose()
posterior_age_dist = np.mean(posterior[names],axis=1)
In [33]:
posterior_age_dist.iplot(xTitle="Year", yTitle="P(Age of Student)")
In [34]:
post_class_ages = []
for i in range(10000):
    post_class_ages.append(
        np.mean(np.random.choice(posterior_age_dist.index.values, size=len(names), 
                                 p=posterior_age_dist)))
print(np.percentile(post_class_ages, [2.5, 50, 97.5]))
f = ff.create_distplot([post_class_ages], ["Posterior Class Ages"],bin_size=0.25)
py.iplot(f)
[ 1995.25423729  1996.01271186  1996.77542373]

What is the gender of the class?

We can construct a similar analysis with gender looking at recent names only.

In [35]:
gender_pivot = babynames[(babynames['Year'] > lthresh) & (babynames['Year'] < uthresh)].pivot_table( 
        index=['Sex'], columns=['Name'], values='Count', aggfunc=np.sum).fillna(0.0)
prob_gender_given_name = gender_pivot / gender_pivot.sum()
In [36]:
prob_gender_given_name[names.intersection(prob_gender_given_name.columns)].mean(axis=1)
Out[36]:
Sex
F    0.376185
M    0.623815
dtype: float64

Would we expect this to be a good estimator?