Lecture 1 – Data 100, Fall 2020

by Anthony D. Joseph

adapted from Joey Gonzalez, Josh Hug, Suraj Rampure

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Plotly plotting support
import plotly.offline as py
py.init_notebook_mode()
import plotly.graph_objs as go
import plotly.figure_factory as ff
import plotly.express as px
# import cufflinks as cf
# cf.set_config_file(offline=False, world_readable=False, theme='ggplot')

Load and clean the roster

In [2]:
names = pd.read_csv("names.csv")
major_year = pd.read_csv("major_year.csv")[['Majors', 'Terms in Attendance']]
In [3]:
names.head(20)
Out[3]:
Name
0 Gene
1 Andrew
2 Michael
3 Archita
4 JAKE
5 Fred
6 Yash
7 Lauren
8 John
9 Emahn
10 Charles
11 Iris
12 Kseniya
13 Alex
14 Justin
15 Frank
16 Jaeyun
17 Andrew
18 Daisuke
19 Basil
In [4]:
names['Name'] = names['Name'].str.lower()
print("Number of Students:", len(names))
names.head(20)
Number of Students: 1164
Out[4]:
Name
0 gene
1 andrew
2 michael
3 archita
4 jake
5 fred
6 yash
7 lauren
8 john
9 emahn
10 charles
11 iris
12 kseniya
13 alex
14 justin
15 frank
16 jaeyun
17 andrew
18 daisuke
19 basil
In [5]:
names.describe()
Out[5]:
Name
count 1164
unique 788
top michael
freq 14
In [6]:
major_year["Majors"] = major_year["Majors"].str.replace("BS","").str.replace("BA","")
In [7]:
major_year['Terms in Attendance'] = major_year['Terms in Attendance'].astype(str)
In [8]:
major_year.head(20)
Out[8]:
Majors Terms in Attendance
0 Materials Science & Eng 5
1 Economics , Statistics 5
2 Computer Science 5
3 Computer Science 7
4 Data Science 5
5 Letters & Sci Undeclared UG 3
6 Letters & Sci Undeclared UG 8
7 Business Administration , Electrical Eng & Com... 3
8 Applied Mathematics , Computer Science 5
9 Letters & Sci Undeclared UG 5
10 Cognitive Science 3
11 Data Science 5
12 Letters & Sci Undeclared UG 5
13 Letters & Sci Undeclared UG 5
14 Letters & Sci Undeclared UG 3
15 Applied Mathematics 6
16 Cognitive Science 7
17 Computer Science 8
18 Letters & Sci Undeclared UG 5
19 Bioengineering 5
In [9]:
major_year.describe()
Out[9]:
Majors Terms in Attendance
count 1164 1164
unique 151 8
top Letters & Sci Undeclared UG 5
freq 287 505

We now know the general structure of our datasets. Let's now ask some questions.

What is the distribution of the lengths of students' names in this class?

In [10]:
sns.distplot(names['Name'].str.len(), rug=True, axlabel="Number of Characters");

What are the majors of students in the class?

In [11]:
(
    major_year["Majors"]
        .str.lower()
        .value_counts().sort_values(ascending=False)
        .head(20).plot(kind='barh', title = "Major")
);
In [12]:
px.bar(major_year['Majors'].value_counts().to_frame().reset_index().head(20), 
       x = 'Majors',
       y = 'index',
       orientation = 'h')

What is the gender of the class?

How can we answer this question?

In [13]:
print(major_year.columns)
print(names.columns)
Index(['Majors', 'Terms in Attendance'], dtype='object')
Index(['Name'], dtype='object')

Ideas:

  1. What do we mean by gender?
  2. Can we use the name to estimate gender?
  3. How would we build model of gender given the name?
  4. Where can we get data for such a model?

US Social Security Data

Public dataset containing baby names and their sex.

Understanding the Setting

In Data 100 you will have to learn about different data sources on your own.

Reading from SSN Office description:

All names are from Social Security card applications for births that occurred in the United States after 1879. Note that many people born before 1937 never applied for a Social Security card, so their names are not included in our data. For others who did apply, our records may not show the place of birth, and again their names are not included in our data.

All data are from a 100% sample of our records on Social Security card applications as of March 2017.

Get data programatically

In [14]:
import urllib.request
import os.path

# Download data from the web directly
data_url = "https://www.ssa.gov/oact/babynames/names.zip"
local_filename = "babynames.zip"
if not os.path.exists(local_filename): # if the data exists don't download again
    with urllib.request.urlopen(data_url) as resp, open(local_filename, 'wb') as f:
        f.write(resp.read())

        
# Load data without unzipping the file
import zipfile
babynames = [] 
with zipfile.ZipFile(local_filename, "r") as zf:
    data_files = [f for f in zf.filelist if f.filename[-3:] == "txt"]
    def extract_year_from_filename(fn):
        return int(fn[3:7])
    for f in data_files:
        year = extract_year_from_filename(f.filename)
        with zf.open(f) as fp:
            df = pd.read_csv(fp, names=["Name", "Sex", "Count"])
            df["Year"] = year
            babynames.append(df)
babynames = pd.concat(babynames)


babynames.head()
Out[14]:
Name Sex Count Year
0 Mary F 7065 1880
1 Anna F 2604 1880
2 Emma F 2003 1880
3 Elizabeth F 1939 1880
4 Minnie F 1746 1880

A little bit of data cleaning:

In [15]:
babynames['Name'] = babynames['Name'].str.lower()
babynames.tail()
Out[15]:
Name Sex Count Year
32028 zylas M 5 2018
32029 zyran M 5 2018
32030 zyrie M 5 2018
32031 zyron M 5 2018
32032 zzyzx M 5 2018

Exploratory Data Analysis

How many people does this data represent?

In [16]:
format(babynames['Count'].sum(), ',d')
Out[16]:
'351,653,025'
In [17]:
len(babynames)
Out[17]:
1957046

Is this number low or high?

It seems low. However the social security website states:

All names are from Social Security card applications for births that occurred in the United States after 1879. **Note that many people born before 1937 never applied for a Social Security card, so their names are not included in our data.** For others who did apply, our records may not show the place of birth, and again their names are not included in our data. All data are from a 100% sample of our records on Social Security card applications as of the end of February 2016.

Let's query to find rows that match desired conditions.

In [18]:
babynames[(babynames['Name'] == 'vela') & (babynames['Sex'] == 'F')].tail(5)
Out[18]:
Name Sex Count Year
7601 vela F 16 2014
6009 vela F 22 2015
5592 vela F 24 2016
7401 vela F 16 2017
6194 vela F 20 2018
In [19]:
babynames[(babynames['Name'] == 'anthony') & (babynames['Year'] == 2000)]
Out[19]:
Name Sex Count Year
2781 anthony F 52 2000
17671 anthony M 19651 2000
In [20]:
babynames.query('Name.str.contains("data")', engine='python')
Out[20]:
Name Sex Count Year
9760 kidata F 5 1975
24915 datavion M 5 1995
23609 datavious M 7 1997
12100 datavia F 7 2000
27502 datavion M 6 2001
28908 datari M 5 2001
29135 datavian M 5 2002
29136 datavious M 5 2002
30570 datavion M 5 2004
17135 datavia F 5 2005
31023 datavion M 5 2005
31019 datavion M 6 2006
33337 datavious M 5 2007
33338 datavius M 5 2007
33397 datavious M 5 2008
33077 datavion M 5 2009
32490 datavious M 5 2010

Proportion of Male and Female Individuals Over Time

In this example we construct a pivot table which aggregates the number of babies registered for each year by Sex.

In [21]:
pivot_year_name_count = pd.pivot_table(babynames, 
        index=['Year'], # the row index
        columns=['Sex'], # the column values
        values='Count', # the field(s) to processed in each group
        aggfunc=np.sum,
    )

pivot_year_name_count.head()
Out[21]:
Sex F M
Year
1880 90994 110490
1881 91953 100743
1882 107847 113686
1883 112319 104625
1884 129019 114442
In [22]:
pivot_year_name_count.plot(title='Names Registered that Year');
In [23]:
fig = go.Figure()
fig.add_trace(go.Scatter(x = pivot_year_name_count.index, y = pivot_year_name_count['F'], name = 'F', line=dict(color='gold')))
fig.add_trace(go.Scatter(x = pivot_year_name_count.index, y = pivot_year_name_count['M'], name = 'M', line=dict(color='blue')))
fig.update_layout(xaxis_title = 'Year', yaxis_title = 'Names Registered')

How many unique names for each year?

In [24]:
pivot_year_name_nunique = pd.pivot_table(babynames, 
        index=['Year'], 
        columns=['Sex'], 
        values='Name', 
        aggfunc=lambda x: len(np.unique(x)),
    )

pivot_year_name_nunique.plot(
   title='Number of Unique Names');

Some observations:

  1. Registration data seems limited in the early 1900s. Because many people did not register before 1937.
  2. You can see the baby boomers and the echo boom.
  3. Females have greater diversity of names.

Computing the Proportion of Female Babies For Each Name

In [25]:
sex_counts = pd.pivot_table(babynames, index='Name', columns='Sex', values='Count',
                            aggfunc='sum', fill_value=0., margins=True)
sex_counts.head()
Out[25]:
Sex F M All
Name
aaban 0 114 114
aabha 35 0 35
aabid 0 16 16
aabidah 5 0 5
aabir 0 10 10

Compute proportion of female babies given each name.

In [26]:
prop_female = sex_counts['F'] / sex_counts['All'] 
prop_female.head(10)
Out[26]:
Name
aaban        0.0
aabha        1.0
aabid        0.0
aabidah      1.0
aabir        0.0
aabriella    1.0
aada         1.0
aadam        0.0
aadan        0.0
aadarsh      0.0
dtype: float64
In [27]:
prop_female.tail(10)
Out[27]:
Name
zytavion     0.000000
zytavious    0.000000
zyus         0.000000
zyva         1.000000
zyvion       0.000000
zyvon        0.000000
zyyanna      1.000000
zyyon        0.000000
zzyzx        0.000000
All          0.495031
dtype: float64

Testing a few names

In [28]:
prop_female['audi']
Out[28]:
0.6
In [29]:
prop_female['anthony']
Out[29]:
0.004882192670149836
In [30]:
prop_female['joey']
Out[30]:
0.11189110521359473
In [31]:
prop_female['avery']
Out[31]:
0.6934594472508525
In [32]:
prop_female["sarah"]
Out[32]:
0.9969234241567629
In [33]:
prop_female["min"]
Out[33]:
0.37598736176935227
In [34]:
prop_female["pat"]
Out[34]:
0.6001585544619619

Build Simple Classifier (Model)

We can define a function to return the most likely Sex for a name. If there is an exact tie, the function returns Male. If the name does not appear in the social security dataset, we return Unknown.

In [35]:
def sex_from_name(name):
    lower_name = name.lower()
    if lower_name in prop_female.index:
        return 'F' if prop_female[lower_name] > 0.5 else 'M'
    else:
        return "Unknown"
In [36]:
sex_from_name("audi")
Out[36]:
'F'
In [37]:
sex_from_name("joey")
Out[37]:
'M'

What fraction of students in Data 100 this semester have names in the SSN dataset?

In [38]:
student_names = pd.Index(names["Name"]).intersection(prop_female.index)
print("Fraction of names in the babynames data:" , len(student_names) / len(names))
Fraction of names in the babynames data: 0.8685567010309279

Which names are not in the dataset?

Why might these names not appear?

In [39]:
missing_names = pd.Index(names["Name"]).difference(prop_female.index)
missing_names
Out[39]:
Index(['adithyan', 'air', 'ameek', 'amritansh', 'amrut', 'angikaar', 'anjing',
       'arhubur', 'armyben', 'atte',
       ...
       'yuqi', 'yuxiao', 'zetian', 'zhaotian', 'zhe', 'zhenyi', 'zhijian',
       'zhiping', 'zhiying', 'ziming'],
      dtype='object', name='Name', length=152)

Estimating the fraction of female and male students

In [40]:
names['Pred. Sex'] = names['Name'].apply(sex_from_name)
(names[names['Pred. Sex'] != "Unknown"]['Pred. Sex'].value_counts()/len(names[names['Pred. Sex'] != "Unknown"])).plot(kind="barh");

Using simulation to estimate uncertainty

Previously we treated a name which is given to females 40% of the time as a "Male" name. This doesn't capture our uncertainty. We can use simulation to provide a better distributional estimate.

Restricting our attention to students in the class

In [41]:
len(prop_female)
Out[41]:
98401
In [42]:
ds100_prob_female = prop_female.loc[prop_female.index.intersection(names['Name'])]
ds100_prob_female.tail(20)
Out[42]:
Name
yash         0.000000
ye           0.421053
yifan        0.117647
yifei        1.000000
yiming       0.000000
yohaan       0.000000
yong         0.289044
yoon         0.266055
yuan         0.159091
yuchen       0.219512
yuxuan       0.372093
zach         0.000000
zachary      0.002828
zack         0.000000
zain         0.036658
zechariah    0.001364
zehra        1.000000
zi           0.508197
zijun        0.000000
zizi         1.000000
dtype: float64

Running the simulation

In [43]:
one_simulation = np.random.rand(len(ds100_prob_female)) < ds100_prob_female
one_simulation.tail(20)
Out[43]:
Name
yash         False
ye            True
yifan        False
yifei         True
yiming       False
yohaan       False
yong          True
yoon         False
yuan         False
yuchen        True
yuxuan        True
zach         False
zachary      False
zack         False
zain         False
zechariah    False
zehra         True
zi            True
zijun        False
zizi          True
dtype: bool
In [44]:
# function that performs many simulations
def simulate_class(students):
    is_female = np.random.rand(len(ds100_prob_female)) < ds100_prob_female
    return np.mean(is_female)

fraction_female_simulations = np.array([simulate_class(names) for n in range(10000)])
In [45]:
# plt.hist(fraction_female_simulations, bins=np.arange(0.4, 0.46, 0.0025), ec='w');
# pd.Series(fraction_female_simulations).iplot(kind="hist", bins=30)
ff.create_distplot([fraction_female_simulations], ['Fraction Female'], bin_size=0.0025, show_rug=False)