Lecture 1 – Data 100, Fall 2020¶

by Anthony D. Joseph

adapted from Joey Gonzalez, Josh Hug, Suraj Rampure

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Plotly plotting support
import plotly.offline as py
py.init_notebook_mode()
import plotly.graph_objs as go
import plotly.figure_factory as ff
import plotly.express as px
# import cufflinks as cf
# cf.set_config_file(offline=False, world_readable=False, theme='ggplot')

Load and clean the roster¶

names = pd.read_csv("names.csv")
major_year = pd.read_csv("major_year.csv")[['Majors', 'Terms in Attendance']]

names.head(20)

names['Name'] = names['Name'].str.lower()
print("Number of Students:", len(names))
names.head(20)

Number of Students: 1164

names.describe()

major_year["Majors"] = major_year["Majors"].str.replace("BS","").str.replace("BA","")

major_year['Terms in Attendance'] = major_year['Terms in Attendance'].astype(str)

major_year.head(20)

major_year.describe()

We now know the general structure of our datasets. Let's now ask some questions.

What is the distribution of the lengths of students' names in this class?¶

sns.distplot(names['Name'].str.len(), rug=True, axlabel="Number of Characters");

What are the majors of students in the class?¶

(
    major_year["Majors"]
        .str.lower()
        .value_counts().sort_values(ascending=False)
        .head(20).plot(kind='barh', title = "Major")
);

px.bar(major_year['Majors'].value_counts().to_frame().reset_index().head(20), 
       x = 'Majors',
       y = 'index',
       orientation = 'h')

What is the gender of the class?¶

How can we answer this question?

print(major_year.columns)
print(names.columns)

Index(['Majors', 'Terms in Attendance'], dtype='object')
Index(['Name'], dtype='object')

Ideas:¶

What do we mean by gender?
Can we use the name to estimate gender?
How would we build model of gender given the name?
Where can we get data for such a model?

Public dataset containing baby names and their sex.

Understanding the Setting¶

In Data 100 you will have to learn about different data sources on your own.

Reading from SSN Office description:

All names are from Social Security card applications for births that occurred in the United States after 1879. Note that many people born before 1937 never applied for a Social Security card, so their names are not included in our data. For others who did apply, our records may not show the place of birth, and again their names are not included in our data.

All data are from a 100% sample of our records on Social Security card applications as of March 2017.

Get data programatically¶

import urllib.request
import os.path

# Download data from the web directly
data_url = "https://www.ssa.gov/oact/babynames/names.zip"
local_filename = "babynames.zip"
if not os.path.exists(local_filename): # if the data exists don't download again
    with urllib.request.urlopen(data_url) as resp, open(local_filename, 'wb') as f:
        f.write(resp.read())

        
# Load data without unzipping the file
import zipfile
babynames = [] 
with zipfile.ZipFile(local_filename, "r") as zf:
    data_files = [f for f in zf.filelist if f.filename[-3:] == "txt"]
    def extract_year_from_filename(fn):
        return int(fn[3:7])
    for f in data_files:
        year = extract_year_from_filename(f.filename)
        with zf.open(f) as fp:
            df = pd.read_csv(fp, names=["Name", "Sex", "Count"])
            df["Year"] = year
            babynames.append(df)
babynames = pd.concat(babynames)


babynames.head()

A little bit of data cleaning:

babynames['Name'] = babynames['Name'].str.lower()
babynames.tail()

Exploratory Data Analysis¶

How many people does this data represent?

format(babynames['Count'].sum(), ',d')

'351,653,025'

len(babynames)

1957046

Is this number low or high?

It seems low. However the social security website states:

All names are from Social Security card applications for births that occurred in the United States after 1879. **Note that many people born before 1937 never applied for a Social Security card, so their names are not included in our data.** For others who did apply, our records may not show the place of birth, and again their names are not included in our data. All data are from a 100% sample of our records on Social Security card applications as of the end of February 2016.

Let's query to find rows that match desired conditions.¶

babynames[(babynames['Name'] == 'vela') & (babynames['Sex'] == 'F')].tail(5)

babynames[(babynames['Name'] == 'anthony') & (babynames['Year'] == 2000)]

babynames.query('Name.str.contains("data")', engine='python')

Proportion of Male and Female Individuals Over Time¶

In this example we construct a pivot table which aggregates the number of babies registered for each year by Sex.

pivot_year_name_count = pd.pivot_table(babynames, 
        index=['Year'], # the row index
        columns=['Sex'], # the column values
        values='Count', # the field(s) to processed in each group
        aggfunc=np.sum,
    )

pivot_year_name_count.head()

pivot_year_name_count.plot(title='Names Registered that Year');

fig = go.Figure()
fig.add_trace(go.Scatter(x = pivot_year_name_count.index, y = pivot_year_name_count['F'], name = 'F', line=dict(color='gold')))
fig.add_trace(go.Scatter(x = pivot_year_name_count.index, y = pivot_year_name_count['M'], name = 'M', line=dict(color='blue')))
fig.update_layout(xaxis_title = 'Year', yaxis_title = 'Names Registered')

How many unique names for each year?¶

pivot_year_name_nunique = pd.pivot_table(babynames, 
        index=['Year'], 
        columns=['Sex'], 
        values='Name', 
        aggfunc=lambda x: len(np.unique(x)),
    )

pivot_year_name_nunique.plot(
   title='Number of Unique Names');

Some observations:

Registration data seems limited in the early 1900s. Because many people did not register before 1937.
You can see the baby boomers and the echo boom.
Females have greater diversity of names.

Computing the Proportion of Female Babies For Each Name¶

sex_counts = pd.pivot_table(babynames, index='Name', columns='Sex', values='Count',
                            aggfunc='sum', fill_value=0., margins=True)
sex_counts.head()

Compute proportion of female babies given each name.

prop_female = sex_counts['F'] / sex_counts['All'] 
prop_female.head(10)

Name
aaban        0.0
aabha        1.0
aabid        0.0
aabidah      1.0
aabir        0.0
aabriella    1.0
aada         1.0
aadam        0.0
aadan        0.0
aadarsh      0.0
dtype: float64

prop_female.tail(10)

Name
zytavion     0.000000
zytavious    0.000000
zyus         0.000000
zyva         1.000000
zyvion       0.000000
zyvon        0.000000
zyyanna      1.000000
zyyon        0.000000
zzyzx        0.000000
All          0.495031
dtype: float64

Testing a few names¶

prop_female['audi']

0.6

prop_female['anthony']

0.004882192670149836

prop_female['joey']

0.11189110521359473

prop_female['avery']

0.6934594472508525

prop_female["sarah"]

0.9969234241567629

prop_female["min"]

0.37598736176935227

prop_female["pat"]

0.6001585544619619

Build Simple Classifier (Model)¶

We can define a function to return the most likely Sex for a name. If there is an exact tie, the function returns Male. If the name does not appear in the social security dataset, we return Unknown.

def sex_from_name(name):
    lower_name = name.lower()
    if lower_name in prop_female.index:
        return 'F' if prop_female[lower_name] > 0.5 else 'M'
    else:
        return "Unknown"

sex_from_name("audi")

'F'

sex_from_name("joey")

'M'

What fraction of students in Data 100 this semester have names in the SSN dataset?¶

student_names = pd.Index(names["Name"]).intersection(prop_female.index)
print("Fraction of names in the babynames data:" , len(student_names) / len(names))

Fraction of names in the babynames data: 0.8685567010309279

Which names are not in the dataset?¶

Why might these names not appear?

missing_names = pd.Index(names["Name"]).difference(prop_female.index)
missing_names

Index(['adithyan', 'air', 'ameek', 'amritansh', 'amrut', 'angikaar', 'anjing',
       'arhubur', 'armyben', 'atte',
       ...
       'yuqi', 'yuxiao', 'zetian', 'zhaotian', 'zhe', 'zhenyi', 'zhijian',
       'zhiping', 'zhiying', 'ziming'],
      dtype='object', name='Name', length=152)

Estimating the fraction of female and male students¶

names['Pred. Sex'] = names['Name'].apply(sex_from_name)
(names[names['Pred. Sex'] != "Unknown"]['Pred. Sex'].value_counts()/len(names[names['Pred. Sex'] != "Unknown"])).plot(kind="barh");

Using simulation to estimate uncertainty¶

Previously we treated a name which is given to females 40% of the time as a "Male" name. This doesn't capture our uncertainty. We can use simulation to provide a better distributional estimate.

Restricting our attention to students in the class¶

len(prop_female)

98401

ds100_prob_female = prop_female.loc[prop_female.index.intersection(names['Name'])]
ds100_prob_female.tail(20)

Name
yash         0.000000
ye           0.421053
yifan        0.117647
yifei        1.000000
yiming       0.000000
yohaan       0.000000
yong         0.289044
yoon         0.266055
yuan         0.159091
yuchen       0.219512
yuxuan       0.372093
zach         0.000000
zachary      0.002828
zack         0.000000
zain         0.036658
zechariah    0.001364
zehra        1.000000
zi           0.508197
zijun        0.000000
zizi         1.000000
dtype: float64

Running the simulation¶

one_simulation = np.random.rand(len(ds100_prob_female)) < ds100_prob_female
one_simulation.tail(20)

Name
yash         False
ye            True
yifan        False
yifei         True
yiming       False
yohaan       False
yong          True
yoon         False
yuan         False
yuchen        True
yuxuan        True
zach         False
zachary      False
zack         False
zain         False
zechariah    False
zehra         True
zi            True
zijun        False
zizi          True
dtype: bool

# function that performs many simulations
def simulate_class(students):
    is_female = np.random.rand(len(ds100_prob_female)) < ds100_prob_female
    return np.mean(is_female)

fraction_female_simulations = np.array([simulate_class(names) for n in range(10000)])

# plt.hist(fraction_female_simulations, bins=np.arange(0.4, 0.46, 0.0025), ec='w');
# pd.Series(fraction_female_simulations).iplot(kind="hist", bins=30)
ff.create_distplot([fraction_female_simulations], ['Fraction Female'], bin_size=0.0025, show_rug=False)

	Name
0	Gene
1	Andrew
2	Michael
3	Archita
4	JAKE
5	Fred
6	Yash
7	Lauren
8	John
9	Emahn
10	Charles
11	Iris
12	Kseniya
13	Alex
14	Justin
15	Frank
16	Jaeyun
17	Andrew
18	Daisuke
19	Basil

	Name
0	gene
1	andrew
2	michael
3	archita
4	jake
5	fred
6	yash
7	lauren
8	john
9	emahn
10	charles
11	iris
12	kseniya
13	alex
14	justin
15	frank
16	jaeyun
17	andrew
18	daisuke
19	basil

	Majors	Terms in Attendance
0	Materials Science & Eng	5
1	Economics , Statistics	5
2	Computer Science	5
3	Computer Science	7
4	Data Science	5
5	Letters & Sci Undeclared UG	3
6	Letters & Sci Undeclared UG	8
7	Business Administration , Electrical Eng & Com...	3
8	Applied Mathematics , Computer Science	5
9	Letters & Sci Undeclared UG	5
10	Cognitive Science	3
11	Data Science	5
12	Letters & Sci Undeclared UG	5
13	Letters & Sci Undeclared UG	5
14	Letters & Sci Undeclared UG	3
15	Applied Mathematics	6
16	Cognitive Science	7
17	Computer Science	8
18	Letters & Sci Undeclared UG	5
19	Bioengineering	5

	Name	Sex	Count	Year
0	Mary	F	7065	1880
1	Anna	F	2604	1880
2	Emma	F	2003	1880
3	Elizabeth	F	1939	1880
4	Minnie	F	1746	1880

	Name	Sex	Count	Year
32028	zylas	M	5	2018
32029	zyran	M	5	2018
32030	zyrie	M	5	2018
32031	zyron	M	5	2018
32032	zzyzx	M	5	2018

	Majors	Terms in Attendance
count	1164	1164
unique	151	8
top	Letters & Sci Undeclared UG	5
freq	287	505

	Name	Sex	Count	Year
7601	vela	F	16	2014
6009	vela	F	22	2015
5592	vela	F	24	2016
7401	vela	F	16	2017
6194	vela	F	20	2018

	Name	Sex	Count	Year
9760	kidata	F	5	1975
24915	datavion	M	5	1995
23609	datavious	M	7	1997
12100	datavia	F	7	2000
27502	datavion	M	6	2001
28908	datari	M	5	2001
29135	datavian	M	5	2002
29136	datavious	M	5	2002
30570	datavion	M	5	2004
17135	datavia	F	5	2005
31023	datavion	M	5	2005
31019	datavion	M	6	2006
33337	datavious	M	5	2007
33338	datavius	M	5	2007
33397	datavious	M	5	2008
33077	datavion	M	5	2009
32490	datavious	M	5	2010

Sex	F	M
Year
1880	90994	110490
1881	91953	100743
1882	107847	113686
1883	112319	104625
1884	129019	114442