Lecture 1 – Data 100, Summer 2023¶

by Lisa Yan

  • adapted from Joseph E. Gonzalez, Anthony D. Joseph, Josh Hug, Suraj Rampure.
  • minor updates by Narges Norouzi, Fernando Pérez, and Dominic Liu.

Software Packages¶

We will be using a wide range of different Python software packages. To install and manage these packages we will be using the Conda environment manager. The following is a list of packages we will routinely use in lectures and homeworks:

In [2]:
# linear algebra, probability
import numpy as np

# data manipulation
import pandas as pd

# visualization
import matplotlib.pyplot as plt
import seaborn as sns

## interactive visualization library
import plotly.offline as py
py.init_notebook_mode()
import plotly.graph_objs as go
import plotly.figure_factory as ff
import plotly.express as px

We will learn how to use all of the technologies used in this demo.

For now, just sit back and think critically about the data and our guided analysis.

1. Starting with a Question: Who are you (the students of Data 100)?¶

This is a pretty vague question but let's start with the goal of learning something about the students in the class.

Here are some "simple" questions:

  1. How many students do we have?
  2. What are your majors?
  3. What year are you?
  4. Diversity ...?

2. Data Acquisition and Cleaning¶

In Data 100 we will study various methods to collect data.

To answer this question, I downloaded the course roster and extracted everyone's names and majors.

In [3]:
# pd stands for pandas, which we will learn starting from tomorrow
# some pandas syntax shared with data8's datascience package
majors = pd.read_csv("data/majors.csv")
names = pd.read_csv("data/names.csv")

3. Exploratory Data Analysis¶

In Data 100 we will study exploratory data analysis and practice analyzing new datasets.

I didn't tell you the details of the data! Let's check out the data and infer its structure. Then we can start answering the simple questions we posed.

Peeking at the Data¶

In [4]:
majors.head(20)
Out[4]:
Majors Terms in Attendance
0 Environmental Sciences BS 8
1 Letters & Sci Undeclared UG 7
2 Letters & Sci Undeclared UG 3
3 Data Science BA 8
4 Computer Science BA 5
5 Civil Engineering BS 8
6 Data Science BA 8
7 Summer Internatnl Visitor UG U
8 Data Science BA 5
9 Data Science BA 3
10 Cognitive Science BA 8
11 Applied Mathematics BA 7
12 Energy Engineering BS 7
13 Electrical Eng & Comp Sci BS 8
14 Computer Science BA 7
15 Letters & Sci Undeclared UG 8
16 Data Science BA, Sociology BA 8
17 Data Science BA 5
18 Computer Science BA, Mathematics BA 7
19 Civil Engineering BS 8
In [5]:
names.head()
Out[5]:
Name Role
0 Leah Student
1 Mariano Student
2 Jocelyn Student
3 LOGAN Student
4 Shaojie Student

What is one potential issue we may need to address in this data?¶

Answer: Some names appear capitalized.

In the above sample we notice that some of the names are capitalized and some are not. This will be an issue in our later analysis so let's convert all names to lower case.

In [6]:
names['Name'] = names['Name'].str.lower()
In [7]:
names.head()
Out[7]:
Name Role
0 leah Student
1 mariano Student
2 jocelyn Student
3 logan Student
4 shaojie Student

How many records do we have?¶

In [8]:
print(len(names))
print(len(majors))
176
176

Based on what we know of our class, each record is most likely a student.

Understanding the structure of data¶

It is important that we understand the meaning of each field and how the data is organized.

In [9]:
names.head()
Out[9]:
Name Role
0 leah Student
1 mariano Student
2 jocelyn Student
3 logan Student
4 shaojie Student

Q: What is the meaning of the Role field?

A: Understanding the meaning of field can often be achieved by looking at the types of data it contains (in particular the counts of its unique values).

We use the value_counts() function in pandas:

In [10]:
names['Role'].value_counts().to_frame()  # counts of unique Roles
Out[10]:
Role
Student 175
#REF! 1

It appears that one student has an erroneous role given as "#REF!". What else can we learn about this student? Let's see their name.

In [11]:
# boolean index to find rows where Role is #REF!
names[names['Role'] == "#REF!"]
Out[11]:
Name Role
22 #ref! #REF!

Though this single bad record won't have much of an impact on our analysis, we can clean our data by removing this record.

In [12]:
names = names[names['Role'] != "#REF!"]

Double check: Let's double check that our record removal only removed the single bad record.

In [13]:
names['Role'].value_counts().to_frame()  # again, counts of unique Roles
Out[13]:
Role
Student 175

Remember we loaded in two files. Let's explore the fields of majors and check for bad records:

In [14]:
majors.columns   # get column names
Out[14]:
Index(['Majors', 'Terms in Attendance'], dtype='object')
In [15]:
majors['Terms in Attendance'].value_counts().to_frame()
Out[15]:
Terms in Attendance
5 53
7 46
8 38
3 17
U 14
6 4
G 2
#REF! 1
4 1

It looks like numbers represents semesters, G represents graduate students, and U might represent something else---maybe campus visitors. But we do still have a bad record:

In [16]:
majors[majors['Terms in Attendance'] == "#REF!"]
Out[16]:
Majors Terms in Attendance
169 #REF! #REF!
In [17]:
majors = majors[majors['Terms in Attendance'] != "#REF!"]
majors['Terms in Attendance'].value_counts().to_frame()
Out[17]:
Terms in Attendance
5 53
7 46
8 38
3 17
U 14
6 4
G 2
4 1

Detail: The deleted majors record number is different from the record number of the bad names record. So while the number of records in each table matches, the row indices don't match, so we'll have to keep these tables separate in order to do our analysis.

Summarizing the Data¶

We will often want to numerically or visually summarize the data. The describe() method provides a brief high level description of our data frame.

In [18]:
names.describe()
Out[18]:
Name Role
count 175 175
unique 164 1
top max Student
freq 2 175
In [19]:
majors.describe()
Out[19]:
Majors Terms in Attendance
count 175 175
unique 45 8
top Letters & Sci Undeclared UG 5
freq 43 53

Q: What do you think top and freq represent?

A: top: most frequent entry, freq: the frequency of that entry


What are your majors?¶

What are the top majors:

In [20]:
majors_count = (       # method chaining in pandas
    majors['Majors']
    .value_counts()
    .sort_values(ascending=False) # highest first
    .to_frame()
    .head(20)          # get the top 20
)

# or, comment out to parse double majors
# majors_count = (
#     majors['Majors']
#     .str.split(", ") # get double majors
#     .explode()       # one major to every row
#     .value_counts()
#     .sort_values(ascending=True)
#     .to_frame()
#     .tail(20)
# )

majors_count
Out[20]:
Majors
Letters & Sci Undeclared UG 43
Data Science BA 34
Computer Science BA 11
Summer Internatnl Visitor UG 8
Chemical Engineering BS 6
Civil Engineering BS 6
Applied Mathematics BA 5
Electrical Eng & Comp Sci BS 5
Molecular & Cell Biology BA 4
Cognitive Science BA 4
Environ Econ & Policy BS 3
Economics BA 3
Summer Domestic Visitor UG 3
Environmental Sciences BS 2
Computer Science BA, Mathematics BA 2
Summer Internatnl Visitor UG, UCBX Concurrent International 2
Chemical Biology BS, Letters & Sci Undeclared UG 2
Sociology BA 2
Mathematics BA 2
Applied Mathematics BA, Computer Science BA 2

We will often use visualizations to make sense of data¶

In Data 100, we will deal with many different kinds of data (not just numbers) and we will study techniques to describe types of data.

How can we summarize the Majors field? A good starting point might be to use a bar plot:

In [21]:
# interactive using plotly
fig = px.bar(majors_count, orientation='h')
fig.update_layout(showlegend=False,
                  xaxis_title='Count',
                  yaxis_title='Major')

What year are you?¶

In [22]:
fig = px.histogram(majors['Terms in Attendance'].sort_values(),
                   histnorm='probability')
fig.update_layout(showlegend=False,
                  xaxis_title="Term",
                  yaxis_title="Fraction of Class")

Diversity and Data Science:¶

Unfortunately, surveys of data scientists suggest that there are far fewer women in data science:

To learn more check out the Kaggle Executive Summary or study the Raw Data.


What fraction of the students are female?¶

We often ask this question because we want to improve the data science program here in Berkeley, especially it has now grown into a new college—College of Computing, Data Science, and Society—Berkeley's first new college in 50 years.

This is actually a fairly complex question. What do we mean by female? Is this a question about the sex or gender identity of the students? They are not the same thing.

  • Sex refers predominantly to biological characteristics.
  • Gender is much more complex with societal and cultural implications and refers to how people identify themselves.

Most likely, the college of CDSS is interested in improving gender diversity, by ensuring that our program is inclusive. Let's reword this question:

Reworded: What is the gender diversity of our students?¶

How could we answer this question?¶

In [23]:
print(majors.columns)
print(names.columns)
Index(['Majors', 'Terms in Attendance'], dtype='object')
Index(['Name', 'Role'], dtype='object')

We don't have the data.¶

Where can we get the data?


(1) We could run a survey!¶

In Data 100, we will learn different ways people collect data (e.g. sampling and census) and their limitations.

We actually did, in the Pre-semester Survey. However, it's not due yet, so we don't have the complete data. Here's what we have so far:

Considering UC Berkeley's enrolled undergraduate students are 54% women, Data 100 is not doing great :(

Two limitations of this result:

  • Not everyone filled out the form. (But is it a representative sample of our students?)
  • The question is optional, so some students chose not to answer it. (What problem will this cause?)

(2) ... or we could try to use the data we have to estimate the _sex_ of the students as a proxy for gender?!?!¶

Please do not attempt option (2) alone. What I am about to do is flawed in so many ways and we will discuss these flaws in a moment and throughout the semester.

However, it will illustrate some very basic inferential modeling and how we might combine multiple data sources to try and reason about something we haven't measured.

The idea is to use first name as a proxy for sex, as a proxy for gender.

$$ \text{Name} \rightarrow \text{Sex} \rightarrow \text{Gender} $$

What potential problems do you see with the method?

To attempt option (2), we will first look at a second data source.


US Social Security Data¶

To study what a name tells about a person we will download data from the United States Social Security office containing the number of registered names broken down by year, sex, and name. This is often called the Baby Names Data as social security numbers (SSNs) are typically given at birth.

1. What does a name tell us about a person?¶

A: In this demo we'll use a person's name to estimate their sex. But a person's name tells us many things (more on this later).

2. Acquire data programatically¶

Note 1: In the following we download the data programmatically to ensure that the process is reproducible.

Note 2: We also load the data directly into python without decompressing the zipfile.

In Data 100 we will think a bit more about how we can be efficient in our data analysis to support processing large datasets.

In [24]:
import urllib.request
import os.path

# Download data from the web directly
data_url = "https://www.ssa.gov/oact/babynames/names.zip"
local_filename = "babynames.zip"
if not os.path.exists(local_filename): # if the data exists don't download again
    with urllib.request.urlopen(data_url) as resp, open(local_filename, 'wb') as f:
        f.write(resp.read())

        
# Load data without unzipping the file
import zipfile
babynames = [] 
with zipfile.ZipFile(local_filename, "r") as zf:
    data_files = [f for f in zf.filelist if f.filename[-3:] == "txt"]
    def extract_year_from_filename(fn):
        return int(fn[3:7])
    for f in data_files:
        year = extract_year_from_filename(f.filename)
        with zf.open(f) as fp:
            df = pd.read_csv(fp, names=["Name", "Sex", "Count"])
            df["Year"] = year
            babynames.append(df)
babynames = pd.concat(babynames)


babynames.head() # show the first few rows
Out[24]:
Name Sex Count Year
0 Mary F 7065 1880
1 Anna F 2604 1880
2 Emma F 2003 1880
3 Elizabeth F 1939 1880
4 Minnie F 1746 1880

2 (cont). Understanding the Setting¶

In Data 100 you will have to learn about different data sources (and their limitations) on your own.

Reading from SSN Office description, bolded for readability:

All names are from Social Security card applications for births that occurred in the United States after 1879. Note that many people born before 1937 never applied for a Social Security card, so their names are not included in our data. For others who did apply, our records may not show the place of birth, and again their names are not included in our data.

To safeguard privacy, we exclude from our tabulated lists of names those that would indicate, or would allow the ability to determine, names with fewer than 5 occurrences in any geographic area. If a name has less than 5 occurrences for a year of birth in any state, the sum of the state counts for that year will be less than the national count.

All data are from a 100% sample of our records on Social Security card applications as of March 2022.

A little bit of data cleaning¶

Examining the data:

In [25]:
babynames
Out[25]:
Name Sex Count Year
0 Mary F 7065 1880
1 Anna F 2604 1880
2 Emma F 2003 1880
3 Elizabeth F 1939 1880
4 Minnie F 1746 1880
... ... ... ... ...
31910 Zuberi M 5 2022
31911 Zydn M 5 2022
31912 Zylon M 5 2022
31913 Zymeer M 5 2022
31914 Zymeire M 5 2022

2085158 rows × 4 columns

In our earlier analysis we converted names to lower case. We will do the same again here:

In [26]:
babynames['Name'] = babynames['Name'].str.lower()
babynames.head()
Out[26]:
Name Sex Count Year
0 mary F 7065 1880
1 anna F 2604 1880
2 emma F 2003 1880
3 elizabeth F 1939 1880
4 minnie F 1746 1880

3. Exploratory Data Analysis (and Visualization)¶

How many people does this data represent?

In [27]:
format(babynames['Count'].sum(), ',d') # sum of 'Count' column
Out[27]:
'365,296,191'
In [28]:
format(len(babynames), ',d')       # number of rows
Out[28]:
'2,085,158'

Q: Is this number low or high?

Answer

It seems low (the 2021 US population was 331.9 million). However the social security website states:

All names are from Social Security card applications for births that occurred in the United States after 1879. Note that many people born before 1937 never applied for a Social Security card, so their names are not included in our data. For others who did apply, our records may not show the place of birth, and again their names are not included in our data. All data are from a 100% sample of our records on Social Security card applications as of the end of February 2016.

Trying a simple query:

In [29]:
# how many Nora's were born in 2018?
babynames[(babynames['Name'] == 'nora') & (babynames['Year'] == 2018)]
Out[29]:
Name Sex Count Year
29 nora F 5836 2018
28545 nora M 7 2018

Trying a more complex query using query() (to be discussed soon!):

In [30]:
# how many baby names contain the word "data"?
babynames.query('Name.str.contains("data")', engine='python')
Out[30]:
Name Sex Count Year
9766 kidata F 5 1975
24914 datavion M 5 1995
23614 datavious M 7 1997
12103 datavia F 7 2000
27510 datavion M 6 2001
28918 datari M 5 2001
29144 datavian M 5 2002
29145 datavious M 5 2002
30576 datavion M 5 2004
17141 datavia F 5 2005
31035 datavion M 5 2005
31028 datavion M 6 2006
33345 datavious M 5 2007
33346 datavius M 5 2007
33416 datavious M 5 2008
33095 datavion M 5 2009
32510 datavious M 5 2010

Temporal Patterns Conditioned on Male/Female¶

In Data 100 we still study how to visualize and analyze relationships in data.

In this example we construct a pivot table which aggregates the number of babies registered for each year by Sex.

We'll discuss pivot tables in detail in the next few lectures.

In [31]:
# counts number of M and F babies per year
year_sex = pd.pivot_table(
        babynames, 
        index=['Year'], # the row index
        columns=['Sex'], # the column values
        values='Count', # the field(s) to processed in each group
        aggfunc=np.sum,
    )[["M", "F"]]

year_sex.head()
Out[31]:
Sex M F
Year
1880 110490 90994
1881 100737 91953
1882 113686 107847
1883 104625 112320
1884 114442 129019

We can visualize these descriptive statistics:

In [32]:
# more interactive using plotly
fig = px.line(year_sex)
fig.update_layout(title="Total Babies per Year",
                  yaxis_title="Number of Babies")

How many unique names for each year?¶

In [33]:
# counts number of M and F *names* per year
year_sex_unique = pd.pivot_table(babynames, 
        index=['Year'], 
        columns=['Sex'], 
        values='Name', 
        aggfunc=lambda x: len(np.unique(x)),
    )
fig = px.line(year_sex_unique)
fig.update_layout(title="Unique Names Per Year",
                  yaxis_title="Number of Baby Names")

Some observations:

  1. Registration data seems limited in the early 1900s. Because many people did not register before 1937.
  2. You can see the baby boomers (born 1940s-1960s) and the Echo Boomers (aka millenials, 1980s to 2000).
  3. Females have greater a sightly greater diversity of names.

4. Understand the World: Prediction and Inference¶

Let's use the Baby Names dataset to estimate the fraction of female students in the class.

Compute the Proportion of Female Babies For Each Name¶

First, we construct a pivot table to compute the total number of babies registered for each Name, broken down by Sex.

In [34]:
# counts number of M and F babies per name
name_sex = pd.pivot_table(babynames, index='Name', columns='Sex', values='Count',
                            aggfunc='sum', fill_value=0., margins=True)
name_sex.sample(5)
Out[34]:
Sex F M All
Name
shaquaya 96 0 96
zenniyah 5 0 5
sarahanne 262 0 262
kinnick 5 490 495
alberico 0 6 6
In [35]:
name_sex.loc["alex",]
Out[35]:
Sex
F        9558
M      281054
All    290612
Name: alex, dtype: int64

Second, we compute proportion of female babies for each name. This is our estimated probability that the baby is Female:

$$ \hat{\textbf{P}\hspace{0pt}}(\text{Female} \,\,\, | \,\,\, \text{Name} ) = \frac{\textbf{Count}(\text{Female and Name})}{\textbf{Count}(\text{Name})} $$
In [36]:
prop_female = (name_sex['F'] / name_sex['All']).rename("Prop. Female")
prop_female.to_frame().sample(10)
Out[36]:
Prop. Female
Name
yimi 0.0
donold 0.0
eilam 0.0
kathleene 1.0
xabier 0.0
taeleigh 1.0
nicolaus 0.0
sarabella 1.0
cadee 1.0
sairus 0.0

Test a few names¶

In [37]:
prop_female["bella"]
Out[37]:
0.9994112105511069
In [38]:
prop_female["dominic"]
Out[38]:
0.006654203581782221
In [39]:
prop_female["joey"]
Out[39]:
0.11911693756342644
In [40]:
prop_female["ani"]
Out[40]:
0.9983204568357407
In [41]:
prop_female["minh"]
Out[41]:
0.12516411378555797
In [42]:
prop_female["pat"]
Out[42]:
0.6001286385257426
In [43]:
prop_female["jaspreet"]
Out[43]:
0.6043956043956044

Next, Build a Simple Classifier (Model)¶

We can define a function to return the most likely Sex for a name. If there is an exact tie or the name does not appear in the social security dataset the function returns Unknown.

In [44]:
def sex_from_name(name):
    lower_name = name.lower()
    if lower_name not in prop_female.index or prop_female[lower_name] == 0.5:
        return "Unknown"
    elif prop_female[lower_name] > 0.5:
        return "F"
    else:
        return "M"
In [45]:
sex_from_name("nora")
Out[45]:
'F'
In [46]:
sex_from_name("joey")
Out[46]:
'M'
In [47]:
sex_from_name("pat")
Out[47]:
'F'

4 (cont). Estimating the fraction of female and male students in Data 100¶

Let's try out our simple classifier! We'll use the apply() function to classify each student name:

In [48]:
# apply sex_from_name to each student name
names['Pred. Sex'] = names['Name'].apply(sex_from_name)
px.bar(names['Pred. Sex'].value_counts()/len(names))

Interpreting the unknowns¶

That's a lot of Unknowns.

...But we can still estimate the fraction of female students in the class:

In [49]:
count_by_sex = names['Pred. Sex'].value_counts().to_frame()
count_by_sex
Out[49]:
Pred. Sex
M 84
F 55
Unknown 36
In [50]:
count_by_sex.loc['F']/(count_by_sex.loc['M'] + count_by_sex.loc['F'])
Out[50]:
Pred. Sex    0.395683
dtype: float64

Questions:

  1. How do we feel about this estimate?
  2. Do we trust it?




Q: What fraction of students in Data 100 this semester have names in the SSN dataset?

In [51]:
print("Fraction of names in the babynames data:", 
      names['Name'].isin(prop_female.index).mean())
Fraction of names in the babynames data: 0.7942857142857143

Q: Which names are not in the dataset?

Why might these names not appear?

In [52]:
# the tilde ~ negates the boolean index. More next week.
names[~names['Name'].isin(prop_female.index)].sample(10)
Out[52]:
Name Role Pred. Sex
28 youyansu Student Unknown
161 qingyang Student Unknown
4 shaojie Student Unknown
130 qiuxuan Student Unknown
131 canhui Student Unknown
110 ayeon Student Unknown
125 haorui Student Unknown
152 jasleem Student Unknown
85 yanfu Student Unknown
172 zhuoxuan Student Unknown
In [53]:
"hangxing" in prop_female.index
Out[53]:
False

Using simulation to estimate uncertainty¶

Previously we treated a name which is given to females 40% of the time as a "Male" name, because the probability was less than 0.5. This doesn't capture our uncertainty.

We can use simulation to provide a better distributional estimate. We'll use 50% for names not in the Baby Names dataset.

In [54]:
# add the computed SSN F proportion to each row. 0.5 for Unknowns.
# merge() effectively "join"s two tables together. to be covered next week.
names['Prop. Female'] = (
    names[['Name']].merge(prop_female, how='left', left_on='Name', 
                          right_index=True)['Prop. Female']
        .fillna(0.5)
)
names.head(10)
Out[54]:
Name Role Pred. Sex Prop. Female
0 leah Student F 0.998094
1 mariano Student M 0.000000
2 jocelyn Student F 0.996825
3 logan Student M 0.073723
4 shaojie Student Unknown 0.500000
5 janmesh Student Unknown 0.500000
6 shaun Student M 0.042693
7 cara Student F 0.998826
8 karl Student M 0.009020
9 haonan Student Unknown 0.500000

Running the simulation¶

In [55]:
# if a randomly picked number from [0.0, 1.0) is under the Female proportion, then F
names['Sim. Female'] = np.random.rand(len(names)) < names['Prop. Female']
names.tail(20)
Out[55]:
Name Role Pred. Sex Prop. Female Sim. Female
156 brett Student M 0.020783 False
157 teddy Student M 0.046772 False
158 yifei Student F 1.000000 True
159 jeffae Student Unknown 0.500000 True
160 sriya Student F 1.000000 True
161 qingyang Student Unknown 0.500000 False
162 tan Student M 0.018425 False
163 xuanning Student Unknown 0.500000 True
164 xuanjie Student Unknown 0.500000 True
165 alexa Student F 0.997828 True
166 bryan Student M 0.004424 False
167 seoin Student Unknown 0.500000 False
168 aaron Student M 0.007205 False
169 xinyi Student F 1.000000 True
170 tisha Student F 1.000000 True
171 zhen Student M 0.197368 False
172 zhuoxuan Student Unknown 0.500000 False
173 suge Student Unknown 0.500000 False
174 king Student M 0.000979 False
175 samantha Student F 0.997949 True

Given such a simulation, we can compute the fraction of the class that is female.

  1. How do we feel about this new estimate?
  2. Do we trust it?
In [56]:
# proportion of Trues in the 'Sim. Female' column
names['Sim. Female'].mean()
Out[56]:
0.4057142857142857

Now that we're performing a simulation, the above proportion is random: it depends on the random numbers we picked to determine whether a student was Female.

Let's run the above simulation several times and see what the distribution of this Female proportion is. The below cell may take a few seconds to run.

In [57]:
# function that performs many simulations
def simulate_class(students):
    is_female = names['Prop. Female'] > np.random.rand(len(names['Prop. Female'])) 
    return np.mean(is_female)

sim_frac_female = np.array([simulate_class(names) for n in range(10000)])
In [58]:
fig = ff.create_distplot([sim_frac_female], ['Fraction Female'], bin_size=0.0025, show_rug=False)
fig.update_layout(xaxis_title='Prop. Female',
                  yaxis_title='Percentage',
                  title='Distribution of Simulated Proportions of Females in the Class')
ax = sns.histplot(sim_frac_female, stat='probability', kde=True, bins=20)
sns.rugplot(sim_frac_female, ax=ax)
ax.set_xlabel("Fraction Female")
ax.set_title('Distribution of Simulated Fractions Female in the Class');

In Data 100 we will understand Kernel Density Functions, Rug Plots, and other visualization techniques.





Limitations of Baby Names dataset¶

UC Berkeley teaches students from around the world.¶

We saw with our Simple Classifier that many student names were classified as "Unknown," often because they weren't in the SSN Baby Names Dataset.

Recall the SSN dataset:

All names are from Social Security card applications for births that occurred in the United States after 1879.

That statement is not reflective of all of our students!!

In [59]:
# students who were not in the SSN Baby Names Dataset
names[~names['Name'].isin(prop_female.index)].sample(10)
Out[59]:
Name Role Pred. Sex Prop. Female Sim. Female
29 yuwen Student Unknown 0.5 False
131 canhui Student Unknown 0.5 True
28 youyansu Student Unknown 0.5 True
144 pujitha Student Unknown 0.5 True
44 juelius Student Unknown 0.5 False
115 zcjanin Student Unknown 0.5 True
81 khalled Student Unknown 0.5 True
9 haonan Student Unknown 0.5 False
20 nathaniel-vivi Student Unknown 0.5 True
71 xiahe Student Unknown 0.5 True

Names change over time.¶

Using data from 1879 (or even 1937) does not represent the diversity and context of U.S. baby names today.

Here are some choice names to show you how the distribution of particular names has varied with time:

In [60]:
subset_names = ["edris", "jamie", "jordan", "leslie", "taylor", "willie"]
subset_babynames_year = (pd.pivot_table(
                    babynames[babynames['Name'].isin(subset_names)],
                    index=['Name', 'Year'], columns='Sex', values='Count',
                    aggfunc='sum', fill_value=0, margins=True)
                 .drop(labels='All', level=0, axis=0) # drop cumulative row
                 .rename_axis(None, axis=1) # remove pivot table col name
                 .reset_index() # move (name, year) back into columns
                 .assign(Propf = lambda row: row.F/(row.F + row.M))
                )
ax = sns.lineplot(data=subset_babynames_year,
                  x='Year', y='Propf', hue='Name')
ax.set_title("Ratio of Female Babies over Time for Select Names")
ax.set_ylabel("Proportion of Female Names in a Year")
ax.legend(loc="lower left");

Bonus: How we selected which names to plot¶

Curious as to how we got the above names? We picked out two types of names:

  • names that had a high variability in F/M naming over years
  • common names that had an average F/M ratio over a set threshold

Check it out:

In [61]:
"""
get a subset of names that:
    have had propf above a threshold, as well as
    have been counted for more than a certain number of years
Note: while we could do our analysis over all names,
    it turns out many names don't matter.
    So to save computation power, we just work
    with a subset of names we know may be candidates.
"""
# these are thresholds we set as data analysts
propf_min = 0.2
propf_max = 0.8
year_thresh = 30

propf_countyear = (babynames
                   .groupby('Name').count()
                   .merge(prop_female.to_frame(), on='Name')
                   .rename(columns={'Prop. Female': 'Propf'})
                   .query("@propf_min < Propf < @propf_max & Year > @year_thresh & Name != 'All'")
                  )[['Propf', 'Year']]
propf_countyear
Out[61]:
Propf Year
Name
aalijah 0.352246 40
aamari 0.380074 35
aaren 0.240409 74
aarin 0.270723 77
aarion 0.221212 59
... ... ...
zephyr 0.229419 76
ziah 0.742308 45
ziyan 0.432911 36
zohar 0.643636 34
zuriel 0.200661 63

1631 rows × 2 columns

In [62]:
# construct a pivot table of (name, year) to count
keep_names = propf_countyear.reset_index()['Name']
name_year_sex = (pd.pivot_table(
                    babynames[babynames['Name'].isin(keep_names)],
                    index=['Name', 'Year'], columns='Sex', values='Count',
                    aggfunc='sum', fill_value=0, margins=True)
                 .drop(labels='All', level=0, axis=0) # drop cumulative row
                 .rename_axis(None, axis=1) # remove pivot table col name
                 .reset_index() # move (name, year) back into columns
                 .assign(Propf = lambda row: row.F/(row.F + row.M))
                )
name_year_sex
Out[62]:
Name Year F M All Propf
0 aalijah 1994 5 0 5 1.000000
1 aalijah 1995 5 0 5 1.000000
2 aalijah 1999 5 0 5 1.000000
3 aalijah 2001 9 0 9 1.000000
4 aalijah 2002 9 6 15 0.600000
... ... ... ... ... ... ...
89419 zuriel 2018 28 44 72 0.388889
89420 zuriel 2019 25 64 89 0.280899
89421 zuriel 2020 31 63 94 0.329787
89422 zuriel 2021 26 59 85 0.305882
89423 zuriel 2022 24 69 93 0.258065

89424 rows × 6 columns

In [63]:
"""
Compute two statistics per name:
- Count of number of babies with name
- Variance of proportion of females
  (i.e., how much the proportion of females varied
  across different years)
"""
names_to_include = 40
group_names =  (name_year_sex
                       .groupby('Name')
                       .agg({'Propf': 'var', 'All': 'sum'})
                       .rename(columns={'Propf': 'Propf Var', 'All': 'Total'})
                       .reset_index()
                      )
In [64]:
# pick some high variance names
high_variance_names = (group_names
                       .sort_values('Propf Var', ascending=False)
                       .head(names_to_include)
                       .sort_values('Total', ascending=False)
                      )

high_variance_names.head(5)
Out[64]:
Name Propf Var Total
19 aime 0.238514 2270
962 linzy 0.229197 1518
979 luan 0.252101 1255
468 edris 0.230026 1050
961 linzie 0.246861 900
In [65]:
# pick some common names
common_names = (group_names
                .sort_values('Total', ascending=False)
                .head(names_to_include)
               )
common_names.head(10)
Out[65]:
Name Propf Var Total
1586 willie 0.026477 595921
729 jordan 0.011999 525151
1484 taylor 0.110481 439450
950 leslie 0.147733 381953
653 jamie 0.021096 357061
70 angel 0.032182 349577
938 lee 0.005389 294727
706 jessie 0.027124 279884
1026 marion 0.017836 261205
358 dana 0.059549 246020