Lecture 1 – Data 100, Spring 2023¶

by Lisa Yan

  • adapted from Joseph E. Gonzalez, Anthony D. Joseph, Josh Hug, Suraj Rampure.
  • minor updates by Narges Norouzi and Fernando Pérez.

Software Packages¶

We will be using a wide range of different Python software packages. To install and manage these packages we will be using the Conda environment manager. The following is a list of packages we will routinely use in lectures and homeworks:

In [65]:
# linear algebra, probability
import numpy as np

# data manipulation
import pandas as pd

# visualization
import matplotlib.pyplot as plt
import seaborn as sns

## interactive visualization library
import plotly.offline as py
py.init_notebook_mode()
import plotly.graph_objs as go
import plotly.figure_factory as ff
import plotly.express as px

We will learn how to use all of the technologies used in this demo.

For now, just sit back and think critically about the data and our guided analysis.

1. Starting with a Question: Who are you (the students of DS100)?¶

This is a pretty vague question but let's start with the goal of learning something about the students in the class.

Here are some "simple" questions:

  1. How many students do we have?
  2. What are your majors?
  3. What year are you?
  4. Diversity ...?

2. Data Acquisition and Cleaning¶

In DS100 we will study various methods to collect data.

To answer this question, I downloaded the course roster and extracted everyone's names and majors.

In [66]:
# pd stands for pandas, which we will learn next week
# some pandas syntax shared with data8's datascience package
majors = pd.read_csv("data/majors.csv")
names = pd.read_csv("data/names.csv")

3. Exploratory Data Analysis¶

In DS100 we will study exploratory data analysis and practice analyzing new datasets.

I didn't tell you the details of the data! Let's check out the data and infer its structure. Then we can start answering the simple questions we posed.

Peeking at the Data¶

In [67]:
majors.head(20)
Out[67]:
Majors Terms in Attendance
0 Environ Health Sciences PhD G
1 Molecular & Cell Biology BA 6
2 Applied Mathematics BA, Computer Science BA 8
3 Electrical Eng & Comp Sci BS 4
4 Public Health MPH G
5 Letters & Sci Undeclared UG 6
6 Computer Science BA, Data Science BA 6
7 Applied Mathematics BA, Computer Science BA 4
8 Chemical Biology BS, Computer Science BA 6
9 Electrical Eng & Comp Sci BS 4
10 Applied Mathematics BA 4
11 Letters & Sci Undeclared UG 4
12 Bioengineering BS 8
13 Letters & Sci Undeclared UG 8
14 Environ Econ & Policy BS, Letters & Sci Undecl... 6
15 Letters & Sci Undeclared UG 4
16 Letters & Sci Undeclared UG 4
17 Letters & Sci Undeclared UG, Nutritional Scien... 4
18 Letters & Sci Undeclared UG 6
19 Letters & Sci Undeclared UG 6
In [68]:
names.head()
Out[68]:
Name Role
0 Yue Student
1 Kevin Student
2 Ryan Student
3 JOHN Student
4 Alex Student

What is one potential issue we may need to address in this data?¶

Answer: Some names appear capitalized.

In the above sample we notice that some of the names are capitalized and some are not. This will be an issue in our later analysis so let's convert all names to lower case.

In [69]:
names['Name'] = names['Name'].str.lower()
In [70]:
names.head()
Out[70]:
Name Role
0 yue Student
1 kevin Student
2 ryan Student
3 john Student
4 alex Student

How many records do we have?¶

In [71]:
print(len(names))
print(len(majors))
1287
1287

Based on what we know of our class, each record is most likely a student.



Q: Is this big data (would you call this a "big class")?

This would not normally constitute big data ... however this is a common data size for a lot of data analysis tasks.

Is this a big class? YES!

Understanding the structure of data¶

It is important that we understand the meaning of each field and how the data is organized.

In [72]:
names.head()
Out[72]:
Name Role
0 yue Student
1 kevin Student
2 ryan Student
3 john Student
4 alex Student

Q: What is the meaning of the Role field?

A: Understanding the meaning of field can often be achieved by looking at the types of data it contains (in particular the counts of its unique values).

We use the value_counts() function in pandas:

In [73]:
names['Role'].value_counts().to_frame()  # counts of unique Roles
Out[73]:
Role
Student 1201
Waitlist Student 85
#REF! 1

It appears that one student has an erroneous role given as "#REF!". What else can we learn about this student? Let's see their name.

In [74]:
# boolean index to find rows where Role is #REF!
names[names['Role'] == "#REF!"]
Out[74]:
Name Role
211 #ref! #REF!

Though this single bad record won't have much of an impact on our analysis, we can clean our data by removing this record.

In [75]:
names = names[names['Role'] != "#REF!"]

Double check: Let's double check that our record removal only removed the single bad record.

In [76]:
names['Role'].value_counts().to_frame()  # again, counts of unique Roles
Out[76]:
Role
Student 1201
Waitlist Student 85

Remember we loaded in two files. Let's explore the fields of majors and check for bad records:

In [77]:
majors.columns   # get column names
Out[77]:
Index(['Majors', 'Terms in Attendance'], dtype='object')
In [78]:
majors['Terms in Attendance'].value_counts().to_frame()
Out[78]:
Terms in Attendance
4 540
6 360
8 176
G 134
7 34
2 19
5 17
3 6
#REF! 1

It looks like numbers represents semesters, G represents graduate students, and U might represent something else---maybe campus visitors. But we do still have a bad record:

In [79]:
majors[majors['Terms in Attendance'] == "#REF!"]
Out[79]:
Majors Terms in Attendance
513 #REF! #REF!
In [80]:
majors = majors[majors['Terms in Attendance'] != "#REF!"]
majors['Terms in Attendance'].value_counts().to_frame()
Out[80]:
Terms in Attendance
4 540
6 360
8 176
G 134
7 34
2 19
5 17
3 6

Detail: The deleted majors record number is different from the record number of the bad names record. So while the number of records in each table matches, the row indices don't match, so we'll have to keep these tables separate in order to do our analysis.

Summarizing the Data¶

We will often want to numerically or visually summarize the data. The describe() method provides a brief high level description of our data frame.

In [81]:
names.describe()
Out[81]:
Name Role
count 1286 1286
unique 961 2
top michael Student
freq 13 1201
In [82]:
majors.describe()
Out[82]:
Majors Terms in Attendance
count 1286 1286
unique 155 8
top Letters & Sci Undeclared UG 4
freq 543 540

Q: What do you think top and freq represent?

A: top: most frequent entry, freq: the frequency of that entry


What are your majors?¶

What are the top majors:

In [83]:
majors_count = (       # method chaining in pandas
    majors['Majors']
    .value_counts()
    .sort_values(ascending=True) # lowest first
    .to_frame()
    .tail(20)          # get the top 20
)

# or, comment out to parse double majors
# majors_count = (
#     majors['Majors']
#     .str.split(", ") # get double majors
#     .explode()       # one major to every row
#     .value_counts()
#     .sort_values(ascending=True)
#     .to_frame()
#     .tail(20)
# )

majors_count
Out[83]:
Majors
Chemical Biology BS 7
Bioengineering BS 7
Psychology BA 7
Applied Mathematics BA, Computer Science BA 7
Business Administration BS 9
Chemical Engineering BS 10
Info Mgmt & Systems MIMS 10
Business Administration MBA 10
Environ Econ & Policy BS 10
Electrical Eng & Comp Sci MEng 13
Applied Mathematics BA 15
Molecular & Cell Biology BA 18
Public Health MPH 20
Cognitive Science BA 27
Civil Engineering BS 34
Electrical Eng & Comp Sci BS 53
Economics BA 59
Data Science BA 77
Computer Science BA 87
Letters & Sci Undeclared UG 543

We will often use visualizations to make sense of data¶

In DS100 we will deal with many different kinds of data (not just numbers) and we will study techniques to diverse types of data.

How can we summarize the Majors field? A good starting point might be to use a bar plot:

In [84]:
# interactive using plotly
fig = px.bar(majors_count, orientation='h')
fig.update_layout(showlegend=False,
                  xaxis_title='Count',
                  yaxis_title='Major')

What year are you?¶

In [85]:
fig = px.histogram(majors['Terms in Attendance'].sort_values(),
                   histnorm='probability')
fig.update_layout(showlegend=False,
                  xaxis_title="Term",
                  yaxis_title="Fraction of Class")

Diversity and Data Science:¶

Unfortunately, surveys of data scientists suggest that there are far fewer women in data science:

To learn more check out the Kaggle Executive Summary or study the Raw Data.


What fraction of the students are female?¶

I actually get asked this question a lot as we try to improve the data science program at Berkeley.

This is actually a fairly complex question. What do we mean by female? Is this a question about the sex or gender identity of the students? They are not the same thing.

  • Sex refers predominantly to biological characteristics.
  • Gender is much more complex with societal and cultural implications and refers to how people identify themselves.

Most likely, my colleagues are interested in improving gender diversity, by ensuring that our program is inclusive. Let's reword this question:

Reworded: What is the gender diversity of our students?¶

How could we answer this question?¶

In [86]:
print(majors.columns)
print(names.columns)
Index(['Majors', 'Terms in Attendance'], dtype='object')
Index(['Name', 'Role'], dtype='object')

We don't have the data.¶

Where can we get the data?


(1) We could run a survey!¶

(2) ... or we could try to use the data we have to estimate the _sex_ of the students as a proxy for gender?!?!¶

Please do not attempt option (2) alone. What I am about to do is flawed in so many ways and we will discuss these flaws in a moment and throughout the semester.

However, it will illustrate some very basic inferential modeling and how we might combine multiple data sources to try and reason about something we haven't measured.

To attempt option (2), we will first look at a second data source.


US Social Security Data¶

To study what a name tells about a person we will download data from the United States Social Security office containing the number of registered names broken down by year, sex, and name. This is often called the Baby Names Data as social security numbers (SSNs) are typically given at birth.

1. What does a name tell us about a person?¶

A: In this demo we'll use a person's name to estimate their sex. But a person's name tells us many things (more on this later).

2. Acquire data programatically¶

Note 1: In the following we download the data programmatically to ensure that the process is reproducible.

Note 2: We also load the data directly into python without decompressing the zipfile.

In DS100 we will think a bit more about how we can be efficient in our data analysis to support processing large datasets.

In [90]:
import urllib.request
import os.path

# Download data from the web directly
data_url = "https://www.ssa.gov/oact/babynames/names.zip"
local_filename = "babynames.zip"
if not os.path.exists(local_filename): # if the data exists don't download again
    with urllib.request.urlopen(data_url) as resp, open(local_filename, 'wb') as f:
        f.write(resp.read())

        
# Load data without unzipping the file
import zipfile
babynames = [] 
with zipfile.ZipFile(local_filename, "r") as zf:
    data_files = [f for f in zf.filelist if f.filename[-3:] == "txt"]
    def extract_year_from_filename(fn):
        return int(fn[3:7])
    for f in data_files:
        year = extract_year_from_filename(f.filename)
        with zf.open(f) as fp:
            df = pd.read_csv(fp, names=["Name", "Sex", "Count"])
            df["Year"] = year
            babynames.append(df)
babynames = pd.concat(babynames)


babynames.head() # show the first few rows
Out[90]:
Name Sex Count Year
0 Mary F 7065 1880
1 Anna F 2604 1880
2 Emma F 2003 1880
3 Elizabeth F 1939 1880
4 Minnie F 1746 1880

2 (cont). Understanding the Setting¶

In Data 100 you will have to learn about different data sources (and their limitations) on your own.

Reading from SSN Office description, bolded for readability:

All names are from Social Security card applications for births that occurred in the United States after 1879. Note that many people born before 1937 never applied for a Social Security card, so their names are not included in our data. For others who did apply, our records may not show the place of birth, and again their names are not included in our data.

To safeguard privacy, we exclude from our tabulated lists of names those that would indicate, or would allow the ability to determine, names with fewer than 5 occurrences in any geographic area. If a name has less than 5 occurrences for a year of birth in any state, the sum of the state counts for that year will be less than the national count.

All data are from a 100% sample of our records on Social Security card applications as of March 2022.

A little bit of data cleaning¶

Examining the data:

In [91]:
babynames
Out[91]:
Name Sex Count Year
0 Mary F 7065 1880
1 Anna F 2604 1880
2 Emma F 2003 1880
3 Elizabeth F 1939 1880
4 Minnie F 1746 1880
... ... ... ... ...
31532 Zyeire M 5 2021
31533 Zyel M 5 2021
31534 Zyian M 5 2021
31535 Zylar M 5 2021
31536 Zyn M 5 2021

2052781 rows × 4 columns

In our earlier analysis we converted names to lower case. We will do the same again here:

In [92]:
babynames['Name'] = babynames['Name'].str.lower()
babynames.head()
Out[92]:
Name Sex Count Year
0 mary F 7065 1880
1 anna F 2604 1880
2 emma F 2003 1880
3 elizabeth F 1939 1880
4 minnie F 1746 1880

3. Exploratory Data Analysis (and Visualization)¶

How many people does this data represent?

In [93]:
format(babynames['Count'].sum(), ',d') # sum of 'Count' column
Out[93]:
'361,888,233'
In [94]:
format(len(babynames), ',d')       # number of rows
Out[94]:
'2,052,781'

Q: Is this number low or high?

Answer

It seems low (the 2021 US population was 331.9 million). However the social security website states:

All names are from Social Security card applications for births that occurred in the United States after 1879. Note that many people born before 1937 never applied for a Social Security card, so their names are not included in our data. For others who did apply, our records may not show the place of birth, and again their names are not included in our data. All data are from a 100% sample of our records on Social Security card applications as of the end of February 2016.

Trying a simple query using query:

In [95]:
# how many Nora's were born in 2018?
babynames[(babynames['Name'] == 'nora') & (babynames['Year'] == 2018)]
Out[95]:
Name Sex Count Year
29 nora F 5833 2018
28527 nora M 7 2018

Trying a more complex query using query() (to be discussed next week):

In [96]:
# how many baby names contain the word "data"?
babynames.query('Name.str.contains("data")', engine='python')
Out[96]:
Name Sex Count Year
9765 kidata F 5 1975
24914 datavion M 5 1995
23613 datavious M 7 1997
12103 datavia F 7 2000
27509 datavion M 6 2001
28918 datari M 5 2001
29140 datavian M 5 2002
29141 datavious M 5 2002
30573 datavion M 5 2004
17140 datavia F 5 2005
31032 datavion M 5 2005
31023 datavion M 6 2006
33344 datavious M 5 2007
33345 datavius M 5 2007
33408 datavious M 5 2008
33091 datavion M 5 2009
32500 datavious M 5 2010

Temporal Patterns Conditioned on Male/Female¶

In DS100 we still study how to visualize and analyze relationships in data.

In this example we construct a pivot table which aggregates the number of babies registered for each year by Sex.

We'll discuss pivot tables in detail next week.

In [97]:
# counts number of M and F babies per year
year_sex = pd.pivot_table(
        babynames, 
        index=['Year'], # the row index
        columns=['Sex'], # the column values
        values='Count', # the field(s) to processed in each group
        aggfunc=np.sum,
    )[["M", "F"]]

year_sex.head()
Out[97]:
Sex M F
Year
1880 110490 90994
1881 100737 91953
1882 113686 107847
1883 104625 112319
1884 114442 129019

We can visualize these descriptive statistics:

In [98]:
# more interactive using plotly
fig = px.line(year_sex)
fig.update_layout(title="Total Babies per Year",
                  yaxis_title="Number of Babies")

How many unique names for each year?¶

In [99]:
# counts number of M and F *names* per year
year_sex_unique = pd.pivot_table(babynames, 
        index=['Year'], 
        columns=['Sex'], 
        values='Name', 
        aggfunc=lambda x: len(np.unique(x)),
    )
fig = px.line(year_sex_unique)
fig.update_layout(title="Unique Names Per Year",
                  yaxis_title="Number of Baby Names")

Some observations:

  1. Registration data seems limited in the early 1900s. Because many people did not register before 1937.
  2. You can see the baby boomers (born 1940s-1960s) and the Echo Boomers (aka millenials, 1980s to 2000).
  3. Females have greater a sightly greater diversity of names.

4. Understand the World: Prediction and Inference¶

Let's use the Baby Names dataset to estimate the fraction of female students in the class.

Compute the Proportion of Female Babies For Each Name¶

First, we construct a pivot table to compute the total number of babies registered for each Name, broken down by Sex.

In [100]:
# counts number of M and F babies per name
name_sex = pd.pivot_table(babynames, index='Name', columns='Sex', values='Count',
                            aggfunc='sum', fill_value=0., margins=True)
name_sex.head()
Out[100]:
Sex F M All
Name
aaban 0 120 120
aabha 51 0 51
aabid 0 16 16
aabidah 5 0 5
aabir 0 10 10

Second, we compute proportion of female babies for each name. This is our estimated probability that the baby is Female:

$$ \hat{\textbf{P}\hspace{0pt}}(\text{Female} \,\,\, | \,\,\, \text{Name} ) = \frac{\textbf{Count}(\text{Female and Name})}{\textbf{Count}(\text{Name})} $$
In [101]:
prop_female = (name_sex['F'] / name_sex['All']).rename("Prop. Female")
prop_female.to_frame().head(10)
Out[101]:
Prop. Female
Name
aaban 0.0
aabha 1.0
aabid 0.0
aabidah 1.0
aabir 0.0
aabriella 1.0
aada 1.0
aadam 0.0
aadan 0.0
aadarsh 0.0

Test a few names¶

In [35]:
prop_female["lisa"]
Out[35]:
0.9971219500105333
In [103]:
prop_female["narges"]
Out[103]:
1.0
In [104]:
prop_female["josh"]
Out[104]:
0.0
In [105]:
prop_female["avery"]
Out[105]:
0.7058240944624468
In [106]:
prop_female["min"]
Out[106]:
0.37598736176935227
In [107]:
prop_female["pat"]
Out[107]:
0.600140600694029
In [108]:
prop_female["jaspreet"]
Out[108]:
0.6043956043956044

Next, Build a Simple Classifier (Model)¶

We can define a function to return the most likely Sex for a name. If there is an exact tie or the name does not appear in the social security dataset the function returns Unknown.

In [109]:
def sex_from_name(name):
    lower_name = name.lower()
    if lower_name not in prop_female.index or prop_female[lower_name] == 0.5:
        return "Unknown"
    elif prop_female[lower_name] > 0.5:
        return "F"
    else:
        return "M"
In [110]:
sex_from_name("nora")
Out[110]:
'F'
In [111]:
sex_from_name("josh")
Out[111]:
'M'
In [112]:
sex_from_name("pat")
Out[112]:
'F'

4 (cont). Estimating the fraction of female and male students in DS100¶

Let's try out our simple classifier! We'll use the apply() function to classify each student name:

In [113]:
# apply sex_from_name to each student name
names['Pred. Sex'] = names['Name'].apply(sex_from_name)
px.bar(names['Pred. Sex'].value_counts()/len(names))

Interpreting the unknowns¶

That's a lot of Unknowns.

...But we can still estimate the fraction of female students in the class:

In [114]:
count_by_sex = names['Pred. Sex'].value_counts().to_frame()
count_by_sex
Out[114]:
Pred. Sex
M 591
F 513
Unknown 182
In [115]:
count_by_sex.loc['F']/(count_by_sex.loc['M'] + count_by_sex.loc['F'])
Out[115]:
Pred. Sex    0.464674
dtype: float64

Questions:

  1. How do we feel about this estimate?
  2. Do we trust it?




Q: What fraction of students in Data 100 this semester have names in the SSN dataset?

In [116]:
print("Fraction of names in the babynames data:", 
      names['Name'].isin(prop_female.index).mean())
Fraction of names in the babynames data: 0.8584758942457231

Q: Which names are not in the dataset?

Why might these names not appear?

In [117]:
# the tilde ~ negates the boolean index. More next week.
names[~names['Name'].isin(prop_female.index)].sample(10)
Out[117]:
Name Role Pred. Sex
468 sauhard Student Unknown
521 zilong Waitlist Student Unknown
1166 zcjanin Student Unknown
640 kamer Student Unknown
1122 limi Student Unknown
324 peiying Student Unknown
884 arunava Student Unknown
938 zeming Student Unknown
1017 xainab Student Unknown
1273 inho Student Unknown

Using simulation to estimate uncertainty¶

Previously we treated a name which is given to females 40% of the time as a "Male" name, because the probability was less than 0.5. This doesn't capture our uncertainty.

We can use simulation to provide a better distributional estimate. We'll use 50% for names not in the Baby Names dataset.

In [118]:
# add the computed SSN F proportion to each row. 0.5 for Unknowns.
# merge() effectively "join"s two tables together. to be covered next week.
names['Prop. Female'] = (
    names[['Name']].merge(prop_female, how='left', left_on='Name', 
                          right_index=True)['Prop. Female']
        .fillna(0.5)
)
names.head(10)
Out[118]:
Name Role Pred. Sex Prop. Female
0 yue Student F 0.532609
1 kevin Student M 0.004508
2 ryan Student M 0.026558
3 john Student M 0.004200
4 alex Student M 0.032599
5 cael Waitlist Student M 0.003687
6 angela Student F 0.996816
7 michael Student M 0.004941
8 sean Student M 0.006971
9 andrew Student M 0.003750

Running the simulation¶

In [119]:
# if a randomly picked number from [0.0, 1.0) is under the Female proportion, then F
names['Sim. Female'] = np.random.rand(len(names)) < names['Prop. Female']
names.tail(20)
Out[119]:
Name Role Pred. Sex Prop. Female Sim. Female
1267 amy Student F 0.997355 True
1268 nithurhan Student Unknown 0.500000 True
1269 harpreet Student F 0.504970 False
1270 jaquelyn Student F 1.000000 True
1271 shujie Student Unknown 0.500000 True
1272 latifa Student F 1.000000 True
1273 inho Student Unknown 0.500000 True
1274 malavika Student F 1.000000 True
1275 mihir Student M 0.000000 False
1276 sanik Student Unknown 0.500000 False
1277 cindy Student F 0.997003 True
1278 raine Student F 0.843631 True
1279 allison Waitlist Student F 0.987817 True
1280 caleb Student M 0.002063 False
1281 robin Waitlist Student F 0.864324 True
1282 sam Student M 0.011800 False
1283 shivani Student F 1.000000 True
1284 harjot Student M 0.245509 False
1285 harshil Student M 0.000000 False
1286 zora Student F 0.997671 True

Given such a simulation, we can compute the fraction of the class that is female.

  1. How do we feel about this new estimate?
  2. Do we trust it?
In [120]:
# proportion of Trues in the 'Sim. Female' column
names['Sim. Female'].mean()
Out[120]:
0.49144634525660963

Now that we're performing a simulation, the above proportion is random: it depends on the random numbers we picked to determine whether a student was Female.

Let's run the above simulation several times and see what the distribution of this Female proportion is. The below cell may take a few seconds to run.

In [121]:
# function that performs many simulations
def simulate_class(students):
    is_female = names['Prop. Female'] > np.random.rand(len(names['Prop. Female'])) 
    return np.mean(is_female)

sim_frac_female = np.array([simulate_class(names) for n in range(10000)])
In [122]:
fig = ff.create_distplot([sim_frac_female], ['Fraction Female'], bin_size=0.0025, show_rug=False)
fig.update_layout(xaxis_title='Prop. Female',
                  yaxis_title='Percentage',
                  title='Distribution of Simulated Proportions of Females in the Class')
ax = sns.histplot(sim_frac_female, stat='probability', kde=True, bins=20)
sns.rugplot(sim_frac_female, ax=ax)
ax.set_xlabel("Fraction Female")
ax.set_title('Distribution of Simulated Fractions Female in the Class');

In DS100 we will understand Kernel Density Functions, Rug Plots, and other visualization techniques.





Limitations of Baby Names dataset¶

UC Berkeley teaches students from around the world.¶

We saw with our Simple Classifier that many student names were classified as "Unknown," often because they weren't in the SSN Baby Names Dataset.

Recall the SSN dataset:

All names are from Social Security card applications for births that occurred in the United States after 1879.

That statement is not reflective of all of our students!!

In [123]:
# students who were not in the SSN Baby Names Dataset
names[~names['Name'].isin(prop_female.index)].sample(10)
Out[123]:
Name Role Pred. Sex Prop. Female Sim. Female
782 hongbi Student Unknown 0.5 False
1119 sungsoo Waitlist Student Unknown 0.5 False
384 seoyoung Student Unknown 0.5 True
822 zhiqi Student Unknown 0.5 False
533 buyankhuu Student Unknown 0.5 False
965 weihao Student Unknown 0.5 True
1019 huaxiao Student Unknown 0.5 False
259 aswinkarthik Student Unknown 0.5 True
87 saiteja Student Unknown 0.5 True
778 filhaq Student Unknown 0.5 True

Names change over time.¶

Using data from 1879 (or even 1937) does not represent the diversity and context of U.S. baby names today.

Here are some choice names to show you how the distribution of particular names has varied with time:

In [124]:
subset_names = ["edris", "jamie", "jordan", "leslie", "taylor", "willie"]
subset_babynames_year = (pd.pivot_table(
                    babynames[babynames['Name'].isin(subset_names)],
                    index=['Name', 'Year'], columns='Sex', values='Count',
                    aggfunc='sum', fill_value=0, margins=True)
                 .drop(labels='All', level=0, axis=0) # drop cumulative row
                 .rename_axis(None, axis=1) # remove pivot table col name
                 .reset_index() # move (name, year) back into columns
                 .assign(Propf = lambda row: row.F/(row.F + row.M))
                )
ax = sns.lineplot(data=subset_babynames_year,
                  x='Year', y='Propf', hue='Name')
ax.set_title("Ratio of Female Babies over Time for Select Names")
ax.set_ylabel("Proportion of Female Names in a Year")
ax.legend(loc="lower left");

Bonus: How we selected which names to plot¶

Curious as to how we got the above names? We picked out two types of names:

  • names that had a high variability in F/M naming over years
  • common names that had an average F/M ratio over a set threshold

Check it out:

In [125]:
"""
get a subset of names that:
    have had propf above a threshold, as well as
    have been counted for more than a certain number of years
Note: while we could do our analysis over all names,
    it turns out many names don't matter.
    So to save computation power, we just work
    with a subset of names we know may be candidates.
"""
# these are thresholds we set as data analysts
propf_min = 0.2
propf_max = 0.8
year_thresh = 30

propf_countyear = (babynames
                   .groupby('Name').count()
                   .merge(prop_female.to_frame(), on='Name')
                   .rename(columns={'Prop. Female': 'Propf'})
                   .query("@propf_min < Propf < @propf_max & Year > @year_thresh & Name != 'All'")
                  )[['Propf', 'Year']]
propf_countyear
Out[125]:
Propf Year
Name
aalijah 0.366995 39
aamari 0.377953 33
aaren 0.244367 73
aarin 0.283472 76
aarion 0.223926 58
... ... ...
zephyr 0.236254 74
ziah 0.746377 43
ziyan 0.446475 35
zohar 0.649057 32
zyan 0.201788 50

1602 rows × 2 columns

In [126]:
# construct a pivot table of (name, year) to count
keep_names = propf_countyear.reset_index()['Name']
name_year_sex = (pd.pivot_table(
                    babynames[babynames['Name'].isin(keep_names)],
                    index=['Name', 'Year'], columns='Sex', values='Count',
                    aggfunc='sum', fill_value=0, margins=True)
                 .drop(labels='All', level=0, axis=0) # drop cumulative row
                 .rename_axis(None, axis=1) # remove pivot table col name
                 .reset_index() # move (name, year) back into columns
                 .assign(Propf = lambda row: row.F/(row.F + row.M))
                )
name_year_sex
Out[126]:
Name Year F M All Propf
0 aalijah 1994 5 0 5 1.000000
1 aalijah 1995 5 0 5 1.000000
2 aalijah 1999 5 0 5 1.000000
3 aalijah 2001 9 0 9 1.000000
4 aalijah 2002 9 6 15 0.600000
... ... ... ... ... ... ...
87522 zyan 2017 9 78 87 0.103448
87523 zyan 2018 6 88 94 0.063830
87524 zyan 2019 11 87 98 0.112245
87525 zyan 2020 6 86 92 0.065217
87526 zyan 2021 8 104 112 0.071429

87527 rows × 6 columns

In [127]:
"""
Compute two statistics per name:
- Count of number of babies with name
- Variance of proportion of females
  (i.e., how much the proportion of females varied
  across different years)
"""
names_to_include = 40
group_names =  (name_year_sex
                       .groupby('Name')
                       .agg({'Propf': 'var', 'All': 'sum'})
                       .rename(columns={'Propf': 'Propf Var', 'All': 'Total'})
                       .reset_index()
                      )
In [128]:
# pick some high variance names
high_variance_names = (group_names
                       .sort_values('Propf Var', ascending=False)
                       .head(names_to_include)
                       .sort_values('Total', ascending=False)
                      )

high_variance_names.head(5)
Out[128]:
Name Propf Var Total
20 aime 0.239362 2253
946 linzy 0.227941 1509
964 luan 0.252438 1185
460 edris 0.227961 1033
945 linzie 0.246861 900
In [129]:
# pick some common names
common_names = (group_names
                .sort_values('Total', ascending=False)
                .head(names_to_include)
               )
common_names.head(10)
Out[129]:
Name Propf Var Total
1557 willie 0.026535 595724
718 jordan 0.012061 520518
1456 taylor 0.109349 437595
934 leslie 0.147521 381471
643 jamie 0.020710 356210
71 angel 0.032230 343815
921 lee 0.005340 294340
696 jessie 0.027282 279340
1009 marion 0.017952 261017
351 dana 0.059373 245740