Lecture 1 – Data 100, Spring 2021

by Joseph E. Gonzalez

adapted from Anthony D. Joseph, Josh Hug, Suraj Rampure

Simple Questions about the Class

  1. How many students do we have?
  2. What are their majors?
  3. What year are they?
  4. Diversity ...?

Load and clean the roster

Peeking at the Data

How many students do we have?

What are their Majors?

What are the top majors:

We will often use visualizations to make sense of data

What Year are they?










Diversity and Data Science:

Unfortunately, surveys of data scientists suggest that there are far fewer women in data science:

To learn more checkout the Kaggle Executive Summary or study the Raw Data.










What fraction of the students are female?

I actually get asked this question a lot as we try to improve the data science program at Berkeley.

This is actually a fairly complex question. What do we mean by female? Is this a question about the sex or gender identity of the students? They are not the same thing.

Most likely, my colleagues are interested in improving gender diversity, by ensuring that our program is inclusive.




How could we answer this question?













We don't have the data.

Where can we get the data?







(1) We coudl run a survey!








(2) ... or we could try to use the data we have to estimate the sex of the students as a proxy for gender.

What I am about to do is flawed in so many ways and we will discuss these flaws in a moment and throughout the semester. However, it will illustrate some very basic inferential modeling and how we might combine multiple data sources to try and reason about something we haven't measured.









US Social Security Data

Public dataset containing baby names and their sex.

Understanding the Setting

In Data 100 you will have to learn about different data sources (and their limitations) on your own.

Reading from SSN Office description:

All names are from Social Security card applications for births that occurred in the United States after 1879. Note that many people born before 1937 never applied for a Social Security card, so their names are not included in our data. For others who did apply, our records may not show the place of birth, and again their names are not included in our data.

To safeguard privacy, we exclude from our tabulated lists of names those that would indicate, or would allow the ability to determine, names with fewer than 5 occurrences in any geographic area. If a name has less than 5 occurrences for a year of birth in any state, the sum of the state counts for that year will be less than the national count.

All data are from a 100% sample of our records on Social Security card applications as of March 2020.

Get data programatically

A little bit of data cleaning:

Exploratory Data Analysis

How many people does this data represent?

Trying a simple query:

Let's use this data to estimate the fraction of female students in the class.

Proportion of Male and Female Individuals Over Time

In this example we construct a pivot table which aggregates the number of babies registered for each year by Sex.

How many unique names for each year?

Some observations:

  1. Registration data seems limited in the early 1900s. Because many people did not register before 1937.
  2. You can see the baby boomers and the echo boom.
  3. Females have greater diversity of names.

Computing the Proportion of Female Babies For Each Name

Compute proportion of female babies given each name.

Testing a few names

Build Simple Classifier (Model)

We can define a function to return the most likely Sex for a name. If there is an exact tie or the name does not appear in the social security dataset the function returns Unknown.

Estimating the fraction of female and male students

What fraction of students in Data 100 this semester have names in the SSN dataset?

Which names are not in the dataset?

Why might these names not appear?

Using simulation to estimate uncertainty

Previously we treated a name which is given to females 40% of the time as a "Male" name. This doesn't capture our uncertainty. We can use simulation to provide a better distributional estimate.

Running the simulation