Lecture 1 – Data 100, Spring 2022

by Lisa Yan

adapted from Joseph E. Gonzalez, Anthony D. Joseph, Josh Hug, Suraj Rampure

Using this notebook:

The website links may often link to the JupyterLab view, where you can browse files and access the Terminal. You can click to hide the sidebar.

To view this notebook in Simple Notebook view (which helps conserve browser memory):

Software Packages

We will be using a wide range of different Python software packages. To install and manage these packages we will be using the Conda environment manager. The following is a list of packages we will routinely use in lectures and homeworks:

We will learn how to use all of the technologies used in this demo.

For now, just sit back and think critically about the data and our guided analysis.

1. Starting with a Question: Who are you (the students of DS100)?

This is a pretty vague question but let's start with the goal of learning something about the students in the class.

Here are some "simple" questions:

  1. How many students do we have?
  2. What are your majors?
  3. What year are you?
  4. Diversity ...?

2. Data Acquisition and Cleaning

In DS100 we will study various methods to collect data.

To answer this question, I downloaded the course roster and extracted everyone's names and majors.

3. Exploratory Data Analysis

In DS100 we will study exploratory data analysis and practice analyzing new datasets.

I didn't tell you the details of the data! Let's check out the data and infer its structure. Then we can start answering the simple questions we posed.

Peeking at the Data

What is one potential issue we may need to address in this data?

Answer: Some names appear capitalized.

In the above sample we notice that some of the names are capitalized and some are not. This will be an issue in our later analysis so let's convert all names to lower case.

How many records do we have?

Based on what we know of our class, each record is most likely a student.



Q: Is this big data (would you call this a "big class")?

This would not normally constitute big data ... however this is a common data size for a lot of data analysis tasks.

Is this a big class? YES!

Understanding the structure of data

It is important that we understand the meaning of each field and how the data is organized.

Q: What is the meaning of the Role field?

A: Understanding the meaning of field can often be achieved by looking at the types of data it contains (in particular the counts of its unique values).

We use the value_counts() function in pandas:

It appears that one student has an erroneous role given as "#REF!". What else can we learn about this student? Let's see their name.

Though this single bad record won't have much of an impact on our analysis, we can clean our data by removing this record.

Double check: Let's double check that our record removal only removed the single bad record.

Remember we loaded in two files. Let's explore the fields of majors and check for bad records:

It looks like numbers represents semesters, G represents graduate students, and U might represent something else---maybe campus visitors. But we do still have a bad record:

Detail: The deleted majors record number 683 is different from the record number (146) of the bad names record. So while the number of records in each table matches, the row indices don't match, so we'll have to keep these tables separate in order to do our analysis.

Summarizing the Data

We will often want to numerically or visually summarize the data. The describe() method provides a brief high level description of our data frame.

Q: What do you think top and freq represent?

A: top: most frequent entry, freq: the frequency of that entry





What are your majors?

What are the top majors:

We will often use visualizations to make sense of data

In DS100 we will deal with many different kinds of data (not just numbers) and we will study techniques to diverse types of data.

How can we summarize the Majors field? A good starting point might be to use a bar plot:

What year are you?










Diversity and Data Science:

Unfortunately, surveys of data scientists suggest that there are far fewer women in data science:

To learn more check out the Kaggle Executive Summary or study the Raw Data.










What fraction of the students are female?

I actually get asked this question a lot as we try to improve the data science program at Berkeley.

This is actually a fairly complex question. What do we mean by female? Is this a question about the sex or gender identity of the students? They are not the same thing.

Most likely, my colleagues are interested in improving gender diversity, by ensuring that our program is inclusive. Let's reword this question:

Reworded: What is the gender diversity of our students?




How could we answer this question?













We don't have the data.

Where can we get the data?







(1) We could run a survey!








(2) ... or we could try to use the data we have to estimate the _sex_ of the students as a proxy for gender?!?!

Please do not attempt option (2) alone. What I am about to do is flawed in so many ways and we will discuss these flaws in a moment and throughout the semester.

However, it will illustrate some very basic inferential modeling and how we might combine multiple data sources to try and reason about something we haven't measured.

To attempt option (2), we will first look at a second data source.









US Social Security Data

To study what a name tells about a person we will download data from the United States Social Security office containing the number of registered names broken down by year, sex, and name. This is often called the Baby Names Data as social security numbers (SSNs) are typically given at birth.

1. What does a name tell us about a person?

A: In this demo we'll use a person's name to estimate their sex. But a person's name tells us many things (more on this later).

2. Acquire data programatically

Note 1: In the following we download the data programmatically to ensure that the process is reproducible.

Note 2: We also load the data directly into python without decompressing the zipfile.

In DS100 we will think a bit more about how we can be efficient in our data analysis to support processing large datasets.




2 (cont). Understanding the Setting

In Data 100 you will have to learn about different data sources (and their limitations) on your own.

Reading from SSN Office description, bolded for readability:

All names are from Social Security card applications for births that occurred in the United States after 1879. Note that many people born before 1937 never applied for a Social Security card, so their names are not included in our data. For others who did apply, our records may not show the place of birth, and again their names are not included in our data.

To safeguard privacy, we exclude from our tabulated lists of names those that would indicate, or would allow the ability to determine, names with fewer than 5 occurrences in any geographic area. If a name has less than 5 occurrences for a year of birth in any state, the sum of the state counts for that year will be less than the national count.

All data are from a 100% sample of our records on Social Security card applications as of March 2021.

A little bit of data cleaning

Examining the data:

In our earlier analysis we converted names to lower case. We will do the same again here:

3. Exploratory Data Analysis (and Visualization)

How many people does this data represent?

Q: Is this number low or high?

Answer

It seems low (the 2020 US population was 329.5 million). However the social security website states:

All names are from Social Security card applications for births that occurred in the United States after 1879. Note that many people born before 1937 never applied for a Social Security card, so their names are not included in our data. For others who did apply, our records may not show the place of birth, and again their names are not included in our data. All data are from a 100% sample of our records on Social Security card applications as of the end of February 2016.

Trying a simple query using query:

Trying a more complex query using query() (to be discussed next week):

Temporal Patterns Conditioned on Male/Female

In DS100 we still study how to visualize and analyze relationships in data.

In this example we construct a pivot table which aggregates the number of babies registered for each year by Sex.

We'll discuss pivot tables in detail next week.

We can visualize these descriptive statistics:

How many unique names for each year?

Some observations:

  1. Registration data seems limited in the early 1900s. Because many people did not register before 1937.
  2. You can see the baby boomers (born 1940s-1960s) and the Echo Boomers (aka millenials, 1980s to 2000).
  3. Females have greater a sightly greater diversity of names.

4. Understand the World: Prediction and Inference

Let's use the Baby Names dataset to estimate the fraction of female students in the class.

Compute the Proportion of Female Babies For Each Name

First, we construct a pivot table to compute the total number of babies registered for each Name, broken down by Sex.

Second, we compute proportion of female babies for each name. This is our estimated probability that the baby is Female:

$$ \hat{\textbf{P}\hspace{0pt}}(\text{Female} \,\,\, | \,\,\, \text{Name} ) = \frac{\textbf{Count}(\text{Female and Name})}{\textbf{Count}(\text{Name})} $$

Test a few names

Next, Build a Simple Classifier (Model)

We can define a function to return the most likely Sex for a name. If there is an exact tie or the name does not appear in the social security dataset the function returns Unknown.

4 (cont). Estimating the fraction of female and male students in DS100

Let's try out our simple classifier! We'll use the apply() function to classify each student name:

Interpreting the unknowns

That's a lot of Unknowns.

...But we can still estimate the fraction of female students in the class:

Questions:

  1. How do we feel about this estimate?
  2. Do we trust it?




Q: What fraction of students in Data 100 this semester have names in the SSN dataset?

Q: Which names are not in the dataset?

Why might these names not appear?

Using simulation to estimate uncertainty

Previously we treated a name which is given to females 40% of the time as a "Male" name, because the probability was less than 0.5. This doesn't capture our uncertainty.

We can use simulation to provide a better distributional estimate. We'll use 50% for names not in the Baby Names dataset.

Running the simulation

Given such a simulation, we can compute the fraction of the class that is female.

  1. How do we feel about this new estimate?
  2. Do we trust it?

Now that we're performing a simulation, the above proportion is random: it depends on the random numbers we picked to determine whether a student was Female.

Let's run the above simulation several times and see what the distribution of this Female proportion is. The below cell may take a few seconds to run.

In DS100 we will understand Kernel Density Functions, Rug Plots, and other visualization techniques.





Limitations of Baby Names dataset

UC Berkeley teaches students from around the world.

We saw with our Simple Classifier that many student names were classified as "Unknown," often because they weren't in the SSN Baby Names Dataset.

Recall the SSN dataset:

All names are from Social Security card applications for births that occurred in the United States after 1879.

That statement is not reflective of all of our students!!

Names change over time.

Using data from 1879 (or even 1937) does not represent the diversity and context of U.S. baby names today.

Here are some choice names to show you how the distribution of particular names has varied with time:

Bonus: How we selected which names to plot

Curious as to how we got the above names? We picked out two types of names:

Check it out: