We will be using a wide range of different Python software packages. To install and manage these packages we will be using the Conda environment manager. The following is a list of packages we will routinely use in lectures and homeworks:
# Linear algebra, probability
import numpy as np
# Data manipulation
import pandas as pd
# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Interactive visualization library
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.figure_factory as ff
import plotly.express as px
We will learn how to use all of the technologies used in this demo.
For now, just sit back and think critically about the data and our guided analysis.
This is a pretty vague question but let's start with the goal of learning something about the students in the class.
Here are some "simple" questions:
In Data 100 we will study various methods to collect data.
To answer this question, I downloaded the course roster and extracted everyone's names and majors.
# pd stands for pandas, which we will learn starting from next lecture
# Some pandas syntax shared with data8's datascience package
majors = pd.read_csv("data/majors-sp24.csv")
names = pd.read_csv("data/names-sp24.csv")
In Data 100 we will study exploratory data analysis and practice analyzing new datasets.
I didn't tell you the details of the data! Let's check out the data and infer its structure. Then we can start answering the simple questions we posed.
# Let's peek at the first 20 rows of the majors dataframe
majors.head(20)
| Majors | Terms in Attendance | |
|---|---|---|
| 0 | Electrical Eng & Comp Sci BS | 8.0 | 
| 1 | Data Science BA (Subplan: Robotics) | 6.0 | 
| 2 | Electrical Eng & Comp Sci MEng | G | 
| 3 | Letters & Sci Undeclared UG | 4.0 | 
| 4 | Molecular & Cell Biology BA (Subplan: Biochem,... | 6.0 | 
| 5 | Civil Engineering BS | 8.0 | 
| 6 | Computer Science BA | 4.0 | 
| 7 | Political Economy BA | 6.0 | 
| 8 | Letters & Sci Undeclared UG | 8.0 | 
| 9 | Art BA | 4.0 | 
| 10 | Chemical Biology BS | 6.0 | 
| 11 | Letters & Sci Undeclared UG | 4.0 | 
| 12 | Chemical Engineering PhD | G | 
| 13 | Computer Science BA | 4.0 | 
| 14 | Computer Science BA | 4.0 | 
| 15 | Environmental Sciences BS | 2.0 | 
| 16 | Letters & Sci Undeclared UG | 4.0 | 
| 17 | Letters & Sci Undeclared UG | 2.0 | 
| 18 | Letters & Sci Undeclared UG | 6.0 | 
| 19 | Cognitive Science BA | 4.0 | 
# Let's peek at the first 5 rows (default) of the names dataframe
names.head()
| Name | |
|---|---|
| 0 | Emily | 
| 1 | Andrew | 
| 2 | Andrew | 
| 3 | NATHAN | 
| 4 | Yimo | 
Answer: Some names appear capitalized.
In the above sample we notice that some of the names are capitalized and some are not. This will be an issue in our later analysis so let's convert all names to lower case.
names['Name'] = names['Name'].str.lower()
names.head()
| Name | |
|---|---|
| 0 | emily | 
| 1 | andrew | 
| 2 | andrew | 
| 3 | nathan | 
| 4 | yimo | 
names dataset¶# Below are the most common, in descending frequency
first_letter = names['Name'].str[0].value_counts()
first_letter.head()
Name a 185 j 136 s 133 m 99 r 65 Name: count, dtype: int64
# Let's visualize this first letter distribution 
plt.bar(first_letter.index, first_letter.values)
plt.xlabel('First Letter')
plt.ylabel('Frequency')
plt.title('First Letter Frequency Distribution')
plt.show()
In the United States, "J" and "A" names are the most popular first initials. Seems like our visualization also reflects this!
name_lengths = names['Name'].str.len()
plt.hist(name_lengths, bins=range(min(name_lengths), max(name_lengths) + 2), edgecolor='black')
plt.xlabel('Name Length')
plt.ylabel('Frequency')
plt.title('Distribution of Length of Names')
average_length = name_lengths.sum() / len(name_lengths)
plt.axvline(average_length, color='red', linestyle='dashed', linewidth=1, label=f'Average: {average_length:.2f}')
plt.legend()
plt.xticks(range(min(name_lengths), max(name_lengths) + 1))
plt.show()
The average length of names in the United States is also around 6 letters!
print(len(names))
print(len(majors))
1276 1276
Based on what we know of our class, each record is most likely a student.
It is important that we understand the meaning of each field and how the data is organized.
names.head()
| Name | |
|---|---|
| 0 | emily | 
| 1 | andrew | 
| 2 | andrew | 
| 3 | nathan | 
| 4 | yimo | 
It appears that one student has an erroneous role given as "#REF!". What else can we learn about this student? Let's see their name.
# Boolean index to find rows where Role is #REF!
names[names['Name'] == "#ref!"]
| Name | |
|---|---|
| 691 | #ref! | 
Though this single bad record won't have much of an impact on our analysis, we can clean our data by removing this record.
names = names[names['Name'] != "#ref!"]
Double check: Let's double check that our record removal only removed the single bad record.
names['Name'].value_counts().to_frame()  # Again, counts of unique Roles
| count | |
|---|---|
| Name | |
| matthew | 13 | 
| ethan | 9 | 
| andrew | 9 | 
| emily | 8 | 
| nathan | 8 | 
| ... | ... | 
| wolfgang | 1 | 
| sukriti | 1 | 
| in | 1 | 
| teresa | 1 | 
| olivia | 1 | 
914 rows × 1 columns
Remember we loaded in two files. Let's explore the fields of majors and check for bad records:
majors dataset¶majors.columns   # Get column names
Index(['Majors', 'Terms in Attendance'], dtype='object')
majors['Terms in Attendance'].value_counts().to_frame()
| count | |
|---|---|
| Terms in Attendance | |
| 4.0 | 539 | 
| 6.0 | 407 | 
| 8.0 | 168 | 
| G | 52 | 
| 2.0 | 43 | 
| 7.0 | 30 | 
| 5.0 | 13 | 
| #REF! | 1 | 
| 3.0 | 1 | 
It looks like numbers represent semesters, G represents graduate students. But we do still have a bad record:
majors[majors['Terms in Attendance'] == "#REF!"]
| Majors | Terms in Attendance | |
|---|---|---|
| 395 | #REF! | #REF! | 
majors = majors[majors['Terms in Attendance'] != "#REF!"]
majors['Terms in Attendance'].value_counts().to_frame()
| count | |
|---|---|
| Terms in Attendance | |
| 4.0 | 539 | 
| 6.0 | 407 | 
| 8.0 | 168 | 
| G | 52 | 
| 2.0 | 43 | 
| 7.0 | 30 | 
| 5.0 | 13 | 
| 3.0 | 1 | 
Detail: The deleted majors record number is different from the record number of the bad names record. So while the number of records in each table matches, the row indices don't match, so we'll have to keep these tables separate in order to do our analysis.
We will often want to numerically or visually summarize the data. The describe() method provides a brief high level description of our data frame.
names.describe()
| Name | |
|---|---|
| count | 1275 | 
| unique | 914 | 
| top | matthew | 
| freq | 13 | 
Q: What do you think top and freq represent?
Answer: top: most frequent entry, freq: the frequency of that entry
majors.describe()
| Majors | Terms in Attendance | |
|---|---|---|
| count | 1275 | 1253 | 
| unique | 259 | 8 | 
| top | Letters & Sci Undeclared UG | 4.0 | 
| freq | 341 | 539 | 
What are the top majors:
majors_count = (       # Method chaining in pandas
    majors['Majors']
    .value_counts()
    .sort_values(ascending=False) # Highest first
    .to_frame()
    .head(20)          # Get the top 20
)
majors_count
| count | |
|---|---|
| Majors | |
| Letters & Sci Undeclared UG | 341 | 
| Computer Science BA | 148 | 
| Data Science BA (Subplan: Business/Industrial Analytics) | 50 | 
| Electrical Eng & Comp Sci BS | 43 | 
| Economics BA | 37 | 
| Cognitive Science BA | 33 | 
| Data Science BA (Subplan: Economics) | 32 | 
| Civil Engineering BS | 30 | 
| Data Science BA (Subplan: Applied Mathematics & Modeling) | 23 | 
| Mol Sci & Software Engin MMSSE (Subplan: Part-Time) | 22 | 
| Letters & Sci Undeclared UG (Subplan: Applied HD Data Science) | 19 | 
| Economics BA, Minor: Data Science UG | 17 | 
| Applied Mathematics BA (Subplan: Data Science) | 15 | 
| Industrial Eng & Ops Rsch BS | 15 | 
| Data Science BA (Subplan: Cognition) | 14 | 
| Data Science BA (Subplan: Robotics) | 12 | 
| Electrical Eng & Comp Sci BS, Minor: Data Science UG | 11 | 
| Bioengineering BS | 10 | 
| Applied Mathematics BA | 9 | 
| Business Administration BS | 8 | 
In Data 100, we will deal with many different kinds of data (not just numbers) and we will study techniques to describe types of data.
How can we summarize the Majors field? A good starting point might be to use a bar plot:
# Interactive using plotly
fig = px.bar(majors_count.loc[::-1], orientation='h')
fig.update_layout(showlegend=False, 
                  xaxis_title='Count', 
                  yaxis_title='Major',
                  autosize=False, 
                  width=800, 
                  height=500)
fig = px.histogram(majors['Terms in Attendance'].sort_values(),
                   histnorm='probability')
fig.update_layout(showlegend=False,
                  xaxis_title="Term",
                  yaxis_title="Fraction of Class",
                  autosize=False, 
                  width=800, 
                  height=250)
# Replacing terms in attendance data with the degree objective
majors.loc[majors.loc[:, 'Terms in Attendance'] != 'G', 'Terms in Attendance'] = 'Undergraduate'
majors.loc[majors.loc[:, 'Terms in Attendance'] == 'G', 'Terms in Attendance'] = 'Graduate'
majors.rename(columns={'Terms in Attendance': 'Ungrad Grad'}, inplace=True)
majors.describe()
| Majors | Ungrad Grad | |
|---|---|---|
| count | 1275 | 1275 | 
| unique | 259 | 2 | 
| top | Letters & Sci Undeclared UG | Undergraduate | 
| freq | 341 | 1223 | 
What is the ratio between graduate and undergraduate students in Data 100, and how does it compare with campus distribution?
What is the proportion of different majors in Data 100, and how does it compare with historical campus trends?
We often ask this question because we want to improve the data science program here in Berkeley, especially since it has now grown into a new college—College of Computing, Data Science, and Society—Berkeley's first new college in 50 years.
print(majors.columns)
print(names.columns)
Index(['Majors', 'Ungrad Grad'], dtype='object') Index(['Name'], dtype='object')
Note 1: In the following, we download the data programmatically to ensure that the process is reproducible.
Note 2: We also load the data directly into Python.
In Data 100 we will think a bit more about how we can be efficient in our data analysis to support processing large datasets.
url = "https://docs.google.com/spreadsheets/d/1J7tz3GQLs3M6hFseJCE9KhjVhe4vKga8Q2ezu0oG5sQ/gviz/tq?tqx=out:csv"
university_majors = pd.read_csv(url, 
                                usecols = ['Academic Yr', 'Semester', 'Ungrad Grad', 
                                           'Entry Status', 'Major Short Nm', 'Student Headcount'])
# Examining the data
university_majors
| Academic Yr | Semester | Ungrad Grad | Entry Status | Major Short Nm | Student Headcount | |
|---|---|---|---|---|---|---|
| 0 | 2013-14 | Fall | Graduate | Graduate | Education | 327 | 
| 1 | 2013-14 | Fall | Graduate | Graduate | Special Education | 14 | 
| 2 | 2013-14 | Fall | Graduate | Graduate | Science & Math Education | 16 | 
| 3 | 2013-14 | Fall | Graduate | Graduate | Chemical Engineering | 132 | 
| 4 | 2013-14 | Fall | Graduate | Graduate | Chemistry | 404 | 
| ... | ... | ... | ... | ... | ... | ... | 
| 7278 | 2022-23 | Spring | Undergraduate | Transfer Entrant | Nut Sci-Physio & Metabol | 20 | 
| 7279 | 2022-23 | Spring | Undergraduate | Transfer Entrant | Nutritional Sci-Dietetics | 3 | 
| 7280 | 2022-23 | Spring | Undergraduate | Transfer Entrant | Nutritional Sci-Toxicology | 3 | 
| 7281 | 2022-23 | Spring | Undergraduate | Transfer Entrant | Genetics & Plant Biology | 10 | 
| 7282 | 2022-23 | Spring | Undergraduate | Transfer Entrant | Microbial Biology | 49 | 
7283 rows × 6 columns
The data is reported on a semester basis. We will aggregate data across different semesters in a year by taking average of Fall and Spring semester enrollment information.
# Reporting student data based on academic year
university_majors = (university_majors.groupby(
    ['Academic Yr', 'Ungrad Grad', 'Entry Status', 'Major Short Nm'], as_index = False)[["Student Headcount"]]
                     .mean()
                    )
university_majors
| Academic Yr | Ungrad Grad | Entry Status | Major Short Nm | Student Headcount | |
|---|---|---|---|---|---|
| 0 | 2013-14 | Graduate | Graduate | African American Studies | 30.0 | 
| 1 | 2013-14 | Graduate | Graduate | Ag & Resource Economics | 71.5 | 
| 2 | 2013-14 | Graduate | Graduate | Anc Hist & Medit Archae | 15.0 | 
| 3 | 2013-14 | Graduate | Graduate | Anthropology | 88.5 | 
| 4 | 2013-14 | Graduate | Graduate | Applied Mathematics | 16.5 | 
| ... | ... | ... | ... | ... | ... | 
| 3735 | 2022-23 | Undergraduate | Transfer Entrant | Spanish and Portuguese | 12.0 | 
| 3736 | 2022-23 | Undergraduate | Transfer Entrant | Statistics | 35.0 | 
| 3737 | 2022-23 | Undergraduate | Transfer Entrant | Sustainable Environ Dsgn | 4.5 | 
| 3738 | 2022-23 | Undergraduate | Transfer Entrant | Theater & Perf Studies | 44.0 | 
| 3739 | 2022-23 | Undergraduate | Transfer Entrant | Urban Studies | 16.0 | 
3740 rows × 5 columns
university_grad_vs_ungrd = (university_majors.groupby(
    ['Academic Yr', 'Ungrad Grad'], as_index = False)[["Student Headcount"]]
                            .sum()
                           )
proportions = university_grad_vs_ungrd.pivot(index='Academic Yr', columns='Ungrad Grad', values='Student Headcount')
proportions['Total'] = proportions['Undergraduate'] + proportions['Graduate']
proportions['Undergrad Proportion'] = proportions['Undergraduate'] / proportions['Total']
proportions['Grad Proportion'] = proportions['Graduate'] / proportions['Total']
fig = px.bar(proportions.reset_index(), 
             x='Academic Yr', 
             y=['Undergraduate', 'Graduate'],
             title='Number of Grad vs. Undergrad Students',
             labels={'value': 'Number of Students'},
             color_discrete_map={'Undergraduate': 'blue', 'Graduate': 'orange'})
fig.update_layout(barmode='relative', autosize=False, width=800, height=600)
fig.show()
data100_grad = majors['Ungrad Grad'].loc[majors['Ungrad Grad'] == 'Graduate'].count()
data100_undergrad = majors['Ungrad Grad'].loc[majors['Ungrad Grad'] == 'Undergraduate'].count()
print("Number of graduate students in Data 100: ", data100_grad)
print("Number of undergraduate students in Data 100: ", data100_undergrad)
Number of graduate students in Data 100: 52 Number of undergraduate students in Data 100: 1223
data100_row = {'Graduate':[data100_grad], 
               'Undergraduate':[data100_undergrad], 
               'Total':[data100_grad + data100_undergrad], 
               'Undergrad Proportion':[data100_undergrad / (data100_grad + data100_undergrad)],
               'Grad Proportion':[data100_grad / (data100_grad + data100_undergrad)], 
               }
new_row_df = pd.DataFrame(data100_row)
proportions.loc['Data 100'] = new_row_df.iloc[0]
fig = px.bar(proportions.reset_index(), 
             x='Academic Yr', 
             y=['Undergrad Proportion', 'Grad Proportion'],
             title='Proportions of Grad vs. Undergrad Students',
             labels={'value': 'Proportion'},
             color_discrete_map={'Undergrad Proportion': 'blue', 'Grad Proportion': 'orange'})
fig.update_layout(barmode='relative', autosize=False, width=800, height=600)
fig.show()
data100_top_20_majors = (       # Method chaining in pandas
    majors['Majors']
    .value_counts()
    .sort_values(ascending=False) # Highest first
    .to_frame()
    .head(20)          # Get the top 20
)
major_trends = university_majors.groupby(['Academic Yr', 'Major Short Nm'], 
                                         as_index = False)[["Student Headcount"]].sum()
print("Top 20 majors at Berkeley in 2022-23")
major_trends[major_trends.loc[:, 'Academic Yr'] == '2022-23'].sort_values('Student Headcount', ascending=False).head(20)
Top 20 majors at Berkeley in 2022-23
| Academic Yr | Major Short Nm | Student Headcount | |
|---|---|---|---|
| 1993 | 2022-23 | Letters & Sci Undeclared | 10651.0 | 
| 1983 | 2022-23 | L&S Computer Science | 2102.5 | 
| 1932 | 2022-23 | Electrical Eng & Comp Sci | 2093.0 | 
| 1894 | 2022-23 | Business Administration | 1645.5 | 
| 1928 | 2022-23 | Economics | 1579.5 | 
| 1984 | 2022-23 | L&S Data Science | 1325.5 | 
| 2020 | 2022-23 | Molecular & Cell Biology | 1225.5 | 
| 2011 | 2022-23 | Mechanical Engineering | 1208.0 | 
| 1992 | 2022-23 | Law (JD) | 1023.0 | 
| 1973 | 2022-23 | Info & Data Science-MIDS | 1021.5 | 
| 2042 | 2022-23 | Political Science | 1005.0 | 
| 1948 | 2022-23 | Evening & Weekend MBA | 919.0 | 
| 2043 | 2022-23 | Psychology | 760.0 | 
| 1901 | 2022-23 | Chemistry | 691.0 | 
| 2057 | 2022-23 | Sociology | 663.0 | 
| 1881 | 2022-23 | Architecture | 604.5 | 
| 1900 | 2022-23 | Chemical Engineering | 595.0 | 
| 1889 | 2022-23 | Bioengineering | 576.0 | 
| 1912 | 2022-23 | Cognitive Science | 505.5 | 
| 1879 | 2022-23 | Applied Mathematics | 497.0 | 
print("Top 20 majors at Berkeley since 2013")
major_trends.groupby(['Major Short Nm'], as_index = False)[['Student Headcount']].sum().sort_values('Student Headcount', ascending=False).head(20)
     
Top 20 majors at Berkeley since 2013
| Major Short Nm | Student Headcount | |
|---|---|---|
| 150 | Letters & Sci Undeclared | 101418.0 | 
| 77 | Electrical Eng & Comp Sci | 18431.0 | 
| 137 | L&S Computer Science | 14818.0 | 
| 33 | Business Administration | 14302.5 | 
| 72 | Economics | 14000.0 | 
| 216 | Political Science | 10334.0 | 
| 176 | Mechanical Engineering | 10193.5 | 
| 149 | Law (JD) | 9645.5 | 
| 95 | Evening & Weekend MBA | 7932.5 | 
| 233 | Sociology | 6719.5 | 
| 217 | Psychology | 6533.0 | 
| 40 | Chemical Engineering | 6126.5 | 
| 41 | Chemistry | 5941.5 | 
| 13 | Architecture | 5680.5 | 
| 85 | English | 5520.0 | 
| 187 | Molecular & Cell Biology | 5431.0 | 
| 125 | Info & Data Science-MIDS | 5105.0 | 
| 26 | Bioengineering | 5061.0 | 
| 128 | Integrative Biology | 5060.5 | 
| 211 | Physics | 5055.0 | 
print("Top 20 majors at Berkeley in Data 100")
print(data100_top_20_majors)
Top 20 majors at Berkeley in Data 100
                                                    count
Majors                                                   
Letters & Sci Undeclared UG                           341
Computer Science BA                                   148
Data Science BA (Subplan: Business/Industrial A...     50
Electrical Eng & Comp Sci BS                           43
Economics BA                                           37
Cognitive Science BA                                   33
Data Science BA (Subplan: Economics)                   32
Civil Engineering BS                                   30
Data Science BA (Subplan: Applied Mathematics &...     23
Mol Sci & Software Engin MMSSE (Subplan: Part-T...     22
Letters & Sci Undeclared UG (Subplan: Applied H...     19
Economics BA, Minor: Data Science UG                   17
Applied Mathematics BA (Subplan: Data Science)         15
Industrial Eng & Ops Rsch BS                           15
Data Science BA (Subplan: Cognition)                   14
Data Science BA (Subplan: Robotics)                    12
Electrical Eng & Comp Sci BS, Minor: Data Scien...     11
Bioengineering BS                                      10
Applied Mathematics BA                                  9
Business Administration BS                              8
data100_top_20_majors.index = data100_top_20_majors.index.str.rsplit(' ', n=1).str[0]
print("Top 20 majors at Berkeley in Data 100")
print(data100_top_20_majors)
Top 20 majors at Berkeley in Data 100
                                                    count
Majors                                                   
Letters & Sci Undeclared                              341
Computer Science                                      148
Data Science BA (Subplan: Business/Industrial          50
Electrical Eng & Comp Sci                              43
Economics                                              37
Cognitive Science                                      33
Data Science BA (Subplan:                              32
Civil Engineering                                      30
Data Science BA (Subplan: Applied Mathematics &        23
Mol Sci & Software Engin MMSSE (Subplan:               22
Letters & Sci Undeclared UG (Subplan: Applied H...     19
Economics BA, Minor: Data Science                      17
Applied Mathematics BA (Subplan: Data                  15
Industrial Eng & Ops Rsch                              15
Data Science BA (Subplan:                              14
Data Science BA (Subplan:                              12
Electrical Eng & Comp Sci BS, Minor: Data Science      11
Bioengineering                                         10
Applied Mathematics                                     9
Business Administration                                 8
fig = px.line(major_trends[major_trends["Major Short Nm"].isin(data100_top_20_majors.index)], 
                        x = "Academic Yr", y = "Student Headcount", color = "Major Short Nm")
fig.update_layout(autosize=False, width=800, height=600)
fig.show()    
data100_top_19_majors = data100_top_20_majors.iloc[1:,:]
fig = px.line(major_trends[major_trends["Major Short Nm"].isin(data100_top_19_majors.index)], 
                        x = "Academic Yr", y = "Student Headcount", color = "Major Short Nm")
fig.update_layout(autosize=False, width=800, height=600)
fig.show()