Syllabus

⚠️ This content is archived as of March 2026 and is retained exclusively for reference. Find current offerings.

This schedule is still under development and is subject to change.

Week	Lecture	Date	Topic
1	1	01/17/2017	Course Overview [Gonzalez] In this lecture we define and motivate the study of data science and outline the key ideas covered throughout the class. Lecture Notes Slides (pptx, pdf, pdf 6up) Additional Optional Reading: Chapter 1 from Doing Data Science and Data Science from Scratch Homework 1 Released
1	2	01/19/2017	The Data Science Lifecycle [Gonzalez] In this lecture we introduce the data-science life-cycle and explore each stage by analyzing tweets from the 2016 presidential election. Lecture Notes Slides (pptx, pdf, pdf 6up) SF Food Safety Demo (html, raw, data)
2	3	01/24/2017	Problem Formulation and Experimental Design [Yu] In this lecture we provide an overview of how to formulate hypothesis, identify sources of data, and construct basic experiments to collect data. Lecture Notes Slides (pptx, pdf, pdf 6up) Homework 2 Released Homework 1 Due
2	4	01/26/2017	Data Wrangling [Hellerstein] In this lecture we explore the challenges of data preparation (e.g., assessing, structuring, cleaning, and rolling up data) and the kinds of errors commonly found in the real world. Lecture Notes Slides (pptx, pdf, pdf 6up) Wrangler Software (optional) Additional reading for the curious: Quartz Bad Data Guide Bad Data Handbook (O’Reilly book, free on berkeley.edu networks) Research Directions in Data Wrangling, Heer et al. 2011. Quantitative Data Cleaning For Large Databases, Hellerstein 2008 Exploratory Data Mining and Cleaning, Dasu and Johnson (book)
3	5	01/31/2017	Exploratory Data Analysis [Nolan] In this lecture we provide an overview of exploratory data analysis (EDA). Lecture Notes Slides (pptx, pdf, pdf 6up) Additional reading for the curious: Exploratory Data Analysis, Tukey 1977 (book) Now You See It, Few 2009 (book)
3	6	02/02/2017	Visualization and Communication [Nolan] This lecture covers how to effectively visualize and communicate complex results to a broader audience. Lecture Notes: Slides (pptx, pdf, pdf 6up)
4	7	02/07/2017	Advanced Python Data Science Tools [Gonzalez] In this lecture we will introduce Pandas, dataframe manipulation, python visualization, and some of the batch oriented philosophy of scalable data processing. Lecture Notes: Summary Slides (pptx, pdf, pdf 6up) Extended Notebook (html, ipynb) Homework 3 Released Homework 2 Due
4	8	02/09/2017	Prediction and Inference [Yu] In this lecture we will explore the key types and challenges of inference and predictions. We will provide an overview of the categories of prediction problems and introduce some of the key machine learning tools in python. Lecture Notes: Slides (pptx, pdf, pdf 6up)
5	9	02/14/2017	Relational Algebra and SQL [Hellerstein] In this lecture we introduce SQL and the relational model. Lecture Notes: Slides (pptx, pdf, pdf 6up).
5	10	02/16/2017	SQL Continued [Hellerstein] In this lecture we will introduce data analysis techniques with a focus on aggregation and summary statistics. Lecture Notes: Slides (continued from last lecture) Extended Notebook: (html no output, ipynb no output, data) Additional resources for the curious CS186 Slides, 2016. PPTX, PDF, Lecture Video 1, Lecture Video 2, Lecture Video 3 PostgreSQL Manual SQLfiddle
6	11	02/21/2017	Advanced SQL [Hellerstein] In this lecture we will cover SQL joins, views, and CTEs, as well as advanced aggregation including order statistics, window functions and user-defined aggregates. Extended Notebook: (html no output, ipynb no output) Gonzalez follow-up notebook (html, ipynb) Homework 4 Released Homework 3 Due
6	12	02/23/2017	Basic Modeling using Statistical Distributions [Nolan] In this lecture we provide an overview of several basic distributions and discuss some of the challenges of working with skewed data. Lecture Notes: Slides (pptx, pdf, pdf 6up).
7	13	02/28/2017	Maximum Likelihood Estimation [Nolan] In this lecture we fit basic models to data by applying the method of maximum likelihood estimation. Slides (pptx, pdf, pdf 6up).
7	14	03/02/2017	Maximum Likelihood Estimation Continued [Nolan] This lecture will continue discussion on the method of maximum likelihood. Slides (pptx, pdf, pdf 6up).
8	15	03/07/2017	Midterm Review [Gonzalez] Slides (pptx, pdf, pdf 6up).
8	16	03/09/2017	Midterm This may change in the weeks before class starts as we adjust the schedule.
9	17	03/14/2017	Least Squares Regression and Hypothesis Testing [Yu] In this lecture dives into the details of least squares regression through the lens of empirical risk minimization while discussing some of the key modeling assumptions. Slides (pptx, pdf, pdf 6up). Homework 4 Due Homework 5 Released
9	18	03/16/2017	Least Squares Regression and Hypothesis Testing [Yu] In this lecture dives into the details of least squares regression through the lens of empirical risk minimization while discussing some of the key modeling assumptions. Slides (pptx, pdf, pdf 6up). Reading Chapter 3.1
10	19	03/21/2017	Feature Engineering, Over-fitting, and Cross Validation [Gonzalez] In this lecture we will begin to do some machine learning. We will explore how simple linear techniques can be used to address complex non-linear relationships on a wide range of data types. We will start to use scikit-learn to build and visualize models in higher dimensional spaces. We will address a key challenge in machine learning – over-fitting and discuss how cross-validation can be used to address over-fitting. The following interactive (html) notebooks walk through the concepts we use in lecture and are suggested reading materials. Least-Squares Linear Regression: (html, ipynb) Feature Engineering Part 1: (html, ipynb, data) An archive zip file of all notebooks, data, and figures for regression and subsequent over-fitting lectures. Optional reading: Chapter 3.1, 3.2.
10	20	03/23/2017	Feature Engineering, Over-fitting, and Cross Validation Continued [Gonzalez] In this lecture we continue the discussion from the last lecture pushing further into feature engineering. Feature Engineering Part 1: (html, ipynb) Feature Engineering Part 2: (html, ipynb, data) An archive zip file of all notebooks, data, and figures for regression and subsequent over-fitting lectures. Optional reading Chapter 2.1, 2.2
11	21	03/28/2017	Spring Break
11	22	03/30/2017	Spring Break
12	23	04/04/2017	Regularization and the Bias Variance tradeoff [Gonzalez] In this lecture will continue our exploration of over-fitting and derive the fundamental bias variance tradeoff for the least squares model. We will then introduce the concept of regularization and explore the commonly used L1 and L2 regularization functions. Slides: (pptx, pdf, handout) Interactive Notebook on Cross Validation and the Bias Variance Tradeoff: (html, ipynb) An archive zip file of all notebooks, data, and figures for regression and subsequent over-fitting lectures. An alternative derivation of the Bias Variance Trade-Off provided by Professor Yu (pdf) Homework 5 Due Homework 6 Released
12	24	04/06/2017	Logistic Regression [Gonzalez] In this lecture we will finish our discussion on regularization and begin to study how linear models can be used to build classifiers through logistic regression. Slides: (pptx, pdf, handout) Interactive Notebook on Regularization: (html, ipynb) Interactive Notebook on Logistic Regression: (html, ipynb)
13	25	04/11/2017	Finish Logistic Regression and Start K-Means [Gonzalez and Yu] In this lecture we will finish our discussion on logistic regression and begin to explore unsupervised learning techniques. In particular we will start with K-means work towards the more general EM algorithm. Part 2 of Logistic Regression Slides: (pptx, pdf, handout) We will continue to use the previous notebook on logistic regression. K-Means Slides: (pptx, pdf, handout) Additional Reading: K-Means Clustering tutorial on scikit-learn
13	26	04/13/2017	Clustering and Expectation Maximization (EM) [Yu] This lecture will continue to cover EM and more general mixed membership clustering techniques. EM and Hierarchical Clustering Slides: pptx, pdf, handout Optional Reading: Silhouette analysis tutorial on scikit-learn.
14	27	04/18/2017	Map-Reduce, Spark, and Big Data [Gonzalez] In this lecture we will introduce the Map-Reduce model of distributed computation and then dive into the Apache Spark Map-Reduce system developed at Berkeley. We will talk about how to use the computational frameworks to scale data processing. Slides: pptx, pdf, handout Notebook demonstrating distributed least squares linear regression in Apache Spark Map-Reduce Cloud Notebook Additional Reading: The Apache Spark programming guide provides a fairly detailed overview of how to use Spark. Be sure to switch the code examples to Python by selecting the Python tab above each code snippet. Python RDD API Python Dataframe API Databricks Cloud Apache Spark tutorial Information about using Databricks Cloud which we will be using for homework. Homework 6 Due Homework 7 Released
14	28	04/20/2017	Guest Lecturer on Data Science and Ethics [Charis Thompson] Slides (ppt, pdf, handout)
15	29	04/25/2017	Finish Discussion on Spark and Classification In the previous lectures we moved quickly through some important concepts in distributed data processing and classification. Because both of these ideas are critical in many data science applications, we will return to the discussion on Spark and review how the relational operators we learned earlier in the class enable scalable distributed computing. We will then return to the topic of classification and review logistic regression and how it can be made to run in a distributed computing environment. Time permitting we will touch on Deep Learning as a generalization of the ideas in logistic regression. Slides: (pptx, pdf, handout) Notebooks: databricks cloud Slides on Gender Bias: (pptx, pdf, handout)
15	30	04/27/2017	PCA and the Berkeley Data Science Major [Nolan and Cathryn Carson] In this lecture we will provide an overview of dimensionality reduction and discuss the PCA method. We will conclude with a discussion from Cathryn Carson on the development and status of the Berkeley Data Science Major. Slides: pdf Homework 7 Due
16	31	05/02/2017	RRR Review [Hellerstein and Yu] This will be part one of a two part exam review lecture to be held during the regular lecture slot. Hellerstein Slides: pptx, pdf, handout. Yu Slides: pptx, pdf, handout Homework 7 Due (optional extension)
16	32	05/04/2017	RRR Review [Gonzalez and Nolan] This will be part two of a two part exam review lecture to be held during the regular lecture slot. Gonzalez Slides: pptx, pdf, handout Nolan Slides: pptx, pdf, handout
17	33	05/11/2017	Final Exam The final exam will be from 3:00 to 6:00 PM on Thursday, May 11, in 100 GPB (Genetics and Plant Biology). For details about exam scheduling visit the Berkeley Exam Calendar.

Course Syllabus

Course Overview [Gonzalez]

Lecture Notes

Homework 1 Released

The Data Science Lifecycle [Gonzalez]

Lecture Notes

Problem Formulation and Experimental Design [Yu]

Lecture Notes

Homework 2 Released

Homework 1 Due

Data Wrangling [Hellerstein]

Lecture Notes

Exploratory Data Analysis [Nolan]

Lecture Notes

Visualization and Communication [Nolan]

Lecture Notes:

Advanced Python Data Science Tools [Gonzalez]

Lecture Notes:

Homework 3 Released

Homework 2 Due

Prediction and Inference [Yu]

Lecture Notes:

Relational Algebra and SQL [Hellerstein]

Lecture Notes:

SQL Continued [Hellerstein]

Lecture Notes:

Advanced SQL [Hellerstein]

Homework 4 Released

Homework 3 Due

Basic Modeling using Statistical Distributions [Nolan]

Lecture Notes:

Maximum Likelihood Estimation [Nolan]

Maximum Likelihood Estimation Continued [Nolan]

Midterm Review [Gonzalez]

Midterm

Least Squares Regression and Hypothesis Testing [Yu]

Homework 4 Due

Homework 5 Released

Least Squares Regression and Hypothesis Testing [Yu]

Feature Engineering, Over-fitting, and Cross Validation [Gonzalez]

Feature Engineering, Over-fitting, and Cross Validation Continued [Gonzalez]

Spring Break

Spring Break

Regularization and the Bias Variance tradeoff [Gonzalez]

Homework 5 Due

Homework 6 Released

Logistic Regression [Gonzalez]

Finish Logistic Regression and Start K-Means [Gonzalez and Yu]

Additional Reading:

Clustering and Expectation Maximization (EM) [Yu]

Optional Reading:

Map-Reduce, Spark, and Big Data [Gonzalez]

Additional Reading:

Homework 6 Due

Homework 7 Released

Guest Lecturer on Data Science and Ethics [Charis Thompson]

Finish Discussion on Spark and Classification

PCA and the Berkeley Data Science Major [Nolan and Cathryn Carson]

Homework 7 Due

RRR Review [Hellerstein and Yu]

Homework 7 Due (optional extension)

RRR Review [Gonzalez and Nolan]

Final Exam