This schedule is still under development and is subject to change.
Week | Lecture | Date | Topic |
---|---|---|---|
|
1 | 01/17/2017 |
Course Overview [Gonzalez]In this lecture we define and motivate the study of data science and outline the key ideas covered throughout the class. Lecture Notes
Homework 1 Released |
2 | 01/19/2017 |
The Data Science Lifecycle [Gonzalez]In this lecture we introduce the data-science life-cycle and explore each stage by analyzing tweets from the 2016 presidential election. Lecture Notes |
|
|
3 | 01/24/2017 |
Problem Formulation and Experimental Design [Yu]In this lecture we provide an overview of how to formulate hypothesis, identify sources of data, and construct basic experiments to collect data. Lecture NotesHomework 2 ReleasedHomework 1 Due |
4 | 01/26/2017 |
Data Wrangling [Hellerstein]In this lecture we explore the challenges of data preparation (e.g., assessing, structuring, cleaning, and rolling up data) and the kinds of errors commonly found in the real world. Lecture Notes
|
|
|
5 | 01/31/2017 |
Exploratory Data Analysis [Nolan]In this lecture we provide an overview of exploratory data analysis (EDA). Lecture Notes |
6 | 02/02/2017 |
Visualization and Communication [Nolan]This lecture covers how to effectively visualize and communicate complex results to a broader audience. Lecture Notes: |
|
|
7 | 02/07/2017 |
Advanced Python Data Science Tools [Gonzalez]In this lecture we will introduce Pandas, dataframe manipulation, python visualization, and some of the batch oriented philosophy of scalable data processing. Lecture Notes:Homework 3 ReleasedHomework 2 Due |
8 | 02/09/2017 |
Prediction and Inference [Yu]In this lecture we will explore the key types and challenges of inference and predictions. We will provide an overview of the categories of prediction problems and introduce some of the key machine learning tools in python. Lecture Notes: |
|
|
9 | 02/14/2017 |
Relational Algebra and SQL [Hellerstein]In this lecture we introduce SQL and the relational model. Lecture Notes: |
10 | 02/16/2017 |
SQL Continued [Hellerstein]In this lecture we will introduce data analysis techniques with a focus on aggregation and summary statistics. Lecture Notes:
|
|
|
11 | 02/21/2017 |
Advanced SQL [Hellerstein]In this lecture we will cover SQL joins, views, and CTEs, as well as advanced aggregation including order statistics, window functions and user-defined aggregates.
Homework 4 ReleasedHomework 3 Due |
12 | 02/23/2017 |
Basic Modeling using Statistical Distributions [Nolan]In this lecture we provide an overview of several basic distributions and discuss some of the challenges of working with skewed data. Lecture Notes: |
|
|
13 | 02/28/2017 |
Maximum Likelihood Estimation [Nolan]In this lecture we fit basic models to data by applying the method of maximum likelihood estimation. |
14 | 03/02/2017 |
Maximum Likelihood Estimation Continued [Nolan]This lecture will continue discussion on the method of maximum likelihood. |
|
|
15 | 03/07/2017 |
Midterm Review [Gonzalez] |
16 | 03/09/2017 |
MidtermThis may change in the weeks before class starts as we adjust the schedule. |
|
|
17 | 03/14/2017 |
Least Squares Regression and Hypothesis Testing [Yu]In this lecture dives into the details of least squares regression through the lens of empirical risk minimization while discussing some of the key modeling assumptions. Homework 4 DueHomework 5 Released |
18 | 03/16/2017 |
Least Squares Regression and Hypothesis Testing [Yu]In this lecture dives into the details of least squares regression through the lens of empirical risk minimization while discussing some of the key modeling assumptions.
|
|
|
19 | 03/21/2017 |
Feature Engineering, Over-fitting, and Cross Validation [Gonzalez]In this lecture we will begin to do some machine learning. We will explore how simple linear techniques can be used to address complex non-linear relationships on a wide range of data types. We will start to use scikit-learn to build and visualize models in higher dimensional spaces. We will address a key challenge in machine learning – over-fitting and discuss how cross-validation can be used to address over-fitting. The following interactive (html) notebooks walk through the concepts we use in lecture and are suggested reading materials. |
20 | 03/23/2017 |
Feature Engineering, Over-fitting, and Cross Validation Continued [Gonzalez]In this lecture we continue the discussion from the last lecture pushing further into feature engineering. |
|
|
21 | 03/28/2017 |
Spring Break |
22 | 03/30/2017 |
Spring Break |
|
|
23 | 04/04/2017 |
Regularization and the Bias Variance tradeoff [Gonzalez]In this lecture will continue our exploration of over-fitting and derive the fundamental bias variance tradeoff for the least squares model. We will then introduce the concept of regularization and explore the commonly used L1 and L2 regularization functions.
Homework 5 DueHomework 6 Released |
24 | 04/06/2017 |
Logistic Regression [Gonzalez]In this lecture we will finish our discussion on regularization and begin to study how linear models can be used to build classifiers through logistic regression. |
|
|
25 | 04/11/2017 |
Finish Logistic Regression and Start K-Means [Gonzalez and Yu]In this lecture we will finish our discussion on logistic regression and begin to explore unsupervised learning techniques. In particular we will start with K-means work towards the more general EM algorithm.
Additional Reading:
|
26 | 04/13/2017 |
Clustering and Expectation Maximization (EM) [Yu]This lecture will continue to cover EM and more general mixed membership clustering techniques. Optional Reading:
|
|
|
27 | 04/18/2017 |
Map-Reduce, Spark, and Big Data [Gonzalez]In this lecture we will introduce the Map-Reduce model of distributed computation and then dive into the Apache Spark Map-Reduce system developed at Berkeley. We will talk about how to use the computational frameworks to scale data processing.
Additional Reading:
Homework 6 DueHomework 7 Released |
28 | 04/20/2017 |
Guest Lecturer on Data Science and Ethics [Charis Thompson] |
|
|
29 | 04/25/2017 |
Finish Discussion on Spark and ClassificationIn the previous lectures we moved quickly through some important concepts in distributed data processing and classification. Because both of these ideas are critical in many data science applications, we will return to the discussion on Spark and review how the relational operators we learned earlier in the class enable scalable distributed computing. We will then return to the topic of classification and review logistic regression and how it can be made to run in a distributed computing environment. Time permitting we will touch on Deep Learning as a generalization of the ideas in logistic regression. |
30 | 04/27/2017 |
PCA and the Berkeley Data Science Major [Nolan and Cathryn Carson]In this lecture we will provide an overview of dimensionality reduction and discuss the PCA method. We will conclude with a discussion from Cathryn Carson on the development and status of the Berkeley Data Science Major.
Homework 7 Due |
|
|
31 | 05/02/2017 |
RRR Review [Hellerstein and Yu]This will be part one of a two part exam review lecture to be held during the regular lecture slot. Homework 7 Due (optional extension) |
32 | 05/04/2017 |
RRR Review [Gonzalez and Nolan]This will be part two of a two part exam review lecture to be held during the regular lecture slot. |
|
|
33 | 05/11/2017 |
Final ExamThe final exam will be from 3:00 to 6:00 PM on Thursday, May 11, in 100 GPB (Genetics and Plant Biology). For details about exam scheduling visit the Berkeley Exam Calendar. |