Syllabus

This syllabus is still under development and is subject to change.

Week Lecture Date Topic
1 1 1/16/2018

Course Overview and Review of Python and Probability [Gonzalez]

In this lecture we provide an overview of what it means to be a data scientist by examining recent surveys of data scientists. By exploring the names of students in the class we will review some basic python tools for data manipulation as well as some of the mathematical concepts that will be required in DS100.

Textbook: Chapter 1

[ pptx | pdf | handout | HTML notebook | notebook.zip | roster data (Berkeley only) | optional reading (Data Science and Science) | screencast ]

2 1/18/2018

Data Design and Sources of Bias [Perez]

Fundamentally, (data) science is the study of using data to learn about the world and solve problems. However, how and what data is collected can have a profound impact on what we can learn and the problems we can solve. In this lecture, we will begin to explore various mechanisms for data collection and their implications on our ability to generalize. In particular we will discuss differences between census, surveys, controlled experiments, and observational studies. We will highlight the power of simple randomization and the fallacies of data scale.

Textbook: Chapter 2

[ pptx | pdf | handout | random sampling notes (typed) | screencast ]

2 3 1/23/2018

Data Manipulation using Pandas [Perez]

While data comes in many forms, most data analysis are done on tabular data. Mastering the skills of constructing, cleaning, joining, aggregating, and manipulating tabular data is essential to data science. In this lecture, we will introduce Pandas, the open-source Python data manipulation library widely used by data scientists. In addition to introducing new syntax, we will introduce new concepts including indexes, the role of column operations on system performance, and basic tools to begin visualizing data.

Textbook: Chapter 3

[ pptx | pdf | handout | notebook | notebook(html) | screencast ]

Lab 1 Released: Setup and Notebook

Homework 1 Released: Setup, Prerequisites, and Image Classification

4 1/25/2018

Data Manipulation using Pandas - continued [Perez]

Continuation of the previous lecture - detailed discussion of the structure of a Pandas DataFrame and various access and grouping operations. Notebooks and slides from the previous lecture remain relevant.

Textbook: Chapter 3

[ group/pivot pptx | group/pivot pdf | groupby notebook | groupby notebook (html version) | screencast ]

Vitamin 1 Released

3 5 1/30/2018

Data Cleaning and EDA [Gonzalez]

Whether collected by you or obtained from someone else, raw data is seldom ready for immediate analysis. Through exploratory data analysis we can often discover important anomalies, identify limitations in the collection process, and better inform subsequent goal oriented analysis. In this lecture we will discuss how to identify and correct common data anomalies and their implications on future analysis. We will also discuss key properties of data including structure, granularity, faithfulness, temporality, and scope and how these properties can inform how we prepare, analyze, and visualize data.

Textbook: Chapter 4

[ pptx | pdf | handout | Advanced Pandas Notebook | Advanced Pandas Notebook (HTML Version) | Python utilities | Piazza Thread | screencast ]

Lab 2 Released: Pandas Overview

Homework 2 Released: Food Safety Data Cleaning and EDA

6 2/1/2018

EDA and Visualization [Gonzalez and Perez]

In this lecture we will continue our discussion EDA and start to work through a real-world exercise in EDA using public crime data for the city of Berkeley. We will also begin to introduce tools for data visualization using Pandas, Seaborn, and Matplotlib.

Textbook: Chapter 5

[ pptx | pdf | handout | Advanced Pandas Notebook | Advanced Pandas Notebook (HTML Version) | EDA and Data Cleaning Notebook | EDA and Data Cleaning Notebook (HTML Version) | Piazza Thread | screencast ]

Vitamin 2 Released

4 7 2/6/2018

Visualization and Data Tranformations [Perez]

A large fraction of the human brain is devoted to visual perception. As a consequence, visualization is a critical tool in both exploratory data analysis and communicating complex relationships in data. However, making informative and clear visualizations of complex concepts can be challenging. In this lecture, we explore good and bad visualizations and describe how to choose visualizations for various kinds of data and goals. Directly visualizing data can result in less informative for several reaasons: plots as curvilinear relationships can be difficult to assess; large numbers of observations can hide core features; and it can be difficult to visualize large numbers of variables. In this lecture, we discuss techniques of data transformations, smoothing, and dimensionality reduction to address challenges in creating informative visualization. With these additional analytics we can often reveal important and informative patterns in data. We pick up with transformations.

Textbook: Chapter 6

[ intro matplotlib notebook (HTML Version) | advanced matplotlib notebook (HTML Version) | code | Piazza Thread | screencast ]

Lab 3 Released: Plotting

Homework 3 Released: EDA of Bike Sharing

8 2/8/2018

Visualization and Data Tranformations (continued) [Perez]

We will finalize our discussion of data visualization principles and techniques.

Textbook: Chapter 6

[ 08-visualization (PPTX) | 08-figures notebook | 08-figures notebook (HTML Version) | 08-matplotlib-beyond-basics notebook | 08-matplotlib-beyond-basics notebook (HTML Version) | 08-matplotlib-common-plots notebook | 08-matplotlib-common-plots notebook (HTML Version) | data | Piazza Thread | screencast ]

Vitamin 3 Released

5 9 2/13/2018

Web Technologies [Gonzalez]

Data are available on the Internet - they can be embedded in web pages, provided after a form submission, and accessible through a REST API. We can access data from these various sources using HTTP - the HyperText Transfer Protocol. We will cover HTTP at a high level, and provide examples of how to use it to access web pages, submit forms, and crate REST requests.

[ PPTX | PDF | Handout | All Code (zip) | Kernel Smoothing | Pandas and HTML | Working with JSON | Web Requests | Calling Twitter REST APIs | Piazza Thread | screencast ]

Lab 4 Released: Plotting, Smoothing, Transformation

Project 1 Released: Twitter Analysis

10 2/15/2018

Working with Text [Sam Lau (Guest Lecturer)]

Whether in documents, tweets, or records in a table, text data is ubiquitous and presents a unique set of challenges for data scientists. How do you extract key phrases from text? What are meaningful aggregate summaries of text? How do you visualize textual data? In this lecture we will introduce a set of techniques (e.g., bag-of-words) to transform text into numerical data and subsequent tabular analysis. We will also introduce regular expressions as a mechanism for cleaning and transforming text data data.

[ keynote | pdf | handout | notebook (ipynb) | notebook (html) | data | Piazza Thread | screencast ]

Vitamin 4 Released

6 11 2/20/2018

Finish REST and Start Relational Databases and SQL [Gonzalez]

Much of the the important data in the world is stored in relational database management systems. In this lecture we will introduce the key concepts in relational databases including the relational data model, basic schema design, and data independence. We will then begin to dig into the SQL language for accessing and manipulating relational data.

[ pptx | pdf | handout | SQL Part 1 (html) | Code | W3C Tutorial | Piazza Thread | screencast ]

12 2/22/2018

More Advanced SQL [Biye Jiang (Guest Lecturer)]

In this lecture we review more advanced SQL queries including joins and common table expressions and discuss how combine computation in a database with python.

[ pptx | pdf | handout | PostgreSQL for Mac | PostgreSQL for Windows | PostgreSQL for Linux | Notebook HTML | Notebook ipynb | Loading the SQL Dump | Piazza Thread | screencast ]

Vitamin 5 Released

7 13 2/27/2018

Modeling and Estimation [Perez]

How do we pick a number to represent a dataset? A key step in data science is developing models that capture the essential signal in data while providing insight into the phenomena that govern the data and enable effective prediction. In this lecture we address the fundamental question of how to choose a number and more generally a model that reflects the data. We will introduce the concept of loss functions and begin to develop basic models.

Textbook: Chapter 10

[ 13-modeling-and-estimation (PPTX) | pdf | handout | Estimation notebook (HTML Version) | convex-functions notebook (HTML Version) | code and data (includes notebooks and scripts as needed) | Piazza Thread | screencast ]

Lab 6 Released: Regular Expressions, SQL

Homework 4 Released: SQL, FEC Data, and Small Donors

14 3/01/2018

Gradient Descent for Model Estimation [Jake Soloff]

In this lecture we will continue our development of models within the framework of loss minimization. In particular, we will explore how calculus can be used to analytically and numerically minimize loss functions. We will introduce the widely used gradient descent algorithm in the context of recommendation systems.

Textbook: Chapter 11

[ Lecture Notebook HTML | Lecture Notebook iPynb | Raw Data | The Loss Game (be a gradient descender...) | Tutorial on Numerical Optimization | Piazza Thread | screencast ]

8 15 3/6/2018

Midterm Review [Gonzalez]

This lecture will review key topics from the course that will be covered on the midterm.

[ PPTX | PDF | Handout | Practice Midterm | Practice Midterm solutions | Piazza Thread | screencast ]

16 3/8/2018

Midterm [Gonzalez and Perez]

The midterm will take place in class (attendance is mandatory).

9 17 3/13/2018

Generalization and Empirical Risk Minimization [Gonzalez]

So far we have focused how to estimate a descriptive statistic or more generally the parameters of a model that reflects our data. What does this say about the population? How can we generalize beyond what we observe? In this lecture we recast our loss minimization approach in the context of empirical risk minimization. In the process we will review basic proability concepts including expectations, bias, and variance.

[ PPTX | PDF | Handout | screencast | Piazza Thread ]

Lab 8 Released: Modeling and Estimation

18 3/15/2018

The Bias Variance Tradeoff and Regularization [Perez]

There is a fundamental tension in predictive modeling between our ability to fit the data and to generalize to the world. In this lecture we characterize this tension through the tradeoff between bias and variance. We will derive the bias and variance decomposition of the least squares objective. We then discuss how to manage this tradeoff by augmenting our objective with a regularization penalty.

[ (PPTX) | (PDF) | (Handout) | Bias_Variance_Regularization_Simplified notebook (HTML Version) | code and data (includes notebooks and scripts as needed) | screencast | Piazza Thread | (optional reading) Statistical Justifications of Bias decomposition - Ch. 12 | (optional reading) Bias Variance Tradeoff and MSE ]

Homework 5 Released: Modeling and Gradient Descent

Vitamin 6 Released

10 19 3/20/2018

Linear Regression and Feature Engineering [Gonzalez]

Linear regression is at the foundation of most machine learning and statistical methods. We have already introduced linear models in an informal way; in this lecture we formalize the setup of a linear model as a parametric description of a dataset, whose parameters can be estimated computationally. We study the normal equations from the perspective of optimization and discuss some of the computational issues around solving the normal equations. We will then transition to the task of feature engineering and describe a range of techniques for transforming data to enable linear models to fit complex relationships.

Textbook: Chapter 12 , Chapter 13

[ pptx | pdf | handout | HTML Notebook | code.zip | screencast | Piazza Thread | (optional reading) Least Squares and Logistic Regression - Ch 10, 11 | (optional reading) Understanding Feature engineering ]

Lab 9 Released: Bootstrap

Homework 6: Feature Engineering & Linear Models

20 3/22/2018

Finish Cross-Validation and Regularization [Gonzalez]

In this lecture, we will finish our discussion on linear regression by reviewing how to use the scikit learn regression package. We will then explore the challenges of overfitting and review how regularization can be used to address overfitting. We will introduce cross-validation as a mechanism to estimate the test error and to select the reguarlization parameters. (Lecture links updated)

[ pptx | pdf | handout | HTML Notebook | code.zip | screencast | Piazza Thread | (optional reading) Regularization in Machine Learning ]

Vitamin 7 Released

11 21 4/3/2018

Classification and Logistic Regression [Gonzalez]

We consider the case where our response is categorical, in particular, we focus on the simple case where the response has two categories. We begin by using least squares to fit the binary response to categorical explanatory variables and find that the predictions are proportions. Next, we consider more complex model that is linear in quantitative explanatory variables, which is called the linear probability model, and we uncover limitations of this model. We motivate an alternative model, the logistic, by examining a local linear fit and matching its shape. We also draw connections between the logistic and log odds. Lastly, we introduce an alternative loss function that is more appropriate for working with probabilities, the Kullback-Leibler Divergence. We derive a representation of the K-L divergence for binary response variables.

[ pptx | pdf | handout | HTML Notebook (Part1) | HTML Notebook (Part2) | code.zip | screencast | Piazza Thread | (optional reading) Least Squares and Logistic Regression - Ch 10, 11 ]

Lab 10 Released: Feature Engineering and Cross-Validation

22 4/5/2018

Classification and Logistic Regression (Part 2) [Gonzalez]

Continued discussion of material in previous lecture.

[ pptx | pdf | handout | HTML Notebook (Part1) | HTML Notebook (Part2) | code.zip | Monte Carlo/Central Limit Notebook (ipynb) | Monte Carlo/Central Limit Notebook (HTML) | screencast | Piazza Thread | (optional reading) Least Squares and Logistic Regression - Ch 10, 11 ]

Vitamin 8 Released

12 23 4/10/2018

Probability theory, Monte Carlo Simulation, and Bootstraping [Perez]

We saw in the last lecture that we can study parameter estimators using theoretical and computational approaches. In this lecture, we will delve deeper into the bootstrap to study the behvaior of the empirical 75th percentile as an estimator for its population counterpart. We will derive the empirical quantile through optimization of a loss function, show that the population parameter minimizes the expected loss, bootstrap the sampling distribution of the empirical 75th percentile, and use the bootstrapped distribution to provide interval estimates for the population parameter. In addition, we will provide a more comprehensive review of basic probability.

[ 23-boostrap-mc (PPTX) | central-limit notebook (HTML Version) | prngs notebook (HTML Version) | restaurant-estimation notebook (HTML Version) | code and data (includes notebooks and scripts as needed) | screencast | Piazza Thread ]

Lab 11 Released: Logistic Regression

24 4/12/2018

Hypothesis Testing [Perez]

A key step in inference is often answering a question about the world. We will consider 4 such questions to varying degrees of detail: 1) Is there enough evidence to bring someone to trial? 2) Is there evidence of an earth-like planet orbiting a star? 3) Do female TAs get lower teaching evaluations than male TAs? We use hypothesis testing to answer these questions. In particular, we examine a collection non-parametric hypothesis tests. These powerful procedures build on the basic idea of random simulation to help quantify the rarity of a particular phenomenon. In the process of using these procedures we will also touch on the challenges of false discovery and multiple testing.

[ 24-hypothesis-testing (PPTX) | hypergeom notebook (HTML Version) | code and data (includes notebooks and scripts as needed) | screencast | Piazza Thread ]

Vitamin 9 Released

13 25 4/17/2018

P-values, Probability, Priors, Rabbits, Quantifauxcation, and Cargo-Cult Statistics [Prof. Philip Stark, Statistics Department, Guest Lecturer.]

A critical discussion of randomness, p-values, hypothesis testing and statistical inference. Prof. Stark will discuss the subtleties intrinsic to these concepts and how we can easily mislead ourselves by following quantitative recipes naïvely, without a proper understanding of their underlying assumptions. These ideas will be illustrated with concrete examples from bird/wind turbine interactions, earthquake modeling and randomized controlled trials.

[ P-values, rabbits & friends (zipped notebook) | P-values, rabbits & friends (HTML Version) | P-values, rabbits & friends (PDF) | screencast | Piazza Thread ]

Lab 12 Released: Hypothesis Testing, Baby Weights

Project 2 Released: Spam vs. Ham Classification

26 4/19/2018

Big Data [Vikram Sreekanti (Guest Lecturer - EECS PhD student)]

Data management at the level of big organizations can be confusing and often relies on many different technologies. In this lecture we will provide an overview of organizational data management and introduce some of the key technologies used to store and compute on data at scale. Time permitting we will dive into some basic Spark programming.

[ Slides (PPTX) | Slides (PDF) | Spark Notebook (HTML) | Spark Notebook (ipynb) | screencast | Piazza Thread ]

Vitamin 10 Released

14 27 4/24/2018

Ethics in Data Science [Guest Speaker Joshua Kroll]

Data science is being used in growing number of settings to make decisions that impact peoples lives. In this lecture we will discuss just a few of the many ethical and legal considerations in the application of data science to real-world problems.

Our guest speaker Joshua Kroll is a computer scientist and researcher interested in the governance of automated decision-making systems, especially those built with machine learning. He is currently Postdoctoral Research Scholar at UC Berkeley’s School of Information, working with Deirdre Mulligan. Before that he received a PhD in Computer Science in the Security Group at Princeton University. His dissertation on Accountable Algorithms was advised by Edward W. Felten and supported by the Center for Information Technology Policy where he studied topics in security, privacy, and how technology informs policy decisions. Joshua was the program chair of this year’s edition of the successful workshop series “Fairness, Accountability, and Transparency in Machine Learning (FAT/ML)”.

[ PPTX | PDF | Handout | screencast ]

28 4/26/2018

Randomized experiments, A/B tests and sequential monitoring [Guest Speaker Steve Howard.]

It is now commonplace for organizations with websites or mobile apps to run randomized controlled experiments, or “A/B tests” as they’re often called in industry. Such experiments provide a reliable way to determine which product changes lead to the most successful user interactions. In this lecture we will discuss why randomized experiments are so important, talk about some of the key design choices that go into A/B tests, and get a brief introduction to sequential monitoring of experimental results.

[ PDF slides | Demo (HTML) | Demo (ipynb) ]

15 29 5/01/2018

Exam Review Part 1 [Gonzalez]

In this exam review we will cover material up to the midterm. This will be largely based on the midterm exam review but we will try to reference more recent concepts.

[ PPTX | PDF | Handout ]

30 5/03/2018

Exam Review Part 2 [Perez]

In this exam review we will cover material after the midterm.

[ PPTX ]