Lecture 16 – Cross-Validation and Regularization

Presented by Anthony D. Joseph, Joseph Gonzalez, Suraj Rampure, Paul Shao

Content by Joseph Gonzalez, Suraj Rampure, Paul Shao

Important: Read this before proceeding with the lectures, as it details what materials you should focus on. (This is also largely recapped in Video 16.1.)

Sections 16.1 through 16.4 discuss train-test splits and cross-validation.

16.1, in addition to giving an overview of the lecture, walks through why we need to split our data into train and test in the first place, and how cross-validation works. It primarily consists of slides.
16.2 and 16.3 walk through the process of creating a basic train-test split, and evaluating models that we’ve fit on our training data using our testing data. Code is in “Part 1”.
16.4 walks through the process of implementing cross-validation. In this video there references to a Pipeline object in scikit-learn. This is not in scope for us, so do not worry about its details. Code is in “Part 1”.

Sections 16.5 and 16.6 discuss regularization.

16.5 discusses why we need to regularize, and how penalties on the norm of our parameter vector accomplish this goal.
16.6 explicitly lists the optimal model parameter when using the L2 penalty on our linear model (called “ridge regression”).

There are also three supplementary videos accompanying this lecture. They don’t introduce any new material, but may still be helpful for your understanding. They are listed as supplementary and not required since the runtime of this lecture is already quite long. They do not have accompanying Quick Checks for this reason.

16.7 and 16.8 walk through implementing ridge and LASSO regression in a notebook. These videos are helpful in explaining how regularization and cross-validation are used in practice. These videos again use Pipeline, which is not in scope. Code is in “Part 2”.
16.9 is another supplementary video, created by Paul Shao (a TA for Data 100 in Spring 2020). It gives a great high-level overview of both the bias-variance tradeoff and regularization.

The Quick Check for this lecture is due Monday, November 2nd at 11:59PM. A random one of the following Google Forms will give you an alphanumeric code once you submit; you should take this code and enter it into the “Lecture 16” question in the “Quick Check Codes” assignment on Gradescope to get credit for submitting this Quick Check.

	Video	Quick Check
16.1 Lecture overview. Training error vs. testing error. Why we need to split our data into train and test. How cross-validation works, and why it is useful.		16.1
16.2 Using scikit-learn to construct a train-test split.		16.2
16.3 Building a linear model and determining its training and test error.		16.3
16.4 Implementing cross-validation, and using it to help select a model.		16.4
16.5 An overview of regularization.		16.5
16.6 Ridge regression and LASSO regression.		16.6
16.7 Supplemental. Using ridge regression and cross-validation in scikit-learn.		N/A
16.8 Supplemental. Using LASSO regression and cross-validation in scikit-learn.		N/A
16.9 Supplemental. An overview of the bias-variance tradeoff, and how it interfaces with regularization.		N/A

Data 100

Lecture 16 – Cross-Validation and Regularization