Lecture 18 – Cross-Validation and Regularization
Presented by Anthony D. Joseph, Joseph Gonzalez, Suraj Rampure, Paul Shao
Content by Joseph Gonzalez, Suraj Rampure, Paul Shao
Important: Read this before proceeding with the lectures, as it details what materials you should focus on. (This is also largely recapped in Video 18.1.)
Sections 18.1 through 18.4 discuss train-test splits and cross-validation.
18.1, in addition to giving an overview of the lecture, walks through why we need to split our data into train and test in the first place, and how cross-validation works. It primarily consists of slides.
18.2 and 18.3 walk through the process of creating a basic train-test split, and evaluating models that we’ve fit on our training data using our testing data. Code is in “Part 1”.
18.4 walks through the process of implementing cross-validation. In this video there references to a
Pipeline object in
scikit-learn. This is not in scope for us, so do not worry about its details. Code is in “Part 1”.
Sections 18.5 and 18.6 discuss regularization.
18.5 discusses why we need to regularize, and how penalties on the norm of our parameter vector accomplish this goal. 18.6 explicitly lists the optimal model parameter when using the L2 penalty on our linear model (called “ridge regression”).
There are also three supplementary videos accompanying this lecture. They don’t introduce any new material, but may still be helpful for your understanding. They are listed as supplementary and not required since the runtime of this lecture is already quite long. They do not have accompanying Quick Checks for this reason.
18.7 and 18.8 walk through implementing ridge and LASSO regression in a notebook. These videos are helpful in explaining how regularization and cross-validation are used in practice. These videos again use
Pipeline, which is not in scope. Code is in “Part 2”.
18.9 is another supplementary video, created by Paul Shao (a TA for Data 100 in Spring 2020). It gives a great high-level overview of both the bias-variance tradeoff and regularization.
A reminder – the right column of the table below contains Quick Checks. These are not required but suggested to help you check your understanding.
Lecture overview. Training error vs. testing error. Why we need to split our data into train and test. How cross-validation works, and why it is useful.
Using scikit-learn to construct a train-test split.
Building a linear model and determining its training and test error.
Implementing cross-validation, and using it to help select a model.
An overview of regularization.
Ridge regression and LASSO regression.
*Supplemental.* Using ridge regression and cross-validation in scikit-learn.
*Supplemental.* Using LASSO regression and cross-validation in scikit-learn.
*Supplemental.* An overview of the bias-variance tradeoff, and how it interfaces with regularization.