Lecture 16, Part 2 – Data 100, Fall 2020

by Joseph Gonzalez (Spring 2020)

Note: scikit-learn's Pipeline functionality is explored at length in this notebook. IT IS NOT IN SCOPE FOR FALL 2020. Instead, focus on the bigger picture, of how we are using regularization in practice.


In this notebook we explore the use of regularization techniques to address overfitting.


As with other notebooks we will use the same set of standard imports.

The Data

For this notebook, we will use the seaborn mpg dataset which describes the fuel mileage (measured in miles per gallon or mpg) of various cars along with characteristics of those cars. Our goal will be to build a model that can predict the fuel mileage of a car based on the characteristics of that car.

The first thing we will want to do with this data is construct a train/test split.

Building a Few Basic Models

The following code uses scikit-learn Pipelines which are not in scope for the fall. Again, treat Pipelines as a black-box that specify the features that our models use.

We have defined and trained (fit) 5 models:


Before we proceed, we'll look at how to use cross_val_score, a method built-in to sklearn.model_selection, to give us the CV RMSE of our models.

The cross_val_score function takes a score function which it cross validates. That score function must take three arguments, the model, X, and y. We implement an Root Mean Squared Error (RMSE) score function:

We can then use cross_val_score to estimate the test RMSE for the model that only uses the number of cylinders c:

Taking the mean we get the average cross validation (CV) RMSE:

Visualizing the Train/CV/Test RMSE

In the following helper function we plot the Train and CV RMSE. We also plot the test RMSE for educational purposes you should not do this! If you use the height of the test RMSE bar to make decisions in designing your model you have invalidated the test data sort of like looking at the exam when studying.

Notice that as we added features the training and cross validation error generally decreased. Also notice that the cross validation error and the test error are generally higher than the training error.

Adding the Text Features

Adding the Origin

The origin is a categorical feature which takes on only a few values:

This can be extracted using a OneHotEncoder in the SelectColumns stage of the pipeline. Notice that the ["origin"] is in brackets, this caused me some confusion when preparing lecture. The OneHotEncoder wants its input in column form rather than list form.

Adding the Vehicle Name

In the previous lecture we added the vehicle name using the CountVectorizer which implements a bag-of-words encoding.

Notice that adding the origin of the vehicle resulted in a small decrease in training error and validation error but adding the name resulted in a large decrease in training error and moderate increase in validation error. The introduction of vehicle name feature resulted in overfitting. To see why this might have happened let's look at the number of features in each model.

Notice that the above plot is in log scale on the y axis! The addition of the model name resulted in a large jump in features. Perhaps some of these new features are useful and some are not. What we really want now is a mechanism to select which of these features to keep and which to ignore or down-weight. This can be done by using regularization.


Broadly speaking, regularization refers to methods used to control over-fitting. However, in this class we will focus on parametric regularization techniques.

The simplest way to think about regularization is in the context of our original loss minimization problem. Recall we defined the loss function which determines what is the best parameter value $\hat{\theta}$ for our model.

$$ \hat{\theta} = \arg \min_\theta \frac{1}{n} \sum_{i=1}^n \textbf{Loss}\left(Y_i, f_\theta(X_i)\right) $$

The loss captures how well our model fits the data. What we need is a way to a way to penalize models that over-fit to the data. We can do this by adding an extra term to our loss minimization problem:

$$ \hat{\theta} = \arg \min_\theta \frac{1}{n} \sum_{i=1}^n \textbf{Loss}\left(Y_i, f_\theta(X_i)\right) + \alpha \textbf{Reg}(f_\theta, X, Y) $$

The Reg function measures how much our model overfits and $\alpha$ parameter (really hyper-parameter) allows us balance fitting our data and overfitting. The remaining challenge is how to define this Reg function and determine the value of our additional $\alpha$ parameter.

The Regularization Hyper-parameter

The $\alpha$ parameter is our regularization hyper-parameter. It is a hyper-parameter because it is not a model parameter but a choice of how we want to balance fitting the data and "over-fitting". The goal is to find a value of this hyper-parameter to maximize our accuracy on the test data. However, we can't use the test data to make modeling decisions so we turn to cross validation. The standard way to find the best $\alpha$ is to try a bunch of values (perhaps using binary search) and take the one with the lowest cross validation error.

You may have noticed that in the video lecture we use $\lambda$ instead of $\alpha$. This is because many textbooks use $\lambda$ and sklearn uses $\alpha$.

The Regularization Function

In our cartoon formulation of the regularized loss, the regularization function Reg is supposed to "measure" how much our model "overfits". It depends on the model parameters and also can depend on the data. An obvious choice for this function could be the gap between training and cross validation error but that is difficult to optimize and somewhat circular since both training and validation error depend on solving the regularized loss minimization problem.

We have already seen that the more features we have the more likely we are to overfit to our data. For our linear model, if we set a parameter to 0 then that is the same as not having that feature. Therefore, one possible regularization function could be to count the non-zero parameters in our model.

$$ \textbf{Reg}(\theta) = \sum_{k=1}^d (\theta_i \neq 0) $$

To minimize this function would be to ignore all the features and that would certainly not overfit. This is actually, the "feature selection" regularization objective. Unfortunately, optimizing this objective is very hard (NP-Hard).

However, there are some good approximations we can use:

$$ \textbf{Reg}^\text{ridge}(\theta) = \sum_{k=1}^d \theta_i^2 $$


$$ \textbf{Reg}^\text{lasso}(\theta) = \sum_{k=1}^d \left|\theta_i \right| $$

Each of these regularization functions (and their combination) give rise to different regression techniques.

Ridge Regression

Ridge regression combines the ridge (L2, Squared) regularization function with the least squares loss.

$$ \hat{\theta}_\alpha = \arg \min_\theta \left(\frac{1}{n} \sum_{i=1}^n \left(Y_i - f_\theta(X_i)\right)^2 \right) + \alpha \sum_{k=1}^d \theta_k^2 $$

Ridge regression, like ordinary least squares regression, also has a closed form solution for the optimal $\hat{\theta}_\alpha$

$$ \hat{\theta}_\alpha = \left(X^T X + n \alpha \mathbf{I}\right)^{-1} X^T Y $$

where $\mathbf{I}$ is the identity matrix, $n$ is the number of data points, and $\alpha$ is the regularization hyper-parameter.

Notice that even if $X^T X$ is not full rank, the addition of $n \alpha \mathbf{I}$ (which is full rank) makes $\left(X^T X + n \alpha \mathbf{I}\right)$ invertible. Thus, ridge regression addresses our earlier issue of having an underdetermined system and partially improves the numerical stability of the solution.

Lasso Regression

Lasso regression combines the absolute (L1) regularization function with the least squares loss.

$$ \hat{\theta}_\alpha = \arg \min_\theta \left(\frac{1}{n} \sum_{i=1}^n \left(Y_i - f_\theta(X_i)\right)^2 \right) + \alpha \sum_{k=1}^d |\theta_k| $$

Lasso is actually an acronym (and a cool name) which stands for Least Absolute Shrinkage and Selection Operator. It is an absolute operator because it is the absolute value. It is a shrinkage operator because it favors smaller parameter values. It is a selection operator because it has the peculiar property of pushing parameter values all the way to zero thereby selecting the remaining features. It is this last property that makes Lasso regression so useful. By using Lasso regression and setting sufficiently large value of $\alpha$ you can eliminate features that are not informative.

Unfortunately, there is no closed form solution for Lasso regression and so iterative optimization algorithms like gradient descent are typically used.

Normalizing the Features

When applying Ridge or Lasso regression it is important that you normalize your features. Why (think about it for a second)?

The issue is that if you have very large and very small features then the relative magnitudes of the coefficients may differ substantially. Think of the coefficients as unit translations from units of features (e.g., cubic centimeters, and horse power) to units of the thing you are trying to predict (e.g., miles per gallon). The regularization function treats all these unit translation coefficients equally but some may need to be much larger than others. Therefore, by standardizing the features we can address this variation. You can standardize a feature by computing:

$$ z = \frac{x - \textbf{Mean}(x)}{\textbf{StdDev}(x)} $$

Ridge Regression in SK Learn

Both Ridge Regression and Lasso are built-in functions in SKLearn. Let's start by importing the Ridge Regression class which behaves identically to the LinearRegression class we used earlier:

Take a look at the documentation. Notice the regularized loss function.

We can plug the Ridge Regression Model in place of our the LinearRegression Model in our earlier pipeline.

We should also standardize our features:

Notice that as we introduce regularization we are reducing the cross validation error and also increasing the training error.

What about different $\alpha$ hyper-parameter values?

That is too much regularization. Let's use cross validation to pick the best value.

Cross Validation to Tune Regularization Parameter

The following uses cross validation to tune the regularization parameter.

We can plot the cross validation error against the different $\alpha$ values and pick the $\alpha$ with the smallest cross validation error.

Adding the best model:

We didn't have to do all of that work. SKLearn has a Ridge Regression class with built-in cross validation

Lasso in SKLearn

Similarly we can swap Ridge Regression for Lasso by simply importing the Lasso object from sklearn.linear_model.

We can examine the distribution of model coefficients to see that Lasso is selecting only a few of the features to use in it's predictions:

Here we get the names of all the features in the model.

Finally, we select the features that had non-zero coefficients (parameters). Not surprisingly, many of these features are likely good predictors of the fuel economy of a car.