Lecture 17

by Isaac Schmidt

Adapted from Joseph Gonzalez, Spring 2020

Train Test Split and Cross Validation

In this section we will work through the train test-split and the process of cross validation.


As with other notebooks we will use the same set of standard imports.

The Data

For this notebook, we will use the seaborn mpg dataset which describes the fuel mileage (measured in miles per gallon or mpg) of various cars along with characteristics of those cars. Our goal will be to build a model that can predict the fuel mileage of a car based on the characteristics of that car.

This data has some rows with missing data. We will ignore those rows until later for the sake of this lecture. We can use the Pandas DataFrame.isna function to find rows with missing values and drop them, although of course, this is not always the best idea!

Train Test Split

The first thing we will want to do with this data is construct a train/test split. Constructing a train test split before EDA and data cleaning can often be helpful. This allows us to see if our data cleaning and any conclusions we draw from visualizations generalize to new data. This can be done by re-running the data cleaning and EDA process on the test dataset.

Using Pandas Operations

We can sample the entire dataset to get a permutation and then select a range of rows.

Selecting a range of rows for training and test.

Checking that they add up.

Shuffling with Numpy

We can directly shuffle the data with numpy, and then select the corresponding rows from our original DataFrame.

Using SKLearn

We can use the train_test_split function from sklearn.model_selection to do this easily.

Building A Basic Model

Let's go through the process of building a model. Let's start by looking at the raw quantitative features available. We will first use just our own feature function (as we did in previous lectures). This function will just extract the quantitative features we can use for our model.

Then we fit a scikit-learn LinearRegression model to our training data.

To evaluate the error we will use the Root Mean Squared Error (RMSE) which is like the mean squared error but in the correct units (mpg) instead of (mpg^2).

The training error is:

Oh no! We just used the test data to evaluate our model! We shouldn't have done that.

(Don't worry, this is only for demonstration purposes. But seriously, don't try this at home.)

Notice: The test error is slightly higher than the training error. This is typically (but not always) the case. Sometimes we get lucky and the test data is "easier to predict" or happens to closely follow the training data.


Keeping track of all the models.

In this notebook (and in life) we will want to keep track of all our models.

We store our model settings in a dictionary. The key is some identifying name, and the value is a 2-element tuple, with the first element being the transformation function (e.g. basic_design_matrix), and the second element being an empty model object (e.g. LinearRegression()).

More Feature Transformations

As in Lecture 14, we might want to look at the displacement per cylinder as well.

We can build a linear model using the same quantitative features as before, but with this new "displacement per cylinder" feature.

Our training error is definitely lower than with the previous model, but what we really care about is the model's performance on new data. But we shouldn't actually ever look at the test data. Instead, to compare these models, we can use cross-validation to "mimic" the train-test split.

In the following function we use the sklearn KFold cross validation class.

Here we define a five fold cross validation with

five_fold = KFold(n_splits=5)

Then we loop over the 5 splits and get the indicies (tr_ind) in the training data to use for training and the indices (va_ind) in the training data to use for validation:

for tr_ind, te_ind in five_fold.split(tr):

Valiating the model

The following helper function generates a plot comparing all the models in the transformations dictionary.

So not only did the new displacement / cylinders feature improve our training error, it also improved our cross-validation error. This indicates that this feature helps our model generalize, or in other words, that it has "predictive power."

Now let's try adding some categorical data, such as the origin column. As this is categorical data, we will have to one-hot encode this variable.

Fortunately, it looks like we have only three possible values for origin. We will use scikit-learn's one-hot encoder to do the transformation. Check out Lecture 14 for a refresher on how this works.

It looks like adding these new features about origin didn't really affect our model.

Let's try if we can gain any information from the name column. This column contains the make and model of each car. The models are fairly unique, so let's try to extract information about the brand (e.g. ford). The following cell shows the top 20 words that appear in this column.

It looks like there is at least one model here (corolla), but it does show the most common brands. We will add one column for each of these strings, with a 1 for a specific car indicating that the name of the car contains the string.

Note: In practice, you would use scikit-learn or some other package, but we will do this manually just to be explicit about what we're doing.

Interesting. Adding the brand information to our design matrix decreased our training error, but it increased our cross-validation error. Looks like we overfit!


In this section we explore the use of regularization techniques to address overfitting.

Ridge Regression

Ridge regression combines the ridge (L2, Squared) regularization function with the least squares loss.

$$ \hat{\theta}_\alpha = \arg \min_\theta \left[ \left(\frac{1}{n} \sum_{i=1}^n \left(Y_i - f_\theta(X_i)\right)^2 \right) + \alpha \sum_{k=1}^d \theta_k^2 \right] $$

Ridge regression, like ordinary least squares regression, also has a closed form solution for the optimal $\hat{\theta}_\alpha$

$$ \hat{\theta}_\alpha = \left(X^T X + n \alpha \mathbf{I}\right)^{-1} X^T Y $$

where $\mathbf{I}$ is the identity matrix, $n$ is the number of data points, and $\alpha$ is the regularization hyperparameter.

Notice that even if $X^T X$ is not full rank, the addition of $n \alpha \mathbf{I}$ (which is full rank) makes $\left(X^T X + n \alpha \mathbf{I}\right)$ invertible. Thus, ridge regression addresses the possible issue of having an underdetermined system and partially improves the numerical stability of the solution.

The Regularization Hyperparameter

The $\alpha$ parameter is our regularization hyperparameter. It is a hyperparameter because it is not a model parameter but a choice of how we want to balance fitting the data and "over-fitting". The goal is to find a value of this hyper"parameter to maximize our accuracy on the test data. However, we can't use the test data to make modeling decisions so we turn to cross validation. The standard way to find the best $\alpha$ is to try a bunch of values (perhaps using binary search) and take the one with the lowest cross validation error.

You may have noticed that in the video lecture we use $\lambda$ instead of $\alpha$. This is because many textbooks use $\lambda$ and sklearn uses $\alpha$.

Ridge Regression in SK Learn

Both Ridge Regression and Lasso are built-in functions in SKLearn. Let's start by importing the Ridge Regression class which behaves identically to the LinearRegression class we used earlier:

Take a look at the documentation. Notice the regularized loss function.

Instead of just looking at the top 20 brands, let's bag-of-words encode the entire name column.

Woah, our RMSE is really, really large! This is due to the fact that by adding all of these columns, our design matrix is no longer full rank. numpy tried to take the inverse of $\mathbb{X}^T \mathbb{X}$, but it ended up sending the parameters to really extreme values, leading to really extreme predictions.

To get around this, let's try regularization. As we're introducing regularization, let's also standardize our quantitative features:

That looks more like it. Let's add this model into our set:

Notice how our training error dropped significantly, but our CV error only changed a little bit. Let's try a different value of $\alpha$:

Oops, that was too much regularization. Let's do a search to find the best $\alpha$:

Lasso Regression

Lasso regression combines the absolute (L1) regularization function with the least squares loss.

$$ \hat{\theta}_\alpha = \arg \min_\theta \left(\frac{1}{n} \sum_{i=1}^n \left(Y_i - f_\theta(X_i)\right)^2 \right) + \alpha \sum_{k=1}^d |\theta_k| $$

Lasso is actually an acronym (and a cool name) which stands for Least Absolute Shrinkage and Selection Operator. It is an absolute operator because it is the absolute value. It is a shrinkage operator because it favors smaller parameter values. It is a selection operator because it has the peculiar property of pushing parameter values all the way to zero thereby selecting the remaining features. It is this last property that makes Lasso regression so useful. By using Lasso regression and setting sufficiently large value of $\alpha$ you can eliminate features that are not informative.

Unfortunately, there is no closed form solution for Lasso regression and so iterative optimization algorithms like gradient descent are typically used.

Let's compare the distribution of the parameters for both the RidgeCV and LassoCV models.