Lecture 16, Part 1 – Data 100, Fall 2020

by Joseph Gonzalez (Spring 2020)

Note: scikit-learn's Pipeline functionality is explored at length in this notebook. IT IS NOT IN SCOPE FOR FALL 2020. Instead, focus on the bigger picture, of how we are splitting our data into train and test, and how we are using cross validation.

Train Test Split and Cross Validation

In this notebook we will work through the train test-split and the process of cross validation.


As with other notebooks we will use the same set of standard imports.

The Data

For this notebook, we will use the seaborn mpg dataset which describes the fuel mileage (measured in miles per gallon or mpg) of various cars along with characteristics of those cars. Our goal will be to build a model that can predict the fuel mileage of a car based on the characteristics of that car.

Train Test Split

The first thing we will want to do with this data is construct a train/test split. Constructing a train test split before EDA and data cleaning can often be helpful. This allows us to see if our data cleaning and any conclusions we draw from visualizations generalize to new data. This can be done by re-running the data cleaning and EDA process on the test dataset.

Using Pandas Operations

We can sample the entire dataset to get a permutation and then select a range of rows.

Selecting a range of rows for training and test

Checking that they add up.

Using SKLearn

We can use the train_test_split function from sklearn.model_selection to do this easily.

Quick Visualization

Out of curiosity what does the mgp field look like? I am going to look at both the train and test distributions but in practice we should avoid looking at the test data.

Building A Basic Model

Let's go through the process of building a model. Let's start by looking at just engine characteristics like "cylinders" and the "displacement". We will first use just our own feature function (as we did in previous lectures). Then we will introduce how to use sklearn Pipelines to combine feature functions and models. As we will see, by combining the feature function and model, we can simplify subsequent training and testing since we are guaranteed that our feature functions are the same on both the training and test datasets.

My first feature function will just extract the two features that I want to use in my model.

Then I fit an sklearn LinearRegression model to my training data.

To evaluate the error we will use the Root Mean Squared Error (RMSE) which is like the mean squared error but in the correct units (mpg) instead of (mpg^2).

The training error is:

Don't try this at home!

The test error is:

Oh no! We just used the test data to evaluate our model! We shouldn't have done that.

(Don't worry, we are trained professionals and this is only for demonstration purposes. But seriously, don't try this at home.)

Notice: The test error is slightly higher than the training error. This is typically (but not always) the case. Sometimes we get lucky and the test data is "easier to predict" or happens to closely follow the training data.

SKLearn Pipelines

Again, for Fall 2020, you do not need to know how to use Pipelines in scikit-learn. They are quite involved. Fortunately, they are merely an accessory to the concepts of this lecture, and not the core of it. If you treat each instance of a Pipeline as a black-box way of specifying which features our model should have, you will be able to understand the cross-validation content just fine.

We have removed much of the dialogue around Pipeline from this lecture, but if you're interested, you can skim the documentation on pipelines.

Keeping track of all the models.

In this notebook (and in life) we will want to keep track of all our models. Here I will store the models in a dictionary with a (not great) name so I can remember which model is which and can easily compare my models in a plot.

More Feature Transformations

We might also want to look at the displacement per cylinder. This is an additional feature transformation that we can add to the first stage of our pipeline. To define this transformation we first need to create a function transformer:

We can then add this as an additional column transformation:

Again, we evaluate the model on our training dataset and see a reduction in error:

Adding More Features

We can now add additional features about the car.

I have put the following code in a try/except statement because I know it will raise an error. What do you think will go wrong?

There appear to be NaN (missing values) in the data (take a look at the horsepower column). We need to deal with these missing values. In previous lectures I mentioned imputation and a standard imputation technique is to replace the missing value with the mean for that column. Scikit learn has a built-in imputation function that we can add to our pipeline after we select the desired columns. The imputation will actually be applied to all the columns. If we wanted to apply it to a specific column then we would need to put it inside the ColumnTransformer.

Notice: The imputation function actually needs to be fit to data so it is also part of the model.

We can now train our model.

Saving the model for later comparison:

Evaluating the training error:

We reduced the training error but what about the test error? We really shouldn't look at the test error so instead we will use cross validation to compare the accuracy:

Cross Validation

In the following function we use the sklearn KFold cross validation class.

Here we define a five fold cross validation with

five_fold = KFold(n_splits=5)

Then we loop over the 5 splits and get the indicies (tr_ind) in the training data to use for training and the indices (va_ind) in the training data to use for validation:

for tr_ind, te_ind in five_fold.split(tr):

Valiating the model

The following helper function generates a plot comparing all the models in the models dictionary.

Notice I made the Test RMSE invisible(ish) because you shouldn't look at it until we are done. But again for demonstration purposes I plotted in so we can see how it compares to the training and cross validation errors.

Can you improve the model further? Let's try adding the model year

Comparing the models

The model year improved accuracy quite a bit! This improvement also appears to generalize as it also reduced the cross validation error.

Going too Far?

Can we use the car's name to predict MPG? The name contains general features like the manufacturer that might help but it also contains the vehicle make which is probably too specific and not all the vehicles in test will appear in training.

Let's try by applying the CountVectorizer (which implements the bag of words features). At this point we are also likely to have too many dimensions in our model and we are not applying any regularization technique to compensate (because we haven't covered regularization in lecture yet.)

Notice: That we are using an additional column transformation.

Overfitting!. We substantially reduced the training error but actually made the generalization error worse!