by Joseph Gonzalez (Spring 2020)
Note: scikit-learn's Pipeline
functionality is explored at length in this notebook. IT IS NOT IN SCOPE FOR FALL 2020. Instead, focus on the bigger picture, of how we are splitting our data into train and test, and how we are using cross validation.
In this notebook we will work through the train test-split and the process of cross validation.
As with other notebooks we will use the same set of standard imports.
import numpy as np
import pandas as pd
import plotly.offline as py
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.figure_factory as ff
import cufflinks as cf
cf.set_config_file(offline=True, sharing=False, theme='ggplot');
from sklearn.linear_model import LinearRegression
For this notebook, we will use the seaborn mpg
dataset which describes the fuel mileage (measured in miles per gallon or mpg) of various cars along with characteristics of those cars. Our goal will be to build a model that can predict the fuel mileage of a car based on the characteristics of that car.
from seaborn import load_dataset
data = load_dataset("mpg")
data
The first thing we will want to do with this data is construct a train/test split. Constructing a train test split before EDA and data cleaning can often be helpful. This allows us to see if our data cleaning and any conclusions we draw from visualizations generalize to new data. This can be done by re-running the data cleaning and EDA process on the test dataset.
We can sample the entire dataset to get a permutation and then select a range of rows.
shuffled_data = data.sample(frac=1., random_state=42)
shuffled_data
Selecting a range of rows for training and test
split_point = int(shuffled_data.shape[0]*0.90)
split_point
tr = shuffled_data.iloc[:split_point]
te = shuffled_data.iloc[split_point:]
tr
te
Checking that they add up.
len(tr) + len(te) == len(data)
We can use the train_test_split
function from sklearn.model_selection
to do this easily.
from sklearn.model_selection import train_test_split
tr, te = train_test_split(data, test_size=0.1, random_state=83)
tr
te
Out of curiosity what does the mgp
field look like? I am going to look at both the train and test distributions but in practice we should avoid looking at the test data.
ff.create_distplot([tr['mpg'], te['mpg']], ['train mpg', 'test mpg'])
Let's go through the process of building a model. Let's start by looking at just engine characteristics like "cylinders" and the "displacement". We will first use just our own feature function (as we did in previous lectures). Then we will introduce how to use sklearn Pipelines
to combine feature functions and models. As we will see, by combining the feature function and model, we can simplify subsequent training and testing since we are guaranteed that our feature functions are the same on both the training and test datasets.
My first feature function will just extract the two features that I want to use in my model.
def phi(df):
return df[["cylinders", "displacement"]]
Then I fit an sklearn LinearRegression model to my training data.
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(phi(tr), tr['mpg'])
To evaluate the error we will use the Root Mean Squared Error (RMSE) which is like the mean squared error but in the correct units (mpg) instead of (mpg^2).
def rmse(y, yhat):
return np.sqrt(np.mean((y - yhat)**2))
The training error is:
Y_hat = model.predict(phi(tr))
Y = tr['mpg']
print("Training Error (RMSE):", rmse(Y, Y_hat))
The test error is:
Y_hat = model.predict(phi(te))
Y = te['mpg']
print("Test Error (RMSE):", rmse(Y, Y_hat))
Oh no! We just used the test data to evaluate our model! We shouldn't have done that.
(Don't worry, we are trained professionals and this is only for demonstration purposes. But seriously, don't try this at home.)
Notice: The test error is slightly higher than the training error. This is typically (but not always) the case. Sometimes we get lucky and the test data is "easier to predict" or happens to closely follow the training data.
Again, for Fall 2020, you do not need to know how to use Pipeline
s in scikit-learn. They are quite involved. Fortunately, they are merely an accessory to the concepts of this lecture, and not the core of it. If you treat each instance of a Pipeline
as a black-box way of specifying which features our model should have, you will be able to understand the cross-validation content just fine.
We have removed much of the dialogue around Pipeline
from this lecture, but if you're interested, you can skim the documentation on pipelines.
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
model = Pipeline([
("SelectColumns", ColumnTransformer([("keep", "passthrough", ["cylinders", "displacement"])])),
("LinearModel", LinearRegression())
])
model.fit(tr, tr['mpg']);
In this notebook (and in life) we will want to keep track of all our models. Here I will store the models in a dictionary with a (not great) name so I can remember which model is which and can easily compare my models in a plot.
models = {"c+d": model}
We might also want to look at the displacement per cylinder. This is an additional feature transformation that we can add to the first stage of our pipeline. To define this transformation we first need to create a function transformer:
from sklearn.preprocessing import FunctionTransformer
def compute_volume(X):
return np.expand_dims(X[:,1] / X[:,0] , axis=1)
volume_transformer = FunctionTransformer(compute_volume, validate=True)
We can then add this as an additional column transformation:
model = Pipeline([
("SelectColumns", ColumnTransformer([
("keep", "passthrough", ["cylinders", "displacement"]),
("cyl_vol", volume_transformer, ["cylinders", "displacement"])])),
("LinearModel", LinearRegression())
])
model.fit(tr, tr['mpg']);
Again, we evaluate the model on our training dataset and see a reduction in error:
Y_hat = model.predict(tr)
Y = tr['mpg']
print("Training Error (RMSE):", rmse(Y, Y_hat))
models["c+d+d/c"] = model
We can now add additional features about the car.
quantitative_features = ["cylinders", "displacement", "horsepower", "weight", "acceleration"]
model = Pipeline([
("SelectColumns", ColumnTransformer([
("keep", "passthrough", quantitative_features),
("cyl_vol", volume_transformer, ["cylinders", "displacement"])])),
("LinearModel", LinearRegression())
])
I have put the following code in a try/except statement because I know it will raise an error. What do you think will go wrong?
try:
model.fit(tr, tr['mpg'])
except ValueError as err:
print(err)
There appear to be NaN (missing values) in the data (take a look at the horsepower column). We need to deal with these missing values. In previous lectures I mentioned imputation and a standard imputation technique is to replace the missing value with the mean for that column. Scikit learn has a built-in imputation function that we can add to our pipeline after we select the desired columns. The imputation will actually be applied to all the columns. If we wanted to apply it to a specific column then we would need to put it inside the ColumnTransformer.
Notice: The imputation function actually needs to be fit to data so it is also part of the model.
from sklearn.impute import SimpleImputer
model = Pipeline([
("SelectColumns", ColumnTransformer([
("keep", "passthrough", quantitative_features),
("cyl_vol", volume_transformer, ["cylinders", "displacement"])])),
("Imputation", SimpleImputer()),
("LinearModel", LinearRegression())
])
We can now train our model.
model.fit(tr, tr['mpg']);
Saving the model for later comparison:
models['c+d+d/c+h+w+a'] = model
Evaluating the training error:
Y_hat = model.predict(tr)
Y = tr['mpg']
print("Training Error (RMSE):", rmse(Y, Y_hat))
We reduced the training error but what about the test error? We really shouldn't look at the test error so instead we will use cross validation to compare the accuracy:
Here we define a five fold cross validation with
five_fold = KFold(n_splits=5)
Then we loop over the 5 splits and get the indicies (tr_ind
) in the training data to use for training and the indices (va_ind
) in the training data to use for validation:
for tr_ind, te_ind in five_fold.split(tr):
from sklearn.model_selection import KFold
from sklearn.base import clone
def cross_validate_rmse(model):
model = clone(model)
five_fold = KFold(n_splits=5)
rmse_values = []
for tr_ind, va_ind in five_fold.split(tr):
model.fit(tr.iloc[tr_ind,:], tr['mpg'].iloc[tr_ind])
rmse_values.append(rmse(tr['mpg'].iloc[va_ind], model.predict(tr.iloc[va_ind,:])))
return np.mean(rmse_values)
Valiating the model
cross_validate_rmse(model)
The following helper function generates a plot comparing all the models in the models dictionary.
def compare_models(models):
# Compute the training error for each model
training_rmse = [rmse(tr['mpg'], model.predict(tr)) for model in models.values()]
# Compute the cross validation error for each model
validation_rmse = [cross_validate_rmse(model) for model in models.values()]
# Compute the test error for each model (don't do this!)
test_rmse = [rmse(te['mpg'], model.predict(te)) for model in models.values()]
names = list(models.keys())
fig = go.Figure([
go.Bar(x = names, y = training_rmse, name="Training RMSE"),
go.Bar(x = names, y = validation_rmse, name="CV RMSE"),
go.Bar(x = names, y = test_rmse, name="Test RMSE", opacity=.3)])
return fig
fig = compare_models(models)
fig.update_yaxes(range=[2,5.1], title="RMSE")
Notice I made the Test RMSE invisible(ish) because you shouldn't look at it until we are done. But again for demonstration purposes I plotted in so we can see how it compares to the training and cross validation errors.
Can you improve the model further? Let's try adding the model year
quantitative_features = ["cylinders", "displacement", "horsepower", "weight", "acceleration", "model_year"]
model = Pipeline([
("SelectColumns", ColumnTransformer([
("keep", "passthrough", quantitative_features),
("cyl_vol", volume_transformer, ["cylinders", "displacement"])
])),
("Imputation", SimpleImputer()),
("LinearModel", LinearRegression())
])
model.fit(tr, tr['mpg'])
models['c+d+d/c+h+w+a+y'] = model
Comparing the models
fig = compare_models(models)
fig.update_yaxes(range=[2,5.1], title="RMSE")
The model year improved accuracy quite a bit! This improvement also appears to generalize as it also reduced the cross validation error.
Can we use the car's name to predict MPG? The name contains general features like the manufacturer that might help but it also contains the vehicle make which is probably too specific and not all the vehicles in test will appear in training.
Let's try by applying the CountVectorizer (which implements the bag of words features). At this point we are also likely to have too many dimensions in our model and we are not applying any regularization technique to compensate (because we haven't covered regularization in lecture yet.)
from sklearn.feature_extraction.text import CountVectorizer
model = Pipeline([
("SelectColumns", ColumnTransformer([
("keep", "passthrough", quantitative_features),
("cyl_vol", volume_transformer, ["cylinders", "displacement"]),
("text", CountVectorizer(), "name")
])),
("Imputation", SimpleImputer()),
("LinearModel", LinearRegression())
])
Notice: That we are using an additional column transformation.
model.fit(tr, tr['mpg'])
models['c+d+d/c+h+w+a+y+n'] = model
fig = compare_models(models)
fig.update_yaxes(range=[0,5.1], title="RMSE")
Overfitting!. We substantially reduced the training error but actually made the generalization error worse!
best_model = clone(models['c+d+d/c+h+w+a+y'])
best_model.fit(data, data['mpg']);
rmse(best_model.predict(te), te['mpg'])