In this notebook we explore the use of regularization techniques to address overfitting.
from IPython.display import YouTubeVideo
YouTubeVideo("MjxeN7OVP1Y")
As with other notebooks we will use the same set of standard imports.
import numpy as np
import pandas as pd
import plotly.offline as py
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.figure_factory as ff
import cufflinks as cf
cf.set_config_file(offline=True, sharing=False, theme='ggplot');
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
For this notebook, we will use the seaborn mpg
dataset which describes the fuel mileage (measured in miles per gallon or mpg) of various cars along with characteristics of those cars. Our goal will be to build a model that can predict the fuel mileage of a car based on the characteristics of that car.
from seaborn import load_dataset
data = load_dataset("mpg")
data
The first thing we will want to do with this data is construct a train/test split.
tr, te = train_test_split(data, test_size=0.25, random_state=83)
In the previous lecture we introduced how to use sklearn pipelines to build a model which contains it's feature engineering steps. Here we will create a sequence of models using increasingly more quantitative features.
YouTubeVideo("rBjgkCbrHqs")
models = {}
quantitative_features = ["cylinders", "displacement", "horsepower", "weight", "acceleration", "model_year"]
for i in range(len(quantitative_features)):
# The features to include in the ith model
features = quantitative_features[:(i+1)]
# The name we are giving to the ith model
name = ",".join([name[0] for name in features])
# The pipeline for the ith model
model = Pipeline([
("SelectColumns", ColumnTransformer([
("keep", "passthrough", features),
])),
("Imputation", SimpleImputer()),
("LinearModel", LinearRegression())
])
# Fit the pipeline
model.fit(tr, tr['mpg']);
# Saving the ith model
models[name] = model
We have defined and trained (fit) 5 models:
models.keys()
In the previous lecture, we introduced cross validation and described how to implement it. In this lecture, we will use sklearn implementation of cross validation to estimate our test error without looking at the test data.
YouTubeVideo("B_93YOsvVEo")
from sklearn.model_selection import cross_val_score
The cross_val_score
function takes a score function which it cross validates. That score function must take three arguments, the model, X, and y. We implement an Root Mean Squared Error (RMSE) score function:
def rmse_score(model, X, y):
return np.sqrt(np.mean((y - model.predict(X))**2))
We can then use cross_val_score
to estimate the test RMSE for the model that only uses the number of cylinders c
:
cross_val_score(models['c'], tr, tr['mpg'], scoring=rmse_score, cv=5)
Taking the mean we get the average cross validation (CV) RMSE:
np.mean(cross_val_score(models['c'], tr, tr['mpg'], scoring=rmse_score, cv=5))
In the following helper function we plot the Train and CV RMSE. We also plot the test RMSE for educational purposes you should not do this! If you use the height of the test RMSE bar to make decisions in designing your model you have invalidated the test data sort of like looking at the exam when studying.
def compare_models(models):
# Compute the training error for each model
training_rmse = [rmse_score(model, tr, tr['mpg']) for model in models.values()]
# Compute the cross validation error for each model
validation_rmse = [np.mean(cross_val_score(model, tr, tr['mpg'], scoring=rmse_score, cv=5))
for model in models.values()]
# Compute the test error for each model (don't do this!)
test_rmse = [rmse_score(model, te, te['mpg']) for model in models.values()]
names = list(models.keys())
fig = go.Figure([
go.Bar(x = names, y = training_rmse, name="Training RMSE"),
go.Bar(x = names, y = validation_rmse, name="CV RMSE"),
go.Bar(x = names, y = test_rmse, name="Test RMSE", opacity=.3)])
fig.update_yaxes(title="RMSE")
return fig
compare_models(models)
Notice that as we added features the training and cross validation error generally decreased. Also notice that the cross validation error and the test error are generally higher than the training error.
In the previous lecture we also introduced text features and I suggested you might also want to add the origin of the vehicle. Here we will add both of these features.
The origin is a categorical feature which takes on only a few values:
tr['origin'].value_counts()
This can be extracted using a OneHotEncoder
in the SelectColumns
stage of the pipeline. Notice that the ["origin"]
is in brackets, this caused me some confusion when preparing lecture. The OneHotEncoder
wants its input in column form rather than list form.
from sklearn.preprocessing import OneHotEncoder
model = Pipeline([
("SelectColumns", ColumnTransformer([
("keep", "passthrough", quantitative_features),
("origin_encoder", OneHotEncoder(), ["origin"])
])),
("Imputation", SimpleImputer()),
("LinearModel", LinearRegression())
])
model.fit(tr, tr['mpg'])
name = ",".join([name[0] for name in quantitative_features]) + ",o"
models[name] = model
In the previous lecture we added the vehicle name using the CountVectorizer
which implements a bag-of-words encoding.
from sklearn.feature_extraction.text import CountVectorizer
model = Pipeline([
("SelectColumns", ColumnTransformer([
("keep", "passthrough", quantitative_features),
("origin_encoder", OneHotEncoder(), ["origin"]),
("text", CountVectorizer(), "name")
])),
("Imputation", SimpleImputer()),
("LinearModel", LinearRegression())
])
model.fit(tr, tr['mpg'])
name = ",".join([name[0] for name in quantitative_features]) + ",o,n"
models[name] = model
compare_models(models)
Notice that adding the origin of the vehicle resulted in a small decrease in training error and validation error but adding the name resulted in a large decrease in training error and moderate increase in validation error. The introduction of vehicle name feature resulted in overfitting. To see why this might have happened let's look at the number of features in each model.
fig = go.Figure()
fig.add_trace(go.Bar(
x = list(models.keys()),
y = [len(models[m]["LinearModel"].coef_) for m in models]
))
fig.update_yaxes(title="Number of Features",type="log")
Notice that the above plot is in log scale on the y axis! The addition of the model name resulted in a large jump in features. Perhaps some of these new features are useful and some are not. What we really want now is a mechanism to select which of these features to keep and which to ignore or down-weight. This can be done by using regularization.
Broadly speaking, regularization refers to methods used to control over-fitting. However, in this class we will focus on parametric regularization techniques. The following video provides a high level overview parametric regularization in the context of loss minimization.
YouTubeVideo("NqKtsZpHmRY")
A brief note on the following text.
YouTubeVideo("0ffDWUdSvHg")
The simplest way to think about regularization is in the context of our original loss minimization problem. Recall we defined the loss function which determines what is the best parameter value $\hat{\theta}$ for our model.
$$ \hat{\theta} = \arg \min_\theta \frac{1}{n} \sum_{i=1}^n \textbf{Loss}\left(Y_i, f_\theta(X_i)\right) $$The loss captures how well our model fits the data. What we need is a way to a way to penalize models that over-fit to the data. We can do this by adding an extra term to our loss minimization problem:
$$ \hat{\theta} = \arg \min_\theta \frac{1}{n} \sum_{i=1}^n \textbf{Loss}\left(Y_i, f_\theta(X_i)\right) + \alpha \textbf{Reg}(f_\theta, X, Y) $$The Reg function measures how much our model overfits and $\alpha$ parameter (really hyper-parameter) allows us balance fitting our data and overfitting. The remaining challenge is how to define this Reg function and determine the value of our additional $\alpha$ parameter.
The $\alpha$ parameter is our regularization hyper-parameter. It is a hyper-parameter because it is not a model parameter but a choice of how we want to balance fitting the data and "over-fitting". The goal is to find a value of this hyper-parameter to maximize our accuracy on the test data. However, we can't use the test data to make modeling decisions so we turn to cross validation. The standard way to find the best $\alpha$ is to try a bunch of values (perhaps using binary search) and take the one with the lowest cross validation error.
You may have noticed that in the video lecture we use $\lambda$ instead of $\alpha$. This is because many textbooks use $\lambda$ and sklearn uses $\alpha$.
In our cartoon formulation of the regularized loss, the regularization function Reg is supposed to "measure" how much our model "overfits". It depends on the model parameters and also can depend on the data. An obvious choice for this function could be the gap between training and cross validation error but that is difficult to optimize and somewhat circular since both training and validation error depend on solving the regularized loss minimization problem.
We have already seen that the more features we have the more likely we are to overfit to our data. For our linear model, if we set a parameter to 0 then that is the same as not having that feature. Therefore, one possible regularization function could be to count the non-zero parameters in our model.
$$ \textbf{Reg}(\theta) = \sum_{k=1}^d (\theta_i \neq 0) $$To minimize this function would be to ignore all the features and that would certainly not overfit. This is actually, the "feature selection" regularization objective. Unfortunately, optimizing this objective is very hard (NP-Hard).
However, there are some good approximations we can use:
$$ \textbf{Reg}^\text{ridge}(\theta) = \sum_{k=1}^d \theta_i^2 $$and
$$ \textbf{Reg}^\text{lasso}(\theta) = \sum_{k=1}^d \left|\theta_i \right| $$Each of these regularization functions (and their combination) give rise to different regression techniques.
Ridge regression combines the ridge (L2, Squared) regularization function with the least squares loss.
$$ \hat{\theta}_\alpha = \arg \min_\theta \left(\frac{1}{n} \sum_{i=1}^n \left(Y_i - f_\theta(X_i)\right)^2 \right) + \alpha \sum_{k=1}^d \theta_k^2 $$Ridge regression, like ordinary least squares regression, also has a closed form solution for the optimal $\hat{\theta}_\alpha$
$$ \hat{\theta}_\alpha = \left(X^T X + n \alpha \mathbf{I}\right)^{-1} X^T Y $$where $\mathbf{I}$ is the identity matrix, $n$ is the number of data points, and $\alpha$ is the regularization hyper-parameter.
Notice that even if $X^T X$ is not full rank, the addition of $n \alpha \mathbf{I}$ (which is full rank) makes $\left(X^T X + n \alpha \mathbf{I}\right)$ invertible. Thus, ridge regression addresses our earlier issue of having an underdetermined system and partially improves the numerical stability of the solution.
Lasso regression combines the absolute (L1) regularization function with the least squares loss.
$$ \hat{\theta}_\alpha = \arg \min_\theta \left(\frac{1}{n} \sum_{i=1}^n \left(Y_i - f_\theta(X_i)\right)^2 \right) + \alpha \sum_{k=1}^d |\theta_k| $$Lasso is actually an acronym (and a cool name) which stands for Least Absolute Shrinkage and Selection Operator. It is an absolute operator because it is the absolute value. It is a shrinkage operator because it favors smaller parameter values. It is a selection operator because it has the peculiar property of pushing parameter values all the way to zero thereby selecting the remaining features. It is this last property that makes Lasso regression so useful. By using Lasso regression and setting sufficiently large value of $\alpha$ you can eliminate features that are not informative.
Unfortunately, there is no closed form solution for Lasso regression and so iterative optimization algorithms like gradient descent are typically used.
When applying Ridge or Lasso regression it is important that you normalize your features. Why (think about it for a second)?
The issue is that if you have very large and very small features then the relative magnitudes of the coefficients may differ substantially. Think of the coefficients as unit translations from units of features (e.g., cubic centimeters, and horse power) to units of the thing you are trying to predict (e.g., miles per gallon). The regularization function treats all these unit translation coefficients equally but some may need to be much larger than others. Therefore, by standardizing the features we can address this variation. You can standardize a feature by computing:
$$ z = \frac{x - \textbf{Mean}(x)}{\textbf{StdDev}(x)} $$Both Ridge Regression and Lasso are built-in functions in SKLearn. Let's start by importing the Ridge
Regression class which behaves identically to the LinearRegression
class we used earlier:
YouTubeVideo("frdGPG10dOA")
from sklearn.linear_model import Ridge
Take a look at the documentation. Notice the regularized loss function.
Ridge?
We can plug the Ridge Regression Model in place of our the LinearRegression Model in our earlier pipeline
ridge_model = Pipeline([
("SelectColumns", ColumnTransformer([
("keep", "passthrough", quantitative_features),
("origin_encoder", OneHotEncoder(), ["origin"]),
("text", CountVectorizer(), "name")
])),
("Imputation", SimpleImputer()),
("LinearModel", Ridge(alpha=0.5))
])
ridge_model.fit(tr, tr['mpg'])
models["Ridge(alpha=0.5)"] = ridge_model
compare_models(models)
We should also standardize our features:
from sklearn.preprocessing import StandardScaler
ridge_model = Pipeline([
("SelectColumns", ColumnTransformer([
("keep", StandardScaler(), quantitative_features),
("origin_encoder", OneHotEncoder(), ["origin"]),
("text", CountVectorizer(), "name")
])),
("Imputation", SimpleImputer()),
# ("Standarize", StandardScaler(with_mean=False)),
("LinearModel", Ridge(alpha=0.5))
])
ridge_model.fit(tr, tr['mpg'])
models["RidgeN(alpha=0.5)"] = ridge_model
compare_models(models)
Notice that as we introduce regularization we are reducing the cross validation error and also increasing the training error.
What about different $\alpha$ hyper-parameter values?
ridge_model = Pipeline([
("SelectColumns", ColumnTransformer([
("keep", StandardScaler(), quantitative_features),
("origin_encoder", OneHotEncoder(), ["origin"]),
("text", CountVectorizer(), "name")
])),
("Imputation", SimpleImputer()),
("LinearModel", Ridge(alpha=10))
])
ridge_model.fit(tr, tr['mpg'])
models["RidgeN(alpha=10)"] = ridge_model
compare_models(models)
That is too much regularization. Let's use cross validation to pick the best value.
The following uses cross validation to tune the regularization parameter.
ridge_model = Pipeline([
("SelectColumns", ColumnTransformer([
("keep", StandardScaler(), quantitative_features),
("origin_encoder", OneHotEncoder(), ["origin"]),
("text", CountVectorizer(), "name")
])),
("Imputation", SimpleImputer()),
("LinearModel", Ridge(alpha=10))
])
alphas = np.linspace(0.5, 20, 30)
cv_values = []
train_values = []
test_values = []
for alpha in alphas:
ridge_model.set_params(LinearModel__alpha=alpha)
cv_values.append(np.mean(cross_val_score(ridge_model, tr, tr['mpg'], scoring=rmse_score, cv=5)))
ridge_model.fit(tr, tr['mpg'])
train_values.append(rmse_score(ridge_model, tr, tr['mpg']))
test_values.append(rmse_score(ridge_model, te, te['mpg']))
We can plot the cross validation error against the different $\alpha$ values and pick the $\alpha$ with the smallest cross validation error.
fig = go.Figure()
fig.add_trace(go.Scatter(x = alphas, y = train_values, mode="lines+markers", name="Train"))
fig.add_trace(go.Scatter(x = alphas, y = cv_values, mode="lines+markers", name="CV"))
fig.add_trace(go.Scatter(x = alphas, y = test_values, mode="lines+markers", name="Test"))
fig.update_layout(xaxis_title=r"$\alpha$", yaxis_title="CV RMSE")
Adding the best model:
best_alpha = alphas[np.argmin(cv_values)]
ridge_model.set_params(LinearModel__alpha=best_alpha)
ridge_model.fit(tr, tr['mpg'])
models["RidgeN(alpha_best)"] = ridge_model
compare_models(models)
We didn't have to do all of that work. SKLearn has a Ridge Regression class with built-in cross validation
from sklearn.linear_model import RidgeCV
alphas = np.linspace(0.5, 3, 30)
ridge_model = Pipeline([
("SelectColumns", ColumnTransformer([
("keep", StandardScaler(), quantitative_features),
("origin_encoder", OneHotEncoder(), ["origin"]),
("text", CountVectorizer(), "name")
])),
("Imputation", SimpleImputer()),
("LinearModel", RidgeCV(alphas=alphas))
])
ridge_model.fit(tr, tr['mpg'])
models["RidgeCV"] = ridge_model
compare_models(models)
Similarly we can swap Ridge Regression for Lasso by simply importing the Lasso
YouTubeVideo("hqZNVrZ3flw")
from sklearn.linear_model import Lasso, LassoCV
lasso_model = Pipeline([
("SelectColumns", ColumnTransformer([
("keep", StandardScaler(), quantitative_features),
("origin_encoder", OneHotEncoder(), ["origin"]),
("text", CountVectorizer(), "name")
])),
("Imputation", SimpleImputer()),
("LinearModel", LassoCV(cv=3))
])
lasso_model.fit(tr, tr['mpg'])
models["LassoCV"] = lasso_model
compare_models(models)
We can examine the distribution of model coefficients to see that Lasso is selecting only a few of the features to use in it's predictions:
ff.create_distplot([
models['LassoCV']["LinearModel"].coef_,
models['RidgeCV']["LinearModel"].coef_],
["Lasso", "Ridge"], bin_size=0.1)
Here we get the names of all the features in the model.
ct = models['LassoCV']['SelectColumns']
feature_names = (
quantitative_features +
list(ct.named_transformers_['origin_encoder'].get_feature_names())+
list(ct.named_transformers_['text'].get_feature_names())
)
feature_names = np.array(feature_names)
print(feature_names)
Finally, we select the features that had non-zero coefficients (parameters). Not surprisingly, many of these features are likely good predictors of the fuel economy of a car.
kept = ~np.isclose(models['LassoCV']["LinearModel"].coef_, 0)
feature_names[kept]