In this section we will work through the train test-split and the process of cross validation.
As with other notebooks we will use the same set of standard imports.
import numpy as np
import pandas as pd
import plotly.offline as py
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.figure_factory as ff
import cufflinks as cf
cf.set_config_file(offline=True, sharing=False, theme='ggplot');
/opt/conda/lib/python3.8/site-packages/geopandas/_compat.py:106: UserWarning: The Shapely GEOS version (3.8.0-CAPI-1.13.1 ) is incompatible with the GEOS version PyGEOS was compiled with (3.9.1-CAPI-1.14.2). Conversions between both will be slow.
from sklearn.linear_model import LinearRegression
For this notebook, we will use the seaborn mpg
dataset which describes the fuel mileage (measured in miles per gallon or mpg) of various cars along with characteristics of those cars. Our goal will be to build a model that can predict the fuel mileage of a car based on the characteristics of that car.
from seaborn import load_dataset
data = load_dataset("mpg")
data
mpg | cylinders | displacement | horsepower | weight | acceleration | model_year | origin | name | |
---|---|---|---|---|---|---|---|---|---|
0 | 18.0 | 8 | 307.0 | 130.0 | 3504 | 12.0 | 70 | usa | chevrolet chevelle malibu |
1 | 15.0 | 8 | 350.0 | 165.0 | 3693 | 11.5 | 70 | usa | buick skylark 320 |
2 | 18.0 | 8 | 318.0 | 150.0 | 3436 | 11.0 | 70 | usa | plymouth satellite |
3 | 16.0 | 8 | 304.0 | 150.0 | 3433 | 12.0 | 70 | usa | amc rebel sst |
4 | 17.0 | 8 | 302.0 | 140.0 | 3449 | 10.5 | 70 | usa | ford torino |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
393 | 27.0 | 4 | 140.0 | 86.0 | 2790 | 15.6 | 82 | usa | ford mustang gl |
394 | 44.0 | 4 | 97.0 | 52.0 | 2130 | 24.6 | 82 | europe | vw pickup |
395 | 32.0 | 4 | 135.0 | 84.0 | 2295 | 11.6 | 82 | usa | dodge rampage |
396 | 28.0 | 4 | 120.0 | 79.0 | 2625 | 18.6 | 82 | usa | ford ranger |
397 | 31.0 | 4 | 119.0 | 82.0 | 2720 | 19.4 | 82 | usa | chevy s-10 |
398 rows × 9 columns
This data has some rows with missing data. We will ignore those rows until later for the sake of this lecture. We can use the Pandas DataFrame.isna
function to find rows with missing values and drop them, although of course, this is not always the best idea!
data = data[~data.isna().any(axis=1)].copy()
The first thing we will want to do with this data is construct a train/test split. Constructing a train test split before EDA and data cleaning can often be helpful. This allows us to see if our data cleaning and any conclusions we draw from visualizations generalize to new data. This can be done by re-running the data cleaning and EDA process on the test dataset.
We can sample the entire dataset to get a permutation and then select a range of rows.
shuffled_data = data.sample(frac = 1, random_state = 42)
shuffled_data
mpg | cylinders | displacement | horsepower | weight | acceleration | model_year | origin | name | |
---|---|---|---|---|---|---|---|---|---|
79 | 26.0 | 4 | 96.0 | 69.0 | 2189 | 18.0 | 72 | europe | renault 12 (sw) |
276 | 21.6 | 4 | 121.0 | 115.0 | 2795 | 15.7 | 78 | europe | saab 99gle |
248 | 36.1 | 4 | 91.0 | 60.0 | 1800 | 16.4 | 78 | japan | honda civic cvcc |
56 | 26.0 | 4 | 91.0 | 70.0 | 1955 | 20.5 | 71 | usa | plymouth cricket |
393 | 27.0 | 4 | 140.0 | 86.0 | 2790 | 15.6 | 82 | usa | ford mustang gl |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
72 | 15.0 | 8 | 304.0 | 150.0 | 3892 | 12.5 | 72 | usa | amc matador (sw) |
107 | 18.0 | 6 | 232.0 | 100.0 | 2789 | 15.0 | 73 | usa | amc gremlin |
272 | 23.8 | 4 | 151.0 | 85.0 | 2855 | 17.6 | 78 | usa | oldsmobile starfire sx |
352 | 29.9 | 4 | 98.0 | 65.0 | 2380 | 20.7 | 81 | usa | ford escort 2h |
103 | 11.0 | 8 | 400.0 | 150.0 | 4997 | 14.0 | 73 | usa | chevrolet impala |
392 rows × 9 columns
Selecting a range of rows for training and test.
split_point = int(shuffled_data.shape[0] * 0.90)
split_point
352
tr = shuffled_data.iloc[:split_point]
te = shuffled_data.iloc[split_point:]
len(tr), len(te)
(352, 40)
tr.head()
mpg | cylinders | displacement | horsepower | weight | acceleration | model_year | origin | name | |
---|---|---|---|---|---|---|---|---|---|
79 | 26.0 | 4 | 96.0 | 69.0 | 2189 | 18.0 | 72 | europe | renault 12 (sw) |
276 | 21.6 | 4 | 121.0 | 115.0 | 2795 | 15.7 | 78 | europe | saab 99gle |
248 | 36.1 | 4 | 91.0 | 60.0 | 1800 | 16.4 | 78 | japan | honda civic cvcc |
56 | 26.0 | 4 | 91.0 | 70.0 | 1955 | 20.5 | 71 | usa | plymouth cricket |
393 | 27.0 | 4 | 140.0 | 86.0 | 2790 | 15.6 | 82 | usa | ford mustang gl |
te.head()
mpg | cylinders | displacement | horsepower | weight | acceleration | model_year | origin | name | |
---|---|---|---|---|---|---|---|---|---|
245 | 36.1 | 4 | 98.0 | 66.0 | 1800 | 14.4 | 78 | usa | ford fiesta |
55 | 27.0 | 4 | 97.0 | 60.0 | 1834 | 19.0 | 71 | europe | volkswagen model 111 |
51 | 30.0 | 4 | 79.0 | 70.0 | 2074 | 19.5 | 71 | europe | peugeot 304 |
176 | 19.0 | 6 | 232.0 | 90.0 | 3211 | 17.0 | 75 | usa | amc pacer |
191 | 22.0 | 6 | 225.0 | 100.0 | 3233 | 15.4 | 76 | usa | plymouth valiant |
Checking that they add up.
len(tr) + len(te) == len(data)
True
We can directly shuffle the data with numpy
, and then select the corresponding rows from our original DataFrame.
np.random.seed(100)
shuffled_indices = np.random.permutation(np.arange(len(data)))
shuffled_indices
array([124, 140, 276, 252, 326, 136, 369, 132, 387, 174, 225, 356, 257, 239, 231, 267, 7, 129, 258, 234, 43, 190, 227, 368, 75, 149, 201, 288, 78, 163, 347, 284, 152, 1, 246, 213, 21, 110, 161, 69, 56, 198, 160, 134, 97, 195, 255, 98, 54, 118, 361, 18, 311, 64, 272, 295, 298, 127, 191, 5, 103, 377, 266, 346, 90, 385, 188, 293, 96, 46, 50, 282, 248, 120, 233, 209, 187, 27, 235, 338, 328, 352, 372, 304, 308, 6, 153, 219, 279, 121, 3, 20, 125, 166, 307, 309, 60, 84, 342, 80, 147, 133, 31, 345, 45, 47, 260, 150, 391, 59, 334, 23, 88, 332, 15, 33, 171, 355, 169, 265, 386, 241, 249, 178, 362, 19, 26, 297, 35, 157, 39, 244, 375, 10, 199, 184, 208, 367, 65, 259, 285, 41, 378, 203, 104, 128, 216, 151, 142, 158, 40, 217, 32, 48, 327, 197, 123, 173, 204, 61, 71, 305, 330, 126, 115, 271, 85, 159, 164, 52, 321, 154, 205, 315, 29, 358, 139, 302, 319, 162, 111, 296, 177, 371, 300, 175, 331, 324, 281, 339, 62, 247, 99, 269, 112, 37, 189, 206, 374, 83, 373, 314, 51, 263, 341, 370, 42, 357, 229, 236, 318, 179, 87, 268, 55, 22, 379, 313, 101, 11, 291, 108, 376, 194, 25, 117, 81, 366, 275, 242, 230, 82, 292, 156, 278, 333, 329, 74, 224, 254, 218, 306, 145, 320, 130, 113, 349, 351, 287, 388, 353, 210, 28, 344, 221, 24, 168, 148, 380, 322, 77, 340, 34, 144, 182, 232, 73, 301, 57, 365, 44, 92, 146, 200, 264, 138, 223, 220, 67, 109, 12, 16, 89, 337, 243, 382, 286, 222, 107, 186, 116, 122, 165, 262, 277, 9, 253, 196, 310, 384, 250, 256, 102, 289, 76, 119, 237, 114, 95, 170, 381, 94, 214, 38, 261, 36, 299, 180, 176, 360, 215, 207, 212, 325, 70, 131, 185, 335, 0, 348, 68, 383, 17, 30, 106, 13, 72, 273, 202, 192, 274, 172, 294, 167, 303, 238, 312, 283, 181, 63, 105, 2, 336, 251, 183, 270, 317, 193, 49, 135, 389, 91, 4, 100, 211, 245, 141, 364, 155, 86, 93, 137, 58, 316, 363, 228, 143, 390, 240, 290, 14, 226, 66, 53, 354, 350, 79, 343, 359, 323, 280, 8])
data.iloc[shuffled_indices].head()
mpg | cylinders | displacement | horsepower | weight | acceleration | model_year | origin | name | |
---|---|---|---|---|---|---|---|---|---|
125 | 20.0 | 6 | 198.0 | 95.0 | 3102 | 16.5 | 74 | usa | plymouth duster |
142 | 26.0 | 4 | 79.0 | 67.0 | 1963 | 15.5 | 74 | europe | volkswagen dasher |
278 | 31.5 | 4 | 89.0 | 71.0 | 1990 | 14.9 | 78 | europe | volkswagen scirocco |
254 | 20.2 | 6 | 200.0 | 85.0 | 2965 | 15.8 | 78 | usa | ford fairmont (auto) |
328 | 30.0 | 4 | 146.0 | 67.0 | 3250 | 21.8 | 80 | europe | mercedes-benz 240d |
tr = data.iloc[shuffled_indices[:split_point]]
te = data.iloc[shuffled_indices[split_point:]]
len(tr), len(te)
(352, 40)
We can use the train_test_split
function from sklearn.model_selection
to do this easily.
from sklearn.model_selection import train_test_split
tr, te = train_test_split(data, test_size = 0.1, random_state = 83)
len(tr), len(te)
(352, 40)
tr.head()
mpg | cylinders | displacement | horsepower | weight | acceleration | model_year | origin | name | |
---|---|---|---|---|---|---|---|---|---|
6 | 14.0 | 8 | 454.0 | 220.0 | 4354 | 9.0 | 70 | usa | chevrolet impala |
352 | 29.9 | 4 | 98.0 | 65.0 | 2380 | 20.7 | 81 | usa | ford escort 2h |
47 | 19.0 | 6 | 250.0 | 100.0 | 3282 | 15.0 | 71 | usa | pontiac firebird |
39 | 14.0 | 8 | 400.0 | 175.0 | 4464 | 11.5 | 71 | usa | pontiac catalina brougham |
304 | 37.3 | 4 | 91.0 | 69.0 | 2130 | 14.7 | 79 | europe | fiat strada custom |
te.head()
mpg | cylinders | displacement | horsepower | weight | acceleration | model_year | origin | name | |
---|---|---|---|---|---|---|---|---|---|
87 | 13.0 | 8 | 350.0 | 145.0 | 3988 | 13.0 | 73 | usa | chevrolet malibu |
279 | 29.5 | 4 | 98.0 | 68.0 | 2135 | 16.6 | 78 | japan | honda accord lx |
319 | 31.3 | 4 | 120.0 | 75.0 | 2542 | 17.5 | 80 | japan | mazda 626 |
173 | 24.0 | 4 | 119.0 | 97.0 | 2545 | 17.0 | 75 | japan | datsun 710 |
148 | 26.0 | 4 | 116.0 | 75.0 | 2246 | 14.0 | 74 | europe | fiat 124 tc |
Let's go through the process of building a model. Let's start by looking at the raw quantitative features available. We will first use just our own feature function (as we did in previous lectures). This function will just extract the quantitative features we can use for our model.
def basic_design_matrix(df):
X = df[["cylinders", "displacement",
"horsepower", "weight", "acceleration", "model_year"]]
return X
basic_design_matrix(tr)
cylinders | displacement | horsepower | weight | acceleration | model_year | |
---|---|---|---|---|---|---|
6 | 8 | 454.0 | 220.0 | 4354 | 9.0 | 70 |
352 | 4 | 98.0 | 65.0 | 2380 | 20.7 | 81 |
47 | 6 | 250.0 | 100.0 | 3282 | 15.0 | 71 |
39 | 8 | 400.0 | 175.0 | 4464 | 11.5 | 71 |
304 | 4 | 91.0 | 69.0 | 2130 | 14.7 | 79 |
... | ... | ... | ... | ... | ... | ... |
394 | 4 | 97.0 | 52.0 | 2130 | 24.6 | 82 |
258 | 6 | 231.0 | 105.0 | 3380 | 15.8 | 78 |
297 | 5 | 183.0 | 77.0 | 3530 | 20.1 | 79 |
23 | 4 | 121.0 | 113.0 | 2234 | 12.5 | 70 |
83 | 4 | 98.0 | 80.0 | 2164 | 15.0 | 72 |
352 rows × 6 columns
Then we fit a scikit-learn
LinearRegression
model to our training data.
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(basic_design_matrix(tr), tr['mpg'])
LinearRegression()
To evaluate the error we will use the Root Mean Squared Error (RMSE) which is like the mean squared error but in the correct units (mpg) instead of (mpg^2).
def rmse(y, yhat):
return np.sqrt(np.mean((y - yhat)**2))
The training error is:
Y_hat = model.predict(basic_design_matrix(tr))
Y = tr['mpg']
print("Training Error (RMSE):", rmse(Y, Y_hat))
Training Error (RMSE): 3.374582699942459
Oh no! We just used the test data to evaluate our model! We shouldn't have done that.
(Don't worry, this is only for demonstration purposes. But seriously, don't try this at home.)
Notice: The test error is slightly higher than the training error. This is typically (but not always) the case. Sometimes we get lucky and the test data is "easier to predict" or happens to closely follow the training data.
In this notebook (and in life) we will want to keep track of all our models.
We store our model settings in a dictionary. The key is some identifying name, and the value is a 2-element tuple, with the first element being the transformation function (e.g. basic_design_matrix
), and the second element being an empty model object (e.g. LinearRegression()
).
models = {"quant": (basic_design_matrix, LinearRegression())}
As in Lecture 14, we might want to look at the displacement per cylinder as well.
def dispcyl_design_matrix(df):
X = basic_design_matrix(df)
X['displacement/cylinder'] = X['displacement'] / X['cylinders']
return X
dispcyl_design_matrix(tr)
cylinders | displacement | horsepower | weight | acceleration | model_year | displacement/cylinder | |
---|---|---|---|---|---|---|---|
6 | 8 | 454.0 | 220.0 | 4354 | 9.0 | 70 | 56.750000 |
352 | 4 | 98.0 | 65.0 | 2380 | 20.7 | 81 | 24.500000 |
47 | 6 | 250.0 | 100.0 | 3282 | 15.0 | 71 | 41.666667 |
39 | 8 | 400.0 | 175.0 | 4464 | 11.5 | 71 | 50.000000 |
304 | 4 | 91.0 | 69.0 | 2130 | 14.7 | 79 | 22.750000 |
... | ... | ... | ... | ... | ... | ... | ... |
394 | 4 | 97.0 | 52.0 | 2130 | 24.6 | 82 | 24.250000 |
258 | 6 | 231.0 | 105.0 | 3380 | 15.8 | 78 | 38.500000 |
297 | 5 | 183.0 | 77.0 | 3530 | 20.1 | 79 | 36.600000 |
23 | 4 | 121.0 | 113.0 | 2234 | 12.5 | 70 | 30.250000 |
83 | 4 | 98.0 | 80.0 | 2164 | 15.0 | 72 | 24.500000 |
352 rows × 7 columns
We can build a linear model using the same quantitative features as before, but with this new "displacement per cylinder" feature.
model = LinearRegression()
model.fit(dispcyl_design_matrix(tr), tr['mpg'])
models['quant+dc'] = (dispcyl_design_matrix, LinearRegression())
Y_hat = model.predict(dispcyl_design_matrix(tr))
Y = tr['mpg']
print("Training Error (RMSE):", rmse(Y, Y_hat))
Training Error (RMSE): 3.033309344625912
Our training error is definitely lower than with the previous model, but what we really care about is the model's performance on new data. But we shouldn't actually ever look at the test data. Instead, to compare these models, we can use cross-validation to "mimic" the train-test split.
In the following function we use the sklearn KFold
cross validation class.
Here we define a five fold cross validation with
five_fold = KFold(n_splits=5)
Then we loop over the 5 splits and get the indicies (tr_ind
) in the training data to use for training and the indices (va_ind
) in the training data to use for validation:
for tr_ind, te_ind in five_fold.split(tr):
from sklearn.model_selection import KFold
from sklearn.base import clone
def cross_validate_rmse(phi_function, model):
model = clone(model)
five_fold = KFold(n_splits = 5, random_state = 100, shuffle = True)
rmse_values = []
for tr_ind, va_ind in five_fold.split(tr):
X_train = phi_function(tr.iloc[tr_ind, :])
y_train = tr['mpg'].iloc[tr_ind]
X_val = phi_function(tr.iloc[va_ind, :])
y_val = tr['mpg'].iloc[va_ind]
model.fit(X_train, y_train)
rmse_values.append(rmse(y_val, model.predict(X_val)))
return np.mean(rmse_values)
Valiating the model
cross_validate_rmse(dispcyl_design_matrix, LinearRegression())
3.1113159765045717
The following helper function generates a plot comparing all the models in the transformations
dictionary.
def compare_models(models):
# Compute the training error for each model
training_rmse = []
for transformation, model in models.values():
model = clone(model)
model.fit(transformation(tr), tr['mpg'])
training_rmse.append(rmse(tr['mpg'], model.predict(transformation(tr))))
# Compute the cross validation error for each model
validation_rmse = [cross_validate_rmse(transformation, model) for transformation, model in models.values()]
names = list(models.keys())
fig = go.Figure([
go.Bar(x = names, y = training_rmse, name="Training RMSE"),
go.Bar(x = names, y = validation_rmse, name="CV RMSE")])
return fig
fig = compare_models(models)
fig.update_yaxes(range = [0, 4], title = "RMSE")
So not only did the new displacement / cylinders feature improve our training error, it also improved our cross-validation error. This indicates that this feature helps our model generalize, or in other words, that it has "predictive power."
Now let's try adding some categorical data, such as the origin
column. As this is categorical data, we will have to one-hot encode this variable.
data['origin'].value_counts()
usa 245 japan 79 europe 68 Name: origin, dtype: int64
Fortunately, it looks like we have only three possible values for origin
. We will use scikit-learn
's one-hot encoder to do the transformation. Check out Lecture 14 for a refresher on how this works.
from sklearn.preprocessing import OneHotEncoder
oh_enc = OneHotEncoder()
oh_enc.fit(data[['origin']])
def origin_design_matrix(df):
X = dispcyl_design_matrix(df)
ohe_cols = pd.DataFrame(oh_enc.transform(df[['origin']]).todense(),
columns = oh_enc.get_feature_names(),
index = df.index)
return X.join(ohe_cols)
models['quant+dc+o'] = (origin_design_matrix, LinearRegression())
origin_design_matrix(tr)
cylinders | displacement | horsepower | weight | acceleration | model_year | displacement/cylinder | x0_europe | x0_japan | x0_usa | |
---|---|---|---|---|---|---|---|---|---|---|
6 | 8 | 454.0 | 220.0 | 4354 | 9.0 | 70 | 56.750000 | 0.0 | 0.0 | 1.0 |
352 | 4 | 98.0 | 65.0 | 2380 | 20.7 | 81 | 24.500000 | 0.0 | 0.0 | 1.0 |
47 | 6 | 250.0 | 100.0 | 3282 | 15.0 | 71 | 41.666667 | 0.0 | 0.0 | 1.0 |
39 | 8 | 400.0 | 175.0 | 4464 | 11.5 | 71 | 50.000000 | 0.0 | 0.0 | 1.0 |
304 | 4 | 91.0 | 69.0 | 2130 | 14.7 | 79 | 22.750000 | 1.0 | 0.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
394 | 4 | 97.0 | 52.0 | 2130 | 24.6 | 82 | 24.250000 | 1.0 | 0.0 | 0.0 |
258 | 6 | 231.0 | 105.0 | 3380 | 15.8 | 78 | 38.500000 | 0.0 | 0.0 | 1.0 |
297 | 5 | 183.0 | 77.0 | 3530 | 20.1 | 79 | 36.600000 | 1.0 | 0.0 | 0.0 |
23 | 4 | 121.0 | 113.0 | 2234 | 12.5 | 70 | 30.250000 | 1.0 | 0.0 | 0.0 |
83 | 4 | 98.0 | 80.0 | 2164 | 15.0 | 72 | 24.500000 | 0.0 | 0.0 | 1.0 |
352 rows × 10 columns
fig = compare_models(models)
fig.update_yaxes(range = [0, 4], title = "RMSE")
It looks like adding these new features about origin didn't really affect our model.
Let's try if we can gain any information from the name
column. This column contains the make and model of each car. The models are fairly unique, so let's try to extract information about the brand (e.g. ford
). The following cell shows the top 20 words that appear in this column.
tr['name'].str.split().explode().value_counts().head(20)
ford 44 chevrolet 37 plymouth 28 (sw) 27 dodge 26 amc 26 toyota 22 custom 17 datsun 17 buick 16 volkswagen 14 pontiac 14 brougham 10 oldsmobile 10 honda 10 mercury 10 rabbit 10 corolla 9 mazda 9 corona 7 Name: name, dtype: int64
It looks like there is at least one model here (corolla
), but it does show the most common brands. We will add one column for each of these strings, with a 1
for a specific car indicating that the name of the car contains the string.
Note: In practice, you would use scikit-learn
or some other package, but we will do this manually just to be explicit about what we're doing.
brands = tr['name'].str.split().explode().value_counts().head(20).index
def brands_design_matrix(df):
X = origin_design_matrix(df)
for brand in brands:
X[brand] = df['name'].str.contains(brand, regex = False).astype(float)
return X
models['quant+dc+o+b'] = (brands_design_matrix, LinearRegression())
brands_design_matrix(tr)
cylinders | displacement | horsepower | weight | acceleration | model_year | displacement/cylinder | x0_europe | x0_japan | x0_usa | ... | volkswagen | pontiac | brougham | oldsmobile | honda | mercury | rabbit | corolla | mazda | corona | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
6 | 8 | 454.0 | 220.0 | 4354 | 9.0 | 70 | 56.750000 | 0.0 | 0.0 | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
352 | 4 | 98.0 | 65.0 | 2380 | 20.7 | 81 | 24.500000 | 0.0 | 0.0 | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
47 | 6 | 250.0 | 100.0 | 3282 | 15.0 | 71 | 41.666667 | 0.0 | 0.0 | 1.0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
39 | 8 | 400.0 | 175.0 | 4464 | 11.5 | 71 | 50.000000 | 0.0 | 0.0 | 1.0 | ... | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
304 | 4 | 91.0 | 69.0 | 2130 | 14.7 | 79 | 22.750000 | 1.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
394 | 4 | 97.0 | 52.0 | 2130 | 24.6 | 82 | 24.250000 | 1.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
258 | 6 | 231.0 | 105.0 | 3380 | 15.8 | 78 | 38.500000 | 0.0 | 0.0 | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
297 | 5 | 183.0 | 77.0 | 3530 | 20.1 | 79 | 36.600000 | 1.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
23 | 4 | 121.0 | 113.0 | 2234 | 12.5 | 70 | 30.250000 | 1.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
83 | 4 | 98.0 | 80.0 | 2164 | 15.0 | 72 | 24.500000 | 0.0 | 0.0 | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
352 rows × 30 columns
fig = compare_models(models)
fig.update_yaxes(range = [0, 4], title = "RMSE")
Interesting. Adding the brand information to our design matrix decreased our training error, but it increased our cross-validation error. Looks like we overfit!
In this section we explore the use of regularization techniques to address overfitting.
Ridge regression combines the ridge (L2, Squared) regularization function with the least squares loss.
$$ \hat{\theta}_\alpha = \arg \min_\theta \left[ \left(\frac{1}{n} \sum_{i=1}^n \left(Y_i - f_\theta(X_i)\right)^2 \right) + \alpha \sum_{k=1}^d \theta_k^2 \right] $$Ridge regression, like ordinary least squares regression, also has a closed form solution for the optimal $\hat{\theta}_\alpha$
$$ \hat{\theta}_\alpha = \left(X^T X + n \alpha \mathbf{I}\right)^{-1} X^T Y $$where $\mathbf{I}$ is the identity matrix, $n$ is the number of data points, and $\alpha$ is the regularization hyperparameter.
Notice that even if $X^T X$ is not full rank, the addition of $n \alpha \mathbf{I}$ (which is full rank) makes $\left(X^T X + n \alpha \mathbf{I}\right)$ invertible. Thus, ridge regression addresses the possible issue of having an underdetermined system and partially improves the numerical stability of the solution.
The $\alpha$ parameter is our regularization hyperparameter. It is a hyperparameter because it is not a model parameter but a choice of how we want to balance fitting the data and "over-fitting". The goal is to find a value of this hyper"parameter to maximize our accuracy on the test data. However, we can't use the test data to make modeling decisions so we turn to cross validation. The standard way to find the best $\alpha$ is to try a bunch of values (perhaps using binary search) and take the one with the lowest cross validation error.
You may have noticed that in the video lecture we use $\lambda$ instead of $\alpha$. This is because many textbooks use $\lambda$ and sklearn uses $\alpha$.
Both Ridge Regression and Lasso are built-in functions in SKLearn. Let's start by importing the Ridge
Regression class which behaves identically to the LinearRegression
class we used earlier:
from sklearn.linear_model import Ridge
Take a look at the documentation. Notice the regularized loss function.
Ridge?
Instead of just looking at the top 20 brands, let's bag-of-words encode the entire name column.
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
vectorizer.fit(tr['name'])
CountVectorizer()
def name_design_matrix(df):
X = origin_design_matrix(df)
feature_names = vectorizer.get_feature_names()
X[feature_names] = vectorizer.transform(df['name']).toarray()
return X
name_design_matrix(tr)
cylinders | displacement | horsepower | weight | acceleration | model_year | displacement/cylinder | x0_europe | x0_japan | x0_usa | ... | volare | volkswagen | volvo | vw | wagon | woody | xe | yorker | zephyr | zx | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
6 | 8 | 454.0 | 220.0 | 4354 | 9.0 | 70 | 56.750000 | 0.0 | 0.0 | 1.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
352 | 4 | 98.0 | 65.0 | 2380 | 20.7 | 81 | 24.500000 | 0.0 | 0.0 | 1.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
47 | 6 | 250.0 | 100.0 | 3282 | 15.0 | 71 | 41.666667 | 0.0 | 0.0 | 1.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
39 | 8 | 400.0 | 175.0 | 4464 | 11.5 | 71 | 50.000000 | 0.0 | 0.0 | 1.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
304 | 4 | 91.0 | 69.0 | 2130 | 14.7 | 79 | 22.750000 | 1.0 | 0.0 | 0.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
394 | 4 | 97.0 | 52.0 | 2130 | 24.6 | 82 | 24.250000 | 1.0 | 0.0 | 0.0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
258 | 6 | 231.0 | 105.0 | 3380 | 15.8 | 78 | 38.500000 | 0.0 | 0.0 | 1.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
297 | 5 | 183.0 | 77.0 | 3530 | 20.1 | 79 | 36.600000 | 1.0 | 0.0 | 0.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
23 | 4 | 121.0 | 113.0 | 2234 | 12.5 | 70 | 30.250000 | 1.0 | 0.0 | 0.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
83 | 4 | 98.0 | 80.0 | 2164 | 15.0 | 72 | 24.500000 | 0.0 | 0.0 | 1.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
352 rows × 294 columns
cross_validate_rmse(name_design_matrix, LinearRegression())
1856664698.7350745
Woah, our RMSE is really, really large! This is due to the fact that by adding all of these columns, our design matrix is no longer full rank. numpy
tried to take the inverse of $\mathbb{X}^T \mathbb{X}$, but it ended up sending the parameters to really extreme values, leading to really extreme predictions.
To get around this, let's try regularization. As we're introducing regularization, let's also standardize our quantitative features:
from sklearn.preprocessing import StandardScaler
quantitative_features = ["cylinders", "displacement", "horsepower", "weight", "acceleration", "model_year"]
scaler = StandardScaler()
scaler.fit(basic_design_matrix(tr[quantitative_features]))
StandardScaler()
def name_design_matrix_std(df):
X = name_design_matrix(df)
X[quantitative_features] = scaler.transform(X[quantitative_features])
return X
name_design_matrix_std(tr)
cylinders | displacement | horsepower | weight | acceleration | model_year | displacement/cylinder | x0_europe | x0_japan | x0_usa | ... | volare | volkswagen | volvo | vw | wagon | woody | xe | yorker | zephyr | zx | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
6 | 1.452785 | 2.426007 | 2.916388 | 1.580639 | -2.320642 | -1.611836 | 56.750000 | 0.0 | 0.0 | 1.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
352 | -0.874316 | -0.933437 | -1.034011 | -0.720703 | 1.862364 | 1.351615 | 24.500000 | 0.0 | 0.0 | 1.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
47 | 0.289235 | 0.500933 | -0.141986 | 0.330873 | -0.175511 | -1.342431 | 41.666667 | 0.0 | 0.0 | 1.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
39 | 1.452785 | 1.916429 | 1.769498 | 1.708880 | -1.426838 | -1.342431 | 50.000000 | 0.0 | 0.0 | 1.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
304 | -0.874316 | -0.999493 | -0.932065 | -1.012159 | -0.282767 | 0.812806 | 22.750000 | 1.0 | 0.0 | 0.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
394 | -0.874316 | -0.942873 | -1.365335 | -1.012159 | 3.256700 | 1.621020 | 24.250000 | 1.0 | 0.0 | 0.0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
258 | 0.289235 | 0.321637 | -0.014553 | 0.445124 | 0.110507 | 0.543401 | 38.500000 | 0.0 | 0.0 | 1.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
297 | -0.292540 | -0.131322 | -0.728174 | 0.619998 | 1.647851 | 0.812806 | 36.600000 | 1.0 | 0.0 | 0.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
23 | -0.874316 | -0.716394 | 0.189338 | -0.890913 | -1.069316 | -1.611836 | 30.250000 | 1.0 | 0.0 | 0.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
83 | -0.874316 | -0.933437 | -0.651714 | -0.972521 | -0.175511 | -1.073026 | 24.500000 | 0.0 | 0.0 | 1.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
352 rows × 294 columns
cross_validate_rmse(name_design_matrix_std, Ridge(alpha = .5))
3.1168076796253716
That looks more like it. Let's add this model into our set:
models['quant+dc+o+n-Ridge.5'] = (name_design_matrix_std, Ridge(alpha = .5))
fig = compare_models(models)
fig.update_yaxes(range = [0, 4], title = "RMSE")
Notice how our training error dropped significantly, but our CV error only changed a little bit. Let's try a different value of $\alpha$:
models['quant+dc+o+n-Ridge100'] = (name_design_matrix_std, Ridge(alpha = 100))
fig = compare_models(models)
fig.update_yaxes(range = [0, 4], title = "RMSE")
Oops, that was too much regularization. Let's do a search to find the best $\alpha$:
alphas = np.linspace(.5, 3.5, 20)
cv_values = []
train_values = []
for alpha in alphas:
model = Ridge(alpha = alpha)
model.fit(name_design_matrix_std(tr), tr['mpg'])
train_values.append(rmse(tr['mpg'], model.predict(name_design_matrix_std(tr))))
validation_rmse = cross_validate_rmse(name_design_matrix_std, model)
cv_values.append(validation_rmse)
fig = go.Figure()
fig.add_trace(go.Scatter(x = alphas, y = train_values, mode="lines+markers", name="Train"))
fig.add_trace(go.Scatter(x = alphas, y = cv_values, mode="lines+markers", name="CV"))
fig.update_layout(xaxis_title=r"$\alpha$", yaxis_title="CV RMSE")
models['quant+dc+o+n-Ridge1.75'] = (name_design_matrix_std, Ridge(alpha = 1.75))
fig = compare_models(models)
fig.update_yaxes(range = [0, 4], title = "RMSE")
from sklearn.linear_model import RidgeCV
models['quant+dc+o+n-RidgeCV'] = (name_design_matrix_std, RidgeCV())
fig = compare_models(models)
fig.update_yaxes(range = [0, 4], title = "RMSE")
Lasso regression combines the absolute (L1) regularization function with the least squares loss.
$$ \hat{\theta}_\alpha = \arg \min_\theta \left(\frac{1}{n} \sum_{i=1}^n \left(Y_i - f_\theta(X_i)\right)^2 \right) + \alpha \sum_{k=1}^d |\theta_k| $$Lasso is actually an acronym (and a cool name) which stands for Least Absolute Shrinkage and Selection Operator. It is an absolute operator because it is the absolute value. It is a shrinkage operator because it favors smaller parameter values. It is a selection operator because it has the peculiar property of pushing parameter values all the way to zero thereby selecting the remaining features. It is this last property that makes Lasso regression so useful. By using Lasso regression and setting sufficiently large value of $\alpha$ you can eliminate features that are not informative.
Unfortunately, there is no closed form solution for Lasso regression and so iterative optimization algorithms like gradient descent are typically used.
from sklearn.linear_model import Lasso, LassoCV
models['quant+dc+o+n-LassoCV'] = (name_design_matrix_std, LassoCV())
fig = compare_models(models)
fig.update_yaxes(range = [0, 4], title = "RMSE")
Let's compare the distribution of the parameters for both the RidgeCV
and LassoCV
models.
model = RidgeCV()
model.fit(name_design_matrix_std(tr), tr['mpg'])
ridge_coef = model.coef_
model = LassoCV()
model.fit(name_design_matrix_std(tr), tr['mpg'])
lasso_coef = model.coef_
ff.create_distplot([ridge_coef, lasso_coef], ["Ridge", "Lasso"], bin_size = 0.1)
name_design_matrix_std(tr).columns[lasso_coef > 0]
Index(['displacement', 'model_year', 'x0_japan', 'datsun', 'diesel', 'plymouth', 'pontiac', 'vw'], dtype='object')