Live Lecture 4 Supplemental Notebook¶

Data 100, Summer 2020¶

Suraj Rampure and Allen Shen

This notebook has 5 sections:

An overview of the modeling process, and how it parallels with sklearn
An exploration of how training RMSE doesn't decrease when we add more features, but test RMSE can
One-hot encoding, and issues with an intercept term
Redundant features and rank
What makes models linear

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn.linear_model as lm

1. An overview of the modeling process¶

Choose a model
Choose a loss function
- model = lm.LinearRegression() does both of these for us!
Minimize average loss to find the optimal $\hat{\theta}$
- model.fit(X, y)
- model.coef_ and model.intercept_ give us the values of $\hat{\theta}$ after fitting

Use our model to make predictions: $\hat{\mathbb{Y}} = \mathbb{X} \hat{\theta}$

model.predict(X)

2. RMSE on training data never increases by adding more features¶

df = sns.load_dataset('tips')
df.head()

def rmse(y, yhat):
    return np.sqrt(np.mean((y - yhat)**2))

Let's start by fitting a simple linear regression model to our familiar tips data. Specifically, we will use total_bill to predict tip.

model1 = lm.LinearRegression()
model1.fit(df[['total_bill']], df['tip'])
pred1 = model1.predict(df[['total_bill']])

rmse(df['tip'], pred1)

1.0178504025697377

model1.coef_

array([0.10502452])

model1.intercept_

0.9202696135546731

plt.scatter(df['tip'], pred1);

Notably, our RMSE was 1.0178504025697377.

Now, let's add a completely unrelated column to our data, and include it as a feature in our model.

df['useless'] = np.random.randn(len(df)) * 342

df

model2 = lm.LinearRegression()
model2.fit(df[['total_bill', 'useless']], df['tip'])
pred2 = model2.predict(df[['total_bill', 'useless']])

rmse(df['tip'], pred2)

1.0155449131945493

model2.coef_

array([ 0.10487718, -0.00019831])

model2.intercept_

0.9177561519777169

plt.scatter(df['tip'], pred2);

Our new RMSE was marginally lower! Why?

Note that the coefficient for our useless feature is very close to 0.

What about Multiple $R^2$?¶

For the original model:

np.corrcoef(pred1, df['tip'])[0, 1]**2

0.4566165863516758

np.var(pred1) / np.var(df['tip'])

0.4566165863516756

Note: model.score for a LinearRegression model also computes the $R^2$ value! See:

model1.score(df[['total_bill']], df['tip'])

0.45661658635167657

You should note that the above three are all the same.

For the model with the useless feature:

np.var(pred2) / np.var(df['tip'])

0.4590753875504143

Recall, we can interpret $R^2$ as being the proportion of variance in our true $y$ values that our fitted values capture. This is saying that our model with the useless feature included accounts for more of the variation in our true $y$ values than the first model does.

Does this make sense?

Let's look at how such a model performs on unseen ("test") data.¶

We haven't yet formally taught you how to use scikit-learn's inbuilt train/test split, so we will do this by hand for now.

idx = np.arange(len(df))
np.random.shuffle(idx)

idx

array([  1, 201,  71, 154,  33, 183,  38, 232,  88, 224, 144, 139, 240,
       187, 195, 226,  53,  98, 149,  22, 210,  46, 181, 173,  70, 213,
        54,  64,  48,  73, 134,  99,   2,  26, 116,  11, 118,   8,  78,
        89, 124, 158, 186, 112, 138,  96, 163,  31, 108, 105,  94, 188,
       217, 171,  37,   5, 145, 129,  86,  58,   4, 206, 161, 242,  32,
       153, 192, 177,  44,  65, 142, 231,  90,  68, 125, 172,  80,  61,
       214, 165, 109, 236, 126, 209, 141, 241, 151, 238, 228, 212, 176,
        25,  66, 104, 235, 203,   9,  23, 101,  77,  72,  30, 170, 220,
       202,  67, 119, 137, 123,  20,  18,  55, 204, 136,  42,  85,  10,
        36,  14,  35,  92, 199,  60,  15, 239,  39, 184, 155, 211, 185,
       215,  40, 100,  21, 222, 115, 243, 120, 233, 156, 102,   6,  17,
       175,  81,  95, 107, 230, 194,  87,  63, 147, 111,  82,  50, 167,
       121, 150, 207,  75, 113, 219,  27, 152,  69, 216,  79, 182,  93,
        51, 218,  59, 190,  91,  29, 114,  24,  43, 117, 168, 127, 146,
       169, 128,   3, 174,  34, 225,  41, 110, 132, 221, 189, 160,  76,
       234,  62, 200, 143,  83, 130,  28, 164,  16,  57, 227,   7,  45,
        47, 229, 178, 197,  56, 179,  49, 135, 131, 208,  19, 103, 162,
       180, 157, 198, 237, 193,  97, 159,  84, 122, 196, 148,   0, 140,
       191, 106, 133,  13,  12, 166, 223,  52,  74, 205])

len(idx)

244

split_point = int((3/4) * len(idx))

split_point

183

train, test = df.iloc[idx[:split_point]], df.iloc[idx[split_point:]]

train

test

new_model1 = lm.LinearRegression()
new_model1.fit(train[['total_bill']], train['tip'])
new_pred1_train = new_model1.predict(train[['total_bill']])
new_pred1_test = new_model1.predict(test[['total_bill']])

new_model1_train_rmse = rmse(train['tip'], new_pred1_train)
new_model1_test_rmse = rmse(test['tip'], new_pred1_test)

new_model1_train_rmse, new_model1_test_rmse

(1.043322299509456, 0.9434971905441378)

Now, for our model with the useless feature:

new_model2 = lm.LinearRegression()
new_model2.fit(train[['total_bill', 'useless']], train['tip'])
new_pred2_train = new_model2.predict(train[['total_bill', 'useless']])
new_pred2_test = new_model2.predict(test[['total_bill', 'useless']])

new_model2_train_rmse = rmse(train['tip'], new_pred2_train)
new_model2_test_rmse = rmse(test['tip'], new_pred2_test)

new_model2_train_rmse, new_model2_test_rmse

(1.0368979705593446, 0.959464772352727)

new_model1.coef_

array([0.11048425])

new_model2.coef_

array([ 0.11026443, -0.00032182])

Note, here training RMSE went down, but test RMSE went up. This is generally what happens when you include features that aren't truly relevant to the underlying relationship in your data. We call this overfitting.

3. One hot encoding¶

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn.linear_model as lm

from sklearn.feature_extraction import DictVectorizer

df = sns.load_dataset('tips')
df.head()

Why do we need one hot encoding?¶

model1 = lm.LinearRegression()
model1.fit(df.drop(columns='tip'), df['tip'])

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-36-7f6b6959884c> in <module>
      1 model1 = lm.LinearRegression()
----> 2 model1.fit(df.drop(columns='tip'), df['tip'])

/opt/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/base.py in fit(self, X, y, sample_weight)
    461         n_jobs_ = self.n_jobs
    462         X, y = check_X_y(X, y, accept_sparse=['csr', 'csc', 'coo'],
--> 463                          y_numeric=True, multi_output=True)
    464 
    465         if sample_weight is not None and np.atleast_1d(sample_weight).ndim > 1:

/opt/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)
    717                     ensure_min_features=ensure_min_features,
    718                     warn_on_dtype=warn_on_dtype,
--> 719                     estimator=estimator)
    720     if multi_output:
    721         y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,

/opt/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    534         # make sure we actually converted to numeric:
    535         if dtype_numeric and array.dtype.kind == "O":
--> 536             array = array.astype(np.float64)
    537         if not allow_nd and array.ndim >= 3:
    538             raise ValueError("Found array with dim %d. %s expected <= 2."

ValueError: could not convert string to float: 'Female'

How to perform a one hot encoding in scikit-learn¶

df = df.drop(columns='tip')
cat_cols = ['sex', 'smoker', 'day', 'time']

vec_enc = DictVectorizer()
vec_enc.fit(df[cat_cols].to_dict(orient='records'))
cat_data = vec_enc.transform(df[cat_cols].to_dict(orient='records')).toarray()
cat_data

array([[0., 0., 1., ..., 0., 1., 0.],
       [0., 0., 1., ..., 0., 1., 0.],
       [0., 0., 1., ..., 0., 1., 0.],
       ...,
       [0., 1., 0., ..., 1., 1., 0.],
       [0., 1., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 1., 0.]])

cat_data_names = vec_enc.get_feature_names()
cat_data_names

['day=Fri',
 'day=Sat',
 'day=Sun',
 'day=Thur',
 'sex=Female',
 'sex=Male',
 'smoker=No',
 'smoker=Yes',
 'time=Dinner',
 'time=Lunch']

cat_data = pd.DataFrame(cat_data, columns=cat_data_names)
df_ohe = pd.concat([df, cat_data], axis=1).drop(columns=cat_cols) # Drop original categorical columns
df_ohe.head()

df_ohe.shape

(244, 12)

X = pd.concat([df_ohe, pd.Series(np.ones(df_ohe.shape[0]), name='intercept')], axis=1)
X.head()

X.shape

(244, 13)

What does this output?¶

np.linalg.matrix_rank(X)

9

What's the issue with the design matrix above?¶

X = X.drop(columns=['day=Sat', 'sex=Female', 'smoker=No', 'time=Dinner'])
X.head()

X.shape

(244, 9)

What does this output?¶

np.linalg.matrix_rank(X)

9

One hot encoding with Pandas¶

df_ohe2 = pd.get_dummies(df)
df_ohe2.head()

X_2 = pd.concat([df_ohe2, pd.Series(np.ones(df_ohe2.shape[0]), name='intercept')], axis=1)
X_2.head()

X_2.shape

(244, 13)

np.linalg.matrix_rank(X_2)

9

df_ohe2 = pd.get_dummies(df, drop_first=True)
df_ohe2.head()

X_2 = pd.concat([df_ohe2, pd.Series(np.ones(df_ohe2.shape[0]), name='intercept')], axis=1)
X_2.head()

X_2.shape

(244, 9)

np.linalg.matrix_rank(X_2)

9

More examples of one hot encoding¶

X_3 = pd.get_dummies(df[['total_bill', 'sex']])
X_3.head()

What would this output?¶

np.linalg.matrix_rank(X_3)

3

X_4 = pd.concat([X_3, pd.Series(np.ones(X_3.shape[0]), name='intercept')], axis=1)
X_4.head()

What would this output?¶

np.linalg.matrix_rank(X_4)

3

(X_4['intercept'] - X_4['sex_Male']).iloc[:5]

0    1.0
1    0.0
2    0.0
3    0.0
4    1.0
dtype: float64

X_5 = pd.get_dummies(df[['total_bill', 'sex', 'smoker']])
X_5.tail(5)

What would this output?¶

np.linalg.matrix_rank(X_5)

4

(X_5['sex_Male'] + X_5['sex_Female'] - X_5['smoker_Yes']).iloc[-5:]

239    1
240    0
241    0
242    1
243    1
dtype: uint8

X_6 = pd.concat([X_5, pd.Series(np.ones(X_5.shape[0]), name='intercept')], axis=1)
X_6.tail()

What would this output?¶

np.linalg.matrix_rank(X_6)

4

4. Duplicate features¶

df = sns.load_dataset('tips')

X_7 = df[['total_bill', 'size', 'size']]
X_7 = pd.concat([X_7, pd.Series(np.ones(X_7.shape[0]), name='intercept')], axis=1)
X_7.head()

What would this output?¶

np.linalg.matrix_rank(X_7)

3

What's the issue with this again?¶

model2 = lm.LinearRegression(fit_intercept=False)
model2.fit(X_7, df['tip'])

LinearRegression(copy_X=True, fit_intercept=False, n_jobs=None, normalize=False)

model2_coef = model2.coef_
model2_coef

array([0.09271334, 0.0962989 , 0.0962989 , 0.66894474])

model2.intercept_

0.0

model2.predict(X_7)[:5]

array([2.62933992, 2.20539403, 3.19464533, 3.24959215, 3.71915687])

X_7.iloc[:5] @ model2_coef

0    2.629340
1    2.205394
2    3.194645
3    3.249592
4    3.719157
dtype: float64

How can I change the model coefficients so that the predictions are the same?¶

Our model is:

$$\theta_0 + \theta_1 \cdot size + \theta_2 \cdot size + \theta_3 \cdot total\_bill$$

model2_coef_modified = model2_coef.copy()
model2_coef_modified[1] = model2_coef[1] - 1000000000
model2_coef_modified[2] = model2_coef[2] + 1000000000

model2_coef

array([0.09271334, 0.0962989 , 0.0962989 , 0.66894474])

model2_coef_modified

array([ 9.27133368e-02, -1.00000000e+09,  1.00000000e+09,  6.68944741e-01])

X_7.iloc[:5] @ model2_coef_modified

0    2.629340
1    2.205394
2    3.194646
3    3.249592
4    3.719157
dtype: float64

Our model is now:

$$\theta_0 + (\theta_1 - 100) \cdot size + (\theta_2 + 100) \cdot size + \theta_3 \cdot total\_bill$$$$= \theta_0 + \theta_1 \cdot size - 100 \cdot size + \theta_2 * size + 100 \cdot size + \theta_3 \cdot total\_bill$$$$= \theta_0 + \theta_1 \cdot size + \theta_2 \cdot size + \theta_3 \cdot total\_bill$$

which is the same as before!

Using 2 times size as a feature¶

X_8 = df[['total_bill', 'size']]
X_8.loc[:, '2 * size'] = 2 * X_8['size']
X_8 = pd.concat([X_8, pd.Series(np.ones(X_8.shape[0]), name='intercept')], axis=1)
X_8.head()

/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py:376: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = _infer_fill_value(value)
/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py:494: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s

What would this output?¶

np.linalg.matrix_rank(X_8)

3

model3 = lm.LinearRegression(fit_intercept=False)
model3.fit(X_8, df['tip'])

LinearRegression(copy_X=True, fit_intercept=False, n_jobs=None, normalize=False)

model3_coef = model3.coef_
model3_coef

array([0.09271334, 0.03851956, 0.07703912, 0.66894474])

model3.predict(X_8)[:5]

array([2.62933992, 2.20539403, 3.19464533, 3.24959215, 3.71915687])

X_8.iloc[:5] @ model3_coef

0    2.629340
1    2.205394
2    3.194645
3    3.249592
4    3.719157
dtype: float64

How can I change the coefficients so that the predictions are the same?¶

Can I do the same thing as before?

model3_coef_modified = model3_coef.copy()
model3_coef_modified[1] = model3_coef[1] - 100
model3_coef_modified[2] = model3_coef[2] + 100

X_8.iloc[:5] @ model3_coef_modified

0    202.629340
1    302.205394
2    303.194645
3    203.249592
4    403.719157
dtype: float64

model3_coef_modified = model3_coef.copy()
model3_coef_modified[1] = model3_coef[1] - 100
model3_coef_modified[2] = model3_coef[2] + 50

X_8.iloc[:5] @ model3_coef_modified

0    2.629340
1    2.205394
2    3.194645
3    3.249592
4    3.719157
dtype: float64

Thought exercise: What happens if I try to add 2 * size + 3 as a feature?¶

X_9 = df[['total_bill', 'size']]
X_9.loc[:, '2 * size + 3'] = 2 * X_9['size'] + 3
X_9 = pd.concat([X_9, pd.Series(np.ones(X_8.shape[0]), name='intercept')], axis=1)
X_9.head()

/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py:376: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = _infer_fill_value(value)
/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py:494: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s

What would this output?¶

np.linalg.matrix_rank(X_9)

3

Adding size squared as a feature¶

X_10 = df[['total_bill', 'size']]
X_10.loc[:, 'size ** 2'] = X_10['size'] ** 2
X_10 = pd.concat([X_10, pd.Series(np.ones(X_10.shape[0]), name='intercept')], axis=1)
X_10.head()

What would this output?¶

np.linalg.matrix_rank(X_10)

4

5. What makes a model linear?¶

Is the following model linear? (Suppose $x$ represents a single observation of our raw data matrix.)

$$f_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 \sin(x_1) + \theta_4 \cos(x_1x_2) e^{x_1}$$

Yes, because it is linear in terms of the parameters. We could formulate this model as $$f_\theta(x) = x^T \theta$$ where $x = \begin{bmatrix} 1 \\ x_1 \\ x_2 \\ \sin(x_1) \\ \cos(x_1x_2) e^{x_1} \end{bmatrix}$ and $\theta = \begin{bmatrix} \theta_0 \\ \theta_1 \\ \theta_2 \\ \theta_3 \\ \theta_4 \end{bmatrix}$.

What about this model?

$$f_\theta(x) = \theta_0 + \theta_1 x + \theta_2 \sin(\theta_3 x)$$

No, because $\theta_3$ is within a $\sin$ function. We cannot write this model in the form $x^T \theta$, so it is not a linear model. It is still a model, though, and we can find its optimal parameters, but just not using least squares.

	total_bill	size	sex_Male	sex_Female	smoker_No	day_Sun	time_Dinner
0	16.99	2	0	1	1	1	1
1	10.34	3	1	0	1	1	1
2	21.01	3	1	0	1	1	1
3	23.68	2	1	0	1	1	1
4	24.59	4	0	1	1	1	1

	total_bill	size	sex_Male	sex_Female	smoker_No	day_Sun	time_Dinner	intercept
0	16.99	2	0	1	1	1	1	1.0
1	10.34	3	1	0	1	1	1	1.0
2	21.01	3	1	0	1	1	1	1.0
3	23.68	2	1	0	1	1	1	1.0
4	24.59	4	0	1	1	1	1	1.0

	total_bill	size	sex_Female	smoker_No	day_Sun	time_Dinner
0	16.99	2	1	1	1	1
1	10.34	3	0	1	1	1
2	21.01	3	0	1	1	1
3	23.68	2	0	1	1	1
4	24.59	4	1	1	1	1

	total_bill	size	sex_Female	smoker_No	day_Sun	time_Dinner	intercept
0	16.99	2	1	1	1	1	1.0
1	10.34	3	0	1	1	1	1.0
2	21.01	3	0	1	1	1	1.0
3	23.68	2	0	1	1	1	1.0
4	24.59	4	1	1	1	1	1.0

	total_bill	sex_Male	sex_Female
0	16.99	0	1
1	10.34	1	0
2	21.01	1	0
3	23.68	1	0
4	24.59	0	1

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4

	total_bill	tip	sex	smoker	day	time	size	useless
128	11.38	2.00	Female	No	Thur	Lunch	2	46.440591
3	23.68	3.31	Male	No	Sun	Dinner	2	-84.694002
174	16.82	4.00	Male	Yes	Sun	Dinner	2	-1.382111
34	17.78	3.27	Male	No	Sat	Dinner	2	-99.177752
225	16.27	2.50	Female	Yes	Fri	Lunch	2	26.222213
...	...	...	...	...	...	...	...	...
166	20.76	2.24	Male	No	Sun	Dinner	2	-72.179415
223	15.98	3.00	Female	No	Fri	Lunch	3	143.733794
52	34.81	5.20	Female	No	Sun	Dinner	4	470.422379
74	14.73	2.20	Female	No	Sat	Dinner	2	-488.249524
205	16.47	3.23	Female	Yes	Thur	Lunch	3	363.289350

	total_bill	sex_Male	sex_Female	smoker_Yes	smoker_No
239	29.03	1	0	0	1
240	27.18	0	1	1	0
241	22.67	1	0	1	0
242	17.82	1	0	0	1
243	18.78	0	1	0	1