Live Lecture 4 Supplemental Notebook

Data 100, Summer 2020

Suraj Rampure and Allen Shen

This notebook has 5 sections:

  1. An overview of the modeling process, and how it parallels with sklearn
  2. An exploration of how training RMSE doesn't decrease when we add more features, but test RMSE can
  3. One-hot encoding, and issues with an intercept term
  4. Redundant features and rank
  5. What makes models linear
In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn.linear_model as lm

1. An overview of the modeling process

  1. Choose a model
  2. Choose a loss function
    • model = lm.LinearRegression() does both of these for us!
  3. Minimize average loss to find the optimal $\hat{\theta}$
    • model.fit(X, y)
    • model.coef_ and model.intercept_ give us the values of $\hat{\theta}$ after fitting

Use our model to make predictions: $\hat{\mathbb{Y}} = \mathbb{X} \hat{\theta}$

  • model.predict(X)

2. RMSE on training data never increases by adding more features

In [2]:
df = sns.load_dataset('tips')
df.head()
Out[2]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
In [3]:
def rmse(y, yhat):
    return np.sqrt(np.mean((y - yhat)**2))

Let's start by fitting a simple linear regression model to our familiar tips data. Specifically, we will use total_bill to predict tip.

In [4]:
model1 = lm.LinearRegression()
model1.fit(df[['total_bill']], df['tip'])
pred1 = model1.predict(df[['total_bill']])
In [5]:
rmse(df['tip'], pred1)
Out[5]:
1.0178504025697377
In [6]:
model1.coef_
Out[6]:
array([0.10502452])
In [7]:
model1.intercept_
Out[7]:
0.9202696135546731
In [8]:
plt.scatter(df['tip'], pred1);

Notably, our RMSE was 1.0178504025697377.

Now, let's add a completely unrelated column to our data, and include it as a feature in our model.

In [9]:
df['useless'] = np.random.randn(len(df)) * 342
In [10]:
df
Out[10]:
total_bill tip sex smoker day time size useless
0 16.99 1.01 Female No Sun Dinner 2 131.771350
1 10.34 1.66 Male No Sun Dinner 3 142.896971
2 21.01 3.50 Male No Sun Dinner 3 26.060033
3 23.68 3.31 Male No Sun Dinner 2 -84.694002
4 24.59 3.61 Female No Sun Dinner 4 -122.228618
... ... ... ... ... ... ... ... ...
239 29.03 5.92 Male No Sat Dinner 3 -421.752080
240 27.18 2.00 Female Yes Sat Dinner 2 -156.040601
241 22.67 2.00 Male Yes Sat Dinner 2 -261.258523
242 17.82 1.75 Male No Sat Dinner 2 53.197656
243 18.78 3.00 Female No Thur Dinner 2 -403.448678

244 rows × 8 columns

In [11]:
model2 = lm.LinearRegression()
model2.fit(df[['total_bill', 'useless']], df['tip'])
pred2 = model2.predict(df[['total_bill', 'useless']])
In [12]:
rmse(df['tip'], pred2)
Out[12]:
1.0155449131945493
In [13]:
model2.coef_
Out[13]:
array([ 0.10487718, -0.00019831])
In [14]:
model2.intercept_
Out[14]:
0.9177561519777169
In [15]:
plt.scatter(df['tip'], pred2);

Our new RMSE was marginally lower! Why?

Note that the coefficient for our useless feature is very close to 0.

What about Multiple $R^2$?

For the original model:

In [16]:
np.corrcoef(pred1, df['tip'])[0, 1]**2
Out[16]:
0.4566165863516758
In [17]:
np.var(pred1) / np.var(df['tip'])
Out[17]:
0.4566165863516756

Note: model.score for a LinearRegression model also computes the $R^2$ value! See:

In [18]:
model1.score(df[['total_bill']], df['tip'])
Out[18]:
0.45661658635167657

You should note that the above three are all the same.

For the model with the useless feature:

In [19]:
np.var(pred2) / np.var(df['tip'])
Out[19]:
0.4590753875504143

Recall, we can interpret $R^2$ as being the proportion of variance in our true $y$ values that our fitted values capture. This is saying that our model with the useless feature included accounts for more of the variation in our true $y$ values than the first model does.

Does this make sense?

Let's look at how such a model performs on unseen ("test") data.

We haven't yet formally taught you how to use scikit-learn's inbuilt train/test split, so we will do this by hand for now.

In [20]:
idx = np.arange(len(df))
np.random.shuffle(idx)
In [21]:
idx
Out[21]:
array([  1, 201,  71, 154,  33, 183,  38, 232,  88, 224, 144, 139, 240,
       187, 195, 226,  53,  98, 149,  22, 210,  46, 181, 173,  70, 213,
        54,  64,  48,  73, 134,  99,   2,  26, 116,  11, 118,   8,  78,
        89, 124, 158, 186, 112, 138,  96, 163,  31, 108, 105,  94, 188,
       217, 171,  37,   5, 145, 129,  86,  58,   4, 206, 161, 242,  32,
       153, 192, 177,  44,  65, 142, 231,  90,  68, 125, 172,  80,  61,
       214, 165, 109, 236, 126, 209, 141, 241, 151, 238, 228, 212, 176,
        25,  66, 104, 235, 203,   9,  23, 101,  77,  72,  30, 170, 220,
       202,  67, 119, 137, 123,  20,  18,  55, 204, 136,  42,  85,  10,
        36,  14,  35,  92, 199,  60,  15, 239,  39, 184, 155, 211, 185,
       215,  40, 100,  21, 222, 115, 243, 120, 233, 156, 102,   6,  17,
       175,  81,  95, 107, 230, 194,  87,  63, 147, 111,  82,  50, 167,
       121, 150, 207,  75, 113, 219,  27, 152,  69, 216,  79, 182,  93,
        51, 218,  59, 190,  91,  29, 114,  24,  43, 117, 168, 127, 146,
       169, 128,   3, 174,  34, 225,  41, 110, 132, 221, 189, 160,  76,
       234,  62, 200, 143,  83, 130,  28, 164,  16,  57, 227,   7,  45,
        47, 229, 178, 197,  56, 179,  49, 135, 131, 208,  19, 103, 162,
       180, 157, 198, 237, 193,  97, 159,  84, 122, 196, 148,   0, 140,
       191, 106, 133,  13,  12, 166, 223,  52,  74, 205])
In [22]:
len(idx)
Out[22]:
244
In [23]:
split_point = int((3/4) * len(idx))
In [24]:
split_point
Out[24]:
183
In [25]:
train, test = df.iloc[idx[:split_point]], df.iloc[idx[split_point:]]
In [26]:
train
Out[26]:
total_bill tip sex smoker day time size useless
1 10.34 1.66 Male No Sun Dinner 3 142.896971
201 12.74 2.01 Female Yes Thur Lunch 2 -514.661072
71 17.07 3.00 Female No Sat Dinner 3 -225.277185
154 19.77 2.00 Male No Sun Dinner 4 -257.022761
33 20.69 2.45 Female No Sat Dinner 4 -280.069292
... ... ... ... ... ... ... ... ...
117 10.65 1.50 Female No Thur Lunch 2 -259.823678
168 10.59 1.61 Female Yes Sat Dinner 2 169.550729
127 14.52 2.00 Female No Thur Lunch 2 -461.221484
146 18.64 1.36 Female No Thur Lunch 3 -29.758910
169 10.63 2.00 Female Yes Sat Dinner 2 570.352014

183 rows × 8 columns

In [27]:
test
Out[27]:
total_bill tip sex smoker day time size useless
128 11.38 2.00 Female No Thur Lunch 2 46.440591
3 23.68 3.31 Male No Sun Dinner 2 -84.694002
174 16.82 4.00 Male Yes Sun Dinner 2 -1.382111
34 17.78 3.27 Male No Sat Dinner 2 -99.177752
225 16.27 2.50 Female Yes Fri Lunch 2 26.222213
... ... ... ... ... ... ... ... ...
166 20.76 2.24 Male No Sun Dinner 2 -72.179415
223 15.98 3.00 Female No Fri Lunch 3 143.733794
52 34.81 5.20 Female No Sun Dinner 4 470.422379
74 14.73 2.20 Female No Sat Dinner 2 -488.249524
205 16.47 3.23 Female Yes Thur Lunch 3 363.289350

61 rows × 8 columns

In [28]:
new_model1 = lm.LinearRegression()
new_model1.fit(train[['total_bill']], train['tip'])
new_pred1_train = new_model1.predict(train[['total_bill']])
new_pred1_test = new_model1.predict(test[['total_bill']])
In [29]:
new_model1_train_rmse = rmse(train['tip'], new_pred1_train)
new_model1_test_rmse = rmse(test['tip'], new_pred1_test)

new_model1_train_rmse, new_model1_test_rmse
Out[29]:
(1.043322299509456, 0.9434971905441378)

Now, for our model with the useless feature:

In [30]:
new_model2 = lm.LinearRegression()
new_model2.fit(train[['total_bill', 'useless']], train['tip'])
new_pred2_train = new_model2.predict(train[['total_bill', 'useless']])
new_pred2_test = new_model2.predict(test[['total_bill', 'useless']])
In [31]:
new_model2_train_rmse = rmse(train['tip'], new_pred2_train)
new_model2_test_rmse = rmse(test['tip'], new_pred2_test)

new_model2_train_rmse, new_model2_test_rmse
Out[31]:
(1.0368979705593446, 0.959464772352727)
In [32]:
new_model1.coef_
Out[32]:
array([0.11048425])
In [33]:
new_model2.coef_
Out[33]:
array([ 0.11026443, -0.00032182])

Note, here training RMSE went down, but test RMSE went up. This is generally what happens when you include features that aren't truly relevant to the underlying relationship in your data. We call this overfitting.

3. One hot encoding

In [34]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn.linear_model as lm

from sklearn.feature_extraction import DictVectorizer
In [35]:
df = sns.load_dataset('tips')
df.head()
Out[35]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

Why do we need one hot encoding?

In [36]:
model1 = lm.LinearRegression()
model1.fit(df.drop(columns='tip'), df['tip'])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-36-7f6b6959884c> in <module>
      1 model1 = lm.LinearRegression()
----> 2 model1.fit(df.drop(columns='tip'), df['tip'])

/opt/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/base.py in fit(self, X, y, sample_weight)
    461         n_jobs_ = self.n_jobs
    462         X, y = check_X_y(X, y, accept_sparse=['csr', 'csc', 'coo'],
--> 463                          y_numeric=True, multi_output=True)
    464 
    465         if sample_weight is not None and np.atleast_1d(sample_weight).ndim > 1:

/opt/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)
    717                     ensure_min_features=ensure_min_features,
    718                     warn_on_dtype=warn_on_dtype,
--> 719                     estimator=estimator)
    720     if multi_output:
    721         y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,

/opt/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    534         # make sure we actually converted to numeric:
    535         if dtype_numeric and array.dtype.kind == "O":
--> 536             array = array.astype(np.float64)
    537         if not allow_nd and array.ndim >= 3:
    538             raise ValueError("Found array with dim %d. %s expected <= 2."

ValueError: could not convert string to float: 'Female'

How to perform a one hot encoding in scikit-learn

In [37]:
df = df.drop(columns='tip')
cat_cols = ['sex', 'smoker', 'day', 'time']
In [38]:
vec_enc = DictVectorizer()
vec_enc.fit(df[cat_cols].to_dict(orient='records'))
cat_data = vec_enc.transform(df[cat_cols].to_dict(orient='records')).toarray()
cat_data
Out[38]:
array([[0., 0., 1., ..., 0., 1., 0.],
       [0., 0., 1., ..., 0., 1., 0.],
       [0., 0., 1., ..., 0., 1., 0.],
       ...,
       [0., 1., 0., ..., 1., 1., 0.],
       [0., 1., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 1., 0.]])
In [39]:
cat_data_names = vec_enc.get_feature_names()
cat_data_names
Out[39]:
['day=Fri',
 'day=Sat',
 'day=Sun',
 'day=Thur',
 'sex=Female',
 'sex=Male',
 'smoker=No',
 'smoker=Yes',
 'time=Dinner',
 'time=Lunch']
In [40]:
cat_data = pd.DataFrame(cat_data, columns=cat_data_names)
df_ohe = pd.concat([df, cat_data], axis=1).drop(columns=cat_cols) # Drop original categorical columns
df_ohe.head()
Out[40]:
total_bill size day=Fri day=Sat day=Sun day=Thur sex=Female sex=Male smoker=No smoker=Yes time=Dinner time=Lunch
0 16.99 2 0.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0
1 10.34 3 0.0 0.0 1.0 0.0 0.0 1.0 1.0 0.0 1.0 0.0
2 21.01 3 0.0 0.0 1.0 0.0 0.0 1.0 1.0 0.0 1.0 0.0
3 23.68 2 0.0 0.0 1.0 0.0 0.0 1.0 1.0 0.0 1.0 0.0
4 24.59 4 0.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0
In [41]:
df_ohe.shape
Out[41]:
(244, 12)
In [42]:
X = pd.concat([df_ohe, pd.Series(np.ones(df_ohe.shape[0]), name='intercept')], axis=1)
X.head()
Out[42]:
total_bill size day=Fri day=Sat day=Sun day=Thur sex=Female sex=Male smoker=No smoker=Yes time=Dinner time=Lunch intercept
0 16.99 2 0.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 1.0
1 10.34 3 0.0 0.0 1.0 0.0 0.0 1.0 1.0 0.0 1.0 0.0 1.0
2 21.01 3 0.0 0.0 1.0 0.0 0.0 1.0 1.0 0.0 1.0 0.0 1.0
3 23.68 2 0.0 0.0 1.0 0.0 0.0 1.0 1.0 0.0 1.0 0.0 1.0
4 24.59 4 0.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 1.0
In [43]:
X.shape
Out[43]:
(244, 13)

What does this output?

In [44]:
np.linalg.matrix_rank(X)
Out[44]:
9

What's the issue with the design matrix above?

In [45]:
X = X.drop(columns=['day=Sat', 'sex=Female', 'smoker=No', 'time=Dinner'])
X.head()
Out[45]:
total_bill size day=Fri day=Sun day=Thur sex=Male smoker=Yes time=Lunch intercept
0 16.99 2 0.0 1.0 0.0 0.0 0.0 0.0 1.0
1 10.34 3 0.0 1.0 0.0 1.0 0.0 0.0 1.0
2 21.01 3 0.0 1.0 0.0 1.0 0.0 0.0 1.0
3 23.68 2 0.0 1.0 0.0 1.0 0.0 0.0 1.0
4 24.59 4 0.0 1.0 0.0 0.0 0.0 0.0 1.0
In [46]:
X.shape
Out[46]:
(244, 9)

What does this output?

In [47]:
np.linalg.matrix_rank(X)
Out[47]:
9

One hot encoding with Pandas

In [48]:
df_ohe2 = pd.get_dummies(df)
df_ohe2.head()
Out[48]:
total_bill size sex_Male sex_Female smoker_Yes smoker_No day_Thur day_Fri day_Sat day_Sun time_Lunch time_Dinner
0 16.99 2 0 1 0 1 0 0 0 1 0 1
1 10.34 3 1 0 0 1 0 0 0 1 0 1
2 21.01 3 1 0 0 1 0 0 0 1 0 1
3 23.68 2 1 0 0 1 0 0 0 1 0 1
4 24.59 4 0 1 0 1 0 0 0 1 0 1
In [49]:
X_2 = pd.concat([df_ohe2, pd.Series(np.ones(df_ohe2.shape[0]), name='intercept')], axis=1)
X_2.head()
Out[49]:
total_bill size sex_Male sex_Female smoker_Yes smoker_No day_Thur day_Fri day_Sat day_Sun time_Lunch time_Dinner intercept
0 16.99 2 0 1 0 1 0 0 0 1 0 1 1.0
1 10.34 3 1 0 0 1 0 0 0 1 0 1 1.0
2 21.01 3 1 0 0 1 0 0 0 1 0 1 1.0
3 23.68 2 1 0 0 1 0 0 0 1 0 1 1.0
4 24.59 4 0 1 0 1 0 0 0 1 0 1 1.0
In [50]:
X_2.shape
Out[50]:
(244, 13)
In [51]:
np.linalg.matrix_rank(X_2)
Out[51]:
9
In [52]:
df_ohe2 = pd.get_dummies(df, drop_first=True)
df_ohe2.head()
Out[52]:
total_bill size sex_Female smoker_No day_Fri day_Sat day_Sun time_Dinner
0 16.99 2 1 1 0 0 1 1
1 10.34 3 0 1 0 0 1 1
2 21.01 3 0 1 0 0 1 1
3 23.68 2 0 1 0 0 1 1
4 24.59 4 1 1 0 0 1 1
In [53]:
X_2 = pd.concat([df_ohe2, pd.Series(np.ones(df_ohe2.shape[0]), name='intercept')], axis=1)
X_2.head()
Out[53]:
total_bill size sex_Female smoker_No day_Fri day_Sat day_Sun time_Dinner intercept
0 16.99 2 1 1 0 0 1 1 1.0
1 10.34 3 0 1 0 0 1 1 1.0
2 21.01 3 0 1 0 0 1 1 1.0
3 23.68 2 0 1 0 0 1 1 1.0
4 24.59 4 1 1 0 0 1 1 1.0
In [54]:
X_2.shape
Out[54]:
(244, 9)
In [55]:
np.linalg.matrix_rank(X_2)
Out[55]:
9

More examples of one hot encoding

In [56]:
X_3 = pd.get_dummies(df[['total_bill', 'sex']])
X_3.head()
Out[56]:
total_bill sex_Male sex_Female
0 16.99 0 1
1 10.34 1 0
2 21.01 1 0
3 23.68 1 0
4 24.59 0 1

What would this output?

In [57]:
np.linalg.matrix_rank(X_3)
Out[57]:
3
In [58]:
X_4 = pd.concat([X_3, pd.Series(np.ones(X_3.shape[0]), name='intercept')], axis=1)
X_4.head()
Out[58]:
total_bill sex_Male sex_Female intercept
0 16.99 0 1 1.0
1 10.34 1 0 1.0
2 21.01 1 0 1.0
3 23.68 1 0 1.0
4 24.59 0 1 1.0

What would this output?

In [59]:
np.linalg.matrix_rank(X_4)
Out[59]:
3
In [60]:
(X_4['intercept'] - X_4['sex_Male']).iloc[:5]
Out[60]:
0    1.0
1    0.0
2    0.0
3    0.0
4    1.0
dtype: float64
In [61]:
X_5 = pd.get_dummies(df[['total_bill', 'sex', 'smoker']])
X_5.tail(5)
Out[61]:
total_bill sex_Male sex_Female smoker_Yes smoker_No
239 29.03 1 0 0 1
240 27.18 0 1 1 0
241 22.67 1 0 1 0
242 17.82 1 0 0 1
243 18.78 0 1 0 1

What would this output?

In [62]:
np.linalg.matrix_rank(X_5)
Out[62]:
4
In [63]:
(X_5['sex_Male'] + X_5['sex_Female'] - X_5['smoker_Yes']).iloc[-5:]
Out[63]:
239    1
240    0
241    0
242    1
243    1
dtype: uint8
In [64]:
X_6 = pd.concat([X_5, pd.Series(np.ones(X_5.shape[0]), name='intercept')], axis=1)
X_6.tail()
Out[64]:
total_bill sex_Male sex_Female smoker_Yes smoker_No intercept
239 29.03 1 0 0 1 1.0
240 27.18 0 1 1 0 1.0
241 22.67 1 0 1 0 1.0
242 17.82 1 0 0 1 1.0
243 18.78 0 1 0 1 1.0

What would this output?

In [65]:
np.linalg.matrix_rank(X_6)
Out[65]:
4

4. Duplicate features

In [66]:
df = sns.load_dataset('tips')
In [67]:
X_7 = df[['total_bill', 'size', 'size']]
X_7 = pd.concat([X_7, pd.Series(np.ones(X_7.shape[0]), name='intercept')], axis=1)
X_7.head()
Out[67]:
total_bill size size intercept
0 16.99 2 2 1.0
1 10.34 3 3 1.0
2 21.01 3 3 1.0
3 23.68 2 2 1.0
4 24.59 4 4 1.0

What would this output?

In [68]:
np.linalg.matrix_rank(X_7)
Out[68]:
3

What's the issue with this again?

In [69]:
model2 = lm.LinearRegression(fit_intercept=False)
model2.fit(X_7, df['tip'])
Out[69]:
LinearRegression(copy_X=True, fit_intercept=False, n_jobs=None, normalize=False)
In [70]:
model2_coef = model2.coef_
model2_coef
Out[70]:
array([0.09271334, 0.0962989 , 0.0962989 , 0.66894474])
In [71]:
model2.intercept_
Out[71]:
0.0
In [72]:
model2.predict(X_7)[:5]
Out[72]:
array([2.62933992, 2.20539403, 3.19464533, 3.24959215, 3.71915687])
In [73]:
X_7.iloc[:5] @ model2_coef
Out[73]:
0    2.629340
1    2.205394
2    3.194645
3    3.249592
4    3.719157
dtype: float64

How can I change the model coefficients so that the predictions are the same?

Our model is:

$$\theta_0 + \theta_1 \cdot size + \theta_2 \cdot size + \theta_3 \cdot total\_bill$$
In [74]:
model2_coef_modified = model2_coef.copy()
model2_coef_modified[1] = model2_coef[1] - 1000000000
model2_coef_modified[2] = model2_coef[2] + 1000000000
In [75]:
model2_coef
Out[75]:
array([0.09271334, 0.0962989 , 0.0962989 , 0.66894474])
In [76]:
model2_coef_modified
Out[76]:
array([ 9.27133368e-02, -1.00000000e+09,  1.00000000e+09,  6.68944741e-01])
In [77]:
X_7.iloc[:5] @ model2_coef_modified
Out[77]:
0    2.629340
1    2.205394
2    3.194646
3    3.249592
4    3.719157
dtype: float64

Our model is now:

$$\theta_0 + (\theta_1 - 100) \cdot size + (\theta_2 + 100) \cdot size + \theta_3 \cdot total\_bill$$$$= \theta_0 + \theta_1 \cdot size - 100 \cdot size + \theta_2 * size + 100 \cdot size + \theta_3 \cdot total\_bill$$$$= \theta_0 + \theta_1 \cdot size + \theta_2 \cdot size + \theta_3 \cdot total\_bill$$

which is the same as before!

Using 2 times size as a feature

In [78]:
X_8 = df[['total_bill', 'size']]
X_8.loc[:, '2 * size'] = 2 * X_8['size']
X_8 = pd.concat([X_8, pd.Series(np.ones(X_8.shape[0]), name='intercept')], axis=1)
X_8.head()
/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py:376: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = _infer_fill_value(value)
/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py:494: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s
Out[78]:
total_bill size 2 * size intercept
0 16.99 2 4 1.0
1 10.34 3 6 1.0
2 21.01 3 6 1.0
3 23.68 2 4 1.0
4 24.59 4 8 1.0

What would this output?

In [79]:
np.linalg.matrix_rank(X_8)
Out[79]:
3
In [80]:
model3 = lm.LinearRegression(fit_intercept=False)
model3.fit(X_8, df['tip'])
Out[80]:
LinearRegression(copy_X=True, fit_intercept=False, n_jobs=None, normalize=False)
In [81]:
model3_coef = model3.coef_
model3_coef
Out[81]:
array([0.09271334, 0.03851956, 0.07703912, 0.66894474])
In [82]:
model3.predict(X_8)[:5]
Out[82]:
array([2.62933992, 2.20539403, 3.19464533, 3.24959215, 3.71915687])
In [83]:
X_8.iloc[:5] @ model3_coef
Out[83]:
0    2.629340
1    2.205394
2    3.194645
3    3.249592
4    3.719157
dtype: float64

How can I change the coefficients so that the predictions are the same?

Can I do the same thing as before?

In [84]:
model3_coef_modified = model3_coef.copy()
model3_coef_modified[1] = model3_coef[1] - 100
model3_coef_modified[2] = model3_coef[2] + 100
In [85]:
X_8.iloc[:5] @ model3_coef_modified
Out[85]:
0    202.629340
1    302.205394
2    303.194645
3    203.249592
4    403.719157
dtype: float64
In [86]:
model3_coef_modified = model3_coef.copy()
model3_coef_modified[1] = model3_coef[1] - 100
model3_coef_modified[2] = model3_coef[2] + 50
In [87]:
X_8.iloc[:5] @ model3_coef_modified
Out[87]:
0    2.629340
1    2.205394
2    3.194645
3    3.249592
4    3.719157
dtype: float64

Thought exercise: What happens if I try to add 2 * size + 3 as a feature?

In [88]:
X_9 = df[['total_bill', 'size']]
X_9.loc[:, '2 * size + 3'] = 2 * X_9['size'] + 3
X_9 = pd.concat([X_9, pd.Series(np.ones(X_8.shape[0]), name='intercept')], axis=1)
X_9.head()
/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py:376: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = _infer_fill_value(value)
/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py:494: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s
Out[88]:
total_bill size 2 * size + 3 intercept
0 16.99 2 7 1.0
1 10.34 3 9 1.0
2 21.01 3 9 1.0
3 23.68 2 7 1.0
4 24.59 4 11 1.0

What would this output?

In [89]:
np.linalg.matrix_rank(X_9)
Out[89]:
3

Adding size squared as a feature

In [90]:
X_10 = df[['total_bill', 'size']]
X_10.loc[:, 'size ** 2'] = X_10['size'] ** 2
X_10 = pd.concat([X_10, pd.Series(np.ones(X_10.shape[0]), name='intercept')], axis=1)
X_10.head()
Out[90]:
total_bill size size ** 2 intercept
0 16.99 2 4 1.0
1 10.34 3 9 1.0
2 21.01 3 9 1.0
3 23.68 2 4 1.0
4 24.59 4 16 1.0

What would this output?

In [91]:
np.linalg.matrix_rank(X_10)
Out[91]:
4

5. What makes a model linear?

Is the following model linear? (Suppose $x$ represents a single observation of our raw data matrix.)

$$f_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 \sin(x_1) + \theta_4 \cos(x_1x_2) e^{x_1}$$

Yes, because it is linear in terms of the parameters. We could formulate this model as $$f_\theta(x) = x^T \theta$$ where $x = \begin{bmatrix} 1 \\ x_1 \\ x_2 \\ \sin(x_1) \\ \cos(x_1x_2) e^{x_1} \end{bmatrix}$ and $\theta = \begin{bmatrix} \theta_0 \\ \theta_1 \\ \theta_2 \\ \theta_3 \\ \theta_4 \end{bmatrix}$.


What about this model?

$$f_\theta(x) = \theta_0 + \theta_1 x + \theta_2 \sin(\theta_3 x)$$

No, because $\theta_3$ is within a $\sin$ function. We cannot write this model in the form $x^T \theta$, so it is not a linear model. It is still a model, though, and we can find its optimal parameters, but just not using least squares.