Suraj Rampure and Allen Shen
This notebook has 5 sections:
sklearn
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn.linear_model as lm
model = lm.LinearRegression()
does both of these for us!model.fit(X, y)
model.coef_
and model.intercept_
give us the values of $\hat{\theta}$ after fittingUse our model to make predictions: $\hat{\mathbb{Y}} = \mathbb{X} \hat{\theta}$
model.predict(X)
df = sns.load_dataset('tips')
df.head()
def rmse(y, yhat):
return np.sqrt(np.mean((y - yhat)**2))
Let's start by fitting a simple linear regression model to our familiar tips
data. Specifically, we will use total_bill
to predict tip
.
model1 = lm.LinearRegression()
model1.fit(df[['total_bill']], df['tip'])
pred1 = model1.predict(df[['total_bill']])
rmse(df['tip'], pred1)
model1.coef_
model1.intercept_
plt.scatter(df['tip'], pred1);
Notably, our RMSE was 1.0178504025697377.
Now, let's add a completely unrelated column to our data, and include it as a feature in our model.
df['useless'] = np.random.randn(len(df)) * 342
df
model2 = lm.LinearRegression()
model2.fit(df[['total_bill', 'useless']], df['tip'])
pred2 = model2.predict(df[['total_bill', 'useless']])
rmse(df['tip'], pred2)
model2.coef_
model2.intercept_
plt.scatter(df['tip'], pred2);
Our new RMSE was marginally lower! Why?
Note that the coefficient for our useless
feature is very close to 0.
For the original model:
np.corrcoef(pred1, df['tip'])[0, 1]**2
np.var(pred1) / np.var(df['tip'])
Note: model.score
for a LinearRegression model also computes the $R^2$ value! See:
model1.score(df[['total_bill']], df['tip'])
You should note that the above three are all the same.
For the model with the useless feature:
np.var(pred2) / np.var(df['tip'])
Recall, we can interpret $R^2$ as being the proportion of variance in our true $y$ values that our fitted values capture. This is saying that our model with the useless feature included accounts for more of the variation in our true $y$ values than the first model does.
Does this make sense?
We haven't yet formally taught you how to use scikit-learn
's inbuilt train/test split, so we will do this by hand for now.
idx = np.arange(len(df))
np.random.shuffle(idx)
idx
len(idx)
split_point = int((3/4) * len(idx))
split_point
train, test = df.iloc[idx[:split_point]], df.iloc[idx[split_point:]]
train
test
new_model1 = lm.LinearRegression()
new_model1.fit(train[['total_bill']], train['tip'])
new_pred1_train = new_model1.predict(train[['total_bill']])
new_pred1_test = new_model1.predict(test[['total_bill']])
new_model1_train_rmse = rmse(train['tip'], new_pred1_train)
new_model1_test_rmse = rmse(test['tip'], new_pred1_test)
new_model1_train_rmse, new_model1_test_rmse
Now, for our model with the useless feature:
new_model2 = lm.LinearRegression()
new_model2.fit(train[['total_bill', 'useless']], train['tip'])
new_pred2_train = new_model2.predict(train[['total_bill', 'useless']])
new_pred2_test = new_model2.predict(test[['total_bill', 'useless']])
new_model2_train_rmse = rmse(train['tip'], new_pred2_train)
new_model2_test_rmse = rmse(test['tip'], new_pred2_test)
new_model2_train_rmse, new_model2_test_rmse
new_model1.coef_
new_model2.coef_
Note, here training RMSE went down, but test RMSE went up. This is generally what happens when you include features that aren't truly relevant to the underlying relationship in your data. We call this overfitting.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn.linear_model as lm
from sklearn.feature_extraction import DictVectorizer
df = sns.load_dataset('tips')
df.head()
model1 = lm.LinearRegression()
model1.fit(df.drop(columns='tip'), df['tip'])
df = df.drop(columns='tip')
cat_cols = ['sex', 'smoker', 'day', 'time']
vec_enc = DictVectorizer()
vec_enc.fit(df[cat_cols].to_dict(orient='records'))
cat_data = vec_enc.transform(df[cat_cols].to_dict(orient='records')).toarray()
cat_data
cat_data_names = vec_enc.get_feature_names()
cat_data_names
cat_data = pd.DataFrame(cat_data, columns=cat_data_names)
df_ohe = pd.concat([df, cat_data], axis=1).drop(columns=cat_cols) # Drop original categorical columns
df_ohe.head()
df_ohe.shape
X = pd.concat([df_ohe, pd.Series(np.ones(df_ohe.shape[0]), name='intercept')], axis=1)
X.head()
X.shape
np.linalg.matrix_rank(X)
X = X.drop(columns=['day=Sat', 'sex=Female', 'smoker=No', 'time=Dinner'])
X.head()
X.shape
np.linalg.matrix_rank(X)
df_ohe2 = pd.get_dummies(df)
df_ohe2.head()
X_2 = pd.concat([df_ohe2, pd.Series(np.ones(df_ohe2.shape[0]), name='intercept')], axis=1)
X_2.head()
X_2.shape
np.linalg.matrix_rank(X_2)
df_ohe2 = pd.get_dummies(df, drop_first=True)
df_ohe2.head()
X_2 = pd.concat([df_ohe2, pd.Series(np.ones(df_ohe2.shape[0]), name='intercept')], axis=1)
X_2.head()
X_2.shape
np.linalg.matrix_rank(X_2)
X_3 = pd.get_dummies(df[['total_bill', 'sex']])
X_3.head()
np.linalg.matrix_rank(X_3)
X_4 = pd.concat([X_3, pd.Series(np.ones(X_3.shape[0]), name='intercept')], axis=1)
X_4.head()
np.linalg.matrix_rank(X_4)
(X_4['intercept'] - X_4['sex_Male']).iloc[:5]
X_5 = pd.get_dummies(df[['total_bill', 'sex', 'smoker']])
X_5.tail(5)
np.linalg.matrix_rank(X_5)
(X_5['sex_Male'] + X_5['sex_Female'] - X_5['smoker_Yes']).iloc[-5:]
X_6 = pd.concat([X_5, pd.Series(np.ones(X_5.shape[0]), name='intercept')], axis=1)
X_6.tail()
np.linalg.matrix_rank(X_6)
df = sns.load_dataset('tips')
X_7 = df[['total_bill', 'size', 'size']]
X_7 = pd.concat([X_7, pd.Series(np.ones(X_7.shape[0]), name='intercept')], axis=1)
X_7.head()
np.linalg.matrix_rank(X_7)
model2 = lm.LinearRegression(fit_intercept=False)
model2.fit(X_7, df['tip'])
model2_coef = model2.coef_
model2_coef
model2.intercept_
model2.predict(X_7)[:5]
X_7.iloc[:5] @ model2_coef
Our model is:
$$\theta_0 + \theta_1 \cdot size + \theta_2 \cdot size + \theta_3 \cdot total\_bill$$model2_coef_modified = model2_coef.copy()
model2_coef_modified[1] = model2_coef[1] - 1000000000
model2_coef_modified[2] = model2_coef[2] + 1000000000
model2_coef
model2_coef_modified
X_7.iloc[:5] @ model2_coef_modified
Our model is now:
$$\theta_0 + (\theta_1 - 100) \cdot size + (\theta_2 + 100) \cdot size + \theta_3 \cdot total\_bill$$$$= \theta_0 + \theta_1 \cdot size - 100 \cdot size + \theta_2 * size + 100 \cdot size + \theta_3 \cdot total\_bill$$$$= \theta_0 + \theta_1 \cdot size + \theta_2 \cdot size + \theta_3 \cdot total\_bill$$which is the same as before!
X_8 = df[['total_bill', 'size']]
X_8.loc[:, '2 * size'] = 2 * X_8['size']
X_8 = pd.concat([X_8, pd.Series(np.ones(X_8.shape[0]), name='intercept')], axis=1)
X_8.head()
np.linalg.matrix_rank(X_8)
model3 = lm.LinearRegression(fit_intercept=False)
model3.fit(X_8, df['tip'])
model3_coef = model3.coef_
model3_coef
model3.predict(X_8)[:5]
X_8.iloc[:5] @ model3_coef
Can I do the same thing as before?
model3_coef_modified = model3_coef.copy()
model3_coef_modified[1] = model3_coef[1] - 100
model3_coef_modified[2] = model3_coef[2] + 100
X_8.iloc[:5] @ model3_coef_modified
model3_coef_modified = model3_coef.copy()
model3_coef_modified[1] = model3_coef[1] - 100
model3_coef_modified[2] = model3_coef[2] + 50
X_8.iloc[:5] @ model3_coef_modified
X_9 = df[['total_bill', 'size']]
X_9.loc[:, '2 * size + 3'] = 2 * X_9['size'] + 3
X_9 = pd.concat([X_9, pd.Series(np.ones(X_8.shape[0]), name='intercept')], axis=1)
X_9.head()
np.linalg.matrix_rank(X_9)
X_10 = df[['total_bill', 'size']]
X_10.loc[:, 'size ** 2'] = X_10['size'] ** 2
X_10 = pd.concat([X_10, pd.Series(np.ones(X_10.shape[0]), name='intercept')], axis=1)
X_10.head()
np.linalg.matrix_rank(X_10)
Is the following model linear? (Suppose $x$ represents a single observation of our raw data matrix.)
$$f_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 \sin(x_1) + \theta_4 \cos(x_1x_2) e^{x_1}$$Yes, because it is linear in terms of the parameters. We could formulate this model as $$f_\theta(x) = x^T \theta$$ where $x = \begin{bmatrix} 1 \\ x_1 \\ x_2 \\ \sin(x_1) \\ \cos(x_1x_2) e^{x_1} \end{bmatrix}$ and $\theta = \begin{bmatrix} \theta_0 \\ \theta_1 \\ \theta_2 \\ \theta_3 \\ \theta_4 \end{bmatrix}$.
What about this model?
$$f_\theta(x) = \theta_0 + \theta_1 x + \theta_2 \sin(\theta_3 x)$$No, because $\theta_3$ is within a $\sin$ function. We cannot write this model in the form $x^T \theta$, so it is not a linear model. It is still a model, though, and we can find its optimal parameters, but just not using least squares.