In this lecture we will introduce linear models and least squares linear regression. We will explain the basic parametric model formulation and how to optimize it using linear algebra and geometry rather than gradient descent.
Updated This notebook has been slightly updated with additional explanations from the original posting. I have also added a video walk-through of the notebook which is imported at the very end.
In this notebook we will be using the following packages:
import numpy as np
import pandas as pd
import plotly.offline as py
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.figure_factory as ff
import cufflinks as cf
cf.set_config_file(offline=True, sharing=False, theme='ggplot');
# Support embedding YouTube Videos in Notebooks
from IPython.display import YouTubeVideo
This is the first of a series of lectures that are designed to be completed online rather than in class. This new lecture format is experimental and we ask for your feedback on Piazza.
The playlist for all the videos in this lecture can be found here:
Least Squares Linear Regression
We have designed this video notebook to be a series of short videos followed by simple questions (with answers provided) and demos. There are no grades for working through this notebook. However, we hope that by combining videos with activities this will actually be more engaging.
The following short video provides a recap of the modeling recipe (model, loss, optimize). If you are comfortable with this you can safely skip this video.
YouTubeVideo("u0YycwQ814c")
What are the three stages of modeling used in this class?
In the following video we introduce the multiple linear regression setting.
YouTubeVideo("21PN6a3s22I")
To motivate the multilinear regression setting we will work with the classic diamonds dataset.
data = pd.read_csv("data/diamonds.csv.zip", index_col=0)
data.head()
Each record in the dataset corresponds to a single diamond. The fields are
Fair
, Good
, Very Good
, Premium
, and Ideal
)J
(worst) to D
(best).I1
(worstt), SI2
, SI1
, VS2
, VS1
, VVS2
, VVS1
, IF
(best)We are interested in predicting the price of a diamond given it's characteristics. Therefore, we would like to fit a model:
$$ f_\theta\left(\text{Diamond's Characteristics}\right) \rightarrow \text{Price} $$Suppose for now we focus only on the diamond's carat, depth, and table values. In this case, we would want a model of the form:
$$ f_\theta\left(\textbf{carat}, \textbf{depth}, \textbf{table}\right) \rightarrow \textbf{price} $$We could express this as a linear model of the form:
$$ f_\theta\left(\textbf{carat}, \textbf{depth}, \textbf{table}\right) = \theta_0 + \theta_1 * \textbf{carat} + \theta_2 * \textbf{depth} + \theta_3 * \textbf{table} $$Where $\theta_0$ is the intercept parameter and $\theta_1$, ..., $\theta_3$ are the "slopes" with respect to each of the covariates.
Focusing on just the carat, depth, and table features and the price our dataset looks like:
data[['carat', 'depth', 'table', 'price']]
In lecture, we saw that the predictions for the entire data set $\hat{\mathbb{Y}}$ can be computed as:
$$ \hat{\mathbb{Y}} = \mathbb{X} \theta $$The covariate matrix $\mathbb{X} \in \mathbb{R}^{n,d+1}$ has $n$ rows corresponding to each record in the dataset and $d+1$ columns corresponding to the original $d$ columns in the data plus an additional 1s column vector. For example, the first five rows of $\mathbb{X}$ for this dataset would look like:
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 1.0 | 0.23 | 61.5 | 55.0 |
1 | 1.0 | 0.21 | 59.8 | 61.0 |
2 | 1.0 | 0.23 | 56.9 | 65.0 |
3 | 1.0 | 0.29 | 62.4 | 58.0 |
4 | 1.0 | 0.31 | 63.3 | 58.0 |
We can extract our covariate matrix and add a column of all 1s:
n,d = data[['carat', 'depth', 'table']].shape
print("n = ", n)
print("d = ", d)
The following function attaches 1s to each of the rows in the matrix. Here I am using the np.hstack
function to add an extra column.
def add_ones_column(data):
return np.hstack([np.ones((n,1)), data])
We can then add a 1s column to our dataset.
X = add_ones_column(data[['carat', 'depth', 'table']].to_numpy())
X
We can define the linear model in python using numpy. Here I will assume the input is in row vector form so this function will work with a single row or the entire matrix. Note in lecture we discuss the case where a single row is stored in column vector form (as this is the case in most math textbooks).
The @
symbol is the matrix multiply operation and is equivalent to writing xt.dot(theta)
.
def linear_model(theta, xt):
return xt @ theta # The @ symbol is matrix multiply
For any value of $\theta$ (even a random one) we can make a prediction using our model:
np.random.seed(42)
theta_guess = np.random.randn(d+1, 1)
theta_guess
Note: Above, we used np.random.randn
, which generates draws from a normal (Gaussian) distribution with mean 0 and variance 1 (the parameters d+1
and 1
specify the number of rows and columns that the returning array should have). The fact that these values are drawn from a normal distribution are irrelevant here, just think of them as coming from a random number generator.
Making a single prection for the 3rd row:
linear_model(theta_guess, X[2,:])
Making a prediction for all the rows at once:
linear_model(theta_guess, X)
Copying the notation directly from the video lecture:
Y_hat = X @ theta_guess
Y_hat
or using our function produces the same thing:
Y_hat = linear_model(theta_guess, X)
Y_hat
The next video defines and minimizes the least squares loss by using a geometric argument.
YouTubeVideo("zkJ3CULN8Wk")
To elaborate a little on the video above:
Recall from linear algebra that two vectors are orthogonal when their dot product is zero. The standard way to write the dot product is $a \cdot b$, but it can also be written as $a^T b$ or $b^T a$ (all of these forms evaluate to $a_1 b_1 + a_2 b_2 + ... + a_n b_n$).
The notion of orthogonality extends to matrices. If we want some vector $r$ to be orthogonal to the span of a matrix $X$ (i.e. to each column in the matrix), that's the same as saying the vector is orthogonal to the entire matrix. To denote this, we can say $X \cdot r = 0$ or $X^T \cdot r = 0$. In our case, $r = y - X \hat{\theta}$, so we have $X^T (y - X \hat{\theta}) = 0$.
The Least Squares Regression loss is the average squared loss:
\begin{align} L(\theta) &= \frac{1}{n}\sum_{i=1}^n \left( \mathbb{Y}_i - \left(\mathbb{X} \theta\right)_i \right)^2 \\ &= \frac{1}{n}\sum_{i=1}^n \left( \mathbb{Y}_i - \mathbb{X}_i \theta \right)^2 \\ &= \frac{1}{n} || \mathbb{Y} - \mathbb{X} \theta ||_2^2 \\ &= \frac{1}{n}\left( \mathbb{Y} - \mathbb{X}\theta \right)^T \left( \mathbb{Y} - \mathbb{X}\theta \right) \end{align}Note: Above, we used 2 facts from linear algebra. First, we used the fact that the square of the $L_2$ norm of a vector is equal to the sum of the squares of its components. Put less cryptically, that means that for any vector $r$, $||r||_2^2 = r_1^2 + r_2^2 + ... + r_n^2 = \sum_{i = 1}^n r_i^2$. We used this going from line 2 to 3, where $r = \mathbb{Y} - \mathbb{X} \theta$. The second fact we used was that (again for any vector $r$) $||r||_2^2 = r \cdot r = r^Tr$.
Here $\mathbb{Y}_i$ is the $i$th observed response (diamond price) and $\mathbb{X}_i$ is the row vector coresponding to the $i$th row of the covariates with an added 1 at the beginning.
It is worth noting that since our goal is typically to minimize the loss and we often don't care about its actual value we can drop the $\frac{1}{n}$:
\begin{align} L(\theta) &= \left( \mathbb{Y} - \mathbb{X}\theta \right)^T \left( \mathbb{Y} - \mathbb{X}\theta \right) \end{align}In order to calculate the loss we need the response vector $\mathbb{Y}$ (the thing we are trying to predict).
Y = data[['price']].to_numpy()
Y
Directly writing the average squared loss:
def squared_loss(theta):
return np.mean((Y - X @ theta)**2)
squared_loss(theta_guess)
Using matrix notation:
def squared_loss(theta):
return ((Y - X @ theta).T @ (Y - X @ theta)).item() / Y.shape[0]
squared_loss(theta_guess)
You may notice that there is some variation in the last digits of the floating point calculation. This is an issue with ordering of numerical operations over large sums.
In the video lecture we defined the normal equation as:
$$ \mathbb{X}^T \mathbb{X} \hat{\theta} = \mathbb{X}^T \mathbb{Y} $$If we solve for $\hat{\theta}$ this is the minimizer of the squared loss with respect to our data.
If we assume $\mathbb{X}^T \mathbb{X}$ is invertible (full rank) then we get the following solution:
$$ \hat{\theta} = \left( \mathbb{X}^T \mathbb{X} \right)^{-1} \mathbb{X}^T \mathbb{Y} $$The following video provides a better intuition of the shape of each of the terms in the solution to the normal equation as well as how to better solve it.
YouTubeVideo("9vWR-19WQKU")
While it is not as numerically stable or efficient. We can compute $\hat{\theta}$ by direction using matrix inversion. To do this, we import the inv
function for the numpy linear algebra library:
from numpy.linalg import inv
theta_hat = inv(X.T @ X) @ X.T @ Y
theta_hat
A more efficient way to solve the normal equations is using the solve
function to solve the linear systems:
where $A=\mathbb{X}^T \mathbb{X}$ and $b=\mathbb{X}^T \mathbb{Y}$. The solve
function can be implement slightly more efficiently by avoiding the inversion and subsequent matrix multiply. For matrices which are less well behaved, solve can also be more numerically stable.
from numpy.linalg import solve
theta_hat = solve(X.T @ X, X.T @ Y)
theta_hat
We have just computed the global optimum of the squared loss.
Now that we have estimated the model parameters $\hat{\theta}$ we can now also predict the price $\hat{\mathbb{Y}}$ for each of our diamonds
Y_hat = X @ theta_hat
In previous lectures we have plotted the residual. We can do that for each of the dimensions but in general it will be difficult to visualize the residual directly.
fig = px.scatter(x = X[:,1], y = Y - Y_hat,opacity=0.2)
fig.update_xaxes(title="Carat")
fig.update_yaxes(title="Residual")
fig
From the above plot we see that we are over estimating the price of big diamonds and under estimating the price of small diamonds.
fig = px.scatter(x = X[:,2], y = Y - Y_hat,opacity=0.2)
fig.update_xaxes(title="Depth")
fig.update_yaxes(title="Residual")
fig
There is no consistent pattern in the residuals.
fig = px.scatter(x = X[:,3], y = Y - Y_hat,opacity=0.2)
fig.update_xaxes(title="Table")
fig.update_yaxes(title="Residual")
fig
A better way to visualize higher dimensional regression models is to compare the Predicted Value against the Observed value
fig = px.scatter(x = Y.flatten(), y = Y_hat.flatten(), opacity=0.2)
fig.add_trace(go.Scatter(x=[0,20000], y=[0, 20000], name="Y_hat=Y"))
fig.update_xaxes(title="Y")
fig.update_yaxes(title="Y_hat")
fig
What do we see? We are under estimating expensive diamonds! Perhaps we should consider the cut, color, and clarity.
Several of you asked for a video walk-through of the notebook. I have created one here:
YouTubeVideo("7M3yLv_DYX4")
In the next notebook lecture we will see how to improve this model by applying feature engineering techniques to incorporate categorical data.