Least Squares Linear Regression¶

In this lecture we will introduce linear models and least squares linear regression. We will explain the basic parametric model formulation and how to optimize it using linear algebra and geometry rather than gradient descent.

Updated This notebook has been slightly updated with additional explanations from the original posting. I have also added a video walk-through of the notebook which is imported at the very end.

Imports¶

In this notebook we will be using the following packages:

import numpy as np
import pandas as pd

import plotly.offline as py
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.figure_factory as ff
import cufflinks as cf
cf.set_config_file(offline=True, sharing=False, theme='ggplot');

# Support embedding YouTube Videos in Notebooks
from IPython.display import YouTubeVideo

A Note on Lecture Format¶

This is the first of a series of lectures that are designed to be completed online rather than in class. This new lecture format is experimental and we ask for your feedback on Piazza.

The playlist for all the videos in this lecture can be found here:

Least Squares Linear Regression

Intended Use¶

We have designed this video notebook to be a series of short videos followed by simple questions (with answers provided) and demos. There are no grades for working through this notebook. However, we hope that by combining videos with activities this will actually be more engaging.

Recap¶

The following short video provides a recap of the modeling recipe (model, loss, optimize). If you are comfortable with this you can safely skip this video.

YouTubeVideo("u0YycwQ814c")

Recap Question¶

What are the three stages of modeling used in this class?

Answer

In this class, the recipe for modeling consists of defining the model, defining the loss, and optimizing the model with respect to the loss.

Multilinear Regression Setting¶

In the following video we introduce the multiple linear regression setting.

YouTubeVideo("21PN6a3s22I")

The Diamonds Dataset¶

To motivate the multilinear regression setting we will work with the classic diamonds dataset.

data = pd.read_csv("data/diamonds.csv.zip", index_col=0)
data.head()

Each record in the dataset corresponds to a single diamond. The fields are

carat: The weight of the diamonds
cut: The quality of the cut is an ordinal variable with values (Fair, Good, Very Good, Premium, and Ideal)
color: The color of the diamond is an ordinal variable with values J (worst) to D (best).
clarity: How obvious inclusions are within the diamond. This is an ordinal variable with values I1 (worstt), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best)
depth: The height of a diamond, measured from the culet to the table, divided by its average girdle diameter.
table: The width of the diamond's table expressed as a percentage of its average diameter.
price: Price of the diamond in USD.
x: length measured in mm
y: width measured in mm
z: depth measured in mm

We are interested in predicting the price of a diamond given it's characteristics. Therefore, we would like to fit a model:

$$ f_\theta\left(\text{Diamond's Characteristics}\right) \rightarrow \text{Price} $$

Suppose for now we focus only on the diamond's carat, depth, and table values. In this case, we would want a model of the form:

$$ f_\theta\left(\textbf{carat}, \textbf{depth}, \textbf{table}\right) \rightarrow \textbf{price} $$

We could express this as a linear model of the form:

$$ f_\theta\left(\textbf{carat}, \textbf{depth}, \textbf{table}\right) = \theta_0 + \theta_1 * \textbf{carat} + \theta_2 * \textbf{depth} + \theta_3 * \textbf{table} $$

Where $\theta_0$ is the intercept parameter and $\theta_1$, ..., $\theta_3$ are the "slopes" with respect to each of the covariates.

Focusing on just the carat, depth, and table features and the price our dataset looks like:

data[['carat', 'depth', 'table', 'price']]

In lecture, we saw that the predictions for the entire data set $\hat{\mathbb{Y}}$ can be computed as:

$$ \hat{\mathbb{Y}} = \mathbb{X} \theta $$

The covariate matrix $\mathbb{X} \in \mathbb{R}^{n,d+1}$ has $n$ rows corresponding to each record in the dataset and $d+1$ columns corresponding to the original $d$ columns in the data plus an additional 1s column vector. For example, the first five rows of $\mathbb{X}$ for this dataset would look like:

	0	1	2	3
0	1.0	0.23	61.5	55.0
1	1.0	0.21	59.8	61.0
2	1.0	0.23	56.9	65.0
3	1.0	0.29	62.4	58.0
4	1.0	0.31	63.3	58.0

We can extract our covariate matrix and add a column of all 1s:

n,d = data[['carat', 'depth', 'table']].shape
print("n = ", n)
print("d = ", d)

n =  53940
d =  3

The following function attaches 1s to each of the rows in the matrix. Here I am using the np.hstack function to add an extra column.

def add_ones_column(data):
    return np.hstack([np.ones((n,1)), data])

We can then add a 1s column to our dataset.

X = add_ones_column(data[['carat', 'depth', 'table']].to_numpy())
X

array([[ 1.  ,  0.23, 61.5 , 55.  ],
       [ 1.  ,  0.21, 59.8 , 61.  ],
       [ 1.  ,  0.23, 56.9 , 65.  ],
       ...,
       [ 1.  ,  0.7 , 62.8 , 60.  ],
       [ 1.  ,  0.86, 61.  , 58.  ],
       [ 1.  ,  0.75, 62.2 , 55.  ]])

We can define the linear model in python using numpy. Here I will assume the input is in row vector form so this function will work with a single row or the entire matrix. Note in lecture we discuss the case where a single row is stored in column vector form (as this is the case in most math textbooks).

The @ symbol is the matrix multiply operation and is equivalent to writing xt.dot(theta).

def linear_model(theta, xt): 
    return xt @ theta # The @ symbol is matrix multiply

For any value of $\theta$ (even a random one) we can make a prediction using our model:

np.random.seed(42)
theta_guess = np.random.randn(d+1, 1)
theta_guess

array([[ 0.49671415],
       [-0.1382643 ],
       [ 0.64768854],
       [ 1.52302986]])

Note: Above, we used np.random.randn, which generates draws from a normal (Gaussian) distribution with mean 0 and variance 1 (the parameters d+1 and 1 specify the number of rows and columns that the returning array should have). The fact that these values are drawn from a normal distribution are irrelevant here, just think of them as coming from a random number generator.

Making a single prection for the 3rd row:

linear_model(theta_guess, X[2,:])

array([136.31533185])

Making a prediction for all the rows at once:

linear_model(theta_guess, X)

array([[124.06440056],
       [132.10427447],
       [136.31533185],
       ...,
       [132.45656072],
       [128.22253935],
       [124.4458851 ]])

Copying the notation directly from the video lecture:

Y_hat = X @ theta_guess
Y_hat

array([[124.06440056],
       [132.10427447],
       [136.31533185],
       ...,
       [132.45656072],
       [128.22253935],
       [124.4458851 ]])

or using our function produces the same thing:

Y_hat = linear_model(theta_guess, X)
Y_hat

array([[124.06440056],
       [132.10427447],
       [136.31533185],
       ...,
       [132.45656072],
       [128.22253935],
       [124.4458851 ]])

Defining and Minimizing the Loss¶

The next video defines and minimizes the least squares loss by using a geometric argument.

YouTubeVideo("zkJ3CULN8Wk")

To elaborate a little on the video above:

Recall from linear algebra that two vectors are orthogonal when their dot product is zero. The standard way to write the dot product is $a \cdot b$, but it can also be written as $a^T b$ or $b^T a$ (all of these forms evaluate to $a_1 b_1 + a_2 b_2 + ... + a_n b_n$).

The notion of orthogonality extends to matrices. If we want some vector $r$ to be orthogonal to the span of a matrix $X$ (i.e. to each column in the matrix), that's the same as saying the vector is orthogonal to the entire matrix. To denote this, we can say $X \cdot r = 0$ or $X^T \cdot r = 0$. In our case, $r = y - X \hat{\theta}$, so we have $X^T (y - X \hat{\theta}) = 0$.

The Least Squares Loss¶

The Least Squares Regression loss is the average squared loss:

\begin{align} L(\theta) &= \frac{1}{n}\sum_{i=1}^n \left( \mathbb{Y}_i - \left(\mathbb{X} \theta\right)_i \right)^2 \\ &= \frac{1}{n}\sum_{i=1}^n \left( \mathbb{Y}_i - \mathbb{X}_i \theta \right)^2 \\ &= \frac{1}{n} || \mathbb{Y} - \mathbb{X} \theta ||_2^2 \\ &= \frac{1}{n}\left( \mathbb{Y} - \mathbb{X}\theta \right)^T \left( \mathbb{Y} - \mathbb{X}\theta \right) \end{align}

Note: Above, we used 2 facts from linear algebra. First, we used the fact that the square of the $L_2$ norm of a vector is equal to the sum of the squares of its components. Put less cryptically, that means that for any vector $r$, $||r||_2^2 = r_1^2 + r_2^2 + ... + r_n^2 = \sum_{i = 1}^n r_i^2$. We used this going from line 2 to 3, where $r = \mathbb{Y} - \mathbb{X} \theta$. The second fact we used was that (again for any vector $r$) $||r||_2^2 = r \cdot r = r^Tr$.

Here $\mathbb{Y}_i$ is the $i$th observed response (diamond price) and $\mathbb{X}_i$ is the row vector coresponding to the $i$th row of the covariates with an added 1 at the beginning.

It is worth noting that since our goal is typically to minimize the loss and we often don't care about its actual value we can drop the $\frac{1}{n}$:

\begin{align} L(\theta) &= \left( \mathbb{Y} - \mathbb{X}\theta \right)^T \left( \mathbb{Y} - \mathbb{X}\theta \right) \end{align}

Calculating the Loss¶

In order to calculate the loss we need the response vector $\mathbb{Y}$ (the thing we are trying to predict).

Y =  data[['price']].to_numpy()
Y

array([[ 326],
       [ 326],
       [ 327],
       ...,
       [2757],
       [2757],
       [2757]])

Directly writing the average squared loss:

def squared_loss(theta):
    return np.mean((Y - X @ theta)**2)

squared_loss(theta_guess)

30389793.20618836

Using matrix notation:

def squared_loss(theta):
    return ((Y - X @ theta).T @ (Y - X @ theta)).item() / Y.shape[0]

squared_loss(theta_guess)

30389793.206188355

You may notice that there is some variation in the last digits of the floating point calculation. This is an issue with ordering of numerical operations over large sums.

Minimizing the Loss¶

In the video lecture we defined the normal equation as:

$$ \mathbb{X}^T \mathbb{X} \hat{\theta} = \mathbb{X}^T \mathbb{Y} $$

If we solve for $\hat{\theta}$ this is the minimizer of the squared loss with respect to our data.

If we assume $\mathbb{X}^T \mathbb{X}$ is invertible (full rank) then we get the following solution:

$$ \hat{\theta} = \left( \mathbb{X}^T \mathbb{X} \right)^{-1} \mathbb{X}^T \mathbb{Y} $$

Understanding the Normal Equation¶

The following video provides a better intuition of the shape of each of the terms in the solution to the normal equation as well as how to better solve it.

YouTubeVideo("9vWR-19WQKU")

Implementing the Normal Equation¶

While it is not as numerically stable or efficient. We can compute $\hat{\theta}$ by direction using matrix inversion. To do this, we import the inv function for the numpy linear algebra library:

from numpy.linalg import inv

theta_hat = inv(X.T @ X) @ X.T @ Y
theta_hat

array([[13003.4405242 ],
       [ 7858.77050994],
       [ -151.23634689],
       [ -104.47278016]])

A more efficient way to solve the normal equations is using the solve function to solve the linear systems:

$$ A \theta = b $$

where $A=\mathbb{X}^T \mathbb{X}$ and $b=\mathbb{X}^T \mathbb{Y}$. The solve function can be implement slightly more efficiently by avoiding the inversion and subsequent matrix multiply. For matrices which are less well behaved, solve can also be more numerically stable.

from numpy.linalg import solve

theta_hat = solve(X.T @ X, X.T @ Y)
theta_hat

array([[13003.4405242 ],
       [ 7858.77050994],
       [ -151.23634689],
       [ -104.47278016]])

We have just computed the global optimum of the squared loss.

Making Predictions¶

Now that we have estimated the model parameters $\hat{\theta}$ we can now also predict the price $\hat{\mathbb{Y}}$ for each of our diamonds

Y_hat = X @ theta_hat

Diagnosing the Model¶

In previous lectures we have plotted the residual. We can do that for each of the dimensions but in general it will be difficult to visualize the residual directly.

Residual vs Carat¶

fig = px.scatter(x = X[:,1], y = Y - Y_hat,opacity=0.2)
fig.update_xaxes(title="Carat")
fig.update_yaxes(title="Residual")
fig

From the above plot we see that we are over estimating the price of big diamonds and under estimating the price of small diamonds.

Residual vs Depth¶

fig = px.scatter(x = X[:,2], y = Y - Y_hat,opacity=0.2)
fig.update_xaxes(title="Depth")
fig.update_yaxes(title="Residual")
fig

There is no consistent pattern in the residuals.

Residuals vs Table¶

fig = px.scatter(x = X[:,3], y = Y - Y_hat,opacity=0.2)
fig.update_xaxes(title="Table")
fig.update_yaxes(title="Residual")
fig

Plotting $\hat{\mathbb{Y}}$ vs $\mathbb{Y}$¶

A better way to visualize higher dimensional regression models is to compare the Predicted Value against the Observed value

fig = px.scatter(x = Y.flatten(), y = Y_hat.flatten(), opacity=0.2)
fig.add_trace(go.Scatter(x=[0,20000], y=[0, 20000], name="Y_hat=Y"))
fig.update_xaxes(title="Y")
fig.update_yaxes(title="Y_hat")
fig

What do we see? We are under estimating expensive diamonds! Perhaps we should consider the cut, color, and clarity.

Video Walk-Through¶

Several of you asked for a video walk-through of the notebook. I have created one here:

YouTubeVideo("7M3yLv_DYX4")

Whats Next¶

In the next notebook lecture we will see how to improve this model by applying feature engineering techniques to incorporate categorical data.

	carat	cut	color	clarity	depth	table	price	x	y	z
1	0.23	Ideal	E	SI2	61.5	55.0	326	3.95	3.98	2.43
2	0.21	Premium	E	SI1	59.8	61.0	326	3.89	3.84	2.31
3	0.23	Good	E	VS1	56.9	65.0	327	4.05	4.07	2.31
4	0.29	Premium	I	VS2	62.4	58.0	334	4.20	4.23	2.63
5	0.31	Good	J	SI2	63.3	58.0	335	4.34	4.35	2.75