Least Squares Regression in SKLearn¶

In this mini video lecture notebook we will introduce how to fit linear models using Scikit Learn. Scikit is a library containing a wide number of classic machine learning algorithms and is to machine learning what pandas is to data in Python.

Imports¶

In this notebook we will be using the following packages:

import numpy as np
import pandas as pd

import plotly.offline as py
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.figure_factory as ff
import cufflinks as cf
cf.set_config_file(offline=True, sharing=False, theme='ggplot');

import torch
from sklearn.linear_model import LinearRegression

# Support embedding YouTube Videos in Notebooks
from IPython.display import YouTubeVideo

Video Walk-Through¶

The following video walk-through presents this notebook. I did not break it into sections but will try to do that in my next notebook.

YouTubeVideo("lFzRiinHSzU")

The Data¶

For this notebook, we will focus on a synthetic dataset which is easy to visualize since it has two feature columns.

synth_data = pd.read_csv("data/synth_data.csv.zip")
synth_data.head()

This dataset is simple enough that we can easily visualize it. Take the data for a spin (literally).

fig = go.Figure()
data_scatter = go.Scatter3d(x=synth_data["X1"], y=synth_data["X2"], z=synth_data["Y"], 
                            mode="markers",
                            marker=dict(size=2))
fig.add_trace(data_scatter)
fig.update_layout(margin=dict(l=0, r=0, t=0, b=0), 
                  height=600)
fig

Ordinary Least Squares Regression without SKLearn¶

We can start using the normal equations:

$$ \hat{\theta} = \left( \mathbb{X}^T \mathbb{X} \right)^{-1} \mathbb{X}^T \mathbb{Y} $$

While it is not as numerically stable or efficient. We can compute $\hat{\theta}$ by direction using matrix inversion:

from numpy.linalg import inv, solve

def least_squares_by_inv(X, Y):
    return inv(X.T @ X) @ X.T @ Y

A more efficient way to solve the normal equations is using the solve function to solve the linear systems:

$$ A \theta = b $$

where $A=\mathbb{X}^T \mathbb{X}$ and $b=\mathbb{X}^T \mathbb{Y}$:

def least_squares_by_solve(X, Y):
    return solve(X.T @ X, X.T @ Y)

Testing on Synthetic Data¶

Let's quickly test the above code on our synthetic dataset:

X = synth_data[["X1", "X2"]].to_numpy()
Y = synth_data[["Y"]].to_numpy()

theta_hat = least_squares_by_inv(X,Y)
theta_hat

array([[ 0.24723532],
       [-0.50987377]])

theta_hat = least_squares_by_solve(X,Y)
theta_hat

array([[ 0.24723532],
       [-0.50987377]])

They both agree!

Let's visualize the Linear Model¶

For the synthetic dataset we can visualize the model in three dimensions. To do this we will use the following plotting function that at evenly space grid points in for X0 and X1.

def plot_plane(f, X, grid_points = 30):
    u = np.linspace(X[:,0].min(),X[:,0].max(), grid_points)
    v = np.linspace(X[:,1].min(),X[:,1].max(), grid_points)
    xu, xv = np.meshgrid(u,v)
    X = np.vstack((xu.flatten(),xv.flatten())).transpose()
    z = f(X)
    return go.Surface(x=xu, y=xv, z=z.reshape(xu.shape),opacity=0.8)

Plotting the data and the plane

fig = go.Figure()
fig.add_trace(data_scatter)
fig.add_trace(plot_plane(lambda x: x.dot(least_squares_by_inv(X,Y)), X))
fig.update_layout(margin=dict(l=0, r=0, t=0, b=0), 
                  height=600)

Something is Wrong!¶

The above plane doesn't seem that close to the data. What did we do wrong? Let's add a cyan colored sphere at the origin (0,0,0) in the above plot.

fig = go.Figure()
fig.add_trace(data_scatter)
fig.add_trace(plot_plane(lambda x: x.dot(least_squares_by_inv(X,Y)), X))
fig.add_trace(go.Scatter3d(x=[0], y=[0], z=[0], marker=dict(color="cyan")))
fig.update_layout(margin=dict(l=0, r=0, t=0, b=0), 
                  height=600)

The plane is passing through the origin. Actually, any parameterization of the current model must go through the origin. Why is that? Let's look at the current equation for our model:

\begin{align} f_\theta(x_1, x_2) = \theta_1 x_1 + \theta_2 x_2 \end{align}

What is the value of this function when $x_1=0$ and $x_2=0$? Does it depend $\theta$? What are we missing?

Answer

We forgot to add a constant term (a 1s column) to our $\mathbb{X}$ covariate matrix.

Adding a 1s Column¶

We would like the model to be:

\begin{align} f_\theta(x_1, x_2) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 \end{align}

We can do this by adding an $x_0$ (an extra column) which always takes the value 1.

\begin{align} f_\theta(x_0, x_1, x_2) = \theta_0 x_0 + \theta_1 x_1 + \theta_2 x_2 \end{align}

By doing this we can keep the general form of the linear equation (no need for special cases):

\begin{align} f_\theta(x) = \sum_{i=1}^p \theta_i x_i \end{align}

The following function adds a 1s column to our matrix. We are horizontally stacking a 1 column vector with our covariate matrix to make a new covariate matrix:

def add_ones_column(X):
    return np.hstack([np.ones((X.shape[0],1)), X])

Before

X

array([[-1.25459881,  4.50714306],
       [ 2.31993942,  0.98658484],
       [-3.4398136 , -3.4400548 ],
       ...,
       [ 2.51375086,  1.56955156],
       [ 4.56614621, -4.31041984],
       [-4.42945279, -2.17812925]])

After

add_ones_column(X)

array([[ 1.        , -1.25459881,  4.50714306],
       [ 1.        ,  2.31993942,  0.98658484],
       [ 1.        , -3.4398136 , -3.4400548 ],
       ...,
       [ 1.        ,  2.51375086,  1.56955156],
       [ 1.        ,  4.56614621, -4.31041984],
       [ 1.        , -4.42945279, -2.17812925]])

Resolving for the optimal $\hat{\theta}$

least_squares_by_solve(add_ones_column(X), Y)

array([[ 4.79938336],
       [ 0.30690004],
       [-0.55918582]])

To plot the model we need to be a bit more careful. My plotting code assumes that the $X$ matrix is $n$ by $2$ dimensions. So we will make a special model function that transforms $X$ by adding an extra dimension and then renders the prediction.

theta_hat = least_squares_by_solve(add_ones_column(X),Y)
def model_append_ones(X):
    return add_ones_column(X) @ theta_hat

Taking our new model for a spin!

fig = go.Figure()
fig.add_trace(data_scatter)
fig.add_trace(plot_plane(model_append_ones, X))
fig.update_layout(margin=dict(l=0, r=0, t=0, b=0), 
                  height=600)

Success!!! That seems to fit the data nicely! That wasn't too hard but let's try doing using sklearn instead.

Using SKLearn¶

Scikit Learn, or as the cool kids call it sklearn (pronounced s-k-learn), is an large package of useful machine learning algorithms. For this lecture, we will use the LinearRegression model in the linear_model module. The fact that there is an entire module with many different models within the linear_model module might suggest that we have a lot to cover still (we do!).

What you should know about sklearn models:

Models are created by first building an instance of the model:
```
model = SuperCoolModelType(args)
```
You then fit the model by calling the fit function passing in data:
```
model.fit(df[['X1' 'X1']], df[['Y']])
```
You then can make predictions by calling predict:
```
model.predict(df2[['X1' 'X1']])
```

The neat part about sklearn is most models behave like this. So if you want to try a cool new model you just change the class of mode you are using.

The LinearRegression Model¶

We import the LinearRegression model

from sklearn.linear_model import LinearRegression

Create an instance of the model:

model = LinearRegression(fit_intercept=True)

Fit the model by passing it the $X$ and $Y$ data:

model.fit(synth_data[["X1", "X2"]], synth_data[["Y"]])

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Make some predictions and even save them back to the original DataFrame

synth_data['Y_hat'] = model.predict(synth_data[["X1", "X2"]])
synth_data

Visualize the model as before. Notice I just pass the model.predict function to our visualization code which will invoke model.predict on a regular grid of points to make the surface plot:

fig = go.Figure()
fig.add_trace(data_scatter)
fig.add_trace(plot_plane(model.predict, X))
fig.update_layout(margin=dict(l=0, r=0, t=0, b=0), 
                  height=600)

Setting Model "Hyper-Parameters"¶

Notice that when we created an instance of the model we set fit_intercept=False. This is a hyper-parameter of the model. What is a hyper-parameter these are the parameters that we do not optimize but instead choose as part of the modeling process. In this case the fit_intercept hyper-parameter adds the additional $\theta_0$ term to our model without adding a 1s column to the data. Internally, it probably adds a 1s column to the data.

What happens if we set the fit_intercept=False?

model_no_intercept = LinearRegression(fit_intercept=False)
model_no_intercept.fit(synth_data[["X1", "X2"]], synth_data[["Y"]])

LinearRegression(copy_X=True, fit_intercept=False, n_jobs=None, normalize=False)

fig = go.Figure()
fig.add_trace(data_scatter)
fig.add_trace(plot_plane(model_no_intercept.predict, X))
fig.update_layout(margin=dict(l=0, r=0, t=0, b=0), 
                  height=600)

Trying an Advanced Model¶

In an earlier lecture we alluded to Kernel Regression as a more advanced technique. To demonstrate the simplicity and power of sklearn lets try using a Kernel regression model instead.

from sklearn.kernel_ridge import KernelRidge

Creating a KernelRidge model and setting the kernel function type to be RBF which stands for radial basis functions and nothing else:

super_model = KernelRidge(kernel="rbf")

Fitting the RBF KernelRidge regression model to our data:

super_model.fit(synth_data[["X1", "X2"]], synth_data[["Y"]])

KernelRidge(alpha=1, coef0=1, degree=3, gamma=None, kernel='rbf',
            kernel_params=None)

Plotting the model:

fig = go.Figure()
fig.add_trace(data_scatter)
fig.add_trace(plot_plane(super_model.predict, X))
fig.update_layout(margin=dict(l=0, r=0, t=0, b=0), 
                  height=600)

Notice that we basically did exactly the same thing as before but with a different model class.

	X1	X2	Y
0	-1.254599	4.507143	1.526396
1	2.319939	0.986585	5.190449
2	-3.439814	-3.440055	4.980978
3	-4.419164	3.661761	1.130775
4	1.011150	2.080726	5.849364

	X1	X2	Y	Y_hat
0	-1.254599	4.507143	1.526396	1.894016
1	2.319939	0.986585	5.190449	4.959689
2	-3.439814	-3.440055	4.980978	5.667334
3	-4.419164	3.661761	1.130775	1.395537
4	1.011150	2.080726	5.849364	3.946193
...	...	...	...	...
995	2.655129	-3.410918	6.771133	7.521580
996	1.102251	-3.646459	8.293461	7.176713
997	2.513751	1.569552	5.577740	4.693183
998	4.566146	-4.310420	8.982427	8.611059
999	-4.429453	-2.178129	3.283895	4.657963