Lecture 14, Part 1 – Data 100, Summer 2020

by Joseph Gonzalez (Spring 2020)

Least Squares Regression in SKLearn

In this notebook we will introduce how to fit linear models using Scikit Learn. Scikit is a library containing a wide number of classic machine learning algorithms and is to machine learning what pandas is to data in Python.


In this notebook we will be using the following packages:

In [1]:
import numpy as np
import pandas as pd
In [2]:
import plotly.offline as py
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.figure_factory as ff
import cufflinks as cf
cf.set_config_file(offline=True, sharing=False, theme='ggplot');
In [3]:
import torch
from sklearn.linear_model import LinearRegression

The Data

For this notebook, we will focus on a synthetic dataset which is easy to visualize since it has two feature columns.

In [4]:
synth_data = pd.read_csv("data/synth_data.csv.zip")
X1 X2 Y
0 -1.254599 4.507143 1.526396
1 2.319939 0.986585 5.190449
2 -3.439814 -3.440055 4.980978
3 -4.419164 3.661761 1.130775
4 1.011150 2.080726 5.849364

This dataset is simple enough that we can easily visualize it. Take the data for a spin (literally).

In [5]:
fig = go.Figure()
data_scatter = go.Scatter3d(x=synth_data["X1"], y=synth_data["X2"], z=synth_data["Y"], 
fig.update_layout(margin=dict(l=0, r=0, t=0, b=0), 


Ordinary Least Squares Regression without SKLearn

We can start using the normal equations:

$$ \hat{\theta} = \left( \mathbb{X}^T \mathbb{X} \right)^{-1} \mathbb{X}^T \mathbb{Y} $$

While it is not as numerically stable or efficient. We can compute $\hat{\theta}$ by direction using matrix inversion:

In [6]:
from numpy.linalg import inv, solve
In [7]:
def least_squares_by_inv(X, Y):
    return inv(X.T @ X) @ X.T @ Y

A more efficient way to solve the normal equations is using the solve function to solve the linear systems:

$$ A \theta = b $$

where $A=\mathbb{X}^T \mathbb{X}$ and $b=\mathbb{X}^T \mathbb{Y}$:

In [8]:
def least_squares_by_solve(X, Y):
    return solve(X.T @ X, X.T @ Y)

Testing on Synthetic Data

Let's quickly test the above code on our synthetic dataset:

In [9]:
X = synth_data[["X1", "X2"]].to_numpy()
Y = synth_data[["Y"]].to_numpy()
In [10]:
theta_hat = least_squares_by_inv(X,Y)
array([[ 0.24723532],
In [11]:
theta_hat = least_squares_by_solve(X,Y)
array([[ 0.24723532],

They both agree!

Visualizing the Linear Model

For the synthetic dataset we can visualize the model in three dimensions. To do this we will use the following plotting function that at evenly space grid points in for X0 and X1.

In [12]:
def plot_plane(f, X, grid_points = 30):
    u = np.linspace(X[:,0].min(),X[:,0].max(), grid_points)
    v = np.linspace(X[:,1].min(),X[:,1].max(), grid_points)
    xu, xv = np.meshgrid(u,v)
    X = np.vstack((xu.flatten(),xv.flatten())).transpose()
    z = f(X)
    return go.Surface(x=xu, y=xv, z=z.reshape(xu.shape),opacity=0.8)

Plotting the data and the plane:

In [14]:
fig = go.Figure()
fig.add_trace(plot_plane(lambda x: x.dot(least_squares_by_inv(X,Y)), X))
fig.update_layout(margin=dict(l=0, r=0, t=0, b=0), 

Something is Wrong!

The above plane doesn't seem that close to the data. What did we do wrong? Let's add a cyan colored sphere at the origin (0,0,0) in the above plot.

In [15]:
fig = go.Figure()
fig.add_trace(plot_plane(lambda x: x.dot(least_squares_by_inv(X,Y)), X))
fig.add_trace(go.Scatter3d(x=[0], y=[0], z=[0], marker=dict(color="cyan")))
fig.update_layout(margin=dict(l=0, r=0, t=0, b=0), 

The plane is passing through the origin. Actually, any parameterization of the current model must go through the origin. Why is that? Let's look at the current equation for our model:

\begin{align} f_\theta(x_1, x_2) = \theta_1 x_1 + \theta_2 x_2 \end{align}

What is the value of this function when $x_1=0$ and $x_2=0$? Does it depend $\theta$? What are we missing?

Answer We forgot to add a constant term (a 1s column) to our $\mathbb{X}$ covariate matrix.

Adding a 1s Column

We would like the model to be:

\begin{align} f_\theta(x_1, x_2) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 \end{align}

We can do this by adding an $x_0$ (an extra column) which always takes the value 1.

\begin{align} f_\theta(x_0, x_1, x_2) = \theta_0 x_0 + \theta_1 x_1 + \theta_2 x_2 \end{align}

By doing this we can keep the general form of the linear equation (no need for special cases):

\begin{align} f_\theta(x) = \sum_{i=1}^p \theta_i x_i \end{align}

The following function adds a 1s column to our matrix. We are horizontally stacking a 1 column vector with our covariate matrix to make a new covariate matrix:

In [16]:
def add_ones_column(X):
    return np.hstack([np.ones((X.shape[0],1)), X])


In [17]:
array([[-1.25459881,  4.50714306],
       [ 2.31993942,  0.98658484],
       [-3.4398136 , -3.4400548 ],
       [ 2.51375086,  1.56955156],
       [ 4.56614621, -4.31041984],
       [-4.42945279, -2.17812925]])


In [18]:
array([[ 1.        , -1.25459881,  4.50714306],
       [ 1.        ,  2.31993942,  0.98658484],
       [ 1.        , -3.4398136 , -3.4400548 ],
       [ 1.        ,  2.51375086,  1.56955156],
       [ 1.        ,  4.56614621, -4.31041984],
       [ 1.        , -4.42945279, -2.17812925]])

Resolving for the optimal $\hat{\theta}$

In [19]:
least_squares_by_solve(add_ones_column(X), Y)
array([[ 4.79938336],
       [ 0.30690004],

To plot the model we need to be a bit more careful. My plotting code assumes that the $X$ matrix is $n$ by $2$ dimensions. So we will make a special model function that transforms $X$ by adding an extra dimension and then renders the prediction.

In [20]:
theta_hat = least_squares_by_solve(add_ones_column(X),Y)
def model_append_ones(X):
    return add_ones_column(X) @ theta_hat

Taking our new model for a spin!

In [21]:
fig = go.Figure()
fig.add_trace(plot_plane(model_append_ones, X))
fig.update_layout(margin=dict(l=0, r=0, t=0, b=0), 

Success! That seems to fit the data nicely! That wasn't too hard but let's try using sklearn instead.

Using SKLearn

Scikit Learn, or as the cool kids call it sklearn (pronounced s-k-learn), is an large package of useful machine learning algorithms. For this lecture, we will use the LinearRegression model in the linear_model module. The fact that there is an entire module with many different models within the linear_model module might suggest that we have a lot to cover still (we do!).

What you should know about sklearn models:

  1. Models are created by first building an instance of the model:
    model = SuperCoolModelType(args)
  2. You then fit the model by calling the fit function passing in data:
    model.fit(df[['X1' 'X1']], df[['Y']])
  3. You then can make predictions by calling predict:
    model.predict(df2[['X1' 'X1']])

The neat part about sklearn is most models behave like this. So if you want to try a cool new model you just change the class of model you are using.

The LinearRegression Model

We import the LinearRegression model.

In [23]:
from sklearn.linear_model import LinearRegression

Create an instance of the model:

In [24]:
model = LinearRegression(fit_intercept=True)

Fit the model by passing it the $X$ and $Y$ data:

In [25]:
model.fit(synth_data[["X1", "X2"]], synth_data[["Y"]])
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Make some predictions and even save them back to the original DataFrame:

In [27]:
synth_data['Y_hat'] = model.predict(synth_data[["X1", "X2"]])
X1 X2 Y Y_hat
0 -1.254599 4.507143 1.526396 1.894016
1 2.319939 0.986585 5.190449 4.959689
2 -3.439814 -3.440055 4.980978 5.667334
3 -4.419164 3.661761 1.130775 1.395537
4 1.011150 2.080726 5.849364 3.946193
... ... ... ... ...
995 2.655129 -3.410918 6.771133 7.521580
996 1.102251 -3.646459 8.293461 7.176713
997 2.513751 1.569552 5.577740 4.693183
998 4.566146 -4.310420 8.982427 8.611059
999 -4.429453 -2.178129 3.283895 4.657963

1000 rows × 4 columns

Visualize the model as before. Notice I just pass the model.predict function to our visualization code which will invoke model.predict on a regular grid of points to make the surface plot:

In [28]:
fig = go.Figure()
fig.add_trace(plot_plane(model.predict, X))
fig.update_layout(margin=dict(l=0, r=0, t=0, b=0), 

Setting Model "Hyper-Parameters"

Notice that when we created an instance of the model we set fit_intercept=False. This is a hyper-parameter of the model. Hyperparameters are parameters that we do not optimize but instead choose as part of the modeling process. In this case the fit_intercept hyper-parameter adds the additional $\theta_0$ term to our model without adding a 1s column to the data. Internally, it probably adds a 1s column to the data.

What happens if we set the fit_intercept=False?

In [29]:
model_no_intercept = LinearRegression(fit_intercept=False)
model_no_intercept.fit(synth_data[["X1", "X2"]], synth_data[["Y"]])
LinearRegression(copy_X=True, fit_intercept=False, n_jobs=None, normalize=False)
In [30]:
fig = go.Figure()
fig.add_trace(plot_plane(model_no_intercept.predict, X))
fig.update_layout(margin=dict(l=0, r=0, t=0, b=0), 

Trying an Advanced Model

To demonstrate the simplicity and power of sklearn lets try using a Kernel regression model instead. Don't worry about how this works.

In [31]:
from sklearn.kernel_ridge import KernelRidge

Creating a KernelRidge model and setting the kernel function type to be RBF which stands for radial basis functions (and nothing else):

In [33]:
super_model = KernelRidge(kernel="rbf")

Fitting the RBF KernelRidge regression model to our data:

In [34]:
super_model.fit(synth_data[["X1", "X2"]], synth_data[["Y"]])
KernelRidge(alpha=1, coef0=1, degree=3, gamma=None, kernel='rbf',

Plotting the model:

In [35]:
fig = go.Figure()
fig.add_trace(plot_plane(super_model.predict, X))
fig.update_layout(margin=dict(l=0, r=0, t=0, b=0),