by Joseph Gonzalez (Spring 2020)
In this notebook we will introduce how to fit linear models using Scikit Learn. Scikit is a library containing a wide number of classic machine learning algorithms and is to machine learning what pandas is to data in Python.
In this notebook we will be using the following packages:
import numpy as np
import pandas as pd
import plotly.offline as py
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.figure_factory as ff
import cufflinks as cf
cf.set_config_file(offline=True, sharing=False, theme='ggplot');
import torch
from sklearn.linear_model import LinearRegression
For this notebook, we will focus on a synthetic dataset which is easy to visualize since it has two feature columns.
synth_data = pd.read_csv("data/synth_data.csv.zip")
synth_data.head()
This dataset is simple enough that we can easily visualize it. Take the data for a spin (literally).
fig = go.Figure()
data_scatter = go.Scatter3d(x=synth_data["X1"], y=synth_data["X2"], z=synth_data["Y"],
mode="markers",
marker=dict(size=2))
fig.add_trace(data_scatter)
fig.update_layout(margin=dict(l=0, r=0, t=0, b=0),
height=600)
fig
We can start using the normal equations:
$$ \hat{\theta} = \left( \mathbb{X}^T \mathbb{X} \right)^{-1} \mathbb{X}^T \mathbb{Y} $$While it is not as numerically stable or efficient. We can compute $\hat{\theta}$ by direction using matrix inversion:
from numpy.linalg import inv, solve
def least_squares_by_inv(X, Y):
return inv(X.T @ X) @ X.T @ Y
A more efficient way to solve the normal equations is using the solve
function to solve the linear systems:
where $A=\mathbb{X}^T \mathbb{X}$ and $b=\mathbb{X}^T \mathbb{Y}$:
def least_squares_by_solve(X, Y):
return solve(X.T @ X, X.T @ Y)
Let's quickly test the above code on our synthetic dataset:
X = synth_data[["X1", "X2"]].to_numpy()
Y = synth_data[["Y"]].to_numpy()
theta_hat = least_squares_by_inv(X,Y)
theta_hat
theta_hat = least_squares_by_solve(X,Y)
theta_hat
They both agree!
For the synthetic dataset we can visualize the model in three dimensions. To do this we will use the following plotting function that at evenly space grid points in for X0
and X1
.
def plot_plane(f, X, grid_points = 30):
u = np.linspace(X[:,0].min(),X[:,0].max(), grid_points)
v = np.linspace(X[:,1].min(),X[:,1].max(), grid_points)
xu, xv = np.meshgrid(u,v)
X = np.vstack((xu.flatten(),xv.flatten())).transpose()
z = f(X)
return go.Surface(x=xu, y=xv, z=z.reshape(xu.shape),opacity=0.8)
Plotting the data and the plane:
fig = go.Figure()
fig.add_trace(data_scatter)
fig.add_trace(plot_plane(lambda x: x.dot(least_squares_by_inv(X,Y)), X))
fig.update_layout(margin=dict(l=0, r=0, t=0, b=0),
height=600)
The above plane doesn't seem that close to the data. What did we do wrong? Let's add a cyan colored sphere at the origin (0,0,0) in the above plot.
fig = go.Figure()
fig.add_trace(data_scatter)
fig.add_trace(plot_plane(lambda x: x.dot(least_squares_by_inv(X,Y)), X))
fig.add_trace(go.Scatter3d(x=[0], y=[0], z=[0], marker=dict(color="cyan")))
fig.update_layout(margin=dict(l=0, r=0, t=0, b=0),
height=600)
The plane is passing through the origin. Actually, any parameterization of the current model must go through the origin. Why is that? Let's look at the current equation for our model:
\begin{align} f_\theta(x_1, x_2) = \theta_1 x_1 + \theta_2 x_2 \end{align}What is the value of this function when $x_1=0$ and $x_2=0$? Does it depend $\theta$? What are we missing?
We would like the model to be:
\begin{align} f_\theta(x_1, x_2) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 \end{align}We can do this by adding an $x_0$ (an extra column) which always takes the value 1.
\begin{align} f_\theta(x_0, x_1, x_2) = \theta_0 x_0 + \theta_1 x_1 + \theta_2 x_2 \end{align}By doing this we can keep the general form of the linear equation (no need for special cases):
\begin{align} f_\theta(x) = \sum_{i=1}^p \theta_i x_i \end{align}The following function adds a 1s column to our matrix. We are horizontally stacking a 1 column vector with our covariate matrix to make a new covariate matrix:
def add_ones_column(X):
return np.hstack([np.ones((X.shape[0],1)), X])
Before:
X
After:
add_ones_column(X)
Resolving for the optimal $\hat{\theta}$
least_squares_by_solve(add_ones_column(X), Y)
To plot the model we need to be a bit more careful. My plotting code assumes that the $X$ matrix is $n$ by $2$ dimensions. So we will make a special model function that transforms $X$ by adding an extra dimension and then renders the prediction.
theta_hat = least_squares_by_solve(add_ones_column(X),Y)
def model_append_ones(X):
return add_ones_column(X) @ theta_hat
Taking our new model for a spin!
fig = go.Figure()
fig.add_trace(data_scatter)
fig.add_trace(plot_plane(model_append_ones, X))
fig.update_layout(margin=dict(l=0, r=0, t=0, b=0),
height=600)
Success! That seems to fit the data nicely! That wasn't too hard but let's try using sklearn
instead.
Scikit Learn, or as the cool kids call it sklearn (pronounced s-k-learn), is an large package of useful machine learning algorithms. For this lecture, we will use the LinearRegression
model in the linear_model
module. The fact that there is an entire module with many different models within the linear_model
module might suggest that we have a lot to cover still (we do!).
What you should know about sklearn
models:
model = SuperCoolModelType(args)
model.fit(df[['X1' 'X1']], df[['Y']])
model.predict(df2[['X1' 'X1']])
The neat part about sklearn is most models behave like this. So if you want to try a cool new model you just change the class of model you are using.
We import the LinearRegression
model.
from sklearn.linear_model import LinearRegression
Create an instance of the model:
model = LinearRegression(fit_intercept=True)
Fit the model by passing it the $X$ and $Y$ data:
model.fit(synth_data[["X1", "X2"]], synth_data[["Y"]])
Make some predictions and even save them back to the original DataFrame:
synth_data['Y_hat'] = model.predict(synth_data[["X1", "X2"]])
synth_data
Visualize the model as before. Notice I just pass the model.predict
function to our visualization code which will invoke model.predict
on a regular grid of points to make the surface plot:
fig = go.Figure()
fig.add_trace(data_scatter)
fig.add_trace(plot_plane(model.predict, X))
fig.update_layout(margin=dict(l=0, r=0, t=0, b=0),
height=600)
Notice that when we created an instance of the model we set fit_intercept=False
. This is a hyper-parameter of the model. Hyperparameters are parameters that we do not optimize but instead choose as part of the modeling process. In this case the fit_intercept
hyper-parameter adds the additional $\theta_0$ term to our model without adding a 1s column to the data. Internally, it probably adds a 1s column to the data.
What happens if we set the fit_intercept=False
?
model_no_intercept = LinearRegression(fit_intercept=False)
model_no_intercept.fit(synth_data[["X1", "X2"]], synth_data[["Y"]])
fig = go.Figure()
fig.add_trace(data_scatter)
fig.add_trace(plot_plane(model_no_intercept.predict, X))
fig.update_layout(margin=dict(l=0, r=0, t=0, b=0),
height=600)
To demonstrate the simplicity and power of sklearn
lets try using a Kernel regression model instead. Don't worry about how this works.
from sklearn.kernel_ridge import KernelRidge
Creating a KernelRidge
model and setting the kernel function type to be RBF which stands for radial basis functions (and nothing else):
super_model = KernelRidge(kernel="rbf")
Fitting the RBF KernelRidge regression model to our data:
super_model.fit(synth_data[["X1", "X2"]], synth_data[["Y"]])
Plotting the model:
fig = go.Figure()
fig.add_trace(data_scatter)
fig.add_trace(plot_plane(super_model.predict, X))
fig.update_layout(margin=dict(l=0, r=0, t=0, b=0),
height=600)
Notice that we basically did exactly the same thing as before but with a different model class.