In [3]:
import numpy as np
import pandas as pd
import plotly.graph_objects as go
import plotly.express as px
import plotly.figure_factory as ff

Fitting Linear Models¶

In this notebook we briefly review the normal equations, introduce the scikit-learn framework, evaluation metrics, and describe methods to visualize model fit.

Toy Data set¶

To enable easy visualization of the model fitting process we will use a simple synthetic data set.

In [4]:
data = pd.read_csv("data/synthetic_data.csv")
data.head()
Out[4]:
X0 X1 Y
0 -1.254599 4.507143 0.669332
1 2.319939 0.986585 5.430523
2 -3.439814 -3.440055 7.640933
3 -4.419164 3.661761 -1.836007
4 1.011150 2.080726 6.148833

We can visualize the data in three dimensions:

In [5]:
data_scatter = go.Scatter3d(x=data["X0"], y=data["X1"], z=data["Y"], 
                            mode="markers",
                            marker=dict(size=2))
layout = dict(margin=dict(l=0, r=0, t=0, b=0), 
              height=600,
              scene = dict(xaxis_title='X0', yaxis_title='X1', zaxis_title='Y'))
go.Figure([data_scatter], layout)

Fitting an Ordinary Least Squares Model¶

Given a model of the form:

$$ \hat{\mathbb{Y}} = f_\theta(\mathbb{X}) = \mathbb{X} \theta $$

and taking the average squared loss over our data:

$$ R(\theta) = \frac{1}{n}\sum_{i=1}^n \left(\mathbb{Y}_i - (\mathbb{X}\theta)_i\right)^2 $$

In lecture, we showed that the $\hat{\theta}$ that minimizes this loss:

$$ \hat{\theta} = \arg\min_\theta R(\theta) $$

is given by the solution to the normal equations:

$$ \hat{\theta} = \left( \mathbb{X}^T \mathbb{X} \right)^{-1} \mathbb{X}^T \mathbb{Y} $$

We can directly implement this expression using standard linear algebra routines:

In [6]:
from numpy.linalg import inv
def ordinary_least_squares(X, Y):
    return inv(X.T @ X) @ X.T @ Y
In [7]:
theta_hat = ordinary_least_squares(data[["X0", "X1"]], data[["Y"]])
theta_hat
Out[7]:
Y
0 0.166823
1 -0.603339

A more efficient way to solve the normal equations is using the solve function to solve the linear systems:

$$ \mathbb{X}^T \mathbb{X} \theta = \mathbb{X}^T \mathbb{Y} $$

can be simplified to:

$$ A \theta = b $$

where $A=\mathbb{X}^T \mathbb{X}$ and $b=\mathbb{X}^T \mathbb{Y}$:

In [8]:
from numpy.linalg import solve
def ordinary_least_squares(X, Y):
    return solve(X.T @ X, X.T @ Y)
In [9]:
theta_hat = ordinary_least_squares(data[["X0", "X1"]], data[["Y"]])
theta_hat
Out[9]:
array([[ 0.16682271],
       [-0.60333908]])

Notice that this second implementation produces a numpy array? This is because Pandas actually implements inversion but the solve routine is entirely in numpy.

Making A Prediction¶

We can use our $\hat{\theta}$ to make predictions:

$$ \hat{y} = X \hat{\theta} $$
In [10]:
data["Yhat"] = data[["X0", "X1"]] @ theta_hat
data
Out[10]:
X0 X1 Y Yhat
0 -1.254599 4.507143 0.669332 -2.928631
1 2.319939 0.986585 5.430523 -0.208227
2 -3.439814 -3.440055 7.640933 1.501680
3 -4.419164 3.661761 -1.836007 -2.946501
4 1.011150 2.080726 6.148833 -1.086700
... ... ... ... ...
495 2.994159 1.946965 5.910233 -0.675186
496 -2.278549 0.902307 4.166517 -0.924511
497 -1.390261 -4.084179 8.897440 2.232218
498 4.173136 -3.631814 4.823716 2.887389
499 4.502374 -0.539942 5.124975 1.076866

500 rows × 4 columns

How good are our predictions? We can plot $Y$ vs $\hat{Y}$

In [11]:
fig = px.scatter(data, x="Yhat", y="Y")
fig.add_trace(go.Scatter(x=[-5,5], y=[-5,5], name="y=yhat"))
fig.update_layout(xaxis_title ="Yhat")

We can also plot the residual distribution.

In [12]:
data['residuals'] = data['Y'] - data['Yhat']
data
Out[12]:
X0 X1 Y Yhat residuals
0 -1.254599 4.507143 0.669332 -2.928631 3.597963
1 2.319939 0.986585 5.430523 -0.208227 5.638749
2 -3.439814 -3.440055 7.640933 1.501680 6.139253
3 -4.419164 3.661761 -1.836007 -2.946501 1.110494
4 1.011150 2.080726 6.148833 -1.086700 7.235534
... ... ... ... ... ...
495 2.994159 1.946965 5.910233 -0.675186 6.585419
496 -2.278549 0.902307 4.166517 -0.924511 5.091027
497 -1.390261 -4.084179 8.897440 2.232218 6.665222
498 4.173136 -3.631814 4.823716 2.887389 1.936327
499 4.502374 -0.539942 5.124975 1.076866 4.048109

500 rows × 5 columns

In [13]:
px.histogram(data, x='residuals', nbins=100)

Visualizing the Model¶

For the synthetic data set we can visualize the model in three dimensions. To do this we will use the following plotting function that at evenly space grid points in for X0 and X1.

In [14]:
def make_surface(f, X, grid_points = 30, name="Linear Model"):
    u = np.linspace(X[:,0].min(),X[:,0].max(), grid_points)
    v = np.linspace(X[:,1].min(),X[:,1].max(), grid_points)
    xu, xv = np.meshgrid(u,v)
    X = np.vstack((xu.flatten(),xv.flatten())).transpose()
    z = f(X)
    return go.Surface(x=xu, y=xv, z=z.reshape(xu.shape), opacity=0.8, name=name,  showscale=False)

Plotting the data and the plane

In [15]:
simple_plane = make_surface(
    lambda X: X @ theta_hat, 
    data[["X0", "X1"]].to_numpy(), 
    grid_points=5,
    name="plane")
go.Figure([data_scatter, simple_plane], layout)

Notice that the plane is constrained to pass through the origin. To fix this, we will need to add a constant term to the model. However, to simplify the process let's switch to using the scikit-learn python library for our modeling.

Introducing Scikit-learn¶

In this class, we introduce the normal equations as well as several other algorithms to provide some insight behind how these techniques work and perhaps more importantly how they fail. However, in practice you will seldom need to implement the core algorithms and will instead use various machine learning software packages. In this class, we will focus on the widely used scikit-learn package.

Scikit-learn, or as the cool kids call it sklearn (pronounced s-k-learn), is an large package of useful machine learning algorithms. For this lecture, we will use the LinearRegression model in the linear_model module. The fact that there is an entire module with many different models within the linear_model module might suggest that we have a lot to cover still (we do!).

What you should know about sklearn models:

  1. Models are created by first building an instance of the model:
    model = SuperCoolModelType(args)
    
  2. You then fit the model by calling the fit function passing in data:
    model.fit(df[['X1' 'X1']], df[['Y']])
    
  3. You then can make predictions by calling predict:
    model.predict(df2[['X1' 'X1']])
    

The neat part about sklearn is most models behave like this. So if you want to try a cool new model you just change the class of mode you are using.

Using Scikit-learn¶

We import the LinearRegression model

In [16]:
from sklearn.linear_model import LinearRegression

Create an instance of the model. This time we will add an intercept term to the model directly.

In [17]:
model = LinearRegression(fit_intercept=True)

Fit the model by passing it the $X$ and $Y$ data:

In [18]:
model.fit(data[["X0", "X1"]], data[["Y"]])
Out[18]:
LinearRegression()
In [19]:
model.coef_
Out[19]:
array([[ 0.34667351, -0.68472445]])
In [20]:
model.intercept_
Out[20]:
array([4.87661395])

Make some predictions and even save them back to the original DataFrame

In [21]:
data['Yhat_sklearn'] = model.predict(data[["X0", "X1"]])
data
Out[21]:
X0 X1 Y Yhat residuals Yhat_sklearn
0 -1.254599 4.507143 0.669332 -2.928631 3.597963 1.355527
1 2.319939 0.986585 5.430523 -0.208227 5.638749 5.005337
2 -3.439814 -3.440055 7.640933 1.501680 6.139253 6.039611
3 -4.419164 3.661761 -1.836007 -2.946501 1.110494 0.837309
4 1.011150 2.080726 6.148833 -1.086700 7.235534 3.802429
... ... ... ... ... ... ...
495 2.994159 1.946965 5.910233 -0.675186 6.585419 4.581475
496 -2.278549 0.902307 4.166517 -0.924511 5.091027 3.468870
497 -1.390261 -4.084179 8.897440 2.232218 6.665222 7.191185
498 4.173136 -3.631814 4.823716 2.887389 1.936327 8.810121
499 4.502374 -0.539942 5.124975 1.076866 4.048109 6.807179

500 rows × 6 columns

Analyzing the fit again:

In [22]:
fig = px.scatter(data, x="Yhat_sklearn", y="Y")
fig.add_trace(go.Scatter(x=[0,10], y=[0,10], name="y=yhat"))
fig.update_layout(xaxis_title ="yhat")

We can also plot the residual distribution.

In [23]:
data['residuals_sklearn'] = data['Y'] - data['Yhat_sklearn'] 
data
Out[23]:
X0 X1 Y Yhat residuals Yhat_sklearn residuals_sklearn
0 -1.254599 4.507143 0.669332 -2.928631 3.597963 1.355527 -0.686195
1 2.319939 0.986585 5.430523 -0.208227 5.638749 5.005337 0.425186
2 -3.439814 -3.440055 7.640933 1.501680 6.139253 6.039611 1.601322
3 -4.419164 3.661761 -1.836007 -2.946501 1.110494 0.837309 -2.673316
4 1.011150 2.080726 6.148833 -1.086700 7.235534 3.802429 2.346404
... ... ... ... ... ... ... ...
495 2.994159 1.946965 5.910233 -0.675186 6.585419 4.581475 1.328758
496 -2.278549 0.902307 4.166517 -0.924511 5.091027 3.468870 0.697647
497 -1.390261 -4.084179 8.897440 2.232218 6.665222 7.191185 1.706255
498 4.173136 -3.631814 4.823716 2.887389 1.936327 8.810121 -3.986405
499 4.502374 -0.539942 5.124975 1.076866 4.048109 6.807179 -1.682204

500 rows × 7 columns

In [24]:
px.histogram(data['residuals_sklearn'], nbins=100)
In [25]:
plane_with_bias = make_surface(model.predict, 
    data[["X0", "X1"]].to_numpy(), 
    grid_points=5,
    name="Plane with Bias")
go.Figure([data_scatter, simple_plane, plane_with_bias], layout)

Computing Error Metrics¶

As we tune the features in our model it will be important to define some useful error metrics.

In [26]:
from sklearn.metrics import mean_squared_error, mean_absolute_error
In [27]:
print("Mean Squared Error:", 
      mean_squared_error(data["Y"], data["Yhat_sklearn"]))
Mean Squared Error: 3.7756044320694873
In [28]:
print("Mean Absolute Error:", 
      mean_absolute_error(data["Y"], data["Yhat_sklearn"]))
Mean Absolute Error: 1.5446935067491216
In [29]:
print("Root Mean Squared Error:", 
      np.sqrt(mean_squared_error(data["Y"], data["Yhat_sklearn"])))
Root Mean Squared Error: 1.9430914626104163
In [30]:
print("Standard Deviation of Residuals:",  
      np.std(data['residuals_sklearn']))
Standard Deviation of Residuals: 1.9430914626104165

As we play with the model we might want a standard visualization

In [31]:
from plotly.subplots import make_subplots

def evaluate_model(predict, grid_points=30):  
    # Compute Y and Yhat
    Y = data["Y"].to_numpy()
    Yhat = predict(data[["X0", "X1"]].to_numpy()).flatten()
    # Compute and print error metrics
    print("Mean Absolute Error:", mean_absolute_error(Y, Yhat))
    print("Root Mean Squared Error:", np.sqrt(mean_squared_error(Y, Yhat)))

    # Make Side by Side Plots
    fig = make_subplots(rows=1, cols=2, 
                        specs=[[{'type': 'scene'}, {'type': 'xy'}]])
    fig.add_trace(data_scatter, row=1, col=1)
    surf = make_surface(predict, 
                       data[["X0", "X1"]].to_numpy(), 
                       grid_points=grid_points)
    fig.add_trace(surf, row=1, col=1)
    fig.update_layout(layout)
    
    fig.add_trace(go.Scatter(x=Yhat, y=Y, mode="markers"), row=1, col=2)
    ymin = np.min([np.min(Y), np.min(Yhat)])
    ymax = np.max([np.max(Y), np.max(Yhat)])
    fig.add_trace(go.Scatter(x=[ymin,ymax], y=[ymin,ymax], name="y=yhat"))
    fig.update_layout(xaxis_title ="yhat", yaxis_title="Y")
    fig.update_layout(showlegend=False)
    fig.show()

Examining our latest model:

In [32]:
evaluate_model(model.predict, grid_points=5)
Mean Absolute Error: 1.5446935067491216
Root Mean Squared Error: 1.9430914626104163

Our first model without the intercept term:

In [33]:
no_intercept_model = LinearRegression(fit_intercept=False)
no_intercept_model.fit(data[["X0", "X1"]], data[["Y"]])
Out[33]:
LinearRegression(fit_intercept=False)
In [34]:
evaluate_model(no_intercept_model.predict, grid_points=5)
Mean Absolute Error: 4.814916851370774
Root Mean Squared Error: 5.217264403053303

Examining the above data we see that there is some periodic structure as well as some curvature. Can we fit this data with a linear model?