In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_context("talk")
%matplotlib inline

import plotly.offline as py
py.init_notebook_mode(connected=False)
import plotly.graph_objs as go
import plotly.figure_factory as ff
import cufflinks as cf
cf.set_config_file(offline=False, world_readable=True, theme='ggplot')

Non-linear Relationships

For this exercise we are going to use synthetic data to illustrate the basic ideas of model design. Notice here that we are generating data from a linear model with Gaussian noise.

In [2]:
train_data = pd.read_csv("data/toy_training_data.csv")
train_data.head()
Out[2]:
X Y
0 -9.889558 12.778085
1 -9.588310 9.888070
2 -9.312230 4.183466
3 -9.095454 0.940616
4 -9.070992 -2.349544
In [3]:
# Visualize the data ---------------------
train_points = go.Scatter(name = "Training Data", 
                          x = train_data['X'], y = train_data['Y'], 
                          mode = 'markers')
# layout = go.Layout(autosize=False, width=800, height=600)
py.iplot(go.Figure(data=[train_points]))

Solving the Normal Equations

There are several ways to compute the normal equations.

Extracting the Data as Matrices

In [4]:
# Note that we select a list of columns to return a matrix (n,p)
X = train_data[['X']].values
print("X shape:", X.shape)

Y = train_data['Y'].values
print("Y shape:", Y.shape)
X shape: (75, 1)
Y shape: (75,)

Adding the ones column

In [5]:
Phi = np.hstack([X, np.ones((len(X), 1))])
Phi[1:5,:]
Out[5]:
array([[-9.58831011,  1.        ],
       [-9.31222958,  1.        ],
       [-9.09545422,  1.        ],
       [-9.07099175,  1.        ]])

Solving for $\hat{\theta{}}$

There are multiple ways to solve for $\hat{\theta{}}$. Following the solution to the normal equations:

$$\large \hat{\theta{}} = \left(\Phi^T \Phi\right)^{-1} \Phi^T Y $$

we get:

In [6]:
theta_hat = np.linalg.inv(Phi.T @ Phi) @ Phi.T @ Y
theta_hat
Out[6]:
array([  1.63923611,  20.54032661])

However computing inverting and multiplying (i.e., solving) can be accomplished with a special routine more efficiently:

$$\large A^{-1}b = \texttt{solve}(A, b) $$

In [7]:
theta_hat = np.linalg.solve(Phi.T @ Phi, Phi.T @ Y)
theta_hat
Out[7]:
array([  1.63923611,  20.54032661])

Using a Package to estimate the linear model

In practice, it is generally better to use specialized software packages for linear regression. In Python, scikit-learn is the standard package for regression.

scikit-learn logo

Here we will take a very brief tour of how to use scikit-learn for regression. Over the next few weeks we will use scikit-learn for a range of different task.

You can use the the scikit-learn linear_model package to compute the normal equations. This package supports a wide range of generalized linear models. For those who are interested in studying machine learning, I would encourage you to skim through the descriptions of the various models in the linear_model package. These are the foundation of most practical applications of supervised machine learning.

In [8]:
from sklearn import linear_model

Intercept Term Scikit-learn can automatically add the intercept term. This can be helpful since you don't need to remember to add it later when making a prediction. In the following we create a model object.

In [9]:
line_reg = linear_model.LinearRegression(fit_intercept=True)

We can then fit the model in one line (this solves the normal equations.

In [10]:
# Fit the model to the data
line_reg.fit(train_data[['X']], train_data['Y'])
np.hstack([line_reg.coef_, line_reg.intercept_]) 
Out[10]:
array([  1.63923611,  20.54032661])

Examining the solution

In the following we plot the solution along with it's residuals.

Making Predictions Using Linear Algebra

Making predictions at each of the training data points.

In [11]:
y_hat = Phi @ theta_hat
y_hat[:5]
Out[11]:
array([ 4.32900655,  4.8228224 ,  5.27538359,  5.63072958,  5.67082936])

We can also use the trained model to render predictions.

In [12]:
y_hat = line_reg.predict(train_data[['X']])
y_hat[:5]
Out[12]:
array([ 4.32900655,  4.8228224 ,  5.27538359,  5.63072958,  5.67082936])

Visualizing the fit

To visualize the fit line we will make a set of predictions at 10 evenly spaced points.

In [13]:
X_query = np.linspace(train_data['X'].min()-1, train_data['X'].max() +1, 1000)
Phi_query = np.hstack([X_query[:,np.newaxis], np.ones((len(X_query),1))])
y_query_pred = Phi_query @ theta_hat

We can then plot the residuals along with the line.

In [14]:
# Define the least squares regression line 
basic_line = go.Scatter(name = r"Linear Model", x=X_query, y = y_query_pred)

# Definethe residual lines segments, a separate line for each 
# training point
residual_lines = [
    go.Scatter(x=[x,x], y=[y,yhat],
               mode='lines', showlegend=False, 
               line=dict(color='black', width = 0.5))
    for (x, y, yhat) in zip(train_data['X'], train_data['Y'], y_hat)
]

# Combine the plot elements
py.iplot([train_points, basic_line] + residual_lines)

Examining the Residuals

It is often helpful to examine the residuals. Ideally the residuals should be normally distributed.

In [15]:
residuals = train_data['Y'] - y_hat
sns.distplot(residuals)
# plt.savefig("residuals.pdf")
Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a16790f98>

We might also plot $\hat{Y}$ vs $Y$. Ideally, the data should be along the diagonal.

In [16]:
# Plot.ly plotting code
py.iplot(go.Figure(
    data = [go.Scatter(x=train_data['Y'], y=y_hat, mode='markers', name="Points"),
            go.Scatter(x=[-10, 50], y = [-10, 50], line=dict(color='black'), 
                       name="Diagonal", mode='lines')],
    layout = dict(title=r"Y_hat vs Y Plot", xaxis=dict(title="Y"), 
                  yaxis=dict(title=r"$\hat{Y}$"))
))

The sum of the residuals should be?

In [17]:
np.sum(residuals)
Out[17]:
-1.8118839761882555e-13

Plotting against dimensions in X

When plotted against any one dimension we don't want to see any patterns:

In [18]:
# Plot.ly plotting code
py.iplot(go.Figure(
    data = [go.Scatter(x=train_data['X'], y=residuals, mode='markers')],
    layout = dict(title="Residual Plot", xaxis=dict(title="X"), 
                  yaxis=dict(title="Residual"))
))

Is this data linear?

How would you describe this data?

  1. Is there a linear relationship between $X$ and $Y$?
  2. Are there other patterns?
  3. How noisy is the data?
  4. Do we see obvious outliers?

The relationship between X and Y does appear to have some linear trend but there also appears to be other patterns in the relationship?

However, in this lecture we will show that linear models can still be used to model this data very effectively.





Question: Can we fit this non-linear cyclic structure with a linear model?

Yes! Let's see how.





What does it mean to be a linear model

Let's return to what it means to be a linear model:

$$\large f_\theta(x) = x^T \theta = \sum_{j=1}^p x_j \theta_j $$

In what sense is the above model linear?

  1. Linear in the features $x$?
  2. Linear in the parameters $\theta$?
  3. Linear in both at the same time?

Yes, Yes, and No. If we look at just the features or just the parameters the model is linear. However, if we look at both at the same time it is not. Why?





Feature Functions

Consider the following alternative model formulation:

$$\large f_\theta\left( x \right) = \phi(x)^T \theta = \sum_{j=1}^{k} \phi_j(x) \theta_j $$

where $\phi_j$ is an arbitrary function from $x\in \mathbb{R}^p$ to $\phi(x)_j \in \mathbb{R}$ and we define $k$ of these functions. We often refer to these functions $\phi_j$ as feature functions or basis functions and their design plays a critical role in both how we capture prior knowledge and our ability to fit complicated data.





The Transformed Covariate Matrix

As a consequence, while the model $f_\theta\left(x \right)$ is no longer linear in $x$ it is still a linear model because it is linear in $\theta$. This means we can continue to use the normal equations to compute the optimal parameters.

Minimizing the squared loss (not shown) we obtain the normal equation:

$$ \large \hat{\theta} = \left( \Phi^T \Phi \right)^{-1} \Phi^T Y $$

It is worth noting that the model is also linear in $\phi$ and that the $\phi_j$ form a new basis (hence the term basis functions) in which the data live. As a consequence we can think of $\phi$ as mapping the data into a new (often higher dimensional space) in which the relationship between $y$ and $\phi(x)$ is defined by a hyperplane.





Capturing Domain Knowledge

Feature functions can be used to capture domain knowledge by:

  1. Introducing additional information from other sources
  2. Combining related features
  3. Encoding non-linear patterns

Suppose I had data about customer purchases and I wanted to estimate their income:

\begin{align} \phi(\text{date}, \text{lat}, \text{lon}, \text{amount})_1 &= \textbf{isWinter}(\text{date}) \\ \phi(\text{date}, \text{lat}, \text{lon}, \text{amount})_2 &= \cos\left( \frac{\textbf{Hour}(\text{date})}{12} \pi \right) \\ \phi(\text{date}, \text{lat}, \text{lon}, \text{amount})_3 &= \frac{\text{amount}}{\textbf{avg_spend}[\textbf{ZipCode}[\text{lat}, \text{lon}]]} \\ \phi(\text{date}, \text{lat}, \text{lon}, \text{amount})_4 &= \exp\left(-\textbf{Distance}\left((\text{lat},\text{lon}), \textbf{StoreA}\right)\right)^2 \\ \phi(\text{date}, \text{lat}, \text{lon}, \text{amount})_5 &= \exp\left(-\textbf{Distance}\left((\text{lat},\text{lon}), \textbf{StoreB}\right)\right)^2 \end{align}

Notice: In the above feature functions:

  1. The transformations are non-linear
  2. They pull in additional information
  3. They may encode implicit knowledge
  4. The functions $\phi$ do not depend on $\theta$










Transforming the Toy Data

In our toy data set we observed a cyclic pattern. Here we construct a $\phi$ to capture the cyclic nature of our data and visualize the corresponding hyperplane.

In the following cell we define a function $\phi$ that maps $x\in \mathbb{R}$ to the vector $[x,\sin(x)] \in \mathbb{R}^2$

$$ \large \phi(x) = [x, \sin(x)] $$

Why not:

$$ \large \phi(x) = [x, \sin(\theta_3 x + \theta_4)] $$

This would no longer be linear $\theta$. However, in practice we might want to consider a range of $\sin$ basis:

$$ \large \phi_{\alpha,\beta}(x) = \sin(\alpha x + \beta) $$

for different values of $\alpha$ and $\beta$. The parameters $\alpha$ and $\beta$ are typically called hyperparameters because (at least in this setting) they are not set automatically through learning.

In [19]:
def sin_phi(x):
    return np.hstack([x, np.sin(x)])
In [20]:
Phi = sin_phi(train_data[["X"]])
Phi[:5]
Out[20]:
array([[-9.88955766,  0.44822588],
       [-9.58831011,  0.16280424],
       [-9.31222958, -0.11231092],
       [-9.09545422, -0.32340318],
       [-9.07099175, -0.34645201]])

Fit a Linear Model on $\Phi$

We can again use the scikit-learn package to fit a linear model on the transformed space.

Fitting the Basic Linear Model

In [21]:
from sklearn import linear_model
basic_reg = linear_model.LinearRegression(fit_intercept=True)
basic_reg.fit(train_data[['X']], train_data['Y'])
Out[21]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Fitting the Transformed Linear Model on $\Phi$

In [22]:
from sklearn import linear_model
sin_reg = linear_model.LinearRegression(fit_intercept=True)
sin_reg.fit(sin_phi(train_data[["X"]]), train_data['Y'])
Out[22]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Making Predictions at Query Locations

In [23]:
X_query = np.linspace(train_data['X'].min()-1, train_data['X'].max() +1, 100)
Y_basic_query = basic_reg.predict(X_query[:, np.newaxis])
Y_sin_query = sin_reg.predict(sin_phi(X_query[:, np.newaxis]))

Visualizing the Fit

In [24]:
# Define the least squares regression line 
basic_line = go.Scatter(name = r"Basic Model", x=X_query, y = Y_basic_query)
sin_line = go.Scatter(name = r"Transformed Model", x=X_query, y = Y_sin_query)

# Definethe residual lines segments, a separate line for each 
# training point
residual_lines = [
    go.Scatter(x=[x,x], y=[y,yhat],
               mode='lines', showlegend=False, 
               line=dict(color='black', width = 0.5))
    for (x, y, yhat) in zip(train_data['X'], train_data['Y'], 
                            sin_reg.predict(sin_phi(train_data[["X"]])))
]

# Combine the plot elements
py.iplot([train_points, basic_line, sin_line] + residual_lines)

Linear Model in a Transformed Space

As discussed earlier the model we just constructed, while non-linear in $x$ is actually a linear model in $\phi(x)$ and we can visualize that linear model's structure in higher dimensions.

In [25]:
# Plot the data in higher dimensions
phi3d = go.Scatter3d(name = "Raw Data",
    x = Phi[:,0], y = Phi[:,1], z = train_data['Y'],
    mode = 'markers',
    marker = dict(size=3),
    showlegend=False
)

# Compute the predictin plane
(u,v) = np.meshgrid(np.linspace(-10,10,5), np.linspace(-1,1,5))
coords = np.vstack((u.flatten(),v.flatten())).T
ycoords = sin_reg.predict(coords)
fit_plane = go.Surface(name = "Fitting Hyperplane",
    x = np.reshape(coords[:,0], (5,5)),
    y = np.reshape(coords[:,1], (5,5)),
    z = np.reshape(ycoords, (5,5)),
    opacity = 0.8, cauto = False, showscale = False,
    colorscale = [[0, 'rgb(255,0,0)'], [1, 'rgb(255,0,0)']]
)

# Construct residual lines
Yhat = sin_reg.predict(Phi)
residual_lines = [
    go.Scatter3d(x=[x[0],x[0]], y=[x[1],x[1]], z=[y, yhat],
                 mode='lines', showlegend=False, 
                 line=dict(color='black'))
    for (x, y, yhat) in zip(Phi, train_data['Y'], Yhat)
]

    
# Label the axis and orient the camera
layout = go.Layout(
    scene=go.Scene(
        xaxis=go.XAxis(title='X'),
        yaxis=go.YAxis(title='sin(X)'),
        zaxis=go.ZAxis(title='Y'),
        aspectratio=dict(x=1.,y=1., z=1.), 
        camera=dict(eye=dict(x=-1, y=-1, z=0))
    )
)

py.iplot(go.Figure(data=[phi3d, fit_plane] + residual_lines, layout=layout))

Error Estimates

How well are we fitting the data? We can compute the root mean squared error.

In [26]:
def rmse(y, yhat):
    return np.sqrt(np.mean((yhat-y)**2))
In [27]:
basic_rmse = rmse(train_data['Y'], basic_reg.predict(train_data[['X']]))
sin_rmse = rmse(train_data['Y'], sin_reg.predict(sin_phi(train_data[['X']])))
In [28]:
py.iplot(go.Figure(data =[go.Bar(
            x=[r'Basic Regression', 
               r'Sine Transformation'],
            y=[basic_rmse, sin_rmse]
            )], layout = go.Layout(title="Loss Comparison", 
                           yaxis=dict(title="RMSE"))))

Generic Feature Functions

We will now explore a range of generic feature transformations. However, before we proceed it is worth contrasting two categories of feature functions and their applications.

  1. Interpretable Features: In settings where our goal is to understand the model (e.g., identify important features that predict customer churn) we may want to construct meaningful features based on our understanding of the domain.

  2. Generic Features: However, in other settings where our primary goals is to make accurate predictions we may instead introduce generic feature functions that enable our models to fit and generalize complex relationships.

Gaussian Radial Basis Functions

One of the more widely used generic feature functions are Gaussian radial basis functions. These feature functions take the form:

$$ \phi_{(\lambda, u_1, \ldots, u_k)}(x) = \left[\exp\left( - \frac{\left|\left|x-u_1\right|\right|_2^2}{\lambda} \right), \ldots, \exp\left( - \frac{\left|\left| x-u_k \right|\right|_2^2}{\lambda} \right) \right] $$

The hyper-parameters $u_1$ through $u_k$ and $\lambda$ are not optimized with $\theta$ but instead are set externally. In many cases the $u_i$ may correspond to points in the training data. The term $\lambda$ defines the spread of the basis function and determines the "smoothness" of the function $f_\theta(\phi(x))$.

The following is a plot of three radial basis function centered at 2 with different values of $\lambda$.

In [ ]:
def gaussian_rbf(u, lam=1):
    return lambda x: np.exp(-(x - u)**2 / lam**2)
In [ ]:
tmpX = np.linspace(-2, 6,1000)
py.iplot([
    dict(name=r"$\lambda=0.5$", x=tmpX, 
         y=gaussian_rbf(2, lam=0.5)(tmpX)),
    dict(name=r"$\lambda=1$", x=tmpX, 
         y=gaussian_rbf(2, lam=1.)(tmpX)),
    dict(name=r"$\lambda=2$", x=tmpX, 
         y=gaussian_rbf(2, lam=2.)(tmpX))
])

Develop some helper code

To simplify the following analysis we create two helper functions.

  1. uniform_rbf_phi which constructs uniformly spaced RBF functions and each function is a feature that has a large value when the input $x$ is nearby.
  2. evaluate_basis which takes a feature function configuration and fits a model
In [ ]:
def uniform_rbf_phi(x, lam=1, num_basis = 10, minvalue=-9, maxvalue=9):
    return np.hstack([gaussian_rbf(u, lam)(x) for u in np.linspace(minvalue, maxvalue, num_basis)])
In [ ]:
tmpXTall = np.linspace(-11, 11,1000)[:,np.newaxis]
py.iplot([
    dict(name=r"$\lambda=0.1$", x=tmpXTall[:,0], 
         y=uniform_rbf_phi(tmpXTall, lam=0.1).mean(axis=1)),
    dict(name=r"$\lambda=0.5$", x=tmpXTall[:,0], 
         y=uniform_rbf_phi(tmpXTall, lam=0.5).mean(axis=1)),
    dict(name=r"$\lambda=1$", x=tmpXTall[:,0], 
         y=uniform_rbf_phi(tmpXTall, lam=1.).mean(axis=1)),
])
In [ ]:
def evaluate_basis(phi, desc):
    # Apply transformation
    Phi = phi(train_data[["X"]])
    
    # Fit a model
    reg_model = linear_model.LinearRegression(fit_intercept=True)
    reg_model.fit(Phi, train_data['Y'])
    
    # Create plot line
    X_test = np.linspace(-11, 11, 1000) # Fine grained test X
    Phi_test = phi(X_test[:,np.newaxis])
    Yhat_test = reg_model.predict(Phi_test)
    line = go.Scatter(name = desc, x=X_test, y=Yhat_test)
    
    # Compute RMSE
    Yhat = reg_model.predict(Phi)
    error = rmse(train_data['Y'], Yhat)
    
    # return results
    return (line, error, reg_model)

Visualizing 10 RBF features

In [ ]:
(rbf_line10, rbf_rmse10, rbf_reg10) = \
    evaluate_basis(lambda x: uniform_rbf_phi(x, lam=1, num_basis=10), r"RBF10")

py.iplot([train_points, rbf_line10, basic_line, sin_line])

Visualizing 50 RBF features (Really Connecting the Dots!)

We are now getting a really good fit to this dataset!!!!

In [ ]:
(rbf_line50, rbf_rmse50, rbf_reg50) = \
    evaluate_basis(lambda x: uniform_rbf_phi(x, lam=0.3, num_basis=50), r"RBF50")

fig = go.Figure(data=[train_points, rbf_line50, rbf_line10, sin_line, basic_line ], 
                   layout = go.Layout(xaxis=dict(range=[-10,10]), 
                                      yaxis=dict(range=[-10,50])))
py.iplot(fig)

Examining the Error

In [ ]:
train_bars = go.Bar(
            x=[r'Basic Regression', 
               r'Sine Transformation',
               r'RBF 10',
               r'RBF 50'],
            y=[basic_rmse, sin_rmse, rbf_rmse10, rbf_rmse50],
            name="Training Erorr")
py.iplot(go.Figure(data = [train_bars], layout = go.Layout(title="Loss Comparison", 
                           yaxis=dict(title="RMSE"))))

Which is the best model?

We started with the objective of minimizing the training loss (error). As we increased the model sophistication by adding features we were able to fit increasingly complex functions to the data and reduce the loss. However, is our ultimate goal to minimize training error?

Ideally we would like to minimize the error we make when making new predictions at unseen values of $X$. One way to evaluate that error is use a test dataset which is distinct from the dataset used to train the model. Fortunately, we have such a test dataset.

In [ ]:
test_data = pd.read_csv("data/toy_test_data.csv")

test_points = go.Scatter(name = "Test Data", x = test_data['X'], y = test_data['Y'], 
                       mode = 'markers', marker=dict(symbol="cross", color="red"))
py.iplot([train_points, test_points])
In [ ]:
def test_rmse(phi, reg):
    return rmse(test_data['Y'], reg.predict(phi(test_data[['X']])))
In [ ]:
test_bars = go.Bar(
            x=[r'Basic Regression', 
               r'Sine Transformation',
               r'RBF 10',
               r'RBF 50'],
            y=[test_rmse(lambda x: x, basic_reg),
               test_rmse(sin_phi, sin_reg),
               test_rmse(lambda x: uniform_rbf_phi(x, lam=1, num_basis=10), rbf_reg10),
               test_rmse(lambda x: uniform_rbf_phi(x, lam=0.3, num_basis=50), rbf_reg50)
              ],
            name="Test Error"
            )
py.iplot(go.Figure(data =[train_bars, test_bars], layout = go.Layout(title="Loss Comparison", 
                           yaxis=dict(title="RMSE"))))

What's happening: Over-fitting

As we increase the expressiveness of our model we begin to over-fit to the variability in our training data. That is we are learning patterns that do not generalize beyond our training dataset

Over-fitting is a key challenge in machine learning and statistical inference. At it's core is a fundamental trade-off between bias and variance: the desire to explain the training data and yet be robust to variation in the training data.

We will study the bias-variance trade-off more in the next lecture but for now we will focus on the trade-off between under fitting and over fitting:








Train, Test, and Validation Split

To manage over-fitting it is essential to split your initial training data into a training and testing dataset.

The Train - Test Split

Before running cross validation split the data into train and test subsets (typically a 90-10 split). How should you do this? You want the test data to reflect the prediction goal:

  1. Random sample of the training data
  2. Future examples
  3. Different stratifications

Ask yourself, where will I be using this model and how does that relate to my test data.

Do not look at the test data until after selecting your final model. Also, it is very important to not look at the test data until after selecting your final model. Finally, you should not look at the test data until after selecting your final model.

Cross Validation

With the remaining training data:

  1. Split the remaining training dataset k-ways as in the Figure above. The figure uses 5-Fold which is standard. You should split the data in the same way you constructed the test set (this is typically randomly)
  2. For each split train the model on the training fraction and then compute the error (RMSE) on the validation fraction.
  3. Average the error across each validation fraction to estimate the test error.

Questions:

  1. What is this accomplishing?
  2. What are the implication on the choice of $k$?







Answers:

  1. This is repeatedly simulating the train-test split we did earlier. We repeat this process because it is noisy.
  2. Larger $k$ means we average our validation error over more instances which makes our estimate of the test error more stable. This typically also means that the validation set is smaller so we have more training data. However, larger $k$ also means we have to train the model more often which gets computational expensive