import numpy as np
= np.array([20, 21, 22, 29, 33])
drinks drinks
array([20, 21, 22, 29, 33])
Last time, we introduced the modeling process. We set up a framework to predict target variables as functions of our features, following a set workflow:
To illustrate this process, we derived the optimal model parameters under simple linear regression with mean squared error (MSE) as the risk function. In this lecture, we’ll continue familiarizing ourselves with the modeling process by finding the best model parameters under a new model. We’ll also test out two different loss functions to understand how our choice of loss influences model design. Later on, we’ll consider what happens when a linear model isn’t the best choice to capture trends in our data – and what solutions there are to create better models.
In today’s lecture, our focus will be on the constant model. The constant model is slightly different from the simple linear regression model we’ve explored previously. Rather than generating predictions from an inputted feature variable, the constant model predicts the same constant number every time. We call this constant
The constant model is also a parametric model, with
Consider the case where L2 (squared) loss is used as the loss function and mean squared error is used as the empirical risk function. More concretely, this means the empirical risk function
Then we will minimize
We can then get the estimating equation and solve it:
It turns out
Let’s take a moment to interpret this result. Our optimal model parameter is the value of the parameter that minimizes the empirical risk
To restate the above in plain English: we are looking at the value of the empirical risk when it takes the best parameter as input. This optimal model parameter,
For modeling purposes, we care less about the minimum value of cost,
That is, we want to find the argument
We see now that changing the model used for prediction leads to a quite different result for the optimal model parameter. What happens if we instead change the loss function used in model evaluation?
This time, we will consider the constant model with L1 (absolute loss) as the loss function. This means that the average loss will be expressed as the mean absolute error.
To fit the model and find the optimal parameter value
Here, we seem to have run into a problem: the derivative of an absolute value is undefined when the argument is 0 (i.e. when
To perform the derivative, consider two cases. When
Taking derivatives:
This means that we obtain a different value for the derivative for datapoints where
To finish finding the best value of
Now, we’ve tried our hand at fitting a model under both MSE and MAE cost functions. How do the two results compare?
Let’s consider a dataset where each entry represents the number of drinks sold at a boba store each day. We’ll fit a constant model to predict the number of drinks that will be sold tomorrow.
import numpy as np
= np.array([20, 21, 22, 29, 33])
drinks drinks
array([20, 21, 22, 29, 33])
From our derivations above, we know that the optimal model parameter under MSE cost is the mean of the dataset. Under MAE cost, the optimal parameter is the median of the dataset.
np.mean(drinks), np.median(drinks)
(25.0, 22.0)
If we plot each empirical risk function across several possible values of
Notice that the MSE above is a smooth function – it is differentiable at all points, making it easy to minimize using numerical methods. The MAE, in contrast, is not differentiable at each of its “kinks.” We’ll explore how the smoothness of the risk functions can impact our ability to apply numerical optimization in a few weeks.
How do outliers affect each risk function? Imagine we replace the largest value in the dataset with 1000. The mean of the data increases substantially, while the median is nearly unaffected.
= np.append(drinks, 1000)
drinks_with_outlier
display(drinks_with_outlier) np.mean(drinks_with_outlier), np.median(drinks_with_outlier)
array([ 20, 21, 22, 29, 33, 1000])
(187.5, 25.5)
This means that under the MSE, the optimal model parameter
Let’s try another experiment. This time, we’ll add an additional, non-outlying datapoint to the data.
= np.append(drinks, 35)
drinks_with_additional_observation drinks_with_additional_observation
array([20, 21, 22, 29, 33, 35])
When we again visualize the cost functions, we find that the MAE now plots a horizontal line between 22 and 29. This means that there are infinitely many optimal values for the model parameter: any value
This leaves us with one final question – how “good” are the predictions made by this “best” fitted model?
One way we might want to evaluate our model’s performance is by computing summary statistics. If the mean and standard deviation of our predictions are close to those of the original observed
It turns out that the four sets of points shown here all have identical means, standard deviations, and correlation coefficients. However, it only makes sense to model the first of these four sets of data using SLR! It is important to visualize your data before starting to model to confirm that your choice of model makes sense for the data.
Another way of evaluating model performance is by using performance metrics. A common choice of metric is the Root Mean Squared Error, or RMSE. The RMSE is simply the square root of MSE. Taking the square root converts the value back into the original, non-squared units of
We may also wish to visualize the model’s residuals, defined as the difference between the observed and predicted
At this point, we have an effective method of fitting models to predict linear relationships. Given a feature variable and target, we can apply our four-step process to find the optimal model parameters.
A key word above is linear. When we computed parameter estimates earlier, we assumed that variables
Data in the real world isn’t always so straightforward. Consider the dataset below, which contains information about the ages and lengths of dugongs.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
= pd.read_csv("data/dugongs.txt", delimiter="\t").sort_values("Length")
dugong = dugong["Length"], dugong["Age"]
x, y
# `corrcoef` computes the correlation coefficient between two variables
# `std` finds the standard deviation
= np.corrcoef(x, y)[0, 1]
r = r*np.std(y)/np.std(x)
theta_1 = np.mean(y) - theta_1*np.mean(x)
theta_0
= plt.subplots(1, 2, dpi=200, figsize=(8, 3))
fig, ax 0].scatter(x, y)
ax[0].set_xlabel("Length")
ax[0].set_ylabel("Age")
ax[
1].scatter(x, y)
ax[1].plot(x, theta_0 + theta_1*x, "tab:red")
ax[1].set_xlabel("Length")
ax[1].set_ylabel("Age"); ax[
Looking at the plot on the left, we see that there is a slight curvature to the data points. Plotting the SLR curve on the right results in a poor fit.
For SLR to perform well, we’d like there to be a rough linear trend relating "Age"
and "Length"
. What is making the raw data deviate from a linear relationship? Notice that the data points with "Length"
greater than 2.6 have disproportionately high values of "Age"
relative to the rest of the data. If we could manipulate these data points to have lower "Age"
values, we’d “shift” these points downwards and reduce the curvature in the data. Applying a logarithmic transformation to "Age"
An important word on
= np.log(y)
z
= np.corrcoef(x, z)[0, 1]
r = r*np.std(z)/np.std(x)
theta_1 = np.mean(z) - theta_1*np.mean(x)
theta_0
= plt.subplots(1, 2, dpi=200, figsize=(8, 3))
fig, ax 0].scatter(x, z)
ax[0].set_xlabel("Length")
ax[0].set_ylabel(r"$\log{(Age)}$")
ax[
1].scatter(x, z)
ax[1].plot(x, theta_0 + theta_1*x, "tab:red")
ax[1].set_xlabel("Length")
ax[1].set_ylabel(r"$\log{(Age)}$")
ax[
=0.3); plt.subplots_adjust(wspace
Our SLR fit looks a lot better! We now have a new target variable: the SLR model is now trying to predict the log of "Age"
, rather than the untransformed "Age"
. In other words, we are applying the transformation
It turns out that this linearized relationship can help us understand the underlying relationship between
For some constants
=120, figsize=(4, 3))
plt.figure(dpi
plt.scatter(x, y)*np.exp(theta_1*x), "tab:red")
plt.plot(x, np.exp(theta_0)"Length")
plt.xlabel("Age"); plt.ylabel(
You may wonder: why did we choose to apply a log transformation specifically? Why not some other function to linearize the data?
Practically, many other mathematical operations that modify the relative scales of "Age"
and "Length"
could have worked here. The Tukey-Mosteller Bulge Diagram is a useful tool for summarizing what transformations can linearize the relationship between two variables. To determine what transformations might be appropriate, trace the shape of the “bulge” made by your data. Find the quadrant of the diagram that matches this bulge. The transformations shown on the vertical and horizontal axes of this quadrant can help improve the fit between the variables.