Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Constant Model, Loss, and Transformations

Last time, we introduced the modeling process. We set up a framework to predict target variables as functions of our features, following a set workflow:

  1. Choose a model - how should we represent the world?

  2. Choose a loss function - how do we quantify prediction error?

  3. Fit the model - how do we choose the best parameter of our model given our data?

  4. Evaluate model performance - how do we evaluate whether this process gave rise to a good model?

To illustrate this process, we derived the optimal model parameters under simple linear regression (SLR) with mean squared error (MSE) as the cost function. A summary of the SLR modeling process is shown below:

First, choose a model. Second, choose a loss function. Third, fit the model. And fourth, evaluate the model performance.

At the end of last lecture, we dived deeper into step 4 - evaluating model performance - using SLR as an example. In this lecture, we’ll also explore the modeling process with new models, continue familiarizing ourselves with the modeling process by finding the best model parameters under a new model, the constant model, and test out two different loss functions to understand how our choice of loss influences model design. Later on, we’ll consider what happens when a linear model isn’t the best choice to capture trends in our data and what solutions there are to create better models.

Before we get into the modeling process, let’s quickly review some important terminology.

Constant Model + MSE

Now, we’ll shift from the SLR model to the constant model, also known as a summary statistic. The constant model is slightly different from the simple linear regression model we’ve explored previously. Rather than generating predictions from an inputted feature variable, the constant model always predicts the same constant number. This ignores any relationships between variables. For example, let’s say we want to predict the number of drinks a boba shop sells in a day. Boba tea sales likely depend on the time of year, the weather, how the customers feel, whether school is in session, etc., but the constant model ignores these factors in favor of a simpler model. In other words, the constant model employs a simplifying assumption. It often serves as a baseline to compare other more complicated models against. If a complex model has comparable or worse performance compared to a very simple constant model, it might call into question how useful the other features in the more complex model really are.

It is also a parametric, statistical model:

y^=θ0\hat{y} = \theta_0

θ0\theta_0 is the parameter of the constant model, just as θ0\theta_0 and θ1\theta_1 were the parameters in SLR. Since our parameter θ0\theta_0 is 1-dimensional (θ0R\theta_0 \in \mathbb{R}), we now have no input to our model and will always predict y^=θ0\hat{y} = \theta_0.

Deriving the optimal θ0\theta_0

Our task now is to determine what value of θ0\theta_0 best represents the optimal model – in other words, what number should we guess each time to have the lowest possible average loss on our data?

Like before, we’ll use Mean Squared Error (MSE). Recall that the MSE is average squared loss (L2 loss) over the data D={y1,y2,...,yn}D = \{y_1, y_2, ..., y_n\}.

R^(θ)=1ni=1n(yiyi^)2\hat{R}(\theta) = \frac{1}{n}\sum^{n}_{i=1} (y_i - \hat{y_i})^2

Our modeling process now looks like this:

  1. Choose a model: constant model

  2. Choose a loss function: L2 loss

  3. Fit the model

  4. Evaluate model performance

Given the constant model y^=θ0\hat{y} = \theta_0, we can rewrite the MSE equation as

R^(θ)=1ni=1n(yiθ0)2\hat{R}(\theta) = \frac{1}{n}\sum^{n}_{i=1} (y_i - \theta_0)^2

We can fit the model by finding the optimal θ0^\hat{\theta_0} that minimizes the MSE using a calculus approach.

  1. Differentiate with respect to θ0\theta_0:

ddθ0R(θ)=ddθ0(1ni=1n(yiθ0)2)=1ni=1nddθ0(yiθ0)2a derivative of sum is a sum of derivatives=1ni=1n2(yiθ0)(1)chain rule=2ni=1n(yiθ0)simplify constants\begin{align} \frac{d}{d\theta_0}\text{R}(\theta) & = \frac{d}{d\theta_0}(\frac{1}{n}\sum^{n}_{i=1} (y_i - \theta_0)^2) \\ &= \frac{1}{n}\sum^{n}_{i=1} \frac{d}{d\theta_0} (y_i - \theta_0)^2 \quad \quad \text{a derivative of sum is a sum of derivatives} \\ &= \frac{1}{n}\sum^{n}_{i=1} 2 (y_i - \theta_0) (-1) \quad \quad \text{chain rule} \\ &= {\frac{-2}{n}}\sum^{n}_{i=1} (y_i - \theta_0) \quad \quad \text{simplify constants} \end{align}
  1. Set the derivative equation equal to 0:

    0=2ni=1n(yiθ0^)0 = {\frac{-2}{n}}\sum^{n}_{i=1} (y_i - \hat{\theta_0})
  2. Solve for θ0^\hat{\theta_0}

0=2ni=1n(yiθ0^)=i=1n(yiθ0^)divide both sides by2n=(i=1nyi)(i=1nθ0^)separate sums=(i=1nyi)(nθ0^)c + c +  + c = ncnθ0^=i=1nyiθ0^=1ni=1nyiθ0^=yˉ\begin{align} 0 &= {\frac{-2}{n}}\sum^{n}_{i=1} (y_i - \hat{\theta_0}) \\ &= \sum^{n}_{i=1} (y_i - \hat{\theta_0}) \quad \quad \text{divide both sides by} \frac{-2}{n} \\ &= \left(\sum^{n}_{i=1} y_i\right) - \left(\sum^{n}_{i=1} \hat{\theta_0}\right) \quad \quad \text{separate sums} \\ &= \left(\sum^{n}_{i=1} y_i\right) - (n \cdot \hat{\theta_0}) \quad \quad \text{c + c + … + c = nc} \\ n \cdot \hat{\theta_0} &= \sum^{n}_{i=1} y_i \\ \hat{\theta_0} &= \frac{1}{n} \sum^{n}_{i=1} y_i \\ \hat{\theta_0} &= \bar{y} \end{align}

Let’s take a moment to interpret this result. θ0^=yˉ\hat{\theta_0} = \bar{y} is the optimal parameter for constant model + MSE. It holds true regardless of what data sample you have, and it provides some formal reasoning as to why the mean is such a common summary statistic.

Our optimal model parameter is the value of the parameter that minimizes the cost function. This minimum value of the cost function can be expressed:

R(θ0^)=minθ0R(θ0)R(\hat{\theta_0}) = \min_{\theta_0} R(\theta_0)

To restate the above in plain English: we are looking at the value of the cost function when it takes the best parameter as input. This optimal model parameter, θ0^\hat{\theta_0}, is the value of θ0\theta_0 that minimizes the cost RR.

For modeling purposes, we care less about the minimum value of cost, R(θ0^)R(\hat{\theta_0}), and more about the value of θ0\theta_0 that results in this lowest average loss. In other words, we concern ourselves with finding the best parameter value such that:

θ0^=argminθ0R(θ0)\hat{\theta_0} = \underset{\theta_0}{\operatorname{\arg\min}}\:R(\theta_0)

That is, we want to find the argument θ0\theta_0 that minimizes the cost function.

Comparing Two Different Models, Both Fit with MSE

Now that we’ve explored the constant model with an L2 loss, we can compare it to the SLR model that we learned last lecture. Consider the dataset below, which contains information about the ages and lengths of dugongs. Suppose we wanted to predict dugong ages:

Constant ModelSimple Linear Regression
modely^=θ0\hat{y} = \theta_0y^=θ0+θ1x\hat{y} = \theta_0 + \theta_1 x
datasample of ages D={y1,y2,...,yn}D = \{y_1, y_2, ..., y_n\}sample of ages D={(x1,y1),(x2,y2),...,(xn,yn)}D = \{(x_1, y_1), (x_2, y_2), ..., (x_n, y_n)\}
dimensionsθ0^\hat{\theta_0} is 1-Dθ^=[θ0^,θ1^]\hat{\theta} = [\hat{\theta_0}, \hat{\theta_1}] is 2-D
loss surface2-D 2D graph of the constant model's loss surface with a singular minimum3-D 3D graph of simple linear regression's loss surface with the minimum marked
loss modelR^(θ)=1ni=1n(yiθ0)2\hat{R}(\theta) = \frac{1}{n}\sum^{n}_{i=1} (y_i - \theta_0)^2R^(θ0,θ1)=1ni=1n(yi(θ0+θ1x))2\hat{R}(\theta_0, \theta_1) = \frac{1}{n}\sum^{n}_{i=1} (y_i - (\theta_0 + \theta_1 x))^2
RMSE7.724.31
predictions visualizedrug plot Rug plot with theta_0_hat markedscatter plot Scatterplot showing the datapoints along with a line representing the predictions.

(Notice how the points for our SLR scatter plot are visually not a great linear fit. We’ll come back to this).

The code for generating the graphs and models is included below, but we won’t go over it in too much depth.

Click to see the code
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import itertools
from mpl_toolkits.mplot3d import Axes3D
dugongs = pd.read_csv("data/dugongs.csv")
data_constant = dugongs["Age"]
data_linear = dugongs[["Length", "Age"]]
Click to see the code
# Big font helper
def adjust_fontsize(size=None):
    SMALL_SIZE = 8
    MEDIUM_SIZE = 10
    BIGGER_SIZE = 12
    if size != None:
        SMALL_SIZE = MEDIUM_SIZE = BIGGER_SIZE = size

    plt.rc("font", size=SMALL_SIZE)  # controls default text sizes
    plt.rc("axes", titlesize=SMALL_SIZE)  # fontsize of the axes title
    plt.rc("axes", labelsize=MEDIUM_SIZE)  # fontsize of the x and y labels
    plt.rc("xtick", labelsize=SMALL_SIZE)  # fontsize of the tick labels
    plt.rc("ytick", labelsize=SMALL_SIZE)  # fontsize of the tick labels
    plt.rc("legend", fontsize=SMALL_SIZE)  # legend fontsize
    plt.rc("figure", titlesize=BIGGER_SIZE)  # fontsize of the figure title

plt.style.use("default")  # Revert style to default mpl
Click to see the code
# Constant Model + MSE
plt.style.use('default') # Revert style to default mpl
adjust_fontsize(size=16)
%matplotlib inline

def mse_constant(theta, data):
    return np.mean(np.array([(y_obs - theta) ** 2 for y_obs in data]), axis=0)

thetas = np.linspace(-20, 42, 1000)
l2_loss_thetas = mse_constant(thetas, data_constant)

# Plotting the loss surface
plt.plot(thetas, l2_loss_thetas)
plt.xlabel(r'$\theta_0$')
plt.ylabel(r'MSE')

# Optimal point
thetahat = np.mean(data_constant)
plt.scatter([thetahat], [mse_constant(thetahat, data_constant)], s=50, label = r"$\hat{\theta}_0$")
plt.legend();
plt.show()
<Figure size 640x480 with 1 Axes>
Click to see the code
# SLR + MSE
def mse_linear(theta_0, theta_1, data_linear):
    data_x, data_y = data_linear.iloc[:, 0], data_linear.iloc[:, 1]
    return np.mean(
        np.array([(y - (theta_0 + theta_1 * x)) ** 2 for x, y in zip(data_x, data_y)]),
        axis=0,
    )


# plotting the loss surface
theta_0_values = np.linspace(-80, 20, 80)
theta_1_values = np.linspace(-10, 30, 80)
mse_values = np.array(
    [[mse_linear(x, y, data_linear) for x in theta_0_values] for y in theta_1_values]
)

# Optimal point
data_x, data_y = data_linear.iloc[:, 0], data_linear.iloc[:, 1]
theta_1_hat = np.corrcoef(data_x, data_y)[0, 1] * np.std(data_y) / np.std(data_x)
theta_0_hat = np.mean(data_y) - theta_1_hat * np.mean(data_x)

# Create the 3D plot
fig = plt.figure(figsize=(7, 5))
ax = fig.add_subplot(111, projection="3d")

X, Y = np.meshgrid(theta_0_values, theta_1_values)
surf = ax.plot_surface(
    X, Y, mse_values, cmap="viridis", alpha=0.6
)  # Use alpha to make it slightly transparent

# Scatter point using matplotlib
sc = ax.scatter(
    [theta_0_hat],
    [theta_1_hat],
    [mse_linear(theta_0_hat, theta_1_hat, data_linear)],
    marker="o",
    color="red",
    s=100,
    label="theta hat",
)

# Create a colorbar
cbar = fig.colorbar(surf, ax=ax, shrink=0.5, aspect=10)
cbar.set_label("Cost Value")

ax.set_title("MSE for different $\\theta_0, \\theta_1$")
ax.set_xlabel("$\\theta_0$")
ax.set_ylabel("$\\theta_1$")
ax.set_zlabel("MSE");

# plt.show()
<Figure size 700x500 with 2 Axes>
Click to see the code
# Predictions
yobs = data_linear["Age"]  # The true observations y
xs = data_linear["Length"]  # Needed for linear predictions
n = len(yobs)  # Predictions

yhats_constant = [thetahat for i in range(n)]  # Not used, but food for thought
yhats_linear = [theta_0_hat + theta_1_hat * x for x in xs]
Click to see the code
# In case we're in a weird style state
sns.set_theme()
adjust_fontsize(size=16)
%matplotlib inline

fig = plt.figure(figsize=(8, 1.5))
sns.rugplot(yobs, height=0.25, lw=2) ;
plt.axvline(thetahat, color='red', lw=4, label=r"$\hat{\theta}_0$");
plt.legend()
plt.yticks([]);
# plt.show()
<Figure size 800x150 with 1 Axes>
Click to see the code
# In case we're in a weird style state
sns.set_theme()
adjust_fontsize(size=16)
%matplotlib inline

sns.scatterplot(x=xs, y=yobs)
plt.plot(xs, yhats_linear, color='red', lw=4);
# plt.savefig('dugong_line.png', bbox_inches = 'tight');
# plt.show()
<Figure size 640x480 with 1 Axes>

Interpreting the RMSE (Root Mean Squared Error):

  • Because the constant error is HIGHER than the linear error,

  • The constant model is WORSE than the linear model (at least for this metric).

Constant Model + MAE

We see now that changing the model used for prediction leads to a wildly different result for the optimal model parameter. What happens if we instead change the loss function used in model evaluation?

This time, we will consider the constant model with L1 (absolute loss) as the loss function. This means that the average loss will be expressed as the Mean Absolute Error (MAE).

  1. Choose a model: constant model

  2. Choose a loss function: L1 loss

  3. Fit the model

  4. Evaluate model performance

Deriving the optimal θ0\theta_0

Recall that the MAE is average absolute loss (L1 loss) over the data D={y1,y2,...,yn}D = \{y_1, y_2, ..., y_n\}.

R^(θ0)=1ni=1nyiyi^\hat{R}(\theta_0) = \frac{1}{n}\sum^{n}_{i=1} |y_i - \hat{y_i}|

Given the constant model y^=θ0\hat{y} = \theta_0, we can write the MAE as:

R^(θ0)=1ni=1nyiθ0\hat{R}(\theta_0) = \frac{1}{n}\sum^{n}_{i=1} |y_i - \theta_0|

To fit the model, we find the optimal parameter value θ0^\hat{\theta_0} that minimizes the MAE by differentiating using a calculus approach:

  1. Differentiate with respect to θ0^\hat{\theta_0}:

R^(θ0)=1ni=1nyiθ0ddθ0R(θ0)=ddθ0(1ni=1nyiθ0)=1ni=1nddθ0yiθ0\begin{align} \hat{R}(\theta_0) &= \frac{1}{n}\sum^{n}_{i=1} |y_i - \theta_0| \\ \frac{d}{d\theta_0} R(\theta_0) &= \frac{d}{d\theta_0} \left(\frac{1}{n} \sum^{n}_{i=1} |y_i - \theta_0| \right) \\ &= \frac{1}{n} \sum^{n}_{i=1} \frac{d}{d\theta_0} |y_i - \theta_0| \end{align}
  • Here, we seem to have run into a problem: the derivative of an absolute value is undefined when the argument is 0 (i.e. when yi=θ0y_i = \theta_0). For now, we’ll ignore this issue. It turns out that disregarding this case doesn’t influence our final result.

  • To perform the derivative, consider two cases. When θ0\theta_0 is less than or equal to yiy_i, the term yiθ0y_i - \theta_0 will be positive and the absolute value has no impact. When θ0\theta_0 is greater than yiy_i, the term yiθ0y_i - \theta_0 will be negative. Applying the absolute value will convert this to a positive value, which we can express by saying (yiθ0)=θ0yi-(y_i - \theta_0) = \theta_0 - y_i.

yiθ0={yiθ0 if θ0yiθ0yiif θ0>yi|y_i - \theta_0| = \begin{cases} y_i - \theta_0 \quad \text{ if } \theta_0 \le y_i \\ \theta_0 - y_i \quad \text{if }\theta_0 > y_i \end{cases}
  • Taking derivatives:

ddθ0yiθ0={ddθ0(yiθ0)=1if θ0<yiddθ0(θ0yi)=1if θ0>yi\frac{d}{d\theta_0} |y_i - \theta_0| = \begin{cases} \frac{d}{d\theta_0} (y_i - \theta_0) = -1 \quad \text{if }\theta_0 < y_i \\ \frac{d}{d\theta_0} (\theta_0 - y_i) = 1 \quad \text{if }\theta_0 > y_i \end{cases}
  • This means that we obtain a different value for the derivative for data points where θ0<yi\theta_0 < y_i and where θ0>yi\theta_0 > y_i. We can summarize this by saying:

ddθ0R(θ0)=1ni=1nddθ0yiθ0=1n[θ0<yi(1)+θ0>yi(+1)]\frac{d}{d\theta_0} R(\theta_0) = \frac{1}{n} \sum^{n}_{i=1} \frac{d}{d\theta_0} |y_i - \theta_0| \\ = \frac{1}{n} \left[\sum_{\theta_0 < y_i} (-1) + \sum_{\theta_0 > y_i} (+1) \right]
  • In other words, we take the sum of values for i=1,2,...,ni = 1, 2, ..., n:

    • -1 if our observation yiy_i is greater than our prediction θ0^\hat{\theta_0}

    • +1 if our observation yiy_i is smaller than our prediction θ0^\hat{\theta_0}

  1. Set the derivative equation equal to 0:

    0=1nθ0^<yi(1)+1nθ0^>yi(+1)0 = \frac{1}{n}\sum_{\hat{\theta_0} < y_i} (-1) + \frac{1}{n}\sum_{\hat{\theta_0} > y_i} (+1)
  2. Solve for θ0^\hat{\theta_0}:

    0=1nθ0^<yi(1)+1nθ0^>yi(1)0 = -\frac{1}{n}\sum_{\hat{\theta_0} < y_i} (1) + \frac{1}{n}\sum_{\hat{\theta_0} > y_i} (1)
θ0^<yi(1)=θ0^>yi(1)\sum_{\hat{\theta_0} < y_i} (1) = \sum_{\hat{\theta_0} > y_i} (1)

Thus, the constant model parameter θ=θ0^\theta = \hat{\theta_0} that minimizes MAE must satisfy:

θ0^<yi(1)=θ0^>yi(1)\sum_{\hat{\theta_0} < y_i} (1) = \sum_{\hat{\theta_0} > y_i} (1)

In other words, the number of observations greater than θ0\theta_0 must be equal to the number of observations less than θ0\theta_0; there must be an equal number of points on the left and right sides of the equation. This is the definition of median, so our optimal value is

θ^0=median(y)\hat{\theta}_0 = median(y)

Summary: Loss Optimization, Calculus, and Critical Points

First, define the objective function as average loss.

  • Plug in L1 or L2 loss.

  • Plug in the model so that the resulting expression is a function of θ\theta.

Then, find the minimum of the objective function:

  1. Differentiate with respect to θ\theta.

  2. Set equal to 0.

  3. Solve for θ^\hat{\theta}.

  4. (If we have multiple parameters) repeat steps 1-3 with partial derivatives.

Recall critical points from calculus: R(θ^)R(\hat{\theta}) could be a minimum, maximum, or saddle point!

  • We should technically also perform the second derivative test, i.e., show R(θ^)>0R''(\hat{\theta}) > 0.

  • MSE has a property—convexity—that guarantees that R(θ^)R(\hat{\theta}) is a global minimum.

  • The proof of convexity for MAE is beyond this course.

Comparing Loss Functions

We’ve now tried our hand at fitting a model under both MSE and MAE cost functions. How do the two results compare?

Let’s consider a dataset where each entry represents the number of drinks sold at a bubble tea store each day. We’ll fit a constant model to predict the number of drinks that will be sold tomorrow.

drinks = np.array([20, 21, 22, 29, 33])
drinks
array([20, 21, 22, 29, 33])
np.mean(drinks), np.median(drinks)
(25.0, 22.0)

If we plot each empirical risk function across several possible values of θ\theta, we find that each θ^\hat{\theta} does indeed correspond to the lowest value of error:

The graph on the left is the constant model loss with MSE error. The lowest point on the curve is annotated as theta = 25.0. The graph on the right is the constant model with MAE error. The lowest point on the curve is annotated as theta = 22.0.

Notice that the MSE above is a smooth function – it is differentiable at all points, making it easy to minimize using numerical methods. The MAE, in contrast, is not differentiable at each of its “kinks.” We’ll explore how the smoothness of the cost function can impact our ability to apply numerical optimization in a few weeks.

How do outliers affect each cost function? Imagine we replace the largest value in the dataset with 1000. The mean of the data increases substantially, while the median is nearly unaffected.

drinks_with_outlier = np.append(drinks, 1033)
display(drinks_with_outlier)
np.mean(drinks_with_outlier), np.median(drinks_with_outlier)
array([ 20, 21, 22, 29, 33, 1033])
(193.0, 25.5)
On the left, MSE curve is shown with theta=193. On the right, MAE is shown with theata=25.5."

This means that under the MSE, the optimal model parameter θ^\hat{\theta} is strongly affected by the presence of outliers. Under the MAE, the optimal parameter is not as influenced by outlying data. We can generalize this by saying that the MSE is sensitive to outliers, while the MAE is robust to outliers.

Let’s try another experiment. This time, we’ll add an additional, non-outlying datapoint to the data.

drinks_with_additional_observation = np.append(drinks, 35)
drinks_with_additional_observation
array([20, 21, 22, 29, 33, 35])

When we again visualize the cost functions, we find that the MAE now plots a horizontal line between 22 and 29. This means that there are infinitely many optimal values for the model parameter: any value θ^[22,29]\hat{\theta} \in [22, 29] will minimize the MAE. In contrast, the MSE still has a single best value for θ^\hat{\theta}. In other words, the MSE has a unique solution for θ^\hat{\theta}; the MAE is not guaranteed to have a single unique solution.

MSE is shown with theta=26.7MAE is shown with theta close to 5.

To summarize our example,

MSE (Mean Squared Loss)MAE (Mean Absolute Loss)
Loss FunctionR^(θ)=1ni=1n(yiθ0)2\hat{R}(\theta) = \frac{1}{n}\sum^{n}_{i=1} (y_i - \theta_0)^2$\hat{R}(\theta) = \frac{1}{n}\sum^{n}_{i=1}
Optimal θ0^\hat{\theta_0}θ0^=mean(y)=yˉ\hat{\theta_0} = mean(y) = \bar{y}θ0^=median(y)\hat{\theta_0} = median(y)
Loss SurfaceMSE is shown with theta=26.7MAE is shown with theta close to 5.
ShapeSmooth - easy to minimize using numerical methods (in a few weeks)Piecewise - at each of the “kinks,” it’s not differentiable. Harder to minimize.
OutliersSensitive to outliers (since they change mean substantially). Sensitivity also depends on the dataset size.More robust to outliers.
θ0^\hat{\theta_0} UniquenessUnique θ0^\hat{\theta_0}Infinitely many θ0^\hat{\theta_0}s

Transformations of Linear Models

At this point, we have an effective method of fitting models to predict linear relationships. Given a feature variable and target, we can apply our four-step process to find the optimal model parameters.

A key word above is linear. When we computed parameter estimates earlier, we assumed that xix_i and yiy_i shared a roughly linear relationship. Data in the real world isn’t always so straightforward, but we can transform the data to try and obtain linearity.

The Tukey-Mosteller Bulge Diagram is a useful tool for summarizing what transformations can linearize the relationship between two variables. To determine what transformations might be appropriate, trace the shape of the “bulge” made by your data. Find the quadrant of the diagram that matches this bulge. The transformations shown on the vertical and horizontal axes of this quadrant can help improve the fit between the variables.

Illustration of how to use the tukey-mosteller bulge diagram.

Note that:

  • There are multiple solutions. Some will fit better than others.

  • sqrt and log make a value “smaller.”

  • Raising to a power makes a value “bigger.”

  • Each of these transformations equates to increasing or decreasing the scale of an axis.

Other goals in addition to linearity are possible, for example, making data appear more symmetric. Linearity allows us to fit lines to the transformed data.

Let’s revisit our dugongs example. The lengths and ages are plotted below:

Click to see the code
# `corrcoef` computes the correlation coefficient between two variables
# `std` finds the standard deviation
x = dugongs["Length"]
y = dugongs["Age"]
r = np.corrcoef(x, y)[0, 1]
theta_1 = r * np.std(y) / np.std(x)
theta_0 = np.mean(y) - theta_1 * np.mean(x)

fig, ax = plt.subplots(1, 2, dpi=200, figsize=(8, 3))
ax[0].scatter(x, y)
ax[0].set_xlabel("Length")
ax[0].set_ylabel("Age")

ax[1].scatter(x, y)
ax[1].plot(x, theta_0 + theta_1 * x, "tab:red")
ax[1].set_xlabel("Length")
ax[1].set_ylabel("Age");
<Figure size 1600x600 with 2 Axes>

Looking at the plot on the left, we see that there is a slight curvature to the data points. Plotting the SLR curve on the right results in a poor fit.

For SLR to perform well, we’d like there to be a rough linear trend relating "Age" and "Length". What is making the raw data deviate from a linear relationship? Notice that the data points with "Length" greater than 2.6 have disproportionately high values of "Age" relative to the rest of the data. If we could manipulate these data points to have lower "Age" values, we’d “shift” these points downwards and reduce the curvature in the data. Applying a logarithmic transformation to yiy_i (that is, taking log(\log( "Age" )) ) would achieve just that.

An important word on log\log: in Data 100 (and most upper-division STEM courses), log\log denotes the natural logarithm with base ee. The base-10 logarithm, where relevant, is indicated by log10\log_{10}.

Click to see the code
z = np.log(y)

r = np.corrcoef(x, z)[0, 1]
theta_1 = r * np.std(z) / np.std(x)
theta_0 = np.mean(z) - theta_1 * np.mean(x)

fig, ax = plt.subplots(1, 2, dpi=200, figsize=(8, 3))
ax[0].scatter(x, z)
ax[0].set_xlabel("Length")
ax[0].set_ylabel(r"$\log{(Age)}$")

ax[1].scatter(x, z)
ax[1].plot(x, theta_0 + theta_1 * x, "tab:red")
ax[1].set_xlabel("Length")
ax[1].set_ylabel(r"$\log{(Age)}$")

plt.subplots_adjust(wspace=0.3)
<Figure size 1600x600 with 2 Axes>

Our SLR fit looks a lot better! We now have a new target variable: the SLR model is now trying to predict the log of "Age", rather than the untransformed "Age". In other words, we are applying the transformation zi=log(yi)z_i = \log{(y_i)}. Notice that the resulting model is still linear in the parameters θ=[θ0,θ1]\theta = [\theta_0, \theta_1]. The SLR model becomes:

logy^=θ0+θ1x\hat{\log{y}} = \theta_0 + \theta_1 x
z^=θ0+θ1x\hat{z} = \theta_0 + \theta_1 x

It turns out that this linearized relationship can help us understand the underlying relationship between xx and yy. If we rearrange the relationship above, we find:

log(y)=θ0+θ1x\log{(y)} = \theta_0 + \theta_1 x
y=eθ0+θ1xy = e^{\theta_0 + \theta_1 x}
y=(eθ0)eθ1xy = (e^{\theta_0})e^{\theta_1 x}
yi=Cekxy_i = C e^{k x}

For some constants CC and kk.

yy is an exponential function of xx. Applying an exponential fit to the untransformed variables corroborates this finding.

Click to see the code
plt.figure(dpi=120, figsize=(4, 3))

plt.scatter(x, y)
plt.plot(x, np.exp(theta_0) * np.exp(theta_1 * x), "tab:red")
plt.xlabel("Length")
plt.ylabel("Age");
<Figure size 480x360 with 1 Axes>

You may wonder: why did we choose to apply a log transformation specifically? Why not some other function to linearize the data?

Practically, many other mathematical operations that modify the relative scales of "Age" and "Length" could have worked here. For example, instead of “squishing” the difference between age values by taking log("Age"), we could have “magnified” the differences between length values by taking ("Length")2^2

Multiple Linear Regression

Multiple linear regression is an extension of simple linear regression that adds additional features to the model. The multiple linear regression model takes the form:

y^=θ0+θ1x1+θ2x2+...+θpxp\hat{y} = \theta_0\:+\:\theta_1x_{1}\:+\:\theta_2 x_{2}\:+\:...\:+\:\theta_p x_{p}

Our predicted value of yy, y^\hat{y}, is a linear combination of the single observations (features), xix_i, and the parameters, θi\theta_i.

We’ll dive deeper into Multiple Linear Regression in the next lecture.

Bonus: Calculating Constant Model MSE Using an Algebraic Trick

Earlier, we calculated the constant model MSE using calculus. It turns out that there is a much more elegant way of performing this same minimization algebraically, without using calculus at all.

In this calculation, we use the fact that the sum of deviations from the mean is 0 or that i=1n(yiyˉ)=0\sum_{i=1}^{n} (y_i - \bar{y}) = 0.

Let’s quickly walk through the proof for this:

i=1n(yiyˉ)=i=1nyii=1nyˉ=i=1nyinyˉ=i=1nyin1ni=1nyi=i=1nyii=1nyi=0\begin{align} \sum_{i=1}^{n} (y_i - \bar{y}) &= \sum_{i=1}^{n} y_i - \sum_{i=1}^{n} \bar{y} \\ &= \sum_{i=1}^{n} y_i - n\bar{y} \\ &= \sum_{i=1}^{n} y_i - n\frac{1}{n}\sum_{i=1}^{n}y_i \\ &= \sum_{i=1}^{n} y_i - \sum_{i=1}^{n}y_i \\ & = 0 \end{align}

In our calculations, we’ll also be using the definition of the variance as a sample. As a refresher:

σy2=1ni=1n(yiyˉ)2\sigma_y^2 = \frac{1}{n}\sum_{i=1}^{n} (y_i - \bar{y})^2

Getting into our calculation for MSE minimization:

R(θ)=1ni=1n(yiθ)2=1ni=1n[(yiyˉ)+(yˉθ)]2using trick that a-b can be written as (a-c) + (c-b)   where a, b, and c are any numbers=1ni=1n[(yiyˉ)2+2(yiyˉ)(yˉθ)+(yˉθ)2]=1n[i=1n(yiyˉ)2+2(yˉθ)i=1n(yiyˉ)+n(yˉθ)2]distribute sum to individual terms=1ni=1n(yiyˉ)2+2n(yˉθ)0+(yˉθ)2sum of deviations from mean is 0=σy2+(yˉθ)2\begin{align} R(\theta) &= {\frac{1}{n}}\sum^{n}_{i=1} (y_i - \theta)^2 \\ &= \frac{1}{n}\sum^{n}_{i=1} [(y_i - \bar{y}) + (\bar{y} - \theta)]^2\quad \quad \text{using trick that a-b can be written as (a-c) + (c-b) } \\ &\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \space \space \text{where a, b, and c are any numbers} \\ &= \frac{1}{n}\sum^{n}_{i=1} [(y_i - \bar{y})^2 + 2(y_i - \bar{y})(\bar{y} - \theta) + (\bar{y} - \theta)^2] \\ &= \frac{1}{n}[\sum^{n}_{i=1}(y_i - \bar{y})^2 + 2(\bar{y} - \theta)\sum^{n}_{i=1}(y_i - \bar{y}) + n(\bar{y} - \theta)^2] \quad \quad \text{distribute sum to individual terms} \\ &= \frac{1}{n}\sum^{n}_{i=1}(y_i - \bar{y})^2 + \frac{2}{n}(\bar{y} - \theta)\cdot0 + (\bar{y} - \theta)^2 \quad \quad \text{sum of deviations from mean is 0} \\ &= \sigma_y^2 + (\bar{y} - \theta)^2 \end{align}

Since variance can’t be negative, we know that our first term, σy2\sigma_y^2 is greater than or equal to 0. Also note, that the first term doesn’t involve θ\theta at all, meaning changing our model won’t change this value. For the purposes of determining $\hat{\theta}#, we can then essentially ignore this term.

Looking at the second term, (yˉθ)2(\bar{y} - \theta)^2, since it is squared, we know it must be greater than or equal to 0. As this term does involve θ\theta, picking the value of θ\theta that minimizes this term will allow us to minimize our average loss. For the second term to equal 0, θ=yˉ\theta = \bar{y}, or in other words, θ^=yˉ=mean(y)\hat{\theta} = \bar{y} = mean(y).

  • Variance, σy2\sigma_y^2: This term represents the spread of the data points around their mean, yˉ\bar{y}, and is a measure of the data’s inherent variability. Importantly, it does not depend on the choice of θ\theta, meaning it’s a fixed property of the data. Variance serves as an indicator of the data’s dispersion and is crucial in understanding the dataset’s structure, but it remains constant regardless of how we adjust our model parameter θ\theta.

  • Bias Squared, (yˉθ)2(\bar{y} - \theta)^2: This term captures the bias of the estimator, defined as the square of the difference between the mean of the data points, yˉ\bar{y}, and the parameter θ\theta. The bias quantifies the systematic error introduced when estimating θ\theta. Minimizing this term is essential for improving the accuracy of the estimator. When θ=yˉ\theta = \bar{y}, the bias is 0, indicating that the estimator is unbiased for the parameter it estimates. This highlights a critical principle in statistical estimation: choosing θ\theta to be the sample mean, yˉ\bar{y}, minimizes the average loss, rendering the estimator both efficient and unbiased for the population mean.