⛰️ Lecture 13 – Data 100, Spring 2025¶

Data 100, Spring 2025

Acknowledgments Page

In [1]:
import pandas as pd
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import matplotlib.pyplot as plt
import numpy as np
pd.options.mode.chained_assignment = None  # default='warn'





Fiting our Multiple Linear Regression Model¶

Consider the penguins dataset.

In [2]:
# Load in the penguins dataset provided by seaborn package
df = sns.load_dataset("penguins")

# Filter to only include Adelie species and remove any rows with missing values
df = df[df["species"] == "Adelie"].dropna()

df
Out[2]:
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 Female
5 Adelie Torgersen 39.3 20.6 190.0 3650.0 Male
... ... ... ... ... ... ... ...
147 Adelie Dream 36.6 18.4 184.0 3475.0 Female
148 Adelie Dream 36.0 17.8 195.0 3450.0 Female
149 Adelie Dream 37.8 18.1 193.0 3750.0 Male
150 Adelie Dream 36.0 17.1 187.0 3700.0 Female
151 Adelie Dream 41.5 18.5 201.0 4000.0 Male

146 rows × 7 columns

Suppose we could measure flippers and weight easily, but not bill dimensions. How can we predict bill depth from flipper length and/or body mass?

For demo purposes, we'll drop all columns except the variables whose relationships we're interested in modeling.

In [3]:
df = df[["bill_depth_mm", "flipper_length_mm", "body_mass_g"]]
df
Out[3]:
bill_depth_mm flipper_length_mm body_mass_g
0 18.7 181.0 3750.0
1 17.4 186.0 3800.0
2 18.0 195.0 3250.0
4 19.3 193.0 3450.0
5 20.6 190.0 3650.0
... ... ... ...
147 18.4 184.0 3475.0
148 17.8 195.0 3450.0
149 18.1 193.0 3750.0
150 17.1 187.0 3700.0
151 18.5 201.0 4000.0

146 rows × 3 columns

Suppose we want to create a linear regression model that predicts a penguin's bill depth $y$ using both their flipper length $x_1$ and body mass $x_2$, plus an intercept term.

$$\Large \hat{y} = \theta_0 + \theta_1 x_1 + \theta_2 x_2$$

Another way to write this:

$$ \widehat{\texttt{bill\_depth\_mm}} = \theta_0 + \theta_1 * \texttt{flipper\_length\_mm} + \theta_2 * \texttt{body\_mass\_g} $$

OLS Approach 1: Use Solution to Normal Equation¶

We saw in last lecture that we can model the above multiple linear regression equation using matrix multiplication:

$$ \large \hat{\mathbb{Y}} = \mathbb{X} \theta$$

The optimal $\hat{\theta}$ that minimizes MSE also solves the normal equation:

$$ \left( \mathbb{X}^T \mathbb{X} \right) \hat{\theta} = \mathbb{X}^T \mathbb{Y}$$

If $\mathbb{X}$ is full column rank, then there is a unique solution to $\hat{\theta}$:

$$\large \hat{\theta} = \left( \mathbb{X}^T \mathbb{X} \right)^{-1} \mathbb{X}^T \mathbb{Y}$$

Note: This derivation is one of the most challenging in the course, especially for those of you who are learning linear algebra during the same semester. Be kind to yourself!



Let's try this in code. We have to add a bias column to the design matrix if we want to include an intercept in the model:

In [4]:
# Construct the design matrix X as a dataframe and add a bias column
X = df[["flipper_length_mm", "body_mass_g"]]
X["bias"] = 1
X
Out[4]:
flipper_length_mm body_mass_g bias
0 181.0 3750.0 1
1 186.0 3800.0 1
2 195.0 3250.0 1
4 193.0 3450.0 1
5 190.0 3650.0 1
... ... ... ...
147 184.0 3475.0 1
148 195.0 3450.0 1
149 193.0 3750.0 1
150 187.0 3700.0 1
151 201.0 4000.0 1

146 rows × 3 columns

In [5]:
# Construct the outcome vector y
y = df["bill_depth_mm"]
y
Out[5]:
0      18.7
1      17.4
2      18.0
4      19.3
5      20.6
       ... 
147    18.4
148    17.8
149    18.1
150    17.1
151    18.5
Name: bill_depth_mm, Length: 146, dtype: float64

Recall the solution to the normal equation if $\mathbb{X}$ is full column rank:

$$\hat{\theta} = \left( \mathbb{X}^T \mathbb{X} \right)^{-1} \mathbb{X}^T \mathbb{Y}$$

In [6]:
# M.T transposes a matrix M
# X @ Y multiplies matrices X and Y
# np.linalg.inv(M) finds the inverse of a matrix M
# By default, the dataframe X will be silently converted to a matrix
theta_using_normal_equation = np.linalg.inv(X.T @ X) @ X.T @ y
theta_using_normal_equation
Out[6]:
0     0.009828
1     0.001477
2    11.002995
dtype: float64

So, our fitted model is as follows:

$$ \widehat{\texttt{bill\_depth\_mm}} = 11 + 0.0098 * \texttt{flipper\_length\_mm} + 0.0015 * \texttt{body\_mass\_g} $$

The bias column is the last column in our design matrix. So, the intercept (11) appears last in the parameter list. More often, the bias column is the first column of the design matrix.

Note: The code above is inefficient. We won't go into this in Data 100, but it's better to use np.linalg.solve rather than computing the explicit matrix inverse:

In [7]:
# Solves for θ in (XTX)θ = XTy using a system of linear equations
np.linalg.solve(X.T @ X, X.T @ y)
Out[7]:
array([9.82848689e-03, 1.47749591e-03, 1.10029953e+01])

Make Predictions¶

We'll save the predictions in a column so we can compare them against predictions from sklearn in the next section.

In [8]:
# Convert the dataframe representation of the design matrix to a numpy array, 
# and then multiply it by the theta vector to get the predicted outcomes.
# Yhat = Xθ
df["pred_bill_depth_mm"] = X.to_numpy() @ theta_using_normal_equation
df
Out[8]:
bill_depth_mm flipper_length_mm body_mass_g pred_bill_depth_mm
0 18.7 181.0 3750.0 18.322561
1 17.4 186.0 3800.0 18.445578
2 18.0 195.0 3250.0 17.721412
4 19.3 193.0 3450.0 17.997254
5 20.6 190.0 3650.0 18.263268
... ... ... ... ...
147 18.4 184.0 3475.0 17.945735
148 17.8 195.0 3450.0 18.016911
149 18.1 193.0 3750.0 18.440503
150 17.1 187.0 3700.0 18.307657
151 18.5 201.0 4000.0 18.888505

146 rows × 4 columns




Instructor Note: Return to Lecture!






Using sklearn to fit our Multiple Linear Regression Model¶

An alternate approach to optimize our model is to use the sklearn.linear_model.LinearRegression class. (Documentation)

In [9]:
# Import the LinearRegression class
from sklearn.linear_model import LinearRegression
  1. Create an sklearn object.

    First we create a model. At this point the model has not been fit so it has no parameters.

In [10]:
# Construct a blank LinearRegression model object
model = LinearRegression()
model
Out[10]:
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
  1. fit the object to data.

Then we "fit" the model, which means computing the parameters that minimize the loss function.

The first argument of the fit function should be a matrix (or DataFrame), and the second should be the observations we're trying to predict.

Note: The LinearRegression class is hard coded to use the MSE as its loss function.

In [11]:
# Fit the linear regression model to the data
model.fit(
    X=df[["flipper_length_mm", "body_mass_g"]], 
    y=df["bill_depth_mm"]
  )
Out[11]:
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
  1. Analyze fit or call predict.

    Now that our model is trained, we can ask it questions.

    The code below asks the model to estimate the bill depth (in mm) for a penguin with a 185 mm flipper length.

In [12]:
# Generate a prediction (yhat) by providing the X values to the model.
# The input to .predict() is a list of lists. 
# Each sub-list contains the X values for each observation.
# That's why there are double brackets!
model.predict([[185, 3750.0]]) 
/srv/conda/envs/notebook/lib/python3.11/site-packages/sklearn/base.py:439: UserWarning: X does not have valid feature names, but LinearRegression was fitted with feature names
  warnings.warn(
Out[12]:
array([18.36187501])

We can also ask the model to generate a series of predictions:

In [13]:
# Can also provide a design matrix or dataframe to .predict()
df["sklearn_preds"] = model.predict(df[["flipper_length_mm", "body_mass_g"]])
df
Out[13]:
bill_depth_mm flipper_length_mm body_mass_g pred_bill_depth_mm sklearn_preds
0 18.7 181.0 3750.0 18.322561 18.322561
1 17.4 186.0 3800.0 18.445578 18.445578
2 18.0 195.0 3250.0 17.721412 17.721412
4 19.3 193.0 3450.0 17.997254 17.997254
5 20.6 190.0 3650.0 18.263268 18.263268
... ... ... ... ... ...
147 18.4 184.0 3475.0 17.945735 17.945735
148 17.8 195.0 3450.0 18.016911 18.016911
149 18.1 193.0 3750.0 18.440503 18.440503
150 17.1 187.0 3700.0 18.307657 18.307657
151 18.5 201.0 4000.0 18.888505 18.888505

146 rows × 5 columns

Looking at the predictions, we see that they are exactly the same as what we calculated with our analytic formula!

Analyze parameters.

We can ask the model for its intercept and slope with _intercept and _coef, respectively.

In [14]:
# There is one intercept term in the model
model.intercept_
Out[14]:
11.002995277447074
In [15]:
# There are two slope terms in the model
model.coef_
Out[15]:
array([0.00982849, 0.0014775 ])

So, our fitted model is as follows:

$$ \widehat{\texttt{bill\_depth\_mm}} = 11 + 0.0098 * \texttt{flipper\_length\_mm} + 0.0015 * \texttt{body\_mass\_g} $$

The parameters are the same as with our analytic formula.

In [16]:
# vs. analytical solutions
theta_using_normal_equation
Out[16]:
0     0.009828
1     0.001477
2    11.002995
dtype: float64

They look the same, but why are they out of order...?

  • Remember our order of columns used for normal equation! The bias column is last in the design matrix, so the intercept appears last in the parameter list.

Visualize the Fit¶

We can visualize this data in three dimensions, but for many (most) real-world problems this will not be possible (or helpful).

Note, the following code is out of scope for this class.

In [17]:
fig = px.scatter_3d(df, x="flipper_length_mm", y="body_mass_g", z="bill_depth_mm")

# Create a grid of points to evaluate plane
grid_resolution = 2
(u,v) = np.meshgrid(
    np.linspace(df["flipper_length_mm"].min(), df["flipper_length_mm"].max(), grid_resolution),
    np.linspace(df["body_mass_g"].min(), df["body_mass_g"].max(), grid_resolution))
features = pd.DataFrame({"flipper_length_mm": u.flatten(),
                         "body_mass_g": v.flatten()})
# Make predictions at every point on the grid
zs = model.predict(features)

# create the Surface
fig.add_trace(go.Surface(x=u, y=v, z= zs.reshape(u.shape), opacity=0.9, showscale=False))
fig.update_layout(autosize=False, width=800, height=600)

We see that the predictions all lie in a plane. In higher dimensions, the predictions all lie in a "hyperplane".

Analyze performance.

The sklearn package also provides a function mean_squared_error() (documentation) that computes the MSE from a list of observations and predictions. This avoids us having to manually compute MSE by first computing residuals.

In [18]:
# Calculate the MSE for a prediction vector and a ground truth outcome vector
from sklearn.metrics import mean_squared_error
mean_squared_error(df["bill_depth_mm"], df["sklearn_preds"])
Out[18]:
0.9764070438843998



Instructor Note: Return to Lecture!



Bonus Demo¶

Later in the semester we will learn about other models. However, you already know enough to start using other techniques:

In [19]:
# The sklearn approach is flexible to different model types, like trees! 
from sklearn.tree import DecisionTreeRegressor

tree_model = DecisionTreeRegressor()

tree_model.fit(
    X=df[["flipper_length_mm", "body_mass_g"]], 
    y=df["bill_depth_mm"])
Out[19]:
DecisionTreeRegressor()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeRegressor()
In [20]:
df["sklearn_dt_preds"] = tree_model.predict(df[["flipper_length_mm", "body_mass_g"]])
In [21]:
mean_squared_error(df["bill_depth_mm"], df["sklearn_dt_preds"])
Out[21]:
0.051107305936073065

Lower error! A better model? Let's visualize it.

In [22]:
fig = px.scatter_3d(df, x="flipper_length_mm", y="body_mass_g", z="bill_depth_mm")

# Create a grid of points to evaluate plane
grid_resolution = 20
(u,v) = np.meshgrid(
    np.linspace(df["flipper_length_mm"].min(), df["flipper_length_mm"].max(), grid_resolution),
    np.linspace(df["body_mass_g"].min(), df["body_mass_g"].max(), grid_resolution))
features = pd.DataFrame({"flipper_length_mm": u.flatten(),
                         "body_mass_g": v.flatten()})

# Make predictions at every point on the grid
zs = tree_model.predict(features) #<------------------ Only change

# create the Surface
fig.add_trace(go.Surface(x=u, y=v, z= zs.reshape(u.shape), opacity=0.9, showscale=False))
fig.update_layout(autosize=False, width=800, height=600)





Minimizing an Arbitrary 1D Function¶

Suppose we want to find the value of $x$ that minimizes the arbitrary function given below:

$$ \frac{1}{10}\left(x^4 - 15x^3 + 80 x^2 - 180x + 144\right) $$

In [23]:
def f(x):
    return (x**4 - 15*x**3 + 80*x**2 - 180*x + 144)/10

Minimization via visualization:

In [24]:
x = np.linspace(1, 6.75, 200)
fig = px.line(y = f(x), x = x)
fig.update_layout(font_size = 16)
fig.update_layout(autosize=False, width=800, height=600)

Above, we see that the minimum is somewhere around 5.3ish.

The Naive Approach: Guess and Check¶

Another approach would be to guess and check

In [25]:
# np.argmin returns the index of the smallest value in an array
# This function returns the x value corresponding to the smallest
# value of f(x), for an input array of x values
def simple_minimize(f, xs):
    y = [f(x) for x in xs]
    return xs[np.argmin(y)]
In [26]:
guesses = [5.3, 5.31, 5.32, 5.33, 5.34, 5.35]
simple_minimize(f, guesses)
Out[26]:
5.33

This process is essentially the same as before where we made a graphical plot, it's just that we're only looking at a few selected points.

In [27]:
xs = np.linspace(1, 7, 200)
sparse_xs = np.linspace(1.5, 6.5, 5)

ys = f(xs)
sparse_ys = f(sparse_xs)

fig = px.line(x = xs, y = f(xs))
fig.add_scatter(x = sparse_xs, y = f(sparse_xs), mode = "markers", marker_size=16)
fig.update_layout(showlegend= False)
fig.update_layout(autosize=False, width=800, height=600)
fig.show()

This basic approach suffers from three major flaws:

  1. If the minimum is outside our range of guesses, the answer will be completely wrong.
  2. Even if our range of guesses is correct, if the guesses are too coarse, our answer will be inaccurate.
  3. It is not computationally efficient, considering potentially vast numbers of bad guesses






scipy.optimize.minimize¶

One way to minimize this mathematical function is to use the scipy.optimize.minimize function. It takes a function and a starting guess and tries to find the minimum.

In [28]:
from scipy.optimize import minimize

# x0 is a starting guess for the value of x that minimizes f.
# minimize() will iteratively try to improve this guess.
# Don't worry about all of the details of the output.
# The important part is "x", the value of x that minimizes f.
minimize(f, x0 = 3.5)
Out[28]:
  message: Optimization terminated successfully.
  success: True
   status: 0
      fun: -0.13827491292945523
        x: [ 2.393e+00]
      nit: 3
      jac: [ 6.484e-06]
 hess_inv: [[ 7.385e-01]]
     nfev: 20
     njev: 10

How can this one line of code find the minimum of any mathematical function so quickly?

Behind the scenes, scipy.optimize.minimize uses a collection of techniques to compute the minimizing value of a function. Many of these techniques operate on numerical approximations to the gradient.

In this lecture, we will learn the basics of gradient descent, then implement it ourselves.




Instructor Note: Return to Lecture!





Descending with Derivatives¶

Instead of choosing all of our guesses ahead of time, we can instead start from a single guess and try to iteratively improve on our choice.

The key insight:

  • If the derivative of the function is negative, that means the function is decreasing, so we should go to the right (i.e., pick a bigger x).

  • If the derivative of the function is positive, that means the function is increasing, so we should go to the left (i.e., pick a smaller x).

Thus, the derivative tells us which way to go.

Desmos demo: https://www.desmos.com/calculator/twpnylu4lr

Taking the derivative of my arbitrary function:

\begin{align} f(x) & = \frac{1}{10}\left(x^4 - 15x^3 + 80 x^2 - 180x + 144\right)\\ \frac{d}{d x} f(x) & = \frac{1}{10}\left(4x^3 - 45x^2 + 160 x - 180\right)\\ \end{align}

In [29]:
def deriv_f(x):
    return (1/10) * (4*x**3 - 45*x**2 + 160*x - 180)
In [30]:
f_line = go.Scatter(x = xs, y = f(xs), mode = "lines", name = "f")
derivative_line = go.Scatter(x = xs, y = deriv_f(xs), 
                             mode = "lines", name = "f'", line = {"dash": "dash"})
# add horizontal line at 0
zero_line = go.Scatter(x = xs, y = 0*xs, mode = "lines", name='f(x) = 0')
roots = np.array([2.3927, 3.5309, 5.3263]) # computed using algorithm
root_markers = go.Scatter(x = np.array(roots), y = 0*roots, 
                         mode = "markers", name = "f'(x) = 0", marker_size = 12)

fig = go.Figure()
fig.add_traces([f_line, derivative_line, zero_line, root_markers])
fig.update_layout(font_size = 20, yaxis_range=[-1, 3])
fig.update_layout(autosize=False, width=800, height=600)
fig.show()

How does the derivative at a point show which the direction to move?

In [31]:
x = 4.3
fig = go.Figure()
fig.add_trace(f_line)
# Adding a red arrow in the direction of the gradient.
#  Note the arrow is just a direction along the x dimension 
#  (the y position is just for illustrative purposes)
fig.add_trace(go.Scatter(
    x=[x, x - deriv_f(x)], y=[f(x), f(x)],  
    marker= dict(size=10,symbol= "arrow-bar-up", angleref="previous"),
    name="-f'(x0)"
    ))
# Add the Green circle for our guess
fig.add_trace(go.Scatter(x=[x],y=[f(x)], 
                         marker_color="green", marker_size=12,
                         mode="markers", name="x0"))
fig.update_layout(font_size = 20, yaxis_range=[-1, 3])
fig.update_layout(autosize=False, width=800, height=600)
fig

Manually Descending with Derivatives¶

Armed with this knowledge, let's try to see if we can use the derivative to optimize the function.

We start by making some guess for the minimizing value of $x$. Then, we look at the derivative of the function at this value of $x$, and step downhill in the opposite direction. We can express our new rule as a recurrence relation:

$$x^{(t+1)} = x^{(t)} - \frac{d}{dx} f(x^{(t)})$$

We obtain our next guess for the minimizing value of $x$ at timestep $t+1$ ($x^{(t+1)}$) by taking the guess our last guess ($x^{(t)}$) and subtracting the derivative of the function at that point ($\frac{d}{dx} f(x^{(t)})$).

Running the algorithm one step at a time:

In [32]:
# `x` is our current guess for the minimum.
# `derivative` is a function that takes in a single value and 
# returns the value of the derivative of f at the given input value.
def take_one_step(x, derivative):
    new_x = x - derivative(x)
    return new_x

Starting with an initial guess of 4.0 and taking 10 steps:¶

In [33]:
x = 4.0
steps = [x]
for i in range(10):
    x = take_one_step(x, deriv_f)
    steps.append(x) 

print(steps)
[4.0, 4.4, 5.0464000000000055, 5.496730601062393, 5.0808624852305115, 5.489980392167775, 5.092824872119241, 5.486755386070718, 5.0984728528436225, 5.485072693208349, 5.101402551267881]

Visualizing the optimization steps¶

Note: The following visualization code is out-of-scope for Data 100, but you should understand the visualization.

In [34]:
# This code is out-of-scope for data-100 but could be fun to learn.
def plot_steps(steps, f = f, f_line = f_line):
    fig = go.Figure()
    fig.add_trace(f_line)
    fig.add_trace(go.Scatter(x = steps, y = [f(s) for s in steps], 
                             mode = "lines+markers", line = {"dash": "dash", "color": "red"},
                             name = "Path", 
                             marker_symbol="arrow",
                             marker_angleref="previous",
                             marker_standoff=4,
                             marker_size = 16))
    fig.add_trace(go.Scatter(x = steps, y = [f(s) for s in steps], 
                             mode = "markers", 
                             name = "Path",
                             marker_color="red",
                             showlegend=False,
                             marker_size = 8))
    fig.update_layout(font_size = 20)
    fig.update_layout(autosize=False, width=800, height=600)

    return fig
In [35]:
plot_steps(steps)

Looking pretty good! We do have a problem though – once we arrive close to the minimum value of the function, our guesses "bounce" back and forth past the minimum without ever reaching it.

  • In other words, each step we take when updating our guess moves us too far. We can address this by decreasing the size of each step.

Let's update our algorithm to use a learning rate (also sometimes called the step size), which controls how far we move with each update. We represent the learning rate with $\alpha$.

$$x^{(t+1)} = x^{(t)} - \alpha \frac{d}{dx} f(x^{(t)})$$

A small $\alpha$ means that we will take small update steps; a large $\alpha$ means we will take large steps.

Let's update our function to use $\alpha=0.3$.

In [36]:
def take_one_step_lr(x, alpha, derivative):
    # Find our new guess using the recurrence relation
    new_x = x - alpha * derivative(x)
    return new_x
In [37]:
x = 4.0
steps = [x]
for i in range(15):
    x = take_one_step_lr(x, alpha=0.3, derivative = deriv_f)
    print(x)
    steps.append(x) 

plot_steps(steps)
4.12
4.267296639999997
4.442725838159953
4.640926244829146
4.846183704850335
5.032118544823421
5.17201478493924
5.2564844894138165
5.297911492494514
5.315427176589101
5.322260602055931
5.324832983472768
5.325787650752968
5.3261400404400865
5.326269854338316




Instructor Note: Return to Lecture!




The Gradient Descent Algorithm¶

The process we just explored above is called gradient descent – we identify the direction of greatest change of a function (its gradient) with respect to the variable we wish to optimize, then descend down to the minimum of the function.

$$ x^{(t+1)} = x^{(t)} - \alpha \frac{d}{dx} f(x) $$

In the cell below, we define gradient_descent, which runs the gradient descent algorithm for a specified number n of updates and stores all guesses.

In [38]:
def gradient_descent(deriv_f, initial_guess, alpha, T):
    """Performs T steps of gradient descent on df using learning rate alpha starting
       from initial_guess. Returns a numpy array of all guesses over time."""
    guesses = [initial_guess]
    current_guess = initial_guess
    while len(guesses) < T:
        current_guess = current_guess - alpha * deriv_f(current_guess)
        guesses.append(current_guess)
        
    return np.array(guesses)

Below, we see a visualization of the trajectory taken by this algorithm.

Try modifying the initial_guess, learning rate alpha, and number of updates (i.e., time steps) T.

In [39]:
trajectory = gradient_descent(deriv_f, initial_guess=1.6, alpha=0.75, T=20)
print(trajectory)
plot_steps(trajectory)
[1.6        3.3112     3.18920918 3.01472352 2.79207742 2.56776716
 2.42826486 2.39421613 2.39274816 2.39274798 2.39274798 2.39274798
 2.39274798 2.39274798 2.39274798 2.39274798 2.39274798 2.39274798
 2.39274798 2.39274798]
In [40]:
trajectory = gradient_descent(deriv_f, initial_guess=6, alpha=0.75, T=20)
print(trajectory)
plot_steps(trajectory)
[6.         4.2        4.6086     5.12279483 5.38817984 5.28497822
 5.34793725 5.31315502 5.33375146 5.32197109 5.32885604 5.32488006
 5.32719254 5.32585303 5.32663079 5.32617982 5.32644152 5.32628973
 5.32637779 5.32632671]

Above, we've simply run our algorithm a fixed number of times. More sophisticated implementations will stop based on a variety of different stopping criteria, e.g., error getting too small, error getting too large, etc. We will not discuss these in our course.

In the next part, we'll return to the world of real data and see how this procedure might be useful for optimizing models.


Instructor Note: Return to Lecture!






Gradient Descent on a 1D Model¶

Recall our modeling workflow from the past few lectures:

  • Define a model with some parameters $\theta_i$
  • Choose a loss function
  • Select the values of $\theta_i$ that minimize the loss function on the data

Gradient descent is a powerful technique for completing this last task. By applying the gradient descent algorithm, we can select values for our parameters $\theta_i$ that will lead to the model having minimal loss on the training data.

When using gradient descent in a modeling context:

  • We make guesses for the minimizing $\theta_i$
  • We compute the derivative of the loss function $L$

Using our gradient descent rule from before:

$$\theta^{(t+1)} = \theta^{(t)} - \alpha \frac{d}{d\theta} L(\theta^{(t)})$$

To see this in action, let's consider a case where we have a simple linear model with no intercept.

$$\hat{y} = \theta_1 x$$

Let's apply our gradient_descent function from before to optimize our model on the tips dataset. We will try to select the best parameter $\theta_i$ to predict the tip $y$ from the total_bill $x$.

$$ \widehat{\texttt{tip}} = \texttt{total\_bill} * \theta_1 $$

We want to find the parameter $\theta_1$ such that the mean squared error is minimized. Our loss function is:

$$L(\theta) = MSE(\theta) = \frac{1}{n} \sum_{i=1}^n (y_i - \theta_1x_i)^2$$

In [41]:
df = sns.load_dataset("tips")
df.head()
Out[41]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

We are want to predict the tip using the total bill.

In [42]:
x = df["total_bill"]
y_obs = df["tip"]

We can visualize the value of the MSE on our dataset for different possible choices of $\theta_1$. To optimize our model, we want to select the value of $\theta_1$ that leads to the lowest MSE.

In [43]:
def mse_single_arg(theta_1):
    """Returns the MSE on our data for the given theta1"""
    y_hat = theta_1 * x
    return np.mean((y_hat - y_obs) ** 2) 

thetas = np.linspace(-1.5, 1, 100)

mse_line = go.Scatter(x = thetas, y = [mse_single_arg(theta_1) for theta_1 in thetas], mode = "lines", name = "MSE")
fig = go.Figure()
fig.add_trace(mse_line)
fig.update_layout(autosize=False, width=800, height=600, xaxis_title="theta_1", yaxis_title="MSE")

To apply gradient descent, we need to compute the derivative of the loss function with respect to our parameter $\theta_1$. This comes out to be:

$$\frac{d}{d\theta_1} L(\theta^{(t)}) = \frac{-2}{n} \sum_{i=1}^n (y_i - \theta_1^{(t)}x_i)x_i$$

Here, we denote our parameter as $\theta_1^{(t)}$ to remind ourselves that we compute the derivative assuming $\theta_i$ has the value of our current guess.

Our gradient descent update rule is:

$$\theta_1^{(t+1)} = \theta_1^{(t)} - \alpha \frac{-2}{n} \sum_{i=1}^n (y_i - \theta_1^{(t)}x_i)x_i$$

To use our gradient descent function, we need to compute the derivative of the MSE.

The derivative of the MSE with respect to theta_1 is implemented below.

In [44]:
def deriv_mse_single_arg(theta_1):
    """Returns the derivative of the MSE on our data for the given theta1"""
    y_hat = theta_1 * x
    return np.mean(-2 * (y_obs - y_hat) * x)

Now, we can apply gradient descent to select a value for $\theta_1$.

In [45]:
trajectory = gradient_descent(deriv_mse_single_arg, initial_guess=-0.5, alpha=0.0001, T=100)
print(f"Final guess for theta_1: {trajectory[-1]}")
plot_steps(trajectory,  mse_single_arg,  mse_line)
Final guess for theta_1: 0.14369554654231262



Instructor Note: Return to Lecture!








Gradient Descent on Multi-Dimensional Models¶

Now suppose we improve our model so that we want to predict the tip from the total_bill plus a constant offset, in other words:

$$\textrm{tip} = \theta_0 + \theta_1 \textrm{bill}$$

Our simple linear regression model has two parameters, $\theta_0$ and $\theta_1$. We need to optimize both of them.

Fortunately, gradient descent readily extends to models with multiple dimenions.

Defining a 2D MSE Function¶

Now, we will generate our predictions as $$\hat{y} = \theta_0 + \theta_1 x_1$$

In the cell below, we add a bias term to our data to represent the constant intercept $\theta_0$.

In [46]:
tips_with_bias = df.copy()
tips_with_bias["bias"] = 1
tips_with_bias = tips_with_bias[["bias", "total_bill"]]
tips_with_bias.head()
Out[46]:
bias total_bill
0 1 16.99
1 1 10.34
2 1 21.01
3 1 23.68
4 1 24.59
In [47]:
X = tips_with_bias
y = df["tip"]

Throughout this problem, we'll assume we want to minimize the mean squared error of our predictions:

In [48]:
def mse_loss(theta):
    y_hat = X @ theta
    return np.mean((y - y_hat) ** 2)    

Using this function, we can visualize our loss function. Because we now want to understand how the loss changes with respect to two parameters, we create a loss surface. Each point on the surface represents the loss of the model for a particular choice of $\theta_0$ and $\theta_1$.

The cell below uses lots of syntax you've never seen. No need to worry about any unfamiliar plotting code; for now, focus on the output.

In [49]:
import plotly.graph_objects as go

uvalues = np.linspace(-1, 5, 20)
vvalues = np.linspace(-0.1, 0.35, 20)
(u,v) = np.meshgrid(uvalues, vvalues)
thetas = np.vstack((u.flatten(),v.flatten()))


MSE = np.array([mse_loss(t) for t in thetas.T])

loss_surface = go.Surface(x=u, 
    y=v, z=np.reshape(MSE, u.shape),
    contours = {"z": {"show": True, "start": 0, "end": 50, "size": 2, "color": "white"}})

# This is an approximate guess for the optimal point.
ind = np.argmin(MSE)
optimal_point = go.Scatter3d(name = "Optimal Point",
    x = [thetas.T[ind,0]], y = [thetas.T[ind,1]], 
    z = [MSE[ind]],
    marker=dict(size=10, color="red"))

fig = go.Figure(data=[loss_surface, optimal_point])
fig.update_layout(scene = dict(
    xaxis_title = "theta0",
    yaxis_title = "theta1",
    zaxis_title = "MSE"), autosize=False, width=800, height=600)

fig.show()

Play around with the plot above. We have marked the lowest point on the surface in red – this is the combination of $\theta_0$ and $\theta_1$ that leads to the lowest MSE for the model.

Alternatively, we can visualize a bird's-eye view of the loss surface from above using a contour plot:

In [50]:
contour = go.Contour(x=u[0], y=v[:, 0], z=np.reshape(MSE, u.shape), 
                     contours=dict(start=0, end=70,size=2))
fig = go.Figure(contour)
fig.update_layout(
    xaxis_title = "theta0",
    yaxis_title = "theta1", autosize=False, width=800, height=600)

fig.show()

Applying Gradient Descent in 2D¶

When working with multidimensional models, we optimize a vector of parameters. Our new update rule is:

$$\vec{\theta}^{(t+1)} = \vec{\theta}^{(t)} - \alpha \nabla_{\vec{\theta}} L(\vec{\theta}^{(t)})$$

Where $\nabla_{\vec{\theta}} L(\vec{\theta}^{(t)})$ is the gradient of the loss function. It is the vector of the partial derivatives of loss with respect to each parameter $\theta_i$.

Let's derive the gradient of the loss for our problem:

$$ \textbf{L}(\theta) = \textbf{MSE}(\theta) = \frac{1}{n}\sum_{i=1}^n \left(y_i - \left(\theta_0 + \theta_1 x_i\right)\right)^2 $$

Computing the partial derivatives: \begin{align} \frac{\partial }{\partial \theta_0}\textbf{L}(\theta) & = \frac{1}{n}\sum_{i=1}^n \frac{\partial }{\partial \theta_0}\left(y_i - \left(\theta_0 + \theta_1 x_i\right)\right)^2 \\ & = \frac{1}{n}\sum_{i=1}^n 2\left(y_i - \left(\theta_0 + \theta_1 x_i\right)\right) \left(- \frac{\partial }{\partial \theta_0}\left(\theta_0 + \theta_1 x_i\right)\right) \\ & = \frac{1}{n}\sum_{i=1}^n 2\left(y_i - \left(\theta_0 + \theta_1 x_i\right)\right) \left(- 1 \right) \\ & = \frac{-2}{n}\sum_{i=1}^n \left(y_i - \left(\theta_0 + \theta_1 x_i\right)\right) \\ \frac{\partial }{\partial \theta_1}\textbf{L}(\theta) & = \frac{1}{n}\sum_{i=1}^n \frac{\partial }{\partial \theta_1}\left(y_i - \left(\theta_0 + \theta_1 x_i\right)\right)^2 \\ & = \frac{1}{n}\sum_{i=1}^n 2\left(y_i - \left(\theta_0 + \theta_1 x_i\right)\right) \left(- \frac{\partial }{\partial \theta_1}\left(\theta_0 + \theta_1 x_i\right)\right) \\ & = \frac{1}{n}\sum_{i=1}^n 2\left(y_i - \left(\theta_0 + \theta_1 x_i\right)\right) \left(- x_i \right) \\ & = \frac{-2}{n}\sum_{i=1}^n \left(y_i - \left(\theta_0 + \theta_1 x_i\right)\right) x_i \\ \end{align}

In the above we used the chain rule: $$ \frac{\partial}{\partial\theta}f(g(\theta))=\left.\left(\frac{\partial}{\partial u}f(u) \right)\right\rvert_{u=\theta}\,\frac{\partial}{\partial\theta}g(\theta) $$

We obtain the gradient:

$$ \nabla \textbf{L}(\theta) = \begin{bmatrix} \frac{-2}{n}\sum_{i=1}^n \left(y_i - \left(\theta_0 + \theta_1 x_i\right)\right) \\ \frac{-2}{n}\sum_{i=1}^n \left(y_i - \left(\theta_0 + \theta_1 x_i\right)\right) x_i \end{bmatrix} $$

In the cell below, we define helper functions to compute the gradient of MSE with respect to our two parameters $\theta_0$ and $\theta_1$, stored in the array theta.

In [51]:
def mse_gradient(theta):
    """Returns the gradient of the MSE on our data for the given theta"""
    x1 = X.iloc[:, 1]
    dth0 = np.mean(-2 * (y - (theta[0] + theta[1]*x1)))
    dth1 = np.mean(-2 * (y - (theta[0] + theta[1]*x1)) * x1)
    return np.array([dth0, dth1])

Now, we can use our gradient_descent function from before to optimize $\theta_0$ and $\theta_1$ at the same time! The final estimates for the ideal model parameters are very similar to the guesses we may have made simply by inspecting the plot of the loss surface from before.

The cell below may take a moment to run.

In [52]:
# Note that `initial_guess` is now an array of size 2.
guesses = gradient_descent(mse_gradient, initial_guess=np.array([1, .5]), alpha=0.001, T=10000)

pd.DataFrame(guesses, columns=["theta_0", "theta_1"]).tail(10)
Out[52]:
theta_0 theta_1
9990 0.922487 0.104931
9991 0.922486 0.104931
9992 0.922485 0.104931
9993 0.922484 0.104931
9994 0.922484 0.104931
9995 0.922483 0.104931
9996 0.922482 0.104931
9997 0.922481 0.104931
9998 0.922481 0.104931
9999 0.922480 0.104932

Comparing with the scipy minimize function.

In [53]:
minimize(mse_loss, x0 = [0,0])
Out[53]:
  message: Optimization terminated successfully.
  success: True
   status: 0
      fun: 1.0360194420115247
        x: [ 9.203e-01  1.050e-01]
      nit: 3
      jac: [ 1.490e-08 -1.490e-08]
 hess_inv: [[ 2.980e+00 -1.253e-01]
            [-1.253e-01  6.335e-03]]
     nfev: 15
     njev: 5

We can actually provide the gradient information to the scipy optimizer to get an even faster solution. This is out of scope for Data 100!

jac stands for Jacobian.

In [54]:
minimize(mse_loss, x0 = [0,0], jac=mse_gradient)
Out[54]:
  message: Optimization terminated successfully.
  success: True
   status: 0
      fun: 1.036019442011377
        x: [ 9.203e-01  1.050e-01]
      nit: 3
      jac: [ 3.494e-16 -6.989e-16]
 hess_inv: [[ 2.980e+00 -1.253e-01]
            [-1.253e-01  6.335e-03]]
     nfev: 5
     njev: 5



Instructor Note: Return to Lecture!





Stochastic Gradient Descent on Multi-Dimensional Models¶

In [55]:
def mse_gradient(theta, X, y):
    """Returns the gradient of the MSE on our data for the given theta"""
    x0 = X.iloc[:, 0]
    x1 = X.iloc[:, 1]
    dth0 = np.mean(-2 * (y - theta[0]*x0 - theta[1]*x1) * x0)
    dth1 = np.mean(-2 * (y - theta[0]*x0 - theta[1]*x1) * x1)
    return np.array([dth0, dth1])
In [56]:
# Run stochastic gradient descent (SGD)
# eta is the initial learning rate, which decays as a function of the time step
def sgd(grad, X, y, initial_theta, eta = 0.3, max_iter = 5000, batch_size=50 ):
    theta = initial_theta
    thetas = [theta]
    n = len(X)
    for t in range(1, max_iter):
        X_sample = X.sample(batch_size)
        y_sample = y.loc[X_sample.index]
        theta = theta - eta/t * grad(theta, X_sample, y_sample)
        thetas.append(theta)
    return thetas
In [57]:
thetas = sgd(mse_gradient, X, y, 
             initial_theta = np.array([1, .5]), 
             eta = 0.001, 
             max_iter = 10000,
             batch_size=1)
thetas[-5:]
Out[57]:
[array([0.98110353, 0.10248307]),
 array([0.98110341, 0.10248119]),
 array([0.98110332, 0.1024799 ]),
 array([0.98110344, 0.10248167]),
 array([0.98110334, 0.10248054])]