---
title: Gradient Descent
execute:
echo: true
warning: false
format:
html:
code-fold: true
code-tools: true
toc: true
toc-title: Gradient Descent
page-layout: full
theme:
- cosmo
- cerulean
callout-icon: false
jupyter: myenv
---
::: {.callout-note collapse="true"}
## Learning Outcomes
* Understand the standard workflow for fitting models in `sklearn`
* Describe the conceptual basis for gradient descent
* Compute the gradient descent update on a provided dataset
:::
At this point, we've grown quite familiar with the modeling process. We've introduced the concept of loss, used it to fit several types of models, and, most recently, extended our analysis to multiple regression. Along the way, we've forged our way through the mathematics of deriving the optimal model parameters in all of its gory detail. It's time to make our lives a little easier – let's implement the modeling process in code!
In this lecture, we'll explore three techniques for model fitting:
1. Translating our derived formulas for regression to Python
2. Using the `sklearn` Python package
3. Applying gradient descent for numerical optimization
## `sklearn`: Implementing Derived Formulas in Code
Throughout this lecture, we'll refer to the `penguins` dataset.
quarto-executable-code-5450563D
```python
import pandas as pd
import seaborn as sns
import numpy as np
penguins = sns.load_dataset("penguins")
penguins = penguins[penguins["species"] == "Adelie"].dropna()
penguins.head(5)
```
Suppose our goal is to predict the value of the `'bill depth'` for a particular penguin given its `'flipper length'`.
quarto-executable-code-5450563D
```python
#| code-fold: false
# Define the design matrix, X...
X = penguins[["flipper_length_mm"]]
# ...as well as the target variable, y
y = penguins[["bill_depth_mm"]]
```
### Simple Linear Regression (SLR)
In the SLR framework we learned last week, this means we are saying our model for `bill depth`, $y$, is a linear function of `flipper length`, $x$:
$$\hat{y} = \theta_0 + \theta_1 x$$
Let's do some EDA first.
quarto-executable-code-5450563D
```python
import matplotlib.pyplot as plt
plt.xlabel("flipper length (mm)")
plt.ylabel("bill depth (mm)")
plt.scatter(data = penguins, x = "flipper_length_mm", y = "bill_depth_mm");
```
Based on our EDA, there is a linear relationship, though it is somewhat weak.
#### SLR w/ Derived Analytical Formulas
Let $\hat{\theta}_0$ and $\hat{\theta}_1$ be the choices that minimize the Mean Squared Error.
One approach to compute $\hat{\theta}_0$ and $\hat{\theta}_1$ is analytically, using the equations we derived in a previous lecture:
$$\hat{\theta}_0 = \bar{y} - \hat{\theta}_1 \bar{x}$$
$$\hat{\theta}_1 = r \frac{\sigma_y}{\sigma_x}$$
$$r = \frac{1}{n} \sum_{i=1}^{n}\left( \frac{x_i - x}{\sigma_x} \right) \left( \frac{y_i - y}{\sigma_y} \right) $$
Let's implement these using the base numpy library, which provides many important functions such as `.mean` and `.std` .
quarto-executable-code-5450563D
```python
x = penguins["flipper_length_mm"]
y = penguins["bill_depth_mm"]
x_bar, sigma_x = np.mean(x), np.std(x)
y_bar, sigma_y = np.mean(y), np.std(y)
r = np.sum((x - x_bar) / sigma_x * (y - y_bar) / sigma_y) / len(x)
theta1_hat = r * sigma_y / sigma_x
theta0_hat = y_bar - theta1_hat * x_bar
print(f"bias parameter: {theta0_hat}, \nslope parameter: {theta1_hat}".format(theta0_hat,theta1_hat))
```
#### SLR Analytical Approach Performance
Let's first assess how "good" this model is using a performance metric. For this exercise, let's use the MSE. As a review:
**Mean Squared Error**: We can compute this explicitly by averaging the square of the residuals $e_i$:
$$\large MSE = \sum_{i=1}^n (y_i - \hat{y}_i)^2 = \frac{1}{n}\sum_{i=1}^n (e_i)^2 = \frac{1}{n}$$
quarto-executable-code-5450563D
```python
# using our estimated parameter values to create a column containing our
# SLR predictions and errors
penguins["analytical_preds_slr"] = theta0_hat + theta1_hat * penguins["flipper_length_mm"]
penguins["residual"] = penguins["bill_depth_mm"] - penguins["analytical_preds_slr"]
penguins
print("MSE: ", np.mean(penguins["residual"]**2))
```
Let's plot plot our results.
quarto-executable-code-5450563D
```python
sns.scatterplot(data = penguins, x = "flipper_length_mm", y = "bill_depth_mm")
plt.plot(penguins["flipper_length_mm"], penguins["analytical_preds_slr"], 'r') # a line
plt.legend([r'$\hat{y}$', '$y$']);
```
#### SLR w/ `sklearn`
We've already saved a lot of time (and avoided tedious calculations) by translating our derived formulas into code. However, we still had to go through the process of writing out the linear algebra ourselves.
To make life *even easier*, we can turn to the `sklearn` Python library. `sklearn` is a robust library of machine learning tools used extensively in research and industry. It gives us a wide variety of in-built modeling frameworks and methods, so we'll keep returning to `sklearn` techniques as we progress through Data 100.
Regardless of the specific type of model being implemented, `sklearn` follows a standard set of steps for creating a model.
1. Create a model object. This generates a new instance of the model class. You can think of it as making a new copy of a standard "template" for a model. In pseudocode, this looks like:
```
my_model = ModelName()
```
2. Fit the model to the `X` design matrix and `Y` target vector. This calculates the optimal model parameters "behind the scenes" without us explicitly working through the calculations ourselves. The fitted parameters are then stored within the model for use in future predictions:
```
my_model.fit(X, Y)
```
3. Analyze the fitted parameters using `.coef_` or `.intercept_`, or use the fitted model to make predictions on the `X` input data using `.predict`.
```
my_model.coef_
my_model.intercept_
my_model.predict(X)
```
Let's put this into action with our multiple regression task. First, initialize an instance of the `LinearRegression` class.
quarto-executable-code-5450563D
```python
#| code-fold: false
from sklearn.linear_model import LinearRegression
model = LinearRegression()
```
Next, fit the model instance to the design matrix `X` and target vector `Y` by calling `.fit`.
quarto-executable-code-5450563D
```python
#| code-fold: false
model.fit(X, y)
```
And, lastly, generate predictions for $\hat{Y}$ using the `.predict` method. Here are the first 5 penguins in our dataset. How close are our analytical solution's predictions to our `sklearn` model's predictions?
quarto-executable-code-5450563D
```python
#| code-fold: false
# Like before, show just the first 5 predictions. The output of predict is usually a np.array
penguins["sklearn_preds_slr"] = model.predict(X)
sklearn_5 = penguins["sklearn_preds_slr"][:5].to_numpy()
print("Sklearn solution: ", sklearn_5)
analytical_5 = penguins["analytical_preds_slr"][:5].to_numpy()
print("Analytical solution: ", analytical_5)
```
You can also use the model to predict what the `bill depth` of a hypothetical penguin with `flipper length` of 185mm would have.
quarto-executable-code-5450563D
```python
# this produces a warning since we
# did not specify what X this refers
# to, but since we only
# have one input it is negligible
model.predict([[185]])
```
We can also check if the fitted parameters,$\hat{\theta}_0,\hat{\theta}_1$, themselves are similar to our analytical solution. Note that since we can have at most 1 intercept in a SLR or OLS model, so we always get back a `scalar` value from `.intercept`. However, when OLS can have multiple coefficient values, so `.coef` returns an array.
quarto-executable-code-5450563D
```python
theta0 = model.intercept_ # this a scalar
print("analytical bias term: ", theta0_hat)
print("sklearn bias term: ", theta0)
theta1 = model.coef_ # this an array
print("analytical coefficient terms: ", theta1_hat)
print("sklearn coefficient terms: ", theta1)
```
#### SLR `sklearn` Performance
The `sklearn` package also provides a function that computes the MSE from a list of observations and predictions. This avoids us having to manually compute MSE by first computing residuals.
[documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html)
quarto-executable-code-5450563D
```python
from sklearn.metrics import mean_squared_error
MSE_sklearn = mean_squared_error(penguins["bill_depth_mm"], penguins["sklearn_preds_slr"])
print("MSE: ", MSE_sklearn)
```
We've generated the exact same predictions and error as before, but without any need for manipulating matrices ourselves!
### Multiple Linear Regression
In the previous lecture, we expressed multiple linear regression using matrix notation.
$$\hat{\mathbb{Y}} = \mathbb{X}\theta$$
#### OLS w/ Derived Analytical Formulas
We used a geometric approach to derive the following expression for the optimal model parameters under MSE error, also called Ordinary Least Squares (OLS):
$$\hat{\theta} = (\mathbb{X}^T \mathbb{X})^{-1}\mathbb{X}^T \mathbb{Y}$$
That's a whole lot of matrix manipulation. How do we implement it in Python?
There are three operations we need to perform here: multiplying matrices, taking transposes, and finding inverses.
* To perform matrix multiplication, use the `@` operator
* To take a transpose, call the `.T` attribute of an array or DataFrame
* To compute an inverse, use `numpy`'s in-built method `np.linalg.inv`
quarto-executable-code-5450563D
```python
#| code-fold: false
X = penguins[["flipper_length_mm", "body_mass_g"]].copy()
X["bias"] = np.ones(len(X))
y = penguins["bill_depth_mm"]
theta_hat = np.linalg.inv(X.T @ X) @ X.T @ y
theta_hat
```
Note that since we added "bias" last, `theta_hat[2]` is our estimated value for the $\theta_0$. To make predictions using our newly-fitted model coefficients, matrix-multiply `X` and `theta_hat`.
quarto-executable-code-5450563D
```python
#| code-fold: false
y_hat = X.to_numpy() @ theta_hat
# Show just the first 5 predictions to save space on the page
y_hat[:5]
penguins["analytical_preds_ols"] = y_hat
```
Note, this technique doesn't work if our X is **not invertible**.
#### OLS w/ `sklearn`
We can actually compute the optimal parameters very easily using sklearn, using the exact code that we wrote earlier. Note: sklearn does NOT use the normal equations. Instead it uses gradient descent, a technique we will learn about soon, which can minimize ANY function, not just the MSE.
quarto-executable-code-5450563D
```python
# creating our design matrix.
# Note:
# - no bias term needed here bc of sklearn automatically includes one
# - to remove the intercept term, set the fit_intercept = true in LinearRegression constructor.
X_2d = penguins[["flipper_length_mm", "body_mass_g"]]
y = penguins["bill_depth_mm"]
model_2d = LinearRegression() # note fit_intercept=True by default
model_2d.fit(X_2d, y)
```
Now we again have a model with which we can use to make predictions. For example, we can ask our model about a penguin's bill depth if they have 185-mm flipper length and 3750 g body mass.
quarto-executable-code-5450563D
```python
penguins["sklearn_predictions_ols"] = model_2d.predict(X_2d)
model_2d.predict([[185, 3750]])
# since we have a 2d data matrix, we maintain the same
# row-column expectation for our inputs.
```
Just like with SLR, we cam also extract the coefficient estimates using `.coef_` and `.intercept`. The reason why `.intercept` returns an array should now be more clear.
quarto-executable-code-5450563D
```python
print(f"(sklearn) theta0: {model_2d.intercept_}")
print(f"(analytical) theta0 {theta_hat[2]}")
print(f"(sklearn) theta1: {model_2d.coef_[0]}")
print(f"(analytical) theta1 {theta_hat[0]}")
print(f"(sklearn) theta2: {model_2d.coef_[1]}")
print(f"(analytical) theta2 {theta_hat[1]}")
```
#### Visualizing Our 2D Linear Model Predictions
When we have two axis with which we can change an input, moving along this input plan creates a 2d plane with which we can get model outputs for. For example, for every single penguin with a `flipper length` , we also must specify a `body mass`. These two values in combination will help us predict the `bill depth`. Thus, we see that the predictions all lie in a 2d-plane. In higher dimensions, they all lie in a "hyperplane".
quarto-executable-code-5450563D
```python
from mpl_toolkits.mplot3d import Axes3D # noqa: F401 unused import
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(penguins["flipper_length_mm"], penguins["body_mass_g"], penguins["bill_depth_mm"])
plt.xlabel('flipper_length_mm')
plt.ylabel('body_mass_g')
```
quarto-executable-code-5450563D
```python
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(penguins["flipper_length_mm"], penguins["body_mass_g"], penguins["bill_depth_mm"])
xx, yy = np.meshgrid(range(170, 220, 10), range(2500, 4500, 100))
zz = ( 11.0029 + 0.00982 * xx + 0.001477 * yy) # thetas_using_sklearn
ax.plot_surface(xx, yy, zz, alpha=0.2)
plt.xlabel('flipper_length_mm')
plt.ylabel('body_mass_g')
plt.gcf().savefig("plane.png", dpi = 300, bbox_inches = "tight")
```
### Loss Terminology
We use the word "loss" in two different (but very related) contexts in this course.
- In general, loss is the cost function that measures how far off model's prediction(s) is(are) from the actual value(s).
- Per-datapoint loss is a cost function that measures the cost of $y$ vs $\hat{y}$ for a particular datapoint.
- Loss (without any adjectives) is generally a cost function measured across all datapoints. Often times, *empirical risk* is *average per-datapoint loss*.
- We prioritize using the latter term, because we don't particularly look at a given datapoint's loss when optimizing a model.
- In other words, the dataset-level loss is the objective function that we'd like to minimize using gradient descent.
- We achieve this minimization by using **per-datapoint** loss values.
## Gradient Descent
At this point, we're fairly comfortable with fitting a regression model under MSE risk (indeed, we've done it three times now!). It's important to remember, however, that the results we've found previously apply to one very specific case: the equations we used above are only relevant to a linear regression model using MSE as the cost function. In reality, we'll be working with a wide range of model types and objective functions, not all of which are as straightforward as the scenario we've discussed previously. This means that we need some more generalizable way of fitting a model to minimize loss.
To do this, we'll introduce the technique of **gradient descent**.
### Minimizing a 1D Function
Let's shift our focus away from MSE to consider some new, arbitrary cost function. You can think of this function as outputting the empirical risk associated with some parameter `theta`.
<img src="images/arbitrary.png" alt='arbitrary' width='600'>
#### The Naive Approach: Guess and Check
Above, we saw that the minimum is somewhere around 5.3ish. Let's see if we can figure out how to find the exact minimum algorithmically from scratch. One way very slow and terrible way would be manual guess-and-check.
quarto-executable-code-5450563D
```python
def arbitrary(x):
return (x**4 - 15*x**3 + 80*x**2 - 180*x + 144)/10
def simple_minimize(f, xs):
# Takes in a function f and a set of values xs.
# Calculates the value of the function f at all values x in xs
# Takes the minimum value of f(x) and returns the corresponding value x
y = [f(x) for x in xs]
return xs[np.argmin(y)]
simple_minimize(arbitrary, np.linspace(1, 7, 20))
```
#### Scipy.optimize.minimize
One way to minimize this mathematical function is to use the `scipy.optimize.minimize` function. It takes a function and a starting guess and tries to find the minimum.
quarto-executable-code-5450563D
```python
from scipy.optimize import minimize
# takes a function f and a starting point x0 and returns a readout
# with the optimal input value of x which minimizes f
minimize(arbitrary, x0 = 3.5)
```
Our choice of start point can affect the outcome. For example if we start to the left, we get stuck in the local minimum on the left side.
quarto-executable-code-5450563D
```python
minimize(arbitrary, x0 = 1)
#here we see the optimal value is different from before
```
`scipy.optimize.minimize` is great. It may also seem a bit magical. How could you write a function that can find the minimum of any mathematical function? There are a number of ways to do this, which we'll explore in today's lecture, eventually arriving at the important idea of **gradient descent**, which is the principle that `scipy.optimize.minimize` uses.
It turns out that under the hood, the `fit` method for `LinearRegression` models uses gradient descent. Gradient descent is also how much of machine learning works, including even advanced neural network models.
In Data 100, the gradient descent process will usually be invisible to us, hidden beneath an abstraction layer. However, to be good data scientists, it's important that we know the basic principles beyond the optimization functions that harness to find optimal parmaeters.
### Digging into Gradient Descent
Looking at the function across this domain, it is clear that the function's minimum value occurs around $\theta = 5.3$. Let's pretend for a moment that we *couldn't* see the full view of the cost function. How would we guess the value of $\theta$ that minimizes the function?
It turns out that the first derivative of the function can give us a clue. In the plots below, the line indicates the value of the derivative of each value of $\theta$. The derivative is negative where it is red and positive where it is green.
Say we make a guess for the minimizing value of $\theta$. Remember that we read plots from left to right, and assume that our starting $\theta$ value is to the left of the optimal $\hat{\theta}$. If the guess "undershoots" the true minimizing value – our guess for $\theta$ is not quite at the value of the $\hat{\theta}$ that truly minimizes the function – the derivative will be **negative** in value. This means that if we increase $\theta$ (move further to the right), then we **can decrease** our loss function further. If this guess "overshoots" the true minimizing value, the derivative will be positive in value, implying the converse.
<img src="images/step.png" alt='step' width='600'>
We can use this pattern to help formulate our next guess for the optimal $\hat{\theta}$. Consider the case where we've undershot $\theta$ by guessing too low of a value. We'll want our next guess to be greater in value than the previous guess – that is, we want to shift our guess to the right. You can think of this as following the slope "downhill" to the function's minimum value.
<img src="images/neg_step.png" alt='neg_step' width='600'>
If we've overshot $\hat{\theta}$ by guessing too high of a value, we'll want our next guess to be lower in value – we want to shift our guess for $\hat{\theta}$ to the left.
<img src="images/pos_step.png" alt='pos_step' width='600'>
## Gradient Descent in 1 Dimension
These observations lead us to the **gradient descent update rule**:
$$\theta^{(t+1)} = \theta^{(t)} - \alpha \frac{d}{d\theta}L(\theta^{(t)})$$
Begin with our guess for $\hat{\theta}$ at timestep $t$. To find our guess for $\hat{\theta}$ at the next timestep, $t+1$, subtract the objective function's derivative evaluted at $\theta^{(t)}$, $\frac{d}{d\theta} L(\theta^{(t)})$, scaled by a positive value $\alpha$. We've replaced the generic function $f$ with $L$ to indicate that we are minimizing loss.
Supposing that any local minima is a global minimum (see **convexity** at the end of this page):
* If our guess $\theta^{(t)}$ is to the left of $\hat{\theta}$ (undershooting), the first derivative will be negative. Subtracting a negative number from $\theta^{(t)}$ will *increase* the value of the next guess, $\theta^{(t+1)}$ and move our loss function down. The guess will shift to the right.
* If our guess $\theta^{(t)}$ was too high (overshooting $\hat{\theta}$), the first derivative will be positive. Subtracting a positive number from $\theta^{(t)}$ will *decrease* the value of the next guess, $\theta^{(t+1)}$ and move our loss function down. The guess will shift to the left.
Put together, this captures the same behavior we reasoned through above. We repeatedly update our guess for the optimal $\theta$ until we've completed a set number of updates, or until each additional update iteration does not change the value of $\theta$. In this second case, we say that gradient descent has **converged** on a solution.
The $\alpha$ term in the update rule is known as the **learning rate**. It is a positive value represents the size of each gradient descent update step – in other words, how "far" should we step to the left or right with each updated guess? A high value of $\alpha$ will lead to large differences in value between consecutive guesses for $\hat{\theta}$; a low value of $\alpha$ will result in smaller differences in value between consecutive guesses. This is the first example of a **hyperparameter**, a parameter that is hand picked by the data scientist that changes the model's behavior, in the course.
quarto-executable-code-5450563D
```python
# define the derivative of the arbitrary function we want to minimize
def derivative_arbitrary(x):
return (4*x**3 - 45*x**2 + 160*x - 180)/10
def gradient_descent(df, initial_guess, alpha, n):
"""Performs n steps of gradient descent on df using learning rate alpha starting
from initial_guess. Returns a numpy array of all guesses over time."""
guesses = [initial_guess]
current_guess = initial_guess
while len(guesses) < n:
current_guess = current_guess - alpha * df(current_guess)
guesses.append(current_guess)
return np.array(guesses)
# calling our function gives us the path that gradient descent takes for 20 steps
# with a learning rate of 0.3 starting at theta = 4
trajectory = gradient_descent(derivative_arbitrary, 4, 0.3, 20)
trajectory
```
Above, we've simply run our algorithm a fixed number of times. More sophisticated implementations will stop based on a variety of different stopping criteria, e.g. error getting too small, error getting too large, etc. We will not discuss these in our course.
### Application of 1D Gradient Descent
We’ve seen how to find the optimal parameters for a 1D linear model for the penguin dataset:
- Using the derived equations from Data 8.
- Using `sklearn`.
- Uses gradient descent under the hood!
In real practice in this course, we’ll usually use `sklearn`. But for now, let’s see how we can do the gradient descent ourselves.
Let's consider a case where we have a linear model with no offset.
$$\hat{y} = \theta_1 x$$
We want to find the parameter $\theta_1$ such that the L2 loss is minimized. In sklearn, this is easy. To avoid fitting an intercept, we set `fit_intercept` to false.
quarto-executable-code-5450563D
```python
model = LinearRegression(fit_intercept = False)
df = sns.load_dataset("tips")
model.fit(df[["total_bill"]], df["tip"])
model.coef_ # the optimal tip percentage is 14.37%
```
### Creating an Explicit MSE Function
To employ gradient descent and do this ourselves, we need to define a function upon which we can use gradient descent. Suppose we select the L2 loss as our loss function. In this case, our goal will be to minimize the mean squared error.
Let's start by writing a function that computes the MSE for a given choice of $\theta_1$ on our dataset.
quarto-executable-code-5450563D
```python
def mse_single_arg(theta1):
"""Returns the MSE on our data for the given theta1"""
x = df["total_bill"]
y_obs = df["tip"]
y_hat = theta1 * x
return np.mean((y_hat - y_obs) ** 2)
mse_single_arg(0.1437)
# The minimum loss value that we can achieve is 1.178 dollars on average away from the truth
```
### Plotting the MSE Function
Since we only have 1 parameter, we can simply cross reference our results with a simnple plot. We do not want to always do this since some functions can have thousands of inputs, making them difficult to plot. We can plot the MSE as a function of `theta1`. It turns out to look pretty smooth, and quite similar to a parabola.
quarto-executable-code-5450563D
```python
theta1s = np.linspace(0, 0.2, 200)
x = df["total_bill"]
y_obs = df["tip"]
MSEs = [mse_single_arg(theta1) for theta1 in theta1s]
plt.plot(theta1s, MSEs)
plt.xlabel(r"Choice for $\theta_1$")
plt.ylabel(r"MSE");
```
The minimum appears to be around $\theta_1 = 0.14$. We can once again check this naively.
quarto-executable-code-5450563D
```python
simple_minimize(mse_single_arg, np.linspace(0, 0.2, 21))
```
As before, what we're doing is computing all the starred values below and then returning the $\theta_1$ that goes with the minimum value.
quarto-executable-code-5450563D
```python
theta1s = np.linspace(0, 0.2, 200)
sparse_theta1s = np.linspace(0, 0.2, 21)
loss = [mse_single_arg(theta1) for theta1 in theta1s]
sparse_loss = [mse_single_arg(theta1) for theta1 in sparse_theta1s]
plt.plot(theta1s, loss)
plt.plot(sparse_theta1s, sparse_loss, 'r*')
plt.xlabel(r"Choice for $\theta_1$")
plt.ylabel(r"MSE");
```
#### Using Scipy.Optimize.minimize
quarto-executable-code-5450563D
```python
import scipy.optimize
from scipy.optimize import minimize
minimize(mse_single_arg, x0 = 0)
```
#### Using Our Gradient Descent Function
Another approach is to use our 1D gradient descent algorithm from earlier. This is the exact same function as earlier. We can run it for 100 steps and see where it ultimately ends up.
quarto-executable-code-5450563D
```python
def mse_loss_derivative_single_arg(theta_1):
"""Returns the derivative of the MSE on our data for the given theta1"""
x = df["total_bill"]
y_obs = df["tip"]
y_hat = theta_1 * x
return np.mean(2 * (y_hat - y_obs) * x)
gradient_descent(mse_loss_derivative_single_arg, 0.05, 0.0001, 100)[-5:]
```
## Multidimensional Gradient Descent
We're in good shape now: we've developed a technique to find the minimum value of a more complex objective function.
The function we worked with above was one-dimensional – we were only minimizing the function with respect to a single parameter, $\theta$. However, as we've seen before, we often need to optimize a cost function with respect to several parameters (for example, when selecting the best model parameters for multiple linear regression). We'll need to extend our gradient descent rule to *multidimensional* objective functions.
Now suppose we improve our model so that we want to predict the tip from the total_bill plus a constant offset, in other words:
$$\textrm{tip} = \theta_0 + \theta_1 \textrm{bill}$$
To put this in more concrete terms what this means, let's return to the familiar case of simple linear regression with MSE loss.
$$\text{MSE}(\theta_0,\:\theta_1) = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 = \frac{1}{n} \sum_{i=1}^{n} (y_i - \theta_0 - \theta_1 x)^2$$
Now, loss is expressed in terms of *two* parameters, $\theta_0$ and $\theta_1$. Rather than a one-dimensional loss function as we had above, we are now dealing with a two-dimensional **loss surface**.
quarto-executable-code-5450563D
```python
# This code is for illustration purposes only
# It contains a lot of syntax you have not seen
import plotly.graph_objects as go
model = LinearRegression(fit_intercept = False)
df = sns.load_dataset("tips")
df["bias"] = 1
model.fit(df[["bias","total_bill"]], df["tip"])
model.coef_
uvalues = np.linspace(0, 2, 10)
vvalues = np.linspace(0, 0.2, 10)
(u,v) = np.meshgrid(uvalues, vvalues)
thetas = np.vstack((u.flatten(),v.flatten()))
X = df[["bias","total_bill"]].to_numpy()
Y = df["tip"].to_numpy()
def mse_loss_single_arg(theta):
return mse_loss(theta, X, Y)
def mse_loss(theta, X, y_obs):
y_hat = X @ theta
return np.mean((y_hat - Y) ** 2)
MSE = np.array([mse_loss_single_arg(t) for t in thetas.T])
loss_surface = go.Surface(x=u, y=v, z=np.reshape(MSE, u.shape))
ind = np.argmin(MSE)
optimal_point = go.Scatter3d(name = "Optimal Point",
x = [thetas.T[ind,0]], y = [thetas.T[ind,1]],
z = [MSE[ind]],
marker=dict(size=10, color="red"))
fig = go.Figure(data=[loss_surface, optimal_point])
fig.update_layout(scene = dict(
xaxis_title = "theta0",
yaxis_title = "theta1",
zaxis_title = "MSE"))
fig.show()
```
Though our objective function looks a little different, we can use the same principles as we did earlier to locate the optimal model parameters. Notice how the minimum value of MSE, marked by the red dot in the plot above, occurs in the "valley" of the loss surface. Like before, we want our guesses for the best pair of $(\theta_0,\:\theta_1)$ to move "downhill" towards this minimum point.
The difference now is that we need to update guesses for *both* $\theta_0$ and $\theta_1$ that minimize a loss function $L(\theta, \mathbb{X}, \mathbb{Y})$:
$$\theta_0^{(t+1)} = \theta_0^{(t)} - \alpha \frac{\partial L}{\partial \theta_0}\Bigm\vert_{\theta=\theta^{(t)}} \qquad \qquad \theta_1^{(t+1)} = \theta_1^{(t)} - \alpha \frac{\partial L}{\partial \theta_1}\Bigm\vert_{\theta=\theta^{(t)}}$$
We can tidy this statement up by using vector notation:
$$\begin{bmatrix}
\theta_{0}^{(t+1)} \\
\theta_{1}^{(t+1)} \\
\end{bmatrix}
=
\begin{bmatrix}
\theta_{0}^{(t)} \\
\theta_{1}^{(t)} \\
\end{bmatrix}
- \alpha
\begin{bmatrix}
\frac{\partial L}{\partial \theta_{0}}\vert_{\theta=\theta^{(t)}} \\
\frac{\partial L}{\partial \theta_{1}}\vert_{\theta=\theta^{(t)}} \\
\end{bmatrix}
$$
To save ourselves from writing out long column vectors, we'll introduce some new notation. $\vec{\theta}^{(t)}$ is a column vector of guesses for each model parameter $\theta_i$ at timestep $t$. We call $\nabla_{\vec{\theta}} L$ the **gradient vector.** In plain English, it means "take the derivative of loss, $L(\theta, \mathbb{X}, \mathbb{Y})$, with respect to each model parameter in $\vec{\theta}$, and evaluate it at the given $\theta = \theta^{(t)}$."
$$\vec{\theta}^{(t+1)}
= \vec{\theta}^{(t)} - \alpha \nabla_{\vec{\theta}} L(\theta^{(t)}, \mathbb{X}, \mathbb{Y})
$$
### Gradient Notation
Consider the 2D function: $$f(\theta_0, \theta_1) = 8\theta_0^2 + 3\theta_0\theta_1$$
For a function of 2 variables, $f(\theta_0, \theta_1)$ we define the gradient as $\nabla_\theta f = \frac{\partial f}{\partial \theta_0} \vec{i} + \frac{\partial f}{\partial \theta_1} \vec{j}$, where $\vec{i}$ and $\vec{j}$ are the unit vectors in the $\theta_0$ and $\theta_1$ directions.
$$\frac{\partial f}{\partial \theta_0} = 16\theta_0 + 3\theta_1$$
$$\frac{\partial f}{\partial \theta_1} = 3\theta_0$$
$$\nabla_\theta f = (16\theta_0 + 3\theta_1)\vec{i} + (3\theta_0)\vec{j}$$
We can also write it in column vector notation.
$$\nabla_\theta f = \begin{bmatrix} \frac{\partial f}{\partial \theta_0} \\ \frac{\partial f}{\partial \theta_1} \\... \end{bmatrix}$$
EX: $\nabla_\theta f = \begin{bmatrix}16\theta_0 + 3\theta_1 \\ 3\theta_0\end{bmatrix}$
You should read these gradients as:
- $\frac{\partial f}{\partial \theta_0}$ : If I nudge the 1st model weight, what happens to loss?
- $\frac{\partial f}{\partial \theta_1}$ :If I nudge the 2nd model weight, what happens to loss?
- etc.
### Visualizing Gradient Descent
First, we need to be able to easily determine the gradient for any pair of values, $\theta_0, \theta_1$.
quarto-executable-code-5450563D
```python
tips_with_bias = df.copy()
tips_with_bias["bias"] = 1
X = tips_with_bias[["bias", "total_bill"]]
X.head(5)
def mse_gradient(theta, X, y_obs):
"""Returns the gradient of the MSE on our data for the given theta"""
x0 = X.iloc[:, 0]
x1 = X.iloc[:, 1]
dth0 = np.mean(-2 * (y_obs - theta[0]*x0 - theta[1]*x1) * x0)
dth1 = np.mean(-2 * (y_obs - theta[0]*x0 - theta[1]*x1) * x1)
return np.array([dth0, dth1])
def mse_gradient_single_arg(theta):
"""Returns the gradient of the MSE on our data for the given theta"""
X = tips_with_bias[["bias", "total_bill"]]
y_obs = tips_with_bias["tip"]
return mse_gradient(theta, X, y_obs)
X = tips_with_bias[["bias", "total_bill"]]
y_obs = tips_with_bias["tip"]
ex1_mse = mse_gradient(np.array([0, 0]), X, y_obs)
print(f"Gradient for values theta0 = 0 and theta1 = 0 : {ex1_mse}")
```
Using our previously previously defined `gradient_descent` function, we can see if our intuition extends to higher dimensions.
quarto-executable-code-5450563D
```python
#print out the last 10 guesses our algorithm outputs to save space
guesses = gradient_descent(mse_gradient_single_arg, np.array([0, 0]), 0.001, 10000)[-10:]
```
## Mini-Batch Gradient Decsent and Stochastic Gradient Descent
Formally, the algorithm we derived above is called **batch gradient descent.** For each iteration of the algorithm, the derivative of loss is computed across the entire batch of available data. While this update rule works well in theory, it is not practical in all circumstances. For large datasets (with perhaps billions of data points), finding the gradient across all the data is incredibly computationally taxing.
**Mini-batch gradient descent** tries to address this issue. In mini-batch descent, only a subset of the data is used to compute an estimate of the gradient. For example, we might consider only 10% of the total data at each gradient descent update step. At the next iteration, a different 10% of the data is sampled to perform the following update. Once the entire dataset has been used, the process is repeated. Each complete "pass" through the data is known as a **training epoch**. In practice, we choose the mini-batch size to be $32$.
In the extreme case, we might choose a batch size of only 1 data point – that is, a single data point is used to estimate the gradient of loss with each update step. This is known as **stochastic gradient descent**.
Batch gradient descent is a deterministic technique – because the entire dataset is used at each update iteration, the algorithm will always advance towards the minimum of the loss surface. In contrast, both mini-batch and stochastic gradient descent involve an element of randomness. Since only a subset of the full data is used to update the guess for $\vec{\theta}$ at each iteration, there's a chance the algorithm will not progress towards the true minimum of loss with each update. Over the longer term, these stochastic techniques should still converge towards the optimal solution.
The diagrams below represent a "bird's eye view" of a loss surface from above. Notice that batch gradient descent takes a direct path towards the optimal $\hat{\theta}$. Stochastic gradient descent, in contrast, "hops around" on its path to the minimum point on the loss surface. This reflects the randomness of the sampling process at each update step.
<img src="images/stochastic.png" alt='stochastic' width='600'>
## Convexity
In our analysis above, we focused our attention on the global minimum of the loss function. You may be wondering: what about the local minimum just to the left?
If we had chosen a different starting guess for $\theta$, or a different value for the learning rate $\alpha$, we may have converged on the local minimum, rather than on the true optimum value of loss.
<img src="images/local.png" alt='local' width='600'>
If the loss function is **convex**, gradient descent is guaranteed to find the global minimum of the objective function. Formally, a function $f$ is convex if:
$$tf(a) + (1-t)f(b) \geq f(ta + (1-t)b)$$
To put this into words: if you drew a line between any two points on the curve, all values on the curve must be *on or below* the line. Importantly, any local minimum of a convex function is also its global minimum.
<img src="images/convex.png" alt='convex' width='600'>
In summary, non-convex loss functions can cause problems with optimization. This means that our choice of loss function is an key factor in our modeling process. It turns out that MSE *is* convex, which is a major reason why it is such a popular choice of loss function.