Lecture 18: Gradient Descent

by Josh Hug (Fall 2019)

Minimizing an Arbitrary 1D Function

Visually, we can see above that the minimum is somewhere around 5.3ish. Let's see if we can figure out how to find the exact minimum algorithmically.

One way very slow and terrible way would be manual guess-and-check.

A somewhat better approach is to use brute force to try out a bunch of x values and return the one that yields the lowest loss.

This process is essentially the same as before where we made a graphical plot, it's just that we're only looking at 20 selected points.

This basic approach is incredibly inefficient, and suffers from two major flaws:

  1. If the minimum is outside our range of guesses, the answer will be completely wrong.
  2. Even if our range of guesses is correct, if the guesses are too coarse, our answer will be inaccurate.

Better Approach: Gradient Descent

Instead of choosing all of our guesses ahead of time, we can instead start from a single guess and try to iteratively improve on our choice.

They key insight is this: If the derivative of the function is negative, that means the function is decreasing, so we should go to the right (i.e. pick a bigger x). If the derivative of the function is positive, that means the function is increasing, so we should go to the left (i.e. pick a smaller x).

Thus, the derivative tells us which way to go.

Armed with this knowledge, let's try to see if we can use the derivative to optimize the function.

Written as a recurrence relation, the process we've described above is:

$$ x^{(t+1)} = x^{(t)} - 0.3 \frac{d}{dx} f(x) $$

This algorithm is also known as "gradient descent".

Given a current $\gamma$, gradient descent creates its next guess for $\gamma$ based on the sign and magnitude of the derivative.

Our choice of 0.3 above was totally arbitrary. Naturally, we can generalize by replacing it with a parameter, typically represented by $\alpha$, and often called the "learning rate".

$$ x^{(t+1)} = x^{(t)} - \alpha \frac{d}{dx} f(x) $$

We can also write up this procedure in code as given below:

Below, we see a visualization of the trajectory taken by this algorithm.

Above, we've simply run our algorithm a fixed number of times. More sophisticated implementations will stop based on a variety of different stopping criteria, e.g. error getting too small, error getting too large, etc. We will not discuss these in our course.

In the next part, we'll return to the world of data science and see how this procedure might be useful for optimizing models.

We'll continue where we left off earlier. We'll see 5 different ways of computing parameters for a 1D, then 2D linear model. These five techniques will be:

  1. Brute Force
  2. Closed Form Solutions
  3. Gradient Descent
  4. scipy.optimize.minimize
  5. sklean.linear_model.LinearRegression

Linear Regression With No Offset

Let's consider a case where we have a linear model with no offset. That is, we want to find the parameter $\gamma$ such that the L2 loss is minimized.

We'll use a one parameter model that the output is $\hat{\gamma}$ times the x value. For example if $\hat{\gamma} = 0.1$, then $\hat{y} = \hat{\gamma} x$, and we are making the prediction line below.

Suppose we select the L2 loss as our loss function. In this case, our goal will be to minimize the mean squared error.

Let's start by writing a function that computes the MSE for a given choice of $\gamma$ on our dataset.

Our goal is to find the $\hat{\gamma}$ with minimum MSE.

Approach 1: Closed Form Solutions

On HW5 problem 3, you derived using calculus that the optimal answer is:

$$\hat{\gamma} = \frac{\sum(x_i y_i)}{\sum(x_i^2)}$$

We can calculate this value below.