Lecture 17: Gradient Descent – Data 100, Fall 2020¶

by Josh Hug (Fall 2019)

Minimizing an Arbitrary 1D Function¶

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Configure nice plotting defaults - (this must be done in a cell separate
# from %matplotlib call)
plt.style.use('seaborn')
sns.set_context('talk', font_scale=1.4)
plt.rcParams['figure.figsize'] = (10, 7)

def arbitrary(x):
    return (x**4 - 15*x**3 + 80*x**2 - 180*x + 144)/10

x = np.linspace(1, 7, 100)
plt.plot(x, arbitrary(x))
axes = plt.gca()
axes.set_ylim([-1, 3]);
# plt.savefig("fx4.png", dpi = 300, bbox_inches = "tight")

Visually, we can see above that the minimum is somewhere around 5.3ish. Let's see if we can figure out how to find the exact minimum algorithmically.

One way very slow and terrible way would be manual guess-and-check.

arbitrary(5.4)

-0.6854400000000282

A somewhat better approach is to use brute force to try out a bunch of x values and return the one that yields the lowest loss.

def simple_minimize(f, xs):
    y = [f(x) for x in xs]
    return xs[np.argmin(y)]

simple_minimize(arbitrary, np.linspace(1, 7, 20))

5.421052631578947

This process is essentially the same as before where we made a graphical plot, it's just that we're only looking at 20 selected points.

xs = np.linspace(1, 7, 200)
sparse_xs = np.linspace(1, 7, 21)

ys = arbitrary(xs)
sparse_ys = arbitrary(sparse_xs)

plt.plot(xs, ys, label = "f")
plt.plot(sparse_xs, sparse_ys, 'r*', label = "f");
# plt.savefig("f_brute_force.png", dpi=300, bbox_inches = "tight")

This basic approach is incredibly inefficient, and suffers from two major flaws:

If the minimum is outside our range of guesses, the answer will be completely wrong.
Even if our range of guesses is correct, if the guesses are too coarse, our answer will be inaccurate.

Better Approach: Gradient Descent¶

Instead of choosing all of our guesses ahead of time, we can instead start from a single guess and try to iteratively improve on our choice.

They key insight is this: If the derivative of the function is negative, that means the function is decreasing, so we should go to the right (i.e. pick a bigger x). If the derivative of the function is positive, that means the function is increasing, so we should go to the left (i.e. pick a smaller x).

Thus, the derivative tells us which way to go.

#desmos demo: https://www.desmos.com/calculator/twpnylu4lr

def derivative_arbitrary(x):
    return (4*x**3 - 45*x**2 + 160*x - 180)/10

def line(x):
    return (0*x)
plt.plot(x, arbitrary(x))
plt.plot(x, derivative_arbitrary(x), '--')
plt.plot(x, line(x), '.')

plt.legend(['f(x)', 'df(x)'])
axes = plt.gca()
axes.set_ylim([-1, 3]);

def plot_arbitrary():
    x = np.linspace(1, 7, 100)
    plt.plot(x, arbitrary(x))
    axes = plt.gca()
    axes.set_ylim([-1, 3])

def plot_x_on_f(f, x):
    y = f(x)
    default_args = dict(label=r'$ \theta $', zorder=2,
                        s=200, c=sns.xkcd_rgb['green'])
    plt.scatter([x], [y], **default_args)

def plot_x_on_f_empty(f, x):
    y = f(x)
    default_args = dict(label=r'$ \theta $', zorder=2,
                        s=200, c = 'none', edgecolor=sns.xkcd_rgb['green'])
    plt.scatter([x], [y], **default_args)

def plot_tangent_on_f(f, x, eps=1e-6):
    slope = ((f(x + eps) - f(x - eps))
             / (2 * eps))
    xs = np.arange(x - 1, x + 1, 0.05)
    ys = f(x) + slope * (xs - x)
    plt.plot(xs, ys, zorder=3, c=sns.xkcd_rgb['green'], linestyle='--')

plot_arbitrary()
plot_x_on_f(arbitrary, 2)
plot_tangent_on_f(arbitrary, 2)
plt.xlabel('x')
plt.ylabel('f(x)');
#plt.savefig('dfx_1.png')

plot_arbitrary()
plot_x_on_f(arbitrary, 6)
plot_tangent_on_f(arbitrary, 6)
#plt.savefig('dfx_2.png')

Armed with this knowledge, let's try to see if we can use the derivative to optimize the function.

derivative_arbitrary(4)

-0.4

derivative_arbitrary(4 + 0.4)

-0.6464000000000055

derivative_arbitrary(4 + 0.4 + 0.64)

-0.45757440000001

derivative_arbitrary(4 + 0.4 + 0.64 + 0.457)

0.4166188892000037

derivative_arbitrary(4)

-0.4

derivative_arbitrary(4 + 0.4)

-0.6464000000000055

derivative_arbitrary(4 + 0.4 + 0.64)

-0.45757440000001

derivative_arbitrary(4 + 0.4 + 0.64 + 0.46)

0.425

def plot_one_step(x):
    new_x = x - derivative_arbitrary(x)
    plot_arbitrary()
    plot_x_on_f(arbitrary, new_x)
    plot_x_on_f_empty(arbitrary, x)
    print(f'old x: {x}')
    print(f'new x: {new_x}')

plot_one_step( 5.48)

old x: 5.48
new x: 5.1101631999999935

def plot_one_step_better(x):
    new_x = x - 0.3 * derivative_arbitrary(x)
    plot_arbitrary()
    plot_x_on_f(arbitrary, new_x)
    plot_x_on_f_empty(arbitrary, x)
    print(f'old x: {x}')
    print(f'new x: {new_x}')

plot_one_step_better(5.32)

old x: 5.32
new x: 5.32398784

Written as a recurrence relation, the process we've described above is:

$$ x^{(t+1)} = x^{(t)} - 0.3 \frac{d}{dx} f(x) $$

This algorithm is also known as "gradient descent".

Given a current $\gamma$, gradient descent creates its next guess for $\gamma$ based on the sign and magnitude of the derivative.

Our choice of 0.3 above was totally arbitrary. Naturally, we can generalize by replacing it with a parameter, typically represented by $\alpha$, and often called the "learning rate".

$$ x^{(t+1)} = x^{(t)} - \alpha \frac{d}{dx} f(x) $$

We can also write up this procedure in code as given below:

def gradient_descent(df, initial_guess, alpha, n):
    guesses = [initial_guess]
    guess = initial_guess
    while len(guesses) < n:
        guess = guess - alpha * df(guess)
        guesses.append(guess)
    return np.array(guesses)

trajectory = gradient_descent(derivative_arbitrary, 4, 1.5, 20)
trajectory

array([4.        , 4.6       , 5.6284    , 4.39841721, 5.36714058,
       5.23345463, 5.50301118, 4.85283564, 5.77676546, 3.72284331,
       3.96846217, 4.53018801, 5.55049763, 4.6927978 , 5.70865641,
       4.05220573, 4.7135408 , 5.72270104, 3.98700301, 4.57133193])

Below, we see a visualization of the trajectory taken by this algorithm.

trajectory = gradient_descent(derivative_arbitrary, 1, 0.9, 20)
plot_arbitrary()
plt.plot(trajectory, arbitrary(trajectory));

Above, we've simply run our algorithm a fixed number of times. More sophisticated implementations will stop based on a variety of different stopping criteria, e.g. error getting too small, error getting too large, etc. We will not discuss these in our course.

In the next part, we'll return to the world of data science and see how this procedure might be useful for optimizing models.

plt.rcParams['figure.figsize'] = (4, 4)
plt.rcParams['figure.dpi'] = 150
plt.rcParams['lines.linewidth'] = 3
sns.set()

We'll continue where we left off earlier. We'll see 5 different ways of computing parameters for a 1D, then 2D linear model. These five techniques will be:

Brute Force
Closed Form Solutions
Gradient Descent
scipy.optimize.minimize
sklean.linear_model.LinearRegression

Linear Regression With No Offset¶

Let's consider a case where we have a linear model with no offset. That is, we want to find the parameter $\gamma$ such that the L2 loss is minimized.

tips = sns.load_dataset("tips")

sns.scatterplot(tips["total_bill"], tips["tip"]);

We'll use a one parameter model that the output is $\hat{\gamma}$ times the x value. For example if $\hat{\gamma} = 0.1$, then $\hat{y} = \hat{\gamma} x$, and we are making the prediction line below.

sns.scatterplot(tips["total_bill"], tips["tip"])
x = tips["total_bill"]
y_hat = 0.1 * x
plt.plot(x, y_hat, 'r')
plt.legend(['$\hat{y}$', '$y$']);
#plt.savefig("tip_vs_total_bill.png", dpi=300)

Suppose we select the L2 loss as our loss function. In this case, our goal will be to minimize the mean squared error.

Let's start by writing a function that computes the MSE for a given choice of $\gamma$ on our dataset.

def mse_loss(gamma, x, y_obs):
    y_hat = gamma * x
    return np.mean((y_hat - y_obs) ** 2)

x = tips["total_bill"]
y_obs = tips["tip"]
mse_loss(0.1, x, y_obs)

2.077768372950818

Our goal is to find the $\hat{\gamma}$ with minimum MSE.

Approach 1: Closed Form Solutions¶

On HW5 problem 3, you derived using calculus that the optimal answer is:

$$\hat{\gamma} = \frac{\sum(x_i y_i)}{\sum(x_i^2)}$$

We can calculate this value below.

np.sum(tips["tip"] * tips["total_bill"])/np.sum(tips["total_bill"]**2)

0.14373189527721666

Alternately, we can use the generic equation for linear regression that we derived in lecture using the definition of orthogonality.

X = tips[["total_bill"]]
y = tips["tip"]
np.linalg.inv(X.T @ X) @ X.T @ y

0    0.143732
dtype: float64

Optimization Approach #2A: Plotting the MSE vs. $\hat{\gamma}$¶

Since x and y_obs are fixed, the only variable is gamma.

For clarity, let's define a python function that returns the MSE as a function of a single argument gamma.

def mse_single_arg(gamma):
    """Returns the MSE on our data for the given gamma"""
    x = tips["total_bill"]
    y_obs = tips["tip"]
    y_hat = gamma * x
    return mse_loss(gamma, x, y_obs)

mse_single_arg(0.1)

2.077768372950818

Thus we can plot the MSE as a function of gamma. It turns out to look pretty smooth, and quite similar to a parabola.

gammas = np.linspace(0, 0.2, 200)
x = tips["total_bill"]
y_obs = tips["tip"]

MSEs = [mse_single_arg(gamma) for gamma in gammas]

plt.plot(gammas, MSEs)
plt.xlabel(r"Choice for $\hat{\gamma}$")
plt.ylabel(r"MSE");
#plt.savefig("tips_MSE_vs_gamma.png", dpi=300, bbox_inches = "tight")

The minimum appears to be around $\hat{\gamma} = 0.14$.

Approach 2B: Brute Force¶

Recall our simple_minimize function from earlier, redefined below for your convenience.

def simple_minimize(f, xs):
    y = [f(x) for x in xs]
    return xs[np.argmin(y)]

simple_minimize(mse_single_arg, np.linspace(0, 0.2, 21))

0.14

As before, what we're doing is computing all the starred values below and then returning the $\hat{\theta}$ that goes with the minimum value.

gammas = np.linspace(0, 0.2, 200)
sparse_gammas = np.linspace(0, 0.2, 21)

loss = [mse_single_arg(gamma) for gamma in gammas]
sparse_loss = [mse_single_arg(gamma) for gamma in sparse_gammas]

plt.plot(gammas, loss)
plt.plot(sparse_gammas, sparse_loss, 'r*')
plt.xlabel(r"Choice for $\hat{\gamma}$")
plt.ylabel(r"MSE");
#plt.savefig("tips_brute_force.png", dpi=300, bbox_inches = "tight")

Approach 3: Use Gradient Descent¶

Another approach is to use our 1D gradient descent algorithm from earlier.

def gradient_descent(df, initial_guess, alpha, n):
    guesses = [initial_guess]
    guess = initial_guess
    while len(guesses) < n:
        guess = guess - alpha * df(guess)
        guesses.append(guess)
    return np.array(guesses)

To use this function, we need to compute the derivative of the MSE. The MSE is repeated below for convenience.

def mse_loss(gamma, x, y_obs):
    y_hat = gamma * x
    return np.mean((y_hat - y_obs) ** 2)

def mse_loss_derivative(gamma, x, y_obs):
    y_hat = gamma * x
    return np.mean(2 * (y_hat - y_obs) * x)

#def squared_loss_derivative(y_hat, y_obs, x):
#    """Returns the derivative of the squared loss for a single prediction"""
#    return 2*(y_hat - y_obs)*x
#    
#def mse_derivative(y_hat, y_obs, x):
#    """Returns the derivative of the MSE"""
#    return np.mean(squared_loss_derivative(y_hat, y_obs, x))

We can try out different values of gamma and see what we get back as our derivative.

gamma = 0.1
x = tips["total_bill"]
y_obs = tips["tip"]

mse_loss_derivative(gamma, x, y_obs)

-41.14398663934429

Just like our mse_of_gamma, we can write a function that returns the derivative of the MSE as a function of a single argument gamma.

def mse_loss_derivative_single_arg(gamma):
    x = tips["total_bill"]
    y_obs = tips["tip"]

    return mse_loss_derivative(gamma, x, y_obs)

mse_loss_derivative_single_arg(0.1)

-41.14398663934429

gradient_descent(mse_loss_derivative_single_arg, 0.05, 0.0001, 100)

array([0.05      , 0.05881852, 0.06680736, 0.0740446 , 0.08060095,
       0.08654045, 0.09192116, 0.09679563, 0.10121151, 0.10521192,
       0.10883597, 0.11211906, 0.11509327, 0.11778766, 0.12022855,
       0.1224398 , 0.12444301, 0.12625776, 0.12790176, 0.1293911 ,
       0.13074031, 0.13196259, 0.13306988, 0.13407298, 0.13498172,
       0.13580495, 0.13655074, 0.13722636, 0.13783841, 0.13839289,
       0.13889519, 0.13935024, 0.13976248, 0.14013593, 0.14047425,
       0.14078073, 0.14105839, 0.14130992, 0.14153778, 0.14174421,
       0.14193122, 0.14210063, 0.1422541 , 0.14239314, 0.14251909,
       0.14263319, 0.14273656, 0.1428302 , 0.14291504, 0.14299189,
       0.14306151, 0.14312458, 0.14318172, 0.14323348, 0.14328037,
       0.14332285, 0.14336134, 0.1433962 , 0.14342778, 0.14345639,
       0.14348231, 0.1435058 , 0.14352707, 0.14354634, 0.1435638 ,
       0.14357961, 0.14359394, 0.14360692, 0.14361868, 0.14362933,
       0.14363898, 0.14364772, 0.14365564, 0.14366281, 0.14366931,
       0.1436752 , 0.14368053, 0.14368537, 0.14368974, 0.14369371,
       0.1436973 , 0.14370056, 0.14370351, 0.14370618, 0.1437086 ,
       0.14371079, 0.14371277, 0.14371457, 0.1437162 , 0.14371768,
       0.14371902, 0.14372023, 0.14372133, 0.14372232, 0.14372322,
       0.14372404, 0.14372478, 0.14372545, 0.14372605, 0.1437266 ])

In the context of minimizing loss, we can write out the gradient descent rule for generating the next $\gamma$ as:

$$ \gamma^{(t+1)} = \gamma^{(t)} - \alpha \frac{\partial}{\partial \gamma} L(\gamma^{(t)}, \Bbb{X}, \vec{\hat{y}}) $$

Here $L$ is our chosen loss function, $\Bbb{X}$ is our design matrix, and $\vec{\hat{y}}$ are our observations. During the gradient descent algorithm, we treat $\Bbb{X}$ and $\vec{\hat{y}}$ as constants.

Approach 4: scipy.optimize.minimize¶

We can also use scipy.optimize.minimize.

import scipy.optimize
from scipy.optimize import minimize
minimize(mse_single_arg, x0 = 0)

      fun: 1.1781161154513213
 hess_inv: array([[1]])
      jac: array([4.24683094e-06])
  message: 'Optimization terminated successfully.'
     nfev: 9
      nit: 1
     njev: 3
   status: 0
  success: True
        x: array([0.14373189])

A natural question that arises: How does scipy.optimize.minimize work? We won't discuss the exact algorithm used by the code (see this wikipedia page about the BFGS algorithm if you're curious, though).

Gradient descent is related to BFGS, though generally doesn't work as well. Comparison of numerical optimization algorithms is very far beyond the scope of our course.

Approach 5: sklearn.linear_model.LinearRegression¶

We can also go one level of abstraction higher and simply fit a linear model using sklearn.

import sklearn.linear_model
from sklearn.linear_model import LinearRegression
model = LinearRegression(fit_intercept = False)

X = tips[["total_bill"]]
y = tips["tip"]
model.fit(X, y)

LinearRegression(copy_X=True, fit_intercept=False, n_jobs=None, normalize=False)

model.coef_

array([0.1437319])

Multi Dimensional Models¶

#tips datset
data = sns.load_dataset("tips")
data.head()

Now suppose we improve our model so that we want to predict the tip from the total_bill plus a constant offset, in other words:

$$\textrm{tip} = \hat{\theta}_0 + \hat{\theta}_1 \textrm{bill}$$

To keep things in the simple framework from lecture, we can create a design matrix with the bias vector in one column and the total_bill in the other.

tips_with_bias = tips.copy()
tips_with_bias["bias"] = 1
X = tips_with_bias[["total_bill", "bias"]]
X.head(5)

Now, we can give our predictions as $$\vec{\hat{y}} = f_{\vec{\hat{\theta}}}(\Bbb{X}) = \Bbb{X} \vec{\hat{\theta}}$$

For example, the predictions below correspond to assuming every table leaves a tip of \$1.50 plus 5% of their total bill.

X @ np.array([0.05, 1.50])

0      2.3495
1      2.0170
2      2.5505
3      2.6840
4      2.7295
        ...
239    2.9515
240    2.8590
241    2.6335
242    2.3910
243    2.4390
Length: 244, dtype: float64

Throughout this problem, we'll assume we want to minimize the mean squared error of our predictions, i.e.

def mse_loss(theta, X, y_obs):
    y_hat = X @ theta
    return np.mean((y_hat - y_obs) ** 2)

For example, the loss assuming the model described above is:

mse_loss(np.array([0.05, 1.50]), X, y_obs)

1.534052175204919

Approach 1: Closed Form Solution¶

y_obs

0      1.01
1      1.66
2      3.50
3      3.31
4      3.61
       ...
239    5.92
240    2.00
241    2.00
242    1.75
243    3.00
Name: tip, Length: 244, dtype: float64

X = tips_with_bias[["total_bill", "bias"]]
X.head(5)

y_obs = tips["tip"]
y_obs.head(5)

0    1.01
1    1.66
2    3.50
3    3.31
4    3.61
Name: tip, dtype: float64

theta_hat = np.linalg.inv(X.T @ X) @ X.T @ y_obs
theta_hat

0    0.105025
1    0.920270
dtype: float64

#Note: It's generally a better idea to use np.linalg.solve.
#np.linalg.inv is slow and can sometimes return incorrect results due to rounding issues.
#np.linalg.solve is faster and generally better behaved.
theta_hat = np.linalg.solve(X.T @ X, X.T @ y_obs)

theta_hat

0    0.105025
1    0.920270
dtype: float64

Approach 2: Brute Force¶

As before, we can simply try out a bunch of theta values and see which works best. In this case, since we have a 2D theta, there's a much bigger space of possible values to try.

def mse_loss(theta, X, y_obs):
    y_hat = X @ theta
    return np.mean((y_hat - y_obs) ** 2)

As before, it's convenient to first create an mse function of a single argument.

X = tips_with_bias[["total_bill", "bias"]]
y_obs = tips["tip"]

def mse_loss_single_arg(theta):
    return mse_loss(theta, X, y_obs)

mse_loss_single_arg([0.01, 0.02])

9.479444821434424

Using this function, we can create a 3D plot. This uses lots of syntax you've never seen.

import plotly.graph_objects as go

uvalues = np.linspace(0, 0.2, 10)#)1)
vvalues = np.linspace(0, 2, 10)#)1)
(u,v) = np.meshgrid(uvalues, vvalues)
thetas = np.vstack((u.flatten(),v.flatten()))

MSE = np.array([mse_loss_single_arg(t) for t in thetas.T])

loss_surface = go.Surface(x=u, y=v, z=np.reshape(MSE, u.shape))

ind = np.argmin(MSE)
optimal_point = go.Scatter3d(name = "Optimal Point",
    x = [thetas.T[ind,0]], y = [thetas.T[ind,1]],
    z = [MSE[ind]],
    marker=dict(size=10, color="red"))

fig = go.Figure(data=[loss_surface, optimal_point])
fig.update_layout(scene = dict(
    xaxis_title = "theta0",
    yaxis_title = "theta1",
    zaxis_title = "MSE"))
fig.show()

Approach 3: Gradient Descent¶

Gradient descent is exactly like it was before, only now our gradient is somewhat more complicated.

def mse_gradient(theta, X, y_obs):
    """Returns the gradient of the MSE on our data for the given theta"""
    x0 = X.iloc[:, 0]
    x1 = X.iloc[:, 1]
    dth0 = np.mean(-2 * (y_obs - theta[0] * x0 - theta[1] * x1) * x0)
    dth1 = np.mean(-2 * (y_obs - theta[0] * x0 - theta[1] * x1) * x1)
    return np.array([dth0, dth1])

X = tips_with_bias[["total_bill", "bias"]]
y_obs = tips["tip"]
mse_gradient(np.array([0, 0]), X, y_obs)

array([-135.22631803,   -5.99655738])

def mse_gradient_single_arg(theta):
    """Returns the gradient of the MSE on our data for the given theta"""
    X = tips_with_bias[["total_bill", "bias"]]
    y_obs = tips["tip"]
    return mse_gradient(theta, X, y_obs)

mse_gradient_single_arg(np.array([0, 0]))

array([-135.22631803,   -5.99655738])

gradient_descent(mse_gradient_single_arg, np.array([0, 0]), 0.001, 200)

array([[0.        , 0.        ],
       [0.13522632, 0.00599656],
       [0.14299127, 0.00662996],
       [0.14342571, 0.00695482],
       [0.14343856, 0.00726185],
       [0.14342717, 0.00756775],
       [0.14341439, 0.00787348],
       [0.14340154, 0.00817912],
       [0.14338868, 0.00848465],
       [0.14337583, 0.00879007],
       [0.14336298, 0.0090954 ],
       [0.14335014, 0.00940062],
       [0.1433373 , 0.00970574],
       [0.14332447, 0.01001076],
       [0.14331164, 0.01031568],
       [0.14329882, 0.01062049],
       [0.14328599, 0.0109252 ],
       [0.14327318, 0.01122981],
       [0.14326036, 0.01153432],
       [0.14324756, 0.01183873],
       [0.14323475, 0.01214303],
       [0.14322195, 0.01244724],
       [0.14320916, 0.01275134],
       [0.14319637, 0.01305533],
       [0.14318358, 0.01335923],
       [0.1431708 , 0.01366302],
       [0.14315802, 0.01396672],
       [0.14314525, 0.01427031],
       [0.14313248, 0.0145738 ],
       [0.14311971, 0.01487719],
       [0.14310695, 0.01518047],
       [0.14309419, 0.01548366],
       [0.14308144, 0.01578674],
       [0.14306869, 0.01608972],
       [0.14305595, 0.0163926 ],
       [0.14304321, 0.01669538],
       [0.14303047, 0.01699806],
       [0.14301774, 0.01730063],
       [0.14300502, 0.01760311],
       [0.14299229, 0.01790548],
       [0.14297957, 0.01820775],
       [0.14296686, 0.01850992],
       [0.14295415, 0.01881199],
       [0.14294144, 0.01911396],
       [0.14292874, 0.01941583],
       [0.14291605, 0.01971759],
       [0.14290335, 0.02001926],
       [0.14289067, 0.02032082],
       [0.14287798, 0.02062228],
       [0.1428653 , 0.02092365],
       [0.14285262, 0.02122491],
       [0.14283995, 0.02152607],
       [0.14282729, 0.02182713],
       [0.14281462, 0.02212808],
       [0.14280196, 0.02242894],
       [0.14278931, 0.0227297 ],
       [0.14277666, 0.02303035],
       [0.14276401, 0.02333091],
       [0.14275137, 0.02363136],
       [0.14273873, 0.02393172],
       [0.1427261 , 0.02423197],
       [0.14271347, 0.02453212],
       [0.14270085, 0.02483218],
       [0.14268823, 0.02513213],
       [0.14267561, 0.02543198],
       [0.142663  , 0.02573173],
       [0.14265039, 0.02603138],
       [0.14263778, 0.02633093],
       [0.14262518, 0.02663038],
       [0.14261259, 0.02692973],
       [0.1426    , 0.02722898],
       [0.14258741, 0.02752812],
       [0.14257483, 0.02782717],
       [0.14256225, 0.02812612],
       [0.14254968, 0.02842497],
       [0.14253711, 0.02872372],
       [0.14252454, 0.02902237],
       [0.14251198, 0.02932091],
       [0.14249942, 0.02961936],
       [0.14248687, 0.02991771],
       [0.14247432, 0.03021596],
       [0.14246177, 0.03051411],
       [0.14244923, 0.03081215],
       [0.1424367 , 0.0311101 ],
       [0.14242416, 0.03140795],
       [0.14241164, 0.0317057 ],
       [0.14239911, 0.03200335],
       [0.14238659, 0.0323009 ],
       [0.14237408, 0.03259835],
       [0.14236157, 0.0328957 ],
       [0.14234906, 0.03319295],
       [0.14233656, 0.0334901 ],
       [0.14232406, 0.03378715],
       [0.14231156, 0.0340841 ],
       [0.14229907, 0.03438095],
       [0.14228659, 0.03467771],
       [0.1422741 , 0.03497436],
       [0.14226163, 0.03527091],
       [0.14224915, 0.03556737],
       [0.14223668, 0.03586372],
       [0.14222422, 0.03615998],
       [0.14221176, 0.03645614],
       [0.1421993 , 0.0367522 ],
       [0.14218685, 0.03704815],
       [0.1421744 , 0.03734401],
       [0.14216195, 0.03763977],
       [0.14214951, 0.03793544],
       [0.14213708, 0.038231  ],
       [0.14212465, 0.03852646],
       [0.14211222, 0.03882183],
       [0.1420998 , 0.03911709],
       [0.14208738, 0.03941226],
       [0.14207496, 0.03970732],
       [0.14206255, 0.04000229],
       [0.14205014, 0.04029716],
       [0.14203774, 0.04059193],
       [0.14202534, 0.04088661],
       [0.14201295, 0.04118118],
       [0.14200056, 0.04147566],
       [0.14198817, 0.04177003],
       [0.14197579, 0.04206431],
       [0.14196341, 0.04235849],
       [0.14195104, 0.04265257],
       [0.14193867, 0.04294655],
       [0.1419263 , 0.04324043],
       [0.14191394, 0.04353422],
       [0.14190158, 0.04382791],
       [0.14188923, 0.04412149],
       [0.14187688, 0.04441498],
       [0.14186454, 0.04470838],
       [0.1418522 , 0.04500167],
       [0.14183986, 0.04529487],
       [0.14182753, 0.04558796],
       [0.1418152 , 0.04588096],
       [0.14180288, 0.04617386],
       [0.14179056, 0.04646666],
       [0.14177824, 0.04675937],
       [0.14176593, 0.04705198],
       [0.14175362, 0.04734448],
       [0.14174132, 0.04763689],
       [0.14172902, 0.04792921],
       [0.14171672, 0.04822142],
       [0.14170443, 0.04851354],
       [0.14169214, 0.04880556],
       [0.14167986, 0.04909748],
       [0.14166758, 0.0493893 ],
       [0.14165531, 0.04968103],
       [0.14164304, 0.04997265],
       [0.14163077, 0.05026418],
       [0.14161851, 0.05055562],
       [0.14160625, 0.05084695],
       [0.141594  , 0.05113819],
       [0.14158175, 0.05142933],
       [0.1415695 , 0.05172037],
       [0.14155726, 0.05201131],
       [0.14154502, 0.05230216],
       [0.14153279, 0.05259291],
       [0.14152056, 0.05288356],
       [0.14150833, 0.05317412],
       [0.14149611, 0.05346458],
       [0.14148389, 0.05375494],
       [0.14147168, 0.0540452 ],
       [0.14145947, 0.05433537],
       [0.14144727, 0.05462543],
       [0.14143507, 0.05491541],
       [0.14142287, 0.05520528],
       [0.14141068, 0.05549506],
       [0.14139849, 0.05578474],
       [0.1413863 , 0.05607432],
       [0.14137412, 0.05636381],
       [0.14136195, 0.0566532 ],
       [0.14134978, 0.05694249],
       [0.14133761, 0.05723168],
       [0.14132544, 0.05752078],
       [0.14131328, 0.05780978],
       [0.14130113, 0.05809869],
       [0.14128898, 0.0583875 ],
       [0.14127683, 0.05867621],
       [0.14126468, 0.05896482],
       [0.14125254, 0.05925334],
       [0.14124041, 0.05954176],
       [0.14122828, 0.05983008],
       [0.14121615, 0.06011831],
       [0.14120403, 0.06040644],
       [0.14119191, 0.06069448],
       [0.14117979, 0.06098242],
       [0.14116768, 0.06127026],
       [0.14115557, 0.06155801],
       [0.14114347, 0.06184565],
       [0.14113137, 0.06213321],
       [0.14111928, 0.06242066],
       [0.14110719, 0.06270802],
       [0.1410951 , 0.06299529],
       [0.14108302, 0.06328246],
       [0.14107094, 0.06356953],
       [0.14105886, 0.0638565 ],
       [0.14104679, 0.06414338],
       [0.14103473, 0.06443017],
       [0.14102266, 0.06471685],
       [0.1410106 , 0.06500344]])

If you play around with the code above, you'll see that it's pretty finicky about the start point and learning rate.

Thus, another approach is to use a more sophisticated numerical optimization library.

For reference, the general matrix form of the gradient is given below. We have not discussed how to derive this in class.

def mse_gradient(theta, X, y_obs):
    """Returns the gradient of the MSE on our data for the given theta"""
    n = len(X)
    return -2 / n * (X.T @ y_obs - X.T @ X @ theta)

Approach 4: scipy.optimize.minimize¶

scipy.optimize.minimize(mse_loss_single_arg, x0 = [0, 0])

      fun: 1.0360194420116033
 hess_inv: array([[ 0.00633488, -0.12534156],
       [-0.12534156,  2.980001  ]])
      jac: array([5.96046448e-08, 5.96046448e-08])
  message: 'Optimization terminated successfully.'
     nfev: 20
      nit: 3
     njev: 5
   status: 0
  success: True
        x: array([0.10502446, 0.92027071])

Approach 5: sklearn.linear_model.LinearRegression¶

As before, we can also go one level of abstraction higher and simply fit a linear model using sklearn.

We can either do this by using our tips_with_bias and fit_intercept = False, or with our original tips dataframe and fit_intercept = True.

import sklearn.linear_model
from sklearn.linear_model import LinearRegression
model = LinearRegression(fit_intercept = False)

X = tips_with_bias[["total_bill", "bias"]]
y = tips["tip"]
model.fit(X, y)

LinearRegression(copy_X=True, fit_intercept=False, n_jobs=None, normalize=False)

model.coef_

array([0.10502452, 0.92026961])

model = LinearRegression()
X = tips[["total_bill"]]
y = tips["tip"]
model.fit(X, y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

model.coef_

array([0.10502452])

model.intercept_

0.9202696135546731

sns.scatterplot(tips["total_bill"], tips["tip"])
x = tips["total_bill"]
y_hat = model.intercept_ + model.coef_ * tips["total_bill"]
plt.plot(x, y_hat, 'r')
plt.legend(['$\hat{y}$', '$y$']);
#plt.savefig("tip_vs_total_bill_linear_regression.png", dpi=300)

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4

	total_bill	bias
0	16.99	1
1	10.34	1
2	21.01	1
3	23.68	1
4	24.59	1

	total_bill	bias
0	16.99	1
1	10.34	1
2	21.01	1
3	23.68	1
4	24.59	1