# Lecture 17: Gradient Descent â€“ Data 100, Fall 2020¶

by Josh Hug (Fall 2019)

## Minimizing an Arbitrary 1D Function¶

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
# Configure nice plotting defaults - (this must be done in a cell separate
# from %matplotlib call)
plt.style.use('seaborn')
sns.set_context('talk', font_scale=1.4)
plt.rcParams['figure.figsize'] = (10, 7)

In [3]:
def arbitrary(x):
return (x**4 - 15*x**3 + 80*x**2 - 180*x + 144)/10

x = np.linspace(1, 7, 100)
plt.plot(x, arbitrary(x))
axes = plt.gca()
axes.set_ylim([-1, 3]);
# plt.savefig("fx4.png", dpi = 300, bbox_inches = "tight")


Visually, we can see above that the minimum is somewhere around 5.3ish. Let's see if we can figure out how to find the exact minimum algorithmically.

One way very slow and terrible way would be manual guess-and-check.

In [4]:
arbitrary(5.4)

Out[4]:
-0.6854400000000282

A somewhat better approach is to use brute force to try out a bunch of x values and return the one that yields the lowest loss.

In [5]:
def simple_minimize(f, xs):
y = [f(x) for x in xs]
return xs[np.argmin(y)]

In [6]:
simple_minimize(arbitrary, np.linspace(1, 7, 20))

Out[6]:
5.421052631578947

This process is essentially the same as before where we made a graphical plot, it's just that we're only looking at 20 selected points.

In [7]:
xs = np.linspace(1, 7, 200)
sparse_xs = np.linspace(1, 7, 21)

ys = arbitrary(xs)
sparse_ys = arbitrary(sparse_xs)

plt.plot(xs, ys, label = "f")
plt.plot(sparse_xs, sparse_ys, 'r*', label = "f");
# plt.savefig("f_brute_force.png", dpi=300, bbox_inches = "tight")


This basic approach is incredibly inefficient, and suffers from two major flaws:

1. If the minimum is outside our range of guesses, the answer will be completely wrong.
2. Even if our range of guesses is correct, if the guesses are too coarse, our answer will be inaccurate.

### Better Approach: Gradient Descent¶

Instead of choosing all of our guesses ahead of time, we can instead start from a single guess and try to iteratively improve on our choice.

They key insight is this: If the derivative of the function is negative, that means the function is decreasing, so we should go to the right (i.e. pick a bigger x). If the derivative of the function is positive, that means the function is increasing, so we should go to the left (i.e. pick a smaller x).

Thus, the derivative tells us which way to go.

In [8]:
#desmos demo: https://www.desmos.com/calculator/twpnylu4lr

In [9]:
def derivative_arbitrary(x):
return (4*x**3 - 45*x**2 + 160*x - 180)/10

def line(x):
return (0*x)
plt.plot(x, arbitrary(x))
plt.plot(x, derivative_arbitrary(x), '--')
plt.plot(x, line(x), '.')

plt.legend(['f(x)', 'df(x)'])
axes = plt.gca()
axes.set_ylim([-1, 3]);

In [10]:
def plot_arbitrary():
x = np.linspace(1, 7, 100)
plt.plot(x, arbitrary(x))
axes = plt.gca()
axes.set_ylim([-1, 3])

def plot_x_on_f(f, x):
y = f(x)
default_args = dict(label=r'$\theta$', zorder=2,
s=200, c=sns.xkcd_rgb['green'])
plt.scatter([x], [y], **default_args)

def plot_x_on_f_empty(f, x):
y = f(x)
default_args = dict(label=r'$\theta$', zorder=2,
s=200, c = 'none', edgecolor=sns.xkcd_rgb['green'])
plt.scatter([x], [y], **default_args)

def plot_tangent_on_f(f, x, eps=1e-6):
slope = ((f(x + eps) - f(x - eps))
/ (2 * eps))
xs = np.arange(x - 1, x + 1, 0.05)
ys = f(x) + slope * (xs - x)
plt.plot(xs, ys, zorder=3, c=sns.xkcd_rgb['green'], linestyle='--')

In [11]:
plot_arbitrary()
plot_x_on_f(arbitrary, 2)
plot_tangent_on_f(arbitrary, 2)
plt.xlabel('x')
plt.ylabel('f(x)');
#plt.savefig('dfx_1.png')

In [12]:
plot_arbitrary()
plot_x_on_f(arbitrary, 6)
plot_tangent_on_f(arbitrary, 6)
#plt.savefig('dfx_2.png')


Armed with this knowledge, let's try to see if we can use the derivative to optimize the function.

In [13]:
derivative_arbitrary(4)

Out[13]:
-0.4
In [14]:
derivative_arbitrary(4 + 0.4)

Out[14]:
-0.6464000000000055
In [15]:
derivative_arbitrary(4 + 0.4 + 0.64)

Out[15]:
-0.45757440000001
In [16]:
derivative_arbitrary(4 + 0.4 + 0.64 + 0.457)

Out[16]:
0.4166188892000037
In [17]:
derivative_arbitrary(4)

Out[17]:
-0.4
In [18]:
derivative_arbitrary(4 + 0.4)

Out[18]:
-0.6464000000000055
In [19]:
derivative_arbitrary(4 + 0.4 + 0.64)

Out[19]:
-0.45757440000001
In [20]:
derivative_arbitrary(4 + 0.4 + 0.64 + 0.46)

Out[20]:
0.425
In [21]:
def plot_one_step(x):
new_x = x - derivative_arbitrary(x)
plot_arbitrary()
plot_x_on_f(arbitrary, new_x)
plot_x_on_f_empty(arbitrary, x)
print(f'old x: {x}')
print(f'new x: {new_x}')

In [22]:
plot_one_step( 5.48)

old x: 5.48
new x: 5.1101631999999935

In [23]:
def plot_one_step_better(x):
new_x = x - 0.3 * derivative_arbitrary(x)
plot_arbitrary()
plot_x_on_f(arbitrary, new_x)
plot_x_on_f_empty(arbitrary, x)
print(f'old x: {x}')
print(f'new x: {new_x}')

In [24]:
plot_one_step_better(5.32)

old x: 5.32
new x: 5.32398784


Written as a recurrence relation, the process we've described above is:

$$x^{(t+1)} = x^{(t)} - 0.3 \frac{d}{dx} f(x)$$

This algorithm is also known as "gradient descent".

Given a current $\gamma$, gradient descent creates its next guess for $\gamma$ based on the sign and magnitude of the derivative.

Our choice of 0.3 above was totally arbitrary. Naturally, we can generalize by replacing it with a parameter, typically represented by $\alpha$, and often called the "learning rate".

$$x^{(t+1)} = x^{(t)} - \alpha \frac{d}{dx} f(x)$$

We can also write up this procedure in code as given below:

In [25]:
def gradient_descent(df, initial_guess, alpha, n):
guesses = [initial_guess]
guess = initial_guess
while len(guesses) < n:
guess = guess - alpha * df(guess)
guesses.append(guess)
return np.array(guesses)

In [26]:
trajectory = gradient_descent(derivative_arbitrary, 4, 1.5, 20)
trajectory

Out[26]:
array([4.        , 4.6       , 5.6284    , 4.39841721, 5.36714058,
5.23345463, 5.50301118, 4.85283564, 5.77676546, 3.72284331,
3.96846217, 4.53018801, 5.55049763, 4.6927978 , 5.70865641,
4.05220573, 4.7135408 , 5.72270104, 3.98700301, 4.57133193])

Below, we see a visualization of the trajectory taken by this algorithm.

In [27]:
trajectory = gradient_descent(derivative_arbitrary, 1, 0.9, 20)
plot_arbitrary()
plt.plot(trajectory, arbitrary(trajectory));


Above, we've simply run our algorithm a fixed number of times. More sophisticated implementations will stop based on a variety of different stopping criteria, e.g. error getting too small, error getting too large, etc. We will not discuss these in our course.

In the next part, we'll return to the world of data science and see how this procedure might be useful for optimizing models.

In [28]:
plt.rcParams['figure.figsize'] = (4, 4)
plt.rcParams['figure.dpi'] = 150
plt.rcParams['lines.linewidth'] = 3
sns.set()


We'll continue where we left off earlier. We'll see 5 different ways of computing parameters for a 1D, then 2D linear model. These five techniques will be:

1. Brute Force
2. Closed Form Solutions
3. Gradient Descent
4. scipy.optimize.minimize
5. sklean.linear_model.LinearRegression

## Linear Regression With No Offset¶

Let's consider a case where we have a linear model with no offset. That is, we want to find the parameter $\gamma$ such that the L2 loss is minimized.

In [29]:
tips = sns.load_dataset("tips")

In [30]:
sns.scatterplot(tips["total_bill"], tips["tip"]);


We'll use a one parameter model that the output is $\hat{\gamma}$ times the x value. For example if $\hat{\gamma} = 0.1$, then $\hat{y} = \hat{\gamma} x$, and we are making the prediction line below.

In [31]:
sns.scatterplot(tips["total_bill"], tips["tip"])
x = tips["total_bill"]
y_hat = 0.1 * x
plt.plot(x, y_hat, 'r')
plt.legend(['$\hat{y}$', '$y$']);
#plt.savefig("tip_vs_total_bill.png", dpi=300)


Suppose we select the L2 loss as our loss function. In this case, our goal will be to minimize the mean squared error.

Let's start by writing a function that computes the MSE for a given choice of $\gamma$ on our dataset.

In [32]:
def mse_loss(gamma, x, y_obs):
y_hat = gamma * x
return np.mean((y_hat - y_obs) ** 2)

In [33]:
x = tips["total_bill"]
y_obs = tips["tip"]
mse_loss(0.1, x, y_obs)

Out[33]:
2.077768372950818

Our goal is to find the $\hat{\gamma}$ with minimum MSE.

### Approach 1: Closed Form Solutions¶

On HW5 problem 3, you derived using calculus that the optimal answer is:

$$\hat{\gamma} = \frac{\sum(x_i y_i)}{\sum(x_i^2)}$$

We can calculate this value below.

In [34]:
np.sum(tips["tip"] * tips["total_bill"])/np.sum(tips["total_bill"]**2)

Out[34]:
0.14373189527721666

Alternately, we can use the generic equation for linear regression that we derived in lecture using the definition of orthogonality.

In [35]:
X = tips[["total_bill"]]
y = tips["tip"]
np.linalg.inv(X.T @ X) @ X.T @ y

Out[35]:
0    0.143732
dtype: float64

### Optimization Approach #2A: Plotting the MSE vs. $\hat{\gamma}$¶

Since x and y_obs are fixed, the only variable is gamma.

For clarity, let's define a python function that returns the MSE as a function of a single argument gamma.

In [36]:
def mse_single_arg(gamma):
"""Returns the MSE on our data for the given gamma"""
x = tips["total_bill"]
y_obs = tips["tip"]
y_hat = gamma * x
return mse_loss(gamma, x, y_obs)

In [37]:
mse_single_arg(0.1)

Out[37]:
2.077768372950818

Thus we can plot the MSE as a function of gamma. It turns out to look pretty smooth, and quite similar to a parabola.

In [38]:
gammas = np.linspace(0, 0.2, 200)
x = tips["total_bill"]
y_obs = tips["tip"]

MSEs = [mse_single_arg(gamma) for gamma in gammas]

plt.plot(gammas, MSEs)
plt.xlabel(r"Choice for $\hat{\gamma}$")
plt.ylabel(r"MSE");
#plt.savefig("tips_MSE_vs_gamma.png", dpi=300, bbox_inches = "tight")


The minimum appears to be around $\hat{\gamma} = 0.14$.

### Approach 2B: Brute Force¶

Recall our simple_minimize function from earlier, redefined below for your convenience.

In [39]:
def simple_minimize(f, xs):
y = [f(x) for x in xs]
return xs[np.argmin(y)]

In [40]:
simple_minimize(mse_single_arg, np.linspace(0, 0.2, 21))

Out[40]:
0.14

As before, what we're doing is computing all the starred values below and then returning the $\hat{\theta}$ that goes with the minimum value.

In [41]:
gammas = np.linspace(0, 0.2, 200)
sparse_gammas = np.linspace(0, 0.2, 21)

loss = [mse_single_arg(gamma) for gamma in gammas]
sparse_loss = [mse_single_arg(gamma) for gamma in sparse_gammas]

plt.plot(gammas, loss)
plt.plot(sparse_gammas, sparse_loss, 'r*')
plt.xlabel(r"Choice for $\hat{\gamma}$")
plt.ylabel(r"MSE");
#plt.savefig("tips_brute_force.png", dpi=300, bbox_inches = "tight")


### Approach 3: Use Gradient Descent¶

Another approach is to use our 1D gradient descent algorithm from earlier.

In [42]:
def gradient_descent(df, initial_guess, alpha, n):
guesses = [initial_guess]
guess = initial_guess
while len(guesses) < n:
guess = guess - alpha * df(guess)
guesses.append(guess)
return np.array(guesses)


To use this function, we need to compute the derivative of the MSE. The MSE is repeated below for convenience.

In [43]:
def mse_loss(gamma, x, y_obs):
y_hat = gamma * x
return np.mean((y_hat - y_obs) ** 2)

In [44]:
def mse_loss_derivative(gamma, x, y_obs):
y_hat = gamma * x
return np.mean(2 * (y_hat - y_obs) * x)

In [45]:
#def squared_loss_derivative(y_hat, y_obs, x):
#    """Returns the derivative of the squared loss for a single prediction"""
#    return 2*(y_hat - y_obs)*x
#
#def mse_derivative(y_hat, y_obs, x):
#    """Returns the derivative of the MSE"""
#    return np.mean(squared_loss_derivative(y_hat, y_obs, x))


We can try out different values of gamma and see what we get back as our derivative.

In [46]:
gamma = 0.1
x = tips["total_bill"]
y_obs = tips["tip"]

mse_loss_derivative(gamma, x, y_obs)

Out[46]:
-41.14398663934429

Just like our mse_of_gamma, we can write a function that returns the derivative of the MSE as a function of a single argument gamma.

In [47]:
def mse_loss_derivative_single_arg(gamma):
x = tips["total_bill"]
y_obs = tips["tip"]

return mse_loss_derivative(gamma, x, y_obs)

In [48]:
mse_loss_derivative_single_arg(0.1)

Out[48]:
-41.14398663934429
In [49]:
gradient_descent(mse_loss_derivative_single_arg, 0.05, 0.0001, 100)

Out[49]:
array([0.05      , 0.05881852, 0.06680736, 0.0740446 , 0.08060095,
0.08654045, 0.09192116, 0.09679563, 0.10121151, 0.10521192,
0.10883597, 0.11211906, 0.11509327, 0.11778766, 0.12022855,
0.1224398 , 0.12444301, 0.12625776, 0.12790176, 0.1293911 ,
0.13074031, 0.13196259, 0.13306988, 0.13407298, 0.13498172,
0.13580495, 0.13655074, 0.13722636, 0.13783841, 0.13839289,
0.13889519, 0.13935024, 0.13976248, 0.14013593, 0.14047425,
0.14078073, 0.14105839, 0.14130992, 0.14153778, 0.14174421,
0.14193122, 0.14210063, 0.1422541 , 0.14239314, 0.14251909,
0.14263319, 0.14273656, 0.1428302 , 0.14291504, 0.14299189,
0.14306151, 0.14312458, 0.14318172, 0.14323348, 0.14328037,
0.14332285, 0.14336134, 0.1433962 , 0.14342778, 0.14345639,
0.14348231, 0.1435058 , 0.14352707, 0.14354634, 0.1435638 ,
0.14357961, 0.14359394, 0.14360692, 0.14361868, 0.14362933,
0.14363898, 0.14364772, 0.14365564, 0.14366281, 0.14366931,
0.1436752 , 0.14368053, 0.14368537, 0.14368974, 0.14369371,
0.1436973 , 0.14370056, 0.14370351, 0.14370618, 0.1437086 ,
0.14371079, 0.14371277, 0.14371457, 0.1437162 , 0.14371768,
0.14371902, 0.14372023, 0.14372133, 0.14372232, 0.14372322,
0.14372404, 0.14372478, 0.14372545, 0.14372605, 0.1437266 ])

In the context of minimizing loss, we can write out the gradient descent rule for generating the next $\gamma$ as:

$$\gamma^{(t+1)} = \gamma^{(t)} - \alpha \frac{\partial}{\partial \gamma} L(\gamma^{(t)}, \Bbb{X}, \vec{\hat{y}})$$

Here $L$ is our chosen loss function, $\Bbb{X}$ is our design matrix, and $\vec{\hat{y}}$ are our observations. During the gradient descent algorithm, we treat $\Bbb{X}$ and $\vec{\hat{y}}$ as constants.

### Approach 4: scipy.optimize.minimize¶

We can also use scipy.optimize.minimize.

In [50]:
import scipy.optimize
from scipy.optimize import minimize
minimize(mse_single_arg, x0 = 0)

Out[50]:
      fun: 1.1781161154513213
hess_inv: array([[1]])
jac: array([4.24683094e-06])
message: 'Optimization terminated successfully.'
nfev: 9
nit: 1
njev: 3
status: 0
success: True
x: array([0.14373189])

A natural question that arises: How does scipy.optimize.minimize work? We won't discuss the exact algorithm used by the code (see this wikipedia page about the BFGS algorithm if you're curious, though).

Gradient descent is related to BFGS, though generally doesn't work as well. Comparison of numerical optimization algorithms is very far beyond the scope of our course.

### Approach 5: sklearn.linear_model.LinearRegression¶

We can also go one level of abstraction higher and simply fit a linear model using sklearn.

In [51]:
import sklearn.linear_model
from sklearn.linear_model import LinearRegression
model = LinearRegression(fit_intercept = False)

In [52]:
X = tips[["total_bill"]]
y = tips["tip"]
model.fit(X, y)

Out[52]:
LinearRegression(copy_X=True, fit_intercept=False, n_jobs=None, normalize=False)
In [53]:
model.coef_

Out[53]:
array([0.1437319])

## Multi Dimensional Models¶

In [124]:
#tips datset
data = sns.load_dataset("tips")
data.head()

Out[124]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

Now suppose we improve our model so that we want to predict the tip from the total_bill plus a constant offset, in other words:

$$\textrm{tip} = \hat{\theta}_0 + \hat{\theta}_1 \textrm{bill}$$

To keep things in the simple framework from lecture, we can create a design matrix with the bias vector in one column and the total_bill in the other.

In [125]:
tips_with_bias = tips.copy()
tips_with_bias["bias"] = 1
X = tips_with_bias[["total_bill", "bias"]]
X.head(5)

Out[125]:
total_bill bias
0 16.99 1
1 10.34 1
2 21.01 1
3 23.68 1
4 24.59 1

Now, we can give our predictions as $$\vec{\hat{y}} = f_{\vec{\hat{\theta}}}(\Bbb{X}) = \Bbb{X} \vec{\hat{\theta}}$$

For example, the predictions below correspond to assuming every table leaves a tip of \$1.50 plus 5% of their total bill. In [126]: X @ np.array([0.05, 1.50])  Out[126]: 0 2.3495 1 2.0170 2 2.5505 3 2.6840 4 2.7295 ... 239 2.9515 240 2.8590 241 2.6335 242 2.3910 243 2.4390 Length: 244, dtype: float64 Throughout this problem, we'll assume we want to minimize the mean squared error of our predictions, i.e. In [127]: def mse_loss(theta, X, y_obs): y_hat = X @ theta return np.mean((y_hat - y_obs) ** 2)  For example, the loss assuming the model described above is: In [128]: mse_loss(np.array([0.05, 1.50]), X, y_obs)  Out[128]: 1.534052175204919 ### Approach 1: Closed Form Solution¶ In [129]: y_obs  Out[129]: 0 1.01 1 1.66 2 3.50 3 3.31 4 3.61 ... 239 5.92 240 2.00 241 2.00 242 1.75 243 3.00 Name: tip, Length: 244, dtype: float64 In [130]: X = tips_with_bias[["total_bill", "bias"]] X.head(5)  Out[130]: total_bill bias 0 16.99 1 1 10.34 1 2 21.01 1 3 23.68 1 4 24.59 1 In [131]: y_obs = tips["tip"] y_obs.head(5)  Out[131]: 0 1.01 1 1.66 2 3.50 3 3.31 4 3.61 Name: tip, dtype: float64 In [132]: theta_hat = np.linalg.inv(X.T @ X) @ X.T @ y_obs theta_hat  Out[132]: 0 0.105025 1 0.920270 dtype: float64 In [107]: #Note: It's generally a better idea to use np.linalg.solve. #np.linalg.inv is slow and can sometimes return incorrect results due to rounding issues. #np.linalg.solve is faster and generally better behaved. theta_hat = np.linalg.solve(X.T @ X, X.T @ y_obs)  In [133]: theta_hat  Out[133]: 0 0.105025 1 0.920270 dtype: float64 ### Approach 2: Brute Force¶ As before, we can simply try out a bunch of theta values and see which works best. In this case, since we have a 2D theta, there's a much bigger space of possible values to try. In [134]: def mse_loss(theta, X, y_obs): y_hat = X @ theta return np.mean((y_hat - y_obs) ** 2)  As before, it's convenient to first create an mse function of a single argument. In [135]: X = tips_with_bias[["total_bill", "bias"]] y_obs = tips["tip"] def mse_loss_single_arg(theta): return mse_loss(theta, X, y_obs)  In [136]: mse_loss_single_arg([0.01, 0.02])  Out[136]: 9.479444821434424 Using this function, we can create a 3D plot. This uses lots of syntax you've never seen. In [137]: import plotly.graph_objects as go uvalues = np.linspace(0, 0.2, 10)#)1) vvalues = np.linspace(0, 2, 10)#)1) (u,v) = np.meshgrid(uvalues, vvalues) thetas = np.vstack((u.flatten(),v.flatten())) MSE = np.array([mse_loss_single_arg(t) for t in thetas.T]) loss_surface = go.Surface(x=u, y=v, z=np.reshape(MSE, u.shape)) ind = np.argmin(MSE) optimal_point = go.Scatter3d(name = "Optimal Point", x = [thetas.T[ind,0]], y = [thetas.T[ind,1]], z = [MSE[ind]], marker=dict(size=10, color="red")) fig = go.Figure(data=[loss_surface, optimal_point]) fig.update_layout(scene = dict( xaxis_title = "theta0", yaxis_title = "theta1", zaxis_title = "MSE")) fig.show()  ### Approach 3: Gradient Descent¶ Gradient descent is exactly like it was before, only now our gradient is somewhat more complicated. In [138]: def mse_gradient(theta, X, y_obs): """Returns the gradient of the MSE on our data for the given theta""" x0 = X.iloc[:, 0] x1 = X.iloc[:, 1] dth0 = np.mean(-2 * (y_obs - theta[0] * x0 - theta[1] * x1) * x0) dth1 = np.mean(-2 * (y_obs - theta[0] * x0 - theta[1] * x1) * x1) return np.array([dth0, dth1])  In [139]: X = tips_with_bias[["total_bill", "bias"]] y_obs = tips["tip"] mse_gradient(np.array([0, 0]), X, y_obs)  Out[139]: array([-135.22631803, -5.99655738]) In [140]: def mse_gradient_single_arg(theta): """Returns the gradient of the MSE on our data for the given theta""" X = tips_with_bias[["total_bill", "bias"]] y_obs = tips["tip"] return mse_gradient(theta, X, y_obs)  In [141]: mse_gradient_single_arg(np.array([0, 0]))  Out[141]: array([-135.22631803, -5.99655738]) In [143]: gradient_descent(mse_gradient_single_arg, np.array([0, 0]), 0.001, 200)  Out[143]: array([[0. , 0. ], [0.13522632, 0.00599656], [0.14299127, 0.00662996], [0.14342571, 0.00695482], [0.14343856, 0.00726185], [0.14342717, 0.00756775], [0.14341439, 0.00787348], [0.14340154, 0.00817912], [0.14338868, 0.00848465], [0.14337583, 0.00879007], [0.14336298, 0.0090954 ], [0.14335014, 0.00940062], [0.1433373 , 0.00970574], [0.14332447, 0.01001076], [0.14331164, 0.01031568], [0.14329882, 0.01062049], [0.14328599, 0.0109252 ], [0.14327318, 0.01122981], [0.14326036, 0.01153432], [0.14324756, 0.01183873], [0.14323475, 0.01214303], [0.14322195, 0.01244724], [0.14320916, 0.01275134], [0.14319637, 0.01305533], [0.14318358, 0.01335923], [0.1431708 , 0.01366302], [0.14315802, 0.01396672], [0.14314525, 0.01427031], [0.14313248, 0.0145738 ], [0.14311971, 0.01487719], [0.14310695, 0.01518047], [0.14309419, 0.01548366], [0.14308144, 0.01578674], [0.14306869, 0.01608972], [0.14305595, 0.0163926 ], [0.14304321, 0.01669538], [0.14303047, 0.01699806], [0.14301774, 0.01730063], [0.14300502, 0.01760311], [0.14299229, 0.01790548], [0.14297957, 0.01820775], [0.14296686, 0.01850992], [0.14295415, 0.01881199], [0.14294144, 0.01911396], [0.14292874, 0.01941583], [0.14291605, 0.01971759], [0.14290335, 0.02001926], [0.14289067, 0.02032082], [0.14287798, 0.02062228], [0.1428653 , 0.02092365], [0.14285262, 0.02122491], [0.14283995, 0.02152607], [0.14282729, 0.02182713], [0.14281462, 0.02212808], [0.14280196, 0.02242894], [0.14278931, 0.0227297 ], [0.14277666, 0.02303035], [0.14276401, 0.02333091], [0.14275137, 0.02363136], [0.14273873, 0.02393172], [0.1427261 , 0.02423197], [0.14271347, 0.02453212], [0.14270085, 0.02483218], [0.14268823, 0.02513213], [0.14267561, 0.02543198], [0.142663 , 0.02573173], [0.14265039, 0.02603138], [0.14263778, 0.02633093], [0.14262518, 0.02663038], [0.14261259, 0.02692973], [0.1426 , 0.02722898], [0.14258741, 0.02752812], [0.14257483, 0.02782717], [0.14256225, 0.02812612], [0.14254968, 0.02842497], [0.14253711, 0.02872372], [0.14252454, 0.02902237], [0.14251198, 0.02932091], [0.14249942, 0.02961936], [0.14248687, 0.02991771], [0.14247432, 0.03021596], [0.14246177, 0.03051411], [0.14244923, 0.03081215], [0.1424367 , 0.0311101 ], [0.14242416, 0.03140795], [0.14241164, 0.0317057 ], [0.14239911, 0.03200335], [0.14238659, 0.0323009 ], [0.14237408, 0.03259835], [0.14236157, 0.0328957 ], [0.14234906, 0.03319295], [0.14233656, 0.0334901 ], [0.14232406, 0.03378715], [0.14231156, 0.0340841 ], [0.14229907, 0.03438095], [0.14228659, 0.03467771], [0.1422741 , 0.03497436], [0.14226163, 0.03527091], [0.14224915, 0.03556737], [0.14223668, 0.03586372], [0.14222422, 0.03615998], [0.14221176, 0.03645614], [0.1421993 , 0.0367522 ], [0.14218685, 0.03704815], [0.1421744 , 0.03734401], [0.14216195, 0.03763977], [0.14214951, 0.03793544], [0.14213708, 0.038231 ], [0.14212465, 0.03852646], [0.14211222, 0.03882183], [0.1420998 , 0.03911709], [0.14208738, 0.03941226], [0.14207496, 0.03970732], [0.14206255, 0.04000229], [0.14205014, 0.04029716], [0.14203774, 0.04059193], [0.14202534, 0.04088661], [0.14201295, 0.04118118], [0.14200056, 0.04147566], [0.14198817, 0.04177003], [0.14197579, 0.04206431], [0.14196341, 0.04235849], [0.14195104, 0.04265257], [0.14193867, 0.04294655], [0.1419263 , 0.04324043], [0.14191394, 0.04353422], [0.14190158, 0.04382791], [0.14188923, 0.04412149], [0.14187688, 0.04441498], [0.14186454, 0.04470838], [0.1418522 , 0.04500167], [0.14183986, 0.04529487], [0.14182753, 0.04558796], [0.1418152 , 0.04588096], [0.14180288, 0.04617386], [0.14179056, 0.04646666], [0.14177824, 0.04675937], [0.14176593, 0.04705198], [0.14175362, 0.04734448], [0.14174132, 0.04763689], [0.14172902, 0.04792921], [0.14171672, 0.04822142], [0.14170443, 0.04851354], [0.14169214, 0.04880556], [0.14167986, 0.04909748], [0.14166758, 0.0493893 ], [0.14165531, 0.04968103], [0.14164304, 0.04997265], [0.14163077, 0.05026418], [0.14161851, 0.05055562], [0.14160625, 0.05084695], [0.141594 , 0.05113819], [0.14158175, 0.05142933], [0.1415695 , 0.05172037], [0.14155726, 0.05201131], [0.14154502, 0.05230216], [0.14153279, 0.05259291], [0.14152056, 0.05288356], [0.14150833, 0.05317412], [0.14149611, 0.05346458], [0.14148389, 0.05375494], [0.14147168, 0.0540452 ], [0.14145947, 0.05433537], [0.14144727, 0.05462543], [0.14143507, 0.05491541], [0.14142287, 0.05520528], [0.14141068, 0.05549506], [0.14139849, 0.05578474], [0.1413863 , 0.05607432], [0.14137412, 0.05636381], [0.14136195, 0.0566532 ], [0.14134978, 0.05694249], [0.14133761, 0.05723168], [0.14132544, 0.05752078], [0.14131328, 0.05780978], [0.14130113, 0.05809869], [0.14128898, 0.0583875 ], [0.14127683, 0.05867621], [0.14126468, 0.05896482], [0.14125254, 0.05925334], [0.14124041, 0.05954176], [0.14122828, 0.05983008], [0.14121615, 0.06011831], [0.14120403, 0.06040644], [0.14119191, 0.06069448], [0.14117979, 0.06098242], [0.14116768, 0.06127026], [0.14115557, 0.06155801], [0.14114347, 0.06184565], [0.14113137, 0.06213321], [0.14111928, 0.06242066], [0.14110719, 0.06270802], [0.1410951 , 0.06299529], [0.14108302, 0.06328246], [0.14107094, 0.06356953], [0.14105886, 0.0638565 ], [0.14104679, 0.06414338], [0.14103473, 0.06443017], [0.14102266, 0.06471685], [0.1410106 , 0.06500344]]) If you play around with the code above, you'll see that it's pretty finicky about the start point and learning rate. Thus, another approach is to use a more sophisticated numerical optimization library. For reference, the general matrix form of the gradient is given below. We have not discussed how to derive this in class. In [74]: def mse_gradient(theta, X, y_obs): """Returns the gradient of the MSE on our data for the given theta""" n = len(X) return -2 / n * (X.T @ y_obs - X.T @ X @ theta)  ### Approach 4: scipy.optimize.minimize¶ In [75]: scipy.optimize.minimize(mse_loss_single_arg, x0 = [0, 0])  Out[75]:  fun: 1.0360194420116033 hess_inv: array([[ 0.00633488, -0.12534156], [-0.12534156, 2.980001 ]]) jac: array([5.96046448e-08, 5.96046448e-08]) message: 'Optimization terminated successfully.' nfev: 20 nit: 3 njev: 5 status: 0 success: True x: array([0.10502446, 0.92027071]) ### Approach 5: sklearn.linear_model.LinearRegression¶ As before, we can also go one level of abstraction higher and simply fit a linear model using sklearn. We can either do this by using our tips_with_bias and fit_intercept = False, or with our original tips dataframe and fit_intercept = True. In [76]: import sklearn.linear_model from sklearn.linear_model import LinearRegression model = LinearRegression(fit_intercept = False)  In [77]: X = tips_with_bias[["total_bill", "bias"]] y = tips["tip"] model.fit(X, y)  Out[77]: LinearRegression(copy_X=True, fit_intercept=False, n_jobs=None, normalize=False) In [78]: model.coef_  Out[78]: array([0.10502452, 0.92026961]) In [79]: model = LinearRegression() X = tips[["total_bill"]] y = tips["tip"] model.fit(X, y)  Out[79]: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False) In [80]: model.coef_  Out[80]: array([0.10502452]) In [81]: model.intercept_  Out[81]: 0.9202696135546731 In [82]: sns.scatterplot(tips["total_bill"], tips["tip"]) x = tips["total_bill"] y_hat = model.intercept_ + model.coef_ * tips["total_bill"] plt.plot(x, y_hat, 'r') plt.legend(['$\hat{y}$', '$y\$']);
#plt.savefig("tip_vs_total_bill_linear_regression.png", dpi=300)

In [ ]:


In [ ]: