Simple Linear Regression¶

Model Training and Prediction Using Derived Formulas¶

In [1]:
import pandas as pd
import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt
import numpy as np

Consider the tips dataset. Each row represents one table that ate at a restaurant. For example, the top row of the table was a table for 2 eating dinner on a Sunday, where the person who paid the check was female and not a smoker. This table had a \$16.99 total bill, and they tipped \\$1.01.

In [2]:
df = sns.load_dataset("tips")
df.head()
Out[2]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

In many cultures, it is customary to tip based on the total bill. This data appears to have been collected from such a culture.

In [3]:
px.scatter(df, x = "total_bill", y = "tip")

Suppose we want to create a linear regression model that predicts the tip from the total bill.

$$\hat{y} = a + b x$$

Let $\hat{a}$ and $\hat{b}$ be the choices that minimize the mean squared error.

One approach to compute $\hat{a}$ and $\hat{b}$ is to use the equations that we derived in Lecture 9:

$$\hat{a} = \bar{y} - \hat{b} \bar{x}$$$$\hat{b} = r \frac{\sigma_y}{\sigma_x}$$$$r = \frac{1}{n} \sum_{i=1}^{n}\left( \frac{x_i - x}{\sigma_x} \right) \left( \frac{y_i - y}{\sigma_y} \right) $$
In [4]:
y = df["tip"]
x = df["total_bill"]
y_bar = np.mean(y)
x_bar = np.mean(x)
sigma_y = np.std(y)
sigma_x = np.std(x)
r = np.sum((x - x_bar) / sigma_x * (y - y_bar) / sigma_y) / len(x)
b_hat = r * sigma_y / sigma_x
a_hat = y_bar - b_hat * x_bar
In [5]:
b_hat = r * sigma_y / sigma_x
b_hat
Out[5]:
0.10502451738435341
In [6]:
a_hat = y_bar - b_hat * x_bar
a_hat
Out[6]:
0.9202696135546722

Let's look at the accuracy of this model.

In [7]:
df["predicted_tip"] = a_hat + b_hat * df["total_bill"]
In [8]:
df.head(10)
Out[8]:
total_bill tip sex smoker day time size predicted_tip
0 16.99 1.01 Female No Sun Dinner 2 2.704636
1 10.34 1.66 Male No Sun Dinner 3 2.006223
2 21.01 3.50 Male No Sun Dinner 3 3.126835
3 23.68 3.31 Male No Sun Dinner 2 3.407250
4 24.59 3.61 Female No Sun Dinner 4 3.502822
5 25.29 4.71 Male No Sun Dinner 4 3.576340
6 8.77 2.00 Male No Sun Dinner 2 1.841335
7 26.88 3.12 Male No Sun Dinner 4 3.743329
8 15.04 1.96 Male No Sun Dinner 2 2.499838
9 14.78 3.23 Male No Sun Dinner 2 2.472532
In [9]:
df["residual"] = df["tip"] - df["predicted_tip"]
In [10]:
df.head(10)
Out[10]:
total_bill tip sex smoker day time size predicted_tip residual
0 16.99 1.01 Female No Sun Dinner 2 2.704636 -1.694636
1 10.34 1.66 Male No Sun Dinner 3 2.006223 -0.346223
2 21.01 3.50 Male No Sun Dinner 3 3.126835 0.373165
3 23.68 3.31 Male No Sun Dinner 2 3.407250 -0.097250
4 24.59 3.61 Female No Sun Dinner 4 3.502822 0.107178
5 25.29 4.71 Male No Sun Dinner 4 3.576340 1.133660
6 8.77 2.00 Male No Sun Dinner 2 1.841335 0.158665
7 26.88 3.12 Male No Sun Dinner 4 3.743329 -0.623329
8 15.04 1.96 Male No Sun Dinner 2 2.499838 -0.539838
9 14.78 3.23 Male No Sun Dinner 2 2.472532 0.757468

We can compute the mean squared error:

In [11]:
np.mean(df["residual"]**2)
Out[11]:
1.036019442011377

The sklearn package (more on that soon) provides a function that computes the MSE from a list of observations and predictions.

In [12]:
from sklearn.metrics import mean_squared_error
mean_squared_error(df["tip"], df["predicted_tip"])
Out[12]:
1.036019442011377

We can also visualize the output of the model:

In [13]:
sns.scatterplot(data = df, x = "total_bill", y = "tip")
plt.plot(df["total_bill"], df["predicted_tip"], 'r')
plt.legend([r'$\hat{y}$', '$y$']);

Or in plotly:

In [14]:
import plotly.graph_objects as go

scatter_plot = px.scatter(df, x="total_bill", y="tip")
line_plot = px.line(df, x="total_bill", y="predicted_tip")
line_plot.update_traces(line=dict(color = 'rgba(255, 0, 0)'))

go.Figure(data=scatter_plot.data + line_plot.data)

Simple Linear Regression Model Training and Prediction Using sklearn¶

An alternate approach to optimize our model is to use the sklearn.linear_model.LinearRegression class.

In [15]:
from sklearn.linear_model import LinearRegression

First we create a model. At this point it is like a baby, knowing nothing.

In [16]:
model = LinearRegression()

Then we "fit" the model, which means computing the parameters that minimize the loss function. The LinearRegression class is hard coded to use the MSE as its loss function. The first argument of the fit function should be a matrix (or DataFrame), and the second should be the observation we're trying to predict, which in this case is a series of tip values.

In [17]:
model.fit(df[["total_bill"]], df["tip"])
Out[17]:
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()

Now that our model is trained, we can ask it questions. The code below asks the model to estimate the tip for a table with a 20 dollar total bill.

In [18]:
model.predict([[20]])
/srv/conda/envs/notebook/lib/python3.9/site-packages/sklearn/base.py:450: UserWarning:

X does not have valid feature names, but LinearRegression was fitted with feature names

Out[18]:
array([3.02075996])

We can also ask the model to generate a series of predictions:

In [19]:
df["sklearn_predictions"] = model.predict(df[["total_bill"]])

Looking at the predictions, we see that they are exactly the same as we got using our analytic formula.

In [20]:
df.head(10)
Out[20]:
total_bill tip sex smoker day time size predicted_tip residual sklearn_predictions
0 16.99 1.01 Female No Sun Dinner 2 2.704636 -1.694636 2.704636
1 10.34 1.66 Male No Sun Dinner 3 2.006223 -0.346223 2.006223
2 21.01 3.50 Male No Sun Dinner 3 3.126835 0.373165 3.126835
3 23.68 3.31 Male No Sun Dinner 2 3.407250 -0.097250 3.407250
4 24.59 3.61 Female No Sun Dinner 4 3.502822 0.107178 3.502822
5 25.29 4.71 Male No Sun Dinner 4 3.576340 1.133660 3.576340
6 8.77 2.00 Male No Sun Dinner 2 1.841335 0.158665 1.841335
7 26.88 3.12 Male No Sun Dinner 4 3.743329 -0.623329 3.743329
8 15.04 1.96 Male No Sun Dinner 2 2.499838 -0.539838 2.499838
9 14.78 3.23 Male No Sun Dinner 2 2.472532 0.757468 2.472532

We can ask the model for its intercept and slope with _intercept and _coef. They are the same as with our analytic formula.

In [21]:
model.intercept_
Out[21]:
0.9202696135546731
In [22]:
model.coef_
Out[22]:
array([0.10502452])

Multiple Linear Regression¶

Model Training and Prediction Using sklearn¶

Let's start by removing the columns we added earlier.

In [23]:
df = df.drop(["predicted_tip", "residual", "sklearn_predictions"], axis = "columns")

Now suppose we want to compute the tip using both the total_bill and the size of the party. We can do this easily in sklearn.

In [24]:
model_2d = LinearRegression()
In [25]:
model_2d.fit(df[["total_bill", "size"]], df["tip"])
Out[25]:
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()

Now that we have this model, we can make predictions. For example, we can ask our model about the tip for a table with a total bill of 10 dollars and 3 customers.

In [26]:
model_2d.predict([[10, 3]])
/srv/conda/envs/notebook/lib/python3.9/site-packages/sklearn/base.py:450: UserWarning:

X does not have valid feature names, but LinearRegression was fitted with feature names

Out[26]:
array([2.17387149])

Note that this prediction is different than our first model:

In [27]:
model.predict([[10]])
/srv/conda/envs/notebook/lib/python3.9/site-packages/sklearn/base.py:450: UserWarning:

X does not have valid feature names, but LinearRegression was fitted with feature names

Out[27]:
array([1.97051479])

Prediction Using the Derived Equation for a 2D Linear Regression Model¶

The equation used by this model is the multiple linear regression model described in this slide: https://docs.google.com/presentation/d/1HZu4wK3wcG81ldQ-7edyunvwVXq4LBJCdXviyi3ZW9c/edit#slide=id.g113dfce000f_0_2682

$$\hat{y} = \theta_0 + \theta_1 x_1 + \theta_2 x_2$$

We can get $\theta_0$ with intercept_ and $\theta_1$ and $\theta_2$ with coef_

In [28]:
print(f"theta0: {model_2d.intercept_}")
print(f"theta1 and theta2: {model_2d.coef_}")
theta0: 0.6689447408125022
theta1 and theta2: [0.09271334 0.19259779]

We can manually compute the model prediction as follows:

In [29]:
0.6689 + 0.0927 * 10 + 0.1926 * 3
Out[29]:
2.1737

Prediction Using the Matrix Form for a Multiple Linear Regression Model¶

We can also manually compute the predictions using matrix multiplication, using the equation from https://docs.google.com/presentation/d/15eEbroVt2r36TXh28C2wm6wgUHlCBCsODR09kLHhDJ8/edit#slide=id.g113dfce000f_0_1309:

$$ \hat{\mathbb{Y}} = \mathbb{X} \theta$$
In [30]:
theta = np.array([[0.6689, 0.0927, 0.1926]]).T
theta
Out[30]:
array([[0.6689],
       [0.0927],
       [0.1926]])
In [31]:
X = np.array([[1, 10, 3]])
X
Out[31]:
array([[ 1, 10,  3]])
In [32]:
X @ theta
Out[32]:
array([[2.1737]])

Using the matrix form we can make many peredictions at once, for example, let X be the first four observations from our dataset:

In [33]:
X = df[["total_bill", "size"]].head(4)
X
Out[33]:
total_bill size
0 16.99 2
1 10.34 3
2 21.01 3
3 23.68 2

If we try to use matrix multiplication, we run into an error.

In [34]:
# uncomment this line
#X.values @ theta 

What's missing is our bias term:

In [35]:
X = df[["total_bill", "size"]].head(4)
X["bias"] = [1, 1, 1, 1]
X = X[["bias", "total_bill", "size"]]
X
Out[35]:
bias total_bill size
0 1 16.99 2
1 1 10.34 3
2 1 21.01 3
3 1 23.68 2
In [36]:
X @ theta
Out[36]:
0
0 2.629073
1 2.205218
2 3.194327
3 3.249236

These are the same values that we get if we use .predict.

In [37]:
model_2d.predict(X[["total_bill", "size"]].head(4))
Out[37]:
array([2.62933992, 2.20539403, 3.19464533, 3.24959215])

Visualizing Our 2D Linear Model Predictions¶

In [38]:
from mpl_toolkits.mplot3d import Axes3D  # noqa: F401 unused import
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(df["total_bill"], df["size"], df["tip"])
plt.xlabel('total bill')
plt.ylabel('size')
Out[38]:
Text(0.5, 0.5, 'size')
In [39]:
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(df["total_bill"], df["size"], df["tip"])
xx, yy = np.meshgrid(range(50), range(6))
zz = ( 0.6689 + 0.0927 * xx + 0.1926 * yy)
ax.plot_surface(xx, yy, zz, alpha=0.2)
plt.xlabel('total bill')
plt.ylabel('size')
plt.gcf().savefig("hi.png", dpi = 300, bbox_inches = "tight")

We see that the predictions all lie in a plane. In higher dimensions, they all lie in a "hyperplane".

Multiple Linear Regression Model Training Using the Derived Matrix Formula¶

For our one dimensional model, we computed $\hat{a}$ and $\hat{b}$ using the equations for simple linear regression.

For a multi-dimensional model we can use the normal equation, which we derived in Lecture 11: https://docs.google.com/presentation/d/1HZu4wK3wcG81ldQ-7edyunvwVXq4LBJCdXviyi3ZW9c/edit#slide=id.g113dfce000f_0_2682

$$\hat{\theta} = \left( \mathbb{X}^T \mathbb{X} \right)^{-1} \mathbb{X}^T \mathbb{Y}$$

Note that this derivation is one of the most challenging in the course, especially for those of you who are learning linear algebra during the same semester.

In [40]:
X = df[["total_bill", "size"]].copy()
X["bias"] = np.ones(len(X))
X = X[["bias", "total_bill", "size"]]
X.head(4)
Out[40]:
bias total_bill size
0 1.0 16.99 2
1 1.0 10.34 3
2 1.0 21.01 3
3 1.0 23.68 2
In [41]:
Y = df[["tip"]]
Y.head(4)
Out[41]:
tip
0 1.01
1 1.66
2 3.50
3 3.31

The optimal parameters we computed using sklearn were:

In [42]:
theta
Out[42]:
array([[0.6689],
       [0.0927],
       [0.1926]])

And if we use the normal equation, we get back:

$$\hat{\theta} = \left( \mathbb{X}^T \mathbb{X} \right)^{-1} \mathbb{X}^T \mathbb{Y}$$
In [43]:
theta_using_normal_equation = np.linalg.inv(X.T @ X) @ X.T @ Y
theta_using_normal_equation.values
Out[43]:
array([[0.66894474],
       [0.09271334],
       [0.19259779]])

Note: The code above is inefficient. We won't go into this in our class, but it's better to use np.linalg.solve rather than computing the explicit matrix inverse.

Note also: sklearn does NOT use the normal equation, instead it uses a procedure called gradient descent which can minimize ANY function, not just the MSE. The rest of the lecture will explain how gradient descent works.

Minimizing an Arbitrary 1D Function¶

Suppose we want to find the minimum of the arbitrary function given below:

In [44]:
def arbitrary(x):
    return (x**4 - 15*x**3 + 80*x**2 - 180*x + 144)/10

x = np.linspace(1, 6.75, 200)
fig = px.line(y = arbitrary(x), x = x)


fig.update_layout(font_size = 16)

One way to minimize this mathematical function is to use the scipy.optimize.minimize function. It takes a function and a starting guess and tries to find the minimum.

In [45]:
from scipy.optimize import minimize

minimize(arbitrary, x0 = 3.5)
Out[45]:
      fun: -0.13827491292966557
 hess_inv: array([[0.73848255]])
      jac: array([6.48573041e-06])
  message: 'Optimization terminated successfully.'
     nfev: 20
      nit: 3
     njev: 10
   status: 0
  success: True
        x: array([2.39275266])

Our choice of start point can affect the outcome. For example if we start to the left, we get stuck in the local minimum on the left side.

In [46]:
minimize(arbitrary, x0 = 1)
Out[46]:
      fun: -0.13827491294422317
 hess_inv: array([[0.74751575]])
      jac: array([-3.7997961e-07])
  message: 'Optimization terminated successfully.'
     nfev: 16
      nit: 7
     njev: 8
   status: 0
  success: True
        x: array([2.3927478])

scipy.optimize.minimize is great. It may also seem a bit magical. How could you write a function that can find the minimum of any mathematical function? There are a number of ways to do this, which we'll explore in today's lecture, eventually arriving at the important idea of gradient descent, which is the principle that scipy.optimize.minimize uses.

It turns out that under the hood, the fit method for LinearRegression models uses gradient descent. Gradient descent is also how much of machine learning works, including even advanced neural network models.

In DS100, the gradient descent process will usually be invisible to us, hidden beneath an abstraction layer. However, to be good data scientists, it's important that we know the basic principles beyond the optimization functions that harness to find optimal parmaeters.

The Naive Approach: Guess and Check¶

Above, we saw that the minimum is somewhere around 5.3ish. Let's see if we can figure out how to find the exact minimum algorithmically from scratch.

One way very slow and terrible way would be manual guess-and-check.

In [47]:
arbitrary(6)
Out[47]:
0.0

A somewhat better approach is to use brute force to try out a bunch of x values and return the one that yields the lowest loss.

In [48]:
def simple_minimize(f, xs):
    y = [f(x) for x in xs]  
    return xs[np.argmin(y)]
In [49]:
simple_minimize(arbitrary, np.linspace(1, 7, 20))
Out[49]:
5.421052631578947

This process is essentially the same as before where we made a graphical plot, it's just that we're only looking at 20 selected points.

In [50]:
xs = np.linspace(1, 7, 200)
sparse_xs = np.linspace(1, 7, 5)

ys = arbitrary(xs)
sparse_ys = arbitrary(sparse_xs)

fig = px.line(x = xs, y = arbitrary(xs))
fig.add_scatter(x = sparse_xs, y = arbitrary(sparse_xs), mode = "markers")
fig.update_layout(showlegend= False)
fig.show()

This basic approach suffers from three major flaws:

  1. If the minimum is outside our range of guesses, the answer will be completely wrong.
  2. Even if our range of guesses is correct, if the guesses are too coarse, our answer will be inaccurate.
  3. It is absurdly computationally inefficient, considering potentially vast numbers of guesses that are useless.

Better Approach: Gradient Descent¶

Instead of choosing all of our guesses ahead of time, we can instead start from a single guess and try to iteratively improve on our choice.

They key insight is this: If the derivative of the function is negative, that means the function is decreasing, so we should go to the right (i.e. pick a bigger x). If the derivative of the function is positive, that means the function is increasing, so we should go to the left (i.e. pick a smaller x).

Thus, the derivative tells us which way to go.

In [51]:
#desmos demo: https://www.desmos.com/calculator/twpnylu4lr
In [52]:
import plotly.graph_objects as go

def derivative_arbitrary(x):
    return (4*x**3 - 45*x**2 + 160*x - 180)/10


fig = go.Figure()
roots = np.array([2.3927, 3.5309, 5.3263])

fig.add_trace(go.Scatter(x = xs, y = arbitrary(xs), 
                         mode = "lines", name = "f"))
fig.add_trace(go.Scatter(x = xs, y = derivative_arbitrary(xs), 
                         mode = "lines", name = "df", line = {"dash": "dash"}))
fig.add_trace(go.Scatter(x = np.array(roots), y = 0*roots, 
                         mode = "markers", name = "df = zero", marker_size = 12))
fig.update_layout(font_size = 20, yaxis_range=[-1, 3])
fig.show()
In [53]:
import matplotlib.pyplot as plt

def plot_arbitrary():
    x = np.linspace(1, 7, 100)
    plt.plot(x, arbitrary(x))
    axes = plt.gca()
    axes.set_ylim([-1, 3])
    
def plot_x_on_f(f, x):
    y = f(x)    
    default_args = dict(label=r'$ \theta $', zorder=2,
                        s=200, c=sns.xkcd_rgb['green'])
    plt.scatter([x], [y], **default_args)
    
def plot_x_on_f_empty(f, x):
    y = f(x)
    default_args = dict(label=r'$ \theta $', zorder=2,
                        s=200, c = 'none', edgecolor=sns.xkcd_rgb['green'])
    plt.scatter([x], [y], **default_args)    
    
def plot_tangent_on_f(f, x, eps=1e-6):
    slope = ((f(x + eps) - f(x - eps))
             / (2 * eps))
    xs = np.arange(x - 1, x + 1, 0.05)
    ys = f(x) + slope * (xs - x)
    plt.plot(xs, ys, zorder=3, c=sns.xkcd_rgb['green'], linestyle='--')    
In [54]:
plot_arbitrary()
plot_x_on_f(arbitrary, 2)
plot_tangent_on_f(arbitrary, 2)
plt.xlabel('x')
plt.ylabel('f(x)');
In [55]:
plot_arbitrary()
plot_x_on_f(arbitrary, 4.4)
plot_tangent_on_f(arbitrary, 4.4)

Manually Descending the Gradient¶

Armed with this knowledge, let's try to see if we can use the derivative to optimize the function.

In [56]:
guess = 4
print(f"x: {guess}, f(x): {arbitrary(guess)}, derivative f'(x): {derivative_arbitrary(guess)}")
x: 4, f(x): 0.0, derivative f'(x): -0.4
In [57]:
guess = 4 + 0.4
print(f"x: {guess}, f(x): {arbitrary(guess)}, derivative f'(x): {derivative_arbitrary(guess)}")
x: 4.4, f(x): -0.21504000000003315, derivative f'(x): -0.6464000000000055
In [58]:
def plot_one_step(x):
    new_x = x - derivative_arbitrary(x)
    plot_arbitrary()
    plot_x_on_f(arbitrary, new_x)
    plot_x_on_f_empty(arbitrary, x)
    print(f'old x: {x}')
    print(f'new x: {new_x}')    
In [59]:
plot_one_step(4)
old x: 4
new x: 4.4
In [60]:
plot_one_step(4.4)
old x: 4.4
new x: 5.0464000000000055
In [61]:
plot_one_step(5.0464)
old x: 5.0464
new x: 5.49673060106241
In [62]:
plot_one_step(5.4967)
old x: 5.4967
new x: 5.080917145374805
In [63]:
plot_one_step(5.080917145374805)
old x: 5.080917145374805
new x: 5.489966698640582
In [64]:
plot_one_step(5.489966698640582)
old x: 5.489966698640582
new x: 5.092848945470474
In [65]:
def plot_one_step_better(x):
    new_x = x - 0.3 * derivative_arbitrary(x)
    plot_arbitrary()
    plot_x_on_f(arbitrary, new_x)
    plot_x_on_f_empty(arbitrary, x)
    print(f'old x: {x}')
    print(f'new x: {new_x}')    
In [66]:
plot_one_step_better(4)
old x: 4
new x: 4.12
In [67]:
plot_one_step_better(4.12)
old x: 4.12
new x: 4.267296639999997
In [68]:
plot_one_step_better(5.17180969114245)
old x: 5.17180969114245
new x: 5.256374838146257
In [69]:
plot_one_step_better(5.323)
old x: 5.323
new x: 5.325108157959999

Gradient Descent as an Algorithmic Procedure in Code¶

Written as a recurrence relation, the process we've described above is:

$$ x^{(t+1)} = x^{(t)} - 0.3 \frac{d}{dx} f(x) $$

This algorithm is also known as "gradient descent".

Given a current $x$, gradient descent creates its next guess for $x$ based on the sign and magnitude of the derivative.

Our choice of 0.3 above was totally arbitrary. Naturally, we can generalize by replacing it with a parameter, typically represented by $\alpha$, and often called the "learning rate".

$$ x^{(t+1)} = x^{(t)} - \alpha \frac{d}{dx} f(x) $$

We can also write up this procedure in code as given below:

In [70]:
def gradient_descent(df, initial_guess, alpha, n):
    """Performs n steps of gradient descent on df using learning rate alpha starting
       from initial_guess. Returns a numpy array of all guesses over time."""
    guesses = [initial_guess]
    current_guess = initial_guess
    while len(guesses) < n:
        current_guess = current_guess - alpha * df(current_guess)
        guesses.append(current_guess)
        
    return np.array(guesses)
In [71]:
trajectory = gradient_descent(derivative_arbitrary, 4, 0.3, 20)
trajectory
Out[71]:
array([4.        , 4.12      , 4.26729664, 4.44272584, 4.64092624,
       4.8461837 , 5.03211854, 5.17201478, 5.25648449, 5.29791149,
       5.31542718, 5.3222606 , 5.32483298, 5.32578765, 5.32614004,
       5.32626985, 5.32631764, 5.32633523, 5.3263417 , 5.32634408])
In [72]:
trajectory = gradient_descent(derivative_arbitrary, 4, 1, 20)
trajectory
Out[72]:
array([4.        , 4.4       , 5.0464    , 5.4967306 , 5.08086249,
       5.48998039, 5.09282487, 5.48675539, 5.09847285, 5.48507269,
       5.10140255, 5.48415922, 5.10298805, 5.48365325, 5.10386474,
       5.48336998, 5.1043551 , 5.48321045, 5.10463112, 5.48312031])

Below, we see a visualization of the trajectory taken by this algorithm.

In [73]:
trajectory = gradient_descent(derivative_arbitrary, 1.6, 0.6, 20)
plot_arbitrary()
plt.plot(trajectory, arbitrary(trajectory));

Above, we've simply run our algorithm a fixed number of times. More sophisticated implementations will stop based on a variety of different stopping criteria, e.g. error getting too small, error getting too large, etc. We will not discuss these in our course.

In the next part, we'll return to the world of real data and see how this procedure might be useful for optimizing models.

Linear Regression With No Offset¶

Let's consider a case where we have a linear model with no offset.

$$\hat{y} = \theta_1 x$$

We want to find the parameter $\theta_1$ such that the L2 loss is minimized. In sklearn, this is easy. To avoid fitting an intercept, we set fit_intercept to false.

In [74]:
model = LinearRegression(fit_intercept = False)
model.fit(df[["total_bill"]], df["tip"])
model.coef_
Out[74]:
array([0.1437319])

The optimal tip percentage is 14.37%.

Creating an Explicit MSE Function¶

To employ gradient descent and do this ourselves, we need to define a function upon which we can use gradient descent.

Suppose we select the L2 loss as our loss function. In this case, our goal will be to minimize the mean squared error.

Let's start by writing a function that computes the MSE for a given choice of $\theta_1$ on our dataset.

In [75]:
def mse_loss(theta1, x, y_obs):
    y_hat = theta1 * x
    return np.mean((y_hat - y_obs) ** 2)    
In [76]:
x = df["total_bill"]
y_obs = df["tip"]
mse_loss(0.147, x, y_obs)
Out[76]:
1.1831403511422127

Since x and y_obs are fixed, the only variable is theta1.

For clarity, let's define a python function that returns the MSE as a function of a single argument theta1.

In [77]:
def mse_single_arg(theta1):
    """Returns the MSE on our data for the given theta1"""
    x = df["total_bill"]
    y_obs = df["tip"]
    y_hat = theta1 * x
    return np.mean((y_hat - y_obs) ** 2) 
In [78]:
mse_single_arg(0.1437)
Out[78]:
1.1781165940051925

Brute Forcing our Explicit MSE Function¶

Thus we can plot the MSE as a function of theta1. It turns out to look pretty smooth, and quite similar to a parabola.

In [79]:
theta1s = np.linspace(0, 0.2, 200)
x = df["total_bill"]
y_obs = df["tip"]

MSEs = [mse_single_arg(theta1) for theta1 in theta1s]

plt.plot(theta1s, MSEs)
plt.xlabel(r"Choice for $\theta_1$")
plt.ylabel(r"MSE");

The minimum appears to be around $\theta_1 = 0.14$.

Recall our simple_minimize function from earlier, redefined below for your convenience.

In [80]:
def simple_minimize(f, xs):
    y = [f(x) for x in xs]  
    return xs[np.argmin(y)]
In [81]:
simple_minimize(mse_single_arg, np.linspace(0, 0.2, 21))
Out[81]:
0.14

As before, what we're doing is computing all the starred values below and then returning the $\theta_1$ that goes with the minimum value.

In [82]:
theta1s = np.linspace(0, 0.2, 200)
sparse_theta1s = np.linspace(0, 0.2, 21)

loss = [mse_single_arg(theta1) for theta1 in theta1s]
sparse_loss = [mse_single_arg(theta1) for theta1 in sparse_theta1s]

plt.plot(theta1s, loss)
plt.plot(sparse_theta1s, sparse_loss, 'r*')
plt.xlabel(r"Choice for $\theta_1$")
plt.ylabel(r"MSE");

Using Scipy.Optimize.minimize¶

In [83]:
import scipy.optimize
from scipy.optimize import minimize
minimize(mse_single_arg, x0 = 0)
Out[83]:
      fun: 1.178116115451325
 hess_inv: array([[1]])
      jac: array([3.20374966e-06])
  message: 'Optimization terminated successfully.'
     nfev: 6
      nit: 1
     njev: 3
   status: 0
  success: True
        x: array([0.14373189])

Using Our Gradient Descent Function¶

Another approach is to use our 1D gradient descent algorithm from earlier. This is the exact same function as earlier, but copy and pasted to remind you what it looks like.

In [84]:
def gradient_descent(df, initial_guess, alpha, n):
    """Performs n steps of gradient descent on df using learning rate alpha starting
       from initial_guess. Returns a numpy array of all guesses over time."""
    guesses = [initial_guess]
    current_guess = initial_guess
    while len(guesses) < n:
        current_guess = current_guess - alpha * df(current_guess)
        guesses.append(current_guess)
        
    return np.array(guesses)

To use this function, we need to compute the derivative of the MSE. The MSE is repeated below for convenience.

In [85]:
def mse_single_arg(theta1):
    """Returns the MSE on our data for the given theta1"""
    x = df["total_bill"]
    y_obs = df["tip"]
    y_hat = theta1 * x
    return np.mean((y_hat - y_obs) ** 2) 

So the derivative is therefore:

In [86]:
def mse_loss_derivative_single_arg(theta_1):
    """Returns the derivative of the MSE on our data for the given theta1"""
    x = df["total_bill"]
    y_obs = df["tip"]
    y_hat = theta_1 * x
    
    return np.mean(2 * (y_hat - y_obs) * x)
In [87]:
gradient_descent(mse_loss_derivative_single_arg, 0.05, 0.0001, 100)
Out[87]:
array([0.05      , 0.05881852, 0.06680736, 0.0740446 , 0.08060095,
       0.08654045, 0.09192116, 0.09679563, 0.10121151, 0.10521192,
       0.10883597, 0.11211906, 0.11509327, 0.11778766, 0.12022855,
       0.1224398 , 0.12444301, 0.12625776, 0.12790176, 0.1293911 ,
       0.13074031, 0.13196259, 0.13306988, 0.13407298, 0.13498172,
       0.13580495, 0.13655074, 0.13722636, 0.13783841, 0.13839289,
       0.13889519, 0.13935024, 0.13976248, 0.14013593, 0.14047425,
       0.14078073, 0.14105839, 0.14130992, 0.14153778, 0.14174421,
       0.14193122, 0.14210063, 0.1422541 , 0.14239314, 0.14251909,
       0.14263319, 0.14273656, 0.1428302 , 0.14291504, 0.14299189,
       0.14306151, 0.14312458, 0.14318172, 0.14323348, 0.14328037,
       0.14332285, 0.14336134, 0.1433962 , 0.14342778, 0.14345639,
       0.14348231, 0.1435058 , 0.14352707, 0.14354634, 0.1435638 ,
       0.14357961, 0.14359394, 0.14360692, 0.14361868, 0.14362933,
       0.14363898, 0.14364772, 0.14365564, 0.14366281, 0.14366931,
       0.1436752 , 0.14368053, 0.14368537, 0.14368974, 0.14369371,
       0.1436973 , 0.14370056, 0.14370351, 0.14370618, 0.1437086 ,
       0.14371079, 0.14371277, 0.14371457, 0.1437162 , 0.14371768,
       0.14371902, 0.14372023, 0.14372133, 0.14372232, 0.14372322,
       0.14372404, 0.14372478, 0.14372545, 0.14372605, 0.1437266 ])

In the context of minimizing loss, we can write out the gradient descent rule for generating the next $\theta_1$ as:

$$ \theta_1^{(t+1)} = \theta_1^{(t)} - \alpha \frac{\partial}{\partial \theta_1} L(\theta^{(t)}, \Bbb{X}, \vec{\hat{y}}) $$

Here $L$ is our chosen loss function, $\Bbb{X}$ are the features of our input, and $\vec{\hat{y}}$ are our observations. During the gradient descent algorithm, we treat $\Bbb{X}$ and $\vec{\hat{y}}$ as constants.

Multi Dimensional Models¶

Now suppose we improve our model so that we want to predict the tip from the total_bill plus a constant offset, in other words:

$$\textrm{tip} = \theta_0 + \theta_1 \textrm{bill}$$

We know how to do this in sklearn:

In [88]:
model = LinearRegression(fit_intercept = True)
X = df[["total_bill"]]
y = df["tip"]
model.fit(X, y)
Out[88]:
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
In [89]:
model.coef_
Out[89]:
array([0.10502452])
In [90]:
model.intercept_
Out[90]:
0.9202696135546731

With sklearn (explicit bias column)¶

To make our lives easier, let's reframe the problem slightly:

In [91]:
tips_with_bias = df.copy()
tips_with_bias["bias"] = 1
X = tips_with_bias[["bias", "total_bill"]]
X.head(5)
Out[91]:
bias total_bill
0 1 16.99
1 1 10.34
2 1 21.01
3 1 23.68
4 1 24.59
In [92]:
model = LinearRegression(fit_intercept = False)
X = tips_with_bias[["bias", "total_bill"]]
y = df["tip"]
model.fit(X, y)
Out[92]:
LinearRegression(fit_intercept=False)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression(fit_intercept=False)
In [93]:
model.coef_
Out[93]:
array([0.92026961, 0.10502452])

Defining a 2D MSE Function¶

Now, we can give our predictions as $$\hat{y} = \theta_0 + \theta_1 \times bill$$

For example, the predictions below correspond to assuming every table leaves a tip of \$1.50 plus 5% of their total bill.

In [94]:
X @ np.array([1.5, 0.05]) 
Out[94]:
0      2.3495
1      2.0170
2      2.5505
3      2.6840
4      2.7295
        ...  
239    2.9515
240    2.8590
241    2.6335
242    2.3910
243    2.4390
Length: 244, dtype: float64

Throughout this problem, we'll assume we want to minimize the mean squared error of our predictions, i.e.

In [95]:
def mse_loss(theta, X, y_obs):
    y_hat = X @ theta
    return np.mean((y_hat - y_obs) ** 2)    

For example, the loss assuming the model described above is:

In [96]:
mse_loss(np.array([1.5, 0.05]), X, y_obs)
Out[96]:
1.5340521752049179

Using this function, we can create a 3D plot. This uses lots of syntax you've never seen.

In [97]:
import plotly.graph_objects as go

uvalues = np.linspace(0, 2, 10)
vvalues = np.linspace(0, 0.2, 10)
(u,v) = np.meshgrid(uvalues, vvalues)
thetas = np.vstack((u.flatten(),v.flatten()))

def mse_loss_single_arg(theta):
    return mse_loss(theta, X, y_obs)

MSE = np.array([mse_loss_single_arg(t) for t in thetas.T])

loss_surface = go.Surface(x=u, y=v, z=np.reshape(MSE, u.shape))

ind = np.argmin(MSE)
optimal_point = go.Scatter3d(name = "Optimal Point",
    x = [thetas.T[ind,0]], y = [thetas.T[ind,1]], 
    z = [MSE[ind]],
    marker=dict(size=10, color="red"))

fig = go.Figure(data=[loss_surface, optimal_point])
fig.update_layout(scene = dict(
    xaxis_title = "theta0",
    yaxis_title = "theta1",
    zaxis_title = "MSE"))
fig.show()

Optimizing the 2D Model With Our Own Gradient Descent¶

Below is the function we just derived, but in code.

In [98]:
def mse_gradient(theta, X, y_obs):
    """Returns the gradient of the MSE on our data for the given theta"""    
    x0 = X.iloc[:, 0]
    x1 = X.iloc[:, 1]
    dth0 = np.mean(-2 * (y_obs - theta[0]*x0 - theta[1]*x1) * x0)
    dth1 = np.mean(-2 * (y_obs - theta[0]*x0 - theta[1]*x1) * x1)
    return np.array([dth0, dth1])
In [99]:
X = tips_with_bias[["bias", "total_bill"]]
y_obs = tips_with_bias["tip"]
mse_gradient(np.array([0, 0]), X, y_obs)
Out[99]:
array([  -5.99655738, -135.22631803])
In [100]:
def mse_gradient_single_arg(theta):
    """Returns the gradient of the MSE on our data for the given theta"""
    X = tips_with_bias[["bias", "total_bill"]]
    y_obs = tips_with_bias["tip"]
    return mse_gradient(theta, X, y_obs)
In [101]:
mse_gradient_single_arg(np.array([0, 0]))
Out[101]:
array([  -5.99655738, -135.22631803])
In [102]:
guesses = gradient_descent(mse_gradient_single_arg, np.array([0, 0]), 0.001, 10000)
In [103]:
pd.DataFrame(guesses).tail(10)
Out[103]:
0 1
9990 0.888098 0.106378
9991 0.888108 0.106378
9992 0.888119 0.106377
9993 0.888130 0.106377
9994 0.888141 0.106376
9995 0.888151 0.106376
9996 0.888162 0.106375
9997 0.888173 0.106375
9998 0.888184 0.106375
9999 0.888194 0.106374

If you play around with the code above, you'll see that it's pretty finicky about the start point and learning rate.

Extra¶

For reference, the general matrix form of the gradient is given below. We have not discussed how to derive this in class.

In [104]:
def mse_gradient(theta, X, y_obs):
    """Returns the gradient of the MSE on our data for the given theta"""
    n = len(X)
    return -2 / n * (X.T @ y_obs - X.T @ X @ theta)
In [ ]: