by Josh Hug (Fall 2019)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# Configure nice plotting defaults - (this must be done in a cell separate
# from %matplotlib call)
plt.style.use('seaborn')
sns.set_context('talk', font_scale=1.4)
plt.rcParams['figure.figsize'] = (10, 7)
def arbitrary(x):
return (x**4 - 15*x**3 + 80*x**2 - 180*x + 144)/10
x = np.linspace(1, 7, 100)
plt.plot(x, arbitrary(x))
axes = plt.gca()
axes.set_ylim([-1, 3]);
# plt.savefig("fx4.png", dpi = 300, bbox_inches = "tight")
Visually, we can see above that the minimum is somewhere around 5.3ish. Let's see if we can figure out how to find the exact minimum algorithmically.
One way very slow and terrible way would be manual guess-and-check.
arbitrary(5.2)
A somewhat better approach is to use brute force to try out a bunch of x values and return the one that yields the lowest loss.
def simple_minimize(f, xs):
y = [f(x) for x in xs]
return xs[np.argmin(y)]
simple_minimize(arbitrary, np.linspace(1, 7, 20))
This process is essentially the same as before where we made a graphical plot, it's just that we're only looking at 20 selected points.
xs = np.linspace(1, 7, 200)
sparse_xs = np.linspace(1, 7, 21)
ys = arbitrary(xs)
sparse_ys = arbitrary(sparse_xs)
plt.plot(xs, ys, label = "f")
plt.plot(sparse_xs, sparse_ys, 'r*', label = "f");
# plt.savefig("f_brute_force.png", dpi=300, bbox_inches = "tight")
This basic approach is incredibly inefficient, and suffers from two major flaws:
Instead of choosing all of our guesses ahead of time, we can instead start from a single guess and try to iteratively improve on our choice.
They key insight is this: If the derivative of the function is negative, that means the function is decreasing, so we should go to the right (i.e. pick a bigger x). If the derivative of the function is positive, that means the function is increasing, so we should go to the left (i.e. pick a smaller x).
Thus, the derivative tells us which way to go.
#desmos demo: https://www.desmos.com/calculator/twpnylu4lr
def derivative_arbitrary(x):
return (4*x**3 - 45*x**2 + 160*x - 180)/10
plt.plot(x, arbitrary(x))
plt.plot(x, derivative_arbitrary(x), '--')
plt.legend(['f(x)', 'df(x)'])
axes = plt.gca()
axes.set_ylim([-1, 3]);
def plot_arbitrary():
x = np.linspace(1, 7, 100)
plt.plot(x, arbitrary(x))
axes = plt.gca()
axes.set_ylim([-1, 3])
def plot_x_on_f(f, x):
y = f(x)
default_args = dict(label=r'$ \theta $', zorder=2,
s=200, c=sns.xkcd_rgb['green'])
plt.scatter([x], [y], **default_args)
def plot_x_on_f_empty(f, x):
y = f(x)
default_args = dict(label=r'$ \theta $', zorder=2,
s=200, c = 'none', edgecolor=sns.xkcd_rgb['green'])
plt.scatter([x], [y], **default_args)
def plot_tangent_on_f(f, x, eps=1e-6):
slope = ((f(x + eps) - f(x - eps))
/ (2 * eps))
xs = np.arange(x - 1, x + 1, 0.05)
ys = f(x) + slope * (xs - x)
plt.plot(xs, ys, zorder=3, c=sns.xkcd_rgb['green'], linestyle='--')
plot_arbitrary()
plot_x_on_f(arbitrary, 2)
plot_tangent_on_f(arbitrary, 2)
plt.xlabel('x')
plt.ylabel('f(x)');
#plt.savefig('dfx_1.png')
plot_arbitrary()
plot_x_on_f(arbitrary, 6)
plot_tangent_on_f(arbitrary, 6)
#plt.savefig('dfx_2.png')
Armed with this knowledge, let's try to see if we can use the derivative to optimize the function.
derivative_arbitrary(4)
derivative_arbitrary(4 + 0.4)
derivative_arbitrary(4 + 0.4 + 0.64)
derivative_arbitrary(4 + 0.4 + 0.64 + 0.457)
derivative_arbitrary(4)
derivative_arbitrary(4 + 0.4)
derivative_arbitrary(4 + 0.4 + 0.64)
derivative_arbitrary(4 + 0.4 + 0.64 + 0.46)
def plot_one_step(x):
new_x = x - derivative_arbitrary(x)
plot_arbitrary()
plot_x_on_f(arbitrary, new_x)
plot_x_on_f_empty(arbitrary, x)
print(f'old x: {x}')
print(f'new x: {new_x}')
plot_one_step( 5.080862485230506)
def plot_one_step_better(x):
new_x = x - 0.3 * derivative_arbitrary(x)
plot_arbitrary()
plot_x_on_f(arbitrary, new_x)
plot_x_on_f_empty(arbitrary, x)
print(f'old x: {x}')
print(f'new x: {new_x}')
plot_one_step_better(5.315427176589101)
Written as a recurrence relation, the process we've described above is:
$$ x^{(t+1)} = x^{(t)} - 0.3 \frac{d}{dx} f(x) $$This algorithm is also known as "gradient descent".
Given a current $\gamma$, gradient descent creates its next guess for $\gamma$ based on the sign and magnitude of the derivative.
Our choice of 0.3 above was totally arbitrary. Naturally, we can generalize by replacing it with a parameter, typically represented by $\alpha$, and often called the "learning rate".
We can also write up this procedure in code as given below:
def gradient_descent(df, initial_guess, alpha, n):
guesses = [initial_guess]
guess = initial_guess
while len(guesses) < n:
guess = guess - alpha * df(guess)
guesses.append(guess)
return np.array(guesses)
trajectory = gradient_descent(derivative_arbitrary, 4, 1.5, 20)
trajectory
Below, we see a visualization of the trajectory taken by this algorithm.
trajectory = gradient_descent(derivative_arbitrary, 1, 0.9, 20)
plot_arbitrary()
plt.plot(trajectory, arbitrary(trajectory));
Above, we've simply run our algorithm a fixed number of times. More sophisticated implementations will stop based on a variety of different stopping criteria, e.g. error getting too small, error getting too large, etc. We will not discuss these in our course.
In the next part, we'll return to the world of data science and see how this procedure might be useful for optimizing models.
plt.rcParams['figure.figsize'] = (4, 4)
plt.rcParams['figure.dpi'] = 150
plt.rcParams['lines.linewidth'] = 3
sns.set()
We'll continue where we left off earlier. We'll see 5 different ways of computing parameters for a 1D, then 2D linear model. These five techniques will be:
Let's consider a case where we have a linear model with no offset. That is, we want to find the parameter $\gamma$ such that the L2 loss is minimized.
tips = sns.load_dataset("tips")
sns.scatterplot(tips["total_bill"], tips["tip"]);
We'll use a one parameter model that the output is $\hat{\gamma}$ times the x value. For example if $\hat{\gamma} = 0.1$, then $\hat{y} = \hat{\gamma} x$, and we are making the prediction line below.
sns.scatterplot(tips["total_bill"], tips["tip"])
x = tips["total_bill"]
y_hat = 0.1 * x
plt.plot(x, y_hat, 'r')
plt.legend(['$\hat{y}$', '$y$']);
#plt.savefig("tip_vs_total_bill.png", dpi=300)
Suppose we select the L2 loss as our loss function. In this case, our goal will be to minimize the mean squared error.
Let's start by writing a function that computes the MSE for a given choice of $\gamma$ on our dataset.
def mse_loss(gamma, x, y_obs):
y_hat = gamma * x
return np.mean((y_hat - y_obs) ** 2)
x = tips["total_bill"]
y_obs = tips["tip"]
mse_loss(0.1, x, y_obs)
Our goal is to find the $\hat{\gamma}$ with minimum MSE.
On HW6, you'll show using calculus that the optimal answer is:
$$\hat{\gamma} = \frac{\sum(x_i y_i)}{\sum(x_i^2)}$$We can calculate this value below.
np.sum(tips["tip"] * tips["total_bill"])/np.sum(tips["total_bill"]**2)
Alternately, we can use the generic equation for linear regression that we derived in lecture using the definition of orthogonality.
X = tips[["total_bill"]]
y = tips["tip"]
np.linalg.inv(X.T @ X) @ X.T @ y
Since x
and y_obs
are fixed, the only variable is gamma
.
For clarity, let's define a python function that returns the MSE as a function of a single argument gamma
.
def mse_single_arg(gamma):
"""Returns the MSE on our data for the given gamma"""
x = tips["total_bill"]
y_obs = tips["tip"]
y_hat = gamma * x
return mse_loss(gamma, x, y_obs)
mse_single_arg(0.1)
Thus we can plot the MSE as a function of gamma
. It turns out to look pretty smooth, and quite similar to a parabola.
gammas = np.linspace(0, 0.2, 200)
x = tips["total_bill"]
y_obs = tips["tip"]
MSEs = [mse_single_arg(gamma) for gamma in gammas]
plt.plot(gammas, MSEs)
plt.xlabel(r"Choice for $\hat{\gamma}$")
plt.ylabel(r"MSE");
#plt.savefig("tips_MSE_vs_gamma.png", dpi=300, bbox_inches = "tight")
The minimum appears to be around $\hat{\gamma} = 0.14$.
Recall our simple_minimize function from earlier, redefined below for your convenience.
def simple_minimize(f, xs):
y = [f(x) for x in xs]
return xs[np.argmin(y)]
simple_minimize(mse_single_arg, np.linspace(0, 0.2, 21))
As before, what we're doing is computing all the starred values below and then returning the $\hat{\theta}$ that goes with the minimum value.
gammas = np.linspace(0, 0.2, 200)
sparse_gammas = np.linspace(0, 0.2, 21)
loss = [mse_single_arg(gamma) for gamma in gammas]
sparse_loss = [mse_single_arg(gamma) for gamma in sparse_gammas]
plt.plot(gammas, loss)
plt.plot(sparse_gammas, sparse_loss, 'r*')
plt.xlabel(r"Choice for $\hat{\gamma}$")
plt.ylabel(r"MSE");
#plt.savefig("tips_brute_force.png", dpi=300, bbox_inches = "tight")
Another approach is to use our 1D gradient descent algorithm from earlier.
def gradient_descent(df, initial_guess, alpha, n):
guesses = [initial_guess]
guess = initial_guess
while len(guesses) < n:
guess = guess - alpha * df(guess)
guesses.append(guess)
return np.array(guesses)
To use this function, we need to compute the derivative of the MSE. The MSE is repeated below for convenience.
def mse_loss(gamma, x, y_obs):
y_hat = gamma * x
return np.mean((y_hat - y_obs) ** 2)
def mse_loss_derivative(gamma, x, y_obs):
y_hat = gamma * x
return np.mean(2 * (y_hat - y_obs) * x)
#def squared_loss_derivative(y_hat, y_obs, x):
# """Returns the derivative of the squared loss for a single prediction"""
# return 2*(y_hat - y_obs)*x
#
#def mse_derivative(y_hat, y_obs, x):
# """Returns the derivative of the MSE"""
# return np.mean(squared_loss_derivative(y_hat, y_obs, x))
We can try out different values of gamma
and see what we get back as our derivative.
gamma = 0.1
x = tips["total_bill"]
y_obs = tips["tip"]
mse_loss_derivative(gamma, x, y_obs)
Just like our mse_of_gamma
, we can write a function that returns the derivative of the MSE as a function of a single argument gamma
.
def mse_loss_derivative_single_arg(gamma):
x = tips["total_bill"]
y_obs = tips["tip"]
return mse_loss_derivative(gamma, x, y_obs)
mse_loss_derivative_single_arg(0.1)
gradient_descent(mse_loss_derivative_single_arg, 0.05, 0.0001, 50)
In the context of minimizing loss, we can write out the gradient descent rule for generating the next $\gamma$ as:
$$ \gamma^{(t+1)} = \gamma^{(t)} - \alpha \frac{\partial}{\partial \gamma} L(\gamma^{(t)}, \Bbb{X}, \vec{\hat{y}}) $$Here $L$ is our chosen loss function, $\Bbb{X}$ is our design matrix, and $\vec{\hat{y}}$ are our observations. During the gradient descent algorithm, we treat $\Bbb{X}$ and $\vec{\hat{y}}$ as constants.
We can also use scipy.optimize.minimize
.
import scipy.optimize
from scipy.optimize import minimize
minimize(mse_single_arg, x0 = 0)
A natural question that arises: How does scipy.optimize.minimize
work? We won't discuss the exact algorithm used by the code (see this wikipedia page about the BFGS algorithm if you're curious, though).
Gradient descent is related to BFGS, though generally doesn't work as well. Comparison of numerical optimization algorithms is very far beyond the scope of our course.
We can also go one level of abstraction higher and simply fit a linear model using sklearn.
import sklearn.linear_model
from sklearn.linear_model import LinearRegression
model = LinearRegression(fit_intercept = False)
X = tips[["total_bill"]]
y = tips["tip"]
model.fit(X, y)
model.coef_
#tips datset
data = sns.load_dataset("tips")
data.head()
Now suppose we improve our model so that we want to predict the tip from the total_bill plus a constant offset, in other words:
$$\textrm{tip} = \hat{\theta}_0 + \hat{\theta}_1 \textrm{bill}$$To keep things in the simple framework from lecture, we can create a design matrix with the bias vector in one column and the total_bill in the other.
tips_with_bias = tips.copy()
tips_with_bias["bias"] = 1
X = tips_with_bias[["total_bill", "bias"]]
X.head(5)
Now, we can give our predictions as $$\vec{\hat{y}} = f_{\vec{\hat{\theta}}}(\Bbb{X}) = \Bbb{X} \vec{\hat{\theta}}$$
For example, the predictions below correspond to assuming every table leaves a tip of \$1.50 plus 5% of their total bill.
X @ np.array([0.05, 1.50])
Throughout this problem, we'll assume we want to minimize the mean squared error of our predictions, i.e.
def mse_loss(theta, X, y_obs):
y_hat = X @ theta
return np.mean((y_hat - y_obs) ** 2)
For example, the loss assuming the model described above is:
mse_loss(np.array([0.05, 1.50]), X, y_obs)
y_obs
X = tips_with_bias[["total_bill", "bias"]]
X.head(5)
y_obs = tips["tip"]
y_obs.head(5)
theta_hat = np.linalg.inv(X.T @ X) @ X.T @ y_obs
theta_hat
#Note: It's generally a better idea to use np.linalg.solve.
#np.linalg.inv is slow and can sometimes return incorrect results due to rounding issues.
#np.linalg.solve is faster and generally better behaved.
theta_hat = np.linalg.solve(X.T @ X, X.T @ y_obs)
theta_hat
As before, we can simply try out a bunch of theta values and see which works best. In this case, since we have a 2D theta, there's a much bigger space of possible values to try.
def mse_loss(theta, X, y_obs):
y_hat = X @ theta
return np.mean((y_hat - y_obs) ** 2)
As before, it's convenient to first create an mse function of a single argument.
X = tips_with_bias[["total_bill", "bias"]]
y_obs = tips["tip"]
def mse_loss_single_arg(theta):
return mse_loss(theta, X, y_obs)
mse_loss_single_arg([0.01, 0.02])
Using this function, we can create a 3D plot. This uses lots of syntax you've never seen.
import plotly.graph_objects as go
uvalues = np.linspace(0, 0.2, 10)#)1)
vvalues = np.linspace(0, 2, 10)#)1)
(u,v) = np.meshgrid(uvalues, vvalues)
thetas = np.vstack((u.flatten(),v.flatten()))
MSE = np.array([mse_loss_single_arg(t) for t in thetas.T])
loss_surface = go.Surface(x=u, y=v, z=np.reshape(MSE, u.shape))
ind = np.argmin(MSE)
optimal_point = go.Scatter3d(name = "Optimal Point",
x = [thetas.T[ind,0]], y = [thetas.T[ind,1]],
z = [MSE[ind]],
marker=dict(size=10, color="red"))
fig = go.Figure(data=[loss_surface, optimal_point])
fig.update_layout(scene = dict(
xaxis_title = "theta0",
yaxis_title = "theta1",
zaxis_title = "MSE"))
fig.show()
Gradient descent is exactly like it was before, only now our gradient is somewhat more complicated.
def mse_gradient(theta, X, y_obs):
"""Returns the gradient of the MSE on our data for the given theta"""
x0 = X.iloc[:, 0]
x1 = X.iloc[:, 1]
dth0 = np.mean(-2 * (y_obs - theta[0] * x0 - theta[1] * x1) * x0)
dth1 = np.mean(-2 * (y_obs - theta[0] * x0 - theta[1] * x1) * x1)
return np.array([dth0, dth1])
X = tips_with_bias[["total_bill", "bias"]]
y_obs = tips["tip"]
mse_gradient(np.array([0, 0]), X, y_obs)
def mse_gradient_single_arg(theta):
"""Returns the gradient of the MSE on our data for the given theta"""
X = tips_with_bias[["total_bill", "bias"]]
y_obs = tips["tip"]
return mse_gradient(theta, X, y_obs)
mse_gradient_single_arg(np.array([0, 0]))
gradient_descent(mse_gradient_single_arg, np.array([0, 0]), 0.001, 20)
If you play around with the code above, you'll see that it's pretty finicky about the start point and learning rate.
Thus, another approach is to use a more sophisticated numerical optimization library.
For reference, the general matrix form of the gradient is given below. We have not discussed how to derive this in class.
def mse_gradient(theta, X, y_obs):
"""Returns the gradient of the MSE on our data for the given theta"""
n = len(X)
return -2 / n * (X.T @ y_obs - X.T @ X @ theta)
scipy.optimize.minimize(mse_loss_single_arg, x0 = [0, 0])
As before, we can also go one level of abstraction higher and simply fit a linear model using sklearn.
We can either do this by using our tips_with_bias
and fit_intercept = False
, or with our original tips
dataframe and fit_intercept = True
.
import sklearn.linear_model
from sklearn.linear_model import LinearRegression
model = LinearRegression(fit_intercept = False)
X = tips_with_bias[["total_bill", "bias"]]
y = tips["tip"]
model.fit(X, y)
model.coef_
model = LinearRegression()
X = tips[["total_bill"]]
y = tips["tip"]
model.fit(X, y)
model.coef_
model.intercept_
sns.scatterplot(tips["total_bill"], tips["tip"])
x = tips["total_bill"]
y_hat = model.intercept_ + model.coef_ * tips["total_bill"]
plt.plot(x, y_hat, 'r')
plt.legend(['$\hat{y}$', '$y$']);
#plt.savefig("tip_vs_total_bill_linear_regression.png", dpi=300)