Loading [MathJax]/jax/output/HTML-CSS/jax.js

Lecture 14:

Intro to Optimization for Data Science



Slides by:

Jake A. Soloff

jake_soloff@berkeley.edu

Scipy has Optimization Package

from scipy.optimize import minimize

minimize(loss, x0 = ...)

So... let's call it a day?

Overview


  • Loss Minimization Framework
  • Application to Recommender Systems
  • Optimization Problems
  • Gradient Descent (GD)
  • GD for Recommender Systems

Estimation by Loss Minimization - a Recipe


  • Design a model
  • Choose loss function
  • Minimize loss function

Example - Summarize data with a single number

  • Observe: numeric data X1,,Xn.
  • Design a model: a single numeric summary θ.
  • Choose loss function: L2-loss, or sum-of-square deviations (θ)=ni=1(Xiθ)2.
  • Minimize loss function: 0=ddθ|θ=ˆθ=ni=12(ˆθXi)  ˆθ=1nni=1Xi.

Example - Recommender System

  • a company sells items, collects feedback (usually called a rating) from customers (called users).
  • rather than listing the most popular items, the company wants to make personalized recommendations
  • the cost of errors varies across applications

<img src="nytrecs.png", width=50%>

  • Observe: a "rating" rui by user u on item i. A form of feedback:
    • 5-star rating
    • Purchase
    • Dwell time
    • Not clicking "next"
  • User u does not rate every item i
    • some ratings are unobserved
  • We (may) observe features xi=(xi1,xi2) about each item i.
  • Design a model: a vector of "weights" θ=(θ1,θ2) such that ruixiθ=xi1θ1+xi2θ2.

  • Notice the right-hand-side above does not depend on u, so this model is not personalized.

  • Design a personalized model: a "preference" vector yu=(yu1,yu2) for user u such that ruixiyu (different weights for each user)
  • Choose loss function: L2-loss, or sum-of-square deviations (y1,,yu)=u,i(ruixiyu)2.
    • notice the above loss function only sums over pairs (u,i) for which we observe rui
  • Minimize loss function (but how?)

<img src="volinsky.png", width=50%>

(Koren, Bell & Volinsky 2009)

Function Minimization


An unconstrained optimization problem is denoted

minxf(x), where

  • x is a collection of decision variables.
  • f(x) is the associated cost.

In our context, we have a loss function (θ) and want to solve

minθ(θ),

which is to say we want to find ˆθ such that (ˆθ)(θ) for all θ.

Minimizing Analytically

Say (θ)=(θ1)2.

In [ ]:
plt.figure(figsize=(8,5))
x = np.arange(-5, 5, .01)
plt.plot(x, (x-1)**2, 'b')
plt.plot(1, 0.0, '|', color = 'r', 
         markersize = 20, markeredgewidth = 4)
plt.xlabel('$\\theta$')
plt.ylabel('$\ell(\\theta)$')
plt.show()
  • look at the graph and see ˆθ1.
  • notice (θ)0 and (1)=0.
  • set derivative to zero. 0=ddθ|θ=ˆθ=2(ˆθ1) so ˆθ=1.

Minimizing Analytically (cont'd)

Now (θ)=(θ1)2+(θ+2)2

In [ ]:
plt.figure(figsize=(8,5))
theta = np.arange(-4, 4, .01)
a, b = -2, 1
plt.plot(theta, (theta-a)**2, 'k--')
plt.plot(theta, (theta-b)**2, 'k--')
plt.plot(theta, (theta-a)**2+(theta-b)**2, 'b')
plt.plot([a,b], [0]*2, '|', color = 'k', 
         markersize = 20, markeredgewidth = 4)
plt.plot(0.5*(a+b), 4.5, '|', color = 'r',
        markersize = 20, markeredgewidth = 4)
plt.xlabel('$\\theta$')
plt.ylabel('$\ell(\\theta)$')
plt.show()

Minimizing Analytically (cont'd)

Now (θ)=θ2(θ24)2

In [ ]:
plt.figure(figsize=(8,5))
x = np.arange(-2.5, 2.5, .01)
plt.plot(x, x**2*(x**2-4)**2, 'b')
plt.plot([-2,-np.sqrt(4/3),0,np.sqrt(4/3),2], 
         [0, 256/27, 0, 256/27, 0], '|', color='r',
         markersize = 20, markeredgewidth = 4)
plt.xlabel('$\\theta$')
plt.ylabel('$\ell(\\theta)$')
plt.show()

Remember that ddθ=0 does not imply θ is a minimizer!

(might try 2nd derivative test)

Convexity Revisited

  • Assuming convexity, (η)=0 does tell us that η is a global minimizer, as we will now show.

Proof by Picture

In [ ]:
plt.figure(figsize=(20,10))

xx = np.linspace(-4,4,50)
yy = 3*(xx+1)**2 + 2

plt.subplot(1,2,1)
plt.plot(xx,yy)
plt.plot(np.linspace(-2,1,50), 3*np.linspace(-2,1,50) + 11,'r-')
plt.plot([-2,1],[5,14],'kx',markersize=20,markeredgewidth=2)
plt.ylim([-20,80])
plt.title("zeroth order condition for convexity") #  $f(\lambda x+(1-\lambda)y) \leq \lambda f(x) + (1-\lambda)f(y)$

plt.subplot(1,2,2)
plt.plot(xx,yy)
plt.plot(xx,12*xx+2,'r-')
plt.plot(1,14,'kx',markersize=20,markeredgewidth=2)
plt.ylim([-20,80])
plt.title("first order condition for convexity") # $f(y) \geq f(x) + (y-x)\\frac{df}{dx}(x)$
plt.show()

Direct Proof

  • Recall convextiy for the loss means for all 0λ1 (λθ+(1λ)η)λ(θ)+(1λ)(η)
  • Group terms with λ in front:

(η+λ(θη))(η)+λ((θ)(η)).

  • Subtract (η) and divide by λ:

(η+λ(θη))(η)λ(θ)(η).

  • Make ratio look like a derivative:

(η+λ(θη))(η)λ(θη)(η)(θη)((θ)(η)).

  • Hence (if is convex and differentiable)

(η)+(η)(θη)(θ).

  • If (θ) is convex, it lies above its tangent approximation everywhere.

Why convexity? Recap

  • We showed if is convex and differentiable, then for all η,θ, (η)+(η)(θη)(θ).
  • In particular, if (η)=0 (i.e. η is a critical point), then (η)(θ) for all θ.
  • Checking that (η)=0 tells us that η is a global minimizer.

What about when θ is not a single number?

  • Example. θ=(θ1,θ2) is a vector and (θ)=θ21+4θ22.
  • The gradient is (θ)=(θ1,θ2)=(2θ1,8θ2).
In [ ]:
a, b = -3, 3
x = np.arange(a, b, .05)
y = np.arange(a, b, .05)
X, Y = np.meshgrid(x, y)
z = X**2 + 4*Y**2

p2 = Surface(name = "Loss Surface",
             x = x, 
             y = y, 
             z = z, 
             colorscale = 'Viridis',
             reversescale = True,
             showscale = False)

paraboloid = Contour(name = "Loss Contour",
                     x = x, 
                     y = y, 
                     z = z, 
                     colorscale = 'Viridis',
                     reversescale = True, 
                     autocontour = True,
                     xaxis = 'x2',
                     yaxis = 'y2')

lay = Layout(xaxis = {'range' : [a, b],
                      'title' : '$\\theta_1$'}, 
             yaxis = {'range' : [a, b],
                      'title' : '$\\theta_2$'},
             scene = {"domain": {'x': [0, 0.48], 
                                 'y': [0, 1]},
                      'xaxis' : {'title' : 'θ1'},
                      'yaxis' : {'title' : 'θ2'},
                      'zaxis' : {'title' : 'ℓ(θ1,θ2)'}},
             xaxis2= {'domain' : [.52, 1],
                      'range' : [a, b],
                      'title' : '$\\theta_1$'},
             yaxis2= {'anchor' : 'x2',
                      'range' : [a, b],
                      'title' : '$\\theta_2$'},
             autosize = False,
             width = 1000,
             height = 600)

iplot(Figure(data = Data([p2, paraboloid]), layout = lay))

Zero Gradient Condition

  • If θ is a single variable, (θ)=0 means the loss is locally flat around θ.
  • If θ=(θ1,θ2), the partial derivatives θj represent the derivative along one coordinate with the other held fixed.
  • Hence if the gradient (θ)=(θ1,θ2)=0, the function is locally flat around θ.
  • If is convex and (η)=0, then η is a global minimizer.

Gradient: steepest ascent, perpendicular to level curves (contours)

In [ ]:
import plotly.figure_factory as ff

a, b = -3, 3
x = np.arange(a, b, .05)
y = np.arange(a, b, .05)

trace1 = Contour(name = "Loss Contour",
                 x = x, 
                 y = y, 
                 z = z, 
                 colorscale = 'Viridis',
                 reversescale = True, 
                 showscale = False,
                 autocontour = True,)

lay = Layout(xaxis = {'range' : [a, b],
                      'title' : '$\\theta_1$'}, 
             yaxis = {'range' : [a, b],
                      'title' : '$\\theta_2$'},
             width = 600,
             height = 600)

xx = np.linspace(a, b, 7)
yy = np.linspace(a, b, 7)
X, Y = np.meshgrid(xx, yy)

trace2 = ff.create_quiver(X, Y, -2*X, -8*Y,
                           scale = .025, 
                           arrow_scale = .15,)['data'][0]
trace2['name'] = "Negative Gradient"
trace2['marker']['color'] = 'black'
trace2['marker']['line']['width'] = 8
trace2['marker']['size'] = 8

iplot(Figure(data = Data([trace1, trace2]), layout = lay))

Gradient Descent Intuition

  • Iterative optimization algorithm: given a candidate minimizer θ(t) at step t, want an improved version θ(t+1).
  • Gradient moves along direction of steepest ascent, so (θ) moves along direction of steepest descent
  • So to get from θ(t) to θ(t+1), we might as well follow the gradient!
In [ ]:
xx = np.linspace(-4,4,50)
f = lambda x: 3*(x+1)**2 + 2
df = lambda x: 6*(x+1)
yy = f(xx)

alpha = .2

x0 = 1
x1 = x0 - alpha*df(x0)
x2 = x1 - alpha*df(x1)

plt.figure(figsize=(10,10))

plt.plot(xx,yy)
plt.plot(xx,12*xx+2,'r-')

plt.plot(x0,-19,'|k', 
         markersize = 20, markeredgewidth = 4)
plt.plot(x0,f(x0),'.k', 
         markersize = 15, markeredgewidth = 4)
plt.annotate('$\\theta^{(0)}$', [x0 - .1, -17])
plt.arrow(x0, -19, -alpha*df(x0), 0, 
          width = .5, head_length = 0.5, length_includes_head = True)
plt.annotate('$-0.2\ell\'(\\theta^{(0)})$', [x1-1, -17])

plt.axvline(x=-1, ls='--')
plt.ylim([-20,80])
plt.title('$\\ell(\\theta) = 3(\\theta+1)^2+2$')
plt.show()
In [ ]:
plt.figure(figsize=(10,10))

plt.plot(xx,yy)
plt.plot(xx,12*xx+2,'r-')

plt.plot(x1,-19,'|k', 
         markersize = 20, markeredgewidth = 4)
plt.plot(x1,f(x1),'.k', 
         markersize = 15, markeredgewidth = 4)
plt.annotate('$\\theta^{(1)}$', [x1 - .1, -17])
plt.arrow(x1, -19, -alpha*df(x1), 0, 
          width = .25, head_length = 0.25, length_includes_head = True)
plt.annotate('$-0.2\ell\'(\\theta^{(1)})$', [x1+1, -17])

plt.axvline(x = -1, ls='--')
plt.ylim([-20, 80])
plt.title('$\\ell(\\theta) = 3(\\theta+1)^2+2$')
plt.show()
In [ ]:
plt.figure(figsize=(10,10))

plt.plot(xx,yy)
plt.plot(xx,12*xx+2,'r-')

plt.plot(x2,-19,'|k', 
         markersize = 20, markeredgewidth = 4)
plt.plot(x2,f(x2),'.k', 
         markersize = 15, markeredgewidth = 4)
plt.annotate('$\\theta^{(2)}$', [x2 - .1, -17])
plt.arrow(x2, -19, -alpha*df(x2), 0, 
          width = .05, head_length = 0.05, length_includes_head = True)

plt.axvline(x = -1, ls='--')
plt.ylim([-20, 80])
plt.title('$\\ell(\\theta) = 3(\\theta+1)^2+2$')
plt.show()

Gradient Descent Algorithm

  • Initialize value θ(0) (zeros, random guess, or output of another method)

  • For t from 0 until convergence, update

θ(t+1)θ(t)α(θ(t))

  • α is called the learning rate and need not be constant (i.e. use αt).
  • For sufficiently small α we will have (θ(t+1))(θ(t)).
  • We might stop once (θ(t)) is small.
In [ ]:
TOGGLER
In [ ]:
def gradient_descent(theta0, grad, alpha, tol = 1e-4, max_iter = 1e5):
    theta_path = [theta0]
    i = 0
    while(np.linalg.norm(grad(theta_path[-1])) > tol) & (i < max_iter):
        i += 1
        theta_t = theta_path[-1]
        theta_path.append(theta_t - alpha*grad(theta_t))
    return np.array(theta_path)
In [ ]:
import plotly.figure_factory as ff

a, b = -3, 3
x = np.arange(a, b, .05)
y = np.arange(a, b, .05)

trace1 = Contour(name = "Loss Contour",
                 x = x, 
                 y = y, 
                 z = z, 
                 colorscale = 'Viridis',
                 reversescale = True, 
                 showscale = False,
                 autocontour = True,)

lay = Layout(xaxis = {'range' : [a, b],
                      'title' : '$\\theta_1$'}, 
             yaxis = {'range' : [a, b],
                      'title' : '$\\theta_2$'},
             width = 600,
             height = 600)

def trace2(theta0 = np.array([2.3, 2.5]), alpha = 0.01):
    theta_path = gradient_descent(theta0, 
                                  lambda theta: np.array([2*theta[0], 8*theta[1]]), 
                                  alpha,
                                  tol = 1e-2)
    return Scatter(name = "Theta Path", 
                     x = theta_path[:,0], 
                     y = theta_path[:,1],
                     mode = "lines+markers")

title = lambda a : '$\\alpha = ' + str(a) + '$'
iplot(Figure(data = [trace1, trace2(alpha = 0.01)], layout = {**lay, **{'title': title(.01)}}))
In [ ]:
iplot(Figure(data = [trace1, trace2(alpha = 0.1)], layout = {**lay, **{'title': title(.1)}}))
In [ ]:
iplot(Figure(data = [trace1, trace2(alpha = 0.23)], layout = {**lay, **{'title': title(.23)}}))

Automatic Differentiation

    import autograd.numpy as np
    from autograd import grad

    def loss(theta):
        ...

    grad(loss)

Example - Recommender System

In [ ]:
!pip install surprise
In [ ]:
rnames = ['uid', 'iid', 'rating', 'time']
ratings = pd.read_csv('./ml-100k/u.data', 
                      sep = '\t', 
                      header = None, 
                      names = rnames)\
                     .drop(columns = ['time'])
    
inames = ['iid', 'title', 'release date', 'video release date', 'IMDb URL', 
          'unknown', 'Action', 'Adventure', 'Animation', 'Children\'s', 
          'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy',
          'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 
          'Sci-Fi', 'Thriller', 'War', 'Western']
items = pd.read_csv('./ml-100k/u.item', 
                      sep = '|', 
                      header = None, 
                      names = inames,
                      encoding = 'iso8859_2')\
                    .drop(columns = ['release date', 'video release date', 
                                     'IMDb URL', 'unknown'])

item_names = items.iloc[:,[0,1]].set_index('iid')

tot_rating = ratings.groupby('iid').sum().drop(columns = ['uid'])
avg_rating = ratings.groupby('iid').mean().drop(columns = ['uid'])
top_titles = avg_rating[(tot_rating > 1300) & (avg_rating > 3.82)].dropna().join(item_names)
inds = list(top_titles.index-1)
top_titles.sort_values('rating', ascending = False)

In this model, both user vectors yu and item vectors xi are unobserved: ruixiyu

In [ ]:
plt.figure(figsize = (17, 17))
plt.plot(algo.qi[inds, 0], algo.qi[inds, 1], 'k.')
for i in inds:
    title = item_names.loc[i+1,'title'][:-6]
    if title.strip().endswith(', The'):
        title = "The " + title[:-6]
    plt.annotate(title, algo.qi[i])
plt.xticks([])
plt.yticks([])
plt.show()

On Worrying Selectively


"Since all models are wrong the scientist 
must be alert to what is importantly wrong. 
It is inappropriate to be concerned about 
mice when there are tigers abroad."
  - George E. P. Box, Science and Statistics

Recap

  • Loss Minimization as a Recipe for Estimation
  • Concepts and Tools from Numerical Optimization
  • Application to Recommender Systems