{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "\n", "# Plotly plotting support\n", "import plotly.plotly as py\n", "# import plotly.offline as py\n", "# py.init_notebook_mode()\n", "\n", "# import cufflinks as cf\n", "# cf.go_offline() # required to use plotly offline (no account required).\n", "\n", "import plotly.graph_objs as go\n", "import plotly.figure_factory as ff\n", "\n", "# Make the notebook deterministic \n", "np.random.seed(42)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook which accompanies the lecture on the Bias Variance Tradeoff and Regularization.\n", "\n", "Notebook created by [Joseph E. Gonzalez](https://eecs.berkeley.edu/~jegonzal) for DS100." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Introducing Regularization\n", "\n", "In the previous notebook we adjusted the number of polynomial features to control model complexity and tradeoff bias and variance. However, this approach to managing model complexity has a few critical limitations:\n", "\n", "1. complexity varies discretely\n", "2. we may only need a few of the higher degree polynomial terms\n", "3. In general we may not have a natural way to order our basis\n", "\n", "Rather than changing the dimension we can instead apply regularization to the weights. More generally, we can adopt the framework of regularized loss minimization. \n", "\n", "$$\\large\n", "\\hat{\\theta} = \\arg \\min_\\theta \\frac{1}{n} \\sum_{i=1}^n \\textbf{Loss}\\left(y_i, f_\\theta(x_i)\\right) + \\lambda \\textbf{R}(\\theta)\n", "$$\n", "\n", "The **regularization** term $\\textbf{R}(\\theta)$ penalizes for $\\theta$ values that result in more complex and therefore higher variance models. The **regularization parameter** $\\lambda$ determines the degree of regularization to apply and is typically determined through cross validation.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Toy Dataset\n", "\n", "As with the previous lectures we will continue to use an easy to visualize synthetic dataset." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.random.seed(42)\n", "n = 50\n", "sigma = 10\n", "X = np.linspace(-10, 10, n)\n", "X = np.sort(X)\n", "Y = 2. * X + 10. + sigma * np.random.randn(n) + 20*np.sin(X) + 0.8*(X)**2\n", "X = X/5\n", "data_points = go.Scatter(name=\"data\", x=X, y=Y, mode='markers')\n", "py.iplot([data_points])\n", "\n", "## Train Test Split\n", "from sklearn.model_selection import train_test_split \n", "X_tr, X_te, Y_tr, Y_te = train_test_split(X, Y, test_size=0.25, random_state=42)\n", "train_points = go.Scatter(name=\"Train Data\", \n", " x=X_tr, y=Y_tr, mode='markers', marker=dict(color=\"blue\", symbol=\"o\"))\n", "test_points = go.Scatter(name=\"Test Data\",\n", " x=X_te, y=Y_te, mode='markers', marker=dict(color=\"red\", symbol=\"x\"))\n", "py.iplot([train_points, test_points], filename=\"toydataset-reg-lecture\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Polynomial Features:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Continuing from the previous lecture we will use polynomial features." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def poly_phi(k):\n", " return lambda X: np.array([np.sin(X*5)] + [X ** i for i in range(1, k+1)]).T" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "

\n", "\n", "# Ridge Regression\n", "\n", "There are many forms for $\\textbf{R}(\\theta)$ but a common form is the squared **$L^2$** norm of $\\theta$.\n", "\n", "$$\\large\n", "\\large \\textbf{R}_{L^2}(\\theta) = \n", "\\large||\\theta||_2^2 = \\theta^T \\theta = \\sum_{k=1}^p \\theta_k^2\n", "$$\n", "\n", "In the context of least squares regression this is often referred to as **Ridge Regression** with the objective:\n", "\n", "$$\\large\n", "\\hat{\\theta} = \\arg \\min_\\theta \\frac{1}{n} \\sum_{i=1}^n \\left(y_i - f_\\theta(x_i)\\right)^2 + \\lambda ||\\theta||_2^2\n", "$$\n", "\n", "This is also sometimes called [Tikhonov Regularization](https://en.wikipedia.org/wiki/Tikhonov_regularization). " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Deriving the optimal $\\hat{\\theta}$ with $L^2$ Regularization\n", "\n", "We return to our linear model formulation:\n", "\n", "$$\\large\n", "f_\\theta(x) = x^T \\theta\n", "$$\n", "\n", "Using the standard matrix notation:\n", "\n", " \n", "\n", "We can rewrite the objection\n", "\n", "\n", "\\begin{align}\\large\n", "\\hat{\\theta}_{\\text{L2}} = \\arg\\min_\\theta \\frac{1}{n}\\left(Y - X \\theta \\right)^T \\left(Y - X \\theta \\right) + \\lambda \\theta^T \\theta\n", "\\end{align}\n", "\n", "Expanding the objective term:\n", "\n", "\\begin{align}\\large\n", "L_\\lambda(\\theta) = \\left(Y - X \\theta \\right)^T \\left(Y - X \\theta \\right) + \\lambda \\theta^T \\theta = \n", "\\frac{1}{n} \\left( \n", " Y^T Y - 2 Y^T X \\theta + \\theta^T X^T X \\theta \n", "\\right) + \\lambda \\theta^T \\theta\n", "\\end{align}\n", "\n", "Taking the **gradient** with respect to $\\theta$:\n", "\n", "\n", "\\begin{align} \\large\n", "\\nabla_\\theta L_\\lambda(\\theta)\n", "& \\large =\n", "\\frac{1}{n} \\left( \n", " \\nabla_\\theta Y^T Y - \\nabla_\\theta 2 Y^T X \\theta + \\nabla_\\theta \\theta^T X^T X \\theta \n", "\\right) + \\nabla_\\theta \\lambda \\theta^T \\theta \\\\\n", "& \\large =\n", "\\frac{1}{n} \\left( \n", " 0 - 2 X^T Y + 2 X^T X \\theta \n", "\\right) + 2\\lambda \\theta\n", "\\end{align} \n", "\n", "The above gradient derivation uses the following identities:\n", "1. $\\large \\nabla_\\theta \\left( A \\theta \\right) = A^T$\n", "1. $\\large \\nabla_\\theta \\left( \\theta^T A \\theta \\right) = A\\theta + A^T \\theta$ and $\\large A = X^T X$ is symmetric\n", "\n", "Setting the gradient equal to zero we get a **regularized** version of the **normal equations**:\n", "\n", "$$\\large\n", "(X^T X + n \\lambda I) \\theta = X^T Y\n", "$$\n", "\n", "$$\\large\n", " \\theta = \\left(X^T X + n \\lambda I \\right)^{-1} X^T Y\n", "$$\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "

\n", "\n", "# Optimal $\\theta$ under $L^2$ regularization\n", "\n", "\n", "Because $\\lambda$ is a tuning parameter we often will absorb the $n$ into $\\lambda$ and rewrite the above equations as:\n", "\n", "\n", "\n", "$$\\large\n", "(X^T X + \\lambda I) \\theta = X^T Y\n", "$$\n", "\n", "$$\\large\n", " \\theta = \\left(X^T X + \\lambda I \\right)^{-1} X^T Y\n", "$$\n", "\n", "**Notice:** The addition of $\\lambda I$ ensures that $X^T X + \\lambda I$ is **full rank**. This addresses the earlier issue in least-squares regression when we had co-linear features.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "

\n", "\n", "\n", "# How does $L^2$ Regularization Help\n", "\n", "The $L^2$ penalty helps in several ways:\n", "\n", "**Manages Model Complexity**\n", "1. It ensures that uninformative features weights are relatively small (near zero) mitigating the affect of those features. \n", "1. It evenly distributes weight over similar features to reduce variance.\n", "\n", "**Practical Concerns**\n", "1. It removes degeneracy created by co-linear features\n", "1. It improves the numerical stability of\n", "\n", "---\n", "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Visualizing $L^2$ Regularization\n", "\n", "In the following we visualize the regularization surface. Notice that it pushes weights towards zero but is relatively smooth around the origin." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "theta_range = np.linspace(-2,2,100) \n", "(u,v) = np.meshgrid(theta_range, theta_range)\n", "w_values = np.vstack((u.flatten(), v.flatten())).T\n", "\n", "def l2_sq_reg(w):\n", " return np.sum(w**2)\n", "reg_values = [l2_sq_reg(w) for w in w_values]\n", "reg_surface = go.Surface(\n", " x = u, y = v,\n", " z = np.reshape(reg_values, u.shape),\n", " contours=dict(z=dict(show=True))\n", ")\n", "\n", "# Axis labels\n", "layout = go.Layout(\n", " scene=go.Scene(\n", " xaxis=go.XAxis(title='w0'),\n", " yaxis=go.YAxis(title='w1'),\n", " aspectratio=dict(x=2.,y=2., z=1.), \n", " camera=dict(eye=dict(x=-2, y=-2, z=2))\n", " )\n", ")\n", "fig = go.Figure(data = [reg_surface], layout = layout)\n", "py.iplot(fig, filename=\"L2regularization\")" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "# Applying $L^2$ Regularization using Scikit Learn" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the following we use the **linear_model.Ridge** model in scikit learn. To demonstrate the efficacy of regularization we will use the degree 32 polynomials which substantially overfit the data. " ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [], "source": [ "Phi = poly_phi(32)(X_tr)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

\n", "\n", "\n", "# Normalization and the Intercept\n", "\n", "Before we proceed it is important that we appropriately normalize the data. Because the standard $L^2$ regularization methods treat each dimensional equivalently it is important that all dimensions are in the same range of values. \n", "\n", "However, if we examine the polynomial features in $\\Phi$ we notice that the distribution of values \n", "can be quite different for each dimension.\n", "\n", "For example in the following we plot the degree 3 and degree 6 dimensions:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "py.iplot(ff.create_distplot([Phi[:,3], Phi[:,6]], group_labels=['x^3', 'x^6']), filename=\"phi_dist_plot\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Notice:**\n", "1. difference in spread\n", "1. asymmetry \n", "\n", "---\n", "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Standardizing the Data\n", "\n", "A common transformation is to center and scale the features to zero mean and unit variance:\n", "\n", "$$\\large\n", "z = \\frac{x - \\mu}{\\sigma}\n", "$$\n", "\n", "This an be accomplished by applying the StandardScalar scikit learn preprocessor." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "StandardScaler(copy=True, with_mean=True, with_std=True)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.preprocessing import StandardScaler\n", "\n", "normalizer = StandardScaler()\n", "normalizer.fit(poly_phi(32)(X_tr))" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# we define the phi function for reuse in the future\n", "def phi_fun(X):\n", " return normalizer.transform(poly_phi(32)(X))\n", "\n", "Phi = phi_fun(X_tr)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Notice in the above code snippet we define a new $\\Phi$ function that applies the pre-trained normalization. This procedure **learns** something about the training data in the formulation of the normalizer. **\n", "\n", "1. Will this be an issue when we cross validate on the training data?\n", "\n", "This process of transformations: feature construction, rescaling, and then subsequently model fitting form **pipelines**. Scikit learn actually has a [pipeline framework](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) to aid with this process. \n", "\n", "---\n", "

\n", "\n", "\n", "In the following we plot the spread of the transformed dimensions. They are still not the same but are at least on the same scale." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "py.iplot(ff.create_distplot([Phi[:,3], Phi[:,6]], group_labels=['x^4', 'x^7'], bin_size=0.3), filename=\"phi_dist_plot2\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Fitting the Ridge Regression Model \n", "\n", "We are now finally ready to fit the ridge regression model. However, we haven't yet decided on a value for the regularization parameter $\\lambda$. Therefore, we will try a range of values. " ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import sklearn.linear_model as linear_model\n", "lam_values = np.hstack((np.logspace(-8,-1,10), np.logspace(-1,2,10),np.logspace(2,10,10)))\n", "\n", "models = [\n", " linear_model.Ridge(alpha = lam).fit(Phi, Y_tr)\n", " for lam in lam_values\n", "]\n", "\n", "# model = linear_model.Ridge(alpha = lam)\n", "# model.fit(Phi, Y_tr)\n", "# models.append(model)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Move the slider in the following plot to see the fit for various $\\lambda$ values." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Make the x values where plot points will be generated\n", "X_plt = np.linspace(np.min(X)-1, np.max(X)+1, 200)\n", "\n", "# Generate the Plotly line objects by predicting the value at each X_plt\n", "lines = []\n", "for k in range(len(models)):\n", " ytmp = models[k].predict(phi_fun(X_plt))\n", " # Plotting software fails with large numbers\n", " ytmp[ytmp > 500] = 500\n", " ytmp[ytmp < -500] = -500\n", " lines.append(\n", " go.Scatter(name=\"Lambda \"+ str(lam_values[k]), \n", " x=X_plt, y = ytmp, visible=False))\n", "\n", "# Construct steps for the interactive slider\n", "lines.visible=True\n", "steps = []\n", "for i in range(len(lines)):\n", " step = dict(\n", " label = lines[i]['name'],\n", " method = 'restyle',\n", " args = ['visible', [False] * (len(lines)+1)],\n", " )\n", " step['args'] = True\n", " step['args'][i+1] = True # Toggle i'th trace to \"visible\"\n", " steps.append(step)\n", "\n", "# Build the slider object\n", "sliders = [dict(active = 0, pad = {\"t\": 20}, steps = steps)]\n", "\n", "# render the plot\n", "layout = go.Layout(xaxis=dict(range=[np.min(X_plt), np.max(X_plt)]), \n", " yaxis=dict(range=[np.min(Y) -5 , np.max(Y) + 5]),\n", " sliders=sliders,\n", " showlegend=False)\n", "py.iplot(go.Figure(data = [train_points] + lines, layout=layout), filename=\"ridge_regression_lines\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For large $\\lambda$ values we see a smoother fit. Notice however that the model does not appear to perform well outside of the input data range. This is a common problem with polynomial feature transformations." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using Cross Validation in the RidgeCV Model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because cross validation is essential to determining the optimal regularization parameter there is built-in support for cross validation in linear_model.RidgeCV. Here we call the built-in cross validation routine passing the lambda values we wish to consider." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ridge_cv_model = linear_model.RidgeCV(alphas=lam_values, store_cv_values=True)\n", "# Fit the model to our training data\n", "ridge_cv_model.fit(Phi, Y_tr)\n", "\n", "# Plot the predicted model\n", "ridge_cv_line = go.Scatter(name = \"Ridge CV Curve\",\n", " x = X_plt,\n", " y = ridge_cv_model.predict(phi_fun(X_plt)))\n", "# render the plot\n", "layout = go.Layout(xaxis=dict(range=[np.min(X_plt), np.max(X_plt)]), \n", " yaxis=dict(range=[np.min(Y) -5 , np.max(Y)+5]))\n", "py.iplot(go.Figure(data = [train_points, ridge_cv_line], layout=layout), filename=\"ridge_cv_line\")" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ridge_cv_loss = np.sqrt(np.mean(ridge_cv_model.cv_values_,axis=0))\n", "py.iplot(\n", " go.Figure(\n", " data=[go.Scatter(name=\"CV Curve\", x=lam_values, y=ridge_cv_loss),\n", " go.Scatter(name=\"Optimum\", x=[lam_values[np.argmin(ridge_cv_loss)]], y=[np.min(ridge_cv_loss)],\n", " mode=\"markers\", marker=dict(color=\"red\", size=10))\n", " ],\n", " layout=go.Layout(xaxis=dict(title=\"Lambda\", type=\"log\"), \n", " yaxis=dict(title=\"CV RMSE\", range=[10,70]))),\n", " filename=\"ridge_cv_model_curve\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question:** What is going on with small $\\lambda$ values?\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# $L^1$ Regularization (Lasso)\n", "\n", "Another common regularization function is the sum of the absolute values:\n", "\n", "$$\\large\n", "\\large \\textbf{R}_{L^1}(\\theta) = \\sum_{k=1}^p |\\theta_k| \n", "$$\n", "\n", "This is called **$L^1$** regularization as it corresponds to the $L^1$ norm. Least squares linear regression in conjunction with the $L^1$ norm is often called the [Lasso](http://www.jstor.org/stable/2346178?seq=1#page_scan_tab_contents) (Least Absolute Shrinkage and Selection Operator). \n", "\n", "In contrast to the $L^2$ norm the $L^1$ norm encourages $\\theta_i$ values to be exactly zero in less informative dimensions thereby reducing model complexity. To see how the $L^1$ encourages sparsity consider the following illustration: \n", "\n", "\n", " \n", "\n", "In the above figures we plot the loss for settings of a two dimensional ($\\theta_1$ and $\\theta_2$) model as the elliptical contours. Without regularization the solution would be at the center of the contours. By imposing regularization we constrain the solution to living in the \"norm ball\" centered at the origin (all zero theta vector). As we increase $\\lambda$ we actually shrink the ball. Unlike the $L^2$ solutions in the $L^1$ will often \"slide to the corners\" which are aligned with axis causing subsets of the $\\theta$ vector to be exactly zero. \n", "\n", "In some settings a compromise can be achieved by using both the $L^2$ and $L^1$ norms to encourage sparsity while ensuring relatively co-linear features are given equal weight (to improve robustness).\n" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "theta_range = np.linspace(-2,2,100) \n", "(u,v) = np.meshgrid(theta_range, theta_range)\n", "w_values = np.vstack((u.flatten(), v.flatten())).T\n", "\n", "def l1_reg(w):\n", " return np.sum(np.abs(w))\n", "reg_values = [l1_reg(w) for w in w_values]\n", "reg_surface = go.Surface(\n", " x = u, y = v,\n", " z = np.reshape(reg_values, u.shape),\n", " contours=dict(z=dict(show=True))\n", ")\n", "\n", "# Axis labels\n", "layout = go.Layout(\n", " scene=go.Scene(\n", " xaxis=go.XAxis(title='w0'),\n", " yaxis=go.YAxis(title='w1'),\n", " aspectratio=dict(x=2.,y=2., z=1.), \n", " camera=dict(eye=dict(x=-2, y=-2, z=2))\n", " )\n", ")\n", "fig = go.Figure(data = [reg_surface], layout = layout)\n", "py.iplot(fig, filename=\"L1regularization\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## $L^1$ regularized regression in scikit-learn" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the following we use the scikit-learn Lasso package. As before we will try a range of values for the regularization parameter. " ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [], "source": [ "lam_values = np.logspace(-1.3,2.5,20)\n", "models = []\n", "for lam in lam_values:\n", " model = linear_model.Lasso(alpha = lam, max_iter=100000)\n", " model.fit(Phi, Y_tr)\n", " models.append(model)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Again we can plot the fit for different regularization penalties." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Make the x values where plot points will be generated\n", "X_plt = np.linspace(np.min(X)-1, np.max(X)+1, 200)\n", "\n", "# Generate the Plotly line objects by predicting the value at each X_plt\n", "lines = []\n", "# Make the full polynomial\n", "poly = np.array([r\"\\theta_0 \\sin(x)\"] + [r\" \\theta_{\" + str(d) + \"} x^{\"+str(d)+\"} \" for d in range(1, 33)])\n", "\n", "\n", "for k in range(len(models)):\n", " ytmp = models[k].predict(phi_fun(X_plt))\n", " # Plotting software fails with large numbers\n", " ytmp[ytmp > 500] = 500\n", " ytmp[ytmp < -500] = -500\n", " num_features = np.sum(~np.isclose(models[k].coef_, 0.))\n", " # get all the nonzer terms\n", " # non_zero_terms = ~np.isclose(models[k].coef_,0)\n", " # poly_str = \"$\" +(\"+\".join(poly[non_zero_terms])) + \"$\"\n", " lines.append(\n", " go.Scatter(name=(\n", " \"Lambda \"+ str(lam_values[k]) + \n", " \" num features = \" + str(num_features)\n", " + \" out of \" + str(len(models[k].coef_))), \n", " x=X_plt, y = ytmp, visible=False))\n", "\n", "# Construct steps for the interactive slider\n", "lines.visible=True\n", "steps = []\n", "for i in range(len(lines)):\n", " step = dict(\n", " label = lines[i]['name'],\n", " method = 'restyle',\n", " args = ['visible', [False] * (len(lines)+1)],\n", " )\n", " step['args'] = True\n", " step['args'][i+1] = True # Toggle i'th trace to \"visible\"\n", " steps.append(step)\n", "\n", "# Build the slider object\n", "sliders = [dict(active = 0, pad = {\"t\": 20}, steps = steps)]\n", "\n", "# render the plot\n", "layout = go.Layout(xaxis=dict(range=[np.min(X_plt), np.max(X_plt)]), \n", " yaxis=dict(range=[np.min(Y) -5 , np.max(Y) + 5]),\n", " sliders=sliders,\n", " showlegend=False)\n", "py.iplot(go.Figure(data = [train_points] + lines, layout=layout), filename=\"lasso_regression_lines\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Notice:\n", "\n", "**What happens in the above plot for larger values of $\\lambda$? **\n", "\n", "---\n", "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Cross Validated Solution\n", "\n", "As with Ridge regression, scikit-learn provides support for cross validation directly in the Lasso model training procedure. In the following we LassoCV to determine the best regularization parameter." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lasso_cv_model = linear_model.LassoCV(alphas=lam_values, max_iter=1000000)\n", "# Fit the model to our training data\n", "lasso_cv_model.fit(Phi, Y_tr)\n", "\n", "# Plot the predicted model\n", "lasso_cv_line = go.Scatter(name = \"Lasso CV Curve\",\n", " x = X_plt,\n", " y = lasso_cv_model.predict(phi_fun(X_plt)))\n", "# render the plot\n", "layout = go.Layout(xaxis=dict(range=[np.min(X_plt), np.max(X_plt)]), \n", " yaxis=dict(range=[np.min(Y) -5 , np.max(Y)+5]))\n", "py.iplot(go.Figure(data = [train_points, lasso_cv_line], layout=layout), filename=\"lasso_cv_line\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's look at the polynomial terms with non-zero $\\theta$" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/markdown": [ "$\\large\\theta_0 \\sin(x)+ \\theta_{1} x^{1} + \\theta_{2} x^{2} + \\theta_{31} x^{31} + \\theta_{32} x^{32}$" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from IPython.display import display, Markdown \n", "\n", "# get all the nonzer terms\n", "non_zero_terms = ~np.isclose(lasso_cv_model.coef_,0)\n", "\n", "# Make the full polynomial\n", "poly = np.array([r\"\\theta_0 \\sin(x)\"] + [r\" \\theta_{\" + str(d) + \"} x^{\"+str(d)+\"} \" for d in range(1, 33)])\n", "\n", "# Print only the nonzero terms\n", "display(Markdown(r\"$\\large\" + (\"+\".join(poly[non_zero_terms])) + \"$\"))\n" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.metrics import mean_squared_error\n", "from sklearn.model_selection import KFold\n", "\n", "\n", "kfold_splits = 5\n", "kfold = KFold(kfold_splits, shuffle=True, random_state=42)\n", "\n", "mse_scores = []\n", "for lam in lam_values:\n", " # One step in k-fold cross validation\n", " def score_model(train_index, test_index):\n", " model = linear_model.Lasso(alpha=lam, max_iter=1000000)\n", " model.fit(Phi[train_index,:], Y_tr[train_index])\n", " return mean_squared_error(Y_tr[test_index], model.predict(Phi[test_index,]))\n", " \n", " mse_score = np.mean([score_model(tr_ind, te_ind) \n", " for (tr_ind, te_ind) in kfold.split(Phi)])\n", " mse_scores.append(mse_score)\n", "rmse_scores = np.sqrt(np.array(mse_scores))\n", "\n", "\n", "py.iplot(\n", " go.Figure(\n", " data=[go.Scatter(name=\"CV Curve\", x=lam_values, y=rmse_scores),\n", " go.Scatter(name=\"Optimum\", x=[lam_values[np.argmin(rmse_scores)]], y=[np.min(rmse_scores)],\n", " mode=\"markers\", marker=dict(color=\"red\", size=10))\n", " ],\n", " layout=go.Layout(xaxis=dict(title=\"Lambda\",type=\"log\",range=[-1.2,1.5]), \n", " yaxis=dict(title=\"CV RMSE\", range=[5,30]))),\n", " filename=\"lasso_cv_model_curve\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python [conda env:ds100]", "language": "python", "name": "conda-env-ds100-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 2 }