{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "## Plotly plotting support\n",
    "import plotly.plotly as py\n",
    "\n",
    "# import plotly.offline as py\n",
    "# py.init_notebook_mode()\n",
    "\n",
    "import plotly.graph_objs as go\n",
    "import plotly.figure_factory as ff\n",
    "\n",
    "# Make the notebook deterministic \n",
    "np.random.seed(42)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notebook created by [Joseph E. Gonzalez](https://eecs.berkeley.edu/~jegonzal) for DS100."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "# Feature Transformations\n",
    "\n",
    "In addition to transforming categorical and text features to real valued representations, we can often improve model performance through the use of additional feature transformations.  Let's start with a simple toy example\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Linear Models for Non-Linear Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To illustrate the potential for feature transformations consider the following *synthetic dataset*:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "               X          Y\n",
      "count  75.000000  75.000000\n",
      "mean   -0.607319  -0.455212\n",
      "std     6.120983  12.873863\n",
      "min    -9.889558 -25.028709\n",
      "25%    -6.191627 -10.113630\n",
      "50%    -1.196950  -1.648253\n",
      "75%     5.042387  10.910793\n",
      "max     9.737739  22.921518\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>X</th>\n",
       "      <th>Y</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>-9.889558</td>\n",
       "      <td>-7.221915</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>-9.588310</td>\n",
       "      <td>-10.111930</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>-9.312230</td>\n",
       "      <td>-15.816534</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>-9.095454</td>\n",
       "      <td>-19.059384</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>-9.070992</td>\n",
       "      <td>-22.349544</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          X          Y\n",
       "0 -9.889558  -7.221915\n",
       "1 -9.588310 -10.111930\n",
       "2 -9.312230 -15.816534\n",
       "3 -9.095454 -19.059384\n",
       "4 -9.070992 -22.349544"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train_data = pd.read_csv(\"toy_training_data.csv\")\n",
    "print(train_data.describe())\n",
    "train_data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Goal:** As usual we would like to learn a model that predicts $Y$ given $X$.\n",
    "\n",
    "What does this relationship between $X \\rightarrow Y$ look like?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<iframe id=\"igraph\" scrolling=\"no\" style=\"border:none;\" seamless=\"seamless\" src=\"https://plot.ly/~jegonzal/689.embed\" height=\"525px\" width=\"100%\"></iframe>"
      ],
      "text/plain": [
       "<plotly.tools.PlotlyDisplay object>"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train_points = go.Scatter(name = \"Training Data\", \n",
    "                          x = train_data['X'], y = train_data['Y'], \n",
    "                          mode = 'markers')\n",
    "# layout = go.Layout(autosize=False, width=800, height=600)\n",
    "py.iplot(go.Figure(data=[train_points]), \n",
    "         filename=\"L19_b_p1\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Properties of the Data\n",
    "\n",
    "How would you describe this data?\n",
    "\n",
    "1. Is there a linear relationship between $X$ and $Y$?  \n",
    "1. Are there other patterns?\n",
    "1. How noisy is the data?\n",
    "1. Do we see obvious outliers?\n",
    "\n",
    "---\n",
    "\n",
    "<br/><br/><br/><br/><br/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Least Squares Linear Regression  (Review)\n",
    "\n",
    "For the remainder of this lecture we will focus on fitting least squares linear regression models.  Recall that **linear regression** models are functions of the form:\n",
    "\n",
    "\\begin{align}\\large\n",
    "f_\\theta(x) = x^T \\theta = \\sum_{j=1}^p \\theta_j x_j\n",
    "\\end{align}\n",
    "\n",
    "and **least squares** implies a loss function of the form:\n",
    "\n",
    "\\begin{align} \\large\n",
    "L_\\mathcal{D}(\\theta) = \\frac{1}{n} \\sum_{i=1}^n \\left(y_i - f_\\theta(x) \\right)^2\n",
    "\\end{align}\n",
    "\n",
    "In the previous lecture, we derived the **normal equations** which define the loss minimizing $\\hat{\\theta}$ for: \n",
    "\n",
    "\\begin{align}\\large\n",
    "\\hat{\\theta} \n",
    "& \\large = \\arg\\min L_\\mathcal{D}(\\theta) \\\\\n",
    "& \\large = \\arg\\min \\frac{1}{n} \\sum_{i=1}^n \\left(y_i - X \\theta \\right)^2 \\\\\n",
    "& \\large = \\left( X^T X \\right)^{-1} X^T Y\n",
    "\\end{align}\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "<br/><br/><br/><br/><br/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Linear Regression with scikit-learn\n",
    "\n",
    "In this lecture we will use the [scikit-learn `linear_model` package](http://scikit-learn.org/stable/modules/linear_model.html) to compute the normal equations. This package supports a wide range of **generalized linear models**.  For those who are interested in studying machine learning, I would encourage you to skim through the descriptions of the various models in the `linear_model` package. These are the foundation of most practical applications of supervised machine learning."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn import linear_model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following block of code creates an instance of the Least Squares Linear Regression model and the fits that model to the training data.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "line_reg = linear_model.LinearRegression(fit_intercept=True)\n",
    "# Fit the model to the data\n",
    "line_reg.fit(train_data[['X']], train_data['Y'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Notice:** In the above code block we explicitly added a **bias** (**intercept**) term by setting `fit_intercept=True`.  Therefore we will not need to add an additional `constant` feature.\n",
    "\n",
    "---\n",
    "\n",
    "<br/><br/><br/><br/><br/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Plotting the Model\n",
    "\n",
    "To plot the model we will predict the value of $Y$ for a range of $X$ values.  I will call these **query points** `X_query`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "X_query = np.linspace(-10, 10, 500)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use the regression model to render predictions at each `X_query` point."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Note that np.linspace returns a 1d vector therefore \n",
    "# we must transform it into a 2d column vector\n",
    "line_Yhat_query = line_reg.predict(\n",
    "    np.reshape(X_query, (len(X_query),1)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To plot the residual we will also predict the $Y$ value for all the training points:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "line_Yhat = line_reg.predict(train_data[['X']])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following visualization code constructs a line as well as the residual plot"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<iframe id=\"igraph\" scrolling=\"no\" style=\"border:none;\" seamless=\"seamless\" src=\"https://plot.ly/~jegonzal/691.embed\" height=\"525px\" width=\"100%\"></iframe>"
      ],
      "text/plain": [
       "<plotly.tools.PlotlyDisplay object>"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Define the least squares regression line \n",
    "basic_line = go.Scatter(name = r\"$\\theta x$\", x=X_query.T, \n",
    "                        y=line_Yhat_query)\n",
    "\n",
    "# Definethe residual lines segments, a separate line for each \n",
    "# training point\n",
    "residual_lines = [\n",
    "    go.Scatter(x=[x,x], y=[y,yhat],\n",
    "               mode='lines', showlegend=False, \n",
    "               line=dict(color='black', width = 0.5))\n",
    "    for (x, y, yhat) in zip(train_data['X'], train_data['Y'], line_Yhat)\n",
    "]\n",
    "\n",
    "# Combine the plot elements\n",
    "py.iplot([train_points, basic_line] + residual_lines, filename=\"L19_b_p2\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Assessing the Fit\n",
    "\n",
    "1. Is our simple linear model a good fit?  \n",
    "1. What do we see in the above plot?\n",
    "1. Are there clear outliers?  \n",
    "1. Are there patterns to the residuals?\n",
    "\n",
    "---\n",
    "\n",
    "<br/><br/><br/><br/><br/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Residual Analysis\n",
    "\n",
    "To help answer these questions it can often be helpful to plot the residuals in a **residual plot**.  Residual plots plot the difference between the predicted value and the observed value in response to a particular covariate ($X$ dimension).  The residual plot can help reveal patterns in the residual that might support additional modeling.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "residuals = line_Yhat - train_data['Y'] "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<iframe id=\"igraph\" scrolling=\"no\" style=\"border:none;\" seamless=\"seamless\" src=\"https://plot.ly/~jegonzal/693.embed\" height=\"525px\" width=\"100%\"></iframe>"
      ],
      "text/plain": [
       "<plotly.tools.PlotlyDisplay object>"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Plot.ly plotting code\n",
    "py.iplot(go.Figure(\n",
    "    data = [dict(x=train_data['X'], y=residuals, mode='markers')],\n",
    "    layout = dict(title=\"Residual Plot\", xaxis=dict(title=\"X\"), \n",
    "                  yaxis=dict(title=\"Residual\"))\n",
    "), filename=\"L19_b_p3\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<iframe id=\"igraph\" scrolling=\"no\" style=\"border:none;\" seamless=\"seamless\" src=\"https://plot.ly/~jegonzal/767.embed\" height=\"525px\" width=\"100%\"></iframe>"
      ],
      "text/plain": [
       "<plotly.tools.PlotlyDisplay object>"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "resid_vs_fit = go.Scatter(name=\"y vs yhat\", x=line_Yhat, y=residuals, mode='markers')\n",
    "layout = go.Layout(xaxis=dict(title=r\"$\\hat{y}$\"), yaxis=dict(title=\"Residuals\"))\n",
    "py.iplot(go.Figure(data=[resid_vs_fit], layout=layout), \n",
    "         filename=\"L19_b_p3.2\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "** Do we see a pattern? **\n",
    "\n",
    "---\n",
    "<br/><br/><br/><br/><br/>\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using a Smoothing Model \n",
    "\n",
    "To better visualize the pattern we could apply another regression package. In the following plotting code we call a more sophisticated regression package `sklearn.kernel_ridge` to estimate a smoothed approximation to the residuals. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.kernel_ridge import KernelRidge\n",
    "\n",
    "# Use Kernelized Ridge Regression with Radial Basis Functions to \n",
    "# compute a smoothed estimator.  Later in this notebook we will \n",
    "# actually implement part of this  ...\n",
    "clf = KernelRidge(kernel='rbf', alpha=2)\n",
    "clf.fit(train_data[['X']], residuals)\n",
    "residual_smoothed = clf.predict(np.reshape(X_query, (len(X_query),1)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<iframe id=\"igraph\" scrolling=\"no\" style=\"border:none;\" seamless=\"seamless\" src=\"https://plot.ly/~jegonzal/695.embed\" height=\"525px\" width=\"100%\"></iframe>"
      ],
      "text/plain": [
       "<plotly.tools.PlotlyDisplay object>"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Plot the residuals with with a kernel smoothing curve\n",
    "py.iplot(go.Figure(\n",
    "    data = [dict(name = \"Residuals\", x=train_data['X'], y=residuals, \n",
    "                 mode='markers'),\n",
    "            dict(name = \"Smoothed Approximation\", \n",
    "                 x=X_query, y=residual_smoothed, \n",
    "                 line=dict(dash=\"dash\"))],\n",
    "    layout = dict(title=\"Residual Plot\", xaxis=dict(title=\"X\"), \n",
    "                  yaxis=dict(title=\"Residual\"))\n",
    "), filename=\"L19_b_p4\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The above plot suggests a cyclic pattern to the residuals. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Question:** Can we fit this non-linear cyclic structure with a linear model?\n",
    "\n",
    "<br/><br/><br/>\n",
    "\n",
    "---\n",
    "\n",
    "## What does it mean to be a _linear model_\n",
    "\n",
    "\n",
    "Let's return to what it means to be a linear model:\n",
    "\n",
    "$$\\large\n",
    "f_\\theta(x) = x^T \\theta = \\sum_{j=1}^p x_j \\theta_j\n",
    "$$\n",
    "\n",
    "In what sense is the above model **linear**? \n",
    "1. Linear in the features $x$? \n",
    "1. Linear in the parameters $\\theta$?\n",
    "1. Linear in both at the same time?\n",
    "\n",
    "---\n",
    "<br/><br/><br/><br/><br/> <br/>\n",
    "Yes, Yes, and No.  If we look at just the features or just the parameters the model is linear. However, if we look at both at the same time it is not. Why?\n",
    "<br/><br/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Feature Functions\n",
    "\n",
    "Consider the following alternative model formulation:\n",
    "\n",
    "$$\\large\n",
    "f_\\theta\\left( \\phi(x) \\right) = \\phi(x)^T \\theta = \\sum_{j=1}^{k} \\phi(x)_j \\theta_j\n",
    "$$\n",
    "\n",
    "where $\\phi_j$ is an *arbitrary function* from $x\\in \\mathbb{R}^p$ to $\\phi(x)_j \\in \\mathbb{R}$ and we define $k$ of these functions.  We often refer to these functions $\\phi_j$ as **feature functions** or **basis functions** and their design plays a critical role in both how we capture prior knowledge and our ability to fit complicated data.\n",
    "\n",
    "---\n",
    "\n",
    "<br/><br/><br/><br/><br/>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Capturing Domain Knowledge\n",
    "\n",
    "\n",
    "Feature functions can be used to capture domain knowledge by:\n",
    "1. Introducing additional information from other sources\n",
    "1. Combining related features\n",
    "1. Encoding non-linear patterns\n",
    "\n",
    "Suppose I had data about customer purchases and I wanted to estimate their income:\n",
    "\n",
    "\\begin{align}\n",
    "\\phi(\\text{date}, \\text{lat}, \\text{lon}, \\text{amount})_1 \n",
    "&= \\textbf{isWinter}(\\text{date}) \\\\\n",
    "\\phi(\\text{date}, \\text{lat}, \\text{lon}, \\text{amount})_2 \n",
    "&= \\cos\\left( \\frac{\\textbf{Hour}(\\text{date})}{12} \\pi \\right)  \\\\\n",
    "\\phi(\\text{date}, \\text{lat}, \\text{lon}, \\text{amount})_3 \n",
    "&= \\frac{\\text{amount}}{\\textbf{avg_spend}[\\textbf{ZipCode}[\\text{lat}, \\text{lon}]]} \\\\\n",
    "\\phi(\\text{date}, \\text{lat}, \\text{lon}, \\text{amount})_4 \n",
    "&= \\exp\\left(-\\textbf{Distance}\\left((\\text{lat},\\text{lon}),  \\textbf{StoreA}\\right)\\right)^2 \\\\\n",
    "\\phi(\\text{date}, \\text{lat}, \\text{lon}, \\text{amount})_5 \n",
    "&= \\exp\\left(-\\textbf{Distance}\\left((\\text{lat},\\text{lon}),  \\textbf{StoreB}\\right)\\right)^2 \n",
    "\\end{align}\n",
    "\n",
    "**Notice:** In the above feature functions:\n",
    "1. The transformations are non-linear\n",
    "1. They pull in additional information\n",
    "1. They may encode implicit knowledge\n",
    "1. The functions $\\phi$ **do not** depend on $\\theta$\n",
    "\n",
    "---\n",
    "<br/><br/><br/><br/><br/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Linear in $\\theta$\n",
    "As a consequence, while the model $f_\\theta\\left( \\phi(x) \\right)$ is no longer linear in $x$ it is still a **linear model** because it _is linear in $\\theta$_.  This means we can continue to use the normal equations to compute the optimal parameters.  \n",
    "\n",
    "To apply the normal equations we define the transformed feature matrix:\n",
    "\n",
    "<img src=\"phi_matrix.png\" width=\"400px\">\n",
    "\n",
    "Then substituting $\\Phi$ for $X$ we obtain the **normal equation**:\n",
    "\n",
    "$$ \\large\n",
    "\\hat{\\theta} = \\left( \\Phi^T \\Phi \\right)^{-1} \\Phi^T Y\n",
    "$$\n",
    "\n",
    "It is worth noting that the model is also linear in $\\phi$ and that the $\\phi_j$ form a new **basis** (hence the term **basis functions**) in which the data live.  As a consequence we can think of $\\phi$ as mapping the data into a new (often higher dimensional space) in which the relationship between $y$ and $\\phi(x)$ is defined by a **hyperplane**. \n",
    "\n",
    "---\n",
    "<br/><br/><br/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Transforming the data with $\\phi$\n",
    "\n",
    "In our toy data set we observed a cyclic pattern.  Here we construct a $\\phi$ to capture the cyclic nature of our data and visualize the corresponding hyperplane.\n",
    "\n",
    "\n",
    "In the following cell we define a function $\\phi$ that maps $x\\in \\mathbb{R}$ to the vector $[x,\\sin(x)] \\in \\mathbb{R}^2$ \n",
    "\n",
    "$$ \\large\n",
    "\\phi(x) = [x, \\sin(x)]\n",
    "$$\n",
    "\n",
    "**Why not:**\n",
    "\n",
    "$$ \\large\n",
    "\\phi(x) = [x, \\sin(\\theta_3 x + \\theta_4)]\n",
    "$$\n",
    "\n",
    "This would no longer be linear $\\theta$.  However, in practice we might want to consider a range of $\\sin$ basis:\n",
    "\n",
    "$$ \\large\n",
    "\\phi_{\\alpha,\\beta}(x) = \\sin(\\alpha x + \\beta)\n",
    "$$\n",
    "\n",
    "for different values of $\\alpha$ and $\\beta$.  The parameters $\\alpha$ and $\\beta$ are typically called **hyperparameters** because (at least in this setting) they are not set automatically through learning.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def sin_phi(x):\n",
    "    return [x, np.sin(x)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We then compute the matrix $\\Phi$ by applying $\\phi$ to reach row (record) in the matrix $X$.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[-9.88955766,  0.44822588],\n",
       "       [-9.58831011,  0.16280424],\n",
       "       [-9.31222958, -0.11231092],\n",
       "       [-9.09545422, -0.32340318],\n",
       "       [-9.07099175, -0.34645201]])"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Phi = np.array([sin_phi(x) for x in train_data['X']])\n",
    "\n",
    "# Look at a few examples\n",
    "Phi[:5,]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It is worth noting that in practice we might prefer a more efficient \"vectorized\" version of the above code:\n",
    "```python\n",
    "Phi = np.vstack((train_data['X'], np.sin(train_data['X']))).T\n",
    "```\n",
    "however in this notebook we will use more explicit `for` loop notation.\n",
    "\n",
    "---\n",
    "\n",
    "<br/><br/><br/><br/><br/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Fit a _Linear Model_ on $\\Phi$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can again use the scikit-learn package to fit a linear model on the transformed space."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "LinearRegression(copy_X=True, fit_intercept=False, n_jobs=1, normalize=False)"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sin_reg = linear_model.LinearRegression(fit_intercept=False)\n",
    "sin_reg.fit(Phi, train_data['Y'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Plotting the predictions\n",
    "\n",
    "**What will the prediction from this mode look like?**\n",
    "\n",
    "$$ \\large\n",
    "f_\\theta(x) = \\phi(x)^T \\theta = \\theta_1 x + \\theta_2 \\sin x\n",
    "$$\n",
    "\n",
    "---\n",
    "\n",
    "<br/><br/><br/><br/><br/><br/>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<iframe id=\"igraph\" scrolling=\"no\" style=\"border:none;\" seamless=\"seamless\" src=\"https://plot.ly/~jegonzal/699.embed\" height=\"525px\" width=\"100%\"></iframe>"
      ],
      "text/plain": [
       "<plotly.tools.PlotlyDisplay object>"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Making predictions at the transformed query points\n",
    "Phi_query = np.array([sin_phi(x) for x in X_query])\n",
    "sin_Yhat_query = sin_reg.predict(Phi_query)\n",
    "\n",
    "# plot the regression line\n",
    "sin_line = go.Scatter(name = r\"$\\theta_0 x + \\theta_1 \\sin(x)$ \", \n",
    "                       x=X_query, y=sin_Yhat_query)\n",
    "\n",
    "# Make predictions at the training points\n",
    "sin_Yhat = sin_reg.predict(Phi)\n",
    "\n",
    "# Plot the residual segments\n",
    "residual_lines = [\n",
    "    go.Scatter(x=[x,x], y=[y,yhat],\n",
    "               mode='lines', showlegend=False, \n",
    "               line=dict(color='black', width = 0.5))\n",
    "    for (x, y, yhat) in zip(train_data['X'], train_data['Y'], sin_Yhat)\n",
    "]\n",
    "\n",
    "py.iplot([train_points, sin_line, basic_line] + residual_lines, \n",
    "         filename=\"L19_b_p10\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Examine the residuals again"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<iframe id=\"igraph\" scrolling=\"no\" style=\"border:none;\" seamless=\"seamless\" src=\"https://plot.ly/~jegonzal/701.embed\" height=\"525px\" width=\"100%\"></iframe>"
      ],
      "text/plain": [
       "<plotly.tools.PlotlyDisplay object>"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sin_Yhat = sin_reg.predict(Phi)\n",
    "residuals = train_data['Y'] - sin_Yhat\n",
    "\n",
    "\n",
    "# Use Kernelized Ridge Regression with Radial Basis Functions to \n",
    "# compute a smoothed estimator. \n",
    "clf = KernelRidge(kernel='rbf')\n",
    "clf.fit(train_data[['X']], residuals)\n",
    "residual_smoothed = clf.predict(np.reshape(X_query, (len(X_query),1)))\n",
    "\n",
    "# Plot the residuals with with a kernel smoothing curve\n",
    "py.iplot(go.Figure(\n",
    "    data = [dict(name = \"Residuals\", \n",
    "                 x=train_data['X'], y=residuals, mode='markers'),\n",
    "            dict(name = \"Smoothed Approximation\", \n",
    "                 x=X_query, y=residual_smoothed, \n",
    "                 line=dict(dash=\"dash\"))],\n",
    "    layout = dict(title=\"Residual Plot\", \n",
    "                  xaxis=dict(title=\"X\"), yaxis=dict(title=\"Residual\"))\n",
    "), filename=\"L19_b_p11\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<iframe id=\"igraph\" scrolling=\"no\" style=\"border:none;\" seamless=\"seamless\" src=\"https://plot.ly/~jegonzal/761.embed\" height=\"525px\" width=\"100%\"></iframe>"
      ],
      "text/plain": [
       "<plotly.tools.PlotlyDisplay object>"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "resid_vs_fit = go.Scatter(name=\"y vs yhat\", x=sin_Yhat, y=residuals, mode='markers')\n",
    "layout = go.Layout(xaxis=dict(title=r\"$\\hat{y}$\"), yaxis=dict(title=\"Residuals\"))\n",
    "py.iplot(go.Figure(data=[resid_vs_fit], layout=layout), \n",
    "         filename=\"L19_b_p2.2\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<iframe id=\"igraph\" scrolling=\"no\" style=\"border:none;\" seamless=\"seamless\" src=\"https://plot.ly/~jegonzal/757.embed\" height=\"525px\" width=\"100%\"></iframe>"
      ],
      "text/plain": [
       "<plotly.tools.PlotlyDisplay object>"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "yhat_vs_y = go.Scatter(name=\"y vs yhat\", x=train_data['Y'], y=sin_Yhat, mode='markers')\n",
    "slope_one = go.Scatter(name=\"Ideal\", x=[-25,25], y=[-25,25])\n",
    "layout = go.Layout(xaxis=dict(title=\"y\"), yaxis=dict(title=r\"$\\hat{y}$\"))\n",
    "py.iplot(go.Figure(data=[yhat_vs_y, slope_one], layout=layout), \n",
    "         filename=\"L19_b_p11.1\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Look at the distribution of residuals"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<iframe id=\"igraph\" scrolling=\"no\" style=\"border:none;\" seamless=\"seamless\" src=\"https://plot.ly/~jegonzal/703.embed\" height=\"525px\" width=\"100%\"></iframe>"
      ],
      "text/plain": [
       "<plotly.tools.PlotlyDisplay object>"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "py.iplot(ff.create_distplot([residuals], group_labels=['Residuals']), \n",
    "         filename=\"L19_b_p12\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Recall** earlier that our residuals were more spread from -10 to 10 and now they have become more concentrated.  However, the outliers remain.  Is that a problem?\n",
    "\n",
    "---\n",
    "<br/><br/><br/><br/><br/><br/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# A Linear Model in Transformed Space\n",
    "\n",
    "As discussed earlier the model we just constructed, while non-linear in $x$ is actually a linear model in $\\phi(x)$ and we can visualize that linear model's structure in higher dimensions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<iframe id=\"igraph\" scrolling=\"no\" style=\"border:none;\" seamless=\"seamless\" src=\"https://plot.ly/~jegonzal/705.embed\" height=\"525px\" width=\"100%\"></iframe>"
      ],
      "text/plain": [
       "<plotly.tools.PlotlyDisplay object>"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Plot the data in higher dimensions\n",
    "phi3d = go.Scatter3d(name = \"Raw Data\",\n",
    "    x = Phi[:,0], y = Phi[:,1], z = train_data['Y'],\n",
    "    mode = 'markers',\n",
    "    marker = dict(size=3),\n",
    "    showlegend=False\n",
    ")\n",
    "\n",
    "# Compute the predictin plane\n",
    "(u,v) = np.meshgrid(np.linspace(-10,10,5), np.linspace(-1,1,5))\n",
    "coords = np.vstack((u.flatten(),v.flatten())).T\n",
    "ycoords = coords @ sin_reg.coef_\n",
    "fit_plane = go.Surface(name = \"Fitting Hyperplane\",\n",
    "    x = np.reshape(coords[:,0], (5,5)),\n",
    "    y = np.reshape(coords[:,1], (5,5)),\n",
    "    z = np.reshape(ycoords, (5,5)),\n",
    "    opacity = 0.8, cauto = False, showscale = False,\n",
    "    colorscale = [[0, 'rgb(255,0,0)'], [1, 'rgb(255,0,0)']]\n",
    ")\n",
    "\n",
    "# Construct residual lines\n",
    "Yhat = sin_reg.predict(Phi)\n",
    "residual_lines = [\n",
    "    go.Scatter3d(x=[x[0],x[0]], y=[x[1],x[1]], z=[y, yhat],\n",
    "                 mode='lines', showlegend=False, \n",
    "                 line=dict(color='black'))\n",
    "    for (x, y, yhat) in zip(Phi, train_data['Y'], Yhat)\n",
    "]\n",
    "\n",
    "    \n",
    "# Label the axis and orient the camera\n",
    "layout = go.Layout(\n",
    "    scene=go.Scene(\n",
    "        xaxis=go.XAxis(title='X'),\n",
    "        yaxis=go.YAxis(title='sin(X)'),\n",
    "        zaxis=go.ZAxis(title='Y'),\n",
    "        aspectratio=dict(x=1.,y=1., z=1.), \n",
    "        camera=dict(eye=dict(x=-1, y=-1, z=0))\n",
    "    )\n",
    ")\n",
    "\n",
    "py.iplot(go.Figure(data=[phi3d, fit_plane] + residual_lines, layout=layout), \n",
    "         filename=\"L19_b_p14\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "<br/><br/><br/><br/><br/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Computing the RMSE of the various methods so far:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def rmse(y, yhat):\n",
    "    return np.sqrt(np.mean((yhat-y)**2))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "const_rmse = rmse(train_data['Y'], train_data['Y'].mean())\n",
    "line_rmse = rmse(train_data['Y'], line_Yhat)\n",
    "sin_rmse = rmse(train_data['Y'], sin_Yhat)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<iframe id=\"igraph\" scrolling=\"no\" style=\"border:none;\" seamless=\"seamless\" src=\"https://plot.ly/~jegonzal/707.embed\" height=\"525px\" width=\"100%\"></iframe>"
      ],
      "text/plain": [
       "<plotly.tools.PlotlyDisplay object>"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "py.iplot(go.Figure(data =[go.Bar(\n",
    "            x=[r'$\\theta $', r'$\\theta x$', \n",
    "               r'$\\theta_0 x + \\theta_1 \\sin(x)$'],\n",
    "            y=[const_rmse, line_rmse, sin_rmse]\n",
    "            )], layout = go.Layout(title=\"Loss Comparison\", \n",
    "                           yaxis=dict(title=\"RMSE\"))), \n",
    "         filename=\"L19_b_p15\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By adding the `sine` feature function were were able to reduce the prediction error. How could we improve further?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Generic Feature Functions\n",
    "\n",
    "We will now explore a range of generic feature transformations. However before we proceed it is worth contrasting two categories of feature functions and their applications.\n",
    "\n",
    "1. **Interpretable Features:** In settings where our goal is to understand the model (e.g., identify important features that predict customer churn) we may want to construct meaningful features based on our understanding of the domain.  \n",
    "\n",
    "1. **Generic Features:** However, in other settings where our primary goals is to make accurate predictions we may instead introduce generic feature functions that enable our models to fit _and generalize_ complex relationships. \n",
    "\n",
    "----\n",
    "\n",
    "<br/><br/><br/><br/><br/><br/>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The Polynomial Basis:\n",
    "\n",
    "The first set of generic feature functions we will consider is the polynomial basis:\n",
    "\n",
    "\\begin{align}\n",
    "\\phi(x) = [x, x^2, x^3, \\ldots, x^k]\n",
    "\\end{align}\n",
    "\n",
    "We can define a generic python function to implement this basis:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def poly_phi(k):\n",
    "    return lambda x: [x ** i for i in range(1, k+1)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To simply the comparison of feature functions we define the following routine:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def evaluate_basis(phi, desc):\n",
    "    # Apply transformation\n",
    "    Phi = np.array([phi(x) for x in train_data['X']])\n",
    "    \n",
    "    # Fit a model\n",
    "    reg_model = linear_model.LinearRegression(fit_intercept=False)\n",
    "    reg_model.fit(Phi, train_data['Y'])\n",
    "    \n",
    "    # Create plot line\n",
    "    X_test = np.linspace(-10, 10, 1000) # Fine grained test X\n",
    "    Phi_test = np.array([phi(x) for x in X_test])\n",
    "    Yhat_test = reg_model.predict(Phi_test)\n",
    "    line = go.Scatter(name = desc, x=X_test, y=Yhat_test)\n",
    "    \n",
    "    # Compute RMSE\n",
    "    Yhat = reg_model.predict(Phi)\n",
    "    error = rmse(train_data['Y'], Yhat)\n",
    "    \n",
    "    # return results\n",
    "    return (line, error, reg_model)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Starting with $k=5$ degree polynomials"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<iframe id=\"igraph\" scrolling=\"no\" style=\"border:none;\" seamless=\"seamless\" src=\"https://plot.ly/~jegonzal/709.embed\" height=\"525px\" width=\"100%\"></iframe>"
      ],
      "text/plain": [
       "<plotly.tools.PlotlyDisplay object>"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(poly_line, poly_rmse, poly_reg) = (\n",
    "    evaluate_basis(poly_phi(5), \"Polynomial\")\n",
    ")\n",
    "\n",
    "py.iplot(go.Figure(data=[train_points, poly_line, sin_line, basic_line], \n",
    "                   layout = go.Layout(xaxis=dict(range=[-10,10]), \n",
    "                                      yaxis=dict(range=[-25,25]))), \n",
    "         filename=\"L19_b_p16\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Increasing to $k=15$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<iframe id=\"igraph\" scrolling=\"no\" style=\"border:none;\" seamless=\"seamless\" src=\"https://plot.ly/~jegonzal/711.embed\" height=\"525px\" width=\"100%\"></iframe>"
      ],
      "text/plain": [
       "<plotly.tools.PlotlyDisplay object>"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(poly_line, poly_rmse, poly_reg) = (\n",
    "    evaluate_basis(poly_phi(15), \"Polynomial\")\n",
    ")\n",
    "\n",
    "py.iplot(go.Figure(data=[train_points, poly_line, sin_line, basic_line], \n",
    "                   layout = go.Layout(xaxis=dict(range=[-10,10]), \n",
    "                                      yaxis=dict(range=[-25,25]))), \n",
    "         filename=\"L19_b_p17\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Seems like a pretty reasonable fit.  Returning to the RMSE on the training data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<iframe id=\"igraph\" scrolling=\"no\" style=\"border:none;\" seamless=\"seamless\" src=\"https://plot.ly/~jegonzal/713.embed\" height=\"525px\" width=\"100%\"></iframe>"
      ],
      "text/plain": [
       "<plotly.tools.PlotlyDisplay object>"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "py.iplot([go.Bar(\n",
    "            x=[r'$\\theta $', r'$\\theta x$', \n",
    "               r'$\\theta_0 x + \\theta_1 \\sin(x)$', \n",
    "               'Polynomial'],\n",
    "            y=[const_rmse, line_rmse, sin_rmse, poly_rmse]\n",
    "    )], filename=\"L19_b_p18\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This was a slight improvement.  Perhaps we should increase to a higher degree polynomial? Why or why not?  We will return to this soon.\n",
    "\n",
    "---\n",
    "<br/><br/><br/><br/><br/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Gaussian Radial Basis Functions\n",
    "\n",
    "One of the more widely used generic feature functions are Gaussian radial basis functions.  These feature functions take the form:\n",
    "\n",
    "$$\n",
    "\\phi_{(\\lambda, u_1, \\ldots, u_k)}(x) = \\left[\\exp\\left( - \\frac{\\left|\\left|x-u_1\\right|\\right|_2^2}{\\lambda} \\right), \\ldots, \n",
    "\\exp\\left( - \\frac{\\left|\\left| x-u_k \\right|\\right|_2^2}{\\lambda} \\right) \\right]\n",
    "$$\n",
    "\n",
    "The **hyper-parameters** $u_1$ through $u_k$ and $\\lambda$ are not optimized with $\\theta$ but instead are set externally.  In many cases the $u_i$ may correspond to points in the training data.  The term $\\lambda$ defines the spread of the basis function and determines the \"smoothness\" of the function $f_\\theta(\\phi(x))$.\n",
    "\n",
    "The following is a plot of three radial basis function centered at 2 with different values of $\\lambda$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def gaussian_rbf(u, lam=1):\n",
    "    return lambda x: np.exp(-(x - u)**2 / lam**2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<iframe id=\"igraph\" scrolling=\"no\" style=\"border:none;\" seamless=\"seamless\" src=\"https://plot.ly/~jegonzal/715.embed\" height=\"525px\" width=\"100%\"></iframe>"
      ],
      "text/plain": [
       "<plotly.tools.PlotlyDisplay object>"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tmpX = np.linspace(-2, 6,100)\n",
    "py.iplot([\n",
    "    dict(name=r\"$\\lambda=0.5$\", x=tmpX, \n",
    "         y=gaussian_rbf(2, lam=0.5)(tmpX)),\n",
    "    dict(name=r\"$\\lambda=1$\", x=tmpX, \n",
    "         y=gaussian_rbf(2, lam=1.)(tmpX)),\n",
    "    dict(name=r\"$\\lambda=2$\", x=tmpX, \n",
    "         y=gaussian_rbf(2, lam=2.)(tmpX))\n",
    "], filename=\"L19_b_p19\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 10 RBF Functions $\\lambda=1$\n",
    "\n",
    "Here we plot 10 uniformly spaced RBF functions with $\\lambda=1$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<iframe id=\"igraph\" scrolling=\"no\" style=\"border:none;\" seamless=\"seamless\" src=\"https://plot.ly/~jegonzal/717.embed\" height=\"525px\" width=\"100%\"></iframe>"
      ],
      "text/plain": [
       "<plotly.tools.PlotlyDisplay object>"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def rbf_phi(x):\n",
    "    return [gaussian_rbf(u, 1.)(x) for u in np.linspace(-9, 9, 10)]\n",
    "\n",
    "(rbf_line, rbf_rmse, rbf_reg) = evaluate_basis(rbf_phi, r\"RBF\")\n",
    "\n",
    "py.iplot([train_points, rbf_line, poly_line, sin_line, basic_line], filename=\"L19_b_p20\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 10 RBF Functions $\\lambda=10$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<iframe id=\"igraph\" scrolling=\"no\" style=\"border:none;\" seamless=\"seamless\" src=\"https://plot.ly/~jegonzal/719.embed\" height=\"525px\" width=\"100%\"></iframe>"
      ],
      "text/plain": [
       "<plotly.tools.PlotlyDisplay object>"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def rbf_phi(x):\n",
    "    return [gaussian_rbf(u, 2)(x) for u in np.linspace(-9, 9, 15)]\n",
    "\n",
    "\n",
    "(rbf_line, rbf_rmse, rbf_reg) = (\n",
    "    evaluate_basis(rbf_phi, r\"RBF\")\n",
    ")\n",
    "\n",
    "py.iplot([train_points, rbf_line, poly_line, sin_line, basic_line],\n",
    "         filename=\"L19_b_p21\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Is this a better fit?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<iframe id=\"igraph\" scrolling=\"no\" style=\"border:none;\" seamless=\"seamless\" src=\"https://plot.ly/~jegonzal/721.embed\" height=\"525px\" width=\"100%\"></iframe>"
      ],
      "text/plain": [
       "<plotly.tools.PlotlyDisplay object>"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "py.iplot([go.Bar(\n",
    "            x=[r'$\\theta $', r'$\\theta x$', \n",
    "               r'$\\theta_0 x + \\theta_1 \\sin(x)$', \n",
    "               r\"Polynomial\",\n",
    "               r\"RBF\"],\n",
    "            y=[const_rmse, line_rmse, sin_rmse, poly_rmse, rbf_rmse]\n",
    "    )], filename=\"L19_b_p23\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Connecting the Dots..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<iframe id=\"igraph\" scrolling=\"no\" style=\"border:none;\" seamless=\"seamless\" src=\"https://plot.ly/~jegonzal/723.embed\" height=\"525px\" width=\"100%\"></iframe>"
      ],
      "text/plain": [
       "<plotly.tools.PlotlyDisplay object>"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def crazy_rbf_phi(x):\n",
    "    return (\n",
    "        [gaussian_rbf(u,0.5)(x) for u in np.linspace(-9, 9, 50)]\n",
    "    )\n",
    "\n",
    "(crazy_rbf_line, crazy_rbf_rmse, crazy_rbf_reg) = (\n",
    "    evaluate_basis(crazy_rbf_phi, \"RBF + Crazy\")\n",
    ")\n",
    "\n",
    "py.iplot(go.Figure(data=[train_points, crazy_rbf_line, \n",
    "                         poly_line, sin_line, basic_line],\n",
    "                   layout=go.Layout(yaxis=dict(range=[-25,25]))),\n",
    "         filename=\"L19_b_p24\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<iframe id=\"igraph\" scrolling=\"no\" style=\"border:none;\" seamless=\"seamless\" src=\"https://plot.ly/~jegonzal/725.embed\" height=\"525px\" width=\"100%\"></iframe>"
      ],
      "text/plain": [
       "<plotly.tools.PlotlyDisplay object>"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train_bars = go.Bar(name = \"Train\",\n",
    "    x=[r'$\\theta $', r'$\\theta x$', r'$\\theta_0 x + \\theta_1 \\sin(x)$', \n",
    "       \"Polynomial\",\n",
    "       \"RBF\",\n",
    "       \"RBF + Crazy\"],\n",
    "    y=[const_rmse, line_rmse, sin_rmse, poly_rmse, rbf_rmse, crazy_rbf_rmse])\n",
    "\n",
    "py.iplot([train_bars], filename=\"L19_b_p25\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "# Which is the best model?\n",
    "\n",
    "We started with the objective of minimizing the training loss (error).  As we increased the model sophistication by adding features we were able to fit increasingly complex functions to the data and reduce the loss.  However, is our ultimate goal to minimize training error?  \n",
    "\n",
    "Ideally we would like to minimize the error we make when making new predictions at unseen values of $X$.  One way to evaluate that error is use a **test dataset** which is distinct from the dataset used to train the model.  Fortunately, we have such a test dataset.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<iframe id=\"igraph\" scrolling=\"no\" style=\"border:none;\" seamless=\"seamless\" src=\"https://plot.ly/~jegonzal/727.embed\" height=\"525px\" width=\"100%\"></iframe>"
      ],
      "text/plain": [
       "<plotly.tools.PlotlyDisplay object>"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "test_data = pd.read_csv(\"toy_test_data.csv\")\n",
    "\n",
    "test_points = go.Scatter(name = \"Test Data\", x = test_data['X'], y = test_data['Y'], \n",
    "                       mode = 'markers', marker=dict(symbol=\"cross\", color=\"red\"))\n",
    "py.iplot([train_points, test_points], filename=\"L19_b_p26\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def test_rmse(phi, reg):\n",
    "    yhat = reg.predict(np.array([phi(x) for x in test_data['X']]))\n",
    "    return rmse(test_data['Y'], yhat)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<iframe id=\"igraph\" scrolling=\"no\" style=\"border:none;\" seamless=\"seamless\" src=\"https://plot.ly/~jegonzal/729.embed\" height=\"525px\" width=\"100%\"></iframe>"
      ],
      "text/plain": [
       "<plotly.tools.PlotlyDisplay object>"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "test_bars = go.Bar(name = \"Test\",\n",
    "    x=[r'$\\theta $', r'$\\theta x$', r'$\\theta_0 x + \\theta_1 \\sin(x)$', \n",
    "       \"Polynomial\",\n",
    "       \"RBF\",\n",
    "       \"RBF + Crazy\"],\n",
    "    y=[np.sqrt(np.mean((test_data['Y']-train_data['Y'].mean())**2)), \n",
    "       test_rmse(lambda x: [x], line_reg),\n",
    "       test_rmse(sin_phi, sin_reg),\n",
    "       test_rmse(poly_phi(15), poly_reg),\n",
    "       test_rmse(rbf_phi, rbf_reg),\n",
    "       test_rmse(crazy_rbf_phi, crazy_rbf_reg)]\n",
    "    )\n",
    "\n",
    "py.iplot([train_bars, test_bars], filename=\"L19_b_p27\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**What happened here?**\n",
    "<br/><br/><br/><br/>\n",
    "\n",
    "---\n",
    "\n",
    "This is a very common occurrence in machine learning.  As we increased the model complexity "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# What's happening: _Over-fitting_\n",
    "\n",
    "As we increase the expressiveness of our model we begin to **over-fit** to the variability in our training data.  That is we are learning patterns that do not **generalize** beyond our training dataset\n",
    "\n",
    "**Over-fitting** is a key challenge in machine learning and statistical inference.  At it's core is a fundamental trade-off between **bias** and **variance**: _the desire to explain the training data and yet be robust to variation in the training data_.\n",
    "\n",
    "We will study the **bias-variance** trade-off more in the next lecture but for now we will focus on the trade-off between under fitting and over fitting:\n",
    "\n",
    "<img src=\"under_over_fitting.png\" width=\"500px\">\n",
    "\n",
    "---\n",
    "\n",
    "<br/><br/><br/><br/><br/><br/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "# Train, Test, and Validation Splitting"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To manage over-fitting it is essential to split your initial training data into a training and testing dataset.  \n",
    "\n",
    "<img src=\"train_test_split.png\" width=\"500px\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Train - Test Split\n",
    "\n",
    "Before running cross validation split the data into train and test subsets (typically a 90-10 split). How should you do this?  You want the test data to reflect the prediction goal:\n",
    "1. **Random** sample of the training data\n",
    "1. **Future** examples\n",
    "1. Different **stratifications**\n",
    "\n",
    "Ask yourself, where will I be using this model and how does that relate to my test data.\n",
    "\n",
    "**Do not look at the test data until after selecting your final model.** Also, it is very important to **not look at the test data until after selecting your final model.**  Finally, you should **not look at the test data until after selecting your final model.** "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "# Cross Validation\n",
    "\n",
    "_With the **remaining training data**:_\n",
    "1. Split the remaining **training dataset** k-ways  as in the Figure above.  The figure uses 5-Fold which is standard.  You should split the data in the same way you constructed the test set (this is typically randomly)\n",
    "1. For each split train the model on the training fraction and then compute the error (RMSE) on the validation fraction.\n",
    "1. Average the error across each validation fraction to _estimate_ the test error.\n",
    "\n",
    "**Questions:**\n",
    "1. What is this accomplishing?\n",
    "1. What are the implication on the choice of $k$? \n",
    "\n",
    "---\n",
    "\n",
    "<br/><br/><br/><br/><br/><br/>\n",
    "\n",
    "\n",
    "\n",
    "**Answers:**\n",
    "1. This is repeatedly simulating the train-test split we did earlier.  We repeat this process because it is noisy.\n",
    "1. **Larger $k$** means we average our validation error over more instances which makes our estimate of the test error **more stable**.  This typically also means that the validation set is smaller so we have more training data.  However, larger $k$ also means we have to train the model more often which gets computational expensive\n",
    "\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# When do we use Cross Validation\n",
    "\n",
    "We use cross validation to estimate our test performance **without looking at our test data.** Remember, **do not look at the test data until after selecting your final model.**\n",
    "\n",
    "Cross validation is commonly used to tune the model complexity.  In the following we will use cross validation to find the optimal number of basis functions."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Cross Validation in sklearn"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.model_selection import KFold\n",
    "nsplits = 5\n",
    "\n",
    "X = train_data['X']\n",
    "Y = train_data['Y']\n",
    "\n",
    "\n",
    "nbasis_values = range(3, 20)\n",
    "sigma_values = np.linspace(1, 20, 50)\n",
    "\n",
    "scores = []\n",
    "for nbasis in nbasis_values:\n",
    "    for sigma in sigma_values:\n",
    "        # Define my basis for this \"experiment\"\n",
    "        def phi(x):\n",
    "            return (\n",
    "                [gaussian_rbf(u, sigma)(x) for u in np.linspace(-9, 9, nbasis)]\n",
    "            )\n",
    "        Phi = np.array([phi(x) for x in X])\n",
    "\n",
    "        # One step in k-fold cross validation\n",
    "        def score_model(train_index, test_index):\n",
    "            model = linear_model.LinearRegression(fit_intercept=False)\n",
    "            model.fit(Phi[train_index,], Y[train_index])\n",
    "            return rmse(Y[test_index], model.predict(Phi[test_index,]))\n",
    "\n",
    "\n",
    "        # Do the actual K-Fold cross validation\n",
    "        kfold = KFold(nsplits, shuffle=True,random_state=42)\n",
    "        score = np.mean([score_model(tr_ind, te_ind) \n",
    "                         for (tr_ind, te_ind) in kfold.split(Phi)])\n",
    "\n",
    "        # Save the results in the dictionary\n",
    "        scores.append(dict(nbasis=nbasis,sigma=sigma,score=score))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "nbasis    8.000000\n",
      "score     5.351392\n",
      "sigma     4.489796\n",
      "Name: 259, dtype: float64\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<iframe id=\"igraph\" scrolling=\"no\" style=\"border:none;\" seamless=\"seamless\" src=\"https://plot.ly/~jegonzal/763.embed\" height=\"525px\" width=\"100%\"></iframe>"
      ],
      "text/plain": [
       "<plotly.tools.PlotlyDisplay object>"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "scores_df = pd.DataFrame(scores)\n",
    "(b,s) = (len(nbasis_values), len(sigma_values))\n",
    "\n",
    "z = np.reshape(scores_df['score'].values,(b,s))\n",
    "loss_surface = go.Surface(\n",
    "    x = np.reshape(scores_df['nbasis'].values,(b,s)), \n",
    "    y = np.reshape(scores_df['sigma'].values,(b,s)),\n",
    "    z = z\n",
    ")\n",
    "\n",
    "optimal_value = scores_df.loc[scores_df['score'].argmin(),]\n",
    "print(optimal_value)\n",
    "optimal_point = go.Scatter3d(\n",
    "    x=optimal_value['nbasis'],\n",
    "    y=optimal_value['sigma'],\n",
    "    z=optimal_value['score'],\n",
    "    marker=dict(size=10, color=\"red\")\n",
    ")\n",
    "\n",
    "# Axis labels\n",
    "layout = go.Layout(\n",
    "    scene=go.Scene(\n",
    "        xaxis=go.XAxis(title='nbasis'),\n",
    "        yaxis=go.YAxis(title='sigma'),\n",
    "        zaxis=go.ZAxis(title='log CV rmse'),\n",
    "        aspectratio=dict(x=1.,y=1., z=1.)\n",
    "    )\n",
    ")\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "fig = go.Figure(data = [loss_surface,optimal_point], layout = layout)\n",
    "py.iplot(fig, filename=\"L19-p2-cv-loss\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Take a look at the final fit:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<iframe id=\"igraph\" scrolling=\"no\" style=\"border:none;\" seamless=\"seamless\" src=\"https://plot.ly/~jegonzal/875.embed\" height=\"525px\" width=\"100%\"></iframe>"
      ],
      "text/plain": [
       "<plotly.tools.PlotlyDisplay object>"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def optimal_rbf_phi(x):\n",
    "    return (\n",
    "        [gaussian_rbf(u,optimal_value['sigma'])(x) for u in np.linspace(-9, 9, optimal_value['nbasis'])]\n",
    "    )\n",
    "\n",
    "(optimal_rbf_line, optimal_rbf_rmse, optimal_rbf_reg) = (\n",
    "    evaluate_basis(optimal_rbf_phi, \"Optimal RBF\")\n",
    ")\n",
    "\n",
    "py.iplot(go.Figure(data=[train_points, optimal_rbf_line],\n",
    "                   layout=go.Layout(yaxis=dict(range=[-25,25]))),\n",
    "         filename=\"L19_final\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Next time:\n",
    "\n",
    "We will dig more into the bias variance trade-off and introduce the concept of regularization to parametrically explore the space of model complexity. "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [conda env:ds100]",
   "language": "python",
   "name": "conda-env-ds100-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}