{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"\n",
"# Plotly plotting support\n",
"import plotly.plotly as py\n",
"# import plotly.offline as py\n",
"# py.init_notebook_mode()\n",
"\n",
"# import cufflinks as cf\n",
"# cf.go_offline() # required to use plotly offline (no account required).\n",
"\n",
"import plotly.graph_objs as go\n",
"import plotly.figure_factory as ff\n",
"\n",
"# Make the notebook deterministic \n",
"np.random.seed(42)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Cross Validation and the Bias Variance Tradeoff\n",
"\n",
"This notebook which accompanies the lecture on the Bias Variance Tradeoff and Regularization.\n",
"\n",
"Notebook created by [Joseph E. Gonzalez](https://eecs.berkeley.edu/~jegonzal) for DS100."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Toy Dataset\n",
"\n",
"As with the previous lectures we will continue to use an easy to visualize synthetic dataset."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.random.seed(42)\n",
"n = 50\n",
"sigma = 10\n",
"X = np.linspace(-10, 10, n)\n",
"X = np.sort(X)\n",
"Y = 2. * X + 10. + sigma * np.random.randn(n) + 20*np.sin(X) + 0.8*(X)**2\n",
"X = X/5\n",
"data_points = go.Scatter(name=\"data\", x=X, y=Y, mode='markers')\n",
"py.iplot([data_points])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Making a train Test Split"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It is always a good habbit to split data into training and test sets."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.model_selection import train_test_split \n",
"X_tr, X_te, Y_tr, Y_te = train_test_split(X, Y, test_size=0.25, random_state=42)\n",
"train_points = go.Scatter(name=\"Train Data\", \n",
" x=X_tr, y=Y_tr, mode='markers', marker=dict(color=\"blue\", symbol=\"o\"))\n",
"test_points = go.Scatter(name=\"Test Data\",\n",
" x=X_te, y=Y_te, mode='markers', marker=dict(color=\"red\", symbol=\"x\"))\n",
"py.iplot([train_points, test_points])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Polynomial Features:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the following we work through a basic bias variance tradeoff for different numbers of polynomial features. To make things a bit more interesting we will also add an additional sine feature:\n",
"\n",
"$$ \\large\n",
"\\Phi_d(x) = \\left[\\sin(\\omega x), x, x^2, \\ldots, x^d \\right]\n",
"$$\n",
"\n",
"with the appropriate $\\omega = 5$ value determined from prior knowledge (because we created the data). \n",
"\n",
"We use the following Python function to generate a $\\Phi$:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def poly_phi(k):\n",
" return lambda X: np.array([np.sin(X*5)] + [X ** i for i in range(1, k+1)]).T"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will consider the following basis sizes:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"basis_sizes = [1, 2, 5, 8, 16, 24, 32]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create a $\\Phi$ feature matrix for each number of basis:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"Phis_tr = { k: poly_phi(k)(X_tr) for k in basis_sizes }"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Train a model for each number of feature"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from sklearn import linear_model\n",
"models = {}\n",
"for k in Phis_tr:\n",
" model = linear_model.LinearRegression()\n",
" model.fit(Phis_tr[k], Y_tr)\n",
" models[k] = model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Visualize the various models\n",
"\n",
"The following code is a bit complicated but essentially makes an interactive visualization of the various models."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Make the x values where plot points will be generated\n",
"X_plt = np.linspace(np.min(X)-1, np.max(X)+1, 200)\n",
"\n",
"# Generate the Plotly line objects by predicting the value at each X_plt\n",
"lines = []\n",
"for k in sorted(models.keys()):\n",
" ytmp = models[k].predict(poly_phi(k)(X_plt))\n",
" # Plotting software fails with large numbers\n",
" ytmp[ytmp > 500] = 500\n",
" ytmp[ytmp < -500] = -500\n",
" lines.append(\n",
" go.Scatter(name=\"degree \"+ str(k), x=X_plt,\n",
" y = ytmp,visible=False))\n",
"\n",
"# Construct steps for the interactive slider\n",
"lines[0].visible=True\n",
"steps = []\n",
"for i in range(len(lines)):\n",
" step = dict(\n",
" label = lines[i]['name'],\n",
" method = 'restyle',\n",
" args = ['visible', [False] * (len(lines)+1)],\n",
" )\n",
" step['args'][1][0] = True\n",
" step['args'][1][i+1] = True # Toggle i'th trace to \"visible\"\n",
" steps.append(step)\n",
"\n",
"# Build the slider object\n",
"sliders = [dict(active = 0, pad = {\"t\": 20}, steps = steps)]\n",
"\n",
"# render the plot\n",
"layout = go.Layout(xaxis=dict(range=[np.min(X_plt), np.max(X_plt)]), \n",
" yaxis=dict(range=[np.min(Y) -5 , np.max(Y)+5]),\n",
" sliders=sliders)\n",
"py.iplot(go.Figure(data = [train_points] + lines, layout=layout), filename=\"poly_curve_comparison\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Cross Validation\n",
"\n",
"\n",
"\n",
"_With the **remaining training data**:_\n",
"1. Split the remaining **training dataset** k-ways as in the Figure above. The figure uses 5-Fold which is standard. You should split the data in the same way you constructed the test set (this is typically randomly)\n",
"1. For each split train the model on the training fraction and then compute the error (RMSE) on the validation fraction.\n",
"1. Average the error across each validation fraction to _estimate_ the test error.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Define the error Function\n",
"\n",
"We will continue to use the mean squared error but this time rather than define it ourselves we will import it from scikit-learn."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from sklearn.metrics import mean_squared_error"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the following we compute the cross validated RMSE for different numbers of polynomial basis values"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from sklearn.model_selection import KFold\n",
"kfold_splits = 5\n",
"kfold = KFold(kfold_splits, shuffle=True, random_state=42)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"mse_scores = []\n",
"poly_degrees = np.arange(1,16)\n",
"for k in poly_degrees:\n",
" Phi = poly_phi(k)(X_tr)\n",
" \n",
" # One step in k-fold cross validation\n",
" def score_model(train_index, test_index):\n",
" model = linear_model.LinearRegression()\n",
" model.fit(Phi[train_index,], Y_tr[train_index])\n",
" return mean_squared_error(Y_tr[test_index], model.predict(Phi[test_index,]))\n",
" \n",
"\n",
" score = np.mean([score_model(tr_ind, te_ind) \n",
" for (tr_ind, te_ind) in kfold.split(Phi)])\n",
" mse_scores.append(score)\n",
" \n",
" \n",
"rmse_scores = np.sqrt(np.array(mse_scores))"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"py.iplot(\n",
" go.Figure(\n",
" data=[go.Scatter(name=\"CV Curve\", x=poly_degrees, y=rmse_scores),\n",
" go.Scatter(name=\"Optimum\", x=[poly_degrees[np.argmin(rmse_scores)]], y=[np.min(rmse_scores)],\n",
" mode=\"markers\", marker=dict(color=\"red\", size=10))\n",
" ],\n",
" layout=go.Layout(xaxis=dict(title=\"Degree\"), yaxis=dict(title=\"CV RMSE\"))),\n",
" filename=\"basis_function_cv_curve\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"### Questions? \n",
"\n",
"1. Is 2 a reasonable answer given the data? (Look at the earlier plot.)\n",
"1. Does this depend on the order in which we consider basis?\n",
"\n",
"\n",
"
\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Variance \n",
"\n",
"1. What happens if we repeat the training procedure with different versions of our trainin data? \n",
"\n",
"1. How will the models change as we inrease the degree of our polynomial?\n",
"\n",
"---\n",
"\n",
"
\n",
"\n",
"\n",
"\n",
"Here we use cross validation to simulate multiple train and test splits by splitting the training data into **train** and **validation** datasets. We then visualize each of the corresponding models."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.model_selection import KFold\n",
"\n",
"# Construct a KFold object which will define random index splits\n",
"kfold_splits = 5\n",
"kfold = KFold(kfold_splits, shuffle=True, random_state=42)\n",
"\n",
"# Construct the lines\n",
"lines = []\n",
"\n",
"# Make the test lines\n",
"X_plt = np.linspace(np.min(X)-1, np.max(X)+1, 200)\n",
"\n",
"# for each train validation split (we ignore the validation data)\n",
"for k in basis_sizes:\n",
" for (tr_ind, val_ind) in kfold.split(X_tr):\n",
" # Construct the Phi matrices for each basis size\n",
" Phi = poly_phi(k)(X_tr[tr_ind]) \n",
" # Fit the model\n",
" model = linear_model.LinearRegression()\n",
" model.fit(Phi, Y_tr[tr_ind])\n",
" # Make predictions at the plotting points\n",
" ytmp = model.predict(poly_phi(k)(X_plt))\n",
" # Plotly has a bug and fails with large numbers\n",
" ytmp[ytmp > 500] = 500\n",
" ytmp[ytmp < -500] = -500\n",
" line = go.Scatter(name=\"degree \"+ str(k), visible=False,\n",
" x=X_plt, y=ytmp)\n",
" lines.append(line)\n",
"\n",
"for l in lines[0:kfold_splits]:\n",
" l.visible=True\n",
"\n",
"steps = []\n",
"for i in range(len(basis_sizes)):\n",
" step = dict(\n",
" label = \"Degree \" + str(basis_sizes[i]) ,\n",
" method = 'restyle',\n",
" args = ['visible', [False] * (len(lines)+1)]\n",
" )\n",
" step['args'][1][-1] = True\n",
" for u in range(i*kfold_splits, (i+1)*kfold_splits):\n",
" step['args'][1][u] = True # Toggle i'th trace to \"visible\"\n",
" steps.append(step)\n",
" \n",
"sliders = [dict(\n",
" active = 0,\n",
" pad = {\"t\": 20},\n",
" steps = steps\n",
")]\n",
"\n",
"layout = go.Layout(xaxis=dict(range=[np.min(X_plt), np.max(X_plt)]), \n",
" yaxis=dict(range=[np.min(Y) -5 , np.max(Y) + 5]),\n",
" sliders=sliders,\n",
" showlegend=False)\n",
"\n",
"py.iplot(go.Figure(data = lines + [train_points], layout=layout), filename=\"cv_bv_slider\")\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Return to Slides!\n",
"\n",
"--- \n",
"\n",
"
"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python [conda env:ds100]",
"language": "python",
"name": "conda-env-ds100-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}