{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "\n", "## Plotly plotting support\n", "import plotly.plotly as py\n", "\n", "# import plotly.offline as py\n", "# py.init_notebook_mode()\n", "\n", "import plotly.graph_objs as go\n", "import plotly.figure_factory as ff\n", "\n", "# Make the notebook deterministic \n", "np.random.seed(42)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notebook created by [Joseph E. Gonzalez](https://eecs.berkeley.edu/~jegonzal) for DS100." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "# Feature Transformations\n", "\n", "In addition to transforming categorical and text features to real valued representations, we can often improve model performance through the use of additional feature transformations. Let's start with a simple toy example\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Linear Models for Non-Linear Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To illustrate the potential for feature transformations consider the following *synthetic dataset*:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " X Y\n", "count 75.000000 75.000000\n", "mean -0.607319 -0.455212\n", "std 6.120983 12.873863\n", "min -9.889558 -25.028709\n", "25% -6.191627 -10.113630\n", "50% -1.196950 -1.648253\n", "75% 5.042387 10.910793\n", "max 9.737739 22.921518\n" ] }, { "data": { "text/html": [ "
\n", " | X | \n", "Y | \n", "
---|---|---|
0 | \n", "-9.889558 | \n", "-7.221915 | \n", "
1 | \n", "-9.588310 | \n", "-10.111930 | \n", "
2 | \n", "-9.312230 | \n", "-15.816534 | \n", "
3 | \n", "-9.095454 | \n", "-19.059384 | \n", "
4 | \n", "-9.070992 | \n", "-22.349544 | \n", "