{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"\n",
"## Plotly plotting support\n",
"import plotly.plotly as py\n",
"\n",
"# import plotly.offline as py\n",
"# py.init_notebook_mode()\n",
"\n",
"import plotly.graph_objs as go\n",
"import plotly.figure_factory as ff\n",
"\n",
"# Make the notebook deterministic \n",
"np.random.seed(42)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notebook created by [Joseph E. Gonzalez](https://eecs.berkeley.edu/~jegonzal) for DS100."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Feature Engineering\n",
"\n",
"In the next few notebooks we will explore a key part of data science, **feature engineering**: _the process of transforming the representation of model inputs to enable better model approximation._ Feature engineering enables you to:\n",
"\n",
"1. **encode** non-numeric features to be used as inputs to common numeric models\n",
"1. capture **domain knowledge** (e.g., the perceived loudness or sound is the log of the intensity)\n",
"1. **transform** complex relationships into simple linear relationships\n",
"\n",
"---\n",
"
\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Mapping from Domain to Range\n",
"\n",
"In the supervised learning setting were are given $(X,Y)$ paris with the goal of learning the mapping from $X$ to $Y$. For example, given pairs of square footage and price we want to learn a function that captures (or at least approximates) the relationship between square feet and price. Our functional approximation is some form of typically parametric mapping from some **domain** to some **range**:\n",
"\n",
"\n",
"\n",
"In this class we will focus on **Multiple Regression** in which we consider mappings from potentially high-dimensional input spaces onto the real line (i.e., $y \\in \\mathbb{R}$):\n",
"\n",
"\n",
"\n",
"It is worth noting that this is distinct from **Multivariate Regression** in which we are predicting multiple (confusing?) response values (e.g., $y \\in \\mathbb{R}^q$).\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# What is the Domain (Features)\n",
"\n",
"Suppose we are given the following table:\n",
"\n",
"\n",
"\n",
"Our goal is to learn a function that approximates the relationship between the blue and red columns. Let's assume the range, `\"Ratings\"`, are the real numbers (this may be a problem if ratings are between [0, 5] but more on that later).\n",
"\n",
"**What is the _domain_ of this function?**\n",
"\n",
"---\n",
"\n",
"
\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"The schema of the relational model provides one possible answer:\n",
"\n",
"```sql\n",
"RatingsData(uid INTEGER, age FLOAT, \n",
" state STRING, hasBought BOOLEAN,\n",
" review STRING, rating FLOAT)\n",
"```\n",
"\n",
"Which would suggest that the domain is then:\n",
"\n",
"$$\n",
"\\textbf{Domain} = \\mathbb{Z} \\times \\mathbb{R} \\times \\mathbb{S} \\times \\mathbb{B} \\times \\mathbb{S} \\times \\mathbb{R}\n",
"$$\n",
"\n",
"Unfortunately, the techniques we have discussed so far and most of the techniques in machine learning and statistics operate on real-valued vector inputs $x \\in \\mathbb{R}^d$ (or for the statisticians $x \\in \\mathbb{R}^p$). \n",
"\n",
"### Goal: \n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"Moreover, many of these techniques, especially the linear models we have been studying, assume the inputs are **continuous** variables in which the relative magnitude of the feature encode information about the response variable. \n",
"\n",
"In the following we define several basic transformations to encode features as real numbers.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"---\n",
"
\n",
"\n",
"\n",
"# Basic Feature Engineering: _Get $\\mathbb{R}$_\n",
"\n",
"Our first step as feature engineers is to translate our data into a form that encodes each feature as a continuous variable."
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"## The _Uninformative_ Feature: `uid`\n",
"\n",
"The `uid` was likely used to join the user information (e.g., `age`, and `state`) with some `Reviews` table. The `uid` presents several questions:\n",
"* What is the meaning of the `uid` *number*? \n",
"* Does the magnitude of the `uid` reveal information about the rating? \n",
"\n",
"There are several answers:\n",
"\n",
"1. Although numbers, identifiers are **typically categorical** (like strings) and as a consequence the magnitude has little meaning. In these settings we would either **drop** or **one-hot encode** the `uid`. We will return to feature dropping and one-hot-encoding in a moment.\n",
"\n",
"1. There are scenarios where the magnitude of the numerical `uid` value contains important information. When user ids are created in consecutive order, larger user ids would imply more recent users. In these cases we might to interpret the `uid` feature as a real number. \n",
"\n",
"\n",
"\n",
"---\n",
"
\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Dropping Features\n",
"\n",
"While uncommon there are certain scenarios where manually dropping features might be helpful:\n",
"\n",
"1. when the features **does not to contain information** associated with the prediction task. Dropping uninformative features can help to address over-fitting, an issue we will discuss in great detail soon. \n",
"\n",
"1. when the feature is **not available when at prediction time.** For example, the feature might contain information collected after the user entered a rating. This is a common scenario in time-series analysis.\n",
"\n",
"However in the absence of substantial domain knowledge, we would prefer to use algorithmic techniques to help eliminate features. We will discuss this more when we return to regularization.\n",
"\n",
"\n",
"---\n",
"
\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## The _Continuous_ `age` Feature\n",
"\n",
"The `age` feature encodes the users age. This is already a continuous real number so no additional feature transformations are required. However, as we will soon see, we may introduce additional related features (e.g., indicators for various age groups or non-linear transformations).\n",
"\n",
"---\n",
"
\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## The _Categorical_ `state` Feature\n",
"\n",
"\n",
"The `state` feature is a string encoding the category (one of the 50 states). How do we meaningfully encode such a feature as one or more real-numbers?\n",
"\n",
"We could enumerate the states in alphabetical order `AL=0`, `AK=2`, ... `WY=49`. This is a form of **dictionary encoding** which maps each category to an integer. However, this would likely be a poor feature encoding since the magnitude provides little information about the rating. \n",
"\n",
"Alternatively, we might enumerate the states based on their geographic region (e.g., lower numbers for coastal states.). While this alternative dictionary encoding may provide information there is better way to encode categorical features for machine learning algorithms."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"
\n",
"\n",
"# One-Hot Encoding\n",
"\n",
"\n",
"\n",
"One-Hot encoding, sometimes also called **dummy encoding** is a simple mechanism to encode categorical data as real numbers such that the magnitude of each dimension is meaningful. Suppose a feature can take on $k$ distinct values (e.g., $k=50$ for 50 states in the United Stated). For each distinct _possible_ value a new feature (dimension) is created. For each record, all the new features are set to zero except the one corresponding to the value in the original feature. \n",
"\n",
"The term one-hot encoding comes from a digital circuit encoding of a categorical state as particular \"hot\" wire:\n",
"\n",
"\n",
"\n",
"The following is a relatively inefficient implementation:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([ 0., 1., 0.])"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def one_hot_encoding(x, categories):\n",
" dictionary = dict(zip(categories, range(len(categories))))\n",
" enc = np.zeros(len(categories))\n",
" enc[dictionary[x]] = 1.0\n",
" return enc\n",
"\n",
"categories = [\"cat\", \"dog\", \"apple\"]\n",
"one_hot_encoding(\"dog\", categories)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Why is this inefficient? Think about a large number of states.\n",
"\n",
"
\n",
"\n",
"**Answer:** Here we are using a dense representation which does not make efficient use of memory"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"
\n",
"\n",
"\n",
"## One-Hot Encoding in Pandas"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here we create a toy dataframe of pets including their name and kind:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"
\n",
" \n",
" \n",
" | \n",
" name | \n",
" kind | \n",
" age | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Goldy | \n",
" Fish | \n",
" 0.5 | \n",
"
\n",
" \n",
" 1 | \n",
" Scooby | \n",
" Dog | \n",
" 7.0 | \n",
"
\n",
" \n",
" 2 | \n",
" Brian | \n",
" Dog | \n",
" 3.0 | \n",
"
\n",
" \n",
" 3 | \n",
" Francine | \n",
" Cat | \n",
" 10.0 | \n",
"
\n",
" \n",
" 4 | \n",
" Goldy | \n",
" Dog | \n",
" 1.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" name kind age\n",
"0 Goldy Fish 0.5\n",
"1 Scooby Dog 7.0\n",
"2 Brian Dog 3.0\n",
"3 Francine Cat 10.0\n",
"4 Goldy Dog 1.0"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.DataFrame({\n",
" \"name\": [\"Goldy\", \"Scooby\", \"Brian\", \"Francine\", \"Goldy\"],\n",
" \"kind\": [\"Fish\", \"Dog\", \"Dog\", \"Cat\", \"Dog\"],\n",
" \"age\": [0.5, 7., 3., 10., 1.]\n",
"}, columns = [\"name\", \"kind\", \"age\"])\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Pandas has a built in function to construct one-hot encodings called **`get_dummies`**"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Cat | \n",
" Dog | \n",
" Fish | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
" 2 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
" 3 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 4 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Cat Dog Fish\n",
"0 0 0 1\n",
"1 0 1 0\n",
"2 0 1 0\n",
"3 1 0 0\n",
"4 0 1 0"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.get_dummies(df['kind'])"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" age | \n",
" name_Brian | \n",
" name_Francine | \n",
" name_Goldy | \n",
" name_Scooby | \n",
" kind_Cat | \n",
" kind_Dog | \n",
" kind_Fish | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0.5 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" 1 | \n",
" 7.0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
" 2 | \n",
" 3.0 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
" 3 | \n",
" 10.0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 4 | \n",
" 1.0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" age name_Brian name_Francine name_Goldy name_Scooby kind_Cat \\\n",
"0 0.5 0 0 1 0 0 \n",
"1 7.0 0 0 0 1 0 \n",
"2 3.0 1 0 0 0 0 \n",
"3 10.0 0 1 0 0 1 \n",
"4 1.0 0 0 1 0 0 \n",
"\n",
" kind_Dog kind_Fish \n",
"0 0 1 \n",
"1 1 0 \n",
"2 1 0 \n",
"3 0 0 \n",
"4 1 0 "
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.get_dummies(df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Issue:** While the Pandas `pandas.get_dummies` function is very convenient and even retains meaningful column labels it has one key downside.\n",
"\n",
"The `get_dummies` function does not take the dictionary of possible values and so will not produce the same encoding if applied across multiple dataframes with different values. This can be a big issue when rendering predictions on a new dataset.\n",
"\n",
"---\n",
"\n",
"
\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## One-Hot Encoding in Scikit-Learn"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Scikit-learn is a widely used machine learning package in Python and provides several implementations of feature encoders for categorical data. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### DictVectorizer"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `DictVectorizer` encodes dictionaries by taking keys that map to strings and applying a one-hot encoding."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"DictVectorizer(dtype=, separator='=', sort=True,\n",
" sparse=True)"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.feature_extraction import DictVectorizer\n",
"\n",
"vec_enc = DictVectorizer()\n",
"vec_enc.fit(df.to_dict(orient='records'))"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([[ 0.5, 0. , 0. , 1. , 0. , 0. , 1. , 0. ],\n",
" [ 7. , 0. , 1. , 0. , 0. , 0. , 0. , 1. ],\n",
" [ 3. , 0. , 1. , 0. , 1. , 0. , 0. , 0. ],\n",
" [ 10. , 1. , 0. , 0. , 0. , 1. , 0. , 0. ],\n",
" [ 1. , 0. , 1. , 0. , 0. , 0. , 1. , 0. ]])"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"vec_enc.transform(df.to_dict(orient='records')).toarray()"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"['age',\n",
" 'kind=Cat',\n",
" 'kind=Dog',\n",
" 'kind=Fish',\n",
" 'name=Brian',\n",
" 'name=Francine',\n",
" 'name=Goldy',\n",
" 'name=Scooby']"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"vec_enc.get_feature_names()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can apply the dictionary vectorizer to new data:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([[ 35., 1., 0., 0., 0., 0., 1., 0.],\n",
" [ 0., 0., 0., 0., 0., 0., 0., 0.],\n",
" [ 0., 0., 0., 0., 0., 0., 1., 0.]])"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"vec_enc.transform([\n",
" {\"kind\": \"Cat\", \"name\": \"Goldy\", \"age\": 35},\n",
" {\"kind\": \"Bird\", \"name\": \"Fluffy\"},\n",
" {\"breed\": \"Chihuahua\", \"name\": \"Goldy\"},\n",
"]).toarray()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice that the second record `{\"kind\": \"Bird\", \"name\": \"Fluffy\"}` has invalid categories and missing fields and it's encoding is entirely zero. Is this reasonable?\n",
"\n",
"---\n",
"\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### _Bonus:_ sklearn `OneHotEncoder`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The basic sklearn `OneHotEncoder` encodes a column of integers corresponding to category values. Therefore, we first need to **dictionary encode** the string values. "
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0 2\n",
"1 1\n",
"2 1\n",
"3 0\n",
"4 1\n",
"dtype: int8"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Convert the \"kind\" column into a category column\n",
"kind_codes = (\n",
" df['kind'].astype(\"category\", categories=[\"Cat\", \"Dog\",\"Fish\"])\n",
" .cat.codes # Extract the category codes\n",
")\n",
"kind_codes"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([[ 0., 0., 1.],\n",
" [ 0., 1., 0.],\n",
" [ 0., 1., 0.],\n",
" [ 1., 0., 0.],\n",
" [ 0., 1., 0.]])"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.preprocessing import OneHotEncoder\n",
"\n",
"# Build an instance of the encoder\n",
"onehot = OneHotEncoder()\n",
"\n",
"# Construct an integer column vector from the 'kind_codes' column\n",
"column_vec_kinds = np.array([kind_codes.values]).T\n",
"\n",
"# Fit the encoder (which can be resued to transform other data)\n",
"onehot.fit(column_vec_kinds)\n",
"\n",
"# Transform the column vector\n",
"onehot.transform(column_vec_kinds).toarray()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# One-Hot Encoding Icecream\n",
"\n",
"Suppose you obtain the log of icecream sales for a popular icecream shop.\n",
"\n",
"\n",
"\n",
"\n",
"The data consists of the flavor and topping, the total icecream mass (mass), and the price charged. \n"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" flavor | \n",
" topping | \n",
" mass | \n",
" price | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Chocolate | \n",
" Chocolate | \n",
" 2.5 | \n",
" 2.50 | \n",
"
\n",
" \n",
" 1 | \n",
" Vanilla | \n",
" Chocolate | \n",
" 4.8 | \n",
" 4.10 | \n",
"
\n",
" \n",
" 2 | \n",
" Strawberry | \n",
" Sprinkles | \n",
" 3.9 | \n",
" 2.26 | \n",
"
\n",
" \n",
" 3 | \n",
" Strawberry | \n",
" Sprinkles | \n",
" 3.4 | \n",
" 2.00 | \n",
"
\n",
" \n",
" 4 | \n",
" Chocolate | \n",
" Chocolate | \n",
" 1.6 | \n",
" 1.80 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" flavor topping mass price\n",
"0 Chocolate Chocolate 2.5 2.50\n",
"1 Vanilla Chocolate 4.8 4.10\n",
"2 Strawberry Sprinkles 3.9 2.26\n",
"3 Strawberry Sprinkles 3.4 2.00\n",
"4 Chocolate Chocolate 1.6 1.80"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"icecream = pd.read_csv(\"icecream_train.csv\")\n",
"icecream.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Predicting the price of icecream\n",
"\n",
"**How would you predict the price of icecream given the flavor, topping, and mass?**\n",
"\n",
"\n",
"--- \n",
"\n",
"
\n",
"Let's start simple and focus on predicting the price from the mass:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from sklearn import linear_model\n",
"\n",
"# Train a linear regression modle to predict price from mass\n",
"reg_mass = linear_model.LinearRegression()\n",
"reg_mass.fit(icecream[['mass']], icecream['price'])\n",
"\n",
"# Make predictions for each of the purchases in our dataset \n",
"yhat_mass = reg_mass.predict(icecream[['mass']])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Analyze the fit\n",
"\n",
"This is a fairly simple one-dimensional problem so we can plot the data."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def plot_fit_line(x, y, model, filename):\n",
" # Data points\n",
" points = go.Scatter(name = \"Data\", x=x,y=y, mode='markers')\n",
" # Predictions for line\n",
" x_query = np.linspace(np.min(x), np.max(x), 1000)\n",
" y_query = model.predict(np.array([x_query]).T)\n",
" model_line = go.Scatter(name=\"Model\", x=x_query, y=y_query)\n",
" # Residual line segments\n",
" residual_lines = [\n",
" go.Scatter(x=[x,x], y=[y,yhat],\n",
" mode='lines', showlegend=False, \n",
" line=dict(color='black', width = 0.5))\n",
" for (x, y, yhat) in zip(x, y, model.predict(np.array([x]).T))\n",
" ]\n",
" return py.iplot([points, model_line] + residual_lines, filename=filename)\n",
"\n",
"\n",
"plot_fit_line(icecream['mass'], icecream['price'], reg_mass, \"FE_Part1_0\") "
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"residual = yhat_mass - icecream['price']\n",
"py.iplot(ff.create_distplot([residual], group_labels=['Residuals'], bin_size=0.1), filename=\"FE_Part1_1\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"# RMSE and MAD\n",
"\n",
"When plotting the prediction error it is common to compute the **root mean squared error (RMSE)** which is the square-root of the average squared loss over the training data. \n",
"\n",
"$$ \\large\n",
"\\textbf{RMSE} = \\sqrt{\\frac{1}{n} \\sum_{i=1}^n \\left(Y_i - f_\\theta(X_i)\\right)^2}\n",
"$$\n",
"\n",
"The RMSE **error** in the **units of $Y$** (in this case price) and is biased towards points with the highest error.\n",
"\n",
"Another error metric that is a bit more robust is the **median absolute devaiation (MAD)** error. \n",
"\n",
"$$ \\large\n",
"\\textbf{MAD} = \\textbf{median}\\left(Y_i - f_\\theta(X_i)\\right)\n",
"$$\n",
"\n",
"\n",
"The RMSE error metric is closer to our squared loss objective and the MAD error is closer to an L1 loss and the corresponding Least Absolute Deviation Regression which we have not yet covered.\n",
"\n",
"Let's take a look at both:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def rmse(y, yhat):\n",
" return np.sqrt(np.mean((yhat-y)**2))\n",
"\n",
"def mad(y, yhat):\n",
" return np.median(np.abs(yhat - y))"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"RMSE: 0.537536988756\n",
"MAD: 0.315943896294\n"
]
}
],
"source": [
"print(\"RMSE:\", rmse(icecream['price'], yhat_mass))\n",
"print(\"MAD:\", mad(icecream['price'], yhat_mass))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Is this a good fit?\n",
"\n",
"---\n",
"
\n",
"\n",
"Often a very basic model is enough. However we notice something intresting. \n",
"\n",
"**At the same mass value there appears to be multiple icecream prices.** \n",
"\n",
"**Why?**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Stratified Analysis\n",
"\n",
"Given we have categorical data one thing we might do is first try to **stratify our analysis.** We could look at at subset of assignments and try to get a better picture of what is happening.\n",
"\n",
"I like Chocolate so I decided to look at just purchases of chocolate flavored icecream and chocolate toppings. "
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ind = (icecream['flavor'] == \"Chocolate\") & (icecream['topping'] == \"Chocolate\")\n",
"reg_chocolate = linear_model.LinearRegression()\n",
"reg_chocolate.fit(icecream[ind][['mass']], icecream[ind]['price'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's plot a stratified version of the data"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"choc_choc_points = (\n",
" go.Scatter(name=\"Chocolate+Chocolate\", \n",
" x = icecream[ind]['mass'], y = icecream[ind]['price'], \n",
" mode='markers',\n",
" marker=dict(color=\"red\", symbol=\"triangle-up\", size=10)))\n",
"\n",
"ind_flav = icecream['flavor'] == \"Chocolate\"\n",
"chocolate_points = (\n",
" go.Scatter(name=\"Choc. Flavored\", \n",
" x = icecream[ind_flav]['mass'], y = icecream[ind_flav]['price'], \n",
" mode='markers',\n",
" marker=dict(color=\"red\", symbol=\"circle-open\", size=15)))\n",
"\n",
"all_data = (\n",
" go.Scatter(name=\"Data\", \n",
" x = icecream['mass'], y = icecream['price'], mode='markers',\n",
" marker=dict(color=\"gray\")))\n",
"\n",
"x_query = np.linspace(icecream['mass'].min(), icecream['mass'].max(), 500)\n",
"line_mass = (\n",
" go.Scatter(name=\"mass Only\", \n",
" x = x_query, y = reg_mass.predict(np.array([x_query]).T), \n",
" line=dict(color=\"black\")))\n",
"\n",
"line_choclate = (\n",
" go.Scatter(name=\"Choc.+Choc. Line\", \n",
" x = x_query, y = reg_chocolate.predict(np.array([x_query]).T), \n",
" line=dict(color=\"orange\")))\n",
"\n",
"py.iplot([all_data, chocolate_points, choc_choc_points, line_mass, line_choclate], \n",
" filename=\"FE_Part1_2\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the above we plot:\n",
"1. all the original data as dots\n",
"1. a circle around chocolate flavored icecream purchases\n",
"1. a triangle over the chocolate flavored icecream purchases with chocolate toppings.\n",
"1. and both the original and chocolate-chocolate icecream regression models.\n",
"\n",
"What do we observe?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"
\n",
"\n",
"\n",
"They may charge customers differnt prices based on flavor and toppings. How can we incorporate that information?\n",
"\n",
"Let's try constructing one-hot encodings for the flavor and topping information features."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"<150x8 sparse matrix of type ''\n",
"\twith 450 stored elements in Compressed Sparse Row format>"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"one_hot_enc = DictVectorizer()\n",
"feature_columns = [\"flavor\", \"topping\", \"mass\"]\n",
"one_hot_enc.fit(icecream[feature_columns].to_dict(orient='records'))\n",
"one_hot_features = (\n",
" one_hot_enc.transform(icecream[feature_columns].to_dict(orient='records'))\n",
")\n",
"one_hot_features"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Examining a few rows we see there are multiple one hot encodings (one for flavor and one for toppings)."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"matrix([[ 1. , 0. , 0. , 2.5, 1. , 0. , 0. , 0. ],\n",
" [ 0. , 0. , 1. , 4.8, 1. , 0. , 0. , 0. ],\n",
" [ 0. , 1. , 0. , 3.9, 0. , 0. , 0. , 1. ],\n",
" [ 0. , 1. , 0. , 3.4, 0. , 0. , 0. , 1. ],\n",
" [ 1. , 0. , 0. , 1.6, 1. , 0. , 0. , 0. ]])"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"one_hot_features.todense()[:5,:]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Again we fit a model:"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Train a linear regression modle to predict price from mass\n",
"one_hot_reg = linear_model.LinearRegression()\n",
"one_hot_reg.fit(one_hot_features, icecream['price'])\n",
"\n",
"# Make predictions for each of the purchases in our dataset \n",
"yhat_one_hot = one_hot_reg.predict(one_hot_features)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## How can we visualize the fit?\n",
"\n",
"
\n",
"\n",
"---"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"residual = yhat_one_hot - icecream['price']\n",
"py.iplot(ff.create_distplot([residual], group_labels=['Residuals'], bin_size=0.01), filename=\"FE_Part1_3\")"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"py.iplot([\n",
" go.Bar(name=\"mass Only\",\n",
" x=[\"RMSE\", \"MAD\"], \n",
" y=[rmse(icecream['price'], yhat_mass), \n",
" mad(icecream['price'], yhat_mass)]),\n",
" go.Bar(name=\"OneHot + mass\",\n",
" x=[\"RMSE\", \"MAD\"],\n",
" y=[rmse(icecream['price'], yhat_one_hot), \n",
" mad(icecream['price'], yhat_one_hot)])\n",
"], filename=\"FE_Part1_4\")"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y_vs_yhat = go.Scatter(name=\"y vs yhat\", x=icecream['price'], y=yhat_one_hot, mode='markers')\n",
"slope_one = go.Scatter(name=\"Ideal\", x=[0,5], y=[0,5])\n",
"layout = go.Layout(xaxis=dict(title=\"y\"), yaxis=dict(title=\"yhat\"))\n",
"py.iplot(go.Figure(data=[y_vs_yhat, slope_one], layout=layout), \n",
" filename=\"FE_Part1_5\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## How could we improve the model?\n",
"\n",
"
\n",
"\n",
"**Icecream Pricing Model:**\n",
"\n",
"$$\\large\n",
"\\text{price} = \\text{mass} * \\theta_\\text{flavor} + \\theta_\\text{topping}\n",
"$$\n",
"\n",
"**Question** How could we encode this model so that we can learn it using linear regression?\n",
"\n",
"--- \n",
"
\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here is a proposal:\n",
"\n",
"\n",
"\\begin{align}\n",
"\\phi\\left(\\text{mass}, \\text{flavor}, \\text{topping} \\right) & = \n",
" \\left[\\text{mass} * \\textbf{OneHot}\\left(\\text{flavor}\\right), \n",
" \\textbf{OneHot}\\left(\\text{topping}\\right)\\right] \n",
"\\end{align}\n",
"\n",
"To see how this works lets look at $\\theta_\\text{topping}$. \n",
"\n",
"\\begin{align}\n",
"\\textbf{OneHot}\\left(\\text{topping}(x)\\right) = \n",
"\\left[\\textbf{isSprinkles}(x), \\textbf{isFruit}(x), \\textbf{isChoc}(x), \\textbf{isNuts}(x)\\right]\n",
"\\end{align}\n",
"\n",
"\\begin{align}\n",
"\\theta_\\text{topping} = \n",
"\\left[\\theta_\\text{sprinkles}, \\theta_\\text{isFruit}, \\theta_\\text{isChoc}, \\theta_\\text{isNuts}\\right]\n",
"\\end{align}\n",
"\n",
"If we take their dot-product we select the corresponding essential learns the constant function $\\theta$ with the unique $\\theta$ value for that topping.\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here we will construct one hot encodings for the flavor and toppings in seperate calls so we know which columns correspond to each:"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"flavor_enc = DictVectorizer()\n",
"flavor_enc.fit(icecream[[\"flavor\"]].to_dict(orient='records'))\n",
"onehot_flavor = flavor_enc.transform(icecream[[\"flavor\"]].to_dict(orient='records'))"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"topping_enc = DictVectorizer()\n",
"topping_enc.fit(icecream[[\"topping\"]].to_dict(orient='records'))\n",
"onehot_topping = topping_enc.transform(icecream[[\"topping\"]].to_dict(orient='records'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To scale the sparse matrix fo encodings by the mass we need to multiply by a sparse diaganol matrix. "
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import scipy as sp\n",
"\n",
"n = len(icecream['mass'].values)\n",
"\n",
"scaling_matrix = sp.sparse.spdiags(icecream['mass'].values, 0, n, n)\n",
"\n",
"mass_times_flavor = scaling_matrix @ onehot_flavor"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Combining the sparse `mass_times_flavor` columns with the `onehot_topping` columns we get a new feature matrix `Phi`"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"<150x7 sparse matrix of type ''\n",
"\twith 300 stored elements in COOrdinate format>"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"Phi = sp.sparse.hstack([mass_times_flavor, onehot_topping])\n",
"Phi"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Again let's look at a few examples (in practice you would want to avoid the `todense()` call"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"matrix([[ 2.5, 0. , 0. , 1. , 0. , 0. , 0. ],\n",
" [ 0. , 0. , 4.8, 1. , 0. , 0. , 0. ],\n",
" [ 0. , 3.9, 0. , 0. , 0. , 0. , 1. ],\n",
" [ 0. , 3.4, 0. , 0. , 0. , 0. , 1. ],\n",
" [ 1.6, 0. , 0. , 1. , 0. , 0. , 0. ]])"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"Phi.todense()[:5,:]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Fitting the linear model (once more)\n",
"\n",
"Notice that this time I am removing the intercept (bias) term since I don't believe it should be part of my model"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from sklearn import linear_model\n",
"reg_domain_knowledge = linear_model.LinearRegression(fit_intercept=False)\n",
"reg_domain_knowledge.fit(Phi, icecream['price'])\n",
"yhat_domain_knowledge = reg_domain_knowledge.predict(Phi)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Did we improve the fit?"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"py.iplot([\n",
" go.Bar(name=\"mass Only\",\n",
" x=[\"RMSE\", \"MAD\"], \n",
" y=[rmse(icecream['price'], yhat_mass), \n",
" mad(icecream['price'], yhat_mass)]),\n",
" go.Bar(name=\"OneHot + mass\",\n",
" x=[\"RMSE\", \"MAD\"],\n",
" y=[rmse(icecream['price'], yhat_one_hot), \n",
" mad(icecream['price'], yhat_one_hot)]),\n",
" go.Bar(name=\"Domain Knowledge\",\n",
" x=[\"RMSE\", \"MAD\"],\n",
" y=[rmse(icecream['price'], yhat_domain_knowledge), \n",
" mad(icecream['price'], yhat_domain_knowledge)])\n",
"], filename=\"FE_Part1_6\")"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"yhat_vs_y = go.Scatter(name=\"y vs yhat\", x=icecream['price'], y=yhat_domain_knowledge, mode='markers')\n",
"slope_one = go.Scatter(name=\"Ideal\", x=[0,5], y=[0,5])\n",
"layout = go.Layout(xaxis=dict(title=\"y\"), yaxis=dict(title=\"yhat\"))\n",
"py.iplot(go.Figure(data=[yhat_vs_y, slope_one], layout=layout), \n",
" filename=\"FE_Part1_7\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"
\n",
"\n",
"## Key Points on One-Hot Encoding\n",
"\n",
"While one-hot encoding is the standard mechanism for encoding **categorical** data there are a few issues to keep in mind:\n",
"\n",
"1. may generate **too many** dimensions/features\n",
" 1. sparse representations are often necessary\n",
" 1. watch out for issues with over-fitting (more on this soon)\n",
"\n",
"1. all possible **values must be known in advance**\n",
" 1. unable introduce new categories when making predictions\n",
" 1. be sure to use the same encoding when making predictions\n",
"\n",
"1. **missing values** are reasonably captured by a zero in all dummy features.\n",
"\n",
"1. Can be combined with other features using domain knowledge.\n",
"\n",
"---\n",
"\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## The _Boolean_ `hasBought` Feature\n",
"\n",
"The `hasBought` feature is a boolean (0/1) valued feature but we it can have missing values:\n",
"\n",
"\n",
"\n",
"There are a few options for encoding `hasBought`:\n",
"\n",
"1. **Interpret directly as numbers.** If there were no missing values then the booleans are typically treated directly as continuous values.\n",
"\n",
"1. **Apply one-hot encoding.** This would create two new features `hasBought=True` and `hasBought=False`. This is probably the most general encoding but suffers from increased complexity.\n",
"\n",
"1. **1/-1 Encoding.** Another common encoding for booleans with missing values is:\n",
"\n",
"\\begin{align}\n",
"\\textbf{True} & \\Rightarrow 1 \\\\\n",
"\\textbf{Null} & \\Rightarrow 0 \\\\\n",
"\\textbf{False} & \\Rightarrow -1 \n",
"\\end{align}\n",
"\n",
"---\n",
"\n",
"
\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"## The _Text_ `review` Feature\n",
"\n",
"Encoding text as a real-valued feature is especially challenging and many of the standard transformations are **lossy**. Moreover, all of the earlier transformations (e.g., one-hot encoding and Boolean representations) preserve the information in the feature. In contrast, most of the techniques for encoding text destroy information about the word order and in many cases key parts of the grammar. \n",
"\n",
"Here we will discuss two widely used representations of text:\n",
"\n",
"* **Bag-of-Words Encoding**: encodes text by the frequency of each word\n",
"* **N-Gram Encoding**: encodes text by the frequency of sequences of words of length $N$\n",
"\n",
"Both of these encoding strategies are related to the one-hot encoding with dummy features created for every word or sequence of words and with multiple dummy features having counts greater than zero.\n",
"\n",
"---\n",
"
\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## The Bag-of-Words Encoding\n",
"\n",
"\n",
"The bag-of-words encoding is widely used and a standard representation for text in many of the popular text clustering algorithms. The following is a simple illustration of the bag-of-words encoding:\n",
"\n",
"\n",
"\n",
"**Notice**\n",
"1. **Stop words are removed.** Stop-words are words like `is` and `about` that in isolation contain very little information about the meaning of the sentence. Here is a good list of [stop-words in many languages](https://code.google.com/archive/p/stop-words/). \n",
"1. **Word order information is lost.** Nonetheless the vector still suggests that the sentence is about `fun`, `machines`, and `learning`. Thought there are many possible meanings _learning machines have fun learning_ or _learning about machines is fun learning_ ...\n",
"1. **Capitalization and punctuation are typically removed.** \n",
"1. **Sparse Encoding:** is necessary to represent the bag-of-words efficiently. There are millions of possible words (including terminology, names, and misspellings) and so instantiating a `0` for every word that is not in each record would be incredibly inefficient. \n",
"\n",
"**Why is it called a bag-of-words?** A bag is another term for a **multiset**: _an unordered \n",
"collection which may contain multiple instances of each element._ \n",
"\n",
"\n",
"---\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Break?\n",
"\n",
"When professor Gonzalez was a graduate student at Carnegie Mellon University, he and several other computer scientists created the following art piece on display at the Gates Center:\n",
"\n",
"\n",
"\n",
"**Notice**\n",
"1. The unordered collection of words in the bag.\n",
"1. The stop words on the floor.\n",
"1. _The missing broom._ The original sculpture had a broom attached but the janitor got confused .... \n",
"\n",
"---\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## The N-Gram Encoding\n",
"\n",
"The N-Gram encoding is a generalization of the bag-of-words encoding designed to capture limited ordering information. Consider the following passage of text:\n",
"\n",
"> _The book was not well written but I did enjoy it._\n",
"\n",
"If we re-arrange the words we can also write:\n",
"\n",
"> _The book was well written but I did not enjoy it._\n",
"\n",
"Moreover, local word order can be important when making decisions about text. The n-gram encoding captures local word order by defining counts over sliding windows. In the following example a bi-gram ($n=2$) encoding is constructed:\n",
"\n",
"\n",
"\n",
"The above n-gram would be encoded in the sparse vector:\n",
"\n",
"\n",
"\n",
"Notice that the n-gram captures key pieces of sentiment information: `\"well written\"` and `\"not enjoy\"`. \n",
"\n",
"N-grams are often used for other types of sequence data beyond text. For example, n-grams can be used to encode genomic data, protein sequences, and click logs. \n",
"\n",
"**N-Gram Issues**\n",
"1. The n-gram representation is hyper sparse and maintaining the dictionary of possible n-grams can be very costly. The **hashing trick** is a popular solution to approximate the sparse n-gram encoding. In the hashing trick each n-gram is mapped to a relatively large (e.g., 32bit) hash-id and the counts are associated with the hash index without saving the n-gram text in a dictionary. As a consequence, multiple n-grams are treated as the same.\n",
"1. As $N$ increase the chance of seeing the same n-grams at prediction time decreases rapidly.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Implementing Bag-of-words and N-grams "
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"['Some say the world will end in fire,',\n",
" 'Some say in ice.',\n",
" 'From what Ive tasted of desire',\n",
" 'I hold with those who favor fire.']"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"frost_text = [x for x in \"\"\"\n",
"Some say the world will end in fire,\n",
"Some say in ice.\n",
"From what Ive tasted of desire\n",
"I hold with those who favor fire.\n",
"\"\"\".split(\"\\n\") if len(x) > 0]\n",
"\n",
"frost_text"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"CountVectorizer(analyzer='word', binary=False, decode_error='strict',\n",
" dtype=, encoding='utf-8', input='content',\n",
" lowercase=True, max_df=1.0, max_features=None, min_df=1,\n",
" ngram_range=(1, 1), preprocessor=None, stop_words='english',\n",
" strip_accents=None, token_pattern='(?u)\\\\b\\\\w\\\\w+\\\\b',\n",
" tokenizer=None, vocabulary=None)"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.feature_extraction.text import CountVectorizer\n",
"\n",
"# Construct the tokenizer with English stop words\n",
"bow = CountVectorizer(stop_words=\"english\")\n",
"\n",
"# fit the model to the passage\n",
"bow.fit(frost_text)"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Words: [(0, 'desire'), (1, 'end'), (2, 'favor'), (3, 'hold'), (4, 'ice'), (5, 'ive'), (6, 'say'), (7, 'tasted'), (8, 'world')]\n"
]
}
],
"source": [
"# Print the words that are kept\n",
"print(\"Words:\", \n",
" list(zip(range(0,len(bow.get_feature_names())),bow.get_feature_names())))"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Sentence Encoding: \n",
"\n",
"Some say the world will end in fire,\n",
" (0, 1)\t1\n",
" (0, 6)\t1\n",
" (0, 8)\t1\n",
"------------------\n",
"Some say in ice.\n",
" (0, 4)\t1\n",
" (0, 6)\t1\n",
"------------------\n",
"From what Ive tasted of desire\n",
" (0, 0)\t1\n",
" (0, 5)\t1\n",
" (0, 7)\t1\n",
"------------------\n",
"I hold with those who favor fire.\n",
" (0, 2)\t1\n",
" (0, 3)\t1\n",
"------------------\n"
]
}
],
"source": [
"print(\"Sentence Encoding: \\n\")\n",
"# Print the encoding of each line\n",
"for (s, r) in zip(frost_text, bow.transform(frost_text)):\n",
" print(s)\n",
" print(r)\n",
" print(\"------------------\")"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"CountVectorizer(analyzer='word', binary=False, decode_error='strict',\n",
" dtype=, encoding='utf-8', input='content',\n",
" lowercase=True, max_df=1.0, max_features=None, min_df=1,\n",
" ngram_range=(1, 2), preprocessor=None, stop_words=None,\n",
" strip_accents=None, token_pattern='(?u)\\\\b\\\\w\\\\w+\\\\b',\n",
" tokenizer=None, vocabulary=None)"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Construct the tokenizer with English stop words\n",
"bigram = CountVectorizer(ngram_range=(1, 2))\n",
"# fit the model to the passage\n",
"bigram.fit(frost_text)"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Words: [(0, 'desire'), (1, 'end'), (2, 'end in'), (3, 'favor'), (4, 'favor fire'), (5, 'fire'), (6, 'from'), (7, 'from what'), (8, 'hold'), (9, 'hold with'), (10, 'ice'), (11, 'in'), (12, 'in fire'), (13, 'in ice'), (14, 'ive'), (15, 'ive tasted'), (16, 'of'), (17, 'of desire'), (18, 'say'), (19, 'say in'), (20, 'say the'), (21, 'some'), (22, 'some say'), (23, 'tasted'), (24, 'tasted of'), (25, 'the'), (26, 'the world'), (27, 'those'), (28, 'those who'), (29, 'what'), (30, 'what ive'), (31, 'who'), (32, 'who favor'), (33, 'will'), (34, 'will end'), (35, 'with'), (36, 'with those'), (37, 'world'), (38, 'world will')]\n"
]
}
],
"source": [
"# Print the words that are kept\n",
"print(\"\\nWords:\", \n",
" list(zip(range(0,len(bigram.get_feature_names())), bigram.get_feature_names())))"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Sentence Encoding: \n",
"\n",
"Some say the world will end in fire,\n",
" (0, 1)\t1\n",
" (0, 2)\t1\n",
" (0, 5)\t1\n",
" (0, 11)\t1\n",
" (0, 12)\t1\n",
" (0, 18)\t1\n",
" (0, 20)\t1\n",
" (0, 21)\t1\n",
" (0, 22)\t1\n",
" (0, 25)\t1\n",
" (0, 26)\t1\n",
" (0, 33)\t1\n",
" (0, 34)\t1\n",
" (0, 37)\t1\n",
" (0, 38)\t1\n",
"------------------\n",
"Some say in ice.\n",
" (0, 10)\t1\n",
" (0, 11)\t1\n",
" (0, 13)\t1\n",
" (0, 18)\t1\n",
" (0, 19)\t1\n",
" (0, 21)\t1\n",
" (0, 22)\t1\n",
"------------------\n",
"From what Ive tasted of desire\n",
" (0, 0)\t1\n",
" (0, 6)\t1\n",
" (0, 7)\t1\n",
" (0, 14)\t1\n",
" (0, 15)\t1\n",
" (0, 16)\t1\n",
" (0, 17)\t1\n",
" (0, 23)\t1\n",
" (0, 24)\t1\n",
" (0, 29)\t1\n",
" (0, 30)\t1\n",
"------------------\n",
"I hold with those who favor fire.\n",
" (0, 3)\t1\n",
" (0, 4)\t1\n",
" (0, 5)\t1\n",
" (0, 8)\t1\n",
" (0, 9)\t1\n",
" (0, 27)\t1\n",
" (0, 28)\t1\n",
" (0, 31)\t1\n",
" (0, 32)\t1\n",
" (0, 35)\t1\n",
" (0, 36)\t1\n",
"------------------\n"
]
}
],
"source": [
"print(\"\\nSentence Encoding: \\n\")\n",
"# Print the encoding of each line\n",
"for (s, r) in zip(frost_text, bigram.transform(frost_text)):\n",
" print(s)\n",
" print(r)\n",
" print(\"------------------\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## _Bonus:_ Term Frequency Scaling\n",
"\n",
"If we are encoding text in a particular domain (e.g., processing insurance claims) it is likely that there will be frequent terms (e.g., `insurance` or `claim`) that provide little information. However, because these terms occur frequently they can present challenges to some modeling techniques. In these cases, additional scaling may be applied to transform the bag-of-word or n-gram vectors to emphasize the more informative terms. One of the most common scalings techniques is the **term frequency inverse document frequency (TF-IDF)** which emphasizes words that are unique to a particular record. Because the notation is confusing, I have provided a pseudo code implementation. However, you should use a more efficient sparse implementation like those provided in scikit learn.\n",
"\n",
"```python\n",
"def tfidf(X):\n",
" \"\"\"\n",
" Input: X is a bag of words matrix (rows=records, cols=terms)\n",
" \"\"\"\n",
" (ndocs, nwords) = X.shape\n",
" tf = X / X.sum(axis=1)[:, np.newaxis]\n",
" idf = ndocs / (X > 0).sum(axis=0) \n",
" return tf * np.log(idf)\n",
"```\n",
"\n",
"\n",
"While these transformations are especially important when computing similarities between vector encodings of text. We will not cover these transformations in DS100 but it is worth knowing that they exist.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Summary of Feature Encoding\n",
"\n",
"Most machine learning (ML) and statistics techniques operate on multivariate real-valued domains (i.e., vectors). As a consequence, we need methods to encode non-continuous datatypes into meaningful continuous forms. We discussed:\n",
"\n",
"1. **one-hot** (a.k.a. **dummy variable**) encoding transform categorical values into vectors of binary values with dimension equal to the number of possible values.\n",
"1. **bag-of-words** and **n-gram** encoding transform text into frequency statistics for individual terms and groups of terms. \n",
"\n",
"We will now explore how feature transformations can be used to capture domain knowledge and encode complex relationships."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python [conda env:ds100]",
"language": "python",
"name": "conda-env-ds100-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}