14  Feature Engineering

  • Recognize the value of feature engineering as a tool to improve model performance
  • Implement polynominal feature generation and one hot encoding
  • Understand the interactions between model complexity, model variance, and training error

At this point in the course, we’ve equipped ourselves with some powerful techniques to build and optimize models. We’ve explored how to develop models of multiple variables, as well as how to fit these models to maximize their performance.

All of this was done with one major caveat: the regression models we’ve worked with so far are all linear in the input variables. We’ve assumed that our predictions should be some combination of linear variables. While this works well in some cases, the real world isn’t always so straightforward. In today’s lecture, we’ll learn an important method to address this issue – and consider some new problems that can arise when we do so.

14.1 Feature Engineering

Feature Engineering is the process of transforming the raw features into more informative features that can be used in modeling or EDA tasks.

Feature engineering allows you to: Capture domain knowledge (e.g. periodicity or relationships between features). Express non-linear relationships using simple linear models. Encode non-numeric features to be used as inputs to models. Example: Using the country of origin of a car as an input to modeling its efficiency.

Why doesn’t sklearn doesn’t have SquareRegression /PolynomialRegression.

  • We can translate these into linear models with features that are polynomials of x.
  • Feature engineering saves sklearn a lot of redundancy in their library.
  • Linear models have really nice properties.

14.2 Feature Functions

A feature function takes our original d dimensional input, \(\mathbb{X}\), and transforms it into a \(d'\) dimensional input \(\Phi\).

For example, when we add the squared term of an existing column, we are effectively using kind of feature function, taking a \(n \times 1\) matrix, \([hp]\), and turning it into an \(n \times 2\) matrix \([hp,hp^2]\)

As number of features grows, we can capture arbitrarily complex relationships.

Let’s take a moment to dig further in to remind ourselves where this linearity comes from. Consider the following dataset on vehicles:

Code
import seaborn as sns
vehicles = sns.load_dataset("mpg").rename(columns={"horsepower":"hp"}).dropna()
vehicles.head(5)
mpg cylinders displacement hp weight acceleration model_year origin name
0 18.0 8 307.0 130.0 3504 12.0 70 usa chevrolet chevelle malibu
1 15.0 8 350.0 165.0 3693 11.5 70 usa buick skylark 320
2 18.0 8 318.0 150.0 3436 11.0 70 usa plymouth satellite
3 16.0 8 304.0 150.0 3433 12.0 70 usa amc rebel sst
4 17.0 8 302.0 140.0 3449 10.5 70 usa ford torino

Suppose we wish to develop a model to predict a vehicle’s fuel efficiency ("mpg") as a function of its horsepower ("hp"). Glancing at the plot below, we see that the relationship between "mpg" and "hp" is non-linear – an SLR fit doesn’t capture the relationship between the two variables.

degree_1

Recall our standard multiple linear regression model. In its current form, it is linear in terms of both \(\theta_i\) and \(x\):

\[\hat{y} = \theta_0 + \theta_1 x + \theta_2 x\:+\:...\]

Just by eyeballing the vehicle data plotted above, it seems that a quadratic model might be more appropriate. In other words, a model of the form below would likely do a better job of capturing the non-linear relationship between the two variables:

\[\hat{y} = \theta_0 + \theta_1 x + \theta_2 x^2\]

This looks fairly similar to our original multiple regression framework! Importantly, it is still linear in \(\theta_i\) – the prediction \(\hat{y}\) is a linear combination of the model parameters. This means that we can use the same linear algebra methods as before to derive the optimal model parameters when fitting the model.

You may be wondering: how can this be a linear model if there is now a \(x^2\) term? Although the model contains non-linear \(x\) terms, it is linear with respect to the model parameters, \(\theta_i\). Because our OLS derivation relied on assuming a linear model of \(\theta_i\), the method is still valid to fit this new model.

If we refit the model with "hp" squared as its own feature, we see that the model follows the data much more closely.

\[\hat{\text{mpg}} = \theta_0 + \theta_1 (\text{hp}) + \theta_2 (\text{hp})^2\]

degree_2

Looks much better! What we’ve done here is called feature engineering: the process of transforming the raw features of a dataset into more informative features for modeling. By squaring the "hp" feature, we were able to create a new feature that signficantly improved the quality of our model.

We perform feature engineering by defining a feature function. A feature function is some function applied to the original variables in the data to generate one or more new features. More formally, a feature function is said to take a \(d\) dimensional input and transform it to a \(p\) dimensional input. This results in a new, feature-engineered design matrix that we rename \(\Phi\).

\[\mathbb{X} \in \mathbb{R}^{n \times d} \longrightarrow \Phi \in \mathbb{R}^{n \times p}\]

In the vehicles example above, we applied a feature function to transform the original input with \(d=1\) features into an engineered design matrix with \(p=2\) features.

14.3 One Hot Encoding

Feature engineering opens up a whole new set of possibilities for designing better performing models. As you will see in lab and homework, feature engineering is one of the most important parts of the entire modeling process.

A particularly powerful use of feature engineering is to allow us to perform regression on non-numeric features. One hot encoding is a feature engineering technique that generates numeric features from categorical data, allowing us to use our usual methods to fit a regression model on the data.

To illustrate how this works, we’ll refer back to the tips data from last lecture. Consider the "day" column of the dataset:

Code
import numpy as np
np.random.seed(1337)
tips_df = sns.load_dataset("tips").sample(100)
tips_df[["day"]].head(5)
day
54 Sun
46 Sun
86 Thur
199 Thur
106 Sat

At first glance, it doesn’t seem possible to fit a regression model to this data – we can’t directly perform any mathematical operations on the entry “Thur”.

To resolve this, we instead create a new table with a feature for each unique value in the original "day" column. We then iterate through the "day" column. For each entry in "day" we fill the corresponding feature in the new table with 1. All other features are set to 0.

ohe

This can be implemented in code using sklearn’s OneHotEncoder() to generate the one hot encoding, then calling pd.concat to combine these new features with the original DataFrame.

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# Perform the one hot encoding
oh_enc = OneHotEncoder()
oh_enc.fit(tips_df[['day']])
ohe_data = oh_enc.transform(tips_df[['day']]).toarray()


# Combine with original features
data_w_ohe = (tips_df
              .join(
                  pd.DataFrame(ohe_data, columns=oh_enc.get_feature_names(), index=tips_df.index))
             )
data_w_ohe
total_bill tip sex smoker day time size x0_Fri x0_Sat x0_Sun x0_Thur
54 25.56 4.34 Male No Sun Dinner 4 0.0 0.0 1.0 0.0
46 22.23 5.00 Male No Sun Dinner 2 0.0 0.0 1.0 0.0
86 13.03 2.00 Male No Thur Lunch 2 0.0 0.0 0.0 1.0
199 13.51 2.00 Male Yes Thur Lunch 2 0.0 0.0 0.0 1.0
106 20.49 4.06 Male Yes Sat Dinner 2 0.0 1.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ...
44 30.40 5.60 Male No Sun Dinner 4 0.0 0.0 1.0 0.0
221 13.42 3.48 Female Yes Fri Lunch 2 1.0 0.0 0.0 0.0
59 48.27 6.73 Male No Sat Dinner 4 0.0 1.0 0.0 0.0
100 11.35 2.50 Female Yes Fri Dinner 2 1.0 0.0 0.0 0.0
127 14.52 2.00 Female No Thur Lunch 2 0.0 0.0 0.0 1.0

100 rows × 11 columns

Now, the “day” feature (or rather, the four new boolean features that represent day) can be used to fit a model.

14.4 Higher-order Polynomial Example

Let’s return to where we started today: Creating higher-order polynomial features for the mpg dataset.

What happens if we add a feature corresponding to the horsepower, cubed? or to the fourth power? the fifth power?

  • Will we get better results?
  • What will the model look like?

Let’s try it out. The below code plots polynomial models fit to the mpg dataset, from order 0 (the constant model) to order 5 (polynomial features through horsepower to the fifth power).

Code
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt


vehicle_data = sns.load_dataset("mpg")
vehicle_data = vehicle_data.rename(columns = {"horsepower": "hp"})
vehicle_data = vehicle_data.dropna()


def get_MSE_for_degree_k_model(k):
    pipelined_model = Pipeline([
        ('poly_transform', PolynomialFeatures(degree = k)),
        ('regression', LinearRegression(fit_intercept = True))    
    ])
    pipelined_model.fit(vehicle_data[["hp"]], vehicle_data["mpg"])
    return mean_squared_error(pipelined_model.predict(vehicle_data[["hp"]]), vehicle_data["mpg"])

ks = np.array(range(0, 7))
MSEs = [get_MSE_for_degree_k_model(k) for k in ks]
MSEs_and_k = pd.DataFrame({"k": ks, "MSE": MSEs})
MSEs_and_k.set_index("k")

def plot_degree_k_model(k, MSEs_and_k, axs):
    pipelined_model = Pipeline([
        ('poly_transform', PolynomialFeatures(degree = k)),
        ('regression', LinearRegression(fit_intercept = True))    
    ])
    pipelined_model.fit(vehicle_data[["hp"]], vehicle_data["mpg"])
    
    row = k // 3
    col = k % 3
    ax = axs[row, col]
    
    sns.scatterplot(data=vehicle_data, x='hp', y='mpg', ax=ax)
    
    x_range = np.linspace(45, 210, 100).reshape(-1, 1)
    ax.plot(x_range, pipelined_model.predict(pd.DataFrame(x_range, columns=['hp'])), c='orange', linewidth=2)
    
    ax.set_ylim((0, 50))
    mse_str = f"MSE: {MSEs_and_k.loc[k, 'MSE']:.4}\norder: {k}"
    ax.text(150, 30, mse_str, dict(size=13))

fig = plt.figure(figsize=(10, 4))
axs = fig.subplots(nrows=2, ncols=3)

for k in range(6):
    plot_degree_k_model(k, MSEs_and_k, axs)
fig.tight_layout()

With constant and linear models, there seems to be a clear pattern in prediction error. With a quadratic model (order 2), the plot seems to match our data much more consistently across different hp values. For higher order polynomials, we observe a small improvement in MSE, but not much beyond 18.98. The MSE will continue to marginally decrease as we add more and more terms to our model. However, there ain’t no free lunch. This decreasing of the MSE is coming at a major cost!

14.5 Variance and Training Error

We’ve seen now that feature engineering allows us to build all sorts of features to improve the performance of the model. In particular, we saw that designing a more complex feature (squaring "hp" in the vehicles data previously) substantially improved the model’s ability to capture non-linear relationships. To take full advantage of this, we might be inclined to design increasingly complex features. Consider the following three models, each of different order (the maximum exponent power of each model):

  • Model with order 1: \(\hat{\text{mpg}} = \theta_0 + \theta_1 (\text{hp})\)
  • Model with order 2: \(\hat{\text{mpg}} = \theta_0 + \theta_1 (\text{hp}) + \theta_2 (\text{hp})^2\)
  • Model with order 4: \(\hat{\text{mpg}} = \theta_0 + \theta_1 (\text{hp}) + \theta_2 (\text{hp})^2 + \theta_3 (\text{hp})^3 + \theta_4 (\text{hp})^4\)


degree_comparison

When we use our model to make predictions on the same data that was used to fit the model, we find that the MSE decreases with increasingly complex models. The training error is the model’s error when generating predictions from the same data that was used for training purposes. We can conclude that the training error goes down as the complexity of the model increases.

train_error

This seems like good news – when working on the training data, we can improve model performance by designing increasingly complex models.

However, high model complexity comes with its own set of issues. When a model has many complicated features, it becomes increasingly sensitive to the data used to fit it. Even a small variation in the data points used to train the model may result in wildly different results for the fitted model. The plots below illustrate this idea. In each case, we’ve fit a model to two very similar sets of data (in fact, they only differ by two data points!). Notice that the model with order 2 appears roughly the same across the two sets of data; in contrast, the model with order 4 changes erratically across the two datasets.

model_variance

The sensitivity of the model to the data used to train it is called the model variance. As we saw above, model variance tends to increase with model complexity.

bvt

We will further explore this tradeoff (and more precisely define model variance) in future lectures.

14.6 Overfitting

We can see that there is a clear “trade-off” that comes from the complexity of our model. As model complexity increases, the model’s error on the training data decreases. At the same time, the model’s variance tends to increase.

Why does this matter? To answer this question, let’s take a moment to review our modeling workflow when making predictions on new data.

  1. Sample a dataset of training data from the real world
  2. Use this training data to fit a model
  3. Apply this fitted model to generate predictions on unseen data

This first step – sampling training data – is important to remember in our analysis. As we saw above, a highly complex model may produce results that vary wildly across different samples of training data. If we happen to sample a set of training data that is a poor representation of the population we are trying to model, our model may perform poorly on any new set of data it has not seen before.

To see why, consider a model fit using the training data shown on the left. Because the model is so complex, it achieves zero error on the training set – it perfectly predicts each value in the training data! When we go to use this model to make predictions on a new sample of data, however, things aren’t so good. The model now has enormous error on the unseen data.

overfit

The phenomenon above is called overfitting. The model effectively just memorized the training data it encountered when it was fitted, leaving it unable to handle new situations.

The takeaway here: we need to strike a balance in the complexity of our models. A model that is too simple won’t be able to capture the key relationships between our variables of interest; a model that is too complex runs the risk of overfitting.

This begs the question: how do we control the complexity of a model? Stay tuned for the next lecture.