Missing Values, Categorical Features, and Text

In this notebook, we discuss:

  1. how to deal with missing values
  2. how to encode categorical features,
  3. and how to encode text features.

In the process, we will work through feature engineering to construct a model that predicts vehicle efficiency.

The Data

For this notebook, we will use the seaborn mpg data set which describes the fuel mileage (measured in miles per gallon or mpg) of various cars along with characteristics of those cars. Our goal will be to build a model that can predict the fuel mileage of a car based on the characteristics of that car.

Quantitative Continuous Features

This data set has several quantitative continuous features that we can use to build our first model. However, even for quantitative continuous features, we may want to do some additional feature engineering. Things to consider are:

  1. transforming features with non-linear functions (log, exp, sine, polynomials)
  2. constructing products or ratios of features
  3. dealing with missing values

Missing Values

We can use the Pandas DataFrame.isna function to find rows with missing values:

There are many ways to deal with missing values. A common strategy is to substitute the mean. Because missing values can actually be useful signal, it is often a good idea to include a feature indicating that the value was missing.

Using our feature function, we can fit our first model to the transformed data:

Keeping Track of Progress

Because we are going to be building multiple models with different feature functions it is important to have a standard way to track each of the models.

The following function takes a model prediction function, the name of a model, and the dictionary of models that we have already constructed. It then evaluates the new model on the data and plots how the new model performs relative to the previous models as well as the $Y$ vs $\hat{Y}$ scatter plot.

In addition, it updates the dictionary of models to include the new model for future plotting.