Lecture 14 – Feature Engineering
by Joseph Gonzalez (Spring 2020)
Important: This lecture is a combination of two lectures from Spring 2020 (this is why the video titles don’t match our numbering). Read this before proceeding with the lectures, as it details which concepts you should focus on.
Sections 14.1 through 14.4 discuss the core techniques of feature engineering. Slides are linked above, and code is in “Part 1” and “Part 2”.
- 14.1: Throughout this lecture, Radial Basis Functions are used as an example. For our purposes, they are purely an example, and are not in-scope.
- 14.2, 14.3: Entirely in scope.
- 14.4: Of the three techniques discussed, one-hot encoding is most important, though the others are still in scope.
Sections 14.5 through 14.7 discuss pitfalls to be aware of in feature engineering. There are no accompanying slides; these ideas are primarily explained in the lecture notebook “Part 3”.
- 14.5: Focus on the numerical ideas here, not the syntax of model creation (though the code is linked above).
- 14.6: The focus of this video is about the content at the end where our design matrix has too many columns, not about the details of Radial Basis Functions.
- 14.7: See the above comment.
A demonstration of how to use scikit-learn to fit linear models.
Feature functions, as a method of transforming existing numerical data, and encoding non-numerical data for use in modeling.
Defining what it means for a model to be linear. The constant feature. More sophisticated numerical features.
Numerically encoding categorical data using various encodings (one-hot, bag of words, n-gram).
Issues we may run into when our design matrix has redundant features.
Issues we may run into when our design matrix has more features than observations. Radial basis functions.
Overfitting our model to the data we used to train it leads to poor generalizability to unseen data, which is the goal of modeling.