Lecture 14 – Feature Engineering
Presented by Joseph Gonzalez
Content by Joseph Gonzalez, John DeNero, Josh Hug
- video playlist
- code HTML: Part 1, Part 2, Part 3
- (supplementary) video and code HTML from a live lecture in Summer 2020 that reinforced some of the mathematical ideas in this lecture
Important: This lecture is a combination of two lectures from previous semesters (this is why the video titles don’t match our numbering). Read this before proceeding with the lectures, as it details which concepts you should focus on.
Sections 14.1 through 14.4 discuss the core techniques of feature engineering. Slides are linked above, and code is in “Part 1” and “Part 2”.
- 14.1: Throughout this lecture, Radial Basis Functions are used as an example. For our purposes, they are purely an example, and are not in-scope.
- 14.2, 14.3: Entirely in scope.
- 14.4: Of the three techniques discussed, one-hot encoding is most important, though the others are still in scope.
Sections 14.5 through 14.7 discuss pitfalls to be aware of in feature engineering. There are no accompanying slides; these ideas are primarily explained in the lecture notebook “Part 3”.
- 14.5: Focus on the numerical ideas here, not the syntax of model creation (though the code is linked above).
- 14.6: The focus of this video is about the content at the end where our design matrix has too many columns, not about the details of Radial Basis Functions.
- 14.7: See the above comment.
The Quick Check for this lecture is due Monday, October 26th at 11:59PM. To get credit for this lecture’s Quick Checks, you will have to fill out all of the following Google Forms as well as the mid-semester survey, linked on the course website and on Piazza (as well as in one of the following forms).
A demonstration of how to use scikit-learn to fit linear models.
Feature functions, as a method of transforming existing numerical data, and encoding non-numerical data for use in modeling.
Defining what it means for a model to be linear. The constant feature. More sophisticated numerical features.
Numerically encoding categorical data using various encodings (one-hot, bag of words, n-gram).
Issues we may run into when our design matrix has redundant features.
Issues we may run into when our design matrix has more features than observations. Radial basis functions.
Overfitting our model to the data we used to train it leads to poor generalizability to unseen data, which is the goal of modeling.