One of the key challenges with feature engineering is that you can "over engineer" your features and produce a model that fits the data but performs poorly when making predictions on new data. This is typically referred to as overfitting to your data and is the focus on the next set of lectures.
In this notebook, we will provide a very simple illustration of overfitting, but as you will see and soon experience, it is very easy to overfit to your data and this will become the key challenge in the design of good models.
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
For this problem we will use a very simple toy dataset to help illustrate where things will fail.
Notice that there are only 8 datapoints in this dataset. Small data is especially prone to the challenges of overfitting.
data = pd.read_csv("data/train.csv")
data
X | Y | |
---|---|---|
0 | -2.647582 | 9.140452 |
1 | -2.051547 | 5.336237 |
2 | -1.810665 | 7.195181 |
3 | -1.312076 | 6.095358 |
4 | -0.789591 | 0.721947 |
5 | -0.660964 | 2.177008 |
6 | 0.806148 | 4.367994 |
7 | 1.054880 | 5.852682 |
colors = px.colors.qualitative.Plotly
data_scatter = go.Scatter(x=data["X"], y=data["Y"], name="data", mode="markers",
marker=dict(color=colors[0]))
go.Figure([data_scatter])