Lecture 12 – Simple Linear Regression

by Suraj Rampure

Notebook credits:

Correlation

First, let's come up with some examples of data to use in slides. (Normally this wouldn't be put in the notebook, but it might be of interest to you.)

Also, note here we use np.corrcoef here to compute the correlation coefficients, because we haven't yet defined what r is manually.

Simple Linear Regression

First, let's implement the tools we'll need for regression.

Let's read in our data.

An interesting issue is that both our parent and child columns occur at fixed positions. We need to add some random noise, otherwise we'll suffer from gross overplotting.

Using our correlation function:

Using an in-built correlation function:

All the same result.

What we now want to do is compute the average $y$ for a given $x$. A practical way to do this is to "bin" our x axis into 1-unit wide buckets, and then compute the average $y$ value for everything in that bucket. (We could choose bins of any width, though.)

Now, let's look at our predictions:

Save for the tails where there are fewer values to draw from, it seems like our red predictions roughly follow a straight line piercing through the "middle" of our point cloud. That's our motivation for using a line to model this bivariate data.

Note: The cool thing about plotly is that you can hover over the points and it will tell you whether it is a prediction or actual value.

Now, it's time to implement the optimal coefficients.

Let's see what our linear model looks like.

Visualizing Loss Surface

Let's look at what the loss surface for the above model looks like. Don't worry too much about what this code is doing.

As you can see, our choice of $\hat{a}, \hat{b}$ truly do minimize mean squared error. They exist at the minimum value of the loss surface.

Multiple Linear Regression

Let's load in a new dataset. This is aggregate per-player data from the 2018-19 NBA season.

Let's suppose our goal is to predict the number of points someone averaged (PTS; this is our dependent variable). The independent variables we'll use are

First, let's explore and fit a model using just AST.

The correlation between AST and PTS is relativelty strong. However, the scatter plot above tells us this isn't exactly the optimal setting in which to perform linear regression. For the purposes of illustration, we'll continue with it anyways.

Let's take a look at our prediction:

Our model does okay. Let's compute the RMSE (that is, the square root of the mean squared error; we take the square root so that the RMSE is in the same units as our $y$ values). We will use this as a baseline for when we add more indepedent variables.

There's still a ton of variation in our model. Let's see if we can do better, by incorporating 3PA as well (that is, the average number of 3 point shot attempts they made per game).

Specifically, we're looking to create the model

$$\text{predicted PTS} = \theta_0 + \theta_1 \cdot \text{AST} + \theta_2 \cdot \text{3PA}$$

In orrder to do this, we're going to import a new library, called sklearn. Don't worry too much about what it's doing for now – we will dedicate an entire section of lecture to it in 2 lectures from now.

The above outputs tell us that the parameters that minimize MSE for this model are

Meaning our predictions should be of the form

$$\text{predicted PTS} = 2.1563 + 1.6407 \cdot \text{AST} + 1.2576 \cdot \text{3PA}$$

Let's visualize what our model and predictions look like.

Instead of our model being a line, it is now a plane in 3D (the colorful surface above). The blue points above are the true PTS values.

It's sometimes hard to interpret things in 3D; we can also visualize in 2D.

The yellow dots are the result of our updated linear model. It doesn't look linear here, because it is not solely a function of assists per game. (It was linear in the 3D figure above.) The yellow points here all lie on the colorful plane above.

We can also scatter our predicted values vs. our actual values.

Let's also look at our RMSE.

It's noticably lower than before!

Multiple $R^2$

This means that our model that only uses AST can explain 45% of the variation of the true observations (PTS values), while our model that uses AST and 3PA can explain 60% of the variation.