Lecture 11 – Ordinary Least Squares¶

by Lisa Yan

Notebook credits:

  • Suraj Rampure
  • Ani Adhikari
  • Data 8's textbook, chapter 15
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import plotly.express as px
import plotly.graph_objs as go

Multiple Linear Regression¶

Let's load in a new dataset. This is aggregate per-player data from the 2018-19 NBA season, downloaded from Kaggle.

In [2]:
nba = pd.read_csv('nba18-19.csv', index_col=0)
nba.index.name = None # drops name of index (players are ordered by rank)
In [3]:
nba
Out[3]:
Player Pos Age Tm G GS MP FG FGA FG% ... FT% ORB DRB TRB AST STL BLK TOV PF PTS
1 Álex Abrines\abrinal01 SG 25 OKC 31 2 19.0 1.8 5.1 0.357 ... 0.923 0.2 1.4 1.5 0.6 0.5 0.2 0.5 1.7 5.3
2 Quincy Acy\acyqu01 PF 28 PHO 10 0 12.3 0.4 1.8 0.222 ... 0.700 0.3 2.2 2.5 0.8 0.1 0.4 0.4 2.4 1.7
3 Jaylen Adams\adamsja01 PG 22 ATL 34 1 12.6 1.1 3.2 0.345 ... 0.778 0.3 1.4 1.8 1.9 0.4 0.1 0.8 1.3 3.2
4 Steven Adams\adamsst01 C 25 OKC 80 80 33.4 6.0 10.1 0.595 ... 0.500 4.9 4.6 9.5 1.6 1.5 1.0 1.7 2.6 13.9
5 Bam Adebayo\adebaba01 C 21 MIA 82 28 23.3 3.4 5.9 0.576 ... 0.735 2.0 5.3 7.3 2.2 0.9 0.8 1.5 2.5 8.9
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
528 Tyler Zeller\zellety01 C 29 MEM 4 1 20.5 4.0 7.0 0.571 ... 0.778 2.3 2.3 4.5 0.8 0.3 0.8 1.0 4.0 11.5
529 Ante Žižić\zizican01 C 22 CLE 59 25 18.3 3.1 5.6 0.553 ... 0.705 1.8 3.6 5.4 0.9 0.2 0.4 1.0 1.9 7.8
530 Ivica Zubac\zubaciv01 C 21 TOT 59 37 17.6 3.6 6.4 0.559 ... 0.802 1.9 4.2 6.1 1.1 0.2 0.9 1.2 2.3 8.9
530 Ivica Zubac\zubaciv01 C 21 LAL 33 12 15.6 3.4 5.8 0.580 ... 0.864 1.6 3.3 4.9 0.8 0.1 0.8 1.0 2.2 8.5
530 Ivica Zubac\zubaciv01 C 21 LAC 26 25 20.2 3.8 7.2 0.538 ... 0.733 2.3 5.3 7.7 1.5 0.4 0.9 1.4 2.5 9.4

708 rows × 29 columns



We are interested in predicting the number of points (PTS) an athlete will score in a basketball game from this season.

Suppose we want to fit a linear model by using some characteristics, or features of a player. Specifically, we'll focus on field goals, assists, and 3 point attempts.

  • FG, the number of (2 point) field goals per game
  • AST, the average number of assists per game, and
  • 3PA, the number of 3 point field goals attempted per game

This is how we express that model:

$$ \hat{y} = \theta_0 + \theta_1 \cdot \textbf{FG} + \theta_2 \cdot \textbf{AST} + \theta_3 \cdot \textbf{3PA} $$
In [4]:
nba[['FG', 'AST', '3PA', 'PTS']]
Out[4]:
FG AST 3PA PTS
1 1.8 0.6 4.1 5.3
2 0.4 0.8 1.5 1.7
3 1.1 1.9 2.2 3.2
4 6.0 1.6 0.0 13.9
5 3.4 2.2 0.2 8.9
... ... ... ... ...
528 4.0 0.8 0.0 11.5
529 3.1 0.9 0.0 7.8
530 3.6 1.1 0.0 8.9
530 3.4 0.8 0.0 8.5
530 3.8 1.5 0.0 9.4

708 rows × 4 columns



In lecture, we saw that the predictions for the entire data set $\hat{\mathbb{Y}}$ can be computed as:

$$ \Large \hat{\mathbb{Y}} = \mathbb{X} \theta $$

The design matrix $\mathbb{X} \in \mathbb{R}^{n,d+1}$ has $n$ rows corresponding to each record in the dataset and $d+1$ columns corresponding to the original $d$ columns in the data plus an additional 1s column vector (bias/intercept feature).

Let's build this design matrix using Pandas:

In [5]:
X = nba[['FG', 'AST', '3PA']]
X.insert(0, 'Bias', 1)
X = X.to_numpy()
X
Out[5]:
array([[1. , 1.8, 0.6, 4.1],
       [1. , 0.4, 0.8, 1.5],
       [1. , 1.1, 1.9, 2.2],
       ...,
       [1. , 3.6, 1.1, 0. ],
       [1. , 3.4, 0.8, 0. ],
       [1. , 3.8, 1.5, 0. ]])
In [6]:
X.shape
Out[6]:
(708, 4)

While we're at it, let's build the Y vector of our true PTS values.

In [7]:
# for nba data
Y = nba[["PTS"]].to_numpy()
n = len(Y)
print("number datapoints", n)
number datapoints 708

Example prediction¶

Suppose we decide to pick an arbitrary parameter $\theta$:

$$\theta = \begin{bmatrix}0.50 \\ -0.14 \\ 0.65 \\ 1.52 \end{bmatrix}$$

(For those interested: I drew these from random values simulated from a standard normal distribution using np.random.randn.)

In [8]:
theta_arbitrary = np.array([[0.5], [-1.14], [0.65], [1.52]])
theta_arbitrary
Out[8]:
array([[ 0.5 ],
       [-1.14],
       [ 0.65],
       [ 1.52]])

For this value of $\theta$ we can make a prediction using our model with matrix multiplication.

The @ symbol is the matrix multiply operation and is equivalent to writing xt.dot(theta).

In [9]:
X @ theta_arbitrary
[[ 5.0700e+00]
 [ 2.8440e+00]
 [ 3.8250e+00]
 [-5.3000e+00]
 [-1.6420e+00]
 ...
 [-2.4490e+00]
 [-2.8890e+00]
 [-2.8560e+00]
 [-2.8570e+00]]

Computing MSE¶

For Ordinary Least Squares, the average loss is MSE:

$$ \Large R(\theta) = \frac{1}{n} || \mathbb{Y} - \mathbb{X}\theta||^2_2 $$

NumPy has a handy function np.linalg.norm that computes the norm of a matrix (default is L2 norm).

In [10]:
def mse_nba(theta):
    """
    Y is PTS
    X is intercept, FG, AST, 3PA
    """
    return (1/n) * (np.linalg.norm(Y - X @ theta) ** 2)

mse_nba(theta_arbitrary)
Out[10]:
76.52265605508474

Is this good? Is this bad? Let's compute the optimal theta and compare!


Implementing Least Squares¶

From lecture, the Least Squares Estimate $\hat{\theta}$ is: $$ \Large \hat{\theta} = \left( \mathbb{X}^T \mathbb{X} \right)^{-1} \mathbb{X}^T \mathbb{Y} $$

While it is not as numerically stable or efficient. We can compute $\hat{\theta}$ by direction using matrix inversion. To do this, we import the inv function for the numpy linear algebra library:

In [11]:
from numpy.linalg import inv
In [12]:
def least_squares_estimate(X, Y):
    return inv(X.T @ X) @ X.T @ Y

theta_hat = least_squares_estimate(X, Y)
theta_hat
Out[12]:
array([[-0.29253798],
       [ 2.51705703],
       [ 0.05075571],
       [ 0.31307653]])
In [13]:
print("Arbitrary theta MSE: ", mse_nba(theta_arbitrary))
print("Optimal theta MSE:"  , mse_nba(theta_hat))
Arbitrary theta MSE:  76.52265605508474
Optimal theta MSE: 0.3963133329080335

Nice!!!


Making Least Squares Predictions¶

Now that we have estimated the model parameters $\hat{\theta}$ we can now also predict the points scored $\hat{\mathbb{Y}}$ for each of our players.

In [14]:
Y_hat = X @ theta_hat




Model Performance/Diagnosing the Model¶

In previous lectures we have plotted the residual vs. our single input feature $x$.

For higher dimensional regression models, we often graph the residual with respect to the fitted values $\hat{y}$.

In [15]:
fig = px.scatter(x = Y_hat.flatten(), y = (Y - Y_hat).flatten(), opacity=0.2)
fig.add_trace(go.Scatter(x=[0, 30], y=[0,0], name="Y_hat=Y"))
fig.update_xaxes(title="Y_hat")
fig.update_yaxes(title="Y_hat - Y")
fig

Overall while the residuals are roughly centered around 0 ($\hat{y} = y$), we see heteroskedasticity: Our regression spread is uneven, particularly as predictions get big.

Multiple $R^2$¶

Let's compute the coefficient of determination, or multiple R^2, for our model.

In [16]:
r2_ast_fg_3pa = np.var(Y_hat) / np.var(Y)
r2_ast_fg_3pa
Out[16]:
0.9883162128703274

That's super high!!! Wait, what's up with that?

(Hint: Basketball facts)

Let's try reducing the number of features to see how this Multiple $R^2$ metric changes.

In [17]:
# use intercept, ast, 3pa
X_3d = nba[['AST', '3PA']]
X_3d.insert(0, 'Bias', 1)
X_3d = X_3d.to_numpy()

theta_ast_3pa = least_squares_estimate(X_3d, Y)
Y_hat_ast_3pa = X_3d @ theta_ast_3pa

r2_ast_3pa = np.var(Y_hat_ast_3pa) / np.var(Y)
r2_ast_3pa
Out[17]:
0.608786276366571
In [18]:
# use intercept, ast only (SLR)
X_slr = nba[['AST']]
X_slr.insert(0, 'Bias', 1)
X_slr = X_slr.to_numpy()

theta_ast_only = least_squares_estimate(X_slr, Y)
Y_hat_ast_only = X_slr @ theta_ast_only

r2_ast_only = np.var(Y_hat_ast_only) / np.var(Y)
r2_ast_only
Out[18]:
0.4570055507968593

Comparing these Multiple $R^2$ together:

In [19]:
print("(SLR) interecept, AST:   ", r2_ast_only)
print("intercept, 3PA, AST:     ", r2_ast_3pa)
print("intercept, FG, 3PA, AST: ", r2_ast_fg_3pa)
(SLR) interecept, AST:    0.4570055507968593
intercept, 3PA, AST:      0.608786276366571
intercept, FG, 3PA, AST:  0.9883162128703274

Because of how basketball is scored, knowing Field Goals FG and 3 point goal attempts 3PA will reliably tell you how many total points a player scored in that game. This is assuming NBA players make a good number of their 3 pointers.

Side note, if you wanted to check that Multiple R^2 for Simple Linear Regression is indeed correlation coefficient $r^2$:

In [20]:
r = np.corrcoef(nba['AST'], nba['PTS'])[0,1]
r ** 2
Out[20]:
0.4570055507968595