by Lisa Yan
Notebook credits:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import plotly.express as px
import plotly.graph_objs as go
Let's load in a new dataset. This is aggregate per-player data from the 2018-19 NBA season, downloaded from Kaggle.
nba = pd.read_csv('nba18-19.csv', index_col=0)
nba.index.name = None # drops name of index (players are ordered by rank)
nba
Player | Pos | Age | Tm | G | GS | MP | FG | FGA | FG% | ... | FT% | ORB | DRB | TRB | AST | STL | BLK | TOV | PF | PTS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Álex Abrines\abrinal01 | SG | 25 | OKC | 31 | 2 | 19.0 | 1.8 | 5.1 | 0.357 | ... | 0.923 | 0.2 | 1.4 | 1.5 | 0.6 | 0.5 | 0.2 | 0.5 | 1.7 | 5.3 |
2 | Quincy Acy\acyqu01 | PF | 28 | PHO | 10 | 0 | 12.3 | 0.4 | 1.8 | 0.222 | ... | 0.700 | 0.3 | 2.2 | 2.5 | 0.8 | 0.1 | 0.4 | 0.4 | 2.4 | 1.7 |
3 | Jaylen Adams\adamsja01 | PG | 22 | ATL | 34 | 1 | 12.6 | 1.1 | 3.2 | 0.345 | ... | 0.778 | 0.3 | 1.4 | 1.8 | 1.9 | 0.4 | 0.1 | 0.8 | 1.3 | 3.2 |
4 | Steven Adams\adamsst01 | C | 25 | OKC | 80 | 80 | 33.4 | 6.0 | 10.1 | 0.595 | ... | 0.500 | 4.9 | 4.6 | 9.5 | 1.6 | 1.5 | 1.0 | 1.7 | 2.6 | 13.9 |
5 | Bam Adebayo\adebaba01 | C | 21 | MIA | 82 | 28 | 23.3 | 3.4 | 5.9 | 0.576 | ... | 0.735 | 2.0 | 5.3 | 7.3 | 2.2 | 0.9 | 0.8 | 1.5 | 2.5 | 8.9 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
528 | Tyler Zeller\zellety01 | C | 29 | MEM | 4 | 1 | 20.5 | 4.0 | 7.0 | 0.571 | ... | 0.778 | 2.3 | 2.3 | 4.5 | 0.8 | 0.3 | 0.8 | 1.0 | 4.0 | 11.5 |
529 | Ante Žižić\zizican01 | C | 22 | CLE | 59 | 25 | 18.3 | 3.1 | 5.6 | 0.553 | ... | 0.705 | 1.8 | 3.6 | 5.4 | 0.9 | 0.2 | 0.4 | 1.0 | 1.9 | 7.8 |
530 | Ivica Zubac\zubaciv01 | C | 21 | TOT | 59 | 37 | 17.6 | 3.6 | 6.4 | 0.559 | ... | 0.802 | 1.9 | 4.2 | 6.1 | 1.1 | 0.2 | 0.9 | 1.2 | 2.3 | 8.9 |
530 | Ivica Zubac\zubaciv01 | C | 21 | LAL | 33 | 12 | 15.6 | 3.4 | 5.8 | 0.580 | ... | 0.864 | 1.6 | 3.3 | 4.9 | 0.8 | 0.1 | 0.8 | 1.0 | 2.2 | 8.5 |
530 | Ivica Zubac\zubaciv01 | C | 21 | LAC | 26 | 25 | 20.2 | 3.8 | 7.2 | 0.538 | ... | 0.733 | 2.3 | 5.3 | 7.7 | 1.5 | 0.4 | 0.9 | 1.4 | 2.5 | 9.4 |
708 rows × 29 columns
We are interested in predicting the number of points (PTS
) an athlete will score in a basketball game from this season.
Suppose we want to fit a linear model by using some characteristics, or features of a player. Specifically, we'll focus on field goals, assists, and 3 point attempts.
FG
, the number of (2 point) field goals per gameAST
, the average number of assists per game, and3PA
, the number of 3 point field goals attempted per gameThis is how we express that model:
$$ \hat{y} = \theta_0 + \theta_1 \cdot \textbf{FG} + \theta_2 \cdot \textbf{AST} + \theta_3 \cdot \textbf{3PA} $$nba[['FG', 'AST', '3PA', 'PTS']]
FG | AST | 3PA | PTS | |
---|---|---|---|---|
1 | 1.8 | 0.6 | 4.1 | 5.3 |
2 | 0.4 | 0.8 | 1.5 | 1.7 |
3 | 1.1 | 1.9 | 2.2 | 3.2 |
4 | 6.0 | 1.6 | 0.0 | 13.9 |
5 | 3.4 | 2.2 | 0.2 | 8.9 |
... | ... | ... | ... | ... |
528 | 4.0 | 0.8 | 0.0 | 11.5 |
529 | 3.1 | 0.9 | 0.0 | 7.8 |
530 | 3.6 | 1.1 | 0.0 | 8.9 |
530 | 3.4 | 0.8 | 0.0 | 8.5 |
530 | 3.8 | 1.5 | 0.0 | 9.4 |
708 rows × 4 columns
In lecture, we saw that the predictions for the entire data set $\hat{\mathbb{Y}}$ can be computed as:
The design matrix $\mathbb{X} \in \mathbb{R}^{n,d+1}$ has $n$ rows corresponding to each record in the dataset and $d+1$ columns corresponding to the original $d$ columns in the data plus an additional 1s column vector (bias/intercept feature).
Let's build this design matrix using Pandas:
X = nba[['FG', 'AST', '3PA']]
X.insert(0, 'Bias', 1)
X = X.to_numpy()
X
array([[1. , 1.8, 0.6, 4.1], [1. , 0.4, 0.8, 1.5], [1. , 1.1, 1.9, 2.2], ..., [1. , 3.6, 1.1, 0. ], [1. , 3.4, 0.8, 0. ], [1. , 3.8, 1.5, 0. ]])
X.shape
(708, 4)
While we're at it, let's build the Y vector of our true PTS
values.
# for nba data
Y = nba[["PTS"]].to_numpy()
n = len(Y)
print("number datapoints", n)
number datapoints 708
Suppose we decide to pick an arbitrary parameter $\theta$:
$$\theta = \begin{bmatrix}0.50 \\ -0.14 \\ 0.65 \\ 1.52 \end{bmatrix}$$(For those interested: I drew these from random values simulated from a standard normal distribution using np.random.randn
.)
theta_arbitrary = np.array([[0.5], [-1.14], [0.65], [1.52]])
theta_arbitrary
array([[ 0.5 ], [-1.14], [ 0.65], [ 1.52]])
For this value of $\theta$ we can make a prediction using our model with matrix multiplication.
The @
symbol is the matrix multiply operation and is equivalent to writing xt.dot(theta)
.
X @ theta_arbitrary
[[ 5.0700e+00] [ 2.8440e+00] [ 3.8250e+00] [-5.3000e+00] [-1.6420e+00] ... [-2.4490e+00] [-2.8890e+00] [-2.8560e+00] [-2.8570e+00]]
For Ordinary Least Squares, the average loss is MSE:
$$ \Large R(\theta) = \frac{1}{n} || \mathbb{Y} - \mathbb{X}\theta||^2_2 $$NumPy has a handy function np.linalg.norm
that computes the norm of a matrix (default is L2 norm).
def mse_nba(theta):
"""
Y is PTS
X is intercept, FG, AST, 3PA
"""
return (1/n) * (np.linalg.norm(Y - X @ theta) ** 2)
mse_nba(theta_arbitrary)
76.52265605508474
Is this good? Is this bad? Let's compute the optimal theta and compare!
From lecture, the Least Squares Estimate $\hat{\theta}$ is: $$ \Large \hat{\theta} = \left( \mathbb{X}^T \mathbb{X} \right)^{-1} \mathbb{X}^T \mathbb{Y} $$
While it is not as numerically stable or efficient. We can compute $\hat{\theta}$ by direction using matrix inversion. To do this, we import the inv
function for the numpy linear algebra library:
from numpy.linalg import inv
def least_squares_estimate(X, Y):
return inv(X.T @ X) @ X.T @ Y
theta_hat = least_squares_estimate(X, Y)
theta_hat
array([[-0.29253798], [ 2.51705703], [ 0.05075571], [ 0.31307653]])
print("Arbitrary theta MSE: ", mse_nba(theta_arbitrary))
print("Optimal theta MSE:" , mse_nba(theta_hat))
Arbitrary theta MSE: 76.52265605508474 Optimal theta MSE: 0.3963133329080335
Nice!!!
Now that we have estimated the model parameters $\hat{\theta}$ we can now also predict the points scored $\hat{\mathbb{Y}}$ for each of our players.
Y_hat = X @ theta_hat
In previous lectures we have plotted the residual vs. our single input feature $x$.
For higher dimensional regression models, we often graph the residual with respect to the fitted values $\hat{y}$.
fig = px.scatter(x = Y_hat.flatten(), y = (Y - Y_hat).flatten(), opacity=0.2)
fig.add_trace(go.Scatter(x=[0, 30], y=[0,0], name="Y_hat=Y"))
fig.update_xaxes(title="Y_hat")
fig.update_yaxes(title="Y_hat - Y")
fig
Overall while the residuals are roughly centered around 0 ($\hat{y} = y$), we see heteroskedasticity: Our regression spread is uneven, particularly as predictions get big.
Let's compute the coefficient of determination, or multiple R^2, for our model.
r2_ast_fg_3pa = np.var(Y_hat) / np.var(Y)
r2_ast_fg_3pa
0.9883162128703274
That's super high!!! Wait, what's up with that?
(Hint: Basketball facts)
Let's try reducing the number of features to see how this Multiple $R^2$ metric changes.
# use intercept, ast, 3pa
X_3d = nba[['AST', '3PA']]
X_3d.insert(0, 'Bias', 1)
X_3d = X_3d.to_numpy()
theta_ast_3pa = least_squares_estimate(X_3d, Y)
Y_hat_ast_3pa = X_3d @ theta_ast_3pa
r2_ast_3pa = np.var(Y_hat_ast_3pa) / np.var(Y)
r2_ast_3pa
0.608786276366571
# use intercept, ast only (SLR)
X_slr = nba[['AST']]
X_slr.insert(0, 'Bias', 1)
X_slr = X_slr.to_numpy()
theta_ast_only = least_squares_estimate(X_slr, Y)
Y_hat_ast_only = X_slr @ theta_ast_only
r2_ast_only = np.var(Y_hat_ast_only) / np.var(Y)
r2_ast_only
0.4570055507968593
Comparing these Multiple $R^2$ together:
print("(SLR) interecept, AST: ", r2_ast_only)
print("intercept, 3PA, AST: ", r2_ast_3pa)
print("intercept, FG, 3PA, AST: ", r2_ast_fg_3pa)
(SLR) interecept, AST: 0.4570055507968593 intercept, 3PA, AST: 0.608786276366571 intercept, FG, 3PA, AST: 0.9883162128703274
Because of how basketball is scored, knowing Field Goals FG
and 3 point goal attempts 3PA
will reliably tell you how many total points a player scored in that game. This is assuming NBA players make a good number of their 3 pointers.
Side note, if you wanted to check that Multiple R^2 for Simple Linear Regression is indeed correlation coefficient $r^2$:
r = np.corrcoef(nba['AST'], nba['PTS'])[0,1]
r ** 2
0.4570055507968595