Code
import pandas as pd
= pd.read_csv('data/nba18-19.csv', index_col=0)
nba = None # Drops name of index (players are ordered by rank) nba.index.name
We’ve now spent a number of lectures exploring how to build effective models – we introduced the SLR and constant models, selected cost functions to suit our modeling task, and applied transformations to improve the linear fit.
Throughout all of this, we considered models of one feature (
An expression is linear in
For example, consider the vector
There are several equivalent terms in the context of regression. The ones we use most often for this course are bolded.
Multiple linear regression is an extension of simple linear regression that adds additional features to the model. The multiple linear regression model takes the form:
Our predicted value of
We can explore this idea further by looking at a dataset containing aggregate per-player data from the 2018-19 NBA season, downloaded from Kaggle.
import pandas as pd
= pd.read_csv('data/nba18-19.csv', index_col=0)
nba = None # Drops name of index (players are ordered by rank) nba.index.name
5) nba.head(
Player | Pos | Age | Tm | G | GS | MP | FG | FGA | FG% | ... | FT% | ORB | DRB | TRB | AST | STL | BLK | TOV | PF | PTS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Álex Abrines\abrinal01 | SG | 25 | OKC | 31 | 2 | 19.0 | 1.8 | 5.1 | 0.357 | ... | 0.923 | 0.2 | 1.4 | 1.5 | 0.6 | 0.5 | 0.2 | 0.5 | 1.7 | 5.3 |
2 | Quincy Acy\acyqu01 | PF | 28 | PHO | 10 | 0 | 12.3 | 0.4 | 1.8 | 0.222 | ... | 0.700 | 0.3 | 2.2 | 2.5 | 0.8 | 0.1 | 0.4 | 0.4 | 2.4 | 1.7 |
3 | Jaylen Adams\adamsja01 | PG | 22 | ATL | 34 | 1 | 12.6 | 1.1 | 3.2 | 0.345 | ... | 0.778 | 0.3 | 1.4 | 1.8 | 1.9 | 0.4 | 0.1 | 0.8 | 1.3 | 3.2 |
4 | Steven Adams\adamsst01 | C | 25 | OKC | 80 | 80 | 33.4 | 6.0 | 10.1 | 0.595 | ... | 0.500 | 4.9 | 4.6 | 9.5 | 1.6 | 1.5 | 1.0 | 1.7 | 2.6 | 13.9 |
5 | Bam Adebayo\adebaba01 | C | 21 | MIA | 82 | 28 | 23.3 | 3.4 | 5.9 | 0.576 | ... | 0.735 | 2.0 | 5.3 | 7.3 | 2.2 | 0.9 | 0.8 | 1.5 | 2.5 | 8.9 |
5 rows × 29 columns
Let’s say we are interested in predicting the number of points (PTS
) an athlete will score in a basketball game this season.
Suppose we want to fit a linear model by using some characteristics, or features of a player. Specifically, we’ll focus on field goals, assists, and 3-point attempts.
FG
, the number of (2-point) field goals per gameAST
, the average number of assists per game3PA
, the number of 3-point field goals attempted per game'FG', 'AST', '3PA', 'PTS']].head() nba[[
FG | AST | 3PA | PTS | |
---|---|---|---|---|
1 | 1.8 | 0.6 | 4.1 | 5.3 |
2 | 0.4 | 0.8 | 1.5 | 1.7 |
3 | 1.1 | 1.9 | 2.2 | 3.2 |
4 | 6.0 | 1.6 | 0.0 | 13.9 |
5 | 3.4 | 2.2 | 0.2 | 8.9 |
Because we are now dealing with many parameter values, we’ve collected them all into a parameter vector with dimensions
We are working with two vectors here: a row vector representing the observed data, and a column vector containing the model parameters. The multiple linear regression model is equivalent to the dot (scalar) product of the observation vector and parameter vector.
Notice that we have inserted 1 as the first value in the observation vector. When the dot product is computed, this 1 will be multiplied with
Given that we have three features here, we can express this model as:
Our features are represented by FG
), AST
), and 3PA
) with each having correpsonding parameters,
In statistics, this model + loss is called Ordinary Least Squares (OLS). The solution to OLS is the minimizing loss for parameters
We now know how to generate a single prediction from multiple observed features. Data scientists usually work at scale – that is, they want to build models that can produce many predictions, all at once. The vector notation we introduced above gives us a hint on how we can expedite multiple linear regression. We want to use the tools of linear algebra.
Let’s think about how we can apply what we did above. To accomodate for the fact that we’re considering several feature variables, we’ll adjust our notation slightly. Each observation can now be thought of as a row vector with an entry for each of
![]() |
To make a prediction from the first observation in the data, we take the dot product of the parameter vector and first observation vector. To make a prediction from the second observation, we would repeat this process to find the dot product of the parameter vector and the second observation vector. If we wanted to find the model predictions for each observation in the dataset, we’d repeat this process for all
Our observed data is represented by
![]() |
The matrix
To review what is happening in the design matrix: each row represents a single observation. For example, a student in Data 100. Each column represents a feature. For example, the ages of students in Data 100. This convention allows us to easily transfer our previous work in DataFrames over to this new linear algebra perspective.
![]() |
The multiple linear regression model can then be restated in terms of matrices:
Here,
As a refresher, let’s also review the dot product (or inner product). This is a vector operation that:
While this is not in scope, note that we can also interpret the dot product geometrically:
We now have a new approach to understanding models in terms of vectors and matrices. To accompany this new convention, we should update our understanding of risk functions and model fitting.
Recall our definition of MSE:
At its heart, the MSE is a measure of distance – it gives an indication of how “far away” the predictions are from the true values, on average.
When working with vectors, this idea of “distance” or the vector’s size/length is represented by the norm. More precisely, the distance between two vectors
The double bars are mathematical notation for the norm. The subscript 2 indicates that we are computing the L2, or squared norm.
The two norms we need to know for Data 100 are the L1 and L2 norms (sound familiar?). In this note, we’ll focus on L2 norm. We’ll dive into L1 norm in future lectures.
For the n-dimensional vector
The L2 vector norm is a generalization of the Pythagorean theorem in
We can express the MSE as a squared L2 norm if we rewrite it in terms of the prediction vector,
Here, the superscript 2 outside of the norm double bars means that we are squaring the norm. If we plug in our linear model
Under the linear algebra perspective, our new task is to fit the optimal parameter vector
We can restate this goal in two ways:
To derive the best parameter vector to meet this goal, we can turn to the geometric properties of our modeling setup.
Up until now, we’ve mostly thought of our model as a scalar product between horizontally stacked observations and the parameter vector. We can also think of .iloc
and .loc
. “:” means that we are taking all entries in the
![]() |
This new approach is useful because it allows us to take advantage of the properties of linear combinations.
Recall that the span or column space of a matrix
Because the prediction vector,
The diagram below is a simplified view of
![]() |
Examining this diagram, we find a problem. The vector of true values,
Another way of rephrasing this goal is to say that we wish to minimize the length of the residual vector
![]() |
The vector in
How does this help us identify the optimal parameter vector,
Note that
Remember our goal is to find
Looking at the definition of orthogonality of
Let’s then rearrange the terms:
And finally, we end up with the normal equation:
Any vector
If
This is called the least squares estimate of
Note that the least squares estimate was derived under the assumption that
Our geometric view of multiple linear regression has taken us far! We have identified the optimal set of parameter values to minimize MSE in a model of multiple features.
Now, we want to understand how well our fitted model performs. One measure of model performance is the Root Mean Squared Error, or RMSE. The RMSE is simply the square root of MSE. Taking the square root converts the value back into the original, non-squared units of
When working with SLR, we generated plots of the residuals against a single feature to understand the behavior of residuals. When working with several features in multiple linear regression, it no longer makes sense to consider a single feature in our residual plots. Instead, multiple linear regression is evaluated by making plots of the residuals against the predicted values. As was the case with SLR, a multiple linear model performs well if its residual plot shows no patterns.
![]() |
For SLR, we used the correlation coefficient to capture the association between the target variable and a single feature variable. In a multiple linear model setting, we will need a performance metric that can account for multiple features at once. Multiple
Note that for OLS with an intercept term, for example
Additionally, as we add more features, our fitted values tend to become closer and closer to our actual values. Thus,
Adding more features doesn’t always mean our model is better though! We’ll see why later in the course.
To summarize:
Model | Estimate | Unique? | |
---|---|---|---|
Constant Model + MSE | Yes. Any set of values has a unique mean. | ||
Constant Model + MAE | Yes, if odd. No, if even. Return the average of the middle 2 values. | ||
Simple Linear Regression + MSE | Yes. Any set of non-constant* values has a unique mean, SD, and correlation coefficient. | ||
OLS (Linear Model + MSE) | Yes, if |