by Suraj Rampure, Allen Shen
import pandas as pd
import numpy as np
Let's start with an example dataset that we saw in lecture many times – average player statistics from the 2018-19 NBA season.
nba = pd.read_csv('nba18-19.csv')
nba.head()
This is the entropy and weighted_metric functions from Lab 11. Recall that entropy is defined as:
$$ S = -\sum_{C} p_C \log_{2} p_C $$Recall that the weighted entropy is given by:
$$ L = \frac{N_1 S(X) + N_2 S(Y)}{N_1 + N_2} $$$N_1$ is the number of samples in the left node $X$, and $N_2$ is the number of samples in the right node $Y$.
def entropy(labels):
_, counts = np.unique(labels, return_counts=True)
props = counts / len(labels)
return -np.sum(props * np.log2(props))
def weighted_metric(left, right, metric):
return (len(left) * metric(left) + len(right) * metric(right)) / (len(left) + len(right))
entropy(['C', 'C', 'C', 'C', 'C'])
entropy(['C', 'SG'])
-0.5 * np.log2(0.5) - 0.5 * np.log2(0.5)
entropy(['C'] * 10 + ['SG'] * 10)
entropy(['C', 'SG', 'PF'])
3*(-0.33 * np.log2(0.33))
-np.log2(1/3)
np.log2(3)
weighted_metric(['C'], ['C'] * 9 + ['SG'] * 10, entropy)
(entropy(['C']) + entropy(['C'] * 9 + ['SG'] * 10)) / 2
weighted_metric(['C'] * 4 + ['SG'], ['C'] * 6 + ['SG'] * 9, entropy)
(entropy(['C'] * 4 + ['SG']) + entropy(['C'] * 6 + ['SG'] * 9)) / 2
nba.head()
weighted_metric(nba.loc[nba['Age'] >= 30, 'Pos'], nba.loc[nba['Age'] < 30, 'Pos'], entropy)
weighted_metric(nba.loc[nba['Age'] >= 25, 'Pos'], nba.loc[nba['Age'] < 25, 'Pos'], entropy)
weighted_metric(nba.loc[nba['FG'] >= 5, 'Pos'], nba.loc[nba['FG'] < 5, 'Pos'], entropy)
weighted_metric(nba.loc[nba['FGA'] >= 5, 'Pos'], nba.loc[nba['FGA'] < 5, 'Pos'], entropy)
nba_left = nba.loc[nba['Age'] >= 25]
nba_left.shape
weighted_metric(nba_left.loc[nba_left['FG'] >= 2, 'Pos'], nba_left.loc[nba_left['FG'] < 2, 'Pos'], entropy)
weighted_metric(nba_left.loc[nba_left['FG'] >= 3, 'Pos'], nba_left.loc[nba_left['FG'] < 3, 'Pos'], entropy)
nba_right = nba.loc[nba['Age'] < 25]
nba_right.shape
weighted_metric(nba_right.loc[nba_right['FG'] >= 5, 'Pos'], nba_right.loc[nba_right['FG'] < 5, 'Pos'], entropy)
from sklearn.linear_model import LinearRegression
nba_small = nba[['FG', 'FGA', 'FT%', '3PA', 'AST', 'PTS']].fillna(0)
nba_small
model = LinearRegression()
model.fit(nba_small[['FG', 'FGA', 'FT%', '3PA', 'AST']], nba_small['PTS'])
rmse = lambda y, yhat: np.sqrt(np.mean((y - yhat)**2))
rmse(model.predict(nba_small[['FG', 'FGA', 'FT%', '3PA', 'AST']]), nba_small['PTS'])
model.coef_
nba_small.corr()
In Lecture 21, we saw that multicollinearity is present here.
Something to note: just because multicollinearity is present, doesn't mean the predictions our model makes are inaccurate. Let's look at the test RMSE of two different models – one that uses just FGA
, and one that uses FG
and FGA
.
from sklearn.model_selection import train_test_split
train, test = train_test_split(nba_small, test_size = 0.2)
Fitting a model that just uses FGA
:
model_1_feature = LinearRegression()
model_1_feature.fit(train[['FGA']], train['PTS'])
model_1_feature.coef_
test_rmse_1_feature = rmse(model_1_feature.predict(test[['FGA']]), test['PTS'])
test_rmse_1_feature
Fitting a model that uses FG, FGA
:
model_2_features = LinearRegression()
model_2_features.fit(train[['FG', 'FGA']], train['PTS'])
model_2_features.coef_
test_rmse_2_features = rmse(model_2_features.predict(test[['FG', 'FGA']]), test['PTS'])
test_rmse_2_features
The model with multicollinearity had a lower testing RMSE!
But that was an aside. Multicollinearity is more of an issue when we care about our model's coefficients.
One solution, clearly, is to drop features that are highly correlated. But another solution is to use PCA:
linear_model = LinearRegression()
linear_model.fit(train[['FG', 'FGA', 'FT%', '3PA', 'AST']], train['PTS'])
linear_model.coef_
rmse(linear_model.predict(test[['FG', 'FGA', 'FT%', '3PA', 'AST']]), test['PTS'])
Let's import PCA
from sklearn.decomposition
. We never really discussed this in class, but it works pretty similarly to the other scikit-learn tools you have used in the class.
from sklearn.decomposition import PCA
pca_model = PCA(n_components = 4)
pca_model.fit(train[['FG', 'FGA', 'FT%', '3PA', 'AST']])
pcs = pca_model.transform(train[['FG', 'FGA', 'FT%', '3PA', 'AST']])
pcs
pca_model.components_
Let's compare this to the output of the SVD.
D = train[['FG', 'FGA', 'FT%', '3PA', 'AST']]
X = D - np.mean(D, axis = 0)
u, s, vt = np.linalg.svd(X, full_matrices = False)
vt[:4]
(u * s)[:, :4]
pc1 = pcs[:, 0]
pc2 = pcs[:, 1]
pc3 = pcs[:, 2]
pc4 = pcs[:, 3]
We can fit a linear model using these principal components as our features!
train['pc1'] = pc1
train['pc2'] = pc2
train['pc3'] = pc3
train['pc4'] = pc4
train.head()
linear_model_pcs = LinearRegression()
linear_model_pcs.fit(train[['pc1', 'pc2', 'pc3', 'pc4']], train['PTS'])
linear_model_pcs.coef_
Note that we are using pca_model
trained on the training data here.
pcs_test = pca_model.transform(test[['FG', 'FGA', 'FT%', '3PA', 'AST']])
pcs_test[:5]
test['pc1'] = pcs_test[:, 0]
test['pc2'] = pcs_test[:, 1]
test['pc3'] = pcs_test[:, 2]
test['pc4'] = pcs_test[:, 3]
rmse(linear_model_pcs.predict(test[['pc1', 'pc2', 'pc3', 'pc4']]), test['PTS'])