Live Lecture 7 – Data 100, Summer 2020

by Suraj Rampure, Allen Shen

In [1]:
import pandas as pd
import numpy as np

Decision Trees

Let's start with an example dataset that we saw in lecture many times – average player statistics from the 2018-19 NBA season.

In [2]:
nba = pd.read_csv('nba18-19.csv')
In [3]:
nba.head()
Out[3]:
Rk Player Pos Age Tm G GS MP FG FGA ... FT% ORB DRB TRB AST STL BLK TOV PF PTS
0 1 Álex Abrines\abrinal01 SG 25 OKC 31 2 19.0 1.8 5.1 ... 0.923 0.2 1.4 1.5 0.6 0.5 0.2 0.5 1.7 5.3
1 2 Quincy Acy\acyqu01 PF 28 PHO 10 0 12.3 0.4 1.8 ... 0.700 0.3 2.2 2.5 0.8 0.1 0.4 0.4 2.4 1.7
2 3 Jaylen Adams\adamsja01 PG 22 ATL 34 1 12.6 1.1 3.2 ... 0.778 0.3 1.4 1.8 1.9 0.4 0.1 0.8 1.3 3.2
3 4 Steven Adams\adamsst01 C 25 OKC 80 80 33.4 6.0 10.1 ... 0.500 4.9 4.6 9.5 1.6 1.5 1.0 1.7 2.6 13.9
4 5 Bam Adebayo\adebaba01 C 21 MIA 82 28 23.3 3.4 5.9 ... 0.735 2.0 5.3 7.3 2.2 0.9 0.8 1.5 2.5 8.9

5 rows × 30 columns

This is the entropy and weighted_metric functions from Lab 11. Recall that entropy is defined as:

$$ S = -\sum_{C} p_C \log_{2} p_C $$

Recall that the weighted entropy is given by:

$$ L = \frac{N_1 S(X) + N_2 S(Y)}{N_1 + N_2} $$

$N_1$ is the number of samples in the left node $X$, and $N_2$ is the number of samples in the right node $Y$.

In [4]:
def entropy(labels):
    _, counts = np.unique(labels, return_counts=True)
    props = counts / len(labels)
    return -np.sum(props * np.log2(props))

def weighted_metric(left, right, metric):
    return (len(left) * metric(left) + len(right) * metric(right)) / (len(left) + len(right))
In [5]:
entropy(['C', 'C', 'C', 'C', 'C'])
Out[5]:
-0.0
In [6]:
entropy(['C', 'SG'])
Out[6]:
1.0
In [7]:
-0.5 * np.log2(0.5) - 0.5 * np.log2(0.5)
Out[7]:
1.0
In [8]:
entropy(['C'] * 10 +  ['SG'] * 10)
Out[8]:
1.0
In [9]:
entropy(['C', 'SG', 'PF'])
Out[9]:
1.584962500721156
In [10]:
3*(-0.33 * np.log2(0.33))
Out[10]:
1.5834674497121084
In [11]:
-np.log2(1/3)
Out[11]:
1.5849625007211563
In [12]:
np.log2(3)
Out[12]:
1.584962500721156
In [13]:
weighted_metric(['C'], ['C'] * 9 + ['SG'] * 10, entropy)
Out[13]:
0.9481008396786846
In [14]:
(entropy(['C']) + entropy(['C'] * 9 + ['SG'] * 10)) / 2
Out[14]:
0.4990004419361498
In [15]:
weighted_metric(['C'] * 4 + ['SG'], ['C'] * 6 + ['SG'] * 9, entropy)
Out[15]:
0.9086949695628419
In [16]:
(entropy(['C'] * 4 + ['SG']) + entropy(['C'] * 6 + ['SG'] * 9)) / 2
Out[16]:
0.8464393446710154
In [17]:
nba.head()
Out[17]:
Rk Player Pos Age Tm G GS MP FG FGA ... FT% ORB DRB TRB AST STL BLK TOV PF PTS
0 1 Álex Abrines\abrinal01 SG 25 OKC 31 2 19.0 1.8 5.1 ... 0.923 0.2 1.4 1.5 0.6 0.5 0.2 0.5 1.7 5.3
1 2 Quincy Acy\acyqu01 PF 28 PHO 10 0 12.3 0.4 1.8 ... 0.700 0.3 2.2 2.5 0.8 0.1 0.4 0.4 2.4 1.7
2 3 Jaylen Adams\adamsja01 PG 22 ATL 34 1 12.6 1.1 3.2 ... 0.778 0.3 1.4 1.8 1.9 0.4 0.1 0.8 1.3 3.2
3 4 Steven Adams\adamsst01 C 25 OKC 80 80 33.4 6.0 10.1 ... 0.500 4.9 4.6 9.5 1.6 1.5 1.0 1.7 2.6 13.9
4 5 Bam Adebayo\adebaba01 C 21 MIA 82 28 23.3 3.4 5.9 ... 0.735 2.0 5.3 7.3 2.2 0.9 0.8 1.5 2.5 8.9

5 rows × 30 columns

In [18]:
weighted_metric(nba.loc[nba['Age'] >= 30, 'Pos'], nba.loc[nba['Age'] < 30, 'Pos'], entropy)
Out[18]:
2.3774367986811584
In [19]:
weighted_metric(nba.loc[nba['Age'] >= 25, 'Pos'], nba.loc[nba['Age'] < 25, 'Pos'], entropy)
Out[19]:
2.375101208480124
In [20]:
weighted_metric(nba.loc[nba['FG'] >= 5, 'Pos'], nba.loc[nba['FG'] < 5, 'Pos'], entropy)
Out[20]:
2.3894903552734736
In [21]:
weighted_metric(nba.loc[nba['FGA'] >= 5, 'Pos'], nba.loc[nba['FGA'] < 5, 'Pos'], entropy)
Out[21]:
2.378945518445882
In [22]:
nba_left = nba.loc[nba['Age'] >= 25]
In [23]:
nba_left.shape
Out[23]:
(425, 30)
In [24]:
weighted_metric(nba_left.loc[nba_left['FG'] >= 2, 'Pos'], nba_left.loc[nba_left['FG'] < 2, 'Pos'], entropy)
Out[24]:
2.400588553950309
In [25]:
weighted_metric(nba_left.loc[nba_left['FG'] >= 3, 'Pos'], nba_left.loc[nba_left['FG'] < 3, 'Pos'], entropy)
Out[25]:
2.399849059712462
In [26]:
nba_right = nba.loc[nba['Age'] < 25]
In [27]:
nba_right.shape
Out[27]:
(283, 30)
In [28]:
weighted_metric(nba_right.loc[nba_right['FG'] >= 5, 'Pos'], nba_right.loc[nba_right['FG'] < 5, 'Pos'], entropy)
Out[28]:
2.2904331310436747

Multicollinearity

In [29]:
from sklearn.linear_model import LinearRegression
In [30]:
nba_small = nba[['FG', 'FGA', 'FT%', '3PA', 'AST', 'PTS']].fillna(0)
nba_small
Out[30]:
FG FGA FT% 3PA AST PTS
0 1.8 5.1 0.923 4.1 0.6 5.3
1 0.4 1.8 0.700 1.5 0.8 1.7
2 1.1 3.2 0.778 2.2 1.9 3.2
3 6.0 10.1 0.500 0.0 1.6 13.9
4 3.4 5.9 0.735 0.2 2.2 8.9
... ... ... ... ... ... ...
703 4.0 7.0 0.778 0.0 0.8 11.5
704 3.1 5.6 0.705 0.0 0.9 7.8
705 3.6 6.4 0.802 0.0 1.1 8.9
706 3.4 5.8 0.864 0.0 0.8 8.5
707 3.8 7.2 0.733 0.0 1.5 9.4

708 rows × 6 columns

In [31]:
model = LinearRegression()
model.fit(nba_small[['FG', 'FGA', 'FT%', '3PA', 'AST']], nba_small['PTS'])
Out[31]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
In [32]:
rmse = lambda y, yhat: np.sqrt(np.mean((y - yhat)**2))
In [33]:
rmse(model.predict(nba_small[['FG', 'FGA', 'FT%', '3PA', 'AST']]), nba_small['PTS'])
Out[33]:
0.6207008204493863
In [34]:
model.coef_
Out[34]:
array([2.44566707, 0.03633589, 0.50454007, 0.28367305, 0.04144433])
In [35]:
nba_small.corr()
Out[35]:
FG FGA FT% 3PA AST PTS
FG 1.000000 0.973355 0.371598 0.600830 0.665761 0.990014
FGA 0.973355 1.000000 0.395902 0.725114 0.703093 0.980447
FT% 0.371598 0.395902 1.000000 0.377633 0.288057 0.401555
3PA 0.600830 0.725114 0.377633 1.000000 0.480880 0.666673
AST 0.665761 0.703093 0.288057 0.480880 1.000000 0.676022
PTS 0.990014 0.980447 0.401555 0.666673 0.676022 1.000000

In Lecture 21, we saw that multicollinearity is present here.

Something to note: just because multicollinearity is present, doesn't mean the predictions our model makes are inaccurate. Let's look at the test RMSE of two different models – one that uses just FGA, and one that uses FG and FGA.

In [36]:
from sklearn.model_selection import train_test_split
In [37]:
train, test = train_test_split(nba_small, test_size = 0.2)

Fitting a model that just uses FGA:

In [38]:
model_1_feature = LinearRegression()
model_1_feature.fit(train[['FGA']], train['PTS'])
Out[38]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
In [39]:
model_1_feature.coef_
Out[39]:
array([1.31276151])
In [40]:
test_rmse_1_feature = rmse(model_1_feature.predict(test[['FGA']]), test['PTS'])
test_rmse_1_feature
Out[40]:
1.10102632702208

Fitting a model that uses FG, FGA:

In [41]:
model_2_features = LinearRegression()
model_2_features.fit(train[['FG', 'FGA']], train['PTS'])
Out[41]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
In [42]:
model_2_features.coef_
Out[42]:
array([1.85829074, 0.4365383 ])
In [43]:
test_rmse_2_features = rmse(model_2_features.predict(test[['FG', 'FGA']]), test['PTS'])
test_rmse_2_features
Out[43]:
0.6698015078633595

The model with multicollinearity had a lower testing RMSE!

But that was an aside. Multicollinearity is more of an issue when we care about our model's coefficients.

One solution, clearly, is to drop features that are highly correlated. But another solution is to use PCA:

PCA for Modeling

In [44]:
linear_model = LinearRegression()
linear_model.fit(train[['FG', 'FGA', 'FT%', '3PA', 'AST']], train['PTS'])
Out[44]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
In [45]:
linear_model.coef_
Out[45]:
array([2.43758621, 0.04459854, 0.4703348 , 0.28488127, 0.03486817])
In [46]:
rmse(linear_model.predict(test[['FG', 'FGA', 'FT%', '3PA', 'AST']]), test['PTS'])
Out[46]:
0.5685824971991201

Let's import PCA from sklearn.decomposition. We never really discussed this in class, but it works pretty similarly to the other scikit-learn tools you have used in the class.

In [47]:
from sklearn.decomposition import PCA
In [48]:
pca_model = PCA(n_components = 4)
pca_model.fit(train[['FG', 'FGA', 'FT%', '3PA', 'AST']])
Out[48]:
PCA(copy=True, iterated_power='auto', n_components=4, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)
In [49]:
pcs = pca_model.transform(train[['FG', 'FGA', 'FT%', '3PA', 'AST']])
pcs
Out[49]:
array([[-8.18977831,  0.08870261,  0.02447443, -0.07567482],
       [-5.29775328,  0.06626676, -0.33739353,  0.14158431],
       [-3.95641467, -0.75235497,  1.20789685, -0.06102934],
       ...,
       [-4.84700163, -0.32930193,  0.08940634,  0.48581747],
       [ 0.55696575, -0.89083483, -0.49149691, -0.05097942],
       [ 1.081038  ,  2.21036665, -0.79411442,  0.31112771]])
In [50]:
pca_model.components_
Out[50]:
array([[ 0.39175754,  0.83585622,  0.0178255 ,  0.30772277,  0.22991164],
       [ 0.29980558,  0.13643796, -0.01445766, -0.91753593,  0.22230627],
       [-0.22490819, -0.20084309,  0.00179777,  0.12561255,  0.9451437 ],
       [-0.83930299,  0.4922353 , -0.04535856, -0.21639057, -0.06627684]])

Let's compare this to the output of the SVD.

In [51]:
D = train[['FG', 'FGA', 'FT%', '3PA', 'AST']]
X = D - np.mean(D, axis = 0)
u, s, vt = np.linalg.svd(X, full_matrices = False)
In [52]:
vt[:4]
Out[52]:
array([[ 0.39175754,  0.83585622,  0.0178255 ,  0.30772277,  0.22991164],
       [ 0.29980558,  0.13643796, -0.01445766, -0.91753593,  0.22230627],
       [-0.22490819, -0.20084309,  0.00179777,  0.12561255,  0.9451437 ],
       [ 0.83930299, -0.4922353 ,  0.04535856,  0.21639057,  0.06627684]])
In [53]:
(u * s)[:, :4]
Out[53]:
array([[-8.18977831,  0.08870261,  0.02447443,  0.07567482],
       [-5.29775328,  0.06626676, -0.33739353, -0.14158431],
       [-3.95641467, -0.75235497,  1.20789685,  0.06102934],
       ...,
       [-4.84700163, -0.32930193,  0.08940634, -0.48581747],
       [ 0.55696575, -0.89083483, -0.49149691,  0.05097942],
       [ 1.081038  ,  2.21036665, -0.79411442, -0.31112771]])
In [54]:
pc1 = pcs[:, 0]
pc2 = pcs[:, 1]
pc3 = pcs[:, 2]
pc4 = pcs[:, 3]

We can fit a linear model using these principal components as our features!

In [55]:
train['pc1'] = pc1
train['pc2'] = pc2
train['pc3'] = pc3
train['pc4'] = pc4
/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
In [56]:
train.head()
Out[56]:
FG FGA FT% 3PA AST PTS pc1 pc2 pc3 pc4
302 0.0 0.0 0.000 0.0 0.0 0.0 -8.189778 0.088703 0.024474 -0.075675
101 1.0 2.6 0.667 0.8 0.3 2.6 -5.297753 0.066267 -0.337394 0.141584
2 1.1 3.2 0.778 2.2 1.9 3.2 -3.956415 -0.752355 1.207897 -0.061029
228 1.2 3.3 0.786 2.2 0.6 3.6 -4.132396 -0.997844 -0.063351 -0.009939
41 4.3 10.7 0.770 4.6 2.9 11.5 4.534435 -0.749357 0.228467 0.359715
In [57]:
linear_model_pcs = LinearRegression()
linear_model_pcs.fit(train[['pc1', 'pc2', 'pc3', 'pc4']], train['PTS'])
Out[57]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
In [58]:
linear_model_pcs.coef_
Out[58]:
array([ 1.09628576,  0.47644956, -0.48760475, -2.1092107 ])

Note that we are using pca_model trained on the training data here.

In [59]:
pcs_test = pca_model.transform(test[['FG', 'FGA', 'FT%', '3PA', 'AST']])
pcs_test[:5]
Out[59]:
array([[ 1.97608652,  3.38486539, -0.68089399, -0.06083578],
       [ 1.61391651, -0.41344573, -0.77032441, -0.26199467],
       [ 2.68573153,  3.51461826, -0.37569613, -0.36744849],
       [-6.53905121,  0.45393693,  0.38958666,  0.01357322],
       [-1.50806343, -0.2737614 ,  2.50259516, -0.16586052]])
In [60]:
test['pc1'] = pcs_test[:, 0]
test['pc2'] = pcs_test[:, 1]
test['pc3'] = pcs_test[:, 2]
test['pc4'] = pcs_test[:, 3]
/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
In [61]:
rmse(linear_model_pcs.predict(test[['pc1', 'pc2', 'pc3', 'pc4']]), test['PTS'])
Out[61]:
0.577268936115324