Live Lecture 7 – Data 100, Summer 2020¶

by Suraj Rampure, Allen Shen

import pandas as pd
import numpy as np

Decision Trees¶

Let's start with an example dataset that we saw in lecture many times – average player statistics from the 2018-19 NBA season.

nba = pd.read_csv('nba18-19.csv')

nba.head()

This is the entropy and weighted_metric functions from Lab 11. Recall that entropy is defined as:

$$ S = -\sum_{C} p_C \log_{2} p_C $$

Recall that the weighted entropy is given by:

$$ L = \frac{N_1 S(X) + N_2 S(Y)}{N_1 + N_2} $$

$N_1$ is the number of samples in the left node $X$, and $N_2$ is the number of samples in the right node $Y$.

def entropy(labels):
    _, counts = np.unique(labels, return_counts=True)
    props = counts / len(labels)
    return -np.sum(props * np.log2(props))

def weighted_metric(left, right, metric):
    return (len(left) * metric(left) + len(right) * metric(right)) / (len(left) + len(right))

entropy(['C', 'C', 'C', 'C', 'C'])

-0.0

entropy(['C', 'SG'])

1.0

-0.5 * np.log2(0.5) - 0.5 * np.log2(0.5)

1.0

entropy(['C'] * 10 +  ['SG'] * 10)

1.0

entropy(['C', 'SG', 'PF'])

1.584962500721156

3*(-0.33 * np.log2(0.33))

1.5834674497121084

-np.log2(1/3)

1.5849625007211563

np.log2(3)

1.584962500721156

weighted_metric(['C'], ['C'] * 9 + ['SG'] * 10, entropy)

0.9481008396786846

(entropy(['C']) + entropy(['C'] * 9 + ['SG'] * 10)) / 2

0.4990004419361498

weighted_metric(['C'] * 4 + ['SG'], ['C'] * 6 + ['SG'] * 9, entropy)

0.9086949695628419

(entropy(['C'] * 4 + ['SG']) + entropy(['C'] * 6 + ['SG'] * 9)) / 2

0.8464393446710154

nba.head()

weighted_metric(nba.loc[nba['Age'] >= 30, 'Pos'], nba.loc[nba['Age'] < 30, 'Pos'], entropy)

2.3774367986811584

weighted_metric(nba.loc[nba['Age'] >= 25, 'Pos'], nba.loc[nba['Age'] < 25, 'Pos'], entropy)

2.375101208480124

weighted_metric(nba.loc[nba['FG'] >= 5, 'Pos'], nba.loc[nba['FG'] < 5, 'Pos'], entropy)

2.3894903552734736

weighted_metric(nba.loc[nba['FGA'] >= 5, 'Pos'], nba.loc[nba['FGA'] < 5, 'Pos'], entropy)

2.378945518445882

nba_left = nba.loc[nba['Age'] >= 25]

nba_left.shape

(425, 30)

weighted_metric(nba_left.loc[nba_left['FG'] >= 2, 'Pos'], nba_left.loc[nba_left['FG'] < 2, 'Pos'], entropy)

2.400588553950309

weighted_metric(nba_left.loc[nba_left['FG'] >= 3, 'Pos'], nba_left.loc[nba_left['FG'] < 3, 'Pos'], entropy)

2.399849059712462

nba_right = nba.loc[nba['Age'] < 25]

nba_right.shape

(283, 30)

weighted_metric(nba_right.loc[nba_right['FG'] >= 5, 'Pos'], nba_right.loc[nba_right['FG'] < 5, 'Pos'], entropy)

2.2904331310436747

Multicollinearity¶

from sklearn.linear_model import LinearRegression

nba_small = nba[['FG', 'FGA', 'FT%', '3PA', 'AST', 'PTS']].fillna(0)
nba_small

model = LinearRegression()
model.fit(nba_small[['FG', 'FGA', 'FT%', '3PA', 'AST']], nba_small['PTS'])

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

rmse = lambda y, yhat: np.sqrt(np.mean((y - yhat)**2))

rmse(model.predict(nba_small[['FG', 'FGA', 'FT%', '3PA', 'AST']]), nba_small['PTS'])

0.6207008204493863

model.coef_

array([2.44566707, 0.03633589, 0.50454007, 0.28367305, 0.04144433])

nba_small.corr()

In Lecture 21, we saw that multicollinearity is present here.

Something to note: just because multicollinearity is present, doesn't mean the predictions our model makes are inaccurate. Let's look at the test RMSE of two different models – one that uses just FGA, and one that uses FG and FGA.

from sklearn.model_selection import train_test_split

train, test = train_test_split(nba_small, test_size = 0.2)

Fitting a model that just uses FGA:

model_1_feature = LinearRegression()
model_1_feature.fit(train[['FGA']], train['PTS'])

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

model_1_feature.coef_

array([1.31276151])

test_rmse_1_feature = rmse(model_1_feature.predict(test[['FGA']]), test['PTS'])
test_rmse_1_feature

1.10102632702208

Fitting a model that uses FG, FGA:

model_2_features = LinearRegression()
model_2_features.fit(train[['FG', 'FGA']], train['PTS'])

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

model_2_features.coef_

array([1.85829074, 0.4365383 ])

test_rmse_2_features = rmse(model_2_features.predict(test[['FG', 'FGA']]), test['PTS'])
test_rmse_2_features

0.6698015078633595

The model with multicollinearity had a lower testing RMSE!

But that was an aside. Multicollinearity is more of an issue when we care about our model's coefficients.

One solution, clearly, is to drop features that are highly correlated. But another solution is to use PCA:

PCA for Modeling¶

linear_model = LinearRegression()
linear_model.fit(train[['FG', 'FGA', 'FT%', '3PA', 'AST']], train['PTS'])

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

linear_model.coef_

array([2.43758621, 0.04459854, 0.4703348 , 0.28488127, 0.03486817])

rmse(linear_model.predict(test[['FG', 'FGA', 'FT%', '3PA', 'AST']]), test['PTS'])

0.5685824971991201

Let's import PCA from sklearn.decomposition. We never really discussed this in class, but it works pretty similarly to the other scikit-learn tools you have used in the class.

from sklearn.decomposition import PCA

pca_model = PCA(n_components = 4)
pca_model.fit(train[['FG', 'FGA', 'FT%', '3PA', 'AST']])

PCA(copy=True, iterated_power='auto', n_components=4, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)

pcs = pca_model.transform(train[['FG', 'FGA', 'FT%', '3PA', 'AST']])
pcs

array([[-8.18977831,  0.08870261,  0.02447443, -0.07567482],
       [-5.29775328,  0.06626676, -0.33739353,  0.14158431],
       [-3.95641467, -0.75235497,  1.20789685, -0.06102934],
       ...,
       [-4.84700163, -0.32930193,  0.08940634,  0.48581747],
       [ 0.55696575, -0.89083483, -0.49149691, -0.05097942],
       [ 1.081038  ,  2.21036665, -0.79411442,  0.31112771]])

pca_model.components_

array([[ 0.39175754,  0.83585622,  0.0178255 ,  0.30772277,  0.22991164],
       [ 0.29980558,  0.13643796, -0.01445766, -0.91753593,  0.22230627],
       [-0.22490819, -0.20084309,  0.00179777,  0.12561255,  0.9451437 ],
       [-0.83930299,  0.4922353 , -0.04535856, -0.21639057, -0.06627684]])

Let's compare this to the output of the SVD.

D = train[['FG', 'FGA', 'FT%', '3PA', 'AST']]
X = D - np.mean(D, axis = 0)
u, s, vt = np.linalg.svd(X, full_matrices = False)

vt[:4]

array([[ 0.39175754,  0.83585622,  0.0178255 ,  0.30772277,  0.22991164],
       [ 0.29980558,  0.13643796, -0.01445766, -0.91753593,  0.22230627],
       [-0.22490819, -0.20084309,  0.00179777,  0.12561255,  0.9451437 ],
       [ 0.83930299, -0.4922353 ,  0.04535856,  0.21639057,  0.06627684]])

(u * s)[:, :4]

array([[-8.18977831,  0.08870261,  0.02447443,  0.07567482],
       [-5.29775328,  0.06626676, -0.33739353, -0.14158431],
       [-3.95641467, -0.75235497,  1.20789685,  0.06102934],
       ...,
       [-4.84700163, -0.32930193,  0.08940634, -0.48581747],
       [ 0.55696575, -0.89083483, -0.49149691,  0.05097942],
       [ 1.081038  ,  2.21036665, -0.79411442, -0.31112771]])

pc1 = pcs[:, 0]
pc2 = pcs[:, 1]
pc3 = pcs[:, 2]
pc4 = pcs[:, 3]

We can fit a linear model using these principal components as our features!

train['pc1'] = pc1
train['pc2'] = pc2
train['pc3'] = pc3
train['pc4'] = pc4

/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.

train.head()

linear_model_pcs = LinearRegression()
linear_model_pcs.fit(train[['pc1', 'pc2', 'pc3', 'pc4']], train['PTS'])

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

linear_model_pcs.coef_

array([ 1.09628576,  0.47644956, -0.48760475, -2.1092107 ])

Note that we are using pca_model trained on the training data here.

pcs_test = pca_model.transform(test[['FG', 'FGA', 'FT%', '3PA', 'AST']])
pcs_test[:5]

array([[ 1.97608652,  3.38486539, -0.68089399, -0.06083578],
       [ 1.61391651, -0.41344573, -0.77032441, -0.26199467],
       [ 2.68573153,  3.51461826, -0.37569613, -0.36744849],
       [-6.53905121,  0.45393693,  0.38958666,  0.01357322],
       [-1.50806343, -0.2737614 ,  2.50259516, -0.16586052]])

test['pc1'] = pcs_test[:, 0]
test['pc2'] = pcs_test[:, 1]
test['pc3'] = pcs_test[:, 2]
test['pc4'] = pcs_test[:, 3]

/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.

rmse(linear_model_pcs.predict(test[['pc1', 'pc2', 'pc3', 'pc4']]), test['PTS'])

0.577268936115324

	FG	FGA	FT%	3PA	AST	PTS
0	1.8	5.1	0.923	4.1	0.6	5.3
1	0.4	1.8	0.700	1.5	0.8	1.7
2	1.1	3.2	0.778	2.2	1.9	3.2
3	6.0	10.1	0.500	0.0	1.6	13.9
4	3.4	5.9	0.735	0.2	2.2	8.9
...	...	...	...	...	...	...
703	4.0	7.0	0.778	0.0	0.8	11.5
704	3.1	5.6	0.705	0.0	0.9	7.8
705	3.6	6.4	0.802	0.0	1.1	8.9
706	3.4	5.8	0.864	0.0	0.8	8.5
707	3.8	7.2	0.733	0.0	1.5	9.4

	FG	FGA	FT%	3PA	AST	PTS
FG	1.000000	0.973355	0.371598	0.600830	0.665761	0.990014
FGA	0.973355	1.000000	0.395902	0.725114	0.703093	0.980447
FT%	0.371598	0.395902	1.000000	0.377633	0.288057	0.401555
3PA	0.600830	0.725114	0.377633	1.000000	0.480880	0.666673
AST	0.665761	0.703093	0.288057	0.480880	1.000000	0.676022
PTS	0.990014	0.980447	0.401555	0.666673	0.676022	1.000000

	Rk	Player	Pos	Age	Tm	G	GS	MP	FG	FGA	...	FT%	ORB	DRB	TRB	AST	STL	BLK	TOV	PF	PTS
0	1	Álex Abrines\abrinal01	SG	25	OKC	31	2	19.0	1.8	5.1	...	0.923	0.2	1.4	1.5	0.6	0.5	0.2	0.5	1.7	5.3
1	2	Quincy Acy\acyqu01	PF	28	PHO	10	0	12.3	0.4	1.8	...	0.700	0.3	2.2	2.5	0.8	0.1	0.4	0.4	2.4	1.7
2	3	Jaylen Adams\adamsja01	PG	22	ATL	34	1	12.6	1.1	3.2	...	0.778	0.3	1.4	1.8	1.9	0.4	0.1	0.8	1.3	3.2
3	4	Steven Adams\adamsst01	C	25	OKC	80	80	33.4	6.0	10.1	...	0.500	4.9	4.6	9.5	1.6	1.5	1.0	1.7	2.6	13.9
4	5	Bam Adebayo\adebaba01	C	21	MIA	82	28	23.3	3.4	5.9	...	0.735	2.0	5.3	7.3	2.2	0.9	0.8	1.5	2.5	8.9

	FG	FGA	FT%	3PA	AST	PTS	pc1	pc2	pc3	pc4
302	0.0	0.0	0.000	0.0	0.0	0.0	-8.189778	0.088703	0.024474	-0.075675
101	1.0	2.6	0.667	0.8	0.3	2.6	-5.297753	0.066267	-0.337394	0.141584
2	1.1	3.2	0.778	2.2	1.9	3.2	-3.956415	-0.752355	1.207897	-0.061029
228	1.2	3.3	0.786	2.2	0.6	3.6	-4.132396	-0.997844	-0.063351	-0.009939
41	4.3	10.7	0.770	4.6	2.9	11.5	4.534435	-0.749357	0.228467	0.359715