Lecture 21 – Data 100, Summer 2020

by Suraj Rampure

adapted from John DeNero, Sam Lau, Ani Adhikari

In [2]:
import numpy as np
import pandas as pd
import sklearn.linear_model as lm
from sklearn.datasets import load_boston

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from tqdm import tnrange

plt.rcParams['figure.figsize'] = (4, 4)
plt.rcParams['figure.dpi'] = 150
plt.rcParams['lines.linewidth'] = 3
sns.set()

Estimation and Bootstrapping

Sample Mean Estimator

Let's say our population is finite and we know it: a uniform over the numbers 0 to 10,000 (inclusive). (Note: You would never need statistical inference if you knew the whole population; we're just creating a playground to try out techniques.)

In [3]:
population = np.arange(10001)
In [4]:
population
Out[4]:
array([    0,     1,     2, ...,  9998,  9999, 10000])

We might want to know the population mean. In this case, we do!

In [5]:
np.mean(population)
Out[5]:
5000.0

But if we only had a sample, then we would perhaps estimate (guess) that the sample mean is a reasonable approximation for the true mean.

In [6]:
sample_100 = np.random.choice(population, size=100, replace=False)
np.mean(sample_100)
Out[6]:
4988.85

In this case, the estimator is the function np.mean and the population parameter is 5000. The estimate is close, but it's wrong.

Sample variance estimator for the variance of the sample mean

Here's an impractical but effective method for estimating the variance of an estimator $f$. (Note that this process is not directly related to the true population parameter, we are instead trying to get a sense of how much our guesses vary from one another.)

In [7]:
def var_estimate(f, pop, m=4000, n=100):
    """Estimate the variance of estimator f by the empirical variance.
    
    f: A function of a sample
    pop: An array representing the whole population
    m, n: Use m samples of size n to estimate the variance
    """
    estimates = []
    for j in range(m):
        sample = np.random.choice(pop, size=n, replace=False)
        estimates.append(f(sample))
    estimates = np.array(estimates)
    plt.hist(estimates, bins=30)
    plt.xlim(4000, 6000)
    return np.var(estimates)

var_estimate(np.mean, population)
Out[7]:
79065.290596824
In [8]:
83465.5906135476**0.5
Out[8]:
288.9041201048327
In [9]:
var_estimate(np.mean, population, n=400)
Out[9]:
19732.94685082151
In [10]:
var_estimate(np.mean, population, n=1600)
Out[10]:
4342.934550494306

This is not a new phenomenon. In Lecture 3, we saw that the variance of the sample mean decreases as our sample size increases.

If we know the variance of the sampling distribution and we know that the sampling distribution is approximately normal, then we know how far off a single estimate is likely to be. About 95% of estimates will be within 2 standard deviations of the mean, so for 95% of samples, the estimate will be off by the following (or less).

In [11]:
2 * np.sqrt(var_estimate(np.mean, population))
Out[11]:
568.5183703792626

Unfortunately, estimating the variance required repeated sampling from the population.

Bootstrap estimator for the variance of the sample mean

Instead, we can estimate the variance using bootstrap resampling.

In [12]:
def bootstrap_var_estimate(f, sample, m=4000):
    """Estimate the variance of estimator f by the empirical variance.
    
    f: A function of a sample
    sample: An array representing a sample of size n
    m: Use m samples of size n to estimate the variance
    """
    estimates = []
    n = len(sample)
    for j in range(m):
        resample = np.random.choice(sample, size=n, replace=True)
        estimates.append(f(resample))
    estimates = np.array(estimates)
    plt.hist(estimates, bins=30)
    return np.mean((estimates - np.mean(estimates))**2) # same as np.var(estimates)

bootstrap_var_estimate(np.mean, sample_100)
Out[12]:
90889.55377866594
In [13]:
np.mean(sample_100)
Out[13]:
4988.85
In [14]:
sample_400 = np.random.choice(population, 400, replace=False)
bootstrap_var_estimate(np.mean, sample_400)
Out[14]:
20555.38664590344
In [15]:
sample_1600 = np.random.choice(population, 1600, replace=False)
bootstrap_var_estimate(np.mean, sample_1600)
Out[15]:
5154.912524485698

We can see that the estimated variance when bootstrapping our sample is close to the variance computed by directly sampling from the population. But, it's a good amount wrong each time.

Bootstrap confidence interval

In [16]:
def ci(sample, estimator, confidence=95, m=1000):
    """Compute a confidence interval for an estimator.
    
    sample: A DataFrame or Series
    estimator: A function from a sample DataFrame to an estimate (number)
    """
    if isinstance(sample, np.ndarray):
        sample = pd.Series(sample)
    estimates = []
    n = sample.shape[0]
    for j in range(m):
        resample = sample.sample(n, replace=True)
        estimates.append(estimator(resample))
    estimates = np.array(estimates)
    slack = 100 - confidence
    lower = np.percentile(estimates, slack/2)
    upper = np.percentile(estimates, 100 - slack/2)
    return (lower, upper)

Here's one bootstrapped confidence interval for the sample mean.

In [17]:
ci(sample_100, np.mean)
Out[17]:
(4406.268, 5552.359999999999)
In [18]:
def bootstrap_dist(sample, estimator, m=10000):
    if isinstance(sample, np.ndarray):
        sample = pd.Series(sample)
    estimates = []
    n = sample.shape[0]
    for j in range(m):
        resample = sample.sample(n, replace=True)
        estimates.append(estimator(resample))
    plt.hist(estimates, bins=30)
    
bootstrap_dist(sample_100, np.mean)

To be crystal clear, the above histogram was computed by:

  • resampling from our original sample sample_100, 10000 times
  • each of those 10000 times, computing the sample mean
  • calling plt.hist on the list of 10000 bootstrapped sample means

Let's create 100 95% confidence intervals for the sample mean. We'd expect roughly 95% of them to contain the true population parameter. In practice, we wouldn't be able to check (because if we knew the true population parameter, we wouldn't be doing any of this).

In [19]:
mean_ints = [ci(np.random.choice(population, 100), np.mean)
             for _ in tnrange(100)]

You will note, many of these intervals contain the true population parameter 5000, but some do not.

In [20]:
mean_ints
Out[20]:
[(4571.893249999999, 5733.399),
 (4263.49625, 5365.62925),
 (3869.1965, 4983.3147500000005),
 (4636.598499999999, 5789.134249999999),
 (4327.271500000001, 5478.137),
 (4270.91675, 5393.41925),
 (4403.2655, 5445.7115),
 (4418.941250000001, 5526.30925),
 (4354.357, 5476.940500000001),
 (5063.61875, 6032.2474999999995),
 (4803.383, 5858.625000000001),
 (4296.8145, 5446.532),
 (4192.855, 5296.3805),
 (4335.627, 5462.401500000001),
 (4184.24975, 5250.18925),
 (4598.960000000001, 5736.44675),
 (4058.00225, 5164.8285000000005),
 (5019.01475, 6015.69075),
 (4364.94175, 5467.209),
 (4283.1135, 5489.525499999999),
 (4948.093, 6083.6665),
 (4679.0395, 5817.77825),
 (4658.19375, 5815.32225),
 (4646.28725, 5732.5135),
 (3655.92025, 4825.71875),
 (4687.769, 5878.094499999999),
 (4357.662, 5486.053),
 (4379.666499999999, 5561.24575),
 (4653.070750000001, 5665.556),
 (4307.931750000001, 5350.811),
 (4500.639999999999, 5609.02275),
 (4269.8730000000005, 5371.896250000001),
 (3987.70775, 5058.6849999999995),
 (4182.382, 5257.4794999999995),
 (4324.41, 5510.258499999999),
 (4268.9085, 5403.333250000001),
 (4397.6905, 5546.9085000000005),
 (4150.45175, 5255.7925),
 (4016.81875, 5220.2455),
 (4362.8952500000005, 5483.636),
 (4685.93325, 5886.9585),
 (4640.4535000000005, 5743.109),
 (4690.688, 5825.1205),
 (4893.824499999999, 5912.40175),
 (4220.393, 5388.75325),
 (4665.9375, 5868.964),
 (4484.740750000001, 5655.769),
 (4162.3315, 5355.2425),
 (4447.148, 5617.661749999999),
 (4527.9555, 5752.386750000001),
 (4297.00275, 5493.032749999999),
 (4313.653249999999, 5387.444750000001),
 (3905.3485, 5048.1582499999995),
 (3955.8115000000003, 5047.5777499999995),
 (4460.95225, 5610.37525),
 (4789.0332499999995, 5990.9580000000005),
 (4411.122, 5562.979),
 (4491.168000000001, 5562.275250000001),
 (4782.2697499999995, 5829.005),
 (4628.566, 5768.233749999999),
 (4569.21525, 5772.759499999999),
 (4105.1925, 5139.1955),
 (4824.986, 5900.383000000001),
 (3751.494, 4850.23525),
 (4554.333, 5693.985),
 (4264.67025, 5344.06875),
 (4291.5782500000005, 5424.49025),
 (4127.529500000001, 5294.964),
 (4350.5777499999995, 5468.468),
 (3492.59575, 4629.16825),
 (4476.262500000001, 5583.739999999999),
 (5004.25, 6055.1630000000005),
 (4805.853499999999, 5950.331749999999),
 (4430.9825, 5541.9887499999995),
 (4300.3685000000005, 5340.44375),
 (4197.187, 5299.85175),
 (4130.70175, 5292.33875),
 (4982.0019999999995, 6039.00325),
 (4618.6195, 5824.835499999999),
 (4441.14125, 5614.757250000001),
 (4759.8625, 5837.5867499999995),
 (4809.94625, 5966.3595),
 (4364.04425, 5534.9505),
 (4572.20175, 5588.756249999999),
 (4609.5485, 5667.429),
 (4734.5075, 5796.27975),
 (4340.797500000001, 5508.88475),
 (4294.96175, 5475.2164999999995),
 (4555.205499999999, 5684.9575),
 (3886.4757499999996, 4976.6295),
 (4105.98375, 5188.9887499999995),
 (4642.42925, 5806.6357499999995),
 (4421.106000000001, 5464.462),
 (4520.968250000001, 5828.00675),
 (4439.5419999999995, 5642.151749999999),
 (4329.157, 5368.13),
 (4800.0112500000005, 5861.218749999999),
 (4273.5435, 5410.737),
 (4617.12925, 5714.289),
 (4427.192500000001, 5484.4285)]

Each time you run the above simulation, you may get a slightly different number printed below. The number printed below is the number of intervals from above that contain the true population parameter. It should be close to 95.

We also have visualized the left and right endpoints of each of the confidence intervals.

  • If the left (blue) endpoint is greater than 5000 or the right (orange) endpoint is less than 5000, that interval does not contain the true population parameter.
In [21]:
plt.hist([v[0] for v in mean_ints], bins=30);
plt.hist([v[1] for v in mean_ints], bins=30);
sum([v[0] <= 5000 <= v[1] for v in mean_ints])
Out[21]:
92

Bootstrap confidence intervals for other population parameters

Median

In [22]:
bootstrap_dist(sample_100, np.median)
In [23]:
ci(sample_100, np.median)
Out[23]:
(4123.5, 6001.0)
In [24]:
# True population median
np.median(population)
Out[24]:
5000.0

Standard Deviation

In [25]:
bootstrap_dist(sample_100, np.std)
In [26]:
ci(sample_100, np.std)
Out[26]:
(2690.4671485985996, 3204.7444298364026)
In [27]:
# True population standard deviation
np.std(population)
Out[27]:
2887.0400066504103

99th Percentile

In [28]:
p99 = lambda a: np.percentile(a, 99)
p99(population)
Out[28]:
9900.0
In [29]:
bootstrap_dist(sample_100, p99)
In [30]:
ci(sample_100, p99)
Out[30]:
(9490.34, 9960.0)
In [31]:
p99_ints = [ci(np.random.choice(population, 100), p99)
            for _ in tnrange(100)]

In [32]:
sum([v[0] <= p99(population) <= v[1] for v in p99_ints])
Out[32]:
64

Extreme percentiles aren't estimated well with the bootstrap. Only 60-70 / 100 of our 95% confidence intervals contained the true population parameter.

Estimating Parameters in Linear Regression

Let's revisit an old friend.

In [56]:
nba = pd.read_csv('nba18-19.csv')
In [57]:
nba
Out[57]:
Rk Player Pos Age Tm G GS MP FG FGA ... FT% ORB DRB TRB AST STL BLK TOV PF PTS
0 1 Álex Abrines\abrinal01 SG 25 OKC 31 2 19.0 1.8 5.1 ... 0.923 0.2 1.4 1.5 0.6 0.5 0.2 0.5 1.7 5.3
1 2 Quincy Acy\acyqu01 PF 28 PHO 10 0 12.3 0.4 1.8 ... 0.700 0.3 2.2 2.5 0.8 0.1 0.4 0.4 2.4 1.7
2 3 Jaylen Adams\adamsja01 PG 22 ATL 34 1 12.6 1.1 3.2 ... 0.778 0.3 1.4 1.8 1.9 0.4 0.1 0.8 1.3 3.2
3 4 Steven Adams\adamsst01 C 25 OKC 80 80 33.4 6.0 10.1 ... 0.500 4.9 4.6 9.5 1.6 1.5 1.0 1.7 2.6 13.9
4 5 Bam Adebayo\adebaba01 C 21 MIA 82 28 23.3 3.4 5.9 ... 0.735 2.0 5.3 7.3 2.2 0.9 0.8 1.5 2.5 8.9
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
703 528 Tyler Zeller\zellety01 C 29 MEM 4 1 20.5 4.0 7.0 ... 0.778 2.3 2.3 4.5 0.8 0.3 0.8 1.0 4.0 11.5
704 529 Ante Žižić\zizican01 C 22 CLE 59 25 18.3 3.1 5.6 ... 0.705 1.8 3.6 5.4 0.9 0.2 0.4 1.0 1.9 7.8
705 530 Ivica Zubac\zubaciv01 C 21 TOT 59 37 17.6 3.6 6.4 ... 0.802 1.9 4.2 6.1 1.1 0.2 0.9 1.2 2.3 8.9
706 530 Ivica Zubac\zubaciv01 C 21 LAL 33 12 15.6 3.4 5.8 ... 0.864 1.6 3.3 4.9 0.8 0.1 0.8 1.0 2.2 8.5
707 530 Ivica Zubac\zubaciv01 C 21 LAC 26 25 20.2 3.8 7.2 ... 0.733 2.3 5.3 7.7 1.5 0.4 0.9 1.4 2.5 9.4

708 rows × 30 columns

This table provides aggregate statistics for each player throughout the 2018-19 NBA season.

Let's use FG, FGA, FT%, 3PA, and AST to predict PTS. For reference:

  • FG is the number of shots a player made.
  • FGA is the number of shots a player took.
  • FT% is the proportion of free throw shots a player took that they made.
  • 3PA is the number of three-point shots a player took.
  • AST is the number of assists the player had.
In [59]:
nba_small = nba[['FG', 'FGA', 'FT%', '3PA', 'AST', 'PTS']].fillna(0)
nba_small
Out[59]:
FG FGA FT% 3PA AST PTS
0 1.8 5.1 0.923 4.1 0.6 5.3
1 0.4 1.8 0.700 1.5 0.8 1.7
2 1.1 3.2 0.778 2.2 1.9 3.2
3 6.0 10.1 0.500 0.0 1.6 13.9
4 3.4 5.9 0.735 0.2 2.2 8.9
... ... ... ... ... ... ...
703 4.0 7.0 0.778 0.0 0.8 11.5
704 3.1 5.6 0.705 0.0 0.9 7.8
705 3.6 6.4 0.802 0.0 1.1 8.9
706 3.4 5.8 0.864 0.0 0.8 8.5
707 3.8 7.2 0.733 0.0 1.5 9.4

708 rows × 6 columns

Note that this is really just for the sake of example; the correlation between FG and PTS is so high that we in practice wouldn't need all of these other features.

In [60]:
reg = lm.LinearRegression()
reg.fit(nba_small.iloc[:,:-1], nba_small.iloc[:,-1])
Out[60]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

The Multiple $R^2$ value is quite high:

In [61]:
reg.score(nba_small.iloc[:,:-1], nba_small.iloc[:,-1])
Out[61]:
0.9886417979129902

Let's look at the coefficients, though:

In [63]:
nba_small.columns
Out[63]:
Index(['FG', 'FGA', 'FT%', '3PA', 'AST', 'PTS'], dtype='object')
In [62]:
reg.coef_.astype(float)
Out[62]:
array([2.44566707, 0.03633589, 0.50454007, 0.28367305, 0.04144433])

The coefficient on FGA, the number of shots a player takes, is very low. This means that FGA is not very useful in predicting PTS. That's strange – because we'd think that the number of shots a player takes would be very useful in such a case.

Let's look at a 95% confidence interval (created using the bootstrap percentile technique from above) for the coefficient on FGA.

In [64]:
def FGA_slope(t):
    reg = lm.LinearRegression().fit(t.iloc[:,:-1], t.iloc[:,-1])
    return reg.coef_[1]

ci(nba_small, FGA_slope)
Out[64]:
(-0.06417468441355369, 0.13449070182831074)

We see 0 is in this interval. Hmmmm....

Multicollinearity

The issue is that FGA is highly correlated with one of the other features in our design matrix.

In [65]:
nba_small.corr()
Out[65]:
FG FGA FT% 3PA AST PTS
FG 1.000000 0.973355 0.371598 0.600830 0.665761 0.990014
FGA 0.973355 1.000000 0.395902 0.725114 0.703093 0.980447
FT% 0.371598 0.395902 1.000000 0.377633 0.288057 0.401555
3PA 0.600830 0.725114 0.377633 1.000000 0.480880 0.666673
AST 0.665761 0.703093 0.288057 0.480880 1.000000 0.676022
PTS 0.990014 0.980447 0.401555 0.666673 0.676022 1.000000

The correlation between FGA and FG is very close to 1. This is a sign of multicollinearity, meaning that the individual coefficients have little meaning.

Let's look at the resulting model that comes from only using FGA as a feature.

In [66]:
simple_model = lm.LinearRegression()
simple_model.fit(nba_small[['FGA']].values.reshape(-1, 1), nba_small['PTS'])
Out[66]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

We dropped all of those features, but the Multiple $R^2$ value is almost the same. The simpler, the better.

In [67]:
simple_model.score(nba_small[['FGA']].values.reshape(-1, 1), nba_small['PTS'])
Out[67]:
0.9612756271235542

The coefficient on FGA in this model, of course, is positive.

In [68]:
simple_model.coef_
Out[68]:
array([1.29982787])
In [69]:
def FGA_slope_simple_model(t):
    reg = lm.LinearRegression().fit(t[['FGA']].values.reshape(-1, 1), t['PTS'])
    return reg.coef_[0]

ci(nba_small, FGA_slope_simple_model)
Out[69]:
(1.2722148050968203, 1.32856860072696)

0 is not in this interval, so we know that the slope for FGA in the linear model with PTS is significantly different than 0.