Notebook by Fernando Pérez, Suraj Rampure, John DeNero, Sam Lau, Ani Adhikari.
import numpy as np
import pandas as pd
import sklearn.linear_model as lm
from sklearn.datasets import load_boston
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from tqdm.notebook import tnrange
plt.rcParams['figure.figsize'] = (4, 4)
plt.rcParams['figure.dpi'] = 150
plt.rcParams['lines.linewidth'] = 3
sns.set()
Let's say our population is finite and we know it: a uniform over the numbers 0 to 10,000 (inclusive). (Note: You would never need statistical inference if you knew the whole population; we're just creating a playground to try out techniques.)
population = np.arange(10001)
population
We might want to know the population mean. In this case, we do!
np.mean(population)
But if we only had a sample, then we would perhaps estimate (guess) that the sample mean is a reasonable approximation for the true mean.
sample_100 = np.random.choice(population, size=100, replace=False)
np.mean(sample_100)
In this case, the estimator is the function np.mean and the population parameter is 5000. The estimate is close, but it's wrong.
Here's an impractical but effective method for estimating the variance of an estimator $f$. (Note that this process is not directly related to the true population parameter, we are instead trying to get a sense of how much our guesses vary from one another.)
def var_estimate(f, pop, m=4000, n=100):
"""Estimate the variance of estimator f by the empirical variance.
f: A function of a sample
pop: An array representing the whole population
m, n: Use m samples of size n to estimate the variance
"""
estimates = []
for j in range(m):
sample = np.random.choice(pop, size=n, replace=False)
estimates.append(f(sample))
estimates = np.array(estimates)
plt.hist(estimates, bins=30)
plt.xlim(4000, 6000)
return np.var(estimates)
var_estimate(np.mean, population)
_**0.5
var_estimate(np.mean, population, n=400)
var_estimate(np.mean, population, n=1600)
This is not a new phenomenon. In Lecture 3, we saw that the variance of the sample mean decreases as our sample size increases.
If we know the variance of the sampling distribution and we know that the sampling distribution is approximately normal, then we know how far off a single estimate is likely to be. About 95% of estimates will be within 2 standard deviations of the mean, so for 95% of samples, the estimate will be off by the following (or less).
2 * np.sqrt(var_estimate(np.mean, population))
Unfortunately, estimating the variance required repeated sampling from the population.
Instead, we can estimate the variance using bootstrap resampling.
def bootstrap_var_estimate(f, sample, m=4000):
"""Estimate the variance of estimator f by the empirical variance.
f: A function of a sample
sample: An array representing a sample of size n
m: Use m samples of size n to estimate the variance
"""
estimates = []
n = len(sample)
for j in range(m):
resample = np.random.choice(sample, size=n, replace=True)
estimates.append(f(resample))
estimates = np.array(estimates)
plt.hist(estimates, bins=30)
return np.mean((estimates - np.mean(estimates))**2) # same as np.var(estimates)
bootstrap_var_estimate(np.mean, sample_100)
np.mean(sample_100)
sample_400 = np.random.choice(population, 400, replace=False)
bootstrap_var_estimate(np.mean, sample_400)
sample_1600 = np.random.choice(population, 1600, replace=False)
bootstrap_var_estimate(np.mean, sample_1600)
We can see that the estimated variance when bootstrapping our sample is close to the variance computed by directly sampling from the population. But, it's a good amount wrong each time.
def ci(sample, estimator, confidence=95, m=1000):
"""Compute a confidence interval for an estimator.
sample: A DataFrame or Series
estimator: A function from a sample DataFrame to an estimate (number)
"""
if isinstance(sample, np.ndarray):
sample = pd.Series(sample)
estimates = []
n = sample.shape[0]
for j in range(m):
resample = sample.sample(n, replace=True)
estimates.append(estimator(resample))
estimates = np.array(estimates)
slack = 100 - confidence
lower = np.percentile(estimates, slack/2)
upper = np.percentile(estimates, 100 - slack/2)
return (lower, upper)
Here's one bootstrapped confidence interval for the sample mean.
np.mean(sample_100)
ci(sample_100, np.mean)
def bootstrap_dist(sample, estimator, m=10000):
if isinstance(sample, np.ndarray):
sample = pd.Series(sample)
estimates = []
n = sample.shape[0]
for j in range(m):
resample = sample.sample(n, replace=True)
estimates.append(estimator(resample))
plt.hist(estimates, bins=30)
bootstrap_dist(sample_100, np.mean)
To be crystal clear, the above histogram was computed by:
sample_100
, 10000 timesplt.hist
on the list of 10000 bootstrapped sample meansLet's create 100 95% confidence intervals for the sample mean. We'd expect roughly 95% of them to contain the true population parameter. In practice, we wouldn't be able to check (because if we knew the true population parameter, we wouldn't be doing any of this).
mean_ints = [ci(np.random.choice(population, 100), np.mean)
for _ in tnrange(100)]
You will note, many of these intervals contain the true population parameter 5000, but some do not. Let's print the first few:
mean_ints[:5]
We can visualize this by plotting each confidence interval as a line and then overlaying the true mean, which we happen to know in this case is 5000. We can then count how many of our CIs include 5000. The code below does that, while also color-coding the CIs that fail to include the true value in red (the true mean is the thick vertical orange line).
We expect the fraction of "OK CIs" (that is, confidence intervals that include the true mean) to be close to 0.95, but the exact value you get will change each time you rerun the above code:
true_mean = 5000
fig, ax = plt.subplots(figsize=(3,6))
nreps = len(mean_ints)
nok = 0
for i, bounds in enumerate(mean_ints):
if bounds[0] <= true_mean <= bounds[1]:
color = 'blue'
nok += 1
else:
color = 'red'
ax.plot(bounds, [i, i], '-o', color=color, lw=1, markersize=2)
ax.axvline(true_mean, color='orange', lw=3);
ax.set_title(f"Fraction of OK CIs: {nok/nreps}");
bootstrap_dist(sample_100, np.median)
ci(sample_100, np.median)
# True population median
np.median(population)
bootstrap_dist(sample_100, np.std)
ci(sample_100, np.std)
# True population standard deviation
np.std(population)
p99 = lambda a: np.percentile(a, 99)
p99(population)
bootstrap_dist(sample_100, p99)
ci(sample_100, p99)
p99_ints = [ci(np.random.choice(population, 100), p99)
for _ in tnrange(100)]
sum([v[0] <= p99(population) <= v[1] for v in p99_ints])
Extreme percentiles aren't estimated well with the bootstrap. Only 60-70 / 100 of our 95% confidence intervals contained the true population parameter.
Let's revisit an old friend.
nba = pd.read_csv('nba18-19.csv')
nba
This table provides aggregate statistics for each player throughout the 2018-19 NBA season.
Let's use FG
, FGA
, FT%
, 3PA
, and AST
to predict PTS
. For reference:
FG
is the number of shots a player made.FGA
is the number of shots a player took.FT%
is the proportion of free throw shots a player took that they made.3PA
is the number of three-point shots a player took.AST
is the number of assists the player had.nba_small = nba[['FG', 'FGA', 'FT%', '3PA', 'AST', 'PTS']].fillna(0)
nba_small
Note that this is really just for the sake of example; the correlation between FG
and PTS
is so high that we in practice wouldn't need all of these other features.
reg = lm.LinearRegression()
reg.fit(nba_small.iloc[:,:-1], nba_small.iloc[:,-1])
The Multiple $R^2$ value is quite high:
reg.score(nba_small.iloc[:,:-1], nba_small.iloc[:,-1])
Let's look at the coefficients, though:
nba_small.columns
reg.coef_.astype(float)
The coefficient on FGA
, the number of shots a player takes, is very low. This means that FGA
is not very useful in predicting PTS
. That's strange – because we'd think that the number of shots a player takes would be very useful in such a case.
Let's look at a 95% confidence interval (created using the bootstrap percentile technique from above) for the coefficient on FGA
.
def FGA_slope(t):
reg = lm.LinearRegression().fit(t.iloc[:,:-1], t.iloc[:,-1])
return reg.coef_[1]
ci(nba_small, FGA_slope)
We see 0 is in this interval. Hmmmm....
The issue is that FGA
is highly correlated with one of the other features in our design matrix.
nba_small.corr()
The correlation between FGA
and FG
is very close to 1. This is a sign of multicollinearity, meaning that the individual coefficients have little meaning.
Let's look at the resulting model that comes from only using FGA
as a feature.
simple_model = lm.LinearRegression()
simple_model.fit(nba_small[['FGA']].values.reshape(-1, 1), nba_small['PTS'])
We dropped all of those features, but the Multiple $R^2$ value is almost the same. The simpler, the better.
simple_model.score(nba_small[['FGA']].values.reshape(-1, 1), nba_small['PTS'])
The coefficient on FGA
in this model, of course, is positive.
simple_model.coef_
def FGA_slope_simple_model(t):
reg = lm.LinearRegression().fit(t[['FGA']].values.reshape(-1, 1), t['PTS'])
return reg.coef_[0]
ci(nba_small, FGA_slope_simple_model)
0 is not in this interval, so we know that the slope for FGA
in the linear model with PTS
is significantly different than 0.