Lecture 22 – Data 100, Summer 2021

by Suraj Rampure

adapted from John DeNero, Sam Lau, Ani Adhikari

Estimation and Bootstrapping

Sample Mean Estimator

Let's say our population is finite and we know it: a uniform over the numbers 0 to 10,000 (inclusive). (Note: You would never need statistical inference if you knew the whole population; we're just creating a playground to try out techniques.)

We might want to know the population mean. In this case, we do!

But if we only had a sample, then we would perhaps estimate (guess) that the sample mean is a reasonable approximation for the true mean.

In this case, the estimator is the function np.mean and the population parameter is 5000. The estimate is close, but it's wrong.

Sample variance estimator for the variance of the sample mean

Here's an impractical but effective method for estimating the variance of an estimator $f$. (Note that this process is not directly related to the true population parameter, we are instead trying to get a sense of how much our guesses vary from one another.)

This is not a new phenomenon. In Lecture 3, we saw that the variance of the sample mean decreases as our sample size increases.

If we know the variance of the sampling distribution and we know that the sampling distribution is approximately normal, then we know how far off a single estimate is likely to be. About 95% of estimates will be within 2 standard deviations of the mean, so for 95% of samples, the estimate will be off by the following (or less).

Unfortunately, estimating the variance required repeated sampling from the population.

Bootstrap estimator for the variance of the sample mean

Instead, we can estimate the variance using bootstrap resampling.

We can see that the estimated variance when bootstrapping our sample is close to the variance computed by directly sampling from the population. But, it's a good amount wrong each time.

Bootstrap confidence interval

Here's one bootstrapped confidence interval for the sample mean.

To be crystal clear, the above histogram was computed by:

Let's create 100 95% confidence intervals for the sample mean. We'd expect roughly 95% of them to contain the true population parameter. In practice, we wouldn't be able to check (because if we knew the true population parameter, we wouldn't be doing any of this).

You will note, many of these intervals contain the true population parameter 5000, but some do not.

Each time you run the above simulation, you may get a slightly different number printed below. The number printed below is the number of intervals from above that contain the true population parameter. It should be close to 95.

We also have visualized the left and right endpoints of each of the confidence intervals.

Bootstrap confidence intervals for other population parameters

Median

Standard Deviation

99th Percentile

Extreme percentiles aren't estimated well with the bootstrap. Only 60-70 / 100 of our 95% confidence intervals contained the true population parameter.

Estimating Parameters in Linear Regression

Let's revisit an old friend.

This table provides aggregate statistics for each player throughout the 2018-19 NBA season.

Let's use FG, FGA, FT%, 3PA, and AST to predict PTS. For reference:

Note that this is really just for the sake of example; the correlation between FG and PTS is so high that we in practice wouldn't need all of these other features.

The Multiple $R^2$ value is quite high:

Let's look at the coefficients, though:

The coefficient on FGA, the number of shots a player takes, is very low. This means that FGA is not very useful in predicting PTS. That's strange – because we'd think that the number of shots a player takes would be very useful in such a case.

Let's look at a 95% confidence interval (created using the bootstrap percentile technique from above) for the coefficient on FGA.

We see 0 is in this interval. Hmmmm....

Multicollinearity

The issue is that FGA is highly correlated with one of the other features in our design matrix.

The correlation between FGA and FG is very close to 1. This is a sign of multicollinearity, meaning that the individual coefficients have little meaning.

Let's look at the resulting model that comes from only using FGA as a feature.

We dropped all of those features, but the Multiple $R^2$ value is almost the same. The simpler, the better.

The coefficient on FGA in this model, of course, is positive.

0 is not in this interval, so we know that the slope for FGA in the linear model with PTS is significantly different than 0.