Monte Carlo methods and the Central Limit Theorem

Let's discuss how to use randomized sampling tools for estimating quantities of statistical interest. This idea encompasses a large class of specific tools, broadly known as Monte Carlo Methods. Their core principle is to use randomness to obtain approximate solutions to otherwise deterministic problems. Why?

  • Faster
  • Easier

Key drawback: they may (with low probability) give the wrong answer.

Alternative: Las Vegas Algorithms. These are randomized algorithms that either always give correct results or explicitly indicate failure.

Monte Carlo

We will carry out some more detailed simulation studies to examine the expected value, standard deviation, and distribution of the sample average for averages of different sizes. In the process we will confirm our theoretical results and discover new ones.

These are examples of a simple but powerful approach to studying random processes through simulation. The technique that we have been using is called Monte Carlo simulation. Briefly, suppose that we want to study the behavior of a statistic $T(X_1, \ldots, X_n)$ which is a function of $X_1, \ldots X_n$. (The sample mean, median, Huber estimator are all examples). Then we can study the behavior of $T$ under different conditions, such as a rango of sample sizes, different population distributions, sampling with or without replacement, etc. We perform the Monte Carlo as follows:

  • Take a combination of input values, e.g., sample size and population distribution, which we call a data generation model
  • Generate data from this model and calculate the statistic of interest
  • Repeat this process many (typically thousands) of times
  • Repeat the above process for all combinations of input values (data generation models).

We will study the impact of the sample size on the probability distribution of $\bar{X}$ for the restaurant population and sampling with replacement.

In [1]:
# A bit of setup first
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from scipy.stats import norm
In [2]:
plt.style.use("seaborn")
plt.rcParams['figure.figsize'] = (10, 6)
sns.set_context('talk', font_scale=1.4)

Law of Large Numbers - demonstration

First, let's consider our population to be 50,000 uniformly distributed random numbers in [0, 1]:

In [3]:
pop = pd.Series(np.random.uniform(size=50_000))

the mean of this population is, as expected, very close to 1/2:

In [4]:
pop.mean()
Out[4]:
0.49970348630035877

But let's see what happens as we study this population via sampling. Let's set up a range of sample sizes that we want to study.

In [5]:
sample_sizes = np.logspace(1, 4, 50, dtype='int')
sample_sizes
Out[5]:
array([   10,    11,    13,    15,    17,    20,    23,    26,    30,
          35,    40,    47,    54,    62,    71,    82,    95,   109,
         126,   145,   167,   193,   222,   255,   294,   339,   390,
         449,   517,   596,   686,   790,   910,  1048,  1206,  1389,
        1599,  1842,  2120,  2442,  2811,  3237,  3727,  4291,  4941,
        5689,  6551,  7543,  8685, 10000])

Next we carry out one round of our Monte Carlo study. That is, we examine the sample average for all of our sample sizes, where we generate only one sample for each size.

In [6]:
np.random.seed(707)
pd.Series([
            pop.sample(s, replace=False).mean() 
            for s in sample_sizes
        ], index=sample_sizes).plot()
plt.xscale('log')
plt.xlabel('Sample Size (log scale)')
plt.ylabel('Mean');
#plt.savefig('lln.pdf');

It appears that the sample average gets closer to the population average as the sample size grows. Note that the x-axis is on log scale.

We repeat this process 100 times to get a sense of how quickly the sample average converges to the population mean. Not that we use transparency to help see where are the bulk of values.

In [7]:
for i in range(0,100):
    pd.Series([
            pop.sample(s, replace=False).mean() 
            for s in sample_sizes
        ], index=sample_sizes).plot(color='grey', alpha = 0.2)

plt.xscale('log')
plt.xlabel('Sample Size (log scale)')
plt.ylabel('Mean');
#plt.savefig('lln_many.pdf');

We have discovered the Law of Large Numbers: As the sample size increases, the sample average (from indpendent sample with replacement from a population) converges to the population average.

Square-root law

We already know the Square-root Law: The standard error of the sample average shrinks by a facotr of $1/\sqrt{n}$ with the sample size. However, we can confirm this is the case with a simulation study. Again, we vary the sample size. This time we take 1000 replications for each sample size. That is for a particular sample size, we obtain 1000 sample averages, and we find the standard deviation of these 1000 sample averages to approximate the SD of the sample average.

In [8]:
# This cell takes a while to run
sample_sizes = np.logspace(1, 4, 50, dtype='int')
sds = []

for s in sample_sizes:
    means = [
        pop.sample(s, replace = False).mean() 
        for _ in range(1000)]
    sds.append(np.std(means))    

Since we want to confirm the Square-root Law, we plot our findings along with our theory, i.e., $\sigma/\sqrt{n}$. The two curves are essentially on top of each other.

In [9]:
plt.plot(sample_sizes, sds, label='SE for 1000 Simulated Means')
plt.plot(sample_sizes, pop.std()/np.sqrt(sample_sizes), 
        label='Population SD/sqrt(sample size)')

plt.xscale('log')
plt.xlabel('Sample Size (log scale)')
plt.ylabel('Standard Error of Mean')
plt.legend();
#plt.savefig('sqrt_law.pdf')

The Central Limit Theorem

Let's consider the problem of estimating the mean of a population of size $N$, by taking samples of size $n$ from this population. The Central Limit Theorem states that if $n$ large in absolute terms and small relative to $N$ (if sampling without replacement), the probability distribution of the sample average becomes increasingly close to the normal curve with center at the population average and SD $= \sigma/\sqrt{n}$.

In [10]:
a = np.random.uniform(size=1000)
plt.hist(a, bins=30);
In [11]:
sample_size = 2
repetitions = 100
b = np.random.uniform(size=(sample_size, repetitions))
bav = b.mean(axis=0)
plt.hist(bav, bins=30);
In [12]:
mu = 0.5
sigma = 0.1
xmin, xmax = mu-5*sigma, mu+5*sigma
x = np.linspace(xmin, xmax, 200)
plt.hist(np.random.normal(mu, sigma, size=10_000), density=True, bins=50);
plt.plot(x, norm.pdf(x, mu, sigma), lw=3);
In [13]:
def uni_cl_theo(sample_size, repetitions, ax=None, bins=None):

    bins = 30 if bins is None else bins
    if ax is None:
        fig, ax = plt.subplots()
    
    # mean and variance of the uniform (0,1) distribution
    mu = 0.5
    sigma2_uni = 1/12
    sigma_norm = np.sqrt(sigma2_uni/sample_size)

    sample_means = np.random.uniform(size=(sample_size, repetitions)).mean(axis=0)
    assert sample_means.size == repetitions  # sanity check
   
    xmin_n, xmax_n = mu-5*sigma_norm, mu+5*sigma_norm
    xmin = min(0, xmin_n)
    xmax = max(1, xmax_n)
    x = np.linspace(xmin, xmax, 300)    
    ax.hist(sample_means, density=True, bins=bins);
    ax.plot(x, norm.pdf(x, mu, sigma_norm), lw=3, label=f'n={sample_size}, r={repetitions}')
    ax.set_xlim([xmin, xmax])
    ax.legend()
    return ax

uni_cl_theo(1, 1000);
In [14]:
for reps in [10, 100, 1000, 10000]:
    uni_cl_theo(2, reps)
In [15]:
reps = 1_000
for sample_size in [1, 2, 10, 100, 500]:
    uni_cl_theo(sample_size, reps)