Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Estimators, Bias, and Variance

Last time, we introduced the idea of random variables: numerical functions of a sample. Most of our work in the last lecture was done to build a background in probability and statistics. Now that we’ve established some key ideas, we’re in a good place to apply what we’ve learned to our original goal -- understanding how the randomness of a sample impacts the model design process.

In this lecture, we will delve more deeply into the idea of fitting a model to a sample. We’ll explore how to re-express our modeling process in terms of random variables and use this new understanding to steer model complexity.

Brief Recap

Note that Cov(X,Y)\text{Cov}(X,Y) would equal 0 if XX and YY are independent.

There is also one more important property of expectation that we should look at. Let XX and YY be independent random variables:

E[XY]=E[X]E[Y]\mathbb{E}[XY] = \mathbb{E}[X]\mathbb{E}[Y]

Data-Generating Processes (DGP)

Sample Statistics

Today, we’ve talked extensively about populations; if we know the distribution of a random variable, we can reliably compute expectation, variance, functions of the random variable, etc. Note that:

In Data Science, however, we often do not have access to the whole population, so we don’t know its distribution. As such, we need to collect a sample and use its distribution to estimate or infer properties of the population. In cases like these, we can take several samples of size nn from the population (an easy way to do this is using df.sample(n, replace=True)), and compute the mean of each sample. When sampling, we make the (big) assumption that we sample uniformly at random with replacement from the population; each observation in our sample is a random variable drawn i.i.d from our population distribution. Remember that our sample mean is a random variable since it depends on our randomly drawn sample! On the other hand, our population mean is simply a number (a fixed value).

Sample Mean Properties

Consider an i.i.d. sample X1,X2,...,XnX_1, X_2, ..., X_n drawn from a population with mean 𝜇 and SD 𝜎. We define the sample mean as

Xˉn=1ni=1nXi\bar{X}_n = \frac{1}{n} \sum_{i=1}^n X_i

The expectation of the sample mean is given by:

E[Xˉn]=1ni=1nE[Xi]=1n(nμ)=μ\begin{align} \mathbb{E}[\bar{X}_n] &= \frac{1}{n} \sum_{i=1}^n \mathbb{E}[X_i] \\ &= \frac{1}{n} (n \mu) \\ &= \mu \end{align}

The variance is given by:

Var(Xˉn)=1n2Var(i=1nXi)=1n2(i=1nVar(Xi))=1n2(nσ2)=σ2n\begin{align} \text{Var}(\bar{X}_n) &= \frac{1}{n^2} \text{Var}( \sum_{i=1}^n X_i) \\ &= \frac{1}{n^2} \left( \sum_{i=1}^n \text{Var}(X_i) \right) \\ &= \frac{1}{n^2} (n \sigma^2) = \frac{\sigma^2}{n} \end{align}

The standard deviation is:

SD(Xˉn)=σn\text{SD}(\bar{X}_n) = \frac{\sigma}{\sqrt{n}}

Xˉn\bar{X}_n is normally distributed (in the limit) by the Central Limit Theorem (CLT).

Central Limit Theorem

In Data 8 and in the previous lecture, you encountered the Central Limit Theorem (CLT). This is a powerful theorem for estimating the distribution of a population with mean μ\mu and standard deviation σ\sigma from a collection of smaller samples. The CLT tells us that if an i.i.d sample of size nn is large, then the probability distribution of the sample mean is roughly normal with mean μ\mu and SD of σn\frac{\sigma}{\sqrt{n}}. More generally, any theorem that provides the rough distribution of a statistic and doesn’t need the distribution of the population is valuable to data scientists! This is because we rarely know a lot about the population.

Illustration of the central limit theorem

Importantly, the CLT assumes that each observation in our samples is drawn i.i.d from the distribution of the population. In addition, the CLT is accurate only when nn is “large”, but what counts as a “large” sample size depends on the specific distribution. If a population is highly symmetric and unimodal, we could need as few as n=20n=20; if a population is very skewed, we need a larger nn. If in doubt, you can bootstrap the sample mean and see if the bootstrapped distribution is bell-shaped. Classes like Data 140 investigate this idea in great detail.

For a more in-depth demo, check out onlinestatbook.

Estimation

At this point in the course, we’ve spent a great deal of time working with models. When we first introduced the idea of modeling a few weeks ago, we did so in the context of prediction: using models to make accurate predictions about unseen data. Another reason we might build models is to better understand complex phenomena in the world around us. Inference is the task of using a model to infer the true underlying relationships between the feature and response variables. For example, if we are working with a set of housing data, prediction might ask: given the attributes of a house, how much is it worth? Inference might ask: how much does having a local park impact the value of a house?

A major goal of inference is to draw conclusions about the full population of data given only a random sample. To do this, we aim to estimate the value of a parameter, which is a numerical function of the population (for example, the population mean μ\mu). We use a collected sample to construct a statistic, which is a numerical function of the random sample (for example, the sample mean Xˉn\bar{X}_n). It’s helpful to think “p” for “parameter” and “population,” and “s” for “sample” and “statistic.”

Since the sample represents a random subset of the population, any statistic we generate will likely deviate from the true population parameter, and it could have been different. We say that the sample statistic is an estimator of the true population parameter. Notationally, the population parameter is typically called θ\theta, while its estimator is denoted by θ^\hat{\theta}.

To address our inference question, we aim to construct estimators that closely estimate the value of the population parameter. We evaluate how “good” an estimator is by answering three questions:

If the Bias of an estimator theta^\hat{theta} is zero, then it is said to be an unbiased estimator. For example, sample mean is unbiased for the population mean.

This relationship between bias and variance can be illustrated with an archery analogy. Imagine that the center of the target is the θ\theta and each arrow corresponds to a separate parameter estimate θ^\hat{\theta}

Graphic showing four different targets with high/low bias and high/low variance.

Ideally, we want our estimator to have low bias and low variance, but how can we mathematically quantify that? See Bias-Variance Tradeoff for more detail.

Prediction as a Data-Generating Process

Now that we’ve established the idea of an estimator, let’s see how we can apply this learning to the modeling process. To do so, we’ll take a moment to formalize our data collection and models in the language of random variables.

Say we are working with an input variable, xx, and a response variable, YY. We assume that YY and xx are linked by some relationship gg; in other words, Y=g(x)Y = g(x) where gg represents some “universal truth” or “law of nature” that defines the true underlying relationship between xx and YY. In the image below, gg is denoted by the red line.

As data scientists, however, we have no way of directly “seeing” the underlying relationship gg. The best we can do is collect observed data out in the real world to try to understand this relationship. Unfortunately, the data collection process will always have some inherent error (think of the randomness you might encounter when taking measurements in a scientific experiment). We say that each observation comes with some random error or noise term, ϵ\epsilon (read: “epsilon”). This error is assumed to be a random variable with expectation E(ϵ)=0\mathbb{E}(\epsilon)=0, variance Var(ϵ)=σ2\text{Var}(\epsilon) = \sigma^2, and be i.i.d. across each observation. The existence of this random noise means that our observations, Y(x)Y(x), are random variables, where Y=g(x)+ϵY = g(x) + \epsilon is the data-generating process. This can be seen on our graph because the points do not line perfectly with the true underlying relationship, and the residual is the noise ϵ\epsilon.

Two graphs are shown. On the left is the underlying data generation process with g shown in addition to the datapoints. On the right is what we see with only the datapoints shown.

We can only observe our IID random sample of data, represented by the blue points above. From this sample, we want to estimate the true relationship gg. We do this by training a model on the data to obtain our optimal θ^\hat{\theta},

θ^=argminθ (1ni=1nLoss(Yi,fθ(Xi)))+λ Regularizer(θ){\hat{\theta}} = \text{arg}\underset{\theta}{\text{min}}\ \left(\frac{1}{n} \sum_{i=1}^n \textbf{Loss}(Y_i, f_{\theta}(X_i))\right) + \lambda\ \textbf{Regularizer}(\theta)

and predicting the value of YY at a given xx location,

Y^(x)=f^(x)\hat{Y}(x) = \hat{f}(x)

where the model Y^(x)\hat{Y}(x) can be used to estimate gg. The error in our prediction is:

(YY^(x))2(Y-\hat{Y}(x))^2

Here, XiX_i, YiY_i, θ^\hat{\theta}, and Y^(x)\hat{Y}(x) are random variables. An example prediction model Y^(x)\hat{Y}(x) is shown below.

True relationship: g(x)\text{True relationship: } g(x)
Observed relationship: Y=g(x)+ϵ\text{Observed relationship: }Y = g(x) + \epsilon
Prediction: Y^(x)=f^(x)\text{Prediction: }\hat{Y}(x) = \hat{f}(x)
Two graphs are shown. On the left is a graph of the datapoints titled 'What we see.' On the right is 'A model we may fit, y hat' where the datapoints are shown in addition to a curve labeled y hat.

When building models, it is also important to note that our choice of features will also significantly impact our estimation. In the plot below, you can see how the different models (green and purple) can lead to different estimates.

Two different y hat curves are shown in addition to the datapoints

Overall, we fit (train) a model based on our sample of (x,y) pairs, and our model estimates the true relationship Y=g(x)+ϵY=g(x)+\epsilon, where at every xx, our prediction for YY is Y^(x)=f^(x)\hat{Y}(x) = \hat{f}(x).

Model Complexity

The bias and variance of a model (covered in detail in the next section) is largely determined by its complexity.

Two low complexity models are shown in this gif.Two high complexity models are shown in this gif. The exact curves differ at various points.

Bias-Variance Tradeoff

Recall the model and the data we generated from that model in the last section:

True relationship: g(x)\text{True relationship: } g(x)
Observed relationship: Y=g(x)+ϵ\text{Observed relationship: }Y = g(x) + \epsilon
Prediction: Y^(x)=f^(x)\text{Prediction: }\hat{Y}(x) = \hat{f}(x)

Just like an estimator, we can evaluate a model’s quality by considering its behavior across different training datasetes (i.e., parallel sampling universes):

Bias(f(x))=E[f(x)]g(x)\begin{align*} \text{Bias}(f(x)) = \mathbb{E}[f(x)] - g(x) \end{align*}
Var(f(x))=E[f(x)E[f(x)]2]\begin{align*} \text{Var}(f(x)) = \mathbb{E}[f(x)- \mathbb{E}[f(x)]^2] \end{align*}
E[(YY^)2]\begin{align*} \mathbb{E}[(Y - \hat{Y})^2] \end{align*}

Decomposition of the Model Risk

Goal: Compute E[(YY^)2]\mathbb{E}[(Y - \hat{Y})^2]

Var(YY^)=E[(YY^)2](E[YY^])2E[(YY^)2]=Var(YY^)+(E[YY^])2\begin{align*} \text{Var}(Y - \hat{Y}) & = \mathbb{E}[ (Y - \hat{Y})^2 ] - \left( \mathbb{E}[ Y - \hat{Y} ] \right)^2 \\ \mathbb{E}[ (Y - \hat{Y})^2 ] & = \text{Var}(Y - \hat{Y}) + \left( \mathbb{E}[ Y - \hat{Y} ] \right) ^2\\ \end{align*}

Therefore, we get:

E[(YY^)2]=σ2+Var(f(X))+Bias(f(X))2\begin{align*} \mathbb{E}[(Y - \hat{Y})^2] &= \sigma ^ 2 + \text{Var}(f(\vec{X})) + \text{Bias}(f(\vec{X}))^2 \end{align*}

The Bias-Variance Decomposition

In the previous section we derived the following:

E[(YY^)2]=σ2+Var(f(X))+Bias(f(X))2\begin{align*} \mathbb{E}[(Y - \hat{Y})^2] &= \sigma ^ 2 + \text{Var}(f(\vec{X})) + \text{Bias}(f(\vec{X}))^2 \end{align*}

The above equation can be interpreted as:

Model Risk=Irreducible error+Model Variance+(Model Bias)2\begin{align*} \text{Model Risk} = \text{Irreducible error} + \text{Model Variance} + (\text{Model Bias})^2 \end{align*}
Plot showing a conceptual understanding of the bias variance tradeoff. Model complexity (e.g. number of features) is on the x-axis and Error/Variance is on the y axis. As you increase model complexity,  variance increases and training error decreases. Above the curves that represent Variance and (model bias)^2, third and fourth curves labeled Validation Error and Test Error exist in U-shapes. The chosen complexity level is represented by a dashed vertical line that intersections with the variance and training error curves a little to the left of where they intersect with each other. Overfitting is to the right of the chosen complexity level and underfitting is to the left of the chosen complexity levelPlot showing a conceptual understanding of the bias variance tradeoff. Model complexity (e.g. number of features) is on the x-axis and Model Risk is on the y axis. As you increase model complexity, model variance increases and (model bias)^2 decreases. Above the curves that represent Variance and (model bias)^2, a third curve labeled Test Error exists in a U-shape. The optimal value is represented by a dashed vertical line that intersections with the variance and (model bias)^2 curves where they intersect with each other. Observation variance is the difference between the test error and the variance.

High variance corresponds to overfitting.

High bias corresponds to underfitting

Irreducible error