18  Estimators, Bias, and Variance

Learning Outcomes
  • Explore commonly seen random variables like Bernoulli and Binomial distributions
  • Apply the Central Limit Theorem to approximate parameters of a population
  • Use sampled data to model an estimation of and infer the true underlying distribution
  • Estimate the true population distribution from a sample using the bootstrapping technique

Last time, we introduced the idea of random variables: numerical functions of a sample. Most of our work in the last lecture was done to build a background in probability and statistics. Now that we’ve established some key ideas, we’re in a good place to apply what we’ve learned to our original goal – understanding how the randomness of a sample impacts the model design process.

In this lecture, we will delve more deeply into the idea of fitting a model to a sample. We’ll explore how to re-express our modeling process in terms of random variables and use this new understanding to steer model complexity.

18.1 Common Random Variables

There are several cases of random variables that appear often and have useful properties. Below are the ones we will explore further in this course. The numbers in parentheses are the parameters of a random variable, which are constants. Parameters define a random variable’s shape (i.e., distribution) and its values. For this lecture, we’ll focus more heavily on the bolded random variables and their special properties, but you should familiarize yourself with all the ones listed below:

  • Bernoulli(\(p\))
    • Takes on value 1 with probability \(p\), and 0 with probability \((1 - p)\).
    • AKA the “indicator” random variable.
    • Let \(X\) be a Bernoulli(\(p\)) random variable.
      • \(\mathbb{E}[X] = 1 * p + 0 * (1-p) = p\)
        • \(\mathbb{E}[X^2] = 1^2 * p + 0 * (1-p) = p\)
      • \(\text{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2 = p - p^2 = p(1-p)\)
  • Binomial(\(n\), \(p\))
    • Number of 1s in \(n\) independent Bernoulli(\(p\)) trials.
    • Let \(Y\) be a Binomial(\(n\), \(p\)) random variable.
      • The distribution of \(Y\) is given by the binomial formula, and we can write \(Y = \sum_{i=1}^n X_i\) where:
        • \(X_i\) s the indicator of success on trial i. \(X_i = 1\) if trial i is a success, else 0.
        • All \(X_i\) are i.i.d. and Bernoulli(\(p\)).
      • \(\mathbb{E}[Y] = \sum_{i=1}^n \mathbb{E}[X_i] = np\)
      • \(\text{Var}(X) = \sum_{i=1}^n \text{Var}(X_i) = np(1-p)\)
        • \(X_i\)’s are independent, so \(\text{Cov}(X_i, X_j) = 0\) for all i, j.
  • Uniform on a finite set of values
    • The probability of each value is \(\frac{1}{\text{(number of possible values)}}\).
    • For example, a standard/fair die.
  • Uniform on the unit interval (0, 1)
    • Density is flat at 1 on (0, 1) and 0 elsewhere.
  • Normal(\(\mu, \sigma^2\)), a.k.a Gaussian
    • \(f(x) = \frac{1}{\sigma\sqrt{2\pi}} \exp\left( -\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^{\!2}\,\right)\)

18.1.1 Example

Suppose you win cash based on the number of heads you get in a series of 20 coin flips. Let \(X_i = 1\) if the \(i\)-th coin is heads, \(0\) otherwise. Which payout strategy would you choose?

A. \(Y_A = 10 * X_1 + 10 * X_2\)

B. \(Y_B = \sum_{i=1}^{20} X_i\)

C. \(Y_C = 20 * X_1\)

Let \(X_1, X_2, ... X_{20}\) be 20 i.i.d Bernoulli(0.5) random variables. Since the \(X_i\)’s are independent, \(\text{Cov}(X_i, X_j) = 0\) for all pairs \(i, j\). Additionally, Since \(X_i\) is Bernoulli(0.5), we know that \(\mathbb{E}[X] = p = 0.5\) and \(\text{Var}(X) = p(1-p) = 0.25\). We can calculate the following for each scenario:

A. \(Y_A = 10 * X_1 + 10 * X_2\) B. \(Y_B = \sum_{i=1}^{20} X_i\) C. \(Y_C = 20 * X_1\)
Expectation \(\mathbb{E}[Y_A] = 10 (0.5) + 10(0.5) = 10\) \(\mathbb{E}[Y_B] = 0.5 + ... + 0.5 = 10\) \(\mathbb{E}[Y_C] = 20(0.5) = 10\)
Variance \(\text{Var}(Y_A) = 10^2 (0.25) + 10^2 (0.25) = 50\) \(\text{Var}(Y_B) = 0.25 + ... + 0.25 = 5\) \(\text{Var}(Y_C) = 20^2 (0.25) = 100\)
Standard Deviation \(\text{SD}(Y_A) \approx 7.07\) \(\text{SD}(Y_B) \approx 2.24\) \(\text{SD}(Y_C) = 10\)

As we can see, all the scenarios have the same expected value but different variances. The higher the variance, the greater the risk and uncertainty, so the “right” strategy depends on your personal preference. Would you choose the “safest” option B, the most “risky” option C, or somewhere in the middle (option A)?

18.2 Sample Statistics

Today, we’ve talked extensively about populations; if we know the distribution of a random variable, we can reliably compute expectation, variance, functions of the random variable, etc. Note that:

  • The distribution of a population describes how a random variable behaves across all individuals of interest.
  • The distribution of a sample describes how a random variable behaves in a specific sample from the population.

In Data Science, however, we often do not have access to the whole population, so we don’t know its distribution. As such, we need to collect a sample and use its distribution to estimate or infer properties of the population. In cases like these, we can take several samples of size \(n\) from the population (an easy way to do this is using df.sample(n, replace=True)), and compute the mean of each sample. When sampling, we make the (big) assumption that we sample uniformly at random with replacement from the population; each observation in our sample is a random variable drawn i.i.d from our population distribution. Remember that our sample mean is a random variable since it depends on our randomly drawn sample! On the other hand, our population mean is simply a number (a fixed value).

18.2.1 Sample Mean

Consider an i.i.d. sample \(X_1, X_2, ..., X_n\) drawn from a population with mean 𝜇 and SD 𝜎. We define the sample mean as \[\bar{X}_n = \frac{1}{n} \sum_{i=1}^n X_i\]

The expectation of the sample mean is given by: \[\begin{align} \mathbb{E}[\bar{X}_n] &= \frac{1}{n} \sum_{i=1}^n \mathbb{E}[X_i] \\ &= \frac{1}{n} (n \mu) \\ &= \mu \end{align}\]

The variance is given by: \[\begin{align} \text{Var}(\bar{X}_n) &= \frac{1}{n^2} \text{Var}( \sum_{i=1}^n X_i) \\ &= \frac{1}{n^2} \left( \sum_{i=1}^n \text{Var}(X_i) \right) \\ &= \frac{1}{n^2} (n \sigma^2) = \frac{\sigma^2}{n} \end{align}\]

\(\bar{X}_n\) is approximately normally distributed by the Central Limit Theorem (CLT).

18.2.2 Central Limit Theorem

In Data 8 and in the previous lecture, you encountered the Central Limit Theorem (CLT). This is a powerful theorem for estimating the distribution of a population with mean \(\mu\) and standard deviation \(\sigma\) from a collection of smaller samples. The CLT tells us that if an i.i.d sample of size \(n\) is large, then the probability distribution of the sample mean is roughly normal with mean \(\mu\) and SD of \(\frac{\sigma}{\sqrt{n}}\). More generally, any theorem that provides the rough distribution of a statistic and doesn’t need the distribution of the population is valuable to data scientists! This is because we rarely know a lot about the population.

clt

Importantly, the CLT assumes that each observation in our samples is drawn i.i.d from the distribution of the population. In addition, the CLT is accurate only when \(n\) is “large”, but what counts as a “large” sample size depends on the specific distribution. If a population is highly symmetric and unimodal, we could need as few as \(n=20\); if a population is very skewed, we need a larger \(n\). If in doubt, you can bootstrap the sample mean and see if the bootstrapped distribution is bell-shaped. Classes like Data 140 investigate this idea in great detail.

For a more in-depth demo, check out onlinestatbook.

18.2.3 Using the Sample Mean to Estimate the Population Mean

Now let’s say we want to use the sample mean to estimate the population mean, for example, the average height of Cal undergraduates. We can typically collect a single sample, which has just one average. However, what if we happened, by random chance, to draw a sample with a different mean or spread than that of the population? We might get a skewed view of how the population behaves (consider the extreme case where we happen to sample the exact same value \(n\) times!).

clt

For example, notice the difference in variation between these two distributions that are different in sample size. The distribution with a bigger sample size (\(n=800\)) is tighter around the mean than the distribution with a smaller sample size (\(n=200\)). Try plugging in these values into the standard deviation equation for the sample mean to make sense of this!

Applying the CLT allows us to make sense of all of this and resolve this issue. By drawing many samples, we can consider how the sample distribution varies across multiple subsets of the data. This allows us to approximate the properties of the population without the need to survey every single member.

Given this potential variance, it is also important that we consider the average value and spread of all possible sample means, and what this means for how big \(n\) should be. For every sample size, the expected value of the sample mean is the population mean: \[\mathbb{E}[\bar{X}_n] = \mu\] We call the sample mean an unbiased estimator of the population mean and will explore this idea more in the next lecture.

Data 8 Recap: Square Root Law

The square root law (Data 8) states that if you increase the sample size by a factor, the SD of the sample mean decreases by the square root of the factor. \[\text{SD}(\bar{X_n}) = \frac{\sigma}{\sqrt{n}}\] The sample mean is more likely to be close to the population mean if we have a larger sample size.

18.3 Prediction and Inference

At this point in the course, we’ve spent a great deal of time working with models. When we first introduced the idea of modeling a few weeks ago, we did so in the context of prediction: using models to make accurate predictions about unseen data. Another reason we might build models is to better understand complex phenomena in the world around us. Inference is the task of using a model to infer the true underlying relationships between the feature and response variables. For example, if we are working with a set of housing data, prediction might ask: given the attributes of a house, how much is it worth? Inference might ask: how much does having a local park impact the value of a house?

A major goal of inference is to draw conclusions about the full population of data given only a random sample. To do this, we aim to estimate the value of a parameter, which is a numerical function of the population (for example, the population mean \(\mu\)). We use a collected sample to construct a statistic, which is a numerical function of the random sample (for example, the sample mean \(\bar{X}_n\)). It’s helpful to think “p” for “parameter” and “population,” and “s” for “sample” and “statistic.”

Since the sample represents a random subset of the population, any statistic we generate will likely deviate from the true population parameter, and it could have been different. We say that the sample statistic is an estimator of the true population parameter. Notationally, the population parameter is typically called \(\theta\), while its estimator is denoted by \(\hat{\theta}\).

To address our inference question, we aim to construct estimators that closely estimate the value of the population parameter. We evaluate how “good” an estimator is by answering three questions:

  • How close is our answer to the parameter? (Risk / MSE) \[ MSE(\hat{\theta}) = E[(\hat{\theta} - \theta)]^2\]
  • Do we get the right answer for the parameter, on average? (Bias) \[\text{Bias}(\hat{\theta}) = E[\hat{\theta} - \theta] = E[\hat{\theta}] - \theta\]
  • How variable is the answer? (Variance) \[Var(\hat{\theta}) = E[(\theta - E[\theta])^2] \]

This relationship can be illustrated with an archery analogy. Imagine that the center of the target is the \(\theta\) and each arrow corresponds to a separate parameter estimate \(\hat{\theta}\)

Ideally, we want our estimator to have low bias and low variance, but how can we mathematically quantify that? See sec-bias-variance-tradeoff for more detail.

18.3.1 Prediction as Estimation

Now that we’ve established the idea of an estimator, let’s see how we can apply this learning to the modeling process. To do so, we’ll take a moment to formalize our data collection and models in the language of random variables.

Say we are working with an input variable, \(x\), and a response variable, \(Y\). We assume that \(Y\) and \(x\) are linked by some relationship \(g\); in other words, \(Y = g(x)\) where \(g\) represents some “universal truth” or “law of nature” that defines the underlying relationship between \(x\) and \(Y\). In the image below, \(g\) is denoted by the red line.

As data scientists, however, we have no way of directly “seeing” the underlying relationship \(g\). The best we can do is collect observed data out in the real world to try to understand this relationship. Unfortunately, the data collection process will always have some inherent error (think of the randomness you might encounter when taking measurements in a scientific experiment). We say that each observation comes with some random error or noise term, \(\epsilon\) (read: “epsilon”). This error is assumed to be a random variable with expectation \(\mathbb{E}(\epsilon)=0\), variance \(\text{Var}(\epsilon) = \sigma^2\), and be i.i.d. across each observation. The existence of this random noise means that our observations, \(Y(x)\), are random variables.

data

We can only observe our random sample of data, represented by the blue points above. From this sample, we want to estimate the true relationship \(g\). We do this by constructing the model \(\hat{Y}(x)\) to estimate \(g\).

\[\text{True relationship: } g(x)\]

\[\text{Observed relationship: }Y = g(x) + \epsilon\]

\[\text{Prediction: }\hat{Y}(x)\]

y_hat

When building models, it is also important to note that our choice of features will also significantly impact our estimation. In the plot below, you can see how the different models (green and purple) can lead to different estimates.

y_hat

18.3.1.1 Estimating a Linear Relationship

If we assume that the true relationship \(g\) is linear, we can express the response as \(Y = f_{\theta}(x)\), where our true relationship is modeled by \[Y = g(x) + \epsilon\] \[ f_{\theta}(x) = Y = \theta_0 + \sum_{j=1}^p \theta_j x_j + \epsilon\]

In our two equations above, the true relationship \(g(x) = \theta_0 + \sum_{j=1}^p \theta_j x_j\) is not random, but \(\epsilon\) is random. Hence, \(Y = f_{\theta}(x)\) is also random.

This true relationship has true, unobservable parameters \(\theta\), and it has random noise \(\epsilon\), so we can never observe the true relationship. Instead, the next best thing we can do is obtain a sample \(\Bbb{X}\), \(\Bbb{Y}\) of \(n\) observed relationships, \((x, Y)\) and use it to train a model and obtain an estimate of \(\hat{\theta}\) \[\hat{Y}(x) = f_{\hat{\theta}}(x) = \hat{\theta_0} + \sum_{j=1}^p \hat{\theta_j} x_j\]

In our estimating equation above, our sample \(\Bbb{X}\), \(\Bbb{Y}\) are random (often due to human error). Hence, the estimates we calculate from our samples \(\hat{\theta}\) are also random, so our predictor \(\hat{Y}(x)\) is also random.

Now taking a look at our original equations, we can see that they both have differing sources of randomness. For our observed relationship, \(Y = g(x) + \epsilon\), \(\epsilon\) represents errors which occur during or after the observation or measurement process. For the estimation model, the data we have is a random sample collected from the population, which was constructed from decisions made before the measurement process.

18.4 Bias-Variance Tradeoff

Recall the model and the data we generated from that model in the last section:

\[\text{True relationship: } g(x)\]

\[\text{Observed relationship: }Y = g(x) + \epsilon\]

\[\text{Prediction: }\hat{Y}(x)\]

With this reformulated modeling goal, we can now revisit the Bias-Variance Tradeoff from two lectures ago (shown below):

In today’s lecture, we’ll explore a more mathematical version of the graph you see above by introducing the terms model risk, observation variance, model bias, and model variance. Eventually, we’ll work our way up to an updated version of the Bias-Variance Tradeoff graph that you see below

18.4.1 Model Risk

Model risk is defined as the mean square prediction error of the random variable \(\hat{Y}\). It is an expectation across all samples we could have possibly gotten when fitting the model, which we can denote as random variables \(X_1, X_2, \ldots, X_n, Y\). Model risk considers the model’s performance on any sample that is theoretically possible, rather than the specific data that we have collected.

\[\text{model risk }=E\left[(Y-\hat{Y(x)})^2\right]\]

What is the origin of the error encoded by model risk? Note that there are two types of errors:

  • Chance errors: happen due to randomness alone
    • Source 1 (Observation Variance): randomness in new observations \(Y\) due to random noise \(\epsilon\)
    • Source 2 (Model Variance): randomness in the sample we used to train the models, as samples \(X_1, X_2, \ldots, X_n, Y\) are random
  • (Model Bias): non-random error due to our model being different from the true underlying function \(g\)

Recall the data-generating process we established earlier. There is a true underlying relationship \(g\), observed data (with random noise) \(Y\), and model \(\hat{Y}\).

errors

To better understand model risk, we’ll zoom in on a single data point in the plot above.

breakdown

Remember that \(\hat{Y}(x)\) is a random variable – it is the prediction made for \(x\) after being fit on the specific sample used for training. If we had used a different sample for training, a different prediction might have been made for this value of \(x\). To capture this, the diagram above considers both the prediction \(\hat{Y}(x)\) made for a particular random training sample, and the expected prediction across all possible training samples, \(E[\hat{Y}(x)]\).

We can use this simplified diagram to break down the prediction error into smaller components. First, start by considering the error on a single prediction, \(Y(x)-\hat{Y}(x)\).

error

We can identify three components of this error.

decomposition

That is, the error can be written as:

\[Y(x)-\hat{Y}(x) = \epsilon + \left(g(x)-E\left[\hat{Y}(x)\right]\right) + \left(E\left[\hat{Y}(x)\right] - \hat{Y}(x)\right)\] \[\newline \]

The model risk is the expected square of the expression above, \(E\left[(Y(x)-\hat{Y}(x))^2\right]\). If we square both sides and then take the expectation, we will get the following decomposition of model risk:

\[E\left[(Y(x)-\hat{Y}(x))^2\right] = E[\epsilon^2] + \left(g(x)-E\left[\hat{Y}(x)\right]\right)^2 + E\left[\left(E\left[\hat{Y}(x)\right] - \hat{Y}(x)\right)^2\right]\]

It looks like we are missing some cross-product terms when squaring the right-hand side, but it turns out that all of those cross-product terms are zero. The detailed derivation is out of scope for this class, but a proof is included at the end of this note for your reference.

This expression may look complicated at first glance, but we’ve actually already defined each term earlier in this lecture! Let’s look at them term by term.

18.4.1.1 Observation Variance

The first term in the above decomposition is \(E[\epsilon^2]\). Remember \(\epsilon\) is the random noise when observing \(Y\), with expectation \(\mathbb{E}(\epsilon)=0\) and variance \(\text{Var}(\epsilon) = \sigma^2\). We can show that \(E[\epsilon^2]\) is the variance of \(\epsilon\): \[ \begin{align*} \text{Var}(\epsilon) &= E[\epsilon^2] + \left(E[\epsilon]\right)^2\\ &= E[\epsilon^2] + 0^2\\ &= \sigma^2. \end{align*} \]

This term describes how variable the random error \(\epsilon\) (and \(Y\)) is for each observation. This is called the observation variance. It exists due to the randomness in our observations \(Y\). It is a form of chance error we talked about in the Sampling lecture.

\[\text{observation variance} = \text{Var}(\epsilon) = \sigma^2.\]

The observation variance results from measurement errors when observing data or missing information that acts like noise. To reduce this observation variance, we could try to get more precise measurements, but it is often beyond the control of data scientists. Because of this, the observation variance \(\sigma^2\) is sometimes called “irreducible error.”

18.4.1.2 Model Variance

We will then look at the last term: \(E\left[\left(E\left[\hat{Y}(x)\right] - \hat{Y}(x)\right)^2\right]\). If you recall the definition of variance from the last lecture, this is precisely \(\text{Var}(\hat{Y}(x))\). We call this the model variance.

It describes how much the prediction \(\hat{Y}(x)\) tends to vary when we fit the model on different samples. Remember the sample we collect can come out very differently, thus the prediction \(\hat{Y}(x)\) will also be different. The model variance describes this variability due to the randomness in our sampling process. Like observation variance, it is also a form of chance error—even though the sources of randomness are different.

\[\text{model variance} = \text{Var}(\hat{Y}(x)) = E\left[\left(\hat{Y}(x) - E\left[\hat{Y}(x)\right]\right)^2\right]\]

The main reason for the large model variance is because of overfitting: we paid too much attention to the details in our sample that small differences in our random sample lead to large differences in the fitted model. To remediate this, we try to reduce model complexity (e.g. take out some features and limit the magnitude of estimated model coefficients) and not fit our model on the noises.

18.4.1.3 Model Bias

Finally, the second term is \(\left(g(x)-E\left[\hat{Y}(x)\right]\right)^2\). What is this? The term \(E\left[\hat{Y}(x)\right] - g(x)\) is called the model bias.

Remember that \(g(x)\) is the fixed underlying truth and \(\hat{Y}(x)\) is our fitted model, which is random. Model bias therefore measures how far off \(g(x)\) and \(\hat{Y}(x)\) are on average over all possible samples.

\[\text{model bias} = E\left[\hat{Y}(x) - g(x)\right] = E\left[\hat{Y}(x)\right] - g(x)\]

The model bias is not random; it’s an average measure for a specific individual \(x\). If bias is positive, our model tends to overestimate \(g(x)\); if it’s negative, our model tends to underestimate \(g(x)\). And if it’s 0, we can say that our model is unbiased.

Unbiased Estimators

An unbiased model has a \(\text{model bias } = 0\). In other words, our model predicts \(g(x)\) on average.

Similarly, we can define bias for estimators like the mean. The sample mean is an unbiased estimator of the population mean, as by CLT, \(\mathbb{E}[\bar{X}_n] = \mu\). Therefore, the \(\text{estimator bias } = \mathbb{E}[\bar{X}_n] - \mu = 0\).

There are two main reasons for large model biases:

  • Underfitting: our model is too simple for the data
  • Lack of domain knowledge: we don’t understand what features are useful for the response variable

To fix this, we increase model complexity (but we don’t want to overfit!) or consult domain experts to see which models make sense. You can start to see a tradeoff here: if we increase model complexity, we decrease the model bias, but we also risk increasing the model variance.

18.4.2 The Decomposition

To summarize:

  • The model risk, \(\mathbb{E}\left[(Y(x)-\hat{Y}(x))^2\right]\), is the mean squared prediction error of the model. It is an expectation and is therefore a fixed number (for a given x).
  • The observation variance, \(\sigma^2\), is the variance of the random noise in the observations. It describes how variable the random error \(\epsilon\) is for each observation and cannot be addressed by modeling.
  • The model bias, \(\mathbb{E}\left[\hat{Y}(x)\right]-g(x)\), is how “off” the \(\hat{Y}(x)\) is as an estimator of the true underlying relationship \(g(x)\).
  • The model variance, \(\text{Var}(\hat{Y}(x))\), describes how much the prediction \(\hat{Y}(x)\) tends to vary when we fit the model on different samples.

The above definitions enable us to simplify the decomposition of model risk before as:

\[ E[(Y(x) - \hat{Y}(x))^2] = \sigma^2 + (E[\hat{Y}(x)] - g(x))^2 + \text{Var}(\hat{Y}(x)) \] \[\text{model risk } = \text{observation variance} + (\text{model bias})^2 \text{+ model variance}\]

This is known as the bias-variance tradeoff. What does it mean? Remember that the model risk is a measure of the model’s performance. Our goal in building models is to keep model risk low; this means that we will want to ensure that each component of model risk is kept at a small value.

Observation variance is an inherent, random part of the data collection process. We aren’t able to reduce the observation variance, so we’ll focus our attention on the model bias and model variance.

In the Feature Engineering lecture, we considered the issue of overfitting. We saw that the model’s error or bias tends to decrease as model complexity increases — if we design a highly complex model, it will tend to make predictions that are closer to the true relationship \(g\). At the same time, model variance tends to increase as model complexity increases; a complex model may overfit to the training data, meaning that small differences in the random samples used for training lead to large differences in the fitted model. We have a problem. To decrease model bias, we could increase the model’s complexity, which would lead to overfitting and an increase in model variance. Alternatively, we could decrease model variance by decreasing the model’s complexity at the cost of increased model bias due to underfitting.

bvt

We need to strike a balance. Our goal in model creation is to use a complexity level that is high enough to keep bias low, but not so high that model variance is large.

18.5 [Bonus] Proof of Bias-Variance Decomposition

This section walks through the detailed derivation of the Bias-Variance Decomposition in the Bias-Variance Tradeoff section above, and this content is out of scope.

We want to prove that the model risk can be decomposed as

\[ \begin{align*} E\left[(Y(x)-\hat{Y}(x))^2\right] &= E[\epsilon^2] + \left(g(x)-E\left[\hat{Y}(x)\right]\right)^2 + E\left[\left(E\left[\hat{Y}(x)\right] - \hat{Y}(x)\right)^2\right]. \end{align*} \]

To prove this, we will first need the following lemma:

If \(V\) and \(W\) are independent random variables then \(E[VW] = E[V]E[W]\).

We will prove this in the discrete finite case. Trust that it’s true in greater generality.

The job is to calculate the weighted average of the values of \(VW\), where the weights are the probabilities of those values. Here goes.

\[\begin{align*} E[VW] ~ &= ~ \sum_v\sum_w vwP(V=v \text{ and } W=w) \\ &= ~ \sum_v\sum_w vwP(V=v)P(W=w) ~~~~ \text{by independence} \\ &= ~ \sum_v vP(V=v)\sum_w wP(W=w) \\ &= ~ E[V]E[W] \end{align*}\]

Now we go into the actual proof:

18.5.1 Goal

Decompose the model risk into recognizable components.

18.5.2 Step 1

\[ \begin{align*} \text{model risk} ~ &= ~ E\left[\left(Y - \hat{Y}(x)\right)^2 \right] \\ &= ~ E\left[\left(g(x) + \epsilon - \hat{Y}(x)\right)^2 \right] \\ &= ~ E\left[\left(\epsilon + \left(g(x)- \hat{Y}(x)\right)\right)^2 \right] \\ &= ~ E\left[\epsilon^2\right] + 2E\left[\epsilon \left(g(x)- \hat{Y}(x)\right)\right] + E\left[\left(g(x) - \hat{Y}(x)\right)^2\right]\\ \end{align*} \]

On the right hand side:

  • The first term is the observation variance \(\sigma^2\).
  • The cross product term is 0 because \(\epsilon\) is independent of \(g(x) - \hat{Y}(x)\) and \(E(\epsilon) = 0\)
  • The last term is the mean squared difference between our predicted value and the value of the true function at \(x\)

18.5.3 Step 2

At this stage we have

\[ \text{model risk} ~ = ~ E\left[\epsilon^2\right] + E\left[\left(g(x) - \hat{Y}(x)\right)^2\right] \]

We don’t yet have a good understanding of \(g(x) - \hat{Y}(x)\). But we do understand the deviation \(D_{\hat{Y}(x)} = \hat{Y}(x) - E\left[\hat{Y}(x)\right]\). We know that

  • \(E\left[D_{\hat{Y}(x)}\right] ~ = ~ 0\)
  • \(E\left[D_{\hat{Y}(x)}^2\right] ~ = ~ \text{model variance}\)

So let’s add and subtract \(E\left[\hat{Y}(x)\right]\) and see if that helps.

\[ g(x) - \hat{Y}(x) ~ = ~ \left(g(x) - E\left[\hat{Y}(x)\right] \right) + \left(E\left[\hat{Y}(x)\right] - \hat{Y}(x)\right) \]

The first term on the right hand side is the model bias at \(x\). The second term is \(-D_{\hat{Y}(x)}\). So

\[ g(x) - \hat{Y}(x) ~ = ~ \text{model bias} - D_{\hat{Y}(x)} \]

18.5.4 Step 3

Remember that the model bias at \(x\) is a constant, not a random variable. Think of it as your favorite number, say 10. Then \[ \begin{align*} E\left[ \left(g(x) - \hat{Y}(x)\right)^2 \right] ~ &= ~ \text{model bias}^2 - 2(\text{model bias})E\left[D_{\hat{Y}(x)}\right] + E\left[D_{\hat{Y}(x)}^2\right] \\ &= ~ \text{model bias}^2 - 0 + \text{model variance} \\ &= ~ \text{model bias}^2 + \text{model variance} \end{align*} \]

Again, the cross-product term is \(0\) because \(E\left[D_{\hat{Y}(x)}\right] ~ = ~ 0\).

18.5.5 Step 4: Bias-Variance Decomposition

In Step 2, we had:

\[ \text{model risk} ~ = ~ \text{observation variance} + E\left[\left(g(x) - \hat{Y}(x)\right)^2\right] \]

Step 3 showed:

\[ E\left[ \left(g(x) - \hat{Y}(x)\right)^2 \right] ~ = ~ \text{model bias}^2 + \text{model variance} \]

Thus, we have proven the bias-variance decomposition:

\[ \text{model risk} = \text{observation variance} + \text{model bias}^2 + \text{model variance}. \]

That is,

\[ E\left[(Y(x)-\hat{Y}(x))^2\right] = \sigma^2 + \left(E\left[\hat{Y}(x)\right] - g(x)\right)^2 + E\left[\left(\hat{Y}(x)-E\left[\hat{Y}(x)\right]\right)^2\right] \]