Derivation of the Bias-Variance Decomposition

A. Adhikari, Data 100 Spring 2020

(slightly modified for Fall 2020)

Preliminary

This result will be used below. You don't have to know how to prove it.

If $V$ and $W$ are independent random variables then $\mathbb{E}(VW) = \mathbb{E}(V)\mathbb{E}(W)$.

Proof: We'll do this in the discrete finite case. Trust that it's true in greater generality.

The job is to calculate the weighted average of the values of $VW$, where the weights are the probabilities of those values. Here goes.

$$ \begin{align*} \mathbb{E}(VW) ~ &= ~ \sum_v\sum_w vwP(V=v \text{ and } W=w) \\ &= ~ \sum_v\sum_w vwP(V=v)P(W=w) ~~~~ \text{by independence} \\ &= ~ \sum_v vP(V=v)\sum_w wP(W=w) \\ &= ~ \mathbb{E}(V)\mathbb{E}(W) \end{align*} $$

What's in this Notebook

  • The first part of what follows is an exposition of the definitions on the slides.
  • The derivation starts at Step 1.
  • Steps 1-3 is where the action is. Step 4 just puts together the results of Steps 2 and 3.

Assumptions

1. For each individual, the response is $g(x) + \epsilon$ where:

  • $x$ consists of the fixed values of all the predictor variables for that individual
  • $g$ is a fixed function, typically unknown; sometimes called the true function or the signal
  • $\epsilon$ is a random error with mean 0 and variance $\sigma^2$, independent of all other individuals; it is sometimes called noise and is unobservable, which means that we never get to see it

2. We have a random sample from the model above. We don't know $g$ so we fit something of our choice.

3. A new individual comes along at $x$. Their response is $Y = g(x) + \epsilon$ for a brand new copy of $\epsilon$. This response is random because $\epsilon$ is random, and has a hidden $x$ in it because its value is $g(x)$ plus the noise.

4. $\hat{Y}(x)$ is our model's predicted response for this indiviual. It is random because it depends on our sample which is random.

Definitions

Observation Variance

The point corresponding to the new observation is $(x, Y)$ where $Y = g(x) + \epsilon$. Remember that $g(x)$ is a constant,$\mathbb{E}(\epsilon) = 0$, and $\mathbb{V}ar(\epsilon) = \sigma^2$. It follows that

  • $\mathbb{E}(Y) = g(x)$
  • $\mathbb{V}ar(Y) = \mathbb{V}ar(\epsilon) = \sigma^2$

That is why $$ \text{observation variance} ~ = ~ \sigma^2 $$

Note: Since $\epsilon$ is centered, that is, $\mathbb{E}(\epsilon) = 0$, we have $\mathbb{V}ar(\epsilon) = \mathbb{E}(\epsilon^2)$. So you will sometimes see the observation variance $\sigma^2$ written as $\mathbb{E}(\epsilon^2)$ instead of $\mathbb{V}ar(Y)$ or $\mathbb{V}ar(\epsilon)$.

Model Bias

The bias of an estimator is the expected difference between the estimator and what it's trying to estimate.

For the new individual at $x$, the model bias is defined by

$$ \text{model bias} ~ = ~ \mathbb{E}(\hat{Y}(x) - Y) ~ = ~ \mathbb{E}(\hat{Y}(x)) - \mathbb{E}(Y) ~ = ~ \mathbb{E}(\hat{Y}(x)) - g(x) $$

This is the difference between our prediction at $x$ and the true signal at $x$, averaged over all possible samples.

The key observation is that bias is a constant (that is, a number), not a random variable. It is a systematic error in the estimate.

Model Variance

We came up with our prediction $\hat{Y}(x)$ based on the model we chose to fit, using data from our random sample. Had that sample come out differently, our prediction might have been different. For each sample, we have a prediction, so the prediction is a random variable and thus has a mean and a variance.

The variance of the predictor $\hat{Y}(x)$ is called the model variance. By the definition of variance,

$$ \text{model variance} ~ = ~ \mathbb{E}\big{(} (\hat{Y}(x) - \mathbb{E}(\hat{Y}(x)))^2 \big{)} $$

Model Risk

This is the mean squared error of our prediction.

$$ \text{model risk} ~ = ~ \mathbb{E}\big{(} (Y - \hat{Y}(x))^2 \big{)} $$

Goal

Decompose the model risk into recognizable components.

Step 1 (Figure on Slide 26)

$$ \begin{align*} \text{model risk} ~ &= ~ \mathbb{E}\big{(} (Y - \hat{Y}(x))^2 \big{)} \\ &= ~ \mathbb{E}\big{(} (g(x) + \epsilon - \hat{Y}(x))^2 \big{)} \\ &= ~ \mathbb{E}\big{(} (\epsilon + (g(x) - \hat{Y}(x)))^2 \big{)} \\ &= ~ \mathbb{E}(\epsilon^2) + 2\mathbb{E}(\epsilon(g(x) - \hat{Y}(x))) + \mathbb{E}\big{(}(g(x) - \hat{Y}(x))^2\big{)} \end{align*} $$

On the right hand side:

  • The first term is the observation variance $\sigma^2$.
  • The cross product term is 0 because $\epsilon$ is independent of $g(x) - \hat{Y}(x)$ and $\mathbb{E}(\epsilon) = 0$
  • The last term is the mean squared difference between our predicted value and the value of the true function at $x$

Step 2 (Figure on Slide 27)

At this stage we have

$$ \text{model risk} ~ = ~ \text{observation variance} + \mathbb{E}\big{(}(g(x) - \hat{Y}(x))^2\big{)} $$

We don't yet have a good understanding of $g(x) - \hat{Y}(x)$. But we do understand the deviation $D_{\hat{Y}(x)} = \hat{Y}(x) - \mathbb{E}(\hat{Y}(x))$. We know that

  • $\mathbb{E}(D_{\hat{Y}(x)}) ~ = ~ 0$
  • $\mathbb{E}(D_{\hat{Y}(x)}^2) ~ = ~ \text{model variance}$

So let's add and subtract $\mathbb{E}(\hat{Y}(x))$ and see if that helps.

$$ g(x) - \hat{Y}(x) ~ = ~ (g(x) - \mathbb{E}(\hat{Y}(x)) + (\mathbb{E}(\hat{Y}(x) - \hat{Y}(x)) $$

The first term on the right hand side is the model bias at $x$. The second term is $-D_{\hat{Y}(x)}$. So

$$ g(x) - \hat{Y}(x) ~ = ~ \text{model bias} - D_{\hat{Y}(x)} $$

Step 3

Remember that the model bias at $x$ is a constant, not a random variable. Think of it as your favorite number, say 10. Then

$$ \begin{align*} \mathbb{E}\big{(}(g(x) - \hat{Y}(x))^2\big{)} ~ & = ~ \text{model bias}^2 - 2(\text{model bias})\mathbb{E}(D_{\hat{Y}(x)}) + \mathbb{E}(D_{\hat{Y}(x)}^2) \\ &= ~ \text{model bias}^2 - 0 + \text{model variance} \\ &= ~ \text{model bias}^2 + \text{model variance} \end{align*} $$

Step 4: Bias-Variance Decomposition

In Step 2 we had

$$ \text{model risk} ~ = ~ \text{observation variance} + \mathbb{E}\big{(}(g(x) - \hat{Y}(x))^2\big{)} $$

Step 3 showed

$$ \mathbb{E}\big{(}(g(x) - \hat{Y}(x))^2\big{)} ~ = ~ \text{model bias}^2 + \text{model variance} $$

Thus we have shown the bias-variance decomposition

$$ \text{model risk} ~ = ~ \text{observation variance} + \text{model bias}^2 + \text{model variance} $$

That is,

$$ \mathbb{E}\big{(} (Y - \hat{Y}(x))^2 \big{)} ~ = ~ \sigma^2 + \mathbb{E}\big{(} (g(x) - \mathbb{E}(\hat{Y}(x))^2\big{)} + \mathbb{E}\big{(} (\hat{Y}(x) - \mathbb{E}(\hat{Y}(x))^2 \big{)} $$

Special Case $\hat{Y}(x) = f_{\hat{\theta}}(x)$

In the case where we are making our predictions by fitting some function $f$ that involves parameters $\theta$, our estimate $\hat{Y}$ is $f_{\hat{\theta}}$ where $\hat{\theta}$ has been estimated from the data and hence is random.

In the bias-variance decomposition

$$ \mathbb{E}\big{(} (Y - \hat{Y}(x))^2 \big{)} ~ = ~ \sigma^2 + \mathbb{E}\big{(} (g(x) - \mathbb{E}(\hat{Y}(x))^2\big{)} + \mathbb{E}\big{(} (\hat{Y}(x) - \mathbb{E}(\hat{Y}(x))^2 \big{)} $$

just plug in the particular prediction $f_{\hat{\theta}}$ in place of the general prediction $\hat{Y}$:

$$ \mathbb{E}\big{(} (Y - f_{\hat{\theta}}(x))^2 \big{)} ~ = ~ \sigma^2 + \mathbb{E}\big{(} (g(x) - \mathbb{E}(f_{\hat{\theta}}(x))^2\big{)} + \mathbb{E}\big{(} (f_{\hat{\theta}}(x) - \mathbb{E}(f_{\hat{\theta}}(x))^2 \big{)} $$