(slightly modified for Fall 2020)

To save this file as a PDF, in Jupyter go to File -> "Download As..."/"Save and Export Notebook As..." -> "PDF via LaTeX (.pdf)"/"PDF". The menu options may vary depending on whether you are using Jupyter Notebook or Jupyter Lab.

Before proceeding with this derivation, you should be familiar with the Random Variables lecture (Lecture 3 in Summer 2020). In particular, you really need to understand expectation and variance.

This result will be used below. You don't have to know how to prove it.

**If $V$ and $W$ are independent random variables then $\mathbb{E}(VW) = \mathbb{E}(V)\mathbb{E}(W)$**.

**Proof:** We'll do this in the discrete finite case. Trust that it's true in greater generality.

The job is to calculate the weighted average of the values of $VW$, where the weights are the probabilities of those values. Here goes.

\begin{align*} \mathbb{E}(VW) ~ &= ~ \sum_v\sum_w vwP(V=v \text{ and } W=w) \\ &= ~ \sum_v\sum_w vwP(V=v)P(W=w) ~~~~ \text{by independence} \\ &= ~ \sum_v vP(V=v)\sum_w wP(W=w) \\ &= ~ \mathbb{E}(V)\mathbb{E}(W) \end{align*}- The first part of what follows is an exposition of the definitions on the slides.
- The derivation starts at Step 1.
- Steps 1-3 is where the action is. Step 4 just puts together the results of Steps 2 and 3.

**1.** For each individual, the response is $g(x) + \epsilon$ where:

- $x$ consists of the
**fixed**values of all the predictor variables for that individual - $g$ is a
**fixed**function, typically unknown; sometimes called the*true*function or the*signal* - $\epsilon$ is a
**random error**with mean 0 and variance $\sigma^2$, independent of all other individuals; it is sometimes called*noise*and is*unobservable,*which means that we never get to see it

**2.** We have a random sample from the model above. We don't know $g$ so we fit something of our choice.

**3.** A new individual comes along at $x$. Their response is $Y = g(x) + \epsilon$ for a brand new copy of $\epsilon$. This response is random because $\epsilon$ is random, and has a hidden $x$ in it because its value is $g(x)$ plus the noise.

**4.** $\hat{Y}(x)$ is our model's predicted response for this indiviual. It is random because it depends on our sample which is random.

The point corresponding to the new observation is $(x, Y)$ where $Y = g(x) + \epsilon$. Remember that $g(x)$ is a constant,$\mathbb{E}(\epsilon) = 0$, and $\mathbb{V}ar(\epsilon) = \sigma^2$. It follows that

- $\mathbb{E}(Y) = g(x)$
- $\mathbb{V}ar(Y) = \mathbb{V}ar(\epsilon) = \sigma^2$

That is why $$ \text{observation variance} ~ = ~ \sigma^2 $$

**Note:** Since $\epsilon$ is centered, that is, $\mathbb{E}(\epsilon) = 0$, we have $\mathbb{V}ar(\epsilon) = \mathbb{E}(\epsilon^2)$. So you will sometimes see the observation variance $\sigma^2$ written as $\mathbb{E}(\epsilon^2)$ instead of $\mathbb{V}ar(Y)$ or $\mathbb{V}ar(\epsilon)$.

The *bias* of an estimator is the expected difference between the estimator and what it's trying to estimate.

For the new individual at $x$, the *model bias* is defined by

This is the difference between our prediction at $x$ and the true signal at $x$, averaged over all possible samples.

The key observation is that bias is a constant (that is, a number), not a random variable. It is a systematic error in the estimate.

We came up with our prediction $\hat{Y}(x)$ based on the model we chose to fit, using data from our random sample. Had that sample come out differently, our prediction might have been different. For each sample, we have a prediction, so the prediction is a random variable and thus has a mean and a variance.

The variance of the predictor $\hat{Y}(x)$ is called the *model variance*. By the definition of variance,

This is the mean squared error of our prediction.

$$ \text{model risk} ~ = ~ \mathbb{E}\big{(} (Y - \hat{Y}(x))^2 \big{)} $$Decompose the model risk into recognizable components.

On the right hand side:

- The first term is the observation variance $\sigma^2$.
- The cross product term is 0 because $\epsilon$ is independent of $g(x) - \hat{Y}(x)$ and $\mathbb{E}(\epsilon) = 0$
- The last term is the mean squared difference between our predicted value and the value of the true function at $x$

At this stage we have

$$ \text{model risk} ~ = ~ \text{observation variance} + \mathbb{E}\big{(}(g(x) - \hat{Y}(x))^2\big{)} $$We don't yet have a good understanding of $g(x) - \hat{Y}(x)$. But we do understand the deviation $D_{\hat{Y}(x)} = \hat{Y}(x) - \mathbb{E}(\hat{Y}(x))$. We know that

- $\mathbb{E}(D_{\hat{Y}(x)}) ~ = ~ 0$
- $\mathbb{E}(D_{\hat{Y}(x)}^2) ~ = ~ \text{model variance}$

So let's add and subtract $\mathbb{E}(\hat{Y}(x))$ and see if that helps.

$$ g(x) - \hat{Y}(x) ~ = ~ (g(x) - \mathbb{E}(\hat{Y}(x)) + (\mathbb{E}(\hat{Y}(x) - \hat{Y}(x)) $$The first term on the right hand side is the model bias at $x$. The second term is $-D_{\hat{Y}(x)}$. So

$$ g(x) - \hat{Y}(x) ~ = ~ \text{model bias} - D_{\hat{Y}(x)} $$Remember that the model bias at $x$ is a constant, not a random variable. Think of it as your favorite number, say 10. Then

\begin{align*} \mathbb{E}\big{(}(g(x) - \hat{Y}(x))^2\big{)} ~ & = ~ \text{model bias}^2 - 2(\text{model bias})\mathbb{E}(D_{\hat{Y}(x)}) + \mathbb{E}(D_{\hat{Y}(x)}^2) \\ &= ~ \text{model bias}^2 - 0 + \text{model variance} \\ &= ~ \text{model bias}^2 + \text{model variance} \end{align*}In Step 2 we had

$$ \text{model risk} ~ = ~ \text{observation variance} + \mathbb{E}\big{(}(g(x) - \hat{Y}(x))^2\big{)} $$Step 3 showed

$$ \mathbb{E}\big{(}(g(x) - \hat{Y}(x))^2\big{)} ~ = ~ \text{model bias}^2 + \text{model variance} $$Thus we have shown the bias-variance decomposition

$$ \text{model risk} ~ = ~ \text{observation variance} + \text{model bias}^2 + \text{model variance} $$That is,

$$ \mathbb{E}\big{(} (Y - \hat{Y}(x))^2 \big{)} ~ = ~ \sigma^2 + \mathbb{E}\big{(} (g(x) - \mathbb{E}(\hat{Y}(x))^2\big{)} + \mathbb{E}\big{(} (\hat{Y}(x) - \mathbb{E}(\hat{Y}(x))^2 \big{)} $$In the case where we are making our predictions by fitting some function $f$ that involves parameters $\theta$, our estimate $\hat{Y}$ is $f_{\hat{\theta}}$ where $\hat{\theta}$ has been estimated from the data and hence is random.

In the bias-variance decomposition

$$ \mathbb{E}\big{(} (Y - \hat{Y}(x))^2 \big{)} ~ = ~ \sigma^2 + \mathbb{E}\big{(} (g(x) - \mathbb{E}(\hat{Y}(x))^2\big{)} + \mathbb{E}\big{(} (\hat{Y}(x) - \mathbb{E}(\hat{Y}(x))^2 \big{)} $$just plug in the particular prediction $f_{\hat{\theta}}$ in place of the general prediction $\hat{Y}$:

$$ \mathbb{E}\big{(} (Y - f_{\hat{\theta}}(x))^2 \big{)} ~ = ~ \sigma^2 + \mathbb{E}\big{(} (g(x) - \mathbb{E}(f_{\hat{\theta}}(x))^2\big{)} + \mathbb{E}\big{(} (f_{\hat{\theta}}(x) - \mathbb{E}(f_{\hat{\theta}}(x))^2 \big{)} $$