Before proceeding with this derivation, you should be familiar with the Random Variables lecture (Lecture 3 in Summer 2020). In particular, you really need to understand expectation and variance.
This result will be used below. You don't have to know how to prove it.
If $V$ and $W$ are independent random variables then $\mathbb{E}(VW) = \mathbb{E}(V)\mathbb{E}(W)$.
Proof: We'll do this in the discrete finite case. Trust that it's true in greater generality.
The job is to calculate the weighted average of the values of $VW$, where the weights are the probabilities of those values. Here goes.
$$ \begin{align*} \mathbb{E}(VW) ~ &= ~ \sum_v\sum_w vwP(V=v \text{ and } W=w) \\ &= ~ \sum_v\sum_w vwP(V=v)P(W=w) ~~~~ \text{by independence} \\ &= ~ \sum_v vP(V=v)\sum_w wP(W=w) \\ &= ~ \mathbb{E}(V)\mathbb{E}(W) \end{align*} $$1. For each individual, the response is $g(x) + \epsilon$ where:
2. We have a random sample from the model above. We don't know $g$ so we fit something of our choice.
3. A new individual comes along at $x$. Their response is $Y = g(x) + \epsilon$ for a brand new copy of $\epsilon$. This response is random because $\epsilon$ is random, and has a hidden $x$ in it because its value is $g(x)$ plus the noise.
4. $\hat{Y}(x)$ is our model's predicted response for this indiviual. It is random because it depends on our sample which is random.
The point corresponding to the new observation is $(x, Y)$ where $Y = g(x) + \epsilon$. Remember that $g(x)$ is a constant,$\mathbb{E}(\epsilon) = 0$, and $\mathbb{V}ar(\epsilon) = \sigma^2$. It follows that
That is why $$ \text{observation variance} ~ = ~ \sigma^2 $$
Note: Since $\epsilon$ is centered, that is, $\mathbb{E}(\epsilon) = 0$, we have $\mathbb{V}ar(\epsilon) = \mathbb{E}(\epsilon^2)$. So you will sometimes see the observation variance $\sigma^2$ written as $\mathbb{E}(\epsilon^2)$ instead of $\mathbb{V}ar(Y)$ or $\mathbb{V}ar(\epsilon)$.
The bias of an estimator is the expected difference between the estimator and what it's trying to estimate.
For the new individual at $x$, the model bias is defined by
$$ \text{model bias} ~ = ~ \mathbb{E}(\hat{Y}(x) - Y) ~ = ~ \mathbb{E}(\hat{Y}(x)) - \mathbb{E}(Y) ~ = ~ \mathbb{E}(\hat{Y}(x)) - g(x) $$This is the difference between our prediction at $x$ and the true signal at $x$, averaged over all possible samples.
The key observation is that bias is a constant (that is, a number), not a random variable. It is a systematic error in the estimate.
We came up with our prediction $\hat{Y}(x)$ based on the model we chose to fit, using data from our random sample. Had that sample come out differently, our prediction might have been different. For each sample, we have a prediction, so the prediction is a random variable and thus has a mean and a variance.
The variance of the predictor $\hat{Y}(x)$ is called the model variance. By the definition of variance,
$$ \text{model variance} ~ = ~ \mathbb{E}\big{(} (\hat{Y}(x) - \mathbb{E}(\hat{Y}(x)))^2 \big{)} $$This is the mean squared error of our prediction.
$$ \text{model risk} ~ = ~ \mathbb{E}\big{(} (Y - \hat{Y}(x))^2 \big{)} $$Decompose the model risk into recognizable components.
On the right hand side:
At this stage we have
$$ \text{model risk} ~ = ~ \text{observation variance} + \mathbb{E}\big{(}(g(x) - \hat{Y}(x))^2\big{)} $$We don't yet have a good understanding of $g(x) - \hat{Y}(x)$. But we do understand the deviation $D_{\hat{Y}(x)} = \hat{Y}(x) - \mathbb{E}(\hat{Y}(x))$. We know that
So let's add and subtract $\mathbb{E}(\hat{Y}(x))$ and see if that helps.
$$ g(x) - \hat{Y}(x) ~ = ~ (g(x) - \mathbb{E}(\hat{Y}(x)) + (\mathbb{E}(\hat{Y}(x) - \hat{Y}(x)) $$The first term on the right hand side is the model bias at $x$. The second term is $-D_{\hat{Y}(x)}$. So
$$ g(x) - \hat{Y}(x) ~ = ~ \text{model bias} - D_{\hat{Y}(x)} $$Remember that the model bias at $x$ is a constant, not a random variable. Think of it as your favorite number, say 10. Then
$$ \begin{align*} \mathbb{E}\big{(}(g(x) - \hat{Y}(x))^2\big{)} ~ & = ~ \text{model bias}^2 - 2(\text{model bias})\mathbb{E}(D_{\hat{Y}(x)}) + \mathbb{E}(D_{\hat{Y}(x)}^2) \\ &= ~ \text{model bias}^2 - 0 + \text{model variance} \\ &= ~ \text{model bias}^2 + \text{model variance} \end{align*} $$In Step 2 we had
$$ \text{model risk} ~ = ~ \text{observation variance} + \mathbb{E}\big{(}(g(x) - \hat{Y}(x))^2\big{)} $$Step 3 showed
$$ \mathbb{E}\big{(}(g(x) - \hat{Y}(x))^2\big{)} ~ = ~ \text{model bias}^2 + \text{model variance} $$Thus we have shown the bias-variance decomposition
$$ \text{model risk} ~ = ~ \text{observation variance} + \text{model bias}^2 + \text{model variance} $$That is,
$$ \mathbb{E}\big{(} (Y - \hat{Y}(x))^2 \big{)} ~ = ~ \sigma^2 + \mathbb{E}\big{(} (g(x) - \mathbb{E}(\hat{Y}(x))^2\big{)} + \mathbb{E}\big{(} (\hat{Y}(x) - \mathbb{E}(\hat{Y}(x))^2 \big{)} $$In the case where we are making our predictions by fitting some function $f$ that involves parameters $\theta$, our estimate $\hat{Y}$ is $f_{\hat{\theta}}$ where $\hat{\theta}$ has been estimated from the data and hence is random.
In the bias-variance decomposition
$$ \mathbb{E}\big{(} (Y - \hat{Y}(x))^2 \big{)} ~ = ~ \sigma^2 + \mathbb{E}\big{(} (g(x) - \mathbb{E}(\hat{Y}(x))^2\big{)} + \mathbb{E}\big{(} (\hat{Y}(x) - \mathbb{E}(\hat{Y}(x))^2 \big{)} $$just plug in the particular prediction $f_{\hat{\theta}}$ in place of the general prediction $\hat{Y}$:
$$ \mathbb{E}\big{(} (Y - f_{\hat{\theta}}(x))^2 \big{)} ~ = ~ \sigma^2 + \mathbb{E}\big{(} (g(x) - \mathbb{E}(f_{\hat{\theta}}(x))^2\big{)} + \mathbb{E}\big{(} (f_{\hat{\theta}}(x) - \mathbb{E}(f_{\hat{\theta}}(x))^2 \big{)} $$