Processing math: 100%

Derivation of the Bias-Variance Decomposition

A. Adhikari, Data 100 Spring 2020

(slightly modified for Fall 2020)

Preliminary

This result will be used below. You don't have to know how to prove it.

If V and W are independent random variables then E(VW)=E(V)E(W).

Proof: We'll do this in the discrete finite case. Trust that it's true in greater generality.

The job is to calculate the weighted average of the values of VW, where the weights are the probabilities of those values. Here goes.

E(VW) = vwvwP(V=v and W=w)= vwvwP(V=v)P(W=w)    by independence= vvP(V=v)wwP(W=w)= E(V)E(W)

What's in this Notebook

  • The first part of what follows is an exposition of the definitions on the slides.
  • The derivation starts at Step 1.
  • Steps 1-3 is where the action is. Step 4 just puts together the results of Steps 2 and 3.

Assumptions

1. For each individual, the response is g(x)+ϵ where:

  • x consists of the fixed values of all the predictor variables for that individual
  • g is a fixed function, typically unknown; sometimes called the true function or the signal
  • ϵ is a random error with mean 0 and variance σ2, independent of all other individuals; it is sometimes called noise and is unobservable, which means that we never get to see it

2. We have a random sample from the model above. We don't know g so we fit something of our choice.

3. A new individual comes along at x. Their response is Y=g(x)+ϵ for a brand new copy of ϵ. This response is random because ϵ is random, and has a hidden x in it because its value is g(x) plus the noise.

4. ˆY(x) is our model's predicted response for this indiviual. It is random because it depends on our sample which is random.

Definitions

Observation Variance

The point corresponding to the new observation is (x,Y) where Y=g(x)+ϵ. Remember that g(x) is a constant,E(ϵ)=0, and Var(ϵ)=σ2. It follows that

  • E(Y)=g(x)
  • Var(Y)=Var(ϵ)=σ2

That is why observation variance = σ2

Note: Since ϵ is centered, that is, E(ϵ)=0, we have Var(ϵ)=E(ϵ2). So you will sometimes see the observation variance σ2 written as E(ϵ2) instead of Var(Y) or Var(ϵ).

Model Bias

The bias of an estimator is the expected difference between the estimator and what it's trying to estimate.

For the new individual at x, the model bias is defined by

model bias = E(ˆY(x)Y) = E(ˆY(x))E(Y) = E(ˆY(x))g(x)

This is the difference between our prediction at x and the true signal at x, averaged over all possible samples.

The key observation is that bias is a constant (that is, a number), not a random variable. It is a systematic error in the estimate.

Model Variance

We came up with our prediction ˆY(x) based on the model we chose to fit, using data from our random sample. Had that sample come out differently, our prediction might have been different. For each sample, we have a prediction, so the prediction is a random variable and thus has a mean and a variance.

The variance of the predictor ˆY(x) is called the model variance. By the definition of variance,

model variance = E((ˆY(x)E(ˆY(x)))2)

Model Risk

This is the mean squared error of our prediction.

model risk = E((YˆY(x))2)

Goal

Decompose the model risk into recognizable components.

Step 1 (Figure on Slide 26)

model risk = E((YˆY(x))2)= E((g(x)+ϵˆY(x))2)= E((ϵ+(g(x)ˆY(x)))2)= E(ϵ2)+2E(ϵ(g(x)ˆY(x)))+E((g(x)ˆY(x))2)

On the right hand side:

  • The first term is the observation variance σ2.
  • The cross product term is 0 because ϵ is independent of g(x)ˆY(x) and E(ϵ)=0
  • The last term is the mean squared difference between our predicted value and the value of the true function at x

Step 2 (Figure on Slide 27)

At this stage we have

model risk = observation variance+E((g(x)ˆY(x))2)

We don't yet have a good understanding of g(x)ˆY(x). But we do understand the deviation DˆY(x)=ˆY(x)E(ˆY(x)). We know that

  • E(DˆY(x)) = 0
  • E(D2ˆY(x)) = model variance

So let's add and subtract E(ˆY(x)) and see if that helps.

g(x)ˆY(x) = (g(x)E(ˆY(x))+(E(ˆY(x)ˆY(x))

The first term on the right hand side is the model bias at x. The second term is DˆY(x). So

g(x)ˆY(x) = model biasDˆY(x)

Step 3

Remember that the model bias at x is a constant, not a random variable. Think of it as your favorite number, say 10. Then

E((g(x)ˆY(x))2) = model bias22(model bias)E(DˆY(x))+E(D2ˆY(x))= model bias20+model variance= model bias2+model variance

Step 4: Bias-Variance Decomposition

In Step 2 we had

model risk = observation variance+E((g(x)ˆY(x))2)

Step 3 showed

E((g(x)ˆY(x))2) = model bias2+model variance

Thus we have shown the bias-variance decomposition

model risk = observation variance+model bias2+model variance

That is,

E((YˆY(x))2) = σ2+E((g(x)E(ˆY(x))2)+E((ˆY(x)E(ˆY(x))2)

Special Case ˆY(x)=fˆθ(x)

In the case where we are making our predictions by fitting some function f that involves parameters θ, our estimate ˆY is fˆθ where ˆθ has been estimated from the data and hence is random.

In the bias-variance decomposition

E((YˆY(x))2) = σ2+E((g(x)E(ˆY(x))2)+E((ˆY(x)E(ˆY(x))2)

just plug in the particular prediction fˆθ in place of the general prediction ˆY:

E((Yfˆθ(x))2) = σ2+E((g(x)E(fˆθ(x))2)+E((fˆθ(x)E(fˆθ(x))2)