This course note was developed in Fall 2023. If you are taking this class in a future semester, please keep in mind that this note may not be up to date with course content for that semester.
Learning Outcomes
Define a random variable in terms of its distribution
Compute the expectation and variance of a random variable
Gain familiarity with the Bernoulli and binomial random variables
In the past few lectures, we’ve examined the role of complexity in influencing model performance. We’ve considered model complexity in the context of a tradeoff between two competing factors: model variance and training error.
Thus far, our analysis has been mostly qualitative. We’ve acknowledged that our choice of model complexity needs to strike a balance between model variance and training error, but we haven’t yet discussed why exactly this tradeoff exists.
To better understand the origin of this tradeoff, we will need to introduce the language of random variables. The next two lectures on probability will be a brief digression from our work on modeling so we can build up the concepts needed to understand this so-called bias-variance tradeoff. Our roadmap for the next few lectures will be:
Random Variables Estimators: introduce random variables, considering the concepts of expectation, variance, and covariance
Estimators, Bias, and Variance: re-express the ideas of model variance and training error in terms of random variables and use this new perspective to investigate our choice of model complexity
Data 8 Recap
Recall the following concepts from Data 8:
Sample mean: the mean of your random sample
Central Limit Theorem: If you draw a large random sample with replacement, then, regardless of the population distribution, the probability distribution of the sample mean
is roughly normal
is centered at the population mean
has an
17.1 Random Variables and Distributions
Suppose we generate a set of random data, like a random sample from some population. A random variable is a numerical function of the randomness in the data. It is random since our sample was drawn at random; it is variable because its exact value depends on how this random sample came out. As such, the domain or input of our random variable is all possible (random) outcomes in a sample space, and its range or output is the number line. We typically denote random variables with uppercase letters, such as or .
17.1.1 Distribution
For any random variable , we need to be able to specify 2 things:
Possible values: the set of values the random variable can take on.
Probabilities: the set of probabilities describing how the total probability of 100% is split over the possible values.
If is discrete (has a finite number of possible values), the probability that a random variable takes on the value is given by , and probabilities must sum to 1: ,
We can often display this using a probability distribution table, which you will see in the coin toss example below.
The distribution of a random variable is a description of how the total probability of 100% is split over all the possible values of , and it fully defines a random variable. The distribution of a discrete random variable can also be represented using a histogram. If a variable is continuous – it can take on infinitely many values – we can illustrate its distribution using a density curve.
Probabilities are areas. For discrete random variables, the area of the red bars represent the probability that a discrete random variable falls within those values. For continuous random variables, the area under the curve represents the probability that a discrete random variable falls within those values.
If we sum up the total area of the bars/under the density curve, we should get 100%, or 1.
17.1.2 Example: Tossing a Coin
To give a concrete example, let’s formally define a fair coin toss. A fair coin can land on heads () or tails (), each with a probability of 0.5. With these possible outcomes, we can define a random variable as:
is a function with a domain, or input, of and a range, or output, of . We can write this in function notation as The probability distribution table of is given by.
0
1
Suppose we draw a random sample of size 3 from all students enrolled in Data 100. We can define as the number of data science students in our sample. Its domain is all possible samples of size 3, and its range is .
We can show the distribution of in the following tables. The table on the left lists all possible samples of and the number of times they can appear (). We can use this to calculate the values for the table on the right, a probability distribution table.
17.1.3 Simulation
Given a random variable ’s distribution, how could we generate/simulate a population? To do so, we can randomly pick values of according to its distribution using np.random.choice or df.sample.
17.2 Expectation and Variance
There are several ways to describe a random variable. The methods shown above – a table of all samples , distribution table , and histograms – are all definitions that fully describe a random variable. Often, it is easier to describe a random variable using some numerical summary rather than fully defining its distribution. These numerical summaries are numbers that characterize some properties of the random variable. Because they give a “summary” of how the variable tends to behave, they are not random – think of them as a static number that describes a certain property of the random variable. In Data 100, we will focus our attention on the expectation and variance of a random variable.
17.2.1 Expectation
The expectation of a random variable is the weighted average of the values of , where the weights are the probabilities of each value occurring. There are two equivalent ways to compute the expectation:
Apply the weights one sample at a time: .
Apply the weights one possible value at a time:
We want to emphasize that the expectation is a number, not a random variable. Expectation is a generalization of the average, and it has the same units as the random variable. It is also the center of gravity of the probability distribution histogram, meaning if we simulate the variable many times, it is the long-run average of the random variable.
17.2.1.1 Example 1: Coin Toss
Going back to our coin toss example, we define a random variable as: We can calculate its expectation using the second method of applying the weights one possible value at a time: Note that is not a possible value of ; it’s an average. The expectation of X does not need to be a possible value of X.
17.2.1.2 Example 2
Consider the random variable :
3
0.1
4
0.2
6
0.4
8
0.3
To calculate it’s expectation, Again, note that is not a possible value of ; it’s an average. The expectation of X does not need to be a possible value of X.
17.2.2 Variance
The variance of a random variable is a measure of its chance error. It is defined as the expected squared deviation from the expectation of . Put more simply, variance asks: how far does typically vary from its average value, just by chance? What is the spread of ’s distribution?
The units of variance are the square of the units of . To get it back to the right scale, use the standard deviation of :
Like with expectation, variance is a number, not a random variable! Its main use is to quantify chance error.
By Chebyshev’s inequality, which you saw in Data 8, no matter what the shape of the distribution of X is, the vast majority of the probability lies in the interval “expectation plus or minus a few SDs.”
If we expand the square and use properties of expectation, we can re-express variance as the computational formula for variance. This form is often more convenient to use when computing the variance of a variable by hand, and it is also useful in Mean Squared Error calculations, as if is centered and .
Proof
How do we compute ? Any function of a random variable is also a random variable – that means that by squaring , we’ve created a new random variable. To compute , we can simply apply our definition of expectation to the random variable .
17.2.3 Example: Dice
Let be the outcome of a single fair dice roll. is a random variable defined as
What’s the expectation
What’s the variance
Using approach 1:
Using approach 2:
17.3 Sums of Random Variables
Often, we will work with multiple random variables at the same time. A function of a random variable is also a random variable; if you create multiple random variables based on your sample, then functions of those random variables are also random variables.
For example, if are random variables, then so are all of these:
17.3.1 Equal vs. Identically Distributed vs. i.i.d
Suppose that we have two random variables and :
and are equal if for every sample . Regardless of the exact sample drawn, is always equal to .
and are identically distributed if the distribution of is equal to the distribution of . We say “X and Y are equal in distribution.” That is, and take on the same set of possible values, and each of these possible values is taken with the same probability. On any specific sample , identically distributed variables do not necessarily share the same value. If X = Y, then X and Y are identically distributed; however, the converse is not true (ex: Y = 7-X, X is a die)
and are independent and identically distributed (i.i.d) if
The variables are identically distributed.
Knowing the outcome of one variable does not influence our belief of the outcome of the other.
For example, let and be numbers on rolls of two fair die. and are i.i.d, so and have the same distribution. However, the sums and have different distributions but the same expectation.
However, has a larger variance
17.3.2 Properties of Expectation
Instead of simulating full distributions, we often just compute expectation and variance directly. Recall the definition of expectation: From it, we can derive some useful properties of expectation:
Linearity of expectation. The expectation of the linear transformation , where and are constants, is:
Proof
Expectation is also linear in sums of random variables.
Proof
If is a non-linear function, then in general, For example, if is -1 or 1 with equal probability, then , but .
17.3.3 Properties of Variance
Recall the definition of variance: Combining it with the properties of expectation, we can derive some useful properties of variance:
Unlike expectation, variance is non-linear. The variance of the linear transformation is:
Subsequently,
The full proof of this fact can be found using the definition of variance. As general intuition, consider that scales the variable by a factor of , then shifts the distribution of by units.
Proof
We know that
In order to compute , consider that a shift by b units does not affect spread, so .
Then,
Shifting the distribution by does not impact the spread of the distribution. Thus, .
Scaling the distribution by does impact the spread of the distribution.
Variance of sums of RVs is affected by the (in)dependence of the RVs.
Proof
The variance of a sum is affected by the dependence between the two random variables that are being added. Let’s expand out the definition of to see what’s going on.
To simplify the math, let and .
17.3.4 Covariance and Correlation
We define the covariance of two random variables as the expected product of deviations from expectation. Put more simply, covariance is a generalization of variance to two random variables:
We can treat the covariance as a measure of association. Remember the definition of correlation given when we first established SLR?
It turns out we’ve been quietly using covariance for some time now! If and are independent, then and . Note, however, that the converse is not always true: and could have but not be independent.
---title: Random Variablesexecute: echo: trueformat: html: code-fold: true code-tools: true toc: true toc-title: Random Variables page-layout: full theme: - cosmo - cerulean callout-icon: falsejupyter: jupytext: text_representation: extension: .qmd format_name: quarto format_version: '1.0' jupytext_version: 1.16.1 kernelspec: display_name: Python 3 (ipykernel) language: python name: python3---::: {.callout-note collapse="false"}## Important Note**This course note was developed in Fall 2023. If you are taking this class in a future semester, please keep in mind that this note may not be up to date with course content for that semester.**:::::: {.callout-note collapse="false"}## Learning Outcomes* Define a random variable in terms of its distribution* Compute the expectation and variance of a random variable* Gain familiarity with the Bernoulli and binomial random variables:::In the past few lectures, we've examined the role of complexity in influencing model performance. We've considered model complexity in the context of a tradeoff between two competing factors: model variance and training error. Thus far, our analysis has been mostly qualitative. We've acknowledged that our choice of model complexity needs to strike a balance between model variance and training error, but we haven't yet discussed *why* exactly this tradeoff exists.To better understand the origin of this tradeoff, we will need to introduce the language of **random variables**. The next two lectures on probability will be a brief digression from our work on modeling so we can build up the concepts needed to understand this so-called **bias-variance tradeoff**. Our roadmap for the next few lectures will be:1. Random Variables Estimators: introduce random variables, considering the concepts of expectation, variance, and covariance2. Estimators, Bias, and Variance: re-express the ideas of model variance and training error in terms of random variables and use this new perspective to investigate our choice of model complexity::: {.callout-tip collapse="true"}## Data 8 RecapRecall the following concepts from Data 8: 1. Sample mean: the mean of your random sample2. Central Limit Theorem: If you draw a large random sample with replacement, then, regardless of the population distribution, the probability distribution of the sample mean a. is roughly normal b. is centered at the population mean c. has an $SD = \frac{\text{population SD}}{\sqrt{\text{sample size}}}$:::## Random Variables and DistributionsSuppose we generate a set of random data, like a random sample from some population. A **random variable** is a *numerical function* of the randomness in the data. It is *random* since our sample was drawn at random; it is *variable* because its exact value depends on how this random sample came out. As such, the domain or input of our random variable is all possible (random) outcomes in a *sample space*, and its range or output is the number line. We typically denote random variables with uppercase letters, such as $X$ or $Y$. ### DistributionFor any random variable $X$, we need to be able to specify 2 things: 1. Possible values: the set of values the random variable can take on.2. Probabilities: the set of probabilities describing how the total probability of 100% is split over the possible values.If $X$ is discrete (has a finite number of possible values), the probability that a random variable $X$ takes on the value $x$ is given by $P(X=x)$, and probabilities must sum to 1: $\sum_{\text{all} x} P(X=x) = 1$,We can often display this using a **probability distribution table**, which you will see in the coin toss example below.The **distribution** of a random variable $X$ is a description of how the total probability of 100% is split over all the possible values of $X$, and it fully defines a random variable. The distribution of a discrete random variable can also be represented using a histogram. If a variable is **continuous** – it can take on infinitely many values – we can illustrate its distribution using a density curve. <palign="center"><imgsrc="images/discrete_continuous.png"alt='discrete_continuous'width='700'></p>Probabilities are areas. For discrete random variables, the *area of the red bars* represent the probability that a discrete random variable $X$ falls within those values. For continuous random variables, the *area under the curve* represents the probability that a discrete random variable $Y$ falls within those values.<palign="center"><imgsrc="images/probability_areas.png"alt='discrete_continuous'width='600'></p>If we sum up the total area of the bars/under the density curve, we should get 100%, or 1.### Example: Tossing a CoinTo give a concrete example, let's formally define a fair coin toss. A fair coin can land on heads ($H$) or tails ($T$), each with a probability of 0.5. With these possible outcomes, we can define a random variable $X$ as: $$X = \begin{cases} 1, \text{if the coin lands heads} \\ 0, \text{if the coin lands tails} \end{cases}$$$X$ is a function with a domain, or input, of $\{H, T\}$ and a range, or output, of $\{1, 0\}$. We can write this in function notation as $$\begin{cases} X(H) = 1 \\ X(T) = 0 \end{cases}$$The probability distribution table of $X$ is given by.| $x$ | $P(X=x)$ | | --- | -------- || 0 | $\frac{1}{2}$ | | 1 | $\frac{1}{2}$ |Suppose we draw a random sample $s$ of size 3 from all students enrolled in Data 100. We can define $Y$ as the number of data science students in our sample. Its domain is all possible samples of size 3, and its range is $\{0, 1, 2, 3\}$.<palign="center"><imgsrc="images/rv.png"alt='rv'width='600'class="center"></p>We can show the distribution of $Y$ in the following tables. The table on the left lists all possible samples of $s$ and the number of times they can appear ($Y(s)$). We can use this to calculate the values for the table on the right, a **probability distribution table**. <palign="center"><imgsrc="images/distribution.png"alt='distribution'width='600'></p>### SimulationGiven a random variable $X$’s distribution, how could we **generate/simulate** a population? To do so, we can randomly pick values of $X$ according to its distribution using `np.random.choice` or `df.sample`. ## Expectation and VarianceThere are several ways to describe a random variable. The methods shown above -- a table of all samples $s, X(s)$, distribution table $P(X=x)$, and histograms -- are all definitions that *fully describe* a random variable. Often, it is easier to describe a random variable using some *numerical summary* rather than fully defining its distribution. These numerical summaries are numbers that characterize some properties of the random variable. Because they give a "summary" of how the variable tends to behave, they are *not* random – think of them as a static number that describes a certain property of the random variable. In Data 100, we will focus our attention on the expectation and variance of a random variable.### ExpectationThe **expectation** of a random variable $X$ is the weighted average of the values of $X$, where the weights are the probabilities of each value occurring. There are two equivalent ways to compute the expectation: 1. Apply the weights one *sample* at a time: $$\mathbb{E}[X] = \sum_{\text{all possible } s} X(s) P(s)$$.2. Apply the weights one possible *value* at a time: $$\mathbb{E}[X] = \sum_{\text{all possible } x} x P(X=x)$$We want to emphasize that the expectation is a *number*, not a random variable. Expectation is a generalization of the average, and it has the same units as the random variable. It is also the center of gravity of the probability distribution histogram, meaning if we simulate the variable many times, it is the long-run average of the random variable.#### Example 1: Coin TossGoing back to our coin toss example, we define a random variable $X$ as: $$X = \begin{cases} 1, \text{if the coin lands heads} \\ 0, \text{if the coin lands tails} \end{cases}$$We can calculate its expectation $\mathbb{E}[X]$ using the second method of applying the weights one possible value at a time: $$\begin{align} \mathbb{E}[X] &= \sum_{x} x P(X=x) \\ &= 1 * 0.5 + 0 * 0.5 \\ &= 0.5\end{align}$$Note that $\mathbb{E}[X] = 0.5$ is not a possible value of $X$; it's an average. **The expectation of X does not need to be a possible value of X**.#### Example 2Consider the random variable $X$: | $x$ | $P(X=x)$ | | --- | -------- || 3 | 0.1 | | 4 | 0.2 || 6 | 0.4 | | 8 | 0.3 |To calculate it's expectation, $$\begin{align} \mathbb{E}[X] &= \sum_{x} x P(X=x) \\ &= 3 * 0.1 + 4 * 0.2 + 6 * 0.4 + 8 * 0.3 \\ &= 0.3 + 0.8 + 2.4 + 2.4 \\ &= 5.9\end{align}$$Again, note that $\mathbb{E}[X] = 5.9$ is not a possible value of $X$; it's an average. **The expectation of X does not need to be a possible value of X**.### VarianceThe **variance** of a random variable is a measure of its chance error. It is defined as the expected squared deviation from the expectation of $X$. Put more simply, variance asks: how far does $X$ typically vary from its average value, just by chance? What is the spread of $X$'s distribution?$$\text{Var}(X) = \mathbb{E}[(X-\mathbb{E}[X])^2]$$The units of variance are the square of the units of $X$. To get it back to the right scale, use the standard deviation of $X$: $$\text{SD}(X) = \sqrt{\text{Var}(X)}$$Like with expectation, **variance is a number, not a random variable**! Its main use is to quantify chance error.By [Chebyshev’s inequality](https://www.inferentialthinking.com/chapters/14/2/Variability.html#Chebychev's-Bounds), which you saw in Data 8, no matter what the shape of the distribution of X is, the vast majority of the probability lies in the interval “expectation plus or minus a few SDs.”If we expand the square and use properties of expectation, we can re-express variance as the **computational formula for variance**. This form is often more convenient to use when computing the variance of a variable by hand, and it is also useful in Mean Squared Error calculations, as $\mathbb{E}[X^2] = \text{Var}(X)$ if $X$ is centered and $E(X)=0$.$$\text{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2$$::: {.callout-tip collapse="true"}## Proof$$\begin{align} \text{Var}(X) &= \mathbb{E}[(X-\mathbb{E}[X])^2]\\ &= \mathbb{E}(X^2 - 2X\mathbb{E}(X) + (\mathbb{E}(X))^2) \\ &= \mathbb{E}(X^2) - 2 \mathbb{E}(X)\mathbb{E}(X) +( \mathbb{E}(X))^2\\ &= \mathbb{E}[X^2] - (\mathbb{E}[X])^2\end{align}$$:::How do we compute $\mathbb{E}[X^2]$? Any function of a random variable is *also* a random variable – that means that by squaring $X$, we've created a new random variable. To compute $\mathbb{E}[X^2]$, we can simply apply our definition of expectation to the random variable $X^2$.$$\mathbb{E}[X^2] = \sum_{x} x^2 P(X = x)$$ ### Example: DiceLet $X$ be the outcome of a single fair dice roll. $X$ is a random variable defined as $$X = \begin{cases} \frac{1}{6}, \text{if } x \in \{1,2,3,4,5,6\}\\ 0, \text{otherwise} \end{cases}$$::: {.callout-caution collapse="true"}## What's the expectation $\mathbb{E}[X]?$$$ \begin{align} \mathbb{E}[X] &= 1(\frac{1}{6}) + 2(\frac{1}{6}) + 3(\frac{1}{6}) + 4(\frac{1}{6}) + 5(\frac{1}{6}) + 6(\frac{1}{6}) \\ &= (\frac{1}{6}) ( 1 + 2 + 3 + 4 + 5 + 6) \\ &= \frac{7}{2} \end{align}$$:::::: {.callout-caution collapse="true"}## What's the variance $\text{Var}(X)?$Using approach 1: $$\begin{align} \text{Var}(X) &= (\frac{1}{6})((1 - \frac{7}{2})^2 + (2 - \frac{7}{2})^2 + (3 - \frac{7}{2})^2 + (4 - \frac{7}{2})^2 + (5 - \frac{7}{2})^2 + (6 - \frac{7}{2})^2) \\ &= \frac{35}{12} \end{align}$$Using approach 2: $$\mathbb{E}[X^2] = \sum_{x} x^2 P(X = x) = \frac{91}{6}$$$$\text{Var}(X) = \frac{91}{6} - (\frac{7}{2})^2 = \frac{35}{12}$$:::## Sums of Random VariablesOften, we will work with multiple random variables at the same time. A function of a random variable is also a random variable; if you create multiple random variables based on your sample, then functions of those random variables are also random variables.For example, if $X_1, X_2, ..., X_n$ are random variables, then so are all of these: * $X_n^2$* $\#\{i : X_i > 10\}$* $\text{max}(X_1, X_2, ..., X_n)$* $\frac{1}{n} \sum_{i=1}^n (X_i - c)^2$* $\frac{1}{n} \sum_{i=1}^n X_i$### Equal vs. Identically Distributed vs. i.i.dSuppose that we have two random variables $X$ and $Y$:* $X$ and $Y$ are **equal** if $X(s) = Y(s)$ for every sample $s$. Regardless of the exact sample drawn, $X$ is always equal to $Y$.* $X$ and $Y$ are **identically distributed** if the distribution of $X$ is equal to the distribution of $Y$. We say “X and Y are equal in distribution.” That is, $X$ and $Y$ take on the same set of possible values, and each of these possible values is taken with the same probability. On any specific sample $s$, identically distributed variables do *not* necessarily share the same value. If X = Y, then X and Y are identically distributed; however, the converse is not true (ex: Y = 7-X, X is a die)* $X$ and $Y$ are **independent and identically distributed (i.i.d)** if 1. The variables are identically distributed. 2. Knowing the outcome of one variable does not influence our belief of the outcome of the other.For example, let $X_1$ and $X_2$ be numbers on rolls of two fair die. $X_1$ and $X_2$ are i.i.d, so $X_1$ and $X_2$ have the same distribution. However, the sums $Y = X_1 + X_1 = 2X_1$ and $Z=X_1+X_2$ have different distributions but the same expectation.<palign="center"><imgsrc="images/yz_distribution.png"alt='distribution'width='=500'></p>However, $Y = X_1$ has a larger variance<palign="center"><imgsrc="images/yz.png"alt='distribution'width='200'></p>### Properties of Expectation Instead of simulating full distributions, we often just compute expectation and variance directly. Recall the definition of expectation: $$\mathbb{E}[X] = \sum_{x} x P(X=x)$$From it, we can derive some useful properties of expectation: 1. **Linearity of expectation**. The expectation of the linear transformation $aX+b$, where $a$ and $b$ are constants, is:$$\mathbb{E}[aX+b] = aE[\mathbb{X}] + b$$::: {.callout-tip collapse="true"}## Proof$$\begin{align} \mathbb{E}[aX+b] &= \sum_{x} (ax + b) P(X=x) \\ &= \sum_{x} (ax P(X=x) + bP(X=x)) \\ &= a\sum_{x}P(X=x) + b\sum_{x}P(X=x)\\ &= a\mathbb{E}(X) + b * 1 \end{align}$$:::2. Expectation is also linear in *sums* of random variables. $$\mathbb{E}[X+Y] = \mathbb{E}[X] + \mathbb{E}[Y]$$::: {.callout-tip collapse="true"}## Proof$$\begin{align} \mathbb{E}[X+Y] &= \sum_{s} (X+Y)(s) P(s) \\ &= \sum_{s} (X(s)P(s) + Y(s)P(s)) \\ &= \sum_{s} X(s)P(s) + \sum_{s} Y(s)P(s)\\ &= \mathbb{E}[X] + \mathbb{E}[Y]\end{align}$$:::3. If $g$ is a non-linear function, then in general, $$\mathbb{E}[g(X)] \neq g(\mathbb{E}[X])$$ For example, if $X$ is -1 or 1 with equal probability, then $\mathbb{E}[X] = 0$, but $\mathbb{E}[X^2] = 1 \neq 0$.### Properties of VarianceRecall the definition of variance: $$\text{Var}(X) = \mathbb{E}[(X-\mathbb{E}[X])^2]$$Combining it with the properties of expectation, we can derive some useful properties of variance: 1. Unlike expectation, variance is *non-linear*. The variance of the linear transformation $aX+b$ is:$$\text{Var}(aX+b) = a^2 \text{Var}(X)$$* Subsequently, $$\text{SD}(aX+b) = |a| \text{SD}(X)$$* The full proof of this fact can be found using the definition of variance. As general intuition, consider that $aX+b$ scales the variable $X$ by a factor of $a$, then shifts the distribution of $X$ by $b$ units. ::: {.callout-tip collapse="true"}## ProofWe know that $$\mathbb{E}[aX+b] = aE[\mathbb{X}] + b$$In order to compute $\text{Var}(aX+b)$, consider that a shift by b units does not affect spread, so $\text{Var}(aX+b) = \text{Var}(aX)$.Then, $$\begin{align} \text{Var}(aX+b) &= \text{Var}(aX) \\ &= E((aX)^2) - (E(aX))^2 \\ &= E(a^2 X^2) - (aE(X))^2\\ &= a^2 (E(X^2) - (E(X))^2) \\ &= a^2 \text{Var}(X)\end{align}$$:::* Shifting the distribution by $b$ *does not* impact the *spread* of the distribution. Thus, $\text{Var}(aX+b) = \text{Var}(aX)$.* Scaling the distribution by $a$ *does* impact the spread of the distribution.<palign="center"><imgsrc="images/transformation.png"alt='transformation'width='600'></p>2. Variance of sums of RVs is affected by the (in)dependence of the RVs.$$\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2\text{cov}(X,Y)$$$$\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) \qquad \text{if } X, Y \text{ independent}$$::: {.callout-tip collapse="true"}## ProofThe variance of a sum is affected by the dependence between the two random variables that are being added. Let’s expand out the definition of $\text{Var}(X + Y)$ to see what’s going on.To simplify the math, let $\mu_x = \mathbb{E}[X]$ and $\mu_y = \mathbb{E}[Y]$.$$ \begin{align}\text{Var}(X + Y) &= \mathbb{E}[(X+Y- \mathbb{E}(X+Y))^2]\\&= \mathbb{E}[((X - \mu_x) + (Y - \mu_y))^2]\\&= \mathbb{E}[(X - \mu_x)^2 + 2(X - \mu_x)(Y - \mu_y) + (Y - \mu_y)^2]\\&= \mathbb{E}[(X - \mu_x)^2] + \mathbb{E}[(Y - \mu_y)^2] + \mathbb{E}[(X - \mu_x)(Y - \mu_y)]\\&= \text{Var}(X) + \text{Var}(Y) + \mathbb{E}[(X - \mu_x)(Y - \mu_y)]\end{align}$$:::### Covariance and CorrelationWe define the **covariance** of two random variables as the expected product of deviations from expectation. Put more simply, covariance is a generalization of variance to *two* random variables: $\text{Cov}(X, X) = \mathbb{E}[(X - \mathbb{E}[X])^2] = \text{Var}(X)$$$\text{Cov}(X, Y) = \mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])]$$We can treat the covariance as a measure of association. Remember the definition of correlation given when we first established SLR?$$r(X, Y) = \mathbb{E}\left[\left(\frac{X-\mathbb{E}[X]}{\text{SD}(X)}\right)\left(\frac{Y-\mathbb{E}[Y]}{\text{SD}(Y)}\right)\right] = \frac{\text{Cov}(X, Y)}{\text{SD}(X)\text{SD}(Y)}$$It turns out we've been quietly using covariance for some time now! If $X$ and $Y$ are independent, then $\text{Cov}(X, Y) =0$ and $r(X, Y) = 0$. Note, however, that the converse is not always true: $X$ and $Y$ could have $\text{Cov}(X, Y) = r(X, Y) = 0$ but not be independent. ### Summary * Let $X$ be a random variable with distribution $P(X=x)$. * $\mathbb{E}[X] = \sum_{x} x P(X=x)$ * $\text{Var}(X) = \mathbb{E}[(X-\mathbb{E}[X])^2] = \mathbb{E}[X^2] - (\mathbb{E}[X])^2$* Let $a$ and $b$ be scalar values. * $\mathbb{E}[aX+b] = aE[\mathbb{X}] + b$ * $\text{Var}(aX+b) = a^2 \text{Var}(X)$* Let $Y$ be another random variable. * $\mathbb{E}[X+Y] = \mathbb{E}[X] + \mathbb{E}[Y]$ * $\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) 2\text{cov}(X,Y)$