Lecture 18 – Data 100, Spring 2024¶

Data 100, Spring 2024

Acknowledgments Page

InĀ [1]:
import pandas as pd
import numpy as np
import plotly.express as px




Which would you pick?¶

  • $\large Y_A = 10 X_1 + 10 X_2 $
  • $\large Y_B = \sum\limits_{i=1}^{20} X_i$
  • $\large Y_C = 20 X_1$

First let's construct the probability distribution for a single coin. This will let us flip 20 IID coins later.

InĀ [2]:
# First construct probability distribution for a single fair coin
p = 0.5
coin_df = pd.DataFrame({"x": [1, 0], # [Heads, Tails]
                        "P(X = x)": [p, 1 - p]})
coin_df
Out[2]:
x P(X = x)
0 1 0.5
1 0 0.5

Choice A:¶

$\large Y_A = 10 X_1 + 10 X_2 $

A couple ways to sample:

InĀ [3]:
coin_df.sample(10, weights="P(X = x)", replace=True)["x"]
Out[3]:
0    1
0    1
1    0
0    1
0    1
0    1
0    1
0    1
0    1
0    1
Name: x, dtype: int64
InĀ [4]:
N = 10000

np.random.rand(N,2) < p
Out[4]:
array([[False, False],
       [ True,  True],
       [ True,  True],
       ...,
       [False,  True],
       [ True,  True],
       [ True,  True]])
InĀ [5]:
sim_flips = pd.DataFrame(
    {"Choice A": np.sum((np.random.rand(N,2) < p) * 10, axis=1)})
sim_flips
Out[5]:
Choice A
0 20
1 20
2 20
3 10
4 20
... ...
9995 10
9996 20
9997 20
9998 0
9999 20

10000 rows Ɨ 1 columns

Choice B:¶

$\large Y_B = \sum\limits_{i=1}^{20} X_i$

InĀ [6]:
sim_flips["Choice B"] = np.sum((np.random.rand(N,20) < p), axis=1)
sim_flips
Out[6]:
Choice A Choice B
0 20 10
1 20 16
2 20 9
3 10 11
4 20 12
... ... ...
9995 10 10
9996 20 9
9997 20 7
9998 0 11
9999 20 14

10000 rows Ɨ 2 columns

Choice C:¶

$\large Y_C = 20 X_1$

InĀ [7]:
sim_flips["Choice C"] = 20 * (np.random.rand(N,1) < p) 
sim_flips
Out[7]:
Choice A Choice B Choice C
0 20 10 20
1 20 16 0
2 20 9 0
3 10 11 0
4 20 12 0
... ... ... ...
9995 10 10 0
9996 20 9 20
9997 20 7 0
9998 0 11 0
9999 20 14 20

10000 rows Ɨ 3 columns


If you're curious as to what these distributions look like, I've simulated some populations:
InĀ [8]:
px.histogram(sim_flips.melt(), x="value", facet_row="variable", 
             barmode="overlay", histnorm="probability",
             title="Empirical Distributions",
             width=600, height=600)
InĀ [9]:
pd.DataFrame([
    sim_flips.mean().rename("Simulated Mean"),
    sim_flips.var().rename("Simulated Var"),
    np.sqrt(sim_flips.var()).rename("Siumulated SD")
])
Out[9]:
Choice A Choice B Choice C
Simulated Mean 10.214000 9.979400 10.092000
Simulated Var 49.879192 4.956271 100.001536
Siumulated SD 7.062520 2.226268 10.000077