🎲 Lecture 17 – Data 100, Summer 2025¶

Data 100, Summer 2025

Acknowledgments Page

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt

plt.rcParams['figure.dpi'] = 300

A Random Variable $X$¶

This section just replicates figures in the lecture. We won't go through it together.

Our probability distribution of $X$, shown as a table:

In [2]:
# Our random variable X
dist_df = pd.DataFrame({"x": [3, 4, 6, 8],
                        "P(X = x)": [0.1, 0.2, 0.4, 0.3]})
dist_df
Out[2]:
x P(X = x)
0 3 0.1
1 4 0.2
2 6 0.4
3 8 0.3
In [3]:
fig = px.bar(dist_df, x="x", y="P(X = x)", title="Distribution of X")
# fig.write_image("distX.png", "png",scale=2)
fig
No description has been provided for this image

Let's use this probability distribution to generate a table of $X_i$'s (i.e., simulated draws from the probability distirbution above).

In [4]:
N = 80000
samples = np.random.choice(
    dist_df["x"], # Draw from these choiecs
    size=N, # This many times
    p=dist_df["P(X = x)"]) # According to this distribution

sim_df = pd.DataFrame({"X": samples})
sim_df
Out[4]:
X
0 3
1 8
2 8
3 8
4 4
... ...
79995 6
79996 6
79997 4
79998 6
79999 6

80000 rows × 1 columns



Let's check how well this simulated sample matches our probability distribution!

In [5]:
fig = px.histogram(sim_df, x="X", title="Empirical distribution of X", 
                   histnorm="probability")
# fig.write_image("empirical_dist.png", "png",scale=2)
fig
In [6]:
print("Simulated E[X]:", sim_df['X'].mean())
print("Simulated Var[X]:", sim_df['X'].var())
Simulated E[X]: 5.9017125
Simulated Var[X]: 2.8925632243840544
In [7]:
E_x = dist_df["x"] @ dist_df["P(X = x)"]
print("E[X]:",E_x)
E[X]: 5.9
In [8]:
Var_x = dist_df["x"]**2 @ dist_df["P(X = x)"] - E_x**2
print("Var[X]:", Var_x)
Var[X]: 2.8900000000000006




🎲 🎲 Sum of 2 Dice Rolls¶

Here's where the lecture demo starts!

Here's the distribution of a single die roll:

In [9]:
roll_df = pd.DataFrame({"x": [1, 2, 3, 4, 5, 6],
                        "P(X = x)": np.ones(6)/6})
roll_df
Out[9]:
x P(X = x)
0 1 0.166667
1 2 0.166667
2 3 0.166667
3 4 0.166667
4 5 0.166667
5 6 0.166667
In [10]:
fig = px.bar(roll_df, x="x", y="P(X = x)", title="Distribution of X")
# fig.write_image("die.png", "png",scale=2)
fig

Let $X_1, X_2$ are the outcomes of two dice rolls. Note $X_1$ and $X_2$ are i.i.d. (independent and identically distributed).

Below, we simulate an 80,000-row table of $X_1, X_2$ values.

The code uses np.random.choice(arr, size, p) (documentation) where arr is the array the values and p is the probability associated with choosing each value.

In [11]:
N = 80000

# roll_df["x"] are the values 1 through 6
# roll_df["P(X=x)"] is a column of a constant 1/6, since all faces are 
# equally likely
sim_rolls_df = pd.DataFrame({
    "X_1": np.random.choice(roll_df["x"], size = N, p = roll_df["P(X = x)"]),
    "X_2": np.random.choice(roll_df["x"], size = N, p = roll_df["P(X = x)"])
})

sim_rolls_df
Out[11]:
X_1 X_2
0 1 3
1 5 5
2 4 6
3 4 4
4 5 6
... ... ...
79995 2 5
79996 6 5
79997 3 6
79998 3 6
79999 4 4

80000 rows × 2 columns

We can use our simulated values of $X_1, X_2$ to create new columns $2 X_1$ and $X_1 + X_2$:

In [12]:
sim_rolls_df['2 X_1'] = 2 * sim_rolls_df['X_1']
sim_rolls_df['X_1 + X_2'] = sim_rolls_df['X_1'] + sim_rolls_df['X_2']
sim_rolls_df
Out[12]:
X_1 X_2 2 X_1 X_1 + X_2
0 1 3 2 4
1 5 5 10 10
2 4 6 8 10
3 4 4 8 8
4 5 6 10 11
... ... ... ... ...
79995 2 5 4 7
79996 6 5 12 11
79997 3 6 6 9
79998 3 6 6 9
79999 4 4 8 8

80000 rows × 4 columns

Now that we have simulated samples of $2 X_1$ and $X_1 + X_2$, we can plot histograms to see their distributions:

In [13]:
plot_df = sim_rolls_df[["2 X_1", "X_1 + X_2"]].melt()

# Show 5 random rows of plot_df
display(plot_df.sample(5))

px.histogram(
  plot_df,
  x="value", 
  color="variable", 
  barmode="overlay", 
  histnorm="probability",
  title="Empirical Distributions"
)
variable value
157622 X_1 + X_2 7
92053 X_1 + X_2 3
39758 2 X_1 2
120516 X_1 + X_2 8
73768 2 X_1 8
In [14]:
# The empirical means are nearly identical, but the variance
# of the two rolls is lower than doubling the first roll.
pd.DataFrame([
    sim_rolls_df[["2 X_1", "X_1 + X_2"]].mean().rename("Mean"),
    sim_rolls_df[["2 X_1", "X_1 + X_2"]].var().rename("Var"),
    np.sqrt(sim_rolls_df[["2 X_1", "X_1 + X_2"]].var()).rename("SD")
])
Out[14]:
2 X_1 X_1 + X_2
Mean 7.010100 7.008737
Var 11.614743 5.820384
SD 3.408041 2.412547




Which would you pick?¶

This is a similar demonstration to the double dice roll above. We won't cover it together during the lecture.

  • $\large Y_A = 10 X_1 + 10 X_2 $
  • $\large Y_B = \sum\limits_{i=1}^{20} X_i$
  • $\large Y_C = 20 X_1$

First let's construct the probability distribution for a single coin. This will let us flip 20 IID coins later.

In [15]:
# First construct probability distribution for a single fair coin
p = 0.5
coin_df = pd.DataFrame({"x": [1, 0], # [Heads, Tails]
                        "P(X = x)": [p, 1 - p]})
coin_df
Out[15]:
x P(X = x)
0 1 0.5
1 0 0.5

Choice A:¶

$\large Y_A = 10 X_1 + 10 X_2 $

In [16]:
N = 10000

np.random.rand(N,2) < p
Out[16]:
array([[ True, False],
       [ True,  True],
       [ True,  True],
       ...,
       [ True, False],
       [False, False],
       [False, False]])
In [17]:
sim_flips = pd.DataFrame(
    {"Choice A": np.sum((np.random.rand(N,2) < p) * 10, axis=1)})
sim_flips
Out[17]:
Choice A
0 10
1 20
2 0
3 10
4 10
... ...
9995 0
9996 20
9997 10
9998 10
9999 0

10000 rows × 1 columns

Choice B:¶

$\large Y_B = \sum\limits_{i=1}^{20} X_i$

In [18]:
sim_flips["Choice B"] = np.sum((np.random.rand(N,20) < p), axis=1)
sim_flips
Out[18]:
Choice A Choice B
0 10 13
1 20 9
2 0 7
3 10 8
4 10 9
... ... ...
9995 0 12
9996 20 10
9997 10 12
9998 10 11
9999 0 12

10000 rows × 2 columns

Choice C:¶

$\large Y_C = 20 X_1$

In [19]:
sim_flips["Choice C"] = 20  * (np.random.rand(N,1) < p) 
sim_flips
Out[19]:
Choice A Choice B Choice C
0 10 13 20
1 20 9 0
2 0 7 0
3 10 8 0
4 10 9 0
... ... ... ...
9995 0 12 0
9996 20 10 20
9997 10 12 20
9998 10 11 20
9999 0 12 0

10000 rows × 3 columns


If you're curious as to what these distributions look like, I've simulated some populations:

In [20]:
px.histogram(sim_flips.melt(), x="value", facet_row="variable", 
             barmode="overlay", histnorm="probability",
             title="Empirical Distributions",
             width=600, height=600)
In [21]:
pd.DataFrame([
    sim_flips.mean().rename("Mean"),
    sim_flips.var().rename("Var"),
    np.sqrt(sim_flips.var()).rename("SD")
])
Out[21]:
Choice A Choice B Choice C
Mean 9.929000 9.995900 9.884000
Var 49.169876 5.098393 99.996544
SD 7.012124 2.257962 9.999827

Visualizing the Binomial Distribution¶

This section replicates figures from the lecture slides. We won't cover it together.

The binomial distribution provides the probability of $y$ successes among $n$ trials, where each trial has $p$ probability of success:

$$ P(Y=y) = \binom{n}{y} p^y (1-p)^{n-y} $$

Equivalently, a binomial RV is a sum of $n$ Bernoulli RVs, each with probability $p$ of success.

We can visualize the binomial distribution for different combinations of $n$ and $p$, as a function of $y$:

In [22]:
# Make a grid plots the binomial distribution for p = 0.5 and n = [1, 2, 3, 10, 100]

# np.math.comb(n,k) gives the binomial coefficient for "n choose k"
def binom(n, k, p): 
  return np.math.comb(n, k) * p**k * (1-p)**(n-k)

n_values = [1, 2, 3, 10, 20, 100]

binom_df = pd.DataFrame({
    "n": np.concatenate([[n] * (n+1) for n in n_values]),
    "y": np.concatenate([np.arange(n+1) for n in n_values]),
    "P(Y=y)": [binom(n, y, 0.5) for n in n_values for y in range(n+1)]
})

fig = px.bar(binom_df, x="y", y="P(Y=y)", 
             facet_col="n", facet_col_wrap=3,
             facet_col_spacing=0.1, facet_row_spacing=0.15)

# Show axis ticks on all subplots
fig.update_yaxes(matches=None, showticklabels=True)
fig.update_xaxes(matches=None, showticklabels=True)

# Increase font size
fig.update_layout(font=dict(size=18))

# Increase size of figure
fig.update_layout(width=1000, height=600)

fig.show()
/tmp/ipykernel_163/3465302967.py:5: DeprecationWarning:

`np.math` is a deprecated alias for the standard library `math` module (Deprecated Numpy 1.25). Replace usages of `np.math` with `math`





From Population to Sample¶

This section replicates findings from the lecture. We won't cover it together.

Remember the population distribution we looked at earlier:

In [23]:
dist_df
Out[23]:
x P(X = x)
0 3 0.1
1 4 0.2
2 6 0.4
3 8 0.3
In [24]:
# A population generated from the distribution
N = 100000
all_samples = np.random.choice(dist_df["x"], N, p=dist_df["P(X = x)"])
sim_pop_df = pd.DataFrame({"X": all_samples})
sim_pop_df
Out[24]:
X
0 8
1 6
2 8
3 4
4 6
... ...
99995 6
99996 8
99997 4
99998 6
99999 6

100000 rows × 1 columns




Suppose we draw a sample of size 100 from this giant population.

We are performing Random Sampling with Replacement: df.sample(n, replace=True) (link)

In [25]:
n = 100      # Size of our sample
sample_df = (
             sim_pop_df.sample(n, replace=True)
             # Some reformatting below
             .reset_index(drop=True)
             .rename(columns={"X": "X"})
            )
sample_df
Out[25]:
X
0 4
1 6
2 6
3 4
4 8
... ...
95 3
96 6
97 3
98 6
99 6

100 rows × 1 columns

Our sample distribution (n = 100):

In [26]:
px.histogram(sample_df, x="X", histnorm="probability", title="Sample (n = 100)")


Compare this to our original population (N = 80,000):

In [27]:
px.histogram(sim_df, x="X", histnorm="probability", title="Population of X")
In [28]:
pd.DataFrame(
    {"Sample": [sample_df["X"].mean(), sample_df["X"].var(), np.sqrt(sample_df["X"].var())],
     "Population": [sim_df["X"].mean(), sim_df["X"].var(), np.sqrt(sim_df["X"].var())]})
Out[28]:
Sample Population
0 5.730000 5.901713
1 3.411212 2.892563
2 1.846947 1.700754
In [29]:
# Three different distributions of X where E(X)=25

# First distribution
dist1_df = pd.DataFrame({"x": [20, 25, 30],
                         "P(X = x)": [0.2, 0.5, 0.2]})
# Second distribution -- really skewed
dist2_df = pd.DataFrame({"x": [0, 25, 50],
                         "P(X = x)": [0.49, 0.02, 0.49]})
# Third distribution -- skewed left
dist3_df = pd.DataFrame({"x": [0, 25, 150],
                         "P(X = x)": [0.5, 0.4, 0.1]})

fig, axs = plt.subplots(1, 3, figsize=(10, 2))
for i, dist_df in enumerate([dist1_df, dist2_df, dist3_df]):
    sns.barplot(x="x", y="P(X = x)", data=dist_df, ax=axs[i], color="skyblue")

plt.tight_layout()

plt.show()
No description has been provided for this image