Random Variables¶

Data 100, Summer 2023

Acknowledgments Page

In [1]:
import seaborn as sns
import pandas as pd
sns.set(font_scale=1.5)
import matplotlib.pyplot as plt
import numpy as np

def adjust_fontsize(size=None):
    SMALL_SIZE = 8
    MEDIUM_SIZE = 10
    BIGGER_SIZE = 12
    if size != None:
        SMALL_SIZE = MEDIUM_SIZE = BIGGER_SIZE = size

    plt.rc('font', size=SMALL_SIZE)          # controls default text sizes
    plt.rc('axes', titlesize=SMALL_SIZE)     # fontsize of the axes title
    plt.rc('axes', labelsize=MEDIUM_SIZE)    # fontsize of the x and y labels
    plt.rc('xtick', labelsize=SMALL_SIZE)    # fontsize of the tick labels
    plt.rc('ytick', labelsize=SMALL_SIZE)    # fontsize of the tick labels
    plt.rc('legend', fontsize=SMALL_SIZE)    # legend fontsize
    plt.rc('figure', titlesize=BIGGER_SIZE)  # fontsize of the figure title

plt.style.use('fivethirtyeight')
sns.set_context("talk")
sns.set_theme()
adjust_fontsize(size=20)
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

Helper functions for plotting

In [2]:
# Helper functions to plot and 
# Compute expectation, variance, standard deviation
def plot_dist(dist_df,
                      xname="x", pname="P(X = x)", varname="X",
                      save=False):
    """
    Plot a distribution from a distribution table.
    Single-variate.
    """
    plt.bar(dist_df[xname], dist_df[pname])
    plt.ylabel(pname)
    plt.xlabel(xname)
    plt.title(f"Distribution of ${varname}$")
    plt.xticks(sorted(dist_df[xname].unique()))
    if save:
        fig = plt.gcf()
        fig.patch.set_alpha(0.0)
        plt.savefig(f"dist{varname}.png", bbox_inches = 'tight');


def simulate_samples(df, xname="x", pname="P(X = x)", size=1):
    return np.random.choice(
                df[xname], # Draw from these choiecs
                size=size, # This many times
                p=df[pname]) # According to this distribution

def simulate_iid_df(dist_df, nvars, rows, varname="X"):
    """
    Make an (row x nvars) dataframe
    by calling simulate_samples for each of the nvars per row
    """
    sample_dict = {}
    for i in range(nvars):
        # Generate many datapoints 
        sample_dict[f"{varname}_{i+1}"] = \
            simulate_samples(dist_df, size=rows)
    return pd.DataFrame(sample_dict)


def plot_simulated_dist(df, colname, show_stats=True, save=False, **kwargs):
    """
    Plot a simulated population.
    """
    sns.histplot(data=df, x=colname, stat='probability', discrete=True, **kwargs)
    plt.xticks(sorted(df[colname].unique())) # if there are gaps)
    if show_stats:
        display(stats_df_multi(df, [colname]))
    if save:
        fig = plt.gcf()
        fig.patch.set_alpha(0.0)
        plt.savefig(f"sim{colname}.png", bbox_inches = 'tight');

def stats_df_multi(df, colnames):
    means = df[colnames].mean(axis=0)
    variances = df[colnames].var(axis=0)
    stdevs = df[colnames].std(axis=0)
    df_stats = pd.concat([means, variances, stdevs],axis=1).T
    df_stats['index_col'] = ["E[•]", "Var(•)", "SD(•)"]
    df_stats = df_stats.set_index('index_col', drop=True).rename_axis(None)
    return df_stats

def plot_simulated_dist_multi(df, colnames, show_stats=True):
    """
    If multiple columns provided, use separate plots.
    """
    ncols = 1
    nrows = len(colnames)
    plt.figure(figsize=(6, 2*nrows+2))
    
    for i, colname in enumerate(colnames):
        subplot_int = int(100*int(nrows) + 10*int(ncols) + int(i+1))
        plt.subplot(subplot_int)
        plot_simulated_dist(df, colname, show_stats=False)
    plt.tight_layout()
    if show_stats:
        display(stats_df_multi(df, colnames))

A Random Variable $X$¶

Our probability distribution of $X$, shown as a table:

In [3]:
# Our random variable X
dist_df = pd.DataFrame({"x": [3, 4, 6, 8],
                        "P(X = x)": [0.1, 0.2, 0.4, 0.3]})
dist_df
Out[3]:
x P(X = x)
0 3 0.1
1 4 0.2
2 6 0.4
3 8 0.3
In [4]:
plot_dist(dist_df)

Let's use this probability distribution to generate a table of $X(s)$, i.e., random variable values for many many samples.

In [5]:
# copied from above
# def simulate_samples(df, xname="x", pname="P(X = x)", size=1):
#     return np.random.choice(
#                 df[xname], # draw from these choiecs
#                 size=size, # this many times
#                 p=df[pname]) # according to this distribution

N = 80000
all_samples = simulate_samples(dist_df, size=N)
sim_df = pd.DataFrame({"X(s)": all_samples})
sim_df
Out[5]:
X(s)
0 8
1 8
2 3
3 6
4 8
... ...
79995 8
79996 6
79997 3
79998 4
79999 8

80000 rows × 1 columns



Let's check how well this simulated sample matches our probability distribution!

In [6]:
plot_simulated_dist(sim_df, "X(s)")
plt.title("Simulated distribution of $X$")
plt.show()
X(s)
E[•] 5.908350
Var(•) 2.895186
SD(•) 1.701525
In [7]:
# The tabular view of the above plot
sim_df.value_counts("X(s)").sort_values()/N
Out[7]:
X(s)
3    0.09955
4    0.19920
8    0.30270
6    0.39855
Name: count, dtype: float64




Die Is the Singular; Dice Is the Plural¶

Let X be the outcome of a single die roll. X is a random variable.

In [8]:
# Our random variable X
roll_df = pd.DataFrame({"x": [1, 2, 3, 4, 5, 6],
                        "P(X = x)": np.ones(6)/6})
roll_df
Out[8]:
x P(X = x)
0 1 0.166667
1 2 0.166667
2 3 0.166667
3 4 0.166667
4 5 0.166667
5 6 0.166667
In [9]:
plot_dist(roll_df)




Sum of 2 Dice Rolls¶

Here's the distribution of a single die roll:

In [10]:
roll_df = pd.DataFrame({"x": [1, 2, 3, 4, 5, 6],
                        "P(X = x)": np.ones(6)/6})
roll_df
Out[10]:
x P(X = x)
0 1 0.166667
1 2 0.166667
2 3 0.166667
3 4 0.166667
4 5 0.166667
5 6 0.166667

Let $X_1, X_2$ are the outcomes of two dice rolls. Note $X_1$ and $X_2$ are i.i.d. (independent and identically distributed).

Below I call a helper function simulate_iid_df, which simulates an 80,000-row table of $X_1, X_2$ values. It uses np.random.choice(arr, size, p) link where arr is the array the values and p is the probability associated with choosing each value. If you're interested in the implementation details, scroll up.

In [11]:
N = 80000
sim_rolls_df = simulate_iid_df(roll_df, nvars=2, rows=N)
sim_rolls_df
Out[11]:
X_1 X_2
0 1 4
1 5 3
2 1 2
3 5 5
4 4 4
... ... ...
79995 5 4
79996 2 5
79997 6 4
79998 6 4
79999 3 2

80000 rows × 2 columns

Define the following random variables, which are functions of $X_1$ and $X_2$:

  • $Y = X_1 + X_1 = 2 X_1$
  • $Z = X_1 + X_2$

We can use our simulated values of $X_1, X_2$ to create new columns $Y$ and $Z$:

In [12]:
sim_rolls_df['Y'] = 2 * sim_rolls_df['X_1']
sim_rolls_df['Z'] = sim_rolls_df['X_1'] + sim_rolls_df['X_2']
sim_rolls_df
Out[12]:
X_1 X_2 Y Z
0 1 4 2 5
1 5 3 10 8
2 1 2 2 3
3 5 5 10 10
4 4 4 8 8
... ... ... ... ...
79995 5 4 10 9
79996 2 5 4 7
79997 6 4 12 10
79998 6 4 12 10
79999 3 2 6 5

80000 rows × 4 columns

Now that we have simulated samples of $Y$ and $Z$, we can plot histograms to see their distributions!


Distribution of $Y$, which was twice the value of our first die roll:
In [13]:
plot_simulated_dist(sim_rolls_df, "Y")
Y
E[•] 7.012475
Var(•) 11.681890
SD(•) 3.417878

Distribution of $Z$, the sum of two IID dice rolls:

In [14]:
plot_simulated_dist(sim_rolls_df, "Z", color='gold')
Z
E[•] 7.005425
Var(•) 5.866194
SD(•) 2.422023

Let's compare the expectations and variances of these simulated distributions of $Y$ and $Z$.

  • We computed:
    • $\mathbb{E}[Y]$ as np.mean(sim_rolls_df['Y'])
    • $\text{Var}(Y)$ as np.var(sim_rolls_df['Y']
    • etc.
  • The larger your simulated rows $N$, the closer the simulated expectation will be to the true expectation.
  • Our approach is tedious--we have to simulate an entire population, then reduce it down to expectation/variance/standard deviation. There has to be a better way!
In [15]:
stats_df_multi(sim_rolls_df, ["Y", "Z"])
Out[15]:
Y Z
E[•] 7.012475 7.005425
Var(•) 11.681890 5.866194
SD(•) 3.417878 2.422023





From Population to Sample¶

Remember the population distribution we looked at earlier:

In [16]:
dist_df
Out[16]:
x P(X = x)
0 3 0.1
1 4 0.2
2 6 0.4
3 8 0.3
In [17]:
# A population generated from the distribution
N = 100000
all_samples = simulate_samples(dist_df, size=N)
sim_pop_df = pd.DataFrame({"X(s)": all_samples})
sim_pop_df
Out[17]:
X(s)
0 6
1 8
2 6
3 8
4 8
... ...
99995 3
99996 8
99997 3
99998 8
99999 3

100000 rows × 1 columns




Suppose we draw a sample of size 100 from this giant population.

We are performing Random Sampling with Replacement: df.sample(n, replace=True) (link)

In [18]:
n = 100      # Size of our sample
sample_df = (
             sim_pop_df.sample(n, replace=True)
             
             # Some reformatting below
             .reset_index(drop=True)
             .rename(columns={"X(s)": "X"})
            )
sample_df
Out[18]:
X
0 6
1 8
2 8
3 8
4 4
... ...
95 6
96 4
97 4
98 4
99 8

100 rows × 1 columns

Our sample distribution (n = 100):

In [19]:
sns.histplot(data=sample_df, x='X', stat='probability')
plt.xticks([3, 4, 6, 8])
plt.title(f"Sample (n = 100)")
plt.show()

print("Mean of Sample:", np.mean(sample_df['X']))
Mean of Sample: 6.24

Compare this to our original population (N = 80,000):

In [20]:
plot_simulated_dist(sim_df, "X(s)")
plt.title("Population of $X$")
plt.show()
X(s)
E[•] 5.908350
Var(•) 2.895186
SD(•) 1.701525