Lecture 09 Supplemental Notebook¶

Data 100, Spring 2023

Acknowledgments Page

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

sns.set_theme(style='darkgrid', font_scale = 1.5,
              rc={'figure.figsize':(7,5)})

rng = np.random.default_rng()

A fake election data set¶

Suppose that we are trying to run a poll to predict the mayoral election in Bearkeley City (an imaginary city that neighbors Berkeley).

First, let's grab a data set that has every single voter in Bearkeley (again, this is a fake dataset) and how they actually voted in the election.

For the purposes of this demo, assume:

  • "high income" indicates a voter is above the median household income, which is $97,834 (actual Berkeley number).
  • There are only two mayoral candidates: one Democrat and one Republican.
  • Every registered voter votes in the election for the candidate under their registered party (Dem or Rep).
In [2]:
bearkeley = pd.read_csv("bearkeley.csv")

# create a 1/0 int that indicates democratic vote
bearkeley['vote.dem'] = (bearkeley['vote'] == 'Dem').astype(int)
bearkeley
Out[2]:
age high_income vote vote.dem
0 35 False Dem 1
1 42 True Rep 0
2 55 False Dem 1
3 77 True Rep 0
4 31 False Dem 1
... ... ... ... ...
1299995 62 True Dem 1
1299996 78 True Rep 0
1299997 68 False Rep 0
1299998 82 True Rep 0
1299999 23 False Dem 1

1300000 rows × 4 columns

What fraction of Bearkeley voters voted for the Democratic candidate?

In [3]:
actual_vote = np.mean(bearkeley["vote.dem"])
actual_vote
Out[3]:
0.5302792307692308

This is the actual outcome of the election. Based on this result, the Democratic candidate would win. How did our sample of retirees do?

Recreate the retiree sample¶

In [4]:
convenience_sample = bearkeley[bearkeley['age'] >= 65]
np.mean(convenience_sample["vote.dem"])
Out[4]:
0.3744755089093924

Based on this result, we would have predicted that the Republican candidate would win! What happened?

  1. Is the sample too small / noisy?
In [5]:
len(convenience_sample)
Out[5]:
359396
In [6]:
len(convenience_sample)/len(bearkeley)
Out[6]:
0.27645846153846154

Seems really large, so the error is definitely not solely chance error. There is some bias afoot.

Check for bias¶

Let us aggregate all voters by age and visualize the fraction of Democratic voters, split by income.

In [7]:
votes_by_demo = bearkeley.groupby(["age","high_income"]).agg("mean").reset_index()
votes_by_demo
Out[7]:
age high_income vote.dem
0 18 False 0.819594
1 18 True 0.667001
2 19 False 0.812214
3 19 True 0.661252
4 20 False 0.805281
... ... ... ...
125 80 True 0.259731
126 81 False 0.394946
127 81 True 0.256759
128 82 False 0.398970
129 82 True 0.248060

130 rows × 3 columns

In [8]:
import matplotlib.ticker as ticker
fig = plt.figure();
red_blue = ["#bf1518", "#397eb7"]
with sns.color_palette(sns.color_palette(red_blue)):
    ax = sns.pointplot(data=votes_by_demo, x = "age", y = "vote.dem", hue = "high_income")

ax.set_title("Voting preferences by demographics")
fig.canvas.draw()
new_ticks = [i.get_text() for i in ax.get_xticklabels()];
plt.xticks(range(0, len(new_ticks), 10), new_ticks[::10]);
  • We see that retirees (in our imaginary city) tend to vote less Democrat.
  • We also see that voters below median income (high_income=False) tend to vote more democrat.

Compare to a Simple Random Sample¶

What if we instead took a simple random sample (SRS) to conduct our pre-election poll?

Suppose we took an SRS of the same size as our retiree sample:

In [9]:
## By default, replace = False
n = len(convenience_sample)
random_sample = bearkeley.sample(n, replace = False)

np.mean(random_sample["vote.dem"])
Out[9]:
0.5302785785039343

This is very close to the actual vote!

In [10]:
actual_vote
Out[10]:
0.5302792307692308

It turns out that we are pretty close, much smaller sample size, say, 800:

In [11]:
n = 800
random_sample = bearkeley.sample(n, replace = False)
np.mean(random_sample["vote.dem"])
Out[11]:
0.51375

We'll learn how to choose this number when we (re)learn the Central Limit Theorem later in the semester.

How to quantify chance error?¶

In our SRS of size 1000, what would be our chance error?

Let's simulate 1000 versions of taking the 500-size SRS from before:

In [12]:
poll_result = []
nrep = 1000   # number of simulations
n = 800       # size of our sample
for i in range(0,nrep):
    random_sample = bearkeley.sample(n, replace = False)
    poll_result.append(np.mean(random_sample["vote.dem"]))
In [13]:
sns.histplot(poll_result, stat='density')
Out[13]:
<AxesSubplot:ylabel='Density'>

What fraction of these simulated samples would have predicted Democrat?

In [14]:
poll_result = pd.Series(poll_result)
np.sum(poll_result > 0.5)/1000
Out[14]:
0.959

You can see the curve looks roughly Gaussian. Using KDE:

In [15]:
sns.histplot(poll_result, stat='density', kde=True)
Out[15]:
<AxesSubplot:ylabel='Density'>

Simulating from a Multinomial Distribution¶

Sometimes instead of having individual reports in the population, we have aggregate statistics. For example, we could have only learned that 53\% of election voters voted Democrat. Even so, we can still simulate probability samples if we assume the population is large.

Specifically, we can use multinomial probabilities to simulate random samples with replacement.

Marbles¶

Suppose we have a very large bag of marbles with the following statistics:

  • 60\% blue
  • 30\% green
  • 10\% red

We then draw 100 marbles from this bag at random with replacement.

np.random.multinomial (documentation):

In [16]:
np.random.multinomial(100, [0.60, 0.30, 0.10])
Out[16]:
array([62, 29,  9])

We can repeat this simulation multiple times, say 20:

In [17]:
np.random.multinomial(100, [0.60, 0.30, 0.10], size=20)
Out[17]:
array([[60, 30, 10],
       [58, 30, 12],
       [60, 34,  6],
       [62, 32,  6],
       [65, 26,  9],
       [61, 28, 11],
       [58, 35,  7],
       [64, 26, 10],
       [59, 26, 15],
       [52, 34, 14],
       [52, 36, 12],
       [65, 29,  6],
       [67, 21, 12],
       [58, 36,  6],
       [57, 35,  8],
       [67, 30,  3],
       [63, 32,  5],
       [65, 29,  6],
       [66, 25,  9],
       [70, 24,  6]])