🥾 Lecture 19 – Data 100, Spring 2025¶

Data 100, Spring 2025

Acknowledgments Page

In [1]:
import numpy as np
import pandas as pd
import plotly.express as px
import sklearn.linear_model as lm
import seaborn as sns
from tqdm.notebook import tqdm

⏪ Data 8 Review: Bootstrapping, Confidence Intervals, and Hypothesis Testing¶

Suppose we want to estimate some fixed population-level quantity, like the true average height of all 32,000 UC Berkeley undergraduates. We assume that this true average height is a fixed but unknown quantity. In other words, the population-level statistic is not a random variable.

In a perfect world, we would measure the height of every UC Berkeley undergraduate, calculate the average height, and have a perfect answer to our question. But, cannot reasonably measure the height of every UC Berkeley undergraduate. Instead, we might take a random sample of, say, 10 UC Berkeley students, calculate the sample average, and then use that sample average as our "best guess" of the true average height of all 32,000 UC Berkeley students.

Here's a (fake) random sample of 10 UC Berkeley undergraduate heights in inches, along with the sample mean of those heights:

In [2]:
heights = np.array([68, 67, 69, 66, 66, 66, 71, 72, 61, 70])

heights.mean()
Out[2]:
67.6

We would say that 67.6 inches is our "best guess" of the true average height of UC Berkeley undergraduates.

Unlike the true average height, your "best guess" is a random quantity. For example, if you and a friend separately sampled 10 UC Berkeley undergraduates, your sample average heights will probably differ due to randomness in the sample. This begs the question: How much could our "best guesses" differ?

To answer this question, it would be useful for us to measure the variability of our sample statistic across parallel universes of random samples. But, we have another problem: We only get to observe one universe!

In Data 8, you learned that bootstrapping can be used to construct synthetic parallel universes. For example, here's how we could construct synthetic bags of M&Ms from one bag of M&Ms:

m&ms and the bootstrap

Here are the steps of bootstrapping written out:

  1. Assume that your random sample of size n is representative of the true population.
  2. To mimic a random draw of size n from the true population, randomly resample n observations with replacement from your random sample. Call this a "synthetic" random sample.
  3. To compute a synthetic "best guess", caculate the sample statistic using your synthetic random sample. For example, you could calculate the sample average.
  4. Repeat steps 1 and 2 many times. A common choice is 10,000 times.
  5. The distribution of the 10,000 synthetic "best guesses" provide a sense of uncertainty around your original "best guess".

Here's how we could generate just one synthetic random sample of heights:

In [3]:
# Set seed for reproducibility
np.random.seed(100)

sample_size = len(heights)

# Resample n values with replacement from our real sample of 10 heights
synth_heights = np.random.choice(heights, size=sample_size, replace=True)

# Compute the mean of the synthetic sample
synth_estimate = synth_heights.mean()

print("Original Heights:")
print(heights)
print("Mean of Original Heights:")
print(heights.mean())
print()
print("Synthetic Heights:")
print(synth_heights)
print("Mean of Synthetic Heights:")
print(synth_estimate)
Original Heights:
[68 67 69 66 66 66 71 72 61 70]
Mean of Original Heights:
67.6

Synthetic Heights:
[61 61 66 72 72 68 66 69 66 69]
Mean of Synthetic Heights:
67.0

Notice that our synthetic sample mean of 67 inches differs from our original "best guess" of 67.6 inches.

Let's repeat this 10,000 times:

In [4]:
sample_size = len(heights)
n_boot = 10000

# Create an empty array to hold the 10,000 synthetic "best guesses"
synth_estimates = np.zeros(n_boot)

for i in range(n_boot):

  # Resample n values with replacement from our real sample of 10 heights
  synth_heights = np.random.choice(heights, size=sample_size, replace=True)

  # Compute the mean of the synthetic sample
  synth_estimate = synth_heights.mean()

  # Append the synthetic mean to synth_estimates
  synth_estimates[i] = synth_estimate

print('Number of synthetic best guesses:')
print(len(synth_estimates))

print('First 5 synthetic best guesses:')
print(synth_estimates[:5])
Number of synthetic best guesses:
10000
First 5 synthetic best guesses:
[67.8 66.9 68.3 67.8 66. ]

To get a sense of how much our best guess could vary across parallel universes, we can visualize the resulting distribution of synthetic best guesses:

In [5]:
fig = px.histogram(pd.Series(synth_estimates), 
            title='Bootstrap Distribution of the Sample Mean Height', 
            width=800, histnorm='probability', 
            barmode="overlay", opacity=0.8)
fig.show()
646566676869707100.0050.010.0150.020.0250.030.0350.040.045
variable0Bootstrap Distribution of the Sample Mean Heightvalueprobability
plotly-logomark

It looks like the best guesses of most parallel universes fall between 65 inches and 70 inches.

Bootstrap confidence intervals¶

Suppose a construction manager at Berkeley asked you to find the sample mean height of Berkeley undergraduates so that doors in a new building were not too high or too low. It would be a good idea to not only provide your best guess of 67 inches, but also provide a sense of how uncertain you are about your best guess.

To do so, we can construct and report a bootstrap confidence interval (CI). For example, to construct a 95% CI, we would grab the middle 95% of the synthetic best guesses. In other words, we would grab the 2.5th percentile and the 97.5th percentile:

In [6]:
# Grab the 2.5th and 97.5th percentiles of the synthetic sample means
ci_bounds = np.percentile(synth_estimates, [2.5, 97.5])

print("Lower bound of 95% CI: ", ci_bounds[0])

print("Upper bound of 95% CI: ", ci_bounds[1])

fig.add_vline(x=ci_bounds[0], line_color='red')
fig.add_vline(x=ci_bounds[1], line_color='red')
fig.add_annotation(x=ci_bounds[0], y=0.02, text="Lower Bound", showarrow=True, arrowhead=2)
fig.add_annotation(x=ci_bounds[1], y=0.02, text="Upper Bound", showarrow=True, arrowhead=2)
fig.show()
Lower bound of 95% CI:  65.6
Upper bound of 95% CI:  69.4
646566676869707100.0050.010.0150.020.0250.030.0350.040.045
variable0Bootstrap Distribution of the Sample Mean HeightvalueprobabilityLower BoundUpper Bound
plotly-logomark

Given the values above, we could say "We are 95% confident that the true average height of UC Berkeley undergrads is between 65.6 inches and 69.4 inches."

  • What "confidence" really means: Across parallel universes, we think that 95% of our synthetic sample estimates would fall between 65.6 inches and 69.4 inches.

This knowledge of uncertainty around our best guess can help us make better decisions than a single best guess alone.

Bootstrap hypothesis testing¶

We can also use our confidence interval to perform hypothesis testing. For example, someone might claim that the true average height of UC Berkeley undergrads is 68 inches. This person has proposed what is called a null hypothesis.

Null hypothesis H0:μ=68,

In the definition above, μ is claimed the true average height of Berkeley undergrads. We can show this null population mean on our plot from above:

In [7]:
fig.add_vline(x=68, line_color='green')
fig.add_annotation(x=68, y=0.02, text="Null population mean", showarrow=True, arrowhead=2)
fig.add_annotation(x=ci_bounds[1], y=0.02, text="Upper Bound", showarrow=True, arrowhead=2)
fig.show()
646566676869707100.0050.010.0150.020.0250.030.0350.040.045
variable0Bootstrap Distribution of the Sample Mean HeightvalueprobabilityLower BoundUpper BoundNull population meanUpper Bound
plotly-logomark

Our best guess of the true average height based on our original sample is 67.6 inches. But, our bootstrap distribution of the sample mean shows that we could have plausibly made a best guess of 68 inches, in a parallel universe.

Statistical logic (out of scope) tells us to therefore fail to reject the hypothesis that the true average height is 68 inches at a 5% significance level.

  • This does not mean that the true average height is exactly 68 inches (i.e., that the hypothesis is true, or that we accept the hypothesis).
  • Instead, all we can say is that our sample data could have plausibly been observed if the true average height were indeed 68 inches, so we can't rule out the possibility that the true average height is in fact 68 inches.
  • The significance level is calculated by subtracting the confidence level from 1.

If we claimed that the true average height were, say, 70 inches, then we would reject the null hypothesis at the 5% level, since 70 inches falls outside of our 95% confidence interval.

  • In other words, it is unlikely that we could have observed a sample average height of 70 inches in a parallel universe.

From last lecture: Bias, variance, and MSE of an estimator¶

In the last lecture, we learned about the bias, variance, and MSE of an estimator, which are composed of expectations over infinite possible random samples of the data:

Bias(ˆθ)=E[ˆθ]−θ

Variance(ˆθ)=E[(ˆθ−E(ˆθ))2]

MSE(ˆθ)=E[(ˆθ−θ)2]

Notice that the variance formula does not require us to know θ. So, we can estimate the variance of our estimator using the bootstrap distribution of ˆθ's.

Estimating variance is an important component of constructing normally-approximated bootstrapped confidence intervals, which are beyond the scope of Data 100.

To estimate expectation from a random sample of B synthetic values of ˆθ, we just compute the sample mean of the ˆθ's:

^E(ˆθ)=Sample mean of ˆθ's=ˉˆθ=1BB∑i=1ˆθi

To estimate variance from a random sample of B synthetic values of ˆθ, we just compute the sample variance of the ˆθ's:

^Var(ˆθ)=Sample variance of ˆθ's≈1BB∑i=1(ˆθi−ˉˆθ)2

The ≈ above is out of scope for Data 100. Don't worry about it!

In [8]:
# Estimate expectation using a mean:
estimated_expectation_theta_hat = synth_estimates.mean()

# Same idea for estimated variance
estimated_variance_theta_hat = np.mean((synth_estimates - estimated_expectation_theta_hat) ** 2)

print("Estimated variance of the synthetic sample means: ", estimated_variance_theta_hat)
Estimated variance of the synthetic sample means:  0.9209339375
In [9]:
# np.var directly computes the variance of an array, so we get the same answer!
sample_variance_est = np.var(synth_estimates)
print("Estimated variance of the synthetic sample means: ", sample_variance_est)
Estimated variance of the synthetic sample means:  0.9209339375

Important Data 100 skill from this section: Understand how population level quantities like E(Something) and Var(Something) can be estimated with a sample of Somethingi's.



Instructor note: Return to slides!


🍝 Bootstrapping a regression coefficient¶

For demonstration, we will fit an SLR model to a random sample of the mpg dataset predicting mpg (miles per gallon) from the weight (weight of the vehicle):

^mpg=ˆθ0+ˆθ1∗weight

Then, using the bootstrap, we will construct a confidence interval around the ˆθ1 coefficient.

  • This confidence interval tells us the values of ˆθ1 that we could have observed in a parallel universe where a different random sample of cars of the same size were selected.

Suppose we collected a simple random sample of 20 cars from a population of cars. For the purposes of this demo we will assume that seaborn's mpg dataset is the population of all cars.

Here's a visualization of our sample and SLR model:

In [10]:
# Set seed for reproducibility
np.random.seed(42)

# Number of cars to sample
sample_size = 20

# Load in the mpg from seaborn
mpg = sns.load_dataset('mpg')
print("Full Data Size:", len(mpg))

# Sample `sample_size` rows from the mpg dataset
mpg_sample = mpg.sample(sample_size)
print("Sample Size:", len(mpg_sample))

px.scatter(mpg_sample, x='weight', y='mpg', trendline='ols', width=800)
Full Data Size: 398
Sample Size: 20
2000250030003500400045005000101520253035
weightmpg
plotly-logomark

We can fit linear model with sklearn to get an estimate of the slope:

In [11]:
model = lm.LinearRegression().fit(mpg_sample[['weight']], mpg_sample['mpg'])

print("Slope of the regression line: ", model.coef_[0])
print("Intercept of the regression line: ", model.intercept_)
Slope of the regression line:  -0.0069218218719096815
Intercept of the regression line:  43.529512519963816

Our "best guess" of the estimated increase in mpg associated with a one-unit increase in weight is -0.007.

But, best guess is not the end of the story! We can use the bootstrap to measure uncertainty around our best guess.

Bootstrap Implementation¶

Now let's use the bootstrap to estimate what ˆθ1 might look like across parallel universes of SLR models fit to different random samples.

To make our code reusable, let's write a bootstrap function that takes in a random sample and an estimator function (i.e., the sample mean or a function to extract the ˆθ1 coefficient from an SLR), and then uses that estimator function to construct many synthetic estimates.

In [12]:
def estimator(sample):
    """
    Fits an SLR to `sample` regressing mpg on weight, 
    and returns the slope of the fitted line
    """
    model = lm.LinearRegression().fit(sample[['weight']], sample['mpg'])

    return (model.intercept_, model.coef_[0])

estimator(mpg_sample)
Out[12]:
(43.529512519963816, -0.0069218218719096815)

As expected, our estimator function returns the same intercept and slope as the model we fit above, so long as we plug in the original sample of cars.

In [13]:
print("Slope of the regression line: ", model.coef_[0])
print("Intercept of the regression line: ", model.intercept_)
Slope of the regression line:  -0.0069218218719096815
Intercept of the regression line:  43.529512519963816

Next, we will write the general-purpose bootstrap function.

Writing a general-purpose bootstrap function is a very classic problem. It could show up in an interview or on-the-job. Good to understand this section well!

The bootstrap function code uses df.sample (link) to generate a bootstrap sample of the same size of the original sample.

In [14]:
def bootstrap(sample, statistic, num_repetitions):
    """
    Returns the statistic computed on a num_repetitions  
    bootstrap samples from sample.
    """
    stats = []

    # tqdm provides a progress bar
    # functionally, this code is the same as `for i in np.arange(num_repetitions)`
    for i in tqdm(np.arange(num_repetitions), "Bootstrapping"):
        
        # Step 1: Resample with replacement from our original sample to generate
        # a synthetic sample of the same size
        bootstrap_sample = sample.sample(frac=1, replace=True)

        # Step 2: Calculate a synthetic estimate using the synthetic sample
        bootstrap_stat = statistic(bootstrap_sample)

        # Append the synthetic estimate to the list of estimates
        stats.append(bootstrap_stat)
        
    return stats    

Constructing MANY bootstrap slope estimates:

In general, 10,000 is a good default for the number of synthetic samples to compute. In this case, we will use 1,000 just so our code runs a little faster.

In [15]:
bs_thetas = bootstrap(mpg_sample, estimator, 1000)
print("Number of bootstrap estimates:", len(bs_thetas))
print("First 5 bootstrap estimates:", bs_thetas[:5])
Bootstrapping:   0%|          | 0/1000 [00:00<?, ?it/s]
Number of bootstrap estimates: 1000
First 5 bootstrap estimates: [(43.215970007392926, -0.006714853868176342), (45.26126846596228, -0.007064625623981777), (46.602470109629174, -0.007653151422012398), (45.84303440370839, -0.007484470553621759), (46.17677039482411, -0.007577745379669565)]

We can visualize the 1,000 synthetic SLR models we fit with the bootstrap:

In [16]:
# Plot the SLR models given in bs_thetas
# Create a DataFrame from the list of tuples

# Make a scatterplot of the original data
fig = px.scatter(mpg_sample, x='weight', y='mpg', trendline='ols', width=800)

for theta in bs_thetas:
    
    # Unpack the tuple
    intercept, slope = theta

    # Create a line from the intercept and slope
    x = np.linspace(1500, 5000, 100)
    y = intercept + slope * x

    # Plot the lines transparently
    fig.add_scatter(x=x, y=y, mode='lines', line=dict(width=0.05))

fig.update_layout(title='Bootstrapped SLR Models')

fig.show()
150020002500300035004000450050000510152025303540
trace 2trace 3trace 4trace 5trace 6trace 7trace 8trace 9trace 10trace 11trace 12trace 13trace 14trace 15trace 16trace 17trace 18trace 19trace 20trace 21trace 22trace 23trace 24trace 25trace 26trace 27trace 28trace 29trace 30trace 31trace 32trace 33trace 34trace 35trace 36trace 37trace 38trace 39trace 40trace 41trace 42trace 43trace 44trace 45trace 46trace 47trace 48trace 49trace 50trace 51trace 52trace 53trace 54trace 55trace 56trace 57trace 58trace 59trace 60trace 61trace 62trace 63trace 64trace 65trace 66trace 67trace 68trace 69trace 70trace 71trace 72trace 73trace 74trace 75trace 76trace 77trace 78trace 79trace 80trace 81trace 82trace 83trace 84trace 85trace 86trace 87trace 88trace 89trace 90trace 91trace 92trace 93trace 94trace 95trace 96trace 97trace 98trace 99trace 100trace 101trace 102trace 103trace 104trace 105trace 106trace 107trace 108trace 109trace 110trace 111trace 112trace 113trace 114trace 115trace 116trace 117trace 118trace 119trace 120trace 121trace 122trace 123trace 124trace 125trace 126trace 127trace 128trace 129trace 130trace 131trace 132trace 133trace 134trace 135trace 136trace 137trace 138trace 139trace 140trace 141trace 142trace 143trace 144trace 145trace 146trace 147trace 148trace 149trace 150trace 151trace 152trace 153trace 154trace 155trace 156trace 157trace 158trace 159trace 160trace 161trace 162trace 163trace 164trace 165trace 166trace 167trace 168trace 169trace 170trace 171trace 172trace 173trace 174trace 175trace 176trace 177trace 178trace 179trace 180trace 181trace 182trace 183trace 184trace 185trace 186trace 187trace 188trace 189trace 190trace 191trace 192trace 193trace 194trace 195trace 196trace 197trace 198trace 199trace 200trace 201trace 202trace 203trace 204trace 205trace 206trace 207trace 208trace 209trace 210trace 211trace 212trace 213trace 214trace 215trace 216trace 217trace 218trace 219trace 220trace 221trace 222trace 223trace 224trace 225trace 226trace 227trace 228trace 229trace 230trace 231trace 232trace 233trace 234trace 235trace 236trace 237trace 238trace 239trace 240trace 241trace 242trace 243trace 244trace 245trace 246trace 247trace 248trace 249trace 250trace 251trace 252trace 253trace 254trace 255trace 256trace 257trace 258trace 259trace 260trace 261trace 262trace 263trace 264trace 265trace 266trace 267trace 268trace 269trace 270trace 271trace 272trace 273trace 274trace 275trace 276trace 277trace 278trace 279trace 280trace 281trace 282trace 283trace 284trace 285trace 286trace 287trace 288trace 289trace 290trace 291trace 292trace 293trace 294trace 295trace 296trace 297trace 298trace 299trace 300trace 301trace 302trace 303trace 304trace 305trace 306trace 307trace 308trace 309trace 310trace 311trace 312trace 313trace 314trace 315trace 316trace 317trace 318trace 319trace 320trace 321trace 322trace 323trace 324trace 325trace 326trace 327trace 328trace 329trace 330trace 331trace 332trace 333trace 334trace 335trace 336trace 337trace 338trace 339trace 340trace 341trace 342trace 343trace 344trace 345trace 346trace 347trace 348trace 349trace 350trace 351trace 352trace 353trace 354trace 355trace 356trace 357trace 358trace 359trace 360trace 361trace 362trace 363trace 364trace 365trace 366trace 367trace 368trace 369trace 370trace 371trace 372trace 373trace 374trace 375trace 376trace 377trace 378trace 379trace 380trace 381trace 382trace 383trace 384trace 385trace 386trace 387trace 388trace 389trace 390trace 391trace 392trace 393trace 394trace 395trace 396trace 397trace 398trace 399trace 400trace 401trace 402trace 403trace 404trace 405trace 406trace 407trace 408trace 409trace 410trace 411trace 412trace 413trace 414trace 415trace 416trace 417trace 418trace 419trace 420trace 421trace 422trace 423trace 424trace 425trace 426trace 427trace 428trace 429trace 430trace 431trace 432trace 433trace 434trace 435trace 436trace 437trace 438trace 439trace 440trace 441trace 442trace 443trace 444trace 445trace 446trace 447trace 448trace 449trace 450trace 451trace 452trace 453trace 454trace 455trace 456trace 457trace 458trace 459trace 460trace 461trace 462trace 463trace 464trace 465trace 466trace 467trace 468trace 469trace 470trace 471trace 472trace 473trace 474trace 475trace 476trace 477trace 478trace 479trace 480trace 481trace 482trace 483trace 484trace 485trace 486trace 487trace 488trace 489trace 490trace 491trace 492trace 493trace 494trace 495trace 496trace 497trace 498trace 499trace 500trace 501trace 502trace 503trace 504trace 505trace 506trace 507trace 508trace 509trace 510trace 511trace 512trace 513trace 514trace 515trace 516trace 517trace 518trace 519trace 520trace 521trace 522trace 523trace 524trace 525trace 526trace 527trace 528trace 529trace 530trace 531trace 532trace 533trace 534trace 535trace 536trace 537trace 538trace 539trace 540trace 541trace 542trace 543trace 544trace 545trace 546trace 547trace 548trace 549trace 550trace 551trace 552trace 553trace 554trace 555trace 556trace 557trace 558trace 559trace 560trace 561trace 562trace 563trace 564trace 565trace 566trace 567trace 568trace 569trace 570trace 571trace 572trace 573trace 574trace 575trace 576trace 577trace 578trace 579trace 580trace 581trace 582trace 583trace 584trace 585trace 586trace 587trace 588trace 589trace 590trace 591trace 592trace 593trace 594trace 595trace 596trace 597trace 598trace 599trace 600trace 601trace 602trace 603trace 604trace 605trace 606trace 607trace 608trace 609trace 610trace 611trace 612trace 613trace 614trace 615trace 616trace 617trace 618trace 619trace 620trace 621trace 622trace 623trace 624trace 625trace 626trace 627trace 628trace 629trace 630trace 631trace 632trace 633trace 634trace 635trace 636trace 637trace 638trace 639trace 640trace 641trace 642trace 643trace 644trace 645trace 646trace 647trace 648trace 649trace 650trace 651trace 652trace 653trace 654trace 655trace 656trace 657trace 658trace 659trace 660trace 661trace 662trace 663trace 664trace 665trace 666trace 667trace 668trace 669trace 670trace 671trace 672trace 673trace 674trace 675trace 676trace 677trace 678trace 679trace 680trace 681trace 682trace 683trace 684trace 685trace 686trace 687trace 688trace 689trace 690trace 691trace 692trace 693trace 694trace 695trace 696trace 697trace 698trace 699trace 700trace 701trace 702trace 703trace 704trace 705trace 706trace 707trace 708trace 709trace 710trace 711trace 712trace 713trace 714trace 715trace 716trace 717trace 718trace 719trace 720trace 721trace 722trace 723trace 724trace 725trace 726trace 727trace 728trace 729trace 730trace 731trace 732trace 733trace 734trace 735trace 736trace 737trace 738trace 739trace 740trace 741trace 742trace 743trace 744trace 745trace 746trace 747trace 748trace 749trace 750trace 751trace 752trace 753trace 754trace 755trace 756trace 757trace 758trace 759trace 760trace 761trace 762trace 763trace 764trace 765trace 766trace 767trace 768trace 769trace 770trace 771trace 772trace 773trace 774trace 775trace 776trace 777trace 778trace 779trace 780trace 781trace 782trace 783trace 784trace 785trace 786trace 787trace 788trace 789trace 790trace 791trace 792trace 793trace 794trace 795trace 796trace 797trace 798trace 799trace 800trace 801trace 802trace 803trace 804trace 805trace 806trace 807trace 808trace 809trace 810trace 811trace 812trace 813trace 814trace 815trace 816trace 817trace 818trace 819trace 820trace 821trace 822trace 823trace 824trace 825trace 826trace 827trace 828trace 829trace 830trace 831trace 832trace 833trace 834trace 835trace 836trace 837trace 838trace 839trace 840trace 841trace 842trace 843trace 844trace 845trace 846trace 847trace 848trace 849trace 850trace 851trace 852trace 853trace 854trace 855trace 856trace 857trace 858trace 859trace 860trace 861trace 862trace 863trace 864trace 865trace 866trace 867trace 868trace 869trace 870trace 871trace 872trace 873trace 874trace 875trace 876trace 877trace 878trace 879trace 880trace 881trace 882trace 883trace 884trace 885trace 886trace 887trace 888trace 889trace 890trace 891trace 892trace 893trace 894trace 895trace 896trace 897trace 898trace 899trace 900trace 901trace 902trace 903trace 904trace 905trace 906trace 907trace 908trace 909trace 910trace 911trace 912trace 913trace 914trace 915trace 916trace 917trace 918trace 919trace 920trace 921trace 922trace 923trace 924trace 925trace 926trace 927trace 928trace 929trace 930trace 931trace 932trace 933trace 934trace 935trace 936trace 937trace 938trace 939trace 940trace 941trace 942trace 943trace 944trace 945trace 946trace 947trace 948trace 949trace 950trace 951trace 952trace 953trace 954trace 955trace 956trace 957trace 958trace 959trace 960trace 961trace 962trace 963trace 964trace 965trace 966trace 967trace 968trace 969trace 970trace 971trace 972trace 973trace 974trace 975trace 976trace 977trace 978trace 979trace 980trace 981trace 982trace 983trace 984trace 985trace 986trace 987trace 988trace 989trace 990trace 991trace 992trace 993trace 994trace 995trace 996trace 997trace 998trace 999trace 1000trace 1001Bootstrapped SLR Modelsweightmpg
plotly-logomark

From the plot above, can you guess why the emoji for this section is spaghetti 🍝?

We were originally just interested in conducting inference on the slope, ˆθ1. So, let's visualize the bootstrap distribution of the synthetic ˆθ1 estimates:

Note we could have done the same for the synthetic ˆθ0 estimates, too!

In [17]:
# Grab the slopes from the list of (intercept, slope) tuples
bs_theta1s = [theta[1] for theta in bs_thetas]

fig = px.histogram(pd.Series(bs_theta1s),
                   title='Bootstrap Distribution of the Slope', 
                   width=800, histnorm='probability', 
                   barmode="overlay", opacity=0.8)
fig.show()
−0.01−0.009−0.008−0.007−0.006−0.00500.020.040.060.080.1
variable0Bootstrap Distribution of the Slopevalueprobability
plotly-logomark

Computing a Bootstrap CI¶

We can compute the CI using the percentiles of the distribution of 1,000 synthetic estimates of ˆθ1:

In [18]:
def bootstrap_ci(bootstrap_estimates, confidence_level=95):
    """
    Returns the confidence interval for the synthetic estimates by grabbing
    the percentiles corresponding to `confidence_level`% of the samples
    """
    lower_percentile = (100 - confidence_level) / 2
    upper_percentile = 100 - lower_percentile

    # np.percentile grabs the given percentiles of an array
    return np.percentile(bootstrap_estimates, [lower_percentile, upper_percentile])

print(bootstrap_ci(bs_theta1s))
[-0.00858358 -0.00549309]

Visualizing our resulting 95% confidence interval:

In [19]:
ci_line_style = dict(color="orange", width=2, dash="dash")
fig.add_vline(x=bootstrap_ci(bs_theta1s)[0], line=ci_line_style)
fig.add_vline(x=bootstrap_ci(bs_theta1s)[1], line=ci_line_style)
−0.01−0.009−0.008−0.007−0.006−0.00500.020.040.060.080.1
variable0Bootstrap Distribution of the Slopevalueprobability
plotly-logomark

Given the above, we can say that "We are 95% confident that the true θ1 falls between -0.009 and -0.0055".

  • The true θ1 would be obtained if we fit our SLR model to the entire population.

Very often, we want to test whether a regression coefficient is significantly different than 0.

  • 0 is not contained in the interval above, so we can reject the null hypothesis that θ1=0 at a 5% significance level.

  • In other words, it is highly unlikely that we would observe our actual sample data in a world where θ1=0, simply due to randomness in the sample.



Instructor Note: Return to Lecture.




🟪 PurpleAir¶

This example is from the Data 100 textbook: link.

The following cell does some basic data cleaning. Don't worry about the details!

In [20]:
csv_file = 'data/Full24hrdataset.csv.gz'
usecols = ['Date', 'ID', 'region', 'PM25FM', 'PM25cf1', 'TempC', 'RH', 'Dewpoint']
full_df = pd.read_csv(csv_file, usecols=usecols, parse_dates=['Date']).dropna()
full_df.columns = ['date', 'id', 'region', 'pm25aqs', 'pm25pa', 'temp', 'rh', 'dew']
full_df = full_df[(full_df['pm25aqs'] < 50)]
# drop dates with issues in the data
bad_dates = pd.to_datetime(['2019-08-21', '2019-08-22', '2019-09-24'])

# GA is the DataFrame that contains air quality measurements
GA = full_df[(full_df['id'] == 'GA1') & (~full_df['date'].isin(bad_dates))]
GA = GA.sort_values("pm25aqs")
display(full_df["region"].value_counts())
display(GA.head())
print("Number of Rows:", GA.shape[0])
region
North                5592
West                 3750
Central Southwest    1502
Southeast            1032
Alaska                365
Name: count, dtype: int64
date id region pm25aqs pm25pa temp rh dew
5416 2019-10-31 GA1 Southeast 3.100000 7.638554 19.214186 70.443672 13.674061
5401 2019-10-09 GA1 Southeast 4.200000 10.059924 24.621388 57.696801 15.708347
5407 2019-10-17 GA1 Southeast 4.200000 6.389826 16.641975 49.377778 5.921212
5411 2019-10-23 GA1 Southeast 4.300000 4.544160 16.963735 50.861111 6.650425
5325 2019-10-23 GA1 Southeast 4.304167 4.544160 16.963735 50.861111 6.650425
Number of Rows: 176

Inverse Regression¶

After we build the model that adjusts the PurpleAir measurements using AQS, we then flip the model around and use it to predict the true air quality in the future from PurpleAir measurements when we don't have a nearby AQS instrument. This is a calibration scenario. Since the AQS measurements are close to the truth, we fit the more variable PurpleAir measurements to them; this is the calibration procedure. Then, we use the calibration curve to correct future PurpleAir measurements. This two-step process is encapsulated in the simple linear model and its flipped form below.

Inverse regression:

  • First, we fit a line to predict a PA measurement from the ground truth, as recorded by an AQS instrument:

    PA≈θ0+θ1AQS

  • Next, we invert the line (i.e, we don't fit another model!) so we can use a PA measurement to predict the true air quality in places where AQS sensors are not available,

    True Air Quality≈−θ0/θ1+1/θ1PA

Why perform this “inverse regression”?

  • Intuitively, AQS measurements are “true” and have no error.
  • A linear model assumes that the inputs and fixed and known and not random. We treat the PA estimates as noisy and random, and the AQS measures as fixed and known (i.e., accurate!).
  • Algebraically identical, but statistically different.
In [21]:
# Fit an SLR predicting purple air measurements from AQS measurements
model = lm.LinearRegression().fit(GA[['pm25aqs']], GA['pm25pa'])
theta_0, theta_1 = model.intercept_, model.coef_[0], 
In [22]:
# pm25 is a measure of air quality. pm stands for "particulate matter".
fig = px.scatter(GA, x='pm25aqs', y='pm25pa', width=800)

# This code adds the SLR fit to the scatterplot. Don't worry about the details of this code.
xtest = pd.DataFrame({"pm25aqs": np.array([GA['pm25aqs'].min(), GA['pm25aqs'].max()])})
fig.add_scatter(x=xtest["pm25aqs"], y=model.predict(xtest[["pm25aqs"]]), mode='lines', 
                name="Least Squares Fit")
510155101520253035
Least Squares Fitpm25aqspm25pa
plotly-logomark

Invert the model by isolating the AQS term, and then change the name of the AQS term the "true air quality estimate".

In [23]:
print(f"True Air Quality Estimate = {-theta_0/theta_1:.2} + {1/theta_1:.2}PA") 
True Air Quality Estimate = 1.6 + 0.46PA
In [24]:
# This code adds the inverse fit to the scatterplot. 
# It may look like we are fitting a new model, but we are just creating a model
# object to make it easier to plot the inverse fit.
# Don't worry about the details of this code.
fig = px.scatter(GA, y='pm25aqs', x='pm25pa', width=800)
model2 = lm.LinearRegression().fit(GA[['pm25pa']], GA['pm25aqs'])
xtest["pm25pa"] = np.array([GA['pm25pa'].min(), GA['pm25pa'].max()])
fig.add_scatter(x=xtest["pm25pa"], y=xtest["pm25pa"] *1/theta_1 - theta_0/theta_1 , mode='lines', 
                name="Inverse Fit")
fig.add_scatter(x=xtest["pm25pa"], y=model2.predict(xtest[['pm25pa']]), mode='lines',
                name="Least Squares Fit")
51015202530354681012141618
Inverse FitLeast Squares Fitpm25papm25aqs
plotly-logomark

The Barkjohn et al. model with Relative Humidity¶

Karoline Barkjohn, Brett Gannt, and Andrea Clements from the US Environmental Protection Agency developed a model to improve the PuprleAir measurements from the AQS sensor measurements. Arkjohn and group’s work was so successful that, as of this writing, the official US government maps, like the AirNow Fire and Smoke map, includes both AQS and PurpleAir sensors, and applies Barkjohn’s correction to the PurpleAir data. PA≈θ0+θ1AQS+θ2RH

The model that Barkjohn settled on incorporates the relative humidity.

  • The code below fits and inverts the Barkjohn model to the data exactly as we did above
In [25]:
model_h = lm.LinearRegression().fit(GA[['pm25aqs', 'rh']], GA['pm25pa'])
[theta_1, theta_2], theta_0 = model_h.coef_, model_h.intercept_

print(f"True Air Quality Estimate = {-theta_0/theta_1:1.2} + {1/theta_1:.2}PA + {-theta_2/theta_1:.2}RH") 
True Air Quality Estimate = 7.0 + 0.44PA + -0.092RH

For comparison, here are the original coefficients:

In [26]:
print(f"True Air Quality Estimate = {-theta_0/theta_1:.2} + {1/theta_1:.2}PA") 
True Air Quality Estimate = 7.0 + 0.44PA

Note that the coefficients on PA are similar, but the intercepts are quite different due to the inclusion of relative humidity.

  • The intercept represents the predicted air quality when PA and RH are 0, which is a different than the interpretation of the original intercept.


Compared to the simple linear model that only incorporated AQS, the Barkjohn et al. model with relative humidity achieves lower error. Good for prediction!




Bootstrapping the regression coefficients for Purple Air¶

From the Barkjohn et al., model, AQS coefficient ˆθ1:

In [27]:
theta_1
Out[27]:
2.2540167939150546

The Relative Humidity coefficient ˆθ2 is pretty close to zero:

In [28]:
theta_2
Out[28]:
0.20630108775555359

Is incorporating humidity in the model really needed?

Null hypothesis: The null hypothesis is θ2=0; that is, the null model is the simpler model:

PA≈θ0+θ1AQS

Repeat 1,000 times to get an approximation to the boostrap sampling distirbution of the bootstrap statistic (the fitted humidity coefficient ^θ2):

In [29]:
def theta2_estimate(sample):
    model = lm.LinearRegression().fit(sample[['pm25aqs', 'rh']], sample['pm25pa'])
    return model.coef_[1]
In [30]:
bs_theta2 = bootstrap(GA, theta2_estimate, 1000)
Bootstrapping:   0%|          | 0/1000 [00:00<?, ?it/s]
In [31]:
import plotly.express as px
fig = px.histogram(x=bs_theta2,
                   labels=dict(x='Bootstrapped Humidity Coefficient'),
                   histnorm='probability', 
                   width=800)
fig.add_vline(0)
fig.add_vline(x=bootstrap_ci(bs_theta2)[0], line=ci_line_style)
fig.add_vline(x=bootstrap_ci(bs_theta2)[1], line=ci_line_style)
00.050.10.150.20.250.300.010.020.030.040.050.060.070.08
Bootstrapped Humidity Coefficientprobability
plotly-logomark

(We know that the center will be close to the original coefficient estimated from the sample, 0.21.)

By design, the center of the bootstrap sampling distribution will be near ˆθ because the bootstrap population consists of the observed data. So, rather than compute the chance of a value at least as large as the observed statistic, we find the chance of a value at least as small as 0.

The hypothesized value of 0 is far from the sampling distribution:

In [32]:
len([elem for elem in bs_theta2 if elem < 0.0]) 
Out[32]:
0

None of the 1000 simulated regression coefficients are as small as the hypothesized coefficient. Statistical logic leads us to reject the null hypothesis that the true association of humidity and air quality is 0.




The Snowy Plover¶

This example borrows some wording from Spring 2020's Data 100, Lecture 22.

The Data¶

The Snowy Plover is a tiny bird that lives on the coast in parts of California and elsewhere. It is so small that it is vulnerable to many predators and to people and dogs that don't look where they are stepping when they go to the beach. It is considered endangered in many parts of the US.

The data are about the eggs and newly-hatched chicks of the Snowy Plover. Here's a parent bird and some eggs.

plover and eggs

The data were collected at the Point Reyes National Seashore by a former student at Berkeley. The goal was to see how the size of an egg could be used to predict the weight of the resulting chick. The bigger the newly-hatched chick, the more likely it is to survive.

plover and chick

Each row of the data frame below corresponds to one Snowy Plover egg and the resulting chick. Note how tiny the bird is:

  • Egg Length and Egg Breadth (widest diameter) are measured in millimeters
  • Egg Weight and Bird Weight are measured in grams; for comparison, a standard paper clip weighs about one gram
In [33]:
eggs = pd.read_csv('data/snowy_plover.csv.gz')
eggs.head()
Out[33]:
egg_weight egg_length egg_breadth bird_weight
0 7.4 28.80 21.84 5.2
1 7.7 29.04 22.45 5.4
2 7.9 29.36 22.48 5.6
3 7.5 30.10 21.71 5.3
4 8.3 30.17 22.75 5.9
In [34]:
eggs.shape
Out[34]:
(44, 4)

For a particular egg, x is the vector of length, breadth, and weight. The proposed model is

^Newborn weight=ˆθ0+ˆθ1egg\_weight+ˆθ2egg\_length+ˆθ2egg\_breadth

Let's fit this model:

In [35]:
y = eggs["bird_weight"]
X = eggs[["egg_weight", "egg_length", "egg_breadth"]]
    
model = lm.LinearRegression(fit_intercept=True).fit(X, y)

display(pd.DataFrame(
    [model.intercept_] + list(model.coef_),
    columns=['theta_hat'],
    index=['intercept', 'egg_weight', 'egg_length', 'egg_breadth']))

all_features_rmse = np.mean((y - model.predict(X)) ** 2)

print("RMSE", all_features_rmse)      
theta_hat
intercept -4.605670
egg_weight 0.431229
egg_length 0.066570
egg_breadth 0.215914
RMSE 0.045470853802757616

Let's try bootstrapping the sample to obtain a 95% confidence intervals for all the parameters.

In [36]:
# This function returns a list of the coefficients of the fitted model
def all_thetas(sample):
    model = lm.LinearRegression().fit(
        sample[["egg_weight", "egg_length", "egg_breadth"]],
        sample["bird_weight"])
    return [model.intercept_] + model.coef_.tolist()

We can re-use our bootstrapping function from before to get synthetic estimates of all of the coefficients:

In [37]:
bs_thetas = pd.DataFrame(
    bootstrap(eggs, all_thetas, 1000), 
    columns=['intercept', 'egg_weight', 'egg_length', 'egg_breadth'])

bs_thetas
Bootstrapping:   0%|          | 0/1000 [00:00<?, ?it/s]
Out[37]:
intercept egg_weight egg_length egg_breadth
0 -8.103400 0.216451 0.098032 0.405838
1 -11.074321 0.077284 0.148533 0.521490
2 -2.718266 0.608443 0.035361 0.111554
3 -1.080113 0.597081 0.026363 0.053714
4 -2.124909 0.663734 0.002459 0.107473
... ... ... ... ...
995 0.034319 0.965715 -0.062090 -0.009347
996 -6.894141 0.288500 0.066708 0.368040
997 -0.395814 0.695542 -0.032516 0.068442
998 -7.291577 0.499068 0.040360 0.344834
999 -4.958756 0.428270 0.047167 0.258591

1000 rows × 4 columns

Computing the confidence intervals using the 1,000 synthetic estimates:

In [38]:
cis = (bs_thetas
       .apply(bootstrap_ci).T
       .rename(columns={0: 'lower', 1: 'upper'}))
cis
Out[38]:
lower upper
intercept -14.740386 4.844829
egg_weight -0.219311 1.082662
egg_length -0.092170 0.208965
egg_breadth -0.230592 0.725745
In [39]:
def visualize_coeffs(bs_thetas, rows, cols):
    cis = (bs_thetas
       .apply(bootstrap_ci).T
       .rename(columns={0: 'lower', 1: 'upper'}))
    display(cis)
    from plotly.subplots import make_subplots
    fig = make_subplots(rows=rows, cols=cols, subplot_titles=cis.index)
    for i, coeff_name in enumerate(cis.index):
        c = (i % cols) + 1
        r = (i // cols) + 1
        fig.add_histogram(x=bs_thetas[coeff_name], name=coeff_name, 
                        row=r, col=c, histnorm='probability')
        fig.add_vline(x=0, row=r, col=c)
        fig.add_vline(x=cis.loc[coeff_name, 'lower'], line=ci_line_style, 
                      row=r, col=c)
        fig.add_vline(x=cis.loc[coeff_name, 'upper'], line=ci_line_style, 
                      row=r, col=c)
    return fig

visualize_coeffs(bs_thetas, 2, 2)
lower upper
intercept -14.740386 4.844829
egg_weight -0.219311 1.082662
egg_length -0.092170 0.208965
egg_breadth -0.230592 0.725745
−20−10000.020.040.06−0.500.5100.020.040.06−0.100.10.200.020.0400.5100.020.040.060.08
interceptegg_weightegg_lengthegg_breadthinterceptegg_weightegg_lengthegg_breadth
plotly-logomark

Because all the confidence intervals contain 0, we cannot reject the null hypothesis that the true coefficient on each term is 0.

Does this mean that all the parameters are statistically indistinguishable from 0?





Inspecting the Relationship between Features¶

To see what's going on, we'll make a scatter plot matrix for the data.

In [40]:
px.scatter_matrix(eggs, width=600, height=600)
8910303234222324891055.566.5730323422232455.566.57
egg_weightegg_lengthegg_breadthbird_weightegg_weightegg_lengthegg_breadthbird_weight
plotly-logomark

This shows that bird_weight is highly correlated with all the other variables (the bottom row), which means fitting a linear model is a good idea.

But, we also see that egg_weight is highly correlated with all the variables (the top row). We saw in lecture that this could result in high variance coefficients and harm inference.

Here are the numeric correlations:

In [41]:
px.imshow(eggs.corr().round(2), text_auto=True, width=600)
10.790.840.850.7910.40.680.840.410.730.850.680.731egg_weightegg_lengthegg_breadthbird_weightbird_weightegg_breadthegg_lengthegg_weight
0.40.50.60.70.80.91
plotly-logomark





Changing Our Modeling Features¶

Based on the correlations above, egg_weight looks like the strongest predictor of newborn chick weight.

An SLR model with just egg_weight performs almost as well as the model that uses all three variables, and the confidence interval for ˆθ1 no longer contains zero.

  • Note that the model with additional variables has a slightly lower RMSE! In sample RMSE will never go up when you add features.
In [42]:
y = eggs["bird_weight"]
X = eggs[["egg_weight"]]
    
model = lm.LinearRegression(fit_intercept=True).fit(X, y)

display(pd.DataFrame([model.intercept_] + list(model.coef_),
             columns=['theta_hat'],
             index=['intercept', 'egg_weight']))
print("All Features RMSE: ", all_features_rmse)
print("Simpler model RMSE: ", np.mean((y - model.predict(X)) ** 2))
theta_hat
intercept -0.058272
egg_weight 0.718515
All Features RMSE:  0.045470853802757616
Simpler model RMSE:  0.046493941375556846
In [43]:
# Return a list of the intercept and slope of the SLR model using just egg_weight
def egg_weight_coeff(sample):
    model = lm.LinearRegression().fit(
        sample[["egg_weight"]],
        sample["bird_weight"])
    return [model.intercept_] + model.coef_.tolist()

We can re-use our bootstrap function from earlier to generate synthetic estimates of the coefficient on egg_weight:

In [44]:
bs_thetas_egg_weight = pd.DataFrame(
    bootstrap(eggs, egg_weight_coeff, 1000), 
    columns=['intercept', 'egg_weight'])
bs_thetas_egg_weight
Bootstrapping:   0%|          | 0/1000 [00:00<?, ?it/s]
Out[44]:
intercept egg_weight
0 -0.108481 0.721276
1 0.178621 0.687295
2 0.131337 0.698104
3 -0.022869 0.712006
4 0.329215 0.669357
... ... ...
995 -0.776623 0.803877
996 0.130192 0.692799
997 -0.540054 0.782740
998 -0.498966 0.767973
999 -0.221662 0.735186

1000 rows × 2 columns

In [45]:
visualize_coeffs(bs_thetas_egg_weight, 1, 2)
lower upper
intercept -0.783541 0.917632
egg_weight 0.602692 0.809269
−101200.020.040.060.080.100.20.40.60.800.010.020.030.040.050.060.070.080.09
interceptegg_weightinterceptegg_weight
plotly-logomark

Notice how much tighter the confidence interval is for egg_weight above, relative to the regression where we included all three collinear terms!

As this example shows, checking for collinearity is important for inference (and less so for prediction).

When we fit a model on highly correlated variables, confidence intervals can have high variance, preventing us from making meaningful statistical conclusions about the relationships between features and outputs.