Lecture 20 - Regression coefficients¶

Data 100, Spring 2023

Acknowledgments Page

In [1]:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn.linear_model as lm

# big font helper
def adjust_fontsize(size=None):
    SMALL_SIZE = 8
    MEDIUM_SIZE = 10
    BIGGER_SIZE = 12
    if size != None:
        SMALL_SIZE = MEDIUM_SIZE = BIGGER_SIZE = size

    plt.rc('font', size=SMALL_SIZE)          # controls default text sizes
    plt.rc('axes', titlesize=SMALL_SIZE)     # fontsize of the axes title
    plt.rc('axes', labelsize=MEDIUM_SIZE)    # fontsize of the x and y labels
    plt.rc('xtick', labelsize=SMALL_SIZE)    # fontsize of the tick labels
    plt.rc('ytick', labelsize=SMALL_SIZE)    # fontsize of the tick labels
    plt.rc('legend', fontsize=SMALL_SIZE)    # legend fontsize
    plt.rc('figure', titlesize=BIGGER_SIZE)  # fontsize of the figure title

plt.style.use('fivethirtyeight')
sns.set_context("talk")
sns.set_theme()
#plt.style.use('default') # revert style to default mpl
adjust_fontsize(size=20)
%matplotlib inline

PurpleAir¶

This example is from the Data 100 textbook: link.

In [2]:
csv_file = 'data/Full24hrdataset.csv'
usecols = ['Date', 'ID', 'region', 'PM25FM', 'PM25cf1', 'TempC', 'RH', 'Dewpoint']
full_df = (pd.read_csv(csv_file, usecols=usecols, parse_dates=['Date'])
        .dropna())
full_df.columns = ['date', 'id', 'region', 'pm25aqs', 'pm25pa', 'temp', 'rh', 'dew']
full_df = full_df.loc[(full_df['pm25aqs'] < 50)]


bad_dates = ['2019-08-21', '2019-08-22', '2019-09-24']
GA = full_df.loc[(full_df['id'] == 'GA1') & (~full_df['date'].isin(bad_dates)) , :]

After we build the model that adjusts the PurpleAir measurements using AQS, we then flip the model around and use it to predict the true air quality in the future from PurpleAir measurements when wec don't have a nearby AQS instrument. This is a calibration scenario. Since the AQS measurements are close to the truth, we fit the more variable PurpleAir measurements to them; this is the calibration procedure. Then, we use the calibration curve to correct future PurpleAir measurements. This two-step process is encapsulated in the simple linear model and its flipped form below.

Inverse regression:

  • First, we fit a line to predict a PA measurement from the ground truth, as recorded by an AQS instrument:

    $$ \text{PA} \approx \theta_0 + \theta_1\text{AQS} $$

  • Next, we flip the line around to use a PA measurement to predict the air quality,

    $$ \text{True Air Quality} \approx -\theta_0/\theta_1 + 1/\theta_1 \text{PA} $$

Why perform this “inverse regression”?

  • Intuitively, AQS measurements are “true” and have no error.
  • A linear model takes a “true” x value input and minimizes the error in the y direction.
  • Algebraically identical, but statistically different.
In [3]:
from sklearn.linear_model import LinearRegression

AQS, PA = GA[['pm25aqs']], GA['pm25pa']
    
model = LinearRegression().fit(AQS, PA)
theta_1, theta_0 = model.coef_[0], model.intercept_
In [4]:
print(f"True Air Quality Estimate = {-theta_0/theta_1:.2} + {1/theta_1:.2}PA") 
True Air Quality Estimate = 1.6 + 0.46PA

This actually matches the EDA performed earlier in the textbook, which shows that PurpleAir measurements are about twice as high as AQS measurements.

The pertinent Q-Q plot (out of scope for this semester):



What is the mean squared error of the original model?

In [5]:
from sklearn.metrics import mean_squared_error

preds_slr = model.predict(AQS)
mean_squared_error(PA, preds_slr)
Out[5]:
4.7083124633807225



Is there a better model? Relative Humidity¶

Karoline Barkjohn, Brett Gannt, and Andrea Clements from the US Environmental Protection Agency developed a model to improve the PuprleAir measurements from the AQS sensor measurements. arkjohn and group’s work was so successful that, as of this writing, the official US government maps, like the AirNow Fire and Smoke map, includes both AQS and PurpleAir sensors, and applies Barkjohn’s correction to the PurpleAir data. $$ \begin{aligned} \text{PA} \approx \theta_0 + \theta_1 \text{AQS} + \theta_2 \text{RH} \end{aligned} $$

The model that Barkjohn settled on incorporated the relative humidity:

In [6]:
AQS_RH, PA = GA[['pm25aqs', 'rh']], GA['pm25pa']
model_h = LinearRegression().fit(AQS_RH, PA)
[theta_1, theta_2], theta_0 = model_h.coef_, model_h.intercept_
    
print(f"True Air Quality Estimate = {-theta_0/theta_1:1.2} + {1/theta_1:.2}PA + {-theta_2/theta_1:.2}RH") 
True Air Quality Estimate = 7.0 + 0.44PA + -0.092RH
In [7]:
preds_humidity = model_h.predict(AQS_RH)
mean_squared_error(PA, preds_humidity)
Out[7]:
3.2977168948380413


Compared to the simple linear model that only incorporated AQS, the Barkjohn et al. model with relative humidity achieves lower error. Good for prediction!

Bootstrapping the regression coefficients¶

From the Barkjohn et al., model, AQS coefficient $\hat{\theta}_1$:

In [8]:
theta_1
Out[8]:
2.2540167939150537

The Relative Humidity coefficient $\hat{\theta}_2$ is pretty close to zero:

In [9]:
theta_2
Out[9]:
0.20630108775555353

Is incorporating humidity in the model really needed?

(Note the slight hand-wavy assumption: air quality measurements should resemble the population measurements. But this is a weak assumption because of weather conditions, time of year, location of monitors, etc. Instead we assume that measurements are similar to others taken under the same conditions as those of the original measurement.)

Null hypothesis: The null hypothesis is $\theta_2 = 0$; that is, the null model is the simpler model:

$$ \begin{aligned} \text{PA} \approx \theta_0 + \theta_1 \text{AQS} \end{aligned} $$
In [10]:
n = len(GA)           # n: size of our sample
def boot_stat(X, y):
    r = randint.rvs(low=0, high=(n-1), size=n)
    
    theta2 = LinearRegression().fit(X.iloc[r, :], y.iloc[r]).coef_[1]
    
    return theta2

We set up the design matrix and the outcome variable and check our boot_stat function once to test it.

In [11]:
from scipy.stats import randint


n = len(GA)
y = GA['pm25pa']
X = GA[['pm25aqs', 'rh']]

boot_stat(X, y)
Out[11]:
0.1635251292176499

Repeat 10,000 times to get an approximation to the boostrap sampling distirbution of the bootstrap statistic (the fitted humidity coefficient $\hat{\theta_2}$):

In [12]:
rng = np.random.default_rng(42)

boot_theta_hat = [boot_stat(X, y) for _ in range(10000)]
In [13]:
import plotly.express as px
px.histogram(x=boot_theta_hat, nbins=50,
            labels=dict(x='Bootstrapped Humidity Coefficient'),
            width=350, height=250)