import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn.linear_model as lm
# big font helper
def adjust_fontsize(size=None):
SMALL_SIZE = 8
MEDIUM_SIZE = 10
BIGGER_SIZE = 12
if size != None:
SMALL_SIZE = MEDIUM_SIZE = BIGGER_SIZE = size
plt.rc('font', size=SMALL_SIZE) # controls default text sizes
plt.rc('axes', titlesize=SMALL_SIZE) # fontsize of the axes title
plt.rc('axes', labelsize=MEDIUM_SIZE) # fontsize of the x and y labels
plt.rc('xtick', labelsize=SMALL_SIZE) # fontsize of the tick labels
plt.rc('ytick', labelsize=SMALL_SIZE) # fontsize of the tick labels
plt.rc('legend', fontsize=SMALL_SIZE) # legend fontsize
plt.rc('figure', titlesize=BIGGER_SIZE) # fontsize of the figure title
plt.style.use('fivethirtyeight')
sns.set_context("talk")
sns.set_theme()
#plt.style.use('default') # revert style to default mpl
adjust_fontsize(size=20)
%matplotlib inline
csv_file = 'data/Full24hrdataset.csv'
usecols = ['Date', 'ID', 'region', 'PM25FM', 'PM25cf1', 'TempC', 'RH', 'Dewpoint']
full_df = (pd.read_csv(csv_file, usecols=usecols, parse_dates=['Date'])
.dropna())
full_df.columns = ['date', 'id', 'region', 'pm25aqs', 'pm25pa', 'temp', 'rh', 'dew']
full_df = full_df.loc[(full_df['pm25aqs'] < 50)]
bad_dates = ['2019-08-21', '2019-08-22', '2019-09-24']
GA = full_df.loc[(full_df['id'] == 'GA1') & (~full_df['date'].isin(bad_dates)) , :]
After we build the model that adjusts the PurpleAir measurements using AQS, we then flip the model around and use it to predict the true air quality in the future from PurpleAir measurements when wec don't have a nearby AQS instrument. This is a calibration scenario. Since the AQS measurements are close to the truth, we fit the more variable PurpleAir measurements to them; this is the calibration procedure. Then, we use the calibration curve to correct future PurpleAir measurements. This two-step process is encapsulated in the simple linear model and its flipped form below.
Inverse regression:
First, we fit a line to predict a PA measurement from the ground truth, as recorded by an AQS instrument:
$$ \text{PA} \approx \theta_0 + \theta_1\text{AQS} $$
Next, we flip the line around to use a PA measurement to predict the air quality,
$$ \text{True Air Quality} \approx -\theta_0/\theta_1 + 1/\theta_1 \text{PA} $$
Why perform this “inverse regression”?
from sklearn.linear_model import LinearRegression
AQS, PA = GA[['pm25aqs']], GA['pm25pa']
model = LinearRegression().fit(AQS, PA)
theta_1, theta_0 = model.coef_[0], model.intercept_
print(f"True Air Quality Estimate = {-theta_0/theta_1:.2} + {1/theta_1:.2}PA")
True Air Quality Estimate = 1.6 + 0.46PA
This actually matches the EDA performed earlier in the textbook, which shows that PurpleAir measurements are about twice as high as AQS measurements.
The pertinent Q-Q plot (out of scope for this semester):
What is the mean squared error of the original model?
from sklearn.metrics import mean_squared_error
preds_slr = model.predict(AQS)
mean_squared_error(PA, preds_slr)
4.7083124633807225
Karoline Barkjohn, Brett Gannt, and Andrea Clements from the US Environmental Protection Agency developed a model to improve the PuprleAir measurements from the AQS sensor measurements. arkjohn and group’s work was so successful that, as of this writing, the official US government maps, like the AirNow Fire and Smoke map, includes both AQS and PurpleAir sensors, and applies Barkjohn’s correction to the PurpleAir data. $$ \begin{aligned} \text{PA} \approx \theta_0 + \theta_1 \text{AQS} + \theta_2 \text{RH} \end{aligned} $$
The model that Barkjohn settled on incorporated the relative humidity:
AQS_RH, PA = GA[['pm25aqs', 'rh']], GA['pm25pa']
model_h = LinearRegression().fit(AQS_RH, PA)
[theta_1, theta_2], theta_0 = model_h.coef_, model_h.intercept_
print(f"True Air Quality Estimate = {-theta_0/theta_1:1.2} + {1/theta_1:.2}PA + {-theta_2/theta_1:.2}RH")
True Air Quality Estimate = 7.0 + 0.44PA + -0.092RH
preds_humidity = model_h.predict(AQS_RH)
mean_squared_error(PA, preds_humidity)
3.2977168948380413
Compared to the simple linear model that only incorporated AQS, the Barkjohn et al. model with relative humidity achieves lower error. Good for prediction!
From the Barkjohn et al., model, AQS coefficient $\hat{\theta}_1$:
theta_1
2.2540167939150537
The Relative Humidity coefficient $\hat{\theta}_2$ is pretty close to zero:
theta_2
0.20630108775555353
Is incorporating humidity in the model really needed?
(Note the slight hand-wavy assumption: air quality measurements should resemble the population measurements. But this is a weak assumption because of weather conditions, time of year, location of monitors, etc. Instead we assume that measurements are similar to others taken under the same conditions as those of the original measurement.)
Null hypothesis: The null hypothesis is $\theta_2 = 0$; that is, the null model is the simpler model:
$$ \begin{aligned} \text{PA} \approx \theta_0 + \theta_1 \text{AQS} \end{aligned} $$n = len(GA) # n: size of our sample
def boot_stat(X, y):
r = randint.rvs(low=0, high=(n-1), size=n)
theta2 = LinearRegression().fit(X.iloc[r, :], y.iloc[r]).coef_[1]
return theta2
We set up the design matrix and the outcome variable and check our boot_stat
function once to test it.
from scipy.stats import randint
n = len(GA)
y = GA['pm25pa']
X = GA[['pm25aqs', 'rh']]
boot_stat(X, y)
0.1635251292176499
Repeat 10,000 times to get an approximation to the boostrap sampling distirbution of the bootstrap statistic (the fitted humidity coefficient $\hat{\theta_2}$):
rng = np.random.default_rng(42)
boot_theta_hat = [boot_stat(X, y) for _ in range(10000)]
import plotly.express as px
px.histogram(x=boot_theta_hat, nbins=50,
labels=dict(x='Bootstrapped Humidity Coefficient'),
width=350, height=250)