import numpy as np
import pandas as pd
import plotly.offline as py
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.figure_factory as ff
import cufflinks as cf
cf.set_config_file(offline=True, sharing=False, theme='ggplot');
from sklearn.linear_model import LinearRegression
This notebook accompanies the slides for Lecture 21. You can think of it as a reading guide for the slides. It takes you through the ideas and results on the slides, in the context of one example.
Start by going over the Review of Linear Regression.
Now step away from the slides and examine the data we are about to analyze.
The Snowy Plover is a tiny bird that lives on the coast in parts of California and elsewhere. It is so small that it is vulnerable to many predators and to people and dogs that don't look where they are stepping when they go to the beach. It is considered endangered in many parts of the US.
The data are about the eggs and newly-hatched chicks of the Snowy Plover. Here's a parent bird and some eggs.
The data were collected at the Point Reyes National Seashore by a former student at Berkeley. The goal was to see how the size of an egg could be used to predict the weight of the resulting chick. The bigger the newly-hatched chick, the more likely it is to survive.
Each row of the data frame below corresponds to one Snowy Plover egg and the resulting chick. Note how tiny the bird is:
birds = pd.read_csv('snowy_plover.csv')
birds.head()
birds.shape
We are going to be regressing Bird Weight on the size of the eggs. The scatter plot below show the relation between Bird Weight and Egg Weight. You can see that it is linear but the vertical spread is not the same throughout.
fig1 = go.Figure()
data_scatter = go.Scatter(x=birds["Egg Weight"], y=birds["Bird Weight"],
mode="markers",
marker=dict(size=8))
fig1.add_trace(data_scatter)
fig1.update_layout(margin=dict(l=0, r=0, t=0, b=0),
height=400,
xaxis_title="Egg Weight",
yaxis_title="Bird Weight")
fig1
Here is a scatter plot of Bird Weight versus the two length variables Egg Length and Egg Breadth. You can turn the plot around to get a sense of the shape.
fig2 = go.Figure()
data_scatter = go.Scatter3d(x=birds["Egg Length"], y=birds["Egg Breadth"], z=birds["Bird Weight"],
mode="markers",
marker=dict(size=4))
fig2.add_trace(data_scatter)
fig2.update_layout(dict(margin=dict(l=0, r=0, t=0, b=0),
height=500,
scene=dict(
xaxis_title="Egg Length",
yaxis_title="Egg Breadth",
zaxis_title="Bird Weight")
))
fig2
It seems reasonable to fit a linear model to these data. Let's start by regressing Bird Weight on all three measurements on the eggs. That means $d=3$.
We will first use SKLearn just as you have done before.
Note that we are using a model with an intercept. That is, every element of the first column of the design matrix $\mathbb{X}$ is 1.
The other columns of $\mathbb{X}$ are Egg Length, Egg Breadth, and Egg Weight.
The observed responses $\mathbb{Y}$ are the column Bird Weight.
model = LinearRegression(fit_intercept=True)
model.fit(birds[["Egg Length", "Egg Breadth", "Egg Weight"]],
birds[["Bird Weight"]])
Here are the intercept and the slopes (also known as weights or coefficients) of Egg Length, Egg Breadth, and Egg Weight.
In terms of our model, these are the values in the vector $\hat{\theta}$.
model.intercept_, model.coef_
We can now find our estimates of Bird Weight. These are called fitted values and can be used to predict the weight of the next little chick based on measurements on the egg.
In terms of our model, the fitted values are $\hat{\mathbb{Y}}$ which is calculated as $\mathbb{X}\hat{\theta}$.
all_covariates = birds
all_covariates["Fitted"] = model.predict(
birds[["Egg Length", "Egg Breadth", "Egg Weight"]]
)
all_covariates.head()
As soon as you make an estimate, you have to consider the error in it. The residuals below are defined by $e = \mathbb{Y} - \hat{\mathbb{Y}}$.
all_covariates["Residuals"] = all_covariates["Bird Weight"] - all_covariates["Fitted"]
all_covariates.head()
# The sum of the residuals is 0
all_covariates["Residuals"].sum()
These two ($\mathbb{Y}$ and $\mathbb{\hat{Y}}$) have the same average:
all_covariates["Bird Weight"].mean()
all_covariates["Fitted"].mean()
The plot shows the residuals versus the fitted values. Notice:
fig3 = go.Figure()
data_scatter = go.Scatter(x=birds["Fitted"], y=birds["Residuals"],
mode="markers",
marker=dict(color='red', size=8))
fig3.add_trace(data_scatter)
fig3.update_layout(margin=dict(l=0, r=0, t=0, b=0),
height=400,
xaxis_title="Fitted Values",
yaxis_title="Residuals")
fig3
Here are the observed responses $\mathbb{Y}$ plotted against the fitted values $\hat{\mathbb{Y}}$.
fig4 = go.Figure()
data_scatter = go.Scatter(x=birds["Fitted"], y=birds["Bird Weight"],
mode="markers",
marker=dict(size=8))
fig4.add_trace(data_scatter)
fig4.update_layout(margin=dict(l=0, r=0, t=0, b=0),
height=400,
xaxis_title="Fitted Values",
yaxis_title="Bird Weight")
fig4
Here is a correlation matrix that shows the correlations between all the pairs you can make from among $\mathbb{Y}$ (Bird Weight), the fitted values, and the residuals.
Note:
birds[["Bird Weight", "Fitted", "Residuals"]].corr()
The correlation between the fitted values and the observed Bird Weight is pretty high: more than 0.85.
The multiple $R^2$ is the square of this correlation. It is one way of measuring how close the fitted values are to the observed responses.
# Multiple R^2
0.850884**2
Everything we have done thus far has been descriptive. We have not considered generalizing beyond the data that we have.
Also, everything that we have done thus far makes no assumptions about the shape of the scatter plots of the data.
To generalize – that is, to make predictions based on new eggs – we do have to make some assumptions about randomness.
Slides 25 and 26 state the model, which should feel familiar from Lecture 21.
For a particular egg, $x$ is the vector of length, breadth, and weight. The model is
$$ f_\theta(x) ~ = ~ \theta_0 + \theta_1\text{egg_length} + \theta_2\text{egg_breadth} + \theta_3\text{egg_weight} + \epsilon $$To carry out the calculations needed for constructing confidence intervals etc, I have used the module statsmodels
. It's a pretty standard way of doing regression using Python. See the textbook for example; scroll down till you see statsmodels.api
being imported.
add_constant
to specify that we want an intercept.Many regression programs produce such output. Data scientists learn to use what they understand and pretend the rest isn't there. The more you study regression, the more of the entries you will be able to understand. You can go a long way just with the pieces that you understand from Data 8 and 100.
import statsmodels.api as sm
y = birds["Bird Weight"]
X_all = sm.add_constant(
(birds[["Egg Length", "Egg Breadth", "Egg Weight"]]).values)
results_all = sm.OLS(y, X_all).fit()
results_all.summary()
We are going to ignore the majority of the output. We will just focus on $R^2$ and the coefficients.
Note that the value of R-squared
at the top of the table is the same as what we got earlier.
Skip to the middle section where you see the labels coef
, std err
, etc.
const
, x1
, x2
, and x3
are respectively Intercept, Egg Length, Egg Breadth, and Egg Weight.coef
is the observed value of $\hat{\theta}$, the best coefficients based on our data. You should go back and check that this is exactly what we got in the section called Estimated Parameters earlier in this notebook.std err
is short for standard error and is an approximation to the SD of the coefficient; it has been estimated from the data. The coefficient has an SD because it is a random variable. We calculated its value based on this sample, but it could have come out differently for a different sample. Skip the next two columns and look at the two at the end. Those form approximate 95% confidence intervals for the true coefficients. For example,
And so on.
You should check that the endpoints of the intervals aren't very different from what you would have got by going two standard errors on either side of the estimated value.
# Normal-curve based rough 95% confidence interval
# for the true intercept theta_0
-4.6057 + 4.83*np.array([-2, 2])
Before you read this, you might want to skim the textbook, especially if you didn't take Data 8. But if you understand the hypotheses and the reasons for them, then keep reading here.
Look at the columns labeled t
and P(>|t|)
. The first is a test statistic and the second is a $p$-value.
Small $p$-values lead us to rejecting the null hypothesis. Equivalently, you can just look at the confidence interval. If it doesn't contain 0, reject the null (at the 5% level). If the interval does contain 0 (that is, if the two ends are of different signs), then the data are consistent with the null hypothesis that the true coefficient is 0.
This brings us to a mystery.
What we are realizing that while the model fits well, none of the individual coefficients has meaning. We have a good overall fit but no detailed interpretation.
Now read Slides 21-22. The problem is collinearity. The covariates are related to each other, not just to the response. As a result, we can't interpret the individual coefficients.
Here is the correlation matrix of the covariates and the response.
birds[["Egg Length", "Egg Breadth", "Egg Weight", "Bird Weight"]].corr()
This means you can't increase one covariate while keeping the others constant. The individual slopes have no meaning.
Here are the two regressions. Notice that $R^2$ hardly drops at all compared to the regression with all three covariates. But in each regression below, the slopes are significantly different from 0. Separating egg weight from the two length dimensions of the egg shows that the variables matter, but that egg length and breadth have almost the same explanatory power as egg weight by itself.
The regression on egg weight alone looks like the one to use. It's no surprise that if you want to predict the weight of the newly-hatched chick, using the weight of the egg is your best move.
X_ew = sm.add_constant(
(birds[["Egg Weight"]]).values)
results_ew = sm.OLS(y, X_ew).fit()
results_ew.summary()
X_el_eb = sm.add_constant(
(birds[["Egg Length", "Egg Breadth"]]).values)
results_el_eb = sm.OLS(y, X_el_eb).fit()
results_el_eb.summary()
The last slide (29) is the most important, because it's easy to get carried away by computation and forget that the calculations are based on some underlying assumptions about the data.