We work through a case study in this notebook. This notebook accompanies the lecture slides for lecture 23 of DS 100 Fall 2017.
The task is to build a model for predicting the weight of a donkey from more easily obtained measurements.
We are following the work of Milner and Rougier in Significance 2014.
import os
from IPython.display import display, Latex, Markdown
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn import linear_model
We read the data and examine the first few rows and confirm that we have the correct number of rows and columns.
donkeys = pd.read_csv("donkeys.csv")
donkeys.head(10)
donkeys.shape
Weight was measured twice on a sample of 31 donkeys to check the accuracy of the scale. The second weighing is in WeightAlt
.
Let's compare the two values by taking the differences.
two_weighings = np.subtract(donkeys['WeightAlt'], donkeys['Weight'])
sns.distplot(two_weighings.dropna())
#plt.savefig("twoweighings.pdf")
The measurements are all within 1 kg of each other.
Next let's look for unusual value that might indicate errors or other problems.
donkeys.describe()
donkeys.quantile([0.005, 0.995])
There is one emaciated donkey with a BCS (body condition score) of 1 and one overweight donkey with a BCS of 4.5. There is also one very small donkey (28 kg), which on further inspection appears to be a baby donkey. These 3 anamolous cases are dropped from the data frame. The model that we fit will apply to resaonable healthy (BCS between 1.5 and 4) mature donkeys.
We make a copy of the original data set and name it donkeys2
.
donkeys2 = donkeys[(donkeys['BCS'] > 1) & (donkeys['BCS'] < 4.5) &
(donkeys['Weight'] > 50)].copy()
donkeys2 = donkeys2.reset_index()
y_donkeys2 = donkeys2['Weight']
donkeys2.head(10)
Before we proceed with our data analysis, we divide the data into an 80% - 20% split and set aside 20% for evaluation of the model.
from sklearn.model_selection import train_test_split
#train, test = train_test_split(donkeys2, test_size=0.2, random_state=1141)
indices = np.arange(len(donkeys2))
np.random.shuffle(indices)
n_train = int(np.round((len(donkeys2)*0.8)))
n_test = len(donkeys2) - n_train
indices[:n_train]
We briefly explore the data in the training set.
We examine the categorical variables with boxplots.
We also examine the relationship between explanatory numberic variables.
train = donkeys2.iloc[indices[ :n_train], ]
sns.boxplot(x = train['BCS'], y = train['Weight'])
#plt.savefig("bcs_box.pdf")
sns.boxplot(x = train['Age'], y = train['Weight'],
order = ['<2', '2-5', '5-10', '10-15', '15-20', '>20'])
#plt.savefig("age_box.pdf")
sns.boxplot(x = train['Sex'], y = train['Weight'], order = ['female', 'stallion', 'gelding'])
#plt.savefig("sex_box.pdf")
sns.regplot('Girth', 'Length', train)
#plt.savefig('girth_length.pdf')
Observations:
We can think of a donkey as a cylinder with appendages. This leads to the mass of the donkey being proportion to a product of the girth and length. This implies a model for log(weight) that is linear in log of girth and length, i.e., $\log(weight) = \alpha + \beta \log(length) + \gamma \log(girth)$.
Milner and Rougier broaden the starting point to consider models of the form:
$$ h_\lambda(weight) = X\theta$$where $h_\lambda(y) = \frac {y^\lambda - 1}{\lambda}$ for
$\lambda \neq 0$ and $h_\lambda(y) = log(y)$ for $\lambda = 0$
And where $X$ has 8 possible forms according to whether each of the
three variables log(girth
), log(height
) or log(length
) is included.
All of these models use the log-transformed values for Length
, Height
, and Girth
. We add their transformed variables to our design matrix.
donkeys2['Length_log'] = np.log(donkeys2['Length'])
donkeys2['Height_log'] = np.log(donkeys2['Height'])
donkeys2['Girth_log'] = np.log(donkeys2['Girth'])
We develop our own loss function to reflect the cost to the donkey's health of an incorrect dose. In particular for anesthetics, we want to err on the side of under dosing because it can be dangerous to over dose and the effects of the drug can be observed, meaning that if the donkey is still in pain then the dose can be increased.
For this reason, we design our own aneshtetic loss function:
def an_loss(x):
wt = (x >= 0) + 3 * (x < 0)
return np.square(x) * wt
xs = np.linspace(-40, 40, 500)
loss = an_loss(xs)
plt.plot(xs, loss, label="Modified Quadratic Loss")
plt.xlabel(r"relative error")
plt.ylabel(r"Loss")
plt.savefig("l1mod_plt.pdf")
Notice that the loss function is expressed in terms of relative error. $$ \frac {y - \hat{y}} {\hat{y}}$$ A negative loss indicate an overdose.
We want to fit the best model where we choose the transformation for $y$ and the best subset of the three quantitative variables, girth
, length
and height
. To do this, we fit the 7 possible models (all combinations of the subsets of the three variables) for each $\lambda$. We use our special loss function in the optimization.
We set up a few possible $\lambda$ values, keeping to values that are easily interpretable.
lmbdas = [-1., -0.75, -0.667, -0.5, -0.333, 0, 0.333, 0.5, 0.667, 0.75, 1, 1.5, 2]
len(lmbdas)
We import the stats
library which has a Box-Cox tranformation.
from scipy import stats
from scipy.stats import boxcox
We will find it useful to have the inverse of the Box-Cox transformation so that we can compare the predicted values in he original units. We write our own.
def invboxcox(y, ld):
if ld == 0:
return np.exp(y)
else:
y2 = np.where(y > 0, y, 0.0)
return ((ld*y2)+1)**(1/ld)
We import minimize
in order to optimize with our special loass function.
# import the optimize library
from scipy.optimize import minimize
In the optimization we want to use the relative error so we revise define a new loss function that incorporates our original loss function.
# Update the loss function to compute relative loss
def new_loss(theta, X, y):
yhat = X @ theta
return np.mean(an_loss((y - yhat) / yhat))
Before we fit all $7 \times 13$ models, we optimize for the two variable model with girth and length and the square-root transformation ($\lambda = 0.5$ ) of weight.
First, we examine therelationship between the transformed weight and log girth.
train = donkeys2.iloc[indices[ :n_train], ]
y_orig = train["Weight"]
y_boxcox = boxcox(y_orig, lmbda = lmbdas[7])
sns.regplot(x = train['Girth'], y = y_boxcox)
#plt.savefig('girth_length.pdf')
X = train[['Girth_log', 'Height_log']]
X_1 = np.hstack([X.values, np.ones((len(X), 1))])
res = minimize(lambda theta: new_loss(theta, X_1, y_boxcox), np.ones(3))
theta_est = res['x']
y_pred = X_1 @ theta_est
y_hat = invboxcox(y_pred, lmbdas[7])
100* np.sqrt(np.mean(an_loss((y_orig - y_hat)/ y_hat)))
#np.sqrt(np.mean((an_loss((y_orig - y_hat)/ y_hat))))
100 * np.sqrt(np.mean(an_loss((y_boxcox - y_pred)/ y_pred)))
Investigate the fit by examing the residuals
resids = y_boxcox - y_pred
plt.scatter(y_pred, resids)
Now, we're ready to fit all of the possible models.
To iterate over all possible combinations of explanatory variables and lambdas. We generate the combinations of explanatory variables.
import itertools
models = []
response_vars = ['Girth_log', 'Length_log', 'Height_log']
for L in range(1, len(response_vars)+1):
for subset in itertools.combinations(response_vars, L):
models.append(list(subset))
models
rmse_mods = []
for a_mod in models:
rmse = []
for lmbda in lmbdas:
y_boxcox = boxcox(y_orig, lmbda = lmbda)
X = train[a_mod]
X_1 = np.hstack([X.values, np.ones((len(X), 1))])
res = minimize(lambda theta: new_loss(theta, X_1, y_boxcox), np.ones(len(a_mod)+1))
theta_est = res['x']
y_pred = X_1 @ theta_est
y_hat= invboxcox(X_1 @ theta_est, lmbda)
# rmse.append(100*np.sqrt(np.mean(((y_orig - y_hat)/ y_hat)**2)))
rmse.append(100*np.std((y_orig - y_hat)/ y_hat))
rmse_mods.append(rmse)
rmse_mods[3]
for i in range(0, 7):
plt.plot(lmbdas, rmse_mods[i], "-o", label = models[i])
plt.xlabel(r"lambda")
plt.ylabel("Average Loss")
plt.legend(loc = "best")
plt.savefig("lossFunction6.pdf")
The three-variable model has slightly smaller error than the girth-length model. We will stick with the two-variable model as there is little gain from the additional variable.
The best $\lambda$ is $\lambda = 1/3$, but when we compare the error rate for $\lambda = 0, 1/3, 1/2$, we see that they differs by less than 0.03 so we choose the log transformation.
Next we transform the categorical variables into dummy variables so that we can include them in the model.
donkeys2 = pd.get_dummies(donkeys2, columns=["BCS", "Age", "Sex"])
donkeys2.head()
Drop the variables that we are not using: 'Length" and 'WeightAlt'. Also drop one category from 'Sex', 'BCS' and 'Age' dummies so that the matrix is not over paramterized.
donkeys2 = donkeys2.drop(['index', 'Height', 'Length', 'Girth', 'Weight',
'Height_log', 'WeightAlt',
'BCS_3.0', 'Age_5-10', 'Sex_female'],
axis = 1)
donkeys2.head(10)
train = donkeys2.iloc[indices[:n_train], ]
y_train = y_donkeys2[indices[ :n_train]]
print("length of train", len(train))
y_train.head(10)
X_1 = np.hstack([train.values, np.ones((len(train), 1))])
y_boxcox = boxcox(y_train, 0)
res = minimize(lambda theta: new_loss(theta, X_1, y_boxcox), np.ones(X_1.shape[1]))
theta_est = res['x']
theta_est
y_pred = X_1 @ theta_est
y_pred[:10]
Let's examine the coefficients for the dummy variables to determine whether we want to keep them in the model or if we want to collapse into fewer categories.
theta_est[[2, 3, 4, 5, 6, 10,9, 7, 8, 11, 12, 13]]
labels = ['1.5','2.0','2.5','3.5','4.0','<2','2-5', '10-15','15-20','>20', 'ge', 'st']
plt.scatter(x = range(12),
y = theta_est[[2, 3, 4, 5, 6, 10,9, 7, 8, 11, 12, 13]])
plt.xticks(range(12), labels)
plt.axvline(x=4.5)
plt.axvline(x = 9.5)
The plot of coefficients above indicates that we could collapse the age catgories into 3: under 2, 2 to 5, and over 5. We also see that the coefficients for the sex dummies are close to 0 so we won't lose much if we drop the sex variable from the model. Ideally we would want to compare the size of these coefficients to their standard errors, but we won't do that here.
We can drop all of the sex dummies. We can also drop the dummies for age over 10 and then the base age category will be for age over 5.
donkeys2 = donkeys2.drop(['Age_10-15', 'Age_15-20', 'Age_>20',
'Sex_gelding', 'Sex_stallion'],
axis = 1)
train = donkeys2.iloc[indices[:n_train], ]
Fit the new model to the training set.
X_1 = np.hstack([train.values, np.ones((len(train), 1))])
res = minimize(lambda theta: new_loss(theta, X_1, y_boxcox), np.ones(X_1.shape[1]))
theta_est = res['x']
theta_est
Finally, we use the fitted model to predict the weights for those donkeys in the test set. First, we get the test set
test_set = donkeys2.iloc[indices[n_train:], ]
y_test = y_donkeys2[indices[n_train:]]
print("length of test", len(test))
X_test_1 = np.hstack([test_set.values, np.ones((len(test_set), 1))])
y_boxcox_test = boxcox(y_test, 0)
Now we estimate the transformed weight from the model fitted to the training data, and we convert the predicted values into original units.
y_test_pred = X_test_1 @ theta_est
y_test_pred_o = invboxcox(y_test_pred, 0)
The following plot shows the typical errors. The red lines correspond to plus/minus 10% of the predicted weight.
plt.scatter(y_test_pred_o, y_test)
plt.plot([75, 225], [75, 225], 'k-', color = 'r')
plt.plot([75, 225], [82.5, 247.5], 'k-', color="red")
plt.plot([75, 225], [67.5, 202.5], 'k-', color="red")
plt.xlabel("Predicted Weight (kg)")
plt.ylabel("Actual Weight (kg)")
plt.savefig("model_assess.pdf")
Nearly all the predicted weights are within 10% of the actual weights. Below we calculate the percentage that are within 10%.
100* np.sum(np.abs((y_test - y_test_pred_o )/y_test_pred_o) < 0.1) / len(y_test)