This notebook accompanies the lecture on Logistic Regression and was updated to incorporate the new video-notebook format. If you have not already watched the accompanying lecture, you should do that first.
In this notebook we walk through the (miss)application of least-squares regression to a binary classification task. In the process, we will show why a different model and loss is needed. We will then demonstrate how to use the scikit-learn logistic regression model.
As with other notebooks we will use the same set of standard imports.
import numpy as np
import pandas as pd
import plotly.offline as py
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.figure_factory as ff
import cufflinks as cf
cf.set_config_file(offline=True, sharing=False, theme='ggplot');
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
Note: For this notebook, the walkthrough is all in this first video (Video 1 of Lecture 23).
from IPython.display import YouTubeVideo
YouTubeVideo("FU_2LmfYOw4")
For this lecture, we will use the Wisconsin Breast Cancer Dataset which we can obtain from scikit learn.
This dataset consists of measurements from tumor biopsies for 569 patients as well as whether the tumor was malignant or benign.
import sklearn.datasets
data_dict = sklearn.datasets.load_breast_cancer()
data = pd.DataFrame(data_dict['data'], columns=data_dict['feature_names'])
data
The prediction task for this data is to predict whether a tumor is benign or malignant (a binary decision) given characteristics of that tumor. As a classic machine learning dataset, the prediction task is captured by the column named "target"
. To put the data back in it's original context we will create a new column called "malignant"
which will be 1 if the tumor is malignant and 0 if it is benign (reversing the definition of target).
# Target data_dict['target'] = 0 is malignant 1 is benign
data['malignant'] = (data_dict['target'] == 0)
What features might be a good indication of whether a tumor is benign or malignant?
data.columns
Perhaps a good starting point is the size of the tumor. Larger tumors are probably more likely to be malignant. In the following, we plot whether the tumor was malignant (1 or 0) against the "mean radius"
.
points = go.Scatter(x=data['mean radius'], y = 1.*data['malignant'], mode="markers")
layout = dict(xaxis=dict(title="Mean Radius"),yaxis=dict(title="Malignant"))
go.Figure(data=[points], layout=layout)
This is a clear example of over-plotting. We can improve the above plot by jittering the data:
def jitter(data, amt=0.1):
return data + amt * np.random.rand(len(data)) - amt/2.0
points = go.Scatter(x=data['mean radius'], y = jitter(data['malignant']),
mode="markers",
marker=dict(opacity=0.5))
go.Figure(data=[points], layout=layout)
Perhaps a better way to visualize the data is using stacked histograms.
ff.create_distplot(
[data.loc[~data['malignant'], 'mean radius'],
data.loc[data['malignant'], 'mean radius']],
group_labels=["Benign","Malignant"],
bin_size=0.5)
Question: Looking at the above histograms could you describe a rule to predict whether or a cell is malignant?
Always split your data into training and test groups.
from sklearn.model_selection import train_test_split
data_tr, data_te = train_test_split(data, test_size=0.10, random_state=42)
print("Training Data Size: ", len(data_tr))
print("Test Data Size: ", len(data_te))
Creating the X
and Y
matrices for the training data:
X = data_tr[['mean radius']].to_numpy()
Y = data_tr['malignant'].to_numpy()
"I suppose it is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail." - Abraham Maslow The Psychology of Science
We would like to predict whether the tumor is malignant from the size of the tumor. We have encoded whether a tumor is malignant or benign as 1 or 0. Those are numbers that we could pretend are continuous and directly apply least squares regression. Why not start there?
In the following, we use scikit-learn to fit a least squares linear regression model. Note, we will not use any regularization since this is a really simple one-dimensional model with a comparatively large training dataset.
lin_reg = LinearRegression()
lin_reg.fit(X, Y)
How well does our model fit the data?
X_plt = np.expand_dims(np.linspace(X.min(), X.max(), 50),1)
model_line = go.Scatter(name="Least Squares",
x=X_plt.flatten(), y=lin_reg.predict(X_plt),
mode="lines", line=dict(color="orange"))
go.Figure([points, model_line], layout=layout)
How do we measure the error in our model? In the following, we will examine some of the different error models.
In past lectures, we have used the root mean squared error as a measure of error. We can compute that here as well:
from sklearn.metrics import mean_squared_error as mse
yhat = lin_reg.predict(X)
print("Training RMSE:", np.sqrt(mse(Y, yhat)))
What does that mean for this data? It is difficult to interpret this error in the context of a classification task.
This is a classification problem so we probably want to measure how often we predict the correct value. This is sometimes called the zero-one loss (or error):
$$ \large \textbf{ZeroOneLoss} = \frac{1}{n} \sum_{i=1}^n \textbf{I}\left[ y_i \neq f_\theta(x_i) \right] $$However to use the classification error we need to define a decision rule that maps $f_\theta(x)$ to the $\{0,1\}$ classification values.
Suppose we instituted the following simple decision rule:
$$\Large \text{If } f_\theta(x) > 0.5 \text{ predict 1 (malignant) else predict 0 (benign).} $$This simple decision rule is deciding that a tumor is malignant if our model predicts a values above 0.5 (closer to 1 than zero).
In the following we plot the implication of these decisions on our training data.
is_mal_hat = lin_reg.predict(X) > 0.5
In the following plot we color the data points according to our decision rule and depict the decision boundary as a dotted vertical line.
mal_points = go.Scatter(name="Classified as Malignant",
x=X[is_mal_hat].flatten(), y = jitter(Y[is_mal_hat]),
mode="markers", marker=dict(opacity=0.5, color="red"))
ben_points = go.Scatter(name="Classified as Benign",
x=X[~is_mal_hat].flatten(), y = jitter(Y[~is_mal_hat]),
mode="markers", marker=dict(opacity=0.5, color="blue"))
dec_boundary = (0.5 - lin_reg.intercept_)/lin_reg.coef_[0]
dec_line = go.Scatter(name="Least Squares Decision Boundary",
x = [dec_boundary,dec_boundary], y=[-0.5,1.5], mode="lines",
line=dict(color="black", dash="dot"))
go.Figure([mal_points, ben_points, model_line,dec_line], layout=layout)
ZeroOneLoss
¶The zero-one loss is so commonly used that there is a built-in function for it in scikit-learn. Here we compute the zero-one-loss for our data.
from sklearn.metrics import zero_one_loss
print("Training Fraction incorrect:",
zero_one_loss(Y, is_mal_hat))
Questions
In any modeling task, when discussing the error it often helpful to have a baseline for comparison. For example, in the earlier regression lectures, a reasonable baseline for comparison would be the constant model that just predicts the average value of $Y$.
For classification tasks, a reasonable baseline would be to predict the majority class.
print("Fraction of Malignant Samples:", Y.mean())
Therefore if we guess the majority class benign we would get what accuracy?
# You can figure this out from the above number
print("Guess Majority:", zero_one_loss(Y, np.zeros(len(Y))))
Not surprisingly, we get an error that is identical to the fraction of example in the other class (or classes).
This is standard example of a common problem in classification (and perhaps modern society): class imbalance.
Class imbalance is when a disproportionate fraction of the samples are in one class (in this case benign).
In extreme cases (e.g., fraud detection) only tiny fraction of the training data may contain examples in particular class. In these settings we can achieve very high-accuracies by always predicting the frequent class without learning a good classifier for the rare classes.
There are many techniques for managing class imbalance here are a few:
In this example the class imbalance is not that extreme so we will continue without re-sampling.
Is the linear model predicting the "probability" of a tumor being malignant? Not really. Probabilities are constrained between 0 and 1. How could we learn a model that captures this probabilistic interpretation?
At the very least our model should probably predict a number between zero and one. This would at least be closer to being a probability. We could try to constrain the model:
$$ \large p_i = \min\left(\max \left( x^T \theta , 0 \right), 1\right) $$this would look like:
def bound01(z):
u = np.where(z > 1, 1, z)
return np.where(u < 0, 0, u)
p_line = go.Scatter(name="Truncated Least Squares",
x=X_plt.flatten(), y=bound01(lin_reg.predict(X_plt)),
mode="lines", line=dict(color="green", width=8))
py.iplot([mal_points, ben_points, model_line, p_line, dec_line], filename="lr-06")
So far least squares regression seems pretty reasonable and we can "force" the predicted values to be bounded between 0 and 1.
Can we interpret the truncated values as probabilities?
Perhaps, but it would depend on how the model is estimated (more on this soon).
It seems like large tumor sizes are indicative of malignant tumors. Suppose we observed a very large malignant tumor that is 100mm in mean radius. What would this do to our model?
Let's add an extra data point and see what happens:
X_ex = np.vstack([X, [100]])
Y_ex = np.hstack([Y, 1.])
lin_reg_ex = LinearRegression()
lin_reg_ex.fit(X_ex, Y_ex)
extreme_point = go.Scatter(
name="Extreme Point", x=[100], y=[1], mode="markers",
marker=dict(color="green", size=10))
model_line.line.color = "gray"
X_plt_ex = np.expand_dims(np.linspace(np.min(X)-5, np.max(X)+5, 100),1)
model_line_ex = go.Scatter(name="New Least Squares",
x=X_plt.flatten(), y=lin_reg_ex.predict(X_plt_ex),
mode="lines", line=dict(color="orange"))
dec_line.line.color = "gray"
dec_boundary_ex = (0.5 - lin_reg_ex.intercept_)/lin_reg_ex.coef_[0]
dec_line_ex = go.Scatter(
name="Decision Boundary",
x = [dec_boundary_ex, dec_boundary_ex], y=[-0.5,1.5], mode="lines",
line=dict(color="black", dash="dash"))
go.Figure([mal_points, ben_points,model_line, model_line_ex, dec_line, dec_line_ex, extreme_point])
The addition of the extreme point shifted the linear model from the gray Least Squares line to the new Orange Least squares line. This shift actually moved the decision boundary and produced a less accurate model! This is a little surprising. Indeed, if we keep increasing the size of this one tumor we can cause the model to incorrectly classify all the smaller malignant tumors as benign.
print("Before:",
zero_one_loss(Y_ex, lin_reg.predict(X_ex) > 0.5))
print("After:",
zero_one_loss(Y_ex, lin_reg_ex.predict(X_ex) > 0.5))
To address this problem, we need to both adjust our model and also introduce a loss function that is more appropriate for the classification task. In the next notebook, we introduce the logistic regression model and negative log-likelihood (cross entropy) loss.