23 Logistic Regression II
Today, we will continue studying the Logistic Regression model. We’ll discuss decision boundaries that help inform the classification of a particular prediction. Then, we’ll pick up from last lecture’s discussion of cross-entropy loss, study a few of its pitfalls, and learn potential remedies. We will also provide an implementation of sklearn
’s logistic regression model. Lastly, we’ll return to decision rules and discuss metrics that allow us to determine our model’s performance in different scenarios.
This will introduce us to the process of thresholding – a technique used to classify data from our model’s predicted probabilities, or

23.1 Decision Boundaries
In logistic regression, we model the probability that a datapoint belongs to Class 1. Last week, we developed the logistic regression model to predict that probability, but we never actually made any classifications for whether our prediction
A decision rule tells us how to interpret the output of the model to make a decision on how to classify a datapoint. We commonly make decision rules by specifying a threshold,
The threshold is often set to
Using our decision rule, we can define a decision boundary as the “line” that splits the data into classes based on its features. For logistic regression, the decision boundary is a hyperplane – a linear combination of the features in p-dimensions – and we can recover it from the final logistic regression model. For example, if we have a model with 2 features (2D), we have
For a model with 2 features, the decision boundary is a line in terms of its features. To make it easier to visualize, we’ve included an example of a 1-dimensional and a 2-dimensional decision boundary below. Notice how the decision boundary predicted by our logistic regression model perfectly separates the points into two classes.

In real life, however, that is often not the case, and we often see some overlap between points of different classes across the decision boundary. The true classes of the 2D data are shown below:

As you can see, the decision boundary predicted by our logistic regression does not perfectly separate the two classes. There’s a “muddled” region near the decision boundary where our classifier predicts the wrong class. What would the data have to look like for the classifier to make perfect predictions?
23.2 Linear Separability and Regularization
A classification dataset is said to be linearly separable if there exists a hyperplane among input features
Linear separability in 1D can be found with a rugplot of a single feature. For example, notice how the plot on the bottom left is linearly separable along the vertical line

This same definition holds in higher dimensions. If there are two features, the separating hyperplane must exist in two dimensions (any line of the form

This sounds great! When the dataset is linearly separable, a logistic regression classifier can perfectly assign datapoints into classes. However, (unexpected) complications may arise. Consider the toy
dataset with 2 points and only a single feature

The optimal
This happens when
The diverging weights cause the model to be overconfident. For example, consider the new point

The loss incurred by this misclassified point is infinite.
Thus, diverging weights (
Consider the loss function with respect to the parameter

Though it’s very difficult to see, the plateau for negative values of
23.2.1 Regularized Logistic Regression
To avoid large weights and infinite loss (particularly on linearly separable data), we use regularization. The same principles apply as with linear regression - make sure to standardize your features first.
For example,
Now, let us compare the loss functions of un-regularized and regularized logistic regression.


As we can see,
sklearn
’s logistic regression defaults to L2 regularization and C=1.0
; C
is the inverse of C
to a large value, for example, C=300.0
, results in minimal regularization.
# sklearn defaults
model = LogisticRegression(penalty='l2', C=1.0, …)
model.fit()
Note that in Data 100, we only use sklearn
to fit logistic regression models. There is no closed-form solution to the optimal theta vector, and the gradient is a little messy (see the bonus section below for details).
From here, the .predict
function returns the predicted class
23.3 Performance Metrics
You might be thinking, if we’ve already introduced cross-entropy loss, why do we need additional ways of assessing how well our models perform? In linear regression, we made numerical predictions and used a loss function to determine how “good” these predictions were. In logistic regression, our ultimate goal is to classify data – we are much more concerned with whether or not each datapoint was assigned the correct class using the decision rule. As such, we are interested in the quality of classifications, not the predicted probabilities.
The most basic evaluation metric is accuracy, that is, the proportion of correctly classified points.
Translated to code:
def accuracy(X, Y):
return np.mean(model.predict(X) == Y)
model.score(X, y) # built-in accuracy function
However, accuracy is not always a great metric for classification. To understand why, let’s consider a classification problem with 100 emails where only 5 are truly spam, and the remaining 95 are truly ham. We’ll investigate two models where accuracy is a poor metric.
- Model 1: Our first model classifies every email as non-spam. The model’s accuracy is high (
), but it doesn’t detect any spam emails. Despite the high accuracy, this is a bad model. - Model 2: The second model classifies every email as spam. The accuracy is low (
), but the model correctly labels every spam email. Unfortunately, it also misclassifies every non-spam email.
As this example illustrates, accuracy is not always a good metric for classification, particularly when your data could exhibit class imbalance (e.g., very few 1’s compared to 0’s).
23.3.1 Types of Classification
There are 4 different different classifications that our model might make:
- True positive: correctly classify a positive point as being positive (
and ) - True negative: correctly classify a negative point as being negative (
and ) - False positive: incorrectly classify a negative point as being positive (
and ) - False negative: incorrectly classify a positive point as being negative (
and )
These classifications can be concisely summarized in a confusion matrix.

An easy way to remember this terminology is as follows:
- Look at the second word in the phrase. Positive means a prediction of 1. Negative means a prediction of 0.
- Look at the first word in the phrase. True means our prediction was correct. False means it was incorrect.
We can now write the accuracy calculation as
In sklearn
, we use the following syntax
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(Y_true, Y_pred)

23.3.2 Accuracy, Precision, and Recall
The purpose of our discussion of the confusion matrix was to motivate better performance metrics for classification problems with class imbalance - namely, precision and recall.
Precision is defined as
Precision answers the question: “Of all observations that were predicted to be
Recall (or sensitivity) is defined as
Recall aims to answer: “Of all observations that were actually
Here’s a helpful graphic that summarizes our discussion above.

23.3.3 Example Calculation
In this section, we will calculate the accuracy, precision, and recall performance metrics for our earlier spam classification example. As a reminder, we had 100 emails, 5 of which were spam. We designed two models:
- Model 1: Predict that every email is non-spam
- Model 2: Predict that every email is spam
23.3.3.1 Model 1
First, let’s begin by creating the confusion matrix.
0 | 1 | |
---|---|---|
0 | True Negative: 95 | False Positive: 0 |
1 | False Negative: 5 | True Positive: 0 |
Convince yourself of why our confusion matrix looks like so.
Notice how our precision is undefined because we never predicted class
23.3.3.2 Model 2
Our confusion matrix for Model 2 looks like so.
0 | 1 | |
---|---|---|
0 | True Negative: 0 | False Positive: 95 |
1 | False Negative: 0 | True Positive: 5 |
Our precision is low because we have many false positives, and our recall is perfect - we correctly classified all spam emails (we never predicted class
23.3.4 Precision vs. Recall
Precision (
In fact, precision and recall are inversely related. This is evident in our second model – we observed a high recall and low precision. Usually, there is a tradeoff in these two (most models can either minimize the number of FP or FN; and in rare cases, both).
The specific performance metric(s) to prioritize depends on the context. In many medical settings, there might be a much higher cost to missing positive cases. For instance, in our breast cancer example, it is more costly to misclassify malignant tumors (false negatives) than it is to incorrectly classify a benign tumor as malignant (false positives). In the case of the latter, pathologists can conduct further studies to verify malignant tumors. As such, we should minimize the number of false negatives. This is equivalent to maximizing recall.
23.3.5 Two More Metrics
The True Positive Rate (TPR) is defined as
You’ll notice this is equivalent to recall. In the context of our spam email classifier, it answers the question: “What proportion of spam did I mark correctly?”. We’d like this to be close to
The False Positive Rate (FPR) is defined as
Another word for FPR is specificity. This answers the question: “What proportion of regular email did I mark as spam?”. We’d like this to be close to
As we increase threshold toy
dataset.

23.4 Adjusting the Classification Threshold
One way to minimize the number of FP vs. FN (equivalently, maximizing precision vs. recall) is by adjusting the classification threshold
The default threshold in sklearn
is

As you may notice, the choice of threshold
- High
: Most predictions are .- Lots of false negatives
- Fewer false positives
- Low
: Most predictions are .- Lots of false positives
- Fewer false negatives
In fact, we can choose a threshold
- Precision-Recall Curve (PR Curve)
- “Receiver Operating Characteristic” Curve (ROC Curve)
23.4.1 Precision-Recall Curves
A Precision-Recall Curve (PR Curve) is an alternative to the ROC curve that displays the relationship between precision and recall for various threshold values. It is constructed in a similar way as with the ROC curve.
Let’s first consider how precision and recall change as a function of the threshold

Displayed below is the PR Curve for the same toy
dataset. Notice how threshold values increase as we move to the left.

Once again, the perfect classifier will resemble the orange curve, this time, facing the opposite direction.

We want our PR curve to be as close to the “top right” of this graph as possible. Again, we use the AUC to determine “closeness”, with the perfect classifier exhibiting an AUC = 1 (and the worst with an AUC = 0.5).
23.4.2 The ROC Curve
The “Receiver Operating Characteristic” Curve (ROC Curve) plots the tradeoff between FPR and TPR. Notice how the far-left of the curve corresponds to higher threshold

The “perfect” classifier is the one that has a TPR of 1, and FPR of 0. This is achieved at the top-left of the plot below. More generally, it’s ROC curve resembles the curve in orange.

We want our model to be as close to this orange curve as possible. How do we quantify “closeness”?
We can compute the area under curve (AUC) of the ROC curve. Notice how the perfect classifier has an AUC = 1. The closer our model’s AUC is to 1, the better it is.
23.4.2.1 (Bonus) What is the “worst” AUC, and why is it 0.5?
On the other hand, a terrible model will have an AUC closer to 0.5. Random predictors randomly predict

23.5 (Bonus) Gradient Descent for Logistic Regression
Let’s define the following:
Now, we can simplify the cross-entropy loss
Hence, the optimal
We want to minimize
So we take the derivative
Setting the derivative equal to 0 and solving for
23.5.1 Gradient Descent Update Rule
For
23.5.2 Stochastic Gradient Descent Update Rule
For