Lecture 20 – Data 100, Summer 2021

by Suraj Rampure

Thresholding in Logistic Regression

So far, our logistic regression model predicts probabilities. But we originally set out on a mission to create a classifier. How can we use our predicted probabilities to create classifications?

Let's get back the NBA data we had last time.

Let's call this model basic_model since it only has one feature. (Eventually, we will use more features.)

It is the same model we fit in the last lecture.

As before, we can use .predict_proba to get the predicted probabilities for each class under our logistic regression model.

We can plot our model, too:

We need to apply a threshold to convert these probabilities into class labels (1 or 0). We can define a function that takes in probabilities, and returns their values, thresholded:

Let's look at the resulting predictions for three different (arbitrarily chosen thresholds): $T = 0.25, T = 0.5, T = 0.75$.

As we increase our threshold, fewer and fewer observations are classified as being 1.

What about models with more than one feature? Well, if our model has any more than two features, we can't really visualize it. But with two (or three, if we include an intercept term), we can visualize it just fine.

Let's now use FG_PCT_DIFF and PF_DIFF to predict whether or not a team will win. PF_DIFF is the difference in the number of personal fouls your team and the other team were charged with.

These results mean that our model makes predictions as

$$\hat{y} = P(Y = 1 | x) = \sigma(0.035 + 34.71 \cdot \text{FG_PCT_DIFF} - 0.160 \cdot \text{PF_DIFF})$$

Interpreting these coefficients, it seems to mean that having a positive FG_PCT_DIFF really helps your team's chances of winning, and having a positive PF_DIFF hurts your team's chances of winning.

We can visualize thresholds here, too. It'll be a little inconvenient to this three times, so let's just show $T = 0.3$ (chosen arbitrarily):

Evaluating classifiers

First, let's compute the accuracy of our better_model, when using $T = 0.5$ (the default in scikit-learn).

As per usual, scikit-learn can do this for us. The .score method of a LogisticRegression classifier gives us the accuracy of it.

Confusion matrix

Our good old friend scikit-learn has an in-built confusion matrix method (of course it does).

Precision and Recall

We can also compute the number of TP, TN, FP, and TN for our classifier, and hence its precision and recall.

These numbers match what we see in the confusion matrix above.

It's important to remember that these values are all for the threshold of $T = 0.5$, which is scikit-learn's default.

Accuracy vs. threshold, Precision vs. threshold, Recall vs. threshold

We already have a function predict_at_threshold, which takes in a list of probabilities and a threshold and computes the predicted labels (1s and 0s) at that threshold.

Let's also create our own accuracy_custom_threshold function, that takes in

And returns the accuracy of the predictions for our model with that threshold. Note that this function is not specific to our better_model, so we can use it later.

This is the same accuracy that model.score gave us before, so we should be good to go.

Let's plot what this accuracy looks like, for various thresholds:

We will make similar helper functions for precision and recall, and make similar plots.

Precision-Recall Curves

We can also plot precision vs. recall.

scikit-learn can also do this for us.

Same thing, just a bit more granular.

We can also plot a ROC curve (explained in lecture).

For a random classifier:

Decision Boundaries

$$P(Y = 1 | x) = \sigma(0.035 + 34.705 \cdot \text{FG_PCT_DIFF} - 0.160 \cdot \text{PF_DIFF})$$$$\sigma(0.035 + 34.705 \cdot \text{FG_PCT_DIFF} - 0.160 \cdot \text{PF_DIFF}) = 0.3$$$$\boxed{0.035 + 34.705 \cdot \text{FG_PCT_DIFF} - 0.160 \cdot \text{PF_DIFF} = \sigma^{-1}(0.3)}$$

Linear Separability

Consider the following toy dataset.

Let's look at the mean cross-entropy loss surface for this toy dataset, and a single feature model $\hat{y} = \sigma(\theta x)$.

But using regularization:

Let's look at another example dataset to illustrate linear separability.

The following data is linearly separable:

And the following here is not:

Full Demo

As a demo of the model fitting process from end-to-end, let's fit a regularized LogisticRegression model on the iris data, while performing a train/test split.

Let's try and predict the species of our iris. But, there are three possible values of species right now:

So let's create a new column is_versicolor that is 1 if the iris is a versicolor, and a 0 otherwise.

First, let's look at the coefficients if we fit without regularization:

Now let's fit with regularization, using the default value of C (the regularization hyperparameter in scikit-learn):

We can see the coefficients on the regularized model are significantly smaller.

Let's evaluate the training and testing accuracy of both models – regularized and not.

Unsurprisingly, the regularized model performs worse on the training data.

In this case, they both happened to perform the same on the test data. Interesting!

Question: What did we forget to do here (that we should always do when performing regularized linear or logistic regression)?