šŸŽÆ Lecture 23 – Data 100, Spring 2025¶

Data 100, Spring 2025

Acknowledgments Page

InĀ [1]:
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
pio.templates["plotly"].layout.colorway = px.colors.qualitative.Vivid
px.defaults.width = 800

from scipy.optimize import minimize
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
import sklearn.linear_model as lm

We'll continue with the games dataset from last lecture.

InĀ [2]:
basketball = pd.read_csv("data/nba.csv")

first_team = basketball.groupby("GAME_ID").first()
second_team = basketball.groupby("GAME_ID").last()
games = first_team.merge(second_team, left_index = True, right_index = True, suffixes = ["", "_OPP"])

games['GOAL_DIFF'] = games["FG_PCT"] - games["FG_PCT_OPP"]
games['WON'] = (games['WL'] == "W").astype(int)
games = games[['TEAM_NAME', 'TEAM_NAME_OPP', 'MATCHUP', 'WON', 'WL', 'AST', 'GOAL_DIFF']]

games
Out[2]:
TEAM_NAME TEAM_NAME_OPP MATCHUP WON WL AST GOAL_DIFF
GAME_ID
21700001 Boston Celtics Cleveland Cavaliers BOS @ CLE 0 L 24 -0.049
21700002 Golden State Warriors Houston Rockets GSW vs. HOU 0 L 34 0.053
21700003 Charlotte Hornets Detroit Pistons CHA @ DET 0 L 16 -0.030
21700004 Indiana Pacers Brooklyn Nets IND vs. BKN 1 W 29 0.041
21700005 Orlando Magic Miami Heat ORL vs. MIA 1 W 22 0.042
... ... ... ... ... ... ... ...
21701226 New Orleans Pelicans San Antonio Spurs NOP vs. SAS 1 W 30 0.189
21701227 Oklahoma City Thunder Memphis Grizzlies OKC vs. MEM 1 W 32 0.069
21701228 LA Clippers Los Angeles Lakers LAC vs. LAL 0 L 27 0.017
21701229 Utah Jazz Portland Trail Blazers UTA @ POR 0 L 18 -0.090
21701230 Houston Rockets Sacramento Kings HOU @ SAC 0 L 11 -0.097

1230 rows Ɨ 7 columns

As before, we will use the "GOAL_DIFF" feature to classify whether a team won (1) or lost (0) their game.

InĀ [3]:
px.scatter(games, x="GOAL_DIFF", y="WON", color="WL", opacity=0.1)

Decision Boundaries¶

Note: We won't go over this section together during the lecture, since the slides cover the same material.

The LogisticRegression class of sklearn.linear_model behaves very similarly to the LinearRegression class. As before, we:

  1. Initialize a model object, and
  2. Fit it to our data.

You find it helpful to recall the model formulation of a fitted logistic regression model with one input:

$$ \hat{P}_{\hat{\theta}}(Y=1 \mid X) = \sigma \left( \hat{\theta}_0 + \hat{\theta}_1 X \right) = \frac{1}{1 + e^{-(\hat{\theta}_0 + \hat{\theta}_1 X)}} $$

InĀ [4]:
X = games[["GOAL_DIFF"]]
Y = games["WON"]

model = lm.LogisticRegression()
model.fit(X, Y)

print("Slope:", model.coef_[0][0])
print("Intercept:", model.intercept_[0])
Slope: 11.821711344721841
Intercept: -0.022895093635768988

Now, rather than predict a numeric output, we predict the probability of a datapoint belonging to Class 1. We do this using the .predict_proba method.

InĀ [5]:
# Preview the first 10 rows
model.predict_proba(X)[:10]
Out[5]:
array([[0.64615008, 0.35384992],
       [0.35350779, 0.64649221],
       [0.5932812 , 0.4067188 ],
       [0.38656007, 0.61343993],
       [0.38376056, 0.61623944],
       [0.59042552, 0.40957448],
       [0.70312773, 0.29687227],
       [0.63252173, 0.36747827],
       [0.31156713, 0.68843287],
       [0.61588544, 0.38411456]])

By default, .predict_proba returns a 2D array.

One column contains the predicted probability that the datapoint belongs to Class 0, and the other contains the predicted probability that it belongs to Class 1 (notice that all rows sum to a total probability of 1).

To check which is which, we can use the .classes_ attribute.

InĀ [6]:
model.classes_
Out[6]:
array([0, 1])

This tells us that the first column contains the probabilities of belonging to Class 0 (losing the game), and the second column contains the probabilities of belonging to Class 1 (winning). Let's grab just the probabilities of Class 1.

We then apply a decision rule: Predict Class 1 if the predicted probability of belonging to Class 1 is 0.5 or higher. Otherwise, predict Class 0.

  • Remember that 0.5 is a common threshold, but we are not required to always use 0.5
InĀ [7]:
# Obtain P(Y=1|x) from the output.
p = model.predict_proba(X)[:, 1]

# Apply decision rule: predict Class 1 if P(Y=1|x) >= 0.5.
(p >= 0.5).astype(int)
Out[7]:
array([0, 1, 0, ..., 1, 0, 0])

The .predict method of LogisticRegression will apply a 0.5 threshold to classify data, by default

InĀ [8]:
# .predict will automatically apply a 0.5 threshold for a logistic regression model.
classes = model.predict(X)

classes
Out[8]:
array([0, 1, 0, ..., 1, 0, 0])

The point where the sigmoid function outputs 0.5 is the decision boundary.

  • This is the point where the model is indifferent between predicting Class 0 and Class 1.

  • This is also the point where $\theta_0 + \theta_1 x = 0$.

For this one dimensional case we can solve for the $x$ value of the decision boundary:

$$ x = - \frac{\theta_0}{\theta_1} = - \frac{\text{intercept}}{\text{slope}} $$

Let's visualize our predictions.

InĀ [9]:
 -model.intercept_[0]/model.coef_[0][0]
Out[9]:
0.0019366987543636136
InĀ [10]:
games["Predicted Class"] = pd.Categorical(classes)

test_points = pd.DataFrame({"GOAL_DIFF": np.linspace(-0.3, 0.3, 100)})
test_points["Predicted Prob"] = model.predict_proba(test_points)[:, 1]

fig = px.scatter(games, x="GOAL_DIFF", y="WON", color="Predicted Class", opacity=0.1)

# Add the logistic regression model predictions
fig.add_trace(go.Scatter(x=test_points["GOAL_DIFF"], y=test_points["Predicted Prob"], 
                         mode="lines", name="Logistic Regression Model", 
                         line_color="black", line_width=5, line_dash="dash"))
fig.add_vline(x = -model.intercept_[0]/model.coef_[0][0], line_dash="dash", 
              line_color="black",
              annotation_text="Decision Boundary", 
              annotation_position="right")

Any time the predicted probability $p$ is less than 0.5, the model predicts Class 0. Otherwise, it predicts Class 1.

A decision boundary describes the line that splits the data into classes based on the features.

For a model with one feature, the decision boundary is a point that separates the two classes. The number of dimensions of the decision boundary plot is the number of features.

  • We visualize this using a 1D plot to plot all data points in terms of just the feature.

  • We cannot define a decision boundary in terms of the predictions, so we remove that axis from our plot.

Notice that all data points to the right of our decision boundary are classified as Class 1, while all data points to the left are classified as Class 0.

InĀ [11]:
fig = px.scatter(games, x="GOAL_DIFF", y=np.zeros(len(games)),
                 symbol="WL", symbol_sequence=[ "circle-open", "cross"], 
                 color="Predicted Class", height=300, opacity=0.1)
# fig.update_traces(marker_symbol='line-ns-open')
fig.update_traces(marker_size=8)
fig.update_layout(
    yaxis=dict(showticklabels=False, showgrid=False, zeroline=False, title=""),
)

decision_boundary =  -model.intercept_[0]/model.coef_[0][0]
fig.add_vline(x = decision_boundary, line_dash="dash", 
              line_color="black",
              annotation_text="Decision Boundary", 
              annotation_position="top right")

Two Features¶

Note: We won't go over this section together during the lecture, since the slides cover the same material.

We can repeat this process with a model with two features: "AST" and "GOAL_DIFF". Now, we express a decision boundary in terms of both of these two features.

How do we find the decision boundary in this case? We calculate the equation for the line that gives us all the points for which the model output is equal to the threshold.

  • Recall that the linear combination $x^T \hat{\theta}$ of logistic regression is the estimated log odds that a data point with features $x^T$ is Class 1.

  • So, we can equivalently express our decision boundary as a probability threshold or log odds threshold. Log odds are linear in $\theta$, so our decision boundary is linear!

$$\text{Probability threshold} = T = \frac{1}{1+e^{-\theta_0 -\theta_1\times\text{GOAL\_DIFF}-\theta_2\times\text{AST}}}$$

$$\Longrightarrow$$

$$\text{Log odds threshold} = \log \frac{T}{1-T} = \theta_0 + \theta_1\times\text{GOAL\_DIFF} + \theta_2\times\text{AST}$$

A probability threshold of 0.5 corresponds to a log odds threshold of $\log \frac{0.5}{1-0.5} = \log (1) = 0$.

InĀ [12]:
X_two_feature = games[["GOAL_DIFF", "AST"]]
Y = games["WON"]

two_feature_model = lm.LogisticRegression()
two_feature_model.fit(X_two_feature, Y)

# This function plots the decision boundary such that AST is a function of GOAL_DIFF.
theta0 = two_feature_model.intercept_
theta1, theta2 = two_feature_model.coef_[0]
print(theta0, theta1, theta2)
[-2.1118332] 10.785521824889493 0.09027541671741883

Make predictions using the new model:

InĀ [13]:
games["Predicted Class"] = two_feature_model.predict(X_two_feature)
games.head()
Out[13]:
TEAM_NAME TEAM_NAME_OPP MATCHUP WON WL AST GOAL_DIFF Predicted Class
GAME_ID
21700001 Boston Celtics Cleveland Cavaliers BOS @ CLE 0 L 24 -0.049 0
21700002 Golden State Warriors Houston Rockets GSW vs. HOU 0 L 34 0.053 1
21700003 Charlotte Hornets Detroit Pistons CHA @ DET 0 L 16 -0.030 0
21700004 Indiana Pacers Brooklyn Nets IND vs. BKN 1 W 29 0.041 1
21700005 Orlando Magic Miami Heat ORL vs. MIA 1 W 22 0.042 1

In the following, we compute the decision boundary for this model:

InĀ [14]:
# Construct the decision boundary
decision_boundary = pd.DataFrame({"GOAL_DIFF": np.linspace(-0.3, 0.3, 100)})

# Compute the y-values of the decision boundary (AST) using a grid of x-values (GOAL_DIFF).
# The decision boundary is defined by the equation:
# 0 = theta0 + theta1 * GOAL_DIFF + theta2 * AST
decision_boundary["AST"] = (theta0 + theta1*decision_boundary["GOAL_DIFF"])/(-theta2)

We can plot the decision boundary alongside the true class labels, and the predicted class labels.

Notice that there are a lot of misclassifications!

InĀ [15]:
games['Predicted Class'] = pd.Categorical(games['Predicted Class'])
fig = px.scatter(games, x="GOAL_DIFF", y="AST", symbol="WL", 
                 hover_data=['TEAM_NAME', 'TEAM_NAME_OPP'],
                 color="Predicted Class", 
                 symbol_sequence=[ "circle-open", "cross"],
                 opacity=0.7,
                 height=600)
fig.update_traces(marker=dict(size=8))
fig.update_layout(xaxis_range=[-0.3, 0.3], yaxis_range=[5, 50])
# Add the decision boundary to the plot
fig.add_scatter(x=decision_boundary["GOAL_DIFF"], y=decision_boundary["AST"],
                mode="lines", line_color="black", line_dash="dash", 
                name="Decision Boundary")

Adding the probabilities to the plot, with lighter shades corresponding to estimated probabilities closer to 0:

InĀ [16]:
goal_diff, ast = np.meshgrid(np.linspace(-0.3, 0.3, 50), np.linspace(5, 50, 50))
pred_grid = pd.DataFrame({"GOAL_DIFF": np.ravel(goal_diff), "AST": np.ravel(ast)})
pred_grid['Probability'] = two_feature_model.predict_proba(pred_grid)[:, 1]
# fig = go.Figure()
fig.add_contour(x=pred_grid['GOAL_DIFF'], y=pred_grid['AST'], z=pred_grid['Probability'],
                showscale=False, opacity=0.4, colorscale="Matter")

Linear Separability¶

Note: We won't go over this section together during the lecture, since the slides cover the same material.

A linearly separable dataset is one that can be perfectly separated into two classes by a hyperplane among the input features.

  • A hyperplane is a decision boundary extended to arbitrarily many dimensions. For example, a model with three features would have a 3D surface as its decision boundary.
InĀ [17]:
import seaborn as sns
iris = sns.load_dataset("iris")

This dataset is linearly separable:

InĀ [18]:
fig = px.scatter(iris[iris["species"] != "virginica"], 
                 x = "petal_length",
                 y = "petal_width", 
                 color="species", 
                 symbol="species", symbol_sequence=[ "circle", "cross"],
                 render_mode="svg")
fig.update_traces(marker=dict(size=12))
fig

And this dataset is not.

InĀ [19]:
fig = px.scatter(iris[iris["species"] != "setosa"], 
                 x = "petal_length",
                 y = "petal_width", 
                 color="species", 
                 symbol="species", symbol_sequence=[ "circle", "cross"],
                 render_mode="svg")
fig.update_traces(marker=dict(size=12))
fig

When our data is linearly separable, we run the risk of diverging weights as the model attempts to reduce cross-entropy loss to 0.

To see why, consider the following artificially generated "toy" dataset.

InĀ [20]:
toy_df = pd.DataFrame({"x": [-1, 1], "y": [0, 1], "label": pd.Categorical([0, 1])})
fig = px.scatter(toy_df, x="x", y="y", 
                 color="label", symbol="label", 
                 symbol_sequence=[ "circle", "cross"],
                 render_mode="svg")
fig.update_traces(marker=dict(size=12))

Let's look at the mean cross-entropy loss surface for this toy dataset, and a single feature model $\hat{y} = \sigma(\theta x)$.

For this situation, our logistic regression model takes the form:

$$ \Large \hat{P}_{\theta}(Y = 1 | x) = \sigma(\theta_1 x) = \frac{1}{1 + e^{-\theta_1 x}} $$

With mean cross-entropy loss:

\begin{align} \hat{\theta} &= \underset{\theta}{\operatorname{argmin}} - \frac{1}{n} \sum_{i=1}^n \left( y_i \log (\sigma(\theta_1 x_i) + (1 - y_i) \log (1 - \sigma(\theta_1 x_i)) \right) \\ &= \underset{\theta}{\operatorname{argmin}} -\frac{1}{2} \left[ \log (\sigma( - \theta_1 )) + \log(1 - \sigma(\theta_1))\right] \end{align}

InĀ [21]:
def toy_model(theta1, x):
    return 1/(1 + np.exp(-theta1 * x))

def mean_cross_entropy_loss_toy(theta1):
    # Here we use 1 - sigma(z) = sigma(-z) to improve numerical stability.
    return - np.sum(toy_df['y'] * np.log(toy_model(theta1, toy_df['x'])) + \
                    (1-toy_df['y']) * np.log(toy_model(theta1, -toy_df['x'])))
InĀ [22]:
thetas = np.linspace(-30, 30, 100)
fig = px.line(x=thetas, y = [mean_cross_entropy_loss_toy(theta) for theta in thetas], 
              render_mode="svg")
fig.update_layout(xaxis_title="Theta", yaxis_title="Mean CE Loss",
                  title="Mean Cross Entropy Loss for Toy Example")

Let's switch the y-axis to log scale to better visualize the loss surface for larger $\theta$.

InĀ [23]:
fig = px.line(x=thetas, y = [mean_cross_entropy_loss_toy(theta) for theta in thetas],
              log_y=True, render_mode="svg")
fig.update_layout(xaxis_title="Theta", yaxis_title="Log Scale Mean CE Loss",
                  title="Log Scale Mean Cross Entropy Loss for Toy Example")

We can keep decreasing the loss if we increase the value of $\theta$.

If left unchecked, the logistic regression model will attempt to use infinite values as the "optimal" model parameters. We describe this phenomenon as the model weights "diverging".

We can use regularization to restrict how large the model parameters can be.

InĀ [24]:
def regularized_loss_toy(theta1, reg):
    return mean_cross_entropy_loss_toy(theta1) + reg * theta1**2
InĀ [25]:
reg = 0.01 # Small amount of regularization
fig = px.line(x=thetas, y = [regularized_loss_toy(theta, reg) for theta in thetas],
              render_mode="svg")
fig.update_layout(xaxis_title="Theta", yaxis_title="Mean CE Loss",
                  title=f"Mean Cross Entropy Loss for Toy Example (Regularization = {reg})")

Much better!

ć€°ļø Regularization in sklearn¶

This is the only section we will go over together during lecture!

By default, sklearn's LogisticRegression applies an arbitrary amount of regularization for us.

Here, a larger C implies less regularization. $\lambda=\frac{1}{C}$

InĀ [26]:
# LogisticRegression objects behave a lot like LinearRegression objects.
# L2 regularization is applied by default, where lambda = 1 / C.
# Bigger C means less regularization.
toy_model = lm.LogisticRegression(C=10)

# We fit to two data points: (-1, 0) and (1, 1).
toy_model.fit([[-1], [1]], [0,1])

# Generate estimated probabilities across a range of x-values.
xtest = np.linspace(-1.5, 1.5, 1000)[:, np.newaxis]
p = toy_model.predict_proba(xtest)[:,1]

fig = px.scatter(toy_df, x="x", y="y", 
         color="label", symbol="label", 
         symbol_sequence=["circle", "cross"],
         title=f"LR Fit (slope = {toy_model.coef_[0][0]}, intercept = {toy_model.intercept_[0]})",
         render_mode="svg")
fig.update_traces(marker=dict(size=15))
fig.update_layout(
  xaxis_title=dict(font=dict(size=22)),
  yaxis_title=dict(font=dict(size=22))
)
fig.add_scatter(x=np.ravel(xtest), y=p, mode="lines", name="LR Model with C=10", 
                line_color="black", opacity=0.5)

When we reduce the amount of regularization, notice the slope term is larger, so the curve fits the two data points more closely.

InĀ [27]:
# Fit exactly the same model, but reduce the regularization strength by 
# a factor of 100.
toy_model = lm.LogisticRegression(C=1000)
toy_model.fit([[-1], [1]], [0,1])

xtest = np.linspace(-1.5, 1.5, 1000)[:, np.newaxis]
p = toy_model.predict_proba(xtest)[:,1]

fig = px.scatter(toy_df, x="x", y="y", 
                 color="label", symbol="label", 
                 symbol_sequence=[ "circle", "cross"],
                 title=f"LR Fit (slope = {toy_model.coef_[0][0]}, intercept = {toy_model.intercept_[0]})",
                 render_mode="svg")
fig.update_traces(marker=dict(size=15))
fig.update_layout(
  xaxis_title=dict(font=dict(size=22)),
  yaxis_title=dict(font=dict(size=22))
)
fig.add_scatter(x=np.ravel(xtest), y=p, mode="lines", name="LR model with C=1000", 
                line_color="black", opacity=0.5)

šŸŽÆ Performance Metrics¶

Note: We won't cover this section together during lecture, since the same material is covered in the slides.

Let's return to our games data. We'll compute the accuracy of our model on this data.

InĀ [28]:
def accuracy(X, Y):
    return np.mean(model.predict(X) == Y)

print(model.predict(X)[:5])
print(Y[:5])
accuracy(X, Y)
[0 1 0 1 1]
GAME_ID
21700001    0
21700002    0
21700003    0
21700004    1
21700005    1
Name: WON, dtype: int64
Out[28]:
0.7943089430894309

As per usual, scikit-learn can do this for us. The .score method of a LogisticRegression classifier provides the accuracy.

InĀ [29]:
model.score(X, Y)
Out[29]:
0.7943089430894309

Important Note: model.predict and model.score use a threshold of 0.5. To use a different threshold, you must use model.predict_proba and work with probabilities directly.

Confusion matrix¶

scikit-learn has an built-in confusion_matrix method.

InĀ [30]:
from sklearn.metrics import confusion_matrix

# Be careful – confusion_matrix takes in y_true as the first parameter and y_pred as the second.
# Don't mix these up!
cm = confusion_matrix(Y, model.predict(X))
cm
Out[30]:
array([[511, 114],
       [139, 466]])
InĀ [31]:
fig = px.imshow(cm, x=["0", "1"], y=["0", "1"],
          labels=dict(x="Predicted", y="Actual"), 
          text_auto=True, 
          color_continuous_scale="Blues", 
          width=400, height=400)
fig.update_xaxes(side="top")

Precision and Recall¶

We can also compute the number of TP, TN, FP, and TN for our classifier, and then its precision and recall.

InĀ [32]:
Y_hat = model.predict(X)
tp = np.sum((Y_hat == 1) & (Y == 1))
tn = np.sum((Y_hat == 0) & (Y == 0))

fp = np.sum((Y_hat == 1) & (Y == 0))
fn = np.sum((Y_hat == 0) & (Y == 1))


print("True Positives: ", tp)
print("True Negatives: ", tn)
print("False Positives:", fp)
print("False Negatives:", fn)
True Positives:  466
True Negatives:  511
False Positives: 114
False Negatives: 139

These numbers match what we see in the confusion matrix above.

Precision and Recall¶

Precision -- How precise are my positive predictions? In other words, what fraction of the things the model predicted positive are actually positive?

InĀ [33]:
precision = tp / (tp + fp)
precision
Out[33]:
0.803448275862069

Recall -- What proportion of actual positives did my model recall in its predictions? In other words, what proportion of actual positive cases that were correctly identified by the model?

InĀ [34]:
recall = tp / (tp + fn)
recall
Out[34]:
0.7702479338842976

True and False Positive Rates¶

The TP, TN, FP, and TN we just calculated also allow us to compute the true and false positive rates (TPR and FPR). Recall that TPR is the same as recall.

InĀ [35]:
fpr = fp/(fp + tn)
fpr
Out[35]:
0.1824
InĀ [36]:
tpr = tp/(tp + fn)
tpr
Out[36]:
0.7702479338842976

It's important to remember that these values are all for the threshold of $T = 0.5$, which is scikit-learn's default.

šŸŽ›ļø Adjusting the Classification Threshold¶

Note: We won't go through this section of the demo together during lecture, since the slides cover the same material.

Before, we used a threshold of 0.5 in our decision rule: If the predicted probability was greater than 0.5 we predicted Class 1, otherwise, we predicted Class 0.

InĀ [37]:
X = games[["GOAL_DIFF"]]
Y = games["WON"]
model = lm.LogisticRegression()
model.fit(X, Y)
print("Slope:", model.coef_[0][0])
print("Intercept:", model.intercept_[0])
Slope: 11.821711344721841
Intercept: -0.022895093635768988
InĀ [38]:
def plot_predictions(threshold = 0.5):
    games["Predicted Class"] = model.predict_proba(X)[:, 1] >= threshold
    # Needed for plotting
    games["Predicted Class"] = pd.Categorical(games["Predicted Class"])
    fig = px.scatter(games, 
            x="GOAL_DIFF", y="WON", color="Predicted Class", 
            title=f"Logistic Regression Predictions (Threshold = {threshold})")
    # Add the logistic regression model predictions
    # Make the data points for the LR model curve
    test_points = pd.DataFrame({"GOAL_DIFF": np.linspace(-0.3, 0.3, 100)})
    test_points["Predicted Prob"] = model.predict_proba(test_points)[:, 1]
    fig.add_trace(go.Scatter(x=test_points["GOAL_DIFF"], y=test_points["Predicted Prob"], 
                            mode="lines", name="Logistic Regression Model", 
                            line_color="black", line_width=5, line_dash="dash"))
    decision_boundary = (-np.log(1/threshold - 1) - model.intercept_[0])/model.coef_[0][0]
    fig.add_vline(x = decision_boundary, line_dash="dash", line_color="black",
                  annotation_text="Decision Boundary", annotation_position="right")
    return fig

plot_predictions(0.5)

What happens if we change the threshold? Below, we apply a threshold of $T=0.25$.

InĀ [39]:
plot_predictions(0.25)

When we lower the threshold, we require a lower predicted probability before we predict Class 1. We can think of this as us telling our model that it needs to be less "confident" about a data point being Class 1 before making a positive prediction. The total number of data points predicted to be Class 1 either stays the same or increases.

The converse happens if we raise the threshold. Consider setting $T=0.75$. Now, we require a higher predicted probability before we predict Class 1. The total number of data points predicted to be Class 1 decreases.

InĀ [40]:
plot_predictions(0.75)

Thresholds and Performance Metrics¶

Note: We won't cover this section together during lecture, since the same material is covered in the slides.

How does changing the threshold impact our performance metrics?

Let's run an experiment: we'll test out several different possible thresholds.

  • For each threshold $T$, we'll make a decision rule where we classify any point with a predicted probability equal to or greater than $T$ as being in Class 1.

  • Otherwise, we'll predict Class 0.

  • We'll then compute the overall accuracy of the classifier when using that threshold.

InĀ [41]:
# Define performance metrics dependent on the threshold value.
def predict_threshold(model, X, T): 
    prob_one = model.predict_proba(X)[:, 1]
    return (prob_one >= T).astype(int)

def accuracy_threshold(X, Y, T):
    return np.mean(predict_threshold(model, X, T) == Y)

def precision_threshold(X, Y, T):
    Y_hat = predict_threshold(model, X, T)
    denominator = np.sum(Y_hat == 1)
    if denominator == 0:
        denominator = np.nan
    return np.sum((Y_hat == 1) & (Y == 1)) / denominator
    
def recall_threshold(X, Y, T):
    Y_hat = predict_threshold(model, X, T)
    return np.sum((Y_hat == 1) & (Y == 1)) / np.sum(Y == 1)

def tpr_threshold(X, Y, T): # Same as recall
    Y_hat = predict_threshold(model, X, T)
    return np.sum((Y_hat == 1) & (Y == 1)) / np.sum(Y == 1)

def fpr_threshold(X, Y, T):
    Y_hat = predict_threshold(model, X, T)
    return np.sum((Y_hat == 1) & (Y == 0)) / np.sum(Y == 0)
InĀ [42]:
metrics = pd.DataFrame()
metrics["Threshold"] = np.linspace(0, 1, 1000)
metrics["Accuracy"] = [accuracy_threshold(X, Y, t) for t in metrics["Threshold"]]
metrics["Precision"] = [precision_threshold(X, Y, t) for t in metrics["Threshold"]]
metrics["Recall"] = [recall_threshold(X, Y, t) for t in metrics["Threshold"]]
metrics.head()
Out[42]:
Threshold Accuracy Precision Recall
0 0.000000 0.49187 0.49187 1.0
1 0.001001 0.49187 0.49187 1.0
2 0.002002 0.49187 0.49187 1.0
3 0.003003 0.49187 0.49187 1.0
4 0.004004 0.49187 0.49187 1.0
InĀ [43]:
fig = px.line(metrics, 
        x="Threshold", y="Accuracy",
        title="Accuracy vs. Threshold",
        render_mode="svg", width=600, height=600)
fig.update_layout(
  xaxis_title=dict(font=dict(size=22)),
  yaxis_title=dict(font=dict(size=22))
)
fig.show()

If we look at the threshold that maximizes accuracy we find it is close $T=0.49$.

InĀ [44]:
# The threshold that maximizes accuracy.
metrics.sort_values("Accuracy", ascending=False).head()
Out[44]:
Threshold Accuracy Precision Recall
488 0.488488 0.798374 0.793103 0.798347
489 0.489489 0.798374 0.793103 0.798347
490 0.490490 0.798374 0.793103 0.798347
485 0.485485 0.797561 0.788026 0.804959
486 0.486486 0.797561 0.788026 0.804959

It turns out that setting $T=0.5$ does not always result in the best performance! Part of the model design process for classification includes choosing an appropriate threshold value.

Precision-Recall Curves¶

In the lecture, we noted that there is a tradeoff between precision and recall.

Precision $=\frac{TP}{\text{Positive Predictions}}=\frac{TP}{TP+FP}$ increases as the number of false positives decreases, which occurs as the threshold is raised, since raising the threshold tends to reduce the number of positive predictions.

Recall $=\frac{TP}{\text{Actual Class 1s}}=\frac{TP}{TP+FN}$ increases as the number of false negatives decreases, which occurs as the threshold is lowered, since lowering the threshold tends to decrease number of negative predictions.

We want to keep both precision and recall high. To do so, we'll need to strategically choose a threshold value.

InĀ [45]:
fig = px.line(metrics, 
        x="Threshold", y=["Accuracy", "Precision", "Recall"],
        title="Performance Metrics vs. Threshold",
        render_mode="svg", height=600, width=600)
fig.update_layout(
  xaxis_title=dict(font=dict(size=22)),
  yaxis_title=dict(font=dict(size=22))
)
fig.show()

A precision-recall curve tests out many possible thresholds. Each point on the curve represents the precision and recall of the classifier for a particular choice of threshold.

We choose a threshold value that keeps both precision and recall high (usually in the rightmost "corner" of the curve).

InĀ [46]:
fig = px.line(metrics, x="Recall", y="Precision",
        title="Precision vs. Recall",
        width=600, height=600,
        render_mode="svg")
fig.update_layout(
  xaxis_title=dict(font=dict(size=22)),
  yaxis_title=dict(font=dict(size=22))
)
fig.show()

One way to balance precision and recall is to compute the F1 score. The F1 score is the harmonic mean of precision and recall:

$$F1 = 2 \times \frac{\text{precision} \times \text{recall}}{\text{precision} + \text{recall}}$$

InĀ [47]:
metrics["F1"] = (2 * metrics["Precision"] * metrics["Recall"] 
                     / (metrics["Precision"] + metrics["Recall"]))
ind = metrics['F1'].idxmax()
metrics.loc[ind,:]
Out[47]:
Threshold    0.482482
Accuracy     0.796748
Precision    0.784912
Recall       0.808264
F1           0.796417
Name: 482, dtype: float64
InĀ [48]:
fig = px.line(metrics, x="Threshold", y="F1",
              title="Finding F1 Score Maximum",
              render_mode="svg",
              height=600, width=600)

fig.add_scatter(x=[metrics.loc[ind, 'Threshold']], y=[metrics.loc[ind, 'F1']], 
                mode='markers', marker=dict(size=10, color='red'),
                name=f"F1 Max {metrics.loc[ind, 'Threshold']:.5f}",)

fig.update_layout(
  xaxis_title=dict(font=dict(size=22)),
  yaxis_title=dict(font=dict(size=22))
)
fig.show()
InĀ [49]:
fig = px.line(metrics, x="Recall", y="Precision",
              title="Precision vs. Recall", width=600, height=600,
              render_mode="svg")
fig.add_scatter(x=[metrics.loc[ind, 'Recall']], y=[metrics.loc[ind, 'Precision']], 
                mode='markers', marker=dict(size=10, color='red'),
                name=f"F1 Max {metrics.loc[ind, 'Threshold']:.5f}")
fig.update_layout(legend=dict(x=.5, y=.1))
fig.update_layout(
  xaxis_title=dict(font=dict(size=22)),
  yaxis_title=dict(font=dict(size=22))
)
fig.show()

ROC Curves¶

We can repeat a similar experiment for the FPR and TPR. Remember that we want to keep FPR low and TPR high.

InĀ [50]:
metrics["TPR"] = [tpr_threshold(X, Y, t) for t in metrics["Threshold"]]
metrics["FPR"] = [fpr_threshold(X, Y, t) for t in metrics["Threshold"]]
InĀ [51]:
fig = px.line(metrics, x="Threshold", y=["TPR", "FPR", "Accuracy"],
        render_mode="svg", width=600, height=600)
fig.update_layout(
  xaxis_title=dict(font=dict(size=22)),
  yaxis_title=dict(font=dict(size=22))
)
fig.show()

A ROC curve tests many possible decision rule thresholds. For each possible threshold, it plots the corresponding TPR and FPR of the classifier.

"ROC" stands for "Receiver Operating Characteristic". It comes from the field of signal processing.

InĀ [52]:
fig = px.line(metrics, x="FPR", y="TPR", title="ROC Curve", 
        width=600, height=600,
        render_mode="svg")
fig.update_layout(
  xaxis_title=dict(font=dict(size=22)),
  yaxis_title=dict(font=dict(size=22))
)
fig.show()

Ideally, a perfect classifier would have a FPR of 0 and TPR of 1. The area under the perfect classifier is 1.

We often use the area under the ROC curve (abbreviated "AUC") as an indicator of model performance. The closer the AUC is to 1, the better.

InĀ [53]:
fig = px.line(metrics, x="FPR", y="TPR", title="ROC Curve", 
              width=600, height=600,
              render_mode="svg")
fig.add_scatter(x=[0,0,1], y=[0,1,1], mode='lines', 
                line_dash='dash', line_color='black',
                name="Perfect Classifier")
# move the legend inside the plot
fig.update_layout(legend=dict(x=.5, y=.1))
fig.update_layout(
  xaxis_title=dict(font=dict(size=22)),
  yaxis_title=dict(font=dict(size=22))
)
fig.show()