= [0, 0, 1, 1, 1, 1, 0, 0, 0, 0]
flips flips
[0, 0, 1, 1, 1, 1, 0, 0, 0, 0]
Up until now, we’ve focused on modeling regression problems - that is, predicting a numerical quantity from a given dataset. We explored the following workflow in the context of linear regression:
In this lecture, we will look at an entirely new modeling problem in classification, which involves predicting a categorical variable. We will discuss how to modify the steps in our existing modeling framework to meet the needs of this new machine learning problem.
You may have seen this diagram which outlines the taxonomy of machine learning.
Regression and classification are both supervised learning problems, meaning they consist of models that learn from data where the \(y\) observations are known. You’ve likely seen examples of classification models before - for example, the k-nearest neighbors algorithm from Data 8. This lecture will focus on another such model known as Logistic Regression.
More generally, a classification problem aims to categorize data. Today, we will focus on binary classification, a classification problem where data may only belong to one of two groups. One such example is the result of a coin toss. Many of the principles we’ll learn today can extended to three or more groups - called multiclass classification. In the next few lectures, we will learn about models designed specifically for these problems.
To build some intuition for logistic regression, let’s look at an introductory example to classification: the coin flip. Suppose we observe some outcomes of a coin flip (1 = Heads, 0 = Tails).
= [0, 0, 1, 1, 1, 1, 0, 0, 0, 0]
flips flips
[0, 0, 1, 1, 1, 1, 0, 0, 0, 0]
For the next flip, would you predict heads or tails?
A reasonable model is to assume all flips are IID (independent and identitically distributed). In other words, each flip has the same probability of returning a 1 (or heads). Let’s define a parameter \(\theta\), the probability that the next flip is a heads. We will use this parameter to inform our decision for \(\hat y\), or our 0 – 1 prediction of the next flip. If \(\theta \ge 0.5, \hat y = 1, \text{else } \hat y = 0\).
You may be inclined to say \(0.5\) is the best choice for \(\theta\). However, notice that we made no assumption about the coin itself. The coin may be biased, so we should make our decision based only on the data. We know that exactly \(\frac{4}{10}\) of the flips were heads, so we should think \(\hat \theta = 0.4\). In the next section, we will mathematically prove why this is the best possible estimate.
Let’s call the result of the coin flip a random variable \(Y\). This is a Bernoulli random variable with two outcomes. \(Y\) has the following distribution:
\[P(Y = y) = \begin{cases} 1, \text{with probability } \theta\\ 0, \text{with probability } 1 - \theta \end{cases} \]
\(\theta\) is unknown to us. But we can find the \(\theta\) that makes the data we observed the most likely, by looking towards the likelihood of the data. The likelihood is proportional to the probability of observing the data.
The probability of observing 4 Heads and 6 Tails follows the Binomial Distribution.
\[\binom{10}{4} (\theta)^4 (1-\theta)^6\]
The likelihood is proportional to the probability above. To find it, simply multiply the probabilities of obtaining each coin flip.
\[(\theta)^{4} (1-\theta)^6\]
The technique known as maximum likelihood estimation finds the \(\theta\) that maximizes the above likelihood. You can find this maximum by taking the derivative of the likelihood, but we’ll provide a more intuitive graphical solution.
import numpy as np
import matplotlib.pyplot as plt
= np.linspace(0, 1, 100)
theta **4 * (1-theta)**6); plt.plot(theta, theta
You can see this function is maximized at the value \(0.4\). Thus, \(\hat\theta = 0.4\). We will revisit these methods in our derivation of the logistic regression model.
What if instead of a 0-1 prediction response, the model returns the probability of data belonging to binary classes? - If the model’s predicted probability is greater than 0.5, classify as class 1. - If the model’s predicted probability is less than 0.5, classify as class 0.
The model still uses the concept of classification activation as a way to aggregate multiple features using a weighted combination of features.
The probabilistic classifier learns to generate the following probabilities and with the decision threshold of 0.5, classifies datapoints into two categories:
Odds is defined as the ration of the probability of an event happening (\(p\)) vs. it not happening (\(1-p\)). - “9 to 1 against” means \(p = 0.1\) - “Even odds” means \(p = 0.5\) - “3 to 1 on” means \(p = 0.75\)
\[\text{odds}(p) = \frac{p}{1 - p}\]
Domain of odds function: \([0, 1)\)
Range of odds function: \([0, +\infty)\)
Thus, the odds function maps values on \([0, 1)\) to \([1, \infty)\).
Our end goal is to use a classification activation to output a probability value based on some unbounded real number. Currently, the odds ratio takes a probability value and outputs a positive real number. If we further transform the odds function to obtain a function that maps a probability to an unbouded real number, the inverse of that function will be our desired result.
Thus, we first need a transformation takes the ouput of the odds function and maps it to an unbounded real number (i.e. maps \([0, \infty)\) to \((-\infty, \infty)\)).
Taking the log of a positive real number creates an unbounded real number!
Applying this transformation to the odds ratio gives us the Logit Function:
\[\text{log-odds}(p) = \text{log}(\frac{p}{1-p})\]
\[\text{log}(\frac{p}{1-p}) = z\] \[\frac{p}{1-p} = e^{z}\] \[p = e^{z} - pe^{z}\] \[p + pe^{z} = e^{z}\] \[(1 + e^{z})p = e^{z}\] \[p = \frac{e^{z}}{1 + e^{z}}\] \[p = \frac{1}{1 + e^{-z}}\]
While it may look daunting, the logistic regression model is just a composition of two simpler functions. One of these we know well from OLS – the linear model \(x^T\theta\). The second is the logistic function – commonly referred to as the sigmoid function.
\[\sigma(t) = \frac{1}{1 + e^{-t}}\]
This logistic function has various properties, although we won’t prove any of them.
You’ll notice the logistic regression model is just the logistic function evaluated on the linear model \(x^T\theta\). The logistic function squashes the result of \(x^T\theta\) to a probability between \(0\) and \(1\). Larger values of \(x^T\theta\) will be mapped to probabilities closer to \(1\), and vice versa. Putting it all together, we have:
\[P(Y = 1|x) = \sigma(x^T\theta) = \frac{1}{1 + e^{-x^T\theta}}\]
The figure below summarizes the logistic regression process:
Suppose we want to predict the probability that a tumor is malignant, given it’s mean radius and mean smoothness. Assume that we fit a logistic regression model (with no intercept), and somehow estimate the optimal parameters: \(\hat \theta = [0.1, -0.5]\).
Say we encounter a new breast tumor with data: \(x = [15, 1]\). We want to answer two questions:
\[P(Y = 1|x) = \sigma(x^T\hat\theta)\] \[= \sigma(0.1 * 15 + (-0.5) * 1) = \sigma(1)\] \[= \frac{1}{1 + e^{-1}}\] \[\approx 0.7311\]
Because the tumor has a \(73\%\) chance of being malignant, we should predict \(\hat y = 1\).
In our derivation of logistic regression, we showed that \(\frac{p}{1-p} = e^{x^T\theta}\). Equivalently, \[\frac{P(Y=1|x)}{P(Y=0|x)} = e^{x^T\theta}\]
Imagine our linear component has just a single feature, along with an intercept term.
\[\frac{P(Y=1|x)}{P(Y=0|x)} = e^{\theta_0 + \theta_1 x}\]
What happens when you increase \(x\) by one unit?
Rememer, the odds ratio can be interpreted as the “number of successes for each failure.” When odds increase, the proportion of successes likewise increases.
Having derived the logistic regression model, the next step is to estimate the optimal parameters \(\hat\theta\).
In linear regression, we found these optimal parameters by minimizing a cost function. The same applies to logistic regression. Let’s begin by evaluating the MSE loss function on the logistic regression model.
\[L(\theta) = \frac{1}{n}\sum_{i=1}^{n} (y_i - \sigma(x_i^T\theta))^2\]
However, there are 3 flaws with this approach.
The MSE loss surface for logistic regression is non-convex. In the following example, you can see the function rises above the secant line, a clear violation of convexity.
Depending on the initialization points, gradient descent may find multiple non-optimal solutions.
A good loss function should penalize incorrect predictions; the MSE doesn’t do this. Consider the following scenario.
In fact, loss is always bounded \(< 1\).
The MSE loss function itself is conceptually questionable.
Why are we squaring the difference of a class label and probability?
This section will introduce a more intuitive loss function - cross-entropy loss. The cross-entropy loss is defined as
\[L(\theta) = -(y\text{ log}(p) + (1-y)\text{log}(1-p))\]
\[\text{where } p = \sigma(x^T\theta)\]
The average risk over an entire dataset is
\[R(\theta) = -\frac{1}{n} \sum_{i=1}^{n}(y_i\text{ log}(p) + (1-y_i)\text{log}(1-p))\]
Cross-entropy loss addresses the 3 pitfalls of squared loss.
To prove that cross-entropy loss is always convex, we’ll need to take its derivative, which becomes difficult and out of scope. Instead, in Data 100, we will show a proof by picture.
Plotted below is the cross-entropy loss function applied to the same toy dataset as before. As you’ll notice, this is a convex function - any line connecting two points lies entirely above the function.
To understand why cross-entropy loss has strong penalization, we need to take a step back. Let’s decompose the loss function into two components.
\[L(\theta) = -(y\text{ log}(p) + (1-y)\text{log}(1-p))\]
To understand why cross-entropy loss makes a great intuitive loss function, we will look towards maximum likelihood estimation in the next section.
This section demonstrates an alternative approach to deriving the logistic regression model using the graph of averages. Our goal will be to use the mean radius of a tumor to predict whether a cancer is malignant (\(y = 1\)) or not (\(y = 0\)). Data from the study is plotted below, along with the overlayed least squares model that minimizes MSE.
Rather than using a linear model, what if used a non-linear model that was a better fit of our data, and bounded predictions between [\(0\), \(1\)]? We can look to the Data 8 Graph of Averages to formulate this model.
The Graph of Averages says
Here, \(x\) is mean radius
and \(y\) is malignant
. By the definition of an average, the prediction for a given \(x\) is thus:
\[\frac{\text{Sum of Nearby ``Malignant" Values}}{\text{Number of Points in Bin}}\]
You’ll notice that this is the proportion of malignant tumors in some point \(x\)’s bin. In a similar light, we can think of this proportion as the probability that point \(x\) belongs to the malignant class (derived from it’s nearest neighbors). That is, \(P(Y = 1|x)\). This notation reads “given a tumor with a mean radius of \(x\), it belongs to class 1 (equivalently, the malignant class) with some specified probability”.
You may recognize some similarities between \(P(Y = 1|x)\) and \(\hat \theta\) from our earlier coin toss example. \(P(Y = 1|x)\) is a proportion of malignant tumors in a bin, whereas \(\hat \theta\) was a proportion of heads in 10 coin tosses. Both also represent probabilities of some new point belonging to class \(1\).
Notice how the Graph of Averages is a much better fit of the data.
Unfortunately, the Graph of Averages begins to degenerate as we add more features. The exact reason is out of scope, but this model becomes harder to use in higher dimensions. Instead, we use Logistic Regression: a probabilistic model that tries to model the Graph of Averages.
As a recap of what we’ve discussed so far:
Our goal here is to find a mathematical function that models the Graph of Averages, or \(P(Y = 1|x)\). For shorthand, let’s call this function \(p\). We will
Note: \(P(Y = 1|x)\) and \(p\) are used interchangeably throughout the rest of the note.
To transform this “S” shaped curve, we will define a quantity called the odds ratio. The odds of an event is the probability the event happens divided by the probability that it doesn’t happen. Remember, \(p\) is shorthand for our event \(P(Y = 1|x)\).
\[\text{odds}(p) = \frac{p}{1 - p}\]
Here, we’ve applied the odds ratio to our Graph of Averages. The result is an exponential relationship.
Applying a logarithmic transformation to the y-axis will linearize the data. We know this from the Tukey-Mosteller Buldge Diagram.
\[\text{log-odds}(p) = \text{log}(\frac{p}{1-p})\]
We have a roughly linear relationship! Specifically, this is linear in our features \(x\) and parameters \(\theta\). In our one dimensional feature space, \(x\) is the mean radius
and \(\theta\) some constant.
\[\text{log}(\frac{p}{1-p}) = x^T\theta\]
Remember our original goal was to model the Graph of Averages, which we called \(p\). Here, we will algebraically solve for \(p\).
\[\text{log}(\frac{p}{1-p}) = x^T\theta\] \[\frac{p}{1-p} = e^{x^T\theta}\] \[p = e^{x^T\theta} - pe^{x^T\theta}\] \[p = \frac{e^{x^T\theta}}{1 + e^{x^T\theta}}\] \[p = \frac{1}{1 + e^{-x^T\theta}}\]
We once again end up modeling \(p\) with the sigmoid (logistic) function!