Lecture 24 Supplemental Notebook¶

Data 100, Spring 2023

Acknowledgments Page

In [1]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import plotly.offline as py
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.figure_factory as ff

from scipy.optimize import minimize

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
In [2]:
# Formatting options

# Big font helper
def adjust_fontsize(size=None):
    SMALL_SIZE = 8
    MEDIUM_SIZE = 10
    BIGGER_SIZE = 12
    if size != None:
        SMALL_SIZE = MEDIUM_SIZE = BIGGER_SIZE = size
    plt.rcParams['font.size'] = SMALL_SIZE
    plt.rcParams['axes.titlesize'] = SMALL_SIZE
    plt.rcParams['axes.labelsize'] = MEDIUM_SIZE
    plt.rcParams['xtick.labelsize'] = SMALL_SIZE
    plt.rcParams['ytick.labelsize'] = SMALL_SIZE
    plt.rcParams['legend.fontsize'] = SMALL_SIZE
    plt.rcParams['figure.titlesize'] = BIGGER_SIZE
    # plt.rc('font', size=SMALL_SIZE)          # controls default text sizes
    # plt.rc('axes', titlesize=SMALL_SIZE)     # fontsize of the axes title
    # plt.rc('axes', labelsize=MEDIUM_SIZE)    # fontsize of the x and y labels
    # plt.rc('xtick', labelsize=SMALL_SIZE)    # fontsize of the tick labels
    # plt.rc('ytick', labelsize=SMALL_SIZE)    # fontsize of the tick labels
    # plt.rc('legend', fontsize=SMALL_SIZE)    # legend fontsize
    # plt.rc('figure', titlesize=BIGGER_SIZE)  # fontsize of the figure title
    
def savefig(fname):
    if not SAVE_FIGURES_FLAG:
        # Avoid memory overload
        return
    
    if not os.path.exists("images"):
        os.mkdir("images")
    fig = plt.gcf()
    fig.patch.set_alpha(0.0)
    plt.savefig(f"images/{fname}.png", bbox_inches = 'tight');
    
plt.rcParams['lines.linewidth'] = 3

plt.style.use('fivethirtyeight')
sns.set_context("talk")
sns.set_theme()
#sns.set()
adjust_fontsize(20)


SAVE_FIGURES_FLAG = False

(Notebook setup) Obtaining the Data¶

For this lecture, we will use the same Wisconsin Breast Cancer Dataset from scikit learn. This is the same dataset we used in Lecture 23, so we'll skip the EDA analysis.

Classification task: Given the mean radius of tumor cells in an image, predict if the tumor is malignant (1) or benign (0).

In [3]:
import sklearn.datasets

# Load the data
data_dict = sklearn.datasets.load_breast_cancer()
data = pd.DataFrame(data_dict['data'], columns=data_dict['feature_names'])
# Target data_dict['target'] = 0 is malignant 1 is benign
data['malignant'] = (data_dict['target'] == 0).astype(int)

# Split the data
from sklearn.model_selection import train_test_split
data_tr, data_te = train_test_split(data, test_size=0.10, random_state=42)
data_tr.reset_index(inplace=True, drop=True)
data_te.reset_index(inplace=True, drop=True)
print("Training Data Size: ", len(data_tr))
print("Test Data Size: ", len(data_te))

# X, Y are training data
X = data_tr[['mean radius']].to_numpy()
Y = data_tr['malignant'].to_numpy()
Training Data Size:  512
Test Data Size:  57
In [4]:
# Manual to allow for jitter
plt.figure(figsize=(4,4))
g = sns.JointGrid(data = data, x = "mean radius", y = "malignant")
g.plot_marginals(sns.histplot)
g.plot_joint(sns.stripplot,
             orient='h', order=[1, 0],
             color=sns.color_palette()[0])
(g.ax_joint).set_xticks([10, 15, 20, 25])
plt.show()
<Figure size 400x400 with 0 Axes>



sklearn¶

The linear_model.LogisticRegression model (documentation) is what we want to use here. In order to recreate our specific model, there are a few parameters we need to set:

  • penalty = 'none': by default, LogisticRegression uses regularization. This is generally a good idea, which we'll discuss later.
  • fit_intercept = False: our toy model does not currently have an intercept term.
  • solver = 'lbgfs': need to specify a numerical optimization routine for the model (similar to gradient descent). lbfgs is one such type; it's the new default in scikit-learn.



Fit¶

We'll fit a model with the mean radius feature and a bias intercept. So $\theta = (\theta_0, \theta_1)$ and our model is:

$$\hat{P}_{\theta} (Y = 1 | x) = \sigma(x^T\theta) = \sigma(\theta_0 + \theta_1 x_1)$$
In [5]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(fit_intercept=True)
model.fit(X, Y); # X, Y are training data


Optimal $\theta_0, \theta_1$ from fitting our model:

In [6]:
model.intercept_, model.coef_
Out[6]:
(array([-14.42394402]), array([[0.97889232]]))



Prediction¶


Predict probabilities: scikit-learn has a built-in .predict_proba method that allows us to get the predicted probabilities under our model. The .classes_ attribute stores our class labels.

In [7]:
model.predict_proba([[20]])
Out[7]:
array([[0.00574364, 0.99425636]])
In [8]:
model.classes_
Out[8]:
array([0, 1])



Let's visualize these probabilities on our train dataset, X and y.

In [9]:
Prob_hat_one = model.predict_proba(X)[:, 1]
print(Prob_hat_one.shape)
(512,)
In [10]:
predictions = model.predict(X)
print(predictions.shape)
(512,)

The seaborn function stripplot auto-plots jittered data by class.

In [11]:
plt.figure(figsize=(6,6))

# Getting model predictions
y_pred = model.predict(X)

# Setting colors based on correct and incorrect y_pred
color_palette = {False:'red', True:'green'}
sns.stripplot(x=X.squeeze(), y=data_tr['malignant'], 
              jitter = 0.1, orient='h', hue= predictions==data_tr['malignant'], 
              palette = color_palette);

plt.gca().invert_yaxis()
plt.xlabel('x: mean radius')
plt.ylabel('y: class')
plt.xlim((7, 30))

# Plotting the sigmoig function with the given parameters
f = sns.lineplot(x= X.squeeze(), y=model.predict_proba(X)[:,1],
                 color='blue', linewidth=3,
                 label= r'$\hat{P}$' + r'$(Y=1|{0} + {1}*x)$'.format(model.intercept_[0],model.coef_[0]))
plt.setp(f.get_legend().get_texts(), fontsize='10');


Predict class labels: By comparison, what does .predict() do?

It predicts 1 if the computed probability for class 1 is greater than 0.5, and 0 otherwise.

$$\text{classify}(x) = \begin{cases} 1, & P(Y = 1 | x) \geq 0.5 \\ 0, & \text{otherwise} \end{cases}$$

This is what we call classification.

Edit the above cells to see sklearn's prediction for a mean radius of 10.

In [12]:
model.predict([[20]])
Out[12]:
array([1])



Let's build a DataFrame to store this information. We may need it later.

In [13]:
# In case you want to see all of the probabilities and predictions
def make_prediction_df(X, Y, model):
    # Assume X has one feature and that model is already fit 
    Prob = model.predict_proba(X)
    Y_hat = model.predict(X)
    df = pd.DataFrame({"X": X.squeeze(),
                       "Y": Y,
                       "P(Y = 1 | x)": Prob[:,1],
                       "Y_hat": Y_hat})
    return df
    
predict_train_df = make_prediction_df(X, Y, model)
predict_train_df
Out[13]:
X Y P(Y = 1 | x) Y_hat
0 25.220 1 0.999965 1
1 13.480 1 0.226448 0
2 11.290 0 0.033174 0
3 12.860 0 0.137598 0
4 19.690 1 0.992236 1
... ... ... ... ...
507 8.888 0 0.003257 0
508 11.640 0 0.046105 0
509 14.290 0 0.392796 0
510 13.980 1 0.323216 0
511 12.180 0 0.075786 0

512 rows × 4 columns




Linear Separability and the Need for Regularization¶

Suppose we had the following toy data:

In [14]:
toy_df = pd.DataFrame({"x": [-1, 1], "y": [1, 0]})
#plt.scatter(toy_df['x'], toy_df['y'], s=100);
sns.scatterplot(data=toy_df, x='x', y='y',
              s=100, legend=None);

Let's look at the mean cross-entropy loss surface for this toy dataset, and a single feature model $\hat{y} = \sigma(\theta x)$.

Let's consider a simplified logistic regression model of the form:

$$ \Large \hat{P}_{\theta}(Y = 1 | x) = \sigma(\theta_1 x) = \frac{1}{1 + e^{-\theta_1 x}} $$

With mean cross-entropy loss:

\begin{align} \hat{\theta} &= \underset{\theta}{\operatorname{argmin}} - \frac{1}{n} \sum_{i=1}^n \left( y_i \log (\sigma(\theta_1 x_i) + (1 - y_i) \log (1 - \sigma(\theta_1 x_i)) \right) \\ &= \underset{\theta}{\operatorname{argmin}} -\frac{1}{2} \left[ \log (\sigma( - \theta_1 )) + \log(1 - \sigma(\theta_1))\right] \end{align}
In [15]:
def toy_model(theta1, x):
    return 1/(1 + np.exp(-theta1 * x))

def mean_cross_entropy_loss_toy(theta1):
    # Here we use 1 - sigma(z) = sigma(-z) to improve numerical stability
    return - np.sum(toy_df['y'] * np.log(toy_model(theta1, toy_df['x'])) + \
                    (1-toy_df['y']) * np.log(toy_model(theta1, -toy_df['x'])))
In [16]:
thetas = np.linspace(-30, 30, 100)
plt.plot(thetas, [mean_cross_entropy_loss_toy(theta) for theta in thetas], color = 'green')
plt.ylabel(r'MCE($\theta$)')
plt.xlabel(r'$\theta$');
plt.title("No regularization")
Out[16]:
Text(0.5, 1.0, 'No regularization')

But using regularization:

In [17]:
def mce_regularized_loss_single_arg_toy(theta, reg):
    return mce_loss_single_arg_toy(theta) + reg * theta**2 

def regularized_loss_toy(theta1, reg):
    return mean_cross_entropy_loss_toy(theta1) + reg * theta1**2

thetas = np.linspace(-30, 30, 100)
plt.plot(thetas, [regularized_loss_toy(theta, 0.1) for theta in thetas], color = 'green')
plt.ylabel(r'MCE($\theta$) + 0.1 $\theta^2$')
plt.xlabel(r'$\theta$');
plt.title(r"Mean Loss + L2 Regularization ($\lambda$ = 0.1)")
Out[17]:
Text(0.5, 1.0, 'Mean Loss + L2 Regularization ($\\lambda$ = 0.1)')

Linearly separable plots¶

In case you were curious about how we generated the linearly separable plots for lecture.

In [18]:
# Change y_sep vs y_nosep
y_used = 'y_sep'

data_1d = pd.DataFrame(
    {"x": [-1, -.75, -.5, -.25, .3, .4, 1, 1.2, 3],
     "y_sep": [ 1,    1,   1,   1,   0,  0, 0, 0,   0],
     "y_nosep": [1,   0,   1,   0,   0,  0, 0, 0,   0]
    })

plt.figure(figsize=((6.1, 1.5)))
sns.scatterplot(data=data_1d, x='x', y=y_used,hue=y_used,
              s=100, edgecolor='k', linewidth=1, legend=None);
plt.ylim((-.3, 1.25))
plt.yticks([0, 1])
sns.rugplot(data=data_1d, x='x', hue=y_used, height = 0.1, legend=None, linewidth=4);
plt.ylabel("y");
In [19]:
iris = sns.load_dataset('iris')

plt.figure(figsize=(6, 4))
# Separable
sns.scatterplot(data = iris[iris['species'] != 'virginica'],
               x = 'petal_length',
               y = 'petal_width',
               hue = 'species', s=100);
plt.gca().legend_.set_title(None)

And the following here is not:

In [20]:
# Not separable
plt.figure(figsize=(6, 4))
sns.scatterplot(data = iris[iris['species'] != 'setosa'],
               x = 'petal_length',
               y = 'petal_width',
               palette=sns.color_palette()[1:3],
               hue = 'species', s=100);
plt.gca().legend_.set_title(None)

Regularization Demo¶

As a demo of the model fitting process from end-to-end, let's fit a regularized LogisticRegression model on the iris data, while performing a train/test split.

Let's try and predict the species of our iris. But, there are three possible values of species right now:

In [21]:
iris = sns.load_dataset('iris')
iris['species'].unique()
Out[21]:
array(['setosa', 'versicolor', 'virginica'], dtype=object)

So let's create a new column is_versicolor that is 1 if the iris is a versicolor, and a 0 otherwise.

In [22]:
iris['is_versicolor'] = (iris['species'] == 'versicolor').astype(int)
In [23]:
iris
Out[23]:
sepal_length sepal_width petal_length petal_width species is_versicolor
0 5.1 3.5 1.4 0.2 setosa 0
1 4.9 3.0 1.4 0.2 setosa 0
2 4.7 3.2 1.3 0.2 setosa 0
3 4.6 3.1 1.5 0.2 setosa 0
4 5.0 3.6 1.4 0.2 setosa 0
... ... ... ... ... ... ...
145 6.7 3.0 5.2 2.3 virginica 0
146 6.3 2.5 5.0 1.9 virginica 0
147 6.5 3.0 5.2 2.0 virginica 0
148 6.2 3.4 5.4 2.3 virginica 0
149 5.9 3.0 5.1 1.8 virginica 0

150 rows × 6 columns

In [24]:
cols = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
In [25]:
from sklearn.model_selection import train_test_split
In [26]:
import random

random.seed(25)
iris_train, iris_test = train_test_split(iris, test_size = 0.2)

First, let's look at the coefficients if we fit without regularization:

In [27]:
iris_model_no_reg = LogisticRegression(penalty = 'none', solver = 'lbfgs')
iris_model_no_reg.fit(iris_train[cols], iris_train['is_versicolor'])
Out[27]:
LogisticRegression(penalty='none')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(penalty='none')
In [28]:
iris_model_no_reg.coef_
Out[28]:
array([[-0.07871677, -2.77933453,  1.20743341, -2.84432124]])
In [29]:
iris_model_no_reg.intercept_
Out[29]:
array([6.81212221])

Now let's fit with regularization, using the default value of C (the regularization hyperparameter in scikit-learn):

In [30]:
iris_model_reg = LogisticRegression(penalty = 'l2', solver = 'lbfgs')
iris_model_reg.fit(iris_train[cols], iris_train['is_versicolor'])
Out[30]:
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
In [31]:
iris_model_reg.coef_
Out[31]:
array([[-0.10033813, -1.94462039,  0.58803632, -1.18201488]])
In [32]:
iris_model_reg.intercept_