import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.offline as py
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.figure_factory as ff
# import cufflinks as cf
# cf.set_config_file(offline=True, sharing=False, theme='ggplot');
from scipy.optimize import minimize
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
# Formatting options
# Big font helper
def adjust_fontsize(size=None):
SMALL_SIZE = 8
MEDIUM_SIZE = 10
BIGGER_SIZE = 12
if size != None:
SMALL_SIZE = MEDIUM_SIZE = BIGGER_SIZE = size
plt.rcParams['font.size'] = SMALL_SIZE
plt.rcParams['axes.titlesize'] = SMALL_SIZE
plt.rcParams['axes.labelsize'] = MEDIUM_SIZE
plt.rcParams['xtick.labelsize'] = SMALL_SIZE
plt.rcParams['ytick.labelsize'] = SMALL_SIZE
plt.rcParams['legend.fontsize'] = SMALL_SIZE
plt.rcParams['figure.titlesize'] = BIGGER_SIZE
def savefig(fname):
if not SAVE_FIGURES_FLAG:
# Avoid memory overload
return
if not os.path.exists("images"):
os.mkdir("images")
fig = plt.gcf()
fig.patch.set_alpha(0.0)
plt.savefig(f"images/{fname}.png", bbox_inches = 'tight');
plt.rcParams['lines.linewidth'] = 3
plt.style.use('fivethirtyeight')
sns.set_context("talk")
sns.set_theme()
adjust_fontsize(20)
SAVE_FIGURES_FLAG = False
toy_df = pd.DataFrame({"x": [-1, -.75, -.5, -.25, .3, .4, 1, 1.2, 3],
"y": [ 0, 0, 0, 0, 1, 1, 1, 1, 1]})
toy_df.sort_values("x")
x | y | |
---|---|---|
0 | -1.00 | 0 |
1 | -0.75 | 0 |
2 | -0.50 | 0 |
3 | -0.25 | 0 |
4 | 0.30 | 1 |
5 | 0.40 | 1 |
6 | 1.00 | 1 |
7 | 1.20 | 1 |
8 | 3.00 | 1 |
plt.scatter(data=toy_df, x='x', y='y', s=100);
plt.xlabel('Data')
plt.ylabel('Class');
The classification task includes the following steps:
In the above example, if we assume that the only feature $x$ is used for classification and the associated weights for $f_\theta$ are $\theta_0 = 0$ and $\theta_1 = 1$, then $f_\theta(x) = z = x$.
If we choose step function with decision threshold of 0 (sign function), the following classification rule will be obtained:
$$ \text{classify}(x) = \begin{cases} 1, &\quad\text{if}\ \ f_\theta(x) \geq 0 \\ 0, &\quad\text{if}\ \ f_\theta(x) < 0 \end{cases}$$color_condition = (((toy_df['x']<0) & (toy_df['y']==0)) |
((toy_df['x']>=0) & (toy_df['y']==1))).astype(int)
color = np.where((color_condition==False),'r', np.where((color_condition==True),'g', 'r'))
plt.scatter(data=toy_df, x='x', y='y', s=100, c=color)
plt.xlabel('Data')
plt.ylabel('Class')
plt.step([-3,0,4], [0, 1, 1], where='post', color='b')
plt.title("Binary Classification with Sign Function Decision Rule");
What if instead of a 0-1 output, we predict the probability of each data point belonging to the class 0 or class 1 and use the 50\% probability threshold to classify datapoints? The classifier in the probabilistic model will be:
$$ \text{classify}(x) = \begin{cases} 1, &\quad\text{if}\ \ P_\theta(Y = 1 | x) \geq 0.5 \\ 0, &\quad\text{if}\ \ P_\theta(Y = 1 | x) < 0.5 \end{cases}$$This approach enables us to evaluate the quality of the model based on:
Our goal in a probabilistic model is to train weights for the model that returns probability values $P_\hat{\theta}(Y = 1 | x)$. This probability function is a conditional probability function that outputs a probability value in the ranges of $[0,1]$ having seen the datapoint $x$.
We follow similar steps as before:
The function that we are interested in finding has one input $z = \theta.x$, and outputs a probability. So the domain of the function is $(-\infty,+\infty)$ and the range of the function is $[0,1]$. What function has this attribute? Let's find out.
Odds is the probability of something happening divided by the probability of it not happening! If $p$ is the probability of the event hapenning, then:
$$\text{odds} = \frac{p}{1-p}$$The odds function has probability as its input and hence its the domain is $[0,1)$ but the range of the odds function is $[0,+\infty)$. Let's see the shape of the odds function.
def odds(p):
return p/(1-p)
p = np.linspace(0, 0.99, 1000)
odds_p = odds(p)
fig = px.line(x=p, y=odds_p, title="Odds Ratio")
fig.update_xaxes(title="Probability")
fig.update_yaxes(title="Odds Ratio")
fig.show()
In the previous part, we saw how probability can be mapped into the range of all positive numbers. Now, since we are interested in using all possible classification activation values $z$ for a classification task which are in the range of $(-\infty, +\infty)$, we are interested in deriving a function that maps $[0,1]$ to $(-\infty, +\infty)$ (the inverse of what we are eventually looking for).
How about we take the $log$ of the previous function we derived? If you take the $log$ of values in the range of $[0, +\infty)$, here is what you will get:
$$ \text{log}(x) \in \begin{cases} (-\infty, 0], &\quad\text{if}\ \ x \in (0,1] \\ (0, +\infty), &\quad\text{if}\ \ x \in (1, +\infty) \end{cases}$$and hence:
$$log(\frac{p}{1-p}) \in (-\infty, +\infty)$$This function is called Logit Function. Let's see how it looks like.
logit = np.log(odds_p)
fig = px.line(x=p, y=logit, title="Logit function")
fig.update_xaxes(title="Probability")
fig.update_yaxes(title="Log of the odds ratio")
/tmp/ipykernel_430/1208200170.py:1: RuntimeWarning: divide by zero encountered in log
Since logit function is a mapping from $(0,1)$ to $(-\infty, +\infty)$, if we inverse this function, we will get a function that can be used to map a real number (in our classification task it is classification activation $z$) to a probability value $(0,1)$.
Let's derive the inverse of the logit function: $$ z = log(\frac{p}{1-p})\\ e^z = \frac{p}{1-p}\\ (1-p)e^z = p\\ e^z - pe^z = p\\ e^z = p + pe^z\\ e^z = p(1+e^z)\\ p = \frac{e^z}{1+e^z}\\ p = \frac{1}{1+e^{-z}}\\ $$
This function is called Sigmoid or Logistic function which maps a real number in $(-\infty, +\infty)$ to a probability. Let's see how this function looks like.
def logistic(z):
return 1/(1+np.exp(-z))
z = np.linspace(-20, 20, 1000)
sig = logistic(z)
fig = px.line(x=z, y=sig, title="Logistic function")
fig.update_xaxes(title="Classification Activation")
fig.update_yaxes(title="Logistic/Probability")
fig.show()
Now let's see few logistic functions with different parameters applied to our previous dataset.
def sigmoid_plot(theta_0, theta_1, c):
z = np.linspace(-5, 5, 100)
f = sns.lineplot(x= z, y=logistic(theta_0 + theta_1 * z),
color=c, linewidth=3,
label= r'$\hat{P}$' + r'$(Y=1|{0} + {1}*x)$'.format(theta_0,theta_1));
plt.setp(f.get_legend().get_texts(), fontsize='10')
plt.figure(figsize=(6,6))
sns.scatterplot(data=toy_df, x='x', y='y', s=100)
plt.xlabel('x')
plt.ylabel('y: class')
plt.xlim((-5, 5))
sigmoid_plot(0, 1, 'green')
sigmoid_plot(1, 1, 'blue')
sigmoid_plot(0, 5, 'red')
For this lecture, we will use the same Wisconsin Breast Cancer Dataset from scikit learn. This dataset consists of measurements from tumor biopsies for 569 patients as well as whether the tumor was malignant or benign.
import sklearn.datasets
data_dict = sklearn.datasets.load_breast_cancer()
data = pd.DataFrame(data_dict['data'], columns=data_dict['feature_names'])
data
mean radius | mean texture | mean perimeter | mean area | mean smoothness | mean compactness | mean concavity | mean concave points | mean symmetry | mean fractal dimension | ... | worst radius | worst texture | worst perimeter | worst area | worst smoothness | worst compactness | worst concavity | worst concave points | worst symmetry | worst fractal dimension | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 17.99 | 10.38 | 122.80 | 1001.0 | 0.11840 | 0.27760 | 0.30010 | 0.14710 | 0.2419 | 0.07871 | ... | 25.380 | 17.33 | 184.60 | 2019.0 | 0.16220 | 0.66560 | 0.7119 | 0.2654 | 0.4601 | 0.11890 |
1 | 20.57 | 17.77 | 132.90 | 1326.0 | 0.08474 | 0.07864 | 0.08690 | 0.07017 | 0.1812 | 0.05667 | ... | 24.990 | 23.41 | 158.80 | 1956.0 | 0.12380 | 0.18660 | 0.2416 | 0.1860 | 0.2750 | 0.08902 |
2 | 19.69 | 21.25 | 130.00 | 1203.0 | 0.10960 | 0.15990 | 0.19740 | 0.12790 | 0.2069 | 0.05999 | ... | 23.570 | 25.53 | 152.50 | 1709.0 | 0.14440 | 0.42450 | 0.4504 | 0.2430 | 0.3613 | 0.08758 |
3 | 11.42 | 20.38 | 77.58 | 386.1 | 0.14250 | 0.28390 | 0.24140 | 0.10520 | 0.2597 | 0.09744 | ... | 14.910 | 26.50 | 98.87 | 567.7 | 0.20980 | 0.86630 | 0.6869 | 0.2575 | 0.6638 | 0.17300 |
4 | 20.29 | 14.34 | 135.10 | 1297.0 | 0.10030 | 0.13280 | 0.19800 | 0.10430 | 0.1809 | 0.05883 | ... | 22.540 | 16.67 | 152.20 | 1575.0 | 0.13740 | 0.20500 | 0.4000 | 0.1625 | 0.2364 | 0.07678 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
564 | 21.56 | 22.39 | 142.00 | 1479.0 | 0.11100 | 0.11590 | 0.24390 | 0.13890 | 0.1726 | 0.05623 | ... | 25.450 | 26.40 | 166.10 | 2027.0 | 0.14100 | 0.21130 | 0.4107 | 0.2216 | 0.2060 | 0.07115 |
565 | 20.13 | 28.25 | 131.20 | 1261.0 | 0.09780 | 0.10340 | 0.14400 | 0.09791 | 0.1752 | 0.05533 | ... | 23.690 | 38.25 | 155.00 | 1731.0 | 0.11660 | 0.19220 | 0.3215 | 0.1628 | 0.2572 | 0.06637 |
566 | 16.60 | 28.08 | 108.30 | 858.1 | 0.08455 | 0.10230 | 0.09251 | 0.05302 | 0.1590 | 0.05648 | ... | 18.980 | 34.12 | 126.70 | 1124.0 | 0.11390 | 0.30940 | 0.3403 | 0.1418 | 0.2218 | 0.07820 |
567 | 20.60 | 29.33 | 140.10 | 1265.0 | 0.11780 | 0.27700 | 0.35140 | 0.15200 | 0.2397 | 0.07016 | ... | 25.740 | 39.42 | 184.60 | 1821.0 | 0.16500 | 0.86810 | 0.9387 | 0.2650 | 0.4087 | 0.12400 |
568 | 7.76 | 24.54 | 47.92 | 181.0 | 0.05263 | 0.04362 | 0.00000 | 0.00000 | 0.1587 | 0.05884 | ... | 9.456 | 30.37 | 59.16 | 268.6 | 0.08996 | 0.06444 | 0.0000 | 0.0000 | 0.2871 | 0.07039 |
569 rows × 30 columns
The prediction task for this data is to predict whether a tumor is benign or malignant (a binary decision) given characteristics of that tumor. As a classic machine learning dataset, the prediction task is captured by the column named "target"
. Target data_dict['target'] = 0 is malignant 1 is benign
data['malignant'] = (data_dict['target'] == 0).astype(int)
is used to put the data back in its original context which will be 1 if the tumor is malignant and 0 if it is benign (reversing the definition of target).
# Target data_dict['target'] = 0 is malignant 1 is benign
data['malignant'] = (data_dict['target'] == 0).astype(int)
What features might be a good indication of whether a tumor is benign or malignant?
data.columns
Index(['mean radius', 'mean texture', 'mean perimeter', 'mean area', 'mean smoothness', 'mean compactness', 'mean concavity', 'mean concave points', 'mean symmetry', 'mean fractal dimension', 'radius error', 'texture error', 'perimeter error', 'area error', 'smoothness error', 'compactness error', 'concavity error', 'concave points error', 'symmetry error', 'fractal dimension error', 'worst radius', 'worst texture', 'worst perimeter', 'worst area', 'worst smoothness', 'worst compactness', 'worst concavity', 'worst concave points', 'worst symmetry', 'worst fractal dimension', 'malignant'], dtype='object')
data[['mean radius', 'malignant']]
mean radius | malignant | |
---|---|---|
0 | 17.99 | 1 |
1 | 20.57 | 1 |
2 | 19.69 | 1 |
3 | 11.42 | 1 |
4 | 20.29 | 1 |
... | ... | ... |
564 | 21.56 | 1 |
565 | 20.13 | 1 |
566 | 16.60 | 1 |
567 | 20.60 | 1 |
568 | 7.76 | 0 |
569 rows × 2 columns
Perhaps a good starting point is the size of the tumor. Larger tumors are probably more likely to be malignant. In the following, we plot whether the tumor was malignant (1 or 0) against the "mean radius"
.
sns.jointplot(data = data, x = "mean radius", y = "malignant");
This is a clear example of over-plotting. We can improve the above plot by jittering the data:
# Manual to allow for jitter
g = sns.JointGrid(data = data, x = "mean radius", y = "malignant")
g.plot_marginals(sns.histplot)
g.plot_joint(sns.stripplot,
orient='h', order=[1, 0],
color=sns.color_palette()[0])
(g.ax_joint).set_xticks([10, 15, 20, 25])
plt.show()
Perhaps a better way to visualize the data is using stacked histograms.
sns.histplot(data = data, x = "mean radius",
hue = "malignant",
binwidth=0.5,
kde=True);
plt.legend(labels=["Malignant", "Benign"]);
Question: Looking at the above histograms could you describe a rule to predict whether or a cell is malignant?
from sklearn.model_selection import train_test_split
data_tr, data_te = train_test_split(data, test_size=0.10, random_state=42)
data_tr.reset_index(inplace=True, drop=True)
data_te.reset_index(inplace=True, drop=True)
print("Training Data Size: ", len(data_tr))
print("Test Data Size: ", len(data_te))
Training Data Size: 512 Test Data Size: 57
Creating the X
and Y
matrices for the training data:
# X, Y are training data
X = data_tr[['mean radius']].to_numpy()
Y = data_tr['malignant'].to_numpy()
What's the best model that fits our data? We will discuss this in the next section.
The linear_model.LogisticRegression
model (documentation) is what we want to use here. In order to recreate our specific model, there are a few parameters we need to set:
penalty = 'none'
: by default, LogisticRegression
uses regularization. This is generally a good idea, which we'll discuss later.fit_intercept = True
: our toy model currently has an intercept term.solver = 'lbgfs'
: need to specify a numerical optimization routine for the model (similar to gradient descent). lbfgs
is one such type; it's the new default in scikit-learn
.We'll fit a model with the mean radius feature and a bias intercept. So $\theta = (\theta_0, \theta_1)$ and our model is:
$$\hat{P}_{\theta} (Y = 1 | x) = \sigma(x^T\theta) = \sigma(\theta_0 + \theta_1 x_1)$$from sklearn.linear_model import LogisticRegression
model = LogisticRegression(fit_intercept=True)
model.fit(X, Y); # X, Y are training data
Optimal $\theta_0, \theta_1$ from fitting our model:
model.intercept_, model.coef_
(array([-14.42394402]), array([[0.97889232]]))
plt.figure(figsize=(6,6))
# Getting model parameters
theta_0 = model.intercept_[0]
theta_1 = model.coef_[0]
# Getting model predictions
y_pred = np.round(logistic(theta_0 + theta_1 * data_tr['mean radius']))
# Setting colors based on correct and incorrect y_pred
color_palette = {False:'red', True:'green'}
sns.stripplot(x=data_tr['mean radius'].squeeze(), y=data_tr['malignant'],
jitter = 0.1, orient='h', hue= y_pred==data_tr['malignant'], palette = color_palette);
plt.gca().invert_yaxis()
plt.xlabel('x: mean radius')
plt.ylabel('y: class')
plt.xlim((7, 30))
# Plotting the sigmoig function with the given parameters
z = np.linspace(5, 30, 100)
f = sns.lineplot(x= z, y=logistic(theta_0 + theta_1 * z),
color='blue', linewidth=3,
label= r'$\hat{P}$' + r'$(Y=1|{0} + {1}*x)$'.format(theta_0,theta_1))
plt.setp(f.get_legend().get_texts(), fontsize='10');
Predict probabilities: scikit-learn
has a built-in .predict_proba
method that allows us to get the predicted probabilities under our model. The .classes_
attribute stores our class labels.
model.predict_proba([[20]])
array([[0.00574364, 0.99425636]])
model.classes_
array([0, 1])
Let's visualize these probabilities on our train dataset, X
and y
.
Prob_hat_one = model.predict_proba(X)[:, 1]
Prob_hat_one.shape
(512,)
The seaborn function stripplot
auto-plots jittered data by class.
plt.figure(figsize=(6,6))
sns.stripplot(x=X.squeeze(), y=Y,
jitter = 0.1, orient='h');
sns.lineplot(x= X.squeeze(), y=Prob_hat_one,
color='green', linewidth=5, label=r'$\hat{P}_{\theta}(Y = 1 | x)$')
plt.gca().invert_yaxis()
plt.xlabel('x: mean radius')
plt.ylabel('y: class')
Text(0, 0.5, 'y: class')
Predict class labels: By comparison, what does .predict()
do?
It predicts 1 if the computed probability for class 1 is greater than 0.5, and 0 otherwise.
$$\text{classify}(x) = \begin{cases} 1, & P(Y = 1 | x) \geq 0.5 \\ 0, & \text{otherwise} \end{cases}$$This is what we call classification.
Edit the above cells to see sklearn's prediction for a mean radius of 10
.
model.predict([[20]])
array([1])
We've chosen a model. It's now time to choose a loss function. Why not squared loss?
def mse_loss_train_nobias(theta):
x = data_tr['mean radius']
y_obs = data_tr['malignant']
y_hat = logistic(x * theta)
return np.mean((y_hat - y_obs) ** 2)
thetas = np.linspace(-10, 10, 100)
plt.plot(thetas, [mse_loss_train_nobias(theta) for theta in thetas])
plt.ylabel('MSE')
plt.xlabel(r'$\theta$');
toy_df = pd.DataFrame({
"x": [-4, -2, -0.5, 1, 3, 5],
"y": [0, 0, 1, 0, 1, 1]
})
toy_df.sort_values("x")
x | y | |
---|---|---|
0 | -4.0 | 0 |
1 | -2.0 | 0 |
2 | -0.5 | 1 |
3 | 1.0 | 0 |
4 | 3.0 | 1 |
5 | 5.0 | 1 |
sns.scatterplot(data=toy_df, x='x', y='y', s=100)
plt.title("Toy classification data")
Text(0.5, 1.0, 'Toy classification data')
Let's plot the loss surface for this toy data using squared loss with the model $\hat{p} = \sigma(\theta x)$. We don't include an intercept term, so $\theta$ and $x$ are both scalars.
def mse_loss_toy_nobias(theta):
p_hat = logistic(toy_df['x'] * theta)
return np.mean((toy_df['y'] - p_hat)**2)
thetas = np.linspace(-10, 10, 100)
plt.plot(thetas, [mse_loss_toy_nobias(theta) for theta in thetas])
plt.title("MSE on toy classification data");
plt.xlabel(r'$\theta$')
plt.ylabel('MSE')
Text(0, 0.5, 'MSE')
This loss surface is not convex! Depending on where we start our optimization search, we'll end up with different results. Let's explore with scipy.optimize.minimize
.
best_theta = minimize(mse_loss_toy_nobias, x0 = 0)["x"][0]
best_theta
0.5446601825581691
Remember, our 1-D model with no intercept is $\hat{p} = \sigma(\theta x)$.
sns.scatterplot(data=toy_df, x='x', y='y', s=100, label='y')
xs = np.linspace(-10, 10, 100)
plt.plot(xs, logistic(xs * best_theta), color='orange', label=r'$\sigma(x^T \theta)$')
plt.xlabel('x')
plt.legend()
plt.title(r'$\hat{\theta} = $' + f"{best_theta:.4}");
Let's try another starting point for minimizing theta.
best_theta_2 = minimize(mse_loss_toy_nobias, x0 = -5)["x"][0]
best_theta_2
-10.343653061026611
Uhhh, looks like the optimizer got stuck.
sns.scatterplot(data=toy_df, x='x', y='y', s=100, label='y')
xs = np.linspace(-10, 10, 100)
plt.plot(xs, logistic(xs * best_theta_2), color='orange', label=r'$\sigma(x^T \theta)$')
plt.xlabel('x')
plt.legend()
plt.title(r'$\hat{\theta} = $' + f"{best_theta_2:.4}")
Text(0.5, 1.0, '$\\hat{\\theta} = $-10.34')
Not only is it not convex, leading to the weird issues above, but squared loss just isn't well-suited for a probability task. Since $\hat{p}_i$ is between 0 and 1, and $y_i$ is either 0 or 1, the squared loss for a single point $(y_i - \hat{p}_i)^2$ is bounded between 0 and 1.
What this means in practice: even if our prediction is terrible, the squared loss is never that large.
p_hat = np.arange(0.001, 0.999, 0.01)
loss = (1 - p_hat)**2
plt.plot(p_hat, loss, color='k')
plt.xlabel(r'$\sigma({x^T \theta})$')
plt.ylabel(r'$(1 - \hat{y})^2$')
plt.title('Squared Loss for One Individual');
Let's look at a new loss, called the log loss, for when our true observation is 1.
p_hat = np.arange(0.001, 0.999, 0.01)
loss = -np.log(p_hat)
plt.plot(p_hat, loss, color='k')
plt.xlabel('$p$: Probability that y is 1')
plt.ylabel('$-\log(p)$')
plt.title('Cross-Entropy Loss for one observation when $y = 1$');
We can see that this penalizes wrong predictions far more than squared loss does.
How to read this plot: Suppose the observation we're trying to predict is actually in class 1. If our model gives an 80% chance of being in class 1, the loss is relatively small (around 0.25).
If we give only a 40% of being in class 1, the loss is larger (around 1).
If we give only a 5% chance of being in class 1, the loss is 3.
And if we give a 0% chance of being in class 1, the loss is infinite.
What about when the true observation is 0?
p_hat = np.arange(0.001, 0.999, 0.01)
loss = -np.log(1 - p_hat)
plt.plot(p_hat, loss, color='k')
plt.xlabel('$p$: Probability that y is 1')
plt.ylabel('$-\log(1 - p)$')
plt.title('Cross-Entropy Loss for one observation when $y = 0$');
Much of the formal derivation is in the slides. But the equation for cross-entropy loss for a single observation is
$$\textrm{loss} = -y \log(\hat{y}) - (1-y)\log(1-\hat{y})$$For us, since $\hat{y} = \sigma(x^T \theta)$, the expression for average cross-entropy loss is
$$R(\theta) = -\frac{1}{n} \sum_{i = 1}^n \big(y_i \log (\sigma(\mathbb{X}_i^T \theta)) + (1 - y_i) \log (1 - \sigma(\mathbb{X}_i^T \theta))\big)$$Let's look at the loss surface for average cross-entropy loss, on our toy data from before.
def cross_entropy(y, phat):
return - y * np.log(phat) - (1 - y) * np.log(1 - phat)
def mce_loss_toy_nobias(theta):
p_hat = logistic(toy_df['x'] * theta)
return np.mean(cross_entropy(toy_df['y'], p_hat))
thetas = np.linspace(-4, 4, 100)
plt.plot(thetas, [mce_loss_toy_nobias(theta) for theta in thetas], color = 'green')
plt.ylabel(r'Mean Cross-Entropy($\theta$)')
plt.xlabel(r'$\theta$');