Lecture 26 - Fashion MNIST¶

Data 100, Spring 2023

Acknowledgments Page and UC Santa Cruz

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from ds100_utils import *

np.random.seed(23) #kallisti

plt.rcParams['figure.figsize'] = (4, 4)
plt.rcParams['figure.dpi'] = 150
sns.set()

Load the Fashion-MNIST dataset¶

We will be using the Fashion-MNIST dataset, which is a cool little dataset with gray scale 28x28 images of articles of clothing.

Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms. Han Xiao, Kashif Rasul, Roland Vollgraf. arXiv:1708.07747 https://github.com/zalandoresearch/fashion-mnist

Load data¶

In [2]:
import fashion_mnist

(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()
Using cached version that was downloaded (UTC): Thu Apr 20 00:00:28 2023
Using cached version that was downloaded (UTC): Thu Apr 20 00:00:28 2023
Using cached version that was downloaded (UTC): Thu Apr 20 00:00:28 2023
Using cached version that was downloaded (UTC): Thu Apr 20 00:00:28 2023

Truncate Dataset: For the purposes of this demo, we're going to randomly sample

  • 10,000 train datapoints, and
  • 1,000 test datapoints.
In [3]:
rng = np.random.default_rng(42)
n_train, n_test = 10000, 1000
train_samples = rng.choice(np.arange(len(train_images)), size=n_train, replace=False)
test_samples = rng.choice(np.arange(len(test_images)), size=n_test, replace=False)

train_images, train_labels = train_images[train_samples,:,:], train_labels[train_samples]
test_images, test_labels = test_images[test_samples,:,:], test_labels[test_samples]

train_images.shape, test_images.shape
Out[3]:
((10000, 28, 28), (1000, 28, 28))

Visualizing images¶

In [4]:
class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
               'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']
class_dict = {i:class_name for i,class_name in enumerate(class_names)}

def show_train_image(index):
    plt.figure()
    # cmap=plt.cm.binary allows us to show the picture in grayscale
    plt.imshow(train_images[index], cmap=plt.cm.binary)
    plt.title(class_names[train_labels[index]])
    plt.colorbar() # adds a bar to the side with values
    plt.show()
In [5]:
# Simply run this cell
show_train_image(0)

Let's see what kind of images we have overall.

There are 10 classes:

In [6]:
# there are 10 classes
print(len(class_names))
print(sorted(class_names))
10
['Ankle boot', 'Bag', 'Coat', 'Dress', 'Pullover', 'Sandal', 'Shirt', 'Sneaker', 'T-shirt/top', 'Trouser']
In [7]:
# Simply run this cell
# see documentation for subplot here:
# https://matplotlib.org/3.2.1/api/_as_gen/matplotlib.pyplot.subplot.html
plt.figure(figsize=(10,10))
for i in range(25):
    plt.subplot(5,5,i+1)
    plt.xticks([])
    plt.yticks([])
    plt.grid(False)
    plt.imshow(train_images[i], cmap=plt.cm.binary)
    plt.xlabel(class_names[train_labels[i]])

Goals of this demo¶

Suppose we would like to train a logistic regression classifier to distinguish between two specific classes of clothes. Note that logistic regression is a binary classifier; if you are interested in multi-class classification beyond this course, check out this sklearn page.

We'd then like to check out two things:

  • What do the data look like? Is it linearly separable?
  • What features can we use?

Preprocess:¶

Normalize to 1¶

Pixel values are 0 (white) to 255 (black). When working with image data, generally we like to normalize to a range between 0 and 1:

In [8]:
# just run this cell
train_images = train_images/255
test_images = test_images/255

print(f'Train Min:{train_images.min()} Max:{train_images.max()}')

print(f'Test Min:{test_images.min()} Max:{test_images.max()}')

show_train_image(0)
Train Min:0.0 Max:1.0
Test Min:0.0 Max:1.0

Reshape features into 1-D¶

Recall that logistic regression relies on our features being 1-D, i.e., a vector, because we are trying to fit the model:

$$\hat{P}_{\theta}(Y = 1 | X = x) = \sigma(x^T \theta)$$
  • Our data is composed of grayscale images (one channel) with a resolution of $28x28$.
  • We can think of this as the images existing in a $28*28=784$ dimensional space.
  • We therefore need to reshape every single image in our dataset can be represented by a vector of length 784.

Using np.reshape, we reshape both train and test sets and convert them to a DataFrame:

In [9]:
# reshape pixels
train_images_vectors = np.reshape(train_images, (len(train_images), -1))
test_images_vectors = np.reshape(test_images, (len(test_images), -1))
train_images_vectors.shape, test_images_vectors.shape
Out[9]:
((10000, 784), (1000, 784))
In [10]:
# then, add class/label to DataFrame
train_df = pd.DataFrame(train_images_vectors)
train_df['label'] = train_labels
train_df['class'] = train_df['label'].map(class_dict)

# reorder columns just so it's easier on the eyes
PIXEL_COLS = train_df.columns.tolist()[:-2]
LABEL_COLS = ['label', 'class']

cols_reorder = LABEL_COLS + PIXEL_COLS
train_df = train_df[cols_reorder]
train_df
Out[10]:
label class 0 1 2 3 4 5 6 7 ... 774 775 776 777 778 779 780 781 782 783
0 1 Trouser 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 ... 0.003922 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.0
1 3 Dress 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 ... 0.243137 0.160784 0.000000 0.000000 0.007843 0.000000 0.000000 0.0 0.0 0.0
2 3 Dress 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 ... 0.596078 0.584314 0.203922 0.196078 0.000000 0.000000 0.000000 0.0 0.0 0.0
3 7 Sneaker 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.0
4 8 Bag 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 ... 0.576471 0.568627 0.509804 0.470588 0.556863 0.168627 0.000000 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
9995 2 Pullover 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.600000 0.223529 0.047059 0.0 0.0 0.0
9996 0 T-shirt/top 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 ... 0.000000 0.317647 0.137255 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.0
9997 0 T-shirt/top 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.007843 ... 0.494118 0.619608 0.003922 0.000000 0.011765 0.000000 0.000000 0.0 0.0 0.0
9998 6 Shirt 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.0
9999 4 Coat 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 ... 0.003922 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.0

10000 rows × 786 columns

In [11]:
# do the same for test dataset
test_df = pd.DataFrame(test_images_vectors)
test_df['label'] = test_labels
test_df['class'] = test_df['label'].map(class_dict)

cols_reorder = LABEL_COLS + PIXEL_COLS
test_df = test_df[cols_reorder]




PCA¶

How would we visualize how the features (i.e., pixels) change with different classes? Would we have to pick random pixels to compare? Probably not. As humans, we can visualize the difference due to higher-order shapes and interactions between the pixels.

Enter PCA.

Here I use sklearn.decomposition.PCA which uses SVD under the hood:

Linear dimensionality reduction using Singular Value Decomposition of the data to project it to a lower dimensional space. The input data is centered but not scaled for each feature before applying the SVD.

Let's look at the train set. We'll run PCA to get the first 50 components:

In [12]:
from sklearn.decomposition import PCA

n_comps = 50
PCA_COLS = [f"pc{i+1}" for i in range(n_comps)]
pca = PCA(n_components=n_comps)
pca.fit(train_df[PIXEL_COLS])
principal_components = pca.transform(train_df[PIXEL_COLS])
In [13]:
# The first 50 components
principal_components.shape
Out[13]:
(10000, 50)

Explained Variance from PCs¶

Note that sklearn.decomposition.PCA has an attribute called explained_variance_ratio_:

Percentage of variance explained by each of the selected components.

The first 50 components account for a reasonable amount of the total variance:

In [14]:
np.sum(pca.explained_variance_ratio_)
Out[14]:
0.863129580664258

The first two components account for a little less than half of variance:

In [15]:
# PC1, PC2 component scores
np.sum(pca.explained_variance_ratio_[:2])
Out[15]:
0.4690051583750241

Seem reasonable? Let's check out the scree plot:

In [16]:
plt.plot(np.arange(n_comps)+1,
         100*pca.explained_variance_ratio_,
         marker='.');
plt.ylabel("% variance")
plt.xlabel("Component Number")
Out[16]:
Text(0.5, 0, 'Component Number')

Visually the elbow looks closer to components 3 or 4, so we can't gather too much from the visualization, but let's try it out and see what happens:

EDA: visualizations¶

In [17]:
def build_comps_df(components, label_df, colnames):
    df = pd.DataFrame(data=components,
                      columns=colnames)
    df["class"] = label_df["class"]
    df["label" ] = label_df["label"]
    return df
In [18]:
pca_df = build_comps_df(principal_components, train_df, PCA_COLS)
# plot pca, uncomment for classes.
sns.lmplot(x='pc1',
           y='pc2',
           data=pca_df, 
           fit_reg=False, 
           # hue='class',
           height=9,
           scatter_kws={"s":50,"alpha":0.2})
plt.title("PCA visualization (PC2 vs. PC1)")
Out[18]:
Text(0.5, 1.0, 'PCA visualization (PC2 vs. PC1)')