Lecture 25 – Data 100, Spring 2024¶

Data 100, Spring 2024

Acknowledgments Page and UC Santa Cruz

In [1]:
import pandas as pd
import numpy as np
from ds100_utils import *
import plotly.express as px

Load the Fashion-MNIST dataset¶

We will be using the Fashion-MNIST dataset, which is a cool little dataset with gray scale 28x28 images of articles of clothing.

Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms. Han Xiao, Kashif Rasul, Roland Vollgraf. arXiv:1708.07747 https://github.com/zalandoresearch/fashion-mnist

Load data¶

In [2]:
import fashion_mnist

(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()
print("Training images", train_images.shape)
print("Test images", test_images.shape)
Using cached version that was downloaded (UTC): Thu Apr 18 05:53:05 2024
Using cached version that was downloaded (UTC): Thu Apr 18 05:53:06 2024
Using cached version that was downloaded (UTC): Thu Apr 18 05:53:06 2024
Using cached version that was downloaded (UTC): Thu Apr 18 05:53:06 2024
Training images (60000, 28, 28)
Test images (10000, 28, 28)

The class names for this data are:

In [3]:
class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
               'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']
class_dict = {i:class_name for i,class_name in enumerate(class_names)}

We have loaded a lot of data which you can play with later (try building a classifier).

For the purposes of this demo, let's take a small sample of the training data.

In [4]:
rng = np.random.default_rng(42)
n = 5000
sample_idx = rng.choice(np.arange(len(train_images)), size=n, replace=False)

# Invert and normalize the images so they look better
img_mat = -1*train_images[sample_idx]
img_mat = (img_mat - img_mat.min())/(img_mat.max() - img_mat.min())

images = pd.DataFrame({"images": img_mat.tolist(), 
                   "labels": train_labels[sample_idx], 
                   "class": [class_dict[x] for x in train_labels[sample_idx]]})
images.head()
Out[4]:
images labels class
0 [[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,... 3 Dress
1 [[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,... 4 Coat
2 [[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,... 0 T-shirt/top
3 [[1.0, 1.0, 1.0, 1.0, 1.0, 0.996078431372549, ... 2 Pullover
4 [[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,... 1 Trouser

Visualizing images¶

The following snippet of code visualizes the images

In [5]:
def show_images(images, ncols=5, max_images=30):
    # conver the subset of images into a n,28,28 matrix for facet visualization
    img_mat = np.array(images.head(max_images)['images'].to_list())
    fig = px.imshow(img_mat, color_continuous_scale='gray', 
                    facet_col = 0, facet_col_wrap=ncols,
                    height = 220*int(np.ceil(len(images)/ncols)))
    fig.update_layout(coloraxis_showscale=False)
    # Extract the facet number and convert it back to the class label.
    fig.for_each_annotation(lambda a: a.update(text=images.iloc[int(a.text.split("=")[-1])]['class']))
    return fig

show_images(images.head(20))

Let's look at each class:

In [6]:
show_images(images.groupby('class',as_index=False).sample(2), ncols=6)

PCA¶

How would we visualize the entire dataset? Let's use PCA to find a low dimensional representation of the images.

First, let's understand the high-dimensional representation. We will extract the matrix of images from the dataframe:

In [7]:
X = np.array(images['images'].to_list())
X.shape
Out[7]:
(5000, 28, 28)

We now "unroll" the pixels into a single row vector 28*28 = 784 dimensions:

In [8]:
X = X.reshape(X.shape[0], -1)
X.shape
Out[8]:
(5000, 784)

Center the data

In [9]:
X = X - X.mean(axis=0)

Run PCA (this time we use SKLearn):

In [10]:
from sklearn.decomposition import PCA
n_comps = 50 
pca = PCA(n_components=n_comps)
pca.fit(X)
Out[10]:
PCA(n_components=50)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
PCA(n_components=50)

Examining PCA Results¶

In [11]:
# make a line plot and show markers
px.line(y=pca.explained_variance_ratio_ *100, markers=True)

Most of data is explained in first two or three dimensions

In [12]:
images[['z1', 'z2', 'z3']] = pca.transform(X)[:, :3]
In [23]:
px.scatter(images, x='z1', y='z2', hover_data=['labels'], 
           width = 800, height = 800)
In [13]:
px.scatter(images, x='z1', y='z2', color='class', hover_data=['labels'], 
           width = 800, height = 800)
In [14]:
fig = px.scatter_3d(images, x='z1', y='z2', z='z3', color='class', hover_data=['labels'], 
              width=1000, height=800)
# set marker size to 5
fig.update_traces(marker=dict(size=5))

Comparison to just some random projections.

In [15]:
rand_basis = np.random.randn(784, 3)
images[['z1_rand', 'z2_rand', 'z3_rand']] = X @ rand_basis
px.scatter_3d(images, x='z1_rand', y='z2_rand', z='z3_rand', color='class', hover_data=['labels'],  
              width=1000, height=800).update_traces(marker=dict(size=5))

Trying other methods¶

Running the below cell might take some time:

In [16]:
from sklearn.manifold import TSNE
tsne_model = TSNE(n_components=3, random_state=0, perplexity=30, learning_rate=200, n_iter=1000)
tsne_comps = tsne_model.fit_transform(X)
images[['tsne1', 'tsne2', 'tsne3']] = tsne_comps
In [17]:
px.scatter(images, x='tsne1', y='tsne2', color='class', hover_data=['labels'],
              width=1000, height=800)
In [18]:
px.scatter_3d(images, x='tsne1', y='tsne2', z='tsne3', color='class', hover_data=['labels'],
              width=1000, height=800).update_traces(marker=dict(size=5))

Finding a lower dimensional basis for

Let's visualize the t-SNE vectors. Note that all embeddings are built off of the principal components, which are rotations of the original features.

When we add class labels to the visualization, notice that t-SNE's clusters correspond reasonably well.

Apply PCA to a subset of the data¶

Let's see if we can build a better embedding for the subset of the data that corresponds to tough images.

In [19]:
classes = ['Coat', 'Pullover']
tough_images = images[images['class'].isin(classes)].copy()
show_images(tough_images.sample(20))
In [20]:
X = np.array(tough_images['images'].to_list())
X = X.reshape(X.shape[0], -1)
X = X - X.mean(axis=0)
zs = PCA(n_components=3).fit_transform(X)
tough_images[['z1', 'z2', 'z3']] = zs
px.scatter_3d(tough_images, x='z1', y='z2', z='z3', color='class', hover_data=['labels'],
              width=1000, height=800).update_traces(marker=dict(size=5))

Logistic Regression on these hard images¶

In [21]:
import sklearn.linear_model as lm
model = lm.LogisticRegression(max_iter=1000)
y = tough_images['class'] == "Coat"
model.fit(zs, y)
np.mean(model.predict(zs) == y)
Out[21]:
0.6905444126074498
In [22]:
import sklearn.linear_model as lm
model = lm.LogisticRegression(max_iter=1000)
y = tough_images['class'] == "Coat"
model.fit(X, y)
np.mean(model.predict(X) == y)
Out[22]:
0.9551098376313276