import pandas as pd
import numpy as np
from ds100_utils import *
import plotly.express as px
Load the Fashion-MNIST dataset¶
We will be using the Fashion-MNIST dataset, which is a cool little dataset with gray scale 28x28 images of articles of clothing.
Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms. Han Xiao, Kashif Rasul, Roland Vollgraf. arXiv:1708.07747 https://github.com/zalandoresearch/fashion-mnist
Load data¶
import fashion_mnist
(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()
print("Training images", train_images.shape)
print("Test images", test_images.shape)
Using cached version that was downloaded (UTC): Thu Apr 18 05:53:05 2024 Using cached version that was downloaded (UTC): Thu Apr 18 05:53:06 2024 Using cached version that was downloaded (UTC): Thu Apr 18 05:53:06 2024 Using cached version that was downloaded (UTC): Thu Apr 18 05:53:06 2024 Training images (60000, 28, 28) Test images (10000, 28, 28)
The class names for this data are:
class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']
class_dict = {i:class_name for i,class_name in enumerate(class_names)}
We have loaded a lot of data which you can play with later (try building a classifier).
For the purposes of this demo, let's take a small sample of the training data.
rng = np.random.default_rng(42)
n = 5000
sample_idx = rng.choice(np.arange(len(train_images)), size=n, replace=False)
# Invert and normalize the images so they look better
img_mat = -1*train_images[sample_idx]
img_mat = (img_mat - img_mat.min())/(img_mat.max() - img_mat.min())
images = pd.DataFrame({"images": img_mat.tolist(),
"labels": train_labels[sample_idx],
"class": [class_dict[x] for x in train_labels[sample_idx]]})
images.head()
images | labels | class | |
---|---|---|---|
0 | [[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,... | 3 | Dress |
1 | [[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,... | 4 | Coat |
2 | [[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,... | 0 | T-shirt/top |
3 | [[1.0, 1.0, 1.0, 1.0, 1.0, 0.996078431372549, ... | 2 | Pullover |
4 | [[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,... | 1 | Trouser |
Visualizing images¶
The following snippet of code visualizes the images
def show_images(images, ncols=5, max_images=30):
# conver the subset of images into a n,28,28 matrix for facet visualization
img_mat = np.array(images.head(max_images)['images'].to_list())
fig = px.imshow(img_mat, color_continuous_scale='gray',
facet_col = 0, facet_col_wrap=ncols,
height = 220*int(np.ceil(len(images)/ncols)))
fig.update_layout(coloraxis_showscale=False)
# Extract the facet number and convert it back to the class label.
fig.for_each_annotation(lambda a: a.update(text=images.iloc[int(a.text.split("=")[-1])]['class']))
return fig
show_images(images.head(20))
Let's look at each class:
show_images(images.groupby('class',as_index=False).sample(2), ncols=6)
PCA¶
How would we visualize the entire dataset? Let's use PCA to find a low dimensional representation of the images.
First, let's understand the high-dimensional representation. We will extract the matrix of images from the dataframe:
X = np.array(images['images'].to_list())
X.shape
(5000, 28, 28)
We now "unroll" the pixels into a single row vector 28*28 = 784 dimensions:
X = X.reshape(X.shape[0], -1)
X.shape
(5000, 784)
Center the data
X = X - X.mean(axis=0)
Run PCA (this time we use SKLearn):
from sklearn.decomposition import PCA
n_comps = 50
pca = PCA(n_components=n_comps)
pca.fit(X)
PCA(n_components=50)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
PCA(n_components=50)
Examining PCA Results¶
# make a line plot and show markers
px.line(y=pca.explained_variance_ratio_ *100, markers=True)
Most of data is explained in first two or three dimensions
images[['z1', 'z2', 'z3']] = pca.transform(X)[:, :3]
px.scatter(images, x='z1', y='z2', hover_data=['labels'],
width = 800, height = 800)
px.scatter(images, x='z1', y='z2', color='class', hover_data=['labels'],
width = 800, height = 800)
fig = px.scatter_3d(images, x='z1', y='z2', z='z3', color='class', hover_data=['labels'],
width=1000, height=800)
# set marker size to 5
fig.update_traces(marker=dict(size=5))
Comparison to just some random projections.
rand_basis = np.random.randn(784, 3)
images[['z1_rand', 'z2_rand', 'z3_rand']] = X @ rand_basis
px.scatter_3d(images, x='z1_rand', y='z2_rand', z='z3_rand', color='class', hover_data=['labels'],
width=1000, height=800).update_traces(marker=dict(size=5))