import pandas as pd
import numpy as np
from ds100_utils import *
import as px
Load the Fashion-MNIST dataset¶
We will be using the Fashion-MNIST dataset, which is a cool little dataset with gray scale 28x28 images of articles of clothing.
Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms. Han Xiao, Kashif Rasul, Roland Vollgraf. arXiv:1708.07747
Load data¶
import fashion_mnist
(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()
print("Training images", train_images.shape)
print("Test images", test_images.shape)
Training images (60000, 28, 28) Test images (10000, 28, 28)
The class names for this data are:
class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']
class_dict = {i:class_name for i,class_name in enumerate(class_names)}
We have loaded a lot of data which you can play with later (try building a classifier).
For the purposes of this demo, let's take a small sample of the training data.
rng = np.random.default_rng(42)
n = 5000
sample_idx = rng.choice(np.arange(len(train_images)), size=n, replace=False)
# Invert and normalize the images so they look better
img_mat = -1*train_images[sample_idx]
img_mat = (img_mat - img_mat.min())/(img_mat.max() - img_mat.min())
images = pd.DataFrame({"images": img_mat.tolist(),
"labels": train_labels[sample_idx],
"class": [class_dict[x] for x in train_labels[sample_idx]]})
images | labels | class | |
0 | [[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,... | 3 | Dress |
1 | [[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,... | 4 | Coat |
2 | [[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,... | 0 | T-shirt/top |
3 | [[1.0, 1.0, 1.0, 1.0, 1.0, 0.996078431372549, ... | 2 | Pullover |
4 | [[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,... | 1 | Trouser |
Visualizing images¶
The following snippet of code visualizes the images
def show_images(images, ncols=5, max_images=30):
# conver the subset of images into a n,28,28 matrix for facet visualization
img_mat = np.array(images.head(max_images)['images'].to_list())
fig = px.imshow(img_mat, color_continuous_scale='gray',
facet_col = 0, facet_col_wrap=ncols,
height = 220*int(np.ceil(len(images)/ncols)))
# Extract the facet number and convert it back to the class label.
fig.for_each_annotation(lambda a: a.update(text=images.iloc[int(a.text.split("=")[-1])]['class']))
return fig
Let's look at each class:
show_images(images.groupby('class',as_index=False).sample(2), ncols=6)
How would we visualize the entire dataset? Let's use PCA to find a low dimensional representation of the images.
First, let's understand the high-dimensional representation. We will extract the matrix of images from the dataframe:
X = np.array(images['images'].to_list())
(5000, 28, 28)
We now "unroll" the pixels into a single row vector 28*28 = 784 dimensions:
X = X.reshape(X.shape[0], -1)
(5000, 784)
Center the data
X = X - X.mean(axis=0)
Run PCA (this time we use SKLearn):
from sklearn.decomposition import PCA
n_comps = 50
pca = PCA(n_components=n_comps)
Examining PCA Results¶
# make a line plot and show markers
px.line(y=pca.explained_variance_ratio_ *100, markers=True)
Most of data is explained in first two or three dimensions
images[['z1', 'z2', 'z3']] = pca.transform(X)[:, :3]
px.scatter(images, x='z1', y='z2', hover_data=['labels'],
width = 800, height = 800)
px.scatter(images, x='z1', y='z2', color='class', hover_data=['labels'],
width = 800, height = 800)
fig = px.scatter_3d(images, x='z1', y='z2', z='z3', color='class', hover_data=['labels'],
width=1000, height=800)
# set marker size to 5
Comparison to just some random projections.
rand_basis = np.random.randn(784, 3)
images[['z1_rand', 'z2_rand', 'z3_rand']] = X @ rand_basis
px.scatter_3d(images, x='z1_rand', y='z2_rand', z='z3_rand', color='class', hover_data=['labels'],
width=1000, height=800).update_traces(marker=dict(size=5))
Trying other methods¶
Running the below cell might take some time:
from sklearn.manifold import TSNE
tsne_model = TSNE(n_components=3, random_state=0, perplexity=30, learning_rate=200, n_iter=1000)
tsne_comps = tsne_model.fit_transform(X)
images[['tsne1', 'tsne2', 'tsne3']] = tsne_comps
px.scatter(images, x='tsne1', y='tsne2', color='class', hover_data=['labels'],
width=1000, height=800)
px.scatter_3d(images, x='tsne1', y='tsne2', z='tsne3', color='class', hover_data=['labels'],
width=1000, height=800).update_traces(marker=dict(size=5))
Finding a lower dimensional basis for
Let's visualize the t-SNE vectors. Note that all embeddings are built off of the principal components, which are rotations of the original features.
When we add class labels to the visualization, notice that t-SNE's clusters correspond reasonably well.
Apply PCA to a subset of the data¶
Let's see if we can build a better embedding for the subset of the data that corresponds to tough images.
classes = ['Coat', 'Pullover']
tough_images = images[images['class'].isin(classes)].copy()
X = np.array(tough_images['images'].to_list())
X = X.reshape(X.shape[0], -1)
X = X - X.mean(axis=0)
zs = PCA(n_components=3).fit_transform(X)
tough_images[['z1', 'z2', 'z3']] = zs
px.scatter_3d(tough_images, x='z1', y='z2', z='z3', color='class', hover_data=['labels'],
width=1000, height=800).update_traces(marker=dict(size=5))
Logistic Regression on these hard images¶
import sklearn.linear_model as lm
model = lm.LogisticRegression(max_iter=1000)
y = tough_images['class'] == "Coat", y)
np.mean(model.predict(zs) == y)
