Lecture 24 – Data 100, Fall 2024¶

Data 100, Fall 2024

Acknowledgments Page

In [1]:
import pandas as pd
import numpy as np
import scipy as sp
import plotly.express as px
import seaborn as sns

Working with High Dimensional Data¶

In the following cells we will use visualization tools to push as far as we can in visualizing the MPG dataset in high-dimensional space:

In [2]:
mpg = sns.load_dataset("mpg").dropna()
mpg.head()
Out[2]:
mpg cylinders displacement horsepower weight acceleration model_year origin name
0 18.0 8 307.0 130.0 3504 12.0 70 usa chevrolet chevelle malibu
1 15.0 8 350.0 165.0 3693 11.5 70 usa buick skylark 320
2 18.0 8 318.0 150.0 3436 11.0 70 usa plymouth satellite
3 16.0 8 304.0 150.0 3433 12.0 70 usa amc rebel sst
4 17.0 8 302.0 140.0 3449 10.5 70 usa ford torino

Visualizing 1 Dimensional Data¶

In [3]:
px.histogram(mpg, x="displacement")

Visualizing 2 Dimensional Data¶

In [4]:
px.scatter(mpg, x="displacement", y="horsepower")

Visualizing 3 Dimensional Data¶

In [5]:
fig = px.scatter_3d(mpg, x="displacement", y="horsepower", z="weight",
                    width=800, height=800)
fig.update_traces(marker=dict(size=3))

Visualizing 4 Dimensional Data¶

In [6]:
fig = px.scatter_3d(mpg, x="displacement", 
                    y="horsepower", 
                    z="weight", 
                    color="model_year",
                    width=800, height=800, 
                    opacity=.7)
fig.update_traces(marker=dict(size=5))

Visualizing 6 Dimensional Data¶

Try clicking on the origin symbols in the legend to see how the plot changes.

In [7]:
fig = px.scatter_3d(mpg, x="displacement", 
                    y="horsepower", 
                    z="weight", 
                    color="model_year",
                    size="mpg",
                    symbol="origin",
                    width=900, height=800, 
                    opacity=.7)
# remove heat map legend and freeze the axes
fig.update_layout(coloraxis_showscale=False,
                  scene=(dict(xaxis_range=[50, 500], 
                              yaxis_range=[40, 250], 
                              zaxis_range=[1000, 5000])))

Visualizing data in high-dimensional space is challenging. In general, the plots we made here can be sometimes helpful for interactive visualizations but can be difficult to interpret in a static form.

Dimensionality Reduction¶

One common approach to visualizing high-dimensional data is to use dimensionality reduction techniques. These techniques aim to find a lower-dimensional representation of the data that captures the most important information.

In [8]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2,)

X = pd.get_dummies(mpg[["displacement", "horsepower", "weight", "model_year", "origin", "mpg"]])
zs = pca.fit_transform(X)
mpg[["z1", "z2"]] = zs
mpg.head()
Out[8]:
mpg cylinders displacement horsepower weight acceleration model_year origin name z1 z2
0 18.0 8 307.0 130.0 3504 12.0 70 usa chevrolet chevelle malibu 536.462765 50.770168
1 15.0 8 350.0 165.0 3693 11.5 70 usa buick skylark 320 730.376262 79.103119
2 18.0 8 318.0 150.0 3436 11.0 70 usa plymouth satellite 470.999791 75.360935
3 16.0 8 304.0 150.0 3433 12.0 70 usa amc rebel sst 466.436304 62.509155
4 17.0 8 302.0 140.0 3449 10.5 70 usa ford torino 481.692727 55.684400
In [9]:
fig = px.scatter(mpg, x="z1", y="z2", color="model_year", symbol="origin", 
                 hover_data=["displacement", "horsepower", "weight", "name"])
fig.update_layout(legend=dict(x=.92, y=1), xaxis_range=[-1500, 2500], yaxis_range=[-200, 300])



Return to lecture.