First things first: why should you look at your data? Isn't statistics enough?
import pandas as pd
data4 = pd.read_csv('data/data4.csv')
data4.head()
datasets = data4.groupby('dataset')
datasets.agg(['count', 'mean', 'var'])
datasets[['x', 'y']].corr().loc[(slice(None), 'x'), 'y']
stats.linregress?
from scipy import stats
datasets.apply(lambda df: stats.linregress(df.x, df.y)[:2])
Surely these four datasets must be more or less the same for all statistically meaningful purposes...
But let's double-check to be sure...
%matplotlib inline
#%matplotlib notebook
import seaborn as sns
sns.lmplot(x="x", y="y", col="dataset", hue="dataset", data=data4,
col_wrap=2, ci=None, size=4);
These four datasets are known as Anscombe's quartet. But they aren't just a weird pathological specimen. Dinosaurs can do the same:
from IPython.display import Video
Video("https://pbs.twimg.com/tweet_video/CrIDuOhWYAAVzcM.mp4")
For more, see The Datasaurus dozen.
data12 = pd.read_csv('data/DatasaurusDozen.tsv', sep='\t')
data12.info()
data12.head()
datasets12 = data12.groupby('dataset')
datasets12.agg(['count', 'mean', 'var'])
datasets12[['x', 'y']].corr().loc[(slice(None), 'x'), 'y']
datasets12.apply(lambda df: stats.linregress(df.x, df.y)[:2])
sns.lmplot(x="x", y="y", col="dataset", hue="dataset", data=data12,
col_wrap=4, ci=None, size=4);
The matplotlib library is a powerful tool capable of producing complex publication-quality figures with fine layout control in two and three dimensions; here we will only provide a minimal self-contained introduction to its usage that covers the functionality needed for the rest of the book. We encourage the reader to read the tutorials included with the matplotlib documentation as well as to browse its extensive gallery of examples that include source code.
Just as we typically use the shorthand np
for Numpy, we will use plt
for the matplotlib.pyplot
module where the easy-to-use plotting functions reside (the library contains a rich object-oriented architecture that we don't have the space to discuss here):
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
The plot
command:
x = np.random.rand(100)
plt.plot(x)
Plotting a function: $f(x) = \sin(x)$:
x = np.linspace(0, 2*np.pi, 300)
y = np.sin(x)
plt.plot(x, y); # note the ';' at the end of the line, it suppresses the Out[N] block.
The most frequently used function is simply called plot
, here is how you can make a simple plot of $\sin(x)$ and $\sin(x^2)$ for $x \in [0, 2\pi]$ with labels and a grid (we use the semicolon in the last line to suppress the display of some information that is unnecessary right now):
y2 = np.sin(x**2)
plt.plot(x, y, label=r'$\sin(x)$')
plt.plot(x, y2, label=r'$\sin(x^2)$')
plt.title('Some functions')
plt.xlabel('x')
plt.ylabel('y')
plt.grid()
plt.legend();
You can control the style, color and other properties of the markers, for example:
x = np.linspace(0, 2*np.pi, 50)
y = np.sin(x)
plt.plot(x, y, linewidth=2);
plt.plot(x, y, 'o', markersize=5, color='g');
a = np.random.rand(5,10)
a
plt.imshow(a, interpolation='bilinear', cmap=plt.cm.BuPu)
plt.figure()
plt.imshow(a, interpolation='bicubic', cmap=plt.cm.Blues)
plt.figure()
plt.imshow(a, interpolation='nearest', cmap=plt.cm.Blues);
img = plt.imread('data/dessert.png')
img.shape
plt.imshow(img);
Plot the r, g, b channels of the image. If we want to directly compare the intensity of the color data in each channel, it's visually clearest if we do so by showing all individual channels as grayscale.
With the call clim=(0,1)
, we ensure that the visual range of the grayscale colormap spans the whole (0, 1) range for each channel, so that visual comparisons across make sense. Otherwise matplotlib would use the entire visual range adapted to each channel's data, making that comparison misleading.
fig, ax = plt.subplots(1,4, figsize=(10,6))
ax[0].imshow(img[:,:,0], cmap=plt.cm.Greys_r, clim=(0, 1))
ax[1].imshow(img[:,:,1], cmap=plt.cm.Greys_r, clim=(0, 1))
ax[2].imshow(img[:,:,2], cmap=plt.cm.Greys_r, clim=(0, 1))
ax[3].imshow(img);
for a in ax:
a.set_xticklabels([]); a.set_xticks([])
a.set_yticklabels([]); a.set_yticks([])
Alternatively we can show the channels in their own color, which still conveys similar information, though cross-channel comparisons are now modulated by the human visual system. Note, e.g., how despite the fact that the front of the cake has very little green (as seen above in the strict per-channel comparison), in the image below we still can see a reasonable amount of detail once it's displayed in green color. That's because the human visual system is much more sensitive to green light than red or blue:
fig, ax = plt.subplots(1,4, figsize=(10,6))
ax[0].imshow(img[:,:,0], cmap=plt.cm.Reds_r, clim=(0, 1))
ax[1].imshow(img[:,:,1], cmap=plt.cm.Blues_r, clim=(0, 1))
ax[2].imshow(img[:,:,2], cmap=plt.cm.Greens_r, clim=(0, 1))
ax[3].imshow(img);
for a in ax:
a.set_xticklabels([]); a.set_xticks([])
a.set_yticklabels([]); a.set_yticks([])