Lecture 6 – Data 100, Fall 2022¶

by Josh Hug

adapted from material by Ani Adhikari, Suraj Rampure, and Fernando Pérez.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
In [2]:
births = pd.read_csv('baby.csv')
In [3]:
plt.rcParams["hist.bins"]
Out[3]:
10
In [4]:
births.head()
Out[4]:
Birth Weight Gestational Days Maternal Age Maternal Height Maternal Pregnancy Weight Maternal Smoker
0 120 284 27 62 100 False
1 113 282 33 64 135 False
2 128 279 28 64 115 True
3 108 282 23 67 125 True
4 136 286 25 62 93 False
In [5]:
births.shape
Out[5]:
(1174, 6)

Bar Plots¶

We often use bar plots to display distributions of a categorical variable:

In [6]:
births['Maternal Smoker'].value_counts()
Out[6]:
False    715
True     459
Name: Maternal Smoker, dtype: int64

This would have been the Data 8 code to do something similar:

from datascience import Table
t = Table.from_df(births['Maternal Smoker'].value_counts().reset_index())
t.barh("index", "Maternal Smoker")
In [7]:
births['Maternal Smoker'].value_counts().plot(kind = 'bar');
In [8]:
ms = births['Maternal Smoker'].value_counts();
plt.bar(ms.index, ms);

Note: putting a semicolon after a plot call hides all of the unnecessary text that comes after it (the <matplotlib.axes_....>).

In [9]:
sns.countplot(data = births, x = 'Maternal Smoker');
In [10]:
import plotly.express as px
px.histogram(births, x = 'Maternal Smoker', color = 'Maternal Smoker')
In [11]:
sns.countplot(data = births, x = 'Maternal Pregnancy Weight');
In [12]:
sns.histplot(data = births, x = 'Maternal Pregnancy Weight');
In [13]:
px.histogram(births, x = 'Maternal Pregnancy Weight')
In [14]:
sns.histplot(data = births, x = 'Maternal Pregnancy Weight', bins = 20);
sns.rugplot(data = births, x = 'Maternal Pregnancy Weight', color = "red");
In [15]:
sns.histplot(data = births, x = 'Maternal Pregnancy Weight', kde = True);
sns.rugplot(data = births, x = 'Maternal Pregnancy Weight', color = "red");

Box Plots¶

In [16]:
plt.figure(figsize = (3, 6))
sns.boxplot(y = "Birth Weight", data = births);
In [17]:
bweights = births["Birth Weight"]
q1 = np.percentile(bweights, 25)
q2 = np.percentile(bweights, 50)
q3 = np.percentile(bweights, 75)
iqr = q3 - q1
whisk1 = q1 - 1.5*iqr
whisk2 = q3 + 1.5*iqr

whisk1, q1, q2, q3, whisk2
Out[17]:
(73.5, 108.0, 120.0, 131.0, 165.5)

Violin Plots¶

In [18]:
plt.figure(figsize = (3, 6))
sns.violinplot(y=births["Birth Weight"]);

Side by side box plots and violin plots¶

In [19]:
plt.figure(figsize=(5, 8))
sns.boxplot(data=births, x = 'Maternal Smoker', y = 'Birth Weight');
In [20]:
plt.figure(figsize=(5, 8))
sns.violinplot(data=births, x = 'Maternal Smoker', y = 'Birth Weight');

Scatter plots¶

In [21]:
births.head()
Out[21]:
Birth Weight Gestational Days Maternal Age Maternal Height Maternal Pregnancy Weight Maternal Smoker
0 120 284 27 62 100 False
1 113 282 33 64 135 False
2 128 279 28 64 115 True
3 108 282 23 67 125 True
4 136 286 25 62 93 False
In [22]:
plt.scatter(births['Maternal Height'], births['Birth Weight']);
plt.xlabel('Maternal Height')
plt.ylabel('Birth Weight');
In [23]:
plt.scatter(data=births, x='Maternal Height', y='Birth Weight');
plt.xlabel('Maternal Height')
plt.ylabel('Birth Weight');
In [24]:
sns.scatterplot(data = births, x = 'Maternal Height', y = 'Birth Weight', hue = 'Maternal Smoker');
In [25]:
births["Maternal Height (jittered)"] = births["Maternal Height"] + np.random.uniform(-0.2, 0.2, len(births))
fig = sns.scatterplot(data = births, x = 'Maternal Height (jittered)', y = 'Birth Weight', hue = 'Maternal Smoker');
In [26]:
sns.lmplot(data = births, x = 'Maternal Height', y = 'Birth Weight', ci=False, hue='Maternal Smoker');
In [27]:
sns.jointplot(data = births, x = 'Maternal Height', y = 'Birth Weight');
In [28]:
sns.jointplot(data = births, x = 'Maternal Height', y = 'Birth Weight', hue='Maternal Smoker');

Hex plots and contour plots¶

In [29]:
sns.jointplot(data = births, x = 'Maternal Height', y = 'Birth Weight', kind='hex');
In [30]:
sns.jointplot(data = births, x = 'Maternal Height', y = 'Birth Weight', kind='kde', fill=True);
In [31]:
sns.jointplot(data = births, x = 'Maternal Height', y = 'Birth Weight', kind='kde', hue='Maternal Smoker');

Bonus¶

Calling the DataFrame .plot() method results in weird things!

In [32]:
births.plot();