Lecture 7 – Data 100, Spring 2022¶

by Josh Hug

adapted from material by Ani Adhikari, Suraj Rampure, and Fernando Pérez.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
In [2]:
%matplotlib inline
In [3]:
births = pd.read_csv('baby.csv')
In [4]:
births.head()
Out[4]:
Birth Weight Gestational Days Maternal Age Maternal Height Maternal Pregnancy Weight Maternal Smoker
0 120 284 27 62 100 False
1 113 282 33 64 135 False
2 128 279 28 64 115 True
3 108 282 23 67 125 True
4 136 286 25 62 93 False
In [5]:
births.shape
Out[5]:
(1174, 6)

Bar Plots¶

We often use bar plots to display distributions of a categorical variable:

In [6]:
births['Maternal Smoker'].value_counts()
Out[6]:
False    715
True     459
Name: Maternal Smoker, dtype: int64
In [7]:
from datascience import Table
t = Table.from_df(births['Maternal Smoker'].value_counts().reset_index())
t.barh("index", "Maternal Smoker")
In [8]:
births['Maternal Smoker'].value_counts().plot(kind = 'bar');
In [9]:
ms = births['Maternal Smoker'].value_counts();
plt.bar(ms.index, ms);

Note: putting a semicolon after a plot call hides all of the unnecessary text that comes after it (the <matplotlib.axes_....>).

In [10]:
sns.countplot(data = births, x = 'Maternal Smoker');
In [11]:
import plotly.express as px
px.histogram(births, x = 'Maternal Smoker', color = 'Maternal Smoker')