Data 100, Spring 2024
A demo of data cleaning and exploratory data analysis using the Mauna Loa CO2 data.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#%matplotlib inline
plt.rcParams['figure.figsize'] = (12, 9)
import plotly.express as px
sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option('display.max_rows', 30)
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 2)
# This option stops scientific notation for pandas
pd.set_option('display.float_format', '{:.2f}'.format)
# Silence some spurious seaborn warnings
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
CO2 concentrations have been monitored at Mauna Loa Observatory since 1958 (website link).
co2_file = "data/co2_mm_mlo.txt"
Let's do some EDA!!
Let's instead check out this file with JupyterLab.
.txt
file.Looking at the first few lines of the data, we spot some relevant characteristics:
We can use read_csv
to read the data into a Pandas data frame, and we provide several arguments to specify that the separators are white space, there is no header (we will set our own column names), and to skip the first 72 rows of the file.
co2 = pd.read_csv(
co2_file,
header = None,
comment = "#",
sep = r'\s+' #delimiter for continuous whitespace (stay tuned for regex next lecture))
)
co2.head()
0 | 1 | 2 | 3 | 4 | 5 | 6 | |
---|---|---|---|---|---|---|---|
0 | 1958 | 3 | 1958.21 | 315.71 | 315.71 | 314.62 | -1 |
1 | 1958 | 4 | 1958.29 | 317.45 | 317.45 | 315.29 | -1 |
2 | 1958 | 5 | 1958.38 | 317.50 | 317.50 | 314.71 | -1 |
3 | 1958 | 6 | 1958.46 | -99.99 | 317.10 | 314.85 | -1 |
4 | 1958 | 7 | 1958.54 | 315.86 | 315.86 | 314.98 | -1 |
Congratulations! You've wrangled the data!
...But our columns aren't named. We need to do more EDA.
co2 = pd.read_csv(
co2_file,
header = None,
comment = "#",
sep = '\s+', #regex for continuous whitespace (next lecture)
names = ['Yr', 'Mo', 'DecDate', 'Avg', 'Int', 'Trend', 'Days']
)
co2.head()
Yr | Mo | DecDate | Avg | Int | Trend | Days | |
---|---|---|---|---|---|---|---|
0 | 1958 | 3 | 1958.21 | 315.71 | 315.71 | 314.62 | -1 |
1 | 1958 | 4 | 1958.29 | 317.45 | 317.45 | 315.29 | -1 |
2 | 1958 | 5 | 1958.38 | 317.50 | 317.50 | 314.71 | -1 |
3 | 1958 | 6 | 1958.46 | -99.99 | 317.10 | 314.85 | -1 |
4 | 1958 | 7 | 1958.54 | 315.86 | 315.86 | 314.98 | -1 |
Scientific studies tend to have very clean data, right...? Let's jump right in and make a time series plot of CO2 monthly averages.
# sns.lineplot(x='DecDate', y='Avg', data=co2); # plotting with seaborn
px.line(co2, x='DecDate', y='Avg',
markers=True, height=600)