In this notebook we review some of the basic tools you might use to pull data from a website.
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
Suppose I am interested in looking at photovoltaic output for California. This data is aggregated on the web here:
http://www.energy.ca.gov/almanac/renewables_data/solar/index.php
The Pandas library has basic support for loading tables directly from websites.
tables = pd.read_html("http://www.energy.ca.gov/almanac/renewables_data/solar/index.php")
This returns a list of tables.
len(tables)
Lets look at a little data from each table:
for t in tables:
display(t.head())
If we return to the website here:
http://www.energy.ca.gov/almanac/renewables_data/solar/index.php
we see that this reflects the content on the website.
Save a backup of the data in case we need it again later ... (The web changes!!)
import pickle
with open("energy_download.pkl", "wb") as f:
pickle.dump(tables, f,)
A better way to save large data files would be to use something like the hd5 format. To do this you need to install PyTables
:
!conda install -y pytables
then run the following:
for i, t in enumerate(tables):
t.to_hdf("energy.h5", "table_" + str(i))
Note that this has the limitation that the table needs to be cleaned slightly to be stored efficiently (e.g., mapping types).
Looking at the tables above it appears that the first table contains Solar Thermal
data and the second contains Solar PV
data. We could rely on this ordering assumption but the order of tables may change. Let's instead write. function to extract the table by name:
tables[0].iloc[0,0]
[t.iloc[0,0] for t in tables]
We can write a little helper function to find the table by name for at least the first two tables. We could rely on location but if things move it might break our code silently.
def find_table(name, tables):
return {t.iloc[0,0]: t for t in tables}[name].copy()
pv_table = find_table("Solar PV", tables)
pv_table.head()
thermal_table = find_table("Solar Thermal", tables)
thermal_table.head()
When extracting data from the web with Pandas we are likely to need to do substantial cleaning.
pv_table.iloc[1,:]
pv_table.columns = pv_table.iloc[1,:]
pv_table.columns.name = None
Note that I reset name of the columns. Why? Notice that the series object has a name as well (row 1
). I don't want 1
to be the name of the column index.
pv_table.head()
pv_table = pv_table.iloc[2:]
pv_table.head()
It is common on human readable tables to see summary statistics at the bottom. We don't want to treat these as rows:
pv_table.tail()
pv_table = pv_table[:-1]
pv_table.tail()
Pandas may not have correctly inferred the types of numeric fields. However run statistics on these fields and create visualizations we will need their type information.
pv_table.dtypes
pv_table = pv_table.astype({"Year": "int", "Net MWh": "float", "Capacity (MW)": "float"})
pv_table.dtypes
Because we need to clean two roughly identical tables with the same structure we will create a simple function that we can debug once and run on both tables.
def clean_solar_table(table):
table = table.copy()
# Extract and set the column names
table.columns = table.iloc[1,:].values
# drop headers and summary at end
table = table.iloc[2:-1]
# Change types
table = table.astype({"Year": "int", "Net MWh": "float", "Capacity (MW)": "float"})
return table.reset_index(drop=True)
clean_solar_table(thermal_table)
Both tables contain the same data. Let's take the union of all the rows adding an extra column to indicate whether the record is a Thermal
or PV
record.
def extract_and_combine_pv_and_thermal(tables):
thermal_table = clean_solar_table(find_table("Solar Thermal", tables))
pv_table = clean_solar_table(find_table("Solar PV", tables))
thermal_table["Kind"] = "Thermal"
pv_table["Kind"] = "PV"
return pd.concat([thermal_table, pv_table]).reset_index(drop=True)
Extracting the final DataFrame
df = extract_and_combine_pv_and_thermal(tables)
df.head()
df.groupby("Kind")[["Capacity (MW)", "Net MWh"]].sum()
Let's save the data at this point in case we need to recover it later.
df.to_csv("combined_energy_part1.csv")