Pandas Downloading Tables¶

In this notebook we review some of the basic tools you might use to pull data from a website.

import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

Examining Solar Energy Usage:¶

Suppose I am interested in looking at photovoltaic output for California. This data is aggregated on the web here:

http://www.energy.ca.gov/almanac/renewables_data/solar/index.php

Load a Web Tables with Pandas¶

The Pandas library has basic support for loading tables directly from websites.

tables = pd.read_html("http://www.energy.ca.gov/almanac/renewables_data/solar/index.php")

This returns a list of tables.

len(tables)

5

Lets look at a little data from each table:

for t in tables:
    display(t.head())

If we return to the website here:

http://www.energy.ca.gov/almanac/renewables_data/solar/index.php

we see that this reflects the content on the website.

Make Backup!¶

Save a backup of the data in case we need it again later ... (The web changes!!)

import pickle
with open("energy_download.pkl", "wb") as f:
    pickle.dump(tables, f,)

A better way to save large data files would be to use something like the hd5 format. To do this you need to install PyTables:

!conda install -y pytables

then run the following:

for i, t in enumerate(tables):
    t.to_hdf("energy.h5", "table_" + str(i))

Note that this has the limitation that the table needs to be cleaned slightly to be stored efficiently (e.g., mapping types).

Extracting the Photovoltaic Table¶

Looking at the tables above it appears that the first table contains Solar Thermal data and the second contains Solar PV data. We could rely on this ordering assumption but the order of tables may change. Let's instead write. function to extract the table by name:

tables[0].iloc[0,0]

'Solar Thermal'

[t.iloc[0,0] for t in tables]

['Solar Thermal', 'Solar PV', 'Year', 'Year', 'County']

We can write a little helper function to find the table by name for at least the first two tables. We could rely on location but if things move it might break our code silently.

def find_table(name, tables):
    return {t.iloc[0,0]: t for t in tables}[name].copy()

pv_table = find_table("Solar PV", tables)
pv_table.head()

thermal_table = find_table("Solar Thermal", tables)
thermal_table.head()

Cleaning the data¶

When extracting data from the web with Pandas we are likely to need to do substantial cleaning.

Get the Column Names¶

pv_table.iloc[1,:]

0             Year
1     Company Name
2     EIA Plant ID
3     CEC Plant ID
4       Plant Name
5            State
6    Capacity (MW)
7          Net MWh
Name: 1, dtype: object

pv_table.columns = pv_table.iloc[1,:]
pv_table.columns.name = None

Note that I reset name of the columns. Why? Notice that the series object has a name as well (row 1). I don't want 1 to be the name of the column index.

pv_table.head()

Removing Header Rows from the Data¶

pv_table = pv_table.iloc[2:]

pv_table.head()

Checking the Bottom of the Table for Summaries¶

It is common on human readable tables to see summary statistics at the bottom. We don't want to treat these as rows:

pv_table.tail()

pv_table = pv_table[:-1]

pv_table.tail()

Fixing Data Types¶

Pandas may not have correctly inferred the types of numeric fields. However run statistics on these fields and create visualizations we will need their type information.

pv_table.dtypes

Year             object
Company Name     object
EIA Plant ID     object
CEC Plant ID     object
Plant Name       object
State            object
Capacity (MW)    object
Net MWh          object
dtype: object

pv_table = pv_table.astype({"Year": "int", "Net MWh": "float", "Capacity (MW)": "float"})

pv_table.dtypes

Year               int64
Company Name      object
EIA Plant ID      object
CEC Plant ID      object
Plant Name        object
State             object
Capacity (MW)    float64
Net MWh          float64
dtype: object

Creating a Generic Cleaning Function¶

Because we need to clean two roughly identical tables with the same structure we will create a simple function that we can debug once and run on both tables.

def clean_solar_table(table):
    table = table.copy()
    # Extract and set the column names
    table.columns = table.iloc[1,:].values
    # drop headers and summary at end
    table = table.iloc[2:-1]
    # Change types
    table = table.astype({"Year": "int", "Net MWh": "float", "Capacity (MW)": "float"})
    return table.reset_index(drop=True)

clean_solar_table(thermal_table)

Combining the Thermal and PV Data¶

Both tables contain the same data. Let's take the union of all the rows adding an extra column to indicate whether the record is a Thermal or PV record.

def extract_and_combine_pv_and_thermal(tables):
    thermal_table = clean_solar_table(find_table("Solar Thermal", tables))
    pv_table = clean_solar_table(find_table("Solar PV", tables))
    thermal_table["Kind"] = "Thermal"
    pv_table["Kind"] = "PV"
    return pd.concat([thermal_table, pv_table]).reset_index(drop=True)

Extracting the final DataFrame

df = extract_and_combine_pv_and_thermal(tables)
df.head()

df.groupby("Kind")[["Capacity (MW)", "Net MWh"]].sum()

Saving a backup¶

Let's save the data at this point in case we need to recover it later.

df.to_csv("combined_energy_part1.csv")

	0	1	2	3	4	5	6	7
0	Solar Thermal	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	Year	Company Name	EIA Plant ID	CEC Plant ID	Plant Name	State	Capacity (MW)	Net MWh
2	2016	Genesis Solar LLC	57394	S0259	Genesis Solar Energy Project	CA	250.0	634079
3	2016	Solar Partners I II VIII LLC	57074	S0078	Ivanpah I (Solar Partners II)	CA	126.0	255762
4	2016	Solar Partners I II VIII LLC	57073	S0079	Ivanpah II (Solar Partners I)	CA	133.0	199870

	0	1	2	3	4	5	6	7
0	Solar PV	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	Year	Company Name	EIA Plant ID	CEC Plant ID	Plant Name	State	Capacity (MW)	Net MWh
2	2016	Not Available	MISSING	S9210	Abby Power LLC	CA	1.5	2628
3	2016	Dominion Solar - CA	58984	S0312	Adams East LLC	CA	19.0	49880
4	2016	Adelanto Solar LLC	59441	S0410	Adelanto Solar I	CA	20.0	57181

	0	1	2	3	4	5	6	7
0	Year	Company Name	EIA Plant ID	CEC Plant ID	Plant Name	State	Capacity (MW)	Net MWh
1	2016	Agua Caliente Solar LLC	57373	S0242	Agua Caliente Solar (AZ)	AZ	290.0	743533
2	2016	LS Power	57680	S0292	Arlington Valley Solar Energy II	AZ	129.0	364396
3	2016	Copper Mountain Solar	57205	S0243	Copper Mountain I (NV)	NV	58.0	117353
4	2016	Copper Mountain Solar	58017	S0244	Copper Mountain II (NV)	NV	155.0	356633

	0	1	2
0	Year	Capacity (MW)	Net MWh
1	2016	11500.1	23080144
2	2015	8500.9	18061557
3	2014	7053.4	12732661
4	2013	4424.4	5500727

	0	1	2
0	County	Capacity (MW)	Net MWh
1	Alameda	14.5	23570
2	Amador	1.5	1971
3	Butte	9.5	16165
4	CA	2.0	348

	Year	Company Name	EIA Plant ID	CEC Plant ID	Plant Name	State	Capacity (MW)	Net MWh
643	2016	Duke Energy Renewables	59253	S0546	Wildwood Solar II	CA	15.0	619
644	2016	Not Available	MISSING	S9040	Wines US Inc. Constellation	CA	1.0	1622
645	2016	8me, LLC	59008	S0445	Woodmere Solar Farm	CA	18.8	35916
646	2016	Yolo County General Services	57422	S0115	Yolo County Solar Project	CA	1.0	1940
647	2016	Not Available	MISSING	S9078	Yuba College	CA	1.0	1622

	Capacity (MW)	Net MWh
Kind
PV	8616.5	17234537.0
Thermal	1248.6	2548090.0