Pandas Downloading Tables

In this notebook we review some of the basic tools you might use to pull data from a website.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

Examining Solar Energy Usage:

Suppose I am interested in looking at photovoltaic output for California. This data is aggregated on the web here:

http://www.energy.ca.gov/almanac/renewables_data/solar/index.php

Load a Web Tables with Pandas

The Pandas library has basic support for loading tables directly from websites.

In [2]:
tables = pd.read_html("http://www.energy.ca.gov/almanac/renewables_data/solar/index.php")

This returns a list of tables.

In [3]:
len(tables)
Out[3]:
5

Lets look at a little data from each table:

In [4]:
for t in tables:
    display(t.head())
0 1 2 3 4 5 6 7
0 Solar Thermal NaN NaN NaN NaN NaN NaN NaN
1 Year Company Name EIA Plant ID CEC Plant ID Plant Name State Capacity (MW) Net MWh
2 2016 Genesis Solar LLC 57394 S0259 Genesis Solar Energy Project CA 250.0 634079
3 2016 Solar Partners I II VIII LLC 57074 S0078 Ivanpah I (Solar Partners II) CA 126.0 255762
4 2016 Solar Partners I II VIII LLC 57073 S0079 Ivanpah II (Solar Partners I) CA 133.0 199870
0 1 2 3 4 5 6 7
0 Solar PV NaN NaN NaN NaN NaN NaN NaN
1 Year Company Name EIA Plant ID CEC Plant ID Plant Name State Capacity (MW) Net MWh
2 2016 Not Available MISSING S9210 Abby Power LLC CA 1.5 2628
3 2016 Dominion Solar - CA 58984 S0312 Adams East LLC CA 19.0 49880
4 2016 Adelanto Solar LLC 59441 S0410 Adelanto Solar I CA 20.0 57181
0 1 2 3 4 5 6 7
0 Year Company Name EIA Plant ID CEC Plant ID Plant Name State Capacity (MW) Net MWh
1 2016 Agua Caliente Solar LLC 57373 S0242 Agua Caliente Solar (AZ) AZ 290.0 743533
2 2016 LS Power 57680 S0292 Arlington Valley Solar Energy II AZ 129.0 364396
3 2016 Copper Mountain Solar 57205 S0243 Copper Mountain I (NV) NV 58.0 117353
4 2016 Copper Mountain Solar 58017 S0244 Copper Mountain II (NV) NV 155.0 356633
0 1 2
0 Year Capacity (MW) Net MWh
1 2016 11500.1 23080144
2 2015 8500.9 18061557
3 2014 7053.4 12732661
4 2013 4424.4 5500727
0 1 2
0 County Capacity (MW) Net MWh
1 Alameda 14.5 23570
2 Amador 1.5 1971
3 Butte 9.5 16165
4 CA 2.0 348

If we return to the website here:

http://www.energy.ca.gov/almanac/renewables_data/solar/index.php

we see that this reflects the content on the website.

Make Backup!

Save a backup of the data in case we need it again later ... (The web changes!!)

In [5]:
import pickle
with open("energy_download.pkl", "wb") as f:
    pickle.dump(tables, f,)

A better way to save large data files would be to use something like the hd5 format. To do this you need to install PyTables:

!conda install -y pytables

then run the following:

for i, t in enumerate(tables):
    t.to_hdf("energy.h5", "table_" + str(i))

Note that this has the limitation that the table needs to be cleaned slightly to be stored efficiently (e.g., mapping types).

Extracting the Photovoltaic Table

Looking at the tables above it appears that the first table contains Solar Thermal data and the second contains Solar PV data. We could rely on this ordering assumption but the order of tables may change. Let's instead write. function to extract the table by name:

In [6]:
tables[0].iloc[0,0]
Out[6]:
'Solar Thermal'
In [7]:
[t.iloc[0,0] for t in tables]
Out[7]:
['Solar Thermal', 'Solar PV', 'Year', 'Year', 'County']

We can write a little helper function to find the table by name for at least the first two tables. We could rely on location but if things move it might break our code silently.

In [8]:
def find_table(name, tables):
    return {t.iloc[0,0]: t for t in tables}[name].copy()
In [9]:
pv_table = find_table("Solar PV", tables)
pv_table.head()
Out[9]:
0 1 2 3 4 5 6 7
0 Solar PV NaN NaN NaN NaN NaN NaN NaN
1 Year Company Name EIA Plant ID CEC Plant ID Plant Name State Capacity (MW) Net MWh
2 2016 Not Available MISSING S9210 Abby Power LLC CA 1.5 2628
3 2016 Dominion Solar - CA 58984 S0312 Adams East LLC CA 19.0 49880
4 2016 Adelanto Solar LLC 59441 S0410 Adelanto Solar I CA 20.0 57181
In [10]:
thermal_table = find_table("Solar Thermal", tables)
thermal_table.head()
Out[10]:
0 1 2 3 4 5 6 7
0 Solar Thermal NaN NaN NaN NaN NaN NaN NaN
1 Year Company Name EIA Plant ID CEC Plant ID Plant Name State Capacity (MW) Net MWh
2 2016 Genesis Solar LLC 57394 S0259 Genesis Solar Energy Project CA 250.0 634079
3 2016 Solar Partners I II VIII LLC 57074 S0078 Ivanpah I (Solar Partners II) CA 126.0 255762
4 2016 Solar Partners I II VIII LLC 57073 S0079 Ivanpah II (Solar Partners I) CA 133.0 199870

Cleaning the data

When extracting data from the web with Pandas we are likely to need to do substantial cleaning.

Get the Column Names

In [11]:
pv_table.iloc[1,:]
Out[11]:
0             Year
1     Company Name
2     EIA Plant ID
3     CEC Plant ID
4       Plant Name
5            State
6    Capacity (MW)
7          Net MWh
Name: 1, dtype: object
In [12]:
pv_table.columns = pv_table.iloc[1,:]
pv_table.columns.name = None

Note that I reset name of the columns. Why? Notice that the series object has a name as well (row 1). I don't want 1 to be the name of the column index.

In [13]:
pv_table.head()
Out[13]:
Year Company Name EIA Plant ID CEC Plant ID Plant Name State Capacity (MW) Net MWh
0 Solar PV NaN NaN NaN NaN NaN NaN NaN
1 Year Company Name EIA Plant ID CEC Plant ID Plant Name State Capacity (MW) Net MWh
2 2016 Not Available MISSING S9210 Abby Power LLC CA 1.5 2628
3 2016 Dominion Solar - CA 58984 S0312 Adams East LLC CA 19.0 49880
4 2016 Adelanto Solar LLC 59441 S0410 Adelanto Solar I CA 20.0 57181

Removing Header Rows from the Data

In [14]:
pv_table = pv_table.iloc[2:]
In [15]:
pv_table.head()
Out[15]:
Year Company Name EIA Plant ID CEC Plant ID Plant Name State Capacity (MW) Net MWh
2 2016 Not Available MISSING S9210 Abby Power LLC CA 1.5 2628
3 2016 Dominion Solar - CA 58984 S0312 Adams East LLC CA 19.0 49880
4 2016 Adelanto Solar LLC 59441 S0410 Adelanto Solar I CA 20.0 57181
5 2016 Adelanto Solar LLC 59440 S0411 Adelanto Solar II CA 7.0 17202
6 2016 Los Angeles Department of Water & Power (LADWP) 57305 S0218 Adelanto Solar Power Project CA 10.0 16245

Checking the Bottom of the Table for Summaries

It is common on human readable tables to see summary statistics at the bottom. We don't want to treat these as rows:

In [16]:
pv_table.tail()
Out[16]:
Year Company Name EIA Plant ID CEC Plant ID Plant Name State Capacity (MW) Net MWh
644 2016 Not Available MISSING S9040 Wines US Inc. Constellation CA 1.0 1622
645 2016 8me, LLC 59008 S0445 Woodmere Solar Farm CA 18.8 35916
646 2016 Yolo County General Services 57422 S0115 Yolo County Solar Project CA 1.0 1940
647 2016 Not Available MISSING S9078 Yuba College CA 1.0 1622
648 NaN Total 8616.1 17234539 NaN NaN NaN NaN
In [17]:
pv_table = pv_table[:-1]
In [18]:
pv_table.tail()
Out[18]:
Year Company Name EIA Plant ID CEC Plant ID Plant Name State Capacity (MW) Net MWh
643 2016 Duke Energy Renewables 59253 S0546 Wildwood Solar II CA 15.0 619
644 2016 Not Available MISSING S9040 Wines US Inc. Constellation CA 1.0 1622
645 2016 8me, LLC 59008 S0445 Woodmere Solar Farm CA 18.8 35916
646 2016 Yolo County General Services 57422 S0115 Yolo County Solar Project CA 1.0 1940
647 2016 Not Available MISSING S9078 Yuba College CA 1.0 1622

Fixing Data Types

Pandas may not have correctly inferred the types of numeric fields. However run statistics on these fields and create visualizations we will need their type information.

In [19]:
pv_table.dtypes
Out[19]:
Year             object
Company Name     object
EIA Plant ID     object
CEC Plant ID     object
Plant Name       object
State            object
Capacity (MW)    object
Net MWh          object
dtype: object
In [20]:
pv_table = pv_table.astype({"Year": "int", "Net MWh": "float", "Capacity (MW)": "float"})
In [21]:
pv_table.dtypes
Out[21]:
Year               int64
Company Name      object
EIA Plant ID      object
CEC Plant ID      object
Plant Name        object
State             object
Capacity (MW)    float64
Net MWh          float64
dtype: object

Creating a Generic Cleaning Function

Because we need to clean two roughly identical tables with the same structure we will create a simple function that we can debug once and run on both tables.

In [22]:
def clean_solar_table(table):
    table = table.copy()
    # Extract and set the column names
    table.columns = table.iloc[1,:].values
    # drop headers and summary at end
    table = table.iloc[2:-1]
    # Change types
    table = table.astype({"Year": "int", "Net MWh": "float", "Capacity (MW)": "float"})
    return table.reset_index(drop=True)
In [23]:
clean_solar_table(thermal_table)
Out[23]:
Year Company Name EIA Plant ID CEC Plant ID Plant Name State Capacity (MW) Net MWh
0 2016 Genesis Solar LLC 57394 S0259 Genesis Solar Energy Project CA 250.0 634079.0
1 2016 Solar Partners I II VIII LLC 57074 S0078 Ivanpah I (Solar Partners II) CA 126.0 255762.0
2 2016 Solar Partners I II VIII LLC 57073 S0079 Ivanpah II (Solar Partners I) CA 133.0 199870.0
3 2016 Solar Partners I II VIII LLC 57075 S0080 Ivanpah III (Solar Partners VIII) CA 133.0 247703.0
4 2016 Mojave Solar LLC 57331 S0104 Mojave Solar Project CA 250.0 624903.0
5 2016 Luz Solar Partners Ltd III 10439 S0071 SEGS III CA 34.2 55739.0
6 2016 Luz Solar Partners Ltd IV 10440 S0072 SEGS IV CA 34.2 55211.0
7 2016 Luz Solar Partners Ltd IX 10446 S0073 SEGS IX CA 92.0 157967.0
8 2016 Luz Solar Partners Ltd V 10441 S0074 SEGS V CA 34.2 55842.0
9 2016 Luz Solar Partners Ltd VI 10442 S0075 SEGS VI CA 35.0 50813.0
10 2016 Luz Solar Partners Ltd VII 10443 S0076 SEGS VII CA 35.0 53936.0
11 2016 Luz Solar Partners Ltd VIII 10444 S0077 SEGS VIII CA 92.0 156265.0

Combining the Thermal and PV Data

Both tables contain the same data. Let's take the union of all the rows adding an extra column to indicate whether the record is a Thermal or PV record.

In [24]:
def extract_and_combine_pv_and_thermal(tables):
    thermal_table = clean_solar_table(find_table("Solar Thermal", tables))
    pv_table = clean_solar_table(find_table("Solar PV", tables))
    thermal_table["Kind"] = "Thermal"
    pv_table["Kind"] = "PV"
    return pd.concat([thermal_table, pv_table]).reset_index(drop=True)

Extracting the final DataFrame

In [25]:
df = extract_and_combine_pv_and_thermal(tables)
df.head()
Out[25]:
Year Company Name EIA Plant ID CEC Plant ID Plant Name State Capacity (MW) Net MWh Kind
0 2016 Genesis Solar LLC 57394 S0259 Genesis Solar Energy Project CA 250.0 634079.0 Thermal
1 2016 Solar Partners I II VIII LLC 57074 S0078 Ivanpah I (Solar Partners II) CA 126.0 255762.0 Thermal
2 2016 Solar Partners I II VIII LLC 57073 S0079 Ivanpah II (Solar Partners I) CA 133.0 199870.0 Thermal
3 2016 Solar Partners I II VIII LLC 57075 S0080 Ivanpah III (Solar Partners VIII) CA 133.0 247703.0 Thermal
4 2016 Mojave Solar LLC 57331 S0104 Mojave Solar Project CA 250.0 624903.0 Thermal
In [26]:
df.groupby("Kind")[["Capacity (MW)", "Net MWh"]].sum()
Out[26]:
Capacity (MW) Net MWh
Kind
PV 8616.5 17234537.0
Thermal 1248.6 2548090.0

Saving a backup

Let's save the data at this point in case we need to recover it later.

In [27]:
df.to_csv("combined_energy_part1.csv")