Introduction to SQL Part 2¶

In this notebook we will connect to a remote SQL database and examine more advanced SQL queries. As with the first notebook, we will be using the core sqlalchemy library as an abstraction layer on-top of the underlying database management system. We will not be using the sophisticated object relational management features available in sqlalchemy.

import sqlalchemy
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as p

Connecting to the remote database¶

Often the database will be hosted on a remote server. In data100 we will use a hosted database. Because many students will be connecting to the same database we have created a list of databases and ask that you use the following script to generate the db_uri. This script will pick a random database each time you connect to try and balance the load across machines.

def get_db_url(password="dataisgreat"):
    import pandas as pd
    # Download the (potentially changing) list of databases
    db_list = pd.read_csv("http://ds100.org/sp18/assets/sql/db_list.csv")
    # Pick a random databse from the list
    db_info = db_list.sample().iloc[0]
    # Get it's info
    username = db_info['username']
    server = db_info['host']
    dbname = db_info['dbname']
    # Construct the database connection string
    db_url = 'postgres+psycopg2://{}:{}@{}/{}'.format(
        username, password, server, dbname)
    return db_url

db_url = get_db_url()
print(db_url)

postgres+psycopg2://student:dataisgreat@data100sp18-prod-db2.westus2.cloudapp.azure.com/data100

engine = sqlalchemy.create_engine(db_url, connect_args={'sslmode':'require'})

Connecting to A Local Postgres Database¶

You can also try to setup your own local Postgres Database. You will need:

The FEC (and lecture) SQL data
Install postres mac, windows, or linux.
Install the FEC data into a Postgres database (instructions). On mac and linux it looks like:

bunzip2 -c filename.gz | psql ds100

To connect to your local database you create an engine with the following command:

engine = sqlalchemy.create_engine("postgres://postgres:@localhost:5432/ds100")

#engine = sqlalchemy.create_engine("postgres://postgres:@localhost:5432/ds100")

Postgres Command Line Tool¶

For those that are interested in exploring the Postgres command line tool, you can connect to the database using the following command in a terminal.

def get_db_psql_command(password="dataisgreat"):
    import pandas as pd
    # Download the (potentially changing) list of databases
    db_list = pd.read_csv("http://ds100.org/sp18/assets/sql/db_list.csv")
    # Pick a random databse from the list
    db_info = db_list.sample().iloc[0]
    # Get it's info
    username = db_info['username']
    server = db_info['host']
    dbname = db_info['dbname']
    # Construct the database connection string
    command = 'PGPASSWORD={} psql -h {} -U {} --dbname {}'.format(
        password, server, username, dbname)
    return command

get_db_psql_command()

'PGPASSWORD=dataisgreat psql -h data100sp18-prod-db1.westus2.cloudapp.azure.com -U student --dbname data100'

Note that we will not be needing to use this tool in lecture.

Listing the Tables in the Database¶

This database contains quite a few more tables. We will explore these tables in this lecture.

engine.table_names()

['reserves',
 'ccl',
 'cm',
 'cn',
 'indiv_sample',
 'indiv_sample2',
 'individual',
 'my_matrix',
 'oppexp',
 'pas',
 'profs',
 'students',
 'tips',
 'boats',
 'sailors']

Data Generation¶

It is often assumed that when working with a database all relations (tables) must come from outside or be derived from other sources of data. It is possible to construct tables in SQL.

Sometimes it's useful to auto-generate data in queries, rather than examine data in the database. This is nice for testing, but also can be useful to play some computational tricks as you'll see in your homework.

SQL has a simple scalar function called random that returns a random value between 0.0 and 1.0. You can use this if you need to generate a column of random numbers. (The PostgreSQL manual doesn't promise much about the statistical properties of this random number generator.)

Let's roll a 6-sided die for each of the students

sql_expr = """
SELECT *, ROUND(RANDOM() * 6) as roll_dice 
FROM students;
"""
pd.read_sql(sql_expr, engine)

Is this a good implementation of a fair 6 sided die?

SELECT *, ROUND(RANDOM() * 6) as roll_dice 
FROM students;

Suppose we want to generate a whole bunch of random numbers, not related to any particular stored table -- can we do that in SQL?

SQL has a notion of table-valued functions: functions that return tables, and hence can be used in a FROM clause of a query. The standard table-valued function is called generate_series, and it's much like numpy's arange:

sql_expr = """
SELECT * 
FROM GENERATE_SERIES(1, 5);
"""
pd.read_sql(sql_expr, engine)

sql_expr = """
SELECT * 
FROM GENERATE_SERIES(1, 10, 2);
"""
pd.read_sql(sql_expr, engine)

Let's test the distribution of our earlier generator:

sql_expr = """
SELECT ROUND(6*RANDOM()) AS rando, COUNT(*)
FROM GENERATE_SERIES(1, 100000) AS flip(trial)
GROUP BY rando
ORDER BY count
"""
pd.read_sql(sql_expr, engine)

And if we want integers, we can use a PostgreSQL typecast operator (postfix ::<type>):

sql_expr = """
-- NOTE WE ALSO TAKE THE CEIL
-- What would happen if we did ot?

SELECT CEIL(6*RANDOM())::INTEGER AS rando, COUNT(*)
FROM generate_series(1, 100000) AS flip(trial)
GROUP BY rando
ORDER BY count
"""
pd.read_sql(sql_expr, engine)

Making a Random Matrix in SQL?!¶

Now suppose we want to populate a "matrix" relation my_matrix(r, c, v) full of random values. Consider the following numpy code

import numpy as np
np.random.seed(43)
# normally distributed random numbers, mean 0 variance 1
my_matrix = np.random.randint(1,6, (3,2)).astype('float')
my_matrix

array([[ 5.,  1.],
       [ 2.,  3.],
       [ 1.,  4.]])

Saving a Matrix as a table¶

Question: How could we store the above matrix as a table?

Answers:

A table with 3 rows and 2 columns
A table with 9 rows and 3 columns (how?)

Building the table in Numpy

my_matrix.flatten()

array([ 5.,  1.,  2.,  3.,  1.,  4.])

# Advanced numpy (you don't need to know this ...)
(col_id, row_id) = np.meshgrid(np.arange(2), np.arange(3))

mat_a = pd.DataFrame(
    np.vstack([row_id.flatten().T, col_id.flatten(), my_matrix.flatten()]).T, 
    columns=['r', 'c', 'v'])

mat_a

In this relational version we need to explicitly generate the r and c values. We can do this via SQL's built-in Cartesian product!

sql_expr = """
SELECT rows.r, columns.c, CEIL(6*RANDOM())::INTEGER AS v
  FROM generate_series(0,2) AS rows(r),
       generate_series(0,1) AS columns(c);
"""
pd.read_sql(sql_expr, engine)

A few take-aways from the previous cell:

Notice the schema of my_matrix reflects the fact that val is a function of the row (x) and column (y) IDs.
We've said before that the order of rows in a table isn't defined in SQL. Is this relational representation of a "matrix" faithful to the mathematical definition of a matrix? Why or why not?

[Bonus Material] User-defined functions (UDFs)¶

Sometimes we may want a custom scalar function that isn't built into SQL. Some database systems allow you to register your own user-defined functions (UDFs) in one or more programming languages. Conveniently, PostgreSQL allows us to register user-defined functions written in Python. Be aware of two things:

Calling Python for each row in a query is quite a bit slower than using the pre-compiled built-in functions in SQL ... this is akin to the use of Python loops instead of numpy calls. If you can avoid using Python UDFs you should do so to get better performance.
Python is a full-feature programming language with access to your operating system's functionality, which means it can reach outside of the scope of the query and wreak havoc, including running arbitrary UNIX commands. (PostgreSQL refers to this as an untrusted language.) Be very careful with the Python UDFs you use in your Postgres queries! If you want to be safer write UDFs in a trusted language. PostgreSQL has a number of other languages to choose from, including Java and even R!.

First we tell PostgreSQL we want to use the plpythonu package (so named because of "pl" for "programming language", "u" for "untrusted"):

try:
    engine.execute("""
        CREATE EXTENSION IF NOT EXISTS plpythonu;
    """)
    udf_support = True
except Exception as e:
    print(e)
    udf_support = False

(psycopg2.OperationalError) could not open extension control file "/usr/share/postgresql/9.5/extension/plpythonu.control": No such file or directory
 [SQL: '\n        CREATE EXTENSION IF NOT EXISTS plpythonu;\n    ']

Now let's write some trivial Python code and register it as a UDF using the create function command. Since SQL is a typed language, we need to specify the SQL types for the input and output to our function, in addition to the code (within $$ delimiters) and the language:

if udf_support:
    engine.execute("""
    DROP FUNCTION IF EXISTS fib(x INTEGER);
    """)

    engine.execute("""
    CREATE FUNCTION fib(x INTEGER) RETURNS INTEGER
    AS $$
    def fib(x):
        if x < 2:
            return x
        else:
            return fib(x-1) + fib(x-2)
    return fib(x)
    $$ LANGUAGE plpythonu;
    """)

if udf_support:
    sql_expr = """
    SELECT x, fib(x)
    FROM generate_series(1,10) AS row(x);
    """
    display(pd.read_sql(sql_expr, engine))

[Bonus Material] A Discussion on Transactions¶

It is possible to create transactions that isolate changes. This is done by starting a transaction with BEGIN. We can then proceed to make changes to the database. During this time others will not be able to see our changes. Until we end the transactions by saying ROLLBACK or COMMIT:

BEGIN;

UPDATE students SET gpa = 3.0 WHERE name = 'Bill Gates'; 

SELECT * FROM students;

ROLLBACK;

SELECT * FROM students;

Try running this in the postgres shell...

Descriptive Statistics in SQL¶

Statistics doesn't deal with individuals, it deals with groups: distributions, populations, samples and the like. As such, computing statistics in SQL focuses heavily on aggregation functions.

All SQL systems have simple descriptive statistics built in as aggregation functions:

min, max
count
sum
avg
stddev and variance, the sample standard deviation and variance.

PostgreSQL offers many more. Some handy ones include

stddev_pop and var_pop: the population standard deviation and variance, which you should use rather than stddev and variance if you know your data is the full population, not a sample.
covar_samp and covar_pop: sample and population covariance
corr, Pearson's correlation coefficient

Order Statistics: Aggregates requiring ordered input¶

You'll notice that a number of handy statistics are missing from this list, including the median and quartiles. That's because those are order statistics: they are defined based on an ordering of the values in a column.

SQL provides for this by allowing what it calls "ordered set functions", which require a WITHIN GROUP (ORDER BY <columns>) clause to accompany the order-statistic aggregate. For example, to compute the 25th percentile, 50th percentile (median) and 75th percentile in SQL, we can use the following:

sql_expr = """
SELECT percentile_cont(0.5) WITHIN GROUP (ORDER BY x) 
FROM generate_series(1,10) AS data(x);
"""
pd.read_sql(sql_expr, engine)

There are two versions of the percentile function:

percentile_cont inuous : interpolates
percentile_disc rete : returns an entry from the table

What will the following expressions return?

sql_expr = """
SELECT percentile_disc(0.5) WITHIN GROUP (ORDER BY x) 
FROM generate_series(1,10) AS data(x);
"""
pd.read_sql(sql_expr, engine)

We can compute the edges and middle of the box in a box plot:

sql_expr = """
SELECT 
    percentile_disc(0.25) WITHIN GROUP (ORDER BY x) as lower_quartile,
    percentile_disc(0.5) WITHIN GROUP (ORDER BY x) as median,
    percentile_disc(0.75) WITHIN GROUP (ORDER BY x) as upper_quartile
FROM generate_series(1,10) AS data(x);
"""
pd.read_sql(sql_expr, engine)

Working with Real Data¶

To begin our exploration of real data we will work with the publicly-available campaign finance data from the Federal Election Commission in a PostgreSQL database.

And let's have a look at the individual table's schema:

inspector = sqlalchemy.inspect(engine)
pd.DataFrame(inspector.get_columns("individual")).set_index('name')

If you are curious about the meaning of these columns check out the FEC data description

How big is this table?

sql_expr = """
SELECT COUNT(*)
FROM individual 
"""
pd.read_sql(sql_expr, engine)

Browsing Tables: `LIMIT` and sampling¶

This is not the first topic usually taught in SQL, but it's extremely useful for exploration.

OK, now we have some serious data loaded and we're ready to explore it.

Database tables are often big--hence the use of a database system. When browsing them at first, we may want to look at exemplary rows: e.g., an arbitrary number of rows, or a random sample of the rows.

To look at all of the data in the individual table, we would simply write:

select * from individual;

But that would return 20,347,829 rows into our Jupyter notebook's memory, and perhaps overflow the RAM in your computer. Instead, we could limit the size of the output to the first 3 rows as follows:

sql_expr = """
SELECT *
FROM individual 
LIMIT 4;
"""
pd.read_sql(sql_expr, engine)

Some notes on the `limit` clause:¶

Not only does it produce a small output, it's quite efficient: the database system stops iterating over the table after producing the first three rows, saving the work of examining the other nearly 40 million rows.
Recall that relations have no intrinsic order, so this is some arbitrary choice of 3 rows. Two issues to keep in mind:
1. This is a biased choice of rows. Very likely these are the first 3 rows stored in some disk file managed by the database, which may (for example) be the first 3 rows that were entered into the database, so they may not be representative of rows entered later.
2. The result is non-deterministic. Given that tables are not guaranteed to have an intrinsic order, it is considered correct for an SQL engine to return any 3 rows that satisfy this query, and return a different 3 rows each time depending on the cached data.

Constructing a Bernoulli Sample¶

As data scientists, we should be concerned about spending much time looking at a biased subset of our data. Instead, we might want an i.i.d. random sample of the rows in the table. There are various methods for sampling from a table. A simple one built into many database systems including PostgreSQL is Bernoulli sampling, in which the decision to return each row is made randomly and independently. As a metaphor, the database engine "flips a coin" for each row to decide whether to return it. We can influence the sampling rate by choosing the probability of a "true" result of the coinflip.

This is done on a per-table basis in the FROM clause of the query like so:

sql_expr = """
SELECT *
FROM individual TABLESAMPLE BERNOULLI(.00001);
"""
pd.read_sql(sql_expr, engine)

To learn more about the TABLESAMPLE clause checkout out the select docs. Note that there is a second sampling method called block sampling which is a lot like cluster sampling at the level of pages on disk!

sql_expr = """
SELECT *
FROM individual TABLESAMPLE BERNOULLI(.00001) REPEATABLE(42);
"""
pd.read_sql(sql_expr, engine)

Three things to note relative to our previous limit construct:

Bernoulli sampling is slow: it scales linearly with the table size by iterating through every row in the table.
The number of rows returned by Bernoulli sampling is probabilistic. For a table with $n$ rows and a sampling probability $p$, the output size comes from a binomial distribution with mean $np$ and variance ($np(1-p)$). For a very small $p$, the variance means we could easily get 0 rows back when trying our query!
If we don't know the size of the table, it's hard to choose a practical sampling probability. First we want to count up the number of rows $n$ (see the discussion of aggregation queries below), to inform us of a good $p$ to choose to get our desired output size. That means yet another full pass of the table to compute the count before we compute the sample!

For these reasons, if we want a proper i.i.d sample, it's a good idea to compute a nice-sized sample and store it, keeping it reasonably large for more general use. Since we will not be updating and rows in our individual table, we can do this without worrying that the sample will get "out of date" with respect to the context of individual.

We can use the CREATE TABLE AS SELECT ... (a.k.a. CTAS) pattern to do create a table that saves the output of a query:

DROP TABLE IF EXISTS indiv_sample;

CREATE TABLE indiv_sample AS
SELECT *
FROM individual TABLESAMPLE BERNOULLI(.1) REPEATABLE(42);

However, because we do not have write access to the database we will instead just select a few entires:

sql_expr = """
SELECT *
FROM individual TABLESAMPLE BERNOULLI(.1) REPEATABLE(42)
LIMIT 5;
"""
pd.read_sql(sql_expr, engine)

sql_expr = """
SELECT count(*)
FROM individual TABLESAMPLE BERNOULLI(.1) REPEATABLE(42);
"""
pd.read_sql(sql_expr, engine)

A SRS of Fixed Size¶

Here is an alternative method to construct a random sample of a fixed size. Note that this is not as efficient and will take several minutes to complete.

CREATE TABLE indiv_sample2 AS
SELECT *, RANDOM() AS u
FROM individual 
ORDER BY u
LIMIT 20000;

sql_expr = """
SELECT *, RANDOM() AS u
FROM individual 
ORDER BY u
LIMIT 5;
"""
pd.read_sql(sql_expr, engine)

Selecting rows and columns, and calling scalar (per-row) functions.¶

OK, we already had a peek at the individual table. Now let's look at specific attributes (columns) relates to who is donating how much.

In addition to referencing the columns of individual in the select clause, we can also derive new columns by writing field-level (so-called "scalar") functions. Typically we reference some table columns in those functions.

In our case, let's compute the log of transaction_amt for subsequent plotting. SQL comes with many typical functions you can use in this way, and PostgreSQL is particularly rich on this front; see the PostgreSQL manual for details.

We'll look at indiv_sample rather than individual while we're just exploring.

sql_expr = """
SELECT name, state, cmte_id,
       transaction_amt, log(transaction_amt)
FROM indiv_sample
LIMIT 10;
"""
pd.read_sql(sql_expr, engine)

We can combine SQL with python in the following way:

sql_expr = """
SELECT transaction_amt AS amt
FROM indiv_sample
WHERE transaction_amt > 0;
"""
result = pd.read_sql(sql_expr, engine)

_ = sns.distplot(result['amt'])

sql_expr = """
SELECT LOG(transaction_amt) AS log_amt
FROM indiv_sample
WHERE transaction_amt > 0;
"""
result = pd.read_sql(sql_expr, engine)
df = result['log_amt']
sns.distplot(df.astype('float'))
scales = np.array([1, 10, 20,  100, 500,  1000, 5000])
_ = plt.xticks(np.log10(scales), scales)

Examining the Tail¶

sql_expr = """
SELECT transaction_amt AS amt
FROM indiv_sample
WHERE transaction_amt > 5000;
"""
result = pd.read_sql(sql_expr, engine)

_ = sns.distplot(result['amt'], rug=True)

sql_expr = """
SELECT transaction_amt AS amt
FROM individual
WHERE transaction_amt > 5000;
"""
result = pd.read_sql(sql_expr, engine)

_ = sns.distplot(result['amt'])

sql_expr = """
SELECT log(transaction_amt) AS log_amt
FROM individual
WHERE transaction_amt > 5000;
"""
result = pd.read_sql(sql_expr, engine)

sns.distplot(result['log_amt'])
scales = np.array([5000, 20000, 100000])
_ = plt.xticks(np.log10(scales), scales)

sql_expr = """
SELECT log(transaction_amt) AS log_amt
FROM individual
WHERE transaction_amt > 1000000;
"""
result = pd.read_sql(sql_expr, engine)

sns.distplot(result['log_amt'], rug=True)
scales = np.array([1000000, 5000000, 50000000])
_ = plt.xticks(np.log10(scales), scales)

`CASE` statements: SQL conditionals in the `FROM` clause¶

What about smaller donations?

sql_expr = """
SELECT name, state, cmte_id,
       transaction_amt, LOG(transaction_amt)
FROM indiv_sample
WHERE transaction_amt < 10
LIMIT 10;
"""
try:
    display(pd.read_sql(sql_expr, engine))
except Exception as e:
    print(e)

(psycopg2.DataError) cannot take logarithm of a negative number
 [SQL: '\nSELECT name, state, cmte_id,\n       transaction_amt, LOG(transaction_amt)\nFROM indiv_sample\nWHERE transaction_amt < 10\nLIMIT 10;\n']

Uh oh, log is not defined for numbers <= 0! We need a conditional statement in the select clause to decide what function to call. We can use SQL's case construct for that.

sql_expr = """
SELECT transaction_amt,
    CASE WHEN transaction_amt > 0 THEN log(transaction_amt)
         WHEN transaction_amt = 0 THEN 0
         ELSE -1*(log(abs(transaction_amt)))
    END AS log_amt
FROM indiv_sample
"""
try:
    result = pd.read_sql(sql_expr, engine)
    sns.distplot(result['log_amt'])
except Exception as e:
    print(e)

Who donated the most?¶

sql_expr = """
SELECT transaction_amt, cmte_id, transaction_dt, name, city, state, memo_text, occupation 
FROM individual 
ORDER BY transaction_amt DESC
LIMIT 10
"""
pd.read_sql(sql_expr, engine)

Grouping Contributions by Name¶

sql_expr = """
SELECT name, SUM(transaction_amt) AS total_amt
FROM individual 
GROUP BY name
ORDER BY total_amt DESC
LIMIT 10
"""
pd.read_sql(sql_expr, engine)

Going Local with `WHERE`¶

sql_expr = """
SELECT name, SUM(transaction_amt) AS total_amt
FROM individual
WHERE city = 'SAN FRANCISCO'
GROUP BY name
ORDER BY total_amt DESC
LIMIT 20;
"""
pd.read_sql(sql_expr, engine)

sql_expr = """
SELECT name, SUM(transaction_amt) AS total_amt
FROM individual
WHERE city = 'BERKELEY'
GROUP BY name
ORDER BY total_amt DESC
LIMIT 20;
"""
pd.read_sql(sql_expr, engine)

Named Queries: Views and CTEs¶

Up to now we've looked at a single query at a time. SQL also allows us to nest queries in various ways. In this section we look at the cleaner examples of how to do this in SQL: views and Common Table Expressions (CTEs).

Views¶

In earlier examples, we created new tables and populated them from the result of queries over stored tables. There are two main drawbacks of that approach that may concern us in some cases:

The new table uses up storage, even though it is recomputable from other tables.
Out of date. The stored output will not reflect changes in the input.

For this reason, SQL provides a notion of logical views: these are basically named queries that are re-evaluated upon each reference.

The syntax is straightforward:

CREATE VIEW <name> AS
<SELECT statement>;

The resulting view <name> can be used in an SELECT query, but not in an INSERT, DELETE or UPDATE query!

As an example, we might want a view that stores just some summary statistics of transaction_amts for each date:

try:
    engine.execute("""
        DROP VIEW IF EXISTS date_stats;
        """)

    engine.execute("""
        CREATE VIEW date_stats AS
        SELECT 
            transaction_dt AS day,
            min(transaction_amt), 
            avg(transaction_amt), 
            stddev(transaction_amt),
            max(transaction_amt)
        FROM individual
        GROUP BY transaction_dt
        ORDER BY day;
        """)
    
    sql_expr = """
        SELECT * from date_stats limit 5;
        """
    
    display(pd.read_sql(sql_expr, engine))
    
except Exception as e:
    print("In the shared Data 100 Postgres server you not allowed to create views.\n\n")
    print(e)

In the shared Data 100 Postgres server you not allowed to create views.


(psycopg2.ProgrammingError) permission denied for schema public
 [SQL: '\n        CREATE VIEW date_stats AS\n        SELECT \n            transaction_dt AS day,\n            min(transaction_amt), \n            avg(transaction_amt), \n            stddev(transaction_amt),\n            max(transaction_amt)\n        FROM individual\n        GROUP BY transaction_dt\n        ORDER BY day;\n        ']

Notice that this did not create a table:

inspector.get_table_names()

['reserves',
 'ccl',
 'cm',
 'cn',
 'indiv_sample',
 'indiv_sample2',
 'individual',
 'my_matrix',
 'oppexp',
 'pas',
 'profs',
 'students',
 'tips',
 'boats',
 'sailors']

Instead it created a view: (if you are allowed to create views)

inspector.get_view_names()

['rando']

Views are not materialized¶

Let's create a random table and we will even seed the random number generator.

DROP VIEW IF EXISTS rando;

SELECT setseed(0.3);

CREATE VIEW rando(rownum, rnd) AS
SELECT rownum, round(random())::INTEGER
FROM generate_series(1,50) AS ind(rownum)

What is the sum of the rows in Random:

sql_expr = """
SELECT SUM(rnd) FROM rando;
"""
pd.read_sql(sql_expr, engine)

What was that value again?

sql_expr = """
SELECT SUM(rnd) FROM rando;
"""
pd.read_sql(sql_expr, engine)

</br></br></br>

The value changes with each invocation.

Too Many Views¶

Views can help:

Simplify queries
Make complex queries more readable
Share "sql programs" with others

Problem:

Creating a new view for each (exploratory) query will result in a lot of views!
views like: temp1, temp1_joey, temp1_joey_fixed, ...
Often you will not be allowed to create views.

We need a mechanism to decompose query into views for the scope of a single query.

Common Table Expressions (`WITH`)¶

Think of these as a view that exists only during the query.

If we're only going to use a view within a single query, it is a little inelegant to CREATE it, and then have to DROP it later to recycle the view name.

Common Table Expressions (CTEs) are like views that we use on-the-fly. (If you know about lambdas in Python, you can think of CTEs as lambda views.) The syntax for CTEs is to use a WITH clause in front of the query:

WITH <name> [(renamed columns)] AS (<SELECT statement>) [, <name2> AS (<SELECT statement>)...]

If you need multiple CTEs, you separate them with commas. We can rewrite our query above without a view as follows:

sql_expr = """
WITH per_day_stats AS (
    SELECT 
        to_date(transaction_dt, 'MMDDYYYY') as day, -- Date Parsing
        min(transaction_amt), 
        avg(transaction_amt), 
        stddev(transaction_amt),
        max(transaction_amt)
    FROM indiv_sample
    GROUP BY transaction_dt
)    
SELECT day, stddev, max - min AS spread
FROM per_day_stats
WHERE stddev IS NOT NULL
ORDER by stddev DESC
LIMIT 5
"""
pd.read_sql(sql_expr, engine)

Joins¶

Suppose now we want to determine which committees received the most money

sql_expr = """
SELECT cmte_id, SUM(transaction_amt) AS total_amt
FROM individual 
GROUP BY cmte_id
ORDER BY total_amt DESC
LIMIT 10
"""
pd.read_sql(sql_expr, engine)

What are the names of these Committees?¶

pd.DataFrame(inspector.get_columns("individual")).set_index('name')

pd.DataFrame(inspector.get_columns("cm")).set_index('name')

We can join the committee description to get the names of the committees that received the most funds.

sql_expr = """
WITH indv2cm AS
(
    SELECT cmte_id, SUM(transaction_amt) AS total_amt
    FROM individual
    GROUP BY cmte_id
    ORDER BY total_amt DESC
)
SELECT cm.cmte_nm, indv2cm.total_amt 
FROM cm, indv2cm
WHERE cm.cmte_id = indv2cm.cmte_id
ORDER BY indv2cm.total_amt DESC
LIMIT 10
"""
pd.read_sql(sql_expr, engine)

Which candidate received the most¶

pd.DataFrame(inspector.get_columns("cn")).set_index('name')

Candidate Committee Linkage Table¶

pd.DataFrame(inspector.get_columns("ccl")).set_index('name')

Joining Aggregated Indiv - CCL - CN¶

sql_expr = """
SELECT cn.cand_name, SUM(indiv.transaction_amt) AS total_amt
FROM individual AS indiv, ccl, cn
WHERE indiv.cmte_id = ccl.cmte_id AND
    ccl.cand_id = cn.cand_id 
GROUP BY cn.cand_name
ORDER BY total_amt DESC
LIMIT 10
"""
pd.read_sql(sql_expr, engine)

Localized Join for CA¶

sql_expr = """
SELECT cn.cand_name, SUM(indiv.transaction_amt) AS total_amt
FROM individual AS indiv, ccl, cn
WHERE indiv.cmte_id = ccl.cmte_id AND
    ccl.cand_id = cn.cand_id AND
    indiv.state = 'CA'
GROUP BY cn.cand_name
ORDER BY total_amt DESC
LIMIT 10
"""
pd.read_sql(sql_expr, engine)

Localized Join for FL¶

sql_expr = """
SELECT cn.cand_name, SUM(indiv.transaction_amt) AS total_amt
FROM individual AS indiv, ccl, cn
WHERE indiv.cmte_id = ccl.cmte_id AND
    ccl.cand_id = cn.cand_id AND
    indiv.state = 'FL'
GROUP BY cn.cand_name
ORDER BY total_amt DESC
LIMIT 10
"""
pd.read_sql(sql_expr, engine)

Localized Join for TX¶

sql_expr = """
SELECT cn.cand_name, SUM(indiv.transaction_amt) AS total_amt
FROM individual AS indiv, ccl, cn
WHERE indiv.cmte_id = ccl.cmte_id AND
    ccl.cand_id = cn.cand_id AND
    indiv.state = 'TX'
GROUP BY cn.cand_name
ORDER BY total_amt DESC
LIMIT 10
"""
pd.read_sql(sql_expr, engine)

Tracking Direct Committee Contributions¶

sql_expr = """
SELECT cm.cmte_nm, SUM(transaction_amt) AS total_amt
FROM pas, cm
WHERE pas.cmte_id = cm.cmte_id
GROUP BY cm.cmte_nm
ORDER BY total_amt DESC
LIMIT 5
"""
pd.read_sql(sql_expr, engine)

sql_expr = """
SELECT cn.cand_name, SUM(transaction_amt) AS total_amt
FROM pas, cn
WHERE pas.cand_id = cn.cand_id
GROUP BY cn.cand_name
ORDER BY total_amt DESC
LIMIT 5
"""
pd.read_sql(sql_expr, engine)

	name	gpa	age	dept	gender	roll_dice
0	Sergey Brin	2.8	40	CS	M	6.0
1	Danah Boyd	3.9	35	CS	F	3.0
2	Bill Gates	1.0	60	CS	M	6.0
3	Hillary Mason	4.0	35	DATASCI	F	1.0
4	Mike Olson	3.7	50	CS	M	4.0
5	Mark Zuckerberg	4.0	30	CS	M	4.0
6	Susan Wojcicki	4.0	46	BUSINESS	F	4.0
7	Marissa Meyer	4.0	45	BUSINESS	F	0.0
8	Sheryl Sandberg	4.0	47	BUSINESS	F	4.0

	rando	count
0	0.0	8336
1	6.0	8439
2	3.0	16601
3	5.0	16634
4	4.0	16640
5	1.0	16674
6	2.0	16676

	rando	count
0	5	16475
1	6	16578
2	1	16631
3	2	16726
4	3	16735
5	4	16855

	autoincrement	default	nullable	type
name
cmte_id	False	None	True	VARCHAR(9)
amndt_ind	False	None	True	CHAR(1)
rpt_tp	False	None	True	VARCHAR(3)
transaction_pgi	False	None	True	CHAR(5)
image_num	False	None	True	BIGINT
transaction_tp	False	None	True	VARCHAR(3)
entity_tp	False	None	True	VARCHAR(3)
name	False	None	True	TEXT
city	False	None	True	TEXT
state	False	None	True	CHAR(2)
zip_code	False	None	True	VARCHAR(12)
employer	False	None	True	TEXT
occupation	False	None	True	TEXT
transaction_dt	False	None	True	VARCHAR(9)
transaction_amt	False	None	True	INTEGER
other_id	False	None	True	TEXT
tran_id	False	None	True	TEXT
file_num	False	None	True	BIGINT
memo_cd	False	None	True	TEXT
memo_text	False	None	True	TEXT
sub_id	False	None	True	BIGINT

	cmte_id	amndt_ind	rpt_tp	transaction_pgi	image_num	transaction_tp	entity_tp	name	city	state	...	employer	occupation	transaction_dt	transaction_amt	other_id	tran_id	file_num	memo_cd	memo_text	sub_id
0	C00461046	N	Q1	G	15020142007	15	IND	MCNULTY, TIMOTHY	PITTSBURGH	PA	...	CARNEGIE MELLON UNIVERSITY	VP	03302015	1800	None	SA0430152007757	1006100	None	None	2043020151242363146
1	C00461046	N	Q1	P	15020142007	15	IND	ISAAC, PAUL MR	LARCHMONT	NY	...	CADOGEN MANAGEMENT LLC	ANALYST	03312015	2700	None	SA0430152007759	1006100	None	None	2043020151242363148
2	C00461046	N	Q1	P	15020142008	15	IND	KENDALL, LINDA G	KEY LARGO	FL	...	RETIRED	RETIRED	03202015	800	None	SA0430152008760	1006100	None	REATTRIBUTION REQUESTED	2043020151242363149
3	C00461046	N	Q1	P	15020142428	15	IND	BUCCIERO, ANTHONY	MURRYSVILLE	PA	...	GCMSI	MANAGEMENT	03262015	500	None	SA04301524282022	1006100	None	None	2043020151242364411

	r	c	v
0	0.0	0.0	5.0
1	0.0	1.0	1.0
2	1.0	0.0	2.0
3	1.0	1.0	3.0
4	2.0	0.0	1.0
5	2.0	1.0	4.0

	cmte_id	amndt_ind	rpt_tp	transaction_pgi	image_num	transaction_tp	entity_tp	name	city	state	...	employer	occupation	transaction_dt	transaction_amt	other_id	tran_id	file_num	memo_cd	memo_text	sub_id
0	C00578922	N	Q2	P	201607159020548786	15	IND	WIMMER, JAMES	LA GRANGE	IL	...	None	None	06272016	500	None	A015CF62D5E6A4E6E989	1085865	None	None	4071520161306753713
1	C00376673	A	12G	G	201612089039799965	15	IND	BOZZUTO, THOMAS S	BALTIMORE	MD	...	THE BOZZUTO GROUP	BUILDER	10192016	1000	None	C22086030	1132270	X	*	4122320161357985929
2	C00575795	A	12G	G	201701099041246602	15	IND	KARPEL, ARLENE	TARZANA	CA	...	RETIRED	ADVERTISING	10172016	10	None	C13106313	1137788	None	None	4011120171366473264
3	C00040253	N	30G	P	201612069037648205	15	IND	CONLAN, JAMES A	FRANKLIN	TN	...	ALLSTATE INSURANCE COMPANY	ABD-STRATEGIC DEPLOYMENT	11252016	27	None	A2016-2183275	1129285	None	None	4120720161356883629

	cmte_id	amndt_ind	rpt_tp	transaction_pgi	image_num	transaction_tp	entity_tp	name	city	state	...	employer	occupation	transaction_dt	transaction_amt	other_id	tran_id	file_num	memo_cd	memo_text	sub_id
0	C00078451	N	12G	P	201611039037111929	15	IND	SULLIVAN, ERIC	SAN DIEGO	CA	...	NATIONAL STEEL & SHIPBUIL	SUPERVISOR SECURITY	10192016	18	None	PR1570494950915	1123284	None	P/R DEDUCTION ($6.00 WEEKLY)	4113020161356319356
1	C00401224	A	M4	P	201702069043376546	24T	IND	VARILLA, JEFFREY	CHICAGO	IL	...	SELF	SCULPTOR/ARTIST	03092016	5	C00577130	SA11AI_41922193	1147747	None	EARMARKED FOR BERNIE 2016 (C00577130)	4030820171379798720

	cmte_id	amndt_ind	rpt_tp	transaction_pgi	image_num	transaction_tp	entity_tp	name	city	state	...	employer	occupation	transaction_dt	transaction_amt	other_id	tran_id	file_num	memo_cd	memo_text	sub_id
0	C00461046	N	Q1	G	15020142255	15	IND	SABIN, ANDREW	None	None	...	SABIN METAL	CEO	03112015	2700	None	SA04301522551502	1006100	None	None	2043020151242363891
1	C00461046	N	Q1	P	15020142491	15	IND	HORENKAMP, ELIZABETH	LITITZ	PA	...	None	None	03242015	204	None	SA04301524912210	1006100	None	None	2043020151242364599
2	C00545673	N	Q1	P	15970346025	15	IND	WELLS, ERSKINE MR.	WASHINGTON	DC	...	BGR GOVERNMENT AFFAIRS LLC	PRINCIPAL	03302015	500	None	AA89794F18FA74169A59	1003199	X	PARTNERSHIP: BGR GOVERNMENT AFFAIRS, LLC	4041520151241865698
3	C00443689	N	Q1	P	15951161362	15	IND	CRUM, JAMES	ODENTON	MD	...	VAN SCOYOC ASSOCIATES	VICE PRESIDENT	03052015	500	None	C6085986	1003825	None	None	4041720151241972691
4	C00005249	N	M3	None	15970303312	15	IND	DEGOLIER, LAURA P. MS.	FOND DU LAC	WI	...	DEGOLIER INSURANCE SERVICES, LLC	OWNER	02102015	125	None	13133722	998098	None	None	4031820151240822433

	cmte_id	amndt_ind	rpt_tp	transaction_pgi	image_num	transaction_tp	entity_tp	name	city	state	...	occupation	transaction_dt	transaction_amt	other_id	tran_id	file_num	memo_cd	memo_text	sub_id	u
0	C00401224	N	YE	None	201602019007031780	24T	IND	WITTEN, AMY	LAWRENCE	KS	...	PROGRAM MANAGER	09082015	58	C00577130	SA11AI_30257793	1047732	None	EARMARKED FOR BERNIE 2016 (C00577130)	4021620161268146239	6.845221e-08
1	C00575795	A	30G	G	201702149049529111	15	IND	KNIGHT, JANE	FORT MILL	SC	...	NURSE	11012016	25	None	C14751364	1148953	None	None	4021720171371420407	1.965091e-07
2	C00033423	N	M11	None	201511209003400237	15	IND	KULIG, CHRISTOPHER R	ELMHURST	IL	...	DEPT MGR COLLECTIONS	10162015	21	None	A2015-2294192	1033666	None	None	4112020151257352974	2.156012e-07
3	C00575795	A	12G	G	201701099041213135	15	IND	RASMUSSEN, AMY	OAKLAND	CA	...	EDITOR	10162016	25	None	C13059250	1137788	None	None	4011120171366372864	2.193265e-07
4	C00143743	N	M9	P	201509169002679540	15	IND	ANDERSON, KEN-JEAN	HORSESHOE BAY	TX	...	RETIRED	08312015	500	None	AC1D196AD2032433981A	1024964	None	None	4091720151253097649	2.994202e-07

	name	state	cmte_id	transaction_amt	log
0	HARRIS, ANN	MO	C00431304	50	1.698970
1	MOEN, ERIK P. MR	WA	C00012880	500	2.698970
2	DYKHOUSE, DANA J.	SD	C00476853	5400	3.732394
3	SAUNDERS, SANFORD MR.	VA	C00554261	2700	3.431364
4	DUEHN, BARBARA	TX	C00513077	15	1.176091
5	PHILLIPS, JANE K.	OK	C00513077	5	0.698970
6	MEDOW, JONATHAN	IL	C00435099	500	2.698970
7	CAMPBELL, JOSEPH	CA	C00499392	1000	3.000000
8	SABIN, ANDREW	None	C00461046	2700	3.431364
9	HORENKAMP, ELIZABETH	PA	C00461046	204	2.309630

	transaction_amt	cmte_id	transaction_dt	name	city	state	memo_text	occupation
0	50000000	C00616078	06162016	STATWARE INC. AND OTHER FIRMS	CENTERBOOK	CT	None	None
1	23787000	C00010603	11072016	HILLARY VICTORY FUND	NEW YORK	NY	None	None
2	11000000	C00575373	04102015	MERCER, ROBERT	EAST SETAUKET	NY	None	FINANCE
3	11000000	C00547349	10262016	STEYER, THOMAS F.	SAN FRANCISCO	CA	None	FOUNDER
4	10000000	C00575431	04092015	NEUGEBAUER, TOBY	DORADO	PR	None	INVESTOR
5	10000000	C00504530	08292016	ADELSON, SHELDON	LAS VEGAS	NV	None	CHAIRMAN
6	10000000	C00571703	08262016	ADELSON, SHELDON G. MR.	LAS VEGAS	NV	None	CHAIRMAN & CEO
7	10000000	C00571703	08292016	ADELSON, MIRIAM DR.	LAS VEGAS	NV	None	PHYSICIAN
8	10000000	C00571372	10212015	C.V. STARR & CO INC	NEW YORK	NY	None	None
9	10000000	C00504530	08292016	ADELSON, MIRIAM	LAS VEGAS	NV	None	PHYSICIAN

	name	total_amt
0	STEYER, THOMAS F.	90044644
1	STATWARE INC. AND OTHER FIRMS	50000000
2	EYCHANER, FRED	37921658
3	HILLARY VICTORY FUND	35379700
4	NEXTGEN CLIMATE ACTION COMMITTEE	31213354
5	ADELSON, MIRIAM	27936800
6	SUSSMAN, DONALD	25642400
7	SOROS, GEORGE	22364011
8	SENATE LEADERSHIP FUND	22355200
9	BLOOMBERG, MICHAEL R.	21675924

	name	total_amt
0	STEYER, THOMAS F.	89726944
1	MOSKOVITZ, DUSTIN	14111000
2	TUNA, CARI	6623000
3	SANDLER, HERBERT M.	5494400
4	THIEL, PETER	4066700
5	OBERNDORF, WILLIAM E. MR.	2416106
6	SCHWAB, HELEN O. MRS.	1502700
7	OBERNDORF, WILLIAM E.	1371200
8	WENDT FAMILY TRUST	1350000
9	AMERICAN PACIFIC INTERNATIONAL CAPITAL INC.	1300000
10	WILLIAMS, SARA	1273960
11	MOSKOVITZ, DUSTIN A.	1000000
12	BUELL, SUSIE T.	955000
13	WILLIAMS, EVAN	875931
14	NEXT GEN	800000
15	BUELL, SUSIE TOMPKINS	797971
16	OBERNDORF, WILLIAM E	781600
17	SCHWAB, CHARLES R.	653400
18	WILSEY, DIANE B.	554380
19	GREER, JIM	551615

	name	total_amt
0	SHENKER, SCOTT	1334600
1	SIMONS, NAT	363400
2	SIMONS, NATHANIEL	224700
3	WILKINS, MICHAEL IAN	186681
4	HEADLEY, MARK	161700
5	GUND, LOUISE LAIDLAW	148083
6	SLATER, AMY	126340
7	HEADLEY, MARK W.	115400
8	HEADLEY, MARK W	115200
9	BERLEKAMP, ELWYN	114900
10	BERNHARDT, ANTHONY	105350
11	SHENKER, SCOTT M	105000
12	LEVIEN, RAPHAEL LINUS	100000
13	SUTTIE, JILL	93565
14	DUIGNAN, SHEILA MARIE	85000
15	HOLTZMAN, STEVEN	77800
16	PARKER, MICHAEL L.	75500
17	FERGUSON, THEODOSIA	74055
18	PARKER, ANN	73700
19	FIDDLER, JERRY	70800

	generate_series
0	1
1	2
2	3
3	4
4	5

	generate_series
0	1
1	3
2	5
3	7
4	9

	day	stddev	spread
0	2015-01-29	156888.687067	499975
1	2016-03-07	25722.240498	100003
2	2015-02-12	20132.919349	49998
3	2015-04-28	11091.966531	33390
4	2016-07-13	10412.932316	49995

	cmte_id	total_amt
0	C00586537	428422747
1	C00401224	314682670
2	C00575795	293602014
3	C00495861	183763164
4	C00571372	132142087
5	C00003418	123195123
6	C00618389	107943772
7	C00571703	104275579
8	C00547349	92536032
9	C00010603	89932788

	r	c	v
0	0.0	0.0	5.0
1	0.0	1.0	1.0
2	1.0	0.0	2.0
3	1.0	1.0	3.0
4	2.0	0.0	1.0
5	2.0	1.0	4.0

	cmte_nm	total_amt
0	HILLARY VICTORY FUND	428422747
1	ACTBLUE	314682670
2	HILLARY FOR AMERICA	293602014
3	PRIORITIES USA ACTION	183763164
4	RIGHT TO RISE USA	132142087
5	REPUBLICAN NATIONAL COMMITTEE	123195123
6	TRUMP VICTORY	107943772
7	SENATE LEADERSHIP FUND	104275579
8	NEXTGEN CLIMATE ACTION COMMITTEE	92536032
9	DNC SERVICES CORP./DEM. NAT'L COMMITTEE	89932788

	autoincrement	default	nullable	type
name
cand_id	False	None	True	VARCHAR(9)
cand_name	False	None	True	TEXT
cand_pty_affiliation	False	None	True	VARCHAR(3)
cand_election_yr	False	None	True	INTEGER
cand_office_st	False	None	True	VARCHAR(2)
cand_office	False	None	True	CHAR(1)
cand_office_district	False	None	True	INTEGER
cand_ici	False	None	True	CHAR(1)
cand_status	False	None	True	CHAR(1)
cand_pcc	False	None	True	VARCHAR(9)
cand_st1	False	None	True	TEXT
cand_st2	False	None	True	TEXT
cand_city	False	None	True	TEXT
cand_st	False	None	True	VARCHAR(2)
cand_zip	False	None	True	VARCHAR(10)

	cand_name	total_amt
0	CLINTON, HILLARY RODHAM / TIMOTHY MICHAEL KAINE	293602014
1	SANDERS, BERNARD	84741223
2	TRUMP, DONALD J. / MICHAEL R. PENCE	65329708
3	RUBIO, MARCO	60770391
4	RYAN, PAUL D.	58792561
5	CRUZ, RAFAEL EDWARD "TED"	54016803
6	BUSH, JEB	33215298
7	CARSON, BENJAMIN S SR MD	27720768
8	VAN HOLLEN, CHRIS	27254529
9	PORTMAN, ROB	21122605

	cand_name	total_amt
0	CLINTON, HILLARY RODHAM / TIMOTHY MICHAEL KAINE	64312114
1	SANDERS, BERNARD	18755158
2	HARRIS, KAMALA D	9217342
3	RUBIO, MARCO	8290342
4	MCCARTHY, KEVIN	5885992
5	CRUZ, RAFAEL EDWARD "TED"	5477703
6	RYAN, PAUL D.	5419340
7	TRUMP, DONALD J. / MICHAEL R. PENCE	5183688
8	MASTO, CATHERINE CORTEZ	3405938
9	BUSH, JEB	3356045

	cand_name	total_amt
0	RUBIO, MARCO	17355941
1	CLINTON, HILLARY RODHAM / TIMOTHY MICHAEL KAINE	15315796
2	PERKINS, RANDY	10315395
3	MURPHY, PATRICK E	7926526
4	BUSH, JEB	7291164
5	RYAN, PAUL D.	6129278
6	TRUMP, DONALD J. / MICHAEL R. PENCE	5072826
7	ROONEY, FRANCIS	4280801
8	SANDERS, BERNARD	3053688
9	CRUZ, RAFAEL EDWARD "TED"	2741375

	cand_name	total_amt
0	CRUZ, RAFAEL EDWARD "TED"	20464749
1	CLINTON, HILLARY RODHAM / TIMOTHY MICHAEL KAINE	16124458
2	RYAN, PAUL D.	7905676
3	TRUMP, DONALD J. / MICHAEL R. PENCE	5763918
4	RUBIO, MARCO	4998601
5	BUSH, JEB	3904580
6	CRUZ, RAFAEL EDWARD TED	3310330
7	CARSON, BENJAMIN S SR MD	3291938
8	SANDERS, BERNARD	3135692
9	DEWHURST, DAVID H	2501335

	cmte_nm	total_amt
0	PRIORITIES USA ACTION	133443133
1	DSCC	118795008
2	RIGHT TO RISE USA	86817138
3	SENATE LEADERSHIP FUND	85994270
4	DCCC	84368850

	cand_name	total_amt
0	TRUMP, DONALD J. / MICHAEL R. PENCE	322303387
1	CLINTON, HILLARY RODHAM / TIMOTHY MICHAEL KAINE	191839266
2	BUSH, JEB	87255752
3	TOOMEY, PATRICK JOSEPH	73347997
4	AYOTTE, KELLY A	64513126