%load_ext sql
%sql postgresql://jmh:@localhost:5432/jmh
One of the fundamental operations in the relational algebra is the general-purpose theta-join and its more specific variants, equijoin and natural join. Joins are endemic to the relational model.
Consider our FEC schema. Each committee has a single row in the cm
table. But each committee is associated with many donations in the individual
table. Some models like XML or JSON would let us "nest" many individual donations for a committee into each cm
record, but the relational model only allows us to store a single atomic value (i.e. a basic integer, float, string, etc.) in a field of a record. A table with only atomic values in its cells is said to be in first normal form. Nested data is sometimes said to be denormalized, and when you convert it back to flat tables it is normalized.
So how to we combine individual
rows and cm
rows in SQL? Via an equijoin on the cmte_id
columns in each; the result is still a normalized table.
Recall that $R \Join_\theta S = \sigma_\theta(R \times S)$: i.e. theta joins are simple cartesian products followed by a theta selection. SQL's standard syntax is analogous:
SELECT ...
FROM R, S
WHERE <theta>;
The same applies to multi-table queries. $R \Join_{\theta_1} S \Join_{\theta_2} T \Join_{\theta_3} U = \sigma_{\theta_1 \wedge \theta_2 \wedge \theta_3} (R \times S \times T \times U)$. Or equivalently, in SQL:
SELECT ...
FROM R, S, T, U
WHERE <theta1>
AND <theta2>
AND <theta3>;
Most join queries you'll want to write in SQL will have this form; you simply need to match up the columns properly in the WHERE
clause.
To begin, let's join up the indiv_sample
and cm
tables on cmte_id
. We should get one row out for each row of indiv_sample
, because cmte_id
is a primary key of cm
.
%%sql select *
from indiv_sample I, cm
where I.cmte_id = cm.cmte_id
limit 2;
Note some issues with column naming:
*
in the SELECT
list returns all the columns of both tables, in the order the tables appear in the FROM
clause.FROM
clause, we introduce the variable I
as an alias for individual
, and reuse it in the WHERE
clause. (As an alternative syntax, we could also have said from individual AS I
; the AS
is optional.)cmte_id
appears in more than one table in the FROM
clause, so we always need to preface it with a table variable to disambiguate it. (Try and see what happens if you remove, say, the I
from I.cmte_id
.)We can put an aggregate over a join query to count rows in the output:
%%sql select count(*)
from indiv_sample I, cm
where I.cmte_id = cm.cmte_id;
Let's compare to indiv_sample
without the join to validate our assumption about the output size:
%sql SELECT count(*) FROM indiv_sample;
If you look at the FEC schema, you'll see that some committees in cm
are associate with candidates in cn
via the cand_id
field. So we can figure out which individuals donated to which candidates by joining the three tables:
%%sql
SELECT name, cand_name, transaction_amt
FROM indiv_sample I, cm, cn
WHERE I.cmte_id = cm.cmte_id
AND cm.cand_id = cn.cand_id
ORDER BY transaction_amt desc
LIMIT 5;
Some notes here:
SELECT
list happen to each appear in only one table, so we don't need to disambiguate them via table variables like I.name, cn.cand_name, I.transaction_amt
.ORDER BY/LIMIT
logic applies as usual.As we learned in class, joins can be reordered (somewhat). It's interesting to see what the query optimizer chooses for a join order. We can view this by using EXPLAIN
in front of our query:
%%sql
EXPLAIN
SELECT name, cand_name, transaction_amt
FROM indiv_sample I, cm, cn
WHERE I.cmte_id = cm.cmte_id
AND cm.cand_id = cn.cand_id;
On my machine, it appears that it chose to do indiv_sample
$\Join$ (cm
$\Join$ cn
).
Sometimes you want to join a table with itself. For example, we may want to pair up candidate records for the same person running for two different offices on two different party affiliations:
%%sql
SELECT cn1.cand_name,
cn1.cand_election_yr, cn1.cand_pty_affiliation, cn1.cand_office, cn1.cand_office_st,
cn2.cand_election_yr, cn2.cand_pty_affiliation, cn2.cand_office, cn2.cand_office_st
FROM cn AS cn1, cn AS cn2
WHERE cn1.cand_name = cn2.cand_name
AND cn1.cand_office != cn2.cand_office
AND cn1.cand_pty_affiliation != cn2.cand_pty_affiliation
AND cn1.cand_election_yr <= cn2.cand_election_yr;
A few notes on the query above:
FROM
clause to cn1
and cn2
. This is required.cn1
and cn2
are table-valued variables; the fact that they're variables over the same table is basically irrelevant to the query semantics.cn1.
or cn2.
out front.AND
clause is useful? As discussed, the standard theta join $R \Join_\theta S$ of the relational algebra is a selection over the cross-product, i.e. $\sigma_\theta(R \times S)$. But it's often more intuitive to think of a join as a "lookup", or a nested loop: for each row in one table $R$ (the outer loop), look for theta-matches in another table $S$ (the inner loop).
What if we want to "preserve" rows from the outer table that have no match? This is called an outer join of $R$ and $S$. If an $R$ tuple has no matches in $S$, we preserve it by returning the $R$ tuple padded with NULL values for all the columns in $S$.
The standard version of this is called a left outer join (or LEFT JOIN
in SQL), reflecting the idea that the table on the left side of the (infix) join operator is the "outer loop" -- i.e. the one we preserve. This requires introducing an infix notation for joins in SQL, like so:
%%sql select cm.cmte_id, cm.cmte_nm, cn.cand_id, cn.cand_name
from cm LEFT JOIN cn
ON cm.cand_id = cn.cand_id
limit 5;
For convenience, we can also have a RIGHT JOIN
that does the same thing, but treats the right-hand-side of the join operator as the table to preserve:
%%sql select cm.cmte_id, cm.cmte_nm, cn.cand_id, cn.cand_name
from cn RIGHT JOIN cm
ON cm.cand_id = cn.cand_id
limit 5;
In fact, we can think about join symmetrically, and preserve non-matching rows from both sides! This is called a FULL JOIN
in SQL:
%%sql select cm.cmte_id, cm.cmte_nm, cn.cand_id, cn.cand_name
from cn FULL JOIN cm
ON cm.cand_id = cn.cand_id
limit 5;
Having introduced the binary syntax for outer joins, SQL also adds binary syntax for theta joins (which it calls just JOIN
, or confusingly, INNER JOIN
) and NATURAL JOIN
(which doesn't need an ON
clause for obvious reasons!) So the following three queries are equivalent:
%%sql select *
from indiv_sample I JOIN cm ON I.cmte_id = cm.cmte_id
limit 3;
%%sql select *
from indiv_sample I INNER JOIN cm ON I.cmte_id = cm.cmte_id
limit 3;
%%sql select *
from indiv_sample I NATURAL JOIN cm
limit 3;
As a matter of software engineering I would discourage you from using NATURAL JOIN ... it's not self-contained to read (you have to go consult the schema to see what a query says), and it's not robust to changes in schema over time.
We've seen a bunch of different query clauses now, and done some mixing and matching. How do they fit together? The order of evaluation should be thought of like this:
FROM
and WHERE
clauses are evaluated to compute selections and joins.GROUP BY
and HAVING
clauses are evaluated to for groups resulting from the previous stepSELECT
clause is evaluated, including any aggregatesORDER BY
clause is evaluatedLIMIT
clause is used to cut off output production.Up to now we've looked at a single query at a time. SQL also allows us to nest queries in various ways. In this section we look at the cleaner examples of how to do this in SQL: views and Common Table Expressions (CTEs).
In earlier examples, we created new tables and populated them from the result of queries over stored tables. There are two main drawbacks of that approach that may concern us in some cases:
For this reason, SQL provides a notion of logical views: these are basically named queries that are re-evaluated upon each reference. They are rather like "macros" if you're familiar with that term.
The syntax is straightforward:
CREATE VIEW <name> AS
<SELECT statement>;
The resulting view <name>
can be used in an SELECT
query, but not in an INSERT
, DELETE
or UPDATE
query!
As an example, we might want a view that stores just some summary statistics of transaction_amt
s for each date:
%%sql
DROP VIEW IF EXISTS date_stats;
CREATE VIEW date_stats AS
SELECT transaction_dt, min(transaction_amt), avg(transaction_amt), stddev(transaction_amt),
max(transaction_amt)
FROM indiv_sample
GROUP BY transaction_dt;
SELECT * from date_stats limit 5;
One of the nice things about views is modularity: if we have a complex query, we can break it up into smaller views and the run queries on the views.
For example, now we can ask for the day with the highest variance in donations per state:
%%sql
SELECT transaction_dt, stddev
FROM date_stats
WHERE stddev IS NOT NULL
ORDER BY stddev DESC
LIMIT 1;
WITH
)¶If we're only going to use a view within a single query, it is a little inelegant to CREATE
it, and then have to DROP
it later to recycle the view name.
Common Table Expressions (CTEs) are like views that we use on-the-fly. (If you know about lambdas in Python, you can think of CTEs as lambda views.) The syntax for CTEs is to use a WITH
clause in front of the query:
WITH <name> [(renamed columns)] AS
(<SELECT statement>)
[, <name2> AS (<SELECT statement>)...]
If you need multiple CTEs, you separate them with commas. We can rewrite our query above without a view as follows:
%%sql
WITH perdate AS
(SELECT transaction_dt AS date, stddev(transaction_amt) AS std
FROM indiv_sample
GROUP BY transaction_dt)
SELECT date, std
FROM perdate
WHERE std IS NOT NULL
ORDER by std DESC
LIMIT 1;
We can of course use views or CTEs in join queries as well, just as if they were tables. For example, we can compute the "argmax" of transaction_amt
for indiv_sample
: those rows that have the maximum transaction_amt
:
%%sql
WITH biggest_gifts AS
(SELECT max(transaction_amt) AS max
FROM indiv_sample)
SELECT I.transaction_dt, I.name, I.state, I.transaction_amt
FROM indiv_sample I, biggest_gifts B
WHERE I.transaction_amt = B.max;
It is also possible to nest a SQL query within the WHERE
clause of another SQL query: this is usually called a "subquery" or "nested query". Time prevents us from covering subqueries here. It's best if you can avoid them anyhow: they are relatively confusing, they often lead to poor performance, and in most cases there is some way to achieve the same effect without using them.
If you'd like to learn more, you can read the relevant material in the PostgreSQL manual or look at slides from CS186 (slides 35-41).
Like the relational algebra, SQL supports the operators for union, intersect, and difference of relations. Becase SQL is a multiset (i.e. duplicate-aware) language, it distinguishes between the set-based versions of these operators (which remove duplicates) and the multiset versions (which have rules about the number of duplicates in the output.
The syntax is simple:
<SELECT query>
<set operator>
<SELECT query>;
where the two queries are compatible in the sense of schemas, and the set operator is one of:
UNION
(set) or UNION ALL
(multiset)INTERSECT
(set) or INTERSECT ALL
(multiset)EXCEPT
(set) or EXCEPT ALL
(multiset).The set-based versions of these operations are straightforward. Rather than teach the number of duplicates formed for each multiset operator, I'd encourage you to think about what's intuitive, and then test it out yourself!
As an example, you can run the query below to find the individual records that did not make it into our sample. (This query will run slowly).
%%sql
SELECT * FROM individual
EXCEPT ALL
SELECT * FROM indiv_sample
LIMIT 5;
Statistics doesn't deal with individuals, it deals with groups: distributions, populations, samples and the like. As such, computing statistics in SQL focuses heavily on aggregation functions.
All SQL systems have simple descriptive statistics built in as aggregation functions:
min, max
count
sum
avg
stddev
and variance
, the sample standard deviation and variance.PostgreSQL offers many more. Some handy ones include
stddev_pop
and var_pop
: the population standard deviation and variance, which you should use rather than stddev
and variance
if you know your data is the full population, not a sample.covar_samp
and covar_pop
: sample and population covariancecorr
, Pearson's correlation coefficientYou'll notice that a number of handy statistics are missing from this list, including the median and quartiles. That's because those are order statistics: they are defined based on an ordering of the values in a column.
SQL provides for this by allowing what it calls "ordered set functions", which require a WITHIN GROUP (ORDER BY <columns>)
clause to accompany the order-statistic aggregate. For example, to compute the 25th percentile, 50th percentile (median) and 75th percentile in SQL, we can use the following:
%%sql
select percentile_cont(0.25) within group (order by transaction_amt) as lower_quartile,
percentile_cont(0.5) within group (order by transaction_amt) as median,
percentile_cont(0.75) within group (order by transaction_amt) as upper_quartile
from indiv_sample;
GROUP BY vs. WITHIN GROUP
Note the difference between WITHIN GROUP
and GROUP BY
:
WITHIN GROUP
is in the FROM
clauseWITHIN GROUP
is associated with a single aggregate functionWITHIN GROUP
does not affect the number of groupsSide note for database aficionados: If you're clever, you can express order statistics like median in more "purely relational" SQL without resorting to WITHIN GROUP (ORDER BY ...)
, but (a) it's hard for people to understand, (b) it's very messy to get more than one order statistic in a single query, and (c) it's quite difficult for a query optimizer to understand and make it go fast.
Of course you can combine WITHIN GROUP
and GROUP BY
to compute order statistics within groups:
%%sql
select state,
percentile_cont(0.25) within group (order by transaction_amt) as lower_quartile,
percentile_cont(0.5) within group (order by transaction_amt) as median,
percentile_cont(0.75) within group (order by transaction_amt) as upper_quartile
from indiv_sample
group by state
limit 5;
Sometimes, for each row in the output of a query, you want perform a calculation on a related set of rows in the output—often a "window" of rows that precede or follow in some order. Again, this is not very "set-oriented", but SQL provides a mechanism to do it, called a window function. The most common window functions are row_number
in some order, rank
in some order (where equivalent values in the ordering get the same rank), and ntile(n)
in some order, which reports which n-tile the row is in:
%%sql
select row_number() over (order by transaction_amt desc),
rank() over (order by transaction_amt desc) as desc_rank,
ntile(4) over (order by transaction_amt desc) as desc_ntile,
transaction_amt
from indiv_sample
order by row_number
limit 10;
Try changing around the ORDER BY
clauses in that query -- switch one or more from DESC
to ASC
, or change the columns around. What happens? Now put the word EXPLAIN
in front of your query: do you see what the database system has to do to satisfy your query?
Sometimes you want to compute order statistics within groups. Note that we can't use a normal GROUP BY
for this since we don't want one row per group; we want to know a property of each individual output row with respect to its group. We do this via a PARTITION BY
clause in the windowed aggregate:
%%sql
select rank() over(partition by state order by transaction_amt desc),
state,
transaction_amt
from indiv_sample
limit 10;
We can also use standard aggregates over these partitions to append grouped aggregates to each row:
%%sql
select rank() over(partition by state order by transaction_amt desc),
state,
avg(transaction_amt) over(partition by state)
transaction_amt
from indiv_sample
order by state
limit 10;
As a final example, let's reconsider the humble argmax:
%%sql
with ranked as (
select rank() over(partition by state order by transaction_amt desc),
transaction_dt, name, state, transaction_amt
from indiv_sample
)
select * from ranked
where rank = 1;
Does this give the same answer as the implementation at the top of the notebook? Run EXPLAIN
on each to see how the database system executes the different forms of the query.
PARTITION BY vs. GROUP BY vs. WITHIN GROUP
When do we use PARTITION BY
? Here are some key things to remember about PARTITION BY
:
PARTITION BY
is used within an OVER
clauseOVER
clause is associated with a single window function in the FROM
clauseBy contrast:
WITHIN GROUP
is used to modify a single aggregate function (not a window function)---specifically an ordered-set aggregate functionGROUP BY
partitions the input set before aggregation begins, and affects all aggregate functions in the SELECT
clause.More general Windowing
Note that PARTITION BY
is a specific way define a window "frame": based on distinct values.
Sometimes you want to define window frames based on position rather than value: e.g. the running 7-day total of transaction_amt
, or the rank of the current transaction_amt
among 2 days preceding or following, etc.
This can be done in SQL as well, but we won't go into the details here. You can look at the PostgreSQL manual for more information.
Earlier we saw how to register Python code as "user-defined functions" (UDFs) in PostgreSQL. Those were scalar functions: they produces one value of output for each input. We can also register "user-defined aggregates" (UDAs), which produce one value of output for a group of inputs.
A user-defined aggregate is based on managing running state per group. This state is determined by registering three things:
As an example, we can define a user-defined aggregates for computing GPAs based on letter grades. The initial condition of the state is the pair (0,0)
representing the count and sum of grades so far. The state transition function increments the count, and adds to the sum based on decoding the letter grade into a number between 0 and 4. The final function returns the sum divided by the count, unless the count is 0 in which case it returns NULL.
%%sql
DROP AGGREGATE IF EXISTS gpa(char);
DROP FUNCTION IF EXISTS final_gpa(avg_state);
DROP FUNCTION IF EXISTS transition_gpa(avg_state, char);
DROP TYPE IF EXISTS avg_state;
CREATE TYPE avg_state AS (cnt integer, total integer);
CREATE FUNCTION transition_gpa(state avg_state, grade char) returns avg_state
AS
$$
def grade_to_num(g):
lookup = {
'A': 4,
'B': 3,
'C': 2,
'D': 1,
'F': 0
};
return lookup.get(g, None);
state["cnt"] += 1;
state["total"] += grade_to_num(grade);
return state;
$$ LANGUAGE plpythonu;
CREATE FUNCTION final_gpa(state avg_state) returns float
AS
$$
if (state["cnt"] == 0):
return None
else:
return state["total"]*1.0 / state["cnt"];
end
$$ LANGUAGE plpythonu;
CREATE AGGREGATE gpa(char)
(
initcond = '(0,0)',
stype = avg_state,
sfunc = transition_gpa,
finalfunc = final_gpa
);
%%sql
DROP TABLE IF EXISTS gradebook;
CREATE TABLE gradebook(name text, course text, grade char);
INSERT INTO gradebook values
('Joe', 'DS100', 'B'),
('Mary', 'DS100', 'A'),
('Sam', 'DS100', 'D');
select gpa(grade) from gradebook;
Now you can use your UDA wherever you used to use a standard aggregate function, including with GROUP BY
and other clauses.
Note that there's a more involved interface for (efficient) implementation of windowed aggregates.
Details on UDAs can be found in the PostgreSQL manual.
Given what you've learned so far, here are some things you should be able to do with a table too big to fit into memory in Python:
Up to this point we have left performance entirely in the hands of the database system. As it turns out, most relational databases still leave a large degree of performance tuning to the user (or "database administrator").
The main way you can influence performance is to ask the database to build "indexes" on columns. This is particularly useful for columns that you use in your WHERE
clauses (theta expressions for selection and join).
Here is an example.
%%sql
drop index if exists ind_sample_name_ix;
explain analyze
select *
from indiv_sample
where name = 'HATHAWAY, RANDOLPH';
%%sql
create index ind_sample_name_ix on indiv_sample (name);
%sql explain analyze select * \
from indiv_sample \
where name = 'HATHAWAY, RANDOLPH';
See how the execution time went down by 1-2 orders of magnitude? The index makes lookups (both exact-match lookups and small range lookups) very fast compared to scanning through all the records in the table. Of course it also made the query optimizer ("planning time") slower since it had more options to consider.
For a large table like individual
this can be an enormous win at query time. Recall that individual
is nearly 40,000,000 rows big. I previously built an index on individual.name
, and look at the runtime for a lookup query on name now:
%sql explain analyze select * \
from individual \
where name = 'HATHAWAY, RANDOLPH';
Indexes can also improve join performance. For example:
%%sql
drop index if exists indiv_sample_cmte;
explain analyze select *
from indiv_sample I, cm, cn
where I.cmte_id = cm.cmte_id
and cm.cand_id = cn.cand_id
limit 10;
%%sql
create index indiv_sample_cmte on indiv_sample(cmte_id);
explain analyze select *
from indiv_sample I, cm, cn
where I.cmte_id = cm.cmte_id
and cm.cand_id = cn.cand_id
limit 10;
Again, the performance win here would be much more dramatic for the individual
table, but I did not take the time and space to build an index on individual.cmte_id
yet.
Why not use indexes on all your columns?
So the typical practice is to add indexes as you find your work getting too slow.