We've seen a bunch of different query clauses now, and done some mixing and matching. How do they fit together? The order of evaluation should be thought of like this:
FROM
and WHERE
clauses are evaluated to compute selections and joins.GROUP BY
and HAVING
clauses are evaluated to for groups resulting from the previous stepSELECT
clause is evaluated, including any aggregatesORDER BY
clause is evaluatedLIMIT
clause is used to cut off output production.%load_ext sql
%sql postgresql://jegonzal:@localhost:5432/fec
%sql DROP VIEW IF EXISTS date_stats;
%sql DROP VIEW IF EXISTS rando;
Up to now we've looked at a single query at a time. SQL also allows us to nest queries in various ways. In this section we look at the cleaner examples of how to do this in SQL: views and Common Table Expressions (CTEs).
In earlier examples, we created new tables and populated them from the result of queries over stored tables. There are two main drawbacks of that approach that may concern us in some cases:
For this reason, SQL provides a notion of logical views: these are basically named queries that are re-evaluated upon each reference. They are rather like "macros" if you're familiar with that term.
The syntax is straightforward:
CREATE VIEW <name> AS
<SELECT statement>;
The resulting view <name>
can be used in an SELECT
query, but not in an INSERT
, DELETE
or UPDATE
query!
As an example, we might want a view that stores just some summary statistics of transaction_amt
s for each date:
%%sql
DROP VIEW IF EXISTS date_stats;
%%sql
CREATE VIEW date_stats AS
SELECT
to_date(transaction_dt, 'MMDDYYYY') as day, -- Date Parsing
min(transaction_amt),
avg(transaction_amt),
stddev(transaction_amt),
max(transaction_amt)
FROM indiv_sample
GROUP BY transaction_dt;
%%sql
SELECT * from date_stats limit 5;
!psql fec -c "\dt"
it created a view
!psql fec -c "\dv"
!psql fec -c "\d date_stats"
%%sql
DROP VIEW IF EXISTS rando;
%%sql
CREATE VIEW rando(rownum, rnd) AS
SELECT rownum, round(random())::INTEGER
FROM generate_series(1,50) AS ind(rownum)
What is the value of the sum of all the rows?
%%sql
SELECT SUM(rnd) FROM rando
What happens if I run it again?
%%sql
SELECT SUM(rnd) FROM rando
One of the nice things about views is modularity: if we have a complex query, we can break it up into smaller views and the run queries on the views.
For example, now we can ask for the day with the highest variance in donations per state:
%%sql
SELECT day, stddev
FROM date_stats
WHERE stddev IS NOT NULL
ORDER BY stddev DESC
LIMIT 5;
Views can help:
Problem:
temp1
, temp1_joey
, temp1_joey_fixed
, ... We need a mechanism to decompose query into views for the scope of a single query.
WITH
)¶Think of these as a view that exists only during the query.
If we're only going to use a view within a single query, it is a little inelegant to CREATE
it, and then have to DROP
it later to recycle the view name.
Common Table Expressions (CTEs) are like views that we use on-the-fly. (If you know about lambdas in Python, you can think of CTEs as lambda views.) The syntax for CTEs is to use a WITH
clause in front of the query:
WITH <name> [(renamed columns)] AS
(<SELECT statement>)
[, <name2> AS (<SELECT statement>)...]
If you need multiple CTEs, you separate them with commas. We can rewrite our query above without a view as follows:
%%sql
WITH per_day_stats AS (
SELECT
to_date(transaction_dt, 'MMDDYYYY') as day, -- Date Parsing
min(transaction_amt),
avg(transaction_amt),
stddev(transaction_amt),
max(transaction_amt)
FROM indiv_sample
GROUP BY transaction_dt
)
SELECT day, stddev
FROM per_day_stats
WHERE stddev IS NOT NULL
ORDER by stddev DESC
LIMIT 1;
We can of course use views or CTEs in join queries as well, just as if they were tables. For example, we can compute the "argmax" of transaction_amt
for indiv_sample
: those rows that have the maximum transaction_amt
:
%%sql
WITH biggest_gifts AS (
SELECT max(transaction_amt) AS max
FROM indiv_sample
)
SELECT I.transaction_dt, I.name, I.state, I.transaction_amt
FROM indiv_sample I, biggest_gifts B
WHERE I.transaction_amt = B.max;
It is also possible to nest a SQL query within the WHERE
clause of another SQL query: this is usually called a "subquery" or "nested query". Time prevents us from covering subqueries here. It's best if you can avoid them anyhow: they are relatively confusing, they often lead to poor performance, and in most cases there is some way to achieve the same effect without using them.
If you'd like to learn more, you can read the relevant material in the PostgreSQL manual or look at slides from CS186 (slides 35-41).
Like the relational algebra, SQL supports the operators for union, intersect, and difference of relations. Becase SQL is a multiset (i.e. duplicate-aware) language, it distinguishes between the set-based versions of these operators (which remove duplicates) and the multiset versions (which have rules about the number of duplicates in the output.
The syntax is simple:
<SELECT query>
<set operator>
<SELECT query>;
where the two queries are compatible in the sense of schemas, and the set operator is one of:
UNION
(set) or UNION ALL
(multiset)INTERSECT
(set) or INTERSECT ALL
(multiset)EXCEPT
(set) or EXCEPT ALL
(multiset).The set-based versions of these operations are straightforward. Rather than teach the number of duplicates formed for each multiset operator, I'd encourage you to think about what's intuitive, and then test it out yourself!
As an example, you can run the query below to find the individual records that did not make it into our sample. (This query will run slowly).
# %%sql
# SELECT * FROM individual
# EXCEPT ALL
# SELECT * FROM indiv_sample
# LIMIT 5;
Statistics doesn't deal with individuals, it deals with groups: distributions, populations, samples and the like. As such, computing statistics in SQL focuses heavily on aggregation functions.
All SQL systems have simple descriptive statistics built in as aggregation functions:
min, max
count
sum
avg
stddev
and variance
, the sample standard deviation and variance.PostgreSQL offers many more. Some handy ones include
stddev_pop
and var_pop
: the population standard deviation and variance, which you should use rather than stddev
and variance
if you know your data is the full population, not a sample.covar_samp
and covar_pop
: sample and population covariancecorr
, Pearson's correlation coefficientYou'll notice that a number of handy statistics are missing from this list, including the median and quartiles. That's because those are order statistics: they are defined based on an ordering of the values in a column.
SQL provides for this by allowing what it calls "ordered set functions", which require a WITHIN GROUP (ORDER BY <columns>)
clause to accompany the order-statistic aggregate. For example, to compute the 25th percentile, 50th percentile (median) and 75th percentile in SQL, we can use the following:
There are two versions of the percentile function:
percentile_cont
inuous : interpolatespercentile_disc
rete : returns an entry from the tableWhat will the following expressions return?
%%sql
SELECT
percentile_cont(0.5) WITHIN GROUP (ORDER BY x)
FROM generate_series(1,10) AS data(x);
%%sql
SELECT
percentile_disc(0.5) WITHIN GROUP (ORDER BY x)
FROM generate_series(1,10) AS data(x);
%%sql
SELECT
percentile_disc(0.25) WITHIN GROUP (ORDER BY x) as lower_quartile,
percentile_disc(0.5) WITHIN GROUP (ORDER BY x) as median,
percentile_disc(0.75) WITHIN GROUP (ORDER BY x) as upper_quartile
FROM generate_series(1,10) AS data(x);
GROUP BY vs. WITHIN GROUP
Note the difference between WITHIN GROUP
and GROUP BY
:
WITHIN GROUP
is in the FROM
clauseWITHIN GROUP
is associated with a single aggregate functionWITHIN GROUP
does not affect the number of groupsSide note for database aficionados: If you're clever, you can express order statistics like median in more "purely relational" SQL without resorting to WITHIN GROUP (ORDER BY ...)
, but (a) it's hard for people to understand, (b) it's very messy to get more than one order statistic in a single query, and (c) it's quite difficult for a query optimizer to understand and make it go fast.
Of course you can combine WITHIN GROUP
and GROUP BY
to compute order statistics within groups:
%%sql
SELECT state,
percentile_cont(0.25) WITHIN GROUP (ORDER BY transaction_amt) as lower_quartile,
percentile_cont(0.5) WITHIN GROUP (ORDER BY transaction_amt) as median,
percentile_cont(0.75) WITHIN GROUP (ORDER BY transaction_amt) as upper_quartile
FROM indiv_sample
GROUP BY state
LIMIT 5;
Sometimes, for each row in the output of a query, you want perform a calculation on a related set of rows in the output—often a "window" of rows that precede or follow in some order. Again, this is not very "set-oriented", but SQL provides a mechanism to do it, called a window function. The most common window functions are row_number
in some order, rank
in some order (where equivalent values in the ordering get the same rank), and ntile(n)
in some order, which reports which n-tile the row is in:
What will the following query compute?
%%sql
WITH empsalary(depname, empno, salary) AS (
VALUES ('develop', 8, 6000),
('develop', 10, 5200),
('develop', 11, 5200),
('develop', 9, 4500),
('develop', 7, 4200),
('personnel', 2, 3900),
('personnel', 5, 3500),
('sales', 1, 5000),
('sales', 4, 4800),
('sales', 3, 4800)
)
SELECT
empno, depname, salary,
rank() OVER (PARTITION BY depname ORDER BY salary DESC),
ntile(4) OVER (ORDER BY salary DESC)
FROM empsalary
ORDER BY empno
rank() OVER (PARTITION BY depname ORDER BY salary DESC)
rank()
is the window functionOVER
connects the function to where it is runPARTITION BY
(optional) like group by splits the table according to the column(s) provided. However unlike group by the split is only applied during the window computation and the result of the window computation is then attached to the original row.ORDER BY
determines the column on which the windowed function is appliedConcluding thoughts: