Reading the SQL Expression

We've seen a bunch of different query clauses now, and done some mixing and matching. How do they fit together? The order of evaluation should be thought of like this:

  1. The FROM and WHERE clauses are evaluated to compute selections and joins.
  2. The GROUP BY and HAVING clauses are evaluated to for groups resulting from the previous step
  3. The SELECT clause is evaluated, including any aggregates
  4. The ORDER BY clause is evaluated
  5. The LIMIT clause is used to cut off output production.
In [1]:
%load_ext sql
%sql postgresql://jegonzal:@localhost:5432/fec
%sql DROP VIEW IF EXISTS date_stats;
%sql DROP VIEW IF EXISTS rando;
Done.
Done.
Out[1]:
[]

Named Queries: Views and CTEs

Up to now we've looked at a single query at a time. SQL also allows us to nest queries in various ways. In this section we look at the cleaner examples of how to do this in SQL: views and Common Table Expressions (CTEs).

Views

In earlier examples, we created new tables and populated them from the result of queries over stored tables. There are two main drawbacks of that approach that may concern us in some cases:

  1. The new table uses up storage, even though it is recomputable from other tables.
  2. If the input tables change, the stored output does not reflect the new state of the input.

For this reason, SQL provides a notion of logical views: these are basically named queries that are re-evaluated upon each reference. They are rather like "macros" if you're familiar with that term.

The syntax is straightforward:

CREATE VIEW <name> AS
<SELECT statement>;

The resulting view <name> can be used in an SELECT query, but not in an INSERT, DELETE or UPDATE query!

As an example, we might want a view that stores just some summary statistics of transaction_amts for each date:

In [2]:
%%sql
DROP VIEW IF EXISTS date_stats;
Done.
Out[2]:
[]
In [3]:
%%sql

CREATE VIEW date_stats AS
SELECT 
    to_date(transaction_dt, 'MMDDYYYY') as day, -- Date Parsing
    min(transaction_amt), 
    avg(transaction_amt), 
    stddev(transaction_amt),
    max(transaction_amt)
FROM indiv_sample
GROUP BY transaction_dt;
Done.
Out[3]:
[]
In [4]:
%%sql
SELECT * from date_stats limit 5;
5 rows affected.
Out[4]:
day min avg stddev max
2015-02-06 355 355.0000000000000000 None 355
2016-02-06 1 181.8666666666666667 518.303395246268 2000
None 0 125.0000000000000000 144.337567297406 250
2015-03-23 25 1311.8750000000000000 1605.471763480546 5000
2016-07-18 0 325.9500000000000000 1105.281458557764 5000

Notice that this did not create a table:

In [5]:
!psql fec -c "\dt"
            List of relations
 Schema |     Name     | Type  |  Owner   
--------+--------------+-------+----------
 public | ccl          | table | postgres
 public | cm           | table | postgres
 public | cn           | table | postgres
 public | gradebook    | table | jegonzal
 public | indiv_sample | table | jegonzal
 public | individual   | table | postgres
 public | oppexp       | table | postgres
 public | oth          | table | postgres
 public | pas          | table | postgres
 public | profs        | table | jegonzal
(10 rows)

it created a view

In [6]:
!psql fec -c "\dv"
           List of relations
 Schema |    Name    | Type |  Owner   
--------+------------+------+----------
 public | date_stats | view | jegonzal
(1 row)

In [7]:
!psql fec -c "\d date_stats"
   View "public.date_stats"
 Column |  Type   | Modifiers 
--------+---------+-----------
 day    | date    | 
 min    | integer | 
 avg    | numeric | 
 stddev | numeric | 
 max    | integer | 

Views are not materialized

In [8]:
%%sql
DROP VIEW IF EXISTS rando;
Done.
Out[8]:
[]
In [9]:
%%sql

CREATE VIEW rando(rownum, rnd) AS
SELECT rownum, round(random())::INTEGER
FROM generate_series(1,50) AS ind(rownum)
Done.
Out[9]:
[]

What is the value of the sum of all the rows?

In [10]:
%%sql
SELECT SUM(rnd) FROM rando
1 rows affected.
Out[10]:
sum
21

What happens if I run it again?

In [11]:
%%sql
SELECT SUM(rnd) FROM rando
1 rows affected.
Out[11]:
sum
25

Views are useful for composition

One of the nice things about views is modularity: if we have a complex query, we can break it up into smaller views and the run queries on the views.

For example, now we can ask for the day with the highest variance in donations per state:

In [12]:
%%sql

SELECT day, stddev
FROM date_stats
WHERE stddev IS NOT NULL
ORDER BY stddev DESC
LIMIT 5;
5 rows affected.
Out[12]:
day stddev
2016-07-13 106041.27290974
2016-02-02 76219.14861970
2016-10-26 65370.45080386
2016-06-28 52401.89791164
2016-10-27 49074.16141287

Too Many Views

Views can help:

  • Simplify queries
  • Make complex queries more readable
  • Share "sql programs" with others

Problem:

  • Creating a new view for each (exploratory) query will result in a lot of views!
  • views like: temp1, temp1_joey, temp1_joey_fixed, ...

We need a mechanism to decompose query into views for the scope of a single query.

Common Table Expressions (WITH)

Think of these as a view that exists only during the query.

If we're only going to use a view within a single query, it is a little inelegant to CREATE it, and then have to DROP it later to recycle the view name.

Common Table Expressions (CTEs) are like views that we use on-the-fly. (If you know about lambdas in Python, you can think of CTEs as lambda views.) The syntax for CTEs is to use a WITH clause in front of the query:

WITH <name> [(renamed columns)] AS (<SELECT statement>) [, <name2> AS (<SELECT statement>)...]

If you need multiple CTEs, you separate them with commas. We can rewrite our query above without a view as follows:

In [13]:
%%sql
WITH per_day_stats AS (
    SELECT 
        to_date(transaction_dt, 'MMDDYYYY') as day, -- Date Parsing
        min(transaction_amt), 
        avg(transaction_amt), 
        stddev(transaction_amt),
        max(transaction_amt)
    FROM indiv_sample
    GROUP BY transaction_dt
)    
SELECT day, stddev
  FROM per_day_stats
 WHERE stddev IS NOT NULL
 ORDER by stddev DESC
 LIMIT 1;
1 rows affected.
Out[13]:
day stddev
2016-07-13 106041.27290974

We can of course use views or CTEs in join queries as well, just as if they were tables. For example, we can compute the "argmax" of transaction_amt for indiv_sample: those rows that have the maximum transaction_amt:

In [14]:
%%sql
WITH biggest_gifts AS (
    SELECT max(transaction_amt) AS max
    FROM indiv_sample
)    
SELECT I.transaction_dt, I.name, I.state, I.transaction_amt
FROM indiv_sample I, biggest_gifts B
WHERE I.transaction_amt = B.max;
1 rows affected.
Out[14]:
transaction_dt name state transaction_amt
07132016 DEMOCRATIC NATIONAL COMMITTEE DC 600000

Nested Queries

It is also possible to nest a SQL query within the WHERE clause of another SQL query: this is usually called a "subquery" or "nested query". Time prevents us from covering subqueries here. It's best if you can avoid them anyhow: they are relatively confusing, they often lead to poor performance, and in most cases there is some way to achieve the same effect without using them.

If you'd like to learn more, you can read the relevant material in the PostgreSQL manual or look at slides from CS186 (slides 35-41).

Set and Multiset Operators (skip in class)

Like the relational algebra, SQL supports the operators for union, intersect, and difference of relations. Becase SQL is a multiset (i.e. duplicate-aware) language, it distinguishes between the set-based versions of these operators (which remove duplicates) and the multiset versions (which have rules about the number of duplicates in the output.

The syntax is simple:

<SELECT query>
<set operator>
<SELECT query>;

where the two queries are compatible in the sense of schemas, and the set operator is one of:

  • Union: UNION (set) or UNION ALL (multiset)
  • Intersection: INTERSECT (set) or INTERSECT ALL (multiset)
  • Difference: EXCEPT (set) or EXCEPT ALL (multiset).

The set-based versions of these operations are straightforward. Rather than teach the number of duplicates formed for each multiset operator, I'd encourage you to think about what's intuitive, and then test it out yourself!

As an example, you can run the query below to find the individual records that did not make it into our sample. (This query will run slowly).

In [15]:
# %%sql
# SELECT * FROM individual
# EXCEPT ALL
# SELECT * FROM indiv_sample
# LIMIT 5;

Getting Fancy with SQL Aggregates: Statistics, Windows and UDAs

Simple Descriptive Statistics in SQL

Statistics doesn't deal with individuals, it deals with groups: distributions, populations, samples and the like. As such, computing statistics in SQL focuses heavily on aggregation functions.

All SQL systems have simple descriptive statistics built in as aggregation functions:

  • min, max
  • count
  • sum
  • avg
  • stddev and variance, the sample standard deviation and variance.

PostgreSQL offers many more. Some handy ones include

  • stddev_pop and var_pop: the population standard deviation and variance, which you should use rather than stddev and variance if you know your data is the full population, not a sample.
  • covar_samp and covar_pop: sample and population covariance
  • corr, Pearson's correlation coefficient

Order Statistics: Aggregates requiring ordered input

You'll notice that a number of handy statistics are missing from this list, including the median and quartiles. That's because those are order statistics: they are defined based on an ordering of the values in a column.

SQL provides for this by allowing what it calls "ordered set functions", which require a WITHIN GROUP (ORDER BY <columns>) clause to accompany the order-statistic aggregate. For example, to compute the 25th percentile, 50th percentile (median) and 75th percentile in SQL, we can use the following:

There are two versions of the percentile function:

  • percentile_cont inuous : interpolates
  • percentile_disc rete : returns an entry from the table

What will the following expressions return?

In [16]:
%%sql
SELECT 
    percentile_cont(0.5) WITHIN GROUP (ORDER BY x) 
FROM generate_series(1,10) AS data(x);
1 rows affected.
Out[16]:
percentile_cont
5.5
In [17]:
%%sql
SELECT 
    percentile_disc(0.5) WITHIN GROUP (ORDER BY x) 
FROM generate_series(1,10) AS data(x);
1 rows affected.
Out[17]:
percentile_disc
5
In [18]:
%%sql
SELECT 
    percentile_disc(0.25) WITHIN GROUP (ORDER BY x) as lower_quartile,
    percentile_disc(0.5) WITHIN GROUP (ORDER BY x) as median,
    percentile_disc(0.75) WITHIN GROUP (ORDER BY x) as upper_quartile
FROM generate_series(1,10) AS data(x);
1 rows affected.
Out[18]:
lower_quartile median upper_quartile
3 5 8

GROUP BY vs. WITHIN GROUP

Note the difference between WITHIN GROUP and GROUP BY:

  • WITHIN GROUP is in the FROM clause
  • WITHIN GROUP is associated with a single aggregate function
  • WITHIN GROUP does not affect the number of groups

Side note for database aficionados: If you're clever, you can express order statistics like median in more "purely relational" SQL without resorting to WITHIN GROUP (ORDER BY ...), but (a) it's hard for people to understand, (b) it's very messy to get more than one order statistic in a single query, and (c) it's quite difficult for a query optimizer to understand and make it go fast.

Of course you can combine WITHIN GROUP and GROUP BY to compute order statistics within groups:

In [19]:
%%sql
SELECT state, 
       percentile_cont(0.25) WITHIN GROUP (ORDER BY transaction_amt) as lower_quartile,
       percentile_cont(0.5) WITHIN GROUP (ORDER BY transaction_amt) as median,
       percentile_cont(0.75) WITHIN GROUP (ORDER BY transaction_amt) as upper_quartile
FROM indiv_sample
GROUP BY state
LIMIT 5;
5 rows affected.
Out[19]:
state lower_quartile median upper_quartile
AA -10.0 -10.0 -10.0
AE 15.0 15.0 25.0
AK 25.0 40.5 100.0
AL 20.0 28.0 115.0
AP 27.0 48.5 83.5

Window Functions

Sometimes, for each row in the output of a query, you want perform a calculation on a related set of rows in the output—often a "window" of rows that precede or follow in some order. Again, this is not very "set-oriented", but SQL provides a mechanism to do it, called a window function. The most common window functions are row_number in some order, rank in some order (where equivalent values in the ordering get the same rank), and ntile(n) in some order, which reports which n-tile the row is in:

simple example:

What will the following query compute?

In [20]:
%%sql
WITH empsalary(depname, empno, salary) AS (
    VALUES ('develop', 8, 6000), 
           ('develop', 10, 5200),
           ('develop', 11, 5200),
           ('develop', 9, 4500),
           ('develop', 7, 4200),
           ('personnel', 2, 3900),
           ('personnel', 5, 3500),
           ('sales', 1, 5000),
           ('sales', 4, 4800),
           ('sales', 3, 4800)
)
SELECT 
    empno, depname, salary, 
    rank() OVER (PARTITION BY depname ORDER BY salary DESC),
    ntile(4) OVER (ORDER BY salary DESC) 
FROM empsalary
ORDER BY empno
10 rows affected.
Out[20]:
empno depname salary rank ntile
1 sales 5000 1 2
2 personnel 3900 1 4
3 sales 4800 2 2
4 sales 4800 2 2
5 personnel 3500 2 4
7 develop 4200 5 3
8 develop 6000 1 1
9 develop 4500 4 3
10 develop 5200 2 1
11 develop 5200 2 1

Dissecting The Query

rank() OVER (PARTITION BY depname ORDER BY salary DESC)
  • rank() is the window function
  • OVER connects the function to where it is run
  • PARTITION BY (optional) like group by splits the table according to the column(s) provided. However unlike group by the split is only applied during the window computation and the result of the window computation is then attached to the original row.
  • ORDER BY determines the column on which the windowed function is applied




COMMIT -- end of SQL discussion

Concluding thoughts:

  • SQL is the most widely used language for manipulating and accessing data.
  • Learn SQL through practice. Give yourself time to do the homework and try writing queries multiple different ways.
  • We will return to SQL and the Ralational model later in the class.

... and now we transition from the tools of summarizing data to estimating probabilities and paramters of models.