We've seen a bunch of different query clauses now, and done some mixing and matching. How do they fit together? The order of evaluation should be thought of like this:

- The
`FROM`

and`WHERE`

clauses are evaluated to compute selections and joins. - The
`GROUP BY`

and`HAVING`

clauses are evaluated to for groups resulting from the previous step - The
`SELECT`

clause is evaluated, including any aggregates - The
`ORDER BY`

clause is evaluated - The
`LIMIT`

clause is used to cut off output production.

In [1]:

```
%load_ext sql
%sql postgresql://jegonzal:@localhost:5432/fec
%sql DROP VIEW IF EXISTS date_stats;
%sql DROP VIEW IF EXISTS rando;
```

Out[1]:

Up to now we've looked at a single query at a time. SQL also allows us to nest queries in various ways. In this section we look at the cleaner examples of how to do this in SQL: views and Common Table Expressions (CTEs).

In earlier examples, we created new tables and populated them from the result of queries over stored tables. There are two main drawbacks of that approach that may concern us in some cases:

- The new table uses up storage, even though it is recomputable from other tables.
- If the input tables change, the stored output does not reflect the new state of the input.

For this reason, SQL provides a notion of logical *views*: these are basically named queries that are re-evaluated upon each reference. They are rather like "macros" if you're familiar with that term.

The syntax is straightforward:

```
CREATE VIEW <name> AS
<SELECT statement>;
```

The resulting view `<name>`

can be used in an `SELECT`

query, but *not* in an `INSERT`

, `DELETE`

or `UPDATE`

query!

As an example, we might want a view that stores just some summary statistics of `transaction_amt`

s for each date:

In [2]:

```
%%sql
DROP VIEW IF EXISTS date_stats;
```

Out[2]:

In [3]:

```
%%sql
CREATE VIEW date_stats AS
SELECT
to_date(transaction_dt, 'MMDDYYYY') as day, -- Date Parsing
min(transaction_amt),
avg(transaction_amt),
stddev(transaction_amt),
max(transaction_amt)
FROM indiv_sample
GROUP BY transaction_dt;
```

Out[3]:

In [4]:

```
%%sql
SELECT * from date_stats limit 5;
```

Out[4]:

In [5]:

```
!psql fec -c "\dt"
```

it created a view

In [6]:

```
!psql fec -c "\dv"
```

In [7]:

```
!psql fec -c "\d date_stats"
```

In [8]:

```
%%sql
DROP VIEW IF EXISTS rando;
```

Out[8]:

In [9]:

```
%%sql
CREATE VIEW rando(rownum, rnd) AS
SELECT rownum, round(random())::INTEGER
FROM generate_series(1,50) AS ind(rownum)
```

Out[9]:

What is the value of the sum of all the rows?

In [10]:

```
%%sql
SELECT SUM(rnd) FROM rando
```

Out[10]:

What happens if I run it again?

In [11]:

```
%%sql
SELECT SUM(rnd) FROM rando
```

Out[11]:

One of the nice things about views is modularity: if we have a complex query, we can break it up into smaller views and the run queries on the views.

For example, now we can ask for the day with the highest variance in donations per state:

In [12]:

```
%%sql
SELECT day, stddev
FROM date_stats
WHERE stddev IS NOT NULL
ORDER BY stddev DESC
LIMIT 5;
```

Out[12]:

Views can help:

- Simplify queries
- Make complex queries more readable
- Share "sql programs" with others

Problem:

- Creating a new view for each (exploratory) query will result in a lot of views!
- views like:
`temp1`

,`temp1_joey`

,`temp1_joey_fixed`

, ...

** We need a mechanism to decompose query into views for the scope of a single query.**

`WITH`

)¶Think of these as a view that exists only during the query.

If we're only going to use a view within a single query, it is a little inelegant to `CREATE`

it, and then have to `DROP`

it later to recycle the view name.

*Common Table Expressions* (CTEs) are like views that we use on-the-fly. (If you know about lambdas in Python, you can think of CTEs as lambda views.) The syntax for CTEs is to use a `WITH`

clause in front of the query:

```
WITH <name> [(renamed columns)] AS
(<SELECT statement>)
[, <name2> AS (<SELECT statement>)...]
```

If you need multiple CTEs, you separate them with commas. We can rewrite our query above without a view as follows:

In [13]:

```
%%sql
WITH per_day_stats AS (
SELECT
to_date(transaction_dt, 'MMDDYYYY') as day, -- Date Parsing
min(transaction_amt),
avg(transaction_amt),
stddev(transaction_amt),
max(transaction_amt)
FROM indiv_sample
GROUP BY transaction_dt
)
SELECT day, stddev
FROM per_day_stats
WHERE stddev IS NOT NULL
ORDER by stddev DESC
LIMIT 1;
```

Out[13]:

`transaction_amt`

for `indiv_sample`

: those rows that have the maximum `transaction_amt`

:

In [14]:

```
%%sql
WITH biggest_gifts AS (
SELECT max(transaction_amt) AS max
FROM indiv_sample
)
SELECT I.transaction_dt, I.name, I.state, I.transaction_amt
FROM indiv_sample I, biggest_gifts B
WHERE I.transaction_amt = B.max;
```

Out[14]:

It is also possible to nest a SQL query within the `WHERE`

clause of another SQL query: this is usually called a "subquery" or "nested query". Time prevents us from covering subqueries here. It's best if you can avoid them anyhow: they are relatively confusing, they often lead to poor performance, and in most cases there is some way to achieve the same effect without using them.

If you'd like to learn more, you can read the relevant material in the PostgreSQL manual or look at slides from CS186 (slides 35-41).

Like the relational algebra, SQL supports the operators for union, intersect, and difference of relations. Becase SQL is a multiset (i.e. duplicate-aware) language, it distinguishes between the set-based versions of these operators (which remove duplicates) and the multiset versions (which have rules about the number of duplicates in the output.

The syntax is simple:

```
<SELECT query>
<set operator>
<SELECT query>;
```

where the two queries are *compatible* in the sense of schemas, and the set operator is one of:

- Union:
`UNION`

(set) or`UNION ALL`

(multiset) - Intersection:
`INTERSECT`

(set) or`INTERSECT ALL`

(multiset) - Difference:
`EXCEPT`

(set) or`EXCEPT ALL`

(multiset).

The set-based versions of these operations are straightforward. Rather than teach the number of duplicates formed for each multiset operator, I'd encourage you to think about what's intuitive, and then test it out yourself!

As an example, you can run the query below to find the individual records that did not make it into our sample. (This query will run slowly).

In [15]:

```
# %%sql
# SELECT * FROM individual
# EXCEPT ALL
# SELECT * FROM indiv_sample
# LIMIT 5;
```

Statistics doesn't deal with individuals, it deals with groups: distributions, populations, samples and the like. As such, computing statistics in SQL focuses heavily on aggregation functions.

All SQL systems have simple descriptive statistics built in as aggregation functions:

`min, max`

`count`

`sum`

`avg`

`stddev`

and`variance`

, the sample standard deviation and variance.

PostgreSQL offers many more. Some handy ones include

`stddev_pop`

and`var_pop`

: the population standard deviation and variance, which you should use rather than`stddev`

and`variance`

if you know your data is the full population, not a sample.`covar_samp`

and`covar_pop`

: sample and population covariance`corr`

, Pearson's correlation coefficient

You'll notice that a number of handy statistics are missing from this list, including the median and quartiles. That's because those are order statistics: they are defined based on an ordering of the values in a column.

SQL provides for this by allowing what it calls "ordered set functions", which require a `WITHIN GROUP (ORDER BY <columns>)`

clause to accompany the order-statistic aggregate. For example, to compute the 25th percentile, 50th percentile (median) and 75th percentile in SQL, we can use the following:

There are two versions of the percentile function:

inuous : interpolates`percentile_cont`

rete : returns an entry from the table`percentile_disc`

What will the following expressions return?

In [16]:

```
%%sql
SELECT
percentile_cont(0.5) WITHIN GROUP (ORDER BY x)
FROM generate_series(1,10) AS data(x);
```

Out[16]:

In [17]:

```
%%sql
SELECT
percentile_disc(0.5) WITHIN GROUP (ORDER BY x)
FROM generate_series(1,10) AS data(x);
```

Out[17]:

In [18]:

```
%%sql
SELECT
percentile_disc(0.25) WITHIN GROUP (ORDER BY x) as lower_quartile,
percentile_disc(0.5) WITHIN GROUP (ORDER BY x) as median,
percentile_disc(0.75) WITHIN GROUP (ORDER BY x) as upper_quartile
FROM generate_series(1,10) AS data(x);
```

Out[18]:

** GROUP BY vs. WITHIN GROUP **

Note the difference between `WITHIN GROUP`

and `GROUP BY`

:

`WITHIN GROUP`

is in the`FROM`

clause`WITHIN GROUP`

is associated with a*single*aggregate function`WITHIN GROUP`

*does not affect the number of groups*

*Side note for database aficionados: If you're clever, you can express order statistics like median in more "purely relational" SQL without resorting to WITHIN GROUP (ORDER BY ...), but (a) it's hard for people to understand, (b) it's very messy to get more than one order statistic in a single query, and (c) it's quite difficult for a query optimizer to understand and make it go fast.*

Of course you can combine `WITHIN GROUP`

and `GROUP BY`

to compute order statistics within groups:

In [19]:

```
%%sql
SELECT state,
percentile_cont(0.25) WITHIN GROUP (ORDER BY transaction_amt) as lower_quartile,
percentile_cont(0.5) WITHIN GROUP (ORDER BY transaction_amt) as median,
percentile_cont(0.75) WITHIN GROUP (ORDER BY transaction_amt) as upper_quartile
FROM indiv_sample
GROUP BY state
LIMIT 5;
```

Out[19]:

Sometimes, for each row in the output of a query, you want perform a calculation on a related *set* of rows in the outputâ€”often a "window" of rows that precede or follow in some order. Again, this is not very "set-oriented", but SQL provides a mechanism to do it, called a *window function*. The most common window functions are `row_number`

in some order, `rank`

in some order (where equivalent values in the ordering get the same rank), and `ntile(n)`

in some order, which reports which n-tile the row is in:

**What will the following query compute?**

In [20]:

```
%%sql
WITH empsalary(depname, empno, salary) AS (
VALUES ('develop', 8, 6000),
('develop', 10, 5200),
('develop', 11, 5200),
('develop', 9, 4500),
('develop', 7, 4200),
('personnel', 2, 3900),
('personnel', 5, 3500),
('sales', 1, 5000),
('sales', 4, 4800),
('sales', 3, 4800)
)
SELECT
empno, depname, salary,
rank() OVER (PARTITION BY depname ORDER BY salary DESC),
ntile(4) OVER (ORDER BY salary DESC)
FROM empsalary
ORDER BY empno
```

Out[20]:

```
rank() OVER (PARTITION BY depname ORDER BY salary DESC)
```

is the window function`rank()`

connects the function to where it is run`OVER`

(optional) like group by splits the table according to the column(s) provided. However unlike group by the split is only applied during the window computation and the result of the window computation is then attached to the original row.`PARTITION BY`

determines the column on which the windowed function is applied`ORDER BY`

Concluding thoughts:

- SQL is the most
**widely used language**for manipulating and accessing data. **Learn SQL through practice.**Give yourself time to do the homework and try writing queries multiple different ways.- We will return to SQL and the Ralational model later in the class.