The first part of this notebook discusses joins in some additional detail. It uses sqlite like lec22 so should run without any special configuration.
import sqlalchemy
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
# Delete the database if it already exists.
from pathlib import Path
dbfile = Path("lec23.db")
if dbfile.exists():
dbfile.unlink()
sqlite_uri = "sqlite:///lec23.db"
sqlite_engine = sqlalchemy.create_engine(sqlite_uri)
sql_expr = """
CREATE TABLE Sailors (
sid INTEGER,
sname CHAR(20),
rating INTEGER,
age REAL,
PRIMARY KEY (sid));
"""
result = sqlite_engine.execute(sql_expr)
sql_expr = """
CREATE TABLE Reserves (
sid INTEGER,
bid INTEGER,
day DATE,
PRIMARY KEY (sid, bid, day),
FOREIGN KEY (sid) REFERENCES Sailors,
FOREIGN KEY (bid) REFERENCES Boats);
"""
result = sqlite_engine.execute(sql_expr)
sql_expr = """
CREATE TABLE Boats (
bid INTEGER,
bname CHAR (20),
color CHAR(10),
PRIMARY KEY (bid));
"""
result = sqlite_engine.execute(sql_expr)
sql_expr = """
INSERT INTO Sailors VALUES
(22, 'dustin', 7, 45.0),
(31, 'lubber', 8, 55.0),
(58, 'rusty', 10, 35.0);
"""
result = sqlite_engine.execute(sql_expr)
sql_expr = """
INSERT INTO Boats VALUES
(101, 'Nina', 'red'),
(102, 'Pinta', 'green'),
(103, 'Santa Maria', 'blue');
"""
result = sqlite_engine.execute(sql_expr)
sql_expr = """
INSERT INTO Reserves VALUES
(22, 101, '10/10/96'),
(58, 103, '11/12/96');
"""
result = sqlite_engine.execute(sql_expr)
Once you've run the cells above, we have our basic tables. Reserves and Sailors are shown below.
sql_expr = """
SELECT * FROM Reserves;
"""
pd.read_sql(sql_expr, sqlite_engine)
sql_expr = """
SELECT * FROM Sailors;
"""
pd.read_sql(sql_expr, sqlite_engine)
As we saw in the previous lecture, we can join these tables by specifying two tables in the FROM clause, and setting and equality condition to match the rows from the two tables. The R and S variables that follow the word AS are known as "range variables".
sql_expr = """
SELECT *
FROM Reserves AS R, Sailors AS S
WHERE R.sid = S.sid;"""
pd.read_sql(sql_expr, sqlite_engine)
If we don't provide a WHERE clause, our select statement returns the cartestian product of the two tables.
sql_expr = """
SELECT *
FROM Reserves AS R, Sailors AS S;"""
pd.read_sql(sql_expr, sqlite_engine)
When we join a table, we do not have to return every column of course.
sql_expr = """
SELECT R.sid, sname, age
FROM Reserves AS R, Sailors AS S
WHERE R.sid = S.sid;"""
pd.read_sql(sql_expr, sqlite_engine)
Technically, the "AS" isn't required to define our range variables. It is often omitted in practice. For better or worse, real world SQL implementations generally have fairly loose syntax.
sql_expr = """
SELECT R.sid, sname, age
FROM Reserves R, Sailors S
WHERE R.sid = S.sid;"""
pd.read_sql(sql_expr, sqlite_engine)
We actually don't need range variables at all, though I don't recommend writing code like this. It's hard to read and needlessly verbose.
sql_expr = """
SELECT Reserves.sid, sname, age
FROM Reserves, Sailors
WHERE Reserves.sid = Sailors.sid;"""
pd.read_sql(sql_expr, sqlite_engine)
The tables we generated above were all the result of "equi joins". An equi join is a join operation where we use equality to match rows. We can also perform "nonequi joins", which is when we use something other than equality to match rows.
sql_expr = """
SELECT S1.sname as junior_name, S1.age as junior_age,
S2.sname as senior_name, S2.age as senior_age
FROM Sailors AS S1, Sailors AS S2
WHERE S1.age < S2.age;"""
pd.read_sql(sql_expr, sqlite_engine)
There are several ways to write a join in SQL. The version we were doing above with the explicit equality condition in the WHERE clauses is didactically useful for understanding the relationship between a cartesian product and a join. However, in practice, it is more common to write joins with syntax like the example below using "INNER JOIN ... ON".
sql_expr = """SELECT s.sid, s.sname, r.bid
FROM Sailors s, Reserves r
WHERE s.sid = r.sid
AND s.age > 20;"""
pd.read_sql(sql_expr, sqlite_engine)
sql_expr = """SELECT s.sid, s.sname, r.bid
FROM Sailors s INNER JOIN Reserves r
ON s.sid = r.sid
AND s.age > 20;"""
pd.read_sql(sql_expr, sqlite_engine)
The INNER JOIN format is generally preferred because it separates the WHERE conditions that are used for joining from those that are used for other purposes (e.g. s.age > 20).
There is a third way to do this sort of join called the "NATURAL JOIN". This is just the INNER JOIN where the ON clause is automatically filled in with equality conditions for all attributes with the same name.
sql_expr = """SELECT s.sid, s.sname, r.bid
FROM Sailors s NATURAL JOIN Reserves r
WHERE s.age > 20;"""
pd.read_sql(sql_expr, sqlite_engine)
In addition to INNER JOIN, SQL also supports LEFT OUTER JOIN, RIGHT OUTER JOIN, and FULL OUTER JOIN. These have the exact same semantics with which you should be familiar from the merge command in pandas. Note that what is called an "outer join" in pandas is called a "full outer join" in SQL.
Technically the OUTER keyword is optional, though we recommend that you keep it. There is no such thing as a left inner join, so whether you say LEFT OUTER JOIN or LEFT JOIN doesn't matter semantically.
The SQLite DBMS (which we are currently using in this notebook) does not support right joins or full outer joins, so we do not provide an example in this notebook. Try using sqlfiddle, postgres, or MySQL.
sql_expr = """SELECT s.sid, s.sname, r.bid
FROM Sailors s LEFT OUTER JOIN Reserves r
ON s.sid = r.sid
AND s.age > 20;"""
pd.read_sql(sql_expr, sqlite_engine)
Null values are a little tricky in practice in SQL. Let's explore why.
sql_expr = """
INSERT INTO sailors VALUES(11, 'Jack Sparrow', NULL, 35);
"""
sqlite_engine.execute(sql_expr)
The new sailor appears with no problems.
sql_expr = """
SELECT * FROM sailors;"""
pd.read_sql(sql_expr, sqlite_engine)
If we want to enumerate all the items with (or without) NULL values for a column, we have to use the special "IS NULL" or "IS NOT NULL" syntax. We can't use "= NULL".
sql_expr = """
SELECT * FROM sailors
WHERE rating IS NULL;"""
pd.read_sql(sql_expr, sqlite_engine)
You might notice above that our null was rendered as "None" instead of "NaN" by pandas. This isn't particularly important, but be aware that null might be represented in more than one way when exported out of our DBMS and into another environment like pandas.
sql_expr = """
SELECT * FROM sailors
WHERE rating IS NOT NULL;"""
pd.read_sql(sql_expr, sqlite_engine)
sql_expr = """
SELECT * FROM sailors
WHERE rating = NULL;"""
pd.read_sql(sql_expr, sqlite_engine)
If we try to combine a null value with anything else using any expression in a select list, the result will be null.
sql_expr = """
SELECT *, rating + 1 FROM sailors;"""
pd.read_sql(sql_expr, sqlite_engine)
sql_expr = """
SELECT *, rating * age FROM sailors;"""
pd.read_sql(sql_expr, sqlite_engine)
A rule of thumb is that x OP null returns null is true for any operation, even if x is also null. In other words "NULL = NULL" returns NULL.
sql_expr = """
SELECT *, rating = NULL FROM sailors;"""
pd.read_sql(sql_expr, sqlite_engine)
sql_expr = """
SELECT *, rating != NULL FROM sailors;"""
pd.read_sql(sql_expr, sqlite_engine)
This rule isn't absolutely true. For example NULL OR TRUE returns TRUE, but NULL OR FALSE returns NULL. In the example below, we see the SQLite represents "TRUE" by the value 1.0.
sql_expr = """
SELECT *, ((rating > 8) OR NULL) FROM sailors;"""
pd.read_sql(sql_expr, sqlite_engine)
NULLs basically act as a third logical value that is different than TRUE or FALSE. For AND, the possible outcomes involving null are:
FALSE AND NULL --> FALSE
TRUE AND NULL --> NULL
NULL AND NULL --> NULL
sql_expr = """
SELECT * FROM sailors
WHERE rating > 8 AND (age < 100);"""
pd.read_sql(sql_expr, sqlite_engine)
OR is a little trickier. Basically:
FALSE OR NULL --> NULL
TRUE OR NULL --> TRUE
NULL OR NULL --> NULL
sql_expr = """
SELECT * FROM sailors
WHERE rating > 8 OR (age < 100);"""
pd.read_sql(sql_expr, sqlite_engine)
Lastly, the logical inversion of null, i.e. NOT NULL, is also NULL.
sql_expr = """
SELECT * FROM sailors
WHERE NOT rating > 8;"""
pd.read_sql(sql_expr, sqlite_engine)
Everything below this line is bonus material and won't appear on any hws or exams.
In this notebook we will connect to a remote SQL database and examine more advanced SQL queries. As with the first notebook, we will be using the core sqlalchemy library as an abstraction layer on-top of the underlying database management system. We will not be using the sophisticated object relational management features available in sqlalchemy.
import sqlalchemy
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as p
Running this notebook locally is tricky. To do so, you'll need:
bunzip2 -c | psql ds100
The dataset is a little over 2 gigabytes in size.
After you have your postgres server running, to connect to your local database you create an engine with the following command:
engine = sqlalchemy.create_engine("postgres://postgres:@localhost:5432/ds100")
#uncomment and run to install needed libraries for this notebook
#!conda install -y psycopg2
#!conda install -y postgresql
#!pip install ipython-sql
engine = sqlalchemy.create_engine("postgres://postgres:@localhost:5432/fec_lec23")
This database contains quite a few more tables. We will explore these tables in this lecture.
engine.table_names()
It is often assumed that when working with a database all relations (tables) must come from outside or be derived from other sources of data. It is possible to construct tables in SQL.
Sometimes it's useful to auto-generate data in queries, rather than examine data in the database. This is nice for testing, but also can be useful to play some computational tricks as you'll see in your homework.
SQL has a simple scalar function called random
that returns a random value between 0.0 and 1.0. You can use this if you need to generate a column of random numbers. (The PostgreSQL manual doesn't promise much about the statistical properties of this random number generator.)
Let's roll a 6-sided die for each of the students
sql_expr = """
SELECT *, ROUND(RANDOM() * 6) as roll_dice
FROM students;
"""
pd.read_sql(sql_expr, engine)
Suppose we want to generate a whole bunch of random numbers, not related to any particular stored table -- can we do that in SQL?
SQL has a notion of table-valued functions: functions that return tables, and hence can be used in a FROM
clause of a query. The standard table-valued function is called generate_series
, and it's much like numpy's arange
:
sql_expr = """
SELECT *
FROM GENERATE_SERIES(1, 5) as n;
"""
pd.read_sql(sql_expr, engine)
sql_expr = """
SELECT *
FROM GENERATE_SERIES(1, 10, 2) as n;
"""
pd.read_sql(sql_expr, engine)
So if we want to get 20 die rolls, we can build a query that builds a table containing 20 such die rolls.
sql_expr = """
SELECT *, ROUND(6*RANDOM()) AS roll
FROM GENERATE_SERIES(1, 20) as n;
"""
pd.read_sql(sql_expr, engine)
Of course, we don't actually care very much about the number of the roll, so there's no need to select the value returned by GENERATE_SERIES.
sql_expr = """
SELECT ROUND(6*RANDOM()) AS roll
FROM GENERATE_SERIES(1, 20) as n;
"""
pd.read_sql(sql_expr, engine)
Suppose we want to keep our random numbers around for later use. To do this, we need to create a new relation in our database. We can do this using "CREATE TABLE AS", as shown below.
sql_expr = """
CREATE TABLE roll20
AS (SELECT ROUND(6*RANDOM()) AS roll
FROM GENERATE_SERIES(1, 20) as n);
"""
engine.execute(sql_expr)
sql_expr = """
SELECT *
FROM roll20;
"""
pd.read_sql(sql_expr, engine)
We can test the properties of our rolls using aggregation functions. Challenge: Try to write a SQL query which computes the maximum, minimum, and average of our 20 die rolls, which are currently stored in the roll20 table.
sql_expr = """
-- your code here
"""
pd.read_sql(sql_expr, engine)
sql_expr = """
SELECT AVG(roll), MIN(roll), MAX(roll)
FROM roll20;
"""
pd.read_sql(sql_expr, engine)
As another exercise, try to generate a table that computes the number of rolls of each value in our roll20 table. For example, one row might be 3 4
indicating that the number 3 occurred 4 times. Hint, you'll want to use GROUP BY.
sql_expr = """
--- your code here
"""
pd.read_sql(sql_expr, engine)
sql_expr = """
SELECT COUNT(*)
FROM roll20
GROUP BY roll
"""
pd.read_sql(sql_expr, engine)
Ideally, we don't want to clog up our database with a bunch of relations like roll20
. One approach is to use the TEMPORARY keyword. Relations created as temporary are invisible to other users connected to the database, and are destroyed as soon as the connection is terminated.
sql_expr = """
CREATE TEMPORARY TABLE temproll20
AS (SELECT ROUND(6*RANDOM()) AS roll
FROM GENERATE_SERIES(1, 20) as n);
"""
engine.execute(sql_expr)
sql_expr = """
SELECT COUNT(*)
FROM temproll20
GROUP BY roll;
"""
pd.read_sql(sql_expr, engine)
Another approach is to use a subexpression. For example, as shown below.
sql_expr = """
SELECT COUNT(*)
FROM (SELECT ROUND(6*RANDOM()) AS roll
FROM GENERATE_SERIES(1, 10000) as n
) sub
GROUP BY roll;
"""
pd.read_sql(sql_expr, engine)
A more common approach is to use "WITH AS" to create a temporary named table that goes away as soon as the statement completes execution.
sql_expr = """
WITH rolls AS (
SELECT ROUND(6*RANDOM()) AS roll
FROM GENERATE_SERIES(1, 10000) as n
)
SELECT COUNT(*)
FROM rolls
GROUP BY roll;
"""
pd.read_sql(sql_expr, engine)
Does this table look reasonable?
The answer follows this cell.
Skip when you're ready.
Our number generation scheme was flawed. Rounding a random number in the range [0, 1] multiplied by 6 has essentially 6 intervals of interest: [0, 0.5), [0.5, 1.5), [1.5, 2.5), ..., [5.5, 6] with possible values 0, 1, 2, 3, 4, 5, 6, with two of those intervals being half the width of the others.
To fix this we can use ceiling instead.
sql_expr = """
SELECT COUNT(*)
FROM (SELECT CEIL(6*RANDOM()) AS roll
FROM GENERATE_SERIES(1, 10000) as n
) sub
GROUP BY roll;
"""
pd.read_sql(sql_expr, engine)
Now a hard SQL challenge that you might be able to figure out. The answer above creates a temporary anonymous table that is then analyzed using COUNT(*). Try to create a new SQL expression that does not create any new tables or use sub expressions, but still computes this same table.
sql_expr = """
--- your code here
"""
pd.read_sql(sql_expr, engine)
The answer follows this cell.
Skip when you're ready.
sql_expr = """
SELECT COUNT(*), CEIL(6*RANDOM()) AS roll
FROM GENERATE_SERIES(1, 100) as n
GROUP BY roll
"""
pd.read_sql(sql_expr, engine)
Note that if we want our rolls to be integers, we can use a PostgreSQL typecast operator (postfix ::<type>
):
sql_expr = """
SELECT CEIL(6*RANDOM())::INTEGER AS roll, COUNT(*)
FROM generate_series(1, 100000) AS n
GROUP BY roll
ORDER BY count
"""
pd.read_sql(sql_expr, engine)
We won't discuss this during lecture, but it is included for your edification.
Now suppose we want to populate a "matrix" relation my_matrix(r, c, v)
full of random values. Consider the following numpy code
import numpy as np
np.random.seed(43)
# normally distributed random numbers, mean 0 variance 1
my_matrix = np.random.randint(1,6, (3,2)).astype('float')
my_matrix
Question: How could we store the above matrix as a table?
Answers:
Building the table in Numpy
my_matrix.flatten()
# Advanced numpy (you don't need to know this ...)
(col_id, row_id) = np.meshgrid(np.arange(2), np.arange(3))
mat_a = pd.DataFrame(
np.vstack([row_id.flatten().T, col_id.flatten(), my_matrix.flatten()]).T,
columns=['r', 'c', 'v'])
mat_a
In this relational version we need to explicitly generate the r
and c
values. We can do this via SQL's built-in Cartesian product!
sql_expr = """
SELECT rows.r, columns.c, CEIL(6*RANDOM())::INTEGER AS v
FROM generate_series(0,2) AS rows(r),
generate_series(0,1) AS columns(c);
"""
pd.read_sql(sql_expr, engine)
A few take-aways from the previous cell:
my_matrix
reflects the fact that val
is a function of the row (x
) and column (y
) IDs.Sometimes we may want a custom scalar function that isn't built into SQL. Some database systems allow you to register your own user-defined functions (UDFs) in one or more programming languages. Conveniently, PostgreSQL allows us to register user-defined functions written in Python. Be aware of two things:
Calling Python for each row in a query is quite a bit slower than using the pre-compiled built-in functions in SQL ... this is akin to the use of Python loops instead of numpy
calls. If you can avoid using Python UDFs you should do so to get better performance.
Python is a full-feature programming language with access to your operating system's functionality, which means it can reach outside of the scope of the query and wreak havoc, including running arbitrary UNIX commands. (PostgreSQL refers to this as an untrusted
language.) Be very careful with the Python UDFs you use in your Postgres queries! If you want to be safer write UDFs in a trusted language. PostgreSQL has a number of other languages to choose from, including Java and even R!.
First we tell PostgreSQL we want to use the plpythonu package (so named because of "pl" for "programming language", "u" for "untrusted"):
try:
engine.execute("""
CREATE EXTENSION IF NOT EXISTS plpythonu;
""")
udf_support = True
except Exception as e:
print(e)
udf_support = False
Now let's write some trivial Python code and register it as a UDF using the create function
command. Since SQL is a typed language, we need to specify the SQL types for the input and output to our function, in addition to the code (within $$ delimiters) and the language:
if udf_support:
engine.execute("""
DROP FUNCTION IF EXISTS fib(x INTEGER);
""")
engine.execute("""
CREATE FUNCTION fib(x INTEGER) RETURNS INTEGER
AS $$
def fib(x):
if x < 2:
return x
else:
return fib(x-1) + fib(x-2)
return fib(x)
$$ LANGUAGE plpythonu;
""")
if udf_support:
sql_expr = """
SELECT x, fib(x)
FROM generate_series(1,10) AS row(x);
"""
display(pd.read_sql(sql_expr, engine))
It is possible to create transactions that isolate changes. This is done by starting a transaction with BEGIN
. We can then proceed to make changes to the database. During this time others will not be able to see our changes. Until we end the transactions by saying ROLLBACK
or COMMIT
:
BEGIN;
UPDATE students SET gpa = 3.0 WHERE name = 'Bill Gates';
SELECT * FROM students;
ROLLBACK;
SELECT * FROM students;
Try running this in the postgres shell...
Statistics doesn't deal with individuals, it deals with groups: distributions, populations, samples and the like. As such, computing statistics in SQL focuses heavily on aggregation functions.
All SQL systems have simple descriptive statistics built in as aggregation functions:
min, max
count
sum
avg
stddev
and variance
, the sample standard deviation and variance.PostgreSQL offers many more. Some handy ones include
stddev_pop
and var_pop
: the population standard deviation and variance, which you should use rather than stddev
and variance
if you know your data is the full population, not a sample.covar_samp
and covar_pop
: sample and population covariancecorr
, Pearson's correlation coefficientWon't discuss during lecture, but provided for your reference.
You'll notice that a number of handy statistics are missing from this list, including the median and quartiles. That's because those are order statistics: they are defined based on an ordering of the values in a column.
SQL provides for this by allowing what it calls "ordered set functions", which require a WITHIN GROUP (ORDER BY <columns>)
clause to accompany the order-statistic aggregate. For example, to compute the 25th percentile, 50th percentile (median) and 75th percentile in SQL, we can use the following:
sql_expr = """
SELECT percentile_cont(0.5) WITHIN GROUP (ORDER BY x)
FROM generate_series(1,10) AS data(x);
"""
pd.read_sql(sql_expr, engine)
There are two versions of the percentile function:
percentile_cont
inuous : interpolatespercentile_disc
rete : returns an entry from the tableWhat will the following expressions return?
sql_expr = """
SELECT percentile_disc(0.5) WITHIN GROUP (ORDER BY x)
FROM generate_series(1,10) AS data(x);
"""
pd.read_sql(sql_expr, engine)
We can compute the edges and middle of the box in a box plot:
sql_expr = """
SELECT
percentile_disc(0.25) WITHIN GROUP (ORDER BY x) as lower_quartile,
percentile_disc(0.5) WITHIN GROUP (ORDER BY x) as median,
percentile_disc(0.75) WITHIN GROUP (ORDER BY x) as upper_quartile
FROM generate_series(1,10) AS data(x);
"""
pd.read_sql(sql_expr, engine)
To begin our exploration of real data we will work with the publicly-available campaign finance data from the Federal Election Commission from the 2016 election in a PostgreSQL database.
And let's have a look at the individual
table's schema:
inspector = sqlalchemy.inspect(engine)
pd.DataFrame(inspector.get_columns("individual")).set_index('name')
If you are curious about the meaning of these columns check out the FEC data description
How big is this table?
sql_expr = """
SELECT COUNT(*)
FROM individual
"""
pd.read_sql(sql_expr, engine)
LIMIT
and sampling¶This is not the first topic usually taught in SQL, but it's extremely useful for exploration.
OK, now we have some serious data loaded and we're ready to explore it.
Database tables are often big--hence the use of a database system. When browsing them at first, we may want to look at exemplary rows: e.g., an arbitrary number of rows, or a random sample of the rows.
To look at all of the data in the individual
table, we would simply write:
select * from individual;
But that would return 20,347,829 rows into our Jupyter notebook's memory, and perhaps overflow the RAM in your computer. Instead, we could limit the size of the output to the first 3 rows as follows:
sql_expr = """
SELECT *
FROM individual
LIMIT 4;
"""
pd.read_sql(sql_expr, engine)
limit
clause:¶As data scientists, we should be concerned about spending much time looking at a biased subset of our data. Instead, we might want an i.i.d. random sample of the rows in the table. There are various methods for sampling from a table. A simple one built into many database systems including PostgreSQL is Bernoulli sampling, in which the decision to return each row is made randomly and independently. As a metaphor, the database engine "flips a coin" for each row to decide whether to return it. We can influence the sampling rate by choosing the probability of a "true" result of the coinflip.
This is done on a per-table basis in the FROM
clause of the query like so:
sql_expr = """
SELECT *
FROM individual TABLESAMPLE BERNOULLI(.00001);
"""
pd.read_sql(sql_expr, engine)
To learn more about the TABLESAMPLE
clause checkout out the select docs. For example, we can specify a random seed using REPEATABLE. If you don't know what a random seed is, don't worry about it. Note that there is a second sampling method called block sampling which is a lot like cluster sampling at the level of pages on disk!
sql_expr = """
SELECT *
FROM individual TABLESAMPLE BERNOULLI(.00001) REPEATABLE(42);
"""
pd.read_sql(sql_expr, engine)
Three things to note relative to our previous limit
construct:
For these reasons, if we want a proper i.i.d sample, it's a good idea to compute a nice-sized sample and store it, keeping it reasonably large for more general use. Since we will not be updating and rows in our individual
table, we can do this without worrying that the sample will get "out of date" with respect to the context of individual
.
We can use the CREATE TABLE AS SELECT ...
(a.k.a. CTAS) pattern to do create a table that saves the output of a query:
DROP TABLE IF EXISTS indiv_sample;
CREATE TABLE indiv_sample AS
SELECT *
FROM individual TABLESAMPLE BERNOULLI(.1) REPEATABLE(42);
sql_expr = """
SELECT *
FROM individual TABLESAMPLE BERNOULLI(.1) REPEATABLE(42)
LIMIT 5;
"""
pd.read_sql(sql_expr, engine)
sql_expr = """
SELECT count(*)
FROM individual TABLESAMPLE BERNOULLI(.1) REPEATABLE(42);
"""
pd.read_sql(sql_expr, engine)
sql_expr = """
DROP TABLE IF EXISTS indiv_sample;
CREATE TABLE indiv_sample AS
SELECT *
FROM individual TABLESAMPLE BERNOULLI(.1) REPEATABLE(42);
"""
engine.execute(sql_expr)
Here is an alternative method to construct a random sample of a fixed size. Note that this is not as efficient and will take several minutes to complete.
CREATE TABLE indiv_sample2 AS
SELECT *, RANDOM() AS u
FROM individual
ORDER BY u
LIMIT 20000;
sql_expr = """
SELECT *, RANDOM() AS u
FROM individual
ORDER BY u
LIMIT 5;
"""
pd.read_sql(sql_expr, engine)
OK, we already had a peek at the individual
table. Now let's look at specific attributes (columns) relates to who is donating how much.
In addition to referencing the columns of individual
in the select
clause, we can also derive new columns by writing field-level (so-called "scalar") functions. Typically we reference some table columns in those functions.
In our case, let's compute the log of transaction_amt
for subsequent plotting. SQL comes with many typical functions you can use in this way, and PostgreSQL is particularly rich on this front; see the PostgreSQL manual for details.
We'll look at indiv_sample
rather than individual
while we're just exploring.
sql_expr = """
SELECT name, state, cmte_id,
transaction_amt, log(transaction_amt)
FROM indiv_sample
LIMIT 10;
"""
pd.read_sql(sql_expr, engine)
We can combine SQL with python in the following way:
sql_expr = """
SELECT transaction_amt AS amt
FROM indiv_sample
WHERE transaction_amt > 0;
"""
result = pd.read_sql(sql_expr, engine)
_ = sns.distplot(result['amt'])
sql_expr = """
SELECT LOG(transaction_amt) AS log_amt
FROM indiv_sample
WHERE transaction_amt > 0;
"""
result = pd.read_sql(sql_expr, engine)
df = result['log_amt']
sns.distplot(df.astype('float'))
scales = np.array([1, 10, 20, 100, 500, 1000, 5000])
_ = plt.xticks(np.log10(scales), scales)
sql_expr = """
SELECT transaction_amt AS amt
FROM indiv_sample
WHERE transaction_amt > 5000;
"""
result = pd.read_sql(sql_expr, engine)
_ = sns.distplot(result['amt'], rug=True)
sql_expr = """
SELECT transaction_amt AS amt
FROM individual
WHERE transaction_amt > 5000;
"""
result = pd.read_sql(sql_expr, engine)
_ = sns.distplot(result['amt'])
sql_expr = """
SELECT log(transaction_amt) AS log_amt
FROM individual
WHERE transaction_amt > 5000;
"""
result = pd.read_sql(sql_expr, engine)
sns.distplot(result['log_amt'])
scales = np.array([5000, 20000, 100000])
_ = plt.xticks(np.log10(scales), scales)
sql_expr = """
SELECT log(transaction_amt) AS log_amt
FROM individual
WHERE transaction_amt > 1000000;
"""
result = pd.read_sql(sql_expr, engine)
sns.distplot(result['log_amt'], rug=True)
scales = np.array([1000000, 5000000, 50000000])
_ = plt.xticks(np.log10(scales), scales)
CASE
statements: SQL conditionals in the FROM
clause¶What about smaller donations?
sql_expr = """
SELECT name, state, cmte_id,
transaction_amt, LOG(transaction_amt)
FROM indiv_sample
WHERE transaction_amt < 10
LIMIT 10;
"""
try:
display(pd.read_sql(sql_expr, engine))
except Exception as e:
print(e)
Uh oh, log is not defined for numbers <= 0! We need a conditional statement in the select
clause to decide what function to call. We can use SQL's case
construct for that.
sql_expr = """
SELECT transaction_amt,
CASE WHEN transaction_amt > 0 THEN log(transaction_amt)
WHEN transaction_amt = 0 THEN 0
ELSE -1*(log(abs(transaction_amt)))
END AS log_amt
FROM indiv_sample
"""
try:
result = pd.read_sql(sql_expr, engine)
sns.distplot(result['log_amt'])
except Exception as e:
print(e)
sql_expr = """
SELECT transaction_amt, cmte_id, transaction_dt, name, city, state, memo_text, occupation
FROM individual
ORDER BY transaction_amt DESC
LIMIT 10
"""
pd.read_sql(sql_expr, engine)
sql_expr = """
SELECT name, SUM(transaction_amt) AS total_amt
FROM individual
GROUP BY name
ORDER BY total_amt DESC
LIMIT 10
"""
pd.read_sql(sql_expr, engine)
WHERE
¶sql_expr = """
SELECT name, SUM(transaction_amt) AS total_amt
FROM individual
WHERE city = 'SAN FRANCISCO'
GROUP BY name
ORDER BY total_amt DESC
LIMIT 20;
"""
pd.read_sql(sql_expr, engine)
sql_expr = """
SELECT name, SUM(transaction_amt) AS total_amt
FROM individual
WHERE city = 'BERKELEY'
GROUP BY name
ORDER BY total_amt DESC
LIMIT 20;
"""
pd.read_sql(sql_expr, engine)
Up to now we've looked at a single query at a time. SQL also allows us to nest queries in various ways. In this section we look at the cleaner examples of how to do this in SQL: views and Common Table Expressions (CTEs).
Will not discuss in lecture. Provided for your reference.
In earlier examples, we created new tables and populated them from the result of queries over stored tables. There are two main drawbacks of that approach that may concern us in some cases:
For this reason, SQL provides a notion of logical views: these are basically named queries that are re-evaluated upon each reference.
The syntax is straightforward:
CREATE VIEW <name> AS
<SELECT statement>;
The resulting view <name>
can be used in an SELECT
query, but not in an INSERT
, DELETE
or UPDATE
query!
As an example, we might want a view that stores just some summary statistics of transaction_amt
s for each date:
try:
engine.execute("""
DROP VIEW IF EXISTS date_stats;
""")
engine.execute("""
CREATE VIEW date_stats AS
SELECT
transaction_dt AS day,
min(transaction_amt),
avg(transaction_amt),
stddev(transaction_amt),
max(transaction_amt)
FROM individual
GROUP BY transaction_dt
ORDER BY day;
""")
sql_expr = """
SELECT * from date_stats limit 5;
"""
display(pd.read_sql(sql_expr, engine))
except Exception as e:
print("In the shared Data 100 Postgres server you not allowed to create views.\n\n")
print(e)
Notice that this did not create a table:
inspector.get_table_names()
Instead it created a view: (if you are allowed to create views)
inspector.get_view_names()
Let's create a random table and we will even seed the random number generator.
DROP VIEW IF EXISTS rando;
SELECT setseed(0.3);
CREATE VIEW rando(rownum, rnd) AS
SELECT rownum, round(random())::INTEGER
FROM generate_series(1,50) AS ind(rownum)
What is the sum of the rows in Random:
sql_expr = """
SELECT SUM(rnd) FROM rando;
"""
pd.read_sql(sql_expr, engine)
What was that value again?
sql_expr = """
SELECT SUM(rnd) FROM rando;
"""
pd.read_sql(sql_expr, engine)
</br></br></br>
The value changes with each invocation.
Views can help:
Problem:
temp1
, temp1_joey
, temp1_joey_fixed
, ... We need a mechanism to decompose query into views for the scope of a single query.
WITH
)¶Think of these as a view that exists only during the query.
If we're only going to use a view within a single query, it is a little inelegant to CREATE
it, and then have to DROP
it later to recycle the view name.
Common Table Expressions (CTEs) are like views that we use on-the-fly. (If you know about lambdas in Python, you can think of CTEs as lambda views.) The syntax for CTEs is to use a WITH
clause in front of the query:
WITH <name> [(renamed columns)] AS
(<SELECT statement>)
[, <name2> AS (<SELECT statement>)...]
If you need multiple CTEs, you separate them with commas. We can rewrite our query above without a view as follows:
sql_expr = """
WITH per_day_stats AS (
SELECT
to_date(transaction_dt, 'MMDDYYYY') as day, -- Date Parsing
min(transaction_amt),
avg(transaction_amt),
stddev(transaction_amt),
max(transaction_amt)
FROM indiv_sample
GROUP BY transaction_dt
)
SELECT day, stddev, max - min AS spread
FROM per_day_stats
WHERE stddev IS NOT NULL
ORDER by stddev DESC
LIMIT 5
"""
pd.read_sql(sql_expr, engine)
Suppose now we want to determine which committees received the most money
sql_expr = """
SELECT cmte_id, SUM(transaction_amt) AS total_amt
FROM individual
GROUP BY cmte_id
ORDER BY total_amt DESC
LIMIT 10
"""
pd.read_sql(sql_expr, engine)
pd.DataFrame(inspector.get_columns("individual")).set_index('name')
pd.DataFrame(inspector.get_columns("cm")).set_index('name')
We can join the committee description to get the names of the committees that received the most funds.
sql_expr = """
WITH indv2cm AS
(
SELECT cmte_id, SUM(transaction_amt) AS total_amt
FROM individual
GROUP BY cmte_id
ORDER BY total_amt DESC
)
SELECT cm.cmte_nm, indv2cm.total_amt
FROM cm, indv2cm
WHERE cm.cmte_id = indv2cm.cmte_id
ORDER BY indv2cm.total_amt DESC
LIMIT 10
"""
pd.read_sql(sql_expr, engine)
pd.DataFrame(inspector.get_columns("cn")).set_index('name')
pd.DataFrame(inspector.get_columns("ccl")).set_index('name')
sql_expr = """
SELECT cn.cand_name, SUM(indiv.transaction_amt) AS total_amt
FROM individual AS indiv, ccl, cn
WHERE indiv.cmte_id = ccl.cmte_id AND
ccl.cand_id = cn.cand_id
GROUP BY cn.cand_name
ORDER BY total_amt DESC
LIMIT 10
"""
pd.read_sql(sql_expr, engine)
sql_expr = """
SELECT cn.cand_name, SUM(indiv.transaction_amt) AS total_amt
FROM individual AS indiv, ccl, cn
WHERE indiv.cmte_id = ccl.cmte_id AND
ccl.cand_id = cn.cand_id AND
indiv.state = 'CA'
GROUP BY cn.cand_name
ORDER BY total_amt DESC
LIMIT 10
"""
pd.read_sql(sql_expr, engine)
sql_expr = """
SELECT cn.cand_name, SUM(indiv.transaction_amt) AS total_amt
FROM individual AS indiv, ccl, cn
WHERE indiv.cmte_id = ccl.cmte_id AND
ccl.cand_id = cn.cand_id AND
indiv.state = 'FL'
GROUP BY cn.cand_name
ORDER BY total_amt DESC
LIMIT 10
"""
pd.read_sql(sql_expr, engine)
sql_expr = """
SELECT cn.cand_name, SUM(indiv.transaction_amt) AS total_amt
FROM individual AS indiv, ccl, cn
WHERE indiv.cmte_id = ccl.cmte_id AND
ccl.cand_id = cn.cand_id AND
indiv.state = 'TX'
GROUP BY cn.cand_name
ORDER BY total_amt DESC
LIMIT 10
"""
pd.read_sql(sql_expr, engine)
sql_expr = """
SELECT cm.cmte_nm, SUM(transaction_amt) AS total_amt
FROM pas, cm
WHERE pas.cmte_id = cm.cmte_id
GROUP BY cm.cmte_nm
ORDER BY total_amt DESC
LIMIT 5
"""
pd.read_sql(sql_expr, engine)
sql_expr = """
SELECT cn.cand_name, SUM(transaction_amt) AS total_amt
FROM pas, cn
WHERE pas.cand_id = cn.cand_id
GROUP BY cn.cand_name
ORDER BY total_amt DESC
LIMIT 5
"""
pd.read_sql(sql_expr, engine)