Install SQL packages:
# !conda install -y psycopg2
# !conda install -y postgresql
# !pip install ipython-sql
# !pip install sqlalchemy
Standard imports + sqlalchemy
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sqlalchemy
%matplotlib inline
%load_ext sql
Establish a database connection to the Postgres database running on my machine localhost
using the schema ds100
postgresql_uri = "postgres://jegonzal:@localhost:5432/ds100"
sqlite_uri = "sqlite:///data/ds100.db"
default_db = postgresql_uri
%%sql $postgresql_uri
-- Need to drop views to prevent integrity constraint violations later.
DROP VIEW IF EXISTS date_stats;
The following example works through some basic table operations including:
CREATE TABLE
and DROP TABLE
¶To start, we are going to define a toy relation (a.k.a. table), populate it with some toy data, and work through some basic SQL. Deeper stuff coming soon though, I promise!
First, let's create the table of students
%%sql $default_db
-- Drop the table if it already exists
DROP TABLE IF EXISTS students;
-- Create the table profs
CREATE TABLE students(
name TEXT PRIMARY KEY,
gpa FLOAT CHECK (gpa >= 0.0 and gpa <= 4.0),
age INTEGER,
dept TEXT,
gender CHAR);
Note that each column has a fixed data type.
The DBMS will enforce these types as data is inserted.
Note also the definition of a primary key, as we discussed in the EDA lecture.
The DBMS will enforce the uniqueness of values in the key columns.
To see what we've done, let's run our first query, dumping out the content of the table: every column for each row. We denote every column with *
:
%%sql $default_db
SELECT * FROM students;
Now let's manually insert some values into the table.
%%sql $default_db
INSERT INTO students VALUES
('Sergey Brin', 2.8, 40, 'CS', 'M'),
('Danah Boyd', 3.9, 35, 'CS', 'F'),
('Bill Gates', 1.0, 60, 'CS', 'M'),
('Hillary Mason', 4.0, 35, 'DATASCI', 'F'),
('Mike Olson', 3.7, 50, 'CS', 'M'),
('Mark Zuckerberg', 4.0, 30, 'CS', 'M'),
('Cheryl Sandberg', 4.0, 47, 'BUSINESS', 'F'),
('Susan Wojcicki', 4.0, 46, 'BUSINESS', 'F'),
('Marissa Meyer', 4.0, 45, 'BUSINESS', 'F');
Note that strings in SQL must be quoted with a single quote '
character.
Note how insertions need to have values in the same order as the columns in the create table
statement! Let's make sure our data is there:
%%sql $default_db
SELECT * FROM students;
What happens if we try to insert another record with the same primary key (name
)?
# %%sql $default_db
# INSERT INTO students VALUES ('Bill Gates', 4.0, 60, 'BUSINESS', 'M')
We can populate the database using Pandas as well:
tips_df = sns.load_dataset("tips")
tips_df.head()
Create a connection with the database
engine = sqlalchemy.create_engine(default_db)
Drop the table if it already exists and then upload the table to the database.
_ = engine.execute("DROP TABLE IF EXISTS tips;")
with engine.connect() as conn:
tips_df.to_sql("tips", conn)
We can also download tables directly into pandas:
with engine.connect() as conn:
students = pd.read_sql("SELECT * FROM students", conn)
students
We can also evaluate a query string and use the SQL iPython extension to execute the query and construct a DataFrame.
query = """
SELECT * FROM tips
WHERE sex = 'Female';
"""
res = %sql $query
df = res.DataFrame()
df.head()
There is no mechanism in standard SQL to access the schema associated with each database management systems. Here we use the corresponding client tools
!sqlite3 data/ds100.db ".schema students"
!psql ds100 -c "\d students"
I found the following SQL Alchemy Quick Reference Sheet to be very helpful.
engine = sqlalchemy.create_engine(postgresql_uri)
inspector = sqlalchemy.inspect(engine)
for col in inspector.get_columns("students"):
print(col)
engine = sqlalchemy.create_engine(sqlite_uri)
inspector = sqlalchemy.inspect(engine)
for col in inspector.get_columns("students"):
print(col)
What is Bill Gates' GPA?
%%sql $default_db
SELECT * FROM students
WHERE name LIKE '%Bill%' -- SQL like regular expression
Wow, Bill has a low GPA let's lend him a hand.
%%sql $default_db
UPDATE students
SET gpa = 1.0 + gpa
WHERE LOWER(name) = 'bill gates';
And let's check the table now:
%%sql $default_db
SELECT * FROM students
WHERE name ~'^Bil.'; -- Regular expression
Suppose Mark logged into the database and tried to give himself a 5.0? Uncomment the following line to see what happens:
# %%sql
# UPDATE students
# SET gpa = 5.0
# WHERE LOWER(name) LIKE '%zuck%';
The above code fails. Why? (check the gpa.)
Reviewing our table
%%sql $default_db
SELECT * FROM students
Notice two things:
update
statement: we decide which rows get updated based entirely on the values in each row, as checked by the where
clause. There is no notion of any information outside the values in the row--e.g. there are no "object identifiers" or "row numbers"... everything is just the data and only the data.We can delete rows in much the same way we update rows:
%%sql $default_db
DELETE FROM students
WHERE name = 'Sergey Brin'
%%sql $default_db
SELECT * FROM students;
Restoring Sergey
%%sql $default_db
INSERT INTO students VALUES
('Sergey Brin', 4.0, 40, 'CS', 'M');
Now let's start looking at some slightly more interesting queries. The canonical SQL query block includes the following clauses, in the order they appear. Square brackets indicate optional clauses.
SELECT ...
FROM ...
[WHERE ...]
[GROUP BY ...]
[HAVING ...]
[ORDER BY ...]
[LIMIT ...];
Query blocks can reference one or more tables, and be nested in various ways. Before we worry about multi-table queries or nested queries, we'll work our way through examples that exercise all of these clauses on a single table.
SELECT
LIST¶The SELECT
list determines which columns to include in the output.
%%sql $default_db
SELECT name
FROM students;
SQL has a wide range of functions that can be applied to each attribute in the select list. Notice that we can alias (name) the columns with AS
. The complete list of built in PostreSQL functions is available here.
%%sql $default_db
SELECT UPPER(name) AS n, LOWER(dept) as d, LOG(gpa) * 4.0 AS log_gpa
FROM students;
As we know, SQL is a multiset logic, preserving the meaning of the number of duplicates in query results. Sometimes, however, we don't want to keep the duplicates, we want to eliminate them. This is done simply by adding the keyword DISTINCT
after the SELECT
statement:
%%sql $default_db
SELECT DISTINCT dept
FROM students;
Which rows are used when taking the distinct entries? Does it really matter?
WHERE
Clause¶The WHERE
clause determines which rows of to include by specifying a predicate (boolean expression). Rows (tuples) that satisfy this expression are returned.
%%sql $default_db
SELECT name, gpa
FROM students
WHERE dept = 'CS'
And of course we can specify both rows and columns explicitly. If we have a primary key, we can filter things down to even the cell level via a select
list of one column, and a where
clause checking equality on the primary key columns:
%%sql $default_db
SELECT gpa
FROM students
WHERE name = 'Bill Gates';
Note that even this "single-celled" response still has a uniform data type of a relation.
SQL is Closed Over Tables: SQL expressions take in tables and always produce tables. How does this compare to Pandas?
Now that you can slice and dice tables into columns, rows and cells, you have enough knowledge to poke around in a database. Let's move on to skills that you'll need as a data scientist.
GROUP BY aggregation in SQL is a lot like the group by in Pandas. SQL provides a family of [aggregate functions] for use in the select
clause. In the simplest form, queries with aggregates in the select
clause generate a single row of output, with each aggregate function performing a summary of all the rows of input. You can have many aggregate functions in your select
clause:
A list of built-in aggregate functions in PostgreSQL is here. In our case, the query we are looking for is as follows.
In the following we compute the average GPA as well as the number of students in each department:
%%sql $default_db
SELECT dept, AVG(gpa) as avg_gpa, COUNT(*)
FROM students
GROUP BY dept
We can use the HAVING
clause to apply a predicate to groups.
%%sql $default_db
SELECT dept, AVG(gpa) as avg_gpa, COUNT(*)
FROM students
GROUP BY dept
HAVING COUNT(*) >= 2
%%sql $default_db
SELECT dept, AVG(gpa) as avg_gpa, COUNT(*)
FROM students
WHERE gender = 'F'
GROUP BY dept
HAVING COUNT(*) >= 2
As a nicety, SQL allows you to order your output rows, in either ascending (ASC) or descending (DESC) order of the values in columns. For example:
%%sql $default_db
SELECT *
FROM students
ORDER BY gpa;
%%sql $default_db
SELECT *
FROM students
ORDER BY gpa, age;
%%sql $default_db
SELECT *
FROM students
ORDER BY gpa DESC, age ASC;
The limit clause limits the number of elements returned. Which elements are returned? While this depends on the order of elements which could be arbitrary beyond anything specified by the ORDER BY
clauses.
Is this a random sample?
No
Why do we use the LIMIT
clause?
Often the database we are querying is massive and retrieving the entire table as we are debugging the query can be costly in time and system resources. However, we should avoid using LIMIT
when constructing a sample of the data.
%%sql
SELECT * FROM students
LIMIT 3
... it's funny, believe me.