Install SQL packages:

# !conda install -y psycopg2
# !conda install -y postgresql
# !pip install ipython-sql
# !pip install sqlalchemy

Standard imports + sqlalchemy

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sqlalchemy

%matplotlib inline
%load_ext sql

Establish a database connection to the Postgres database running on my machine localhost using the schema ds100

postgresql_uri = "postgres://jegonzal:@localhost:5432/ds100"
sqlite_uri = "sqlite:///data/ds100.db"
default_db = postgresql_uri

%%sql $postgresql_uri
-- Need to drop views to prevent integrity constraint violations later.
DROP VIEW IF EXISTS date_stats;

Done.

[]

Basic Table Creation, Manipulation, and Queries¶

The following example works through some basic table operations including:

creating a table
adding rows
updating rows
deleting rows
querying the table

`CREATE TABLE` and `DROP TABLE`¶

To start, we are going to define a toy relation (a.k.a. table), populate it with some toy data, and work through some basic SQL. Deeper stuff coming soon though, I promise!

First, let's create the table of students

%%sql $default_db

-- Drop the table if it already exists
DROP TABLE IF EXISTS students;

-- Create the table profs
CREATE TABLE students(
    name TEXT PRIMARY KEY, 
    gpa FLOAT CHECK (gpa >= 0.0 and gpa <= 4.0), 
    age INTEGER, 
    dept TEXT, 
    gender CHAR);

Done.
Done.

[]

Note that each column has a fixed data type.

The DBMS will enforce these types as data is inserted.

Note also the definition of a primary key, as we discussed in the EDA lecture.

The DBMS will enforce the uniqueness of values in the key columns.

To see what we've done, let's run our first query, dumping out the content of the table: every column for each row. We denote every column with *:

%%sql $default_db

SELECT * FROM students;

0 rows affected.

INSERTing VALUES¶

Now let's manually insert some values into the table.

%%sql $default_db

INSERT INTO students VALUES 
 ('Sergey Brin', 2.8, 40, 'CS', 'M'),
  ('Danah Boyd', 3.9, 35, 'CS', 'F'),
  ('Bill Gates', 1.0, 60, 'CS', 'M'),
  ('Hillary Mason', 4.0, 35, 'DATASCI', 'F'),
  ('Mike Olson', 3.7, 50, 'CS', 'M'),
  ('Mark Zuckerberg', 4.0, 30, 'CS', 'M'),
  ('Cheryl Sandberg', 4.0, 47, 'BUSINESS', 'F'),
  ('Susan Wojcicki', 4.0, 46, 'BUSINESS', 'F'),
  ('Marissa Meyer', 4.0, 45, 'BUSINESS', 'F');

9 rows affected.

[]

Note that strings in SQL must be quoted with a single quote ' character.

Note how insertions need to have values in the same order as the columns in the create table statement! Let's make sure our data is there:

%%sql $default_db

SELECT * FROM students;

9 rows affected.

Primary Key Integrity Constraint¶

What happens if we try to insert another record with the same primary key (name)?

# %%sql $default_db
# INSERT INTO students VALUES ('Bill Gates', 4.0, 60, 'BUSINESS', 'M')

Using Pandas and SQL¶

We can populate the database using Pandas as well:

tips_df = sns.load_dataset("tips")
tips_df.head()

Create a connection with the database

engine = sqlalchemy.create_engine(default_db)

Drop the table if it already exists and then upload the table to the database.

_ = engine.execute("DROP TABLE IF EXISTS tips;")
with engine.connect() as conn:
    tips_df.to_sql("tips", conn)

We can also download tables directly into pandas:

with engine.connect() as conn:
    students = pd.read_sql("SELECT * FROM students", conn)
students

Another Pandas + SQL¶

We can also evaluate a query string and use the SQL iPython extension to execute the query and construct a DataFrame.

query = """
SELECT * FROM tips
WHERE sex = 'Female';
"""

res = %sql $query

df = res.DataFrame()

df.head()

87 rows affected.

Exploring the Schema¶

There is no mechanism in standard SQL to access the schema associated with each database management systems. Here we use the corresponding client tools

Sqlite3 schema information:

!sqlite3 data/ds100.db ".schema students"

CREATE TABLE students(
    name TEXT PRIMARY KEY, 
    gpa FLOAT CHECK (gpa >= 0.0 and gpa <= 4.0), 
    age INTEGER, 
    dept TEXT, 
    gender CHAR);

PostgreSQL schema information:

!psql ds100 -c "\d students"

        Table "public.students"
 Column |       Type       | Modifiers 
--------+------------------+-----------
 name   | text             | not null
 gpa    | double precision | 
 age    | integer          | 
 dept   | text             | 
 gender | character(1)     | 
Indexes:
    "students_pkey" PRIMARY KEY, btree (name)
Check constraints:
    "students_gpa_check" CHECK (gpa >= 0.0::double precision AND gpa <= 4.0::double precision)

Using SQL Alchemy's Generic Driver

I found the following SQL Alchemy Quick Reference Sheet to be very helpful.

engine = sqlalchemy.create_engine(postgresql_uri)
inspector = sqlalchemy.inspect(engine)
for col in inspector.get_columns("students"):
    print(col)

{'name': 'name', 'type': TEXT(), 'nullable': False, 'default': None, 'autoincrement': False}
{'name': 'gpa', 'type': DOUBLE_PRECISION(precision=53), 'nullable': True, 'default': None, 'autoincrement': False}
{'name': 'age', 'type': INTEGER(), 'nullable': True, 'default': None, 'autoincrement': False}
{'name': 'dept', 'type': TEXT(), 'nullable': True, 'default': None, 'autoincrement': False}
{'name': 'gender', 'type': CHAR(length=1), 'nullable': True, 'default': None, 'autoincrement': False}

engine = sqlalchemy.create_engine(sqlite_uri)
inspector = sqlalchemy.inspect(engine)
for col in inspector.get_columns("students"):
    print(col)

{'name': 'name', 'type': TEXT(), 'nullable': True, 'default': None, 'autoincrement': 'auto', 'primary_key': 1}
{'name': 'gpa', 'type': FLOAT(), 'nullable': True, 'default': None, 'autoincrement': 'auto', 'primary_key': 0}
{'name': 'age', 'type': INTEGER(), 'nullable': True, 'default': None, 'autoincrement': 'auto', 'primary_key': 0}
{'name': 'dept', 'type': TEXT(), 'nullable': True, 'default': None, 'autoincrement': 'auto', 'primary_key': 0}
{'name': 'gender', 'type': CHAR(), 'nullable': True, 'default': None, 'autoincrement': 'auto', 'primary_key': 0}

UPDATE values¶

What is Bill Gates' GPA?

%%sql $default_db

SELECT * FROM students
    WHERE name LIKE  '%Bill%' -- SQL like regular expression

1 rows affected.

Wow, Bill has a low GPA let's lend him a hand.

%%sql $default_db

UPDATE students
    SET gpa = 1.0 + gpa
    WHERE LOWER(name) = 'bill gates';

1 rows affected.

[]

And let's check the table now:

%%sql $default_db

SELECT * FROM students
    WHERE name ~'^Bil.'; -- Regular expression

1 rows affected.

Suppose Mark logged into the database and tried to give himself a 5.0? Uncomment the following line to see what happens:

# %%sql 

# UPDATE students
#     SET gpa = 5.0
#     WHERE LOWER(name) LIKE '%zuck%';

The above code fails. Why? (check the gpa.)

Reviewing our table

%%sql $default_db

SELECT * FROM students

9 rows affected.

Notice two things:

If you are using Postgres the rows likely came back in a different order than before. Remember: relations do not have a defined order, and in fact two different orders are just two ways of describing the same relation!
Note the relational style of the update statement: we decide which rows get updated based entirely on the values in each row, as checked by the where clause. There is no notion of any information outside the values in the row--e.g. there are no "object identifiers" or "row numbers"... everything is just the data and only the data.

Deleting Records¶

We can delete rows in much the same way we update rows:

%%sql $default_db

DELETE FROM students 
    WHERE name = 'Sergey Brin'

1 rows affected.

[]

%%sql $default_db

SELECT * FROM students;

8 rows affected.

Restoring Sergey

%%sql $default_db

INSERT INTO students VALUES
      ('Sergey Brin', 4.0, 40, 'CS', 'M');

1 rows affected.

[]

SELECT Queries¶

Now let's start looking at some slightly more interesting queries. The canonical SQL query block includes the following clauses, in the order they appear. Square brackets indicate optional clauses.

SELECT ...
  FROM ...
[WHERE ...]
[GROUP BY ...]
[HAVING ...]
[ORDER BY ...]
[LIMIT ...];

Query blocks can reference one or more tables, and be nested in various ways. Before we worry about multi-table queries or nested queries, we'll work our way through examples that exercise all of these clauses on a single table.

The `SELECT` LIST¶

The SELECT list determines which columns to include in the output.

%%sql $default_db

SELECT name
FROM students;

9 rows affected.

Functions in the Selection List¶

SQL has a wide range of functions that can be applied to each attribute in the select list. Notice that we can alias (name) the columns with AS. The complete list of built in PostreSQL functions is available here.

%%sql $default_db

SELECT UPPER(name) AS n, LOWER(dept) as d, LOG(gpa) * 4.0 AS log_gpa
FROM students;

9 rows affected.

Selecting Distinct Rows¶

As we know, SQL is a multiset logic, preserving the meaning of the number of duplicates in query results. Sometimes, however, we don't want to keep the duplicates, we want to eliminate them. This is done simply by adding the keyword DISTINCT after the SELECT statement:

%%sql $default_db

SELECT DISTINCT dept
FROM students;

3 rows affected.

Which rows are used when taking the distinct entries? Does it really matter?

The `WHERE` Clause¶

The WHERE clause determines which rows of to include by specifying a predicate (boolean expression). Rows (tuples) that satisfy this expression are returned.

%%sql $default_db

SELECT name, gpa
FROM students
WHERE dept = 'CS'

5 rows affected.

And of course we can specify both rows and columns explicitly. If we have a primary key, we can filter things down to even the cell level via a select list of one column, and a where clause checking equality on the primary key columns:

%%sql $default_db

SELECT gpa
FROM students
WHERE name = 'Bill Gates';

1 rows affected.

Note that even this "single-celled" response still has a uniform data type of a relation.

SQL is Closed Over Tables: SQL expressions take in tables and always produce tables. How does this compare to Pandas?

Now that you can slice and dice tables into columns, rows and cells, you have enough knowledge to poke around in a database. Let's move on to skills that you'll need as a data scientist.

Group By Aggregation¶

GROUP BY aggregation in SQL is a lot like the group by in Pandas. SQL provides a family of [aggregate functions] for use in the select clause. In the simplest form, queries with aggregates in the select clause generate a single row of output, with each aggregate function performing a summary of all the rows of input. You can have many aggregate functions in your select clause:

A list of built-in aggregate functions in PostgreSQL is here. In our case, the query we are looking for is as follows.

In the following we compute the average GPA as well as the number of students in each department:

%%sql $default_db

SELECT dept, AVG(gpa) as avg_gpa, COUNT(*)
FROM students
GROUP BY dept

3 rows affected.

We can use the HAVING clause to apply a predicate to groups.

%%sql $default_db

SELECT dept, AVG(gpa) as avg_gpa, COUNT(*)
FROM students
GROUP BY dept
HAVING COUNT(*) >= 2

2 rows affected.

%%sql  $default_db

SELECT dept, AVG(gpa) as avg_gpa, COUNT(*)
FROM students
WHERE gender = 'F'
GROUP BY dept
HAVING COUNT(*) >= 2

1 rows affected.

Ordering the output¶

As a nicety, SQL allows you to order your output rows, in either ascending (ASC) or descending (DESC) order of the values in columns. For example:

%%sql $default_db

SELECT *
FROM students
ORDER BY gpa;

9 rows affected.

%%sql $default_db

SELECT *
FROM students
ORDER BY gpa, age;

9 rows affected.

%%sql $default_db

SELECT *
FROM students
ORDER BY gpa DESC, age ASC;

9 rows affected.

LIMIT Clause¶

The limit clause limits the number of elements returned. Which elements are returned? While this depends on the order of elements which could be arbitrary beyond anything specified by the ORDER BY clauses.

Is this a random sample?

No

Why do we use the LIMIT clause?

Often the database we are querying is massive and retrieving the entire table as we are debugging the query can be costly in time and system resources. However, we should avoid using LIMIT when constructing a sample of the data.

%%sql
SELECT * FROM students 
LIMIT 3

3 rows affected.

If this is funny you are Getting IT¶

Boby Drop tables

... it's funny, believe me.

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4

	index	total_bill	tip	sex	smoker	day	time	size
0	0	16.99	1.01	Female	No	Sun	Dinner	2
1	4	24.59	3.61	Female	No	Sun	Dinner	4
2	11	35.26	5.00	Female	No	Sun	Dinner	4
3	14	14.83	3.02	Female	No	Sun	Dinner	2
4	16	10.33	1.67	Female	No	Sun	Dinner	3

n	d	log_gpa
DANAH BOYD	cs	2.364258428106
HILLARY MASON	datasci	2.40823996531185
MIKE OLSON	cs	2.27280689626798
MARK ZUCKERBERG	cs	2.40823996531185
CHERYL SANDBERG	business	2.40823996531185
SUSAN WOJCICKI	business	2.40823996531185
MARISSA MEYER	business	2.40823996531185
BILL GATES	cs	1.20411998265592
SERGEY BRIN	cs	2.40823996531185

name	gpa	age	dept	gender
Sergey Brin	2.8	40	CS	M
Danah Boyd	3.9	35	CS	F
Bill Gates	1.0	60	CS	M
Hillary Mason	4.0	35	DATASCI	F
Mike Olson	3.7	50	CS	M
Mark Zuckerberg	4.0	30	CS	M
Cheryl Sandberg	4.0	47	BUSINESS	F
Susan Wojcicki	4.0	46	BUSINESS	F
Marissa Meyer	4.0	45	BUSINESS	F

	name	gpa	age	dept	gender
0	Sergey Brin	2.8	40	CS	M
1	Danah Boyd	3.9	35	CS	F
2	Bill Gates	1.0	60	CS	M
3	Hillary Mason	4.0	35	DATASCI	F
4	Mike Olson	3.7	50	CS	M
5	Mark Zuckerberg	4.0	30	CS	M
6	Cheryl Sandberg	4.0	47	BUSINESS	F
7	Susan Wojcicki	4.0	46	BUSINESS	F
8	Marissa Meyer	4.0	45	BUSINESS	F

name	gpa	age	dept	gender
Sergey Brin	2.8	40	CS	M
Danah Boyd	3.9	35	CS	F
Hillary Mason	4.0	35	DATASCI	F
Mike Olson	3.7	50	CS	M
Mark Zuckerberg	4.0	30	CS	M
Cheryl Sandberg	4.0	47	BUSINESS	F
Susan Wojcicki	4.0	46	BUSINESS	F
Marissa Meyer	4.0	45	BUSINESS	F
Bill Gates	2.0	60	CS	M

name	gpa	age	dept	gender
Danah Boyd	3.9	35	CS	F
Hillary Mason	4.0	35	DATASCI	F
Mike Olson	3.7	50	CS	M
Mark Zuckerberg	4.0	30	CS	M
Cheryl Sandberg	4.0	47	BUSINESS	F
Susan Wojcicki	4.0	46	BUSINESS	F
Marissa Meyer	4.0	45	BUSINESS	F
Bill Gates	2.0	60	CS	M