Introduction to SQL

In this notebook we walk through basic SQL queries and review how to interact with SQL from Python. We will be using the core sqlalchemy library as an abstraction layer on-top of the underlying database management system. We will not be using the sophisticated object relational management features available in sqlalchemy.

In [1]:
import sqlalchemy

Connecting to the Database

For the first part of this assignment we will connect to a local SQLite database. SQLite is a simple file oriented database management "library." In the following we launch a SQLite database stored in the data100.db file in the local folder.

In [2]:
# Delete the database if it already exists.
from pathlib import Path
dbfile = Path("data100.db")
if dbfile.exists():
    dbfile.unlink()

sqlite_uri = "sqlite:///data100.db"
sqlite_engine = sqlalchemy.create_engine(sqlite_uri)

On the first call there should be no tables present in the file.

In [3]:
sqlite_engine.table_names()
Out[3]:
[]

Creating a Table

To start, we are going to define a toy relation (a.k.a. table), populate it with some toy data, and work through some basic SQL. First, let's create the table of students.

The table contains columns:

  • name: Text and unique (Primary Key)
  • gpa: Float with range constraints
  • age: Integer
  • dept: Text
  • sex: Char (e.g., 'M', 'F')

Note that each column has a fixed data type.

The DBMS will enforce these types as data is inserted.

Note also the definition of a primary key, as we discussed in the EDA lecture.

The DBMS will enforce the uniqueness of values in the key columns.

In [4]:
sql_expr = """
CREATE TABLE students(
    name TEXT PRIMARY KEY, 
    gpa FLOAT CHECK (gpa >= 0.0 and gpa <= 4.0), 
    age INTEGER, 
    dept TEXT, 
    gender CHAR);
"""
result = sqlite_engine.execute(sql_expr)

The query returns a sqlalchemy ResultProxy object:

In [5]:
result?

We can check to see if the result object contains any data:

In [6]:
result.returns_rows
Out[6]:
False

Inserting Values into the Table

Now let's manually insert some values into the table. Note that:

  • strings in SQL must be quoted with a single quote ' character.
  • insertions need to have values in the same order as the columns in the create table statement!
In [7]:
sql_expr = """
INSERT INTO students VALUES 
 ('Sergey Brin', 2.8, 45, 'CS', 'M'),
  ('Danah Boyd', 3.9, 40, 'CS', 'F'),
  ('Bill Gates', 1.0, 63, 'CS', 'M'),
  ('Hillary Mason', 4.0, 39, 'DATASCI', 'F'),
  ('Mike Olson', 3.7, 53, 'CS', 'M'),
  ('Mark Zuckerberg', 3.8, 34, 'CS', 'M'),
  ('Cheryl Sandberg', 3.6, 49, 'BUSINESS', 'F'),
  ('Susan Wojcicki', 3.8, 50, 'BUSINESS', 'F'),
  ('Marissa Meyer', 2.6, 43, 'BUSINESS', 'F')
 ;

"""
result = sqlite_engine.execute(sql_expr)

Again we see that this query returns nothing:

In [8]:
result.returns_rows
Out[8]:
False

Querying the Table

Now that we have populated the table we can construct a query to extract the results.

In [9]:
sql_expr = """
SELECT * FROM students;
"""
result = sqlite_engine.execute(sql_expr)
In [10]:
result.returns_rows
Out[10]:
True

Iterating the Cursor

The query returns a persistent connection in the form of a cursor which can be used to read data from the database.

In [11]:
[r for r in result.cursor]
Out[11]:
[('Sergey Brin', 2.8, 45, 'CS', 'M'),
 ('Danah Boyd', 3.9, 40, 'CS', 'F'),
 ('Bill Gates', 1.0, 63, 'CS', 'M'),
 ('Hillary Mason', 4.0, 39, 'DATASCI', 'F'),
 ('Mike Olson', 3.7, 53, 'CS', 'M'),
 ('Mark Zuckerberg', 3.8, 34, 'CS', 'M'),
 ('Cheryl Sandberg', 3.6, 49, 'BUSINESS', 'F'),
 ('Susan Wojcicki', 3.8, 50, 'BUSINESS', 'F'),
 ('Marissa Meyer', 2.6, 43, 'BUSINESS', 'F')]

However, as we read the cursor we advance it and so it can only be used once:

In [12]:
[r for r in result.cursor]
Out[12]:
[]

Using Pandas to Query the Database

You can also use Pandas to query the database. Here we pass the engine (or a connection) into the pandas.read_sql function:

In [13]:
import pandas as pd

sql_expr = """
SELECT * FROM students;
"""
pd.read_sql(sql_expr, sqlite_engine)
Out[13]:
name gpa age dept gender
0 Sergey Brin 2.8 45 CS M
1 Danah Boyd 3.9 40 CS F
2 Bill Gates 1.0 63 CS M
3 Hillary Mason 4.0 39 DATASCI F
4 Mike Olson 3.7 53 CS M
5 Mark Zuckerberg 3.8 34 CS M
6 Cheryl Sandberg 3.6 49 BUSINESS F
7 Susan Wojcicki 3.8 50 BUSINESS F
8 Marissa Meyer 2.6 43 BUSINESS F

Primary Key Integrity Constraint

What happens if we try to insert another record with the same primary key (name)?

In [14]:
sql_expr = """
INSERT INTO students VALUES ('Bill Gates', 4.0, 60, 'BUSINESS', 'M')
"""
try:
    result = sqlite_engine.execute(sql_expr)
except Exception as e:
    print(e)
(sqlite3.IntegrityError) UNIQUE constraint failed: students.name [SQL: "\nINSERT INTO students VALUES ('Bill Gates', 4.0, 60, 'BUSINESS', 'M')\n"] (Background on this error at: http://sqlalche.me/e/gkpj)

Notice in the above block of code we use try: and except Exception as e:. This accomplishes two goals:

  1. This syntax catches the exception and prevents the notebook from terminating when the error occurs (we are expecting this error!)
  2. This syntax also hides the full stack trace and only shows us the important message containing the final error.

Saving a Dataframe to a Database

We can also populate the database using Pandas. In the following we first obtain the Tips dataset from the seaborn visualization library as a dataframe:

In [15]:
import seaborn as sns

tips_df = sns.load_dataset("tips")
tips_df.head()
Out[15]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

We can then use the pandas.to_sql command to put the data in our SQLlite database:

In [16]:
tips_df.to_sql("tips", sqlite_engine)

We can see that a new table has been added:

In [17]:
sqlite_engine.table_names()
Out[17]:
['students', 'tips']

We can also query the table:

In [18]:
sql_expr = """
SELECT * FROM tips
"""
pd.read_sql(sql_expr, sqlite_engine)
Out[18]:
index total_bill tip sex smoker day time size
0 0 16.99 1.01 Female No Sun Dinner 2
1 1 10.34 1.66 Male No Sun Dinner 3
2 2 21.01 3.50 Male No Sun Dinner 3
3 3 23.68 3.31 Male No Sun Dinner 2
4 4 24.59 3.61 Female No Sun Dinner 4
5 5 25.29 4.71 Male No Sun Dinner 4
6 6 8.77 2.00 Male No Sun Dinner 2
7 7 26.88 3.12 Male No Sun Dinner 4
8 8 15.04 1.96 Male No Sun Dinner 2
9 9 14.78 3.23 Male No Sun Dinner 2
10 10 10.27 1.71 Male No Sun Dinner 2
11 11 35.26 5.00 Female No Sun Dinner 4
12 12 15.42 1.57 Male No Sun Dinner 2
13 13 18.43 3.00 Male No Sun Dinner 4
14 14 14.83 3.02 Female No Sun Dinner 2
15 15 21.58 3.92 Male No Sun Dinner 2
16 16 10.33 1.67 Female No Sun Dinner 3
17 17 16.29 3.71 Male No Sun Dinner 3
18 18 16.97 3.50 Female No Sun Dinner 3
19 19 20.65 3.35 Male No Sat Dinner 3
20 20 17.92 4.08 Male No Sat Dinner 2
21 21 20.29 2.75 Female No Sat Dinner 2
22 22 15.77 2.23 Female No Sat Dinner 2
23 23 39.42 7.58 Male No Sat Dinner 4
24 24 19.82 3.18 Male No Sat Dinner 2
25 25 17.81 2.34 Male No Sat Dinner 4
26 26 13.37 2.00 Male No Sat Dinner 2
27 27 12.69 2.00 Male No Sat Dinner 2
28 28 21.70 4.30 Male No Sat Dinner 2
29 29 19.65 3.00 Female No Sat Dinner 2
... ... ... ... ... ... ... ... ...
214 214 28.17 6.50 Female Yes Sat Dinner 3
215 215 12.90 1.10 Female Yes Sat Dinner 2
216 216 28.15 3.00 Male Yes Sat Dinner 5
217 217 11.59 1.50 Male Yes Sat Dinner 2
218 218 7.74 1.44 Male Yes Sat Dinner 2
219 219 30.14 3.09 Female Yes Sat Dinner 4
220 220 12.16 2.20 Male Yes Fri Lunch 2
221 221 13.42 3.48 Female Yes Fri Lunch 2
222 222 8.58 1.92 Male Yes Fri Lunch 1
223 223 15.98 3.00 Female No Fri Lunch 3
224 224 13.42 1.58 Male Yes Fri Lunch 2
225 225 16.27 2.50 Female Yes Fri Lunch 2
226 226 10.09 2.00 Female Yes Fri Lunch 2
227 227 20.45 3.00 Male No Sat Dinner 4
228 228 13.28 2.72 Male No Sat Dinner 2
229 229 22.12 2.88 Female Yes Sat Dinner 2
230 230 24.01 2.00 Male Yes Sat Dinner 4
231 231 15.69 3.00 Male Yes Sat Dinner 3
232 232 11.61 3.39 Male No Sat Dinner 2
233 233 10.77 1.47 Male No Sat Dinner 2
234 234 15.53 3.00 Male Yes Sat Dinner 2
235 235 10.07 1.25 Male No Sat Dinner 2
236 236 12.60 1.00 Male Yes Sat Dinner 2
237 237 32.83 1.17 Male Yes Sat Dinner 2
238 238 35.83 4.67 Female No Sat Dinner 3
239 239 29.03 5.92 Male No Sat Dinner 3
240 240 27.18 2.00 Female Yes Sat Dinner 2
241 241 22.67 2.00 Male Yes Sat Dinner 2
242 242 17.82 1.75 Male No Sat Dinner 2
243 243 18.78 3.00 Female No Thur Dinner 2

244 rows × 8 columns

Exploring the Schema

There is no mechanism in standard SQL to access the schema associated with each database management systems. However sqlalchemy provides a simple abstraction layer. I found the following SQL Alchemy Quick Reference Sheet to be very helpful.

In [19]:
inspector = sqlalchemy.inspect(sqlite_engine)
inspector.get_table_names()
Out[19]:
['students', 'tips']
In [20]:
# Get column information
for col in inspector.get_columns("students"):
    print(col)
{'name': 'name', 'type': TEXT(), 'nullable': True, 'default': None, 'autoincrement': 'auto', 'primary_key': 1}
{'name': 'gpa', 'type': FLOAT(), 'nullable': True, 'default': None, 'autoincrement': 'auto', 'primary_key': 0}
{'name': 'age', 'type': INTEGER(), 'nullable': True, 'default': None, 'autoincrement': 'auto', 'primary_key': 0}
{'name': 'dept', 'type': TEXT(), 'nullable': True, 'default': None, 'autoincrement': 'auto', 'primary_key': 0}
{'name': 'gender', 'type': CHAR(), 'nullable': True, 'default': None, 'autoincrement': 'auto', 'primary_key': 0}

Deleting Records

We can delete rows using the DELETE statement.

In [21]:
sql_expr = """
DELETE FROM students 
    WHERE LOWER(name) = 'sergey brin'
"""
sqlite_engine.execute(sql_expr)
Out[21]:
<sqlalchemy.engine.result.ResultProxy at 0x7f4ece582da0>

Notice we can rerun the above command multiple times. Why?

In [22]:
sql_expr = """
SELECT * FROM students;
"""
pd.read_sql(sql_expr, sqlite_engine)
Out[22]:
name gpa age dept gender
0 Danah Boyd 3.9 40 CS F
1 Bill Gates 1.0 63 CS M
2 Hillary Mason 4.0 39 DATASCI F
3 Mike Olson 3.7 53 CS M
4 Mark Zuckerberg 3.8 34 CS M
5 Cheryl Sandberg 3.6 49 BUSINESS F
6 Susan Wojcicki 3.8 50 BUSINESS F
7 Marissa Meyer 2.6 43 BUSINESS F

Restoring Sergey

In [23]:
sql_expr = """
INSERT INTO students VALUES
    ('Sergey Brin', 2.8, 45, 'CS', 'M');
"""
sqlite_engine.execute(sql_expr)
Out[23]:
<sqlalchemy.engine.result.ResultProxy at 0x7f4ece582400>

The above computation cannot be run more than once:

In [24]:
try:
    sql_expr = """
    INSERT INTO students VALUES
        ('Sergey Brin', 2.8, 45, 'CS', 'M');
    """
    sqlite_engine.execute(sql_expr)
except Exception as e:
    print(e)
(sqlite3.IntegrityError) UNIQUE constraint failed: students.name [SQL: "\n    INSERT INTO students VALUES\n        ('Sergey Brin', 2.8, 45, 'CS', 'M');\n    "] (Background on this error at: http://sqlalche.me/e/gkpj)

Notice that the order of the records has change. We actually have no guarantee on where Sergey is inserted in the database.

In [25]:
sql_expr = """
SELECT * FROM students;
"""
pd.read_sql(sql_expr, sqlite_engine)
Out[25]:
name gpa age dept gender
0 Danah Boyd 3.9 40 CS F
1 Bill Gates 1.0 63 CS M
2 Hillary Mason 4.0 39 DATASCI F
3 Mike Olson 3.7 53 CS M
4 Mark Zuckerberg 3.8 34 CS M
5 Cheryl Sandberg 3.6 49 BUSINESS F
6 Susan Wojcicki 3.8 50 BUSINESS F
7 Marissa Meyer 2.6 43 BUSINESS F
8 Sergey Brin 2.8 45 CS M

UPDATE values

What is Bill Gates' GPA?

In [26]:
sql_expr = """
SELECT * FROM students;
"""
pd.read_sql(sql_expr, sqlite_engine)
Out[26]:
name gpa age dept gender
0 Danah Boyd 3.9 40 CS F
1 Bill Gates 1.0 63 CS M
2 Hillary Mason 4.0 39 DATASCI F
3 Mike Olson 3.7 53 CS M
4 Mark Zuckerberg 3.8 34 CS M
5 Cheryl Sandberg 3.6 49 BUSINESS F
6 Susan Wojcicki 3.8 50 BUSINESS F
7 Marissa Meyer 2.6 43 BUSINESS F
8 Sergey Brin 2.8 45 CS M

Wow, Bill has a low GPA let's lend him a hand.

In [27]:
sql_expr = """
UPDATE students
SET gpa = 1.0 + gpa
WHERE LOWER(name) = 'bill gates';
"""
sqlite_engine.execute(sql_expr)
Out[27]:
<sqlalchemy.engine.result.ResultProxy at 0x7f4ece51fcf8>

And let's check the table now:

In [28]:
sql_expr = """
SELECT * FROM students;
"""
pd.read_sql(sql_expr, sqlite_engine)
Out[28]:
name gpa age dept gender
0 Danah Boyd 3.9 40 CS F
1 Bill Gates 2.0 63 CS M
2 Hillary Mason 4.0 39 DATASCI F
3 Mike Olson 3.7 53 CS M
4 Mark Zuckerberg 3.8 34 CS M
5 Cheryl Sandberg 3.6 49 BUSINESS F
6 Susan Wojcicki 3.8 50 BUSINESS F
7 Marissa Meyer 2.6 43 BUSINESS F
8 Sergey Brin 2.8 45 CS M

Important Note

In the update statement we decide which rows get updated based entirely on the values in each row, as checked by the where clause. There is no notion of any information outside the values in the row--e.g. there are no "object identifiers" or "row numbers"... everything is just the data and only the data.

Integrity Constraints

The integrity constraints we imposed earlier can be used to improve data quality.

Suppose we try to update everyone's GPA in CS to be one point higher.

In [29]:
try:
    sql_expr = """
        UPDATE students
        SET gpa = 1.0 + gpa
        WHERE dept = 'CS';
        """
    sqlite_engine.execute(sql_expr)
except Exception as e:
    print(e)
(sqlite3.IntegrityError) CHECK constraint failed: students [SQL: "\n        UPDATE students\n        SET gpa = 1.0 + gpa\n        WHERE dept = 'CS';\n        "] (Background on this error at: http://sqlalche.me/e/gkpj)

The above code fails. Why? (check the gpa.)

Dropping Tables

The drop command is used to remove tables from the database (be carefull!):

In [30]:
sql_expr = """
DROP TABLE tips;
"""
sqlite_engine.execute(sql_expr)
Out[30]:
<sqlalchemy.engine.result.ResultProxy at 0x7f4ece565e10>

Notice that the tips table no longer exists:

In [31]:
sqlite_engine.table_names()
Out[31]:
['students']

SELECT Queries

Now let's start looking at some slightly more interesting queries. The canonical SQL query block includes the following clauses, in the order they appear. Square brackets indicate optional clauses.

SELECT ...
  FROM ...
[WHERE ...]
[GROUP BY ...]
[HAVING ...]
[ORDER BY ...]
[LIMIT ...];

Query blocks can reference one or more tables, and be nested in various ways. Before we worry about multi-table queries or nested queries, we'll work our way through examples that exercise all of these clauses on a single table.

The SELECT and WHERE Clauses

The list that follows SELECT determines which columns to include in the output.

In [32]:
sql_expr = """
SELECT name
FROM students;
"""
pd.read_sql(sql_expr, sqlite_engine)
Out[32]:
name
0 Bill Gates
1 Cheryl Sandberg
2 Danah Boyd
3 Hillary Mason
4 Marissa Meyer
5 Mark Zuckerberg
6 Mike Olson
7 Sergey Brin
8 Susan Wojcicki

The WHERE clause determines which rows of to include by specifying a predicate (boolean expression). Rows (tuples) that satisfy this expression are returned.

In [33]:
sql_expr = """
SELECT name, gpa
FROM students
WHERE dept = 'CS'
"""
pd.read_sql(sql_expr, sqlite_engine)
Out[33]:
name gpa
0 Danah Boyd 3.9
1 Bill Gates 2.0
2 Mike Olson 3.7
3 Mark Zuckerberg 3.8
4 Sergey Brin 2.8

We can use the OR and AND keywords to combine predicates.

In [34]:
sql_expr = """
SELECT name, gpa, dept
FROM students
WHERE dept = 'CS' OR name < 'E';
"""
pd.read_sql(sql_expr, sqlite_engine)
Out[34]:
name gpa dept
0 Danah Boyd 3.9 CS
1 Bill Gates 2.0 CS
2 Mike Olson 3.7 CS
3 Mark Zuckerberg 3.8 CS
4 Cheryl Sandberg 3.6 BUSINESS
5 Sergey Brin 2.8 CS

And of course we can specify both rows and columns explicitly. If we have a primary key, we can filter things down to even the cell level via a select list of one column, and a where clause checking equality on the primary key columns:

In [35]:
sql_expr = """
SELECT gpa
FROM students
WHERE name = 'Bill Gates';
"""
pd.read_sql(sql_expr, sqlite_engine)
Out[35]:
gpa
0 2.0

Note that even this "single-celled" response still has a uniform data type of a relation.

SQL expressions take in tables and always produce tables. This is unlike pandas dataframe operations where some dataframe operations return a Series or even a single value.

Selecting Distinct Rows

SQL treats query results as a multiset, i.e. a relation can have tuples that are identical ot each other. Sometimes, however, we don't want to keep the duplicates, we want to eliminate them. This is done simply by adding the keyword DISTINCT after the SELECT statement:

In [36]:
sql_expr = """
SELECT DISTINCT dept
FROM students;
"""
pd.read_sql(sql_expr, sqlite_engine)
Out[36]:
dept
0 CS
1 DATASCI
2 BUSINESS

Which rows are used when taking the distinct entries? Does it really matter?

Note that distinct operates at the level of the tuple, not at the level of the column.

In [37]:
sql_expr = """
SELECT DISTINCT dept, gender
FROM students;
"""
pd.read_sql(sql_expr, sqlite_engine)
Out[37]:
dept gender
0 CS F
1 CS M
2 DATASCI F
3 BUSINESS F

Functions in the Selection List

SQL has a wide range of functions that can be applied to each attribute in the select list. Notice that we can alias (name) the columns with AS. The complete list of built in PostreSQL functions is available here.

In [38]:
sql_expr = """
SELECT UPPER(name), LOWER(dept) AS d, gpa/4.0 AS GPA_ratio
FROM students;
"""
pd.read_sql(sql_expr, sqlite_engine)
Out[38]:
UPPER(name) d GPA_ratio
0 DANAH BOYD cs 0.975
1 BILL GATES cs 0.500
2 HILLARY MASON datasci 1.000
3 MIKE OLSON cs 0.925
4 MARK ZUCKERBERG cs 0.950
5 CHERYL SANDBERG business 0.900
6 SUSAN WOJCICKI business 0.950
7 MARISSA MEYER business 0.650
8 SERGEY BRIN cs 0.700

Unfortunately, sqlite has very limited functionality. However, most modern full featured database management systems will offer a rich set of transformations.

Aggregation Functions

SQL also provides a number of useful aggregation functions, for example AVG.

In [39]:
sql_expr = """
SELECT AVG(age)
FROM students
WHERE dept = 'CS'
"""
pd.read_sql(sql_expr, sqlite_engine)
Out[39]:
AVG(age)
0 47.0
In [40]:
sql_expr = """
SELECT AVG(age)
FROM students
"""
pd.read_sql(sql_expr, sqlite_engine)
Out[40]:
AVG(age)
0 46.222222

Other aggregation functions include .SUM, COUNT, MAX, and MIN

In [41]:
sql_expr = """
SELECT SUM(age)
FROM students
"""
pd.read_sql(sql_expr, sqlite_engine)
Out[41]:
SUM(age)
0 416

The COUNT function, which counts the number of non-null tuples, can take * as an argument.

In [42]:
sql_expr = """
SELECT COUNT(*)
FROM students
"""
pd.read_sql(sql_expr, sqlite_engine)
Out[42]:
COUNT(*)
0 9

Now that you can slice and dice tables into columns, rows and cells, you have enough knowledge to poke around in a database. Let's move on to skills that you'll need as a data scientist.

Group By Aggregation

GROUP BY aggregation in SQL is a lot like the group by in Pandas. SQL provides a family of [aggregate functions] for use in the select clause. In the simplest form, queries with aggregates in the select clause generate a single row of output, with each aggregate function performing a summary of all the rows of input. You can have many aggregate functions in your select clause:

A list of built-in aggregate functions in PostgreSQL is here. In our case, the query we are looking for is as follows.

In the following we compute the average age of students in each department.

In [43]:
sql_expr = """
SELECT dept, AVG(age) as avg_age
FROM students
GROUP BY dept
"""
pd.read_sql(sql_expr, sqlite_engine)
Out[43]:
dept avg_age
0 BUSINESS 47.333333
1 CS 47.000000
2 DATASCI 39.000000

We can also include the number of students in each department using the COUNT aggregation function.

In [44]:
sql_expr = """
SELECT dept, AVG(age) as avg_age, COUNT(*) as count
FROM students
GROUP BY dept
"""
pd.read_sql(sql_expr, sqlite_engine)
Out[44]:
dept avg_age count
0 BUSINESS 47.333333 3
1 CS 47.000000 5
2 DATASCI 39.000000 1

Suppose we wanted to try to retrieve only groups with more than 2 students. Cannot use a WHERE clause for this (will cause an error).

In [45]:
sql_expr = """
SELECT dept, AVG(age) as avg_age
FROM students
WHERE COUNT(age) > 2 
GROUP BY dept
"""
pd.read_sql(sql_expr, sqlite_engine)
---------------------------------------------------------------------------
OperationalError                          Traceback (most recent call last)
~/miniconda2/envs/data100/lib/python3.6/site-packages/sqlalchemy/engine/base.py in _execute_context(self, dialect, constructor, statement, parameters, *args)
   1192                         parameters,
-> 1193                         context)
   1194         except BaseException as e:

~/miniconda2/envs/data100/lib/python3.6/site-packages/sqlalchemy/engine/default.py in do_execute(self, cursor, statement, parameters, context)
    506     def do_execute(self, cursor, statement, parameters, context=None):
--> 507         cursor.execute(statement, parameters)
    508 

OperationalError: misuse of aggregate: COUNT()

The above exception was the direct cause of the following exception:

OperationalError                          Traceback (most recent call last)
<ipython-input-45-6ea94085c652> in <module>()
      5 GROUP BY dept
      6 """
----> 7 pd.read_sql(sql_expr, sqlite_engine)

~/miniconda2/envs/data100/lib/python3.6/site-packages/pandas/io/sql.py in read_sql(sql, con, index_col, coerce_float, params, parse_dates, columns, chunksize)
    414             sql, index_col=index_col, params=params,
    415             coerce_float=coerce_float, parse_dates=parse_dates,
--> 416             chunksize=chunksize)
    417 
    418 

~/miniconda2/envs/data100/lib/python3.6/site-packages/pandas/io/sql.py in read_query(self, sql, index_col, coerce_float, parse_dates, params, chunksize)
   1090         args = _convert_params(sql, params)
   1091 
-> 1092         result = self.execute(*args)
   1093         columns = result.keys()
   1094 

~/miniconda2/envs/data100/lib/python3.6/site-packages/pandas/io/sql.py in execute(self, *args, **kwargs)
    981     def execute(self, *args, **kwargs):
    982         """Simple passthrough to SQLAlchemy connectable"""
--> 983         return self.connectable.execute(*args, **kwargs)
    984 
    985     def read_table(self, table_name, index_col=None, coerce_float=True,

~/miniconda2/envs/data100/lib/python3.6/site-packages/sqlalchemy/engine/base.py in execute(self, statement, *multiparams, **params)
   2073 
   2074         connection = self.contextual_connect(close_with_result=True)
-> 2075         return connection.execute(statement, *multiparams, **params)
   2076 
   2077     def scalar(self, statement, *multiparams, **params):

~/miniconda2/envs/data100/lib/python3.6/site-packages/sqlalchemy/engine/base.py in execute(self, object, *multiparams, **params)
    940         """
    941         if isinstance(object, util.string_types[0]):
--> 942             return self._execute_text(object, multiparams, params)
    943         try:
    944             meth = object._execute_on_connection

~/miniconda2/envs/data100/lib/python3.6/site-packages/sqlalchemy/engine/base.py in _execute_text(self, statement, multiparams, params)
   1102             statement,
   1103             parameters,
-> 1104             statement, parameters
   1105         )
   1106         if self._has_events or self.engine._has_events:

~/miniconda2/envs/data100/lib/python3.6/site-packages/sqlalchemy/engine/base.py in _execute_context(self, dialect, constructor, statement, parameters, *args)
   1198                 parameters,
   1199                 cursor,
-> 1200                 context)
   1201 
   1202         if self._has_events or self.engine._has_events:

~/miniconda2/envs/data100/lib/python3.6/site-packages/sqlalchemy/engine/base.py in _handle_dbapi_exception(self, e, statement, parameters, cursor, context)
   1411                 util.raise_from_cause(
   1412                     sqlalchemy_exception,
-> 1413                     exc_info
   1414                 )
   1415             else:

~/miniconda2/envs/data100/lib/python3.6/site-packages/sqlalchemy/util/compat.py in raise_from_cause(exception, exc_info)
    201     exc_type, exc_value, exc_tb = exc_info
    202     cause = exc_value if exc_value is not exception else None
--> 203     reraise(type(exception), exception, tb=exc_tb, cause=cause)
    204 
    205 if py3k:

~/miniconda2/envs/data100/lib/python3.6/site-packages/sqlalchemy/util/compat.py in reraise(tp, value, tb, cause)
    184             value.__cause__ = cause
    185         if value.__traceback__ is not tb:
--> 186             raise value.with_traceback(tb)
    187         raise value
    188 

~/miniconda2/envs/data100/lib/python3.6/site-packages/sqlalchemy/engine/base.py in _execute_context(self, dialect, constructor, statement, parameters, *args)
   1191                         statement,
   1192                         parameters,
-> 1193                         context)
   1194         except BaseException as e:
   1195             self._handle_dbapi_exception(

~/miniconda2/envs/data100/lib/python3.6/site-packages/sqlalchemy/engine/default.py in do_execute(self, cursor, statement, parameters, context)
    505 
    506     def do_execute(self, cursor, statement, parameters, context=None):
--> 507         cursor.execute(statement, parameters)
    508 
    509     def do_execute_no_params(self, cursor, statement, context=None):

OperationalError: (sqlite3.OperationalError) misuse of aggregate: COUNT() [SQL: '\nSELECT dept, AVG(age) as avg_age\nFROM students\nWHERE COUNT(age) > 2 \nGROUP BY dept\n'] (Background on this error at: http://sqlalche.me/e/e3q8)

We can use the HAVING clause to apply a predicate to groups.

In [46]:
sql_expr = """
SELECT dept, AVG(age) as avg_age, COUNT(*) as cnt
FROM students
GROUP BY dept
HAVING COUNT(*) >= 2
"""
pd.read_sql(sql_expr, sqlite_engine)
Out[46]:
dept avg_age cnt
0 BUSINESS 47.333333 3
1 CS 47.000000 5

Ordering the output

The ORDER BY clause in SQL allows you to order your output rows, in either ascending (ASC) or descending (DESC) order of the values in columns. For example:

In [47]:
sql_expr = """
SELECT name, gpa, age
FROM students
WHERE name > 'E'
ORDER BY gpa, name;
"""
pd.read_sql(sql_expr, sqlite_engine)
Out[47]:
name gpa age
0 Marissa Meyer 2.6 43
1 Sergey Brin 2.8 45
2 Mike Olson 3.7 53
3 Mark Zuckerberg 3.8 34
4 Susan Wojcicki 3.8 50
5 Hillary Mason 4.0 39
In [48]:
sql_expr = """
SELECT name, gpa, age
FROM students
WHERE name > 'E'
ORDER BY gpa DESC, name ASC;
"""
pd.read_sql(sql_expr, sqlite_engine)
Out[48]:
name gpa age
0 Hillary Mason 4.0 39
1 Mark Zuckerberg 3.8 34
2 Susan Wojcicki 3.8 50
3 Mike Olson 3.7 53
4 Sergey Brin 2.8 45
5 Marissa Meyer 2.6 43

LIMIT Clause

The limit clause limits the number of elements returned.

In [49]:
sql_expr = """
SELECT * FROM students 
LIMIT 3
"""
pd.read_sql(sql_expr, sqlite_engine)
Out[49]:
name gpa age dept gender
0 Danah Boyd 3.9 40 CS F
1 Bill Gates 2.0 63 CS M
2 Hillary Mason 4.0 39 DATASCI F

Why do we use the LIMIT clause?

Often the database we are querying is massive and retrieving the entire table as we are debugging the query can be costly in time and system resources. However, we should avoid using LIMIT when constructing a sample of the data.

Which elements are returned?

While this depends on the order of elements which could be arbitrary beyond anything specified by the ORDER BY clauses.

Is this a random sample?

No, be careful!

Test Your Understanding

Try to figure out what the query below will compute

In [50]:
sql_expr = """
SELECT dept, AVG(gpa) as avg_gpa, COUNT(*) as count
FROM students
WHERE gender = 'F'
GROUP BY dept
HAVING COUNT(*) >2
"""
pd.read_sql(sql_expr, sqlite_engine)
Out[50]:
dept avg_gpa count
0 BUSINESS 3.333333 3

Try to write a SQL statement that returns the average percent tip for diners who came on the weekends, grouped by table size and ordered by increasing tip percentage.

In [51]:
tips_df = sns.load_dataset("tips")
tips_df.to_sql("tips", sqlite_engine)
In [52]:
sql_expr = """
SELECT * from tips
LIMIT 5
"""
pd.read_sql(sql_expr, sqlite_engine) 
Out[52]:
index total_bill tip sex smoker day time size
0 0 16.99 1.01 Female No Sun Dinner 2
1 1 10.34 1.66 Male No Sun Dinner 3
2 2 21.01 3.50 Male No Sun Dinner 3
3 3 23.68 3.31 Male No Sun Dinner 2
4 4 24.59 3.61 Female No Sun Dinner 4
In [53]:
sql_expr = """
SELECT sex, smoker, avg(tip/total_bill) as pct
FROM tips
WHERE day = 'Sun' OR day = 'Sat'
GROUP BY sex, smoker
ORDER BY pct
"""
pd.read_sql(sql_expr, sqlite_engine) 
Out[53]:
sex smoker pct
0 Male Yes 0.151530
1 Female No 0.157180
2 Male No 0.159930
3 Female Yes 0.179239

A Brief Look at Joins

In [54]:
sql_expr = """
CREATE TABLE Sailors (
   sid INTEGER,   
   sname CHAR(20), 
   rating INTEGER, 
   age REAL,
   PRIMARY KEY (sid));
"""
result = sqlite_engine.execute(sql_expr)
In [55]:
sql_expr = """
 CREATE TABLE Reserves (
   sid INTEGER,      
   bid INTEGER, 
   day DATE, 
  PRIMARY KEY (sid, bid, day), 
  FOREIGN KEY (sid) REFERENCES Sailors, 
  FOREIGN KEY (bid) REFERENCES Boats);
"""
result = sqlite_engine.execute(sql_expr)
In [56]:
sql_expr = """
CREATE TABLE Boats (
   bid INTEGER,
   bname CHAR (20), 
   color CHAR(10), 
   PRIMARY KEY (bid));
"""
result = sqlite_engine.execute(sql_expr)
In [57]:
sql_expr = """
INSERT INTO Sailors VALUES
    (1, 'Popeye', 10, 22),
    (2, 'OliveOyl', 11, 39),
    (3, 'Garfield', 1, 27),
    (4, 'Bob', 5, 19),
    (5, 'SpongeBob', 11, 2);
"""
result = sqlite_engine.execute(sql_expr)
In [58]:
sql_expr = """
INSERT INTO Boats VALUES
    (101, 'Nina', 'red'),
    (102, 'Pinta', 'green'),
    (103, 'Santa Maria', 'blue');
"""
result = sqlite_engine.execute(sql_expr)
In [59]:
sql_expr = """
INSERT INTO Reserves VALUES
    (1, 102, '9/12/2017'),
    (2, 102, '9/13/2017'),
    (1, 101, '10/01/2017');
"""
result = sqlite_engine.execute(sql_expr)
In [60]:
sql_expr = """
SELECT * from SAILORS
"""
pd.read_sql(sql_expr, sqlite_engine) 
Out[60]:
sid sname rating age
0 1 Popeye 10 22.0
1 2 OliveOyl 11 39.0
2 3 Garfield 1 27.0
3 4 Bob 5 19.0
4 5 SpongeBob 11 2.0
In [61]:
sql_expr = """
SELECT * from RESERVES
"""
pd.read_sql(sql_expr, sqlite_engine) 
Out[61]:
sid bid day
0 1 102 9/12/2017
1 2 102 9/13/2017
2 1 101 10/01/2017
In [68]:
sql_expr = """
SELECT s.sid, s.sname, r.bid
    FROM Sailors s, Reserves r
   WHERE s.sid = r.sid
     AND s.age > 20;"""
pd.read_sql(sql_expr, sqlite_engine) 
Out[68]:
sid sname bid
0 1 Popeye 102
1 2 OliveOyl 102
2 1 Popeye 101