Lecture 20 – Data 100, Fall 2023¶

Data 100, Fall 2023

Acknowledgments Page

Starting Up SQL¶

Before we look at SQL syntax in detail, let's first get ourselves set up to run SQL queries in Jupyter.

Approach #1: SQL Magic¶

1. Load the sql Module.

Load %%sql cell magic.

In [1]:
%load_ext sql

2. Connect to a database.

Here, we connect to the SQLite database basic_examples.db.

In [2]:
%%sql
sqlite:///data/basic_examples.db


3. Run a simple SQL query.

Note the %%sql lets Jupyter parse the rest of the lines as a SQL command.

In [3]:
%%sql
SELECT * FROM Dragon;
Running query in 'sqlite:///data/basic_examples.db'
Out[3]:
name year cute
hiccup 2010 10
drogon 2011 -100
dragon 2 2019 0
puff 2010 100
smaug 2011 None
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()

Simple query, this time on two different lines.

In [4]:
%%sql
SELECT *
FROM Dragon;
Running query in 'sqlite:///data/basic_examples.db'
Out[4]:
name year cute
hiccup 2010 10
drogon 2011 -100
dragon 2 2019 0
puff 2010 100
smaug 2011 None
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()

Approach #2: pd.read_sql¶

It turns out that pandas has a special-purpose function to parse SQL queries. We can pass in a SQL query as a string to return a pandas DataFrame. To achieve the same result as we did using cell magic above, we can do the following.

1. Connect to a database

In [5]:
import sqlalchemy 
import pandas as pd

engine = sqlalchemy.create_engine("sqlite:///data/basic_examples.db")
connection = engine.connect()

2. Run a simple SQL query

In [6]:
query = """
SELECT * 
FROM Dragon;
"""

pd.read_sql(query, engine)
Out[6]:
name year cute
0 hiccup 2010 10.0
1 drogon 2011 -100.0
2 dragon 2 2019 0.0
3 puff 2010 100.0
4 smaug 2011 NaN

Tables and Schema¶

A database contains a collection of SQL tables. Let's connect to our "toy" database basic_examples.db and explore the tables it stores.

In [7]:
%%sql
sqlite:///data/basic_examples.db
In [8]:
%%sql
SELECT * FROM sqlite_master WHERE type='table'
Running query in 'sqlite:///data/basic_examples.db'
Out[8]:
type name tbl_name rootpage sql
table sqlite_sequence sqlite_sequence 7 CREATE TABLE sqlite_sequence(name,seq)
table Dragon Dragon 2 CREATE TABLE Dragon (
name TEXT PRIMARY KEY,
year INTEGER CHECK (year >= 2000),
cute INTEGER
)
table Dish Dish 4 CREATE TABLE Dish (
name TEXT PRIMARY KEY,
type TEXT,
cost INTEGER CHECK (cost >= 0)
)
table Scene Scene 6 CREATE TABLE Scene (
id INTEGER PRIMARY KEY AUTOINCREMENT,
biome TEXT NOT NULL,
city TEXT NOT NULL,
visitors INTEGER CHECK (visitors >= 0),
created_at DATETIME DEFAULT (DATETIME('now'))
)
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()

Basic Queries¶

Every SQL query must contain a SELECT and FROM clause.

  • SELECT: specify the column(s) to return in the output
  • FROM: specify the database table from which to extract data
In [9]:
%%sql
SELECT * FROM Dragon;
Running query in 'sqlite:///data/basic_examples.db'
Out[9]:
name year cute
hiccup 2010 10
drogon 2011 -100
dragon 2 2019 0
puff 2010 100
smaug 2011 None
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()
In [10]:
%%sql
SELECT cute, year FROM Dragon;
Running query in 'sqlite:///data/basic_examples.db'
Out[10]:
cute year
10 2010
-100 2011
0 2019
100 2010
None 2011
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()

Aliasing with AS

In [11]:
%%sql
SELECT cute AS cuteness,
       year AS birth
FROM Dragon;
Running query in 'sqlite:///data/basic_examples.db'
Out[11]:
cuteness birth
10 2010
-100 2011
0 2019
100 2010
None 2011
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()

Uniqueness with DISTINCT

In [12]:
%%sql
SELECT DISTINCT year
FROM Dragon;
Running query in 'sqlite:///data/basic_examples.db'
Out[12]:
year
2010
2011
2019
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()

Filtering with WHERE

In [13]:
%%sql
SELECT name, year
FROM Dragon
WHERE cute > 0;
Running query in 'sqlite:///data/basic_examples.db'
Out[13]:
name year
hiccup 2010
puff 2010
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()
In [14]:
%%sql
SELECT name, cute, year
FROM Dragon
WHERE cute > 0 OR year > 2013;
Running query in 'sqlite:///data/basic_examples.db'
Out[14]:
name cute year
hiccup 10 2010
dragon 2 0 2019
puff 100 2010
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()
In [15]:
%%sql
SELECT name, year
FROM Dragon 
WHERE name IN ("puff", "hiccup");
Running query in 'sqlite:///data/basic_examples.db'
Out[15]:
name year
hiccup 2010
puff 2010
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()
In [16]:
%%sql
SELECT name, cute
FROM Dragon
WHERE cute IS NOT NULL;
Running query in 'sqlite:///data/basic_examples.db'
Out[16]:
name cute
hiccup 10
drogon -100
dragon 2 0
puff 100
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()

Ordering data using ORDER BY

In [17]:
%%sql
SELECT *
FROM Dragon
ORDER BY cute DESC;
Running query in 'sqlite:///data/basic_examples.db'
Out[17]:
name year cute
puff 2010 100
hiccup 2010 10
dragon 2 2019 0
drogon 2011 -100
smaug 2011 None
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()

Restricting output with LIMIT and OFFSET

In [18]:
%%sql
SELECT *
FROM Dragon
LIMIT 2;
Running query in 'sqlite:///data/basic_examples.db'
Out[18]:
name year cute
hiccup 2010 10
drogon 2011 -100
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()
In [19]:
%%sql
SELECT *
FROM Dragon
LIMIT 2
OFFSET 1;
Running query in 'sqlite:///data/basic_examples.db'
Out[19]:
name year cute
drogon 2011 -100
dragon 2 2019 0
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()

Grouping Data with GROUP BY¶

In [20]:
%%sql
SELECT *
FROM Dish;
Running query in 'sqlite:///data/basic_examples.db'
Out[20]:
name type cost
ravioli entree 10
ramen entree 13
taco entree 7
edamame appetizer 4
fries appetizer 4
potsticker appetizer 4
ice cream dessert 5
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()

A small note: the fact that type is highlighted in green below is a consequence of Jupyter assuming that we are writing Python code (where type is a built-in keyword). type does not have a special meaning in SQL, so the color below does not indicate any special functionality. When we run the cell, Jupyter realizes it should recognize the code as SQL.

In [21]:
%%sql
SELECT type
FROM Dish;
Running query in 'sqlite:///data/basic_examples.db'
Out[21]:
type
entree
entree
entree
appetizer
appetizer
appetizer
dessert
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()
In [22]:
%%sql
SELECT type
FROM Dish
GROUP BY type;
Running query in 'sqlite:///data/basic_examples.db'
Out[22]:
type
appetizer
dessert
entree
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()
In [23]:
%%sql
SELECT type, SUM(cost)
FROM Dish
GROUP BY type;
Running query in 'sqlite:///data/basic_examples.db'
Out[23]:
type SUM(cost)
appetizer 12
dessert 5
entree 30
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()
In [24]:
%%sql
SELECT type, 
       SUM(cost), 
       MIN(cost),
       MAX(name)
FROM Dish
GROUP BY type;
Running query in 'sqlite:///data/basic_examples.db'
Out[24]:
type SUM(cost) MIN(cost) MAX(name)
appetizer 12 4 potsticker
dessert 5 5 ice cream
entree 30 7 taco
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()
In [25]:
%%sql
SELECT year, COUNT(cute)
FROM Dragon
GROUP BY year;
Running query in 'sqlite:///data/basic_examples.db'
Out[25]:
year COUNT(cute)
2010 2
2011 1
2019 1
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()
In [26]:
%%sql
SELECT year, COUNT(*)
FROM Dragon
GROUP BY year;
Running query in 'sqlite:///data/basic_examples.db'
Out[26]:
year COUNT(*)
2010 2
2011 2
2019 1
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()

Working with the sql results as Python variables¶

By default, executing a query with a magic command produces output but doesn't save it into any Python variable:

In [27]:
%sql SELECT * FROM Dragon
Running query in 'sqlite:///data/basic_examples.db'
Out[27]:
name year cute
hiccup 2010 10
drogon 2011 -100
dragon 2 2019 0
puff 2010 100
smaug 2011 None
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()

This can be inconvenient if you later want to do further processing of these data in Python.

Storing one-line %sql queries¶

For simple one-line queries, you can use IPython's ability to store the result of a magic command like %sql as if it were any other Python statement, and save the output to a variable:

In [28]:
dragon_table = %sql SELECT * FROM Dragon
dragon_table
Running query in 'sqlite:///data/basic_examples.db'
Out[28]:
name year cute
hiccup 2010 10
drogon 2011 -100
dragon 2 2019 0
puff 2010 100
smaug 2011 None
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()

As noted above, the result of the query is a Python variable of type ResultSet, more specifically:

In [29]:
type(dragon_table)
Out[29]:
sql.run.ResultSet

You need to manually convert it to a Pandas DataFrame if you want to do pandas-things with its content:

In [30]:
dragon_df = dragon_table.DataFrame()
dragon_df
Out[30]:
name year cute
0 hiccup 2010 10.0
1 drogon 2011 -100.0
2 dragon 2 2019 0.0
3 puff 2010 100.0
4 smaug 2011 NaN

You can configure jupysql to automatically convert all outputs to Pandas DataFrames. This can be handy if you intend all your Python-side work to be done with Pandas, as it saves you from manually having to call .DataFrame() first on all outputs. On the other hand, you don't get access to the original SQL ResultSet object, which have a number of interesting properties and capabilities. You can learn more about those in the jupysql documentation.

For now, let's turn this on so you can see how this simplified, "pandas all the way" worfklow looks like:

In [31]:
%config SqlMagic.autopandas = True
In [32]:
dragon_df = %sql SELECT * FROM Dragon
dragon_df
Running query in 'sqlite:///data/basic_examples.db'
Out[32]:
name year cute
0 hiccup 2010 10.0
1 drogon 2011 -100.0
2 dragon 2 2019 0.0
3 puff 2010 100.0
4 smaug 2011 NaN
In [33]:
type(dragon_df)
Out[33]:
pandas.core.frame.DataFrame

Storing multi-line %%sql queries¶

For a more complex query that won't fit in one line, such as for example:

In [34]:
%%sql
SELECT year, COUNT(*)
FROM Dragon
GROUP BY year;
Running query in 'sqlite:///data/basic_examples.db'
Out[34]:
year COUNT(*)
0 2010 2
1 2011 2
2 2019 1

You can use the variable << syntax in jupysql to store its output (this will honor your autopandas state and store either a sql.run.ResultState or a Pandas DataFrame):

In [35]:
%%sql dragon_years <<
SELECT year, COUNT(*)
FROM Dragon
GROUP BY year;
Running query in 'sqlite:///data/basic_examples.db'
In [36]:
dragon_years
Out[36]:
year COUNT(*)
0 2010 2
1 2011 2
2 2019 1