Lecture 20 – Data 100, Spring 2024¶

Data 100, Spring 2024

Acknowledgments Page

Starting Up SQL¶

Before we look at SQL syntax in detail, let's first get ourselves set up to run SQL queries in Jupyter.

Approach #1: SQL Magic¶

1. Load the sql Module.

Load %%sql cell magic.

In [1]:
%load_ext sql

2. Connect to a database.

Here, we connect to the SQLite database basic_examples.db.

In [2]:
%sql duckdb:///data/example_duck.db --alias duck

If you were connecting to an "enterprise data platform"

from sqlalchemy import create_engine

snow_engine = create_engine(
    f"snowflake://{user}:{password}@{account_identifier}")
%sql snow_engine --alias snow

db_engine = create_engine(
  url = f"databricks://token:{access_token}@{server_hostname}?" +
        f"http_path={http_path}&catalog={catalog}&schema={schema}"
)
%sql db_engine --alias db

3. Run a simple SQL query.

Note the %%sql lets Jupyter parse the rest of the lines as a SQL command.

In [3]:
%%sql
SELECT * FROM Dragon;
Running query in 'duck'
Out[3]:
name year cute
hiccup 2010 10
drogon 2011 -100
dragon 2 2019 0
puff 2010 100
smaug 2011 None
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()

Simple query, this time on two different lines.

In [4]:
%%sql
SELECT *
FROM Dragon;
Running query in 'duck'
Out[4]:
name year cute
hiccup 2010 10
drogon 2011 -100
dragon 2 2019 0
puff 2010 100
smaug 2011 None
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()

Approach #3: pd.read_sql¶

It turns out that pandas has a special-purpose function to parse SQL queries. We can pass in a SQL query as a string to return a pandas DataFrame. To achieve the same result as we did using cell magic above, we can do the following.

1. Connect to a database

In [5]:
import sqlalchemy 
import pandas as pd

engine = sqlalchemy.create_engine("duckdb:///data/example_duck.db")

2. Run a simple SQL query

In [6]:
query = """
SELECT * 
FROM Dragon;
"""

df = pd.read_sql(query, engine)
df
Out[6]:
name year cute
0 hiccup 2010 10.0
1 drogon 2011 -100.0
2 dragon 2 2019 0.0
3 puff 2010 100.0
4 smaug 2011 NaN

Approach "3" -- Duck DB Special¶

Now that we are using DuckDB we can do something extra crazy:

In [7]:
import seaborn as sns
mpg = sns.load_dataset("mpg")
In [8]:
%%sql
SELECT * FROM mpg
Running query in 'duck'
Out[8]:
mpg cylinders displacement horsepower weight acceleration model_year origin name
18.0 8 307.0 130.0 3504 12.0 70 usa chevrolet chevelle malibu
15.0 8 350.0 165.0 3693 11.5 70 usa buick skylark 320
18.0 8 318.0 150.0 3436 11.0 70 usa plymouth satellite
16.0 8 304.0 150.0 3433 12.0 70 usa amc rebel sst
17.0 8 302.0 140.0 3449 10.5 70 usa ford torino
15.0 8 429.0 198.0 4341 10.0 70 usa ford galaxie 500
14.0 8 454.0 220.0 4354 9.0 70 usa chevrolet impala
14.0 8 440.0 215.0 4312 8.5 70 usa plymouth fury iii
14.0 8 455.0 225.0 4425 10.0 70 usa pontiac catalina
15.0 8 390.0 190.0 3850 8.5 70 usa amc ambassador dpl
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()
Truncated to displaylimit of 10
If you want to see more, please visit displaylimit configuration

That is right!! DuckDB can also see my dataframes in the python environment allowing me to do dataframe manipulation in SQL!



Return to Lecture



Tables and Schema¶

A database contains a collection of SQL tables. Let's connect to our "toy" database basic_examples.db and explore the tables it stores.

In [9]:
%%sql
SELECT * FROM information_schema.tables
Running query in 'duck'
Out[9]:
table_catalog table_schema table_name table_type self_referencing_column_name reference_generation user_defined_type_catalog user_defined_type_schema user_defined_type_name is_insertable_into is_typed commit_action TABLE_COMMENT
example_duck main assignment BASE TABLE None None None None None YES NO None None
example_duck main dish BASE TABLE None None None None None YES NO None None
example_duck main dragon BASE TABLE None None None None None YES NO None None
example_duck main grade BASE TABLE None None None None None YES NO None None
example_duck main scene BASE TABLE None None None None None YES NO None None
example_duck main student BASE TABLE None None None None None YES NO None None
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()
In [10]:
%%sql
SELECT * FROM information_schema.columns
Running query in 'duck'
Out[10]:
table_catalog table_schema table_name column_name ordinal_position column_default is_nullable data_type character_maximum_length character_octet_length numeric_precision numeric_precision_radix numeric_scale datetime_precision interval_type interval_precision character_set_catalog character_set_schema character_set_name collation_catalog collation_schema collation_name domain_catalog domain_schema domain_name udt_catalog udt_schema udt_name scope_catalog scope_schema scope_name maximum_cardinality dtd_identifier is_self_referencing is_identity identity_generation identity_start identity_increment identity_maximum identity_minimum identity_cycle is_generated generation_expression is_updatable COLUMN_COMMENT
example_duck main assignment assignment_id 1 None NO INTEGER None None 32 2 0 None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None
example_duck main assignment description 2 None YES VARCHAR None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None
example_duck main dish name 1 None NO VARCHAR None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None
example_duck main dish type 2 None YES VARCHAR None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None
example_duck main dish cost 3 None YES INTEGER None None 32 2 0 None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None
example_duck main dragon name 1 None NO VARCHAR None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None
example_duck main dragon year 2 None YES INTEGER None None 32 2 0 None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None
example_duck main dragon cute 3 None YES INTEGER None None 32 2 0 None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None
example_duck main grade student_id 1 None YES INTEGER None None 32 2 0 None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None
example_duck main grade assignment_id 2 None YES INTEGER None None 32 2 0 None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()
Truncated to displaylimit of 10
If you want to see more, please visit displaylimit configuration

Getting Schema information with SQLAlchemy¶

How you list the tables varies across database platforms. For example, the statement:

SELECT * FROM information_schema.columns

only works on Postgres compatible databases.

For example, if we wanted to get the schema for tables in sqlite we would need the following:

In [11]:
pd.read_sql("SELECT * FROM sqlite_schema", "sqlite:///data/basic_examples.db")
Out[11]:
type name tbl_name rootpage sql
0 table sqlite_sequence sqlite_sequence 7 CREATE TABLE sqlite_sequence(name,seq)
1 table Dragon Dragon 2 CREATE TABLE Dragon (\n name TEXT PRIMARY K...
2 index sqlite_autoindex_Dragon_1 Dragon 3 None
3 table Dish Dish 4 CREATE TABLE Dish (\n name TEXT PRIMARY KEY...
4 index sqlite_autoindex_Dish_1 Dish 5 None
5 table Scene Scene 6 CREATE TABLE Scene (\n id INTEGER PRIMARY K...

Fortunately, SQLAlchemy has some generic tools that will be helpful regardless of what database platform you use.

In [12]:
from sqlalchemy import inspect
inspector = inspect(engine)
inspector.get_table_names()
Out[12]:
['assignment', 'dish', 'dragon', 'grade', 'scene', 'student']
In [13]:
inspector.get_columns('scene')
Out[13]:
[{'name': 'id',
  'type': INTEGER(),
  'nullable': False,
  'default': None,
  'autoincrement': False,
  'comment': None},
 {'name': 'biome',
  'type': VARCHAR(),
  'nullable': False,
  'default': None,
  'autoincrement': False,
  'comment': None},
 {'name': 'city',
  'type': VARCHAR(),
  'nullable': False,
  'default': None,
  'autoincrement': False,
  'comment': None},
 {'name': 'visitors',
  'type': INTEGER(),
  'nullable': True,
  'default': None,
  'autoincrement': False,
  'comment': None},
 {'name': 'created_at',
  'type': TIMESTAMP(),
  'nullable': True,
  'default': 'current_date()',
  'autoincrement': False,
  'comment': None}]

Same with SQLite

In [14]:
sqlite_engine = sqlalchemy.create_engine("sqlite:///data/basic_examples.db")
inspect(sqlite_engine).get_columns("scene")
Out[14]:
[{'name': 'id',
  'type': INTEGER(),
  'nullable': True,
  'default': None,
  'primary_key': 1},
 {'name': 'biome',
  'type': TEXT(),
  'nullable': False,
  'default': None,
  'primary_key': 0},
 {'name': 'city',
  'type': TEXT(),
  'nullable': False,
  'default': None,
  'primary_key': 0},
 {'name': 'visitors',
  'type': INTEGER(),
  'nullable': True,
  'default': None,
  'primary_key': 0},
 {'name': 'created_at',
  'type': DATETIME(),
  'nullable': True,
  'default': "DATETIME('now')",
  'primary_key': 0}]

Example of table creation with interesting constraints

More advanced example of creating tables with primary and foreign key constraints:

In [15]:
%%sql

DROP TABLE IF EXISTS grade;
DROP TABLE IF EXISTS assignment;
DROP TABLE IF EXISTS student;


CREATE TABLE student (
    student_id INTEGER PRIMARY KEY,
    name VARCHAR,
    email VARCHAR
);

CREATE TABLE assignment (
    assignment_id INTEGER PRIMARY KEY,
    description VARCHAR
);

CREATE TABLE grade (
    student_id INTEGER,
    assignment_id INTEGER,
    score REAL CHECK (score > 0 AND score <= 100),
    FOREIGN KEY (student_id) REFERENCES student(student_id),
    FOREIGN KEY (assignment_id) REFERENCES assignment(assignment_id)
);

INSERT INTO student VALUES
(123, 'JoeyG', 'jegonzal@berkeley.edu'),
(456, 'NargesN', 'norouzi@berkeley.edu');

INSERT INTO assignment VALUES
(1, 'easy assignment'),
(2, 'hard assignment');
Running query in 'duck'
Out[15]:
Count
2
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()
In [16]:
%%sql 
INSERT INTO grade VALUES
(123, 1, 80),
(123, 2, 42),
(456, 2, 100);
Running query in 'duck'
Out[16]:
Count
3
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()
In [17]:
%sql SELECT * FROM grade;
Running query in 'duck'
Out[17]:
student_id assignment_id score
123 1 80.0
123 2 42.0
456 2 100.0
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()







Basic Queries¶

Every SQL query must contain a SELECT and FROM clause.

  • SELECT: specify the column(s) to return in the output
  • FROM: specify the database table from which to extract data
In [18]:
%%sql
SELECT * FROM Dragon;
Running query in 'duck'
Out[18]:
name year cute
hiccup 2010 10
drogon 2011 -100
dragon 2 2019 0
puff 2010 100
smaug 2011 None
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()
In [19]:
%%sql
SELECT cute, year FROM Dragon;
Running query in 'duck'
Out[19]:
cute year
10 2010
-100 2011
0 2019
100 2010
None 2011
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()

Aliasing with AS

In [20]:
%%sql
SELECT cute AS cuteness,
       year AS "birth year"
FROM Dragon;
Running query in 'duck'
Out[20]:
cuteness birth year
10 2010
-100 2011
0 2019
100 2010
None 2011
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()

Uniqueness with DISTINCT

In [21]:
%%sql
SELECT DISTINCT year
FROM Dragon;
Running query in 'duck'
Out[21]:
year
2010
2019
2011
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()

Filtering with WHERE

In [22]:
%%sql
SELECT name, year
FROM Dragon
WHERE cute > 0;
Running query in 'duck'
Out[22]:
name year
hiccup 2010
puff 2010
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()
In [23]:
%%sql
SELECT name, cute, year
FROM Dragon
WHERE cute > 0 OR year > 2013;
Running query in 'duck'
Out[23]:
name cute year
hiccup 10 2010
puff 100 2010
dragon 2 0 2019
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()
In [24]:
%%sql
SELECT name, year
FROM Dragon 
WHERE name IN ('puff', 'hiccup');
Running query in 'duck'
Out[24]:
name year
puff 2010
hiccup 2010
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()
In [25]:
%%sql
SELECT name, cute
FROM Dragon
WHERE cute IS NOT NULL;
Running query in 'duck'
Out[25]:
name cute
hiccup 10
drogon -100
dragon 2 0
puff 100
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()

Ordering data using ORDER BY

In [26]:
%%sql
SELECT *
FROM Dragon
ORDER BY cute DESC;
Running query in 'duck'
Out[26]:
name year cute
puff 2010 100
hiccup 2010 10
dragon 2 2019 0
drogon 2011 -100
smaug 2011 None
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()

Restricting output with LIMIT and OFFSET

In [27]:
%%sql
SELECT *
FROM Dragon
LIMIT 2;
Running query in 'duck'
Out[27]:
name year cute
hiccup 2010 10
drogon 2011 -100
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()
In [28]:
%%sql
SELECT *
FROM Dragon
LIMIT 2
OFFSET 1;
Running query in 'duck'
Out[28]:
name year cute
drogon 2011 -100
dragon 2 2019 0
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()

What if we wanted a random sample:

In [69]:
%%sql
SELECT *
FROM Dragon
ORDER BY RANDOM() 
LIMIT 2
Running query in 'duck'
Out[69]:
name year cute
0 drogon 2011 -100
1 dragon 2 2019 0
In [82]:
%%sql
SELECT * 
FROM Dragon USING SAMPLE reservoir(2 ROWS) REPEATABLE (100);
Running query in 'duck'
Out[82]:
name year cute
0 puff 2010 100
1 drogon 2011 -100

Grouping Data with GROUP BY¶

In [29]:
%%sql
SELECT *
FROM Dish;
Running query in 'duck'
Out[29]:
name type cost
ravioli entree 10
ramen entree 13
taco entree 7
edamame appetizer 4
fries appetizer 4
potsticker appetizer 4
ice cream dessert 5
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()

A small note: the fact that type is highlighted in green below is a consequence of Jupyter assuming that we are writing Python code (where type is a built-in keyword). type does not have a special meaning in SQL, so the color below does not indicate any special functionality. When we run the cell, Jupyter realizes it should recognize the code as SQL.

In [30]:
%%sql
SELECT type
FROM Dish;
Running query in 'duck'
Out[30]:
type
entree
entree
entree
appetizer
appetizer
appetizer
dessert
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()
In [31]:
%%sql
SELECT type
FROM Dish
GROUP BY type;
Running query in 'duck'
Out[31]:
type
entree
dessert
appetizer
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()
In [32]:
%%sql
SELECT type, SUM(cost)
FROM Dish
GROUP BY type;
Running query in 'duck'
Out[32]:
type sum("cost")
entree 30
dessert 5
appetizer 12
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()
In [33]:
%%sql
SELECT type, 
       SUM(cost), 
       MIN(cost),
       MAX(name)
FROM Dish
GROUP BY type;
Running query in 'duck'
Out[33]:
type sum("cost") min("cost") max("name")
entree 30 7 taco
dessert 5 5 ice cream
appetizer 12 4 potsticker
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()
In [34]:
%%sql
SELECT year, COUNT(cute)
FROM Dragon
GROUP BY year;
Running query in 'duck'
Out[34]:
year count(cute)
2010 2
2011 1
2019 1
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()
In [35]:
%%sql
SELECT year, COUNT(*)
FROM Dragon
GROUP BY year;
Running query in 'duck'
Out[35]:
year count_star()
2010 2
2011 2
2019 1
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()

Working with the sql results as Python variables¶

By default, executing a query with a magic command produces output but doesn't save it into any Python variable:

In [36]:
%sql SELECT * FROM Dragon
Running query in 'duck'
Out[36]:
name year cute
hiccup 2010 10
drogon 2011 -100
dragon 2 2019 0
puff 2010 100
smaug 2011 None
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()

This can be inconvenient if you later want to do further processing of these data in Python.

Storing one-line %sql queries¶

For simple one-line queries, you can use IPython's ability to store the result of a magic command like %sql as if it were any other Python statement, and save the output to a variable:

In [37]:
dragon_table = %sql SELECT * FROM Dragon
dragon_table
Running query in 'duck'
Out[37]:
name year cute
hiccup 2010 10
drogon 2011 -100
dragon 2 2019 0
puff 2010 100
smaug 2011 None
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()

As noted above, the result of the query is a Python variable of type ResultSet, more specifically:

In [38]:
type(dragon_table)
Out[38]:
sql.run.ResultSet

You need to manually convert it to a Pandas DataFrame if you want to do pandas-things with its content:

In [39]:
dragon_df = dragon_table.DataFrame()
dragon_df
Out[39]:
name year cute
0 hiccup 2010 10.0
1 drogon 2011 -100.0
2 dragon 2 2019 0.0
3 puff 2010 100.0
4 smaug 2011 NaN

You can configure jupysql to automatically convert all outputs to Pandas DataFrames. This can be handy if you intend all your Python-side work to be done with Pandas, as it saves you from manually having to call .DataFrame() first on all outputs. On the other hand, you don't get access to the original SQL ResultSet object, which have a number of interesting properties and capabilities. You can learn more about those in the jupysql documentation.

For now, let's turn this on so you can see how this simplified, "pandas all the way" worfklow looks like:

In [40]:
%config SqlMagic.autopandas = True
In [41]:
dragon_df = %sql SELECT * FROM Dragon
dragon_df
Running query in 'duck'
Out[41]:
name year cute
0 hiccup 2010 10.0
1 drogon 2011 -100.0
2 dragon 2 2019 0.0
3 puff 2010 100.0
4 smaug 2011 NaN
In [42]:
type(dragon_df)
Out[42]:
pandas.core.frame.DataFrame

Storing multi-line %%sql queries¶

For a more complex query that won't fit in one line, such as for example:

In [43]:
%%sql
SELECT year, COUNT(*)
FROM Dragon
GROUP BY year;
Running query in 'duck'
Out[43]:
year count_star()
0 2010 2
1 2011 2
2 2019 1

You can use the variable << syntax in jupysql to store its output (this will honor your autopandas state and store either a sql.run.ResultState or a Pandas DataFrame):

In [44]:
%%sql dragon_years <<
SELECT year, COUNT(*)
FROM Dragon
GROUP BY year;
Running query in 'duck'
In [45]:
dragon_years
Out[45]:
year count_star()
0 2010 2
1 2011 2
2 2019 1

More Adavnced SQL with DuckDB¶

You could do a lot what we do in this class just using Duck DB and the many duck db functions:

https://duckdb.org/docs/sql/functions/overview

In [46]:
from ds100_utils import fetch_and_cache
In [47]:
url = "https://gist.github.com/domoritz/fd517a3a3a210c24a488e61870e2cf2c/raw/b1d53719e8e0eb9f6a95de82fdaccf0b001c0dea/flights-1m.parquet"
fetch_and_cache(url,"flights.parquet")
Using cached version that was downloaded (UTC): Tue Apr  2 03:41:25 2024
Out[47]:
PosixPath('data/flights.parquet')
In [48]:
%%sql
SELECT * FROM 'data/flights.parquet' LIMIT 10;
Running query in 'duck'
Out[48]:
FL_DATE DEP_DELAY ARR_DELAY AIR_TIME DISTANCE DEP_TIME ARR_TIME
0 2006-01-01 5 19 350 2475 9.083333 12.483334
1 2006-01-02 167 216 343 2475 11.783334 15.766666
2 2006-01-03 -7 -2 344 2475 8.883333 12.133333
3 2006-01-04 -5 -13 331 2475 8.916667 11.950000
4 2006-01-05 -3 -17 321 2475 8.950000 11.883333
5 2006-01-06 -4 -32 320 2475 8.933333 11.633333
6 2006-01-08 -3 -2 346 2475 8.950000 12.133333
7 2006-01-09 3 0 334 2475 9.050000 12.166667
8 2006-01-10 -7 -21 334 2475 8.883333 11.816667
9 2006-01-11 8 -10 321 2475 9.133333 12.000000
In [49]:
%%sql avg_delays <<

SELECT 
    dayname(fl_date) AS "Day of the Week", 
    mean(dep_delay) AS "Mean Departure Delay", 
FROM 'data/flights.parquet'
GROUP BY "Day of the Week"
Running query in 'duck'
In [50]:
import plotly.express as px
px.bar(avg_delays, x="Day of the Week", y = "Mean Departure Delay",
       category_orders={"Day of the Week": ["Sunday", "Monday", "Tuesday", 
                      "Wednesday", "Thursday", "Friday", "Saturday"]})
In [ ]:
 
In [ ]: