💽 Lecture 20 – Data 100, Summer 2025¶

Data 100, Summer 2025

Acknowledgments Page

Starting Up SQL¶

Before we look at SQL syntax in detail, let's first get ourselves set up to run SQL queries in Jupyter.

Approach #1: SQL Magic¶

1. Load the sql Module.

Load %%sql cell magic to allow the Jupyter notebook to understand SQL.

FYI: %sql is called line magic because it only applies to one line. We will see shortly that %%sql applies to an entire cell. So, it's called cell magic.

In [1]:
%load_ext sql

2. Connect to a database.

Here, we connect to the SQLite database basic_examples.db and duckdb database example_duck.db.

In [2]:
%sql sqlite:///data/basic_examples.db --alias sqlite_ex
Connecting to 'sqlite_ex'
In [3]:
%sql duckdb:///data/example_duck.db --alias duckdb_ex
Connecting and switching to connection 'duckdb_ex'

If you were connecting to an "enterprise data platform"

from sqlalchemy import create_engine

snow_engine = create_engine(
    f"snowflake://{user}:{password}@{account_identifier}")
%sql snow_engine --alias snow

db_engine = create_engine(
  url = f"databricks://token:{access_token}@{server_hostname}?" +
        f"http_path={http_path}&catalog={catalog}&schema={schema}"
)
%sql db_engine --alias db


3. Run a simple SQL query.

%sql parses only the immmediate line as a SQL command.

In [4]:
%sql SELECT * FROM Dragon;
Running query in 'duckdb_ex'
Out[4]:
name year cute
hiccup 2010 10
drogon 2011 -100
dragon 2 2019 0
puff 2010 100
smaug 2011 None

The %%sql command, on the other hand, lets Jupyter parse the rest of the lines (the entire code block) as a SQL command.

In [5]:
%%sql
SELECT * FROM Dragon;
Running query in 'duckdb_ex'
Out[5]:
name year cute
hiccup 2010 10
drogon 2011 -100
dragon 2 2019 0
puff 2010 100
smaug 2011 None

Simple query, this time on two different lines.

In [6]:
%%sql
SELECT *
FROM Dragon;
Running query in 'duckdb_ex'
Out[6]:
name year cute
hiccup 2010 10
drogon 2011 -100
dragon 2 2019 0
puff 2010 100
smaug 2011 None

Storing one-line %sql queries¶

For simple one-line queries, you can use IPython's ability to store the result of a magic command like %sql as if it were any other Python statement, and save the output to a variable:

In [7]:
dragon_table = %sql SELECT * FROM Dragon
dragon_table
Running query in 'duckdb_ex'
Out[7]:
name year cute
hiccup 2010 10
drogon 2011 -100
dragon 2 2019 0
puff 2010 100
smaug 2011 None

As noted above, the result of the query is a Python variable of type ResultSet, more specifically:

In [8]:
type(dragon_table)
Out[8]:
sql.run.resultset.ResultSet

You need to manually convert it to a Pandas DataFrame if you want to do pandas-things with its content:

In [9]:
dragon_df = dragon_table.DataFrame()
dragon_df
Out[9]:
name year cute
0 hiccup 2010 10
1 drogon 2011 -100
2 dragon 2 2019 0
3 puff 2010 100
4 smaug 2011 <NA>

You can configure jupysql to automatically convert all outputs to Pandas DataFrames. This can be handy if you intend all your Python-side work to be done with Pandas, as it saves you from manually having to call .DataFrame() first on all outputs.

  • On the other hand, you don't get access to the original SQL ResultSet objects, which have a number of interesting properties and capabilities. You can learn more about those in the jupysql documentation.

For now, let's turn this on so you can see how this simplified, "pandas all the way" worfklow looks like:

In [10]:
%config SqlMagic.autopandas = True
In [11]:
dragon_df = %sql SELECT * FROM Dragon
dragon_df
Running query in 'duckdb_ex'
Out[11]:
name year cute
0 hiccup 2010 10
1 drogon 2011 -100
2 dragon 2 2019 0
3 puff 2010 100
4 smaug 2011 <NA>
In [12]:
type(dragon_df)
Out[12]:
pandas.core.frame.DataFrame

Storing output of multiple SQL lines¶

You can use the variable << syntax in jupysql to store its output.

  • Note: This will follow your autopandas state and store either a sql.run.ResultState or a Pandas DataFrame.
In [13]:
%%sql res <<
SELECT *
FROM Dragon;
Running query in 'duckdb_ex'
In [14]:
res
Out[14]:
name year cute
0 hiccup 2010 10
1 drogon 2011 -100
2 dragon 2 2019 0
3 puff 2010 100
4 smaug 2011 <NA>

Approach #2: pd.read_sql¶

It turns out that pandas has a special-purpose function to parse SQL queries. We can pass in a SQL query as a string to return a pandas DataFrame. To achieve the same result as we did using cell magic above, we can do the following.

1. Connect to a database

In [15]:
import sqlalchemy 
import pandas as pd

engine = sqlalchemy.create_engine("duckdb:///data/example_duck.db")

2. Run a simple SQL query

In [16]:
query = """
SELECT * 
FROM Dragon;
"""

df = pd.read_sql(query, engine)
df
Out[16]:
name year cute
0 hiccup 2010 10.0
1 drogon 2011 -100.0
2 dragon 2 2019 0.0
3 puff 2010 100.0
4 smaug 2011 NaN

Approach #3 -- DuckDB Special¶

With DuckDB we can directly reference dataframe objects in our Python environment:

In [17]:
import seaborn as sns
import duckdb
mpg = sns.load_dataset("mpg")
In [18]:
output = duckdb.query("SELECT * FROM mpg")
output
Out[18]:
┌────────┬───────────┬──────────────┬────────────┬────────┬──────────────┬────────────┬─────────┬────────────────────────────┐
│  mpg   │ cylinders │ displacement │ horsepower │ weight │ acceleration │ model_year │ origin  │            name            │
│ double │   int64   │    double    │   double   │ int64  │    double    │   int64    │ varchar │          varchar           │
├────────┼───────────┼──────────────┼────────────┼────────┼──────────────┼────────────┼─────────┼────────────────────────────┤
│   18.0 │         8 │        307.0 │      130.0 │   3504 │         12.0 │         70 │ usa     │ chevrolet chevelle malibu  │
│   15.0 │         8 │        350.0 │      165.0 │   3693 │         11.5 │         70 │ usa     │ buick skylark 320          │
│   18.0 │         8 │        318.0 │      150.0 │   3436 │         11.0 │         70 │ usa     │ plymouth satellite         │
│   16.0 │         8 │        304.0 │      150.0 │   3433 │         12.0 │         70 │ usa     │ amc rebel sst              │
│   17.0 │         8 │        302.0 │      140.0 │   3449 │         10.5 │         70 │ usa     │ ford torino                │
│   15.0 │         8 │        429.0 │      198.0 │   4341 │         10.0 │         70 │ usa     │ ford galaxie 500           │
│   14.0 │         8 │        454.0 │      220.0 │   4354 │          9.0 │         70 │ usa     │ chevrolet impala           │
│   14.0 │         8 │        440.0 │      215.0 │   4312 │          8.5 │         70 │ usa     │ plymouth fury iii          │
│   14.0 │         8 │        455.0 │      225.0 │   4425 │         10.0 │         70 │ usa     │ pontiac catalina           │
│   15.0 │         8 │        390.0 │      190.0 │   3850 │          8.5 │         70 │ usa     │ amc ambassador dpl         │
│     ·  │         · │          ·   │         ·  │     ·  │           ·  │          · │  ·      │         ·                  │
│     ·  │         · │          ·   │         ·  │     ·  │           ·  │          · │  ·      │         ·                  │
│     ·  │         · │          ·   │         ·  │     ·  │           ·  │          · │  ·      │         ·                  │
│   26.0 │         4 │        156.0 │       92.0 │   2585 │         14.5 │         82 │ usa     │ chrysler lebaron medallion │
│   22.0 │         6 │        232.0 │      112.0 │   2835 │         14.7 │         82 │ usa     │ ford granada l             │
│   32.0 │         4 │        144.0 │       96.0 │   2665 │         13.9 │         82 │ japan   │ toyota celica gt           │
│   36.0 │         4 │        135.0 │       84.0 │   2370 │         13.0 │         82 │ usa     │ dodge charger 2.2          │
│   27.0 │         4 │        151.0 │       90.0 │   2950 │         17.3 │         82 │ usa     │ chevrolet camaro           │
│   27.0 │         4 │        140.0 │       86.0 │   2790 │         15.6 │         82 │ usa     │ ford mustang gl            │
│   44.0 │         4 │         97.0 │       52.0 │   2130 │         24.6 │         82 │ europe  │ vw pickup                  │
│   32.0 │         4 │        135.0 │       84.0 │   2295 │         11.6 │         82 │ usa     │ dodge rampage              │
│   28.0 │         4 │        120.0 │       79.0 │   2625 │         18.6 │         82 │ usa     │ ford ranger                │
│   31.0 │         4 │        119.0 │       82.0 │   2720 │         19.4 │         82 │ usa     │ chevy s-10                 │
├────────┴───────────┴──────────────┴────────────┴────────┴──────────────┴────────────┴─────────┴────────────────────────────┤
│ 398 rows (20 shown)                                                                                              9 columns │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
In [19]:
type(output)
Out[19]:
duckdb.duckdb.DuckDBPyRelation
In [20]:
output.df()
Out[20]:
mpg cylinders displacement horsepower weight acceleration model_year origin name
0 18.0 8 307.0 130.0 3504 12.0 70 usa chevrolet chevelle malibu
1 15.0 8 350.0 165.0 3693 11.5 70 usa buick skylark 320
2 18.0 8 318.0 150.0 3436 11.0 70 usa plymouth satellite
3 16.0 8 304.0 150.0 3433 12.0 70 usa amc rebel sst
4 17.0 8 302.0 140.0 3449 10.5 70 usa ford torino
... ... ... ... ... ... ... ... ... ...
393 27.0 4 140.0 86.0 2790 15.6 82 usa ford mustang gl
394 44.0 4 97.0 52.0 2130 24.6 82 europe vw pickup
395 32.0 4 135.0 84.0 2295 11.6 82 usa dodge rampage
396 28.0 4 120.0 79.0 2625 18.6 82 usa ford ranger
397 31.0 4 119.0 82.0 2720 19.4 82 usa chevy s-10

398 rows × 9 columns


Tables and Schema¶

A database contains a collection of SQL tables. Let's connect to our "toy" database example_duck.db and explore the tables it stores.

In [21]:
%%sql
SELECT * FROM information_schema.tables
Running query in 'duckdb_ex'
Out[21]:
table_catalog table_schema table_name table_type self_referencing_column_name reference_generation user_defined_type_catalog user_defined_type_schema user_defined_type_name is_insertable_into is_typed commit_action TABLE_COMMENT
0 example_duck main dish BASE TABLE None None None None None YES NO None None
1 example_duck main dragon BASE TABLE None None None None None YES NO None None
2 example_duck main scene BASE TABLE None None None None None YES NO None None
In [22]:
%%sql
SELECT * FROM information_schema.columns
Running query in 'duckdb_ex'
Out[22]:
table_catalog table_schema table_name column_name ordinal_position column_default is_nullable data_type character_maximum_length character_octet_length ... identity_generation identity_start identity_increment identity_maximum identity_minimum identity_cycle is_generated generation_expression is_updatable COLUMN_COMMENT
0 example_duck main dish name 1 None NO VARCHAR <NA> <NA> ... None None None None None <NA> None None <NA> None
1 example_duck main dish type 2 None YES VARCHAR <NA> <NA> ... None None None None None <NA> None None <NA> None
2 example_duck main dish cost 3 None YES INTEGER <NA> <NA> ... None None None None None <NA> None None <NA> None
3 example_duck main dragon name 1 None NO VARCHAR <NA> <NA> ... None None None None None <NA> None None <NA> None
4 example_duck main dragon year 2 None YES INTEGER <NA> <NA> ... None None None None None <NA> None None <NA> None
5 example_duck main dragon cute 3 None YES INTEGER <NA> <NA> ... None None None None None <NA> None None <NA> None
6 example_duck main scene id 1 None NO INTEGER <NA> <NA> ... None None None None None <NA> None None <NA> None
7 example_duck main scene biome 2 None NO VARCHAR <NA> <NA> ... None None None None None <NA> None None <NA> None
8 example_duck main scene city 3 None NO VARCHAR <NA> <NA> ... None None None None None <NA> None None <NA> None
9 example_duck main scene visitors 4 None YES INTEGER <NA> <NA> ... None None None None None <NA> None None <NA> None
10 example_duck main scene created_at 5 current_date() YES TIMESTAMP <NA> <NA> ... None None None None None <NA> None None <NA> None

11 rows × 45 columns

Getting Schema information with SQLAlchemy¶

How you list the tables varies across database platforms. For example, the statement:

SELECT * FROM information_schema.columns

only works on Postgres compatible databases.

For example, if we wanted to get the schema for tables in SQLite we would need the following:

In [23]:
pd.read_sql("SELECT * FROM sqlite_schema", "sqlite:///data/basic_examples.db")
Out[23]:
type name tbl_name rootpage sql
0 table sqlite_sequence sqlite_sequence 7 CREATE TABLE sqlite_sequence(name,seq)
1 table Dragon Dragon 2 CREATE TABLE Dragon (\n name TEXT PRIMARY K...
2 index sqlite_autoindex_Dragon_1 Dragon 3 None
3 table Dish Dish 4 CREATE TABLE Dish (\n name TEXT PRIMARY KEY...
4 index sqlite_autoindex_Dish_1 Dish 5 None
5 table Scene Scene 6 CREATE TABLE Scene (\n id INTEGER PRIMARY K...

Fortunately, SQLAlchemy has some generic tools that will be helpful regardless of what database platform you use.

In [24]:
from sqlalchemy import inspect
inspector = inspect(engine)
inspector.get_table_names()
Out[24]:
['dish', 'dragon', 'scene']
In [25]:
inspector.get_columns('scene')
Out[25]:
[{'name': 'id',
  'type': Integer(),
  'nullable': False,
  'default': None,
  'autoincrement': False,
  'comment': None},
 {'name': 'biome',
  'type': String(),
  'nullable': False,
  'default': None,
  'autoincrement': False,
  'comment': None},
 {'name': 'city',
  'type': String(),
  'nullable': False,
  'default': None,
  'autoincrement': False,
  'comment': None},
 {'name': 'visitors',
  'type': Integer(),
  'nullable': True,
  'default': None,
  'autoincrement': False,
  'comment': None},
 {'name': 'created_at',
  'type': TIMESTAMP(),
  'nullable': True,
  'default': 'current_date()',
  'autoincrement': False,
  'comment': None}]

Same with SQLite

In [26]:
sqlite_engine = sqlalchemy.create_engine("sqlite:///data/basic_examples.db")
inspect(sqlite_engine).get_columns("scene")
Out[26]:
[{'name': 'id',
  'type': INTEGER(),
  'nullable': True,
  'default': None,
  'primary_key': 1},
 {'name': 'biome',
  'type': TEXT(),
  'nullable': False,
  'default': None,
  'primary_key': 0},
 {'name': 'city',
  'type': TEXT(),
  'nullable': False,
  'default': None,
  'primary_key': 0},
 {'name': 'visitors',
  'type': INTEGER(),
  'nullable': True,
  'default': None,
  'primary_key': 0},
 {'name': 'created_at',
  'type': DATETIME(),
  'nullable': True,
  'default': "DATETIME('now')",
  'primary_key': 0}]

More advanced example of creating tables with primary and foreign key constraints:

In [27]:
%sql duckdb:///data/duckdb_example.db --alias student_db
Connecting and switching to connection 'student_db'
In [28]:
%%sql student_db

DROP TABLE IF EXISTS grade;
DROP TABLE IF EXISTS assignment;
DROP TABLE IF EXISTS student;


CREATE TABLE student (
    student_id INTEGER PRIMARY KEY,
    name VARCHAR,
    email VARCHAR
);

CREATE TABLE assignment (
    assignment_id INTEGER PRIMARY KEY,
    description VARCHAR
);

CREATE TABLE grade (
    student_id INTEGER,
    assignment_id INTEGER,
    score REAL CHECK (score > 0 AND score <= 100),
    FOREIGN KEY (student_id) REFERENCES student(student_id),
    FOREIGN KEY (assignment_id) REFERENCES assignment(assignment_id)
);

INSERT INTO student VALUES
(123, 'JoeyG', 'jegonzal@berkeley.edu'),
(456, 'NargesN', 'norouzi@berkeley.edu');

INSERT INTO assignment VALUES
(1, 'easy assignment'),
(2, 'hard assignment');
Out[28]:
Success
In [29]:
%%sql 
INSERT INTO grade VALUES
(123, 1, 80),
(123, 2, 42),
(456, 2, 100);
Running query in 'student_db'
Out[29]:
Success
In [30]:
%sql SELECT * FROM grade;
Running query in 'student_db'
Out[30]:
student_id assignment_id score
0 123 1 80.0
1 123 2 42.0
2 456 2 100.0



Basic Queries¶

SELECT and FROM¶

Every SQL query must contain a SELECT and FROM clause.

  • SELECT: specify the column(s) to return in the output.
  • FROM: specify the database table from which to extract data.

First, let's reconnect to our duckdb_ex database from earlier:

In [31]:
%sql duckdb_ex
Switching to connection 'duckdb_ex'
In [32]:
%%sql
SELECT * 
FROM Dragon;
Running query in 'duckdb_ex'
Out[32]:
name year cute
0 hiccup 2010 10
1 drogon 2011 -100
2 dragon 2 2019 0
3 puff 2010 100
4 smaug 2011 <NA>
In [33]:
%%sql
SELECT cute, year 
FROM Dragon;
Running query in 'duckdb_ex'
Out[33]:
cute year
0 10 2010
1 -100 2011
2 0 2019
3 100 2010
4 <NA> 2011

Aliasing with AS¶

In [34]:
%%sql
SELECT cute AS cuteness,
       year AS "birth year"
FROM Dragon;
Running query in 'duckdb_ex'
Out[34]:
cuteness birth year
0 10 2010
1 -100 2011
2 0 2019
3 100 2010
4 <NA> 2011

AS is technically optional, but often good practice to include!

In [35]:
%%sql
SELECT cute cuteness,
       year "birth year"
FROM Dragon;
Running query in 'duckdb_ex'
Out[35]:
cuteness birth year
0 10 2010
1 -100 2011
2 0 2019
3 100 2010
4 <NA> 2011

Uniqueness with DISTINCT¶

In [36]:
%%sql
SELECT DISTINCT year
FROM Dragon;
Running query in 'duckdb_ex'
Out[36]:
year
0 2011
1 2010
2 2019

Filtering with WHERE¶

In [37]:
%%sql
SELECT name, year
FROM Dragon
WHERE cute > 0;
Running query in 'duckdb_ex'
Out[37]:
name year
0 hiccup 2010
1 puff 2010
In [38]:
%%sql
SELECT name, cute, year
FROM Dragon
WHERE cute > 0 OR year > 2013;
Running query in 'duckdb_ex'
Out[38]:
name cute year
0 hiccup 10 2010
1 dragon 2 0 2019
2 puff 100 2010
In [39]:
%%sql
SELECT name, year
FROM Dragon 
WHERE name IN ('puff', 'hiccup');
Running query in 'duckdb_ex'
Out[39]:
name year
0 hiccup 2010
1 puff 2010
In [40]:
%%sql
SELECT name, cute
FROM Dragon
WHERE cute IS NOT NULL;
Running query in 'duckdb_ex'
Out[40]:
name cute
0 hiccup 10
1 drogon -100
2 dragon 2 0
3 puff 100

Ordering data using ORDER BY¶

In [41]:
%%sql
SELECT *
FROM Dragon
ORDER BY cute DESC;
Running query in 'duckdb_ex'
Out[41]:
name year cute
0 puff 2010 100
1 hiccup 2010 10
2 dragon 2 2019 0
3 drogon 2011 -100
4 smaug 2011 <NA>

Restricting output with LIMIT and OFFSET¶

In [42]:
%%sql
SELECT *
FROM Dragon
LIMIT 2;
Running query in 'duckdb_ex'
Out[42]:
name year cute
0 hiccup 2010 10
1 drogon 2011 -100
In [43]:
%%sql
SELECT *
FROM Dragon
LIMIT 2
OFFSET 1;
Running query in 'duckdb_ex'
Out[43]:
name year cute
0 drogon 2011 -100
1 dragon 2 2019 0

Sampling with RANDOM()¶

What if we wanted a random sample:

In [44]:
%%sql
SELECT *
FROM Dragon
ORDER BY RANDOM() 
LIMIT 2
Running query in 'duckdb_ex'
Out[44]:
name year cute
0 puff 2010 100
1 dragon 2 2019 0
In [45]:
%%sql
SELECT * 
FROM Dragon USING SAMPLE reservoir(2 ROWS) REPEATABLE (100);
Running query in 'duckdb_ex'
Out[45]:
name year cute
0 drogon 2011 -100
1 smaug 2011 <NA>