🗃️ Lecture 21 – Data 100, Summer 2025¶

Data 100, Summer 2025

Acknowledgments Page

Loading the Data¶

In this lecture, we'll work with the Dish table. In the cells below, we connect to the database and query the table.

In [1]:
%load_ext sql

Add support for reading sqlite tables using duckdb.

In [2]:
%config SqlMagic.displaylimit = 100

Loading the SQLite database file using duckdb.

In [3]:
%sql duckdb:///data/basic_examples.db --alias duckdb_ex
Connecting to 'duckdb_ex'

Get the large IMDB database file for other examples in this demo.

Grouping Data with GROUP BY¶

In [4]:
%%sql
SELECT *
FROM Dish;
Running query in 'duckdb_ex'
Out[4]:
name type cost
ravioli entree 10
ramen entree 13
taco entree 7
edamame appetizer 4
fries appetizer 4
potsticker appetizer 4
ice cream dessert 5
In [5]:
%%sql
SELECT type
FROM Dish;
Running query in 'duckdb_ex'
Out[5]:
type
entree
entree
entree
appetizer
appetizer
appetizer
dessert

Re-run this cell a couple times. Notice that there is no guaranteed order!

In [6]:
%%sql
SELECT type
FROM Dish
GROUP BY type;
Running query in 'duckdb_ex'
Out[6]:
type
entree
dessert
appetizer

Remember that SQL is a declarative programming language. Even though the summation happens after the grouping "under the hood", we include the summation in the SELECT clause.

In [7]:
%%sql
SELECT type, SUM(cost)
FROM Dish
GROUP BY type;
Running query in 'duckdb_ex'
Out[7]:
type sum("cost")
entree 30
dessert 5
appetizer 12
In [8]:
%%sql
SELECT type, 
       SUM(cost), 
       MIN(cost),
       MAX(name)
FROM Dish
GROUP BY type;
Running query in 'duckdb_ex'
Out[8]:
type sum("cost") min("cost") max("name")
appetizer 12 4 potsticker
entree 30 7 taco
dessert 5 5 ice cream

Count the non-NULL values in cute:

In [9]:
%%sql
SELECT year, COUNT(cute)
FROM Dragon
GROUP BY year;
Running query in 'duckdb_ex'
Out[9]:
year count(cute)
2010 2
2011 1
2019 1

Count the number of rows in each group:

In [10]:
%%sql
SELECT year, COUNT(*)
FROM Dragon
GROUP BY year;
Running query in 'duckdb_ex'
Out[10]:
year count_star()
2010 2
2011 2
2019 1

Filtering Groups Using HAVING¶

In [11]:
%%sql
SELECT *
FROM Dish;
Running query in 'duckdb_ex'
Out[11]:
name type cost
ravioli entree 10
ramen entree 13
taco entree 7
edamame appetizer 4
fries appetizer 4
potsticker appetizer 4
ice cream dessert 5
In [12]:
%%sql
SELECT type, COUNT(*)
FROM Dish
GROUP BY type
HAVING MAX(cost) < 10;
Running query in 'duckdb_ex'
Out[12]:
type count_star()
appetizer 3
dessert 1
In [13]:
%%sql
SELECT type, COUNT(*)
FROM Dish
WHERE cost < 10
GROUP BY type
Running query in 'duckdb_ex'
Out[13]:
type count_star()
entree 1
dessert 1
appetizer 3
In [14]:
%%sql
SELECT * 
FROM Dragon
Running query in 'duckdb_ex'
Out[14]:
name year cute
hiccup 2010 10
drogon 2011 -100
dragon 2 2019 0
puff 2010 100
smaug 2011 None
In [15]:
%%sql
SELECT year, MAX(cute)
FROM Dragon
WHERE name in ('hiccup', 'dragon', 'puff')
GROUP BY year
HAVING MIN(cute) >= 0;
Running query in 'duckdb_ex'
Out[15]:
year max(cute)
2010 100

EDA in SQL¶

Our typical workflow when working with "big data" is:

  • Use SQL to query data from a database
  • Use Python (with pandas) to analyze this data in detail

We can, however, still perform simple data cleaning and re-structuring using SQL directly. To do so, we'll consider the Title table from the IMDB dataset.

In [16]:
# Use the gdown package to download the ~3GB imdb database
!pip install gdown

import os
if os.path.exists("/home/jovyan/shared/sql/imdb_duck.db"):
    imdbpath = "duckdb:////home/jovyan/shared/sql/imdb_duck.db"
elif os.path.exists("data/imdb_duck.db"):
    imdbpath =  "duckdb:///data/imdb_duck.db"
else:
    import gdown
    url = 'https://drive.google.com/uc?id=10tKOHGLt9QoOgq5Ii-FhxpB9lDSQgl1O'
    output_path = 'data/imdb_duck.db'
    gdown.download(url, output_path, quiet=False)
    imdbpath = "duckdb:///data/imdb_duck.db"
print(imdbpath)
Requirement already satisfied: gdown in /srv/conda/envs/notebook/lib/python3.11/site-packages (5.2.0)
Requirement already satisfied: beautifulsoup4 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from gdown) (4.12.3)
Requirement already satisfied: filelock in /srv/conda/envs/notebook/lib/python3.11/site-packages (from gdown) (3.18.0)
Requirement already satisfied: requests[socks] in /srv/conda/envs/notebook/lib/python3.11/site-packages (from gdown) (2.32.4)
Requirement already satisfied: tqdm in /srv/conda/envs/notebook/lib/python3.11/site-packages (from gdown) (4.67.1)
Requirement already satisfied: soupsieve>1.2 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from beautifulsoup4->gdown) (2.5)
Requirement already satisfied: charset_normalizer<4,>=2 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from requests[socks]->gdown) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from requests[socks]->gdown) (3.6)
Requirement already satisfied: urllib3<3,>=1.21.1 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from requests[socks]->gdown) (2.2.1)
Requirement already satisfied: certifi>=2017.4.17 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from requests[socks]->gdown) (2025.6.15)
Requirement already satisfied: PySocks!=1.5.7,>=1.5.6 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from requests[socks]->gdown) (1.7.1)
duckdb:///data/imdb_duck.db
In [17]:
# Connect to the DuckDB database containing the IMDB data
from sqlalchemy import create_engine
imdb_engine = create_engine(imdbpath, connect_args={'read_only': True})
imdb_engine = create_engine("duckdb:///data/imdb_duck.db", connect_args={'read_only': True})
%sql imdb_engine --alias imdb
In [18]:
%%sql
SELECT setseed(0.42); -- Setting the random number seed for reproducibility

SELECT *
FROM Title
ORDER BY RANDOM()
LIMIT 10;
Running query in 'imdb'
Out[18]:
tconst titleType primaryTitle originaTitle isAdult startYear endYear runtimeMinutes genres
tt13766868 movie Voyage to Persia Voyage to Persia False 2005 None 85 Documentary
tt13952310 tvMovie Arik Brauer - Ich will nur Geschichten erzählen Arik Brauer - Ich will nur Geschichten erzählen False 2008 None 33 Biography,Documentary
tt0190384 movie The Eternal Light The Eternal Light False 1919 None 75 Drama,History
tt2083235 tvMovie Jos ei sauna auta Jos ei sauna auta False 1977 None 28 Documentary
tt21850110 movie French Story French Story False 2024 None 60 Drama
tt0207186 movie Vampire Child Vampire Child False 1999 None None None
tt0042243 movie Beware of Blondie Beware of Blondie False 1950 None 64 Comedy,Family
tt1522156 movie Mark of Love Mark of Love False 2009 None 86 Comedy,Romance
tt0411767 tvMovie The Post-Bag The Post-Bag False 1938 None None Musical
tt1563563 tvMovie Death and Back Death and Back False 2009 None 60 Documentary

Matching Text Using LIKE¶

% is similiar to .* in regex:

In [19]:
%%sql 
SELECT titleType, primaryTitle
FROM Title
WHERE primaryTitle LIKE '%Star Wars%'
Running query in 'imdb'
Out[19]:
titleType primaryTitle
movie Star Wars: Episode IV - A New Hope
movie Star Wars: Episode V - The Empire Strikes Back
movie Star Wars: Episode VI - Return of the Jedi
movie Star Wars: Episode I - The Phantom Menace
movie Star Wars: Episode II - Attack of the Clones
movie Star Wars: Episode III - Revenge of the Sith
tvMovie The Star Wars Holiday Special
tvMovie From Star Wars to Star Wars: The Story of Industrial Light & Magic
tvMovie Star Wars City
tvMovie From 'Star Wars' to 'Jedi': The Making of a Saga
tvMovie The Making of 'Star Wars'
movie Doraemon: Nobita's Little Star Wars
movie Saving Star Wars
tvMovie The Mythology of Star Wars
tvMovie Star Wars: Feel the Force
movie The Untitled Star Wars Mockumentary
tvMovie Star Wars: The Magic & the Mystery
tvMovie Star Wars: Shortened!
tvMovie When Star Wars Ruled the World
tvMovie Robot Chicken: Star Wars
movie Star Wars: Battle for the Holocrons
movie Rifftrax: The Star Wars Holiday Special
movie Star Wars: Rogue Squadron
movie Untitled Taika Waititi Star Wars Film
movie Star Wars: New Jedi Order
movie RiffTrax: Star Wars: The Force Awakens
tvMovie Star Wars: The Legacy Revealed
tvMovie Star Wars Tech
tvMovie Bring Back... Star Wars
tvMovie Nissan Rogue: Star Wars Rogue One Battle Tested
tvMovie Walmart: Rogue One: A Star Wars Story
tvMovie John Boyega Pranks Star Wars Fans with Surprise Photobomb at Celebration: Force For Change
tvMovie Star Wars Galaxy's Edge: Adventure Awaits
movie 77-Star Wars-17
movie Untitled Star Wars/Kevin Feige Project
tvMovie Star Wars: Music by John Williams
movie Battle Star Wars
movie RiffTrax: Star Wars: The Last Jedi
movie Built by Jedi - The Making of Star Wars Jedi: Fallen Order
movie Untitled Star Wars/Exegol Project
movie Star Wars: The Clone Wars
movie Star Wars: Hand of the Empire
tvMovie Star Wars in Japan Celebration
tvMovie The Lego Star Wars Holiday Special
movie How to Watch Star Wars, Part One: The Prequels Are Better Movies Than You Deserve
tvMovie Hollywood's Master of Myth: Joseph Campbell - The Force Behind Star Wars
movie Star Wars: New Horizons
tvMovie Toy Empire: The British Force Behind Star Wars Toys
movie Conflict: A Star Wars Story
movie Star Wars: Rogue Squadron
movie Star Wars: A Droid Story
movie Doraemon the Movie: Nobita's Little Star Wars 2021
movie Star Wars: A Galaxy in Darkness
movie Journey of the Outlaw: A Star Wars Story
movie Lego Star Wars the Last Hope
movie Revan: A Star Wars Fan Film
tvMovie Lego Star Wars Terrifying Tales
movie Star Wars Inner Demons
movie Rifftrax: Solo: A Star Wars Story
movie Star Wars: The Chosen One
movie Star Wars xxx parody
tvMovie Robot Chicken: Star Wars III
tvMovie Von Star Wars bis Harry Potter - John Williams dirigiert die Wiener Philharmoniker
movie Star Wars Heroes & Villains
tvMovie Lego Star Wars: The Padawan Menace
tvMovie Star Wars: ESPN Fantasy Football
movie Star Wars: Star Warriors
movie Plageuis: A Star Wars Story
movie Star Wars Kid: The Rise of the Digital Shadows
movie Star Wars: Light of the Jedi
movie Star Wars: The Rise of Ren
movie Luke Skywalker: A Star Wars Story
tvMovie Musk vs Bezos: The New Star Wars
tvMovie Kinect Star Wars: Duel
movie Star Wars: Star Warriors
movie Star Wars: The Mystery of the Lockstream Cargo.
movie Star Wars: Threads of Destiny
tvMovie Lego Star Wars: The Empire Strikes Out
tvMovie Lego Star Wars: The Empire Strikes Out
movie Star Wars: Episode VII - The Force Awakens
movie Star Wars: Episode VIII - The Last Jedi
movie Star Wars: Episode IX - The Rise of Skywalker
movie The Disney Star Wars Trilogy Documentary
movie The Fans vs Disney Star Wars Special Edition
movie Star Wars: Shattered Empire
movie Star Wars: Labyrinth of Evil, Part I
movie Untitled Dave Filoni Star Wars Project
movie Star Wars: Dawn of the Jedi
movie Star Wars: Crimson Empire
movie Untitled Star Wars
movie Star Wars: Unlikely Allies
tvMovie Kinect Star Wars: Girly Vader
movie Untitled Star Wars/Obaid-Chinoy Project
movie The Patrick (H) Willems Star Wars Holiday Special
movie Star Wars: Revenge of the Apprentice
movie Star Wars: The Fandom Menace
movie Star Wars: Tremors of the Force
movie Star Wars: The Force Unleashed II
movie Plastic Galaxy: The Story of Star Wars Toys
movie Rogue One: A Star Wars Story
Truncated to displaylimit of 100.
In [20]:
%%sql 
SELECT titleType, primaryTitle
FROM Title
WHERE primaryTitle SIMILAR TO '.*Star Wars.*'
Running query in 'imdb'
Out[20]:
titleType primaryTitle
movie Star Wars: Episode IV - A New Hope
movie Star Wars: Episode V - The Empire Strikes Back
movie Star Wars: Episode VI - Return of the Jedi
movie Star Wars: Episode I - The Phantom Menace
movie Star Wars: Episode II - Attack of the Clones
movie Star Wars: Episode III - Revenge of the Sith
tvMovie The Star Wars Holiday Special
tvMovie From Star Wars to Star Wars: The Story of Industrial Light & Magic
tvMovie Star Wars City
tvMovie From 'Star Wars' to 'Jedi': The Making of a Saga
tvMovie The Making of 'Star Wars'
movie Doraemon: Nobita's Little Star Wars
movie Saving Star Wars
tvMovie The Mythology of Star Wars
tvMovie Star Wars: Feel the Force
movie The Untitled Star Wars Mockumentary
tvMovie Star Wars: The Magic & the Mystery
tvMovie Star Wars: Shortened!
tvMovie When Star Wars Ruled the World
tvMovie Robot Chicken: Star Wars
movie Star Wars: Battle for the Holocrons
movie Rifftrax: The Star Wars Holiday Special
movie Star Wars: Rogue Squadron
movie Untitled Taika Waititi Star Wars Film
movie Star Wars: New Jedi Order
movie RiffTrax: Star Wars: The Force Awakens
tvMovie Star Wars: The Legacy Revealed
tvMovie Star Wars Tech
tvMovie Bring Back... Star Wars
tvMovie Nissan Rogue: Star Wars Rogue One Battle Tested
tvMovie Walmart: Rogue One: A Star Wars Story
tvMovie John Boyega Pranks Star Wars Fans with Surprise Photobomb at Celebration: Force For Change
tvMovie Star Wars Galaxy's Edge: Adventure Awaits
movie 77-Star Wars-17
movie Untitled Star Wars/Kevin Feige Project
tvMovie Star Wars: Music by John Williams
movie Battle Star Wars
movie RiffTrax: Star Wars: The Last Jedi
movie Built by Jedi - The Making of Star Wars Jedi: Fallen Order
movie Untitled Star Wars/Exegol Project
movie Star Wars: The Clone Wars
movie Star Wars: Hand of the Empire
tvMovie Star Wars in Japan Celebration
tvMovie The Lego Star Wars Holiday Special
movie How to Watch Star Wars, Part One: The Prequels Are Better Movies Than You Deserve
tvMovie Hollywood's Master of Myth: Joseph Campbell - The Force Behind Star Wars
movie Star Wars: New Horizons
tvMovie Toy Empire: The British Force Behind Star Wars Toys
movie Conflict: A Star Wars Story
movie Star Wars: Rogue Squadron
movie Star Wars: A Droid Story
movie Doraemon the Movie: Nobita's Little Star Wars 2021
movie Star Wars: A Galaxy in Darkness
movie Journey of the Outlaw: A Star Wars Story
movie Lego Star Wars the Last Hope
movie Revan: A Star Wars Fan Film
tvMovie Lego Star Wars Terrifying Tales
movie Star Wars Inner Demons
movie Rifftrax: Solo: A Star Wars Story
movie Star Wars: The Chosen One
movie Star Wars xxx parody
tvMovie Robot Chicken: Star Wars III
tvMovie Von Star Wars bis Harry Potter - John Williams dirigiert die Wiener Philharmoniker
movie Star Wars Heroes & Villains
tvMovie Lego Star Wars: The Padawan Menace
tvMovie Star Wars: ESPN Fantasy Football
movie Star Wars: Star Warriors
movie Plageuis: A Star Wars Story
movie Star Wars Kid: The Rise of the Digital Shadows
movie Star Wars: Light of the Jedi
movie Star Wars: The Rise of Ren
movie Luke Skywalker: A Star Wars Story
tvMovie Musk vs Bezos: The New Star Wars
tvMovie Kinect Star Wars: Duel
movie Star Wars: Star Warriors
movie Star Wars: The Mystery of the Lockstream Cargo.
movie Star Wars: Threads of Destiny
tvMovie Lego Star Wars: The Empire Strikes Out
tvMovie Lego Star Wars: The Empire Strikes Out
movie Star Wars: Episode VII - The Force Awakens
movie Star Wars: Episode VIII - The Last Jedi
movie Star Wars: Episode IX - The Rise of Skywalker
movie The Disney Star Wars Trilogy Documentary
movie The Fans vs Disney Star Wars Special Edition
movie Star Wars: Shattered Empire
movie Star Wars: Labyrinth of Evil, Part I
movie Untitled Dave Filoni Star Wars Project
movie Star Wars: Dawn of the Jedi
movie Star Wars: Crimson Empire
movie Untitled Star Wars
movie Star Wars: Unlikely Allies
tvMovie Kinect Star Wars: Girly Vader
movie Untitled Star Wars/Obaid-Chinoy Project
movie The Patrick (H) Willems Star Wars Holiday Special
movie Star Wars: Revenge of the Apprentice
movie Star Wars: The Fandom Menace
movie Star Wars: Tremors of the Force
movie Star Wars: The Force Unleashed II
movie Plastic Galaxy: The Story of Star Wars Toys
movie Rogue One: A Star Wars Story
Truncated to displaylimit of 100.

_ represents a single character, similar to .:

In [21]:
%%sql 
SELECT titleType, primaryTitle
FROM Title
WHERE primaryTitle LIKE 'Harry Potter and the Deathly Hallows: Part _'
Running query in 'imdb'
Out[21]:
titleType primaryTitle
movie Harry Potter and the Deathly Hallows: Part 1
movie Harry Potter and the Deathly Hallows: Part 2

Converting Data Types Using CAST¶

In [22]:
%%sql
SELECT primaryTitle, CAST(runtimeMinutes AS FLOAT)
FROM Title
LIMIT 10;
Running query in 'imdb'
Out[22]:
primaryTitle CAST(runtimeMinutes AS FLOAT)
Miss Jerry 45.0
The Corbett-Fitzsimmons Fight 100.0
Bohemios 100.0
The Story of the Kelly Gang 70.0
The Prodigal Son 90.0
Robbery Under Arms None
Hamlet None
Don Quijote None
The Fairylogue and Radio-Plays 120.0
A Cultura do Cacau None

Applying Conditions With CASE¶

CASE WHEN is the SQL version of if else if ... else

Here, we return a random order so we can see the various movie ages (otherwise, the top few entries happen to all be old movies).

In [23]:
%%sql
SELECT setseed(0.42); -- Setting the random number seed
    
SELECT titleType, startYear,
CASE WHEN startYear < '1950' THEN 'old'
     WHEN startYear < '2000' THEN 'mid-aged'
     ELSE 'new'
     END AS movie_age
FROM Title
ORDER BY RANDOM()
LIMIT 10;
Running query in 'imdb'
Out[23]:
titleType startYear movie_age
movie 2005 new
tvMovie 2008 new
movie 1919 old
tvMovie 2003 new
movie 2013 new
movie 1999 mid-aged
movie 1950 mid-aged
movie 2009 new
tvMovie 1938 old
tvMovie 2009 new


Joining Tables¶

We combine data from multiple tables by performing a join. We will explore joins using the cats database, which includes two tables: s and t.

In [24]:
%sql duckdb:///data/cats.db --alias cats
Connecting and switching to connection 'cats'
In [25]:
%%sql

DROP TABLE IF EXISTS s;
DROP TABLE IF EXISTS t;

CREATE TABLE s (
    id INTEGER,
    name VARCHAR,
);

CREATE TABLE t (
    id INTEGER,
    breed VARCHAR
);


INSERT INTO s VALUES
(0, 'Apricot'),
(1, 'Boots'),
(2, 'Cally'),
(4, 'Eugene');

INSERT INTO t VALUES
(1, 'persian'),
(2, 'ragdoll'),
(4, 'bengal'),
(5, 'persian');
Running query in 'cats'
Out[25]:
Count
In [26]:
%%sql
SELECT * FROM t;
Running query in 'cats'
Out[26]:
id breed
1 persian
2 ragdoll
4 bengal
5 persian

Inner Join¶

Only keep rows with a match. Similar to the default behavior of pd.merge.

In [27]:
%%sql
SELECT s.id, name, breed
FROM s INNER JOIN t ON s.id = t.id;
Running query in 'cats'
Out[27]:
id name breed
1 Boots persian
2 Cally ragdoll
4 Eugene bengal

By default, JOINing without specifying a join type will default to an inner join.

In [28]:
%%sql
SELECT s.id, name, breed
FROM s JOIN t ON s.id = t.id;
Running query in 'cats'
Out[28]:
id name breed
1 Boots persian
2 Cally ragdoll
4 Eugene bengal

Cross Join¶

A cross joins will join each row of the first table with every row of the second table.

In [29]:
%%sql
SELECT *
FROM s CROSS JOIN t;
Running query in 'cats'
Out[29]:
id name id_1 breed
0 Apricot 1 persian
1 Boots 1 persian
2 Cally 1 persian
4 Eugene 1 persian
0 Apricot 2 ragdoll
1 Boots 2 ragdoll
2 Cally 2 ragdoll
4 Eugene 2 ragdoll
0 Apricot 4 bengal
1 Boots 4 bengal
2 Cally 4 bengal
4 Eugene 4 bengal
0 Apricot 5 persian
1 Boots 5 persian
2 Cally 5 persian
4 Eugene 5 persian

Equivalently, a cross join can be performed using the following syntax:

In [30]:
%%sql
SELECT *
FROM s, t
Running query in 'cats'
Out[30]:
id name id_1 breed
0 Apricot 1 persian
1 Boots 1 persian
2 Cally 1 persian
4 Eugene 1 persian
0 Apricot 2 ragdoll
1 Boots 2 ragdoll
2 Cally 2 ragdoll
4 Eugene 2 ragdoll
0 Apricot 4 bengal
1 Boots 4 bengal
2 Cally 4 bengal
4 Eugene 4 bengal
0 Apricot 5 persian
1 Boots 5 persian
2 Cally 5 persian
4 Eugene 5 persian

Conceptually, an inner join is equivalent to a cross join where irrelevant rows are removed.

Below is an example of a common implicit join. An inner join happened without the JOIN keyword!

In [31]:
%%sql
SELECT *
FROM s, t
WHERE s.id = t.id;
Running query in 'cats'
Out[31]:
id name id_1 breed
1 Boots 1 persian
2 Cally 2 ragdoll
4 Eugene 4 bengal

Left Outer Join¶

Keep everything from the left table, and use NULL where there is not a match:

In [32]:
%%sql
SELECT *
FROM s LEFT JOIN t ON s.id = t.id;
Running query in 'cats'
Out[32]:
id name id_1 breed
1 Boots 1 persian
2 Cally 2 ragdoll
4 Eugene 4 bengal
0 Apricot None None

Right Outer Join¶

Keep everything from the right table, and use NULL where there is not a match:

In [33]:
%%sql
SELECT *
FROM s RIGHT JOIN t ON s.id = t.id;
Running query in 'cats'
Out[33]:
id name id_1 breed
1 Boots 1 persian
2 Cally 2 ragdoll
4 Eugene 4 bengal
None None 5 persian

Full Outer Join¶

Keep everything from the both tables, and use NULL where there is not a match:

In [34]:
%%sql
SELECT *
FROM s FULL JOIN t ON s.id = t.id;
Running query in 'cats'
Out[34]:
id name id_1 breed
1 Boots 1 persian
2 Cally 2 ragdoll
4 Eugene 4 bengal
0 Apricot None None
None None 5 persian

Aliasing in Joins¶

Let's return to the IMDB dataset. Now, we'll consider two tables: Title and Rating.

In [35]:
%sql imdb
Switching to connection 'imdb'

When working with tables that have long names, we often create an alias using the AS keyword (much like we did with columns in the previous lecture). This makes it easier to reference these tables when performing a join.

In [36]:
%%sql
SELECT primaryTitle, averageRating, genres, 
FROM Title AS T INNER JOIN Rating AS R
    ON T.tconst = R.tconst
WHERE CAST (numVotes AS INT) > 100 
ORDER BY averageRating DESC
LIMIT 10;
Running query in 'imdb'
Out[36]:
primaryTitle averageRating genres
A Maiden's Tale 10.0 Drama,Romance
The Dark Web 10.0 Crime,Thriller
Rule Number 4 9.899999618530273 None
Aima 9.899999618530273 None
Electric Burma 9.899999618530273 Documentary
Mr. Natwarlal 9.899999618530273 Action
Cuento de Primavera: A Spring Tale 9.899999618530273 Drama,Fantasy,Mystery
Imamdasta 9.899999618530273 Drama
Brother Islands, a Scuba Dive Adventure 9.899999618530273 Documentary
Nannayikoode 9.800000190734863 Drama

The AS keyword is optional – omitting it will still create an alias. We typically include AS to make our queries more readable.

In [37]:
%%sql 
SELECT primaryTitle, averageRating
FROM Title T INNER JOIN Rating R
    ON T.tconst = R.tconst
WHERE CAST (numVotes AS INT) > 100 
ORDER BY averageRating DESC
LIMIT 10;
Running query in 'imdb'
Out[37]:
primaryTitle averageRating
A Maiden's Tale 10.0
The Dark Web 10.0
Aima 9.899999618530273
Brother Islands, a Scuba Dive Adventure 9.899999618530273
Mr. Natwarlal 9.899999618530273
Imamdasta 9.899999618530273
Rule Number 4 9.899999618530273
Cuento de Primavera: A Spring Tale 9.899999618530273
Electric Burma 9.899999618530273
Ardhambardha Premakathe 9.800000190734863

Referencing columns using the full or aliased table name is important to avoid ambiguity. Suppose the tables we are trying to join both include a column with the same name, like the tconst columns present in both the Title and Rating tables of the IMDB database. If we do not specify which table's column we wish to reference, SQL will not be able to process our query.

In the cell below, it is unclear if we are referring to the tconst column from the Title table or the tconst column from the Rating table. SQL errors.

SELECT primaryTitle, averageRating
FROM Title AS T INNER JOIN Rating AS R
ON tconst = tconst;

IMDB Case Study¶

Let's perform our first data science task in SQL – we'll explore the IMDB dataset in further detail.

First, let's understand what tables are present in the database.

In [38]:
%%sql
SELECT tbl_name, sql FROM sqlite_master WHERE type='table'
Running query in 'imdb'
Out[38]:
tbl_name sql
name CREATE TABLE "name"(nconst VARCHAR PRIMARY KEY, primaryName VARCHAR, birthYear INTEGER, deathYear INTEGER, primaryProfession VARCHAR, knownForTitles VARCHAR);
principal CREATE TABLE principal(tconst VARCHAR NOT NULL, ordering INTEGER, nconst VARCHAR NOT NULL, category VARCHAR, job VARCHAR, characters VARCHAR, FOREIGN KEY (tconst) REFERENCES title(tconst), FOREIGN KEY (nconst) REFERENCES name(nconst));
rating CREATE TABLE rating(tconst VARCHAR NOT NULL, averageRating FLOAT, numVotes INTEGER, FOREIGN KEY (tconst) REFERENCES title(tconst));
title CREATE TABLE title(tconst VARCHAR PRIMARY KEY, titleType VARCHAR, primaryTitle VARCHAR, originaTitle VARCHAR, isAdult BOOLEAN, startYear INTEGER, endYear INTEGER, runtimeMinutes INTEGER, genres VARCHAR);

Let's take a closer look at the Title table. Below, we preview a few rows of interest.

In [39]:
%%sql 
SELECT *
FROM Title
WHERE primaryTitle IN ('Ginny & Georgia', 'What If...?', 'Succession', 'Veep', 'Tenet')
LIMIT 50;
Running query in 'imdb'
Out[39]:
tconst titleType primaryTitle originaTitle isAdult startYear endYear runtimeMinutes genres
tt1003027 movie What If...? Notre univers impitoyable False 2008 None 87 Comedy
tt8008862 movie Succession Succession False 2018 None None Horror
tt8819150 movie Succession Succession False 2018 None 71 Documentary
tt9663946 movie Tenet Tenet False 2020 None 150 Action,Sci-Fi,Thriller
tt0254490 movie Succession Keisho sakazuki False 1992 None 119 Drama
tt4097596 movie What If...? What If...? False None None 25 Action,Drama
tt6723592 movie Tenet Tenet False 2020 None 150 Action,Sci-Fi,Thriller

There are a few issues with how this data is stored. Can you spot them?

  • We see some rows with missing runtimeMinutes. This suggests malformed data. We could drop these rows in pandas using [] or .dropna().

Instead, let's see how we can can do these fixes in SQL.

In [40]:
%%sql 
SELECT 
    tconst AS id,
    primaryTitle AS title, 
    titleType,
    runtimeMinutes,
    startYear
FROM Title
WHERE 
    runtimeMinutes > 0 AND runtimeMinutes < 300 -- get rid of outliers and missing values --
    AND startYear > 0     -- get rid of missing values --
LIMIT 10
Running query in 'imdb'
Out[40]:
id title titleType runtimeMinutes startYear
tt0000009 Miss Jerry movie 45 1894
tt0000147 The Corbett-Fitzsimmons Fight movie 100 1897
tt0000502 Bohemios movie 100 1905
tt0000574 The Story of the Kelly Gang movie 70 1906
tt0000591 The Prodigal Son movie 90 1907
tt0000679 The Fairylogue and Radio-Plays movie 120 1908
tt0000941 Locura de amor movie 45 1909
tt0001184 Don Juan de Serrallonga movie 58 1910
tt0001285 The Life of Moses movie 50 1909
tt0001498 The Battle of Trafalgar movie 51 1911

Perhaps we're interested in only top rated action movies. We can refine our query results futher.

In [41]:
%%sql
SELECT 
    T.tconst AS id,
    primaryTitle AS title, 
    titleType,
    runtimeMinutes,
    startYear,
    genres,
    averageRating, 
    numVotes
FROM Title T, Rating R
WHERE 
    T.tconst = R.tconst 
    AND runtimeMinutes > 0 AND runtimeMinutes < 300 -- get rid of outliers and missing values --
    AND startYear > 0     -- get rid of missing values --
    AND titleType = 'movie' -- get movies
    AND numVotes >= 1000
    AND genres LIKE '%Action%'
ORDER BY averageRating DESC
LIMIT 10
Running query in 'imdb'
Out[41]:
id title titleType runtimeMinutes startYear genres averageRating numVotes
tt0259534 Ramayana: The Legend of Prince Rama movie 135 1993 Action,Adventure,Animation 9.199999809265137 14441
tt29284103 Nene Saroja movie 155 2023 Action 9.0 1038
tt0167260 The Lord of the Rings: The Return of the King movie 201 2003 Action,Adventure,Drama 9.0 1970161
tt0468569 The Dark Knight movie 152 2008 Action,Crime,Drama 9.0 2857548
tt16492678 Demon Slayer: Kimetsu no Yaiba - Tsuzumi Mansion Arc movie 87 2021 Action,Animation,Fantasy 8.899999618530273 24226
tt0237376 Kireedam movie 124 1989 Action,Drama 8.899999618530273 8490
tt0120737 The Lord of the Rings: The Fellowship of the Ring movie 178 2001 Action,Adventure,Drama 8.899999618530273 1998121
tt14090476 Sarsenapati Hambirrao movie 158 2022 Action,Drama,History 8.899999618530273 2877
tt3189256 Om movie 150 1995 Action,Crime,Drama 8.899999618530273 3917
tt0075143 The Message movie 207 1976 Action,Adventure,Biography 8.899999618530273 10447

Using CTEs to do some complex analysis¶

Here we use a Common Table Expression (CTE) to construct a table that I use as an input in another query.

  • This is a bit like temporarily storing the output of one query, using that output as part of another query, and then discarding the temporary output.

This is done with the syntax:

WITH table_name as ( 
    SELECT ...
)
SELECT ... 
FROM table_name, ...
...    
In [42]:
%%sql
WITH good_action_movies AS (
    SELECT *
    FROM Title T JOIN Rating R ON T.tconst = R.tconst  
    WHERE genres LIKE '%Action%' AND CAST (averageRating AS INT) > 7 AND CAST (numVotes AS INT) > 5000
),
prolific_actors AS (
    SELECT N.nconst, primaryName, COUNT(*) as numRoles
    FROM Name N JOIN Principal P ON N.nconst = P.nconst
    WHERE category = 'actor'
    GROUP BY N.nconst, primaryName
)
SELECT primaryTitle, primaryName, numRoles, ROUND(CAST (averageRating AS INT)) AS rating
FROM good_action_movies m, prolific_actors a, principal p
WHERE p.tconst = m.tconst AND p.nconst = a.nconst
ORDER BY rating DESC, numRoles DESC
LIMIT 20
Running query in 'imdb'
Out[42]:
primaryTitle primaryName numRoles rating
Threat Level Midnight: The Movie Steve Carell 51 10
Threat Level Midnight: The Movie Steve Carell 51 10
Threat Level Midnight: The Movie Ed Helms 36 10
Threat Level Midnight: The Movie Rainn Wilson 36 10
Threat Level Midnight: The Movie Ed Helms 36 10
Threat Level Midnight: The Movie John Krasinski 28 10
Threat Level Midnight: The Movie Craig Robinson 26 10
Threat Level Midnight: The Movie Ricky Gervais 17 10
Threat Level Midnight: The Movie Stephen Merchant 14 10
Threat Level Midnight: The Movie Creed Bratton 12 10
Threat Level Midnight: The Movie B.J. Novak 5 10
Threat Level Midnight: The Movie B.J. Novak 5 10
Kireedam Jagathy Sreekumar 661 9
Thevar Magan Nassar 436 9
Devasuram Mohanlal 409 9
Spadikam Mohanlal 409 9
Kireedam Mohanlal 409 9
Kireedam Sankaradi 403 9
Devasuram Nedumudi Venu 370 9
Spadikam Nedumudi Venu 370 9

How do ratings vary through the years?¶

If we want to know the critic ratings for each of these action movies, we can join Title with the Rating table.

In the cell below, we use the << operator to store the result of our query as a variable.

In [43]:
%config SqlMagic.autopandas = True
In [44]:
%%sql action_movies_df << 
SELECT 
    T.tconst AS id,
    primaryTitle AS title, 
    titleType,
    runtimeMinutes AS time,
    startYear AS year,
    genres,
    averageRating AS rating, 
    numVotes
FROM Title T, Rating R
WHERE 
    T.tconst = R.tconst 
    AND time > 0 AND time < 300 -- get rid of outliers and missing values --
    AND year > 0     -- get rid of missing values --
    AND titleType = 'movie' -- get movies
    AND numVotes >= 1000
    AND genres LIKE '%Action%'
ORDER BY startYear
Running query in 'imdb'
In [45]:
action_movies_df
Out[45]:
id title titleType time year genres rating numVotes
0 tt0004465 The Perils of Pauline movie 199 1914 Action,Adventure,Drama 6.3 1085
1 tt0004635 The Squaw Man movie 74 1914 Action,Drama,Romance 5.7 1041
2 tt0006333 20,000 Leagues Under the Sea movie 85 1916 Action,Adventure,Sci-Fi 6.1 1980
3 tt0011387 The Last of the Mohicans movie 73 1920 Action,Adventure,Drama 6.7 1439
4 tt0012642 A Sailor-Made Man movie 47 1921 Action,Adventure,Comedy 6.8 1246
... ... ... ... ... ... ... ... ...
7850 tt23461428 Lal Salaam movie 150 2024 Action,Drama,Sport 4.8 1020
7851 tt23752056 Mission: Chapter 1 movie 144 2024 Action,Thriller 6.1 1285
7852 tt26745202 Vey Dharuvey movie 127 2024 Action,Comedy 8.3 2290
7853 tt26923601 Game On movie 140 2024 Action,Drama,Romance 7.2 2252
7854 tt27110516 One More Shot movie 103 2024 Action,Thriller 5.5 1700

7855 rows × 8 columns

In [46]:
import plotly.express as px
px.scatter(action_movies_df, 
           x = "year", 
           y="rating", 
           opacity=0.5,
          range_y=[action_movies_df["rating"].min(), action_movies_df["rating"].max()])

Using a Common Table Expression (CTE):

In [47]:
%%sql action_grouped_df <<
WITH action_movies AS (
    SELECT 
        T.tconst AS id,
        primaryTitle AS title, 
        titleType,
        runtimeMinutes AS time,
        startYear AS year,
        genres,
        averageRating AS rating, 
        numVotes
    FROM Title T, Rating R
    WHERE 
        T.tconst = R.tconst 
        AND time > 0 AND time < 300 -- get rid of outliers and missing values --
        AND year > 0     -- get rid of missing values --
        AND titleType = 'movie' -- get movies
        AND numVotes >= 1000
    AND genres LIKE '%Action%'
)
SELECT 
    year, 
    median(rating) AS median_rating,
    quantile_cont(rating, .25) AS lower_q,
    quantile_cont(rating, .75) AS upper_q,
FROM action_movies
GROUP BY year
ORDER BY year
Running query in 'imdb'

Let's look at what we got.

In [48]:
import plotly.express as px
action_grouped_df['error_p'] = action_grouped_df['upper_q'] - action_grouped_df['median_rating']
action_grouped_df['error_m'] = action_grouped_df['median_rating'] - action_grouped_df['lower_q'] 
fig = px.line(action_grouped_df, x="year", y="median_rating", 
              error_y="error_p", error_y_minus="error_m")
fig

How is the runtime of an action movie associated with the rating it receives?¶

In [49]:
px.scatter(action_movies_df, x="time", y="rating", opacity=0.1)
In [50]:
px.density_heatmap(action_movies_df, x="time", y="rating",nbinsx=100, nbinsy=100)
In [51]:
%%sql action_grouped_df << 
WITH action_movies AS (
    SELECT 
        T.tconst AS id,
        primaryTitle AS title, 
        titleType,
        runtimeMinutes AS time,
        startYear AS year,
        genres,
        averageRating AS rating, 
        numVotes
    FROM Title T, Rating R
    WHERE 
        T.tconst = R.tconst 
        AND time > 0 AND time < 300 -- get rid of outliers and missing values --
        AND year > 0     -- get rid of missing values --
        AND titleType = 'movie' -- get movies
        AND numVotes >= 1000
    AND genres LIKE '%Action%'
)
SELECT 
    ceil(time / 10) * 10 as time_desc, 
    median(rating) AS median_rating,
    quantile_cont(rating, .25) AS lower_q,
    quantile_cont(rating, .75) AS upper_q,
FROM action_movies
GROUP BY time_desc
ORDER BY time_desc
Running query in 'imdb'
In [52]:
import plotly.express as px
action_grouped_df['error_p'] = action_grouped_df['upper_q'] - action_grouped_df['median_rating']
action_grouped_df['error_m'] = action_grouped_df['median_rating'] - action_grouped_df['lower_q'] 
fig = px.line(action_grouped_df, x="time_desc", y="median_rating", 
              error_y="error_p", error_y_minus="error_m")
fig

Could we use a movie's runtime to predict its rating?

In [53]:
import sklearn.linear_model as lm
import numpy as np
X, Y = action_movies_df["time"].to_numpy(), action_movies_df["rating"].to_numpy()

model = lm.LinearRegression()
model.fit(X[..., np.newaxis], Y)

xs = np.linspace(X.min(),X.max())
yhats = model.predict(xs[:, np.newaxis])
In [54]:
fig = px.scatter(x=X, y=Y, opacity=0.1)
fig.add_scatter(x=xs, y=yhats)
In [55]:
px.scatter(x=X, y=Y, opacity=0.1, trendline="ols")