SQL II¶

Exploring advanced SQL syntax.

Loading the Data¶

In this lecture, we'll continue our work with the Dish table. In the cells below, we connect to the database and query the table.

In [1]:

%load_ext sql

In [2]:

%%sql
sqlite:///data/basic_examples.db

In [3]:

%%sql
SELECT *
FROM Dish;

Running query in 'sqlite:///data/basic_examples.db'

Out[3]:

name	type	cost
ravioli	entree	10
ramen	entree	7
taco	entree	7
edamame	appetizer	4
fries	appetizer	4
potsticker	appetizer	4
ice cream	dessert	5

ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()

Filtering Groups Using `HAVING`¶

In [4]:

%%sql
SELECT type, COUNT(*)
FROM Dish
GROUP BY type
HAVING MAX(cost) < 8;

Running query in 'sqlite:///data/basic_examples.db'

Out[4]:

type	COUNT(*)
appetizer	3
dessert	1

ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()

In [5]:

%%sql
SELECT type, MAX(name)
FROM DishDietary
WHERE notes == 'gf'
GROUP BY type
HAVING MAX(cost) <= 7;

Running query in 'sqlite:///data/basic_examples.db'

Out[5]:

type	MAX(name)
appetizer	fries
entree	taco

ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()

EDA in SQL¶

Our typical workflow when working with "big data" is:

Use SQL to query data from a database
Use Python (with pandas) to analyze this data in detail

We can, however, still perform simple data cleaning and re-structuring using SQL directly. To do so, we'll consider the Title table from the IMDB dataset. We use random ordering here to get a "snapshot" of representative rows sampled from throughout the table.

In [6]:

%%sql
sqlite:///data/imdbmini.db

In [7]:

%%sql
SELECT *
FROM Title
ORDER BY RANDOM()
LIMIT 10;

Running query in 'sqlite:///data/imdbmini.db'

Out[7]:

tconst	titleType	primaryTitle	originalTitle	startYear	endYear	runtimeMinutes	genres
99077	movie	Awakenings	Awakenings	1990	None	121	Biography,Drama
1023481	movie	Step Up 2: the Streets	Step Up 2: The Streets	2008	None	98	Drama,Music,Romance
5640450	movie	ARQ	ARQ	2016	None	88	Action,Sci-Fi,Thriller
1265990	movie	The Roommate	The Roommate	2011	None	94	Thriller
116209	movie	The English Patient	The English Patient	1996	None	162	Drama,Romance,War
111257	movie	Speed	Speed	1994	None	116	Action,Adventure,Thriller
110989	movie	Richie Rich	Ri¢hie Ri¢h	1994	None	95	Comedy,Family
86993	movie	The Bounty	The Bounty	1984	None	132	Adventure,Drama,History
313443	movie	Out of Time	Out of Time	2003	None	105	Crime,Drama,Mystery
1883092	tvMiniSeries	Generation War	Unsere Mütter, unsere Väter	2013	2013	270	Action,Drama,History

ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()
Truncated to displaylimit of 10
If you want to see more, please visit displaylimit configuration

Matching Text Using `LIKE`¶

In [8]:

%%sql
SELECT titleType, primaryTitle
FROM Title
WHERE primaryTitle LIKE "%Star Wars%"

Running query in 'sqlite:///data/imdbmini.db'

Out[8]:

titleType	primaryTitle
movie	Star Wars: Episode IV - A New Hope
movie	Star Wars: Episode V - The Empire Strikes Back
movie	Star Wars: Episode VI - Return of the Jedi
movie	Star Wars: Episode I - The Phantom Menace
movie	Star Wars: Episode II - Attack of the Clones
movie	Star Wars: Episode III - Revenge of the Sith
tvSeries	Star Wars: Clone Wars
tvSeries	Star Wars: The Clone Wars
movie	Star Wars: The Clone Wars
movie	Star Wars: Episode VII - The Force Awakens

ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()
Truncated to displaylimit of 10
If you want to see more, please visit displaylimit configuration

In [9]:

%%sql
SELECT titleType, primaryTitle
FROM Title
WHERE primaryTitle LIKE "Harry Potter and the Deathly Hallows: Part _"

Running query in 'sqlite:///data/imdbmini.db'

Out[9]:

titleType	primaryTitle
movie	Harry Potter and the Deathly Hallows: Part 1
movie	Harry Potter and the Deathly Hallows: Part 2

ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()

Converting Data Types Using `CAST`¶

In [10]:

%%sql
SELECT primaryTitle, CAST(runtimeMinutes AS INT)
FROM Title
LIMIT 10;

Running query in 'sqlite:///data/imdbmini.db'

Out[10]:

primaryTitle	CAST(runtimeMinutes AS INT)
A Trip to the Moon	13
The Birth of a Nation	195
The Cabinet of Dr. Caligari	76
The Kid	68
Nosferatu	94
Sherlock Jr.	45
Battleship Potemkin	75
The Gold Rush	95
Metropolis	153
The General	67

ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()
Truncated to displaylimit of 10
If you want to see more, please visit displaylimit configuration

Applying Conditions With `CASE`¶

Here, we return a random order so we can see the various movie ages (otherwise, the top few entries happen to all be old movies).

In [11]:

%%sql
SELECT titleType, startYear,
CASE WHEN startYear < 1950 THEN "old"
     WHEN startYear < 2000 THEN "mid-aged"
     ELSE "new"
     END AS movie_age
FROM Title
ORDER BY RANDOM()
LIMIT 10;

Running query in 'sqlite:///data/imdbmini.db'

Out[11]:

titleType	startYear	movie_age
movie	2015	new
movie	1989	mid-aged
tvSeries	2010	new
tvSeries	2013	new
tvSeries	1989	mid-aged
tvMovie	2013	new
movie	2004	new
movie	2017	new
movie	2002	new
movie	1981	mid-aged

ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()
Truncated to displaylimit of 10
If you want to see more, please visit displaylimit configuration

Joining Tables¶

We combine data from multiple tables by performing a join. We will explore joins using the cats database, which includes two tables: s and t.

In [12]:

%%sql
sqlite:///data/basic_examples.db

In [13]:

%%sql
SELECT * FROM s;

Running query in 'sqlite:///data/basic_examples.db'

Out[13]:

id	name
0	Apricot
1	Boots
2	Cally
4	Eugene

ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()

In [14]:

%%sql
SELECT * FROM t;

Running query in 'sqlite:///data/basic_examples.db'

Out[14]:

id	breed
1	persian
2	ragdoll
4	bengal
5	persian

ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()

Inner Join¶

In [15]:

%%sql
SELECT s.id, name, breed
FROM s INNER JOIN t ON s.id = t.id;

Running query in 'sqlite:///data/basic_examples.db'

Out[15]:

id	name	breed
1	Boots	persian
2	Cally	ragdoll
4	Eugene	bengal

ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()

By default, JOINing without specifying a join type will default to an inner join.

In [16]:

%%sql
SELECT s.id, name, breed
FROM s JOIN t ON s.id = t.id;

Running query in 'sqlite:///data/basic_examples.db'

Out[16]:

id	name	breed
1	Boots	persian
2	Cally	ragdoll
4	Eugene	bengal

ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()

Cross Join¶

In [17]:

%%sql
SELECT *
FROM s CROSS JOIN t;

Running query in 'sqlite:///data/basic_examples.db'

Out[17]:

id	name	id_1	breed
0	Apricot	1	persian
0	Apricot	2	ragdoll
0	Apricot	4	bengal
0	Apricot	5	persian
1	Boots	1	persian
1	Boots	2	ragdoll
1	Boots	4	bengal
1	Boots	5	persian
2	Cally	1	persian
2	Cally	2	ragdoll

ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()
Truncated to displaylimit of 10
If you want to see more, please visit displaylimit configuration

Equivalently, a cross join can be performed using the following syntax:

In [18]:

%%sql
SELECT *
FROM s, t;

Running query in 'sqlite:///data/basic_examples.db'

Out[18]:

id	name	id_1	breed
0	Apricot	1	persian
0	Apricot	2	ragdoll
0	Apricot	4	bengal
0	Apricot	5	persian
1	Boots	1	persian
1	Boots	2	ragdoll
1	Boots	4	bengal
1	Boots	5	persian
2	Cally	1	persian
2	Cally	2	ragdoll

ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()
Truncated to displaylimit of 10
If you want to see more, please visit displaylimit configuration

Conceptually, an inner join is equivalent to a cross join where irrelevant rows are removed.

In [19]:

%%sql
SELECT *
FROM s CROSS JOIN t
WHERE s.id = t.id;

Running query in 'sqlite:///data/basic_examples.db'

Out[19]:

id	name	id_1	breed
1	Boots	1	persian
2	Cally	2	ragdoll
4	Eugene	4	bengal

ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()

Left Outer Join¶

In [20]:

%%sql
SELECT *
FROM s LEFT JOIN t ON s.id = t.id;

Running query in 'sqlite:///data/basic_examples.db'

Out[20]:

id	name	id_1	breed
0	Apricot	None	None
1	Boots	1	persian
2	Cally	2	ragdoll
4	Eugene	4	bengal

ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()

Right Outer Join¶

In [21]:

%%sql
SELECT *
FROM s RIGHT JOIN t ON s.id = t.id;

Running query in 'sqlite:///data/basic_examples.db'

Out[21]:

id	name	id_1	breed
1	Boots	1	persian
2	Cally	2	ragdoll
4	Eugene	4	bengal
None	None	5	persian

ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()

Full Outer Join¶

In [22]:

%%sql
SELECT *
FROM s FULL JOIN t ON s.id = t.id;

Running query in 'sqlite:///data/basic_examples.db'

Out[22]:

id	name	id_1	breed
0	Apricot	None	None
1	Boots	1	persian
2	Cally	2	ragdoll
4	Eugene	4	bengal
None	None	5	persian

ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()

Aliasing in Joins¶

Let's return to the IMDB dataset. Now, we'll consider two tables: Title and Rating.

In [23]:

%%sql
sqlite:///data/imdbmini.db

When working with tables that have long names, we often create an alias using the AS keyword (much like we did with columns in the previous lecture). This makes it easier to reference these tables when performing a join.

In [24]:

%%sql
SELECT primaryTitle, averageRating
FROM Title AS T INNER JOIN Rating AS R
ON T.tconst = R.tconst;

Running query in 'sqlite:///data/imdbmini.db'

Out[24]:

primaryTitle	averageRating
A Trip to the Moon	8.2
The Birth of a Nation	6.3
The Cabinet of Dr. Caligari	8.1
The Kid	8.3
Nosferatu	7.9
Sherlock Jr.	8.2
Battleship Potemkin	8.0
The Gold Rush	8.2
Metropolis	8.3
The General	8.1

ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()
Truncated to displaylimit of 10
If you want to see more, please visit displaylimit configuration

The AS keyword is optional – omitting it will still create an alias. We typically include AS to make our queries more readable.

In [25]:

%%sql
SELECT primaryTitle, averageRating
FROM Title T INNER JOIN Rating R
ON T.tconst = R.tconst;

Running query in 'sqlite:///data/imdbmini.db'

Out[25]:

primaryTitle	averageRating
A Trip to the Moon	8.2
The Birth of a Nation	6.3
The Cabinet of Dr. Caligari	8.1
The Kid	8.3
Nosferatu	7.9
Sherlock Jr.	8.2
Battleship Potemkin	8.0
The Gold Rush	8.2
Metropolis	8.3
The General	8.1

ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()
Truncated to displaylimit of 10
If you want to see more, please visit displaylimit configuration

Referencing columns using the full or aliased table name is important to avoid ambiguity. Suppose the tables we are trying to join both include a column with the same name, like the tconst columns present in both the Title and Rating tables of the IMDB database. If we do not specify which table's column we wish to reference, SQL will not be able to process our query.

In the cell below, it is unclear if we are referring to the tconst column from the Title table or the tconst column from the Rating table. SQL errors.

In [26]:

%%sql
SELECT primaryTitle, averageRating
FROM Title AS T INNER JOIN Rating AS R
ON tconst = tconst;

Running query in 'sqlite:///data/imdbmini.db'

(sqlite3.OperationalError) ambiguous column name: tconst [SQL: SELECT primaryTitle, averageRating FROM Title AS T INNER JOIN Rating AS R ON tconst = tconst;] (Background on this error at: https://sqlalche.me/e/20/e3q8)

Refer to columns in the format table_alias.column_name to avoid ambiguity.

In [27]:

%%sql
SELECT primaryTitle, averageRating
FROM Title AS T INNER JOIN Rating AS R
ON T.tconst = R.tconst;

Running query in 'sqlite:///data/imdbmini.db'

Out[27]:

primaryTitle	averageRating
A Trip to the Moon	8.2
The Birth of a Nation	6.3
The Cabinet of Dr. Caligari	8.1
The Kid	8.3
Nosferatu	7.9
Sherlock Jr.	8.2
Battleship Potemkin	8.0
The Gold Rush	8.2
Metropolis	8.3
The General	8.1

ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()
Truncated to displaylimit of 10
If you want to see more, please visit displaylimit configuration

IMDB Case Study¶

Let's perform our first data science task in SQL – we'll explore the IMDB dataset in further detail.

First, let's understand what tables are present in the database.

In [28]:

%%sql
sqlite:///data/imdbmini.db

In [29]:

%%sql
SELECT tbl_name, sql FROM sqlite_master WHERE type='table'

Running query in 'sqlite:///data/imdbmini.db'

Out[29]:

tbl_name	sql
Title	CREATE TABLE "Title" ( "tconst" INTEGER, "titleType" TEXT, "primaryTitle" TEXT, "originalTitle" TEXT, "isAdult" TEXT, "startYear" TEXT, "endYear" TEXT, "runtimeMinutes" TEXT, "genres" TEXT )
Name	CREATE TABLE "Name" ( "nconst" INTEGER, "primaryName" TEXT, "birthYear" TEXT, "deathYear" TEXT, "primaryProfession" TEXT )
Role	CREATE TABLE "Role" ( tconst INTEGER, ordering TEXT, nconst INTEGER, category TEXT, job TEXT, characters TEXT )
Rating	CREATE TABLE "Rating" ( tconst INTEGER, averageRating TEXT, numVotes TEXT )

ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()

Let's take a closer look at the Title table. Below, we preview a few rows of interest.

In [30]:

%%sql
SELECT *
FROM Title
WHERE primaryTitle IN ("Ginny & Georgia", "What If...?", "Succession", "Veep", "Tenet")
LIMIT 10;

Running query in 'sqlite:///data/imdbmini.db'

Out[30]:

tconst	titleType	primaryTitle	originalTitle	startYear	endYear	runtimeMinutes	genres
10168312	tvSeries	What If...?	What If...?	2021	None	None	Action,Adventure,Animation
10813940	tvSeries	Ginny & Georgia	Ginny & Georgia	2021	None	None	Comedy,Drama
1759761	tvSeries	Veep	Veep	2012	2019	28	Comedy
6723592	movie	Tenet	Tenet	2020	None	150	Action,Sci-Fi,Thriller
7660850	tvSeries	Succession	Succession	2018	None	60	Drama

ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()

There are a few issues with how this data is stored. Can you spot them?

We see some rows with missing runtimeMinutes. This suggests malformed data. We could drop these rows in pandas using [] or .dropna().
We also note that the time and year columns are currently given in string format (see the schema listed earlier), whereas we probably want them in the format. Again, we could do this conversion in pandas using pd.to_numeric or .astype.

Instead, let's see how we can can do these fixes in SQL.

In [31]:

%%sql 
SELECT tconst AS id,
  primaryTitle AS title, 
  titleType,
  CAST(runtimeMinutes AS int) AS time,  
  CAST(startYear AS int) AS year
FROM Title
WHERE time > 0 AND -- get rid of outliers and missing values --
      year > 0     -- get rid of missing values --

Running query in 'sqlite:///data/imdbmini.db'

Out[31]:

id	title	titleType	time	year
417	A Trip to the Moon	short	13	1902
4972	The Birth of a Nation	movie	195	1915
10323	The Cabinet of Dr. Caligari	movie	76	1920
12349	The Kid	movie	68	1921
13442	Nosferatu	movie	94	1922
15324	Sherlock Jr.	movie	45	1924
15648	Battleship Potemkin	movie	75	1925
15864	The Gold Rush	movie	95	1925
17136	Metropolis	movie	153	1927
17925	The General	movie	67	1926

ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()
Truncated to displaylimit of 10
If you want to see more, please visit displaylimit configuration

Perhaps we're interested in only action movies. We can refine our query results futher.

In [32]:

%%sql
SELECT tconst AS id,
  primaryTitle AS title, 
  titleType,
  CAST(runtimeMinutes AS int) AS time,  
  CAST(startYear AS int) AS year
FROM Title
WHERE time > 0 AND -- get rid of outliers and missing values --
      year > 0     -- get rid of missing values --
      AND titleType = "movie"
      AND genres LIKE "%Action%";

Running query in 'sqlite:///data/imdbmini.db'

Out[32]:

id	title	titleType	time	year
15324	Sherlock Jr.	movie	45	1924
17925	The General	movie	67	1926
23427	Scarface	movie	93	1932
29843	The Adventures of Robin Hood	movie	102	1938
40506	Key Largo	movie	100	1948
40724	Red River	movie	133	1948
42041	White Heat	movie	114	1949
46534	The War of the Worlds	movie	85	1953
47034	Godzilla	movie	96	1954
47478	Seven Samurai	movie	207	1954

ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()
Truncated to displaylimit of 10
If you want to see more, please visit displaylimit configuration

If we want to know the critic ratings for each of these action movies, we can join Title with the Rating table.

In the cell below, we use the << operator to store the result of our query as a variable.

In [33]:

%%sql action_movies_sql <<
SELECT T.tconst AS id,
  primaryTitle AS title, 
  titleType,
  CAST(runtimeMinutes AS int) AS time,  
  CAST(startYear AS int) AS year,
  CAST(averageRating AS float) AS rating
FROM Title AS T INNER JOIN Rating AS R ON T.tconst = R.tconst
WHERE time > 0 AND -- get rid of outliers and missing values --
      year > 0     -- get rid of missing values --
      AND titleType = "movie"
      AND genres LIKE "%Action%";

Running query in 'sqlite:///data/imdbmini.db'

Let's look at what we got.

In [34]:

action_movies_sql

Out[34]:

id	title	titleType	time	year	rating
15324	Sherlock Jr.	movie	45	1924	8.2
17925	The General	movie	67	1926	8.1
23427	Scarface	movie	93	1932	7.8
29843	The Adventures of Robin Hood	movie	102	1938	7.9
40506	Key Largo	movie	100	1948	7.8
40724	Red River	movie	133	1948	7.8
42041	White Heat	movie	114	1949	8.1
46534	The War of the Worlds	movie	85	1953	7.1
47034	Godzilla	movie	96	1954	7.6
47478	Seven Samurai	movie	207	1954	8.6

ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()
Truncated to displaylimit of 10
If you want to see more, please visit displaylimit configuration

A typical workflow for interacting with databases is:

Use SQL to query large amounts of data from the database. Perform basic filtering and cleaning operations so we obtain the data we need.
Convert the outputted table into a DataFrame, then peform more detailed analysis work in pandas

We use .DataFrame() to convert our SQL table action_movies into a pandas DataFrame.

In [35]:

action_movies_df = action_movies_sql.DataFrame()
action_movies_df

Out[35]:

	id	title	titleType	time	year	rating
0	15324	Sherlock Jr.	movie	45	1924	8.2
1	17925	The General	movie	67	1926	8.1
2	23427	Scarface	movie	93	1932	7.8
3	29843	The Adventures of Robin Hood	movie	102	1938	7.9
4	40506	Key Largo	movie	100	1948	7.8
...	...	...	...	...	...	...
1715	9243946	El Camino: A Breaking Bad Movie	movie	122	2019	7.3
1716	9252468	Mosul	movie	86	2019	7.2
1717	9376612	Shang-Chi and the Legend of the Ten Rings	movie	132	2021	7.9
1718	9691136	Shadow in the Cloud	movie	83	2020	4.9
1719	9777666	The Tomorrow War	movie	138	2021	6.6

1720 rows × 6 columns

Now, we can use our usual tools – pandas, numpy, matplotlib, and seaborn – to continue our work!

In [36]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn.linear_model as lm

How do ratings vary through the years?

In [37]:

sns.lineplot(data=action_movies_df, x="year", y="rating");

How does the runtime of an action movie influence the rating it receives?

In [38]:

sns.scatterplot(data=action_movies_df, x="time", y="rating");

In [39]:

# Apply transformations to linearize the data
sns.scatterplot(x=np.log(action_movies_df["time"]), y=action_movies_df["rating"]**3);

Could we use a movie's runtime to predict its rating?

In [40]:

X, Y = np.log(action_movies_df[["time"]]), action_movies_df["rating"]**3

model = lm.LinearRegression()
model.fit(X, Y)

xs = np.linspace(4, 5.5)
plt.plot(xs, model.predict(xs[:, np.newaxis]), c="tab:red")
sns.scatterplot(x=np.log(action_movies_df["time"]), y=action_movies_df["rating"]**3);

/srv/conda/envs/notebook/lib/python3.11/site-packages/sklearn/base.py:439: UserWarning: X does not have valid feature names, but LinearRegression was fitted with feature names
  warnings.warn(

SQL II¶

Loading the Data¶

Filtering Groups Using HAVING¶

EDA in SQL¶

Matching Text Using LIKE¶

Converting Data Types Using CAST¶

Applying Conditions With CASE¶

Joining Tables¶

Inner Join¶

Cross Join¶

Left Outer Join¶

Right Outer Join¶

Full Outer Join¶

Aliasing in Joins¶

IMDB Case Study¶

Filtering Groups Using `HAVING`¶

Matching Text Using `LIKE`¶

Converting Data Types Using `CAST`¶

Applying Conditions With `CASE`¶