%load_ext sql
%%sql
sqlite:///data/basic_examples.db
%%sql
SELECT *
FROM Dish;
name | type | cost |
---|---|---|
ravioli | entree | 10 |
ramen | entree | 7 |
taco | entree | 7 |
edamame | appetizer | 4 |
fries | appetizer | 4 |
potsticker | appetizer | 4 |
ice cream | dessert | 5 |
ResultSet
: to convert to pandas, call .DataFrame()
or to polars, call .PolarsDataFrame()
HAVING
¶%%sql
SELECT type, COUNT(*)
FROM Dish
GROUP BY type
HAVING MAX(cost) < 8;
type | COUNT(*) |
---|---|
appetizer | 3 |
dessert | 1 |
ResultSet
: to convert to pandas, call .DataFrame()
or to polars, call .PolarsDataFrame()
%%sql
SELECT type, MAX(name)
FROM DishDietary
WHERE notes == 'gf'
GROUP BY type
HAVING MAX(cost) <= 7;
type | MAX(name) |
---|---|
appetizer | fries |
entree | taco |
ResultSet
: to convert to pandas, call .DataFrame()
or to polars, call .PolarsDataFrame()
Our typical workflow when working with "big data" is:
pandas
) to analyze this data in detailWe can, however, still perform simple data cleaning and re-structuring using SQL directly. To do so, we'll consider the Title
table from the IMDB dataset. We use random ordering here to get a "snapshot" of representative rows sampled from throughout the table.
%%sql
sqlite:///data/imdbmini.db
%%sql
SELECT *
FROM Title
ORDER BY RANDOM()
LIMIT 10;
tconst | titleType | primaryTitle | originalTitle | isAdult | startYear | endYear | runtimeMinutes | genres |
---|---|---|---|---|---|---|---|---|
99077 | movie | Awakenings | Awakenings | 0 | 1990 | None | 121 | Biography,Drama |
1023481 | movie | Step Up 2: the Streets | Step Up 2: The Streets | 0 | 2008 | None | 98 | Drama,Music,Romance |
5640450 | movie | ARQ | ARQ | 0 | 2016 | None | 88 | Action,Sci-Fi,Thriller |
1265990 | movie | The Roommate | The Roommate | 0 | 2011 | None | 94 | Thriller |
116209 | movie | The English Patient | The English Patient | 0 | 1996 | None | 162 | Drama,Romance,War |
111257 | movie | Speed | Speed | 0 | 1994 | None | 116 | Action,Adventure,Thriller |
110989 | movie | Richie Rich | Ri¢hie Ri¢h | 0 | 1994 | None | 95 | Comedy,Family |
86993 | movie | The Bounty | The Bounty | 0 | 1984 | None | 132 | Adventure,Drama,History |
313443 | movie | Out of Time | Out of Time | 0 | 2003 | None | 105 | Crime,Drama,Mystery |
1883092 | tvMiniSeries | Generation War | Unsere Mütter, unsere Väter | 0 | 2013 | 2013 | 270 | Action,Drama,History |
ResultSet
: to convert to pandas, call .DataFrame()
or to polars, call .PolarsDataFrame()
LIKE
¶%%sql
SELECT titleType, primaryTitle
FROM Title
WHERE primaryTitle LIKE "%Star Wars%"
titleType | primaryTitle |
---|---|
movie | Star Wars: Episode IV - A New Hope |
movie | Star Wars: Episode V - The Empire Strikes Back |
movie | Star Wars: Episode VI - Return of the Jedi |
movie | Star Wars: Episode I - The Phantom Menace |
movie | Star Wars: Episode II - Attack of the Clones |
movie | Star Wars: Episode III - Revenge of the Sith |
tvSeries | Star Wars: Clone Wars |
tvSeries | Star Wars: The Clone Wars |
movie | Star Wars: The Clone Wars |
movie | Star Wars: Episode VII - The Force Awakens |
ResultSet
: to convert to pandas, call .DataFrame()
or to polars, call .PolarsDataFrame()
%%sql
SELECT titleType, primaryTitle
FROM Title
WHERE primaryTitle LIKE "Harry Potter and the Deathly Hallows: Part _"
titleType | primaryTitle |
---|---|
movie | Harry Potter and the Deathly Hallows: Part 1 |
movie | Harry Potter and the Deathly Hallows: Part 2 |
ResultSet
: to convert to pandas, call .DataFrame()
or to polars, call .PolarsDataFrame()
CAST
¶%%sql
SELECT primaryTitle, CAST(runtimeMinutes AS INT)
FROM Title
LIMIT 10;
primaryTitle | CAST(runtimeMinutes AS INT) |
---|---|
A Trip to the Moon | 13 |
The Birth of a Nation | 195 |
The Cabinet of Dr. Caligari | 76 |
The Kid | 68 |
Nosferatu | 94 |
Sherlock Jr. | 45 |
Battleship Potemkin | 75 |
The Gold Rush | 95 |
Metropolis | 153 |
The General | 67 |
ResultSet
: to convert to pandas, call .DataFrame()
or to polars, call .PolarsDataFrame()
CASE
¶Here, we return a random order so we can see the various movie ages (otherwise, the top few entries happen to all be old movies).
%%sql
SELECT titleType, startYear,
CASE WHEN startYear < 1950 THEN "old"
WHEN startYear < 2000 THEN "mid-aged"
ELSE "new"
END AS movie_age
FROM Title
ORDER BY RANDOM()
LIMIT 10;
titleType | startYear | movie_age |
---|---|---|
movie | 2015 | new |
movie | 1989 | mid-aged |
tvSeries | 2010 | new |
tvSeries | 2013 | new |
tvSeries | 1989 | mid-aged |
tvMovie | 2013 | new |
movie | 2004 | new |
movie | 2017 | new |
movie | 2002 | new |
movie | 1981 | mid-aged |
ResultSet
: to convert to pandas, call .DataFrame()
or to polars, call .PolarsDataFrame()
We combine data from multiple tables by performing a join. We will explore joins using the cats database, which includes two tables: s
and t
.
%%sql
sqlite:///data/basic_examples.db
%%sql
SELECT * FROM s;
id | name |
---|---|
0 | Apricot |
1 | Boots |
2 | Cally |
4 | Eugene |
ResultSet
: to convert to pandas, call .DataFrame()
or to polars, call .PolarsDataFrame()
%%sql
SELECT * FROM t;
id | breed |
---|---|
1 | persian |
2 | ragdoll |
4 | bengal |
5 | persian |
ResultSet
: to convert to pandas, call .DataFrame()
or to polars, call .PolarsDataFrame()
%%sql
SELECT s.id, name, breed
FROM s INNER JOIN t ON s.id = t.id;
id | name | breed |
---|---|---|
1 | Boots | persian |
2 | Cally | ragdoll |
4 | Eugene | bengal |
ResultSet
: to convert to pandas, call .DataFrame()
or to polars, call .PolarsDataFrame()
By default, JOIN
ing without specifying a join type will default to an inner join.
%%sql
SELECT s.id, name, breed
FROM s JOIN t ON s.id = t.id;
id | name | breed |
---|---|---|
1 | Boots | persian |
2 | Cally | ragdoll |
4 | Eugene | bengal |
ResultSet
: to convert to pandas, call .DataFrame()
or to polars, call .PolarsDataFrame()
%%sql
SELECT *
FROM s CROSS JOIN t;
id | name | id_1 | breed |
---|---|---|---|
0 | Apricot | 1 | persian |
0 | Apricot | 2 | ragdoll |
0 | Apricot | 4 | bengal |
0 | Apricot | 5 | persian |
1 | Boots | 1 | persian |
1 | Boots | 2 | ragdoll |
1 | Boots | 4 | bengal |
1 | Boots | 5 | persian |
2 | Cally | 1 | persian |
2 | Cally | 2 | ragdoll |
ResultSet
: to convert to pandas, call .DataFrame()
or to polars, call .PolarsDataFrame()
Equivalently, a cross join can be performed using the following syntax:
%%sql
SELECT *
FROM s, t;
id | name | id_1 | breed |
---|---|---|---|
0 | Apricot | 1 | persian |
0 | Apricot | 2 | ragdoll |
0 | Apricot | 4 | bengal |
0 | Apricot | 5 | persian |
1 | Boots | 1 | persian |
1 | Boots | 2 | ragdoll |
1 | Boots | 4 | bengal |
1 | Boots | 5 | persian |
2 | Cally | 1 | persian |
2 | Cally | 2 | ragdoll |
ResultSet
: to convert to pandas, call .DataFrame()
or to polars, call .PolarsDataFrame()
Conceptually, an inner join is equivalent to a cross join where irrelevant rows are removed.
%%sql
SELECT *
FROM s CROSS JOIN t
WHERE s.id = t.id;
id | name | id_1 | breed |
---|---|---|---|
1 | Boots | 1 | persian |
2 | Cally | 2 | ragdoll |
4 | Eugene | 4 | bengal |
ResultSet
: to convert to pandas, call .DataFrame()
or to polars, call .PolarsDataFrame()
%%sql
SELECT *
FROM s LEFT JOIN t ON s.id = t.id;
id | name | id_1 | breed |
---|---|---|---|
0 | Apricot | None | None |
1 | Boots | 1 | persian |
2 | Cally | 2 | ragdoll |
4 | Eugene | 4 | bengal |
ResultSet
: to convert to pandas, call .DataFrame()
or to polars, call .PolarsDataFrame()
%%sql
SELECT *
FROM s RIGHT JOIN t ON s.id = t.id;
id | name | id_1 | breed |
---|---|---|---|
1 | Boots | 1 | persian |
2 | Cally | 2 | ragdoll |
4 | Eugene | 4 | bengal |
None | None | 5 | persian |
ResultSet
: to convert to pandas, call .DataFrame()
or to polars, call .PolarsDataFrame()
%%sql
SELECT *
FROM s FULL JOIN t ON s.id = t.id;
id | name | id_1 | breed |
---|---|---|---|
0 | Apricot | None | None |
1 | Boots | 1 | persian |
2 | Cally | 2 | ragdoll |
4 | Eugene | 4 | bengal |
None | None | 5 | persian |
ResultSet
: to convert to pandas, call .DataFrame()
or to polars, call .PolarsDataFrame()
Let's return to the IMDB dataset. Now, we'll consider two tables: Title
and Rating
.
%%sql
sqlite:///data/imdbmini.db
When working with tables that have long names, we often create an alias using the AS
keyword (much like we did with columns in the previous lecture). This makes it easier to reference these tables when performing a join.
%%sql
SELECT primaryTitle, averageRating
FROM Title AS T INNER JOIN Rating AS R
ON T.tconst = R.tconst;
primaryTitle | averageRating |
---|---|
A Trip to the Moon | 8.2 |
The Birth of a Nation | 6.3 |
The Cabinet of Dr. Caligari | 8.1 |
The Kid | 8.3 |
Nosferatu | 7.9 |
Sherlock Jr. | 8.2 |
Battleship Potemkin | 8.0 |
The Gold Rush | 8.2 |
Metropolis | 8.3 |
The General | 8.1 |
ResultSet
: to convert to pandas, call .DataFrame()
or to polars, call .PolarsDataFrame()
The AS
keyword is optional – omitting it will still create an alias. We typically include AS
to make our queries more readable.
%%sql
SELECT primaryTitle, averageRating
FROM Title T INNER JOIN Rating R
ON T.tconst = R.tconst;
primaryTitle | averageRating |
---|---|
A Trip to the Moon | 8.2 |
The Birth of a Nation | 6.3 |
The Cabinet of Dr. Caligari | 8.1 |
The Kid | 8.3 |
Nosferatu | 7.9 |
Sherlock Jr. | 8.2 |
Battleship Potemkin | 8.0 |
The Gold Rush | 8.2 |
Metropolis | 8.3 |
The General | 8.1 |
ResultSet
: to convert to pandas, call .DataFrame()
or to polars, call .PolarsDataFrame()
Referencing columns using the full or aliased table name is important to avoid ambiguity. Suppose the tables we are trying to join both include a column with the same name, like the tconst
columns present in both the Title
and Rating
tables of the IMDB database. If we do not specify which table's column we wish to reference, SQL will not be able to process our query.
In the cell below, it is unclear if we are referring to the tconst
column from the Title
table or the tconst
column from the Rating
table. SQL errors.
%%sql
SELECT primaryTitle, averageRating
FROM Title AS T INNER JOIN Rating AS R
ON tconst = tconst;
Refer to columns in the format table_alias.column_name
to avoid ambiguity.
%%sql
SELECT primaryTitle, averageRating
FROM Title AS T INNER JOIN Rating AS R
ON T.tconst = R.tconst;
primaryTitle | averageRating |
---|---|
A Trip to the Moon | 8.2 |
The Birth of a Nation | 6.3 |
The Cabinet of Dr. Caligari | 8.1 |
The Kid | 8.3 |
Nosferatu | 7.9 |
Sherlock Jr. | 8.2 |
Battleship Potemkin | 8.0 |
The Gold Rush | 8.2 |
Metropolis | 8.3 |
The General | 8.1 |
ResultSet
: to convert to pandas, call .DataFrame()
or to polars, call .PolarsDataFrame()
Let's perform our first data science task in SQL – we'll explore the IMDB dataset in further detail.
First, let's understand what tables are present in the database.
%%sql
sqlite:///data/imdbmini.db
%%sql
SELECT tbl_name, sql FROM sqlite_master WHERE type='table'
tbl_name | sql |
---|---|
Title | CREATE TABLE "Title" ( "tconst" INTEGER, "titleType" TEXT, "primaryTitle" TEXT, "originalTitle" TEXT, "isAdult" TEXT, "startYear" TEXT, "endYear" TEXT, "runtimeMinutes" TEXT, "genres" TEXT ) |
Name | CREATE TABLE "Name" ( "nconst" INTEGER, "primaryName" TEXT, "birthYear" TEXT, "deathYear" TEXT, "primaryProfession" TEXT ) |
Role | CREATE TABLE "Role" ( tconst INTEGER, ordering TEXT, nconst INTEGER, category TEXT, job TEXT, characters TEXT ) |
Rating | CREATE TABLE "Rating" ( tconst INTEGER, averageRating TEXT, numVotes TEXT ) |
ResultSet
: to convert to pandas, call .DataFrame()
or to polars, call .PolarsDataFrame()
Let's take a closer look at the Title
table. Below, we preview a few rows of interest.
%%sql
SELECT *
FROM Title
WHERE primaryTitle IN ("Ginny & Georgia", "What If...?", "Succession", "Veep", "Tenet")
LIMIT 10;
tconst | titleType | primaryTitle | originalTitle | isAdult | startYear | endYear | runtimeMinutes | genres |
---|---|---|---|---|---|---|---|---|
10168312 | tvSeries | What If...? | What If...? | 0 | 2021 | None | None | Action,Adventure,Animation |
10813940 | tvSeries | Ginny & Georgia | Ginny & Georgia | 0 | 2021 | None | None | Comedy,Drama |
1759761 | tvSeries | Veep | Veep | 0 | 2012 | 2019 | 28 | Comedy |
6723592 | movie | Tenet | Tenet | 0 | 2020 | None | 150 | Action,Sci-Fi,Thriller |
7660850 | tvSeries | Succession | Succession | 0 | 2018 | None | 60 | Drama |
ResultSet
: to convert to pandas, call .DataFrame()
or to polars, call .PolarsDataFrame()
There are a few issues with how this data is stored. Can you spot them?
We see some rows with missing runtimeMinutes
. This suggests malformed data. We could drop these rows in pandas
using []
or .dropna()
.
We also note that the time and year columns are currently given in string format (see the schema listed earlier), whereas we probably want them in the format. Again, we could do this conversion in pandas
using pd.to_numeric
or .astype
.
Instead, let's see how we can can do these fixes in SQL.
%%sql
SELECT tconst AS id,
primaryTitle AS title,
titleType,
CAST(runtimeMinutes AS int) AS time,
CAST(startYear AS int) AS year
FROM Title
WHERE time > 0 AND -- get rid of outliers and missing values --
year > 0 -- get rid of missing values --
id | title | titleType | time | year |
---|---|---|---|---|
417 | A Trip to the Moon | short | 13 | 1902 |
4972 | The Birth of a Nation | movie | 195 | 1915 |
10323 | The Cabinet of Dr. Caligari | movie | 76 | 1920 |
12349 | The Kid | movie | 68 | 1921 |
13442 | Nosferatu | movie | 94 | 1922 |
15324 | Sherlock Jr. | movie | 45 | 1924 |
15648 | Battleship Potemkin | movie | 75 | 1925 |
15864 | The Gold Rush | movie | 95 | 1925 |
17136 | Metropolis | movie | 153 | 1927 |
17925 | The General | movie | 67 | 1926 |
ResultSet
: to convert to pandas, call .DataFrame()
or to polars, call .PolarsDataFrame()
Perhaps we're interested in only action movies. We can refine our query results futher.
%%sql
SELECT tconst AS id,
primaryTitle AS title,
titleType,
CAST(runtimeMinutes AS int) AS time,
CAST(startYear AS int) AS year
FROM Title
WHERE time > 0 AND -- get rid of outliers and missing values --
year > 0 -- get rid of missing values --
AND titleType = "movie"
AND genres LIKE "%Action%";
id | title | titleType | time | year |
---|---|---|---|---|
15324 | Sherlock Jr. | movie | 45 | 1924 |
17925 | The General | movie | 67 | 1926 |
23427 | Scarface | movie | 93 | 1932 |
29843 | The Adventures of Robin Hood | movie | 102 | 1938 |
40506 | Key Largo | movie | 100 | 1948 |
40724 | Red River | movie | 133 | 1948 |
42041 | White Heat | movie | 114 | 1949 |
46534 | The War of the Worlds | movie | 85 | 1953 |
47034 | Godzilla | movie | 96 | 1954 |
47478 | Seven Samurai | movie | 207 | 1954 |
ResultSet
: to convert to pandas, call .DataFrame()
or to polars, call .PolarsDataFrame()
If we want to know the critic ratings for each of these action movies, we can join Title
with the Rating
table.
In the cell below, we use the <<
operator to store the result of our query as a variable.
%%sql action_movies_sql <<
SELECT T.tconst AS id,
primaryTitle AS title,
titleType,
CAST(runtimeMinutes AS int) AS time,
CAST(startYear AS int) AS year,
CAST(averageRating AS float) AS rating
FROM Title AS T INNER JOIN Rating AS R ON T.tconst = R.tconst
WHERE time > 0 AND -- get rid of outliers and missing values --
year > 0 -- get rid of missing values --
AND titleType = "movie"
AND genres LIKE "%Action%";
Let's look at what we got.
action_movies_sql
id | title | titleType | time | year | rating |
---|---|---|---|---|---|
15324 | Sherlock Jr. | movie | 45 | 1924 | 8.2 |
17925 | The General | movie | 67 | 1926 | 8.1 |
23427 | Scarface | movie | 93 | 1932 | 7.8 |
29843 | The Adventures of Robin Hood | movie | 102 | 1938 | 7.9 |
40506 | Key Largo | movie | 100 | 1948 | 7.8 |
40724 | Red River | movie | 133 | 1948 | 7.8 |
42041 | White Heat | movie | 114 | 1949 | 8.1 |
46534 | The War of the Worlds | movie | 85 | 1953 | 7.1 |
47034 | Godzilla | movie | 96 | 1954 | 7.6 |
47478 | Seven Samurai | movie | 207 | 1954 | 8.6 |
ResultSet
: to convert to pandas, call .DataFrame()
or to polars, call .PolarsDataFrame()
A typical workflow for interacting with databases is:
pandas
We use .DataFrame()
to convert our SQL table action_movies
into a pandas
DataFrame.
action_movies_df = action_movies_sql.DataFrame()
action_movies_df
id | title | titleType | time | year | rating | |
---|---|---|---|---|---|---|
0 | 15324 | Sherlock Jr. | movie | 45 | 1924 | 8.2 |
1 | 17925 | The General | movie | 67 | 1926 | 8.1 |
2 | 23427 | Scarface | movie | 93 | 1932 | 7.8 |
3 | 29843 | The Adventures of Robin Hood | movie | 102 | 1938 | 7.9 |
4 | 40506 | Key Largo | movie | 100 | 1948 | 7.8 |
... | ... | ... | ... | ... | ... | ... |
1715 | 9243946 | El Camino: A Breaking Bad Movie | movie | 122 | 2019 | 7.3 |
1716 | 9252468 | Mosul | movie | 86 | 2019 | 7.2 |
1717 | 9376612 | Shang-Chi and the Legend of the Ten Rings | movie | 132 | 2021 | 7.9 |
1718 | 9691136 | Shadow in the Cloud | movie | 83 | 2020 | 4.9 |
1719 | 9777666 | The Tomorrow War | movie | 138 | 2021 | 6.6 |
1720 rows × 6 columns
Now, we can use our usual tools – pandas
, numpy
, matplotlib
, and seaborn
– to continue our work!
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn.linear_model as lm
How do ratings vary through the years?
sns.lineplot(data=action_movies_df, x="year", y="rating");
How does the runtime of an action movie influence the rating it receives?
sns.scatterplot(data=action_movies_df, x="time", y="rating");
# Apply transformations to linearize the data
sns.scatterplot(x=np.log(action_movies_df["time"]), y=action_movies_df["rating"]**3);
Could we use a movie's runtime to predict its rating?
X, Y = np.log(action_movies_df[["time"]]), action_movies_df["rating"]**3
model = lm.LinearRegression()
model.fit(X, Y)
xs = np.linspace(4, 5.5)
plt.plot(xs, model.predict(xs[:, np.newaxis]), c="tab:red")
sns.scatterplot(x=np.log(action_movies_df["time"]), y=action_movies_df["rating"]**3);
/srv/conda/envs/notebook/lib/python3.11/site-packages/sklearn/base.py:439: UserWarning: X does not have valid feature names, but LinearRegression was fitted with feature names warnings.warn(