SQL continued¶

In [1]:
import pandas as pd
import plotly.express as px

Introducing sqlalchemy¶

Last time, we saw how to use the magic %%sql command to run SQL code in a Jupyter notebook.

In [2]:
%load_ext sql
In [3]:
%%sql
sqlite:///data/lec18_basic_examples.db
In [4]:
%%sql
SELECT type, MAX(cost)
FROM Dish
GROUP BY type;
 * sqlite:///data/lec18_basic_examples.db
Done.
Out[4]:
type MAX(cost)
appetizer 4
dessert 5
entree 10

Let's see how we can use the sqlalchemy library.

Unlike the %%sql magic command, sqlalchemy creates a connection objection that can be used as the second argument to the pandas.read_sql| command. An example is shown below.

In [5]:
import sqlalchemy
# create a SQL Alchemy connection to the database
engine = sqlalchemy.create_engine("sqlite:///data/lec18_basic_examples.db")
connection = engine.connect()
pd.read_sql("""
SELECT type, MAX(cost)
FROM Dish
GROUP BY type;""", connection)
Out[5]:
type MAX(cost)
0 appetizer 4
1 dessert 5
2 entree 10

This workflow is handy since it lets you work with data as you're used to in DS100 so far.

In [6]:
df = pd.read_sql("""
SELECT type, MAX(cost)
FROM Dish
GROUP BY type;""", connection)
In [7]:
px.bar(df, x = "type", y = "MAX(cost)")

One downside is that any syntax errors in your code will be harder to diagnose since the "stack trace" is so much more complicated looking when a pd.read_sql call fails due to a malformed query. Try uncommenting and running the two cells below for a demonstration.

In [8]:
%%sql
SELECT lightning FROM Dish
 * sqlite:///data/lec18_basic_examples.db
(sqlite3.OperationalError) no such column: lightning
[SQL: SELECT lightning FROM Dish]
(Background on this error at: https://sqlalche.me/e/14/e3q8)
In [9]:
#pd.read_sql("""SELECT lightning FROM Dish""", connection)

Now, you can make the default IPython errors much shorter by using the xmode magic:

In [10]:
%xmode minimal
Exception reporting mode: Minimal

Which will now lead, even for Pandas, to much more concise errors:

In [11]:
pd.read_sql("""SELECT lightning FROM Dish""", connection)
OperationalError: no such column: lightning


The above exception was the direct cause of the following exception:

OperationalError: (sqlite3.OperationalError) no such column: lightning
[SQL: SELECT lightning FROM Dish]
(Background on this error at: https://sqlalche.me/e/14/e3q8)

Using %%sql the smart way¶

While it's good for you to know about SQL Alchemy, in general you can get pretty much everything done via the SQL magic, including clean interactions with Pandas. For this, recall from the previous lecture two key features in the syntax: << and {var}, which are your friends:

In [12]:
query = """
SELECT * FROM Dish
LIMIT {limit}
"""
In [13]:
lim = 5
In [14]:
%%sql res <<
{query.format(limit=lim)}
 * sqlite:///data/lec18_basic_examples.db
Done.
Returning data to local variable res
In [15]:
res.sql
Out[15]:
'SELECT * FROM Dish\nLIMIT 5'
In [16]:
res.DataFrame()
Out[16]:
name type cost
0 ravioli entree 10
1 pork bun entree 7
2 taco entree 7
3 edamame appetizer 4
4 fries appetizer 4

Using the << syntax to get your SQL output into Python variables, .DataFrame() to convert them into Pandas DataFrames as needed, and potentially taking advantage of the {expression} syntax to manipulate your queries, gives you an extremely powerful toolbox.

IMDB Example¶

So far, we've only used SQL on tiny toy datasets. Let's try using a real world dataset. Run the cell below. This will probably take some time as the files from IMDB are quite large.

In [17]:
from os.path import exists

# From https://www.imdb.com/interfaces/
from ds100_utils import fetch_and_cache
data_directory = './data'
fetch_and_cache('https://datasets.imdbws.com/title.basics.tsv.gz', 'titles.tsv.gz', data_directory)
fetch_and_cache('https://datasets.imdbws.com/name.basics.tsv.gz', 'names.tsv.gz', data_directory)
if not exists(f"{data_directory}/titles.tsv"):
    !gunzip -kf {data_directory}/titles.tsv.gz
    !gunzip -kf {data_directory}/names.tsv.gz
    
!ls -lh {data_directory}
Using cached version that was downloaded (UTC): Thu Nov  3 06:09:18 2022
Using cached version that was downloaded (UTC): Thu Nov  3 06:09:20 2022
total 3.4G
-rw-r--r-- 1 jovyan jovyan 1.6G Nov  3 06:13 imdb.db
-rw-r--r-- 1 jovyan jovyan  28K Sep 14 22:37 lec18_basic_examples.db
-rw-r--r-- 1 jovyan jovyan 688M Nov  3 06:09 names.tsv
-rw-r--r-- 1 jovyan jovyan 225M Nov  3 06:09 names.tsv.gz
-rw-r--r-- 1 jovyan jovyan 764M Nov  3 06:09 titles.tsv
-rw-r--r-- 1 jovyan jovyan 156M Nov  3 06:09 titles.tsv.gz

These files are too big for pandas to handle, at least on the datahub machines.

In [18]:
# Running the line below will probaly crash your kernel because the names 
# dataset is too big to comfortable fit into main memory. 
#pd.read_csv(f'{data_directory}/names.tsv', sep='\t')

By contrast, SQL can work with them just fine. However, the code is not in .db format. To convert the data to .db format, run the code below. We will not be discussing how this command works in class. This command will take a minute or longer to complete. Once it's done, a file called imdb.db should appear in the data folder.

In [19]:
from os.path import exists

imdb_file_exists = exists('./data/imdb.db')
if not imdb_file_exists:
    !(cd data; sqlite3 imdb.db ".mode tabs" ".import titles.tsv titles" ".import names.tsv names") 2> /dev/null

Now that the file exists, we can see what's in the database.

We'll make both an explicit SQLAlchemy connection and will use the magic so we can compare both approaches:

In [20]:
imdb_db = "sqlite:///data/imdb.db"
engine = sqlalchemy.create_engine(imdb_db)
connection = engine.connect()
In [37]:
%sql $imdb_db
In [39]:
%%sql tables <<
SELECT sql FROM sqlite_master WHERE type='table';
 * sqlite:///data/imdb.db
   sqlite:///data/lec18_basic_examples.db
Done.
Returning data to local variable tables
In [40]:
tables
Out[40]:
sql
CREATE TABLE "titles"(
"tconst" TEXT,
"titleType" TEXT,
"primaryTitle" TEXT,
"originalTitle" TEXT,
"isAdult" TEXT,
"startYear" TEXT,
"endYear" TEXT,
"runtimeMinutes" TEXT,
"genres" TEXT
)
CREATE TABLE "names"(
"nconst" TEXT,
"primaryName" TEXT,
"birthYear" TEXT,
"deathYear" TEXT,
"primaryProfession" TEXT,
"knownForTitles" TEXT
)

We can view each individual entry and access its sql column (by dotted name or with []):

In [41]:
print(tables[0].sql)
CREATE TABLE "titles"(
  "tconst" TEXT,
  "titleType" TEXT,
  "primaryTitle" TEXT,
  "originalTitle" TEXT,
  "isAdult" TEXT,
  "startYear" TEXT,
  "endYear" TEXT,
  "runtimeMinutes" TEXT,
  "genres" TEXT
)
In [42]:
print(tables[1]["sql"])
CREATE TABLE "names"(
  "nconst" TEXT,
  "primaryName" TEXT,
  "birthYear" TEXT,
  "deathYear" TEXT,
  "primaryProfession" TEXT,
  "knownForTitles" TEXT
)
In [43]:
get_10_movies = """
SELECT *
FROM titles
LIMIT 10;
"""
In [44]:
%%sql
$get_10_movies
 * sqlite:///data/imdb.db
   sqlite:///data/lec18_basic_examples.db
Done.
Out[44]:
tconst titleType primaryTitle originalTitle isAdult startYear endYear runtimeMinutes genres
tt0000001 short Carmencita Carmencita 0 1894 \N 1 Documentary,Short
tt0000002 short Le clown et ses chiens Le clown et ses chiens 0 1892 \N 5 Animation,Short
tt0000003 short Pauvre Pierrot Pauvre Pierrot 0 1892 \N 4 Animation,Comedy,Romance
tt0000004 short Un bon bock Un bon bock 0 1892 \N 12 Animation,Short
tt0000005 short Blacksmith Scene Blacksmith Scene 0 1893 \N 1 Comedy,Short
tt0000006 short Chinese Opium Den Chinese Opium Den 0 1894 \N 1 Short
tt0000007 short Corbett and Courtney Before the Kinetograph Corbett and Courtney Before the Kinetograph 0 1894 \N 1 Short,Sport
tt0000008 short Edison Kinetoscopic Record of a Sneeze Edison Kinetoscopic Record of a Sneeze 0 1894 \N 1 Documentary,Short
tt0000009 movie Miss Jerry Miss Jerry 0 1894 \N 45 Romance
tt0000010 short Leaving the Factory La sortie de l'usine Lumière à Lyon 0 1895 \N 1 Documentary,Short

Action Movie EDA¶

The LIKE Keyword and CAST Keyword¶

Suppose we want a list of action movies. We can do this by finding rows where the titleType is "movie" genres column contains "action". To do this, we can use the LIKE keyword.

In [45]:
action_movies_query = """
SELECT tconst AS id,
  primaryTitle AS title, 
  runtimeMinutes AS time, 
  startYear AS year
FROM titles
WHERE titleType = 'movie' AND 
      genres LIKE '%Action%'"""
In [46]:
%%sql action_movies << 
{action_movies_query}
 * sqlite:///data/imdb.db
   sqlite:///data/lec18_basic_examples.db
Done.
Returning data to local variable action_movies
In [47]:
action_movies.DataFrame()
Out[47]:
id title time year
0 tt0000574 The Story of the Kelly Gang 70 1906
1 tt0002574 What Happened to Mary 150 1912
2 tt0003545 Who Will Marry Mary? \N 1913
3 tt0003747 Cameo Kirby 50 1914
4 tt0003897 The Exploits of Elaine 220 1914
... ... ... ... ...
42674 tt9904270 Get Rid of It \N \N
42675 tt9904682 SIUAT \N \N
42676 tt9905492 Midnight Reckoning \N \N
42677 tt9905708 Résilience \N \N
42678 tt9907670 Wanderer in a Business Suit \N 1961

42679 rows × 4 columns

Before moving forward, let's compare what happens if we run the exact same query via Pandas:

In [48]:
pd.read_sql(action_movies_query, connection)
Out[48]:
id title time year
0 tt0000574 The Story of the Kelly Gang 70 1906
1 tt0002574 What Happened to Mary 150 1912
2 tt0003545 Who Will Marry Mary? \N 1913
3 tt0003747 Cameo Kirby 50 1914
4 tt0003897 The Exploits of Elaine 220 1914
... ... ... ... ...
42674 tt9904270 Get Rid of It \N \N
42675 tt9904682 SIUAT \N \N
42676 tt9905492 Midnight Reckoning \N \N
42677 tt9905708 Résilience \N \N
42678 tt9907670 Wanderer in a Business Suit \N 1961

42679 rows × 4 columns

OK! We're in business - it looks like we get the same outputs, so we can trust that both methods work similarly, and we can pick whichever approach we prefer.

I generally find more convenient working with the SQL magic, and simply converting to data frames as needed.

Back to our data...

We see a number of rows containing "\N". These represent values that IMDB was missing. We could drop these rows in pandas using [] notation or query.

We also note that the time and year columns are currently given in string format, whereas we probably want them in the format. Again, we could do this conversion in pandas using pd.to_numeric.

Lastly, looking at the data, we also see that there are some weird outliers like "The Hazards of Helen" which are 1,428 minutes long. We of course know how to drop these in pandas.

Instead, let's see how we can can do these three fixes in SQL using the CAST keyword.

The CAST keyword converts a table column to another type. In the code below, we convert runtimeMinutes and startYear to int. Any missing invalid values are replaced by 0.

In [59]:
%%sql action_movies << 
SELECT tconst AS id,
  primaryTitle AS title, 
  CAST(runtimeMinutes AS int) AS time, 
  CAST(startYear AS int) AS year
FROM titles
WHERE genres LIKE '%Action%' AND
  titleType = 'movie' AND
  time > 60 AND time < 180 AND
  year > 0
 * sqlite:///data/imdb.db
   sqlite:///data/lec18_basic_examples.db
Done.
Returning data to local variable action_movies
In [60]:
action_movies = action_movies.DataFrame()
action_movies
Out[60]:
id title time year
0 tt0000574 The Story of the Kelly Gang 70 1906
1 tt0002574 What Happened to Mary 150 1912
2 tt0004223 The Life of General Villa 105 1914
3 tt0004450 Die Pagode 82 1917
4 tt0004635 The Squaw Man 74 1914
... ... ... ... ...
24385 tt9900748 The Robinsons 110 2019
24386 tt9900782 Kaithi 145 2019
24387 tt9900908 Useless Handcuffs 89 1969
24388 tt9901162 The Robinsons 90 2020
24389 tt9904066 Fox Hunting 66 2019

24390 rows × 4 columns

Visualization¶

In [58]:
px.histogram(action_movies, x = "year")
In [53]:
action_movies['time'].groupby(action_movies['year']).mean().plot();
In [54]:
action_movies['decade'] = (action_movies['year'] // 10) * 10
px.box(action_movies, x = 'decade', y = 'time', color = "decade")
#plt.xticks(rotation=45);

Or in seaborn:

In [55]:
import seaborn as sns
import matplotlib.pyplot as plt
sns.boxplot(x = 'decade', y = 'time', data = action_movies)
plt.xticks(rotation=45);
In [ ]: