import pandas as pd
import plotly.express as px
Last time, we saw how to use the magic %%sql
command to run SQL code in a Jupyter notebook.
%load_ext sql
%%sql
sqlite:///data/lec18_basic_examples.db
%%sql
SELECT type, MAX(cost)
FROM Dish
GROUP BY type;
* sqlite:///data/lec18_basic_examples.db Done.
type | MAX(cost) |
---|---|
appetizer | 4 |
dessert | 5 |
entree | 10 |
Let's see how we can use the sqlalchemy
library.
Unlike the %%sql
magic command, sqlalchemy creates a connection
objection that can be used as the second argument to the pandas.read_sql
| command. An example is shown below.
import sqlalchemy
# create a SQL Alchemy connection to the database
engine = sqlalchemy.create_engine("sqlite:///data/lec18_basic_examples.db")
connection = engine.connect()
pd.read_sql("""
SELECT type, MAX(cost)
FROM Dish
GROUP BY type;""", connection)
type | MAX(cost) | |
---|---|---|
0 | appetizer | 4 |
1 | dessert | 5 |
2 | entree | 10 |
This workflow is handy since it lets you work with data as you're used to in DS100 so far.
df = pd.read_sql("""
SELECT type, MAX(cost)
FROM Dish
GROUP BY type;""", connection)
px.bar(df, x = "type", y = "MAX(cost)")
One downside is that any syntax errors in your code will be harder to diagnose since the "stack trace" is so much more complicated looking when a pd.read_sql
call fails due to a malformed query. Try uncommenting and running the two cells below for a demonstration.
%%sql
SELECT lightning FROM Dish
* sqlite:///data/lec18_basic_examples.db (sqlite3.OperationalError) no such column: lightning [SQL: SELECT lightning FROM Dish] (Background on this error at: https://sqlalche.me/e/14/e3q8)
#pd.read_sql("""SELECT lightning FROM Dish""", connection)
Now, you can make the default IPython errors much shorter by using the xmode
magic:
%xmode minimal
Exception reporting mode: Minimal
Which will now lead, even for Pandas, to much more concise errors:
pd.read_sql("""SELECT lightning FROM Dish""", connection)
OperationalError: no such column: lightning The above exception was the direct cause of the following exception: OperationalError: (sqlite3.OperationalError) no such column: lightning [SQL: SELECT lightning FROM Dish] (Background on this error at: https://sqlalche.me/e/14/e3q8)
%%sql
the smart way¶While it's good for you to know about SQL Alchemy, in general you can get pretty much everything done via the SQL magic, including clean interactions with Pandas. For this, recall from the previous lecture two key features in the syntax: <<
and {var}
, which are your friends:
query = """
SELECT * FROM Dish
LIMIT {limit}
"""
lim = 5
%%sql res <<
{query.format(limit=lim)}
* sqlite:///data/lec18_basic_examples.db Done. Returning data to local variable res
res.sql
'SELECT * FROM Dish\nLIMIT 5'
res.DataFrame()
name | type | cost | |
---|---|---|---|
0 | ravioli | entree | 10 |
1 | pork bun | entree | 7 |
2 | taco | entree | 7 |
3 | edamame | appetizer | 4 |
4 | fries | appetizer | 4 |
Using the <<
syntax to get your SQL output into Python variables, .DataFrame()
to convert them into Pandas DataFrames as needed, and potentially taking advantage of the {expression}
syntax to manipulate your queries, gives you an extremely powerful toolbox.
So far, we've only used SQL on tiny toy datasets. Let's try using a real world dataset. Run the cell below. This will probably take some time as the files from IMDB are quite large.
from os.path import exists
# From https://www.imdb.com/interfaces/
from ds100_utils import fetch_and_cache
data_directory = './data'
fetch_and_cache('https://datasets.imdbws.com/title.basics.tsv.gz', 'titles.tsv.gz', data_directory)
fetch_and_cache('https://datasets.imdbws.com/name.basics.tsv.gz', 'names.tsv.gz', data_directory)
if not exists(f"{data_directory}/titles.tsv"):
!gunzip -kf {data_directory}/titles.tsv.gz
!gunzip -kf {data_directory}/names.tsv.gz
!ls -lh {data_directory}
Using cached version that was downloaded (UTC): Thu Nov 3 06:09:18 2022 Using cached version that was downloaded (UTC): Thu Nov 3 06:09:20 2022 total 3.4G -rw-r--r-- 1 jovyan jovyan 1.6G Nov 3 06:13 imdb.db -rw-r--r-- 1 jovyan jovyan 28K Sep 14 22:37 lec18_basic_examples.db -rw-r--r-- 1 jovyan jovyan 688M Nov 3 06:09 names.tsv -rw-r--r-- 1 jovyan jovyan 225M Nov 3 06:09 names.tsv.gz -rw-r--r-- 1 jovyan jovyan 764M Nov 3 06:09 titles.tsv -rw-r--r-- 1 jovyan jovyan 156M Nov 3 06:09 titles.tsv.gz
These files are too big for pandas to handle, at least on the datahub machines.
# Running the line below will probaly crash your kernel because the names
# dataset is too big to comfortable fit into main memory.
#pd.read_csv(f'{data_directory}/names.tsv', sep='\t')
By contrast, SQL can work with them just fine. However, the code is not in .db format. To convert the data to .db
format, run the code below. We will not be discussing how this command works in class. This command will take a minute or longer to complete. Once it's done, a file called imdb.db
should appear in the data folder.
from os.path import exists
imdb_file_exists = exists('./data/imdb.db')
if not imdb_file_exists:
!(cd data; sqlite3 imdb.db ".mode tabs" ".import titles.tsv titles" ".import names.tsv names") 2> /dev/null
Now that the file exists, we can see what's in the database.
We'll make both an explicit SQLAlchemy connection and will use the magic so we can compare both approaches:
imdb_db = "sqlite:///data/imdb.db"
engine = sqlalchemy.create_engine(imdb_db)
connection = engine.connect()
%sql $imdb_db
%%sql tables <<
SELECT sql FROM sqlite_master WHERE type='table';
* sqlite:///data/imdb.db sqlite:///data/lec18_basic_examples.db Done. Returning data to local variable tables
tables
sql |
---|
CREATE TABLE "titles"( "tconst" TEXT, "titleType" TEXT, "primaryTitle" TEXT, "originalTitle" TEXT, "isAdult" TEXT, "startYear" TEXT, "endYear" TEXT, "runtimeMinutes" TEXT, "genres" TEXT ) |
CREATE TABLE "names"( "nconst" TEXT, "primaryName" TEXT, "birthYear" TEXT, "deathYear" TEXT, "primaryProfession" TEXT, "knownForTitles" TEXT ) |
We can view each individual entry and access its sql
column (by dotted name or with []
):
print(tables[0].sql)
CREATE TABLE "titles"( "tconst" TEXT, "titleType" TEXT, "primaryTitle" TEXT, "originalTitle" TEXT, "isAdult" TEXT, "startYear" TEXT, "endYear" TEXT, "runtimeMinutes" TEXT, "genres" TEXT )
print(tables[1]["sql"])
CREATE TABLE "names"( "nconst" TEXT, "primaryName" TEXT, "birthYear" TEXT, "deathYear" TEXT, "primaryProfession" TEXT, "knownForTitles" TEXT )
get_10_movies = """
SELECT *
FROM titles
LIMIT 10;
"""
%%sql
$get_10_movies
* sqlite:///data/imdb.db sqlite:///data/lec18_basic_examples.db Done.
tconst | titleType | primaryTitle | originalTitle | isAdult | startYear | endYear | runtimeMinutes | genres |
---|---|---|---|---|---|---|---|---|
tt0000001 | short | Carmencita | Carmencita | 0 | 1894 | \N | 1 | Documentary,Short |
tt0000002 | short | Le clown et ses chiens | Le clown et ses chiens | 0 | 1892 | \N | 5 | Animation,Short |
tt0000003 | short | Pauvre Pierrot | Pauvre Pierrot | 0 | 1892 | \N | 4 | Animation,Comedy,Romance |
tt0000004 | short | Un bon bock | Un bon bock | 0 | 1892 | \N | 12 | Animation,Short |
tt0000005 | short | Blacksmith Scene | Blacksmith Scene | 0 | 1893 | \N | 1 | Comedy,Short |
tt0000006 | short | Chinese Opium Den | Chinese Opium Den | 0 | 1894 | \N | 1 | Short |
tt0000007 | short | Corbett and Courtney Before the Kinetograph | Corbett and Courtney Before the Kinetograph | 0 | 1894 | \N | 1 | Short,Sport |
tt0000008 | short | Edison Kinetoscopic Record of a Sneeze | Edison Kinetoscopic Record of a Sneeze | 0 | 1894 | \N | 1 | Documentary,Short |
tt0000009 | movie | Miss Jerry | Miss Jerry | 0 | 1894 | \N | 45 | Romance |
tt0000010 | short | Leaving the Factory | La sortie de l'usine Lumière à Lyon | 0 | 1895 | \N | 1 | Documentary,Short |
Suppose we want a list of action movies. We can do this by finding rows where the titleType is "movie" genres column contains "action". To do this, we can use the LIKE keyword.
action_movies_query = """
SELECT tconst AS id,
primaryTitle AS title,
runtimeMinutes AS time,
startYear AS year
FROM titles
WHERE titleType = 'movie' AND
genres LIKE '%Action%'"""
%%sql action_movies <<
{action_movies_query}
* sqlite:///data/imdb.db sqlite:///data/lec18_basic_examples.db Done. Returning data to local variable action_movies
action_movies.DataFrame()
id | title | time | year | |
---|---|---|---|---|
0 | tt0000574 | The Story of the Kelly Gang | 70 | 1906 |
1 | tt0002574 | What Happened to Mary | 150 | 1912 |
2 | tt0003545 | Who Will Marry Mary? | \N | 1913 |
3 | tt0003747 | Cameo Kirby | 50 | 1914 |
4 | tt0003897 | The Exploits of Elaine | 220 | 1914 |
... | ... | ... | ... | ... |
42674 | tt9904270 | Get Rid of It | \N | \N |
42675 | tt9904682 | SIUAT | \N | \N |
42676 | tt9905492 | Midnight Reckoning | \N | \N |
42677 | tt9905708 | Résilience | \N | \N |
42678 | tt9907670 | Wanderer in a Business Suit | \N | 1961 |
42679 rows × 4 columns
Before moving forward, let's compare what happens if we run the exact same query via Pandas:
pd.read_sql(action_movies_query, connection)
id | title | time | year | |
---|---|---|---|---|
0 | tt0000574 | The Story of the Kelly Gang | 70 | 1906 |
1 | tt0002574 | What Happened to Mary | 150 | 1912 |
2 | tt0003545 | Who Will Marry Mary? | \N | 1913 |
3 | tt0003747 | Cameo Kirby | 50 | 1914 |
4 | tt0003897 | The Exploits of Elaine | 220 | 1914 |
... | ... | ... | ... | ... |
42674 | tt9904270 | Get Rid of It | \N | \N |
42675 | tt9904682 | SIUAT | \N | \N |
42676 | tt9905492 | Midnight Reckoning | \N | \N |
42677 | tt9905708 | Résilience | \N | \N |
42678 | tt9907670 | Wanderer in a Business Suit | \N | 1961 |
42679 rows × 4 columns
OK! We're in business - it looks like we get the same outputs, so we can trust that both methods work similarly, and we can pick whichever approach we prefer.
I generally find more convenient working with the SQL magic, and simply converting to data frames as needed.
Back to our data...
We see a number of rows containing "\N". These represent values that IMDB was missing. We could drop these rows in pandas using []
notation or query
.
We also note that the time and year columns are currently given in string format, whereas we probably want them in the format. Again, we could do this conversion in pandas using pd.to_numeric
.
Lastly, looking at the data, we also see that there are some weird outliers like "The Hazards of Helen" which are 1,428 minutes long. We of course know how to drop these in pandas.
Instead, let's see how we can can do these three fixes in SQL using the CAST keyword.
The CAST
keyword converts a table column to another type. In the code below, we convert runtimeMinutes and startYear to int. Any missing invalid values are replaced by 0.
%%sql action_movies <<
SELECT tconst AS id,
primaryTitle AS title,
CAST(runtimeMinutes AS int) AS time,
CAST(startYear AS int) AS year
FROM titles
WHERE genres LIKE '%Action%' AND
titleType = 'movie' AND
time > 60 AND time < 180 AND
year > 0
* sqlite:///data/imdb.db sqlite:///data/lec18_basic_examples.db Done. Returning data to local variable action_movies
action_movies = action_movies.DataFrame()
action_movies
id | title | time | year | |
---|---|---|---|---|
0 | tt0000574 | The Story of the Kelly Gang | 70 | 1906 |
1 | tt0002574 | What Happened to Mary | 150 | 1912 |
2 | tt0004223 | The Life of General Villa | 105 | 1914 |
3 | tt0004450 | Die Pagode | 82 | 1917 |
4 | tt0004635 | The Squaw Man | 74 | 1914 |
... | ... | ... | ... | ... |
24385 | tt9900748 | The Robinsons | 110 | 2019 |
24386 | tt9900782 | Kaithi | 145 | 2019 |
24387 | tt9900908 | Useless Handcuffs | 89 | 1969 |
24388 | tt9901162 | The Robinsons | 90 | 2020 |
24389 | tt9904066 | Fox Hunting | 66 | 2019 |
24390 rows × 4 columns
px.histogram(action_movies, x = "year")
action_movies['time'].groupby(action_movies['year']).mean().plot();
action_movies['decade'] = (action_movies['year'] // 10) * 10
px.box(action_movies, x = 'decade', y = 'time', color = "decade")
#plt.xticks(rotation=45);
Or in seaborn:
import seaborn as sns
import matplotlib.pyplot as plt
sns.boxplot(x = 'decade', y = 'time', data = action_movies)
plt.xticks(rotation=45);