• Identify situations where a database may be preferred over a CSV file
  • Write basic SQL queries using SELECT, FROM, WHERE, ORDER BY, LIMIT, and OFFSET
  • Perform aggregations using GROUP BY and HAVING

In the first ten weeks of the course, we made our way through the entire data science lifecycle: we learned how to load and explore a dataset, formulate questions, and use the tools of prediction and inference to come up with answers. For the remaining weeks of the semester, we are going to make a second pass through the lifecycle, this time doing so with a different set of tools, ideas, and abstractions.

21.1 Databases

With this goal in mind, let’s go back to the very beginning of the lifecycle. We first started our work in data analysis by looking at the pandas library, which offered us powerful tools to manipulate tabular data stored in (primarily) CSV files. CSVs work well when analyzing relatively small datasets that don’t need to be shared across many users. In research and industry, however, data scientists often need to access enormous bodies of data that cannot be easily stored in a CSV format. Collaborating with others when working with CSVs can also be tricky – a real-world data scientist may run into problems when multiple users try to make modifications, or, even worse, security issues with who should and should not have access to the data.

A database is a large, organized collection of data. Databases are administered by Database Management Systems (DBMS), which are software systems that store, manage, and facilitate access to one or more databases. Databases help mitigate many of the issues that come with using CSVs for data storage: they provide reliable storage that can survive system crashes or disk failures, are optimized to compute on data that does not fit into memory, and contain special data structures to improve performance. Using databases rather than CSVs offers further benefits from the standpoint of data management. A DBMS can apply settings that configure how data is organized, block certain data anomalies (for example, enforcing non-negative weights or ages), and determine who is allowed access to the data. It can also ensure safe concurrent operations where multiple users reading and writing to the database will not lead to fatal errors.

As you may have guessed, we can’t use our usual pandas methods to work with data in a database. Instead, we’ll turn to Structured Query Language.

21.2 Structured Query Language and Database Schema

Structured Query Language, or SQL (commonly pronounced “sequel,” though this is the subject of fierce debate), is a special programming language designed to communicate with databases. You may have encountered it in classes like CS 61A or Data C88C before. Unlike Python, it is a declarative programming language – this means that rather than writing the exact logic needed to complete a task, a piece of SQL code “declares” what the desired final output should be and leaves the program to determine what logic should be implemented.

It is important to reiterate that SQL is an entirely different language from Python. However, Python does have special engines that allow us to run SQL code in a Jupyter notebook. While this is typically not how SQL is used outside of an educational setting, we will be using this workflow to illustrate how SQL queries are constructed using the tools we’ve already worked with this semester. You will learn more about how to run SQL queries in Jupyter during lab.

The syntax below will seem unfamiliar to you; for now, just focus on understanding the output displayed. We will clarify the SQL code in a bit.

An important note: in lecture, you were first introduced to a Jupyter tool called SQL cell magic to form database connections. This resource is unfortunately not compatible with the infrastructure used to host course notes. Because of this, the course notes for SQL I and SQL II will rely on the sqlalchemy library for running SQL code. Lab and homework will offer more guided practice with SQL cell magic.

To start, we’ll look at a database called dragon.db.

Code
# Load the SQL Alchemy Python library
import sqlalchemy
import pandas as pd

# Create a SQL Alchemy connection to the database
engine = sqlalchemy.create_engine("sqlite:///data/dragons.db")
connection = engine.connect()

# Query information from the database
query = """
SELECT * 
FROM sqlite_master
WHERE type="table";
"""

pd.read_sql(query, engine)
type name tbl_name rootpage sql
0 table sqlite_sequence sqlite_sequence 7 CREATE TABLE sqlite_sequence(name,seq)
1 table Dragon Dragon 2 CREATE TABLE Dragon (\n name TEXT PRIMARY K...
2 table Dish Dish 4 CREATE TABLE Dish (\n name TEXT PRIMARY KEY...
3 table Scene Scene 6 CREATE TABLE Scene (\n id INTEGER PRIMARY K...

The summary above displays information about the database. The database contains four tables, named sqlite_sequence, Dragon, Dish, and Scene. The rightmost column above lists the command that was used to construct each table.

Let’s look more closely at the command used to create the Dragon table (the second entry above).

CREATE TABLE Dragon (name TEXT PRIMARY KEY,
                     year INTEGER CHECK (year >= 2000),
                     cute INTEGER)

The statement CREATE TABLE is used to specify the schema of the table – a description of what logic is used to organize the table. Schema follow a set format:

  • ColName: the name of a column
  • DataType: the type of data to be stored in a column. Some of the most common SQL data types are INT (integers), REAL (single-precision floating point numbers), TEXT (strings), BLOB (arbitrary data, such as audio/video files), and DATETIME (a date and time).
  • Constraint: some restriction on the data to be stored in the column. Common constraints are CHECK (data must obey a certain condition), PRIMARY KEY (designate a column as the table’s primary key), NOT NULL (data cannot be null), and DEFAULT (a default fill value if no specific entry is given).

We see that Dragon contains five columns. The first of these, "name", contains text data. It is designated as the primary key of the table; that is, the data contained in "name" uniquely identifies each entry in the table. Because "name" is the primary key of the table, no two entries in the table can have the same name – a given value of "name" is unique to each dragon. The "year" column contains integer data, with the constraint that year values must be greater than or equal to 2000. The final column, "cute", contains integer data with no restrictions on allowable values.

We can verify this by viewing Dragon itself.

Code
query = """
SELECT *
FROM Dragon;
"""

pd.read_sql(query, engine)
name year cute
0 hiccup 2010 10
1 drogon 2011 -100
2 dragon 2 2019 0

Database tables (also referred to as relations) are structured much like DataFrames in pandas. Each row, sometimes called a tuple, represents a single record in the dataset. Each column, sometimes called an attribute or field, describes some feature of the record.

21.3 SELECTing From Tables

To extract and manipulate data stored in a SQL table, we will need to familiarize ourselves with the syntax to write pieces of SQL code, which we call queries.

The basic unit of a SQL query is the SELECT statement. SELECT specifies what columns we would like to extract from a given table. We use FROM to tell SQL the table from which we want to SELECT our data.

query = """
SELECT *
FROM Dragon;
"""

pd.read_sql(query, engine)
name year cute
0 hiccup 2010 10
1 drogon 2011 -100
2 dragon 2 2019 0

In SQL, * means “everything.” The query above grabs all the columns in Dragon and displays them in the outputted table. We can also specify a specific subset of columns to be SELECTed.

query = """
SELECT cute, year
FROM Dragon;
"""

pd.read_sql(query, engine)
cute year
0 10 2010
1 -100 2011
2 0 2019

And just like that, we’ve already written two SQL queries. There are a few points of note in the queries above. Firstly, notice that every “verb” is written in uppercase. It is convention to write SQL operations in capital letters, but your code will run just fine even if you choose to keep things in lowercase. Second, the query above separates each statement with a new line. SQL queries are not impacted by whitespace within the query; this means that SQL code is typically written with a new line after each statement to make things more readable. The semicolon (;) indicates the end of a query. There are some “flavors” of SQL in which a query will not run if no semicolon is present; in Data 100, however, the SQL version we will use works with or without an ending semicolon. Queries in these notes will end with semicolons to build up good habits.

The AS keyword allows us to give a column a new name after it has been SELECTed. The general syntax is:

SELECT column_name_in_database_table AS new_name_in_output_table
query = """
SELECT cute AS cuteness, year AS birth
FROM Dragon;
"""

pd.read_sql(query, engine)
cuteness birth
0 10 2010
1 -100 2011
2 0 2019

To SELECT only the unique values in a column, we use the DISTINCT keyword. This will cause any any duplicate entries in a column to be removed. If we want to find only the unique years in Dragon, without any repeats, we would write:

query = """
SELECT DISTINCT year
FROM Dragon;
"""

pd.read_sql(query, engine)
year
0 2010
1 2011
2 2019

Every SQL query must include both a SELECT and FROM statement. Intuitively, this makes sense – we know that we’ll want to extract some piece of information from the table; to do so, we also need to indicate what table we want to consider.

It is important to note that SQL enforces a strict “order of operations” – SQL clauses must always follow the same sequence. For example, the SELECT statement must always precede FROM. This means that any SQL query will follow the same structure.

SELECT <column list>
FROM <table>
[additional clauses]

The additional clauses that we use depend on the specific task trying to be achieved. We may refine our query to filter on a certain condition, aggregate a particular column, or join several tables together. We will spend the rest of this lecture outlining some useful clauses to build up our understanding of the order of operations.

21.4 Applying WHERE Conditions

The WHERE keyword is used to select only some rows of a table, filtered on a given Boolean condition.

query = """
SELECT name, year
FROM Dragon
WHERE cute > 0;
"""

pd.read_sql(query, engine)
name year
0 hiccup 2010

We can add complexity to the WHERE condition using the keywords AND, OR, and NOT, much like we would in Python.

query = """
SELECT name, year
FROM Dragon
WHERE cute > 0 OR year > 2013;
"""

pd.read_sql(query, engine)
name year
0 hiccup 2010
1 dragon 2 2019

21.5 Sorting and Restricting Output

What if we want the output table to appear in a certain order? The ORDER BY keyword behaves similarly to .sort_values() in pandas.

query = """
SELECT *
FROM Dragon
ORDER BY cute;
"""

pd.read_sql(query, engine)
name year cute
0 drogon 2011 -100
1 dragon 2 2019 0
2 hiccup 2010 10

By default, ORDER BY will display results in ascending order (with the lowest values at the top of the table). To sort in descending order, we use the DESC keyword after specifying the column to be used for ordering.

query = """
SELECT *
FROM Dragon
ORDER BY cute DESC;
"""

pd.read_sql(query, engine)
name year cute
0 hiccup 2010 10
1 dragon 2 2019 0
2 drogon 2011 -100

We can also tell SQL to ORDER BY two columns at once. This will sort the table by the first listed column, then use the values in the second listed column to break any ties.

query = """
SELECT *
FROM Dragon
ORDER BY name, cute;
"""

pd.read_sql(query, engine)
name year cute
0 dragon 2 2019 0
1 drogon 2011 -100
2 hiccup 2010 10

In many instances, we are only concerned with a certain number of rows in the output table (for example, wanting to find the first two dragons in the table). The LIMIT keyword restricts the output to a specified number of rows. It serves a function similar to that of .head() in pandas.

query = """
SELECT *
FROM Dragon
LIMIT 2;
"""

pd.read_sql(query, engine)
name year cute
0 hiccup 2010 10
1 drogon 2011 -100

The OFFSET keyword indicates the index at which LIMIT should start. In other words, we can use OFFSET to shift where the LIMITing begins by a specified number of rows. For example, we might care about the dragons that are at positions #2 and #3 in the table.

query = """
SELECT *
FROM Dragon
LIMIT 2
OFFSET 1;
"""

pd.read_sql(query, engine)
name year cute
0 drogon 2011 -100
1 dragon 2 2019 0

Let’s summarize what we’ve learned so far. We know that SELECT and FROM are the fundamental building blocks of any SQL query. We can augment these two keywords with additional clauses to refine the data in our output table.

Any clauses that we include must follow a strict ordering within the query:

SELECT <column list>
FROM <table>
[WHERE <predicate>]
[ORDER BY <column list>]
[LIMIT <number of rows>]
[OFFSET <number of rows>]

Here, any clause contained in square brackets [ ] is optional – we only need to use the keyword if it is relevant to the table operation we want to perform.

21.6 Aggregating with GROUP BY

At this point, we’ve seen that SQL offers much of the same functionality that was given to us by pandas. We can extract data from a table, filter it, and reorder it to suit our needs.

In pandas, much of our analysis work relied heavily on being able to use .groupby() to aggregate across the rows of our dataset. SQL’s answer to this task is the (very conveniently named) GROUP BY clause. While the outputs of GROUP BY are similar to those of .groupby() – in both cases, we obtain an output table where some column has been used for grouping – the syntax and logic used to group data in SQL are fairly different to the pandas implementation.

To illustrate GROUP BY, we will consider the Dish table from the basic_examples database.

Code
# Create a SQL Alchemy connection to the `basic_examples` database
engine = sqlalchemy.create_engine("sqlite:///data/basic_examples.db")
connection = engine.connect()

# Query information from the database
query = """
SELECT * 
FROM Dish
LIMIT 5;
"""

pd.read_sql(query, engine)
name type cost
0 ravioli entree 10
1 ramen entree 13
2 taco entree 7
3 edamame appetizer 4
4 fries appetizer 4

Say we wanted to count the maximum cost of any dish of a certain type. To accomplish this, we would write the following code.

query = """
SELECT type, MAX(cost)
FROM Dish
GROUP BY type;
"""

pd.read_sql(query, engine)
type MAX(cost)
0 appetizer 4
1 dessert 5
2 entree 13

What is going on here? The statement GROUP BY type tells SQL to group the data based on the value contained in the type column (whether a record is a appetizer, entree, or dessert). MAX(cost) determines the maximum cost of any dish in each type and displays the result in the output table.

You may be wondering: why does MAX(cost) come before the command to GROUP BY type? Don’t we need to form groups before we can count the number of entries in each?

Remember that SQL is a declarative programming language – a SQL programmer simply states what end result they would like to see, and leaves the task of figuring out how to obtain this result to SQL itself. This means that SQL queries sometimes don’t follow what a reader sees as a “logical” sequence of thought. Instead, SQL requires that we follow its set order of operations when constructing queries. So long as we follow this ordering, SQL will handle the underlying logic.

In practical terms: our goal with this query was to output the maximum cost of each type. To communicate this to SQL, we say that we want to SELECT the MAX value of cost for each type group.

Let’s put this into action one more time. Now, let’s find the sum of all costs for each type.

query = """
SELECT type, SUM(cost)
FROM Dish
GROUP BY type;
"""

pd.read_sql(query, engine)
type SUM(cost)
0 appetizer 12
1 dessert 5
2 entree 30

There are many aggregation functions that can be used to aggregate the data contained in each group. Some common examples are:

  • COUNT: count the number of rows associated with each group
  • MIN: find the minimum value of each group
  • MAX: find the maximum value of each group
  • SUM: sum across all records in each group

We can easily compute multiple aggregations, all at once (a task that was very tricky in pandas).

query = """
SELECT type, SUM(cost), MIN(cost), MAX(name)
FROM Dish
GROUP BY type;
"""

pd.read_sql(query, engine)
type SUM(cost) MIN(cost) MAX(name)
0 appetizer 12 4 potsticker
1 dessert 5 5 ice cream
2 entree 30 7 taco

To count the number of rows associated with each group, we use the COUNT keyword. Calling COUNT(*) will compute the total number of rows in each group, including rows with null values. Its pandas equivalent is .groupby().size().

query = """
SELECT type, COUNT(*)
FROM Dish
GROUP BY type;
"""

pd.read_sql(query, engine)
type COUNT(*)
0 appetizer 3
1 dessert 1
2 entree 3

To exclude null values when counting the rows in each group, we explicitly call COUNT on a column in the table. This is similar to calling .groupby().count() in pandas.

query = """
SELECT type, COUNT(cost)
FROM Dish
GROUP BY type;
"""

pd.read_sql(query, engine)
type COUNT(cost)
0 appetizer 3
1 dessert 1
2 entree 3

21.6.1 Filtering with HAVING

Often, we will only be interested in groups that share a certain characteristic. For example, we might only want to work with types where all dishes of that type cost less than $8.

When working with pandas, we used the .filter() method of GroupBy objects to filter out entire groups of data based on a given condition. In SQL, we can achieve something similar by using the HAVING clause. HAVING checks to see if all data across a group satisfies a certain condition. If this condition is met, the group is kept in the outputted data. If not, the group is excluded from the output.

query = """
SELECT type, COUNT(*)
FROM Dish
GROUP BY type
HAVING MAX(cost) < 8;
"""

pd.read_sql(query, engine)
type COUNT(*)
0 appetizer 3
1 dessert 1

Notice that all entrees have been excluded from the output table.

It can be helpful to read HAVING clauses this way: “give me all type groups having a maximum cost less than 8.”

You may ask: why couldn’t we have just used a WHERE clause here? It is important to remember that WHERE applies a condition to individual rows. We could, for example, use WHERE to filter rows that have particular values of cost. WHERE cannot, however, apply a condition across a group of many rows. Because WHERE only considers one row at a time, we would not be able to use it to filter on, say, the maximum cost across all rows in a group, or the sum of all costs across an entire group of records. HAVING allows us to apply a condition based on some aggregate value of a collection of grouped rows.

With these definitions of GROUP BY and HAVING in hand, let’s update our SQL order of operations. Remember: every SQL query must list clauses in this order.

SELECT <column list>
FROM <table>
[WHERE <predicate>]
[GROUP BY <column list>]
[HAVING <predicate>]
[ORDER BY <column list>]
[LIMIT <number of rows>]
[OFFSET <number of rows>]