💽 Lecture 5 (Part 2, Data Storage Types) – Data 100, Spring 2025¶
Data 100, Spring 2025
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#%matplotlib inline
plt.rcParams['figure.figsize'] = (12, 9)
sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option('display.max_rows', 30)
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 2)
# Use 2 decimal places instead of scientific notation in pandas
pd.set_option('display.float_format', '{:.2f}'.format)
# Silence some spurious seaborn warnings
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
🤹♀️ File Formats Other than CSV¶
There are many file types for storing structured data: CSV, TSV, JSON, XML, ASCII, SAS...
- In lecture, we will cover TSV and JSON since
pandas
supports them out-of-box.
TSV stands for tab-separated values. Think CSV, but with tabs (\t
) as the delimeter instead of commas.
TSVs are technically faster to import than CSVs.
But, historical tradition has led to CSV being the more popular file format.
Like we did with data/cdc_tuberculosis.csv
, you can open data/cdc_tuberculosis.tsv
in the JupyterLab Explorer to see its contents.
The pd.read_csv
function also reads TSVs if we specify the delimiter with parameter sep='\t'
(documentation).
# Identical data to Part 1!
tuberculosis_df_tsv = pd.read_csv("data/cdc_tuberculosis.tsv", sep='\t')
tuberculosis_df_tsv.head()
Unnamed: 0 | No. of TB cases | Unnamed: 2 | Unnamed: 3 | |
---|---|---|---|---|
0 | U.S. jurisdiction | 2019 | 2020 | 2021 |
1 | Total | 8,900 | 7,173 | 7,860 |
2 | Alabama | 87 | 72 | 92 |
3 | Alaska | 58 | 58 | 58 |
4 | Arizona | 183 | 136 | 129 |
# CSV representation
with open("data/cdc_tuberculosis.csv", "r") as f:
for _, row in zip(range(4), f):
print(row) # print raw strings
,No. of TB cases,, U.S. jurisdiction,2019,2020,2021 Total,"8,900","7,173","7,860" Alabama,87,72,92
# TSV representation
with open("data/cdc_tuberculosis.tsv", "r") as f:
for _, row in zip(range(4), f):
print(row) # print raw strings
No. of TB cases U.S. jurisdiction 2019 2020 2021 Total "8,900" "7,173" "7,860" Alabama 87 72 92
# Print literal \t instead of tabbed spaces:
with open("data/cdc_tuberculosis.tsv", "r") as f:
for _, row in zip(range(4), f):
print(repr(row)) # print raw strings
'\tNo. of TB cases\t\t\n' 'U.S. jurisdiction\t2019\t2020\t2021\n' 'Total\t"8,900"\t"7,173"\t"7,860"\n' 'Alabama\t87\t72\t92\n'
Instructor note: Return to Slides!
🪆 JSON (JavaScript Object Notation)¶
The congress.gov API (Application Programming Interface) provides data about the activities and members of the United States Congress (i.e., the House of Representatives and the Senate).
- Click the link above to see the kinds of information provided by the API.
To get a JSON file containing information about the current members of Congress from California, you could use the following API call:
https://api.congress.gov/v3/member/CA?api_key=[INSERT_KEY]&limit=250&format=json¤tMember=True
You can instantly sign up for a congress.gov API key here. Once you have your key, replace
[INSERT_KEY]
above with your key, and enter the API call as a URL in your browser. What happens?Once the JSON from the API call is visible in your browser, you can click
File
-->Save Page As
to save the JSON file to your coputer.Coarsely, API keys are used to track how much a given user engages with the API. There might be limits to the number of API calls (e.g., congress.gov API limits to 5,000 calls per hour), or a cost for API calls (e.g., using the OpenAI API for programmatically using ChatGPT).
For convenience, the JSON file from the call above has already been downloaded for you and is saved at data/ca-congress-members.json
.
📁 File contents¶
Let's look at data/ca-congress-members.json
in the JupyterLab Explorer.
Right-click the file, and then click
Open With
-->Editor
.You'll notice that JSON looks a lot like a Python dictionary.
Berkeley, CA is in the 12th district. Can you find our representative in Congress?
Note: In general, it's a good idea to check the file size before opening a file in JupyterLab. Very large files can cause crashes. See
os.path.getsize
documentation.
We can programmatically view the first couple lines of the file using the same functions we used with CSVs:
congress_file = "data/ca-congress-members.json"
# Inspect the first five lines of the file
with open(congress_file, "r") as f:
for i, row in enumerate(f):
print(row)
if i >= 4: break
{ "members": [ { "bioguideId": "T000491", "depiction": {
🐍 EDA: Digging into JSON with Python¶
JSON data closely matches the internal Python object model.
- In the following cell, we import the entire JSON datafile into a Python dictionary using the
json
package.
import json
# Import the JSON file into Python as a dictionary
with open(congress_file, "rb") as f:
congress_json = json.load(f)
type(congress_json)
dict
The congress_json
variable is a dictionary encoding the data in the JSON file.
Below, we access the first element of the members
element of the congress_json
dictionary.
- This first element is also a dictionary (and there are more dictionaries inside of it!)
# Grab the list corresponding to the `members` key in the JSON dictionary,
# and then grab the first element of this list.
# In a moment, we'll see how we knew to use the key `members`, and that
# the resulting object is a list.
congress_json['members'][0]
{'bioguideId': 'T000491', 'depiction': {'attribution': 'Image courtesy of the Member', 'imageUrl': 'https://www.congress.gov/img/member/6774606d0b34857ecc9091a9_200.jpg'}, 'district': 45, 'name': 'Tran, Derek', 'partyName': 'Democratic', 'state': 'California', 'terms': {'item': [{'chamber': 'House of Representatives', 'startYear': 2025}]}, 'updateDate': '2025-01-21T18:00:52Z', 'url': 'https://api.congress.gov/v3/member/T000491?format=json'}
How should we probe a nested dictionary like congress_json
?
We can start by identifying the top-level keys of the dictionary:
# Grab the top-level keys of the JSON dictionary
congress_json.keys()
dict_keys(['members', 'pagination', 'request'])
Looks like we have three top-level keys: members
, pagination
, and request
.
You'll often see a top-level
meta
key in JSON files. This does not refer to Meta (formerly Facebook). Instead, it typically refers to metadata (data about the data). Metadata are often maintained alongside the data.
Let's check the type of the members
element:
type(congress_json['members'])
list
Looks like a list! What are the first two elements?
congress_json['members'][:2]
[{'bioguideId': 'T000491', 'depiction': {'attribution': 'Image courtesy of the Member', 'imageUrl': 'https://www.congress.gov/img/member/6774606d0b34857ecc9091a9_200.jpg'}, 'district': 45, 'name': 'Tran, Derek', 'partyName': 'Democratic', 'state': 'California', 'terms': {'item': [{'chamber': 'House of Representatives', 'startYear': 2025}]}, 'updateDate': '2025-01-21T18:00:52Z', 'url': 'https://api.congress.gov/v3/member/T000491?format=json'}, {'bioguideId': 'M001241', 'depiction': {'attribution': 'Image courtesy of the Member', 'imageUrl': 'https://www.congress.gov/img/member/67744ed90b34857ecc909155_200.jpg'}, 'district': 47, 'name': 'Min, Dave', 'partyName': 'Democratic', 'state': 'California', 'terms': {'item': [{'chamber': 'House of Representatives', 'startYear': 2025}]}, 'updateDate': '2025-01-21T18:00:52Z', 'url': 'https://api.congress.gov/v3/member/M001241?format=json'}]
More dictionaries! You can repeat the process above to traverse the nested dictionary.
You'll notice that each record of congress_json['members']
looks like it could be a column of a DataFrame.
- The keys look a lot like column names, and the values could be the entries in each row.
But, the two other elements of congress_json
don't have the same structure as congress_json['members']
.
So, they probably don't belong in a DataFrame containing the members of Congress from CA.
We'll see the implications of this inconsistency in the next section.
print(congress_json['pagination'])
print(congress_json['request'])
{'count': 54} {'contentType': 'application/json', 'format': 'json'}
🐼 JSON with pandas¶
pandas
has a built in function called pd.read_json
for reading in JSON files.
- Uncomment the code below and see what happens.
# pd.read_json(congress_file)
Uh oh. Using the default parameters, we got an error.
The code above tries to import the entire JSON file located at
congress_file
(congress_json
), includingcongress_json['pagination']
andcongress_json['request']
.We only want to make a DataFrame out of
congress_json['members']
.
This time, let's try converting the members
element of congress_json
to a DataFrame by using pd.DataFrame
:
# Convert dictionary to DataFrame
congress_df = pd.DataFrame(congress_json['members'])
congress_df.head()
bioguideId | depiction | district | name | partyName | state | terms | updateDate | url | |
---|---|---|---|---|---|---|---|---|---|
0 | T000491 | {'attribution': 'Image courtesy of the Member'... | 45.00 | Tran, Derek | Democratic | California | {'item': [{'chamber': 'House of Representative... | 2025-01-21T18:00:52Z | https://api.congress.gov/v3/member/T000491?for... |
1 | M001241 | {'attribution': 'Image courtesy of the Member'... | 47.00 | Min, Dave | Democratic | California | {'item': [{'chamber': 'House of Representative... | 2025-01-21T18:00:52Z | https://api.congress.gov/v3/member/M001241?for... |
2 | K000400 | {'attribution': 'Image courtesy of the Member'... | 37.00 | Kamlager-Dove, Sydney | Democratic | California | {'item': [{'chamber': 'House of Representative... | 2025-01-21T18:00:52Z | https://api.congress.gov/v3/member/K000400?for... |
3 | G000598 | {'attribution': 'Image courtesy of the Member'... | 42.00 | Garcia, Robert | Democratic | California | {'item': [{'chamber': 'House of Representative... | 2025-01-21T18:00:52Z | https://api.congress.gov/v3/member/G000598?for... |
4 | K000397 | {'attribution': 'Image courtesy of the Member'... | 40.00 | Kim, Young | Republican | California | {'item': [{'chamber': 'House of Representative... | 2025-01-21T18:00:52Z | https://api.congress.gov/v3/member/K000397?for... |
We've successfully begun to rectangularize our JSON data!
Instructor Note: Return to Slides!
🕰️ Temporality¶
Let's briefly look at how we can use pandas dt
accessors to work with dates/times in a dataset.
We will use the Berkeley Police Department (PD) Calls for Service dataset.
calls = pd.read_csv("data/Berkeley_PD_-_Calls_for_Service.csv")
calls.head()
CASENO | OFFENSE | EVENTDT | EVENTTM | CVLEGEND | CVDOW | InDbDate | Block_Location | BLKADDR | City | State | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 21014296 | THEFT MISD. (UNDER $950) | 04/01/2021 12:00:00 AM | 10:58 | LARCENY | 4 | 06/15/2021 12:00:00 AM | Berkeley, CA\n(37.869058, -122.270455) | NaN | Berkeley | CA |
1 | 21014391 | THEFT MISD. (UNDER $950) | 04/01/2021 12:00:00 AM | 10:38 | LARCENY | 4 | 06/15/2021 12:00:00 AM | Berkeley, CA\n(37.869058, -122.270455) | NaN | Berkeley | CA |
2 | 21090494 | THEFT MISD. (UNDER $950) | 04/19/2021 12:00:00 AM | 12:15 | LARCENY | 1 | 06/15/2021 12:00:00 AM | 2100 BLOCK HASTE ST\nBerkeley, CA\n(37.864908,... | 2100 BLOCK HASTE ST | Berkeley | CA |
3 | 21090204 | THEFT FELONY (OVER $950) | 02/13/2021 12:00:00 AM | 17:00 | LARCENY | 6 | 06/15/2021 12:00:00 AM | 2600 BLOCK WARRING ST\nBerkeley, CA\n(37.86393... | 2600 BLOCK WARRING ST | Berkeley | CA |
4 | 21090179 | BURGLARY AUTO | 02/08/2021 12:00:00 AM | 6:20 | BURGLARY - VEHICLE | 1 | 06/15/2021 12:00:00 AM | 2700 BLOCK GARBER ST\nBerkeley, CA\n(37.86066,... | 2700 BLOCK GARBER ST | Berkeley | CA |
Looks like there are three columns with dates/times: EVENTDT
, EVENTTM
, and InDbDate
.
EVENTDT
stands for the date when the event took placeEVENTTM
stands for the time of day the event took place (in 24-hr format)InDbDate
is the date this call is entered into the database.
We can convert these string columns to datetime
objects using the pd.to_datetime
function.
# pd.to_datetime() is smart -- It can often infer what you want based on
# the format of the datetime string.
# But, it's not always right! It's good practice to specify the format of
# your datetimes. See the documentation and the `format` argument.
# Without format specified:
calls["EVENTDT"] = pd.to_datetime(calls["EVENTDT"])
# With format specified:
# calls["EVENTDT"] = pd.to_datetime(calls["EVENTDT"], format='%m/%d/%Y %I:%M:%S %p')
calls.head()
/tmp/ipykernel_99/2275392033.py:7: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format. calls["EVENTDT"] = pd.to_datetime(calls["EVENTDT"])
CASENO | OFFENSE | EVENTDT | EVENTTM | CVLEGEND | CVDOW | InDbDate | Block_Location | BLKADDR | City | State | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 21014296 | THEFT MISD. (UNDER $950) | 2021-04-01 | 10:58 | LARCENY | 4 | 06/15/2021 12:00:00 AM | Berkeley, CA\n(37.869058, -122.270455) | NaN | Berkeley | CA |
1 | 21014391 | THEFT MISD. (UNDER $950) | 2021-04-01 | 10:38 | LARCENY | 4 | 06/15/2021 12:00:00 AM | Berkeley, CA\n(37.869058, -122.270455) | NaN | Berkeley | CA |
2 | 21090494 | THEFT MISD. (UNDER $950) | 2021-04-19 | 12:15 | LARCENY | 1 | 06/15/2021 12:00:00 AM | 2100 BLOCK HASTE ST\nBerkeley, CA\n(37.864908,... | 2100 BLOCK HASTE ST | Berkeley | CA |
3 | 21090204 | THEFT FELONY (OVER $950) | 2021-02-13 | 17:00 | LARCENY | 6 | 06/15/2021 12:00:00 AM | 2600 BLOCK WARRING ST\nBerkeley, CA\n(37.86393... | 2600 BLOCK WARRING ST | Berkeley | CA |
4 | 21090179 | BURGLARY AUTO | 2021-02-08 | 6:20 | BURGLARY - VEHICLE | 1 | 06/15/2021 12:00:00 AM | 2700 BLOCK GARBER ST\nBerkeley, CA\n(37.86066,... | 2700 BLOCK GARBER ST | Berkeley | CA |
Now we can use the dt
accessor on this column.
We can get the month:
# 1 - January, 2 - February, ..., 12 - December
calls["EVENTDT"].dt.month
0 4 1 4 2 4 3 2 4 2 .. 2627 12 2628 2 2629 3 2630 4 2631 2 Name: EVENTDT, Length: 2632, dtype: int32
Which day of the week the date is on:
# 0 - Monday, 1 - Tuesday, ..., 6 - Sunday
calls["EVENTDT"].dt.dayofweek
0 3 1 3 2 0 3 5 4 0 .. 2627 0 2628 2 2629 2 2630 5 2631 4 Name: EVENTDT, Length: 2632, dtype: int32
We can also sort by datetime:
# Sort the DataFrame by datetime to find the earliest call.
calls.sort_values("EVENTDT").head()
CASENO | OFFENSE | EVENTDT | EVENTTM | CVLEGEND | CVDOW | InDbDate | Block_Location | BLKADDR | City | State | |
---|---|---|---|---|---|---|---|---|---|---|---|
2513 | 20057398 | BURGLARY COMMERCIAL | 2020-12-17 | 16:05 | BURGLARY - COMMERCIAL | 4 | 06/15/2021 12:00:00 AM | 600 BLOCK GILMAN ST\nBerkeley, CA\n(37.878405,... | 600 BLOCK GILMAN ST | Berkeley | CA |
624 | 20057207 | ASSAULT/BATTERY MISD. | 2020-12-17 | 16:50 | ASSAULT | 4 | 06/15/2021 12:00:00 AM | 2100 BLOCK SHATTUCK AVE\nBerkeley, CA\n(37.871... | 2100 BLOCK SHATTUCK AVE | Berkeley | CA |
154 | 20092214 | THEFT FROM AUTO | 2020-12-17 | 18:30 | LARCENY - FROM VEHICLE | 4 | 06/15/2021 12:00:00 AM | 800 BLOCK SHATTUCK AVE\nBerkeley, CA\n(37.8918... | 800 BLOCK SHATTUCK AVE | Berkeley | CA |
659 | 20057324 | THEFT MISD. (UNDER $950) | 2020-12-17 | 15:44 | LARCENY | 4 | 06/15/2021 12:00:00 AM | 1800 BLOCK 4TH ST\nBerkeley, CA\n(37.869888, -... | 1800 BLOCK 4TH ST | Berkeley | CA |
993 | 20057573 | BURGLARY RESIDENTIAL | 2020-12-17 | 22:15 | BURGLARY - RESIDENTIAL | 4 | 06/15/2021 12:00:00 AM | 1700 BLOCK STUART ST\nBerkeley, CA\n(37.857495... | 1700 BLOCK STUART ST | Berkeley | CA |
We can also do many things with the dt
accessor like switching time zones and converting time back to UNIX/POSIX time. Check out the documentation on .dt
accessor and time series/date functionality.
What type are datetime objects?
calls["EVENTDT"].dtype
dtype('<M8[ns]')
ns
above stands for nanoseconds.
<M8
refers to the Numpy typedatetime64
Under the hood, datetimes in Pandas are integers representing the number of nanoseconds since 1/1/1970 UTC.
# datetimes in pandas are stored as integers representing number of
# nanoseconds since 1970-01-01
calls["EVENTDT"].astype(int)
0 1617235200000000000 1 1617235200000000000 2 1618790400000000000 3 1613174400000000000 4 1612742400000000000 ... 2627 1608508800000000000 2628 1614124800000000000 2629 1616544000000000000 2630 1619222400000000000 2631 1614297600000000000 Name: EVENTDT, Length: 2632, dtype: int64
Instructor Note: Return to Slides!
🤷 Faithfulness and missing values¶
To conclude, let's very briefly explore missingness in the Berkeley PD Calls for Service dataset.
Looking at the top of the dataframe, we can already see that there are missing values in the BLKADDR
column.
calls.head()
CASENO | OFFENSE | EVENTDT | EVENTTM | CVLEGEND | CVDOW | InDbDate | Block_Location | BLKADDR | City | State | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 21014296 | THEFT MISD. (UNDER $950) | 2021-04-01 | 10:58 | LARCENY | 4 | 06/15/2021 12:00:00 AM | Berkeley, CA\n(37.869058, -122.270455) | NaN | Berkeley | CA |
1 | 21014391 | THEFT MISD. (UNDER $950) | 2021-04-01 | 10:38 | LARCENY | 4 | 06/15/2021 12:00:00 AM | Berkeley, CA\n(37.869058, -122.270455) | NaN | Berkeley | CA |
2 | 21090494 | THEFT MISD. (UNDER $950) | 2021-04-19 | 12:15 | LARCENY | 1 | 06/15/2021 12:00:00 AM | 2100 BLOCK HASTE ST\nBerkeley, CA\n(37.864908,... | 2100 BLOCK HASTE ST | Berkeley | CA |
3 | 21090204 | THEFT FELONY (OVER $950) | 2021-02-13 | 17:00 | LARCENY | 6 | 06/15/2021 12:00:00 AM | 2600 BLOCK WARRING ST\nBerkeley, CA\n(37.86393... | 2600 BLOCK WARRING ST | Berkeley | CA |
4 | 21090179 | BURGLARY AUTO | 2021-02-08 | 6:20 | BURGLARY - VEHICLE | 1 | 06/15/2021 12:00:00 AM | 2700 BLOCK GARBER ST\nBerkeley, CA\n(37.86066,... | 2700 BLOCK GARBER ST | Berkeley | CA |
We can use the .isna()
method to get a sense of how often values in BLKADDR
are missing.
- The
.isnull()
method is functionally equivalent.
# isna() returns a Series of booleans indicating whether each element
# # in the Series is missing.
print(calls['BLKADDR'].isna().head())
# The mean of a Series of booleans is the proportion of booleans that are True.
calls['BLKADDR'].isna().mean()
0 True 1 True 2 False 3 False 4 False Name: BLKADDR, dtype: bool
0.007598784194528876
It looks like missing values are actually quite rare: Only 0.8% of records are missing a value in BLKADDR
.
Why are these values missing?
Again, looking at just the first few rows, we see that
NaN
values inBLKADDR
appear to be accompanied by latitude/longitude coordinates in theBlock_Location
column.In all likelihood, missing values in
BLKADDR
probably correspond to locations that do not have a defined address in the officer's navigation or GPS system.
The best default approach here: Leave the rows with missing BLKADDR
untouched, or replace the NaN
values with a MISSING
indicator.
- In the future, if we wanted to conduct an analysis of the streets where police incidents were most common, we might impute
BLKADDR
by using the nearest street, which we could identify with an external package.
For a very rough sense of missingness in each column of a DataFrame, you can use the info()
method.
- Based on the output, it looks like the only column with missing values is
BLKADDR
.
# You can see the total number of rows at the top of the .info() output.
# Compare this to the number of non-null values in the BLKADDR column.
calls.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2632 entries, 0 to 2631 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CASENO 2632 non-null int64 1 OFFENSE 2632 non-null object 2 EVENTDT 2632 non-null datetime64[ns] 3 EVENTTM 2632 non-null object 4 CVLEGEND 2632 non-null object 5 CVDOW 2632 non-null int64 6 InDbDate 2632 non-null object 7 Block_Location 2632 non-null object 8 BLKADDR 2612 non-null object 9 City 2632 non-null object 10 State 2632 non-null object dtypes: datetime64[ns](1), int64(2), object(8) memory usage: 226.3+ KB
Instructor Note: Return to Slides!