💽 Lecture 5 (Part 2, Data Storage Types) – Data 100, Spring 2025¶

Data 100, Spring 2025

Acknowledgments Page

In [1]:
import numpy as np
import pandas as pd
In [2]:
import matplotlib.pyplot as plt
import seaborn as sns
#%matplotlib inline
plt.rcParams['figure.figsize'] = (12, 9)

sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option('display.max_rows', 30)
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 2)
# Use 2 decimal places instead of scientific notation in pandas
pd.set_option('display.float_format', '{:.2f}'.format)

# Silence some spurious seaborn warnings
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)





🤹‍♀️ File Formats Other than CSV¶

There are many file types for storing structured data: CSV, TSV, JSON, XML, ASCII, SAS...

  • In lecture, we will cover TSV and JSON since pandas supports them out-of-box.

TSV stands for tab-separated values. Think CSV, but with tabs (\t) as the delimeter instead of commas.

  • TSVs are technically faster to import than CSVs.

  • But, historical tradition has led to CSV being the more popular file format.


Like we did with data/cdc_tuberculosis.csv, you can open data/cdc_tuberculosis.tsv in the JupyterLab Explorer to see its contents.



The pd.read_csv function also reads TSVs if we specify the delimiter with parameter sep='\t' (documentation).

In [3]:
# Identical data to Part 1!
tuberculosis_df_tsv = pd.read_csv("data/cdc_tuberculosis.tsv", sep='\t')
tuberculosis_df_tsv.head()
Out[3]:
Unnamed: 0 No. of TB cases Unnamed: 2 Unnamed: 3
0 U.S. jurisdiction 2019 2020 2021
1 Total 8,900 7,173 7,860
2 Alabama 87 72 92
3 Alaska 58 58 58
4 Arizona 183 136 129
In [4]:
# CSV representation
with open("data/cdc_tuberculosis.csv", "r") as f:
    for _, row in zip(range(4), f):
        print(row) # print raw strings
,No. of TB cases,,

U.S. jurisdiction,2019,2020,2021

Total,"8,900","7,173","7,860"

Alabama,87,72,92

In [5]:
# TSV representation
with open("data/cdc_tuberculosis.tsv", "r") as f:
    for _, row in zip(range(4), f):
        print(row) # print raw strings
	No. of TB cases		

U.S. jurisdiction	2019	2020	2021

Total	"8,900"	"7,173"	"7,860"

Alabama	87	72	92

In [6]:
# Print literal \t instead of tabbed spaces:
with open("data/cdc_tuberculosis.tsv", "r") as f:
    for _, row in zip(range(4), f):
        print(repr(row)) # print raw strings
'\tNo. of TB cases\t\t\n'
'U.S. jurisdiction\t2019\t2020\t2021\n'
'Total\t"8,900"\t"7,173"\t"7,860"\n'
'Alabama\t87\t72\t92\n'



Instructor note: Return to Slides!




🪆 JSON (JavaScript Object Notation)¶

The congress.gov API (Application Programming Interface) provides data about the activities and members of the United States Congress (i.e., the House of Representatives and the Senate).

  • Click the link above to see the kinds of information provided by the API.

To get a JSON file containing information about the current members of Congress from California, you could use the following API call:

  • https://api.congress.gov/v3/member/CA?api_key=[INSERT_KEY]&limit=250&format=json&currentMember=True

  • You can instantly sign up for a congress.gov API key here. Once you have your key, replace [INSERT_KEY] above with your key, and enter the API call as a URL in your browser. What happens?

  • Once the JSON from the API call is visible in your browser, you can click File --> Save Page As to save the JSON file to your coputer.

  • Coarsely, API keys are used to track how much a given user engages with the API. There might be limits to the number of API calls (e.g., congress.gov API limits to 5,000 calls per hour), or a cost for API calls (e.g., using the OpenAI API for programmatically using ChatGPT).


For convenience, the JSON file from the call above has already been downloaded for you and is saved at data/ca-congress-members.json.

📁 File contents¶

Let's look at data/ca-congress-members.json in the JupyterLab Explorer.

  • Right-click the file, and then click Open With --> Editor.

  • You'll notice that JSON looks a lot like a Python dictionary.

  • Berkeley, CA is in the 12th district. Can you find our representative in Congress?

Note: In general, it's a good idea to check the file size before opening a file in JupyterLab. Very large files can cause crashes. See os.path.getsize documentation.

We can programmatically view the first couple lines of the file using the same functions we used with CSVs:

In [7]:
congress_file = "data/ca-congress-members.json"

# Inspect the first five lines of the file
with open(congress_file, "r") as f:
    for i, row in enumerate(f):
        print(row)
        if i >= 4: break
{

    "members": [

        {

            "bioguideId": "T000491",

            "depiction": {

🐍 EDA: Digging into JSON with Python¶

JSON data closely matches the internal Python object model.

  • In the following cell, we import the entire JSON datafile into a Python dictionary using the json package.
In [8]:
import json

# Import the JSON file into Python as a dictionary
with open(congress_file, "rb") as f:
    congress_json = json.load(f)

type(congress_json)
Out[8]:
dict

The congress_json variable is a dictionary encoding the data in the JSON file.

Below, we access the first element of the members element of the congress_json dictionary.

  • This first element is also a dictionary (and there are more dictionaries inside of it!)
In [9]:
# Grab the list corresponding to the `members` key in the JSON dictionary, 
# and then grab the first element of this list.
# In a moment, we'll see how we knew to use the key `members`, and that
# the resulting object is a list.
congress_json['members'][0]
Out[9]:
{'bioguideId': 'T000491',
 'depiction': {'attribution': 'Image courtesy of the Member',
  'imageUrl': 'https://www.congress.gov/img/member/6774606d0b34857ecc9091a9_200.jpg'},
 'district': 45,
 'name': 'Tran, Derek',
 'partyName': 'Democratic',
 'state': 'California',
 'terms': {'item': [{'chamber': 'House of Representatives',
    'startYear': 2025}]},
 'updateDate': '2025-01-21T18:00:52Z',
 'url': 'https://api.congress.gov/v3/member/T000491?format=json'}

How should we probe a nested dictionary like congress_json?

We can start by identifying the top-level keys of the dictionary:

In [10]:
# Grab the top-level keys of the JSON dictionary
congress_json.keys()
Out[10]:
dict_keys(['members', 'pagination', 'request'])

Looks like we have three top-level keys: members, pagination, and request.

You'll often see a top-level meta key in JSON files. This does not refer to Meta (formerly Facebook). Instead, it typically refers to metadata (data about the data). Metadata are often maintained alongside the data.

Let's check the type of the members element:

In [11]:
type(congress_json['members'])
Out[11]:
list

Looks like a list! What are the first two elements?

In [12]:
congress_json['members'][:2]
Out[12]:
[{'bioguideId': 'T000491',
  'depiction': {'attribution': 'Image courtesy of the Member',
   'imageUrl': 'https://www.congress.gov/img/member/6774606d0b34857ecc9091a9_200.jpg'},
  'district': 45,
  'name': 'Tran, Derek',
  'partyName': 'Democratic',
  'state': 'California',
  'terms': {'item': [{'chamber': 'House of Representatives',
     'startYear': 2025}]},
  'updateDate': '2025-01-21T18:00:52Z',
  'url': 'https://api.congress.gov/v3/member/T000491?format=json'},
 {'bioguideId': 'M001241',
  'depiction': {'attribution': 'Image courtesy of the Member',
   'imageUrl': 'https://www.congress.gov/img/member/67744ed90b34857ecc909155_200.jpg'},
  'district': 47,
  'name': 'Min, Dave',
  'partyName': 'Democratic',
  'state': 'California',
  'terms': {'item': [{'chamber': 'House of Representatives',
     'startYear': 2025}]},
  'updateDate': '2025-01-21T18:00:52Z',
  'url': 'https://api.congress.gov/v3/member/M001241?format=json'}]

More dictionaries! You can repeat the process above to traverse the nested dictionary.

You'll notice that each record of congress_json['members'] looks like it could be a column of a DataFrame.

  • The keys look a lot like column names, and the values could be the entries in each row.

But, the two other elements of congress_json don't have the same structure as congress_json['members'].

  • So, they probably don't belong in a DataFrame containing the members of Congress from CA.

  • We'll see the implications of this inconsistency in the next section.

In [13]:
print(congress_json['pagination'])
print(congress_json['request'])
{'count': 54}
{'contentType': 'application/json', 'format': 'json'}

🐼 JSON with pandas¶

pandas has a built in function called pd.read_json for reading in JSON files.

  • Uncomment the code below and see what happens.
In [14]:
# pd.read_json(congress_file)

Uh oh. Using the default parameters, we got an error.

  • The code above tries to import the entire JSON file located at congress_file (congress_json), including congress_json['pagination'] and congress_json['request'].

  • We only want to make a DataFrame out of congress_json['members'].

This time, let's try converting the members element of congress_json to a DataFrame by using pd.DataFrame:

In [15]:
# Convert dictionary to DataFrame
congress_df = pd.DataFrame(congress_json['members'])
congress_df.head()
Out[15]:
bioguideId depiction district name partyName state terms updateDate url
0 T000491 {'attribution': 'Image courtesy of the Member'... 45.00 Tran, Derek Democratic California {'item': [{'chamber': 'House of Representative... 2025-01-21T18:00:52Z https://api.congress.gov/v3/member/T000491?for...
1 M001241 {'attribution': 'Image courtesy of the Member'... 47.00 Min, Dave Democratic California {'item': [{'chamber': 'House of Representative... 2025-01-21T18:00:52Z https://api.congress.gov/v3/member/M001241?for...
2 K000400 {'attribution': 'Image courtesy of the Member'... 37.00 Kamlager-Dove, Sydney Democratic California {'item': [{'chamber': 'House of Representative... 2025-01-21T18:00:52Z https://api.congress.gov/v3/member/K000400?for...
3 G000598 {'attribution': 'Image courtesy of the Member'... 42.00 Garcia, Robert Democratic California {'item': [{'chamber': 'House of Representative... 2025-01-21T18:00:52Z https://api.congress.gov/v3/member/G000598?for...
4 K000397 {'attribution': 'Image courtesy of the Member'... 40.00 Kim, Young Republican California {'item': [{'chamber': 'House of Representative... 2025-01-21T18:00:52Z https://api.congress.gov/v3/member/K000397?for...

We've successfully begun to rectangularize our JSON data!




Instructor Note: Return to Slides!



🕰️ Temporality¶

Let's briefly look at how we can use pandas dt accessors to work with dates/times in a dataset.

We will use the Berkeley Police Department (PD) Calls for Service dataset.

In [16]:
calls = pd.read_csv("data/Berkeley_PD_-_Calls_for_Service.csv")
calls.head()
Out[16]:
CASENO OFFENSE EVENTDT EVENTTM CVLEGEND CVDOW InDbDate Block_Location BLKADDR City State
0 21014296 THEFT MISD. (UNDER $950) 04/01/2021 12:00:00 AM 10:58 LARCENY 4 06/15/2021 12:00:00 AM Berkeley, CA\n(37.869058, -122.270455) NaN Berkeley CA
1 21014391 THEFT MISD. (UNDER $950) 04/01/2021 12:00:00 AM 10:38 LARCENY 4 06/15/2021 12:00:00 AM Berkeley, CA\n(37.869058, -122.270455) NaN Berkeley CA
2 21090494 THEFT MISD. (UNDER $950) 04/19/2021 12:00:00 AM 12:15 LARCENY 1 06/15/2021 12:00:00 AM 2100 BLOCK HASTE ST\nBerkeley, CA\n(37.864908,... 2100 BLOCK HASTE ST Berkeley CA
3 21090204 THEFT FELONY (OVER $950) 02/13/2021 12:00:00 AM 17:00 LARCENY 6 06/15/2021 12:00:00 AM 2600 BLOCK WARRING ST\nBerkeley, CA\n(37.86393... 2600 BLOCK WARRING ST Berkeley CA
4 21090179 BURGLARY AUTO 02/08/2021 12:00:00 AM 6:20 BURGLARY - VEHICLE 1 06/15/2021 12:00:00 AM 2700 BLOCK GARBER ST\nBerkeley, CA\n(37.86066,... 2700 BLOCK GARBER ST Berkeley CA

Looks like there are three columns with dates/times: EVENTDT, EVENTTM, and InDbDate.

  • EVENTDT stands for the date when the event took place

  • EVENTTM stands for the time of day the event took place (in 24-hr format)

  • InDbDate is the date this call is entered into the database.

We can convert these string columns to datetime objects using the pd.to_datetime function.

In [17]:
# pd.to_datetime() is smart -- It can often infer what you want based on
# the format of the datetime string.
# But, it's not always right! It's good practice to specify the format of 
# your datetimes. See the documentation and the `format` argument.

# Without format specified:
calls["EVENTDT"] = pd.to_datetime(calls["EVENTDT"])

# With format specified:
# calls["EVENTDT"] = pd.to_datetime(calls["EVENTDT"], format='%m/%d/%Y %I:%M:%S %p')

calls.head()
/tmp/ipykernel_99/2275392033.py:7: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  calls["EVENTDT"] = pd.to_datetime(calls["EVENTDT"])
Out[17]:
CASENO OFFENSE EVENTDT EVENTTM CVLEGEND CVDOW InDbDate Block_Location BLKADDR City State
0 21014296 THEFT MISD. (UNDER $950) 2021-04-01 10:58 LARCENY 4 06/15/2021 12:00:00 AM Berkeley, CA\n(37.869058, -122.270455) NaN Berkeley CA
1 21014391 THEFT MISD. (UNDER $950) 2021-04-01 10:38 LARCENY 4 06/15/2021 12:00:00 AM Berkeley, CA\n(37.869058, -122.270455) NaN Berkeley CA
2 21090494 THEFT MISD. (UNDER $950) 2021-04-19 12:15 LARCENY 1 06/15/2021 12:00:00 AM 2100 BLOCK HASTE ST\nBerkeley, CA\n(37.864908,... 2100 BLOCK HASTE ST Berkeley CA
3 21090204 THEFT FELONY (OVER $950) 2021-02-13 17:00 LARCENY 6 06/15/2021 12:00:00 AM 2600 BLOCK WARRING ST\nBerkeley, CA\n(37.86393... 2600 BLOCK WARRING ST Berkeley CA
4 21090179 BURGLARY AUTO 2021-02-08 6:20 BURGLARY - VEHICLE 1 06/15/2021 12:00:00 AM 2700 BLOCK GARBER ST\nBerkeley, CA\n(37.86066,... 2700 BLOCK GARBER ST Berkeley CA

Now we can use the dt accessor on this column.

We can get the month:

In [18]:
# 1 - January, 2 - February, ..., 12 - December
calls["EVENTDT"].dt.month
Out[18]:
0        4
1        4
2        4
3        2
4        2
        ..
2627    12
2628     2
2629     3
2630     4
2631     2
Name: EVENTDT, Length: 2632, dtype: int32

Which day of the week the date is on:

In [19]:
# 0 - Monday, 1 - Tuesday, ..., 6 - Sunday
calls["EVENTDT"].dt.dayofweek
Out[19]:
0       3
1       3
2       0
3       5
4       0
       ..
2627    0
2628    2
2629    2
2630    5
2631    4
Name: EVENTDT, Length: 2632, dtype: int32

We can also sort by datetime:

In [20]:
# Sort the DataFrame by datetime to find the earliest call.
calls.sort_values("EVENTDT").head()
Out[20]:
CASENO OFFENSE EVENTDT EVENTTM CVLEGEND CVDOW InDbDate Block_Location BLKADDR City State
2513 20057398 BURGLARY COMMERCIAL 2020-12-17 16:05 BURGLARY - COMMERCIAL 4 06/15/2021 12:00:00 AM 600 BLOCK GILMAN ST\nBerkeley, CA\n(37.878405,... 600 BLOCK GILMAN ST Berkeley CA
624 20057207 ASSAULT/BATTERY MISD. 2020-12-17 16:50 ASSAULT 4 06/15/2021 12:00:00 AM 2100 BLOCK SHATTUCK AVE\nBerkeley, CA\n(37.871... 2100 BLOCK SHATTUCK AVE Berkeley CA
154 20092214 THEFT FROM AUTO 2020-12-17 18:30 LARCENY - FROM VEHICLE 4 06/15/2021 12:00:00 AM 800 BLOCK SHATTUCK AVE\nBerkeley, CA\n(37.8918... 800 BLOCK SHATTUCK AVE Berkeley CA
659 20057324 THEFT MISD. (UNDER $950) 2020-12-17 15:44 LARCENY 4 06/15/2021 12:00:00 AM 1800 BLOCK 4TH ST\nBerkeley, CA\n(37.869888, -... 1800 BLOCK 4TH ST Berkeley CA
993 20057573 BURGLARY RESIDENTIAL 2020-12-17 22:15 BURGLARY - RESIDENTIAL 4 06/15/2021 12:00:00 AM 1700 BLOCK STUART ST\nBerkeley, CA\n(37.857495... 1700 BLOCK STUART ST Berkeley CA

We can also do many things with the dt accessor like switching time zones and converting time back to UNIX/POSIX time. Check out the documentation on .dt accessor and time series/date functionality.

What type are datetime objects?

In [21]:
calls["EVENTDT"].dtype
Out[21]:
dtype('<M8[ns]')

ns above stands for nanoseconds.

  • <M8 refers to the Numpy type datetime64

Under the hood, datetimes in Pandas are integers representing the number of nanoseconds since 1/1/1970 UTC.

In [22]:
# datetimes in pandas are stored as integers representing number of 
# nanoseconds since 1970-01-01
calls["EVENTDT"].astype(int)
Out[22]:
0       1617235200000000000
1       1617235200000000000
2       1618790400000000000
3       1613174400000000000
4       1612742400000000000
               ...         
2627    1608508800000000000
2628    1614124800000000000
2629    1616544000000000000
2630    1619222400000000000
2631    1614297600000000000
Name: EVENTDT, Length: 2632, dtype: int64




Instructor Note: Return to Slides!



🤷 Faithfulness and missing values¶

To conclude, let's very briefly explore missingness in the Berkeley PD Calls for Service dataset.

Looking at the top of the dataframe, we can already see that there are missing values in the BLKADDR column.

In [23]:
calls.head()
Out[23]:
CASENO OFFENSE EVENTDT EVENTTM CVLEGEND CVDOW InDbDate Block_Location BLKADDR City State
0 21014296 THEFT MISD. (UNDER $950) 2021-04-01 10:58 LARCENY 4 06/15/2021 12:00:00 AM Berkeley, CA\n(37.869058, -122.270455) NaN Berkeley CA
1 21014391 THEFT MISD. (UNDER $950) 2021-04-01 10:38 LARCENY 4 06/15/2021 12:00:00 AM Berkeley, CA\n(37.869058, -122.270455) NaN Berkeley CA
2 21090494 THEFT MISD. (UNDER $950) 2021-04-19 12:15 LARCENY 1 06/15/2021 12:00:00 AM 2100 BLOCK HASTE ST\nBerkeley, CA\n(37.864908,... 2100 BLOCK HASTE ST Berkeley CA
3 21090204 THEFT FELONY (OVER $950) 2021-02-13 17:00 LARCENY 6 06/15/2021 12:00:00 AM 2600 BLOCK WARRING ST\nBerkeley, CA\n(37.86393... 2600 BLOCK WARRING ST Berkeley CA
4 21090179 BURGLARY AUTO 2021-02-08 6:20 BURGLARY - VEHICLE 1 06/15/2021 12:00:00 AM 2700 BLOCK GARBER ST\nBerkeley, CA\n(37.86066,... 2700 BLOCK GARBER ST Berkeley CA

We can use the .isna() method to get a sense of how often values in BLKADDR are missing.

  • The .isnull() method is functionally equivalent.
In [24]:
# isna() returns a Series of booleans indicating whether each element 
# # in the Series is missing.
print(calls['BLKADDR'].isna().head())

# The mean of a Series of booleans is the proportion of booleans that are True.
calls['BLKADDR'].isna().mean()
0     True
1     True
2    False
3    False
4    False
Name: BLKADDR, dtype: bool
Out[24]:
0.007598784194528876

It looks like missing values are actually quite rare: Only 0.8% of records are missing a value in BLKADDR.

Why are these values missing?

  • Again, looking at just the first few rows, we see that NaN values in BLKADDR appear to be accompanied by latitude/longitude coordinates in the Block_Location column.

  • In all likelihood, missing values in BLKADDR probably correspond to locations that do not have a defined address in the officer's navigation or GPS system.


The best default approach here: Leave the rows with missing BLKADDR untouched, or replace the NaN values with a MISSING indicator.

  • In the future, if we wanted to conduct an analysis of the streets where police incidents were most common, we might impute BLKADDR by using the nearest street, which we could identify with an external package.

For a very rough sense of missingness in each column of a DataFrame, you can use the info() method.

  • Based on the output, it looks like the only column with missing values is BLKADDR.
In [25]:
# You can see the total number of rows at the top of the .info() output.
# Compare this to the number of non-null values in the BLKADDR column.
calls.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2632 entries, 0 to 2631
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   CASENO          2632 non-null   int64         
 1   OFFENSE         2632 non-null   object        
 2   EVENTDT         2632 non-null   datetime64[ns]
 3   EVENTTM         2632 non-null   object        
 4   CVLEGEND        2632 non-null   object        
 5   CVDOW           2632 non-null   int64         
 6   InDbDate        2632 non-null   object        
 7   Block_Location  2632 non-null   object        
 8   BLKADDR         2612 non-null   object        
 9   City            2632 non-null   object        
 10  State           2632 non-null   object        
dtypes: datetime64[ns](1), int64(2), object(8)
memory usage: 226.3+ KB




Instructor Note: Return to Slides!