import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#%matplotlib inline
plt.rcParams['figure.figsize'] = (12, 9)
sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option('display.max_rows', 30)
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 2)
# This option stops scientific notation for pandas
pd.set_option('display.float_format', '{:.2f}'.format)
# Silence some spurious seaborn warnings
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
Structure: Other File Formats¶
There are many file types for storing structured data: CSV, TSV, JSON, XML, ASCII, SAS...
- Documentation will be your best friend to understand how to process many of these file types.
- In lecture, we will cover TSV and JSON since pandas supports them out-of-box.
The pd.read_csv
function also reads in TSVs if we specify the delimiter with parameter sep='\t'
(documentation).
tuberculosis_df_tsv = pd.read_csv("data/cdc_tuberculosis.tsv", sep='\t')
tuberculosis_df_tsv.head()
Unnamed: 0 | No. of TB cases | Unnamed: 2 | Unnamed: 3 | TB incidence | Unnamed: 5 | Unnamed: 6 | |
---|---|---|---|---|---|---|---|
0 | U.S. jurisdiction | 2019 | 2020 | 2021 | 2019.00 | 2020.00 | 2021.00 |
1 | Total | 8,900 | 7,173 | 7,860 | 2.71 | 2.16 | 2.37 |
2 | Alabama | 87 | 72 | 92 | 1.77 | 1.43 | 1.83 |
3 | Alaska | 58 | 58 | 58 | 7.91 | 7.92 | 7.92 |
4 | Arizona | 183 | 136 | 129 | 2.51 | 1.89 | 1.77 |
with open("data/cdc_tuberculosis.tsv", "r") as f:
for _, row in zip(range(4), f):
print(repr(row)) # print raw strings
'\tNo. of TB cases\t\t\tTB incidence\t\t\n' 'U.S. jurisdiction\t2019\t2020\t2021\t2019\t2020\t2021\n' 'Total\t"8,900"\t"7,173"\t"7,860"\t2.71\t2.16\t2.37\n' 'Alabama\t87\t72\t92\t1.77\t1.43\t1.83\n'
Return to Slides
JSON¶
The City of Berkeley Open Data website has a dataset with COVID-19 Confirmed Cases among Berkeley residents by date.
Let's first check out this website.
Next, let's download this file, saving it as a JSON (note the source URL file type).
In the interest of reproducible data science we will download the data programatically. We have defined some helper functions in the ds100_utils.py file. I can then reuse these helper functions in many different notebooks.
# just run this cell
from ds100_utils import fetch_and_cache
Occasionally, you will want to modify code that you have imported from a local Python library. To reimport those modifications you can either use the python importlib library:
from importlib import reload
reload(utils)
or use iPython magic which will intelligently import code when files change:
%load_ext autoreload
%autoreload 2
covid_file = fetch_and_cache(
"https://data.cityofberkeley.info/api/views/xn6j-b766/rows.json?accessType=DOWNLOAD",
"confirmed-cases.json",
force=False)
covid_file # a file path wrapper object
Using cached version that was downloaded (UTC): Mon Aug 26 16:18:21 2024
PosixPath('data/confirmed-cases.json')
File size¶
Often, I like to start my analysis by getting a rough estimate of the size of the data. This will help inform the tools I use and how I view the data. If it is relatively small I might use a text editor or a spreadsheet to look at the data. If it is larger, I might jump to more programmatic exploration or even used distributed computing tools.
However here we will use Python tools to probe the file.
Since these seem to be text files I might also want to investigate the number of lines, which often corresponds to the number of records.
import os
print(covid_file, "is", os.path.getsize(covid_file) / 1e6, "MB")
data/confirmed-cases.json is 0.205583 MB
As part of your workflow, you should also learn some basic Unix commands, as these are often very handy (in fact, there's an entire book called "Data Science at the Command Line" that explores this idea in depth!).
In Jupyter/IPython, you can prefix lines with !
to execute arbitrary Unix commands, and within those lines, you can refer to Python variables and expressions with the syntax {expr}
.
Here, we use the ls
command to list files, using the -lh
flags, which request "long format with information in human-readable form". We also use the wc
command for "word count", but with the -l
flag, which asks for line counts instead of words.
These two give us the same information as the code above, albeit in a slightly different form:
!ls -lh {covid_file}
!wc -l {covid_file}
-rw-r--r-- 1 jegonzal staff 201K Aug 26 16:18 data/confirmed-cases.json 1706 data/confirmed-cases.json
File contents¶
Because we have a text file in a visual IDE like Jupyter/DataHub, I'm going to visually explore the data via the built-in file explorer.
To the Jupyter view!
To the Python view...?
with open(covid_file, "r") as f:
for i, row in enumerate(f):
print(repr(row)) # print raw strings
if i >= 4: break
'{\n' ' "meta" : {\n' ' "view" : {\n' ' "id" : "xn6j-b766",\n' ' "name" : "COVID-19 Confirmed Cases",\n'
In the same vein, we can use the head
Unix command (which is where Pandas' head
method comes from!) to see the first few lines of the file:
!head -5 {covid_file}
{ "meta" : { "view" : { "id" : "xn6j-b766", "name" : "COVID-19 Confirmed Cases",
Back to the Python view.
In order to load the JSON file into pandas, Let's first do some EDA with the Python
json
package to understand the particular structure of this JSON file so that we can decide what (if anything) to load into Pandas.
EDA: Digging into JSON¶
Python has relatively good support for JSON data since it closely matches the internal python object model. In the following cell we import the entire JSON datafile into a python dictionary using the json
package.
import json
with open(covid_file, "rb") as f:
covid_json = json.load(f)
The covid_json
variable is now a dictionary encoding the data in the file:
type(covid_json)
dict
Examine what keys are in the top level json object¶
We can list the keys to determine what data is stored in the object.
covid_json.keys()
dict_keys(['meta', 'data'])
Observation: The JSON dictionary contains a meta
key which likely refers to meta data (data about the data). Meta data often maintained with the data and can be a good source of additional information.
We can investigate the meta data further by examining the keys associated with the metadata.
covid_json['meta'].keys()
dict_keys(['view'])
The meta
key contains another dictionary called view
. This likely refers to meta-data about a particular "view" of some underlying database. We will learn more about views when we study SQL later in the class.
covid_json['meta']['view'].keys()
dict_keys(['id', 'name', 'assetType', 'attribution', 'averageRating', 'category', 'createdAt', 'description', 'displayType', 'downloadCount', 'hideFromCatalog', 'hideFromDataJson', 'newBackend', 'numberOfComments', 'oid', 'provenance', 'publicationAppendEnabled', 'publicationDate', 'publicationGroup', 'publicationStage', 'rowsUpdatedAt', 'rowsUpdatedBy', 'tableId', 'totalTimesRated', 'viewCount', 'viewLastModified', 'viewType', 'approvals', 'clientContext', 'columns', 'grants', 'metadata', 'owner', 'query', 'rights', 'tableAuthor', 'tags', 'flags'])
Notice that this a nested/recursive data structure. As we dig deeper we reveal more and more keys and the corresponding data:
meta
|-> data
| ... (haven't explored yet)
|-> view
| -> id
| -> name
| -> attribution
...
| -> description
...
| -> columns
...
There is a key called description in the view sub dictionary. This likely contains a description of the data:
print(covid_json['meta']['view']['description'])
Counts of confirmed COVID-19 cases among Berkeley residents by date. As of 6/21/22, this dataset will be updated weekly instead of daily. As of 11/14/22, this dataset only includes PCR cases.
Examining the Data Field for Records¶
We can look at a few entries in the data
field. This is what we'll load into Pandas.
for i in range(3):
print(f"{i:03} | {covid_json['data'][i]}")
000 | ['row-yeww~gvnm.g68w', '00000000-0000-0000-386C-533081785A3A', 0, 1687370769, None, 1687370769, None, '{ }', '2019-12-01T00:00:00', '0', '0'] 001 | ['row-fsx2.86py_bzpc', '00000000-0000-0000-FF2C-E5BC42CD0EE9', 0, 1687370769, None, 1687370769, None, '{ }', '2019-12-02T00:00:00', '0', '0'] 002 | ['row-ggah~5w7b~ce8n', '00000000-0000-0000-6EFB-E68400EBBD1E', 0, 1687370769, None, 1687370769, None, '{ }', '2019-12-03T00:00:00', '0', '0']
Observations:
- These look like equal-length records, so maybe
data
is a table! - But what do each of values in the record mean? Where can we find column headers?
Back to the metadata.
Columns Metadata¶
Another potentially useful key in the metadata dictionary is the columns
. This returns a list:
covid_json['meta']['view']['columns']
[{'id': -1, 'name': 'sid', 'dataTypeName': 'meta_data', 'fieldName': ':sid', 'position': 0, 'renderTypeName': 'meta_data', 'format': {}, 'flags': ['hidden']}, {'id': -1, 'name': 'id', 'dataTypeName': 'meta_data', 'fieldName': ':id', 'position': 0, 'renderTypeName': 'meta_data', 'format': {}, 'flags': ['hidden']}, {'id': -1, 'name': 'position', 'dataTypeName': 'meta_data', 'fieldName': ':position', 'position': 0, 'renderTypeName': 'meta_data', 'format': {}, 'flags': ['hidden']}, {'id': -1, 'name': 'created_at', 'dataTypeName': 'meta_data', 'fieldName': ':created_at', 'position': 0, 'renderTypeName': 'meta_data', 'format': {}, 'flags': ['hidden']}, {'id': -1, 'name': 'created_meta', 'dataTypeName': 'meta_data', 'fieldName': ':created_meta', 'position': 0, 'renderTypeName': 'meta_data', 'format': {}, 'flags': ['hidden']}, {'id': -1, 'name': 'updated_at', 'dataTypeName': 'meta_data', 'fieldName': ':updated_at', 'position': 0, 'renderTypeName': 'meta_data', 'format': {}, 'flags': ['hidden']}, {'id': -1, 'name': 'updated_meta', 'dataTypeName': 'meta_data', 'fieldName': ':updated_meta', 'position': 0, 'renderTypeName': 'meta_data', 'format': {}, 'flags': ['hidden']}, {'id': -1, 'name': 'meta', 'dataTypeName': 'meta_data', 'fieldName': ':meta', 'position': 0, 'renderTypeName': 'meta_data', 'format': {}, 'flags': ['hidden']}, {'id': 542388895, 'name': 'Date', 'dataTypeName': 'calendar_date', 'description': 'Dates reflect the date created in CalREDIE. ', 'fieldName': 'date', 'position': 1, 'renderTypeName': 'calendar_date', 'tableColumnId': 138291427, 'cachedContents': {'non_null': '1297', 'largest': '2023-06-19T00:00:00.000', 'null': '0', 'top': [{'item': '2022-05-03T00:00:00.000', 'count': '1'}, {'item': '2021-06-04T00:00:00.000', 'count': '1'}, {'item': '2020-10-28T00:00:00.000', 'count': '1'}, {'item': '2021-05-04T00:00:00.000', 'count': '1'}, {'item': '2020-01-13T00:00:00.000', 'count': '1'}, {'item': '2022-03-12T00:00:00.000', 'count': '1'}, {'item': '2022-05-13T00:00:00.000', 'count': '1'}, {'item': '2021-12-15T00:00:00.000', 'count': '1'}, {'item': '2020-07-10T00:00:00.000', 'count': '1'}, {'item': '2021-09-11T00:00:00.000', 'count': '1'}, {'item': '2020-08-11T00:00:00.000', 'count': '1'}, {'item': '2020-04-19T00:00:00.000', 'count': '1'}, {'item': '2021-11-11T00:00:00.000', 'count': '1'}, {'item': '2023-03-02T00:00:00.000', 'count': '1'}, {'item': '2023-03-14T00:00:00.000', 'count': '1'}, {'item': '2022-10-30T00:00:00.000', 'count': '1'}, {'item': '2021-10-15T00:00:00.000', 'count': '1'}, {'item': '2020-11-08T00:00:00.000', 'count': '1'}, {'item': '2022-05-11T00:00:00.000', 'count': '1'}, {'item': '2021-01-28T00:00:00.000', 'count': '1'}], 'smallest': '2019-12-01T00:00:00.000', 'count': '1297', 'cardinality': '1297'}, 'format': {'view': 'date'}}, {'id': 542388893, 'name': 'New Cases', 'dataTypeName': 'number', 'description': 'Total number of new cases reported by date created in CalREDIE. ', 'fieldName': 'bklhj_newcases', 'position': 2, 'renderTypeName': 'number', 'tableColumnId': 98765830, 'cachedContents': {'non_null': '1297', 'largest': '325', 'null': '0', 'top': [{'item': '0', 'count': '144'}, {'item': '1', 'count': '88'}, {'item': '2', 'count': '85'}, {'item': '4', 'count': '74'}, {'item': '3', 'count': '70'}, {'item': '5', 'count': '62'}, {'item': '6', 'count': '56'}, {'item': '7', 'count': '45'}, {'item': '8', 'count': '38'}, {'item': '9', 'count': '37'}, {'item': '12', 'count': '36'}, {'item': '11', 'count': '36'}, {'item': '13', 'count': '34'}, {'item': '10', 'count': '32'}, {'item': '16', 'count': '23'}, {'item': '15', 'count': '22'}, {'item': '17', 'count': '21'}, {'item': '19', 'count': '21'}, {'item': '18', 'count': '20'}, {'item': '14', 'count': '19'}], 'smallest': '0', 'count': '1297', 'cardinality': '112'}, 'format': {}}, {'id': 542388894, 'name': 'Cumulative Cases', 'dataTypeName': 'number', 'description': 'Total number of cumulative cases reported by date created in CalREDIE. ', 'fieldName': 'bklhj_cumulcases', 'position': 3, 'renderTypeName': 'number', 'tableColumnId': 98765829, 'cachedContents': {'non_null': '1297', 'largest': '23390', 'null': '0', 'top': [{'item': '0', 'count': '35'}, {'item': '3', 'count': '21'}, {'item': '6', 'count': '15'}, {'item': '5', 'count': '8'}, {'item': '2', 'count': '6'}, {'item': '7', 'count': '6'}, {'item': '3617', 'count': '5'}, {'item': '68', 'count': '4'}, {'item': '53', 'count': '4'}, {'item': '8', 'count': '4'}, {'item': '58', 'count': '4'}, {'item': '9', 'count': '4'}, {'item': '3620', 'count': '3'}, {'item': '23390', 'count': '3'}, {'item': '100', 'count': '3'}, {'item': '82', 'count': '3'}, {'item': '3629', 'count': '3'}, {'item': '80', 'count': '3'}, {'item': '3613', 'count': '3'}, {'item': '35', 'count': '3'}], 'smallest': '0', 'count': '1297', 'cardinality': '1154'}, 'format': {}}]
Let's go back to the file explorer.
Based on the contents of this key, what are reasonable names for each column in the data
table?
You can also get the view that Jupyter provides in the file explorer by using Python. This displays our JSON object as an interacive graphical object with a built-in search box:
from IPython.display import JSON
JSON(covid_json)
<IPython.core.display.JSON object>
Summary of exploring the JSON file¶
- The above metadata tells us a lot about the columns in the data including column names, potential data anomalies, and a basic statistic.
- Because of its non-tabular structure, JSON makes it easier (than CSV) to create self-documenting data, meaning that information about the data is stored in the same file as the data.
- Self documenting data can be helpful since it maintains its own description and these descriptions are more likely to be updated as data changes.
JSON with pandas¶
After our above EDA, let's finally go about loading the data (not the metadata) into a pandas dataframe.
In the following block of code we:
Translate the JSON records into a dataframe:
- fields:
covid_json['meta']['view']['columns']
- records:
covid_json['data']
- fields:
Remove columns that have no metadata description. This would be a bad idea in general but here we remove these columns since the above analysis suggests that they are unlikely to contain useful information.
Examine the
tail
of the table.
# pd.read_json(covid_file)
# Load the data from JSON and assign column titles
covid = pd.DataFrame(
covid_json['data'],
columns=[c['name'] for c in covid_json['meta']['view']['columns']])
covid.tail()
sid | id | position | created_at | created_meta | updated_at | updated_meta | meta | Date | New Cases | Cumulative Cases | |
---|---|---|---|---|---|---|---|---|---|---|---|
1292 | row-zg3v~6bpx~tp8j | 00000000-0000-0000-7710-AECCCEB67A5A | 0 | 1687370769 | None | 1687370769 | None | { } | 2023-06-15T00:00:00 | 2 | 23384 |
1293 | row-wean-73e3.fixn | 00000000-0000-0000-27C0-D21340C3F93F | 0 | 1687370769 | None | 1687370769 | None | { } | 2023-06-16T00:00:00 | 4 | 23388 |
1294 | row-cvez.6nyd.3f6y | 00000000-0000-0000-C1AE-C76310119F30 | 0 | 1687370769 | None | 1687370769 | None | { } | 2023-06-17T00:00:00 | 2 | 23390 |
1295 | row-hzcg-24ra_7ipd | 00000000-0000-0000-D048-D04E71C55E3A | 0 | 1687370769 | None | 1687370769 | None | { } | 2023-06-18T00:00:00 | 0 | 23390 |
1296 | row-8tdn_tuw8_hujn | 00000000-0000-0000-3181-E8EECFBF2249 | 0 | 1687370769 | None | 1687370769 | None | { } | 2023-06-19T00:00:00 | 0 | 23390 |
Return to Slides
Temporality¶
Let's briefly look at how we can use pandas dt
accessors to work with dates/times in a dataset.
We will use the dataset from Lab 3: the Berkeley PD Calls for Service dataset.
calls = pd.read_csv("data/Berkeley_PD_-_Calls_for_Service.csv")
calls.head()
CASENO | OFFENSE | EVENTDT | EVENTTM | CVLEGEND | CVDOW | InDbDate | Block_Location | BLKADDR | City | State | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 21014296 | THEFT MISD. (UNDER $950) | 04/01/2021 12:00:00 AM | 10:58 | LARCENY | 4 | 06/15/2021 12:00:00 AM | Berkeley, CA\n(37.869058, -122.270455) | NaN | Berkeley | CA |
1 | 21014391 | THEFT MISD. (UNDER $950) | 04/01/2021 12:00:00 AM | 10:38 | LARCENY | 4 | 06/15/2021 12:00:00 AM | Berkeley, CA\n(37.869058, -122.270455) | NaN | Berkeley | CA |
2 | 21090494 | THEFT MISD. (UNDER $950) | 04/19/2021 12:00:00 AM | 12:15 | LARCENY | 1 | 06/15/2021 12:00:00 AM | 2100 BLOCK HASTE ST\nBerkeley, CA\n(37.864908,... | 2100 BLOCK HASTE ST | Berkeley | CA |
3 | 21090204 | THEFT FELONY (OVER $950) | 02/13/2021 12:00:00 AM | 17:00 | LARCENY | 6 | 06/15/2021 12:00:00 AM | 2600 BLOCK WARRING ST\nBerkeley, CA\n(37.86393... | 2600 BLOCK WARRING ST | Berkeley | CA |
4 | 21090179 | BURGLARY AUTO | 02/08/2021 12:00:00 AM | 6:20 | BURGLARY - VEHICLE | 1 | 06/15/2021 12:00:00 AM | 2700 BLOCK GARBER ST\nBerkeley, CA\n(37.86066,... | 2700 BLOCK GARBER ST | Berkeley | CA |
Looks like there are three columns with dates/times: EVENTDT
, EVENTTM
, and InDbDate
.
Most likely, EVENTDT
stands for the date when the event took place, EVENTTM
stands for the time of day the event took place (in 24-hr format), and InDbDate
is the date this call is recorded onto the database.
If we check the data type of these columns, we will see they are stored as strings. We can convert them to datetime
objects using pandas to_datetime
function.
calls["EVENTDT"] = pd.to_datetime(calls["EVENTDT"],
format = "%m/%d/%Y %I:%M:%S %p")
calls.head()
CASENO | OFFENSE | EVENTDT | EVENTTM | CVLEGEND | CVDOW | InDbDate | Block_Location | BLKADDR | City | State | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 21014296 | THEFT MISD. (UNDER $950) | 2021-04-01 | 10:58 | LARCENY | 4 | 06/15/2021 12:00:00 AM | Berkeley, CA\n(37.869058, -122.270455) | NaN | Berkeley | CA |
1 | 21014391 | THEFT MISD. (UNDER $950) | 2021-04-01 | 10:38 | LARCENY | 4 | 06/15/2021 12:00:00 AM | Berkeley, CA\n(37.869058, -122.270455) | NaN | Berkeley | CA |
2 | 21090494 | THEFT MISD. (UNDER $950) | 2021-04-19 | 12:15 | LARCENY | 1 | 06/15/2021 12:00:00 AM | 2100 BLOCK HASTE ST\nBerkeley, CA\n(37.864908,... | 2100 BLOCK HASTE ST | Berkeley | CA |
3 | 21090204 | THEFT FELONY (OVER $950) | 2021-02-13 | 17:00 | LARCENY | 6 | 06/15/2021 12:00:00 AM | 2600 BLOCK WARRING ST\nBerkeley, CA\n(37.86393... | 2600 BLOCK WARRING ST | Berkeley | CA |
4 | 21090179 | BURGLARY AUTO | 2021-02-08 | 6:20 | BURGLARY - VEHICLE | 1 | 06/15/2021 12:00:00 AM | 2700 BLOCK GARBER ST\nBerkeley, CA\n(37.86066,... | 2700 BLOCK GARBER ST | Berkeley | CA |
Now we can use the dt
accessor on this column.
We can get the month:
calls["EVENTDT"].dt.month
0 4 1 4 2 4 3 2 4 2 .. 2627 12 2628 2 2629 3 2630 4 2631 2 Name: EVENTDT, Length: 2632, dtype: int32
Which day of the week the date is on:
calls["EVENTDT"].dt.dayofweek
0 3 1 3 2 0 3 5 4 0 .. 2627 0 2628 2 2629 2 2630 5 2631 4 Name: EVENTDT, Length: 2632, dtype: int32
Check the mimimum values to see if there are any suspicious-looking, 70s dates:
calls.sort_values("EVENTDT").head()
CASENO | OFFENSE | EVENTDT | EVENTTM | CVLEGEND | CVDOW | InDbDate | Block_Location | BLKADDR | City | State | |
---|---|---|---|---|---|---|---|---|---|---|---|
2513 | 20057398 | BURGLARY COMMERCIAL | 2020-12-17 | 16:05 | BURGLARY - COMMERCIAL | 4 | 06/15/2021 12:00:00 AM | 600 BLOCK GILMAN ST\nBerkeley, CA\n(37.878405,... | 600 BLOCK GILMAN ST | Berkeley | CA |
624 | 20057207 | ASSAULT/BATTERY MISD. | 2020-12-17 | 16:50 | ASSAULT | 4 | 06/15/2021 12:00:00 AM | 2100 BLOCK SHATTUCK AVE\nBerkeley, CA\n(37.871... | 2100 BLOCK SHATTUCK AVE | Berkeley | CA |
154 | 20092214 | THEFT FROM AUTO | 2020-12-17 | 18:30 | LARCENY - FROM VEHICLE | 4 | 06/15/2021 12:00:00 AM | 800 BLOCK SHATTUCK AVE\nBerkeley, CA\n(37.8918... | 800 BLOCK SHATTUCK AVE | Berkeley | CA |
659 | 20057324 | THEFT MISD. (UNDER $950) | 2020-12-17 | 15:44 | LARCENY | 4 | 06/15/2021 12:00:00 AM | 1800 BLOCK 4TH ST\nBerkeley, CA\n(37.869888, -... | 1800 BLOCK 4TH ST | Berkeley | CA |
993 | 20057573 | BURGLARY RESIDENTIAL | 2020-12-17 | 22:15 | BURGLARY - RESIDENTIAL | 4 | 06/15/2021 12:00:00 AM | 1700 BLOCK STUART ST\nBerkeley, CA\n(37.857495... | 1700 BLOCK STUART ST | Berkeley | CA |
Doesn't look like it! We are good!
We can also do many things with the dt
accessor like switching time zones and converting time back to UNIX/POSIX time. Check out the documentation on .dt
accessor and time series/date functionality.