This analysis is inspired by Jonathan Dinu's excellent post: https://github.com/hopelessoptimism/happy-healthy-hungry/blob/master/h3.ipynb
Have you ever seen one of these signs?
Have you ever wondered what it means?
Is this a safe place to eat?
What does a score of 90 mean?
In this lecture we will explore food safety ratings for San Francisco to get a better picture of what these ratings mean.
Before we start, let's spend a few minutes thinking about the questions that we want to answer.
What is a good restaurant score for food safety inspections?
What are common issues with restaurants that could be better addressed?
Are there regions of the city sub-par?
Is the situation getting worse or better?
What are some other questions that we might want to answer?
Description of inspection data for SF:
I downloaded several feeds and stored them in the local data folder.
Lets first see what files we downloaded
ls -lh data/san_francisco/
total 9904 -rwxr-xr-x 1 jegonzal staff 645K Jan 1 20:50 businesses.csv* -rwxr-xr-x 1 jegonzal staff 493K Jan 1 20:50 inspections.csv* -rwxr-xr-x 1 jegonzal staff 120B Jan 1 20:50 legend.csv* -rwxr-xr-x 1 jegonzal staff 3.7M Jan 1 20:50 violations.csv*
There is data for other cities as well but we will focus on San Francisco for today.
Data scientists will often use many tools to better understand the data.
The first set of tools we might use are command line utilities:
head -n 5: read the first 5 lines of a file
wc: Compute the number of lines (records) in a file.
We can use these command line utilities from within the python notebook. This is done by prefixing each command with the "bang" operator
Using head gives us a better picture of the contents of the
! head data/san_francisco/businesses.csv
!head -n 3 ./data/san_francisco/inspections.csv
!head -n 3 ./data/san_francisco/violations.csv
cat to read the entire file
The legend indicates what ranges of scores mean. For example, this suggests that poor quality restaurants will have scores at or below 70.
# man cat
esc to clear the pane.
business_idfield seems to appear in multiple files and is probably unique in businesses?
!ls -Rlh data/san_francisco/
total 9904 -rwxr-xr-x 1 jegonzal staff 645K Jan 1 20:50 businesses.csv -rwxr-xr-x 1 jegonzal staff 493K Jan 1 20:50 inspections.csv -rwxr-xr-x 1 jegonzal staff 120B Jan 1 20:50 legend.csv -rwxr-xr-x 1 jegonzal staff 3.7M Jan 1 20:50 violations.csv
This dataset is fairly small but in general the data we study may be large (e.g., too big to easily manipulate in memory).
In the following we write a small bash script to calculate the number of lines each file.
%%bash # We can use ipython Magic commands to # evaluate the entire cell in bash for f in $(ls -R data/san_francisco/*); do wc $f done;
6316 44398 660035 data/san_francisco/businesses.csv 15431 15431 505325 data/san_francisco/inspections.csv 5 6 120 data/san_francisco/legend.csv 40937 432008 3891639 data/san_francisco/violations.csv
wc command prints the number of lines, words, and characters respectively in the file.
To learn more about this command run:
esc to close the popup pane.
In DS100 we will be using Python to do a lot of our analysis.
How could we write a similar program in Python?
for f in $(ls -R data/san_francisco/*); do wc $f done;
import os for fname in os.listdir("data/san_francisco"): with open(os.path.join("data/san_francisco", fname), "rb") as f: lines = sum(1 for line in f) print("Lines in", fname, ":", lines)
Lines in businesses.csv : 6316 Lines in inspections.csv : 15431 Lines in legend.csv : 5 Lines in violations.csv : 40937
The above Python program demonstrates a few big ideas:
withexpression to open (and then automatically close) a file.
sum(1 for line in f)) to count the number of lines without first loading the entire file in memory.
When we load data into a program it will often increase in size. Therefore we need to be careful when loading entire files:
import sys with open("data/san_francisco/violations.csv", "r") as f: lines = f.readlines()
print("Size of list container:", sys.getsizeof(lines) / (10. ** 6) , "MBytes")
Size of list container: 0.342696 MBytes
print("Sum of lines objects:", sum(sys.getsizeof(l) for l in lines) / (10. ** 6), "MBytes")
Sum of lines objects: 5.856615 MBytes
!ls -Rlh data/san_francisco/violations.csv
-rwxr-xr-x 1 jegonzal staff 3.7M Jan 1 20:50 data/san_francisco/violations.csv
First note that the file takes a little more space in memory than it did on disk. Why?
This is because the data structures to store the contents of the file may contain extra fields (e.g., length and points) that add additional overhead.
Data scientists need to think about computational issues:
This is why knowing the size of the file can be important before loading it.
What if I wanted to know the number of unique businesses in the
violations.csv file without loading the entire file into memory at once?
business_ids = set() with open("data/san_francisco/violations.csv", "r") as f: next(f) for line in f: business_ids.add(int(line.split(",")))
print("There are", len(business_ids), "unique business ids.")
There are 5259 unique business ids.
Interesting there are fewer businesses in the
violations.csv file than there are lines in the
businesses.csv file. Perhaps there are no violations for some businesses? Let's check to see how many unique
business_ids are in the
To do this we will practice using bash commands instead.
Could we have computed the number of unique businesses using just Bash commands?
tailreturns the end of the file.
tail -n +2
catreads the entire file
cutextracts a field from row.
cut -f 1 -d ","
sortsorts the input
uniqreturns the unique entries form a sorted input
wcas earlier counts the entries in the output.
We can compose multiple operations using
%%bash tail -n +2 data/san_francisco/businesses.csv | head -n 5
10,"TIRAMISU KITCHEN","033 BELDEN PL","San Francisco","CA","94104","37.791116","-122.403816","+14154217044" 19,"NRGIZE LIFESTYLE CAFE","1200 VAN NESS AVE, 3RD FLOOR","San Francisco","CA","94109","37.786848","-122.421547","+14157763262" 24,"OMNI S.F. HOTEL - 2ND FLOOR PANTRY","500 CALIFORNIA ST, 2ND FLOOR","San Francisco","CA","94104","37.792888","-122.403135","+14156779494" 31,"NORMAN'S ICE CREAM AND FREEZES","2801 LEAVENWORTH ST ","San Francisco","CA","94133","37.807155","-122.419004","" 45,"CHARLIE'S DELI CAFE","3202 FOLSOM ST ","San Francisco","CA","94110","37.747114","-122.413641","+14156415051"
%%bash tail -n +2 data/san_francisco/businesses.csv \ | cut -f 1 -d "," \ | head -n 10
10 19 24 31 45 48 50 54 56 58
cut: stdin: Illegal byte sequence
The above error occurs because the file is encoded in a non UTF-8 string format. In general this is often the older ISO format ISO-8859-1. We can use
iconv to clean up the file.
!tail -n +2 data/san_francisco/businesses.csv \ | iconv -t UTF-8 -f ISO-8859-1 \ | cut -f 1 -d "," \ | head -n 10
10 19 24 31 45 48 50 54 56 58
%%bash tail -n +2 data/san_francisco/businesses.csv \ | iconv -t UTF-8 -f ISO-8859-1 \ | cut -f 1 -d "," \ | sort \ | head -n 10
10 1000 10011 1002 1003 10030 1005 1008 10083 10099
%%bash tail -n +2 data/san_francisco/businesses.csv \ | iconv -t UTF-8 -f ISO-8859-1 \ | cut -f 1 -d "," \ | sort \ | uniq \ | wc
6315 6315 35738
There is one less unique
business_id than there are lines in the file. Why is that?
In DS-100 we will largely focus on analysis within Python. However, as a data scientist you may want to learn more about using these Bash shell commands (common to all POSIX shells). Others worth learning are:
rmare file and folder traversal and manipulation commands.
grepregular expression pattern matching
trchange individual characters
teeused to split an output down two paths
awka string processing framework with it's own built in language for string transformations.
tarare compression and archiving commands
In this class we will frequently operate on data stored in table form. A common data structure for tabular data in analysis tools is a DataFrame.
Conceptually DataFrames behave like tables (in a Spreadsheet) with rows corresponding to data records and columns corresponding to fields. DataFrames are also like Matrices in that we can index a row or column of data. However, unlike Matrices DataFrames can store data of different types in each column.
Many of you may already be familiar with DataFrames from Data8. In Data8 you used a DataFrame library called DataScience created here at Berkeley. The Berkeley DataScience library contains a simple DataFrame data structure with a few basic operations. In contrast many of the DataFrame APIs in use today (e.g., Pandas, Spark, and R) have fairly extensive APIs with numerous short-cuts and optimizations.
One of the most commonly used DataFrame APIs today is the Python Pandas Library. This library was heavily inspired by the DataFrame concepts in the R programming language.
In DS100 we will the Pandas DataFrame library for many of the assignments. We encourage students to read a little about Pandas here
One of the useful features of Pandas is its ability to load fairly messy "CSV" files:
import pandas as pd # note the need to provide encoding information for this text file bus = pd.read_csv("data/san_francisco/businesses.csv", encoding='ISO-8859-1') ins = pd.read_csv("data/san_francisco/inspections.csv") vio = pd.read_csv("data/san_francisco/violations.csv")
If you look at the
pd.read_csv command you will see that it has a very large number of arguments:
To get a sense as to how the data was processed lets look at the records in each of the DataFrame
print('Records:', len(bus)) bus.head()
|0||10||TIRAMISU KITCHEN||033 BELDEN PL||San Francisco||CA||94104||37.791116||-122.403816||+14154217044|
|1||19||NRGIZE LIFESTYLE CAFE||1200 VAN NESS AVE, 3RD FLOOR||San Francisco||CA||94109||37.786848||-122.421547||+14157763262|
|2||24||OMNI S.F. HOTEL - 2ND FLOOR PANTRY||500 CALIFORNIA ST, 2ND FLOOR||San Francisco||CA||94104||37.792888||-122.403135||+14156779494|
|3||31||NORMAN'S ICE CREAM AND FREEZES||2801 LEAVENWORTH ST||San Francisco||CA||94133||37.807155||-122.419004||NaN|
|4||45||CHARLIE'S DELI CAFE||3202 FOLSOM ST||San Francisco||CA||94110||37.747114||-122.413641||+14156415051|
print('Records:', len(ins)) ins.head()
print('Records:', len(vio)) vio.head()
|0||10||20160503||High risk food holding temperature [ date vi...|
|1||10||20160503||High risk food holding temperature [ date vi...|
|2||10||20160503||High risk vermin infestation|
|3||10||20160503||Unapproved or unmaintained equipment or utensils|
|4||10||20160503||No thermometers or uncalibrated thermometers|
Now that we have successfully loaded the data let's start to examine what we have.
print("Bus(" + ", ".join(bus.columns) + ")") print("Ins(" + ", ".join(ins.columns) + ")") print("Vio(" + ", ".join(vio.columns) + ")")
Bus(business_id, name, address, city, state, postal_code, latitude, longitude, phone_number) Ins(business_id, score, date, type) Vio(business_id, date, description)
business_idfield is shared across tables¶
business_id field connects records across each of the DataFrames. Later in the class we will discuss this form of foreign key relationship between tables. For now, we will construct a modified version of these tables that are indexed by the
business_id. This will both speedup future operations and also provide an implicit join key.
bus.set_index('business_id', inplace=True) ins.set_index('business_id', inplace=True) vio.set_index('business_id', inplace=True)
|10||20160503||High risk food holding temperature [ date vi...|
|10||20160503||High risk food holding temperature [ date vi...|
|10||20160503||High risk vermin infestation|
|10||20160503||Unapproved or unmaintained equipment or utensils|
|10||20160503||No thermometers or uncalibrated thermometers|
violations tables have dates.
business_id 10 20160503 10 20160503 10 20160503 10 20160503 10 20160503 10 20140729 10 20140729 10 20140114 10 20140114 10 20140114 Name: date, dtype: int64
Notice the "type" of information stored in that column is an integer not a date. We need to clean that up a bit.
ins['date'] = pd.to_datetime(ins['date'], format="%Y%m%d") vio['date'] = pd.to_datetime(vio['date'], format="%Y%m%d")
business_id 10 2016-05-03 10 2016-05-03 10 2016-05-03 10 2016-05-03 10 2016-05-03 10 2014-07-29 10 2014-07-29 10 2014-01-14 10 2014-01-14 10 2014-01-14 Name: date, dtype: datetime64[ns]
print("Oldest inspection:", ins['date'].min()) print("Oldest violation:", vio['date'].min())
Oldest inspection: 2013-12-26 00:00:00 Oldest violation: 2013-12-26 00:00:00
print("Most recent inspection:", ins['date'].max()) print("Most recent violation:", vio['date'].max())
Most recent inspection: 2016-12-23 00:00:00 Most recent violation: 2016-12-23 00:00:00
It seems like there are possibly many
routine inspections. Let's look at the distribution of inspection types. To do this we are going to select the
type column and use the
value_counts() function to count the occurrences:
routine 15429 complaint 1 Name: type, dtype: int64
Strange! There is only one complaint in the entire database.
Let's look a little bit further at this complaint:
ins[ins['type'] == 'complaint']
A score of 100?! That is odd. What was the restaurant?
ins[ins['type'] == 'complaint'].join(bus)
|87440||100||2016-08-01||complaint||EL AJI PERUVIAN RESTAURANT||3015 MISSION ST||San Francisco||CA||94110||NaN||NaN||+14156587349|
This probably deserves further investigation ...
|87440||2016-08-01||Low risk vermin infestation|
|87440||2016-08-01||Inadequate dressing rooms or improper storage ...|
'Inadequate dressing rooms or improper storage of personal items'
And it still got a score of 100? This record is probably an anomaly but it does cause us to question the meaning of these scores.
Group the inspections by the index (the
ins_grouped_by_bid = ins.groupby(ins.index)
bus['first_inspect'] = ins_grouped_by_bid['date'].min() bus['last_inspect'] = ins_grouped_by_bid['date'].max() bus['num_inspect'] = ins_grouped_by_bid['date'].count()
bus['days_bt_ins'] = (bus['last_inspect'] - bus['first_inspect']).dt.days / (bus['num_inspect'] - 1) bus['years_bt_ins'] = bus['days_bt_ins'] / 365.0
bus[['years_bt_ins', 'days_bt_ins', 'first_inspect', 'last_inspect', 'num_inspect']].head(3)
%matplotlib inline import seaborn as sns import matplotlib.pyplot as plt
<matplotlib.axes._subplots.AxesSubplot at 0x10ec68438>
What patterns do we see?
Inspections are required annually. Why longer than 1 year?
Our original goal was to understand scores. Let's see what we can find. Since restaurants can have multiple inspections let's look at the most recent.
def most_recent(df): return df.sort_values('date', ascending=False).iloc bus['latest_score'] = ins_grouped_by_bid.apply(most_recent)['score'] bus['mean_score'] = ins_grouped_by_bid['score'].mean()
Let's look at the spread of scores.
<matplotlib.axes._subplots.AxesSubplot at 0x1110ddb38>
<matplotlib.axes._subplots.AxesSubplot at 0x1121c2eb8>
The scores seem to be pretty heavily skewed towards 100. Is that good news? There also appears to be a weird clustering around 90. Why?
count 5730.000000 mean 90.359860 std 8.061857 min 52.000000 25% 86.000000 50% 92.000000 75% 96.000000 max 100.000000 Name: latest_score, dtype: float64
What fraction of the restaurants are poor (below 70) by the cities standard?
threshold = 70 fraction = (bus['latest_score'] <= threshold).mean() # <- Why the mean? print("Less than", str(round(fraction * 100, 2)) + "% of the restaurants have a score of", threshold, "or less.")
Less than 2.17% of the restaurants have a score of 70 or less.
Would you feel comfortable eating at a restaurant that got a score of 90?
What data could we use to answer this question?
print("Bus(" + ", ".join(bus.columns) + ")\n") print("Ins(" + ", ".join(ins.columns) + ")\n") print("Vio(" + ", ".join(vio.columns) + ")")
Bus(name, address, city, state, postal_code, latitude, longitude, phone_number, first_inspect, last_inspect, num_inspect, days_bt_ins, years_bt_ins, latest_score, mean_score) Ins(score, date, type) Vio(date, description)
What fraction of businesses don't have a postal code?
Is that acceptable?
Let's look at how the businesses are spread over the postal codes.
zipcounts = bus['postal_code'].value_counts() zipcounts.name = 'counts' zipcounts.plot('bar')
<matplotlib.axes._subplots.AxesSubplot at 0x11205ca90>
CAis not a zipcode.
Compute the number of occurrences for each postal code:
We will use the Folium mapping package. To install it make sure you are in the correct conda environment (e.g.,
ds100) and then use
pip (not conda which is still a few versions behind):
pip install folium
import folium folium.__version__
We need version
0.2.0 or greater.
import folium SF_COORDINATES = (37.76, -122.45) sf_map = folium.Map(location=SF_COORDINATES, zoom_start=12) # choropleth : Greek for choro=area + pleth=multitued sf_map.choropleth(geo_path = "condensed_zipcodes.json", data = zipcounts, key_on = 'properties.zipcode', threshold_scale = [100, 300, 400, 450, 500, 575], fill_color='YlGn', fill_opacity=0.7, line_opacity=0.5) sf_map
This requires the python area package. In your DS100 conda environment run:
pip pip install area
# Load my geo graphic data import json import numpy as np import area with open("condensed_zipcodes.json", "r") as f: geojson = json.load(f) def extract_info(x): postal_code = x['properties']['zipcode'] coords = x['geometry']['coordinates'] if len(coords) == 1: coords = coords hectares = area.area(x['geometry']) / (10000.0) return (postal_code, hectares, np.mean(coords, axis=0)) # Construct a dataframe with the number of hectares in each zipcode postal_codes = pd.DataFrame( [extract_info(g) for g in geojson['features']], columns=['zipcode', 'hectares', 'coords']).set_index('zipcode') postal_codes.head()
for zipcode, row in postal_codes.iterrows(): text = '<div style="font-size: 24pt">%s</div>' % zipcode sf_map.add_child(folium.Marker(row['coords'][::-1], icon=folium.DivIcon(html=text))) sf_map
How does this relate to the regions in SF?
Would we expect small zipcodes regions to have as many restaurants as big regions?
tmp = pd.DataFrame(zipcounts).join(postal_codes) density = (tmp['counts'] / tmp['hectares']).dropna()
<matplotlib.axes._subplots.AxesSubplot at 0x1129037f0>
import folium SF_COORDINATES = (37.77, -122.45) sf_map = folium.Map(location=SF_COORDINATES, zoom_start=12) #choropleth : Greek for choro=area + pleth=multitued sf_map.choropleth(geo_path = "condensed_zipcodes.json", data = density, key_on = 'properties.zipcode', fill_color='YlGn', fill_opacity=0.7, line_opacity=0.5) sf_map
/Users/jegonzal/anaconda3/envs/ds100/lib/python3.5/site-packages/ipykernel/__main__.py:9: FutureWarning: 'threshold_scale' default behavior has changed. Now you get a linear scale between the 'min' and the 'max' of your data. To get former behavior, use folium.utilities.split_six.
How does this relate to the regions in SF?
We also have geo-coordinates attached to some of the restaurants. How many are missing?
How many of the entries are missing the latitude or longitude fields?
(bus['latitude'].isnull() | bus['longitude'].isnull()).mean()
We having missing values for 43% of the businesses. Is that a problem?
Nearly half are missing the latitude or longitude information! Let's look at the locations of those that do have a location.
bus_locations = bus[['name', 'latitude', 'longitude', 'latest_score', 'mean_score']].dropna() print("There are", len(bus_locations), "businesses with locations.")
There are 3550 businesses with locations.
The following generates a HeatMap of locations where we have collected restaurant data with geo-location information.
Lets define a foodie ZIP codes as one that has at least 200 restaurants.
# foodie_zips = zipcounts[zipcounts > 200].index foodie_zips = density[density > 0.5].index
foodie_bus = bus[bus['postal_code'].isin(foodie_zips)]
foodie_bus.boxplot(by='postal_code', column='latest_score') _ = plt.xticks(rotation='vertical')
sns.violinplot(x="postal_code", y="latest_score", data=foodie_bus) _ = plt.grid(b=True, which='major', color='w', linewidth=1.0) _ = plt.xticks(rotation='vertical')