Food Safety Analysis¶

Written by Joseph E. Goznalez

jegonzal@cs.berkeley.edu

This analysis is inspired by Jonathan Dinu's excellent post: https://github.com/hopelessoptimism/happy-healthy-hungry/blob/master/h3.ipynb

Motivation¶

Have you ever seen one of these signs?

Have you ever wondered what it means?
Is this a safe place to eat?
What does a score of 90 mean?

In this lecture we will explore food safety ratings for San Francisco to get a better picture of what these ratings mean.

Thinking about Questions¶

Before we start, let's spend a few minutes thinking about the questions that we want to answer.

What is a good restaurant score for food safety inspections?
What are common issues with restaurants that could be better addressed?
Are there regions of the city sub-par?
Is the situation getting worse or better?

What are some other questions that we might want to answer?

Where can we get data?¶

Description of inspection data for SF:

https://www.sfdph.org/dph/EH/Food/Score/

Data Links from Yelp LIVES program:¶

Yelp LIVES site: https://www.yelp.com/healthscores
Data description: https://www.yelp.com/healthscores#businesses
Downloading data from various locations: https://www.yelp.com/healthscores/feeds

Getting the data¶

I downloaded several feeds and stored them in the local data folder. Lets first see what files we downloaded data/san_francisco

ls -lh data/san_francisco/

total 9904
-rwxr-xr-x  1 jegonzal  staff   645K Jan  1 20:50 businesses.csv*
-rwxr-xr-x  1 jegonzal  staff   493K Jan  1 20:50 inspections.csv*
-rwxr-xr-x  1 jegonzal  staff   120B Jan  1 20:50 legend.csv*
-rwxr-xr-x  1 jegonzal  staff   3.7M Jan  1 20:50 violations.csv*

There is data for other cities as well but we will focus on San Francisco for today.

Questions¶

How big is the data?
File types?

Looking at the data¶

Data scientists will often use many tools to better understand the data.

The first set of tools we might use are command line utilities:

head -n 5: read the first 5 lines of a file
wc: Compute the number of lines (records) in a file.

We can use these command line utilities from within the python notebook. This is done by prefixing each command with the "bang" operator !

Using `Head`¶

Using head gives us a better picture of the contents of the businesses.csv file:

! head data/san_francisco/businesses.csv

!head -n 3 ./data/san_francisco/inspections.csv

!head -n 3 ./data/san_francisco/violations.csv

Use cat to read the entire file

!cat ./data/san_francisco/legend.csv

The legend indicates what ranges of scores mean. For example, this suggests that poor quality restaurants will have scores at or below 70.

What is Jupyter¶

Environment for describing analysis and doing analysis

Use `man` to learn¶

To learn more about these commands try:

man cat

# man cat

press esc to clear the pane.

What have we learned so far?¶

The first line of the file defines the header of the table.
These files consist of comma separated values (CSV).
All the fields appear to be quoted (this will be helpful in parsing the file).
The business_id field seems to appear in multiple files and is probably unique in businesses?
We can get a sense for the kind of information in each file (though we should definitely read the documentation)

How big is the data?¶

We already looked at the raw file size:

!ls -Rlh data/san_francisco/

total 9904
-rwxr-xr-x  1 jegonzal  staff   645K Jan  1 20:50 businesses.csv
-rwxr-xr-x  1 jegonzal  staff   493K Jan  1 20:50 inspections.csv
-rwxr-xr-x  1 jegonzal  staff   120B Jan  1 20:50 legend.csv
-rwxr-xr-x  1 jegonzal  staff   3.7M Jan  1 20:50 violations.csv

Why might we want to know the size of the data?
Why don't we just load it into python?

This dataset is fairly small but in general the data we study may be large (e.g., too big to easily manipulate in memory).

How Many Record?¶

How many records or observations do we have?
Can we answer this question just from the size of the file?

In the following we write a small bash script to calculate the number of lines each file.

%%bash
# We can use ipython Magic commands to 
# evaluate the entire cell in bash
for f in $(ls -R data/san_francisco/*); do
    wc $f
done;

    6316   44398  660035 data/san_francisco/businesses.csv
   15431   15431  505325 data/san_francisco/inspections.csv
       5       6     120 data/san_francisco/legend.csv
   40937  432008 3891639 data/san_francisco/violations.csv

The wc command prints the number of lines, words, and characters respectively in the file.

To learn more about this command run:

man wc

Press esc to close the popup pane.

Counting Lines in Python (Skip in Class)¶

In DS100 we will be using Python to do a lot of our analysis.

How could we write a similar program in Python?

for f in $(ls -R data/san_francisco/*); do
    wc $f
done;

import os
for fname in os.listdir("data/san_francisco"):
    with open(os.path.join("data/san_francisco", fname), "rb") as f:
        lines = sum(1 for line in f)
        print("Lines in", fname, ":", lines)

Lines in businesses.csv : 6316
Lines in inspections.csv : 15431
Lines in legend.csv : 5
Lines in violations.csv : 40937

The above Python program demonstrates a few big ideas:

programmatically traversing the file systems with os.listdir
using a with expression to open (and then automatically close) a file.
using a generator (the expression sum(1 for line in f)) to count the number of lines without first loading the entire file in memory.

A quick note on file size¶

When we load data into a program it will often increase in size. Therefore we need to be careful when loading entire files:

import sys

with open("data/san_francisco/violations.csv", "r") as f:
    lines = f.readlines()

print("Size of list container:", 
      sys.getsizeof(lines) / (10. ** 6) , 
      "MBytes")

Size of list container: 0.342696 MBytes

print("Sum of lines objects:", 
      sum(sys.getsizeof(l) for l in lines) / (10. ** 6), 
      "MBytes")

Sum of lines objects: 5.856615 MBytes

!ls -Rlh data/san_francisco/violations.csv

-rwxr-xr-x  1 jegonzal  staff   3.7M Jan  1 20:50 data/san_francisco/violations.csv

First note that the file takes a little more space in memory than it did on disk. Why?

This is because the data structures to store the contents of the file may contain extra fields (e.g., length and points) that add additional overhead.

Data scientists need to think about computational issues:

Can we fit the data in memory?
How long will this analysis take to run?

This is why knowing the size of the file can be important before loading it.

Iterating (Skip in Class)¶

What if I wanted to know the number of unique businesses in the violations.csv file without loading the entire file into memory at once?

business_ids = set()
with open("data/san_francisco/violations.csv", "r") as f:
    next(f)
    for line in f:
        business_ids.add(int(line.split(",")[0]))

print("There are", len(business_ids), "unique business ids.")

There are 5259 unique business ids.

Interesting there are fewer businesses in the violations.csv file than there are lines in the businesses.csv file. Perhaps there are no violations for some businesses? Let's check to see how many unique business_ids are in the businesses.csv file.

To do this we will practice using bash commands instead.

A little more with bash commands¶

Could we have computed the number of unique businesses using just Bash commands?

tail returns the end of the file.
```
tail -n +2
```
skips the first line of the file.
cat reads the entire file
cut extracts a field from row.
```
cut -f 1 -d ","
```
extracts the first field from a line delimited by commas.
sort sorts the input
uniq returns the unique entries form a sorted input
wc as earlier counts the entries in the output.

We can compose multiple operations using | (pipe).

%%bash
tail -n +2 data/san_francisco/businesses.csv | head -n 5

10,"TIRAMISU KITCHEN","033 BELDEN PL","San Francisco","CA","94104","37.791116","-122.403816","+14154217044"
19,"NRGIZE LIFESTYLE CAFE","1200 VAN NESS AVE, 3RD FLOOR","San Francisco","CA","94109","37.786848","-122.421547","+14157763262"
24,"OMNI S.F. HOTEL - 2ND FLOOR PANTRY","500 CALIFORNIA ST, 2ND  FLOOR","San Francisco","CA","94104","37.792888","-122.403135","+14156779494"
31,"NORMAN'S ICE CREAM AND FREEZES","2801 LEAVENWORTH ST ","San Francisco","CA","94133","37.807155","-122.419004",""
45,"CHARLIE'S DELI CAFE","3202 FOLSOM ST ","San Francisco","CA","94110","37.747114","-122.413641","+14156415051"

%%bash
tail -n +2 data/san_francisco/businesses.csv \
    | cut -f 1 -d "," \
    | head -n 10

10
19
24
31
45
48
50
54
56
58

cut: stdin: Illegal byte sequence

The above error occurs because the file is encoded in a non UTF-8 string format. In general this is often the older ISO format ISO-8859-1. We can use iconv to clean up the file.

man iconv

!tail -n +2 data/san_francisco/businesses.csv \
    | iconv -t UTF-8 -f ISO-8859-1 \
    | cut -f 1 -d "," \
    | head -n 10

10
19
24
31
45
48
50
54
56
58

Adding sort:

%%bash
tail -n +2 data/san_francisco/businesses.csv \
    | iconv -t UTF-8 -f ISO-8859-1 \
    | cut -f 1 -d "," \
    | sort \
    | head -n 10

10
1000
10011
1002
1003
10030
1005
1008
10083
10099

Adding uniq:

%%bash
tail -n +2 data/san_francisco/businesses.csv \
    | iconv -t UTF-8 -f ISO-8859-1 \
    | cut -f 1 -d "," \
    | sort \
    | uniq \
    | wc

    6315    6315   35738

man uniq

There is one less unique business_id than there are lines in the file. Why is that?

In DS-100 we will largely focus on analysis within Python. However, as a data scientist you may want to learn more about using these Bash shell commands (common to all POSIX shells). Others worth learning are:

cd, ls, pwd, mkdir, and rm are file and folder traversal and manipulation commands.
grep regular expression pattern matching
tr change individual characters
tee used to split an output down two paths
awk a string processing framework with it's own built in language for string transformations.
zip, gzip, bzip2, and tar are compression and archiving commands

The Pandas DataFrame Library¶

In this class we will frequently operate on data stored in table form. A common data structure for tabular data in analysis tools is a DataFrame.

Conceptually DataFrames behave like tables (in a Spreadsheet) with rows corresponding to data records and columns corresponding to fields. DataFrames are also like Matrices in that we can index a row or column of data. However, unlike Matrices DataFrames can store data of different types in each column.

Many of you may already be familiar with DataFrames from Data8. In Data8 you used a DataFrame library called DataScience created here at Berkeley. The Berkeley DataScience library contains a simple DataFrame data structure with a few basic operations. In contrast many of the DataFrame APIs in use today (e.g., Pandas, Spark, and R) have fairly extensive APIs with numerous short-cuts and optimizations.

One of the most commonly used DataFrame APIs today is the Python Pandas Library. This library was heavily inspired by the DataFrame concepts in the R programming language.

In DS100 we will the Pandas DataFrame library for many of the assignments. We encourage students to read a little about Pandas here

Loading the DataFrame¶

One of the useful features of Pandas is its ability to load fairly messy "CSV" files:

import pandas as pd


# note the need to provide encoding information for this text file
bus = pd.read_csv("data/san_francisco/businesses.csv", 
                             encoding='ISO-8859-1')

ins = pd.read_csv("data/san_francisco/inspections.csv")

vio = pd.read_csv("data/san_francisco/violations.csv")

If you look at the pd.read_csv command you will see that it has a very large number of arguments:

pd.read_csv?

To get a sense as to how the data was processed lets look at the records in each of the DataFrame

print('Records:', len(bus))
bus.head()

Records: 6315

print('Records:', len(ins))
ins.head()

Records: 15430

print('Records:', len(vio))
vio.head()

Records: 40936

Now that we have successfully loaded the data let's start to examine what we have.

Questions about data organization?¶

How is the data organized?
- What are the relationships between the tables (e.g., common attributes, keys)?
Are there common data issues (e.g., missing values (NaN), date parsing)?
How far does the data go back?
others...

print("Bus(" + ", ".join(bus.columns) + ")")
print("Ins(" + ", ".join(ins.columns) + ")")
print("Vio(" + ", ".join(vio.columns) + ")")

Bus(business_id, name, address, city, state, postal_code, latitude, longitude, phone_number)
Ins(business_id, score, date, type)
Vio(business_id, date, description)

The `business_id` field is shared across tables¶

The business_id field connects records across each of the DataFrames. Later in the class we will discuss this form of foreign key relationship between tables. For now, we will construct a modified version of these tables that are indexed by the business_id. This will both speedup future operations and also provide an implicit join key.

bus.set_index('business_id', inplace=True)
ins.set_index('business_id', inplace=True)
vio.set_index('business_id', inplace=True)

vio.head()

Date Parsing¶

Both the inspections and violations tables have dates.

vio['date'].head(10)

business_id
10    20160503
10    20160503
10    20160503
10    20160503
10    20160503
10    20140729
10    20140729
10    20140114
10    20140114
10    20140114
Name: date, dtype: int64

Notice the "type" of information stored in that column is an integer not a date. We need to clean that up a bit.

ins['date'] = pd.to_datetime(ins['date'], format="%Y%m%d")
vio['date'] = pd.to_datetime(vio['date'], format="%Y%m%d")

vio['date'].head(10)

business_id
10   2016-05-03
10   2016-05-03
10   2016-05-03
10   2016-05-03
10   2016-05-03
10   2014-07-29
10   2014-07-29
10   2014-01-14
10   2014-01-14
10   2014-01-14
Name: date, dtype: datetime64[ns]

How far back do the inspections and violation reports go:¶

print("Oldest inspection:", ins['date'].min())
print("Oldest violation:", vio['date'].min())

Oldest inspection: 2013-12-26 00:00:00
Oldest violation: 2013-12-26 00:00:00

print("Most recent inspection:", ins['date'].max())
print("Most recent violation:", vio['date'].max())

Most recent inspection: 2016-12-23 00:00:00
Most recent violation: 2016-12-23 00:00:00

Digging into the `Inspection` Data¶

ins.head()

It seems like there are possibly many routine inspections. Let's look at the distribution of inspection types. To do this we are going to select the type column and use the value_counts() function to count the occurrences:

ins['type'].value_counts()

routine      15429
complaint        1
Name: type, dtype: int64

Strange! There is only one complaint in the entire database.

Investigating the Complaint (Skip in Class)¶

Let's look a little bit further at this complaint:

ins[ins['type'] == 'complaint']

A score of 100?! That is odd. What was the restaurant?

ins[ins['type'] == 'complaint'].join(bus)

This probably deserves further investigation ...

vio.loc[87440]

vio.loc[87440].iloc[1,1]

'Inadequate dressing rooms or improper storage of personal items'

And it still got a score of 100? This record is probably an anomaly but it does cause us to question the meaning of these scores.

How frequently are these restaurants inspected?¶

Group the inspections by the index (the business_id) Drawing

ins_grouped_by_bid = ins.groupby(ins.index)

bus['first_inspect'] = ins_grouped_by_bid['date'].min()
bus['last_inspect'] = ins_grouped_by_bid['date'].max()
bus['num_inspect'] = ins_grouped_by_bid['date'].count()

bus['days_bt_ins'] = (bus['last_inspect'] - bus['first_inspect']).dt.days / (bus['num_inspect'] - 1)
bus['years_bt_ins'] = bus['days_bt_ins'] / 365.0

bus[['years_bt_ins', 'days_bt_ins', 'first_inspect', 'last_inspect', 'num_inspect']].head(3)

%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt

bus['years_bt_ins'].hist(bins=50)

<matplotlib.axes._subplots.AxesSubplot at 0x10ec68438>

What patterns do we see?

Two modes at 1.0 and 2.0. Why?

Inspections are required annually. Why longer than 1 year?

Understanding Scores¶

Our original goal was to understand scores. Let's see what we can find. Since restaurants can have multiple inspections let's look at the most recent.

def most_recent(df):
    return df.sort_values('date', ascending=False).iloc[0]

bus['latest_score'] = ins_grouped_by_bid.apply(most_recent)['score']
bus['mean_score'] = ins_grouped_by_bid['score'].mean()

Let's look at the spread of scores.

bus['latest_score'].hist(bins=10)

<matplotlib.axes._subplots.AxesSubplot at 0x1110ddb38>

bus['latest_score'].hist(bins=200)

<matplotlib.axes._subplots.AxesSubplot at 0x1121c2eb8>

The scores seem to be pretty heavily skewed towards 100. Is that good news? There also appears to be a weird clustering around 90. Why?

Score inflation?¶

bus['latest_score'].describe()

count    5730.000000
mean       90.359860
std         8.061857
min        52.000000
25%        86.000000
50%        92.000000
75%        96.000000
max       100.000000
Name: latest_score, dtype: float64

What fraction of the restaurants are poor (below 70) by the cities standard?

threshold = 70

fraction = (bus['latest_score'] <= threshold).mean() # <- Why the mean?

print("Less than", str(round(fraction * 100, 2)) + 
      "% of the restaurants have a score of", threshold, "or less.")

Less than 2.17% of the restaurants have a score of 70 or less.

Would you feel comfortable eating at a restaurant that got a score of 90?

What is the Geographic Distribution of Scores?¶

What data could we use to answer this question?

print("Bus(" + ", ".join(bus.columns) + ")\n")
print("Ins(" + ", ".join(ins.columns) + ")\n")
print("Vio(" + ", ".join(vio.columns) + ")")

Bus(name, address, city, state, postal_code, latitude, longitude, phone_number, first_inspect, last_inspect, num_inspect, days_bt_ins, years_bt_ins, latest_score, mean_score)

Ins(score, date, type)

Vio(date, description)

City?
Postal Code
Latitude & Longitude

Postal Codes (ZIP codes)¶

What is our coverage over postal codes?¶

What fraction of businesses don't have a postal code?

bus['postal_code'].isnull().mean()

0.041013460015835314

Is that acceptable?

Let's look at how the businesses are spread over the postal codes.

zipcounts = bus['postal_code'].value_counts()
zipcounts.name = 'counts'
zipcounts.plot('bar')

<matplotlib.axes._subplots.AxesSubplot at 0x11205ca90>

What do we observe from above?¶

Observations?¶

94110 seems to have the most inspections.
Messy Data! CA is not a zipcode.
Some postal codes are in extended form.

Visualizing the counts of businesses in each postal code¶

Compute the number of occurrences for each postal code:

We will use the Folium mapping package. To install it make sure you are in the correct conda environment (e.g., ds100) and then use pip (not conda which is still a few versions behind):

pip install folium

import folium
folium.__version__

'0.2.1'

We need version 0.2.0 or greater.

import folium
SF_COORDINATES = (37.76, -122.45)
sf_map = folium.Map(location=SF_COORDINATES, zoom_start=12)

#  choropleth : Greek for choro=area + pleth=multitued
sf_map.choropleth(geo_path = "condensed_zipcodes.json", 
                  data = zipcounts,
                  key_on = 'properties.zipcode',
                  threshold_scale = [100, 300, 400, 450, 500, 575],
                  fill_color='YlGn', fill_opacity=0.7, line_opacity=0.5)

sf_map

Get ZIP Code Labels and Areas¶

This requires the python area package. In your DS100 conda environment run:

pip
pip install area

# Load my geo graphic data
import json
import numpy as np
import area

with open("condensed_zipcodes.json", "r") as f:
    geojson = json.load(f)

def extract_info(x):
    postal_code = x['properties']['zipcode']
    coords = x['geometry']['coordinates'][0]
    if len(coords) == 1:
        coords = coords[0]
    hectares = area.area(x['geometry']) / (10000.0)
    return (postal_code, hectares, np.mean(coords, axis=0))

# Construct a dataframe with the number of hectares in each zipcode
postal_codes = pd.DataFrame(
    [extract_info(g) for g in geojson['features']],
    columns=['zipcode', 'hectares', 'coords']).set_index('zipcode')

postal_codes.head()

for zipcode, row in postal_codes.iterrows():
    text = '<div style="font-size: 24pt">%s</div>' % zipcode
    sf_map.add_child(folium.Marker(row['coords'][::-1], icon=folium.DivIcon(html=text)))
sf_map

How does this relate to the regions in SF?

Drawing

Would we expect small zipcodes regions to have as many restaurants as big regions?

Normalizing by ZIP code by area¶

tmp = pd.DataFrame(zipcounts).join(postal_codes)
density = (tmp['counts'] / tmp['hectares']).dropna()

density.plot(kind='bar')

<matplotlib.axes._subplots.AxesSubplot at 0x1129037f0>

import folium
SF_COORDINATES = (37.77, -122.45)
sf_map = folium.Map(location=SF_COORDINATES, zoom_start=12)

#choropleth : Greek for choro=area + pleth=multitued
sf_map.choropleth(geo_path = "condensed_zipcodes.json", 
                  data = density,
                  key_on = 'properties.zipcode',
                  fill_color='YlGn', fill_opacity=0.7, line_opacity=0.5)

sf_map

/Users/jegonzal/anaconda3/envs/ds100/lib/python3.5/site-packages/ipykernel/__main__.py:9: FutureWarning: 'threshold_scale' default behavior has changed. Now you get a linear scale between the 'min' and the 'max' of your data. To get former behavior, use folium.utilities.split_six.

How does this relate to the regions in SF?

Drawing

More Fine Grained Picture¶

We also have geo-coordinates attached to some of the restaurants. How many are missing?

How many of the entries are missing the latitude or longitude fields?

(bus['latitude'].isnull() | bus['longitude'].isnull()).mean()

0.43008709422011082

We having missing values for 43% of the businesses. Is that a problem?

Nearly half are missing the latitude or longitude information! Let's look at the locations of those that do have a location.

bus_locations = bus[['name', 'latitude', 'longitude', 'latest_score', 'mean_score']].dropna()

print("There are", len(bus_locations), "businesses with locations.")

There are 3550 businesses with locations.

The following generates a HeatMap of locations where we have collected restaurant data with geo-location information.

import folium
import folium.plugins # The Folium Javascript Map Library

locs = bus_locations[['latitude', 'longitude']].as_matrix()

SF_COORDINATES = (37.76, -122.45)
sf_map = folium.Map(location=SF_COORDINATES, zoom_start=12)
sf_map.add_children(folium.plugins.HeatMap(locs, radius = 10))
sf_map

Scores broken down by ZIP code¶

Focus on foodie ZIP codes¶

Lets define a foodie ZIP codes as one that has at least 200 restaurants.

# foodie_zips = zipcounts[zipcounts > 200].index
foodie_zips = density[density > 0.5].index

foodie_bus = bus[bus['postal_code'].isin(foodie_zips)]

foodie_bus.boxplot(by='postal_code', column='latest_score')
_ = plt.xticks(rotation='vertical')

What are the safest places to eat?¶

What is the least safe place to eat?¶

Drawing

Another kind of visualization (Skip)¶

sns.violinplot(x="postal_code", y="latest_score", 
               data=foodie_bus)
_ = plt.grid(b=True, which='major', color='w', linewidth=1.0)
_ = plt.xticks(rotation='vertical')

Why do the scores go above 100?

sns.violinplot(x="postal_code", y="latest_score", 
               data=foodie_bus,
               cut= 0)
_ = plt.grid(b=True, which='major', color='w', linewidth=1.0)
_ = plt.xticks(rotation='vertical')

score_quantiles = bus[bus['postal_code'].isin(foodie_zips)][['postal_code', 'latest_score']].dropna()\
    .groupby('postal_code')['latest_score'].quantile([0.1, 0.25, .5, 0.75, 0.9])

Visualizing the Scores by Zipcode¶

First we need to get the boundaries of the zip codes for our foody_zips

foodie_geojson = geojson.copy()
foodie_geojson['features'] = [f for f in geojson['features'] if f['properties']['zipcode'] in foodie_zips]

Median scores by zipcode¶

import folium
SF_COORDINATES = (37.76, -122.45)
sf_map = folium.Map(location=SF_COORDINATES, zoom_start=12)
sf_map.choropleth(geo_str = json.dumps(foodie_geojson), 
                  data = score_quantiles[:, 0.50],
                  key_on = 'properties.zipcode',
                  fill_color='BuPu', fill_opacity=0.7, line_opacity=0.5)
sf_map

/Users/jegonzal/anaconda3/envs/ds100/lib/python3.5/site-packages/ipykernel/__main__.py:7: FutureWarning: 'threshold_scale' default behavior has changed. Now you get a linear scale between the 'min' and the 'max' of your data. To get former behavior, use folium.utilities.split_six.

Lower 10% Quantile¶

import folium
SF_COORDINATES = (37.76, -122.45)
sf_map = folium.Map(location=SF_COORDINATES, zoom_start=12)
sf_map.choropleth(geo_str = json.dumps(foodie_geojson), 
                  data = score_quantiles[:, 0.1],
                  key_on = 'properties.zipcode',
                  fill_color='BuPu', fill_opacity=0.7, line_opacity=0.5)
sf_map

/Users/jegonzal/anaconda3/envs/ds100/lib/python3.5/site-packages/ipykernel/__main__.py:7: FutureWarning: 'threshold_scale' default behavior has changed. Now you get a linear scale between the 'min' and the 'max' of your data. To get former behavior, use folium.utilities.split_six.

ZIP Code with most spread (Interquartile Range)¶

import folium
SF_COORDINATES = (37.76, -122.45)
sf_map = folium.Map(location=SF_COORDINATES, zoom_start=12)
sf_map.choropleth(geo_str = json.dumps(foodie_geojson), 
                  data = score_quantiles[:, 0.75] - score_quantiles[:, 0.25],
                  key_on = 'properties.zipcode',
                  fill_color='BuPu', fill_opacity=0.7, line_opacity=0.5)
sf_map

/Users/jegonzal/anaconda3/envs/ds100/lib/python3.5/site-packages/ipykernel/__main__.py:7: FutureWarning: 'threshold_scale' default behavior has changed. Now you get a linear scale between the 'min' and the 'max' of your data. To get former behavior, use folium.utilities.split_six.

Low Scoring Locations¶

SF_COORDINATES = (37.76, -122.45)
sf_map = folium.Map(location=SF_COORDINATES, zoom_start=12)

cluster = folium.MarkerCluster()

for _, row in bus_locations[bus_locations['latest_score'] < 70.0].iterrows():
    cluster.add_child(folium.Marker(location=[row['latitude'], row['longitude']], 
                                    popup=(row['name'] +": " + str(row['latest_score']))))
sf_map.add_child(cluster)
sf_map

sf_map

What kinds of violations are correlated with low scores?¶

vio.head()

vio['description'].value_counts()

Unclean or degraded floors walls or ceilings                                                                    986
Moderate risk food holding temperature                                                                          661
Unapproved or unmaintained equipment or utensils                                                                649
Inadequate and inaccessible handwashing facilities                                                              557
Wiping cloths not clean or properly stored or inadequate sanitizer                                              527
Improper food storage                                                                                           492
Inadequately cleaned or sanitized food contact surfaces                                                         491
Food safety certificate or food handler card not available                                                      407
Foods not protected from contamination                                                                          381
Moderate risk vermin infestation                                                                                373
Unclean nonfood contact surfaces                                                                                352
Low risk vermin infestation                                                                                     314
Inadequate food safety knowledge or lack of certified food safety manager                                       294
Permit license or inspection report not posted                                                                  290
Improper storage of equipment utensils or linens                                                                245
Improper or defective plumbing                                                                                  225
High risk food holding temperature                                                                              196
Inadequate warewashing facilities or equipment                                                                  184
Unclean or unsanitary food contact surfaces                                                                     151
Inadequate ventilation or lighting                                                                              148
No thermometers or uncalibrated thermometers                                                                    144
Improper thawing methods                                                                                        122
Unclean hands or improper use of gloves                                                                         121
Improper storage use or identification of toxic substances                                                      114
Insufficient hot water or running water                                                                         107
Improper cooling methods                                                                                        105
High risk vermin infestation                                                                                     78
Inadequate dressing rooms or improper storage of personal items                                                  65
Inadequate procedures or records for time as a public health control                                             54
Improper food labeling or menu misrepresentation                                                                 53
                                                                                                               ... 
Permit license or inspection report not posted  [ date violation corrected: 7/29/2014 ]                           1
Improper storage of equipment utensils or linens  [ date violation corrected: 6/26/2014 ]                         1
Improper cooling methods  [ date violation corrected: 10/9/2016 ]                                                 1
Food safety certificate or food handler card not available  [ date violation corrected: 8/14/2015 ]               1
Improper cooling methods  [ date violation corrected: 8/14/2015 ]                                                 1
Non service animal  [ date violation corrected: 3/25/2014 ]                                                       1
High risk vermin infestation  [ date violation corrected: 10/25/2014 ]                                            1
Unclean unmaintained or improperly constructed toilet facilities  [ date violation corrected: 11/10/2016 ]        1
Improper food labeling or menu misrepresentation  [ date violation corrected: 8/18/2016 ]                         1
Improper storage use or identification of toxic substances  [ date violation corrected: 11/5/2015 ]               1
High risk food holding temperature   [ date violation corrected: 11/25/2015 ]                                     1
Unclean or degraded floors walls or ceilings  [ date violation corrected: 9/12/2016 ]                             1
Improper thawing methods   [ date violation corrected: 3/14/2014 ]                                                1
Employee eating or smoking  [ date violation corrected: 8/2/2016 ]                                                1
Improper storage use or identification of toxic substances  [ date violation corrected: 8/17/2016 ]               1
Moderate risk vermin infestation  [ date violation corrected: 4/17/2015 ]                                         1
Improper food storage  [ date violation corrected: 7/8/2015 ]                                                     1
Wiping cloths not clean or properly stored or inadequate sanitizer  [ date violation corrected: 11/12/2014 ]      1
No thermometers or uncalibrated thermometers   [ date violation corrected: 6/6/2014 ]                             1
Unapproved food source  [ date violation corrected: 3/31/2016 ]                                                   1
Improper or defective plumbing  [ date violation corrected: 7/30/2015 ]                                           1
Improper storage of equipment utensils or linens  [ date violation corrected: 9/24/2015 ]                         1
Contaminated or adulterated food  [ date violation corrected: 6/26/2015 ]                                         1
Inadequate warewashing facilities or equipment  [ date violation corrected: 9/2/2016 ]                            1
No person in charge of food facility  [ date violation corrected: 12/5/2014 ]                                     1
Improper thawing methods   [ date violation corrected: 12/22/2014 ]                                               1
Moderate risk food holding temperature   [ date violation corrected: 9/22/2015 ]                                  1
No thermometers or uncalibrated thermometers   [ date violation corrected: 7/28/2016 ]                            1
Inadequate and inaccessible handwashing facilities  [ date violation corrected: 6/16/2015 ]                       1
Inadequate sewage or wastewater disposal  [ date violation corrected: 5/5/2016 ]                                  1
Name: description, dtype: int64

Dirty data!¶

It looks like the description field has data information! How many descriptions are there?

len(vio['description'].value_counts())

15072

We can use regular expressions to extract dates and remove the:

[ date violation corrected: ##/#/#### ]

string

vio['resolved'] = pd.to_datetime(vio['description'].str.extract("(\d+\/\d+\/\d+)", expand=False))
vio['desc'] = vio['description'].str.replace("\[.*\]","").str.strip()

Now let's look at all the value counts

len(vio['desc'].value_counts())

68

print(vio['desc'].value_counts().to_string())

Unclean or degraded floors walls or ceilings                                 3559
Moderate risk food holding temperature                                       2809
Inadequate and inaccessible handwashing facilities                           2613
Unapproved or unmaintained equipment or utensils                             2450
Inadequately cleaned or sanitized food contact surfaces                      2325
Wiping cloths not clean or properly stored or inadequate sanitizer           2304
Improper food storage                                                        2163
Foods not protected from contamination                                       1907
Unclean nonfood contact surfaces                                             1861
High risk food holding temperature                                           1642
Moderate risk vermin infestation                                             1532
Food safety certificate or food handler card not available                   1254
Improper storage of equipment utensils or linens                             1136
Inadequate food safety knowledge or lack of certified food safety manager    1100
Permit license or inspection report not posted                               1035
Unclean or unsanitary food contact surfaces                                  1003
Low risk vermin infestation                                                   911
Improper cooling methods                                                      789
Improper or defective plumbing                                                769
High risk vermin infestation                                                  742
Unclean hands or improper use of gloves                                       737
Improper thawing methods                                                      647
Inadequate warewashing facilities or equipment                                595
Inadequate ventilation or lighting                                            589
No thermometers or uncalibrated thermometers                                  544
Improper storage use or identification of toxic substances                    531
Insufficient hot water or running water                                       526
Inadequate dressing rooms or improper storage of personal items               260
Inadequate procedures or records for time as a public health control          237
Unclean unmaintained or improperly constructed toilet facilities              227
No hot water or running water                                                 204
Contaminated or adulterated food                                              193
Employee eating or smoking                                                    161
Improper food labeling or menu misrepresentation                              157
Other low risk violation                                                      154
Improper reheating of food                                                    133
Inadequate or unsanitary refuse containers or area or no garbage service      132
Worker safety hazards                                                         103
Noncompliance with HAACP plan or variance                                     100
Inadequate sewage or wastewater disposal                                       96
Other moderate risk violation                                                  79
Unauthorized or unsafe use of time as a public health control measure          74
Unsanitary employee garments hair or nails                                     68
Food in poor condition                                                         68
Noncompliance with shell fish tags or display                                  55
No plan review or Building Permit                                              40
Sewage or wastewater contamination                                             39
Unpermitted food facility                                                      36
No person in charge of food facility                                           32
Inadequate HACCP plan record keeping                                           29
Unapproved  living quarters in food facility                                   28
Unapproved food source                                                         26
Other high risk violation                                                      25
Consumer advisory not provided for raw or undercooked foods                    19
Non service animal                                                             19
Improperly washed fruits and vegetables                                        12
Improper cooking time or temperatures                                          10
Mobile food facility with unapproved operating conditions                      10
Reservice of previously served foods                                            9
Unreported or unrestricted ill employee with communicable disease               6
Discharge from employee nose mouth or eye                                       6
Prohibited food served to sensitive population                                  4
Mobile food facility not operating with an approved commissary                  3
Noncompliance with Gulf Coast oyster regulation                                 3
Noncompliance with Cottage Food Operation                                       2
Improperly displayed mobile food permit or signage                              2
Mobile food facility HCD insignia unavailable                                   1
No restroom facility within 200 feet of mobile food facility                    1

Restaurants with many violations might be a bad sign:

bus['num_violations'] = vio.index.value_counts()

What are the worst offenders?

bus.sort_values('num_violations', ascending=False)[['name', 'num_violations', 'postal_code']].head(10)

Is that fair?¶

what about age?¶

That might not be fair if the restaurant is old. Let's look at at the violations on the latest inspection.

vio_bus = vio.join(bus['last_inspect'])
recent_vio = vio_bus[vio_bus['date'] >= vio_bus['last_inspect']]
bus['num_recent_violations'] = recent_vio.index.value_counts()

bus.sort_values('num_recent_violations', ascending=False)[['name', 'num_violations', 'num_recent_violations']].head(10)

What about unresolved recent violations?

bus['num_unresolved_violations'] = recent_vio[recent_vio['resolved'].isnull()].index.value_counts()

bus.sort_values('num_unresolved_violations', ascending=False)[[
        'name', 'num_violations', 'num_recent_violations','num_unresolved_violations']].head(10)

Lets Look at these violations on a map¶

75th Quantile Number of Recent Violations¶

score_quantiles2 = bus[['postal_code', 'num_recent_violations']].dropna()\
    .groupby('postal_code')['num_recent_violations'].quantile([0.25, .5, 0.75])

SF_COORDINATES = (37.76, -122.45)
sf_map = folium.Map(location=SF_COORDINATES, zoom_start=12)
sf_map.choropleth(geo_str = json.dumps(foodie_geojson), 
                  data = score_quantiles2[:, 0.75],
                  key_on = 'properties.zipcode',
                  fill_color='BuPu', fill_opacity=0.7, line_opacity=0.5)
sf_map

/Users/jegonzal/anaconda3/envs/ds100/lib/python3.5/site-packages/ipykernel/__main__.py:8: FutureWarning: 'threshold_scale' default behavior has changed. Now you get a linear scale between the 'min' and the 'max' of your data. To get former behavior, use folium.utilities.split_six.

SF_COORDINATES = (37.76, -122.45)
sf_map = folium.Map(location=SF_COORDINATES, zoom_start=12)
bus_loc = bus[~bus['latitude'].isnull()]
for bid, row in bus_loc[bus_loc['num_recent_violations'] > 7].iterrows():
    msg = "[" + str(bid) + "] " + row['name'] +": " + str(row['latest_score']) + "--> " + \
    " -- ".join(recent_vio.loc[bid]['desc'].values)
    sf_map.add_child(folium.Marker(location=[row['latitude'], row['longitude']], 
                                    popup=msg))
sf_map

What are the Violations of high-scoring restaurants?¶

Recall:

bus['latest_score'].dropna().describe()

count    5730.000000
mean       90.359860
std         8.061857
min        52.000000
25%        86.000000
50%        92.000000
75%        96.000000
max       100.000000
Name: latest_score, dtype: float64

Let's look at restaurants that scored above 96

tmp = recent_vio.join(bus['latest_score'])
tmp[tmp['latest_score'] >= 96.].groupby('desc')['desc'].count().sort_values(ascending=False)

desc
Unclean or degraded floors walls or ceilings                                 186
Wiping cloths not clean or properly stored or inadequate sanitizer           116
Unapproved or unmaintained equipment or utensils                             106
Moderate risk food holding temperature                                        96
Improper food storage                                                         86
Inadequate and inaccessible handwashing facilities                            78
Inadequately cleaned or sanitized food contact surfaces                       67
Food safety certificate or food handler card not available                    59
Unclean nonfood contact surfaces                                              57
Permit license or inspection report not posted                                55
Improper storage of equipment utensils or linens                              52
Low risk vermin infestation                                                   51
Improper or defective plumbing                                                50
Moderate risk vermin infestation                                              44
Inadequate warewashing facilities or equipment                                42
Foods not protected from contamination                                        36
Inadequate food safety knowledge or lack of certified food safety manager     27
Insufficient hot water or running water                                       24
No thermometers or uncalibrated thermometers                                  18
Improper storage use or identification of toxic substances                    16
Inadequate dressing rooms or improper storage of personal items               14
Other low risk violation                                                      11
Noncompliance with HAACP plan or variance                                     10
Improper food labeling or menu misrepresentation                              10
Inadequate ventilation or lighting                                             9
Unclean unmaintained or improperly constructed toilet facilities               8
Worker safety hazards                                                          7
Inadequate procedures or records for time as a public health control           5
Inadequate HACCP plan record keeping                                           5
Unpermitted food facility                                                      5
Employee eating or smoking                                                     4
Noncompliance with shell fish tags or display                                  3
High risk food holding temperature                                             3
Unclean or unsanitary food contact surfaces                                    3
Inadequate or unsanitary refuse containers or area or no garbage service       2
Other moderate risk violation                                                  2
Inadequate sewage or wastewater disposal                                       2
Unapproved  living quarters in food facility                                   2
Unclean hands or improper use of gloves                                        2
No hot water or running water                                                  1
High risk vermin infestation                                                   1
Unsanitary employee garments hair or nails                                     1
Reservice of previously served foods                                           1
Improper cooling methods                                                       1
Improper thawing methods                                                       1
Improperly washed fruits and vegetables                                        1
Non service animal                                                             1
No plan review or Building Permit                                              1
Contaminated or adulterated food                                               1
Name: desc, dtype: int64

Violations of Low Scoring Restaurants¶

tmp = recent_vio.join(bus['latest_score'])
tmp[tmp['latest_score'] <= 70.].groupby('desc')['desc'].count().sort_values(ascending=False)

desc
High risk food holding temperature                                           82
Foods not protected from contamination                                       77
Unclean or degraded floors walls or ceilings                                 71
Inadequate and inaccessible handwashing facilities                           61
Unapproved or unmaintained equipment or utensils                             59
High risk vermin infestation                                                 59
Unclean or unsanitary food contact surfaces                                  54
Inadequate food safety knowledge or lack of certified food safety manager    53
Inadequately cleaned or sanitized food contact surfaces                      52
Unclean hands or improper use of gloves                                      51
Improper thawing methods                                                     50
Moderate risk food holding temperature                                       41
Improper cooling methods                                                     40
Improper food storage                                                        40
Wiping cloths not clean or properly stored or inadequate sanitizer           39
Moderate risk vermin infestation                                             38
Unclean nonfood contact surfaces                                             34
Contaminated or adulterated food                                             30
No thermometers or uncalibrated thermometers                                 26
Permit license or inspection report not posted                               25
Improper storage use or identification of toxic substances                   21
Improper or defective plumbing                                               20
Food safety certificate or food handler card not available                   18
Improper storage of equipment utensils or linens                             17
Low risk vermin infestation                                                  13
Inadequate ventilation or lighting                                           13
Insufficient hot water or running water                                      11
Inadequate warewashing facilities or equipment                               11
Inadequate sewage or wastewater disposal                                     10
Inadequate or unsanitary refuse containers or area or no garbage service      7
Employee eating or smoking                                                    7
Sewage or wastewater contamination                                            6
Improper reheating of food                                                    5
No hot water or running water                                                 5
Improper food labeling or menu misrepresentation                              5
Inadequate procedures or records for time as a public health control          4
Noncompliance with HAACP plan or variance                                     3
Unclean unmaintained or improperly constructed toilet facilities              3
No person in charge of food facility                                          3
Food in poor condition                                                        3
Inadequate dressing rooms or improper storage of personal items               2
Other low risk violation                                                      2
Unauthorized or unsafe use of time as a public health control measure         2
Noncompliance with Cottage Food Operation                                     1
Noncompliance with shell fish tags or display                                 1
Reservice of previously served foods                                          1
Mobile food facility with unapproved operating conditions                     1
Noncompliance with Gulf Coast oyster regulation                               1
Consumer advisory not provided for raw or undercooked foods                   1
Name: desc, dtype: int64

If you have the same violations do you have the same score?¶

features = pd.get_dummies(recent_vio['desc']).groupby(recent_vio.index).sum()
# training_data = bus[['latest_score', 'num_recent_violations']]\
#     .join(features)
training_data = bus[['latest_score']].join(features)
training_data = training_data.dropna()

grouped_by_vio = training_data.groupby(list(features.columns))
res = pd.DataFrame({
    'count': grouped_by_vio['latest_score'].count().values,
    'mean': grouped_by_vio['latest_score'].mean().values,
    'var': grouped_by_vio['latest_score'].var().values
})
res[res['count'] > 25]

Building a Predictive Model¶

We could try to build a regressive model to identify how violations might contribute to the final score.

X = training_data.drop('latest_score', axis=1).as_matrix()
Y = training_data['latest_score'].as_matrix()

How much data do we have?

print(X.shape)

(4830, 65)

import sklearn
import sklearn.model_selection
(Xtr, Xte, Ytr, Yte) = sklearn.model_selection.train_test_split(X,Y, test_size=.1)

import sklearn.linear_model
#model = sklearn.linear_model.LinearRegression()
model = sklearn.linear_model.LassoCV(eps=0.000001, n_alphas=1000, max_iter=50000, fit_intercept=True)
model.fit(Xtr, Ytr)
print("The R^2 value:", model.score(Xte, Yte))

The R^2 value: 0.935394392538

Listing the coefficient of the model¶

The most negative coefficient determine the the violations that most contributed to a reduction in score.

feature_weights = pd.DataFrame({
        'features': training_data.drop('latest_score', axis=1).columns, 
        'weights': model.coef_})
print(feature_weights.sort_values('weights'))

                                             features   weights
33  Mobile food facility with unapproved operating... -7.694119
9               Improper cooking time or temperatures -7.228079
52                             Unapproved food source -7.044778
54  Unauthorized or unsafe use of time as a public... -6.696624
61  Unreported or unrestricted ill employee with c... -6.596896
58        Unclean or unsanitary food contact surfaces -6.525649
14                         Improper reheating of food -6.512154
8                        High risk vermin infestation -6.486433
10                           Improper cooling methods -6.483540
36                      No hot water or running water -6.358525
55            Unclean hands or improper use of gloves -6.345138
7                  High risk food holding temperature -6.324212
1                    Contaminated or adulterated food -6.311102
45                          Other high risk violation -5.925851
50                 Sewage or wastewater contamination -5.631901
49               Reservice of previously served foods -5.384052
42    Noncompliance with Gulf Coast oyster regulation -4.626733
26           Inadequate sewage or wastewater disposal -4.332132
41          Noncompliance with Cottage Food Operation -4.321874
43          Noncompliance with HAACP plan or variance -4.034720
17                           Improper thawing methods -3.988688
2           Discharge from employee nose mouth or eye -3.946798
6              Foods not protected from contamination -3.922273
21  Inadequate and inaccessible handwashing facili... -3.894855
32      Mobile food facility HCD insignia unavailable -3.885643
44      Noncompliance with shell fish tags or display -3.835800
29  Inadequately cleaned or sanitized food contact... -3.688618
3                          Employee eating or smoking -3.682846
35                   Moderate risk vermin infestation -3.674774
34             Moderate risk food holding temperature -3.610771
..                                                ...       ...
47                      Other moderate risk violation -2.708247
19            Improperly washed fruits and vegetables -2.440374
51       Unapproved  living quarters in food facility -2.314820
16  Improper storage use or identification of toxi... -2.158376
39       No thermometers or uncalibrated thermometers -2.153206
37               No person in charge of food facility -2.146354
46                           Other low risk violation -2.129569
27                 Inadequate ventilation or lighting -2.084141
13                     Improper or defective plumbing -2.013531
15   Improper storage of equipment utensils or linens -2.005444
22  Inadequate dressing rooms or improper storage ... -1.992388
28     Inadequate warewashing facilities or equipment -1.909698
12                              Improper food storage -1.885891
60                          Unpermitted food facility -1.879470
31                        Low risk vermin infestation -1.874897
5   Food safety certificate or food handler card n... -1.849792
62         Unsanitary employee garments hair or nails -1.848149
57       Unclean or degraded floors walls or ceilings -1.831330
56                   Unclean nonfood contact surfaces -1.806200
38                  No plan review or Building Permit -1.796837
48     Permit license or inspection report not posted -1.730579
63  Wiping cloths not clean or properly stored or ... -1.685951
64                              Worker safety hazards -1.554917
53   Unapproved or unmaintained equipment or utensils -1.518245
20               Inadequate HACCP plan record keeping -1.432978
11   Improper food labeling or menu misrepresentation -1.423454
59  Unclean unmaintained or improperly constructed... -1.380122
24  Inadequate or unsanitary refuse containers or ... -0.841134
40                                 Non service animal  0.616388
18  Improperly displayed mobile food permit or sig...  0.801769

[65 rows x 2 columns]

Done! (Not Really)¶

There are a lot of questions we left unanswered. The data for other cities is available and we didn't even begin to answer the questions about how food safety is changing.

	hectares	coords
zipcode
94108	69.880951	[-122.4079474, 37.7914164]
94104	20.714150	[-122.402450875, 37.79189525]
94122	612.142955	[-122.477982706, 37.7598352941]
94130	231.528932	[-122.367345109, 37.8173646353]
94158	270.461338	[-122.38815998, 37.7694004322]

	name	num_violations	postal_code
business_id
75139	HAKKA RESTAURANT	42.0	94121
489	DOL-HO RESTAURANT	39.0	94133
60354	TAQUERIA CASTILLO B	38.0	94102
3115	YEE'S RESTAURANT	37.0	94133
70996	HONG KONG LOUNGE	36.0	94121
1833	ROGUE ALES PUBLIC HOUSE	36.0	94133
71385	YAMA SUSHI RESTAURANT	34.0	94132
76547	QUAN PHO INC	34.0	94112
37882	FIRST CAKE	34.0	94121
2945	NEW SUN HONG KONG RESTAURANT	34.0	94133

	name	num_violations	num_recent_violations
business_id
78328	GOLDEN WOK	32.0	20.0
82134	HARBOR VILLA RESTAURANT	16.0	16.0
521	CALIFORNIA CRISP	24.0	15.0
87492	NAN KING ROAD BISTRO	13.0	13.0
68672	KAMA SUSHI SOMA	27.0	13.0
66961	MI TIERRA MARKET	19.0	13.0
1641	MARTHA & BROS. COFFEE CO	19.0	12.0
35026	MANGOSTEEN	17.0	12.0
77548	POC-CHUC RESTAURANT	20.0	12.0
83351	CITY SUPER	20.0	12.0

	name	num_violations	num_recent_violations	num_unresolved_violations
business_id
3491	BORODUDUR RESTAURANT	16.0	11.0	11.0
6654	HIEP-THANH MARKET	27.0	11.0	11.0
3875	MARINA SUPER	19.0	11.0	11.0
2764	AL-HAMRA	21.0	10.0	10.0
35447	SPARKY'S PIZZA & DINER	15.0	10.0	10.0
19003	PANCHITAS RESTAURANT	18.0	10.0	10.0
2331	JADE CAFE	27.0	10.0	10.0
71012	LITTLE PARIS	18.0	10.0	10.0
17411	OYE MANAGUA	23.0	9.0	9.0
4607	NEW GOLDEN DAISY	22.0	9.0	9.0

	business_id	name	address	city	state	postal_code	latitude	longitude	phone_number
0	10	TIRAMISU KITCHEN	033 BELDEN PL	San Francisco	CA	94104	37.791116	-122.403816	+14154217044
1	19	NRGIZE LIFESTYLE CAFE	1200 VAN NESS AVE, 3RD FLOOR	San Francisco	CA	94109	37.786848	-122.421547	+14157763262
2	24	OMNI S.F. HOTEL - 2ND FLOOR PANTRY	500 CALIFORNIA ST, 2ND FLOOR	San Francisco	CA	94104	37.792888	-122.403135	+14156779494
3	31	NORMAN'S ICE CREAM AND FREEZES	2801 LEAVENWORTH ST	San Francisco	CA	94133	37.807155	-122.419004	NaN
4	45	CHARLIE'S DELI CAFE	3202 FOLSOM ST	San Francisco	CA	94110	37.747114	-122.413641	+14156415051

	business_id	score	date	type
0	10	82	20160503	routine
1	10	94	20140729	routine
2	10	92	20140114	routine
3	19	94	20160513	routine
4	19	94	20141110	routine

	business_id	date	description
0	10	20160503	High risk food holding temperature [ date vi...
1	10	20160503	High risk food holding temperature [ date vi...
2	10	20160503	High risk vermin infestation
3	10	20160503	Unapproved or unmaintained equipment or utensils
4	10	20160503	No thermometers or uncalibrated thermometers

	score	date	type
business_id
10	82	2016-05-03	routine
10	94	2014-07-29	routine
10	92	2014-01-14	routine
19	94	2016-05-13	routine
19	94	2014-11-10	routine

	score	date	type	name	address	city	state	postal_code	latitude	longitude	phone_number
business_id
87440	100	2016-08-01	complaint	EL AJI PERUVIAN RESTAURANT	3015 MISSION ST	San Francisco	CA	94110	NaN	NaN	+14156587349

	years_bt_ins	days_bt_ins	first_inspect	last_inspect	num_inspect
business_id
10	1.150685	420.0	2014-01-14	2016-05-03	3.0
19	1.121918	409.5	2014-02-14	2016-05-13	3.0
24	0.772603	282.0	2014-06-12	2016-10-05	4.0

	count	mean	var
1	47	98.000000	0.000000
9	78	98.000000	0.000000
32	40	98.000000	0.000000
90	41	96.195122	0.560976
107	92	95.836957	2.225872
108	26	94.000000	0.000000
229	62	96.000000	0.000000
268	26	92.000000	0.000000
474	68	96.000000	0.000000
520	26	92.615385	4.726154
915	29	97.068966	25.137931
1021	39	98.000000	0.000000
1583	30	93.233333	1.633333
1908	34	95.764706	0.912656

Food Safety Analysis¶

Motivation¶

Thinking about Questions¶

Where can we get data?¶

Data Links from Yelp LIVES program:¶

Getting the data¶

Questions¶

Looking at the data¶

Using Head¶

What is Jupyter¶

Use man to learn¶

What have we learned so far?¶

How big is the data?¶

How Many Record?¶

Counting Lines in Python (Skip in Class)¶

A quick note on file size¶

Iterating (Skip in Class)¶

A little more with bash commands¶

The Pandas DataFrame Library¶

Loading the DataFrame¶

Questions about data organization?¶

The business_id field is shared across tables¶

Date Parsing¶

How far back do the inspections and violation reports go:¶

Digging into the Inspection Data¶

Investigating the Complaint (Skip in Class)¶

How frequently are these restaurants inspected?¶

Understanding Scores¶

Score inflation?¶

What is the Geographic Distribution of Scores?¶

Postal Codes (ZIP codes)¶

What is our coverage over postal codes?¶

What do we observe from above?¶

Observations?¶

Visualizing the counts of businesses in each postal code¶

Get ZIP Code Labels and Areas¶

Normalizing by ZIP code by area¶

More Fine Grained Picture¶

Scores broken down by ZIP code¶

Focus on foodie ZIP codes¶

What are the safest places to eat?¶

What is the least safe place to eat?¶

Another kind of visualization (Skip)¶

Visualizing the Scores by Zipcode¶

Median scores by zipcode¶

Lower 10% Quantile¶

ZIP Code with most spread (Interquartile Range)¶

Low Scoring Locations¶

What kinds of violations are correlated with low scores?¶

Dirty data!¶

Is that fair?¶

what about age?¶

Lets Look at these violations on a map¶

75th Quantile Number of Recent Violations¶

What are the Violations of high-scoring restaurants?¶

Violations of Low Scoring Restaurants¶

If you have the same violations do you have the same score?¶

Building a Predictive Model¶

Listing the coefficient of the model¶

Done! (Not Really)¶

Using `Head`¶

Use `man` to learn¶

The `business_id` field is shared across tables¶

Digging into the `Inspection` Data¶