In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import zipfile
%matplotlib inline

Canonicalization

In [2]:
with open('county_and_state.csv') as f:
    county_and_state = pd.read_csv(f)
    
with open('county_and_population.csv') as f:
    county_and_pop = pd.read_csv(f)    

Suppose we'd like to join these two tables. Unfortunately, we can't, because the strings representing the county names don't match, as seen below.

In [3]:
county_and_state
Out[3]:
County State
0 De Witt County IL
1 Lac qui Parle County MN
2 Lewis and Clark County MT
3 St John the Baptist Parish LS
In [4]:
county_and_pop
Out[4]:
County Population
0 DeWitt 16798
1 Lac Qui Parle 8067
2 Lewis & Clark 55716
3 St. John the Baptist 43044

Before we can join them, we'll do what I call canonicalization.

Canonicalization: A process for converting data that has more than one possible representation into a "standard", "normal", or canonical form (definition via Wikipedia).

In [5]:
def canonicalize_county(county_name):
    return (
        county_name
        .lower()               # lower case
        .replace(' ', '')      # remove spaces
        .replace('&', 'and')   # replace &
        .replace('.', '')      # remove dot
        .replace('county', '') # remove county
        .replace('parish', '') # remove parish
    )
In [6]:
county_and_pop['clean_county'] = county_and_pop['County'].map(canonicalize_county)
county_and_state['clean_county'] = county_and_state['County'].map(canonicalize_county)
In [7]:
county_and_pop.merge(county_and_state,
                     left_on = 'clean_county', right_on = 'clean_county')
Out[7]:
County_x Population clean_county County_y State
0 DeWitt 16798 dewitt De Witt County IL
1 Lac Qui Parle 8067 lacquiparle Lac qui Parle County MN
2 Lewis & Clark 55716 lewisandclark Lewis and Clark County MT
3 St. John the Baptist 43044 stjohnthebaptist St John the Baptist Parish LS

Processing Data from a Text Log Using Basic Python

In [8]:
with open('log.txt', 'r') as f:
    log_lines = f.readlines()
In [9]:
log_lines
Out[9]:
['169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] "GET /stat141/Winter04/ HTTP/1.1" 200 2585 "http://anson.ucdavis.edu/courses/"\n',
 '193.205.203.3 - - [2/Feb/2005:17:23:6 -0800] "GET /stat141/Notes/dim.html HTTP/1.0" 404 302 "http://eeyore.ucdavis.edu/stat141/Notes/session.html"\n',
 '169.237.46.240 - "" [3/Feb/2006:10:18:37 -0800] "GET /stat141/homework/Solutions/hw1Sol.pdf HTTP/1.1"\n']

Suppose we want to extract the day, month, year, hour, minutes, seconds, and timezone. Looking at the data, we see that these items are not in a fixed position relative to the beginning of the string. That is, slicing by some fixed offset isn't going to work.

In [10]:
log_lines[0][20:31]
Out[10]:
'26/Jan/2014'
In [11]:
log_lines[1][20:31]
Out[11]:
'/Feb/2005:1'

Instead, we'll need to use some more sophisticated thinking. Let's focus on only the first line of the file.

In [12]:
first = log_lines[0]
first
Out[12]:
'169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] "GET /stat141/Winter04/ HTTP/1.1" 200 2585 "http://anson.ucdavis.edu/courses/"\n'
In [ ]:
 
In [13]:
pertinent = first.split("[")[1].split(']')[0]
day, month, rest = pertinent.split('/')
year, hour, minute, rest = rest.split(':')
seconds, time_zone = rest.split(' ')
day, month, year, hour, minute, seconds, time_zone
Out[13]:
('26', 'Jan', '2014', '10', '47', '58', '-0800')

A much more sophisticated but common approach is to extract the information we need using a regular expression. See today's lecture slides for more on regular expressions.

In [14]:
import re
pattern = r'\[(\d+)/(\w+)/(\d+):(\d+):(\d+):(\d+) (.+)\]'
day, month, year, hour, minute, second, time_zone = re.search(pattern, first).groups()
year, month, day, hour, minute, second, time_zone
Out[14]:
('2014', 'Jan', '26', '10', '47', '58', '-0800')

Or alternately using the findall method:

In [15]:
import re
pattern = r'\[(\d+)/(\w+)/(\d+):(\d+):(\d+):(\d+) (.+)\]'
day, month, year, hour, minute, second, time_zone = re.findall(pattern, first)[0]
year, month, day, hour, minute, second, time_zone
Out[15]:
('2014', 'Jan', '26', '10', '47', '58', '-0800')

Note: We can return the results as a Series:

In [16]:
cols = ["Day", "Month", "Year", "Hour", "Minute", "Second", "Time Zone"]
def log_entry_to_series(line):
    return pd.Series(re.search(pattern, line).groups(), index = cols)

log_entry_to_series(first)
Out[16]:
Day             26
Month          Jan
Year          2014
Hour            10
Minute          47
Second          58
Time Zone    -0800
dtype: object

And using this function we can create a DataFrame of all the time information.

In [17]:
log_info = pd.DataFrame(columns=cols)

for line in log_lines:
    log_info = log_info.append(log_entry_to_series(line), ignore_index = True)

log_info
Out[17]:
Day Month Year Hour Minute Second Time Zone
0 26 Jan 2014 10 47 58 -0800
1 2 Feb 2005 17 23 6 -0800
2 3 Feb 2006 10 18 37 -0800

Regular Expression From Lecture

Fill in the regex below so that after code executes, day is “26”, month is “Jan”, and year is “2014”.

In [18]:
log_lines[0]
Out[18]:
'169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] "GET /stat141/Winter04/ HTTP/1.1" 200 2585 "http://anson.ucdavis.edu/courses/"\n'
In [19]:
pattern = r"YOUR REGEX HERE"
matches = re.findall(pattern, log_lines[0])
#day, month, year = matches[0]
#day, month, year

Real World Example #1: Restaurant Data

In this example, we will show how regexes can allow us to track quantitative data across categories defined by the appearance of various text fields.

In this example we'll see how the presence of certain keywords can affect quantitative data, e.g. how do restaurant health scores vary as a function of the number of violations that mention "vermin"?

In [20]:
vio = pd.read_csv('violations.csv', header=0, names=['id', 'date', 'desc'])
desc = vio['desc']
vio.head()
Out[20]:
id date desc
0 19 20171211 Inadequate food safety knowledge or lack of ce...
1 19 20171211 Unapproved or unmaintained equipment or utensils
2 19 20160513 Unapproved or unmaintained equipment or utensi...
3 19 20160513 Unclean or degraded floors walls or ceilings ...
4 19 20160513 Food safety certificate or food handler card n...
In [21]:
counts = desc.value_counts()

counts[:10]
Out[21]:
Unclean or degraded floors walls or ceilings                          999
Unapproved or unmaintained equipment or utensils                      659
Inadequately cleaned or sanitized food contact surfaces               493
Improper food storage                                                 476
Inadequate and inaccessible handwashing facilities                    467
Moderate risk food holding temperature                                452
Wiping cloths not clean or properly stored or inadequate sanitizer    418
Moderate risk vermin infestation                                      374
Unclean nonfood contact surfaces                                      369
Food safety certificate or food handler card not available            353
Name: desc, dtype: int64
In [22]:
# Hmmm...
counts[50:60]
Out[22]:
Inadequate HACCP plan record keeping                                                                16
Moderate risk food holding temperature   [ date violation corrected: 10/11/2016 ]                   16
Unclean or degraded floors walls or ceilings  [ date violation corrected: 11/29/2017 ]              16
Unclean or degraded floors walls or ceilings  [ date violation corrected: 12/7/2017 ]               15
Unclean or degraded floors walls or ceilings  [ date violation corrected: 11/27/2017 ]              15
Inadequately cleaned or sanitized food contact surfaces  [ date violation corrected: 9/26/2017 ]    14
Unapproved or unmaintained equipment or utensils  [ date violation corrected: 9/19/2017 ]           14
Unclean or degraded floors walls or ceilings  [ date violation corrected: 11/28/2017 ]              14
Unclean or degraded floors walls or ceilings  [ date violation corrected: 9/6/2017 ]                14
Unapproved  living quarters in food facility                                                        13
Name: desc, dtype: int64
In [23]:
#Use regular expressions to cut out the extra info in square braces.
vio['clean_desc'] = (vio['desc']
             .str.replace('\s*\[.*\]$', '')
             .str.strip()
             .str.lower())
vio.head()
Out[23]:
id date desc clean_desc
0 19 20171211 Inadequate food safety knowledge or lack of ce... inadequate food safety knowledge or lack of ce...
1 19 20171211 Unapproved or unmaintained equipment or utensils unapproved or unmaintained equipment or utensils
2 19 20160513 Unapproved or unmaintained equipment or utensi... unapproved or unmaintained equipment or utensils
3 19 20160513 Unclean or degraded floors walls or ceilings ... unclean or degraded floors walls or ceilings
4 19 20160513 Food safety certificate or food handler card n... food safety certificate or food handler card n...
In [24]:
vio['clean_desc'].value_counts().head() 
Out[24]:
unclean or degraded floors walls or ceilings               3507
moderate risk food holding temperature                     2542
inadequate and inaccessible handwashing facilities         2529
unapproved or unmaintained equipment or utensils           2382
inadequately cleaned or sanitized food contact surfaces    2301
Name: clean_desc, dtype: int64
In [25]:
#use regular expressions to assign new features for the presence of various keywords
with_features = (vio
 .assign(is_clean     = vio['clean_desc'].str.contains('clean|sanit'))
 .assign(is_high_risk = vio['clean_desc'].str.contains('high risk'))
 .assign(is_vermin    = vio['clean_desc'].str.contains('vermin'))
 .assign(is_surface   = vio['clean_desc'].str.contains('wall|ceiling|floor|surface'))
 .assign(is_human     = vio['clean_desc'].str.contains('hand|glove|hair|nail'))
 .assign(is_permit    = vio['clean_desc'].str.contains('permit|certif'))
)
with_features.head()
Out[25]:
id date desc clean_desc is_clean is_high_risk is_vermin is_surface is_human is_permit
0 19 20171211 Inadequate food safety knowledge or lack of ce... inadequate food safety knowledge or lack of ce... False False False False False True
1 19 20171211 Unapproved or unmaintained equipment or utensils unapproved or unmaintained equipment or utensils False False False False False False
2 19 20160513 Unapproved or unmaintained equipment or utensi... unapproved or unmaintained equipment or utensils False False False False False False
3 19 20160513 Unclean or degraded floors walls or ceilings ... unclean or degraded floors walls or ceilings True False False True False False
4 19 20160513 Food safety certificate or food handler card n... food safety certificate or food handler card n... False False False False True True
In [26]:
count_features = (with_features
 .groupby(['id', 'date'])
 .sum()
 .reset_index()
)
count_features.iloc[255:260, :]
Out[26]:
id date is_clean is_high_risk is_vermin is_surface is_human is_permit
255 489 20150728 5.0 0.0 2.0 3.0 0.0 0.0
256 489 20150807 1.0 0.0 0.0 1.0 0.0 0.0
257 489 20160308 2.0 2.0 1.0 0.0 1.0 0.0
258 489 20160721 2.0 1.0 1.0 1.0 0.0 1.0
259 489 20161220 3.0 0.0 1.0 2.0 0.0 0.0
In [27]:
count_features.query('is_vermin > 1')
Out[27]:
id date is_clean is_high_risk is_vermin is_surface is_human is_permit
255 489 20150728 5.0 0.0 2.0 3.0 0.0 0.0
291 527 20170821 1.0 1.0 2.0 1.0 1.0 1.0
1508 2622 20160526 4.0 2.0 2.0 3.0 0.0 0.0
1573 2721 20150422 2.0 1.0 2.0 1.0 0.0 0.0
1746 2945 20150921 2.0 1.0 2.0 2.0 2.0 1.0
2277 3776 20151223 2.0 0.0 2.0 1.0 0.0 0.0
3784 6729 20160706 6.0 0.0 2.0 4.0 2.0 0.0
3949 7537 20150304 1.0 0.0 2.0 1.0 1.0 0.0
4017 7661 20160616 2.0 0.0 2.0 2.0 1.0 0.0
4165 9679 20150422 2.0 0.0 2.0 2.0 0.0 0.0
7742 69353 20150420 3.0 1.0 2.0 2.0 1.0 0.0
8319 71999 20150401 1.0 1.0 2.0 0.0 0.0 0.0
9074 77427 20170706 0.0 0.0 2.0 0.0 0.0 0.0
9759 80112 20160225 2.0 0.0 2.0 0.0 0.0 2.0
11552 87492 20161208 6.0 2.0 2.0 4.0 2.0 0.0
In [28]:
#use a new pandas feature called "melt" that we won't describe in any detail
#the granularity of the resulting frame is a violation type in a given inspection
broken_down_by_violation_type = pd.melt(count_features, id_vars=['id', 'date'],
            var_name='feature', value_name='num_vios')
broken_down_by_violation_type.sort_values(["id", "date"]).head(13)
Out[28]:
id date feature num_vios
0 19 20160513 is_clean 1.0
12262 19 20160513 is_high_risk 0.0
24524 19 20160513 is_vermin 0.0
36786 19 20160513 is_surface 1.0
49048 19 20160513 is_human 1.0
61310 19 20160513 is_permit 1.0
1 19 20171211 is_clean 0.0
12263 19 20171211 is_high_risk 0.0
24525 19 20171211 is_vermin 0.0
36787 19 20171211 is_surface 0.0
49049 19 20171211 is_human 0.0
61311 19 20171211 is_permit 1.0
2 24 20160311 is_clean 2.0
In [29]:
#read in the scores
ins = pd.read_csv('inspections.csv',
                  header=0,
                  usecols=[0, 1, 2],
                  names=['id', 'score', 'date'])
ins.head()
Out[29]:
id score date
0 19 94 20160513
1 19 94 20171211
2 24 98 20171101
3 24 98 20161005
4 24 96 20160311
In [30]:
#join scores with the table broken down by violation type
violation_type_and_scores = (
    broken_down_by_violation_type
    .merge(ins, left_on=['id', 'date'], right_on=['id', 'date'])
)
violation_type_and_scores.head(12)
Out[30]:
id date feature num_vios score
0 19 20160513 is_clean 1.0 94
1 19 20160513 is_high_risk 0.0 94
2 19 20160513 is_vermin 0.0 94
3 19 20160513 is_surface 1.0 94
4 19 20160513 is_human 1.0 94
5 19 20160513 is_permit 1.0 94
6 19 20171211 is_clean 0.0 94
7 19 20171211 is_high_risk 0.0 94
8 19 20171211 is_vermin 0.0 94
9 19 20171211 is_surface 0.0 94
10 19 20171211 is_human 0.0 94
11 19 20171211 is_permit 1.0 94
In [31]:
sns.catplot(x='num_vios', y='score',
               col='feature', col_wrap=2,
               kind='box',
               data=violation_type_and_scores)
Out[31]:
<seaborn.axisgrid.FacetGrid at 0x1a229dce10>

Above we see, for example, that if a restaurant inspection involved 2 violation with the keyword "vermin", the average score for that inspection would be a little bit below 80.

Text Processing Example 2: Police Data

In this example, we will apply string processing to the process of data cleaning and exploratory data analysis.

Getting the Data

The city of Berkeley maintains an Open Data Portal for citizens to access data about the city. We will be examining Call Data.

In [32]:
import ds100_utils

calls_url = 'https://data.cityofberkeley.info/api/views/k2nh-s5h5/rows.csv?accessType=DOWNLOAD'
calls_file = ds100_utils.fetch_and_cache(calls_url, 'calls.csv')
calls = pd.read_csv(calls_file, warn_bad_lines=True)
calls.head()
Using cached version that was downloaded (UTC): Mon Sep 14 13:29:42 2020
Out[32]:
CASENO OFFENSE EVENTDT EVENTTM CVLEGEND CVDOW InDbDate Block_Location BLKADDR City State
0 19043288 SEXUAL ASSAULT FEL. 08/02/2019 12:00:00 AM 12:00 SEX CRIME 5 09/14/2020 07:00:09 AM 0 UNKNOWN\nBerkeley, CA 0 UNKNOWN Berkeley CA
1 19092424 THEFT MISD. (UNDER $950) 11/03/2019 12:00:00 AM 20:10 LARCENY 0 09/14/2020 07:00:12 AM DURANT AVE\nBerkeley, CA DURANT AVE Berkeley CA
2 19039736 DISTURBANCE 07/21/2019 12:00:00 AM 04:30 DISORDERLY CONDUCT 0 09/14/2020 07:00:08 AM 300 CANYON RD\nBerkeley, CA 300 CANYON RD Berkeley CA
3 19092062 BURGLARY AUTO 09/28/2019 12:00:00 AM 07:25 BURGLARY - VEHICLE 6 09/14/2020 07:00:11 AM SPINNAKER WAY\nBerkeley, CA SPINNAKER WAY Berkeley CA
4 19092648 VANDALISM 11/26/2019 12:00:00 AM 22:00 VANDALISM 2 09/14/2020 07:00:12 AM 1200 HOPKINS ST\nBerkeley, CA\n(37.876428, -12... 1200 HOPKINS ST Berkeley CA

How many records did we get?

In [33]:
len(calls)
Out[33]:
5227

What does an example Block_Location value look like?

In [34]:
print(calls['Block_Location'].iloc[4])
1200 HOPKINS ST
Berkeley, CA
(37.876428, -122.291736)

Preliminary observations on the data?

  1. EVENTDT -- Contains the incorrect time
  2. EVENTTM -- Contains the time in 24 hour format (What timezone?)
  3. CVDOW -- Encodes the day of the week (see data documentation).
  4. InDbDate -- Appears to be correctly formatted and appears pretty consistent in time.
  5. Block_Location -- a multi-line string that contains coordinates.
  6. BLKADDR -- Appears to be the address in Block Location.
  7. City and State seem redundant given this is supposed to be the city of Berkeley dataset.

Extracting locations

The block location contains geographic coordinates. Let's extract them.

In [35]:
calls['Block_Location'][4]
Out[35]:
'1200 HOPKINS ST\nBerkeley, CA\n(37.876428, -122.291736)'
In [36]:
calls_lat_lon = (
    calls['Block_Location']
    .str.extract("\((\d+\.\d+)\, (-\d+\.\d+)\)")
)
calls_lat_lon.columns = ['Lat', 'Lon']
calls_lat_lon.head(10)
Out[36]:
Lat Lon
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 37.876428 -122.291736
5 37.881957 -122.269551
6 37.858116 -122.268002
7 37.88014 -122.297498
8 37.894692 -122.272265
9 37.847262 -122.275622

How many records have missing values?

In [37]:
calls_lat_lon.isnull().sum()
Out[37]:
Lat    193
Lon    193
dtype: int64

Examine the missing values.

In [38]:
calls[calls_lat_lon.isnull().any(axis=1)]['Block_Location'].head(10)
Out[38]:
0                            0 UNKNOWN\nBerkeley, CA
1                           DURANT AVE\nBerkeley, CA
2                        300 CANYON RD\nBerkeley, CA
3                        SPINNAKER WAY\nBerkeley, CA
13           ADDISON STREET & GRANT ST\nBerkeley, CA
50          SHATTUCK AV & TELEGRAPH AV\nBerkeley, CA
57       SAN PABLO AVENUE &CARLETON ST\nBerkeley, CA
62       GOLDEN GATE FIELDS &GILMAN ST\nBerkeley, CA
69    CLAREMONT AVE&STONEWALL ROAD AVE\nBerkeley, CA
77          SHATTUCK AVENUE &CENTER ST\nBerkeley, CA
Name: Block_Location, dtype: object

Join in the extracted values.

In [39]:
calls['Lat'] = calls_lat_lon['Lat']
calls['Lon'] = calls_lat_lon['Lon']
calls.head()
Out[39]:
CASENO OFFENSE EVENTDT EVENTTM CVLEGEND CVDOW InDbDate Block_Location BLKADDR City State Lat Lon
0 19043288 SEXUAL ASSAULT FEL. 08/02/2019 12:00:00 AM 12:00 SEX CRIME 5 09/14/2020 07:00:09 AM 0 UNKNOWN\nBerkeley, CA 0 UNKNOWN Berkeley CA NaN NaN
1 19092424 THEFT MISD. (UNDER $950) 11/03/2019 12:00:00 AM 20:10 LARCENY 0 09/14/2020 07:00:12 AM DURANT AVE\nBerkeley, CA DURANT AVE Berkeley CA NaN NaN
2 19039736 DISTURBANCE 07/21/2019 12:00:00 AM 04:30 DISORDERLY CONDUCT 0 09/14/2020 07:00:08 AM 300 CANYON RD\nBerkeley, CA 300 CANYON RD Berkeley CA NaN NaN
3 19092062 BURGLARY AUTO 09/28/2019 12:00:00 AM 07:25 BURGLARY - VEHICLE 6 09/14/2020 07:00:11 AM SPINNAKER WAY\nBerkeley, CA SPINNAKER WAY Berkeley CA NaN NaN
4 19092648 VANDALISM 11/26/2019 12:00:00 AM 22:00 VANDALISM 2 09/14/2020 07:00:12 AM 1200 HOPKINS ST\nBerkeley, CA\n(37.876428, -12... 1200 HOPKINS ST Berkeley CA 37.876428 -122.291736

Examining Location information

Let's examine the geographic data (latitude and longitude). Recall that we had some missing values. Let's look at the behavior of these missing values according to crime type.

In [40]:
missing_lat_lon = calls[calls[['Lat', 'Lon']].isnull().any(axis=1)]
missing_lat_lon['CVLEGEND'].value_counts().plot(kind='barh');
In [41]:
calls['CVLEGEND'].value_counts().plot(kind='barh');

We might further normalize the analysis by the frequency to find which type of crime has the highest proportion of missing values.

In [42]:
(missing_lat_lon['CVLEGEND'].value_counts() 
 / calls['CVLEGEND'].value_counts()
).sort_values(ascending=False).plot(kind="barh");

Now, let's make a crime map.

In [43]:
# you may need to install the folium package for this to work
# to do this, uncomment the line below and run it.
# !pip install folium
In [44]:
import folium
import folium.plugins

SF_COORDINATES = (37.87, -122.28)
sf_map = folium.Map(location=SF_COORDINATES, zoom_start=13)
locs = calls[['Lat', 'Lon']].astype('float').dropna().values
heatmap = folium.plugins.HeatMap(locs.tolist(), radius=10)
sf_map.add_child(heatmap)
Out[44]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Questions

  1. Is campus really the safest place to be?
  2. Why are all the calls located on the street and at often at intersections?
In [45]:
import folium.plugins

locations = calls[calls['CVLEGEND'] == 'ASSAULT'][['Lat', 'Lon']]

cluster = folium.plugins.MarkerCluster()
for _, r in locations.dropna().iterrows():
    cluster.add_child(
        folium.Marker([float(r["Lat"]), float(r["Lon"])]))
    
sf_map = folium.Map(location=SF_COORDINATES, zoom_start=13)
sf_map.add_child(cluster)
sf_map
Out[45]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Bonus Content: Using pd.to_datetime to Extract Time Information

Date parsing using pd.to_datetime.

In [46]:
pd.Series(log_lines).str.extract(r'\[(.*) -0800\]').apply(
    lambda s: pd.to_datetime(s, format='%d/%b/%Y:%H:%M:%S'))
Out[46]:
0
0 2014-01-26 10:47:58
1 2005-02-02 17:23:06
2 2006-02-03 10:18:37