import re
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline

Examples¶

Creating Compatible Primary Keys¶

Simple String Manipulation¶

Manipulating strings directy, without using regular expressions, can get us a long way. As an example, we will develop some code to clean the county names, where the code does not use regular expressions.

Let's create some sample data that has all of the issues that we want to fix in a few county names. Recall that we want to:

Change all letters to lower case.
Drop county and parish from the end of the string.
Change '&' to 'and'
Drop all periods
Eliminate all spaces/blanks

The following strings have all of these features that need fixing.

county_ex = [
    'De witt County', 
    'Lac qui Parle County', 
    'St. John the Baptist Parish', 
    'Stone County', 
    'Lewis & Clark County'
]

Typically, we want to convert the letters to lower case first because then we know that we are only working with lower case.

In our example, "county" and "parish" are both 6 letters long so we need not do anything fancy to eliminate this word from each string. We can simply drop the last 6 characters.

Then we can go through the characters in each string, one at a time, and eliminate the blanks and periods and replace the "&" with "and".

The code below takes care of these string manipulations:

final = []
for s in county_ex:
    lower = s.lower()
    wo_county = lower[:-7]
    def dropchange(l):
        if l == "&":
            return "and"
        elif l == " " or l == ".":
            return "" 
        else:
            return l
    final_word = "".join([dropchange(l) for l in wo_county])
    final.append(final_word)
print(final)

['dewitt', 'lacquiparle', 'stjohnthebaptist', 'stone', 'lewisandclark']

Here's an alternative approach.

final = []
for s in county_ex:
    lower = s.lower()
    without_county = lower[:-7]
    def amptoand(l):
        if l == "&":
            return "and"
        else:
            return l
    final_word = "".join([amptoand(l) for l in without_county if l != "." and l != " "])
    final.append(final_word)
print(final)

['dewitt', 'lacquiparle', 'stjohnthebaptist', 'stone', 'lewisandclark']

Why did we eliminate all blanks from the names, rather than eliminate from De Witt County? We looked at only a few records and can imagine that there are other cases where we want to drop a blank. Also, removing all of the blanks could make the county names easier to work with in general. Is there a downside to dropping the spaces? If a county is no longer uniquely identified without the blanks in its name, then that could be a problem.

Python list comprehension¶

We can instead use Python list comprehension to make the changes to county_ex.

cleaned_cty = [
    w.lower()[:-7] # fancy indexing
    .replace(".","")
    .replace("&","and")
    .replace(" ", "") 
    for w in county_ex
]
print(cleaned_cty)

['dewitt', 'lacquiparle', 'stjohnthebaptist', 'stone', 'lewisandclark']

Data frame¶

We can also work with the county name as a column in a dataframe and clean it. Below are two approaches to create a data frame from county_ex.

df = pd.DataFrame(pd.Series(county_ex, name="County"))
df

df = pd.DataFrame(data = county_ex, columns=["County"])
df

If we want to return to the original format we can get the county strings back out with:

df['County']

0                 De witt County
1           Lac qui Parle County
2    St. John the Baptist Parish
3                   Stone County
4           Lewis & Clark County
Name: County, dtype: object

Python list comprehension¶

Lastly, we use the string manipulation available for Pandas dataframes to perform the same cleaning. Rather than loop through each character in the string one at a time, we use str to work on the string. The method replace takes regular expressions as arguments. An example below is "county|parish" this pattern will match either the characters "county" or the characters "parish" and if replace finds a match anywhere in the string, it will replace it with "", i.e., it will drop it.

(
    df['County'].str.lower()
    .str.replace("county", "")
    .str.replace("parish", "")
    .str.replace(".","")
    .str.replace("&","and")
    .str.replace(" ", "")
)

0              dewitt
1         lacquiparle
2    stjohnthebaptist
3               stone
4       lewisandclark
Name: County, dtype: object

Below is a more compact version that consolidates the replacements using metacharacters in regular expressions. For example, An example, the "county|parish" pattern matches either the characters "county" or the characters "parish" so if replace finds a match anywhere in the string, it will replace these 6 characters with "", i.e., they will be dropped.

Also, the pattern "[. ]" says that either a period or a blank are to be treated equivalently because they are within the square brackets. Now, if either a blank or a period is found, then it will be replaced with "" (that is, it will be dropped).

(
    df['County'].str.lower()
    .str.replace("county|parish","")
    .str.replace("[. ]","")
    .str.replace("&","and")
)

0              dewitt
1         lacquiparle
2    stjohnthebaptist
3               stone
4       lewisandclark
Name: County, dtype: object

We have introduced some concepts related to pattern matching, such as | and [ ]. We will cover more in detail in the examples in class slides.

Web Logs¶

In our next example, we process strings that have some structure, but not enough to always finr values between commas, say, or in fixed positions in all reacords. We can get a long way with simple string manipulation in this example too, but we will demonstrate how powerful a regular expression can be in extracting fields from strings.

These data are logs from a web site. Again, we use a small data set to demonstrate how we might process the file, but we don't actually process and analyze the full dataset.

ls -lh data

total 202472
-rwxr-xr-x@ 1 nolan  staff    89M Sep  9 12:01 141WebLog.log*
-rw-r--r--  1 nolan  staff   400B Sep 18 18:06 smallLog.txt
-rw-r--r--  1 nolan  staff   9.9M Sep 18 19:48 stateoftheunion1790-2017.txt

The sample file has only two records as we see in the call to head below.

!head data/smallLog.txt

169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] "GET /stat141/Winter04 HTTP/1.1" 301 328 "http://anson.ucdavis.edu/courses/"  "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.1.4322)”
169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] "GET /stat141/Winter04/ HTTP/1.1" 200 2585 "http://anson.ucdavis.edu/courses/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.1.4322)"

We want to read each line of the file in as a string and then process the strings to create variables. We use readlines to do this.

with open("data/smallLog.txt", "r") as f:
    lines = f.readlines()
print(len(lines))
print(lines[0])

3
169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] "GET /stat141/Winter04 HTTP/1.1" 301 328 "http://anson.ucdavis.edu/courses/"  "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.1.4322)”

We can place lines in a dataframe and work with the strings as a column in the dataframe.

df = pd.DataFrame(pd.Series(lines, name="raw"))
df

Each log entry contains a lot of information such as the:

Page visited
Date and time of visit
Browser used
IP address

When we examine the records, we see that there are certain characters that separate the fields. In our two sample records some of the pieces of information appear in the same location, but we don't want to rely on that always being the case. For example, the IP address at the beginning of the string need not always be 14 characters long. It might be 15 characters if all four of the numbers have 3 digits. This would then shift the rest of the information in the string. Instead, we want to key on the separators to locate the information.

Sometimes these are dashes, other times they are square brackets, commas, or colons. Let's consider the task of extracting the day, month, and year each as a string. One way to do this is to split the string into pieces. We can do this with the split method.

We first split the string many times over to zoom in on the substring(s) of interest. That is, the date appears between [ and ]

Let's first split the string at the left square brackets. Notice that the square brackets are "eaten" by the splitting process.

date = df['raw'].str.split(r'[').str
date[0]

0    169.237.46.168 - - 
1    169.237.46.168 - - 
2                     \n
Name: raw, dtype: object

date[1]

0    26/Jan/2014:10:47:58 -0800] "GET /stat141/Wint...
1    26/Jan/2014:10:47:58 -0800] "GET /stat141/Wint...
2                                                  NaN
Name: raw, dtype: object

The second substring contains the date. If we split the second substring at colons then the first of its substrings will contain the day, month, and year. That is,

date2 = date[1].str.split(r':').str[0]
print(date2)

0    26/Jan/2014
1    26/Jan/2014
2            NaN
Name: raw, dtype: object

(df['raw']
 .str.split(r'[').str[1]
 .str.split(r':').str[0]
 .str.split(r'/').str[0:3]
)

0    [26, Jan, 2014]
1    [26, Jan, 2014]
2                NaN
Name: raw, dtype: object

date3 = date2.str.split(r'/').str[0:3]
df['day'] = date3.str[0]
df['mon'] = date3.str[1]
df['Yr'] = date3.str[2]
df

Lastly, we can split this string, e.g., "26/Jan/2014", on the forward slash and assign each of the 3 resulting substrings to day, month, and year variables in df1.

We can perform all of these splits at once with

date_pieces = (df['raw']
               .str.split(r'[').str[1]
               .str.split(r':').str[0]
               .str.split(r'/').str[0:3]
              )
print(date_pieces)
df['day'] = date_pieces.str[0]
df['mon'] = date_pieces.str[1]
df['Yr'] = date_pieces.str[2]
df

0    [26, Jan, 2014]
1    [26, Jan, 2014]
2                NaN
Name: raw, dtype: object

However with regular expressions, we can split on any of the 3 characters, '[' or ':' or "/", in one call to split. That is, we can split the string on 3 equivalent and then select the desired substrings. We do this as follows.

df1 = pd.DataFrame(pd.Series(lines, name="raw"))
date = df1['raw'].str.split(r'[\[/:]').str[1:4]
df1['day'] = date.str[0]
df1['mon'] = date.str[1]
df1['yr'] = date.str[2]
df

In the above call to split we passed a regular expression, '[\[/:]'. This is a special pattern. The outer brackets indicate that the characters within are to be treated equivalently. The literals are within the brackets; these are the left bracket, forward slash, and colon. Equivalent characters means: "when any of these are found" split the string. Notice the backslash preceeding the left bracket withing the outer brackets. This is a special character that indicates that the [ is to be treated as the literal backslash, i.e., it is not a special character for the regex engine. We delve into regular expressions in greater depth in the next example.

Food Safety¶

vio = pd.read_csv("../../hw/hw2/data/violations.csv")

new_date = vio['date'].apply(lambda d: pd.datetime.strptime(str(d),'%Y%m%d'))
vio['year']  = new_date.dt.year

vio = vio[vio['year'] == 2016].copy()

vio['description'].value_counts().head(20)

Unclean or degraded floors walls or ceilings                                 844
Unapproved or unmaintained equipment or utensils                             527
Moderate risk food holding temperature                                       491
Inadequate and inaccessible handwashing facilities                           430
Wiping cloths not clean or properly stored or inadequate sanitizer           413
Inadequately cleaned or sanitized food contact surfaces                      400
Improper food storage                                                        399
Food safety certificate or food handler card not available                   354
Foods not protected from contamination                                       294
Moderate risk vermin infestation                                             287
Unclean nonfood contact surfaces                                             285
Low risk vermin infestation                                                  265
Permit license or inspection report not posted                               253
Inadequate food safety knowledge or lack of certified food safety manager    210
Improper or defective plumbing                                               197
Improper storage of equipment utensils or linens                             194
Inadequate warewashing facilities or equipment                               153
High risk food holding temperature                                           139
Unclean or unsanitary food contact surfaces                                  120
Inadequate ventilation or lighting                                           115
Name: description, dtype: int64

len(vio['description'].value_counts())

4688

vio['description'].value_counts().head(20).plot(kind='barh')

<matplotlib.axes._subplots.AxesSubplot at 0x111bf3e10>

vio['description'].value_counts().tail(10)

Unclean nonfood contact surfaces  [ date violation corrected: 6/22/2016 ]                                         1
High risk vermin infestation  [ date violation corrected: 10/27/2016 ]                                            1
Improper storage use or identification of toxic substances  [ date violation corrected: 5/26/2016 ]               1
Other low risk violation  [ date violation corrected: 3/23/2016 ]                                                 1
Unclean or degraded floors walls or ceilings  [ date violation corrected: 2/1/2016 ]                              1
Inadequate procedures or records for time as a public health control   [ date violation corrected: 2/19/2016 ]    1
Unclean hands or improper use of gloves  [ date violation corrected: 8/15/2016 ]                                  1
Improper cooling methods  [ date violation corrected: 4/20/2016 ]                                                 1
Inadequately cleaned or sanitized food contact surfaces  [ date violation corrected: 12/8/2016 ]                  1
Other moderate risk violation  [ date violation corrected: 6/24/2016 ]                                            1
Name: description, dtype: int64

Remove the [ date violation corrected ...] piece

vio['desc'] = vio['description'].str.replace("\[.*\]","").str.strip()

Now let's check how many unique violation descriptions we have

len(vio['desc'].value_counts())

64

vio['desc'].value_counts().head(20)

Unclean or degraded floors walls or ceilings                                 1536
Moderate risk food holding temperature                                       1188
Inadequate and inaccessible handwashing facilities                           1097
Unapproved or unmaintained equipment or utensils                             1011
Inadequately cleaned or sanitized food contact surfaces                       976
Wiping cloths not clean or properly stored or inadequate sanitizer            930
Improper food storage                                                         814
Foods not protected from contamination                                        764
High risk food holding temperature                                            732
Moderate risk vermin infestation                                              681
Food safety certificate or food handler card not available                    605
Unclean nonfood contact surfaces                                              573
Permit license or inspection report not posted                                444
Inadequate food safety knowledge or lack of certified food safety manager     432
Unclean or unsanitary food contact surfaces                                   429
Low risk vermin infestation                                                   425
Improper storage of equipment utensils or linens                              398
Unclean hands or improper use of gloves                                       378
Improper or defective plumbing                                                350
Improper cooling methods                                                      339
Name: desc, dtype: int64

vio['desc'].value_counts().tail(20)

Food in poor condition                                               15
Sewage or wastewater contamination                                   14
Unapproved  living quarters in food facility                         13
No plan review or Building Permit                                    13
Inadequate HACCP plan record keeping                                 12
Unapproved food source                                               12
Unpermitted food facility                                            11
No person in charge of food facility                                 11
Other high risk violation                                            10
Non service animal                                                    8
Reservice of previously served foods                                  4
Improper cooking time or temperatures                                 4
Noncompliance with Gulf Coast oyster regulation                       3
Improperly washed fruits and vegetables                               3
Discharge from employee nose mouth or eye                             3
Consumer advisory not provided for raw or undercooked foods           3
Mobile food facility HCD insignia unavailable                         1
Unreported or unrestricted ill employee with communicable disease     1
Improperly displayed mobile food permit or signage                    1
Noncompliance with Cottage Food Operation                             1
Name: desc, dtype: int64

# pd.get_dummies(vio['desc'])

regex = re.compile(r"clean|sanitary")

%%timeit
vio['desc'].str.contains(regex)

10 loops, best of 3: 40.6 ms per loop

%%timeit
vio['desc'].str.contains(r"clean|sanitary")

10 loops, best of 3: 27.8 ms per loop

Add dummy variable¶

vio['isClean'] = vio['desc'].str.contains(r"clean|sanit")

surface = vio['desc'].str.contains(r"surface")
food = vio['desc'].str.contains(r"\Wfood")
vio['isFoodSurface'] = (surface & food)

vio['isHighRisk'] = vio['desc'].str.contains(r"high risk")
vio['isMedRisk'] = vio['desc'].str.contains(r"medium risk")
vio['isLowRisk'] = vio['desc'].str.contains(r"low risk")
#create risk which is 1 2 3 for low medium or high
vio["risk"] = vio['isLowRisk'] + 2*vio['isMedRisk'] + 3*vio['isHighRisk']

vio['isVermin'] = vio['desc'].str.contains(r"vermin")
vio['isStorage'] = vio['desc'].str.contains(r"thaw|cool|therm|storage")
vio['isDisplayPermit'] = vio['desc'].str.contains(r"certificate|permit")
vio['isNonFoodSurface'] = vio['desc'].str.contains(r"wall|ceiling|floor|nonfood surface")
vio['isHuman'] = vio['desc'].str.contains(r"hand|glove|hair|nail")

Aggregate by business and Date¶

features = vio.groupby(['business_id', 'date']).aggregate({
        "isClean": sum,
        "risk": max
    })
features.head()

a = vio.groupby(['business_id', 'date'])[['isClean', 'isHighRisk', 'isVermin',
                                           'isStorage', 'isDisplayPermit', 'isHuman',
                                          'isNonFoodSurface']].sum()
b = vio.groupby(['business_id', 'date'])[['risk']].max()

features = a.join(b)

features = vio.groupby(['business_id', 'date'])[['isClean', 'isHighRisk', 'isVermin',
                                                'isStorage', 'isDisplayPermit', 'isHuman',
                                                'isNonFoodSurface']].sum()
features.head()

ins = pd.read_csv("../../hw/hw2/data/inspections.csv").set_index(['business_id', 'date'])
ins.head()

ins.join(features).head()

features = vio.groupby(['business_id', 'date'], as_index=False)[['isClean', 'isHighRisk', 
                                                                 'isVermin', 'isStorage', 
                                                                 'isDisplayPermit', 'isHuman',
                                                                'isNonFoodSurface']].sum()
features.head()

ins = pd.read_csv("../../hw/hw2/data/inspections.csv")
ins.head()

joined_data = pd.merge(features, ins, on=['business_id', 'date'])
joined_data.head()

feature_names = ['isClean', 'isHighRisk', 'isVermin', 
                 'isStorage', 'isDisplayPermit', 'isHuman', 'isNonFoodSurface']
import sklearn.linear_model as lm
model = lm.LinearRegression()
model.fit(joined_data[feature_names], joined_data['score'])
model.coef_

array([-3.64416653, -8.83289192, -4.80860929, -3.63481813,  2.70064248,
       -4.4417157 ,  1.72030654])

plt.figure(figsize=(15,10))
i = 1
for f in feature_names:
    plt.subplot(2, 4, i)
    sns.boxplot(y="score", x=f, data=joined_data)
    i += 1
plt.savefig("xxx.pdf")

Create indicator variables for whether a violation contains certain words

State of the union address¶

with open("data/stateoftheunion1790-2017.txt", "r") as f:
    text = f.read()

records = text.split("***")

def extract_parts(line):
    parts = line.split("\n")
    name = parts[3].strip()
    date = parts[4].strip()
    text = "\n".join(parts[5:]).strip()
    return [name, date, text]

df = pd.DataFrame([extract_parts(l) for l in records[1:]], columns=["Name", "Date", "Text"])
df.head()

df['clean text'] = (
    df['Text']
    .str.replace("\n", " ")
    .str.lower()
    .str.replace(r"[^a-z\s]", " ")
)
df.head()

from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer()
tfidf = vec.fit_transform(df['clean text'])

np.random.seed(42)
import scipy as sp
(u, s, vt) = sp.sparse.linalg.svds(tfidf, k=2)

df['Year'] = df['Date'].str[-4:].astype('int')
df['x'] = u[:,0]
df['y'] = u[:,1]

sns.lmplot(x = 'x', y = 'y', data = df, hue='Year', legend=False, fit_reg=False, palette="Blues")

<seaborn.axisgrid.FacetGrid at 0x116640748>

!pip install plotly

Requirement already satisfied: plotly in /Users/nolan/anaconda3/envs/ds100/lib/python3.6/site-packages
Requirement already satisfied: requests in /Users/nolan/anaconda3/envs/ds100/lib/python3.6/site-packages (from plotly)
Requirement already satisfied: nbformat>=4.2 in /Users/nolan/anaconda3/envs/ds100/lib/python3.6/site-packages (from plotly)
Requirement already satisfied: decorator>=4.0.6 in /Users/nolan/anaconda3/envs/ds100/lib/python3.6/site-packages (from plotly)
Requirement already satisfied: pytz in /Users/nolan/anaconda3/envs/ds100/lib/python3.6/site-packages (from plotly)
Requirement already satisfied: six in /Users/nolan/anaconda3/envs/ds100/lib/python3.6/site-packages (from plotly)

import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.figure_factory as ff

colors = np.array(["rgba({0},{1},{2},1)".format(*c) for c in sns.color_palette("Blues", len(df))])
colors[-1] = "rgba(.99,.5,.2,1.)"
py.iplot([go.Scatter(x = df['x'], y = df['y'], mode='markers', marker=dict(color=colors), text=df['Name'])])

		isClean	risk
business_id	date
10	20160503	0.0	0
19	20160513	1.0	0
24	20160311	2.0	0
24	20161005	1.0	0
31	20160816	0.0	0

		isClean	isHighRisk	isVermin	isStorage	isDisplayPermit	isHuman	isNonFoodSurface
business_id	date
10	20160503	0.0	0.0	1.0	1.0	0.0	0.0	0.0
19	20160513	1.0	0.0	0.0	0.0	1.0	1.0	1.0
24	20160311	2.0	0.0	0.0	0.0	0.0	0.0	2.0
24	20161005	1.0	0.0	0.0	0.0	0.0	0.0	1.0
31	20160816	0.0	0.0	0.0	3.0	0.0	1.0	0.0

	business_id	date	isClean	isVermin	isStorage	isDisplayPermit	isHuman	isNonFoodSurface
0	10	20160503	0.0	1.0	1.0	0.0	0.0	0.0
1	19	20160513	1.0	0.0	0.0	1.0	1.0	1.0
2	24	20160311	2.0	0.0	0.0	0.0	0.0	2.0
3	24	20161005	1.0	0.0	0.0	0.0	0.0	1.0
4	31	20160816	0.0	0.0	3.0	0.0	1.0	0.0

	Name	Date	Text
0	George Washington	January 8, 1790	Fellow-Citizens of the Senate and House of Rep...
1	George Washington	December 8, 1790	Fellow-Citizens of the Senate and House of Rep...
2	George Washington	October 25, 1791	Fellow-Citizens of the Senate and House of Rep...
3	George Washington	November 6, 1792	Fellow-Citizens of the Senate and House of Rep...
4	George Washington	December 3, 1793	Fellow-Citizens of the Senate and House of Rep...

	Name	Date	Text	clean text
0	George Washington	January 8, 1790	Fellow-Citizens of the Senate and House of Rep...	fellow citizens of the senate and house of rep...
1	George Washington	December 8, 1790	Fellow-Citizens of the Senate and House of Rep...	fellow citizens of the senate and house of rep...
2	George Washington	October 25, 1791	Fellow-Citizens of the Senate and House of Rep...	fellow citizens of the senate and house of rep...
3	George Washington	November 6, 1792	Fellow-Citizens of the Senate and House of Rep...	fellow citizens of the senate and house of rep...
4	George Washington	December 3, 1793	Fellow-Citizens of the Senate and House of Rep...	fellow citizens of the senate and house of rep...

	County
0	De witt County
1	Lac qui Parle County
2	St. John the Baptist Parish
3	Stone County
4	Lewis & Clark County

	raw
0	169.237.46.168 - - [26/Jan/2014:10:47:58 -0800...
1	169.237.46.168 - - [26/Jan/2014:10:47:58 -0800...
2	\n

	raw	day	mon	Yr
0	169.237.46.168 - - [26/Jan/2014:10:47:58 -0800...	26	Jan	2014
1	169.237.46.168 - - [26/Jan/2014:10:47:58 -0800...	26	Jan	2014
2	\n	NaN	NaN	NaN

		score	type	isClean	isHighRisk	isVermin	isStorage	isDisplayPermit	isHuman	isNonFoodSurface
business_id	date
10	20140114	92	routine	NaN	NaN	NaN	NaN	NaN	NaN	NaN
	20140729	94	routine	NaN	NaN	NaN	NaN	NaN	NaN	NaN
	20160503	82	routine	0	0	1	1	0	0	0
19	20140214	94	routine	NaN	NaN	NaN	NaN	NaN	NaN	NaN
19	20141110	94	routine	NaN	NaN	NaN	NaN	NaN	NaN	NaN