import re
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
Manipulating strings directy, without using regular expressions, can get us a long way. As an example, we will develop some code to clean the county names, where the code does not use regular expressions.
Let's create some sample data that has all of the issues that we want to fix in a few county names. Recall that we want to:
The following strings have all of these features that need fixing.
county_ex = [
'De witt County',
'Lac qui Parle County',
'St. John the Baptist Parish',
'Stone County',
'Lewis & Clark County'
]
Typically, we want to convert the letters to lower case first because then we know that we are only working with lower case.
In our example, "county" and "parish" are both 6 letters long so we need not do anything fancy to eliminate this word from each string. We can simply drop the last 6 characters.
Then we can go through the characters in each string, one at a time, and eliminate the blanks and periods and replace the "&" with "and".
The code below takes care of these string manipulations:
final = []
for s in county_ex:
lower = s.lower()
wo_county = lower[:-7]
def dropchange(l):
if l == "&":
return "and"
elif l == " " or l == ".":
return ""
else:
return l
final_word = "".join([dropchange(l) for l in wo_county])
final.append(final_word)
print(final)
Here's an alternative approach.
final = []
for s in county_ex:
lower = s.lower()
without_county = lower[:-7]
def amptoand(l):
if l == "&":
return "and"
else:
return l
final_word = "".join([amptoand(l) for l in without_county if l != "." and l != " "])
final.append(final_word)
print(final)
Why did we eliminate all blanks from the names, rather than eliminate from De Witt County? We looked at only a few records and can imagine that there are other cases where we want to drop a blank. Also, removing all of the blanks could make the county names easier to work with in general. Is there a downside to dropping the spaces? If a county is no longer uniquely identified without the blanks in its name, then that could be a problem.
We can instead use Python list comprehension to make the changes to county_ex
.
cleaned_cty = [
w.lower()[:-7] # fancy indexing
.replace(".","")
.replace("&","and")
.replace(" ", "")
for w in county_ex
]
print(cleaned_cty)
We can also work with the county name as a column in a dataframe and clean it.
Below are two approaches to create a data frame from county_ex
.
df = pd.DataFrame(pd.Series(county_ex, name="County"))
df
df = pd.DataFrame(data = county_ex, columns=["County"])
df
If we want to return to the original format we can get the county strings back out with:
df['County']
Lastly, we use the string manipulation available for Pandas dataframes to perform the same cleaning. Rather than loop through each character in the string one at a time, we use str
to work on the string. The method replace
takes regular expressions as arguments. An example below is "county|parish" this pattern will match either the characters "county" or the characters "parish" and if replace
finds a match anywhere in the string, it will replace it with "", i.e., it will drop it.
(
df['County'].str.lower()
.str.replace("county", "")
.str.replace("parish", "")
.str.replace(".","")
.str.replace("&","and")
.str.replace(" ", "")
)
Below is a more compact version that consolidates the replacements using metacharacters in regular expressions. For example, An example, the "county|parish" pattern matches either the characters "county" or the characters "parish" so if replace
finds a match anywhere in the string, it will replace these 6 characters with "", i.e., they will be dropped.
Also, the pattern "[. ]" says that either a period or a blank are to be treated equivalently because they are within the square brackets. Now, if either a blank or a period is found, then it will be replaced with "" (that is, it will be dropped).
(
df['County'].str.lower()
.str.replace("county|parish","")
.str.replace("[. ]","")
.str.replace("&","and")
)
We have introduced some concepts related to pattern matching, such as | and [ ]. We will cover more in detail in the examples in class slides.
In our next example, we process strings that have some structure, but not enough to always finr values between commas, say, or in fixed positions in all reacords. We can get a long way with simple string manipulation in this example too, but we will demonstrate how powerful a regular expression can be in extracting fields from strings.
These data are logs from a web site. Again, we use a small data set to demonstrate how we might process the file, but we don't actually process and analyze the full dataset.
ls -lh data
The sample file has only two records as we see in the call to head
below.
!head data/smallLog.txt
We want to read each line of the file in as a string and then process the strings to create variables. We use readlines
to do this.
with open("data/smallLog.txt", "r") as f:
lines = f.readlines()
print(len(lines))
print(lines[0])
We can place lines
in a dataframe and work with the strings as a column in the dataframe.
df = pd.DataFrame(pd.Series(lines, name="raw"))
df
Each log entry contains a lot of information such as the:
When we examine the records, we see that there are certain characters that separate the fields. In our two sample records some of the pieces of information appear in the same location, but we don't want to rely on that always being the case. For example, the IP address at the beginning of the string need not always be 14 characters long. It might be 15 characters if all four of the numbers have 3 digits. This would then shift the rest of the information in the string. Instead, we want to key on the separators to locate the information.
Sometimes these are dashes, other times they are square brackets, commas, or colons. Let's consider the task of extracting the day, month, and year each as a string. One way to do this is to split the string into pieces. We can do this with the split
method.
We first split the string many times over to zoom in on the substring(s) of interest. That is, the date appears between [
and ]
Let's first split the string at the left square brackets. Notice that the square brackets are "eaten" by the splitting process.
date = df['raw'].str.split(r'[').str
date[0]
date[1]
The second substring contains the date. If we split the second substring at colons then the first of its substrings will contain the day, month, and year. That is,
date2 = date[1].str.split(r':').str[0]
print(date2)
(df['raw']
.str.split(r'[').str[1]
.str.split(r':').str[0]
.str.split(r'/').str[0:3]
)
date3 = date2.str.split(r'/').str[0:3]
df['day'] = date3.str[0]
df['mon'] = date3.str[1]
df['Yr'] = date3.str[2]
df
Lastly, we can split this string, e.g., "26/Jan/2014", on the forward slash and assign each of the 3 resulting substrings to day, month, and year variables in df1
.
We can perform all of these splits at once with
date_pieces = (df['raw']
.str.split(r'[').str[1]
.str.split(r':').str[0]
.str.split(r'/').str[0:3]
)
print(date_pieces)
df['day'] = date_pieces.str[0]
df['mon'] = date_pieces.str[1]
df['Yr'] = date_pieces.str[2]
df
However with regular expressions, we can split on any of the 3 characters, '[' or ':' or "/", in one call to split
. That is, we can split the string on 3 equivalent and then select the desired substrings. We do this as follows.
df1 = pd.DataFrame(pd.Series(lines, name="raw"))
date = df1['raw'].str.split(r'[\[/:]').str[1:4]
df1['day'] = date.str[0]
df1['mon'] = date.str[1]
df1['yr'] = date.str[2]
df
In the above call to split
we passed a regular expression, '[\[/:]'. This is a special pattern. The outer brackets indicate that the characters within are to be treated equivalently. The literals are within the brackets; these are the left bracket, forward slash, and colon. Equivalent characters means: "when any of these are found" split the string. Notice the backslash preceeding the left bracket withing the outer brackets. This is a special character that indicates that the [ is to be treated as the literal backslash, i.e., it is not a special character for the regex engine. We delve into regular expressions in greater depth in the next example.
vio = pd.read_csv("../../hw/hw2/data/violations.csv")
new_date = vio['date'].apply(lambda d: pd.datetime.strptime(str(d),'%Y%m%d'))
vio['year'] = new_date.dt.year
vio = vio[vio['year'] == 2016].copy()
vio['description'].value_counts().head(20)
len(vio['description'].value_counts())
vio['description'].value_counts().head(20).plot(kind='barh')
vio['description'].value_counts().tail(10)
Remove the [ date violation corrected ...] piece
vio['desc'] = vio['description'].str.replace("\[.*\]","").str.strip()
Now let's check how many unique violation descriptions we have
len(vio['desc'].value_counts())
vio['desc'].value_counts().head(20)
vio['desc'].value_counts().tail(20)
# pd.get_dummies(vio['desc'])
regex = re.compile(r"clean|sanitary")
%%timeit
vio['desc'].str.contains(regex)
%%timeit
vio['desc'].str.contains(r"clean|sanitary")
vio['isClean'] = vio['desc'].str.contains(r"clean|sanit")
surface = vio['desc'].str.contains(r"surface")
food = vio['desc'].str.contains(r"\Wfood")
vio['isFoodSurface'] = (surface & food)
vio['isHighRisk'] = vio['desc'].str.contains(r"high risk")
vio['isMedRisk'] = vio['desc'].str.contains(r"medium risk")
vio['isLowRisk'] = vio['desc'].str.contains(r"low risk")
#create risk which is 1 2 3 for low medium or high
vio["risk"] = vio['isLowRisk'] + 2*vio['isMedRisk'] + 3*vio['isHighRisk']
vio['isVermin'] = vio['desc'].str.contains(r"vermin")
vio['isStorage'] = vio['desc'].str.contains(r"thaw|cool|therm|storage")
vio['isDisplayPermit'] = vio['desc'].str.contains(r"certificate|permit")
vio['isNonFoodSurface'] = vio['desc'].str.contains(r"wall|ceiling|floor|nonfood surface")
vio['isHuman'] = vio['desc'].str.contains(r"hand|glove|hair|nail")
features = vio.groupby(['business_id', 'date']).aggregate({
"isClean": sum,
"risk": max
})
features.head()
a = vio.groupby(['business_id', 'date'])[['isClean', 'isHighRisk', 'isVermin',
'isStorage', 'isDisplayPermit', 'isHuman',
'isNonFoodSurface']].sum()
b = vio.groupby(['business_id', 'date'])[['risk']].max()
features = a.join(b)
features = vio.groupby(['business_id', 'date'])[['isClean', 'isHighRisk', 'isVermin',
'isStorage', 'isDisplayPermit', 'isHuman',
'isNonFoodSurface']].sum()
features.head()
ins = pd.read_csv("../../hw/hw2/data/inspections.csv").set_index(['business_id', 'date'])
ins.head()
ins.join(features).head()
features = vio.groupby(['business_id', 'date'], as_index=False)[['isClean', 'isHighRisk',
'isVermin', 'isStorage',
'isDisplayPermit', 'isHuman',
'isNonFoodSurface']].sum()
features.head()
ins = pd.read_csv("../../hw/hw2/data/inspections.csv")
ins.head()
joined_data = pd.merge(features, ins, on=['business_id', 'date'])
joined_data.head()
feature_names = ['isClean', 'isHighRisk', 'isVermin',
'isStorage', 'isDisplayPermit', 'isHuman', 'isNonFoodSurface']
import sklearn.linear_model as lm
model = lm.LinearRegression()
model.fit(joined_data[feature_names], joined_data['score'])
model.coef_
plt.figure(figsize=(15,10))
i = 1
for f in feature_names:
plt.subplot(2, 4, i)
sns.boxplot(y="score", x=f, data=joined_data)
i += 1
plt.savefig("xxx.pdf")
Create indicator variables for whether a violation contains certain words
with open("data/stateoftheunion1790-2017.txt", "r") as f:
text = f.read()
records = text.split("***")
def extract_parts(line):
parts = line.split("\n")
name = parts[3].strip()
date = parts[4].strip()
text = "\n".join(parts[5:]).strip()
return [name, date, text]
df = pd.DataFrame([extract_parts(l) for l in records[1:]], columns=["Name", "Date", "Text"])
df.head()
df['clean text'] = (
df['Text']
.str.replace("\n", " ")
.str.lower()
.str.replace(r"[^a-z\s]", " ")
)
df.head()
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer()
tfidf = vec.fit_transform(df['clean text'])
np.random.seed(42)
import scipy as sp
(u, s, vt) = sp.sparse.linalg.svds(tfidf, k=2)
df['Year'] = df['Date'].str[-4:].astype('int')
df['x'] = u[:,0]
df['y'] = u[:,1]
sns.lmplot(x = 'x', y = 'y', data = df, hue='Year', legend=False, fit_reg=False, palette="Blues")
!pip install plotly
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.figure_factory as ff
colors = np.array(["rgba({0},{1},{2},1)".format(*c) for c in sns.color_palette("Blues", len(df))])
colors[-1] = "rgba(.99,.5,.2,1.)"
py.iplot([go.Scatter(x = df['x'], y = df['y'], mode='markers', marker=dict(color=colors), text=df['Name'])])