-
Notifications
You must be signed in to change notification settings - Fork 6
/
dwr_07_regular_expressions.R
617 lines (469 loc) · 21.7 KB
/
dwr_07_regular_expressions.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
#' ---
#' title: "Data Wrangling in R: Regular Expressions"
#' author: "Clay Ford"
#' date: "Spring 2016"
#' output: pdf_document
#' ---
setwd("../data")
load("datasets_L06.Rda")
# This lecture uses the following packages:
# install.packages("stringr")
library(stringr)
# install.packages("qdapRegex")
library(qdapRegex)
# Intro -------------------------------------------------------------------
# A 'regular expression' is a pattern that describes a set of strings.
# Examples:
# - all 5-digit numbers in a document
# - all 5-digit numbers ending in 00
# - words spelled in ALL CAPS
# - words in brackets or delimiters [],<>,(),{}
# - words at the end of a sentence
# - all email addresses
# - dates in a certain format
# These are examples of string patterns. Regular Expressions are the language we
# use to describe the pattern. You should know, however, regular expressions are
# a language into itself. There are entire books devoted to regular expressions.
# Quote floating around internet: "Some people, when confronted with a problem,
# think 'I know, I'll use regular expressions.' Now they have two problems."
# Regular expressions can be tricky to get right, especially for complex
# patterns.
# We will only dabble in regular expressions. Key lesson: recognize when you
# need a regular expression and know enough to cobble one together using your
# knowledge, wits and Google.
# Two PDF files you may want to download and save for reference:
# http://biostat.mc.vanderbilt.edu/wiki/pub/Main/SvetlanaEdenRFiles/regExprTalk.pdf
# http://gastonsanchez.com/Handling_and_Processing_Strings_in_R.pdf
# Good pages to print off/bookmark:
# http://www.cheatography.com/davechild/cheat-sheets/regular-expressions/
# http://regexlib.com/CheatSheet.aspx
# or just Google "regex cheatsheet"
# Good library book:
# Go to virgo, search for "Regular expressions cookbook"
# RegEx tutorials:
# http://www.rexegg.com/
# http://www.regular-expressions.info/
# Regular Expression Basics -----------------------------------------------
# Regular expressions are composed of three components:
# (1) literal characters
# (2) modifiers (or metacharacters)
# (3) character classes
# (1) LITERAL CHARACTERS
# These are the literal characters you want to match. If you want to find the
# word "factor", you search for "factor", the literal characters.
# (2) MODIFIERS
# Modifiers define patterns;
# meet the modifiers:
# $ * + . ? [ ] ^ { } | ( ) \
# precede these with double backslash (in R!) if you want to treat them as
# literal characters.
# ^ start of string
# $ end of string
# . any character except new line
# * 0 or more
# + 1 or more
# ? 0 or 1
# | or (alternative patterns)
# {} quantifier brackets: exactly {n}; at least {n,}; between {n,m}
# () group patterns together
# \ escape character (needs to be escaped itself in R: \\)
# [] character class brackets (not to be confused with R's subsetting brackets!)
# (3) CHARACTER CLASSES
# a range of characters to be matched;
# placed in brackets: []
# For example: [a-q] means all letters from a - q;
# [a-zA-Z] means all alphabetic characters;
# [0-9A-Za-z] means all alphanumeric characters;
# The ^ symbol means "not" when used in brackets, so [^abc] means "Not (a or b
# or c)"
# From R documentation: "Because their interpretation is locale- and
# implementation-dependent, character ranges are best avoided." Good advice if
# you're sharing R code. Otherwise, fine to use on your own.
# PREDEFINED CHARACTER CLASSES
# [:lower:] - Lower-case letters in the current locale. [a-z]
#
# [:upper:] - Upper-case letters in the current locale. [A-Z]
#
# [:alpha:] - Alphabetic characters: [:lower:] and [:upper:]. [a-zA-Z]
#
# [:digit:] - Digits: 0 1 2 3 4 5 6 7 8 9. [0-9]
#
# [:alnum:] - Alphanumeric characters: [:alpha:] and [:digit:]. [0-9A-Za-z]
#
# [:punct:] - Punctuation characters: ! " # $ % & ' ( ) * + , - . / : ; < = > ?
# @ [ \ ] ^ _ ` { | } ~.
#
# [:graph:] - Graphical characters: [:alnum:] and [:punct:].
#
# [:blank:] - Blank characters: space and tab, and possibly other
# locale-dependent characters such as non-breaking space.
#
# [:space:] - Space characters: tab, newline, vertical tab, form feed, carriage
# return, space and possibly other locale-dependent characters.
#
# [:print:] - Printable characters: [:alnum:], [:punct:] and space.
# Note that the brackets in these class names are part of the symbolic names,
# and must be included in addition to the brackets delimiting the bracket list!
# More regex codes! (Yay! More stuff!) Be sure to escape that backslash!
# \b - Word boundary
# \d - any decimal digit
# \w - any word character
# \s - any white-space character
# \n - a New line
# see ?regex for an indepth overview of regular expressions.
# RegEx examples ----------------------------------------------------------
# Let's create some sample text to demonstrate regular expressions:
someText <- c(" here's a sentence",
"This is me typing at 2:02 in the morning",
"Back in 1995 I was only 22.",
"You saw 4 monkeys?",
"There are 10 kinds of people, those that understand binary
and the other 9 that don't care",
"Who likes pancakes? I do. I really really like pancakes!",
" <strong>Bolded text is bold and daring</strong>",
"figure1.jpg", "cover.jpg", "report.pdf", "log.txt",
"I'm a one man wolfpack and I weigh 222",
"OMG, a 3-eyed cyclops!!!",
"2112 is an awesome album.",
"2222 is my PIN")
someText
# Examples of SUPER BASIC regex patterns:
# find elements in vector beginning with 1 or more spaces
grep("^ +", someText, value=T)
grep("^[[:blank:]]+", someText, value=T)
# find elements containing a question mark; need to "escape" the "?"
grep("\\?", someText, value=T)
# find elements ending with a question mark
grep("\\?$", someText, value=T)
# find elements containing one or more numbers
grep("[0-9]+", someText, value=T)
grep("[[:digit:]]+", someText, value=T)
# find elements containing numbers with 2 digits
grep("[0-9]{2}", someText, value=T)
grep("[[:digit:]]{2}", someText, value=T)
# text ending with .jpg; need to escape the "."
grep("\\.jpg$", someText, value=T)
# text ending with a 3=character file extension
grep("\\.[[:alpha:]]{3}$", someText, value=T)
grep("\\.\\w{3}$", someText, value=T)
# text beginning with only letters, and containing only letters, ending in .jpg
grep("^[a-zA-Z]+\\.jpg", someText, value=T)
grep("^[[:alpha:]]+\\.jpg", someText, value=T)
# text containing two consecutive "really "
grep("(really ){2}",someText, value=T)
# text containing two or more !
grep("!{2,}",someText, value=T)
# Contraction beginning with 3 letters
grep(" [[:alpha:]]{3}'", someText, value = T)
grep("\\b[[:alpha:]]{3}'", someText, value = T)
# text with 3-character words
grep("\\b\\w{3}\\b", someText, value = T)
# text with 3-character words but no file names
grep("\\b\\w{3}\\b[^[:punct:]]", someText, value = T)
# text with ALL CAPS (two or more CAPS)
grep("\\b[[:upper:]]{2,}\\b", someText, value = T)
# text with a new line
grep("\\n", someText, value = T)
# matching 0 or more times
grep("2*2", someText, value = T)
# matching 1 or more times
grep("2+2", someText, value = T)
# Search/Replace with RegEx -----------------------------------------------
# Recall sub() and gsub() functions. These perform replacement of the first and
# all matches respectively. In a previous lecture we used them to search/replace
# literal strings. Now let's use them with regular expressions. A few examples:
# Replace Repeated Whitespace with a Single Space
gsub(" +"," ", someText)
gsub("\\s+"," ",someText) # removes \n!
# Trim Leading and Trailing Whitespace:
gsub("^ +| +$","", someText)
# Or better yet, just use the built-in function
trimws(someText)
# Replace a new line with a space
gsub("\\n"," ",someText)
# Remove HTML/XML tags (basic)
# "<" followed by anything but ">" and ending with ">"
gsub("<[^>]*>","",someText)
# Or better yet, just use the qdapRegex function rm_angle()
rm_angle(someText)
# Extract with RegEx ------------------------------------------------------
# The base R functions regexpr() and gregexpr() along with regmatches() can be
# used to extract character matches, but I find the str_extract() and
# str_extract_all() in the stringr package to be easier and faster to use.
# str_extract() extracts first piece of a string that matches a pattern while
# str_extract_all() extracts all matches. A few examples:
# Extract one- or two-digit numbers:
# first match
str_extract(someText, "[0-9]{1,2}")
# all matches; returns a list
str_extract_all(someText, "[0-9]{1,2}")
# can use the base R function unlist() function to get just the numbers in a
# vector:
unlist(str_extract_all(someText, "[0-9]{1,2}"))
# Extract a string that contains a . followed by 3 lower-case letters (file
# extensions)
str_extract(someText,"\\.[a-z]{3}")
# just the file extenions without a period (not very elegant but works)
str_extract(someText,"(jpg|tif|pdf|txt)$")
# Extract text beginning with only letters, and containing only letters,
# ending in .jpg
str_extract(someText, "^[a-z]+\\.jpg")
# to get just the text
tmp <- str_extract(someText, "^[a-z]+\\.jpg")
tmp[!is.na(tmp)]
# Web scraping ------------------------------------------------------------
# Regular Expressions can be very helpful when doing web scraping. Let's scrape
# some data and demonstrate. Simply give a URL as an argument to the readLines()
# function. The readLines() function reads in text lines. The following reads in
# the HTML code of a web page into a single vector.
# 113th Congress Senate Bills: first 100 results.
senate_bills <- readLines("http://thomas.loc.gov/cgi-bin/bdquery/d?d113:0:./list/bss/d113SN.lst:")
# Notice senate_bills is a vector, not a data frame. Each element of text
# corresponds to one line of HTML code:
senate_bills[1:10]
# We'd like to create a data frame that includes bill number, bill title,
# sponsor, and number of co-sponsors.
# In the HTML we see that bill number, title, and sponsor are in lines that
# begin like this: "<p><b>15.</b>". We can use regular expressions to find all
# 1-3 digit numbers followed by a period and </b>.
# grep() can find the indices of such patterns,
k <- grep("[0-9]{1,3}\\.</b>", senate_bills)
k[1:4]
# Use k to subset the data
temp <- senate_bills[k]
head(temp)
tail(temp)
# Now replace the HTML tags with space
temp <- gsub("<[^>]*>", " ",temp)
head(temp)
tail(temp)
# break vector elements by ":"
temp <- strsplit(temp,":")
# Let's see what we have so far:
head(temp)
# To get the bill numbers we can pull out the first element of each list
# component as follows:
bill <- sapply(temp,function(x)x[1])
head(bill)
# Now we can use str_extract() to pull out the bill numbers. I've decided to
# keep the "S":
bill <-str_extract(bill, "S\\.[0-9]{1,3}")
head(bill)
# Now let's get the bill title. It's in the second element.
temp[[1]]
# pull out second element of each list component
title <- sapply(temp,function(x)x[2])
title[1:4]
# get rid of " Sponsor" at end
title <- gsub(" Sponsor$","",title)
# get rid of leading and trailing spaces
title <- trimws(title)
head(title)
# Now get the bill sponsor. It's in the third element.
temp[[1]]
sponsor <- sapply(temp,function(x)x[3])
sponsor <- trimws(sponsor) # get rid of leading spaces
head(sponsor)
# Get number of cosponsors by first finding those vector elements that contain
# the string "Cosponsors". Have to be careful; not all bills have Cosponsors
# (ie, Cosponsors (None) ) but all have the word "Cosponsors".
k <- grep("Cosponsors",senate_bills)
# subset vector to contain only those matching elements
temp <- senate_bills[k]
head(temp)
# Now extract number of cosponsors; either None or a 1-2 digit number.
cosponsors <- str_extract(temp, pattern = "\\([[:alnum:]]{1,4}\\)")
# Get rid of parentheses
cosponsors <- gsub(pattern = "[\\(|\\)]", replacement = "", cosponsors)
# Replace "None" with 0 and convert to numeric
cosponsors <- as.numeric(gsub("None",0,cosponsors))
summary(cosponsors)
# And finally create data frame
senate_bills <- data.frame(bill, title, sponsor, cosponsors,
stringsAsFactors = FALSE)
head(senate_bills)
# What if we wanted to do this for all results? We have to iterate through the URLs.
# http://thomas.loc.gov/cgi-bin/bdquery/d?d113:0:./list/bss/d113SN.lst:[[o]]&items=100&
# http://thomas.loc.gov/cgi-bin/bdquery/d?d113:100:./list/bss/d113SN.lst:[[o]]&items=100&
# http://thomas.loc.gov/cgi-bin/bdquery/d?d113:200:./list/bss/d113SN.lst:[[o]]&items=100&
# ...
# http://thomas.loc.gov/cgi-bin/bdquery/d?d113:3000:./list/bss/d113SN.lst:[[o]]&items=100&
# We also may want to create a data frame in advance to store the data
SenateBills <- data.frame(bill=character(3020), title=character(3020),
sponsor=character(3020),
cosponsors=numeric(3020),
stringsAsFactors = FALSE)
# Now cycle through the URLS using the code from above. I suppose ideally I
# would determine the upper bound of my sequence (3000) programmatically, but
# this is a one-off for the 113th congress so I'm cutting myself some slack.
for(i in seq(0,3000,100)){
senate_bills <- readLines(paste0("http://thomas.loc.gov/cgi-bin/bdquery/d?d113:",i,":./list/bss/d113SN.lst:"))
# bill number
k <- grep("[0-9]{1,3}\\.</b>", senate_bills)
temp <- senate_bills[k]
temp <- gsub("<[^>]*>", " ",temp)
temp <- strsplit(temp,":")
bill <- sapply(temp,function(x)x[1])
bill <- str_extract(bill, "S\\.[0-9]{1,4}") # need to increase to 4 digits
# title
title <- sapply(temp,function(x)x[2])
title <- gsub(" Sponsor$","",title)
title <- trimws(title)
# sponsor
sponsor <- sapply(temp,function(x)x[3])
sponsor <- trimws(sponsor)
# coponsors
k <- grep("Cosponsors",senate_bills)
temp <- senate_bills[k]
cosponsors <- str_extract(temp, pattern = "\\([[:alnum:]]{1,4}\\)")
cosponsors <- gsub(pattern = "[\\(|\\)]", replacement = "", cosponsors)
cosponsors <- as.numeric(gsub("None",0,cosponsors))
# add to data frame
rows <- (i+1):(i+length(k))
SenateBills[rows,] <- data.frame(bill, title, sponsor, cosponsors, stringsAsFactors = FALSE)
}
# For another web scraping tutorial of mine, see:
# https://github.com/UVa-R-Users-Group/meetup/tree/master/2014-10-07-web-scraping
# The rvest package by Hadley Wickham allows you to "Easily Harvest (Scrape) Web
# Pages":
# http://blog.rstudio.org/2014/11/24/rvest-easy-web-scraping-with-r/
# The XML package also has some functions for converting HTML tables to data
# frames.
# RegEx within data frames ------------------------------------------------
# Recall our allStocks data. We wanted to add a column indicating which stock
# each row belongs to. We can use gsub() and regular expressions to easily
# do this.
head(allStocks)
# Notice the row name contains the name of the stock. We can extract the row
# names and formally add them to the data frame using the rownames() function.
# first three row names:
rownames(allStocks)[1:3]
# extract all row names and add to data frame:
allStocks$Stock <- rownames(allStocks)
head(allStocks)
# Let's reset the row names:
rownames(allStocks) <- NULL
head(allStocks)
# Now we find the pattern "\\.csv\\.[0-9]{1,3}" and replace with nothing. Recall
# that "." is metacharacter that has to be escaped. [0-9]{1,3} translates to all
# numbers ranging from 0 - 999.
allStocks$Stock <- gsub(pattern = "\\.csv\\.[0-9]{1,3}",
replacement = "",
allStocks$Stock)
# and let's make our new variable a factor:
allStocks$Stock <- factor(allStocks$Stock)
head(allStocks)
tail(allStocks)
summary(allStocks$Stock)
# While we're at it, let's fix the Date. (Currently a factor.)
allStocks$Date <- as.Date(allStocks$Date, format="%d-%b-%y")
# And just for fun, graph closing price over time for all stocks one on graph:
library(ggplot2)
ggplot(allStocks, aes(x=Date, y = Close, color=Stock)) + geom_line()
# Now let's finish cleaning up the 2012 election data!
names(electionData)
# I want to drop everything to the right of the "NA.34 NA" column. Frankly I'm
# not sure what those columns contain.
# Get the column number of the column with header "NA.34 NA"
fir <- grep("NA.34 NA", names(electionData))
fir
# get the column number of the last column in the data frame
las <- ncol(electionData)
las
# Now subset the data frame; keep all columns except 72-82
electionData <- electionData[,-c(fir:las)]
# drop columns with names of "NA.1, NA.2, etc"; these are proportions. I can
# always derive them later if I want them.
ind <- grep("NA\\.[0-9]{1,2}", names(electionData))
ind
electionData <- electionData[,-ind]
# and some final clean up
names(electionData)[3] <- "Total.Popular.Vote"
names(electionData)[5] <- "Elec Vote R"
electionData$"Pop Vote D" <- NULL
rownames(electionData) <- NULL
# still some lingering character columns
which(sapply(electionData, is.character))
# convert to numeric
electionData[,2:6] <- sapply(electionData[,2:6], as.numeric)
# Now our election data contains only number of votes.
# Another extended example ------------------------------------------------
# Let's add Occupation names to the arrests data. Recall that the Occup columns
# contains numeric codes for Occupations. I'd like to make that column a factor
# where the codes are associated with levels that define the code number.
arrests$Occup[1:5]
# First we read in a document that contains Occupation code numbers and the
# occupation name. I created this from the codebook that accompanied this data.
oc <- readLines("../data/00049-Occupation-codes.txt", warn=FALSE)
# trim whitespace
oc <- trimws(oc)
# Notice all code numbers are in the first three positions. Let's use stringr
# for the str_extract() function. Notice we need to convert to integer to match
# the integer codes in the arrests data frame.
codeNums <- as.integer(str_extract(string = oc, pattern = "^[0-9]{3}"))
# Getting the code names is a little harder. There are probably a dozen
# different ways to proceed from this point on, but here's how I decided to do
# it. Basically extract everything except numbers.
codeNames <- trimws(str_extract(string = oc, pattern = "[^[:digit:]]+"))
head(codeNames)
tail(codeNames)
# Now I can make Occup a factor with levels equal to codeNums and labels equal
# to codeNames. I'm going to make a new column so we can compare to the original
# column.
arrests$Occup2 <- factor(arrests$Occup, levels = codeNums, labels = codeNames)
# some quick counts; they seem to match our source file
head(summary(arrests$Occup2))
tail(summary(arrests$Occup2))
# Apparently there are no codes in the data for Cannot Read Film (997) or None
# listed (998) despite them being listed in the code book.
nrow(subset(arrests, Occup %in% c(997,998)))
# Which codes are we using that don't have matches in the data?
setdiff(codeNums,arrests$Occup)
# 174 = secret society; codebook reports 0 so that makes sense.
# Which codes are in the data that we don't have matches for in codeNums?
setdiff(arrests$Occup, codeNums)
# 1, 2, and 178 are not listed in the codebook!
k <- setdiff(arrests$Occup, codeNums)
head(subset(arrests, Occup %in% k, select = c("Occup","Occup2")))
nrow(subset(arrests, Occup %in% k))
# 403 records with Occup code that doesn't match codebook
# Bottom line: this data, as provided by ICPSR, is a bit dirty.
# qdapRegex package -------------------------------------------------------
# The qdapRegex package has some pre-defined functions for Regular Expression
# Removal, Extraction, and Replacement. Let's explore some of them by way of an
# example.
# I have a text file of studio albums by the Canadian rock band, Rush. Read it
# in:
rushSA <- readLines("rush_albums.txt")
rushSA
# I'd like to make a data frame with two columns: album title and year released.
# We'll use qdapRegex functions to do this.
# First let's trim the white space:
rushSA <- trimws(rushSA)
# The qdapRegex package has a function called rm_between() that will
# Remove/Replace/Extract Strings Between 2 Markers. I want to use it to extract
# album release year between parentheses. Note I have to use the extract=TRUE
# argument:
year <- rm_between(rushSA, left = "(", right=")", extract=TRUE)
# That returns a list; I can use unlist() to make it a vector:
year <- unlist(year)
year
# I need to remove the string ", cover album". Could use gsub() to find and
# replace with nothing, but a more general approach would be to extract all
# numbers. Remember the tidyr helper function, extract_numeric?
year <- tidyr::extract_numeric(year)
year
# Now get the album titles; this time use the rm_between() function without the
# extract=TRUE argument. This removes everything between the (), including the
# parentheses.
album <- rm_between(rushSA, left = "(", right=")")
album
# And now our data frame:
rushStudioAlbums <- data.frame(year, album)
head(rushStudioAlbums)
# There is also a package called stringi that bills itself as "THE string
# processing package for R". As I understand it, stringr is wrapper for stringi.
# Learn more :http://www.rexamine.com/resources/stringi/
# save data for next set of lecture notes
save(list=c("electionData", "weather", "arrests", "allStocks", "popVa","airplane",
"SenateBills"), file="../data/datasets_L07.Rda")