-
Notifications
You must be signed in to change notification settings - Fork 1
/
text.Rmd
441 lines (336 loc) · 18.4 KB
/
text.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
---
title: "Text analysis"
output:
html_document:
toc: yes
toc_float: yes
html_notebook:
toc: yes
toc_float: yes
---
[<<BACK](https://remi-daigle.github.io/2017-CHONe-Data/)
<!--
rm(list=ls())
setwd('/Users/davidbeauchesne/dropbox/phd/misc/2017-chone-data/2017-chone-data/')
library(rmarkdown)
render(input = 'text.rmd', 'html_document')
-->
# Text analysis in R
To begin, we wish to credit Julia Silge for much of the material presented here. Her [post](http://juliasilge.com/blog/Life-Changing-Magic/) on `tidytext` data analysis and the code therein inspired most of this portion of the workshop.
```{r, echo=FALSE}
library(dplyr)
library(tidytext)
library(pdftools)
library(stringr)
library(tidyr)
library(ggplot2)
library(viridis)
library(tibble)
library(knitr)
library(widyr)
library(ggraph)
library(wordcloud2)
library(magrittr)
library(igraph)
library(magrittr)
library(scales)
```
## Selecting a text to analyze for the workshop
Let's begin by downloading two documents written by our dear Dr. Paul Snelgrove. The first one is his book entitled ['Discoveries of the Census of Marine Life: Making Ocean Life Count'](http://www.cambridge.org/download_file/153663) and the second is one of his most cited paper (*n* = 886) according to google scholar, [Getting to the Bottom of Marine Biodiversity: Sedimentary Habitats: Ocean bottoms are the most widespread habitat on Earth and support high biodiversity and key ecosystem services](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/bioscience/49/2/10.2307/1313538/2/49-2-129.pdf?Expires=1493740003&Signature=Jp0aTSea-mCne3vgy6tE2JYz58l-K4iWPF3gnd8r-GagpywgUq8U9WcImKsSa~bDZOj5mY3216xNoqSDWXGLBdumI-WIRvUrHFKj1AetoS-Rsyup0NVO9nWE0te8dsIYDWEKkUXvu7-9xdHnmpa5QSNUlE8kM8V~4B68mdJR7W0eE-at~GH4p7IrPRDeAN9n8U~I2Kd-s~KkYgAV6ASxvzXhK4sLUaja~xs3n5eXdhTrcaJINFOxk2~2nU17jDgXM6PUw3W5Epvdywz4s6~eU4IyWCst4sokZAV3OczO7NiagYdKq3foTi~Y-EP~c0uvPr1gRFoxT6itFNGSSZCk6A__&Key-Pair-Id=APKAIUCZBIA4LVPAVW3Q/).
```{r, load file, eval = FALSE}
# Download file from the web
download.file('http://www.cambridge.org/download_file/153663','Snelgrove_Text_Only.pdf', mode = 'wb')
download.file('https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/bioscience/49/2/10.2307/1313538/2/49-2-129.pdf?Expires=1493740003&Signature=Jp0aTSea-mCne3vgy6tE2JYz58l-K4iWPF3gnd8r-GagpywgUq8U9WcImKsSa~bDZOj5mY3216xNoqSDWXGLBdumI-WIRvUrHFKj1AetoS-Rsyup0NVO9nWE0te8dsIYDWEKkUXvu7-9xdHnmpa5QSNUlE8kM8V~4B68mdJR7W0eE-at~GH4p7IrPRDeAN9n8U~I2Kd-s~KkYgAV6ASxvzXhK4sLUaja~xs3n5eXdhTrcaJINFOxk2~2nU17jDgXM6PUw3W5Epvdywz4s6~eU4IyWCst4sokZAV3OczO7NiagYdKq3foTi~Y-EP~c0uvPr1gRFoxT6itFNGSSZCk6A__&Key-Pair-Id=APKAIUCZBIA4LVPAVW3Q/', '49-2-129.pdf', mode = 'wd')
```
## Importing text from pdfs in R
The package [`pdftools`](https://ropensci.org/blog/blog/2016/03/01/pdftools-and-jeroen) allows you to read the content of a pdf rather easily and is available on ROpenSci. See package details for more functions
```{r, read pdf}
# Read pdf file
book <- pdf_text("Snelgrove_Text_Only.pdf")
paper <- pdf_text("49-2-129.pdf")
book[1]
length(book)
paper[1]
length(paper)
```
<br/>
As you can see, the resulting book is coerced as a page per string for a total of 398 pages, while there are 10 pages for the paper, with lines divided by '\\n'. If you look at the actual pdf, you will also realize that tables are not imported in R using the `pdf_text` function. This could be highly useful to get data embedded in pdf format. If you wish do so, you can take a look at the package [`tabulizer`](https://ropensci.org/blog/blog/2017/04/18/tabulizer), which we will not cover in this workshop.
Now let's tidy up the text to make it more easily usable for further analyses.
## Tidy up text
```{r, tidy text}
# Divide strings per line using '\n as a separator'
book <- str_split(book, '\n')
paper <- str_split(paper, '\n')
book[[1]][1:10]
# Trim whitespaces at the beginning and end of lines
book <- lapply(X = book, FUN = str_trim, side = 'both')
paper <- lapply(X = paper, FUN = str_trim, side = 'both')
book[[1]][1:10]
# Transform as a matrix
bookMat <- matrix(nrow = 0, ncol = 3, dimnames = list(c(), c('text','page','document')))
for(i in 1:length(book)) {
bk <- cbind(book[[i]], rep(i, length(book[[i]])), 'Discoveries of the Census of Marine Life')
bookMat <- rbind(bookMat, bk)
}
paperMat <- matrix(nrow = 0, ncol = 3, dimnames = list(c(), c('text','page','document')))
for(i in 1:length(paper)) {
bk <- cbind(paper[[i]], rep(i, length(paper[[i]])), 'Getting to the Bottom of Marine Biodiversity')
paperMat <- rbind(paperMat, bk)
}
kable(bookMat[1:10, ])
# Remove empty strings
bookMat[bookMat[,'text'] == '', 'text'] <- NA
bookMat <- na.omit(bookMat)
paperMat[paperMat[,'text'] == '', 'text'] <- NA
paperMat <- na.omit(paperMat)
# Convert to data.frame
bookMat <- as.data.frame(bookMat)
bookMat[, 'text'] <- as.character(bookMat[, 'text'])
bookMat[, 'page'] <- as.numeric(paste(bookMat[, 'page']))
bookMat[, 'document'] <- as.character(paste(bookMat[, 'document']))
kable(bookMat[1:10, ])
paperMat <- as.data.frame(paperMat)
paperMat[, 'text'] <- as.character(paperMat[, 'text'])
paperMat[, 'page'] <- as.numeric(paste(paperMat[, 'page']))
paperMat[, 'document'] <- as.character(paste(paperMat[, 'document']))
kable(paperMat[1:10, ])
# bind as a single data.framt
paul <- rbind(bookMat, paperMat)
```
<br/>
The resulting data for the paper are not perfect (*i.e.* truncated words are not regrouped), but they will serve for our purposes!
## Text analysis
<br/>
Now that we have our document as a R data frame, we can start playing around with it. There are multiple packages that allow you perform text analyses. We will focus on package `tidytext` as it aligns with the tidyverse, but the package `tm` is another important package used to perform text analysis under R. Each package uses certain classes of object, but classes can be modified quite easily for use between the packages. See this [vignette](http://127.0.0.1:15641/library/tidytext/doc/tidying_casting.html) for a description of steps and functions used to achieve this.
We will first transform the book as a [tibble class object](https://www.rdocumentation.org/packages/tibble/versions/1.2) for use in `tidytext`
```{r, tidytext}
# Convert to tibble class object for use in tidytext
paul <- as_tibble(paul)
paul
```
<br/>
### Group your text by attributes
<br/>
The book can easily be grouped by certain criterion. For exemple, we will start by grouping the data according to which part of the book they belong. There are a total of three parts to this book. Unfortunately, this particular process does not work for our pdf document, but the code does work we might use it again elsewhere, let's say for an exercice (*cough cough*).
```{r, group chapters}
paulDoc <- paul %>%
mutate(linenumber = row_number(),
chapter = cumsum(str_detect(text, regex("^part [\\divxlc]", ignore_case = TRUE)))) %>%
ungroup()
paulDoc
unique(paulDoc[, 'chapter'])
```
<br/>
This function is not useful with these documents as they are not divided by chapters and the function does not differentiate between documents, yet the regex code can still be useful to divide your text however you may see fit.
<br/>
### Term frequency
A frequent analysis performed with text is to evaluate the frequency of words used in the text. This can be done by first unnesting the individual words from the text with the `unnest_tokens` function and using the `count` function to count the number of words.
```{r, words}
# Count per word
paulWord <- paul %>%
unnest_tokens(word, text) %>% # divide by word
count(word, sort = TRUE) # counts the word frequency
paulWord
# Total number of words
totalWords <- paulWord %>% summarize(total = sum(n))
totalWords
```
<br/>
You could also apply this directly to the data grouped by chapters to obtain a word count per chapters.
```{r, words chapter}
# Count per word per chapter
paulWordDocs <- paul %>%
mutate(linenumber = row_number()) %>%
ungroup() %>%
unnest_tokens(word, text) %>%
count(document, word, sort = TRUE) %>%
ungroup()
# Attach total number of words per chapter
totalWords <- paulWordDocs %>% group_by(document) %>% summarize(total = sum(n))
totalWords
paulWordDocs <- left_join(paulWordDocs, totalWords)
paulWordDocs
ggplot(paulWordDocs, aes(n/total, fill = document)) +
geom_histogram(show.legend = FALSE) +
xlim(NA, 0.0009) +
facet_wrap(~document, ncol = 2, scales = "free_y")
```
<br/>
Words like 'the', 'and', 'of' are the most common by far. Those are referred to as *stop words* and are usually removed in text analysis. A list of such words is available in the `stop_words` dataset and can be removed from your dataset using function `anti_join`.
```{r, stop words}
# Count per word after stop words removal
data("stop_words")
paulWordDocs <- paul %>%
mutate(linenumber = row_number()) %>%
ungroup() %>%
unnest_tokens(word, text) %>%
anti_join(stop_words) %>% # remove stop words
count(document, word, sort = TRUE) %>%
ungroup()
totalWords <- paulWordDocs %>% group_by(document) %>% summarize(total = sum(n))
paulWordDocs <- left_join(paulWordDocs, totalWords)
paulWordDocs
ggplot(paulWordDocs, aes(n/total, fill = document)) +
geom_histogram(show.legend = FALSE) +
xlim(NA, 0.0009) +
facet_wrap(~document, ncol = 2, scales = "free_y")
```
### Word frequency comparison
The frequency of words could also be compared between documents.
```{r, word frequency}
paulBook <- as_tibble(bookMat)
paulPaper <- as_tibble(paperMat)
paulBookWord <- paulBook %>%
mutate(linenumber = row_number()) %>%
ungroup() %>%
unnest_tokens(word, text) %>%
anti_join(stop_words) %>% # remove stop words
count(word, sort = TRUE) %>%
ungroup()
paulPaperWord <- paulPaper %>%
mutate(linenumber = row_number()) %>%
ungroup() %>%
unnest_tokens(word, text) %>%
anti_join(stop_words) %>% # remove stop words
count(word, sort = TRUE) %>%
ungroup()
frequency <- paulBookWord %>%
rename(Book = n) %>%
inner_join(paulPaperWord) %>%
rename(Paper = n) %>%
mutate(Book = Book / sum(Book),
Paper = Paper / sum(Paper)) %>%
ungroup()
ggplot(frequency, aes(x = Book, y = Paper, color = abs(Paper - Book))) +
geom_abline(color = "gray40") +
geom_jitter(alpha = 0.1, size = 2.5, width = 0.4, height = 0.4) +
geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
scale_x_log10(labels = percent_format()) +
scale_y_log10(labels = percent_format()) +
scale_color_gradient(limits = c(0, 0.001), low = "darkslategray4", high = "gray75") +
theme_minimal(base_size = 14) +
theme(legend.position="none") +
labs(title = "Comparing Word Frequencies",
subtitle = "Word frequencies in Paul Snelgroves's book and paper",
y = "Getting to the Bottom of Marine Biodiversity", x = "Discoveries of the Census of Marine Life")
```
### Sentiment analysis
`tidytext` also gives the opportunity to perform a cursory sentiment analysis, *i.e.* evaluating whether the text is more or less negative, using the `sentiment` dataset. While it may not be as useful to qualify positive or negative science, it may reveal some insights as to the overall style of writing of the author (we are looking at you Dr. Snelgrove).
```{r, sentiment}
# Gather list of sentiments from tidytext
bing <- sentiments %>%
filter(lexicon == "bing") %>%
dplyr::select(-score)
bing
paulSentiment <- paul %>%
mutate(linenumber = row_number()) %>%
ungroup() %>%
unnest_tokens(word, text) %>%
anti_join(stop_words) %>% # remove stop words
inner_join(bing) %>% # join with sentiment dataset
count(document, index = linenumber %/% 80, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
paulSentiment
# plot
ggplot(paulSentiment, aes(index, sentiment, fill = document)) +
geom_bar(stat = "identity", show.legend = FALSE) +
facet_wrap(~document, ncol = 2, scales = "free_x") +
theme_minimal(base_size = 13) +
labs(title = "Sentiment in Paul Snelgrove's writing",
y = "Sentiment") +
scale_fill_viridis(end = 0.75, discrete=TRUE, direction = -1) +
scale_x_discrete(expand=c(0.02,0)) +
theme(strip.text=element_text(hjust=0)) +
theme(strip.text = element_text(face = "italic")) +
theme(axis.title.x=element_blank()) +
theme(axis.ticks.x=element_blank()) +
theme(axis.text.x=element_blank())
# Most common positive and negative words
paulSentimentCount <- paul %>%
mutate(linenumber = row_number()) %>%
ungroup() %>%
unnest_tokens(word, text) %>%
anti_join(stop_words) %>% # remove stop words
inner_join(bing) %>% # join with sentiment dataset
count(document, word, sentiment, sort = TRUE) %>%
ungroup()
# Contribution to sentiment
paulSentimentCount %>%
filter(n > 10) %>%
mutate(n = ifelse(sentiment == "negative", -n, n)) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
ylab("Contribution to sentiment")
```
### "Units Beyond Words"
Words are not the only units of text that can be extracted using the `unnest_tokens` function. Look at the package vignette for more information!
```{r, sentences}
paulSentences <- paul %>%
group_by(document) %>%
unnest_tokens(sentence, text, token = "sentences") %>%
ungroup()
paulSentences$sentence[1]
```
### Networks of Words
You can also asociate which words are used more often together with the function `pairwise_count` from the package `widyr`
```{r, pair count, eval = FALSE}
paulWord <- paul %>%
mutate(linenumber = row_number()) %>%
ungroup() %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)
paulWordOcc <- pairwise_count(paulWord, word, linenumber, sort = TRUE)
set.seed(1813)
paulWordOcc %>%
filter(n >= 25) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n, edge_width = n)) +
geom_node_point(color = "darkslategray4", size = 5) +
geom_node_text(aes(label = name), vjust = 1.8) +
ggtitle(expression(paste("Word Network in Paul Snelgrove's book ",
italic("Census of Marine Life")))) +
theme_void()
netD3 <- paulWordOcc %>%
filter(n >= 25) %>%
graph_from_data_frame()
netD3 <- networkD3::igraph_to_networkD3(netD3, group = rep(1, vcount(netD3)), what = 'both')
networkD3::forceNetwork(Links = netD3$links,
Nodes = netD3$nodes,
Source = 'source',
Target = 'target',
NodeID = 'name',
Group = 'group',
zoom = TRUE,
linkDistance = 50,
fontSize = 12,
opacity = 0.9,
charge = -10)
```
<br/>
### Wordle
Another neat visual tool available in R is the ability to produce custom wordle based on the results of your analyses using the package `wordcloud2`
```{r, wordcloud}
paulWord <- paul %>%
mutate(linenumber = row_number()) %>%
ungroup() %>%
unnest_tokens(word, text) %>%
anti_join(stop_words) %>% # remove stop words
count(document, word, sort = TRUE) %>% # counts the word frequency
ungroup() %>%
filter(n >= 20)
paulWordDF <- as.data.frame(paulWord[, c('word','n')])
# Basic wordle
wordcloud2::wordcloud2(paulWordDF, size = 1, color="random-light", backgroundColor=1)
# # Word shaped wordle
# wordcloud2::letterCloud(paulWordDF, word = "Paul")
#
# # Image shaped wordle
# wordcloud2::wordcloud2(paulWordDF, figPath = "./CoML_icon.png", size = 1.5)
#
# wordcloud2::wordcloud2(paulWordDF, figPath = "./CHONe.jpg", size = 1.5)
```
[<<BACK](https://remi-daigle.github.io/2017-CHONe-Data/)