text.Rmd

---
title: "Text analysis"
output:
  html_document:
    toc: yes
    toc_float: yes
  html_notebook:
    toc: yes
    toc_float: yes
---
[<<BACK](https://remi-daigle.github.io/2017-CHONe-Data/)

<!--

rm(list=ls())
setwd('/Users/davidbeauchesne/dropbox/phd/misc/2017-chone-data/2017-chone-data/')
library(rmarkdown)
render(input = 'text.rmd', 'html_document')

-->


# Text analysis in R

To begin, we wish to credit Julia Silge for much of the material presented here. Her [post](http://juliasilge.com/blog/Life-Changing-Magic/) on `tidytext` data analysis and the code therein inspired most of this portion of the workshop.

```{r, echo=FALSE}
library(dplyr)
library(tidytext)
library(pdftools)
library(stringr)
library(tidyr)
library(ggplot2)
library(viridis)
library(tibble)
library(knitr)
library(widyr)
library(ggraph)
library(wordcloud2)
library(magrittr)
library(igraph)
library(magrittr)
library(scales)
```

## Selecting a text to analyze for the workshop

Let's begin by downloading two documents written by our dear Dr. Paul Snelgrove. The first one is his book entitled ['Discoveries of the Census of Marine Life: Making Ocean Life Count'](http://www.cambridge.org/download_file/153663) and the second is one of his most cited paper (*n* = 886) according to google scholar, [Getting to the Bottom of Marine Biodiversity: Sedimentary Habitats: Ocean bottoms are the most widespread habitat on Earth and support high biodiversity and key ecosystem services](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/bioscience/49/2/10.2307/1313538/2/49-2-129.pdf?Expires=1493740003&Signature=Jp0aTSea-mCne3vgy6tE2JYz58l-K4iWPF3gnd8r-GagpywgUq8U9WcImKsSa~bDZOj5mY3216xNoqSDWXGLBdumI-WIRvUrHFKj1AetoS-Rsyup0NVO9nWE0te8dsIYDWEKkUXvu7-9xdHnmpa5QSNUlE8kM8V~4B68mdJR7W0eE-at~GH4p7IrPRDeAN9n8U~I2Kd-s~KkYgAV6ASxvzXhK4sLUaja~xs3n5eXdhTrcaJINFOxk2~2nU17jDgXM6PUw3W5Epvdywz4s6~eU4IyWCst4sokZAV3OczO7NiagYdKq3foTi~Y-EP~c0uvPr1gRFoxT6itFNGSSZCk6A__&Key-Pair-Id=APKAIUCZBIA4LVPAVW3Q/).

```{r, load file, eval = FALSE}
    # Download file from the web
        download.file('http://www.cambridge.org/download_file/153663','Snelgrove_Text_Only.pdf', mode = 'wb')

        download.file('https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/bioscience/49/2/10.2307/1313538/2/49-2-129.pdf?Expires=1493740003&Signature=Jp0aTSea-mCne3vgy6tE2JYz58l-K4iWPF3gnd8r-GagpywgUq8U9WcImKsSa~bDZOj5mY3216xNoqSDWXGLBdumI-WIRvUrHFKj1AetoS-Rsyup0NVO9nWE0te8dsIYDWEKkUXvu7-9xdHnmpa5QSNUlE8kM8V~4B68mdJR7W0eE-at~GH4p7IrPRDeAN9n8U~I2Kd-s~KkYgAV6ASxvzXhK4sLUaja~xs3n5eXdhTrcaJINFOxk2~2nU17jDgXM6PUw3W5Epvdywz4s6~eU4IyWCst4sokZAV3OczO7NiagYdKq3foTi~Y-EP~c0uvPr1gRFoxT6itFNGSSZCk6A__&Key-Pair-Id=APKAIUCZBIA4LVPAVW3Q/', '49-2-129.pdf', mode = 'wd')

```

## Importing text from pdfs in R

The package [`pdftools`](https://ropensci.org/blog/blog/2016/03/01/pdftools-and-jeroen) allows you to read the content of a pdf rather easily and is available on ROpenSci. See package details for more functions

```{r, read pdf}
    # Read pdf file
        book <- pdf_text("Snelgrove_Text_Only.pdf")
        paper <- pdf_text("49-2-129.pdf")

        book[1]
        length(book)

        paper[1]
        length(paper)
```

<br/>

As you can see, the resulting book is coerced as a page per string for a total of 398 pages, while there are 10 pages for the paper, with lines divided by '\\n'. If you look at the actual pdf, you will also realize that tables are not imported in R using the `pdf_text` function. This could be highly useful to get data embedded in pdf format. If you wish do so, you can take a look at the package [`tabulizer`](https://ropensci.org/blog/blog/2017/04/18/tabulizer), which we will not cover in this workshop.

Now let's tidy up the text to make it more easily usable for further analyses.

## Tidy up text

```{r, tidy text}
    # Divide strings per line using '\n as a separator'
        book <- str_split(book, '\n')
        paper <- str_split(paper, '\n')
        book[[1]][1:10]

    # Trim whitespaces at the beginning and end of lines
        book <- lapply(X = book, FUN = str_trim, side = 'both')
        paper <- lapply(X = paper, FUN = str_trim, side = 'both')
        book[[1]][1:10]

    # Transform as a matrix
        bookMat <- matrix(nrow = 0, ncol = 3, dimnames = list(c(), c('text','page','document')))
        for(i in 1:length(book)) {
            bk <- cbind(book[[i]], rep(i, length(book[[i]])), 'Discoveries of the Census of Marine Life')
            bookMat <- rbind(bookMat, bk)
        }

        paperMat <- matrix(nrow = 0, ncol = 3, dimnames = list(c(), c('text','page','document')))
        for(i in 1:length(paper)) {
            bk <- cbind(paper[[i]], rep(i, length(paper[[i]])), 'Getting to the Bottom of Marine Biodiversity')
            paperMat <- rbind(paperMat, bk)
        }

        kable(bookMat[1:10, ])

    # Remove empty strings
        bookMat[bookMat[,'text'] == '', 'text'] <- NA
        bookMat <- na.omit(bookMat)

        paperMat[paperMat[,'text'] == '', 'text'] <- NA
        paperMat <- na.omit(paperMat)

    # Convert to data.frame
        bookMat <- as.data.frame(bookMat)
        bookMat[, 'text'] <- as.character(bookMat[, 'text'])
        bookMat[, 'page'] <- as.numeric(paste(bookMat[, 'page']))
        bookMat[, 'document'] <- as.character(paste(bookMat[, 'document']))
        kable(bookMat[1:10, ])

        paperMat <- as.data.frame(paperMat)
        paperMat[, 'text'] <- as.character(paperMat[, 'text'])
        paperMat[, 'page'] <- as.numeric(paste(paperMat[, 'page']))
        paperMat[, 'document'] <- as.character(paste(paperMat[, 'document']))
        kable(paperMat[1:10, ])

    # bind as a single data.framt
        paul <- rbind(bookMat, paperMat)
```

<br/>

The resulting data for the paper are not perfect (*i.e.* truncated words are not regrouped), but they will serve for our purposes!

## Text analysis

<br/>

Now that we have our document as a R data frame, we can start playing around with it. There are multiple packages that allow you perform text analyses. We will focus on package `tidytext` as it aligns with the tidyverse, but the package `tm` is another important package used to perform text analysis under R. Each package uses certain classes of object, but classes can be modified quite easily for use between the packages. See this [vignette](http://127.0.0.1:15641/library/tidytext/doc/tidying_casting.html) for a description of steps and functions used to achieve this.

We will first transform the book as a [tibble class object](https://www.rdocumentation.org/packages/tibble/versions/1.2) for use in `tidytext`

```{r, tidytext}
    # Convert to tibble class object for use in tidytext
        paul <- as_tibble(paul)
        paul
```

<br/>

### Group your text by attributes

<br/>

The book can easily be grouped by certain criterion. For exemple, we will start by grouping the data according to which part of the book they belong. There are a total of three parts to this book. Unfortunately, this particular process does not work for our pdf document, but the code does work we might use it again elsewhere, let's say for an exercice (*cough cough*).

```{r, group chapters}
    paulDoc <- paul %>%
                    mutate(linenumber = row_number(),
                           chapter  = cumsum(str_detect(text, regex("^part [\\divxlc]", ignore_case = TRUE)))) %>%
                    ungroup()

    paulDoc
    unique(paulDoc[, 'chapter'])
```

<br/>

This function is not useful with these documents as they are not divided by chapters and the function does not differentiate between documents, yet the regex code can still be useful to divide your text however you may see fit.

<br/>

### Term frequency

A frequent analysis performed with text is to evaluate the frequency of words used in the text. This can be done by first unnesting the individual words from the text with the `unnest_tokens` function and using the `count` function to count the number of words.

```{r, words}
    # Count per word
        paulWord <- paul %>%
                        unnest_tokens(word, text) %>% # divide by word
                        count(word, sort = TRUE) # counts the word frequency
        paulWord

    # Total number of words
        totalWords <- paulWord %>% summarize(total = sum(n))
        totalWords
```

<br/>

You could also apply this directly to the data grouped by chapters to obtain a word count per chapters.

```{r, words chapter}
    # Count per word per chapter
        paulWordDocs <- paul %>%
                            mutate(linenumber = row_number()) %>%
                            ungroup() %>%
                            unnest_tokens(word, text) %>%
                            count(document, word, sort = TRUE) %>%
                            ungroup()

    # Attach total number of words per chapter                            
        totalWords <- paulWordDocs %>% group_by(document) %>% summarize(total = sum(n))
        totalWords

        paulWordDocs <- left_join(paulWordDocs, totalWords)
        paulWordDocs

        ggplot(paulWordDocs, aes(n/total, fill = document)) +
          geom_histogram(show.legend = FALSE) +
          xlim(NA, 0.0009) +
          facet_wrap(~document, ncol = 2, scales = "free_y")
```
<br/>

Words like 'the', 'and', 'of' are the most common by far. Those are referred to as *stop words* and are usually removed in text analysis. A list of such words is available in the `stop_words` dataset and can be removed from your dataset using function `anti_join`.

```{r, stop words}
    # Count per word after stop words removal
        data("stop_words")
        paulWordDocs <- paul %>%
                            mutate(linenumber = row_number()) %>%
                            ungroup() %>%
                            unnest_tokens(word, text) %>%
                            anti_join(stop_words) %>% # remove stop words
                            count(document, word, sort = TRUE) %>%
                            ungroup()

        totalWords <- paulWordDocs %>% group_by(document) %>% summarize(total = sum(n))
        paulWordDocs <- left_join(paulWordDocs, totalWords)
        paulWordDocs

        ggplot(paulWordDocs, aes(n/total, fill = document)) +
          geom_histogram(show.legend = FALSE) +
          xlim(NA, 0.0009) +
          facet_wrap(~document, ncol = 2, scales = "free_y")

```

### Word frequency comparison

The frequency of words could also be compared between documents.

```{r, word frequency}
    paulBook <- as_tibble(bookMat)
    paulPaper <- as_tibble(paperMat)

    paulBookWord <- paulBook %>%
                        mutate(linenumber = row_number()) %>%
                        ungroup() %>%
                        unnest_tokens(word, text) %>%
                        anti_join(stop_words) %>% # remove stop words
                        count(word, sort = TRUE) %>%

                        ungroup()

    paulPaperWord <- paulPaper %>%
                        mutate(linenumber = row_number()) %>%
                        ungroup() %>%
                        unnest_tokens(word, text) %>%
                        anti_join(stop_words) %>% # remove stop words
                        count(word, sort = TRUE) %>%
                        ungroup()

    frequency <- paulBookWord %>%
                    rename(Book = n) %>%
                    inner_join(paulPaperWord) %>%
                    rename(Paper = n) %>%
                    mutate(Book = Book / sum(Book),
                           Paper = Paper / sum(Paper)) %>%
                    ungroup()

    ggplot(frequency, aes(x = Book, y = Paper, color = abs(Paper - Book))) +
            geom_abline(color = "gray40") +
            geom_jitter(alpha = 0.1, size = 2.5, width = 0.4, height = 0.4) +
            geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
            scale_x_log10(labels = percent_format()) +
            scale_y_log10(labels = percent_format()) +
            scale_color_gradient(limits = c(0, 0.001), low = "darkslategray4", high = "gray75") +
            theme_minimal(base_size = 14) +
            theme(legend.position="none") +
            labs(title = "Comparing Word Frequencies",
                 subtitle = "Word frequencies in Paul Snelgroves's book and paper",
                 y = "Getting to the Bottom of Marine Biodiversity", x = "Discoveries of the Census of Marine Life")
```

### Sentiment analysis

`tidytext` also gives the opportunity to perform a cursory sentiment analysis, *i.e.* evaluating whether the text is more or less negative, using the `sentiment` dataset. While it may not be as useful to qualify positive or negative science, it may reveal some insights as to the overall style of writing of the author (we are looking at you Dr. Snelgrove).

```{r, sentiment}
    # Gather list of sentiments from tidytext
        bing <- sentiments %>%
                filter(lexicon == "bing") %>%
                dplyr::select(-score)
        bing

        paulSentiment <- paul %>%
                            mutate(linenumber = row_number()) %>%
                            ungroup() %>%
                            unnest_tokens(word, text) %>%
                            anti_join(stop_words) %>% # remove stop words
                            inner_join(bing) %>% # join with sentiment dataset
                            count(document, index = linenumber %/% 80, sentiment) %>%
                            spread(sentiment, n, fill = 0) %>%
                            mutate(sentiment = positive - negative)
        paulSentiment

        # plot
            ggplot(paulSentiment, aes(index, sentiment, fill = document)) +
                    geom_bar(stat = "identity", show.legend = FALSE) +
                    facet_wrap(~document, ncol = 2, scales = "free_x") +
                    theme_minimal(base_size = 13) +
                    labs(title = "Sentiment in Paul Snelgrove's writing",
                         y = "Sentiment") +
                    scale_fill_viridis(end = 0.75, discrete=TRUE, direction = -1) +
                    scale_x_discrete(expand=c(0.02,0)) +
                    theme(strip.text=element_text(hjust=0)) +
                    theme(strip.text = element_text(face = "italic")) +
                    theme(axis.title.x=element_blank()) +
                    theme(axis.ticks.x=element_blank()) +
                    theme(axis.text.x=element_blank())

        # Most common positive and negative words
        paulSentimentCount <- paul %>%
                                mutate(linenumber = row_number()) %>%
                                ungroup() %>%
                                unnest_tokens(word, text) %>%
                                anti_join(stop_words) %>% # remove stop words
                                inner_join(bing) %>% # join with sentiment dataset
                                count(document, word, sentiment, sort = TRUE) %>%
                                ungroup()


        # Contribution to sentiment
            paulSentimentCount %>%
                  filter(n > 10) %>%
                  mutate(n = ifelse(sentiment == "negative", -n, n)) %>%
                  mutate(word = reorder(word, n)) %>%
                  ggplot(aes(word, n, fill = sentiment)) +
                  geom_bar(stat = "identity") +
                  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
                  ylab("Contribution to sentiment")
```

### "Units Beyond Words"

Words are not the only units of text that can be extracted using the `unnest_tokens` function. Look at the package vignette for more information!

```{r, sentences}

paulSentences <- paul %>%
                    group_by(document) %>%
                    unnest_tokens(sentence, text, token = "sentences") %>%
                    ungroup()


paulSentences$sentence[1]
```

### Networks of Words

You can also asociate which words are used more often together with the function `pairwise_count` from the package `widyr`

```{r, pair count, eval = FALSE}
        paulWord <- paul %>%
                        mutate(linenumber = row_number()) %>%
                        ungroup() %>%
                        unnest_tokens(word, text) %>%
                        anti_join(stop_words)

        paulWordOcc <- pairwise_count(paulWord, word, linenumber, sort = TRUE)

        set.seed(1813)
        paulWordOcc %>%
                filter(n >= 25) %>%
                graph_from_data_frame() %>%
                ggraph(layout = "fr") +
                geom_edge_link(aes(edge_alpha = n, edge_width = n)) +
                geom_node_point(color = "darkslategray4", size = 5) +
                geom_node_text(aes(label = name), vjust = 1.8) +
                ggtitle(expression(paste("Word Network in Paul Snelgrove's book ",
                                         italic("Census of Marine Life")))) +
                theme_void()


        netD3 <- paulWordOcc %>%
                filter(n >= 25) %>%
                graph_from_data_frame()

        netD3 <- networkD3::igraph_to_networkD3(netD3, group = rep(1, vcount(netD3)), what = 'both')

        networkD3::forceNetwork(Links = netD3$links,
                                Nodes = netD3$nodes,
                                Source = 'source',
                                Target = 'target',
                                NodeID = 'name',
                                Group = 'group',
                                zoom = TRUE,
                                linkDistance = 50,
                                fontSize = 12,
                                opacity = 0.9,
                                charge = -10)
```

<br/>

### Wordle

Another neat visual tool available in R is the ability to produce custom wordle based on the results of your analyses using the package `wordcloud2`

```{r, wordcloud}

        paulWord <- paul %>%
                    mutate(linenumber = row_number()) %>%
                    ungroup() %>%
                    unnest_tokens(word, text) %>%
                    anti_join(stop_words) %>% # remove stop words
                    count(document, word, sort = TRUE) %>% # counts the word frequency
                    ungroup() %>%
                    filter(n >= 20)

        paulWordDF <- as.data.frame(paulWord[, c('word','n')])

        # Basic wordle
            wordcloud2::wordcloud2(paulWordDF, size = 1, color="random-light", backgroundColor=1)


        # # Word shaped wordle
        #     wordcloud2::letterCloud(paulWordDF, word = "Paul")
        #
        # # Image shaped wordle
        #     wordcloud2::wordcloud2(paulWordDF, figPath = "./CoML_icon.png", size = 1.5)
        #
        #     wordcloud2::wordcloud2(paulWordDF, figPath = "./CHONe.jpg", size = 1.5)
```


[<<BACK](https://remi-daigle.github.io/2017-CHONe-Data/)