-
Notifications
You must be signed in to change notification settings - Fork 5
/
11-NLP-R.qmd
329 lines (248 loc) · 9.71 KB
/
11-NLP-R.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
---
title: Quick Intro to NLP with `R`
format:
revealjs:
theme: _extensions/metropolis-theme/metropolis.scss
chalkboard: true
logo: /images/ScPo-logo.png
footer: "[SciencesPo Intro To Programming 2024](https://floswald.github.io/ScPoProgramming/)"
incremental: false
code-line-numbers: false
highlight-style: github
slide-number: true
author: Florian Oswald
subtitle: "[SciencesPo Intro To Programming 2024](https://floswald.github.io/ScPoProgramming/)"
date: today
date-format: "D MMMM, YYYY"
execute:
echo: true
cache: true
---
## Intro
In this lecture we will introduce the most basic language models with R.
This is based on a nice [introduction](https://datascienceplus.com/an-introduction-to-k-gram-language-models-in-r/) by Valerio Gherardi, author of the `kgrams` package for R.
---
## Natuarl Language Processing (NLP) basics
* We are all familiar with *large* language models (LLMs for short) by now.
* ChatGPT (short for _Chat Generative Pretrained Transformer_) is a proprietary solution, there are by now many open source alternatives.
* We will not be able to go into the details of those, but see some simpler cousins.
## $k$-gram language models
* Let $w_i$ be the $i$-th word in a sentence, i.e. $s = w_1 w_2 \dots w_n$
* An NLP model gives the probability of observing this sentence, i.e. $\Pr(s)$.
* As usual, we can *sample* from $\Pr(s)$ to obtain *random* sentences.
* In general all the $s$ at our disposal come from a certain _corpus_, i.e. a collection of sentences/words.
## Continuation Probabilities
* Define a sequence of words as _context_: $w_1 w_2 \dots w_m$
* We can _predict_ the next word in the sequence by computing $\Pr(w|c)$, i.e. $\Pr(w|c)$ is the probability that the next word is $w$, given context $c$.
* That is in a nutshell what ChatGPT computes for you.
## Dictionaries
* The list of known words in an NLP model is called the _dictionary_.
* This also tells us how to deal with _unknown_ words - those are mapped to the `UNK` (unknown word token).
* It also tells us how to deal with the end of sentences, by introducing an `EOS` (end of sentence) token.
* `kgram` models (below) also include a `BOS` (beginning of sentence) token. Each sentence is left-padded with $N-1$ `BOS` tokens ($N$ the order of the model). This helps predicting _the first workd of the next sentence_ from the preceding $N-1$ tokens.
## $k$-gram Models
* A $k$-gram model makes a _markovian_ assumption on continuation probabilites.
* We assume that the next word depends only on the last $N-1$ words, where $N$ is the _order_ of the model.
* We have
$$\begin{align}
\Pr(w|c) &= \Pr(w|w_1 w_2 \cdots w_{N-1})\\
c &= \cdots w_{-1} w_0 w_1 w_2 \cdots w_{N-1}
\end{align}$$
* We call the $k$ tuples of words $(w_1, w_2,\dots, w_k)$ _k-grams_.
* You can see that we can only capture relatively short range dependencies.
* As $N$ becomes too large, memory requirements explode.
## Estimating Continuation Probabilities
* We can make a table from our corpus, counting how many times each $k$ gram occurs.
* While this is simple, we need a _smoothing_ technique to account for the fact that many potentially sensible sentences are never observed in our Corpus.
* The smoothing will take some probability from the very frequently observed sequences and give some to the rarer ones, simply speaking.
$$\hat{\Pr}_{MLE}(w|c) = \frac{C(w_1 w_2 \cdots w_{k} w)}{C(w_1 w_2 \cdots w_{k})}$$
* Our data is sparse: many sequences are not in our corpus, hence the above estimator incorrectly assigns zero probability to them.
* If context $w_1 w_2 \cdots w_{k}$ not in data, estimator is not defined.
## Training and Testing NLP Models
* We need an evaluation metric: how good is this model.
* Widely used is [perplexity](https://en.wikipedia.org/wiki/Perplexity): The larger _perplexity_ of a discrete probability distribution, the less likely it will be that an observer could guess the next value to be drawn from it.
* We will evaluate $H=-\frac{1}{W} \sum_s \ln \Pr(s)$ where $W$ is the total number of words in our corpus.
## Training a k-gram model in R
```{r}
#| echo: true
library(kgrams)
```
We can get the spoken text from the following Shakespear plays:
```{r}
#| echo: true
playcodes <- c(
"All's Well That Ends Well" = "AWW",
"Antony and Cleopatra" = "Ant",
"As You Like It" = "AYL",
"The Comedy of Errors" = "Err",
"Coriolanus" = "Cor",
"Cymbeline" = "Cym",
"Hamlet" = "Ham",
"Henry IV, Part 1" = "1H4",
"Henry IV, Part 2" = "2H4",
"Henry V" = "H5",
"Henry VI, Part 1" = "1H6",
"Henry VI, Part 2" = "2H6",
"Henry VI, Part 3" = "3H6",
"Henry VIII" = "H8",
"Julius Caesar" = "JC",
"King John" = "Jn",
"King Lear" = "Lr",
"Love's Labor's Lost" = "LLL",
"Macbeth" = "Mac",
"Measure for Measure" = "MM",
"The Merchant of Venice" = "MV",
"The Merry Wives of Windsor" = "Wiv",
"A Midsummer Night's Dream" = "MND",
"Much Ado About Nothing" = "Ado",
"Othello" = "Oth",
"Pericles" = "Per",
"Richard II" = "R2",
"Richard III" = "R3",
"Romeo and Juliet" = "Rom",
"The Taming of the Shrew" = "Shr",
"The Tempest" = "Tmp",
"Timon of Athens" = "Tim",
"Titus Andronicus" = "Tit",
"Troilus and Cressida" = "Tro",
"Twelfth Night" = "TN",
"Two Gentlemen of Verona" = "TGV",
"Two Noble Kinsmen" = "TNK",
"The Winter's Tale" = "WT"
)
```
## Estimating 2
We could get the text from "Much Ado about Nothing" as follows:
```{r}
#| echo: true
get_url_con <- function(playcode) {
stopifnot(playcode %in% playcodes)
url <- paste0("https://www.folgerdigitaltexts.org/", playcode, "/text")
con <- url(url)
}
con <- get_url_con("Ado")
open(con)
readLines(con, 10)
```
```{r}
#| echo: true
close(con)
```
## Defining Training and Testing Data
We will use all plays but "Hamlet" as training data, and reserve this last one for testing our model.
```{r}
train_playcodes <- playcodes[names(playcodes) != c("Hamlet")]
test_playcodes <- playcodes[names(playcodes) == c("Hamlet")]
```
We want to pre-process the text data. Here we want to remove some html tags and make everything lower-case.
```{r}
.preprocess <- function(x) {
# Remove html tags
x <- gsub("<[^>]+>", "", x)
# Lower-case and remove characters not alphanumeric or punctuation
x <- kgrams::preprocess(x)
return(x)
}
```
## Preprocessing Text
* We need to split sentences at sensible punctuation marks `.!?:;` and insert `EOS` and `BOS` tokens into the data.
* This will treat `.!?:;` as regular _words_, hence the model will be able to *predict* those.
```{r}
.tknz_sent <- function(x) {
# Collapse everything to a single string
x <- paste(x, collapse = " ")
# Tokenize sentences
x <- kgrams::tknz_sent(x, keep_first = TRUE)
# Remove empty sentences
x <- x[x != ""]
return(x)
}
```
## Making $k$-gram frequency counts
* Let us now make a table of occurences of all $k$-grams in our corpus.
* We set an _order_:
```{r}
N = 5
freqs = kgram_freqs(N, .preprocess = .preprocess, .tknz_sent = .tknz_sent)
summary(freqs)
```
* So, for now this is an empty model as you can see. Let's train it on our corpus!
## Training the NLP model
```{r}
lapply(train_playcodes,
function(playcode) {
con <- get_url_con(playcode)
process_sentences(text = con, freqs = freqs, verbose = FALSE)
})
```
## Checking the Frequency tables
* the `freqs` object was modified during the previous call.
* Let's check it quickly:
```{r}
query(freqs, c("leonato", "pound of flesh", "smartphones"))
```
* Last thing to do: choose a smoother.
```{r}
smoothers()
```
Let's choose the _modified Kneser-Ney_ smoother and set some default parameters:
```{r}
info("mkn")
```
## Building the model
```{r}
model <- language_model(freqs, smoother = "mkn", D1 = 0.5, D2 = 0.5, D3 = 0.5)
summary(model)
```
## Making Predictions with the model
* Now we can compute probabilities for given sentences:
```{r}
sentences <- c(
"I have a letter from monsieur Berowne to one lady Rosaline.",
"I have an email from monsieur Valerio to one lady Judit."
)
probability(sentences, model)
```
or we can get the _continuation probability_ for a context:
```{r}
context <- "pound of"
words <- c("flesh", "bananas")
probability(words %|% context, model)
```
## Tuning our models
* Remember we held out "Hamlet" from our training data. Let's use it to test performance now!
```{r}
con <- get_url_con(test_playcodes)
perplexity(text = con, model = model)
```
This applies the same transformations and tokenization to test data than it does to training data (which is important).
## Tuning More
* We could now create a grid over the parameters of the model (`D1`, `D2` etc) as well as the order of the models
* We would then choose those parameters for whcih the perplexity is smallest.
* Suppose we find that the $k=4$ models works best.
* Let's use it to create some random sentences!
```{r}
param(model, "N") <- 4
```
## Random Text generation
```{r}
set.seed(840)
sample_sentences(model, 10, max_length = 20)
```
## Temperature
* The temperature parameter makes the pdf smoother and rougher. Smaller values mean the model will not deviate much from it's implied distribution, higher values means there will be much more randomness in output.
```{r}
set.seed(841)
sample_sentences(model, 10, max_length = 20) # Normal temperature
```
## High temperature
```{r}
set.seed(841)
sample_sentences(model, 10, max_length = 20, t = 10)
```
## Low temperature
```{r}
set.seed(841)
sample_sentences(model, 10, max_length = 20, t = 0.1)
```
# End