Language models

Word sequences:

Why are useful?

Counting words in Corpora:

Example Corpora:

Simple N-grams models:

Derive a probabilistic model that gives us the probability of entire word sequences (setences) or probability of the next word in a sequence.
Suppose any word is equally likely (e.g. 1 / |V| probability). This is not correct.
Unigram probability is the frequency of a word over the total observed number of words.
But not always word with high probability will fit; they can be very unlikely. So for that we should look at conditional probabilities given preceding words.
In theory, it is necessary to calculate P(w1n) = P(w1) P(w2 | w1) P(w3 | w12) ... P(wn | w1 n-1)
But how to compute P (wn | w1 wn-1)? As n can be very large, it is impracticable.
Make a simplification: Markov assumption. Aproximate the probability of n-th word in sequence given n-1 preceding words by the probability of the n-th word given a limited number k of preceding words.
With k = 1, this is a bigram model. With k = 2, a trigram.
Bigram P(wn | wn-1) = C(wn-1 followed by wn) / C(wn-1).
Sometimes probability can be 0 because a bigram is not included in the corpus.
To solve that, smoothing.

Smoothing:

Provide feedback