Skip to content

Latest commit

 

History

History
42 lines (36 loc) · 2.14 KB

07-Language_models.md

File metadata and controls

42 lines (36 loc) · 2.14 KB

Language models

Word sequences:

  • Natural language texts are not sets, but sequences: they have order.
  • Languages where order matters a lot or which ones don't.
  • Can build a grammar restricting permissible sequences (e.g. NP → ADJ + N).
  • Approaches: statistical models

Why are useful?

  • For speech recognition: I ate a nice peach, I ate an ice beach.
  • Spelling error correction: They are leaving in about fifteen *minuets.

Counting words in Corpora:

  • Counting sentences is useless.
  • Counting characters is useful for text prediction.
  • Counting words is useful as it is a middle-grained approach.
  • Where do we find them? Get a corpora (singular: corpus).
  • The used corpus has a significant impact.

Example Corpora:

  • Brown Corpus - just English.
  • British National Corpus - balanced across a range of sources.
  • Gigaword corpus - a billion words.
  • Switchboard corpus - telephone conversations.

Simple N-grams models:

  • Derive a probabilistic model that gives us the probability of entire word sequences (setences) or probability of the next word in a sequence.
  • Suppose any word is equally likely (e.g. 1 / |V| probability). This is not correct.
  • Unigram probability is the frequency of a word over the total observed number of words.
  • But not always word with high probability will fit; they can be very unlikely. So for that we should look at conditional probabilities given preceding words.
  • In theory, it is necessary to calculate P(w1n) = P(w1) P(w2 | w1) P(w3 | w12) ... P(wn | w1 n-1)
  • But how to compute P (wn | w1 wn-1)? As n can be very large, it is impracticable.
  • Make a simplification: Markov assumption. Aproximate the probability of n-th word in sequence given n-1 preceding words by the probability of the n-th word given a limited number k of preceding words.
  • With k = 1, this is a bigram model. With k = 2, a trigram.
  • Bigram P(wn | wn-1) = C(wn-1 followed by wn) / C(wn-1).
  • Sometimes probability can be 0 because a bigram is not included in the corpus.
  • To solve that, smoothing.

Smoothing:

  • For any n-gram, we pretend that we have seen it at least once.
  • Add-one or Laplace smoothing.
  • We add 1 to every bigram entry.