corpora and statistical methods lecture...

46
Language modelling using N-Grams Corpora and Statistical Methods Lecture 7

Upload: others

Post on 30-Sep-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Corpora and Statistical Methods Lecture 12staff.um.edu.mt/albert.gatt/teaching/dl/statLecture7a.pdfApplications of the Shannon game Automatic speech recognition (cf. tutorial 1): given

Language modelling using N-Grams

Corpora and Statistical Methods

Lecture 7

Page 2: Corpora and Statistical Methods Lecture 12staff.um.edu.mt/albert.gatt/teaching/dl/statLecture7a.pdfApplications of the Shannon game Automatic speech recognition (cf. tutorial 1): given

In this lecture

We consider one of the basic tasks in Statistical NLP:

language models are probabilistic representations of allowable

sequences

This part:

methodological overview

fundamental statistical estimation models

Next part:

smoothing techniques

Page 3: Corpora and Statistical Methods Lecture 12staff.um.edu.mt/albert.gatt/teaching/dl/statLecture7a.pdfApplications of the Shannon game Automatic speech recognition (cf. tutorial 1): given

Assumptions, definitions, methodology, algorithms

Part 1

Page 4: Corpora and Statistical Methods Lecture 12staff.um.edu.mt/albert.gatt/teaching/dl/statLecture7a.pdfApplications of the Shannon game Automatic speech recognition (cf. tutorial 1): given

Example task The word-prediction task (Shannon game)

Given: a sequence of words (the history)

a choice of next word

Predict: the most likely next word

Generalises easily to other problems, such as predicting the POS of unknowns based on history.

Page 5: Corpora and Statistical Methods Lecture 12staff.um.edu.mt/albert.gatt/teaching/dl/statLecture7a.pdfApplications of the Shannon game Automatic speech recognition (cf. tutorial 1): given

Applications of the Shannon game

Automatic speech recognition (cf. tutorial 1):

given a sequence of possible words, estimate its probability

Context-sensitive spelling correction:

Many spelling errors are real words He walked for miles in the dessert. (resp. desert)

Identifying such errors requires a global estimate of the probability of a sentence.

Page 6: Corpora and Statistical Methods Lecture 12staff.um.edu.mt/albert.gatt/teaching/dl/statLecture7a.pdfApplications of the Shannon game Automatic speech recognition (cf. tutorial 1): given

Applications of N-gram models generally

POS Tagging (cf. lecture 3):

predict the POS of an unknown word by looking at its history

Statistical parsing:

e.g. predict the group of words that together form a phrase

Statistical NL Generation:

given a semantic form to be realised as text, and several possible realisations, select the most probable one.

Page 7: Corpora and Statistical Methods Lecture 12staff.um.edu.mt/albert.gatt/teaching/dl/statLecture7a.pdfApplications of the Shannon game Automatic speech recognition (cf. tutorial 1): given

A real-world example: Google’s did you

mean

Google uses an n-gram

model (based on sequences

of characters, not words).

In this case, the sequence

apple desserts is much more

probable than apple deserts

Page 8: Corpora and Statistical Methods Lecture 12staff.um.edu.mt/albert.gatt/teaching/dl/statLecture7a.pdfApplications of the Shannon game Automatic speech recognition (cf. tutorial 1): given

How it works Documents provided by the search engine are added to:

An index (for fast retrieval)

A language model (based on probability of a sequence of characters)

A submitted query (“apple deserts”) can be modified (using character insertions, deletions, substitutions and transpositions) to yield a query that fits the language model better (“apple desserts”).

Outcome is a context-sensitive spelling correction:

“apple deserts” “apple desserts”

“frod baggins” “frodo baggins”

“frod” “ford”

Page 9: Corpora and Statistical Methods Lecture 12staff.um.edu.mt/albert.gatt/teaching/dl/statLecture7a.pdfApplications of the Shannon game Automatic speech recognition (cf. tutorial 1): given

The noisy channel model

After Jurafsky and Martin (2009), Speech and Language

Processing (2nd Ed). Prentice Hall p. 198

Page 10: Corpora and Statistical Methods Lecture 12staff.um.edu.mt/albert.gatt/teaching/dl/statLecture7a.pdfApplications of the Shannon game Automatic speech recognition (cf. tutorial 1): given

The assumptions behind n-gram models

Page 11: Corpora and Statistical Methods Lecture 12staff.um.edu.mt/albert.gatt/teaching/dl/statLecture7a.pdfApplications of the Shannon game Automatic speech recognition (cf. tutorial 1): given

The Markov Assumption

Markov models:

probabilistic models which predict the likelihood of a future unit

based on limited history

in language modelling, this pans out as the local history assumption:

the probability of wn depends on a limited number of prior words

utility of the assumption:

we can rely on a small n for our n-gram models (bigram, trigram)

long n-grams become exceedingly sparse

Probabilities become very small with long sequences

Page 12: Corpora and Statistical Methods Lecture 12staff.um.edu.mt/albert.gatt/teaching/dl/statLecture7a.pdfApplications of the Shannon game Automatic speech recognition (cf. tutorial 1): given

The structure of an n-gram model

The task can be re-stated in conditional probabilistic terms:

Limiting n under the Markov Assumption means:

greater chance of finding more than one occurrence of the sequence w1…wn-1

more robust statistical estimations

N-grams are essentially equivalence classes or bins

every unique n-gram is a type or bin

)...|( 11 nn wwwP

Page 13: Corpora and Statistical Methods Lecture 12staff.um.edu.mt/albert.gatt/teaching/dl/statLecture7a.pdfApplications of the Shannon game Automatic speech recognition (cf. tutorial 1): given

Structure of n-gram models (II)

If we construct a model where all histories with the same n-1

words are considered one class or bin, we have an (n-1)th

order Markov Model

Note terminology:

n-gram model = (n-1)th order Markov Model

Page 14: Corpora and Statistical Methods Lecture 12staff.um.edu.mt/albert.gatt/teaching/dl/statLecture7a.pdfApplications of the Shannon game Automatic speech recognition (cf. tutorial 1): given

Methodological considerations We are often concerned with: building an n-gram model evaluating it

We therefore make a distinction between training and testdata You never test on your training data If you do, you’re bound to get good results. N-gram models tend to be overtrained, i.e.: if you train on a corpus

C, your model will be biased towards expecting the kinds of events in C. Another term for this: overfitting

Page 15: Corpora and Statistical Methods Lecture 12staff.um.edu.mt/albert.gatt/teaching/dl/statLecture7a.pdfApplications of the Shannon game Automatic speech recognition (cf. tutorial 1): given

Dividing the data

Given: a corpus of n units (words, sentences, … depending

on the task)

A large proportion of the corpus is reserved for training.

A smaller proportion for testing/evaluation (normally 5-10%)

Page 16: Corpora and Statistical Methods Lecture 12staff.um.edu.mt/albert.gatt/teaching/dl/statLecture7a.pdfApplications of the Shannon game Automatic speech recognition (cf. tutorial 1): given

Held-out (validation) data

Held-out estimation:

during training, we sometimes estimate parameters for our model

empirically

commonly used in smoothing (how much probability space do we

want to set aside for unseen data)?

therefore, the training set is often split further into training data and

validation data

normally, held-out data is 10% of the size of the training data

Page 17: Corpora and Statistical Methods Lecture 12staff.um.edu.mt/albert.gatt/teaching/dl/statLecture7a.pdfApplications of the Shannon game Automatic speech recognition (cf. tutorial 1): given

Development data A common approach:

1. train an algorithm on training data

a. (estimate further parameters on held-out data if required)

2. evaluate it

3. re-tune it

4. go back to Step 1 until no further finetuning necessary

5. Carry out final evaluation

For this purpose, it’s useful to have:

training data for step 1

development set for steps 2-4

final test set for step 5

Page 18: Corpora and Statistical Methods Lecture 12staff.um.edu.mt/albert.gatt/teaching/dl/statLecture7a.pdfApplications of the Shannon game Automatic speech recognition (cf. tutorial 1): given

Significance testing Often, we compare the performance of our algorithm against some

baseline.

A single, raw performance score won’t tell us much. We need to test for significance (e.g. using t-test).

Typical method:

Split test set into several small test sets, e.g. 20 samples

evaluation carried out separately on each

mean and variance estimated based on 20 different samples

test for significant difference between algorithm and a predefined baseline

Page 19: Corpora and Statistical Methods Lecture 12staff.um.edu.mt/albert.gatt/teaching/dl/statLecture7a.pdfApplications of the Shannon game Automatic speech recognition (cf. tutorial 1): given

Size of n-gram models

In a corpus of vocabulary size N, the assumption is that any

combination of n words is a potential n-gram.

For a bigram model: N2 possible n-grams in principle

For a trigram model: N3 possible n-grams.

Page 20: Corpora and Statistical Methods Lecture 12staff.um.edu.mt/albert.gatt/teaching/dl/statLecture7a.pdfApplications of the Shannon game Automatic speech recognition (cf. tutorial 1): given

Size (continued)

Each n-gram in our model is a parameter used to estimate

probability of the next possible word.

too many parameters make the model unwieldy

too many parameters lead to data sparseness: most of them will have f

= 0 or 1

Most models stick to unigrams, bigrams or trigrams.

estimation can also combine different order models

Page 21: Corpora and Statistical Methods Lecture 12staff.um.edu.mt/albert.gatt/teaching/dl/statLecture7a.pdfApplications of the Shannon game Automatic speech recognition (cf. tutorial 1): given

Further considerations When building a model, we tend to take into account the

start-of-sentence symbol: the girl swallowed a large green caterpillar

<s> the

the girl

Also typical to map all tokens w such that count(w) < k to <UNK>: usually, tokens with frequency 1 or 2 are just considered “unknown”

or “unseen”

this reduces the parameter space

Page 22: Corpora and Statistical Methods Lecture 12staff.um.edu.mt/albert.gatt/teaching/dl/statLecture7a.pdfApplications of the Shannon game Automatic speech recognition (cf. tutorial 1): given

Building models using Maximum Likelihood

Estimation

Page 23: Corpora and Statistical Methods Lecture 12staff.um.edu.mt/albert.gatt/teaching/dl/statLecture7a.pdfApplications of the Shannon game Automatic speech recognition (cf. tutorial 1): given

Maximum Likelihood Estimation Approach

Basic equation:

In a unigram model, this reduces to simple probability.

MLE models estimate probability using relative frequency.

)...(

)...()...|(

11

111

n

nnn

wwP

wwPwwwP

Page 24: Corpora and Statistical Methods Lecture 12staff.um.edu.mt/albert.gatt/teaching/dl/statLecture7a.pdfApplications of the Shannon game Automatic speech recognition (cf. tutorial 1): given

Limitations of MLE

MLE builds the model that maximises the probability of the

training data.

Unseen events in the training data are assigned zero

probability.

Since n-gram models tend to be sparse, this is a real problem.

Consequences:

seen events are given more probability mass than they have

unseen events are given zero mass

Page 25: Corpora and Statistical Methods Lecture 12staff.um.edu.mt/albert.gatt/teaching/dl/statLecture7a.pdfApplications of the Shannon game Automatic speech recognition (cf. tutorial 1): given

Seen/unseen

A A’

Probability mass of events in training data

Probability massof events not in training data

The problem with MLE is that it distributes A’ among members of A.

Page 26: Corpora and Statistical Methods Lecture 12staff.um.edu.mt/albert.gatt/teaching/dl/statLecture7a.pdfApplications of the Shannon game Automatic speech recognition (cf. tutorial 1): given

The solution

Solution is to correct MLE estimation using a smoothing

technique.

More on this in the next part

But cf. Tutorial 1, which introduced the simplest method of

smoothing known.

Page 27: Corpora and Statistical Methods Lecture 12staff.um.edu.mt/albert.gatt/teaching/dl/statLecture7a.pdfApplications of the Shannon game Automatic speech recognition (cf. tutorial 1): given

Adequacy of different order models Manning/Schutze `99 report results for n-gram models of a corpus of the

novels of Austen.

Task: use n-gram model to predict the probability of a sentence in the test data.

Models:

unigram: essentially zero-context markov model, uses only the probability of individual words

bigram

trigram

4-gram

Page 28: Corpora and Statistical Methods Lecture 12staff.um.edu.mt/albert.gatt/teaching/dl/statLecture7a.pdfApplications of the Shannon game Automatic speech recognition (cf. tutorial 1): given

Example test case

•Training Corpus: five Jane Austen novels

• Corpus size = 617,091 words

•Vocabulary size = 14,585 unique types

•Task: predict the next word of the trigram

“inferior to ________”

from test data, Persuasion:

“[In person, she was] inferior to both [sisters.]”

Page 29: Corpora and Statistical Methods Lecture 12staff.um.edu.mt/albert.gatt/teaching/dl/statLecture7a.pdfApplications of the Shannon game Automatic speech recognition (cf. tutorial 1): given

Selecting an n

Vocabulary (V) = 20,000 words

n Number of bins

(i.e. no. of possible unique n-grams)

2 (bigrams) 400,000,000

3 (trigrams) 8,000,000,000,000

4 (4-grams) 1.6 x 1017

Page 30: Corpora and Statistical Methods Lecture 12staff.um.edu.mt/albert.gatt/teaching/dl/statLecture7a.pdfApplications of the Shannon game Automatic speech recognition (cf. tutorial 1): given

Adequacy of unigrams

Problems with unigram models:

not entirely hopeless because most sentences contain a majority

of highly common words

ignores syntax completely:

P(In person she was inferior) = P(inferior was she person in)

Page 31: Corpora and Statistical Methods Lecture 12staff.um.edu.mt/albert.gatt/teaching/dl/statLecture7a.pdfApplications of the Shannon game Automatic speech recognition (cf. tutorial 1): given

Adequacy of bigrams

Bigrams:

improve situation dramatically

some unexpected results:

p(she|person) decreases compared to the unigram model. Though she is

very common, it is uncommon after person

Page 32: Corpora and Statistical Methods Lecture 12staff.um.edu.mt/albert.gatt/teaching/dl/statLecture7a.pdfApplications of the Shannon game Automatic speech recognition (cf. tutorial 1): given

Adequacy of trigrams

Trigram models will do brilliantly when they’re useful.

They capture a surprising amount of contextual variation in

text.

Biggest limitation:

most new trigrams in test data will not have been seen in training data.

Problem carries over to 4-grams, and is much worse!

Page 33: Corpora and Statistical Methods Lecture 12staff.um.edu.mt/albert.gatt/teaching/dl/statLecture7a.pdfApplications of the Shannon game Automatic speech recognition (cf. tutorial 1): given

Reliability vs. Discrimination

larger n: more information about the context of the specific

instance (greater discrimination)

smaller n: more instances in training data, better statistical

estimates (more reliability)

Page 34: Corpora and Statistical Methods Lecture 12staff.um.edu.mt/albert.gatt/teaching/dl/statLecture7a.pdfApplications of the Shannon game Automatic speech recognition (cf. tutorial 1): given

Backing off

Possible way of striking a balance between reliability and

discrimination:

backoff model:

where possible, use a trigram

if trigram is unseen, try and “back off ” to a bigram model

if bigrams are unseen, try and “back off ” to a unigram

Page 35: Corpora and Statistical Methods Lecture 12staff.um.edu.mt/albert.gatt/teaching/dl/statLecture7a.pdfApplications of the Shannon game Automatic speech recognition (cf. tutorial 1): given

Evaluating language models

Page 36: Corpora and Statistical Methods Lecture 12staff.um.edu.mt/albert.gatt/teaching/dl/statLecture7a.pdfApplications of the Shannon game Automatic speech recognition (cf. tutorial 1): given

Perplexity

Recall: Entropy is a measure of uncertainty:

high entropy = high uncertainty

perplexity:

if I’ve trained on a sample, how surprised am I when exposed to a

new sample?

a measure of uncertainty of a model on new data

Page 37: Corpora and Statistical Methods Lecture 12staff.um.edu.mt/albert.gatt/teaching/dl/statLecture7a.pdfApplications of the Shannon game Automatic speech recognition (cf. tutorial 1): given

Entropy as “expected value”

One way to think of the summation part is as a weighted average of the information content.

We can view this average value as an “expectation”: the expected surprise/uncertainty of our model.

Xx

xpxpXH )(log)()(

Page 38: Corpora and Statistical Methods Lecture 12staff.um.edu.mt/albert.gatt/teaching/dl/statLecture7a.pdfApplications of the Shannon game Automatic speech recognition (cf. tutorial 1): given

Comparing distributions We have a language model built from a sample. The sample is

a probability distribution q over n-grams.

q(x) = the probability of some n-gram x in our model.

The sample is generated from a true population (“the language”) with probability distribution p.

p(x) = the probability of x in the true distribution

Page 39: Corpora and Statistical Methods Lecture 12staff.um.edu.mt/albert.gatt/teaching/dl/statLecture7a.pdfApplications of the Shannon game Automatic speech recognition (cf. tutorial 1): given

Evaluating a language model

We’d like an estimate of how good our model is as a model of

the language

i.e. we’d like to compare q to p

We don’t have access to p. (Hence, can’t use KL-Divergence)

Instead, we use our test data as an estimate of p.

Page 40: Corpora and Statistical Methods Lecture 12staff.um.edu.mt/albert.gatt/teaching/dl/statLecture7a.pdfApplications of the Shannon game Automatic speech recognition (cf. tutorial 1): given

Cross-entropy: basic intuition

Measure the number of bits needed to identify an event

coming from p, if we code it according to q:

We draw sequences according to p;

but we sum the log of their probability according to q.

This estimate is called cross-entropy H(p,q)

Page 41: Corpora and Statistical Methods Lecture 12staff.um.edu.mt/albert.gatt/teaching/dl/statLecture7a.pdfApplications of the Shannon game Automatic speech recognition (cf. tutorial 1): given

Cross-entropy: p vs. q Cross-entropy is an upper bound on the entropy of the true

distribution p:

H(p) ≤ H(p,q)

if our model distribution (q) is good, H(p,q) ≈ H(p)

We estimate cross-entropy based on our test data.

Gives an estimate of the distance of our language model from the distribution in the test sample.

Page 42: Corpora and Statistical Methods Lecture 12staff.um.edu.mt/albert.gatt/teaching/dl/statLecture7a.pdfApplications of the Shannon game Automatic speech recognition (cf. tutorial 1): given

Estimating cross-entropy

x

xqxpqpH )(log)(),(

Probability accordingto p (test set)

Entropy accordingto q (language model)

Page 43: Corpora and Statistical Methods Lecture 12staff.um.edu.mt/albert.gatt/teaching/dl/statLecture7a.pdfApplications of the Shannon game Automatic speech recognition (cf. tutorial 1): given

Perplexity

The perplexity of a language model with probability distribution q, relative to a test set with probability distribution p is:

A perplexity value of k (obtained on a test set) tells us: our model is as surprised on average as it would be if it had to make k guesses for every

sequence (n-gram) in the test data.

The lower the perplexity, the better the language model (the lower the surprise on our test data).

),(2 qpH

Page 44: Corpora and Statistical Methods Lecture 12staff.um.edu.mt/albert.gatt/teaching/dl/statLecture7a.pdfApplications of the Shannon game Automatic speech recognition (cf. tutorial 1): given

Perplexity example (Jurafsky & Martin, 2000,

p. 228)

Trained unigram, bigram and trigram models from a corpus of news text (Wall Street Journal)

applied smoothing

38 million words

Vocab of 19,979 (low-frequency words mapped to UNK).

Computed perplexity on a test set of 1.5 million words.

Page 45: Corpora and Statistical Methods Lecture 12staff.um.edu.mt/albert.gatt/teaching/dl/statLecture7a.pdfApplications of the Shannon game Automatic speech recognition (cf. tutorial 1): given

J&M’s results

Trigrams do best of all.

Value suggests the extent to which the model can fit the data in the test set.

Note: with unigrams, the model has to make lots of guesses!

N-Gram model

Perplexity

Unigram 962

Bigram 170

Trigram 109

Page 46: Corpora and Statistical Methods Lecture 12staff.um.edu.mt/albert.gatt/teaching/dl/statLecture7a.pdfApplications of the Shannon game Automatic speech recognition (cf. tutorial 1): given

Summary Main point about Markov-based language models:

data sparseness is always a problem

smoothing techniques are required to estimate probability of unseen events

Next part discusses more refined smoothing techniques than those seen so far.