natural language processing: language modellingnlpcourse.europe.naverlabs.com/slides/01-lm.pdf ·...

© 2017 NAVER LABS. All rights reserved.

Matthias Gallé

Naver Labs Europe

08th January 2018

Natural Language Processing:

Language Modelling

@mgalle

Language Modelling

Language is ambiguous, and we are decoding all the time the most probable meaning

We want to compute

𝑃 𝑠 = 𝑃 𝑤1𝑤2𝑤3…𝑤|𝑠|

“But it must be recognized that the notion ’probability of a sentence’ is an

entirely useless one, under any known interpretation of this term.”

Language Model: uses

Spell correction

Re-ranking for:

• OCR

• ASR

• MT

But also fundamental building block for Q&A, summarization, etc, etc(including IR)

P(boil an egg) > P(boil a egg)

P(boil an egg) > P(boil Enoch)

Not enough statistics

s c(s)

boil 2796

boil an 269

boil an egg 28

boil an egg for 1

boil an egg for you 0

But, c(an egg for) = 8 and c(an egg) = 1571

Billion Word Corpus (http://www.statmt.org/lm-benchmark/)

http://www.statmt.org/lm-benchmark/

Not enough statistics: Markovian assumption

Memoryless property

Only most recent words are important

intuitively, should be approximately true in general for text

how common are long-range correlation in text?

Assume that 𝑃 𝑤𝑖 𝑤1𝑤2…𝑤𝑖−1 = 𝑃(𝑤𝑖 𝑤𝑖−𝑛+1…𝑤𝑖−2𝑤𝑖−1

Example: 2-gram model

P(boil an egg for you) = P(you | for) * P(for | egg) * P(egg | an) * P(an | boil)

s c(s)

for you 39191

egg for 136

an egg 1571

boil an 29

Maximum Likelihood Estimation

Just counting

𝑃(𝑤𝑖 𝑤𝑖−𝑛+1…𝑤𝑖−1 =𝑐(𝑤𝑖−𝑛+1…𝑤𝑖−1𝑤𝑖)

𝑐(𝑤𝑖−𝑛+1…𝑤𝑖−1)

2-gram

P(boil an egg for you) = c(for you)/c(for) * c(egg for)/c(egg) * c(an egg)/c(an) * c(boil an)/c(boil)

= 39191/6598312 * 136/7871 * 1571/2442763 * 29/2659 = 7.198e-11

P(boil a egg for you) = 1.056e-12

and P(an egg) = 6.4e-4 vs P(a egg) = 6.51e-6

P(boil Enoch for you) =3.43e-14

Toolkits

CMU-Cambridge (http://www.speech.cs.cmu.edu/SLM)

SRILM (http://www.speech.sri.com/projects/srilm/)

KenLM (https://kheafield.com/code/kenlm/)

integrated into Moses (phrase-based MT)

http://www.speech.cs.cmu.edu/SLM

http://www.speech.sri.com/projects/srilm/

https://kheafield.com/code/kenlm/

Data

Google Books Ngrams

http://storage.googleapis.com/books/ngrams/books/datasetsv2.html

8M books (6% of all books every published)

Binned by years (culturomics)

Ngrams of Corpus of Contemporary American English

https://www.ngrams.info/download_coca.asp

Any data-set

The one task in NLP where annotation is not an issue

Have to pre-process

(small domain-specific often > large generic-domain)

http://storage.googleapis.com/books/ngrams/books/datasetsv2.html

https://www.ngrams.info/download_coca.asp

Evaluation

1. Word Error Rate: how many times your best prediction is incorrect

Problem: what if your second one was good with p = 0.49?

2. Final task: impact of different LM on end-task

time-consuming

co-founding variables

Evaluation: Perplexity

Interpretation: maximise probability

“how well can you predict upcoming word?”

“amount of surprise that the test data generates for your model“

Perplexity: 𝑝𝑝𝑥 𝑞 = 2−∑ log2 𝑞(𝑥𝑖)

= 2𝐻(𝑝,𝑞) , where p is observed (empirical) distribution (p(x) = c(x)/N )

Min ppx = max probability

Unit of entropy: bits per unit (word)

how much bits do I need to encode correct option given the model

perfect model: 1 bit

Evaluation: Perplexity

Cross-entropy: how many additional bits do I need to encode the correct answer wrt p, knowing q

corpus ppx/H

Penntree bank (PTB) ~ 60

1 Billion Word Corpus ~ 30 (23 with ensemble)

Char-PTB ~ 3.03 [1.6 bits-per-character]

CHAR-War & Peace ~ 2.46 [1.3 bpc]

Shortcomings of Perplexity

𝑝𝑝𝑥 𝑞 = 2−∑ log2 𝑞(𝑥𝑖)

What happens if you never saw x?

Lots of effort to model unseen events

SPiCe competition (http://spice.lif.univ-mrs.fr) used nDCG (from IR)

http://spice.lif.univ-mrs.fr/

Drawbacks of MLE: lack of generalizationActually, P(boil Enoch for you) = 0

“boil Enoch” never occurs

Law of NLP: You will always find new n-grams

Other drawbacks of MLE: OverfittingUse LM as text generator (Trump speeches)2-grams'The biggest concern with King of pro - fought very hard to look at a state legislature opposed it was she received a two hours during the oath to me the Obama administration has known opportunity . <END>‘‘And even knows how bad for you , and leave government service members who will never before in our borders from a whole thing . <END>’‘The Democratic Convention was invaded and restore dignity and put her emails she delivers for teachers but we 're asking for the systemic failures in line , go and together as your jobs , about the side of the swamp of every year including one from our southern border with backdoor tariffs , you 've been charged with all is no moral character . <END>’4-gram'I do have a reaction to the prosecutor in Baltimore who indicted those police officers who probably could have made a deal to DESTROY the laptops of government officials implicated in a massive criminal cover - up her crimes . <END>''It is no great secret that many of the great veterans as you saw , and it arrives on November 8th , the arrogance of Washington , D.C. <END>''Every day we fail to enforce our laws is a day when a loving parent is at risk of losing their tax - exempt status . <END>‘

Other drawbacks of MLE: OverfittingUse LM as text generator (Trump speeches)

7-gramPerhaps it is easy for politicians to lose touch with reality when they are being paid millions of dollars to read speeches to Wall Street executives instead of spending time with real people in real pain . <END>You saw it the other day with the truck screaming out the window . <END>

Smoothing

Whenever data sparsity is an issue, smoothing can help performance,

and data sparsity is almost always an issue in statistical modeling. In the

extreme case where there is so much training data that all parameters

can be accurately trained without smoothing, one can almost always

expand the model, such as by moving to a higher n-gram model, to

achieve improved performance. With more parameters data sparsity

becomes an issue again, but with proper smoothing the models are

usually more accurate than the original models. Thus, no matter how

much data one has, smoothing can almost always help performance, and

for a relatively small effort.”

Chen & Goodman (1998)

Smoothing

Simplest: Laplace smoothing

reserve some probability mass for unseen events

Just assume you saw each word 𝛼 times more than you actually did (calledLaplace when 𝛼 = 1)

𝑃 𝑤 𝑐 =#𝑐. 𝑤

#𝑐

𝑃 𝑤 𝑐 =#𝑐.𝑤 + 𝛼

#𝑐 + 𝛼𝑉

Back-off & Interpolation

Sometimes less is better

Back-off: if not enough evidence, use smaller context𝑖𝑓 𝑐 𝑤𝑖−2𝑤𝑖−1 > 𝐾

𝑝 𝑤𝑖 𝑤𝑖−2𝑤𝑖−1 = 𝑝2(𝑤𝑖|𝑤𝑖−2𝑤𝑖−1)

otherwise

𝑝 𝑤𝑖 𝑤𝑖−2𝑤𝑖−1 = 𝑝1(𝑤𝑖|𝑤𝑖−1 )

Back-off & Interpolation

Interpolation (aka Jelinek-Mercer) : combine signal from >1 context

𝑝 𝑤𝑖 𝑤𝑖−2𝑤𝑖−1 = 𝜆1𝑝2(𝑤𝑖|𝑤𝑖−2𝑤𝑖−1) + (1 − 𝜆1)𝑝1(𝑤𝑖|𝑤𝑖−1 )

Can be done recursively.

Base model: MLE, or uniform

𝜆𝑖:

• estimated using held-out data (≠training). Why?

• can be context-dependent (bucketing to reduce parameter explosion)

Witten-Bell

Intuition:

• not many different words will follow (“San ____”) or (“in spite ___”)

• if context-diversity is high, then context doesn’t provide lot of informationif a w occurs V times and has V-context-diversity, then its presence is not informative

• Model probability of using smaller-order model (1 − 𝜆𝑐) as 𝑟𝑑 𝑐

𝑟𝑑 𝑐 +∑𝑤𝑖#(𝑐𝑤𝑖)

Where rd(c) is right-diversity of c : |{w | #(cw)>0}|

Witten-Bell

Example:

• “New” always followed by “York” : 1 − 𝜆𝑁𝑒𝑤 =1

1+𝑐(𝑁𝑒𝑤 𝑌𝑜𝑟𝑘)

• c always followed by a different word: 1 − 𝜆𝑐 =#𝑐

2#𝑐=

1

2

Originally developed for compression

Compressing ~ Learning

Kneser-Ney

Key innovation: Absolute discounting + better modelling of lower-order

Absolute discounting: again, reserve probability mass for unseen event

𝑝 𝑤𝑖 𝑤𝑖−1) =max c wi−1wi − 𝛿, 0

𝑐(𝑤𝑖−1)+ 𝜆𝑤𝑖−1

𝑝 𝑤𝑖

∑𝑤𝑖𝑝 𝑤𝑖 𝑤𝑖−1) = 1. How often 𝛿 gets discounted?

• V?

• rd(𝑤𝑖−1) !

⇒ 𝜆𝑤𝑖 −1= 𝛿 ∗

𝑟𝑑 𝑤𝑖−1

𝑐(𝑤𝑖−1)

Kneser-Ney II

Lower-order model

Assumes it is coming from an interpolated model:

lower-order only used when higher-order useless

Intuition: assume Rica occurs very often, but only ever preceded by Costa.

In interpolated model, unigram distribution p(Rica) will be relatively high, although it is used only if p(Rica | c) is not considered to be modelled good enough (therefore c ≠ 𝐶𝑜𝑠𝑡𝑎 => p(Rica) should be low)

Kneser-Ney II

Lower-order model

𝑝 𝑤𝑖 =ld wi

𝑁bigrams

Where ld = left diversity

Can be done recursively as well.

“Modified Kneser-Ney” (normally) uses 3 different values for 𝛿 (for ngrams occurring 1, 2 and 3+ times).

Data structures

• Hashing

• Approximate hashing:

Bloom Filters. Store 𝑤_(1 + logہ # wۂ )

• Quantize probabilities

• Suffix Trees

Smoothed Bloom filter language models: Tera-Scale LMs on the Cheap. Talbot & Osborne. EMNLP 2007

Data structures: Suffix Trees

https://www.researchgate.net/publication/315676593_Accelerating_a_BWT-based_exact_search_on_multi-GPU_heterogeneous_computing_platforms

https://en.wikipedia.org/wiki/Suffix_tree

With suffix links

Suffix Trees as Language Models. Kennington et al. LREC 2012

Fast, Small and Exact: Infinite-order Language Modelling with Compressed Suffix Trees. Shareghi, EMNLP 2016

Efficiency: Pruning

• Count-based: remove all ngrams < K

removes in particular singletons

• Probability-based: p cw [ log 𝑝(𝑤|𝑐) − log 𝑝′ 𝑤 𝑐 ) ]

• Relative-entropy: ∑𝑐𝑖,𝑤𝑗p ciwj [ log 𝑝(𝑤𝑗|𝑐𝑖) − log 𝑝′ 𝑤𝑗 𝑐𝑖 ) ]

!! Pruning and smoothing do not work always well together

Out-of-Vocabulary words

What if you see new words?

(remember, 1k new En words per year, + named entities, spelling errors, transliteration)

One of the most common problem in NLP

1. Character-based LM• Hybrid word-char based

2. Train with <UNK> token• Keep only V’ words (most common, centroids, most-discriminative)

• Everything not in V’ gets mapped to <UNK>

Other LM approaches: History-based

𝑃 𝑤 𝑐𝑡𝑥𝑡, ℎ𝑖𝑠𝑡𝑜𝑟𝑦 = 𝜆𝑃 𝑤 𝑐𝑡𝑥𝑡 + 1 − 𝜆#(𝑤 ∈ ℎ𝑖𝑠𝑡𝑜𝑟𝑦)

|ℎ𝑖𝑠𝑡𝑜𝑟𝑦|

Or just linear interpolation:𝑃 𝑤 𝑐𝑡𝑥𝑡, ℎ𝑖𝑠𝑡𝑜𝑟𝑦 = 𝜆𝑃 𝑤 𝑐𝑡𝑥𝑡 + 1 − 𝜆 𝑃ℎ𝑖𝑠𝑡𝑜𝑟𝑦(𝑤|𝑐𝑡𝑥𝑡)

Also useful when you have (small) amount of in-domain text

Other approaches: Parsing-based

We decided 𝑃 𝑤1, 𝑤2, … , 𝑤 𝑠 = 𝑃 𝑤 𝑠 𝑤1, 𝑤2, … , 𝑤𝑠 ∗ 𝑃 𝑤2 𝑤1 ∗𝑃 𝑤1 which is completely arbitrary

Assume you have a probabilistic parser with which you can compute p(t|s), then you can define

𝑃 s =

𝑡

𝑃(𝑠|𝑡)

Other approaches: Class-based

Define set of classes 𝑐1, 𝑐2, … , 𝑐𝑘

Define 𝑃 𝑤𝑖 𝑐𝑡𝑥𝑡) = ∑𝑐𝑗 𝑃 𝑤𝑖 𝑐𝑗 𝑃 𝑐𝑗 𝑐𝑡𝑥𝑡)

Typical class are Part-of-Speech Tags (Noun, Verb, Adj, etc)

But can be induced

Paris is the capital of FranceBerlin is the capital of GermanyRome is the capital of Italy

CITY is the capital of COUNTRYCITY is the capital of COUNTRYCITY is the capital of COUNTRY

Other approaches: max-entropy

Define feature vector 𝜙(𝑤𝑖 , 𝑤1…𝑤𝑖−1) of size d

And then, find 𝜃 that maximises

𝑃𝜃 𝑤𝑖 𝑤1…𝑤𝑖−1 =exp 𝜃∙ 𝜙 𝑤𝑖,𝑤1…𝑤𝑖−1

∑𝑤∈Σ exp 𝜃∙𝜙 𝑤,𝑤1…𝑤𝑖−1

Can use any feature you can dream of (syntactic, grammatical, external)

Log-linear model: easy to train (convex)

Can control the # parameters (d)

Huge vocabulary

Neural Language Modeling

The cat is walking in a bedroom

A dog was running in the room

For us both are very similar, but wouldn’t share almost any parameter in n-gram model

Language is discrete

No clear relationship between dog/cat, is/was

Continuous representation of words

Was common < 2010, but not mainstream

Nowadays, ubiquitous in NLP

Continuous => Learn

Overall view

𝑤𝑖−1

𝑤𝑖−2

𝑤𝑖−3

𝑤𝑖−4

𝑒(𝑤𝑖−1)

Hidden Layer

𝑤𝑖

𝑒(𝑤𝑖−2)

𝑒(𝑤𝑖−3)

𝑒(𝑤𝑖−4)

tanh softmax

Practical details

lots of effort how to do this effectively

data-parallelizing

parameter-parallel

Final prediction actually an interpolation of NN with 3-gram model

Recurrent Neural Networks: the intuition

Obtains state-of-the-art in many NLP tasks

RNN designed to handle sequences (main difference with vision)

Differently from traditional NN, RNN feeds itself:

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Recurrent Neural Networks: the intuition

To bypass the explosion of parameters of n-grams models, use same weights at each time-step

Wh

xt

ht

yt

Wh

xt+1

ht+1

yt+1

Wh

xt-1

ht-1

yt-1

h: size of hidden layerd: size of word embeddingV: size of vocabulary

𝑊ℎ ∈ ℝℎ×ℎ

𝑊𝑤 ∈ ℝℎ×𝑑

𝑊𝑝 ∈ ℝ𝑉×ℎ

Ww Ww Ww

WpWpWp

LSTM

NN trained with gradient-descent methods

Compositional power of back-propagation

Ending up multiplying many numbers ∈ [0,1] tends to 0.

Vanishing gradient problem

Long short-term memory

Adds a gate: a switch which decides when to forget past

http://deeplearning.net/tutorial/lstm.html

natural language processing: language modellingnlpcourse.europe.naverlabs.com/slides/01-lm.pdf ·...

Documents