lecture2 language modelinglecture 2: language modeling ltat.01.001 –natural language processing...

49
Lecture 2: Language modeling LTAT.01.001 – Natural Language Processing Kairit Sirts ([email protected] ) 20.02.2019

Upload: others

Post on 08-Oct-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat

Lecture 2: Language modeling

LTAT.01.001 – Natural Language ProcessingKairit Sirts ([email protected])

20.02.2019

Page 2: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat

The task of language modeling

The cat sat on the mat

The mat sat on the cat

The cat mat the on sat

2

Page 3: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat

Language modeling

Task:• Estimate the quality/fluency/grammaticality of a natural language

sentence or segment

Why?• Generate new sentences• Choose between several variants, picking the best sounding one.

3

Page 4: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat

Language modeling

Word: !

Sentence: " = !!!"…!#

4

Page 5: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat

Language modeling

Can we use some grammaticality checking rules to determine the fluency of the sentence !?• Theoretically - yes• In practice: • Grammar checking software is unreliable• Grammar checking software is only available for few languages• Its output is often non-continuous, which means that

• It cannot be used in optimization• It cannot be used to easily choose a better output from many viable hypotheses

5

Page 6: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat

Language modeling

Instead we will try to calculate/model:

! " = ! $!$"…$#

>

>

6

P(The cat sat on the mat) P(The mat sat on the cat)

P(The cat mat the on sat)P(The mat sat on the cat)

Page 7: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat

How to compute the sentence probability?

= #(#$% &'( )'( *+ ($% ,'()# '.. )%+(%+&%) = ?

= #(#$% ,'( )'( *+ ($% &'()# '.. )%+(%+&%) = ?

= #(#$% &'( ,'( ($% *+ )'()# '.. )%+(%+&%) = ?

# - the number or count of such sentencesThat’s clearly not doable in general!

7

P(The cat sat on the mat)

P(The mat sat on the cat)

P(The cat mat the on sat)

Page 8: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat

How to compute the sentence probability?

Factorize the joint probability:• In general:

! ", $, % = ! " ! $ " ! % ", $

• Similarly:! '!, '", … , '#= ! '! ! '" '! ! '$ '!, '" …!('#|'!, '", … , '#%!)

• It still does not solve the problem!

8

Page 9: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat

Sentence probability

• Cannot estimate directly:

! "!""…"# = #("!""…"#)# ()) *+,-+,.+*

• Cannot use the factorization:

! "!""…"# =%$%!

#!("$|"!…"$&!)

9

Page 10: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat

Sentence probability

But word probabilities are doable:• Take a huge text (millions/billions of words)• Compute the probability for each word type (unique word)

! " = #(")# '(( ")*+, -. /ℎ1 /12/

10

Maximum likelihood (ML) estimate

Page 11: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat

Sentence probability

• What if we treat each word as independent of other words? Then:

! " ≅ ! $! ×! $" ×⋯×!($#)

=

=

11

P(The cat sat on the mat) P(The mat sat on the cat)

P(The cat mat the on sat)P(The mat sat on the cat)

Page 12: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat

Sentence probability• Maybe add some context?

= " #ℎ% " &'( #ℎ% " )'( &'( " *+ )'( " (ℎ% *+ "(-'(|(ℎ%)

= " #ℎ% " -'( #ℎ% " )'( -'( " *+ )'( " (ℎ% *+ "(&'(|(ℎ%)

= " #ℎ% " &'( #ℎ% " -'( &'( " (ℎ% -'( " *+ (ℎ% "()'(|*+) 12

! " ≅ ! $! ×! $"|$! ×!($#|$")×⋯×!($$|$$%!)P(The cat sat on the mat)

P(The mat sat on the cat)

P(The cat mat the on sat)

Page 13: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat

Sentence probability – Markov property

Independence assumption or Markov assumption (in the context of language modeling):• The next word only depends on the current/last word.

• This is precisely the model we had on the previous slide and it is called bigram language model because we are looking at the word bigrams.

13

Page 14: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat

N-gram language model

In general, we talk about n-gram language models, where the next word depends on a fixed history of n-1 words.

• Unigram model – all words are independent, the classical BOW approach• Bigram model• Trigram model – the next word depends on two last words• 4-gram model• 5-gram model

14

Page 15: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat

Computing n-gram probabilities

• Unigrams: !! " !! = ##!# $%% #&'()

• Bigrams: !!*+!! " !! !!*+ = #(#!"#,#!)#(#!"#)

• Trigrams: !!*/!!*+!! " !! !!*/, !!*+ = #(#!"$,#!"#,#!)#(#!"$,#!"#)

15

Page 16: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat

Sentence probability according to n-gram model• If

! "! "", "#, … , "!$" ≅ ! "! "!$% , … , "!$"• Where & = ()*+, *+(& − 1:• Unigrams: ! = 0• Bigrams: ! = 1• Trigrams: ! = 2, etc

• Then

! / =0!&"

'!("!|"", "#, … , "!$") ≅0

!&"

'!("!|"!$% , … , "!$")

16

Page 17: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat

Bigram language model: example

An example corpus:1. the cat saw the mouse2. the cat heard a mouse3. the mouse heard4. a mouse saw5. a cat saw

17

Bigram Count Unigram Count Bigram probSTART the STARTthe cat thecat saw catsaw the sawthe mouse themouse END mousecat heard catheard a hearda mouse aSTART a STARTmouse saw mousesaw END sawa cat a

Page 18: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat

Bigram language model: example

An example corpus:1. the cat saw the mouse2. the cat heard a mouse3. the mouse heard4. a mouse saw5. a cat saw

18

Bigram Count Unigram Count Bigram probSTART the 3 START 5 0.6the cat 2 the 4 0.5cat saw 2 cat 3 0.67saw the 1 saw 3 0.33the mouse 2 the 4 0.5mouse END 2 mouse 4 0.5cat heard 1 cat 3 0.33heard a 1 heard 2 0.5a mouse 2 a 3 0.67START a 2 START 5 0.4mouse saw 1 mouse 4 0.25saw END 2 saw 3 0.67a cat 1 a 3 0.33

Page 19: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat

Bigram language model: example

P(The cat heard) = ?

19

Bigram Bigram probSTART the 0.6the cat 0.5cat saw 0.67saw the 0.33the mouse 0.5mouse END 0.5cat heard 0.33heard a 0.5a mouse 0.67START a 0.4mouse saw 0.25saw END 0.67a cat 0.33heard END 0.5

Page 20: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat

Bigram language model: example

P(The cat heard) = = P(START the) x P(the cat) x P(cat heard) x P(heard END)

20

Bigram Bigram probSTART the 0.6the cat 0.5cat saw 0.67saw the 0.33the mouse 0.5mouse END 0.5cat heard 0.33heard a 0.5a mouse 0.67START a 0.4mouse saw 0.25saw END 0.67a cat 0.33heard END 0.5

Page 21: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat

Bigram language model: example

P(The cat heard) = = P(START the) x P(the cat) x P(cat heard) x P(heard END) = 0.6*0.5*0.33*0.5 = 0.0495

21

Bigram Bigram probSTART the 0.6the cat 0.5cat saw 0.67saw the 0.33the mouse 0.5mouse END 0.5cat heard 0.33heard a 0.5a mouse 0.67START a 0.4mouse saw 0.25saw END 0.67a cat 0.33heard END 0.5

Page 22: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat

Bigram language model: example

P(The mouse saw the cat) = ?

22

Bigram Bigram probSTART the 0.6the cat 0.5cat saw 0.67saw the 0.33the mouse 0.5mouse END 0.5cat heard 0.33heard a 0.5a mouse 0.67START a 0.4mouse saw 0.25saw END 0.67a cat 0.33heard END 0.5

Page 23: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat

Bigram language model: example

P(the mouse saw the cat) = =P(START the) x P(the mouse) x P(mouse saw) x P(saw the) x P(the cat) x P(cat END)

23

Bigram Bigram probSTART the 0.6the cat 0.5cat saw 0.67saw the 0.33the mouse 0.5mouse END 0.5cat heard 0.33heard a 0.5a mouse 0.67START a 0.4mouse saw 0.25saw END 0.67a cat 0.33heard END 0.5

Page 24: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat

Bigram language model: example

P(the mouse saw the cat) = =P(START the) x P(the mouse) x P(mouse saw) x P(saw the) x P(the cat) x P(cat END)= 0.6*0.5*0.25*0.33*0.5*0 = 0

24

Bigram Bigram probSTART the 0.6the cat 0.5cat saw 0.67saw the 0.33the mouse 0.5mouse END 0.5cat heard 0.33heard a 0.5a mouse 0.67START a 0.4mouse saw 0.25saw END 0.67a cat 0.33heard END 0.5

Page 25: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat

Morphology

25Source: www.rabiaergin.com

Page 26: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat

Sparsity issues

Natural languages are sparse!

Consider vocabulary of size 60000• How many possible unigrams, bigrams, trigrams are there?• How large a text corpus do we need to obtain reliable statistics for all

ngrams?• Does more data solve the problem completely?

26

Page 27: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat

Zipf’s law

• Given some corpus of natural language text, the frequency of any word is inversely proportional to its rank in the frequency table• The most frequent word will occur approximately twice as often as the

second most frequent word• The second most frequent word will occur approximately twice as often as

the third most frequent word etc

27

Page 28: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat

Zipf’s law

28

Masrai and Milton, 2006. “How different is Arabic from Other Languages? The Relationship between Word Frequency and Lexical Coverage”

Page 29: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat

Smoothing

The general idea: Find a way to fill the gaps in counts• Take care not to change the original distribution too much• Fill in the gaps only as much as needed: as the corpus grows larger

there are less gaps to fill.

• Smoothing methods• Add λ method• Interpolation• (Modified) Kneser-Ney• There are others

29

Page 30: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat

Add λ method

Assume all n-grams occur λ times more than they actually occur.• Usual bigram probability:

! "! "!"# = #("!"#, "!)#("!"#)

• Add 0 < * ≤ 1 to all bigram counts:

!$ "!|"!"# = # "!"#, "! + *#("!"#) + *|/|

• Special case * = 1: add-one smoothing

30

Page 31: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat

Add λ method

• Advantages• Very simple• Easy to apply

• Disadvantages• Performs poorly (according to Chen & Goodman)• All unseen events receive the same probability• All events are upgraded by λ

31

Page 32: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat

Interpolation (Jelinek-Mercer smoothing)

If the bigram !!"# !! is unseen:• Originally its probability would be 0:

" !! !!"# = 0• Instead of 0 we could use the probability of the shorter n-gram

(unigram):"(!!)

• We must make sure that the total probability mass remains the same• Thus interpolate between the unigram and bigram distribution

"$% !! !!"# = '" !! !!"# + 1 − ' "(!!)

32

Page 33: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat

Interpolation (Jelinek-Mercer smoothing)

• Recursive formulation: nth-order smoothed model is defined recursively as linear interpolation between the nth-order maximum likelihood (ML) model and the (n-1)th-order smoothed model!!" "# "#$% , … , "#$&= &#$%! "# "#$% , … , "#$& + 1 − &#$% !!" "# "#$%'&, … , "#$&

• Can ground the recursion with:• 1st order unigram model• 0th order uniform model

! " = 1|&|

33

Page 34: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat

Software for language modelling

• KenLM• https://github.com/kpu/kenlm

• SRILM• http://www.speech.sri.com/projects/srilm/

• IRSTLM• http://hlt-mt.fbk.eu/technologies/irstlm

• Others:• http://www.statmt.org/moses/?n=FactoredTraining.BuildingLanguageModel

34

Page 35: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat

Language model evaluation

• Intrinsic evaluation• Perplexity• Quick and simple• Improvements in perplexity might not translate into improvements in

downstream tasks• Extrinsic evaluation• In a down-stream task (like machine translation, speech recognition etc)• More difficult and time-consuming• More accurate evaluation (although beware of confounding with other

factors)

35

Page 36: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat

Perplexity

• Perplexity is a measurement of how well a probability model predicts a sample.• Language model is a probability model over language• To evaluate a language model, compute the perplexity over held-out

set (test set)

!! = 2!"# ∑!"#

$ %&'% ((*!|*!&',…,*!&#)

36

Page 37: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat

Perplexity

• The lower the perplexity the better the language model, i.e. the less “surprised” the model is on seeing the evaluation data• The exponent is really the cross-entropy, which measures the number

of bits needed to represent a word:

! "#, #̂ = −(!"#

$"#(*!) log% #̂(*!)

• "# *! = #((!)$ - empirical unigram probability

• #̂ *! = #*+(*!|*!,- , … , *!,#)

37

Page 38: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat

Perplexity

• Let’s assume that the cross-entropy on a test set is 7.95• This means that each word in the test set could be encoded with 7.95

bits• The model perplexity would be 27.95=247 per word• This means that the model is as confused on test data as if it would

have to choose uniformly at random from 247 possibilities for each word.

38

Page 39: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat

Perplexity

• Perplexity is corpus-specific: only the perplexities calculated on the same test set are comparable• For meaningful comparison, the vocabulary sizes of the two language

models must be the same, e.g.• You can compare a bigram language model to a trigram language model that

both use vocabulary size 10000• You cannot compare a trigram language model using a vocabulary size 10000

to a trigram language model using vocabulary size 20000

39

Page 40: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat

Neural language models

• Window-based feed-forward neural language model• Recurrent neural language model

40

Page 41: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat

Feed-forward neural language model (Bengioet al., 2003)

41

! = # $!"#$% ; … ; # $!"& ; # $!"%

' = ( !)' + +'

, $! $!"#$%, … , $!"&, $!"%= ./01234(')( + +()

Page 42: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat

Recurrent neural language model (Mikolov et al., 2010)

Source: http://colah.github.io

!! = #(%!&+!!"#( + )$)

+ ,! ,#, … , ,!"%, ,!"#= /012345(!!&& + )&)

Page 43: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat

Training the language model with cross-entropy loss

!!"#$$%&'("#)* "#, # = −'+,-

.#+ log "#+ = − log "#(

|V| - the vocabulary sizet – index of the correct word

43

Page 44: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat

Why is the softmax over large vocabulary computationally costly?• What is a softmax?

!"#$%&' (! = *"!∑!" *"!"

• Now take the derivative from this with respect to (!• The sum over the whole vocabulary will remain in the derivative

(check it yourself)

44

Page 45: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat

How to handle large softmax?

• Hierarchical softmax• Decompose the softmax layer into

binary tree• Reduce the complexity of the output

distribution from O(|V|) to O(log|V|)• Self-normalization• Approximate softmax

45

source: https://becominghuman.ai

Page 46: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat

What to do with infrequent words

• Typically, the vocabulary size is fixed, ranging anywhere between 10K-200K words• Still, there will always be words that are not part of the vocabulary

(remember Turkish?)• The most common approach is to simply replace all out-of-vocabulary

(OOV) words with a special UNK token• Another option is to reduce the sparsity by constructing vocabulary

from subword units:• Morphemes, characters, syllables, …

46

Page 47: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat

What to do with infrequent words

• What if there are no UNK’s in the training set?• Use a random UNK vector during testing• Randomly replace some infrequent words with UNK during training

• Construct word embeddings from characters (we’ll talk about it in more detail later)• Works for input (context) words• Cannot use for output words

47

Page 48: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat

Character-level language model

• For instance for generating text with mark-up• A. Karpathy, 2015. The Unreasonable Effectiveness of Recurrent

Neural Networks• Generated text based on a LM trained on Wikipedia:

48

Page 49: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat

Using language models

• For scoring sentences• Speech recognition• Using LM for text classification• Statistical machine translation

• For generating text• Neural machine translation• Dialogue generation• Abstractive summarization

49