natural language processing - carnegie mellon...

Natural Language Processing

Lecture 6: Information Theory; Spelling, Edit Distance, and Noisy

Channels

A Taste of Information Theory

• Shannon Entropy, H(p)• Cross-entropy, H(p; q)• Perplexity

CodebookName Code

Biden 000Sanders 001Harris 010YKW 011Booker 100Warren 101Yang 110Marianne 111

CodebookName Code Probability

Biden 000 1/4Sanders 001 1/16Harris 010 1/64YKW 011 1/2Booker 100 1/64Warren 101 1/8Yang 110 1/64Marianne 111 1/64

CodebookName Probability New Code

Biden 1/4 10Sanders 1/16 1110Harris 1/64 111100YKW 1/2 0Booker 1/64 111101Warren 1/8 110Yang 1/64 111110Marianne 1/64 111111

A Taste of Information Theory

• Shannon Entropy, H(p)• Cross-entropy, H(p; q)• Perplexity

Three Spelling Problems

1. Detecting isolated non-words“graffe” “exampel”

2. Fixing isolated non-words“graffe” à “giraffe” “exampel” à “example”

3. Fixing errors in context“I ate desert” à “I ate dessert”“It was written be me” à “It was written by me”

String edit distance• How many letter changes to map A to B• Substitutions– E X A M P E L– E X A M P L E 2 substitutions

• Insertions– E X A P L E– E X A M P L E 1 insertion

• Deletions– E X A M M P L E– E X A _ M P L E 1 deletion

Levenshtein Distance

D0,0 = 0

Di,j = min

8><

>:

Di�1,j + inscost(ti)

Di,j�1 + delcost(sj)

Di�1,j�1 + substcost(ti, sj)

String Edit Distance

String edit distance# 9L 8E 7P 6M 5M 4A 3X 2E 1# 0 1 2 3 4 5 6 7 8

# E X A M P L E #

String edit distance# 9 8 7 6 5L 8 7 6 5 4E 7 6 5 4 3P 6 5 4 3 2M 5 4 3 2 1M 4 3 2 1 0 1A 3 2 1 0 1 2X 2 1 0 1 2 3E 1 0 1 2 3 4# 0 1 2 3 4 5 6 7 8

# E X A M P L E #

String edit distance# 9 8 7 6 5 4 4 6 5L 8 7 6 5 4 3 3 5 7E 7 6 5 4 3 2 3 2 3P 6 5 4 3 2 1 2 3 4M 5 4 3 2 1 2 3 4 5M 4 3 2 1 0 1 2 3 4A 3 2 1 0 1 2 3 4 5X 2 1 0 1 2 3 4 5 6E 1 0 1 2 3 4 5 6 7# 0 1 2 3 4 5 6 7 8

# E X A M P L E #

Levenshtein Hamming Distance

D0,0 = 0

Di,j = min

8><

>:

Di�1,j +1Di,j�1 +1Di�1,j�1 + substcost(ti, sj)

Levenshtein Distance with Transposition

D0,0 = 0

Di,j = min

8>>><

>>>:

Di�1,j + inscost(ti)

Di,j�1 + delcost(sj)

Di�1,j�1 + substcost(ti, sj)

Di�2,j�2 + transcost(sj�1, sj)if sj�1 = ti and sj = ti�1

Three Spelling Problems

üDetecting isolated non-wordsüFixing isolated non-words3. Fixing errors in context

Kernighan’s Model: A Noisy Channel

source

channel

example exmaple

acress

c freq(c) p(t | c) %actress 1343 p(delete t) 37cress 0 p(delete a) 0caress 4 p(transpose a & c) 0access 2280 p(substitute r for c) 0across 8436 p(substitute e for o) 18acres 2879 p(delete s) 21acres 2879 p(delete s) 23

How to choose between options

• Probabilities of edits– Insertions, deletions, substitutions,– Transpositions

• Probability of the new word

Noisy Channel Model (General)

source

channel

y x

decode

Probability model

• Most likely word given observationArgmax ( )

• By Bayes Rule is equivalent toArgmax ( )

• Which is equivalent to Argmax ( P(W) P(O|W) ) (denominator is constant)

• P(O | W) calculated from edit distance• P(W) calculated from language model

natural language processing - carnegie mellon...

Documents