natural language processing - carnegie mellon...

22
Natural Language Processing Lecture 6: Information Theory; Spelling, Edit Distance, and Noisy Channels

Upload: others

Post on 29-Jan-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

  • Natural Language Processing

    Lecture 6: Information Theory; Spelling, Edit Distance, and Noisy

    Channels

  • A Taste of Information Theory

    • Shannon Entropy, H(p)• Cross-entropy, H(p; q)• Perplexity

  • CodebookName Code

    Biden 000Sanders 001Harris 010YKW 011Booker 100Warren 101Yang 110Marianne 111

  • CodebookName Code Probability

    Biden 000 1/4Sanders 001 1/16Harris 010 1/64YKW 011 1/2Booker 100 1/64Warren 101 1/8Yang 110 1/64Marianne 111 1/64

  • CodebookName Probability New Code

    Biden 1/4 10Sanders 1/16 1110Harris 1/64 111100YKW 1/2 0Booker 1/64 111101Warren 1/8 110Yang 1/64 111110Marianne 1/64 111111

  • A Taste of Information Theory

    • Shannon Entropy, H(p)• Cross-entropy, H(p; q)• Perplexity

  • Three Spelling Problems

    1. Detecting isolated non-words“graffe” “exampel”

    2. Fixing isolated non-words“graffe” à “giraffe” “exampel” à “example”

    3. Fixing errors in context“I ate desert” à “I ate dessert”“It was written be me” à “It was written by me”

  • String edit distance• How many letter changes to map A to B• Substitutions– E X A M P E L– E X A M P L E 2 substitutions

    • Insertions– E X A P L E– E X A M P L E 1 insertion

    • Deletions– E X A M M P L E– E X A _ M P L E 1 deletion

  • Levenshtein Distance

    D0,0 = 0

    Di,j = min

    8><

    >:

    Di�1,j + inscost(ti)

    Di,j�1 + delcost(sj)

    Di�1,j�1 + substcost(ti, sj)

  • String Edit Distance

  • String edit distance# 9L 8E 7P 6M 5M 4A 3X 2E 1# 0 1 2 3 4 5 6 7 8

    # E X A M P L E #

  • String edit distance# 9 8 7 6 5L 8 7 6 5 4E 7 6 5 4 3P 6 5 4 3 2M 5 4 3 2 1M 4 3 2 1 0 1A 3 2 1 0 1 2X 2 1 0 1 2 3E 1 0 1 2 3 4# 0 1 2 3 4 5 6 7 8

    # E X A M P L E #

  • String edit distance# 9 8 7 6 5 4 4 6 5L 8 7 6 5 4 3 3 5 7E 7 6 5 4 3 2 3 2 3P 6 5 4 3 2 1 2 3 4M 5 4 3 2 1 2 3 4 5M 4 3 2 1 0 1 2 3 4A 3 2 1 0 1 2 3 4 5X 2 1 0 1 2 3 4 5 6E 1 0 1 2 3 4 5 6 7# 0 1 2 3 4 5 6 7 8

    # E X A M P L E #

  • String edit distance# 9 8 7 6 5 4 4 6 5L 8 7 6 5 4 3 3 5 7E 7 6 5 4 3 2 3 2 3P 6 5 4 3 2 1 2 3 4M 5 4 3 2 1 2 3 4 5M 4 3 2 1 0 1 2 3 4A 3 2 1 0 1 2 3 4 5X 2 1 0 1 2 3 4 5 6E 1 0 1 2 3 4 5 6 7# 0 1 2 3 4 5 6 7 8

    # E X A M P L E #

  • Levenshtein Hamming Distance

    D0,0 = 0

    Di,j = min

    8><

    >:

    Di�1,j +1Di,j�1 +1Di�1,j�1 + substcost(ti, sj)

  • Levenshtein Distance with Transposition

    D0,0 = 0

    Di,j = min

    8>>><

    >>>:

    Di�1,j + inscost(ti)

    Di,j�1 + delcost(sj)

    Di�1,j�1 + substcost(ti, sj)

    Di�2,j�2 + transcost(sj�1, sj)if sj�1 = ti and sj = ti�1

  • Three Spelling Problems

    üDetecting isolated non-wordsüFixing isolated non-words3. Fixing errors in context

  • Kernighan’s Model: A Noisy Channel

    source

    channel

    example exmaple

  • acress

    c freq(c) p(t | c) %actress 1343 p(delete t) 37cress 0 p(delete a) 0caress 4 p(transpose a & c) 0access 2280 p(substitute r for c) 0across 8436 p(substitute e for o) 18acres 2879 p(delete s) 21acres 2879 p(delete s) 23

  • How to choose between options

    • Probabilities of edits– Insertions, deletions, substitutions,– Transpositions

    • Probability of the new word

  • Noisy Channel Model (General)

    source

    channel

    y x

    decode

  • Probability model

    • Most likely word given observationArgmax ( )

    • By Bayes Rule is equivalent toArgmax ( )

    • Which is equivalent to Argmax ( P(W) P(O|W) ) (denominator is constant)

    • P(O | W) calculated from edit distance• P(W) calculated from language model