introduction to natural language processingcoling.epfl.ch/slides/courstal-corrlex-en.pdf · •a...

34
Out-of- Vocabulary forms Spelling Error Correction ©EPFL J.-C. Chappelier & M. Rajman Out of Vocabulary Forms Spelling Error correction J.-C. Chappelier & M. Rajman Laboratoire d’Intelligence Artificielle Faculté I&C Introduction to INLP – 1 / 33

Upload: others

Post on 13-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Natural Language Processingcoling.epfl.ch/slides/coursTAL-corrlex-en.pdf · •a given lexicon (regular language) and a set of transformations define the edit space

Out-of-Vocabularyforms

Spelling ErrorCorrection

©EPFLJ.-C. Chappelier & M. Rajman

Out of Vocabulary FormsSpelling Error correction

J.-C. Chappelier & M. Rajman

Laboratoire d’Intelligence ArtificielleFaculté I&C

Introduction to INLP – 1 / 33

Page 2: Introduction to Natural Language Processingcoling.epfl.ch/slides/coursTAL-corrlex-en.pdf · •a given lexicon (regular language) and a set of transformations define the edit space

Out-of-Vocabularyforms

Spelling ErrorCorrection

©EPFLJ.-C. Chappelier & M. Rajman

Contents

I Out of Vocabulary FormsI Spelling Error Correction

I Edit distanceI Spelling error correction with FSAI Weighted edit distance

Introduction to INLP – 2 / 33

Page 3: Introduction to Natural Language Processingcoling.epfl.ch/slides/coursTAL-corrlex-en.pdf · •a given lexicon (regular language) and a set of transformations define the edit space

Out-of-Vocabularyforms

Spelling ErrorCorrection

©EPFLJ.-C. Chappelier & M. Rajman

Out of Vocabulary forms

I Out of Vocabulary (OoV) forms matter:they occur quite frequently (e.g. ' 10% in newspapers)

What do they consist of?I spelling errors:foget, summmary, usqge, ...

I neologisms:Internetization, Tacherism, ...

I borrowings:gestalt, rendez-vous, ...

I forms difficult to exhaustively lexicalize:(numbers,) proper names, abbreviations, ...

I identification based on patterns is not well-adapted for all OoV forms+ We will focus here on spelling errors, neologisms and borrowings

Introduction to INLP – 3 / 33

Page 4: Introduction to Natural Language Processingcoling.epfl.ch/slides/coursTAL-corrlex-en.pdf · •a given lexicon (regular language) and a set of transformations define the edit space

Out-of-Vocabularyforms

Spelling ErrorCorrection

©EPFLJ.-C. Chappelier & M. Rajman

Spelling errors and neologisms

I for spelling errors (resp. neologisms), distortions (resp. derivations) are modelledby transformations, i.e. rewriting rules (sometimes weighted)

Example:I Transposition (distortion): XY → YX [1.0]

where X and Y stands for variables

I tripling (distortion): XX → XXX [1.0]

I name derivation: ize:INF → ization:N [1.0]

I a given lexicon (regular language) and a set of transformations define the editspace to be explored

+ The aim is to find the position of the OoV forms in the edit space with respect to known(lexicalized) forms (neighbourhoods, similarity, distance)

Introduction to INLP – 4 / 33

Page 5: Introduction to Natural Language Processingcoling.epfl.ch/slides/coursTAL-corrlex-en.pdf · •a given lexicon (regular language) and a set of transformations define the edit space

Out-of-Vocabularyforms

Spelling ErrorCorrection

©EPFLJ.-C. Chappelier & M. Rajman

Spelling errors and neologisms (2)

I if the transformation set is simple enough: automatic (or semi-automatic) learningof the transformation set is possible

Examples:I morphological rules for SpanishI transformations for spelling error correction after OCR:

I print a selected set of texts (you have in electronic form)making use of several different printers and fonts

I OCR the printed copiesI learn required transformations from errors

Introduction to INLP – 5 / 33

Page 6: Introduction to Natural Language Processingcoling.epfl.ch/slides/coursTAL-corrlex-en.pdf · •a given lexicon (regular language) and a set of transformations define the edit space

Out-of-Vocabularyforms

Spelling ErrorCorrection

©EPFLJ.-C. Chappelier & M. Rajman

BorrowingsFor borrowings + identification of the source language[when no large coverage lexica are avalaible for the other languages, but only representative

texts]

Decomposition into n-grams of characters.Example: for trigrams

dribble→ (dri,rib,ibb,bbl,ble)

In practice: n varies from 2 to 4

From reference corpora, estimate the likelihood of a word to belong to a givenlanguage.Example for trigrams:

P(dribble|L) = P(dri|L) · P(rib|L)

P(ri|L)· ... · P(ble|L)

P(bl|L)

Trigrams for French, English, German and Spanish: 87% discrimination accuracyIntroduction to INLP – 6 / 33

Page 7: Introduction to Natural Language Processingcoling.epfl.ch/slides/coursTAL-corrlex-en.pdf · •a given lexicon (regular language) and a set of transformations define the edit space

Out-of-Vocabularyforms

Spelling ErrorCorrection

©EPFLJ.-C. Chappelier & M. Rajman

Likelihood vs. Posterior probabilityIn the former slide, why make use of the likelihood P(w |L) rather than the posteriorprobability P(L|w)?I They are both hard to accurately model without further assumptions (w belongs

to a huge set!)but no further simplification can be made on P(L|w): w is fixed (and there isnothing to gain “simplifying” L!)P(w |L) can be further simplified making assumptions on w

I Using the Bayes’ rule:

argmaxL

P(L|w) = argmaxL

P(w |L) ·P(L)

+ introduces the likelihood anyway! (which could then be simplified further)

I If you can accurately estimate P(L), sure, make use of it!

I Otherwise, the least biaised hypothesis (maximum entropy) is to a priori assumethat all languages are all equally possible: maximizing posterior probability is thenthe same as maximizing likelihood

Introduction to INLP – 7 / 33

Page 8: Introduction to Natural Language Processingcoling.epfl.ch/slides/coursTAL-corrlex-en.pdf · •a given lexicon (regular language) and a set of transformations define the edit space

Out-of-Vocabularyforms

Spelling ErrorCorrectionTwo approaches

Edit distance

Spell. err. corr. witha FSA

Weighted editdistance

©EPFLJ.-C. Chappelier & M. Rajman

Contents

I Out of Vocabulary Formsä Spelling Error Correction

I Edit distanceI Spelling error correction with FSAI Weighted edit distance

Introduction to INLP – 7 / 33

Page 9: Introduction to Natural Language Processingcoling.epfl.ch/slides/coursTAL-corrlex-en.pdf · •a given lexicon (regular language) and a set of transformations define the edit space

Out-of-Vocabularyforms

Spelling ErrorCorrectionTwo approaches

Edit distance

Spell. err. corr. witha FSA

Weighted editdistance

©EPFLJ.-C. Chappelier & M. Rajman

Spelling error correction

correct strings

all strings

max. distancemax. distance

input stringinput string

correct strings

solutionsTwo approaches:

Exact Probabilisticlexicon-based

correct forms: lexicon any stringmetric: edit distance probability

In this lecture:I only a few words about the probabilistic approach (next slide)I mainly: exact, lexicon-based, approach

Introduction to INLP – 8 / 33

Page 10: Introduction to Natural Language Processingcoling.epfl.ch/slides/coursTAL-corrlex-en.pdf · •a given lexicon (regular language) and a set of transformations define the edit space

Out-of-Vocabularyforms

Spelling ErrorCorrectionTwo approaches

Edit distance

Spell. err. corr. witha FSA

Weighted editdistance

©EPFLJ.-C. Chappelier & M. Rajman

Probabilistic approach summarized (1/2)

Make (one more time!) use of n-grams (both levels, characters and tokens, arecombined)

w : OoV token to be corrected

c: candidate correction, out of C (w), set of possible candidates for w

argmaxc∈C (w)

P(c|w) = argmaxc∈C (w)

P(c) ·P(w |c)

P(c): language model (n-grams of tokens/words; n = 1 here, but could easily beextended to neighboring tokens (n > 1 then))

P(w |c): error model: edit distance and/or m-grams of characters

Introduction to INLP – 9 / 33

Page 11: Introduction to Natural Language Processingcoling.epfl.ch/slides/coursTAL-corrlex-en.pdf · •a given lexicon (regular language) and a set of transformations define the edit space

Out-of-Vocabularyforms

Spelling ErrorCorrectionTwo approaches

Edit distance

Spell. err. corr. witha FSA

Weighted editdistance

©EPFLJ.-C. Chappelier & M. Rajman

Probabilistic approach summarized (2/2)

A usual (unexplicit?) assumption is that P(w |c) is many orders of magnitude higher forsmaller edit distance (than for higher): thus closer candidate are considereds first,leading to this simple algorithm,where Cd (w) is the set of candidates at distance d from w :

I if C1(w) is not empty, return argmaxc∈C1(w)

P(c);

I (else) if C2(w) is not empty, return argmaxc∈C2(w)

P(c);

I etc...

For more details: see http://norvig.com/spell-correct.html

Introduction to INLP – 10 / 33

Page 12: Introduction to Natural Language Processingcoling.epfl.ch/slides/coursTAL-corrlex-en.pdf · •a given lexicon (regular language) and a set of transformations define the edit space

Out-of-Vocabularyforms

Spelling ErrorCorrectionTwo approaches

Edit distance

Spell. err. corr. witha FSA

Weighted editdistance

©EPFLJ.-C. Chappelier & M. Rajman

Edit distance

also called “Levenshtein distance”

+ distance between 2 forms/strings= minimal number of transformations to change one into the other

+ depends on the set of transformations considered

Examples of transformations:` insertion: exmple→ example` deletion: example→ exmple` substitution: exemple→ example` transposition: exmaple→ example

Introduction to INLP – 11 / 33

Page 13: Introduction to Natural Language Processingcoling.epfl.ch/slides/coursTAL-corrlex-en.pdf · •a given lexicon (regular language) and a set of transformations define the edit space

Out-of-Vocabularyforms

Spelling ErrorCorrectionTwo approaches

Edit distance

Spell. err. corr. witha FSA

Weighted editdistance

©EPFLJ.-C. Chappelier & M. Rajman

Computation of edit distance (1)

Notations:Xi : i th char of string X

X ji : if i ≤ j : substring Xi ,...,Xj ; empty string otherwise

Example: X = castleX3 = s X 6

4 = tle X 41 = cast X 0

1 = ε

Computation of the distance D(X ,Y ) by dynamic programming:

+ step by step in a chart m where each cell mij contains the distance between thetwo substrings X i

1 and Y j1 :

mij = D(X i1,Y

j1)

Introduction to INLP – 12 / 33

Page 14: Introduction to Natural Language Processingcoling.epfl.ch/slides/coursTAL-corrlex-en.pdf · •a given lexicon (regular language) and a set of transformations define the edit space

Out-of-Vocabularyforms

Spelling ErrorCorrectionTwo approaches

Edit distance

Spell. err. corr. witha FSA

Weighted editdistance

©EPFLJ.-C. Chappelier & M. Rajman

Computation of edit distance (2)

D(X 01 ;Y j

1) = j initialization

D(X i1;Y 0

1 ) = i

D(X i1;Y j

1) = D(X i−11 ;Y j−1

1 ) if Xi = Yj (equality)

= 1 + min{

D(X i−21 ;Y j−2

1 ),

D(X i−11 ;Y j

1),D(X i1;Y j−1

1 )} else if i ≥ 2 and j ≥

2 and Xi−1 = Yj andXi = Yj−1 (transposition,deletion, insertion)

= 1 + min{

D(X i−11 ;Y j−1

1 ),

D(X i−11 ;Y j

1),D(X i1;Y j−1

1 )} else (substitution, dele-

tion, insertion)

Introduction to INLP – 13 / 33

Page 15: Introduction to Natural Language Processingcoling.epfl.ch/slides/coursTAL-corrlex-en.pdf · •a given lexicon (regular language) and a set of transformations define the edit space

Out-of-Vocabularyforms

Spelling ErrorCorrectionTwo approaches

Edit distance

Spell. err. corr. witha FSA

Weighted editdistance

©EPFLJ.-C. Chappelier & M. Rajman

Computation of edit distance: computation order

Id

Subs

Transp

Del

Ins

Y

X

i

jj-2

i-2

+ several possible ways of computing: rowwise, columnwise or diagonal

Introduction to INLP – 14 / 33

Page 16: Introduction to Natural Language Processingcoling.epfl.ch/slides/coursTAL-corrlex-en.pdf · •a given lexicon (regular language) and a set of transformations define the edit space

Out-of-Vocabularyforms

Spelling ErrorCorrectionTwo approaches

Edit distance

Spell. err. corr. witha FSA

Weighted editdistance

©EPFLJ.-C. Chappelier & M. Rajman

Computation of edit distance (3)Example, columnwise:

for all i from 0 to |X | (size of X ) domi0 = i

for all j from 1 to |Y | dom0j = jfor all i from 1 to |X | do

if Xi = Yj thenmij = mi−1,j−1

else if i ≥ 2 and j ≥ 2 and Xi−1 = Yj and Xi = Yj−1 then

mij = 1 + min{

mi−2,j−2; mi ,j−1; mi−1,j

}else

mij = 1 + min{

mi−1,j−1; mi ,j−1; mi−1,j

}Return m|X |,|Y |

Introduction to INLP – 15 / 33

Page 17: Introduction to Natural Language Processingcoling.epfl.ch/slides/coursTAL-corrlex-en.pdf · •a given lexicon (regular language) and a set of transformations define the edit space

Out-of-Vocabularyforms

Spelling ErrorCorrectionTwo approaches

Edit distance

Spell. err. corr. witha FSA

Weighted editdistance

©EPFLJ.-C. Chappelier & M. Rajman

Edit Distance (example)

D(exmple;exemple) D(exmaple;example)

e x e m p l e

0 1 2 3 4 5 6 7

e 1 0 1 2 3 4 5 6

x 2 1 0 1 2 3 4 5

m 3 2 1 1 1 2 3 4

p 4 3 2 2 2 1 2 3

l 5 4 3 3 3 2 1 2

e 6 5 4 3 4 3 2 1

e x a m p l e

0 1 2 3 4 5 6 7

e 1 0 1 2 3 4 5 6

x 2 1 0 1 2 3 4 5

m 3 2 1 1 1 2 3 4

a 4 3 2 1 1 2 3 4

p 5 4 3 2 2 1 2 3

l 6 5 4 3 3 2 1 2

e 7 6 5 4 4 3 2 1

Introduction to INLP – 16 / 33

Page 18: Introduction to Natural Language Processingcoling.epfl.ch/slides/coursTAL-corrlex-en.pdf · •a given lexicon (regular language) and a set of transformations define the edit space

Out-of-Vocabularyforms

Spelling ErrorCorrectionTwo approaches

Edit distance

Spell. err. corr. witha FSA

Weighted editdistance

©EPFLJ.-C. Chappelier & M. Rajman

Spelling error correction using a FSA

Problem: approximative search of lexicalized (surface) forms= within a max. distance range

i.e. Fault-tolerant recognition (within a regular language):

Find all ending paths such that the corresponding string is within adistance range less than θ of the given input string.

Remark: a trie is a special case of FSA

Introduction to INLP – 17 / 33

Page 19: Introduction to Natural Language Processingcoling.epfl.ch/slides/coursTAL-corrlex-en.pdf · •a given lexicon (regular language) and a set of transformations define the edit space

Out-of-Vocabularyforms

Spelling ErrorCorrectionTwo approaches

Edit distance

Spell. err. corr. witha FSA

Weighted editdistance

©EPFLJ.-C. Chappelier & M. Rajman

Finite-State Automata (FSA)

Formally:I Q: (finite) set of statesI Σ: (finite) alphabetI δ : arcs (mapping from Q×Σ to Q)

q1

q2a

q1

q2

δ ( , a )=

I q0 ∈Q: inital stateI F ⊂Q: final states

Interface:I initialState(): provides q0

I (q,a)=nextAfter(p,c): returns next state and characterafter character ’c’ starting from state ’p’Formally: returns Argminα {(q,α) ∈Q×Σ such that α > c and δ (p,α) = q}

I isFinal(p): are we done with p? Checks whether p ∈ F or not.

p q

· ··

c

a

· · ·

Introduction to INLP – 18 / 33

Page 20: Introduction to Natural Language Processingcoling.epfl.ch/slides/coursTAL-corrlex-en.pdf · •a given lexicon (regular language) and a set of transformations define the edit space

Out-of-Vocabularyforms

Spelling ErrorCorrectionTwo approaches

Edit distance

Spell. err. corr. witha FSA

Weighted editdistance

©EPFLJ.-C. Chappelier & M. Rajman

Pruning criteria: cut-off edit distance

To make it useful in practice⇒ Fast⇒ good pruning+ cut-off edit distance: [Oflazer 1996]

Co(X n1 ,Y

m1 ) = min

I(m)≤i≤J(m)D(X i

1;Y m1 )

I(m) = min(n,max(1,m−θ)) J(m) = min(n,max(1,m + θ))

Important property:

Co(X ,Y ) > θ =⇒ ∀Z D(X ,Y + Z ) > θ

Introduction to INLP – 19 / 33

Page 21: Introduction to Natural Language Processingcoling.epfl.ch/slides/coursTAL-corrlex-en.pdf · •a given lexicon (regular language) and a set of transformations define the edit space

Out-of-Vocabularyforms

Spelling ErrorCorrectionTwo approaches

Edit distance

Spell. err. corr. witha FSA

Weighted editdistance

©EPFLJ.-C. Chappelier & M. Rajman

Cut-off Edit Distance: example

=2θ

seX

Y

n=7

m=4

x m p l e

I=4-2=2 J=4+2=6

ex

exm

exmp

exmpl

exmple

e x ma

Ye x a m

0 1 2 3 4e 1 0 1 2 3x 2 1 0 1 2m 3 2 1 1 1p 4 3 2 2 2l 5 4 3 3 3e 6 5 4 4 4

X s 7 6 5 5 5

Co(X ,Y ) = min{2,1,2,3,4}= 1

Introduction to INLP – 20 / 33

Page 22: Introduction to Natural Language Processingcoling.epfl.ch/slides/coursTAL-corrlex-en.pdf · •a given lexicon (regular language) and a set of transformations define the edit space

Out-of-Vocabularyforms

Spelling ErrorCorrectionTwo approaches

Edit distance

Spell. err. corr. witha FSA

Weighted editdistance

©EPFLJ.-C. Chappelier & M. Rajman

Walk through a FSA within a θ distance rangePrefix-compatible Depth-first version

Input: a string to be corrected (X ), a lexicon in the form of a FSA and a maximal errorthreshold (θ )

Push(ε,ε,q0)while Stack is not empty do

Pop(Z ,c,p)(q,a) = nextAfter(p,c)if (q,a) 6= /0 then

Push(Z ,a,p)Y ← Z + aif Co(X ,Y )≤ θ then

Push(Y ,ε,q)if isFinal(q) and D(X ,Y )≤ θ then

Add Y to solutions

q0 p qZca

Introduction to INLP – 21 / 33

Page 23: Introduction to Natural Language Processingcoling.epfl.ch/slides/coursTAL-corrlex-en.pdf · •a given lexicon (regular language) and a set of transformations define the edit space

Out-of-Vocabularyforms

Spelling ErrorCorrectionTwo approaches

Edit distance

Spell. err. corr. witha FSA

Weighted editdistance

©EPFLJ.-C. Chappelier & M. Rajman

Example: ababa vs. (aba|bab)* with θ = 11

[2] [2] [2] [2] [2] [2]

1

2

3

5

4

2 4

3

2 4

1

42

3

2

5

4

1

3

2

2 4

5

4

X=ababa (aba|bab)* for θ =1

[0]

[1]

[1]

[1]

[1]

[1]

[1]

[1]

[1]

[2]

[1]

[1]

[0]

[0]

[0]

[0]

[0]

.vs.

a

a

a

b

b

b

a

b

b

a

a

b

b

b

b

b

a

a

a

b

b

b

b a

a

a

a

a

1 1 1

a b a a b a0 1 2 3 4 5 6

a 1 0 1 2 3 4 5b 2 1 0 1 2 3 4a 3 2 1 0 1 2 3b 4 3 2 1 1 1 2a 5 4 3 2 1 1 1

Solutions: abaaba, ababab, bababaIntroduction to INLP – 22 / 33

Page 24: Introduction to Natural Language Processingcoling.epfl.ch/slides/coursTAL-corrlex-en.pdf · •a given lexicon (regular language) and a set of transformations define the edit space

Out-of-Vocabularyforms

Spelling ErrorCorrectionTwo approaches

Edit distance

Spell. err. corr. witha FSA

Weighted editdistance

©EPFLJ.-C. Chappelier & M. Rajman

Implementation issues

À Efficient computation of Co with the previously described chart :

+ recomputation of the last column (m) only

+ Computation of D and Co in the same loop

Á Y ← Z + a: beware (local copies, pointers etc...).Similarly, do not naively implement "Push(Y ,q)".

 In some (programming) languages: it could be worth transposing the algorithm:Y (which is changing) for rows and X for columns

Introduction to INLP – 23 / 33

Page 25: Introduction to Natural Language Processingcoling.epfl.ch/slides/coursTAL-corrlex-en.pdf · •a given lexicon (regular language) and a set of transformations define the edit space

Out-of-Vocabularyforms

Spelling ErrorCorrectionTwo approaches

Edit distance

Spell. err. corr. witha FSA

Weighted editdistance

©EPFLJ.-C. Chappelier & M. Rajman

Contents

I Out of Vocabulary FormsI Spelling Error Correction

I Edit distanceI Spelling error correction with FSAä Weighted edit distance

Introduction to INLP – 24 / 33

Page 26: Introduction to Natural Language Processingcoling.epfl.ch/slides/coursTAL-corrlex-en.pdf · •a given lexicon (regular language) and a set of transformations define the edit space

Out-of-Vocabularyforms

Spelling ErrorCorrectionTwo approaches

Edit distance

Spell. err. corr. witha FSA

Weighted editdistance

©EPFLJ.-C. Chappelier & M. Rajman

Limitations?

å weightingExample: diacritics, uppercase

eleves→ élèves aloves→ élèves

å specific transformationsExample: typing errors

tupe→ type more generally: deuit→ fruitusqge→ usage

å whitespacestheothers→ the others othe rs→ others

+ 3 aspects of the same problem

Solution: generalization of the edit distance: weighted edit distance

Introduction to INLP – 25 / 33

Page 27: Introduction to Natural Language Processingcoling.epfl.ch/slides/coursTAL-corrlex-en.pdf · •a given lexicon (regular language) and a set of transformations define the edit space

Out-of-Vocabularyforms

Spelling ErrorCorrectionTwo approaches

Edit distance

Spell. err. corr. witha FSA

Weighted editdistance

©EPFLJ.-C. Chappelier & M. Rajman

Weighted Edit Distance

weighted transformations such that :à W (Id) = 0à W (f ) > 0 f 6= Id

à W (f−1) = W (f )

à W (f ◦g) = W (f ) + W (g)

D(X ;Y ) = minf :Y=f (X )

W (f )

+ It is actually a distance on Σ∗

Difference with the preceding distance: W (f ) is not necessarily the same (= 1).

Introduction to INLP – 26 / 33

Page 28: Introduction to Natural Language Processingcoling.epfl.ch/slides/coursTAL-corrlex-en.pdf · •a given lexicon (regular language) and a set of transformations define the edit space

Out-of-Vocabularyforms

Spelling ErrorCorrectionTwo approaches

Edit distance

Spell. err. corr. witha FSA

Weighted editdistance

©EPFLJ.-C. Chappelier & M. Rajman

Remarks

Ê Distance on Σ∗ ⇒ ∀X Y , ∃f : Y = f (X )

True if Ins and Del are in the transformation set

Ë non overlapping transformationsi.e. cannot apply a transformation to the result of the previous transformation

Counter-Example: ba Transp→ abSub→ ac

Introduction to INLP – 27 / 33

Page 29: Introduction to Natural Language Processingcoling.epfl.ch/slides/coursTAL-corrlex-en.pdf · •a given lexicon (regular language) and a set of transformations define the edit space

Out-of-Vocabularyforms

Spelling ErrorCorrectionTwo approaches

Edit distance

Spell. err. corr. witha FSA

Weighted editdistance

©EPFLJ.-C. Chappelier & M. Rajman

Coherence Constraints

"Semantic Integrity":I W (Del) + W (Ins(x)) > W (Sub(x))

I W (Split) < W (Ins(x)) ( which implies W (Merge) < W (Del))

I W (Transp) < W (Ins(x)) + W (Del)

+ Introduction a new f such that f = ◦i fi , is useful if and only ifW (f ) < ∑

iW (fi )

Introduction to INLP – 28 / 33

Page 30: Introduction to Natural Language Processingcoling.epfl.ch/slides/coursTAL-corrlex-en.pdf · •a given lexicon (regular language) and a set of transformations define the edit space

Out-of-Vocabularyforms

Spelling ErrorCorrectionTwo approaches

Edit distance

Spell. err. corr. witha FSA

Weighted editdistance

©EPFLJ.-C. Chappelier & M. Rajman

Weighted Edit Distance: computation

xxx

yyy

xxxx

yyy

xxxx

yyyyy

X :

Y :

input range(f)

min2min1

min2 = min [ min1(f) + W(f) ]f

(min1 and min2 are the values stored in the chart)

Introduction to INLP – 29 / 33

Page 31: Introduction to Natural Language Processingcoling.epfl.ch/slides/coursTAL-corrlex-en.pdf · •a given lexicon (regular language) and a set of transformations define the edit space

Out-of-Vocabularyforms

Spelling ErrorCorrectionTwo approaches

Edit distance

Spell. err. corr. witha FSA

Weighted editdistance

©EPFLJ.-C. Chappelier & M. Rajman

Weighted Edit Distance: computation (2)

D(X 01 ;Y j

1) = j initializationD(X i

1;Y 01 ) = i

D(X i1;Y j

1) = D(X i−11 ;Y j−1

1 ) if Xi = Yj (equality)= W (f ) + min{min1(f )} for all applicable trans-

formations f of thesame weight

= ... for all possibleweights.

increasing W (f )

+ The optimization lies in the grouping of similar cases: same weight and compatibletransformations(Example: previously Transp and Sub were incompatible becauseW (Transp) < 2W (Sub). But each of them is compatible with Del and Ins.)

Note: {min1(f )} is the set of all the minimal values for all possible f at this point; theyshall, of course, already be computed at this point (loop condition)

Introduction to INLP – 30 / 33

Page 32: Introduction to Natural Language Processingcoling.epfl.ch/slides/coursTAL-corrlex-en.pdf · •a given lexicon (regular language) and a set of transformations define the edit space

Out-of-Vocabularyforms

Spelling ErrorCorrectionTwo approaches

Edit distance

Spell. err. corr. witha FSA

Weighted editdistance

©EPFLJ.-C. Chappelier & M. Rajman

Example

D(example;exemple) D(exémple;exemple)e x e m p l e

0 1 2 3 4 5 6 7e 1 0 1 2 3 4 5 6x 2 1 0 1 2 3 4 5a 3 2 1 1 2 3 4 5m 4 3 2 2 1 2 3 4p 5 4 3 3 2 1 2 3l 6 5 4 4 3 2 1 2e 7 6 5 4 4 3 2 1

e x e m p l e0 1 2 3 4 5 6 7

e 1 0 1 2 3 4 5 6x 2 1 0 1 2 3 4 5é 3 2 1 0.1 1.1 2.1 3.1 4.1m 4 3 2 1.1 0.1 1.1 2.1 3.1p 5 4 3 2.1 1.1 0.1 1.1 2.1l 6 5 4 3.1 2.1 1.1 0.1 1.1e 7 6 5 4 3.1 2.1 1.1 0.1

W (é↔e)=0.1

Introduction to INLP – 31 / 33

Page 33: Introduction to Natural Language Processingcoling.epfl.ch/slides/coursTAL-corrlex-en.pdf · •a given lexicon (regular language) and a set of transformations define the edit space

Out-of-Vocabularyforms

Spelling ErrorCorrectionTwo approaches

Edit distance

Spell. err. corr. witha FSA

Weighted editdistance

©EPFLJ.-C. Chappelier & M. Rajman

Keypoints

ß One has to handle out of vocabulary forms

ß Edit (Levenshtein) distance, weighted edit distance

ß Spelling error correction with FSA

Introduction to INLP – 32 / 33

Page 34: Introduction to Natural Language Processingcoling.epfl.ch/slides/coursTAL-corrlex-en.pdf · •a given lexicon (regular language) and a set of transformations define the edit space

Out-of-Vocabularyforms

Spelling ErrorCorrectionTwo approaches

Edit distance

Spell. err. corr. witha FSA

Weighted editdistance

©EPFLJ.-C. Chappelier & M. Rajman

References

K. Oflazer, Error-tolerant Finite State Recognition with Applications to MorphologicalAnalysis and Spelling Correction, Computational Linguistics, Volume 22, Number 1,1996.

Section 8.2 in M. Rajman editor, "Speech and Language Engineering", EPFL Press,2006.

Sections 3.10 and 3.11 in D. Jurafsky and J. H. Martin, "Speech and LanguageProcessing", Prentice Hall, 2008 (2nd edition).

Section 3.3 in C. D. Manning, P. Raghavan and H. Schütze, "Introduction to InformationRetrieval", Cambridge University Press. 2008

Introduction to INLP – 33 / 33