lrec 2010

24
A large list of confusion sets for spellchecking assessed against a corpus of real-word errors Jenny Pedler, Roger Mitton LREC 2010

Upload: artemas-kristy

Post on 31-Dec-2015

27 views

Category:

Documents


0 download

DESCRIPTION

A large list of confusion sets for spellchecking assessed against a corpus of real-word errors Jenny Pedler, Roger Mitton. LREC 2010. Some real-word errors The sand-eel is the principle food for many birds and animals. Our teacher tort us to spell. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: LREC 2010

A large list of confusion sets for spellchecking assessed against a corpus of real-word errors

Jenny Pedler, Roger Mitton

LREC 2010

Page 2: LREC 2010

Some real-word errors

The sand-eel is the principle food for many birds and animals.

Our teacher tort us to spell.

Henley Regatta comes near the top of the English social calender.

Page 3: LREC 2010

Spellchecker-induced real-word errors

The Wine Bar Company is opening a chain of brassieres.

The nightwatchman threw the switch and eliminated the backyard.

Page 4: LREC 2010

Cupertino, California

Page 5: LREC 2010

... to encourage cooperation and ...

Page 6: LREC 2010

... to encourage cooperation and ...

Page 7: LREC 2010

... to encourage cooperation and ...

Cupertino

co-operation

....

Page 8: LREC 2010

The original Cupertinos

"reinforcing bilateral and multilateral Cupertino"

"South Asian Association for regional Cupertino"

Page 9: LREC 2010

Confusion sets

{cite, sight, site}

{form, from}

{passed, past}

{peace, piece}

{principal, principle}

{quiet, quite, quit}

{their, there, they're}

{weather, whether}

{you're, your}

Page 10: LREC 2010

He had quiet a young girl staying with him

of 17 named Ethel Monticue.

Page 11: LREC 2010

He had quiet a young girl staying with him

quite?

quit?

of 17 named Ethel Monticue.

Page 12: LREC 2010

The confusion-set approach has been demonstrated to work with

(a) a short list of confusion sets,

(b) artificial test data.

Page 13: LREC 2010

To assess its potential for real, unrestricted text, we need:

(1) a realistically-sized list of confusion sets,

(2) a corpus of running text containing genuine real-word errors.

Page 14: LREC 2010

A list of confusion sets• Tuned string-to-string edit-distance

• ~ 6000 sets

• Headword (confusables)– wright (right, write) – right (rite, write)– write (right, rite, writ)

Inflected forms Proper nouns Usage errors – e.g. <fewer, less>

Page 15: LREC 2010

A corpus of real-word errors

Sentences 675

Words 12024

Total errors (tokens) 833

Distinct errors (types) 428

Distinct error/target pairs 495

quit quietquit quite

Page 16: LREC 2010

The collation of the information was <ERR targ = really> relay </ERR> <ERR targ = quite> quit </ERR> easy to do.

Corpus mark-up example

Page 17: LREC 2010

Corpus profile: Frequent errors

Error|target pair Frequencythere|their 35form|from 20to|too 19their|there 19a|an 18its|it's 17your|you're 15weather|whether 12cant|can't 10collage|college 9

Page 18: LREC 2010

Corpus profile: Homophone errorsHomophone set N. Occs

there, their, they're 38

to, too, two 23

its, it's 17

your, you're 15

weather, whether 12

herd, heard 5

witch, which 4

hear, here 3

wile, while 3

14% of distinct error/target pairs

Page 19: LREC 2010

Corpus profile: Simple errorsError Type N.Errors % Errors

Omission (e.g. ether, either)

142 29%

Substitution (e.g. vary, very) 104 21%

Insertion (e.g. bellow, below) 56 11%

Transposition (e.g. dose, does) 12 2%

All simple 314 63%

All error pairs 495 100%

Page 20: LREC 2010

How would our list cope with our corpus?

Types Tokens

Detectable and correctableE.g. shod (should)

44% 58%

Detectable but not correctableE.g. martial (material)

16% 12%

Not detectable (inflection error)E.g. friend (friends), take (taken)

23% 17%

Not detectable (other)E.g. pads (passed)

17% 13%

Total (100%) 495 833

Page 21: LREC 2010

Non-detectable/non-correctable

Error not a headword (“non-detectable”)

Target not a candidate (“non-correctable”)

Pair Frequency Pair Frequencya, an 17 an, a 4the, they 4 cause, because 3is, his 2 as, has 2is, it 2 easy, easily 2i, it 2 for, from 2u, your 2 in, is 2

mouths, months 2none, non 2no, know 2

Page 22: LREC 2010

Using the list for spellchecking

• Rules based on surrounding context

• May be unreliable– 25% errors have another error within 2 words– 9% are another real-word error

• Syntax-based methods– Easiest to implement– Shown to have good performance

Page 23: LREC 2010

Syntax-based rules: potential

Tagsets Types Tokens

Distinctbellow (NN1,VVB,VVI)below (AV0, PRP)

58% 68%

? Overlappingpray (VVB, VVI, AV0)prey (NN1, VVB, VVI)

31% 25%

Matchingconfirm (VVI, VVB)conform (VVI, VVB)

11% 7%

Total errors (=100%) 299 580

Page 24: LREC 2010

Resources available for download

www.dcs.bbk.ac.uk/~jenny/resources.html