lrec 2010
DESCRIPTION
A large list of confusion sets for spellchecking assessed against a corpus of real-word errors Jenny Pedler, Roger Mitton. LREC 2010. Some real-word errors The sand-eel is the principle food for many birds and animals. Our teacher tort us to spell. - PowerPoint PPT PresentationTRANSCRIPT
A large list of confusion sets for spellchecking assessed against a corpus of real-word errors
Jenny Pedler, Roger Mitton
LREC 2010
Some real-word errors
The sand-eel is the principle food for many birds and animals.
Our teacher tort us to spell.
Henley Regatta comes near the top of the English social calender.
Spellchecker-induced real-word errors
The Wine Bar Company is opening a chain of brassieres.
The nightwatchman threw the switch and eliminated the backyard.
Cupertino, California
... to encourage cooperation and ...
... to encourage cooperation and ...
... to encourage cooperation and ...
Cupertino
co-operation
....
The original Cupertinos
"reinforcing bilateral and multilateral Cupertino"
"South Asian Association for regional Cupertino"
Confusion sets
{cite, sight, site}
{form, from}
{passed, past}
{peace, piece}
{principal, principle}
{quiet, quite, quit}
{their, there, they're}
{weather, whether}
{you're, your}
He had quiet a young girl staying with him
of 17 named Ethel Monticue.
He had quiet a young girl staying with him
quite?
quit?
of 17 named Ethel Monticue.
The confusion-set approach has been demonstrated to work with
(a) a short list of confusion sets,
(b) artificial test data.
To assess its potential for real, unrestricted text, we need:
(1) a realistically-sized list of confusion sets,
(2) a corpus of running text containing genuine real-word errors.
A list of confusion sets• Tuned string-to-string edit-distance
• ~ 6000 sets
• Headword (confusables)– wright (right, write) – right (rite, write)– write (right, rite, writ)
Inflected forms Proper nouns Usage errors – e.g. <fewer, less>
A corpus of real-word errors
Sentences 675
Words 12024
Total errors (tokens) 833
Distinct errors (types) 428
Distinct error/target pairs 495
quit quietquit quite
The collation of the information was <ERR targ = really> relay </ERR> <ERR targ = quite> quit </ERR> easy to do.
Corpus mark-up example
Corpus profile: Frequent errors
Error|target pair Frequencythere|their 35form|from 20to|too 19their|there 19a|an 18its|it's 17your|you're 15weather|whether 12cant|can't 10collage|college 9
Corpus profile: Homophone errorsHomophone set N. Occs
there, their, they're 38
to, too, two 23
its, it's 17
your, you're 15
weather, whether 12
herd, heard 5
witch, which 4
hear, here 3
wile, while 3
14% of distinct error/target pairs
Corpus profile: Simple errorsError Type N.Errors % Errors
Omission (e.g. ether, either)
142 29%
Substitution (e.g. vary, very) 104 21%
Insertion (e.g. bellow, below) 56 11%
Transposition (e.g. dose, does) 12 2%
All simple 314 63%
All error pairs 495 100%
How would our list cope with our corpus?
Types Tokens
Detectable and correctableE.g. shod (should)
44% 58%
Detectable but not correctableE.g. martial (material)
16% 12%
Not detectable (inflection error)E.g. friend (friends), take (taken)
23% 17%
Not detectable (other)E.g. pads (passed)
17% 13%
Total (100%) 495 833
Non-detectable/non-correctable
Error not a headword (“non-detectable”)
Target not a candidate (“non-correctable”)
Pair Frequency Pair Frequencya, an 17 an, a 4the, they 4 cause, because 3is, his 2 as, has 2is, it 2 easy, easily 2i, it 2 for, from 2u, your 2 in, is 2
mouths, months 2none, non 2no, know 2
Using the list for spellchecking
• Rules based on surrounding context
• May be unreliable– 25% errors have another error within 2 words– 9% are another real-word error
• Syntax-based methods– Easiest to implement– Shown to have good performance
Syntax-based rules: potential
Tagsets Types Tokens
Distinctbellow (NN1,VVB,VVI)below (AV0, PRP)
58% 68%
? Overlappingpray (VVB, VVI, AV0)prey (NN1, VVB, VVI)
31% 25%
Matchingconfirm (VVI, VVB)conform (VVI, VVB)
11% 7%
Total errors (=100%) 299 580
Resources available for download
www.dcs.bbk.ac.uk/~jenny/resources.html