autumn 20111 web information retrieval (web ir) handout #3:dictionaries and tolerant retrieval...
DESCRIPTION
Introduction 10% -12% search engines queries is misspelled. Spelling Correction effects in information retrieval. A good spelling corrector should only act when it is clear that the user made an error. Autumn 20113TRANSCRIPT
![Page 1: Autumn 20111 Web Information retrieval (Web IR) Handout #3:Dictionaries and tolerant retrieval Mohammad Sadegh Taherzadeh ECE Department, Yazd University](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b337f8b9ab05999bffd/html5/thumbnails/1.jpg)
Autumn 2011 1
Web Information retrieval (Web IR)
Handout #3:Dictionaries and tolerant retrieval
Mohammad Sadegh TaherzadehECE Department, Yazd [email protected]
![Page 2: Autumn 20111 Web Information retrieval (Web IR) Handout #3:Dictionaries and tolerant retrieval Mohammad Sadegh Taherzadeh ECE Department, Yazd University](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b337f8b9ab05999bffd/html5/thumbnails/2.jpg)
Architecture of Search Engines
Autumn 2011 2
Crawler(s)
Page Repository
Indexer Module
CollectionAnalysis Module
Query Engine Ranking
Client
Indexes : TextStructureUtility
Queries
Web
![Page 3: Autumn 20111 Web Information retrieval (Web IR) Handout #3:Dictionaries and tolerant retrieval Mohammad Sadegh Taherzadeh ECE Department, Yazd University](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b337f8b9ab05999bffd/html5/thumbnails/3.jpg)
Introduction • 10% -12% search engines queries is
misspelled.
• Spelling Correction effects in information retrieval.
• A good spelling corrector should only act when it is clear that the user made an error.
Autumn 2011 3
![Page 4: Autumn 20111 Web Information retrieval (Web IR) Handout #3:Dictionaries and tolerant retrieval Mohammad Sadegh Taherzadeh ECE Department, Yazd University](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b337f8b9ab05999bffd/html5/thumbnails/4.jpg)
Spelling Errors• Typographic errors
– These errors are occurring when the correct spelling of the word is known but the word is mistyped by mistake.
– (example: Taht --> that) – Word boundaries (example: home page --> homepage)
• Cognitive errors– These are errors occurring when the correct spellings
of the word are not known. – (example: seprate --> separate)
Autumn 2011 4
![Page 5: Autumn 20111 Web Information retrieval (Web IR) Handout #3:Dictionaries and tolerant retrieval Mohammad Sadegh Taherzadeh ECE Department, Yazd University](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b337f8b9ab05999bffd/html5/thumbnails/5.jpg)
Spelling Error Correction• The problem of spelling error correction
entails three sub-problems: – Detection of an error– Generation of candidate corrections– Ranking of candidate corrections
Autumn 2011 5
![Page 6: Autumn 20111 Web Information retrieval (Web IR) Handout #3:Dictionaries and tolerant retrieval Mohammad Sadegh Taherzadeh ECE Department, Yazd University](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b337f8b9ab05999bffd/html5/thumbnails/6.jpg)
Spelling Error Correction (cont.)
• An example:– For misspell input query : مین عاستعالم سوابق ت
اجتماعی
– Error detection : مینعتاستعالم سوابق اجتماعی
– Generate candidate : { تامین، تعمیر، تضمین، تعمیم، تخمین ، {تعیین
– Candidate ranking : { تعمیر، تضمین، تعیین، تعمیم، تامین ،{تخمین
– Correction : اجتماعی میناتاستعالم سوابق
Autumn 2011 6
![Page 7: Autumn 20111 Web Information retrieval (Web IR) Handout #3:Dictionaries and tolerant retrieval Mohammad Sadegh Taherzadeh ECE Department, Yazd University](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b337f8b9ab05999bffd/html5/thumbnails/7.jpg)
Implementing Spelling Correction
• There are two basic principles underlying most spelling correction algorithms:
– 1. Of various alternative correct spellings for a misspelled query, choose the “nearest” one. • This demands that we have a notion of
nearness or proximity between a pair of queries.
Autumn 2011 7
![Page 8: Autumn 20111 Web Information retrieval (Web IR) Handout #3:Dictionaries and tolerant retrieval Mohammad Sadegh Taherzadeh ECE Department, Yazd University](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b337f8b9ab05999bffd/html5/thumbnails/8.jpg)
– 2. When two correctly spelled queries are tied (or nearly tied), select the one that is more common.• The simplest notion of more common is
to consider the number of occurrences of the term in the collection.
• A different notion of more common is employed in many search engines, especially on the web. The idea is to use the correction that is most common among queries typed in by other users.
Autumn 2011 8
![Page 9: Autumn 20111 Web Information retrieval (Web IR) Handout #3:Dictionaries and tolerant retrieval Mohammad Sadegh Taherzadeh ECE Department, Yazd University](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b337f8b9ab05999bffd/html5/thumbnails/9.jpg)
Error Detection• N-gram based techniques
– Spellcheckers without dictionaries– Non-positional vs. Positional– It begins by going right through the dictionary and
tabulating all the trigrams (three-letter sequences)• For instance, abs, will occur quite often
(“absent”, “crabs”) • Whereas, pkx, won't occur at all. It would
detect “pkxie”, which might have been mistyped for “pixie”
Autumn 2011 9
![Page 10: Autumn 20111 Web Information retrieval (Web IR) Handout #3:Dictionaries and tolerant retrieval Mohammad Sadegh Taherzadeh ECE Department, Yazd University](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b337f8b9ab05999bffd/html5/thumbnails/10.jpg)
• Dictionary based techniques – Given a word, look it up in the dictionary for validation.– Dictionary construction issues– Effective Search
• Lookup Hash table• Trie (aka. pseudo-Btree for retrieval text)
• For Example اجتماعیمین عتاستعالم سوابق ✓
• تعمینمعنی واژه ╳Autumn 2011 10
![Page 11: Autumn 20111 Web Information retrieval (Web IR) Handout #3:Dictionaries and tolerant retrieval Mohammad Sadegh Taherzadeh ECE Department, Yazd University](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b337f8b9ab05999bffd/html5/thumbnails/11.jpg)
• Type of the errors :
– Non-Word errors
– Real-Word errors• Most of errors in web query is Real-Word
error.• Context based error detection is used for
real word errors.
Autumn 2011 11
![Page 12: Autumn 20111 Web Information retrieval (Web IR) Handout #3:Dictionaries and tolerant retrieval Mohammad Sadegh Taherzadeh ECE Department, Yazd University](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b337f8b9ab05999bffd/html5/thumbnails/12.jpg)
Generate Candidates• Generate Candidates Techniques:
– Minimum edit distance techniques– Similarity key techniques– Rule-based techniques– N-gram-based techniques– Probabilistic techniques– Neural networks
Autumn 2011 12
![Page 13: Autumn 20111 Web Information retrieval (Web IR) Handout #3:Dictionaries and tolerant retrieval Mohammad Sadegh Taherzadeh ECE Department, Yazd University](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b337f8b9ab05999bffd/html5/thumbnails/13.jpg)
Minimum edit distance techniques
• Edit distance– Given two character strings s1 and s2, the edit
distance between them is the minimum number of edit operations required to transform s1 into s2.
– Edit operations or Damura-Levenshtein distance• Insertion, e.g. typing acress for cress• Deletion, e.g. typing acress for actress• Substitution, e.g. typing acress for across• Transposition, e.g. typing acress for caress
Autumn 2011 13
![Page 14: Autumn 20111 Web Information retrieval (Web IR) Handout #3:Dictionaries and tolerant retrieval Mohammad Sadegh Taherzadeh ECE Department, Yazd University](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b337f8b9ab05999bffd/html5/thumbnails/14.jpg)
• The literature on spelling correction claims that 80 to 95% of spelling errors are an edit distance of 1 from the target.
• Compute edit distance between erroneous word and all dictionary words.
• Select those dictionary words whose edit distance is within a pre-specified threshold value.
Autumn 2011 14
![Page 15: Autumn 20111 Web Information retrieval (Web IR) Handout #3:Dictionaries and tolerant retrieval Mohammad Sadegh Taherzadeh ECE Department, Yazd University](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b337f8b9ab05999bffd/html5/thumbnails/15.jpg)
Autumn 2011 15
![Page 16: Autumn 20111 Web Information retrieval (Web IR) Handout #3:Dictionaries and tolerant retrieval Mohammad Sadegh Taherzadeh ECE Department, Yazd University](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b337f8b9ab05999bffd/html5/thumbnails/16.jpg)
Similarity key techniques• Similarity Key Techniques
– Aim: Tries to assign common codes to similar words and String.
Coding Schemas– Sound similarity (receive ➡ receive)
• Soundex Algorithm
– Shape similarity (انتخاب ➡ انتحاب)• Shapex Algorithm
Autumn 2011 16
![Page 17: Autumn 20111 Web Information retrieval (Web IR) Handout #3:Dictionaries and tolerant retrieval Mohammad Sadegh Taherzadeh ECE Department, Yazd University](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b337f8b9ab05999bffd/html5/thumbnails/17.jpg)
• Soundex
Autumn 2011 17
![Page 18: Autumn 20111 Web Information retrieval (Web IR) Handout #3:Dictionaries and tolerant retrieval Mohammad Sadegh Taherzadeh ECE Department, Yazd University](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b337f8b9ab05999bffd/html5/thumbnails/18.jpg)
N-Gram Based Technique• N-Grams
– An N-gram is a sequence of N adjacent letters in a word
– The more N-grams, two strings, share the more similar they are.
• Similarity coefficient δ– δ = |common N-grams| / |Total N-grams|– Jaccard coefficient
Autumn 2011 18
![Page 19: Autumn 20111 Web Information retrieval (Web IR) Handout #3:Dictionaries and tolerant retrieval Mohammad Sadegh Taherzadeh ECE Department, Yazd University](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b337f8b9ab05999bffd/html5/thumbnails/19.jpg)
• N-Gram similarity example:– fact vs. fract
– Bigrams in fact : -f fa ac ct t- 5 bigrams– Bigrams in fract : -f fr ra ac ct t- 6 bigrams– Union : -f fa fr ra ac ct t- 7
bigrams– Common : -f ac ct t- 4
bigrams
δ = 4/7 = 0.57
Autumn 2011 19
![Page 20: Autumn 20111 Web Information retrieval (Web IR) Handout #3:Dictionaries and tolerant retrieval Mohammad Sadegh Taherzadeh ECE Department, Yazd University](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b337f8b9ab05999bffd/html5/thumbnails/20.jpg)
• Generate candidate – N-gram inverted index – For example misspell “bord” ➡ bo or rd
– We would enumerate “aboard”, “boardroom” and “border”.
Autumn 2011 20
![Page 21: Autumn 20111 Web Information retrieval (Web IR) Handout #3:Dictionaries and tolerant retrieval Mohammad Sadegh Taherzadeh ECE Department, Yazd University](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b337f8b9ab05999bffd/html5/thumbnails/21.jpg)
Probabilistic Techniques• Find the most probable transmitted
word (correct dictionary word) for a received erroneous string (misspelling).
• Generic Algorithm– The model assigns a probability to each correct
dictionary word for being a possible correction of the misspelling. The word with highest probability is considered the closest match (or the actual intended word).
Autumn 2011 21
![Page 22: Autumn 20111 Web Information retrieval (Web IR) Handout #3:Dictionaries and tolerant retrieval Mohammad Sadegh Taherzadeh ECE Department, Yazd University](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b337f8b9ab05999bffd/html5/thumbnails/22.jpg)
Probabilistic Techniques (cont.)
Autumn 2011 22
![Page 23: Autumn 20111 Web Information retrieval (Web IR) Handout #3:Dictionaries and tolerant retrieval Mohammad Sadegh Taherzadeh ECE Department, Yazd University](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b337f8b9ab05999bffd/html5/thumbnails/23.jpg)
Autumn 2011 23
![Page 24: Autumn 20111 Web Information retrieval (Web IR) Handout #3:Dictionaries and tolerant retrieval Mohammad Sadegh Taherzadeh ECE Department, Yazd University](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b337f8b9ab05999bffd/html5/thumbnails/24.jpg)
Autumn 2011 24
![Page 25: Autumn 20111 Web Information retrieval (Web IR) Handout #3:Dictionaries and tolerant retrieval Mohammad Sadegh Taherzadeh ECE Department, Yazd University](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b337f8b9ab05999bffd/html5/thumbnails/25.jpg)
Error model• letter-to-letter confusion probabilities.
– [Kernighan 1990]– keyboard adjacencies. A probability matrix – Rule base.
• string-to-string confusion probabilities. – [Brill 2000]– we needed a training set of (si, wi) string pairs, where
si represents a spelling error and wi is the corresponding corrected word.
Autumn 2011 25
![Page 26: Autumn 20111 Web Information retrieval (Web IR) Handout #3:Dictionaries and tolerant retrieval Mohammad Sadegh Taherzadeh ECE Department, Yazd University](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b337f8b9ab05999bffd/html5/thumbnails/26.jpg)
• for each training pair (q1, q2)– we counted the frequencies of edit operations α → β.
These frequencies are then used for computing P(α → β), which shows the probability that when users intended to type the string α they typed β instead.
– As an example, we extract the following edit operations from the training pair (satellite, satillite):
– Window size 1: e → i;– Window size 2: te → ti, el → il;– Window size 3: tel → til, ate → ati, ell → ill.
Autumn 2011 26
![Page 27: Autumn 20111 Web Information retrieval (Web IR) Handout #3:Dictionaries and tolerant retrieval Mohammad Sadegh Taherzadeh ECE Department, Yazd University](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b337f8b9ab05999bffd/html5/thumbnails/27.jpg)
Language Modelسازمان بیمه تامین ...•
• Guessing the next word or word prediction.
• Definition – A statistical language model is a probability
distribution over sequences of words.– Having a way to estimate the relative likelihood of
different phrases is useful in many natural language processing applications.
Autumn 2011 27
![Page 28: Autumn 20111 Web Information retrieval (Web IR) Handout #3:Dictionaries and tolerant retrieval Mohammad Sadegh Taherzadeh ECE Department, Yazd University](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b337f8b9ab05999bffd/html5/thumbnails/28.jpg)
Language Model (cont.)
Autumn 2011 28
![Page 29: Autumn 20111 Web Information retrieval (Web IR) Handout #3:Dictionaries and tolerant retrieval Mohammad Sadegh Taherzadeh ECE Department, Yazd University](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b337f8b9ab05999bffd/html5/thumbnails/29.jpg)
• We might represent this probability as follows:
P(w1 , w2 . . ., wn-1, wn )
• We can use the chain rule of probability to decompose this probability:
Autumn 2011 29
![Page 30: Autumn 20111 Web Information retrieval (Web IR) Handout #3:Dictionaries and tolerant retrieval Mohammad Sadegh Taherzadeh ECE Department, Yazd University](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b337f8b9ab05999bffd/html5/thumbnails/30.jpg)
• But how can we compute probability like:
• Counting N-grams of words in corpora.– The general equation for this N-gram approximation
to the conditional probability of the next word in a sequence is:
Autumn 2011 30
![Page 31: Autumn 20111 Web Information retrieval (Web IR) Handout #3:Dictionaries and tolerant retrieval Mohammad Sadegh Taherzadeh ECE Department, Yazd University](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b337f8b9ab05999bffd/html5/thumbnails/31.jpg)
• For bigram model:
• For example:
Autumn 2011 31
![Page 32: Autumn 20111 Web Information retrieval (Web IR) Handout #3:Dictionaries and tolerant retrieval Mohammad Sadegh Taherzadeh ECE Department, Yazd University](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b337f8b9ab05999bffd/html5/thumbnails/32.jpg)
• To improve language model– Co-occurrence frequencies + Confusion sets– N-Gram POS Probabilities– . . .
Autumn 2011 32
![Page 33: Autumn 20111 Web Information retrieval (Web IR) Handout #3:Dictionaries and tolerant retrieval Mohammad Sadegh Taherzadeh ECE Department, Yazd University](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b337f8b9ab05999bffd/html5/thumbnails/33.jpg)
Forms of spelling correction• Isolated-term • Context -sensitive
Autumn 2011 33
![Page 34: Autumn 20111 Web Information retrieval (Web IR) Handout #3:Dictionaries and tolerant retrieval Mohammad Sadegh Taherzadeh ECE Department, Yazd University](https://reader035.vdocuments.site/reader035/viewer/2022062401/5a4d1b337f8b9ab05999bffd/html5/thumbnails/34.jpg)
End • Question?
Autumn 2011 34