comparing word relatedness measures based on google n-grams aminul islam, evangelos milios, vlado...
DESCRIPTION
Introduction ● Methods can be categorized into 3: – Corpus-based ● Supervised ● Unsupervised – Knowledge-based ● Semantic resources were used – HybridTRANSCRIPT
![Page 1: Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,](https://reader036.vdocuments.site/reader036/viewer/2022070605/5a4d1ada7f8b9ab05997455a/html5/thumbnails/1.jpg)
ComparingWord Relatedness Measures
Based on Google n-gramsAminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ
Faculty of Computer ScienceDalhousie University, Halifax, Canada
[email protected], [email protected], [email protected]
COLING 2012
![Page 2: Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,](https://reader036.vdocuments.site/reader036/viewer/2022070605/5a4d1ada7f8b9ab05997455a/html5/thumbnails/2.jpg)
Introduction●Word-relatedness has a wide range of
applications– IR: Image retrieval, Query extention…– Paraphrase recognition– Malapropism detection and correction– Automatic creation of thesauri– Speak recognition– …
![Page 3: Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,](https://reader036.vdocuments.site/reader036/viewer/2022070605/5a4d1ada7f8b9ab05997455a/html5/thumbnails/3.jpg)
Introduction●Methods can be categorized into 3:
– Corpus-based●Supervised●Unsupervised
– Knowledge-based●Semantic resources were used
– Hybrid
![Page 4: Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,](https://reader036.vdocuments.site/reader036/viewer/2022070605/5a4d1ada7f8b9ab05997455a/html5/thumbnails/4.jpg)
Introduction
●This paper focus on unsupervised corpus-based measures
●6 measures have been compared
![Page 5: Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,](https://reader036.vdocuments.site/reader036/viewer/2022070605/5a4d1ada7f8b9ab05997455a/html5/thumbnails/5.jpg)
Problem
●Unsupervised corpus-based measures usually use co-occurrence statistics, mostly word n-grams and frequencies
– The co-occurrence are corpus-specific– Most of the corpura doesn't have co-occurrence
stats, thus can't be used on-line– Some use web search result, but results vary from
time to time
![Page 6: Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,](https://reader036.vdocuments.site/reader036/viewer/2022070605/5a4d1ada7f8b9ab05997455a/html5/thumbnails/6.jpg)
Motivation●How to compare different measures fairly?●Observation
– Co-occurrence stats were used– A corpus with co-occurrence information, eg.
Google n-grams, is probably a good resource
![Page 7: Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,](https://reader036.vdocuments.site/reader036/viewer/2022070605/5a4d1ada7f8b9ab05997455a/html5/thumbnails/7.jpg)
Google N-Grams●A publicly available corpus with
– Co-occurrence statistics (uni-gram to 5-gram)– A large volume of <del>web text</del>
●Digitalized books with over 5.2 million books published since 1500
– Data format:●ngram year match_count volume_count●eg:
– analysis is often described as 1991 1 1 1
![Page 8: Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,](https://reader036.vdocuments.site/reader036/viewer/2022070605/5a4d1ada7f8b9ab05997455a/html5/thumbnails/8.jpg)
Another Motivation●To find a indirect mapping between Google n-
grams and web search result– Thus, it might be used on-line
![Page 9: Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,](https://reader036.vdocuments.site/reader036/viewer/2022070605/5a4d1ada7f8b9ab05997455a/html5/thumbnails/9.jpg)
How About WordNet?●In 2006, Budanitsky and Hirst evaluated 5
knowledge-based measures using WordNet– Create a resource like WordNet requires lots of
efforts– Coverage of words is not enough for NLP tasks– Resource is language-specific, while Google n-
grams consists more than 10 languages
![Page 10: Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,](https://reader036.vdocuments.site/reader036/viewer/2022070605/5a4d1ada7f8b9ab05997455a/html5/thumbnails/10.jpg)
Notations●C(w1 … wn)
– Frequency of the n-gram●D(w1 … wn)
– # of web docs (up to 5-grams)●M(w1, w2)
– C(w1 wi w2)
![Page 11: Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,](https://reader036.vdocuments.site/reader036/viewer/2022070605/5a4d1ada7f8b9ab05997455a/html5/thumbnails/11.jpg)
Notations●(w1, w2)
– 1/2 [ C(w1 wi w2) + C(w2 wi w1) ]●N
– # of docs used in Google n-grams● |V|
– # of uni-grams in Google n-grams●Cmax
– max frequency in Google n-grams
![Page 12: Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,](https://reader036.vdocuments.site/reader036/viewer/2022070605/5a4d1ada7f8b9ab05997455a/html5/thumbnails/12.jpg)
Assumptions●Some measures use web search results, and co-
occurrence information not provided by Google n-gram, but
– C(w1) ≥ D(w1)– C(w1 w2) ≥ D(w1 w2)
●It is because uni-grams and bi-grams might occurs multiple times in one document
![Page 13: Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,](https://reader036.vdocuments.site/reader036/viewer/2022070605/5a4d1ada7f8b9ab05997455a/html5/thumbnails/13.jpg)
Assumptions●Considering the lower limits
– C(w1) ≈ D(w1)– C(w1 w2) ≈ D(w1 w2)
![Page 14: Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,](https://reader036.vdocuments.site/reader036/viewer/2022070605/5a4d1ada7f8b9ab05997455a/html5/thumbnails/14.jpg)
Measures●Jaccard Coefficient
●Simpson Coefficient
![Page 15: Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,](https://reader036.vdocuments.site/reader036/viewer/2022070605/5a4d1ada7f8b9ab05997455a/html5/thumbnails/15.jpg)
Measures●Dice Coefficient
●Pointwise Mutual Information
![Page 16: Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,](https://reader036.vdocuments.site/reader036/viewer/2022070605/5a4d1ada7f8b9ab05997455a/html5/thumbnails/16.jpg)
Measures●Normalized Google Distane (NGD)
variation
![Page 17: Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,](https://reader036.vdocuments.site/reader036/viewer/2022070605/5a4d1ada7f8b9ab05997455a/html5/thumbnails/17.jpg)
Measures●Relatedness based on Tri-grams (RT)
![Page 18: Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,](https://reader036.vdocuments.site/reader036/viewer/2022070605/5a4d1ada7f8b9ab05997455a/html5/thumbnails/18.jpg)
Evaluation●Compare with human judgments
– It is considered to be the upper limit●Evaluate the measures with respect to a
particular application– Evaluate relatedness of words
●Text Similarity
![Page 19: Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,](https://reader036.vdocuments.site/reader036/viewer/2022070605/5a4d1ada7f8b9ab05997455a/html5/thumbnails/19.jpg)
Compare With Human Judgments●Rubenstein and Goodenough's 65 Word Pairs
– 51 people rating 65 pairs of word (English) on the scale of 0.0 to 4.0
●Miller and Charles' 28 Noun Pairs– Restricting R&G to 30 pairs, 38 human judges– Most of researchers use 28 pairs because 2 were
omitted from early version of WordNet
![Page 20: Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,](https://reader036.vdocuments.site/reader036/viewer/2022070605/5a4d1ada7f8b9ab05997455a/html5/thumbnails/20.jpg)
Result
![Page 21: Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,](https://reader036.vdocuments.site/reader036/viewer/2022070605/5a4d1ada7f8b9ab05997455a/html5/thumbnails/21.jpg)
Result
![Page 22: Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,](https://reader036.vdocuments.site/reader036/viewer/2022070605/5a4d1ada7f8b9ab05997455a/html5/thumbnails/22.jpg)
Application-based Evaluation●TOEFL's 80 Synonym Questions
– Given a problem word,infinite, and four alternative wordslimitless, relative, unusual, structuralchoose the most related word
●ESL's 50 Synonym Qeustions– Same as TOEFL's 80 synonym questions task– Expect the synonym questions are from English
as a 2nd Language tests
![Page 23: Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,](https://reader036.vdocuments.site/reader036/viewer/2022070605/5a4d1ada7f8b9ab05997455a/html5/thumbnails/23.jpg)
Result
![Page 24: Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,](https://reader036.vdocuments.site/reader036/viewer/2022070605/5a4d1ada7f8b9ab05997455a/html5/thumbnails/24.jpg)
Result
![Page 25: Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,](https://reader036.vdocuments.site/reader036/viewer/2022070605/5a4d1ada7f8b9ab05997455a/html5/thumbnails/25.jpg)
Text Similarity●Find the similarity between two text items●Use different measures on a single text
similarity measure, and evaluate the results of the text similarity measure based on a standard data set
●30 sentences pairs from one of most used data sets were used
![Page 26: Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,](https://reader036.vdocuments.site/reader036/viewer/2022070605/5a4d1ada7f8b9ab05997455a/html5/thumbnails/26.jpg)
Result
![Page 27: Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,](https://reader036.vdocuments.site/reader036/viewer/2022070605/5a4d1ada7f8b9ab05997455a/html5/thumbnails/27.jpg)
Result●Pearson correlation coefficient with mean
human similarity ratings:– Ho et al. (2010) used one measure based-on
WordNet and then applied those scores in Islam and Inkpen (2008) achieved 0.895
– Tsatsaronis et al. (2010) achieved 0.856– Islam et al. (2012) achieved 0.916
●The improvement over Ho et al. (2010) is statistically significant at 0.05 level
![Page 28: Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,](https://reader036.vdocuments.site/reader036/viewer/2022070605/5a4d1ada7f8b9ab05997455a/html5/thumbnails/28.jpg)
Conclusion●Any measures uses n-gram statistics can easily
apply Google n-gram corpus, and be fairly evaluated with existing works on standard data sets of different tasks
●Find an indirect mapping of co-occurrence statistics between the Google n-gram corpus and a web search engine using some assumptions
![Page 29: Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,](https://reader036.vdocuments.site/reader036/viewer/2022070605/5a4d1ada7f8b9ab05997455a/html5/thumbnails/29.jpg)
Conclusion●Measures based on n-gram are language-
independent– Other languages can be implemented if it has a
sufficiently large n-gram corpus