finding similar questions in large question and answer archives
DESCRIPTION
Finding Similar Questions in Large Question and Answer Archives. Retrieval Models for Question and Answer Archives. Jiwoon Jeon , W. Bruce Croft and Joon Ho Lee. Jiwoon Jeon , W. Bruce Croft and Xiaobing Xue. Presenter Sawood Alam . - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Finding Similar Questions in Large Question and Answer Archives](https://reader030.vdocuments.site/reader030/viewer/2022033106/56812c00550346895d907237/html5/thumbnails/1.jpg)
Finding Similar Questions in Large Question and Answer Archives
Jiwoon Jeon, W. Bruce Croft and Joon Ho Lee
Retrieval Models for Question and Answer Archives
Jiwoon Jeon, W. Bruce Croft and Xiaobing Xue
PresenterSawood Alam <[email protected]>
![Page 2: Finding Similar Questions in Large Question and Answer Archives](https://reader030.vdocuments.site/reader030/viewer/2022033106/56812c00550346895d907237/html5/thumbnails/2.jpg)
Finding Similar Questions in Large Question and Answer Archives
Jiwoon Jeon, W. Bruce Croft and Joon Ho LeeCenter for Intelligent Information Retrieval, Computer Science
DepartmentUniversity of Massachusetts, Amherst, MA 01003
[jeon,croft,joonho]@cs.umass.edu
CIKM '05, Proceedings of the 14th ACM Conference on Information and Knowledge Management, 2005
![Page 3: Finding Similar Questions in Large Question and Answer Archives](https://reader030.vdocuments.site/reader030/viewer/2022033106/56812c00550346895d907237/html5/thumbnails/3.jpg)
Introduction
• Q&A systems quickly build large archives– Naver, a popular Korean search site gets 25,000+
questions per day• Great linguistic resource• Answering questions from the archive before a
human response appear
![Page 4: Finding Similar Questions in Large Question and Answer Archives](https://reader030.vdocuments.site/reader030/viewer/2022033106/56812c00550346895d907237/html5/thumbnails/4.jpg)
Q&A Over Usual Search
• Opinion or summary• Direct answers rather than relevant documents• Search in collection of questions associated
with answers• Lexical similarity vs. semantic similarity– Is downloading movies illegal?– Can I share a copy of a DVD online?
![Page 5: Finding Similar Questions in Large Question and Answer Archives](https://reader030.vdocuments.site/reader030/viewer/2022033106/56812c00550346895d907237/html5/thumbnails/5.jpg)
Solving Word Mismatch Problem
• Knowledge database (machine readable dictionaries) – unreliable performance
• Manual rules or templates – hard to scale• Statistical technique – most promising– Requires large training data set
![Page 6: Finding Similar Questions in Large Question and Answer Archives](https://reader030.vdocuments.site/reader030/viewer/2022033106/56812c00550346895d907237/html5/thumbnails/6.jpg)
Question and Answer Archive
• Average lengths (words)• Title: 5.8• Body: 49• Answer: 179
![Page 7: Finding Similar Questions in Large Question and Answer Archives](https://reader030.vdocuments.site/reader030/viewer/2022033106/56812c00550346895d907237/html5/thumbnails/7.jpg)
Relevance Judgments
• Eighteen different retrieval results (varying retrieval algorithms)– Query likelihood, Okapi BM25 and overlap
coeficient• Top 20 Q&A pairs from each retrieval result• Manual judgment• Correctness of answer was ignored• Manual browsing for missing relevant Q&A
pairs
![Page 8: Finding Similar Questions in Large Question and Answer Archives](https://reader030.vdocuments.site/reader030/viewer/2022033106/56812c00550346895d907237/html5/thumbnails/8.jpg)
Field Importance
![Page 9: Finding Similar Questions in Large Question and Answer Archives](https://reader030.vdocuments.site/reader030/viewer/2022033106/56812c00550346895d907237/html5/thumbnails/9.jpg)
Generation of Training Sample
• LM-HRANKSim(A, B) = (1/r1 + 1/r2) / 2
Where:• Answer A retrieves B at
rank r1
• Answer B retrieves A at rank r2
![Page 10: Finding Similar Questions in Large Question and Answer Archives](https://reader030.vdocuments.site/reader030/viewer/2022033106/56812c00550346895d907237/html5/thumbnails/10.jpg)
Word Translation Probabilities
![Page 11: Finding Similar Questions in Large Question and Answer Archives](https://reader030.vdocuments.site/reader030/viewer/2022033106/56812c00550346895d907237/html5/thumbnails/11.jpg)
Experiments and Results
![Page 12: Finding Similar Questions in Large Question and Answer Archives](https://reader030.vdocuments.site/reader030/viewer/2022033106/56812c00550346895d907237/html5/thumbnails/12.jpg)
Examples and Analysis
![Page 13: Finding Similar Questions in Large Question and Answer Archives](https://reader030.vdocuments.site/reader030/viewer/2022033106/56812c00550346895d907237/html5/thumbnails/13.jpg)
Retrieval Models for Question and Answer Archives
Jiwoon JeonGoogle, Inc. Mountain View, CA 94043, USA
[email protected]. Bruce Croft and Xiaobing Xue
Center for Intelligent Information Retrieval, Computer Science DepartmentUniversity of Massachusetts, Amherst, MA 01003
[croft,xuexb]@cs.umass.edu
SIGIR '08, Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information
retrieval, 2008
![Page 14: Finding Similar Questions in Large Question and Answer Archives](https://reader030.vdocuments.site/reader030/viewer/2022033106/56812c00550346895d907237/html5/thumbnails/14.jpg)
Introduction
• Word mismatch problem• Focus on translation based approach• Explanation of poor performance of pure IBM
model vs. query-likelihood language model• Proposed a mixed model– Query part: translation based language model– Answer part: query likelihood language model
![Page 15: Finding Similar Questions in Large Question and Answer Archives](https://reader030.vdocuments.site/reader030/viewer/2022033106/56812c00550346895d907237/html5/thumbnails/15.jpg)
LM vs. IBM model 1
![Page 16: Finding Similar Questions in Large Question and Answer Archives](https://reader030.vdocuments.site/reader030/viewer/2022033106/56812c00550346895d907237/html5/thumbnails/16.jpg)
Question Part
![Page 17: Finding Similar Questions in Large Question and Answer Archives](https://reader030.vdocuments.site/reader030/viewer/2022033106/56812c00550346895d907237/html5/thumbnails/17.jpg)
Answer Part
• Gamma = 0 : translation based (for question part)• Gamma = 1 : query likelihood LM (for answer part)• Beta = 0 : combination model
![Page 18: Finding Similar Questions in Large Question and Answer Archives](https://reader030.vdocuments.site/reader030/viewer/2022033106/56812c00550346895d907237/html5/thumbnails/18.jpg)
Word-to-Word Translation Probability
• Word “cheat” in question– “trust”, “forgive”, “dump” and “leave” etc. in answer
• Word “cheat” in answer– “husband” and “boyfriend” etc. in question
• All these words are useful to attack word mismatch problem– Combined probability used: P(Q|A) and P(A|Q)
![Page 19: Finding Similar Questions in Large Question and Answer Archives](https://reader030.vdocuments.site/reader030/viewer/2022033106/56812c00550346895d907237/html5/thumbnails/19.jpg)
Examples
![Page 20: Finding Similar Questions in Large Question and Answer Archives](https://reader030.vdocuments.site/reader030/viewer/2022033106/56812c00550346895d907237/html5/thumbnails/20.jpg)
Experimental Results
![Page 21: Finding Similar Questions in Large Question and Answer Archives](https://reader030.vdocuments.site/reader030/viewer/2022033106/56812c00550346895d907237/html5/thumbnails/21.jpg)
Conclusions
• Translation based language model for query part and QL language model for answer part
• Experiment done on a Q&A web service where people answer others questions
• Future work– Testing effect of proposed model on FAQ archives– Yahoo! Answers collection– Phrase based machine translation rather than
word based translation