![Page 1: DCU@FIRE 2012: Monolingual and Crosslingual SMS-based FAQ Retrieval](https://reader036.vdocuments.site/reader036/viewer/2022070417/5681532b550346895dc15098/html5/thumbnails/1.jpg)
DCU@FIRE 2012: Monolingual and Crosslingual SMS-based FAQ Retrieval
Johannes LevelingCNGL, School of Computing, Dublin City University, Ireland
![Page 2: DCU@FIRE 2012: Monolingual and Crosslingual SMS-based FAQ Retrieval](https://reader036.vdocuments.site/reader036/viewer/2022070417/5681532b550346895dc15098/html5/thumbnails/2.jpg)
Outline
Motivation
System Setup and Changes
Monolingual Experiments
Crosslingual ExperimentsSMT system
Training data
Translation results
OOV Reduction
FAQ Retrieval Results
Conclusions and Future Work
![Page 3: DCU@FIRE 2012: Monolingual and Crosslingual SMS-based FAQ Retrieval](https://reader036.vdocuments.site/reader036/viewer/2022070417/5681532b550346895dc15098/html5/thumbnails/3.jpg)
Motivation
Task:Given a SMS query, find FAQ documents answering the query
Last year’s DCU system:SMS correction and normalisation
In-Domain retrieval: Three approaches (SOLR, Lucene, Term Overlap)
Out-of-domain (OOD) detection: Three approaches (term overlap, normalized BM25 scores, ML)
Combination of ID retrieval and OOD results
![Page 4: DCU@FIRE 2012: Monolingual and Crosslingual SMS-based FAQ Retrieval](https://reader036.vdocuments.site/reader036/viewer/2022070417/5681532b550346895dc15098/html5/thumbnails/4.jpg)
Motivation
This year’s system:
Same SMS correction and normalisationone more spelling correction resource (manually created)
Single retrieval approach: Lucene with BM25 retrieval model
Single OOD detection approach: IB-1 classification using Timbl (Machine Learning)
additional features for term overlap and normalized BM25 scores
Trained statistical machine translation system for document translation (Hindi to English)
![Page 5: DCU@FIRE 2012: Monolingual and Crosslingual SMS-based FAQ Retrieval](https://reader036.vdocuments.site/reader036/viewer/2022070417/5681532b550346895dc15098/html5/thumbnails/5.jpg)
Questions
Investigatethe influence of OOD detection on system performance
the influence of out-of-vocabulary (OOV) words on crosslingual performance
![Page 6: DCU@FIRE 2012: Monolingual and Crosslingual SMS-based FAQ Retrieval](https://reader036.vdocuments.site/reader036/viewer/2022070417/5681532b550346895dc15098/html5/thumbnails/6.jpg)
Collection Statistics
Language Documents Training (rel/non_rel)
Test (rel/non_rel)
English 7251 4476 (3047/1429)
1733 (726/1007)
Hindi 1994 554 (173/381)
579 (200/379)
English to Hindi
1994 554 (173/381)
431
(75/356)
![Page 7: DCU@FIRE 2012: Monolingual and Crosslingual SMS-based FAQ Retrieval](https://reader036.vdocuments.site/reader036/viewer/2022070417/5681532b550346895dc15098/html5/thumbnails/7.jpg)
Monolingual Experiments (Setup)
Experiments for English and Hindi
Processing steps:Normalize SMS and FAQ documents
Correct SMS queries
Retrieve answers
Detect OOD queries (or not), e.g. “NONE” queries
Produce final result
![Page 8: DCU@FIRE 2012: Monolingual and Crosslingual SMS-based FAQ Retrieval](https://reader036.vdocuments.site/reader036/viewer/2022070417/5681532b550346895dc15098/html5/thumbnails/8.jpg)
Crosslingual Experiments (Setup)
Experiments for English to Hindi
Additional translation step to translate Hindi FAQ documents into English
Translation is based on newly trained statistical machine translation system (SMT)
Problem:sparse training data → combination of different training resources
out of vocabulary (OOV) words
→ OOV reduction
![Page 9: DCU@FIRE 2012: Monolingual and Crosslingual SMS-based FAQ Retrieval](https://reader036.vdocuments.site/reader036/viewer/2022070417/5681532b550346895dc15098/html5/thumbnails/9.jpg)
Crosslingual Experiments (SMT System)
Training an SMT systemData preparation
tokenization/normalization scripts
Data alignmentGiza++ for word-level alignment
Phrase extractionMoses MT toolkit
Training a language modelSRILM for trigram LM with Kneser-Ney smoothing
TuningMinimum error rate tuning (MERT)
![Page 10: DCU@FIRE 2012: Monolingual and Crosslingual SMS-based FAQ Retrieval](https://reader036.vdocuments.site/reader036/viewer/2022070417/5681532b550346895dc15098/html5/thumbnails/10.jpg)
Crosslingual Experiments (Training Data)
Agro (agricultural domain): 246 sentences Crowdsourced HI-EN data: 50k sentences EILMT (tourism domain): 6700 sentences ICON: 7000 sentences TIDES: 50k sentences
FIRE ad-hoc queries: 200 titles, 200 descriptions Interlanguage Wikipedia links: 27k entries OPUS/KDE: 97k entries UWdict: 128k entries
![Page 11: DCU@FIRE 2012: Monolingual and Crosslingual SMS-based FAQ Retrieval](https://reader036.vdocuments.site/reader036/viewer/2022070417/5681532b550346895dc15098/html5/thumbnails/11.jpg)
Translation Results (Hindi to English)
Data Training / Test / Development BLEU
TIDES 49,504 / 697 / 988 13.30
Crowdsourced EN-HI 41,396 / 8000 / 4000 7.04
ICON 7000 / 500 / 500 25.38
![Page 12: DCU@FIRE 2012: Monolingual and Crosslingual SMS-based FAQ Retrieval](https://reader036.vdocuments.site/reader036/viewer/2022070417/5681532b550346895dc15098/html5/thumbnails/12.jpg)
OOV Reduction
Problem: 15.4% untranslated words in translation output
Idea: modify untranslated words to obtain a translation
OOV reduction is based on two resourcesUWdict
Manually created transliteration lexicon (TRL): 639 entries
![Page 13: DCU@FIRE 2012: Monolingual and Crosslingual SMS-based FAQ Retrieval](https://reader036.vdocuments.site/reader036/viewer/2022070417/5681532b550346895dc15098/html5/thumbnails/13.jpg)
OOV Reduction
Word modifications:Character normalization, e.g.
replace Chandrabindu with Bindu
delete Virama character
replace long with short vowels
StemmingLucene Hindi stemmer
TransliterationITRANS transliteration rules
rules for cleaning up ITRANS results
Decompoundingword split at every position into candidate constituents
word is decompounded if both constituents have a translation
![Page 14: DCU@FIRE 2012: Monolingual and Crosslingual SMS-based FAQ Retrieval](https://reader036.vdocuments.site/reader036/viewer/2022070417/5681532b550346895dc15098/html5/thumbnails/14.jpg)
OOV Reduction Results (Hindi to English)
Lookup form Lookup Data Count % Reduction
original term UWdict. 4,728 14.5
original term TRL 83 0.3
normalized term UWdict 419 1.3
normalized term TRL 24 0.1
stemmed term UWdict 1,413 4.4
stemmed term TRL 14 0.0
stemmed normalized term UWdict 135 0.4
stemmed normalized term TRL 0 0.0
compound constituents UWdict 721 2.2
transliteration N/A 24,973 76.8
![Page 15: DCU@FIRE 2012: Monolingual and Crosslingual SMS-based FAQ Retrieval](https://reader036.vdocuments.site/reader036/viewer/2022070417/5681532b550346895dc15098/html5/thumbnails/15.jpg)
FAQ Retrieval Results
Run Language OOD detection
OOV reduction
ID correct
OOD correct
MRR
1 EN N - 661/726 19/1007 0.937
2 EN Y - 595/726 981/1007 0.949
1 HI N - 77/379 13/379 0.473
2 HI Y - 26/379 375/379 0.880
1 EN2HI N N 29/75 41/1007 0.450
2 EN2HI N Y 22/75 60/1007 0.365
3 EN2HI Y Y 4/75 989/1007 0.444
![Page 16: DCU@FIRE 2012: Monolingual and Crosslingual SMS-based FAQ Retrieval](https://reader036.vdocuments.site/reader036/viewer/2022070417/5681532b550346895dc15098/html5/thumbnails/16.jpg)
Conclusions
Monolingual experiments:Good performance for English and Hindi
OOD detection improves MRR (but reduces number of correct ID queries)
Crosslingual experiments:Lower performance
OOD detection reduces MRR
OOV reduction reduces MRR
![Page 17: DCU@FIRE 2012: Monolingual and Crosslingual SMS-based FAQ Retrieval](https://reader036.vdocuments.site/reader036/viewer/2022070417/5681532b550346895dc15098/html5/thumbnails/17.jpg)
Future work
Further analysis of our results neededNormalization issues for MT training data?
Unbalanced OOD training data for Hindi and English?
Is there Hindi textese (e.g. abbreviations etc.)?
Does the training data match the test data?manually or automatically created
Improve transliteration approach
Comparison to other submissions
![Page 18: DCU@FIRE 2012: Monolingual and Crosslingual SMS-based FAQ Retrieval](https://reader036.vdocuments.site/reader036/viewer/2022070417/5681532b550346895dc15098/html5/thumbnails/18.jpg)
10q 4 ur @ensn