our ir strategy
DESCRIPTION
Multilingual Search using Query Translation and Collection Selection Jacques Savoy, Pierre-Yves Berger University of Neuchatel, Switzerland www.unine.ch/info/clef/. 1. Effective monolingual IR system (combining indexing schemes?) 2. Combining query translation tools - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Our IR Strategy](https://reader036.vdocuments.site/reader036/viewer/2022062314/56814a79550346895db78f32/html5/thumbnails/1.jpg)
Multilingual Search using
Query Translation and
Collection Selection
Jacques Savoy, Pierre-Yves Berger
University of Neuchatel, Switzerland
www.unine.ch/info/clef/
![Page 2: Our IR Strategy](https://reader036.vdocuments.site/reader036/viewer/2022062314/56814a79550346895db78f32/html5/thumbnails/2.jpg)
Our IR Strategy
1. Effective monolingual IR system1. Effective monolingual IR system
(combining indexing schemes?)(combining indexing schemes?)
2. Combining query translation tools2. Combining query translation tools
3. Effective collection selection and 3. Effective collection selection and
merging strategiesmerging strategies
![Page 3: Our IR Strategy](https://reader036.vdocuments.site/reader036/viewer/2022062314/56814a79550346895db78f32/html5/thumbnails/3.jpg)
Effective Monolingual IR
1.1. Define a stopword listDefine a stopword list
2.2. Have a « good » stemmerHave a « good » stemmer
Removing only the feminineRemoving only the feminineand plural suffixes (inflections)and plural suffixes (inflections)Focus on Focus on nounsnouns and and adjectivesadjectives see www.unine.ch/info/clef/see www.unine.ch/info/clef/
![Page 4: Our IR Strategy](https://reader036.vdocuments.site/reader036/viewer/2022062314/56814a79550346895db78f32/html5/thumbnails/4.jpg)
Indexing Units
For Indo-European languages : Set of significant words
CJK: bigramswith stoplistwithout stemming
For Finnish, we used 4-grams.
![Page 5: Our IR Strategy](https://reader036.vdocuments.site/reader036/viewer/2022062314/56814a79550346895db78f32/html5/thumbnails/5.jpg)
IR Models
Probabilistic Okapi Prosit or
deviation from randomness
Vector-space Lnu-ltc tf-idf (ntc-ntc) binary (bnn-bnn)
![Page 6: Our IR Strategy](https://reader036.vdocuments.site/reader036/viewer/2022062314/56814a79550346895db78f32/html5/thumbnails/6.jpg)
Monolingual Evaluation
Model English French Portug.
Q=TD word word word
Okapi 0.5422 0.4685 0.4835
Prosit 0.5313 0.4568 0.4695
Lnu-ltc 0.4979 0.4349 0.4579
tf-idf 0.3706 0.3309 0.3708
binary 0.3005 0.2017 0.1834
![Page 7: Our IR Strategy](https://reader036.vdocuments.site/reader036/viewer/2022062314/56814a79550346895db78f32/html5/thumbnails/7.jpg)
Monolingual Evaluation
Model Finnish Russian
Q=TD word word
Okapi 0.4773 0.3800
Prosit 0.4620 0.3448
Lnu-ltc 0.4643 0.3794
tf-idf 0.3862 0.2716
binary 0.1859 0.1512
![Page 8: Our IR Strategy](https://reader036.vdocuments.site/reader036/viewer/2022062314/56814a79550346895db78f32/html5/thumbnails/8.jpg)
Monolingual Evaluation
Model Finnish Russian
Q=TD word 4-gram word 4-gram
Okapi 0.4773 0.5386 0.3800 0.2890
Prosit 0.4620 0.5357 0.3448 0.2879
Lnu-ltc 0.4643 0.5022 0.3794 0.2852
tf-idf 0.3862 0.4466 0.2716 0.1916
binary 0.1859 0.2387 0.1512 0.0373
![Page 9: Our IR Strategy](https://reader036.vdocuments.site/reader036/viewer/2022062314/56814a79550346895db78f32/html5/thumbnails/9.jpg)
Problems with Finnish
Finnish can be characterized by:
1. A large number of cases (12+),
2. Agglutinative (saunastani)
& compound construction (rakkauskirje)
3. Variations in the stem
(matto, maton, mattoja, …).
Do we need a deeper morpholgical analysis?
![Page 10: Our IR Strategy](https://reader036.vdocuments.site/reader036/viewer/2022062314/56814a79550346895db78f32/html5/thumbnails/10.jpg)
Improvement using…
Relevance feedback (Rocchio)?
![Page 11: Our IR Strategy](https://reader036.vdocuments.site/reader036/viewer/2022062314/56814a79550346895db78f32/html5/thumbnails/11.jpg)
Monolingual Evaluation
Q=TD English French Finnish Russian
Okapi 0.5422 0.4685 0.5386 0.3800
+PRF 0.5704 0.4851 0.5308 0.3945
+5.2% +3.5% -1.5% +3.8%
Prosit 0.5313 0.4568 0.5357 0.3448
+PRF 0.5742 0.4643 0.5684 0.3736
+8.1% +1.6% +6.1% +8.4%
![Page 12: Our IR Strategy](https://reader036.vdocuments.site/reader036/viewer/2022062314/56814a79550346895db78f32/html5/thumbnails/12.jpg)
Relevance FeedbackMAP after blind query-expansion
0 10 15 20 30 40 50 60 75 100# of terms added (from 3 best-ranked doc.)
MAP
0.36
0.38
0.4
0.42
0.44
0.46
0.48
0.5
Prosit Okapidtu-dtn Lnu-ltc
![Page 13: Our IR Strategy](https://reader036.vdocuments.site/reader036/viewer/2022062314/56814a79550346895db78f32/html5/thumbnails/13.jpg)
Improvement using…
Data fusion approaches (see the paper)
Improvement with the English, Finnish and Portuguese corpora
Hurt the MAP with the French and Russian collections
![Page 14: Our IR Strategy](https://reader036.vdocuments.site/reader036/viewer/2022062314/56814a79550346895db78f32/html5/thumbnails/14.jpg)
Bilingual IR
1. Machine-readable dictionaries (MRD)e.g., Babylon
2. Machine translation services (MT)e.g., FreeTranslation BabelFish
3. Parallel and/or comparable corpora(not used in this evaluation campaign)
![Page 15: Our IR Strategy](https://reader036.vdocuments.site/reader036/viewer/2022062314/56814a79550346895db78f32/html5/thumbnails/15.jpg)
Bilingual IR
“Tour de France WinnerWho won the Tour de France in 1995?”
(Babylon1) “France, vainqueurOrganisation Mondiale de la Santé, le, France,* 1995”
(Right) “Le vainqueur du Tour de FranceQui a gagné le tour de France en 1995 ?”
![Page 16: Our IR Strategy](https://reader036.vdocuments.site/reader036/viewer/2022062314/56814a79550346895db78f32/html5/thumbnails/16.jpg)
Bilingual EN->FR/FI/RU/PT
TDOkapi
Frenchword
Finnish4-gram
Russianword
Portugword
Manual 0.4685 0.5386 0.3800 0.4835
Baby1 0.3706 0.1965 0.2209 0.3071
FreeTra 0.3845 N/A 0.3067 0.4057
InterTra 0.2664 0.2653 0.1216 0.3277
Reverso 0.3830 N/A 0.2960 N/A
Combi. 0.4066 0.3042 0.3888 0.4204
![Page 17: Our IR Strategy](https://reader036.vdocuments.site/reader036/viewer/2022062314/56814a79550346895db78f32/html5/thumbnails/17.jpg)
Bilingual EN->FR/FI/RU/PT
TDOkapi
Frenchword
Finnish4-gram
Russianword
Portugword
Manual 0.4685 0.5386 0.3800 0.4835
Baby1 79.1% 36.5% 58.1% 63.5%
FreeTra 82.1% N/A 80.7% 83.9%
InterTra 56.9% 49.3% 32.0% 67.8%
Reverso 81.8% N/A 77.9% N/A
Combi. 86.8% 56.5% 102.3% 86.9%
![Page 18: Our IR Strategy](https://reader036.vdocuments.site/reader036/viewer/2022062314/56814a79550346895db78f32/html5/thumbnails/18.jpg)
Bilingual IR
Improvement … Query combination (concatenation) Relevance feedback
(yes for PT, FR, RU, no for FI)Data fusion (yes for FR, no for RU, FI, PT)
![Page 19: Our IR Strategy](https://reader036.vdocuments.site/reader036/viewer/2022062314/56814a79550346895db78f32/html5/thumbnails/19.jpg)
Multilingual EN->EN FI FR RU
1. Create a common index Document translation (DT)
2. Search on each language and merge the result lists (QT)
3. Mix QT and DT
4. No translation
![Page 20: Our IR Strategy](https://reader036.vdocuments.site/reader036/viewer/2022062314/56814a79550346895db78f32/html5/thumbnails/20.jpg)
Merging Problem
EN
<–– Merging<–– Merging
RUFRFI
![Page 21: Our IR Strategy](https://reader036.vdocuments.site/reader036/viewer/2022062314/56814a79550346895db78f32/html5/thumbnails/21.jpg)
Merging Problem
1 EN120 1.22 EN200 1.03 EN050 0.74 EN705 0.6…
1 FR043 0.82 FR120 0.753 FR055 0.654 …
1 RU050 1.62 RU005 1.13 RU120 0.94 …
![Page 22: Our IR Strategy](https://reader036.vdocuments.site/reader036/viewer/2022062314/56814a79550346895db78f32/html5/thumbnails/22.jpg)
Merging Strategies
Round-robin (baseline)
Raw-score merging
Normalize
Biased round-robin
Z-score
Logistic regression
![Page 23: Our IR Strategy](https://reader036.vdocuments.site/reader036/viewer/2022062314/56814a79550346895db78f32/html5/thumbnails/23.jpg)
Merging Problem
1 EN120 1.22 EN200 1.03 EN050 0.74 EN705 0.6…
1 FR043 0.82 FR120 0.753 FR055 0.654 …
1 RU050 1.62 RU005 1.13 RU120 0.94 …
1 RU… 1 RU… 2 EN…2 EN…3 RU…3 RU…4 ….4 ….
![Page 24: Our IR Strategy](https://reader036.vdocuments.site/reader036/viewer/2022062314/56814a79550346895db78f32/html5/thumbnails/24.jpg)
Merging Strategies
Round-robin (baseline)
Raw-score merging
Normalize
Biased round-robin
Z-score
Logistic regression
![Page 25: Our IR Strategy](https://reader036.vdocuments.site/reader036/viewer/2022062314/56814a79550346895db78f32/html5/thumbnails/25.jpg)
Test-Collection CLEF-2004
EN FI FR RU
size 154 MB 137 MB 244 MB 68 MB
doc 56,472 55,344 90,261 16,716
topic 42 45 49 34
rel./query 8.9 9.2 18.7 3.6
![Page 26: Our IR Strategy](https://reader036.vdocuments.site/reader036/viewer/2022062314/56814a79550346895db78f32/html5/thumbnails/26.jpg)
Merging Strategies
Round-robin (baseline)
Raw-score merging
Normalize
Biased round-robin
Z-score
Logistic regression
![Page 27: Our IR Strategy](https://reader036.vdocuments.site/reader036/viewer/2022062314/56814a79550346895db78f32/html5/thumbnails/27.jpg)
Z-Score Normalization
1 EN120 1.22 EN200 1.03 EN050 0.74 EN765 0.6… …
Compute the mean and standard deviation
New score = ((old score-) / ) +
1 EN120 7.02 EN200 5.03 EN050 2.04 EN765 1.0…
![Page 28: Our IR Strategy](https://reader036.vdocuments.site/reader036/viewer/2022062314/56814a79550346895db78f32/html5/thumbnails/28.jpg)
MLIR Evaluation
Condition A (meta-search)
Prosit/Okapi (+PRF, data fusion for
FR, EN)
Condition C (same SE)
Prosit only (+PRF)
![Page 29: Our IR Strategy](https://reader036.vdocuments.site/reader036/viewer/2022062314/56814a79550346895db78f32/html5/thumbnails/29.jpg)
Multilingual Evaluation
EN->EN FR FI RU Cond. A Cond. C
Round-robin 0.2386 0.2358
Raw-score 0.0642 0.3067
Norm 0.2899 0.2646
Biased RR 0.2639 0.2613
Z-score wt 0.2669 0.2867
Logistic 0.3090 0.3393
![Page 30: Our IR Strategy](https://reader036.vdocuments.site/reader036/viewer/2022062314/56814a79550346895db78f32/html5/thumbnails/30.jpg)
Collection Selection
General idea:1. Use the logistic regression to compute
a probability of relevance for each document;
2. Sum the document scores from the first 15 best ranked documents;
3. If this sum ≥ , consider all the list,otherwise select only the first m doc.
![Page 31: Our IR Strategy](https://reader036.vdocuments.site/reader036/viewer/2022062314/56814a79550346895db78f32/html5/thumbnails/31.jpg)
Multilingual Evaluation
EN->EN FR FI RU Cond. A Cond. C
Round-robin 0.2386 0.2358
Logistic reg. 0.3090 0.3393
Optimal select 0.3234 0.3558
Select (m=0) 0.2957 0.3405
Select (m=3) 0.2953 0.3378
![Page 32: Our IR Strategy](https://reader036.vdocuments.site/reader036/viewer/2022062314/56814a79550346895db78f32/html5/thumbnails/32.jpg)
Conclusion (monolingual)
Finnish is still a difficult language
Blind-query expansion: different patterns with different IR models
Data fusion (?)
![Page 33: Our IR Strategy](https://reader036.vdocuments.site/reader036/viewer/2022062314/56814a79550346895db78f32/html5/thumbnails/33.jpg)
Conclusion (bilingual)
Translation resources not available for all languages pairs
Improvement bycombining query translations yes, but can we have a better combination scheme?
![Page 34: Our IR Strategy](https://reader036.vdocuments.site/reader036/viewer/2022062314/56814a79550346895db78f32/html5/thumbnails/34.jpg)
Conclusion (multilingual)
Selection and merging are still hard problems
Overfitting is not always the best choice (see results with Condition C)
![Page 35: Our IR Strategy](https://reader036.vdocuments.site/reader036/viewer/2022062314/56814a79550346895db78f32/html5/thumbnails/35.jpg)
Multilingual Search using
Query Translation and
Collection Selection
Jacques Savoy, Pierre-Yves Berger
University of Neuchatel, Switzerland
www.unine.ch/info/clef/