a merging strategy proposal: the 2-step retrieval status value method

25
A merging strategy proposal: The 2-step retrieval status value method Fernando Mart´inez-Santiago · L. Alfonso Ure ˜na-L ´opez · Maite Mart´in-Valdivia Department of Computer Science, University of Ja ´en, Ja´en, Spain Inf Retrieval (2006) 9: 71–93

Upload: yoshe

Post on 13-Jan-2016

39 views

Category:

Documents


0 download

DESCRIPTION

A merging strategy proposal: The 2-step retrieval status value method. Fernando Mart´inez-Santiago · L. Alfonso Ure ˜na-L´opez · Maite Mart´in-Valdivia Department of Computer Science, University of Ja´en, Ja´en, Spain Inf Retrieval (2006) 9: 71–93. Merging problem. query. Language 1. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: A merging strategy proposal: The 2-step retrieval status value method

A merging strategy proposal:The 2-step retrieval status value

methodFernando Mart´inez-Santiago · L. Alfonso Ure ˜na-L´opez · Maite Mart´in-Valdivia

Department of Computer Science, University of Ja´en, Ja´en, Spain

Inf Retrieval (2006) 9: 71–93

Page 2: A merging strategy proposal: The 2-step retrieval status value method

Merging problem

Language 1 Language 2 Language 3

Result lists from per language

Merge to a single result list

d11

d12

d13….

.

d21

d22

d23….

.

d31

d32

d33….

.

d31

d32

d21

d11

d12

d23

d13….

query

Merge strategy

Page 3: A merging strategy proposal: The 2-step retrieval status value method

Traditional solution• Round-Robin

– Language1 list d11 d12 d13…– Language2 list d21 d22 d23…– Language3 list d31 d32 d33…– Marge d11 d21 d31 d12 d22 d32 …

• Raw-scoring

• Normalized scoring

– 1)

– 2)

Page 4: A merging strategy proposal: The 2-step retrieval status value method

Traditional solution• Logistic regression (Calv´e and Savoy (2000), Savoy (2003a))

• LVQ neural networks (Mart´ın et al. 2003)

Page 5: A merging strategy proposal: The 2-step retrieval status value method

2-step retrieval status value method• Step 1:

– translating and searching the query on each monolinqual collection,produces two results:

a) a concept T’ consist of each term together with its corresponding translation

b) Mutilinqual collection D’,as result of the union of the 1000 retrieved documents for each language.

Page 6: A merging strategy proposal: The 2-step retrieval status value method

2-step retrieval status value method• Step 2:

– re-indexing the D’ ,but considering solely the T’ vocabulary.

– given a concept , its document frequency is the result of grouping together the document frequencies of the terms which makes up the concept

Page 7: A merging strategy proposal: The 2-step retrieval status value method

2-step retrieval status value method• For Example:• Spanish word casa translate to English word is house ,home

Given a document , term frequency will be calculate as usual , document frequency will be the sum of the document frequency of “casa”, “house” ,“home”

Page 8: A merging strategy proposal: The 2-step retrieval status value method

Mixed 2-step RSV• Not aligned words

• Raw mixed 2-step RSV method– for a given τi j , term j into the monolingual collection i , the document

frequency value will be:• As 2-step method ,if τi j is aligned.

• the initial weight in the first step of the method, if the translation of τi j into the other languages is unknown.

• RSVi = α · RSVialign + (1 − α) ·RSVi

nonalign

– α = 0.75

Page 9: A merging strategy proposal: The 2-step retrieval status value method

Mixed 2-step RSV• Normalized mixed 2-step RSV method

– α = 0.75

Page 10: A merging strategy proposal: The 2-step retrieval status value method

Mixed 2-step RSV• Learning–based algorithm

– Logistic regression

• α, β1, β2 and β3 must be estimated by using iteratively re-weighted least squares method

– LVQ Neural network (Mart´ın et al. 2003)

Page 11: A merging strategy proposal: The 2-step retrieval status value method

Use machine translation to align word

• Pen = “Pesticides in baby food” – Unigrams Pen = {Pesticides, baby, food}– Bigrams Pen = {Pesticides baby, baby food}

• the translated expression is:– EXPen={Pesticides in baby food}{Pesticides,baby, food}{Pesticides baby,

baby food }

• Then we have, • Psp = {Pesticidas alimento ni˜nos}• Unigrams Psp = {Pesticidas, beb´e, alimento} (Unigrams P

sp is the translation of Unigrams Pen )• Bigrams Psp = {Pesticidas beb´es, alimento ni˜nos} (Bigra

ms Psp is the translation of Bigrams Pen )

Page 12: A merging strategy proposal: The 2-step retrieval status value method

Use machine translation to align word

• For each wordisp Unigrams P∈ sp do

– (a) if wordisp P∈ sp, then remove wordi

sp from Psp, and add (wordi

sp , wordien ) to the set of aligned w

ords ALIGNED

• Thus, we obtain:– Psp = {ni˜nos}– ALIGNED = {(pesticidas,pesticides),(alimento,f

ood)}

Page 13: A merging strategy proposal: The 2-step retrieval status value method

Use machine translation to align word

• For each bigram bigramspi ∈ BigramsPsp

– (a) if (wordsp1 , worden

1 ) ∈ ALIGNED (wordsp1 is a

ligned with worden1 ) and wordsp

2 ∈ Psp then remove wordsp

2 from Psp and add (wordsp 2 , worden

2 ) to ALIGNED set.

– (b) if (wordsp1 , worden

2 ) ∈ ALIGNED and wordsp2

∈ Psp then remove wordsp2 from Psp and add (wo

rdsp2 , worden

1 ) to ALIGNED set.

Page 14: A merging strategy proposal: The 2-step retrieval status value method

Use machine translation to align word

– (c) if (wordsp2 , worden

1 ∈ ALIGNED and words

p1 ∈ Psp then remove wordsp

1 from Psp andadd (wordsp

1 , worden2 ) to ALIGNED set.

– (d) if (wordsp2 , worden

2 ∈ ALIGNED and words

p1 ∈ Psp, then remove wordsp

1 from Psp and add (wordsp

1 , worden1 ) to ALIGNED set.

• Psp = ∅• ALIGNED = {(pesticidas,pesticides),(alime

nto,food) (ni˜nos,baby)

Page 15: A merging strategy proposal: The 2-step retrieval status value method

Method conclusion• Fully aligned word

– 2-step method

• Partial aligned word– Raw-mixed 2-step RSV– Normalized mixed 2-step RSV– Logistic regression mixed 2-step RSV– Neural network mixed 2-step RSV

• Algorithm to align phrase and translations

Page 16: A merging strategy proposal: The 2-step retrieval status value method

Experiment• Document

– CLEF 2003 have two task CLEF 2003-8 and CLEF 2003-4 . CLEF 2003-4 is limited to four language(English , France , German and Spanish )

• Query (Title + Description )

Page 17: A merging strategy proposal: The 2-step retrieval status value method

Experiment• they are indexed with the Zprise IR system, us

ing the OKAPI probabilistic model (fixed at b = 0.75 and k1 = 1.2)

• Translation strategies– Machine Readable Dictionary (Babylon)

• to pick the first translation available (under the heading “Babylon 1”) or the first two terms (indicated under the label “Babylon 2”)

– Machine Translation (MT, Babelfish)– Mixed MT and MDR

• by taking together Babelfish and Babylon 1 translations.

Page 18: A merging strategy proposal: The 2-step retrieval status value method

Experiment1 –multilinqual results with fully aligned queries

Page 19: A merging strategy proposal: The 2-step retrieval status value method

Experiment1 –multilinqual results with fully aligned queries

Page 20: A merging strategy proposal: The 2-step retrieval status value method

Experiment1 – analysis of failures

Too many documents from the Spanish collection for this query

Page 21: A merging strategy proposal: The 2-step retrieval status value method

Experiment1 – analysis of failures

Page 22: A merging strategy proposal: The 2-step retrieval status value method

Experiment2 –multilinqual results with partially aligned queries

• Based on MDR translation approach

Page 23: A merging strategy proposal: The 2-step retrieval status value method

Experiment2 –multilinqual results with partially aligned queries

• Based on MDR translation approach

Page 24: A merging strategy proposal: The 2-step retrieval status value method

Experiment2 –multilinqual results with partially aligned queries

• Based on MT translation approach

• with the CLEF 2001–2002 test collection and CLEF2001+CLEF2002+CLEF2003 query set (160 queries, five languages, EN, SP, DE, FR, IT)

Page 25: A merging strategy proposal: The 2-step retrieval status value method

Conclusion

• Future effort– Dealing with translation probabilities.– Testing the method with other translation strategie

s such as the Multilingual Similarity Thesaurus.– n-grams indexing.– continue studying strategies in order to deal with a

ligned and non-aligned term queries: the integration of both sorts of terms by means of bayesian networks