1 flexible and efficient toolbox for information retrieval miracle group josé miguel goñi-menoyo...

27
1 Flexible and Efficient Toolbox for Flexible and Efficient Toolbox for Information Retrieval Information Retrieval MIRACLE group MIRACLE group José Miguel Goñi-Menoyo (UPM) José Carlos González-Cristóbal (UPM-Daedalus) Julio Villena-Román (UC3M-Daedalus)

Upload: mark-morton

Post on 17-Jan-2016

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Flexible and Efficient Toolbox for Information Retrieval MIRACLE group José Miguel Goñi-Menoyo (UPM) José Carlos González-Cristóbal (UPM-Daedalus) Julio

1

Flexible and Efficient Toolbox for Flexible and Efficient Toolbox for Information RetrievalInformation Retrieval

MIRACLE groupMIRACLE group

José Miguel Goñi-Menoyo (UPM)José Carlos González-Cristóbal (UPM-Daedalus)

Julio Villena-Román (UC3M-Daedalus)

Page 2: 1 Flexible and Efficient Toolbox for Information Retrieval MIRACLE group José Miguel Goñi-Menoyo (UPM) José Carlos González-Cristóbal (UPM-Daedalus) Julio

2

Our approachOur approach

New Year’s Resolution: work with all languages in CLEFNew Year’s Resolution: work with all languages in CLEFadhoc, image, web, geo, iclef, qa…adhoc, image, web, geo, iclef, qa…

Wish list: Wish list: Language-dependent stuffLanguage-dependent stuffLanguage-independent stuffLanguage-independent stuffVersatile combinationVersatile combinationFast Fast Simple for non computer scientistsSimple for non computer scientists

Not to reinvent the wheel again every year!Not to reinvent the wheel again every year! Approach: Toolbox for information retrievalApproach: Toolbox for information retrieval

Page 3: 1 Flexible and Efficient Toolbox for Information Retrieval MIRACLE group José Miguel Goñi-Menoyo (UPM) José Carlos González-Cristóbal (UPM-Daedalus) Julio

3

AgendaAgenda

ToolboxToolbox

2005 Experiments2005 Experiments

2005 Results2005 Results

2006 Homework2006 Homework

Page 4: 1 Flexible and Efficient Toolbox for Information Retrieval MIRACLE group José Miguel Goñi-Menoyo (UPM) José Carlos González-Cristóbal (UPM-Daedalus) Julio

4

Toolbox BasicsToolbox Basics

Toolbox made of small one-function tools Toolbox made of small one-function tools

Processing as a pipeline (borrowed from Unix):Processing as a pipeline (borrowed from Unix):Each tool combination leads to a different run approachEach tool combination leads to a different run approach

Shallow I/O interfaces: Shallow I/O interfaces: tools in several programming languages (C/C++, Java, Perl, tools in several programming languages (C/C++, Java, Perl,

PHP, Prolog…),PHP, Prolog…), with different design approaches, andwith different design approaches, and from different sources (own development, downloading, …)from different sources (own development, downloading, …)

Page 5: 1 Flexible and Efficient Toolbox for Information Retrieval MIRACLE group José Miguel Goñi-Menoyo (UPM) José Carlos González-Cristóbal (UPM-Daedalus) Julio

5

MIRACLE Tools MIRACLE Tools Tokenizer:Tokenizer:

pattern matchingpattern matching isolate punctuationisolate punctuationsplit sentences, paragraphs, passagessplit sentences, paragraphs, passages

identifies some entitiesidentifies some entitiescompounds, numbers, initials, abbreviations, datescompounds, numbers, initials, abbreviations, dates

extracts indexing termsextracts indexing termsown-development (written in Perl) or “outsourced”own-development (written in Perl) or “outsourced”

Proper noun extractionProper noun extractionNaive algorithm: Uppercase words Naive algorithm: Uppercase words unlessunless stop-word, stop- stop-word, stop-

clef or verb/adverbclef or verb/adverb Stemming: generally “outsourced”Stemming: generally “outsourced” Transforming tools: lowercase, accents and diacritical Transforming tools: lowercase, accents and diacritical

characters are normalized, transliterationcharacters are normalized, transliteration

Page 6: 1 Flexible and Efficient Toolbox for Information Retrieval MIRACLE group José Miguel Goñi-Menoyo (UPM) José Carlos González-Cristóbal (UPM-Daedalus) Julio

6

More MIRACLE Tools More MIRACLE Tools Filtering tools:Filtering tools:

stop-words and stop-clefsstop-words and stop-clefsphrase pattern filter (for topics)phrase pattern filter (for topics)

Automatic translation issues: “outsourced” to available on-Automatic translation issues: “outsourced” to available on-line resources or desktop applicationsline resources or desktop applications

Bultra (EnBultra (EnBu)Bu) Webtrance (EnWebtrance (EnBu)Bu) AutTrans (EsAutTrans (EsFr, EsFr, EsPt)Pt)

MoBiCAT (EnMoBiCAT (EnHu)Hu) SystranSystran BabelFish AltavistaBabelFish Altavista

BabylonBabylon FreeTranslationFreeTranslation Google Language ToolsGoogle Language Tools

InterTransInterTrans WordLingoWordLingo ReversoReverso

Semantic expansionSemantic expansionEuroWordNetEuroWordNetown resources for Spanishown resources for Spanish

The philosopher's stone: indexing and retrieval systemThe philosopher's stone: indexing and retrieval system

Page 7: 1 Flexible and Efficient Toolbox for Information Retrieval MIRACLE group José Miguel Goñi-Menoyo (UPM) José Carlos González-Cristóbal (UPM-Daedalus) Julio

7

Indexing and Retrieval SystemIndexing and Retrieval System

Implements boolean, vectorial and probabilistic BM25 retrieval Implements boolean, vectorial and probabilistic BM25 retrieval modelsmodels

Only BM25 in used in CLEF 2005Only BM25 in used in CLEF 2005 Only OR operator was used for termsOnly OR operator was used for terms

Native support for UTF-8 (and others) encodingsNative support for UTF-8 (and others) encodings No transliteration scheme is neededNo transliteration scheme is needed Good results for BulgarianGood results for Bulgarian

More efficiency achieved than with previous enginesMore efficiency achieved than with previous engines Several orders of magnitude in indexing timeSeveral orders of magnitude in indexing time

Page 8: 1 Flexible and Efficient Toolbox for Information Retrieval MIRACLE group José Miguel Goñi-Menoyo (UPM) José Carlos González-Cristóbal (UPM-Daedalus) Julio

8

Trie-based indexTrie-based index

calm, cast, coating, coat, money, monk, month

Page 9: 1 Flexible and Efficient Toolbox for Information Retrieval MIRACLE group José Miguel Goñi-Menoyo (UPM) José Carlos González-Cristóbal (UPM-Daedalus) Julio

9

1st course implementation: linked arrays1st course implementation: linked arrays

calm, cast, coating, coat, money, monk, month

Page 10: 1 Flexible and Efficient Toolbox for Information Retrieval MIRACLE group José Miguel Goñi-Menoyo (UPM) José Carlos González-Cristóbal (UPM-Daedalus) Julio

10

Efficient tries: avoiding empty cellsEfficient tries: avoiding empty cells

abacus, abet, ace, baby be, beach, bee

Page 11: 1 Flexible and Efficient Toolbox for Information Retrieval MIRACLE group José Miguel Goñi-Menoyo (UPM) José Carlos González-Cristóbal (UPM-Daedalus) Julio

11

Basic ExperimentsBasic Experiments

SS: Standard sequence (tokenization, filtering, stemming, : Standard sequence (tokenization, filtering, stemming, transformation)transformation)

NN: Non stemming: Non stemming

RR: Use of narrative field in topics: Use of narrative field in topics TT: Ignore narrative field: Ignore narrative field r1r1: Pseudo-relevance feedback (with 1st retrieved : Pseudo-relevance feedback (with 1st retrieved

document)document) PP: Proper noun extraction (in topics): Proper noun extraction (in topics)

SR, ST, r1SR, NR, NT, NPSR, ST, r1SR, NR, NT, NP

Page 12: 1 Flexible and Efficient Toolbox for Information Retrieval MIRACLE group José Miguel Goñi-Menoyo (UPM) José Carlos González-Cristóbal (UPM-Daedalus) Julio

12

Paragraph indexingParagraph indexing

HH: Paragraph indexing: Paragraph indexingdocparsdocpars (document paragraphs) are indexed instead of docs (document paragraphs) are indexed instead of docs

termterm doc1#1, doc69#5 … doc1#1, doc69#5 …combination of combination of docpars docpars relevance:relevance:

relrelNN = rel = relmNmN + + αα / n * ∑ / n * ∑ j≠mj≠m rel reljNjN

n=paragraphs retrieved for doc Nn=paragraphs retrieved for doc N

relreljNjN=relevance of paragraph i of doc N=relevance of paragraph i of doc N

m=paragraph with maximum relevancem=paragraph with maximum relevanceαα=0.75 (experimental)=0.75 (experimental)

HR, HTHR, HT

Page 13: 1 Flexible and Efficient Toolbox for Information Retrieval MIRACLE group José Miguel Goñi-Menoyo (UPM) José Carlos González-Cristóbal (UPM-Daedalus) Julio

13

Combined experimentsCombined experiments ““Democratic system”: documents with good score in many Democratic system”: documents with good score in many

experiments are likely to be relevantexperiments are likely to be relevant

aa: Average:: Average:Merging of several experiments, adding relevanceMerging of several experiments, adding relevance

xx: WDX - asymmetric combination of two experiments:: WDX - asymmetric combination of two experiments:First (more relevant) non-weighted D documents from run AFirst (more relevant) non-weighted D documents from run ARest of documents from run A, with W weightRest of documents from run A, with W weightAll documents from run B, with X weightAll documents from run B, with X weightRelevance re-sortingRelevance re-sorting

Mostly used for combining base runs with proper nouns Mostly used for combining base runs with proper nouns runsruns

aHRSR, aHTST, xNP01HR1, xNP01r1SR1aHRSR, aHTST, xNP01HR1, xNP01r1SR1

Page 14: 1 Flexible and Efficient Toolbox for Information Retrieval MIRACLE group José Miguel Goñi-Menoyo (UPM) José Carlos González-Cristóbal (UPM-Daedalus) Julio

14

Multilingual mergingMultilingual merging

Standard approaches for merging:Standard approaches for merging:No normalization and relevance re-sortingNo normalization and relevance re-sortingStandard normalization and relevance re-sortingStandard normalization and relevance re-sortingMin-max normalization and relevance re-sortingMin-max normalization and relevance re-sorting

Miracle approach for merging:Miracle approach for merging:The number of docs selected from a collection (language) is The number of docs selected from a collection (language) is

proportional to the average relevance of its first N docs (N=1, proportional to the average relevance of its first N docs (N=1, 10, 50, 125, 250, 1000). Then one of the standard 10, 50, 125, 250, 1000). Then one of the standard approaches is usedapproaches is used

Page 15: 1 Flexible and Efficient Toolbox for Information Retrieval MIRACLE group José Miguel Goñi-Menoyo (UPM) José Carlos González-Cristóbal (UPM-Daedalus) Julio

15

Results Results

We performed…We performed…

… … countless experiments!countless experiments!

(just for the adhoc task)(just for the adhoc task)

Page 16: 1 Flexible and Efficient Toolbox for Information Retrieval MIRACLE group José Miguel Goñi-Menoyo (UPM) José Carlos González-Cristóbal (UPM-Daedalus) Julio

16

Monolingual BulgarianMonolingual Bulgarian

Stemmer (UTF-8): NeuchâtelStemmer (UTF-8): Neuchâtel

Rank: 4thRank: 4th

Page 17: 1 Flexible and Efficient Toolbox for Information Retrieval MIRACLE group José Miguel Goñi-Menoyo (UPM) José Carlos González-Cristóbal (UPM-Daedalus) Julio

17

Bilingual EnglishBilingual EnglishBulgarianBulgarian

(83% monolingual)(83% monolingual)

EnEnBu: Bultra, WebtranceBu: Bultra, Webtrance

Rank: 1stRank: 1st

Page 18: 1 Flexible and Efficient Toolbox for Information Retrieval MIRACLE group José Miguel Goñi-Menoyo (UPM) José Carlos González-Cristóbal (UPM-Daedalus) Julio

18

Monolingual HungarianMonolingual Hungarian

Stemmer: NeuchâtelStemmer: Neuchâtel

Rank: 3rdRank: 3rd

Page 19: 1 Flexible and Efficient Toolbox for Information Retrieval MIRACLE group José Miguel Goñi-Menoyo (UPM) José Carlos González-Cristóbal (UPM-Daedalus) Julio

19

Bilingual EnglishBilingual EnglishHungarianHungarian

(87% monolingual)(87% monolingual)

EnEnHu: MoBiCATHu: MoBiCAT

Rank: 1stRank: 1st

Page 20: 1 Flexible and Efficient Toolbox for Information Retrieval MIRACLE group José Miguel Goñi-Menoyo (UPM) José Carlos González-Cristóbal (UPM-Daedalus) Julio

20

Monolingual FrenchMonolingual French

Stemmer: SnowballStemmer: Snowball

Rank: >5thRank: >5th

Page 21: 1 Flexible and Efficient Toolbox for Information Retrieval MIRACLE group José Miguel Goñi-Menoyo (UPM) José Carlos González-Cristóbal (UPM-Daedalus) Julio

21

Bilingual EnglishBilingual EnglishFrenchFrench

(79% monolingual)(79% monolingual)

EnEnFr: SystranFr: Systran

Rank: 5thRank: 5th

Page 22: 1 Flexible and Efficient Toolbox for Information Retrieval MIRACLE group José Miguel Goñi-Menoyo (UPM) José Carlos González-Cristóbal (UPM-Daedalus) Julio

22

Bilingual SpanishBilingual SpanishFrenchFrench

(81% monolingual)(81% monolingual)

EsEsFr: ATrans, SystranFr: ATrans, Systran

(Rank: 5th)(Rank: 5th)

Page 23: 1 Flexible and Efficient Toolbox for Information Retrieval MIRACLE group José Miguel Goñi-Menoyo (UPM) José Carlos González-Cristóbal (UPM-Daedalus) Julio

23

Monolingual PortugueseMonolingual Portuguese

Stemmer: SnowballStemmer: Snowball

Rank: >5th (4th)Rank: >5th (4th)

Page 24: 1 Flexible and Efficient Toolbox for Information Retrieval MIRACLE group José Miguel Goñi-Menoyo (UPM) José Carlos González-Cristóbal (UPM-Daedalus) Julio

24

Bilingual EnglishBilingual EnglishPortuguesePortuguese

(55% monolingual)(55% monolingual)

EnEnPt: SystranPt: Systran

Rank: 3rdRank: 3rd

Page 25: 1 Flexible and Efficient Toolbox for Information Retrieval MIRACLE group José Miguel Goñi-Menoyo (UPM) José Carlos González-Cristóbal (UPM-Daedalus) Julio

25

Bilingual SpanishBilingual SpanishPortuguesePortuguese

(88% monolingual)(88% monolingual)

EsEsPt: ATransPt: ATrans

(Rank: 2nd)(Rank: 2nd)

Page 26: 1 Flexible and Efficient Toolbox for Information Retrieval MIRACLE group José Miguel Goñi-Menoyo (UPM) José Carlos González-Cristóbal (UPM-Daedalus) Julio

26

Multilingual-8 (En, Es, Fr)Multilingual-8 (En, Es, Fr)

Rank: 2nd [Fr, En] Rank: 2nd [Fr, En] 3rd [Es]3rd [Es]

Page 27: 1 Flexible and Efficient Toolbox for Information Retrieval MIRACLE group José Miguel Goñi-Menoyo (UPM) José Carlos González-Cristóbal (UPM-Daedalus) Julio

27

Conclusions and homeworkConclusions and homework

Toolbox = “imagination is the limit”Toolbox = “imagination is the limit” Focus on interesting linguistic things instead of boring text manipulationFocus on interesting linguistic things instead of boring text manipulation Reusability (half of the work is done for next year!)Reusability (half of the work is done for next year!)

Keys for good results:Keys for good results:Fast IR engine is essentialFast IR engine is essentialNative character encoding supportNative character encoding supportTopic narrativeTopic narrativeGood translation engines make the differenceGood translation engines make the difference

Homework: Homework: further development on system modules, fine tuningfurther development on system modules, fine tuningSpanish, French, Portuguese… Spanish, French, Portuguese…