javelin project briefing aquaint program 1 aquaint workshop, october 2005 javelin project briefing...

1AQUAINT Workshop, October 2005

JAVELIN Project Briefing

AQUAINTProgram


Eric Nyberg, Teruko Mitamura, Jamie Callan, Robert Frederking, Jaime Carbonell,

Matthew Bilotti, Jeongwoo Ko, Frank Lin, Lucian Lita, Vasco Pedro, Andrew Schlaikjer,

Hideki Shima, Luo Si, David Svoboda

Language Technologies InstituteCarnegie Mellon University



AQUAINTProgram

Status Update

• Project Start: September 30, 2004 (now in Month 13)

• Last Six Months:– Initial CLQA system evaluated in NTCIR

(English-Japanese, English-Chinese)

– Multilingual Distributed IR evaluated in CLEF competition

– Initial Phase II English system in TREC relationship track



AQUAINTProgram

Multilingual QA



AQUAINTProgram

Javelin Multilingual QA

• End-to-end systems for English to Chinese and English to Japanese

• Participated in NTCIR-5 CLQA-1 (E-C, E-J) evaluation

– http://www.slt.atr.jp/CLQA/

• NTCIR5 workshop will be held in Tokyo, Japan, December 6-9, 2005



AQUAINTProgram

NTCIR CLQA1 task overview

• EC, CC, CE, EJ, JE subtasks– Answer named entities (e.g. person name, organization

name, location, artifact, date, money, time, etc.)– We were the only team that participated in both EC and

EJ subtasks• Question/answer data set

– EC: 200 for training and formal run– EJ: 300 for training and 200 for formal run

• Corpus– EC: United Daily News 2000-2001 (466,564 articles)– EJ: Yomiuri Newspaper 2000-2001 (658,719 articles)



AQUAINTProgram

CLQA1 Evaluation Criteria

• Only the top answer candidate is judged, along with its supporting document

• Correct answers that were not properly supported by the returned document were judged to be unsupported

• Answer is incorrect even if a substring of the answer is correct

• Issue: we found that the gold-standard document set (supported) is not complete



AQUAINTProgram

MLQA Architecture

QA RS IX AG

KeywordTranslator Chinese

Corpus

EM

JapaneseCorpus

ChineseIX

JapaneseIX

EnglishCorpus

EnglishIX

Original Module/Resources

ML Module/Resources



AQUAINTProgram

QA RS IX AG


Corpus

EM

JapaneseCorpus

ChineseIX

JapaneseIX

EnglishCorpus

EnglishIX

How much did the Japan

Bank for International Cooperation decide to loan to the Taiwan High-Speed

Corporation?

How much did the Japan Bank for International

Cooperation decide to loan to the Taiwan High-Speed

Corporation?



AQUAINTProgram

QA RS IX AG


Corpus

EM

JapaneseCorpus

ChineseIX

JapaneseIX

EnglishCorpus

EnglishIX

Answer Type = MONEYKeyword = ___________________

Answer Type = MONEYKeyword = ___________________Bank for International Cooperation

Taiwan High-Speed Corporationloan



AQUAINTProgram

QA RS IX AG


Corpus

EM

JapaneseCorpus

ChineseIX

JapaneseIX

EnglishCorpus

EnglishIXAnswer Type = MONEY

Keyword = _____________Answer Type = MONEY

Keyword = _____________

Answer Type = MONEYKeyword = ___________________

Answer Type = MONEYKeyword = ___________________Bank for International Cooperation

Taiwan High-Speed Corporationloan



AQUAINTProgram

QA RS IX AG


Corpus

EM

JapaneseCorpus

ChineseIX

JapaneseIX

EnglishCorpus

EnglishIX

DocID = JY-20010705J1TYMCC1300010, Confidence = 44.01

DocID = JY-20011116J1TYMCB1300010, Confidence = 42.95::

DocID = JY-20010705J1TYMCC1300010, Confidence = 44.01

DocID = JY-20011116J1TYMCB1300010, Confidence = 42.95::



AQUAINTProgram

QA RS IX AG


Corpus

EM

JapaneseCorpus

ChineseIX

JapaneseIX

EnglishCorpus

EnglishIX

Answer Candidate = Confidence = 0.0718Passage =



AQUAINTProgram

QA RS IX AG


Corpus

EM

JapaneseCorpus

ChineseIX

JapaneseIX

EnglishCorpus

EnglishIX

Cluster and Re-rank answer candidates.

Cluster and Re-rank answer candidates.



AQUAINTProgram

QA RS IX AG


Corpus

EM

JapaneseCorpus

ChineseIX

JapaneseIX

EnglishCorpus

EnglishIX

Answer =



AQUAINTProgram

Formal Run Result

　 No. ofParticipants

Number ofSubmissions MAX MIN MEDIAN AVE

EC 4 8 25(33) 6(8) 14.5(19) 15.63(19.75)

EJ 4 11 25(31) 0(0) 17(18) 12.73(14.61)

Only the top answer candidate is judged

Measured by number of correct answers

(unsupported answers in brackets)



AQUAINTProgram

With Partial Gold Standard Input

a Average precision of answer-type detectionbAverage precision of keyword translation over 200 formal run questionscAverage precision of document retrieval. Counted if correct document was ranked between 1st–15th

dAverage precision of answer extraction. Counted if correct answer was ranked between 1st–100th

eThe MRR measure of IX performance, calculated by averaging the sum of the reciprocal of each answer’s rankfOverall accuracy of the systemgAccuracy including unsupported answers



AQUAINTProgram

• Question Analyzer, Retrieval Strategist have relatively high accuracy



AQUAINTProgram

• QA, RS have relatively high accuracy• Translation accuracy affects overall accuracy

greatly– Accuracy in RS increased by 26.5% in EC and 22.5% in EJ.– If unsupported answers are considered, there is a 10.5%

improvement in accuracy for EC and 2.5% for EJ.– We found correct documents that are not in the gold-

standard set.

plus 22.5%plus 22.5%

plus 26.5%plus 26.5%



AQUAINTProgram

• QA, RS have relatively high accuracy• Translation accuracy affects overall accuracy greatly• There is room for improvement in IX

– Raise accuracy and reduce noise

Average precision of answer extraction is calculated by counting correct answers ranked between 1st–100th

Average precision of answer extraction is calculated by counting correct answers ranked between 1st–100th

The MRR measure of IX performance is calculated by

averaging the sum of the reciprocal of each answer’s rank

The MRR measure of IX performance is calculated by

averaging the sum of the reciprocal of each answer’s rank



AQUAINTProgram

• Translation accuracy affects overall accuracy greatly• QA, RS have relatively high accuracy• There is room for improvement in IX

– Raise accuracy and reduce noise

• Validation function in AG is crucial– Filter out noise in IX output– Boost up rank of correct answer

Only the topmost answer candidate is judged at the

end, big accuracy drop

Only the topmost answer candidate is judged at the

end, big accuracy drop



AQUAINTProgram

Next Steps for Multilingual QA

• Improve translation of keywords from E-C and E-J(e.g. named entity translation)

• Improve extraction using syntactic and semantic information in Chinese and Japanese (e.g. use of Cabocha)

• Improve Validation function in AG

• Upcoming Evaluation(s):– NTCIR CLQA-2, if available in 2006– AQUAINT E-C definition question pilot when training/test

data is available

• Integrate with Distributed IR (next slides)



AQUAINTProgram

Current Multilingual QA Systems

English

Japanese

Chinese

English QA

ChineseCLQA

JapaneseCLQA

English Questions

Answers in Japanese

Answers in Chinese

3 separate systems, no distributed IR



AQUAINTProgram

Future Vision

English QA

ChineseCLQA

JapaneseCLQA

English Questions

Answers in Japanese

Answers in ChineseDistributed

IR

single, integrated system with distributed IR

Chinese

Japanese

English



AQUAINTProgram

Multilingual Distributed Information Retrieval



AQUAINTProgram

What Is Distributed IR?

• A method of searching across multiple full-text search engines– “federated search”, “the hidden Web”

• It is important when relevant information is scattered across many search engines– Within an organization– On the Web– Which ones have the information you need?



AQUAINTProgram

Many Search EnginesDon’t Speak English



AQUAINTProgram

Multilingual Distributed IR:Recent Progress

ResearchExtend monolingual algorithms to

multilingual environments• Multilingual query-based sampling

– Monolingual corpora

• Multilingual result-merging– Given retrieval results in N languages,

produce a single multilingual ranked list



AQUAINTProgram

Multilingual Distributed IR:Recent Progress

Evaluation• CLEF Multi-8 Ad-hoc Retrieval task

– English (2), Spanish (1), French (2), Italian (2), Swedish (1), German (2), Finnish (1), Dutch (2)

• Why CLEF?– More languages than NTCIR

• More languages is harder

– CLEF is focusing on result-merging this year• Models uncooperative environments, where we have

no control over individual search engines



AQUAINTProgram

CLEF 2005:Two Cross-Lingual Retrieval Tasks • Usual Ad-hoc Cross-lingual Retrieval

– Cooperative search engines, under our control– English queries, documents in 8 languages, 8 search engines

• Multilingual Results Merging Task– Uncooperative search engines, nothing under our control

• We get only ranked lists of documents from each engine– We treat the task as a multilingual federated search problem

• Documents in language l are stored in search engine s• Minimize cost of downloading, indexing and translating

documents



AQUAINTProgram

CLEF 2005:Usual Ad hoc Cross-lingual Retrieval

For each query:1. Four distinct retrieval methods r

– Translate English queries into target language• With & without pseudo relevance feedback

– Translate all documents into English• With/without pseudo relevance feedback

– Lemur search engine2. Combine all results from method r into a

multilingual result3. Combine results from all methods into a final resultUse training data to maximize combination accuracy



AQUAINTProgram

CLEF 2005: Cross-lingual Results Merging Task

For each query:1. Download a few top-ranked documents from each source2. Create “comparable scores” for each downloaded document

by combining results of four methods (previous slide)3. For each downloaded document we have

<source search engine score, comparable score>4. Train language-specific, query-specific logistic models to

transform any source-specific score to a comparable score5. Estimate comparable scores for all ranked documents from

each source6. Merge documents by their comparable scores



AQUAINTProgram

Multilingual Distributed IR:CLEF Results

Mean Average Precision (MAP) across 40 queries

Our Best Run

Other Best Run

Median Run

Ad hoc Cross-lingual Retrieval

0.449 0.333 0.261

Result Merging 0.419 0.329 0.298



AQUAINTProgram

CandidatePredicates

RetrievalStrategist

OntologyAnnotationsDatabase

Text Annotator

Identifi

nder

ASSER

T

MX

Term

inato

r

Corpus

QuestionKey

PredicatesQuestionAnalyzer

AnswerPassages

AnswerGenerator

InformationExtractor

RankedPredicate List

SemanticIndex

off-lineindexing

Extending JAVELIN with Domain Semantics



AQUAINTProgram

John S. gave Mary an orchid for her birthday.

NE tagger

Entity tagger

Semantic Parser

Ref Resolver

Basic Tokens

Verb expansion

Unified Terms

Predicate Structure Formation

All tags are stand-off annotations stored in a relational data model



AQUAINTProgram

Retrieval on Predicate-Argument Structure

InputQuestion

OutputAnswers

QuestionAnalysis

DocumentRetrieval

Post-Processing

AnswerExtraction

“Who did Smith meet?"



AQUAINTProgram


Predicate-Argument Template

ARG0 ARG1meet

?x

InputQuestion

OutputAnswers

QuestionAnalysis

DocumentRetrieval

Post-Processing

AnswerExtraction


Smith



AQUAINTProgram

IRWhat the IR engine sees:

ARG0 ARG1meet

?x

InputQuestion

OutputAnswers

QuestionAnalysis

DocumentRetrieval

Post-Processing

AnswerExtraction


Smith

“Frank met Alice.Smith dislikes Bob."

“Smith met Jones.”

Some Retrieved Documents:




AQUAINTProgram

RDBMS

InputQuestion

OutputAnswers

QuestionAnalysis

DocumentRetrieval

Post-Processing

AnswerExtraction


“Frank met Alice.John dislikes Bob." “Smith met Jones.”X

Matching Predicate Instance

ARG0 ARG1meet

JonesSmith

“Jones”

ARG0 ARG1meet

AliceFrank

ARG0 ARG1dislikes

BobJohn




AQUAINTProgram

Preliminary Results:TREC 2005 Relationship QA Track

• Partial system:– Semantic indexing not fully integrated– Question analysis module incomplete

• Our Goal: measure ability to retrieve relevant nuggets

• Submitted a second run with manual predicate bracketing (questions)

• Results (in MRR of relevant nuggets):– Run 1: 0.1356– Run 2: 0.5303



AQUAINTProgram

Example: Question AnalysisThe analyst is interested in Iraqi oil smuggling. Specifically, is Iraq smuggling oil to other countries, and if so, which countries? In addition,who is behind the Iraqi oil smuggling?

interested

Iraqi oilsmuggling

Theanalyst

ARG0 ARG1

smuggling

oil

Iraq

ARG0

ARG1

othercountries

ARG2

smuggling

oil

Iraq

ARG0

ARG1

whichcountries

ARG2

is behind

the Iraqi oilsmuggling

Who

ARG0 ARG1



AQUAINTProgram

Example: ResultsThe analyst is interested in Iraqi oil smuggling. Specifically, is Iraq smuggling oil to other countries, and if so, which countries? In addition,who is behind the Iraqi oil smuggling?

1. “The amount of oil smuggled out of Iraq has doubled since August last year, when oil prices began to increase,” Gradeck said in a telephone interview Wednesday from Bahrain.

2. U.S.: Russian Tanker Had Iraqi Oil By ROBERT BURNS, AP Military Writer WASHINGTON (AP) – Tests of oil samples taken from a Russian tanker suspected of violating the U.N. embargo on Iraq show that it was loaded with petroleum products derived from both Iranian and Iraqi crude, two senior defense officials said.

5. With no American or allied effort to impede the traffic, between 50,000 and 60,000 barrels of Iraqi oil and fuel products a day are now being smuggled along the Turkish route, Clinton administration officials estimate.

(7 of 15 nuggets judged relevant)



AQUAINTProgram

Next Steps• Better Question Analysis

– Retrain ASSERT-style annotator –or-incorporate rule-based NLP from HALO (KANTOO)

• Semantic Indexing and Retrieval– Moving to Indri allows exact representation of our

predicate structure in the index and in queries

• Ranking retrieved predicate instances– Aggregating information across documents

• Extracting answers from predicate-argument structure



AQUAINTProgram

Key Predicates using Event Semantics from a Domain Ontology

possess is a precondition of operate, export, …possess is a postcondition of assemble, buy, …

More useful passages are matched



AQUAINTProgram

Improved Results

assembleoperateinstalldevelopexportimportmanufacture



AQUAINTProgram

Indexing of Predicate Structures Implemented Using Indri (October ’05)

(web demo available)

#combine[predicate]( buy.target #any:gpe.arg0 weapon.arg1 )



AQUAINTProgram

Some Recent PapersE. Nyberg, R. Frederking, T. Mitamura, M. Bilotti, K. Hannan, L. Hiyakumoto, J. Ko, F.

Lin, L. Lita, V. Pedro, and A. Schlaikjer , "JAVELIN I and II Systems at TREC 2005", notebook paper submitted to TREC 2005.

"CMU JAVELIN System for NTCIR5 CLQA1", F. Lin, H. Shima, M. Wang, T. Mitamura, to appear in Proceedings of the 5th NTCIR Workshop.

L. Si and J. Callan, "Modeling search engine effectiveness for federated search.", Proceedings of the Twenty Eighth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Salvador, Brazil.

L. Si and J. Callan, "CLEF 2005: Multilingual retrieval by combining multiple multilingual ranked lists.", Sixth Workshop of the Cross-Language Evaluation Forum, CLEF 2005, Vienna, Austria.

E. Nyberg, T. Mitamura, R. Frederking, V. Pedro, M. Bilotti, A. Schlaikjer and K. Hannan (2005). "Extending the JAVELIN QA System with Domain Semantics", to appear in Proceedings of AAAI 2005 (Workshop on Question Answering in Restricted Domains)

L. Hiyakumoto, L.V. Lita, E. Nyberg (2005). “Multi-Strategy Information Extraction for Question Answering”, FLAIRS 2005, to appear.

http://www.cs.cmu.edu/~ehn/JAVELIN



AQUAINTProgram

Questions ?