irsaw - towards semantic annotation for question answering
DESCRIPTION
Talk given at the CNI spring 2007 task force meeting, April 16-17, Phoenix, AZTRANSCRIPT
IRSAW – Towards SemanticAnnotation of Documents for Question
Answering
Johannes Leveling
CNI Spring 2007 Task Force Meeting,Apr. 16–17, 2007, Phoenix, AZ
IRSAW: Intelligent Information Retrieval on the Basis of aSemantically Annotated Webfunded by the DFG (Deutsche Forschungsgemeinschaft)
IntroductionIRSAWResults
Outline
1 Introduction
2 IRSAWBasic ArchitectureThe MultiNet paradigmProcessing PhasesModules
3 Results
Johannes Leveling IRSAW – Towards Semantic Annotation of Documents for QA 2 / 30
IntroductionIRSAWResults
Motivation
More information becomes available on the internetand/but precise answers are difficult to find→ Develop a semantically based question answering
(QA) framework in the IRSAW projectGeneral idea:
Retrieve documents from the internetSemantically analyze and annotate the question anddocuments, andApply deep linguistic methods for question answeringon document content
Johannes Leveling IRSAW – Towards Semantic Annotation of Documents for QA 3 / 30
IntroductionIRSAWResults
Background and Embedding in a GeneralStrategy
Semantically based natural language processingKnowledge representation MultiNet (conceptoriented)Support by large semantically oriented computationallexicon→ word sense disambiguationHomogeneous representation of lexical knowledge,general background knowledge (world knowledge),dialogue context, and meaning of sentences andtextsEmphasis on semantic relations in all applications(not just concepts or descriptors)
Johannes Leveling IRSAW – Towards Semantic Annotation of Documents for QA 4 / 30
IntroductionIRSAWResults
NLI-Z39.50: Beyond Descriptor Search
Predecessor: Natural language interface for the Z39.50protocol
Natural language interface to libraries andinformation providers on the internetTransformation of semantic structures intoexpressions of formal retrieval languagesIncludes features such as de-duplication of results,phonetic search, decomposition of compounds, queryexpansion with additional conceptsExample query: Where do I find books by PeterJackson which were published in the last ten yearswith Springer and Addison-Wesley?
Johannes Leveling IRSAW – Towards Semantic Annotation of Documents for QA 5 / 30
IntroductionIRSAWResults
Basic ArchitectureThe MultiNet paradigmProcessing PhasesModules
IRSAW
Multi staged approachQuestions are processed in three phases, accessingweb search engines, local databases, and a semanticnetwork knowledge baseWeb documents→ semantic annotation
Johannes Leveling IRSAW – Towards Semantic Annotation of Documents for QA 7 / 30
IntroductionIRSAWResults
Basic ArchitectureThe MultiNet paradigmProcessing PhasesModules
Architecture of IRSAW
Document preprocessing
Question processing Find answer candidates
Natural language question
Answer
IRSAW framework
MAVE
InSicht
QAP
MIRA
Select answer candidate
Merge answer streams
Validate answer candidates
Score and rank answer candidates
Johannes Leveling IRSAW – Towards Semantic Annotation of Documents for QA 8 / 30
IntroductionIRSAWResults
Basic ArchitectureThe MultiNet paradigmProcessing PhasesModules
IRSAW – Methods and Modules (1/3)
Apply parser (for German language) to produce asemantic network representation of texts (based onthe knowledge representation paradigm MultiNet)
→ Allows a full semantic interpretation of questions anddocuments on which logical inferences are based(state-of-the-art: mostly shallow methods)
Johannes Leveling IRSAW – Towards Semantic Annotation of Documents for QA 9 / 30
IntroductionIRSAWResults
Basic ArchitectureThe MultiNet paradigmProcessing PhasesModules
IRSAW – Methods and Modules (2/3)
Combine different data streams containing answercandidates
→ Use different methods to produce answer streams toincrease recall and robustnessLogically validate answers
→ Select validated answers from streams of answercandidates to increase precision
Johannes Leveling IRSAW – Towards Semantic Annotation of Documents for QA 10 / 30
IntroductionIRSAWResults
Basic ArchitectureThe MultiNet paradigmProcessing PhasesModules
IRSAW – Methods and Modules (3/3)
Natural language generation of answers→ Allows for rephrasing from text and combination of
answer fragments from different documents(state-of-the-art: extracting snippets from the text)IRSAW also aims at investigating linguisticphenomena in questions and documents (e.g.idioms, metonymy, and temporal and spatial aspects)
Johannes Leveling IRSAW – Towards Semantic Annotation of Documents for QA 11 / 30
IntroductionIRSAWResults
Basic ArchitectureThe MultiNet paradigmProcessing PhasesModules
IRSAW – Software
The IRSAW project will result in two software componentsaccessible via internet:
1 The question answering system IRSAW2 A web service for the semantic annotation of (web)
documents
Johannes Leveling IRSAW – Towards Semantic Annotation of Documents for QA 12 / 30
IntroductionIRSAWResults
Basic ArchitectureThe MultiNet paradigmProcessing PhasesModules
MultiNet: Example for Semantic Network
In which year did Charles de Gaulle die?/In welchem Jahr starb Charles de Gaulle?
c6dn starb
SUBS sterbenTEMP past.0
[gener sp]
AFF //
TEMP
��
c31d de Gaulle
SUB Menschfact real
gener sp
quant one
refer det
card 1
etype 0
varia con
ATTR
��
ATTR//
c33naSUB Nachname[
gener sp
quant one
card 1
etype 0
]
VAL
��
c5?wh-questionme∨oa∨ta Jahr
SUB Jahrfact real
gener sp
quant one
refer det
card 1
etype 0
c32na
SUB Vorname[gener sp
quant one
card 1
etype 0
]
VAL
��
De_Gaulle.0fe
Charles.0fe
Johannes Leveling IRSAW – Towards Semantic Annotation of Documents for QA 13 / 30
IntroductionIRSAWResults
Basic ArchitectureThe MultiNet paradigmProcessing PhasesModules
MultiNet: Tools and Resources
Syntactic-semantic parser:WOCADI (Word Class Controlled DisambiguatingParser)Large semantic computational lexicon:HaGenLex (Hagen German Lexicon)Workbench for the computer lexicographer:LiaPlus (Lexicon in action)
Johannes Leveling IRSAW – Towards Semantic Annotation of Documents for QA 14 / 30
IntroductionIRSAWResults
Basic ArchitectureThe MultiNet paradigmProcessing PhasesModules
IRSAW: First Processing Phase
Transform user question into IR queryPreselect information resources (→ Broker)Send IR query to web search engines and webportals (→ lists of URLs)Retrieve web documents referenced and convertthem
Johannes Leveling IRSAW – Towards Semantic Annotation of Documents for QA 15 / 30
IntroductionIRSAWResults
Basic ArchitectureThe MultiNet paradigmProcessing PhasesModules
IRSAW: Second Processing Phase
Segment and index text passages from the web inlocal databaseAccess to units of textual information of certain types(chapters, paragraphs, sentences, or phrases)
Johannes Leveling IRSAW – Towards Semantic Annotation of Documents for QA 16 / 30
IntroductionIRSAWResults
Basic ArchitectureThe MultiNet paradigmProcessing PhasesModules
IRSAW: Third Processing Phase
Employ different modules to produce data streamscontaining answer candidates:
QAP (Question Answering by Pattern matching),MIRA (Modified Information Retrieval Approach), andInSicht
Merge, rank, logically validate answer candidates andselect best answer (MAVE)
Johannes Leveling IRSAW – Towards Semantic Annotation of Documents for QA 17 / 30
IntroductionIRSAWResults
Basic ArchitectureThe MultiNet paradigmProcessing PhasesModules
InSicht
Analyze text segments (question, texts) withWOCADI and return the representation of themeaning of a text as a semantic networkExpand queries with semantically related concepts
→ High recallParaphrase answer node in semantic network(generate answer)Match semantic networks
→ High precision+ Co-reference resolution, logical inference
rules/textual entailments
Johannes Leveling IRSAW – Towards Semantic Annotation of Documents for QA 18 / 30
IntroductionIRSAWResults
Basic ArchitectureThe MultiNet paradigmProcessing PhasesModules
InSicht Logical Entailment
( (rule((subs ?n1 “ermorden.1.1”) ;; kill(aff ?n1 ?n2)->(subs ?n3 “sterben.1.1”) ;; die(aff ?n3 ?n2)) )
(ktype categ)(name “ermorden.1.1_entailment”) )
Johannes Leveling IRSAW – Towards Semantic Annotation of Documents for QA 19 / 30
IntroductionIRSAWResults
Basic ArchitectureThe MultiNet paradigmProcessing PhasesModules
InSicht Example
User question: In which year did Charles de Gaulledie?In welchem Jahr starb Charles de Gaulle?Text passage: France’s chief of state Jacques Chiracacknowledged the merits of general and statesmanCharles de Gaulle, who died 25 years ago.Frankreichs Staatschef Jacques Chirac hat dieVerdienste des vor 25 Jahren gestorbenen Generalsund Staatsmannes Charles de Gaulle gewürdigt.(SDA.951109.0236)Answer: 1970 (deictic temporal expression resolved;document written in 1995)
Johannes Leveling IRSAW – Towards Semantic Annotation of Documents for QA 20 / 30
IntroductionIRSAWResults
Basic ArchitectureThe MultiNet paradigmProcessing PhasesModules
QAP - Question Answering by PatternMatching
Create patterns by processing knownquestion-answer pairsSearch for text passages containing keywords fromquestionApply pattern matching on answer candidatesExtract answer string from variable binding
+ Robustness, high precision for a small class ofquestions
– No logical inferences possible
Johannes Leveling IRSAW – Towards Semantic Annotation of Documents for QA 21 / 30
IntroductionIRSAWResults
Basic ArchitectureThe MultiNet paradigmProcessing PhasesModules
QAP Example
User question: In which year was the RussianRevolution?In welchem Jahr fand die russische Revolution statt?Text passage: The satire inspired by the Russianrevolution 1917 lets the dream of liberty and equalityfail because of humans.Die von der Russischen Revolution 1917 inspirierteSatire läßt den Traum von Freiheit und Gleichheit anden Menschen scheitern. (FR940612-000533)Answer: 1917 (pattern matching subsystem ignoresmetonymy and ellipsis)
Johannes Leveling IRSAW – Towards Semantic Annotation of Documents for QA 22 / 30
IntroductionIRSAWResults
Basic ArchitectureThe MultiNet paradigmProcessing PhasesModules
MIRA - Modified Information RetrievalApproach
Train tagger on answer classes (LOC, PER, ORG)Search for text passages containing keywords fromquestionUse tagger on answer candidate sentence and selectmost frequent word sequence
+ Highly recall-oriented– Very low precision, works only for a small class of
questions (with answer type LOC, PER, ORG)
Johannes Leveling IRSAW – Towards Semantic Annotation of Documents for QA 23 / 30
IntroductionIRSAWResults
Basic ArchitectureThe MultiNet paradigmProcessing PhasesModules
MIRA Example
User question: Who was the first man on the moon?Wer war der erste Mensch auf dem Mond?Text passage: Twenty-five years ago Neil Armstrongwas the first man to step onto the moon, but todaymanned space flight stagnates.Vor 25 Jahren betrat Neil Armstrong als ersterMensch den Mond, doch heute stagniert diebemannte Raumfahrt. (FR940724-001243)Answer: Neil Armstrong (PER)
Johannes Leveling IRSAW – Towards Semantic Annotation of Documents for QA 24 / 30
IntroductionIRSAWResults
Basic ArchitectureThe MultiNet paradigmProcessing PhasesModules
MAVE - MultiNet-based Answer Verification
Validate answer candidatesTest logical validity of answer candidate (usinginferences, entailments)Added heuristic quality indicators as fallback strategySelect most trusted answer
Johannes Leveling IRSAW – Towards Semantic Annotation of Documents for QA 25 / 30
IntroductionIRSAWResults
Evaluations
InSicht evaluation: best performance for monolingualGerman question answering task at Cross LanguageEvaluation Forum 2005 (QA@CLEF 2005)IRSAW evaluation at QA@CLEF 2006: combinationof InSicht and QAP answer stream:one of the best results in the monolingual GermanQA track;best results for answer validation task with MAVEIRSAW evaluation (for RIAO 2007): InSicht, QAP,MIRA answer streams, and logical validation withMAVE→ better results with more answer streamsand logical answer validation
Johannes Leveling IRSAW – Towards Semantic Annotation of Documents for QA 26 / 30
IntroductionIRSAWResults
Evaluation Results
Results for answer validation of answer candidates for600 questions (InSicht:I, MIRA:M, QAP:Q; c=correct,i=inexact, w=wrong)
QA streams c i w
IRSAW: I 199.4 10.9 15.7IRSAW: I+M+Q 244.4 16.9 255.7IRSAW: I+M+Q (Optimum) 290.0 15.0 215.0
Johannes Leveling IRSAW – Towards Semantic Annotation of Documents for QA 27 / 30
IntroductionIRSAWResults
Project Status
Implementation of modules for QA system IRSAWcompletedEvaluations are now based on corpus of newspaperarticles / Wikipedia (corpora of millions of sentences)Semantic annotation⇒ Semantic Web
Johannes Leveling IRSAW – Towards Semantic Annotation of Documents for QA 28 / 30
IntroductionIRSAWResults
Future Work
Next: Evaluation on web pages (even bigger corpus,dynamic content)Add more robustnessAdd acoustic interface (speech input)Create English prototype (?)
Johannes Leveling IRSAW – Towards Semantic Annotation of Documents for QA 29 / 30
IntroductionIRSAWResults
Selected References
Ingo Glöckner, Sven Hartrumpf, and Johannes Leveling. Logicalvalidation, answer merging and witness selection – a case studyin multi-stream question answering. To appear in: Proceedingsof RIAO 2007, 2007.
Sven Hartrumpf and Johannes Leveling. University of Hagen atQA@CLEF 2006: Interpretation and normalization of temporalexpressions. In Alessandro Nardi, Carol Peters, and José LuisVicedo, editors, Results of the CLEF 2006 Cross-LanguageSystem Evaluation Campaign, Working Notes for the CLEF2006 Workshop, Alicante, Spain, 2006.
Hermann Helbig. Knowledge Representation and theSemantics of Natural Language. Springer, Berlin, 2006.
Johannes Leveling IRSAW – Towards Semantic Annotation of Documents for QA 30 / 30