xml retrieval: a content-oriented perspective mounia lalmas department of computer science queen...
TRANSCRIPT
XML Retrieval: A content-oriented perspective
Mounia Lalmas
Department of Computer Science
Queen Mary, University of London
Outline
Part I - Content-oriented XML retrieval
Part II - Evaluating content-oriented XML retrieval
XML Retrieval: Motivation
XML is able to represent a mixture of “structured” and text (“unstructured”) information.
XML applications: digital libraries, content management.
XML repositories: IEEE INEX collection, LexisNexis, the Library of Congress collection.
XML Retrieval: DB and IR views
Data-centric view (DB)—XML as exchange format for structured data
Document-centric view (IR)—XML as format for representing the logical structure of documents
Now increasingly both views (DB+IR)
Data Centric XML Documents: Example
<CLASS name=“DCS317” num_of_std=“100”> <LECTURER lecid=“111”>Thomas</LECTURER> <STUDENT marks=“70” origin=“Oversea”> <NAME>Tassos</NAME> </STUDENT> <STUDENT marks=“30” origin=“EU”> <NAME>Christof</NAME> </STUDENT></CLASS>
Document Centric XML Documents: Example
<CLASS name=“DCS317” num_of_std=“100”><LECTURER lecid=“111”>Thomas</LECTURER><STUDENT studid=“007” >
<NAME>James Bond</NAME> is the best student in theclass. He scored <INTERM>95</INTERM> points out of<MAX>100</MAX>. His presentation of <ARTICLE>UsingMaterialized Views in Data Warehouse</ARTICLE> wasbrilliant.
</STUDENT><STUDENT stuid=“131”>
<NAME>Donald Duck</NAME> is not a very goodstudent. He scored <INTERM>20</INTERM> points…</STUDENT>
</CLASS>
Content-oriented XML retrieval
Traditional IR is about finding relevant documents to a user’s information need, e.g. entire book.
XML retrieval allows users to retrieve document components (elements) that are more focussed to their information needs, e.g a chapter, a page, several paragraphs of a book instead of an entire book.
The structure of documents is exploited to identify which document components to retrieve.
• Structure improves precision• Exploit visual memory
Book
Chapters
Sections
Subsections
World Wide Web
This is only only another to look one le to show the need an la a out structure of and more a document and so ass to it doe not necessary text a structured document have retrieval on the web is an it important topic of today’s research it issues to make se last sentence..
XML retrieval allows users to retrieve document components that are more focussed, e.g. a subsection of a book instead of an entire book.
SEARCHING = QUERYING + BROWSINGSEARCHING = QUERYING + BROWSING
Content-oriented XML retrieval
Focussed retrieval: Scientific Collection
Querymodel checking aviation systems
Answerone section in a workshop report
Focussed Retrieval: Encyclopedia
Information needvolcanic eruption prediction
Answerrelatively small portion of the volcano topic
Focussed retrieval: Technical Manual
Querysegmentation fault windows services for unix
Answeronly a single paragraph in a long manual
XML: eXtensible Mark-up Language
Meta-language (user-defined tags) currently being adopted as the document format language by W3C
Used to describe content and structure (and not layout)
Grammar described in DTD ( used for validation)
<lecture> <title> Structured Document Retrieval </title> <author> <fnm> Smith </fnm> <snm> John </snm> </author> <chapter> <title> Introduction into SDR </title> <paragraph> …. </paragraph> … </chapter> …</lecture>
<!ELEMENT lecture (title, author+,chapter+)><!ELEMENT author (fnm*,snm)><!ELEMENT fnm #PCDATA>…
XML: eXtensible Mark-up Language
Use of XPath notation to refer to the XML structure
chapter/title: title is a direct sub-component of chapter//title: any titlechapter//title: title is a direct or indirect sub-component of chapterchapter/paragraph[2]: any direct second paragraph of any chapterchapter/*: all direct sub-components of a chapter
<lecture> <title> Structured Document Retrieval </title> <author> <fnm> Smith </fnm> <snm> John </snm> </author> <chapter> <title> Introduction into SDR </title> <paragraph> …. </paragraph> … </chapter> …</lecture>
XML Queries
Content-only (CO) queries Standard IR queries but here we are retrieving document components “Wine tasting in Granada”
Structure-only queries Usually not that useful from an IR perspective “Paragraph containing a diagram next to a table”
Content-and-structure (CAS) queries Put constraints on which types of components are to be retrieved
• E.g. “Articles that contain sections about hotels in Granada, and that contain a picture of Alhambra, and return titles of these articles”
Where to look (support elements),what to return (target elements)
Content-oriented XML retrieval
Return document components at the right level of granularity (e.g. a book, a chapter, a section, a paragraph, a table, a
figure, etc), relevant to the user’s information need with regards to content
and structure.
SEARCHING = QUERYING + BROWSINGSEARCHING = QUERYING + BROWSING
Right level of granularity: The challenge
Query: wordnet information retrieval
(Simplified) Conceptual model
Structured documents Content + structure
Inverted file + structure index
tf, idf, …
Matching content + structure
Presentation of related components
Documents Query
Document representation
Retrieval results
Query representation
Indexing Formulation
Retrieval function
Article ?XML,?retrieval
?authoring
0.9 XML 0.5 XML 0.2 XML
0.4 retrieval 0.7 authoring
Challenge 1: term weights
Title Section 1 Section 2
how to obtain document and collection statistics (e.g. tf, idf)
Article ?XML,?retrieval
?authoring
0.9 XML 0.5 XML 0.2 XML
0.4 retrieval 0.7 authoring
Challenge 2: augmentation weights
Title Section 1 Section 2
0.5 0.8 0.2
which components contribute best to content of “article”?
Article ?XML,?retrieval
?authoring
0.9 XML 0.5 XML 0.2 XML
0.4 retrieval 0.7 authoring
Challenge 3: component weights
Title Section 1 Section 20.6
0.4 0.4
0.5
which component type (tag) is a good retrieval unit?
Article XML,retrieval
authoring
XML XML XML
retrieval authoring
Challenge 4: overlapping elements
Title Section 1 Section 2
“section 1” and “article” are both relevant to “XML retrieval”, so which one to return?
Approaches …
vector space model
probabilistic model
Bayesian network
language model
extending DB model
Boolean model
natural language processing
cognitive model
ontology
parameter estimation
tuning
smoothing
fusion
phrase
term statistics
collection statistics
component statistics
proximity search
logistic regression
belief modelrelevance feedback
divergence from randomness
machine learning
Content-oriented XML retrieval: Conclusion
Efficiency—Not just documents, but all its elements
Models—Statistics to be adapted or redefined—Combination of evidence
Users—What is focussed retrieval?—Do users really want elements?
Interface and presentation issues
Outline
Part I - Content-oriented XML retrieval
Part II - Evaluating content-oriented XML retrieval
Evaluation of XML retrieval: INEX
Promote research and stimulate development of XML information access and retrieval, through
Creation of evaluation infrastructure and organisation of regular evaluation campaigns for system testing
Building of an XML information access and retrieval research community Construction of test-suites
Collaborative effort participants contribute to the development of the collection
End with a yearly workshop, in December, in Dagstuhl, Germany
INEX has allowed a new community in XML information access to emerge, as shown by the number of publications (64 - not final- in 2005, 37 in 2004 and 13 in 2003).
INEX: Background
University of Amsterdam, NLUniversity of Otago, NZUniversity of Chile, CLCWI, NLCarnegie Mellon University, USAIBM Research Lab, ILUniversity of Minnesota Duluth, USAUniversity of Paris 6, FR
Queensland University of Technology, AUSUniversity of California, Berkeley, USARoyal School of LIS, DKQueen Mary, University of London, UKUniversity of Duisburg-Essen, DEINRIA-Rocquencourt, FRUtrecht University, NL
Sponsored by DELOS Network of Excellence for Digital Libraries under FP6 – IST programme
Mainly dependent on voluntary efforts Coordination is distributed for tasks and tracks
Main Institutions involved in Coordination for 2005
INEX 2005 Participants
64 participants: 32 Europe; 12 N.America; 10 Asia; 5 Oceania; 5 other+3000 e-mails in 2005!
Max-Planck-Institut fuer Informatik, GermanyInformation Studies, Royal School of LIS, Denmark University of California, Berkeley, USA Peking University, China University of Granada, Spain University of Amsterdam, The Netherlands University of Otago, New Zealand Queen Mary University of London, UKUniversity of Toronto, Canada Utrecht University, The Netherlands City University London, UK University of Kaiserslautern, Germany INRIA-Rocquencourt, France University of Wollongong in Dubai IRIT - Toulouse, France RMIT University, Australia Ecoles des Mines de Saint-Etienne, France Queensland University of Technology, Australia Universtity of Klagenfurt, Austria Fondazione Ugo Bordoni, Italy University of Tampere, FinlandCarnegie Mellon University, USA Cornell University, USA University of Illinois at Urbana-Champaign, USA IBM Haifa Research Lab, IsraelOchanomizu University, JapanThe Hebrew University of Jerusalem, Israel Laboratoire d’Informatique de Paris 6, FranceUniversity of Minnesota Duluth, USAUniversity of Rostock, GermanyUniversity of California, Los Angeles, USAUniversity of Udine, ItalyUniversity of Rostock, GermanyUniversity of California, Los Angeles, USAUniversity of Udine, Italy
University of South-Brittany, FranceNagoya University, JapanUniversity of Waterloo, CanadaRutgers University, USAKyungpook National University, KoreaUniversity of Chile, ChileHiroshima City University, JapanUniversity of Helsinki, FinlandAT&T Labs-Research, USAMicrosoft Research Lab Cambridge, UKUniversity of Twente, The NetherlandsCentre for Mathematics & Computer Science (CWI), NLUniversity of Utah, USAUniversity Duisburg-Essen, GermanyUniversity of Ostrava, Czech RepublicHong Kong Baptist University, Hong KongUniversity of Sheffield, UKOslo University College, NorwayL3S Research Center, GermanyUniversity of Michigan, USACLIPS-IMAG Grenoble, FranceWuhan University, ChinaNara Institute of Science and Technology, JapanRitsumeikan University, JapanUniversity of Tsukuba, JapanState University of Montes Claros, Montes Claros(MG), BrazilINRIA Sophia AntipolisCharles de Gaulle University - Lille 3University of Siena, ItalyAustralian Research Council, Canberra, AustraliaUniversity of Wollongong, Wollongong, AustraliaUniversity of Padova, Italy
Test suite for evaluating retrieval performance
Is your XML engine retrieving the relevant information, while at the same time avoiding returning irrelevant information?
Document collection
Topics reflecting realistic information needs
Retrieval tasks, stating what the XML search engine should return as answers
Relevance assessments, stating which elements are relevant to which topics
Metrics to measure effectiveness performance
INEX test suites
Documents ~500MB (+ 241 MB): 12,107 (16, 819) articles in XML format from IEEE Computer Society journals and magazines; 8 millions elements!
INEX 2002 60 topics, inex_eval metric
INEX 200366 topics, use subset of XPath, inex_eval and inex_eval_ng metrics
INEX 200475 topics, subset of 2003 XPath subset (NEXI)Official metric: inex_evalOthers: inex_eval_ng, XCG, t2i, ERR, PRUM, …
INEX 200587 topics, NEXIOfficial metric: XCG
INEX Topics
CO topic: open standards for digital video in distance learning
CAS topic: //article[about(.,'formal methods verify correctness aviation systems')]//sec//[about(.,'case study application model checking theorem proving')]
— Candidate topics submitted by participants, must have some relevant elements, not to few and not too many
— Selection process performed by INEX organisers
Retrieval tasks I
CO retrieval task, same as standard IR, but return elementsopen standards for digital video in distance learning
+S retrieval task, where user add structural hints to query to narrow down number of returned elements
//article//sec[about(.,open standards for digital video in distance learning)]
Three strategies:— Focussed strategy: assume that user prefers a single element that is the most
relevant.— Thorough strategy: assume that user prefers all highly relevant elements.— Fetch and browse strategy: assume that user interested in highly relevant
elements that are contained only within highly relevant articles.
Retrieval tasks II
CAS retrieval task where to look for the relevant elements (i.e. support elements) what type of elements to return (i.e. target elements). strict and vague interpretations applied to both support and target
elements
//article[about(.,'formal methods verify correctness aviation systems')]//sec//[about(.,'case study application model checking theorem proving')]
Relevance in XML retrieval
smallest component (specificity) that is highly relevant (exhaustivity)
specificityspecificity: extent to which a document component is focused on the information need, while being an informative unit.
exhaustivityexhaustivity: extent to which the information contained in a document component satisfies the information need.
XML retrieval evaluation
XML retrieval
article
ss1 ss2
s1 s2 s3
XML evaluation
Relevance assessment task
Topics are assessed by the INEX participants Use of an on-line interface
CompletenessCompleteness— Rules that force assessors to assess related elements — E.g. element assessed relevant its parent element and children elements must also
be assessed— …
ConsistencyConsistency— Rules to enforce consistent assessments— E.g. Parent of a relevant element must also be relevant, although to a different extent— E.g. Exhaustivity increases going up; specificity increases going down— …
Assessing a topics takes a week!
Average 2 topics per participants
Duplicate assessments (12 topics) in INEX 2004
% Agreement
Topic % Type1 12,59 CAS2 2,95 CAS3 22,85 CAS4 8,60 CAS5 60,87 CAS6 0,00 CAS7 27,53 CAS8 7,63 CO9 25,22 CO10 9,89 CO11 5,65 CO12 9,08 CO
12,19
Tag %Abs 7,53App 13,64Art 2,44Article 21,70Atl 1,95B 16,45Bb 15,37Bdy 20,33Bib 14,84Bm 15,79Fig 20,25Fm 6,06Index-entry 0,00Ip1 10,11Item 10,16Lists (sum) 5,14P 9,51P2 10,84Ref 5,00Sec 15,90Ss1 14,01Ss2 10,45St 5,94
Measuring effectiveness: Metrics
A research problem in itself!
Metrics inex_eval - official INEX metric until 2004 inex_eval_ng ERR (expected ratio of relevant units) XCG (XML cumulative gain) - official INEX metric 2005 t2i (tolerance to irrelevance) PRUM (Precision Recall with User Modelling) HiXEval …..
What is the problem? Relevance propagates up!
~26,000 relevant elements on ~14,000 relevant paths
Propagated assessments: ~45% Increase in size of relevant elements: ~182%
Precision-Recall-based metric and Overlap
Simulated runs
Overlap in results
Rank Systems (runs) Avg Prec % Overlap1. IBM Haifa Research Lab(CO-0.5-LAREFIENMENT) 0.1437 80.892. IBM Haifa Research Lab(CO-0.5) 0.1340 81.463. University of Waterloo(Waterloo-Baseline) 0.1267 76.324. University of Amsterdam(UAms-CO-T-FBack) 0.1174 81.855. University of Waterloo(Waterloo-Expanded) 0.1173 75.626. Queensland University of Technology(CO_PS_Stop50K) 0.1073 75.897. Queensland University of Technology(CO_PS_099_049) 0.1072 76.818. IBM Haifa Research Lab(CO-0.5-Clustering) 0.1043 81.109. University of Amsterdam(UAms-CO-T) 0.1030 71.9610. LIP6(simple) 0.0921 64.29
Official INEX 2004 Results for CO topics (1500 retrieved elements)
Final words
Challenging research issues in XML retrieval are not ‘just’ about the effective retrieval of XML documents, but also about what and how to evaluate!
INEX 2006 document collection— Wikipedia (XML English) document collection full-texts, marked up in XML, of about
1,900,000 articles — 228,546 categories, totaling +100 Gigabytes (10 Gigabytes without pictures)— 3000 different tags, article has in average 500 XML nodes, average depth 5.
Additional tracks in 2006— interactive, heterogeneous collection, document mining, relevance feedback,
natural language query processing, multimedia, XML entity search
GraciasGracias