xml retrieval: a content-oriented perspective mounia lalmas department of computer science queen...

XML Retrieval: A content-oriented perspective

Mounia Lalmas

Department of Computer Science

Queen Mary, University of London

Outline

Part I - Content-oriented XML retrieval

Part II - Evaluating content-oriented XML retrieval

XML Retrieval: Motivation

XML is able to represent a mixture of “structured” and text (“unstructured”) information.

XML applications: digital libraries, content management.

XML repositories: IEEE INEX collection, LexisNexis, the Library of Congress collection.

XML Retrieval: DB and IR views

Data-centric view (DB)—XML as exchange format for structured data

Document-centric view (IR)—XML as format for representing the logical structure of documents

Now increasingly both views (DB+IR)

Data Centric XML Documents: Example

<CLASS name=“DCS317” num_of_std=“100”> <LECTURER lecid=“111”>Thomas</LECTURER> <STUDENT marks=“70” origin=“Oversea”> <NAME>Tassos</NAME> </STUDENT> <STUDENT marks=“30” origin=“EU”> <NAME>Christof</NAME> </STUDENT></CLASS>

Document Centric XML Documents: Example

<CLASS name=“DCS317” num_of_std=“100”><LECTURER lecid=“111”>Thomas</LECTURER><STUDENT studid=“007” >

<NAME>James Bond</NAME> is the best student in theclass. He scored <INTERM>95</INTERM> points out of<MAX>100</MAX>. His presentation of <ARTICLE>UsingMaterialized Views in Data Warehouse</ARTICLE> wasbrilliant.

</STUDENT><STUDENT stuid=“131”>

<NAME>Donald Duck</NAME> is not a very goodstudent. He scored <INTERM>20</INTERM> points…</STUDENT>

</CLASS>

Content-oriented XML retrieval

Traditional IR is about finding relevant documents to a user’s information need, e.g. entire book.

XML retrieval allows users to retrieve document components (elements) that are more focussed to their information needs, e.g a chapter, a page, several paragraphs of a book instead of an entire book.

The structure of documents is exploited to identify which document components to retrieve.

• Structure improves precision• Exploit visual memory

Book

Chapters

Sections

Subsections

World Wide Web

This is only only another to look one le to show the need an la a out structure of and more a document and so ass to it doe not necessary text a structured document have retrieval on the web is an it important topic of today’s research it issues to make se last sentence..

XML retrieval allows users to retrieve document components that are more focussed, e.g. a subsection of a book instead of an entire book.

SEARCHING = QUERYING + BROWSINGSEARCHING = QUERYING + BROWSING


Focussed retrieval: Scientific Collection

Querymodel checking aviation systems

Answerone section in a workshop report

Focussed Retrieval: Encyclopedia

Information needvolcanic eruption prediction

Answerrelatively small portion of the volcano topic

Focussed retrieval: Technical Manual

Querysegmentation fault windows services for unix

Answeronly a single paragraph in a long manual

XML: eXtensible Mark-up Language

Meta-language (user-defined tags) currently being adopted as the document format language by W3C

Used to describe content and structure (and not layout)

Grammar described in DTD ( used for validation)

<lecture> <title> Structured Document Retrieval </title> <author> <fnm> Smith </fnm> <snm> John </snm> </author> <chapter> <title> Introduction into SDR </title> <paragraph> …. </paragraph> … </chapter> …</lecture>

<!ELEMENT lecture (title, author+,chapter+)><!ELEMENT author (fnm*,snm)><!ELEMENT fnm #PCDATA>…

XML: eXtensible Mark-up Language

Use of XPath notation to refer to the XML structure

chapter/title: title is a direct sub-component of chapter//title: any titlechapter//title: title is a direct or indirect sub-component of chapterchapter/paragraph[2]: any direct second paragraph of any chapterchapter/*: all direct sub-components of a chapter

<lecture> <title> Structured Document Retrieval </title> <author> <fnm> Smith </fnm> <snm> John </snm> </author> <chapter> <title> Introduction into SDR </title> <paragraph> …. </paragraph> … </chapter> …</lecture>

XML Queries

Content-only (CO) queries Standard IR queries but here we are retrieving document components “Wine tasting in Granada”

Structure-only queries Usually not that useful from an IR perspective “Paragraph containing a diagram next to a table”

Content-and-structure (CAS) queries Put constraints on which types of components are to be retrieved

• E.g. “Articles that contain sections about hotels in Granada, and that contain a picture of Alhambra, and return titles of these articles”

Where to look (support elements),what to return (target elements)


Return document components at the right level of granularity (e.g. a book, a chapter, a section, a paragraph, a table, a

figure, etc), relevant to the user’s information need with regards to content

and structure.

SEARCHING = QUERYING + BROWSINGSEARCHING = QUERYING + BROWSING

Right level of granularity: The challenge

Query: wordnet information retrieval

(Simplified) Conceptual model

Structured documents Content + structure

Inverted file + structure index

tf, idf, …

Matching content + structure

Presentation of related components

Documents Query

Document representation

Retrieval results

Query representation

Indexing Formulation

Retrieval function

Article ?XML,?retrieval

?authoring

0.9 XML 0.5 XML 0.2 XML

0.4 retrieval 0.7 authoring

Challenge 1: term weights

Title Section 1 Section 2

how to obtain document and collection statistics (e.g. tf, idf)


?authoring

0.9 XML 0.5 XML 0.2 XML


Challenge 2: augmentation weights


0.5 0.8 0.2

which components contribute best to content of “article”?


?authoring

0.9 XML 0.5 XML 0.2 XML


Challenge 3: component weights

Title Section 1 Section 20.6

0.4 0.4

0.5

which component type (tag) is a good retrieval unit?

Article XML,retrieval

authoring

XML XML XML

retrieval authoring

Challenge 4: overlapping elements


“section 1” and “article” are both relevant to “XML retrieval”, so which one to return?

Approaches …

vector space model

probabilistic model

Bayesian network

language model

extending DB model

Boolean model

natural language processing

cognitive model

ontology

parameter estimation

tuning

smoothing

fusion

phrase

term statistics

collection statistics

component statistics

proximity search

logistic regression

belief modelrelevance feedback

divergence from randomness

machine learning

Content-oriented XML retrieval: Conclusion

Efficiency—Not just documents, but all its elements

Models—Statistics to be adapted or redefined—Combination of evidence

Users—What is focussed retrieval?—Do users really want elements?

Interface and presentation issues

Outline

Part I - Content-oriented XML retrieval

Part II - Evaluating content-oriented XML retrieval

Evaluation of XML retrieval: INEX

Promote research and stimulate development of XML information access and retrieval, through

Creation of evaluation infrastructure and organisation of regular evaluation campaigns for system testing

Building of an XML information access and retrieval research community Construction of test-suites

Collaborative effort participants contribute to the development of the collection

End with a yearly workshop, in December, in Dagstuhl, Germany

INEX has allowed a new community in XML information access to emerge, as shown by the number of publications (64 - not final- in 2005, 37 in 2004 and 13 in 2003).

INEX: Background

University of Amsterdam, NLUniversity of Otago, NZUniversity of Chile, CLCWI, NLCarnegie Mellon University, USAIBM Research Lab, ILUniversity of Minnesota Duluth, USAUniversity of Paris 6, FR

Queensland University of Technology, AUSUniversity of California, Berkeley, USARoyal School of LIS, DKQueen Mary, University of London, UKUniversity of Duisburg-Essen, DEINRIA-Rocquencourt, FRUtrecht University, NL

Sponsored by DELOS Network of Excellence for Digital Libraries under FP6 – IST programme

Mainly dependent on voluntary efforts Coordination is distributed for tasks and tracks

Main Institutions involved in Coordination for 2005

INEX 2005 Participants

64 participants: 32 Europe; 12 N.America; 10 Asia; 5 Oceania; 5 other+3000 e-mails in 2005!

Max-Planck-Institut fuer Informatik, GermanyInformation Studies, Royal School of LIS, Denmark University of California, Berkeley, USA Peking University, China University of Granada, Spain University of Amsterdam, The Netherlands University of Otago, New Zealand Queen Mary University of London, UKUniversity of Toronto, Canada Utrecht University, The Netherlands City University London, UK University of Kaiserslautern, Germany INRIA-Rocquencourt, France University of Wollongong in Dubai IRIT - Toulouse, France RMIT University, Australia Ecoles des Mines de Saint-Etienne, France Queensland University of Technology, Australia Universtity of Klagenfurt, Austria Fondazione Ugo Bordoni, Italy University of Tampere, FinlandCarnegie Mellon University, USA Cornell University, USA University of Illinois at Urbana-Champaign, USA IBM Haifa Research Lab, IsraelOchanomizu University, JapanThe Hebrew University of Jerusalem, Israel Laboratoire d’Informatique de Paris 6, FranceUniversity of Minnesota Duluth, USAUniversity of Rostock, GermanyUniversity of California, Los Angeles, USAUniversity of Udine, ItalyUniversity of Rostock, GermanyUniversity of California, Los Angeles, USAUniversity of Udine, Italy

University of South-Brittany, FranceNagoya University, JapanUniversity of Waterloo, CanadaRutgers University, USAKyungpook National University, KoreaUniversity of Chile, ChileHiroshima City University, JapanUniversity of Helsinki, FinlandAT&T Labs-Research, USAMicrosoft Research Lab Cambridge, UKUniversity of Twente, The NetherlandsCentre for Mathematics & Computer Science (CWI), NLUniversity of Utah, USAUniversity Duisburg-Essen, GermanyUniversity of Ostrava, Czech RepublicHong Kong Baptist University, Hong KongUniversity of Sheffield, UKOslo University College, NorwayL3S Research Center, GermanyUniversity of Michigan, USACLIPS-IMAG Grenoble, FranceWuhan University, ChinaNara Institute of Science and Technology, JapanRitsumeikan University, JapanUniversity of Tsukuba, JapanState University of Montes Claros, Montes Claros(MG), BrazilINRIA Sophia AntipolisCharles de Gaulle University - Lille 3University of Siena, ItalyAustralian Research Council, Canberra, AustraliaUniversity of Wollongong, Wollongong, AustraliaUniversity of Padova, Italy

Test suite for evaluating retrieval performance

Is your XML engine retrieving the relevant information, while at the same time avoiding returning irrelevant information?

Document collection

Topics reflecting realistic information needs

Retrieval tasks, stating what the XML search engine should return as answers

Relevance assessments, stating which elements are relevant to which topics

Metrics to measure effectiveness performance

INEX test suites

Documents ~500MB (+ 241 MB): 12,107 (16, 819) articles in XML format from IEEE Computer Society journals and magazines; 8 millions elements!

INEX 2002 60 topics, inex_eval metric

INEX 200366 topics, use subset of XPath, inex_eval and inex_eval_ng metrics

INEX 200475 topics, subset of 2003 XPath subset (NEXI)Official metric: inex_evalOthers: inex_eval_ng, XCG, t2i, ERR, PRUM, …

INEX 200587 topics, NEXIOfficial metric: XCG

INEX Topics

CO topic: open standards for digital video in distance learning

CAS topic: //article[about(.,'formal methods verify correctness aviation systems')]//sec//[about(.,'case study application model checking theorem proving')]

— Candidate topics submitted by participants, must have some relevant elements, not to few and not too many

— Selection process performed by INEX organisers

Retrieval tasks I

CO retrieval task, same as standard IR, but return elementsopen standards for digital video in distance learning

+S retrieval task, where user add structural hints to query to narrow down number of returned elements

//article//sec[about(.,open standards for digital video in distance learning)]

Three strategies:— Focussed strategy: assume that user prefers a single element that is the most

relevant.— Thorough strategy: assume that user prefers all highly relevant elements.— Fetch and browse strategy: assume that user interested in highly relevant

elements that are contained only within highly relevant articles.

Retrieval tasks II

CAS retrieval task where to look for the relevant elements (i.e. support elements) what type of elements to return (i.e. target elements). strict and vague interpretations applied to both support and target

elements

//article[about(.,'formal methods verify correctness aviation systems')]//sec//[about(.,'case study application model checking theorem proving')]

Relevance in XML retrieval

smallest component (specificity) that is highly relevant (exhaustivity)

specificityspecificity: extent to which a document component is focused on the information need, while being an informative unit.

exhaustivityexhaustivity: extent to which the information contained in a document component satisfies the information need.

XML retrieval evaluation

XML retrieval

article

ss1 ss2

s1 s2 s3

XML evaluation

Relevance assessment task

Topics are assessed by the INEX participants Use of an on-line interface

CompletenessCompleteness— Rules that force assessors to assess related elements — E.g. element assessed relevant its parent element and children elements must also

be assessed— …

ConsistencyConsistency— Rules to enforce consistent assessments— E.g. Parent of a relevant element must also be relevant, although to a different extent— E.g. Exhaustivity increases going up; specificity increases going down— …

Assessing a topics takes a week!

Average 2 topics per participants

Duplicate assessments (12 topics) in INEX 2004

% Agreement

Topic % Type1 12,59 CAS2 2,95 CAS3 22,85 CAS4 8,60 CAS5 60,87 CAS6 0,00 CAS7 27,53 CAS8 7,63 CO9 25,22 CO10 9,89 CO11 5,65 CO12 9,08 CO

12,19

Tag %Abs 7,53App 13,64Art 2,44Article 21,70Atl 1,95B 16,45Bb 15,37Bdy 20,33Bib 14,84Bm 15,79Fig 20,25Fm 6,06Index-entry 0,00Ip1 10,11Item 10,16Lists (sum) 5,14P 9,51P2 10,84Ref 5,00Sec 15,90Ss1 14,01Ss2 10,45St 5,94

Measuring effectiveness: Metrics

A research problem in itself!

Metrics inex_eval - official INEX metric until 2004 inex_eval_ng ERR (expected ratio of relevant units) XCG (XML cumulative gain) - official INEX metric 2005 t2i (tolerance to irrelevance) PRUM (Precision Recall with User Modelling) HiXEval …..

What is the problem? Relevance propagates up!

~26,000 relevant elements on ~14,000 relevant paths

Propagated assessments: ~45% Increase in size of relevant elements: ~182%

Precision-Recall-based metric and Overlap

Simulated runs

Overlap in results

Rank Systems (runs) Avg Prec % Overlap1. IBM Haifa Research Lab(CO-0.5-LAREFIENMENT) 0.1437 80.892. IBM Haifa Research Lab(CO-0.5) 0.1340 81.463. University of Waterloo(Waterloo-Baseline) 0.1267 76.324. University of Amsterdam(UAms-CO-T-FBack) 0.1174 81.855. University of Waterloo(Waterloo-Expanded) 0.1173 75.626. Queensland University of Technology(CO_PS_Stop50K) 0.1073 75.897. Queensland University of Technology(CO_PS_099_049) 0.1072 76.818. IBM Haifa Research Lab(CO-0.5-Clustering) 0.1043 81.109. University of Amsterdam(UAms-CO-T) 0.1030 71.9610. LIP6(simple) 0.0921 64.29

Official INEX 2004 Results for CO topics (1500 retrieved elements)

Final words

Challenging research issues in XML retrieval are not ‘just’ about the effective retrieval of XML documents, but also about what and how to evaluate!

INEX 2006 document collection— Wikipedia (XML English) document collection full-texts, marked up in XML, of about

1,900,000 articles — 228,546 categories, totaling +100 Gigabytes (10 Gigabytes without pictures)— 3000 different tags, article has in average 500 XML nodes, average depth 5.

Additional tracks in 2006— interactive, heterogeneous collection, document mining, relevance feedback,

natural language query processing, multimedia, XML entity search

GraciasGracias

xml retrieval: a content-oriented perspective mounia lalmas department of computer science queen...

Documents

xml structurechaptertitle

xml queriescontent

xml applications

document components

document componentswine

document format language

direct subcomponents

entire book