xml information retreival

XML Information Retreival

Hui FangDepartment of Computer Science

University of Illinois at Urbana-Champaign

Some slides are borrowed from Nobert Fuhr’s XML Tutorial.

Outline

• XML basics

• Research Topics

• XML IR– Tasks– Retrieval methods– Clustering XML documents

XML standards

Basic XML

• Hierarchical document format for information exchange in WWW

• Self describing data (tags)

• Nested element structure having a root

• Element data can have– Attributes– Sub-elements

(Slides from Jayavel Shanmugasundaram)

Attribute Element

Example XML document<?xml version="1.0" encoding="ISO-8859-1" ?> - <book> <title> Finding Out About: A Cognitive Perspective on Search Engine Technology and the WWW</title> <author id = “rbelew”> <name> <firstname> Richard </firstname>

<lastname> Belew </lastname> </name> <address> <city> San Diego </city> <zip> 92093 </zip> </address> </author></book>

Tree structure of XML documents

book

id=“rbelew”

authortitle

name address

First name Last name city Zip code

Finding….

Richard Belew San Diego 92093

Basic XML standard does not deal with …

• Standardization of element namesXML namespaces

• Structure of element content

XML DTDs

• Data types of element content

XML schema

XML namespace

<table><tr>

<td>Apples</td>

<td>Bananas</td></tr>

</table>

<table><name>GPA Table</name><width>80</width><length>120</length>

</table>

Provide a method to avoid element name conflicts

XML namespace(Cont.)

<h:table xmlns:h="http://www.w3.org/TR/html4/">

<h:tr> <h:td>Apples</h:td>

<h:td>Bananas</h:td> </h:tr>

</h:table>

<f:table xmlns:f="http://www.w3schools.com/gpa"> <f:name>GPA Table</f:name> <f:width>80</f:width> <f:length>120</f:length>

</f:table>

Provide a method to avoid element name conflicts

XML Document Type DefinitionDefine the document structure with a list of legal elements

<?xml version="1.0"?> <!DOCTYPE note SYSTEM

"note.dtd"> <note>

<to>Tove</to> <from>Jani</from> <heading>Reminder</heading> <body>Have a rest!</body>

</note>

<!ELEMENT note (to,from,heading,body)>

<!ELEMENT to (#PCDATA)><!ELEMENT from (#PCDATA)><!ELEMENT heading (#PCDATA)> <!ELEMENT body (#PCDATA)>

Research Topics related to XML

Research Topics

• IR areas– Retrieval Models

– Query Languages

– …

• DB areas– Query Languages

– System architecture

– Apply relational DB technology to XML data

– Streaming XML

– XML Query Processing

– XML indexing and compression

– ……

XML IR

INEX:Initiative for the Evaluation for XML Retrieval

• Documents: 12,107 articles in XML format

• Queries: 30 Content-only;

30 Content and structure

• Relevance Assessments: by participating groups

• Participants: 36 active groups in 2003

CO search task

• Document as hierarchical structure of nested elements

• Type of elements is not considered

• Query refers to content only

• Query syntax as in standard text retrieval

• Task: Find smallest subtree(element) satisfying the query

Example of CO Topic<INEX-Topic topic-id=“45” query-type=“CO” ct-no=“056”><Title> <cw>augmented reality and medicine</cw></Title><Description>How virtual (or augmented )reality can contribute to improve the medical

and surgical practice.</Description><Narrative>In order to be considered relevant, a document/component must include

considerations about applications of computer graphics and especially augmented (or virtual) reality to medice(including surgery).

</Narrative><Keywords>Augmented virtual reality medicine surgery improve computer assisted

aided image</Keywords></INEX-Topic>

CAS search Task

• Queries contain explicit references to the XML structure, by restricing– The context of interest

• <te>:target element

– The context of certain search concepts• (<cw>,<ce>) pairs

Example of CAS topic<INEX-Topic topic-id=“09” query-type=“CAS” ct-no=“048”><Title><te>article</te><cw>non-monotonic reasoning</cw><ce>bdy/sec</ce><cw>1999 2000</cw> <ce>hdr//yr</ce><cw>-calendar</cw><ce><tig/at1<ce><cw>belief revision</cw></Title><Narrative>Retrieve all articles from the years 1999-2000 that deal with works on

non-monotonic reaonsing. Do not retrieve CfPs/calendar entries</Narrative><Keywords>non-monotonic reasoning belief revision </Keywords></INEX-Topic>

XML Retrieval Methods

• XIRQL– XML query languages with IR-related features

• Language models

• JuruXML

XIRQL(I)

• CO Approaches :– Split document text into disjoint nodes– Index nodes separately– Aggregate indexing weights for higher-

level elements (subtrees)

Index nodes as units for term weighting

Application of known indexing functions (e.g. tf*idf)

1 2 3

4 5

document

class="H.3.3"

author

John Smith

title

XML Retrieval Introduction

chapter

heading This. . .

heading

SyntaxExamples

heading

sectionheading

XML Query

Lang. XQL

section

We describesyntax of

XQL

chapter

Index nodes for relevance-oriented search

1 2 3

4 5

document

class="H.3.3"

author

John Smith

title

XML Retrieval Introduction

chapter

heading This. . .

heading

SyntaxExamples

heading

sectionheading

XML Query

Lang. XQL

section

We describesyntax of

XQL

chapter

Q1: syntax example

Q2: XQL

Combining weights

…by disjunction

Q1: syntax example

Q2: XQL

0.5 example 0.8 XQL0.7 syntax

section1 section2

0.3 XQL

chapter

0.5 example0.7 syntax

0.86 0.8+0.3-0.8*0.3=0.86

Need to return most specific element satisfying the query!

0.7*0.5=0.35

Combining weights… with augmentation weight

Q2: XQL

0.5 example 0.8 XQL0.7 syntax

section1 section2

0.3 XQL

chapter

0.30 example0.42 syntax

0.64 0.48+0.3-0.48*0.3=0.64

0.6 0.6

XIRQL(II)

• CAS approaches

– Extension of XQL by• Weighting and ranking• Data types with vague predicates• Structural relativism

XQL Expressions• Path condition

– search for single elementsheading

– parent-child:chapter/heading

– ancestor-descendant:chapter//section

– document root:/book/*

• Filter wrt. structure://chapter[heading]

• Filter wrt. content:/document[@class=“H.3.3” $and$ author=“John Smith”]

Data types with vague predicates

• Compares two values of a specific data-type– E.g. Near, broader, narrower

• Returns (probabilistic) matching value– E.g. “Search for an artist named Ulbrich, living in

Frankfurt, Germany about 100 years ago”

Ernst Olbrich, Darmstadt, 1899

P(Olbrich Ulbrich)=0.8 (phonetic similarity)

P(1899 1903)=0.9 (numeric similarity)

P(Darmstadt Frankfurt)=0.7 (geographic distance)

pn

g

Semantic Relativism

• Drop distinction attribute/element:~author searches for attribute or element

• Generalize to data types:#personname searches for attribute/elements of

specific data type

Language models

• Generate language models for each node in the tree

• Combine the children language models using linear interpolation

• Use EM approach to train the linear interpolation parameters

Element-specific language models---CO Approaches

Higher level nodes: mixture of language models

)|()|( ii wPwP

Query: dog and cat

0.5 0.5

55.0)|( bodydogP

45.0)|( bodycatP

Type-specific language models--- CAS approaches

0.5 0.5

0.5 0.5

• “Return components of type x where it has component y that contains the query term w”

e.g. return documents where the title is contains the word “bird”5.05.01)()|( titlePtitlebirdP

e.g. return documents where the body’s first section is contains the word “dog”

25.05.05.01)()1(sec)1sec|( bodyPtionPtiondogP

Juru-XML

• Element-specific indexing+vector space model:– Transform query into set of (term,path)-

conditions– Vague matching of path conditions– Modified cosine similarity as retrieval function

JuruXML(1)---Transform Query

JuruXML(2)---Vague matching of path conditions

JuruXML(3)---Retrieval function

• Standard cosine similarity– wQ(ti): query term

weight of term ti

– wD(ti): indexing weight of term ti in the document

• Modified cosine similarity– wQ(ti ,ci

Q): query term weight of pair (ti,ci

Q)

– wD(ti ,ciD): indexing

weight of pair (ti,ciD) in

the document)()(

||||

1),( i

D

DQti

Q twtwDQ

DQi

||||

),(),(),(

),(),( ),(

DQ

cccrctwctw

DQ

Dj

Qi

Dji

D

Qct Dct

Qii

Q

Qii

Dji

• For each query term (ti,ciQ) treat all matched

document terms (ti,cjD) equally from the user

perspective.

• Define a weight function w(ciQ)

– E.g.

JuruXML(4)---Alternative approach (Merging contexts)

||1)( Qi

Qi ccw

Clustering XML documents

Document similarity• Document representation:

documentN-dimensional vector – N= # document features– Feature sets

• Text only• Tags only• Text + Tags

• Feature weighting in the document vector• Similarity measure--- vector similarity

– E.g. cosine measure

||||),cos(

21

2121

dd

dddd

Clustering methods

• Hierarchical clustering: – Main weakness: quadratic complexity

• Partitional clustering:– K-means

• Linear time complexity

• Simplicity of its algorithm

K-Means clustering algorithm

Measuring clustering quality

• External quality: comparison of clusters with external classification– Entropy distribution of classes within clusters– Purity largest class in a cluster/cluster size

• Internal quality: calculate average inter- and intra- cluster similarities.– cohesiveness ( overall similarity)

Discussion

• Text alone give best results

• Text+tags: problem with weighting of tags vs. terms

Conclusion

• XML basics

• XML Retrieval Tasks and methods

• Clustering XML documents

Bayesian Networks

Context-dependent Retrieval

• The score of one element is given by RSV(Retrieval Status Value).

• RSV of node depends on RSVs of nodes in the context(parent nodes)

• Elements with highest values are then presented to the user.

Bayesian Networks

Bayesian Networks(Cont.)

xml information retreival

Documents

xml structure

xml retrievaldocuments

xml formatqueries

document structure

nobert fuhrs xml tutorial

topic augmented reality

split document text

search taskdocument