xml information retreival

49
XML Information Retreival Hui Fang Department of Computer Science University of Illinois at Urbana- Champaign Some slides are borrowed from Nobert Fuhr’s XML Tutorial.

Upload: lefty

Post on 09-Jan-2016

31 views

Category:

Documents


2 download

DESCRIPTION

XML Information Retreival. Hui Fang Department of Computer Science University of Illinois at Urbana-Champaign. Some slides are borrowed from Nobert Fuhr’s XML Tutorial. Outline. XML basics Research Topics XML IR Tasks Retrieval methods Clustering XML documents. XML standards. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: XML Information Retreival

XML Information Retreival

Hui FangDepartment of Computer Science

University of Illinois at Urbana-Champaign

Some slides are borrowed from Nobert Fuhr’s XML Tutorial.

Page 2: XML Information Retreival

Outline

• XML basics

• Research Topics

• XML IR– Tasks– Retrieval methods– Clustering XML documents

Page 3: XML Information Retreival

XML standards

Page 4: XML Information Retreival

Basic XML

• Hierarchical document format for information exchange in WWW

• Self describing data (tags)

• Nested element structure having a root

• Element data can have– Attributes– Sub-elements

(Slides from Jayavel Shanmugasundaram)

Page 5: XML Information Retreival

Attribute Element

Example XML document<?xml version="1.0" encoding="ISO-8859-1" ?> -<!-- Edited with XML Spy v4.2   --> <book> <title> Finding Out About: A Cognitive Perspective on Search Engine Technology and the WWW</title> <author id = “rbelew”> <name> <firstname> Richard </firstname>

<lastname> Belew </lastname> </name> <address> <city> San Diego </city> <zip> 92093 </zip> </address> </author></book>

Page 6: XML Information Retreival

Tree structure of XML documents

book

id=“rbelew”

authortitle

name address

First name Last name city Zip code

Finding….

Richard Belew San Diego 92093

Page 7: XML Information Retreival

Basic XML standard does not deal with …

• Standardization of element namesXML namespaces

• Structure of element content

XML DTDs

• Data types of element content

XML schema

Page 8: XML Information Retreival

XML namespace

<table><tr>

<td>Apples</td>

<td>Bananas</td></tr>

</table>

<table><name>GPA Table</name><width>80</width><length>120</length>

</table>

Provide a method to avoid element name conflicts

Page 9: XML Information Retreival

XML namespace(Cont.)

<h:table xmlns:h="http://www.w3.org/TR/html4/">

<h:tr> <h:td>Apples</h:td>

<h:td>Bananas</h:td> </h:tr>

</h:table>

<f:table xmlns:f="http://www.w3schools.com/gpa"> <f:name>GPA Table</f:name> <f:width>80</f:width> <f:length>120</f:length>

</f:table>

Provide a method to avoid element name conflicts

Page 10: XML Information Retreival

XML Document Type DefinitionDefine the document structure with a list of legal elements

<?xml version="1.0"?> <!DOCTYPE note SYSTEM

"note.dtd"> <note>

<to>Tove</to> <from>Jani</from> <heading>Reminder</heading> <body>Have a rest!</body>

</note>

<!ELEMENT note (to,from,heading,body)>

<!ELEMENT to (#PCDATA)><!ELEMENT from (#PCDATA)><!ELEMENT heading (#PCDATA)> <!ELEMENT body (#PCDATA)>

Page 11: XML Information Retreival

Research Topics related to XML

Page 12: XML Information Retreival

Research Topics

• IR areas– Retrieval Models

– Query Languages

– …

• DB areas– Query Languages

– System architecture

– Apply relational DB technology to XML data

– Streaming XML

– XML Query Processing

– XML indexing and compression

– ……

Page 13: XML Information Retreival

XML IR

Page 14: XML Information Retreival

INEX:Initiative for the Evaluation for XML Retrieval

• Documents: 12,107 articles in XML format

• Queries: 30 Content-only;

30 Content and structure

• Relevance Assessments: by participating groups

• Participants: 36 active groups in 2003

Page 15: XML Information Retreival

CO search task

• Document as hierarchical structure of nested elements

• Type of elements is not considered

• Query refers to content only

• Query syntax as in standard text retrieval

• Task: Find smallest subtree(element) satisfying the query

Page 16: XML Information Retreival

Example of CO Topic<INEX-Topic topic-id=“45” query-type=“CO” ct-no=“056”><Title> <cw>augmented reality and medicine</cw></Title><Description>How virtual (or augmented )reality can contribute to improve the medical

and surgical practice.</Description><Narrative>In order to be considered relevant, a document/component must include

considerations about applications of computer graphics and especially augmented (or virtual) reality to medice(including surgery).

</Narrative><Keywords>Augmented virtual reality medicine surgery improve computer assisted

aided image</Keywords></INEX-Topic>

Page 17: XML Information Retreival

CAS search Task

• Queries contain explicit references to the XML structure, by restricing– The context of interest

• <te>:target element

– The context of certain search concepts• (<cw>,<ce>) pairs

Page 18: XML Information Retreival

Example of CAS topic<INEX-Topic topic-id=“09” query-type=“CAS” ct-no=“048”><Title><te>article</te><cw>non-monotonic reasoning</cw><ce>bdy/sec</ce><cw>1999 2000</cw> <ce>hdr//yr</ce><cw>-calendar</cw><ce><tig/at1<ce><cw>belief revision</cw></Title><Narrative>Retrieve all articles from the years 1999-2000 that deal with works on

non-monotonic reaonsing. Do not retrieve CfPs/calendar entries</Narrative><Keywords>non-monotonic reasoning belief revision </Keywords></INEX-Topic>

Page 19: XML Information Retreival

XML Retrieval Methods

• XIRQL– XML query languages with IR-related features

• Language models

• JuruXML

Page 20: XML Information Retreival

XIRQL(I)

• CO Approaches :– Split document text into disjoint nodes– Index nodes separately– Aggregate indexing weights for higher-

level elements (subtrees)

Page 21: XML Information Retreival

Index nodes as units for term weighting

Application of known indexing functions (e.g. tf*idf)

1 2 3

4 5

document

class="H.3.3"

author

John Smith

title

XML Retrieval Introduction

chapter

heading This. . .

heading

SyntaxExamples

heading

sectionheading

XML Query

Lang. XQL

section

We describesyntax of

XQL

chapter

Page 22: XML Information Retreival

Index nodes for relevance-oriented search

1 2 3

4 5

document

class="H.3.3"

author

John Smith

title

XML Retrieval Introduction

chapter

heading This. . .

heading

SyntaxExamples

heading

sectionheading

XML Query

Lang. XQL

section

We describesyntax of

XQL

chapter

Q1: syntax example

Q2: XQL

Page 23: XML Information Retreival

Combining weights

…by disjunction

Q1: syntax example

Q2: XQL

0.5 example 0.8 XQL0.7 syntax

section1 section2

0.3 XQL

chapter

0.5 example0.7 syntax

0.86 0.8+0.3-0.8*0.3=0.86

Need to return most specific element satisfying the query!

0.7*0.5=0.35

Page 24: XML Information Retreival

Combining weights… with augmentation weight

Q2: XQL

0.5 example 0.8 XQL0.7 syntax

section1 section2

0.3 XQL

chapter

0.30 example0.42 syntax

0.64 0.48+0.3-0.48*0.3=0.64

0.6 0.6

Page 25: XML Information Retreival

XIRQL(II)

• CAS approaches

– Extension of XQL by• Weighting and ranking• Data types with vague predicates• Structural relativism

Page 26: XML Information Retreival

XQL Expressions• Path condition

– search for single elementsheading

– parent-child:chapter/heading

– ancestor-descendant:chapter//section

– document root:/book/*

• Filter wrt. structure://chapter[heading]

• Filter wrt. content:/document[@class=“H.3.3” $and$ author=“John Smith”]

Page 27: XML Information Retreival

Data types with vague predicates

• Compares two values of a specific data-type– E.g. Near, broader, narrower

• Returns (probabilistic) matching value– E.g. “Search for an artist named Ulbrich, living in

Frankfurt, Germany about 100 years ago”

Ernst Olbrich, Darmstadt, 1899

P(Olbrich Ulbrich)=0.8 (phonetic similarity)

P(1899 1903)=0.9 (numeric similarity)

P(Darmstadt Frankfurt)=0.7 (geographic distance)

pn

g

Page 28: XML Information Retreival

Semantic Relativism

• Drop distinction attribute/element:~author searches for attribute or element

• Generalize to data types:#personname searches for attribute/elements of

specific data type

Page 29: XML Information Retreival

Language models

• Generate language models for each node in the tree

• Combine the children language models using linear interpolation

• Use EM approach to train the linear interpolation parameters

Page 30: XML Information Retreival

Element-specific language models---CO Approaches

Page 31: XML Information Retreival

Higher level nodes: mixture of language models

)|()|( ii wPwP

Query: dog and cat

0.5 0.5

55.0)|( bodydogP

45.0)|( bodycatP

Page 32: XML Information Retreival

Type-specific language models--- CAS approaches

Page 33: XML Information Retreival

0.5 0.5

0.5 0.5

• “Return components of type x where it has component y that contains the query term w”

e.g. return documents where the title is contains the word “bird”5.05.01)()|( titlePtitlebirdP

e.g. return documents where the body’s first section is contains the word “dog”

25.05.05.01)()1(sec)1sec|( bodyPtionPtiondogP

Page 34: XML Information Retreival

Juru-XML

• Element-specific indexing+vector space model:– Transform query into set of (term,path)-

conditions– Vague matching of path conditions– Modified cosine similarity as retrieval function

Page 35: XML Information Retreival

JuruXML(1)---Transform Query

Page 36: XML Information Retreival

JuruXML(2)---Vague matching of path conditions

Page 37: XML Information Retreival

JuruXML(3)---Retrieval function

• Standard cosine similarity– wQ(ti): query term

weight of term ti

– wD(ti): indexing weight of term ti in the document

• Modified cosine similarity– wQ(ti ,ci

Q): query term weight of pair (ti,ci

Q)

– wD(ti ,ciD): indexing

weight of pair (ti,ciD) in

the document)()(

||||

1),( i

D

DQti

Q twtwDQ

DQi

||||

),(),(),(

),(),( ),(

DQ

cccrctwctw

DQ

Dj

Qi

Dji

D

Qct Dct

Qii

Q

Qii

Dji

Page 38: XML Information Retreival

• For each query term (ti,ciQ) treat all matched

document terms (ti,cjD) equally from the user

perspective.

• Define a weight function w(ciQ)

– E.g.

JuruXML(4)---Alternative approach (Merging contexts)

||1)( Qi

Qi ccw

Page 39: XML Information Retreival

Clustering XML documents

Page 40: XML Information Retreival

Document similarity• Document representation:

documentN-dimensional vector – N= # document features– Feature sets

• Text only• Tags only• Text + Tags

• Feature weighting in the document vector• Similarity measure--- vector similarity

– E.g. cosine measure

||||),cos(

21

2121

dd

dddd

Page 41: XML Information Retreival

Clustering methods

• Hierarchical clustering: – Main weakness: quadratic complexity

• Partitional clustering:– K-means

• Linear time complexity

• Simplicity of its algorithm

Page 42: XML Information Retreival

K-Means clustering algorithm

Page 43: XML Information Retreival

Measuring clustering quality

• External quality: comparison of clusters with external classification– Entropy distribution of classes within clusters– Purity largest class in a cluster/cluster size

• Internal quality: calculate average inter- and intra- cluster similarities.– cohesiveness ( overall similarity)

Page 44: XML Information Retreival

Discussion

• Text alone give best results

• Text+tags: problem with weighting of tags vs. terms

Page 45: XML Information Retreival

Conclusion

• XML basics

• XML Retrieval Tasks and methods

• Clustering XML documents

Page 46: XML Information Retreival

Bayesian Networks

Page 47: XML Information Retreival

Context-dependent Retrieval

• The score of one element is given by RSV(Retrieval Status Value).

• RSV of node depends on RSVs of nodes in the context(parent nodes)

• Elements with highest values are then presented to the user.

Page 48: XML Information Retreival

Bayesian Networks

Page 49: XML Information Retreival

Bayesian Networks(Cont.)