xml information retreival
DESCRIPTION
XML Information Retreival. Hui Fang Department of Computer Science University of Illinois at Urbana-Champaign. Some slides are borrowed from Nobert Fuhr’s XML Tutorial. Outline. XML basics Research Topics XML IR Tasks Retrieval methods Clustering XML documents. XML standards. - PowerPoint PPT PresentationTRANSCRIPT
XML Information Retreival
Hui FangDepartment of Computer Science
University of Illinois at Urbana-Champaign
Some slides are borrowed from Nobert Fuhr’s XML Tutorial.
Outline
• XML basics
• Research Topics
• XML IR– Tasks– Retrieval methods– Clustering XML documents
XML standards
Basic XML
• Hierarchical document format for information exchange in WWW
• Self describing data (tags)
• Nested element structure having a root
• Element data can have– Attributes– Sub-elements
(Slides from Jayavel Shanmugasundaram)
Attribute Element
Example XML document<?xml version="1.0" encoding="ISO-8859-1" ?> -<!-- Edited with XML Spy v4.2 --> <book> <title> Finding Out About: A Cognitive Perspective on Search Engine Technology and the WWW</title> <author id = “rbelew”> <name> <firstname> Richard </firstname>
<lastname> Belew </lastname> </name> <address> <city> San Diego </city> <zip> 92093 </zip> </address> </author></book>
Tree structure of XML documents
book
id=“rbelew”
authortitle
name address
First name Last name city Zip code
Finding….
Richard Belew San Diego 92093
Basic XML standard does not deal with …
• Standardization of element namesXML namespaces
• Structure of element content
XML DTDs
• Data types of element content
XML schema
XML namespace
<table><tr>
<td>Apples</td>
<td>Bananas</td></tr>
</table>
<table><name>GPA Table</name><width>80</width><length>120</length>
</table>
Provide a method to avoid element name conflicts
XML namespace(Cont.)
<h:table xmlns:h="http://www.w3.org/TR/html4/">
<h:tr> <h:td>Apples</h:td>
<h:td>Bananas</h:td> </h:tr>
</h:table>
<f:table xmlns:f="http://www.w3schools.com/gpa"> <f:name>GPA Table</f:name> <f:width>80</f:width> <f:length>120</f:length>
</f:table>
Provide a method to avoid element name conflicts
XML Document Type DefinitionDefine the document structure with a list of legal elements
<?xml version="1.0"?> <!DOCTYPE note SYSTEM
"note.dtd"> <note>
<to>Tove</to> <from>Jani</from> <heading>Reminder</heading> <body>Have a rest!</body>
</note>
<!ELEMENT note (to,from,heading,body)>
<!ELEMENT to (#PCDATA)><!ELEMENT from (#PCDATA)><!ELEMENT heading (#PCDATA)> <!ELEMENT body (#PCDATA)>
Research Topics related to XML
Research Topics
• IR areas– Retrieval Models
– Query Languages
– …
• DB areas– Query Languages
– System architecture
– Apply relational DB technology to XML data
– Streaming XML
– XML Query Processing
– XML indexing and compression
– ……
XML IR
INEX:Initiative for the Evaluation for XML Retrieval
• Documents: 12,107 articles in XML format
• Queries: 30 Content-only;
30 Content and structure
• Relevance Assessments: by participating groups
• Participants: 36 active groups in 2003
CO search task
• Document as hierarchical structure of nested elements
• Type of elements is not considered
• Query refers to content only
• Query syntax as in standard text retrieval
• Task: Find smallest subtree(element) satisfying the query
Example of CO Topic<INEX-Topic topic-id=“45” query-type=“CO” ct-no=“056”><Title> <cw>augmented reality and medicine</cw></Title><Description>How virtual (or augmented )reality can contribute to improve the medical
and surgical practice.</Description><Narrative>In order to be considered relevant, a document/component must include
considerations about applications of computer graphics and especially augmented (or virtual) reality to medice(including surgery).
</Narrative><Keywords>Augmented virtual reality medicine surgery improve computer assisted
aided image</Keywords></INEX-Topic>
CAS search Task
• Queries contain explicit references to the XML structure, by restricing– The context of interest
• <te>:target element
– The context of certain search concepts• (<cw>,<ce>) pairs
Example of CAS topic<INEX-Topic topic-id=“09” query-type=“CAS” ct-no=“048”><Title><te>article</te><cw>non-monotonic reasoning</cw><ce>bdy/sec</ce><cw>1999 2000</cw> <ce>hdr//yr</ce><cw>-calendar</cw><ce><tig/at1<ce><cw>belief revision</cw></Title><Narrative>Retrieve all articles from the years 1999-2000 that deal with works on
non-monotonic reaonsing. Do not retrieve CfPs/calendar entries</Narrative><Keywords>non-monotonic reasoning belief revision </Keywords></INEX-Topic>
XML Retrieval Methods
• XIRQL– XML query languages with IR-related features
• Language models
• JuruXML
XIRQL(I)
• CO Approaches :– Split document text into disjoint nodes– Index nodes separately– Aggregate indexing weights for higher-
level elements (subtrees)
Index nodes as units for term weighting
Application of known indexing functions (e.g. tf*idf)
1 2 3
4 5
document
class="H.3.3"
author
John Smith
title
XML Retrieval Introduction
chapter
heading This. . .
heading
SyntaxExamples
heading
sectionheading
XML Query
Lang. XQL
section
We describesyntax of
XQL
chapter
Index nodes for relevance-oriented search
1 2 3
4 5
document
class="H.3.3"
author
John Smith
title
XML Retrieval Introduction
chapter
heading This. . .
heading
SyntaxExamples
heading
sectionheading
XML Query
Lang. XQL
section
We describesyntax of
XQL
chapter
Q1: syntax example
Q2: XQL
Combining weights
…by disjunction
Q1: syntax example
Q2: XQL
0.5 example 0.8 XQL0.7 syntax
section1 section2
0.3 XQL
chapter
0.5 example0.7 syntax
0.86 0.8+0.3-0.8*0.3=0.86
Need to return most specific element satisfying the query!
0.7*0.5=0.35
Combining weights… with augmentation weight
Q2: XQL
0.5 example 0.8 XQL0.7 syntax
section1 section2
0.3 XQL
chapter
0.30 example0.42 syntax
0.64 0.48+0.3-0.48*0.3=0.64
0.6 0.6
XIRQL(II)
• CAS approaches
– Extension of XQL by• Weighting and ranking• Data types with vague predicates• Structural relativism
XQL Expressions• Path condition
– search for single elementsheading
– parent-child:chapter/heading
– ancestor-descendant:chapter//section
– document root:/book/*
• Filter wrt. structure://chapter[heading]
• Filter wrt. content:/document[@class=“H.3.3” $and$ author=“John Smith”]
Data types with vague predicates
• Compares two values of a specific data-type– E.g. Near, broader, narrower
• Returns (probabilistic) matching value– E.g. “Search for an artist named Ulbrich, living in
Frankfurt, Germany about 100 years ago”
Ernst Olbrich, Darmstadt, 1899
P(Olbrich Ulbrich)=0.8 (phonetic similarity)
P(1899 1903)=0.9 (numeric similarity)
P(Darmstadt Frankfurt)=0.7 (geographic distance)
pn
g
Semantic Relativism
• Drop distinction attribute/element:~author searches for attribute or element
• Generalize to data types:#personname searches for attribute/elements of
specific data type
Language models
• Generate language models for each node in the tree
• Combine the children language models using linear interpolation
• Use EM approach to train the linear interpolation parameters
Element-specific language models---CO Approaches
Higher level nodes: mixture of language models
)|()|( ii wPwP
Query: dog and cat
0.5 0.5
55.0)|( bodydogP
45.0)|( bodycatP
Type-specific language models--- CAS approaches
0.5 0.5
0.5 0.5
• “Return components of type x where it has component y that contains the query term w”
e.g. return documents where the title is contains the word “bird”5.05.01)()|( titlePtitlebirdP
e.g. return documents where the body’s first section is contains the word “dog”
25.05.05.01)()1(sec)1sec|( bodyPtionPtiondogP
Juru-XML
• Element-specific indexing+vector space model:– Transform query into set of (term,path)-
conditions– Vague matching of path conditions– Modified cosine similarity as retrieval function
JuruXML(1)---Transform Query
JuruXML(2)---Vague matching of path conditions
JuruXML(3)---Retrieval function
• Standard cosine similarity– wQ(ti): query term
weight of term ti
– wD(ti): indexing weight of term ti in the document
• Modified cosine similarity– wQ(ti ,ci
Q): query term weight of pair (ti,ci
Q)
– wD(ti ,ciD): indexing
weight of pair (ti,ciD) in
the document)()(
||||
1),( i
D
DQti
Q twtwDQ
DQi
||||
),(),(),(
),(),( ),(
DQ
cccrctwctw
DQ
Dj
Qi
Dji
D
Qct Dct
Qii
Q
Qii
Dji
• For each query term (ti,ciQ) treat all matched
document terms (ti,cjD) equally from the user
perspective.
• Define a weight function w(ciQ)
– E.g.
JuruXML(4)---Alternative approach (Merging contexts)
||1)( Qi
Qi ccw
Clustering XML documents
Document similarity• Document representation:
documentN-dimensional vector – N= # document features– Feature sets
• Text only• Tags only• Text + Tags
• Feature weighting in the document vector• Similarity measure--- vector similarity
– E.g. cosine measure
||||),cos(
21
2121
dd
dddd
Clustering methods
• Hierarchical clustering: – Main weakness: quadratic complexity
• Partitional clustering:– K-means
• Linear time complexity
• Simplicity of its algorithm
K-Means clustering algorithm
Measuring clustering quality
• External quality: comparison of clusters with external classification– Entropy distribution of classes within clusters– Purity largest class in a cluster/cluster size
• Internal quality: calculate average inter- and intra- cluster similarities.– cohesiveness ( overall similarity)
Discussion
• Text alone give best results
• Text+tags: problem with weighting of tags vs. terms
Conclusion
• XML basics
• XML Retrieval Tasks and methods
• Clustering XML documents
Bayesian Networks
Context-dependent Retrieval
• The score of one element is given by RSV(Retrieval Status Value).
• RSV of node depends on RSVs of nodes in the context(parent nodes)
• Elements with highest values are then presented to the user.
Bayesian Networks
Bayesian Networks(Cont.)