on efficient part-match querying of xml data

On Efficient Part-match Querying of XML Data

DATESO 2004

Michal Krátký, [email protected] Andrt, [email protected]

Department of Computer ScienceVŠB–Technical University of Ostrava

Czech Republic

www.fei.vsb.czwww.fei.vsb.cz

Contents

Introduction – XML, query languages, indexing XML data, part-match querying.

Multi-dimensional approach to indexing XML data.

Extension of the multi-dimensional approach for keyword-based querying.

Index data structures. Preliminary experimental results.

2/21

Introduction

Native XML database. Set of documents is a database, DTD (XML Schema) is its database schema.

XML query languages (XPath, XQL, XQuery,…).

A common feature is a possibility to formulate paths in the XML graph (regular path expressions, XPath axes and so on).

Approaches based on: relational decomposition, trie, multi-dimensional, signatures and so on.

3/21

Part-match querying XML data

Some approaches for keyword or phrase based searching were published: XQuery-IR (WebDb’02), XKeyword (ICDE’03) and so on.

Knowledges from IR are applied. Query languages contain operators for

matching term occurrence. For example contains(), ~=.

4/21

Multi-dimensional approach to indexing XML data

5/21

A graph is a set of the paths. XML document is decomposed to paths and labelled paths.

labelled path: lp ∈ XLP:

s0,s1,...,slPN

path: p ∈ XP:

idU(u0),idU(u1),...,idU(ulLP),s

idU(ui) – unique number

of a node ui

Indexes

Term index – a storage of strings si of an XML document and their idT(si).

Labelled path index – a storage of points representing labelled paths.

Path index – a storage of points representing paths.

6/21

Examplelabelled path index, path index

books,book,id; books,book,title and books,book,author. Points (0,1,2); (0,1,4) and (0,1,6) are created using idT of element and attribute names, idLP = 0, 1 and 2.

For example, the path to value The Two Towers. The labelled path books,book,title with idLP 1 belongs. Vector (1,0,1,3,5) is created using idLP, unique numbers idU of elements, and idT of the term.

7/21

Query for values of elements and attributes

XPath query: books/book[author=“Joseph Heller”] 3 phases of a query processing, finding:

● idT of terms from the term index,

● idLP 2 of labelled path books,book,author from the labelled path index: point query (0,1,6),

● points from the path index: range query (2,0,0,0,12)×(2,max,max,max,12). 8/21

Enhanced querying

XPath axes are processed by a range query or sequence of range queries. For example axis descendent: (0,idU(u0),…,idU(ul-1), idU(u),0,…, 0):(maxD,idU(u0),…,idU(ul-1), idU(u), maxD,…,maxD).

Regular path expression. For example //title[name=‘Chaudhri’] is processed by a complex range query. The query is possible to process in one run in the multi-dimensional data structure.

9/21

Comparison of approaches

Mainline approaches (XISS, XPath Accelerator) index single element (attribute). For example query /e1[e2=‘dog’] is processed by joining single results.

Result formatting. For example a result of the query //name is all matched subtree.

Operation Update and Insert are simple possible.

10/21

Keyword-based searching

Motivation:/PLAY[PERSONAE/PERSONA~=OTHELLO]/TITLE Path-Labelled Path-Term (PLT) index is

added. The index indexes an 3-dimensional space:

(idP, idLP, idT).

idP is added into the point representing path: (idP,idLP,idU0

,idU1,…,idUl

,s).

11/21

Path-Labelled Path-Term index Example

12/21

Query processing plan Example

13/21

Index data structures

Paged and balanced multi-dimensional data structures – (B)UB-trees, variants of R-trees.

Problems: ● indexing points with different dimensions. ● narrow range query – the signature is

applied for efficient processing – Signature R-tree.

Efficient processing of the complex range query.

14/21

Efficient processing the complex range query

Complex range query = sequence of range queries: qb1,qb2,…,qbn.

The query is possible to process in one run in the multi-dimensional data structure.

15/21

Experimental results

Protein Sequence Database XML document:

● the document size is 683MB,

● number of elements: 21,305,818,

● number of attributes:1,290,647.

● maximal length of path: 7.

BUB-forest, R*-forest, Signature BUB-tree and R*-tree. Index structures: trees indexing spaces of dimension n=7 and n=9.

16/21

Experimental resultsQueries: ProteinDatabase/ProteinEntry/[reference/refinfo/ authors/author='Smith, E.L.']

17/21

Experimental results Regular path expression

Query: //uid='89071748' , 5 labelled paths were matched.

Naive processing the complex range query: DAC: 368

Efficient processing the complex range query: DAC: 139

Time: 0.03s, Improvement: 2.5x

18/21

Preliminary experimental results Keyword-based searching

othello.xml:

● document size is 250kB,

● maximal length of the path: 6

● number of paths: 4,967

● number of labelled paths: 13

● number of terms: 8,744

● PLT index: 27,127

19/21

Preliminary experimental results Keyword-based searching

Query: /PLAY[PERSONAE/PERSONA~=OTHELLO]/TITLE

Labelled path index: result size: 1, DAC: 3

PLT index: result size: 1, DAC: 3 Path index: result size: 1, DAC: 13 Path index: result size: 1, DAC: 4

20/21

Conclusion

21/21

Θ(m × log n), Θ(c × m × log n) vs. Θ(m1 × m2), m1 ,m2 ≥ m.

Efficient processing a query with AND condition. Signature is applied.

Multi-dimensional approach for term searching may be applied (e.g. *comp*).

The update operation of XML documents. Comparison with another approaches for test

collections (INEX, XMark, …).

http://www.cs.vsb.cz/arg

References

M. Krátký, J. Pokorný, V. Snášel: Implementation of XPath Axes in the Multi-dimensional Approach to Indexing XML Data. Accepted at International Workshop on Database Technologies for Handling XML information on the Web, DataX, Int'l Conference on EDBT, Heraklion - Crete, Greece, 2004.

M. Krátký, J. Pokorný, T. Skopal, V. Snášel: The Geometric Framework for Exact and Similarity Querying XML data. In Proceedings of EurAsia-ICT 2002. Shiraz, Iran, Springer Verlag, LNCS 2510.

M. Krátký, T. Skopal, and V. Snášel: Multidimensional Term Indexing for Efficient Processing of Complex Queries. Kybernetika, Journal of the Academy of Sciences of the Czech Republic, 2004, accepted.

Paths, labelled paths Paths 0,1,2,’003-04212’; 0,5,6,’001-00863’

and 0,9,10,’045-00012’ belong to the labelled path books,book,id,

. . .

Paths 0,1,4,’J.R.R. Tolkien’; 0,5,8,’J.R.R. Tolkien’ and 0,9,12,’Joseph Heller’ belong to the labelled path books,book,author.

Complex queries

Query for values and XPath axis processing, e.g. books/book[author='Joseph Heller']/title

● Combination of above described techniques: query for value, XPath axis processing.

Regular path expression queries for example: books//author

● A sequence of range queries processes this query in the path and labelled path index: books, author - books,*,author - books,*,…,*,author.

(B)UB-tree, R-treeUB-treeUB-tree

Z-address

B-treeB-tree

Narrow range query – signature multi-dimensional ds

Regions intersecting a query hyper box are searched, O(NI × logc n).

Ratio cR of relevant NR and intersect NI regions

≪ 1 with an increasing dimension. Signatures are applied to better filtration of irrelevant

regions – signature md structures.

Signature R-tree


Queries: ProteinDatabase/ProteinEntry/[reference/refinfo/ authors/author='Smith, E.L.']

on efficient part-match querying of xml data

Documents

examplelabelled path

labelled path books

path indexbooks

labelled paths

attributesxpath query

example query e1e2

xml document

query languages xpath