development of a new indexing technique for xml document retrieval

33
Development Of a New Indexing Technique for XML Document Retrieval by: Amjad Ali Amjad

Upload: amjad-ali

Post on 09-Jul-2015

148 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Development of a new indexing technique for XML document retrieval

Development Of a New Indexing Technique for XML Document Retrieval

by:

Amjad Ali Amjad

Page 2: Development of a new indexing technique for XML document retrieval

Agenda

Introduction

Background

Problem Statement

Proposed Solution

Results and discussions

Conclusion and future directions

Page 3: Development of a new indexing technique for XML document retrieval

Introduction

What is XML

XML is a markup language much like HTML

XML was designed to describe data

XML tags are not predefined. You must define your own tags

XML uses a Document Type Definition (DTD) or an XML Schema to describe the data

Page 4: Development of a new indexing technique for XML document retrieval

Introduction (continue)

Example Doc 1:<invoice>

<buyer>

<name>ABC Corp</name>

<address>1 Industrial Way</address>

</buyer>

<seller>

<name>Acme Inc</name>

<address>2 Acme Rd.</address>

</seller>

<item count=3>saw</item>

<item count=2>drill</item>

</invoice>

Page 5: Development of a new indexing technique for XML document retrieval

Introduction (continue)

Database management systems are increasingly being called upon to manage semi-structured data: data with an irregular or changing organization

Semi-structured data is often represented as a graph (tree structure)

Evaluating queries over semi-structured data involves navigating paths through this relationship structure

Page 6: Development of a new indexing technique for XML document retrieval

Introduction (continue)

Index Construction for efficient access

Traditional indexing Techniques are

applicable.

Parent child nesting relationship

Expensive querying for this representation even with indexes

Page 7: Development of a new indexing technique for XML document retrieval

Introduction (continue)

Building specialized data manager

For semi-structured data repository

Example projects LORE, Tamino,

XYZFind

Update causes a re-computation

How to deal with update problem

Page 8: Development of a new indexing technique for XML document retrieval

Introduction (continue)

Relative Region Co-ordinate

Knowledge of start and end position

Within the parent element

Only need to update the portion of the index file

Value indexes on attribute values

Page 9: Development of a new indexing technique for XML document retrieval

Introduction (continue)

Term-based inverted indices on element content when this is a large piece of text.

Index on tag name (i.e given a tag name we can return all the elements with the specified tag.

Page 10: Development of a new indexing technique for XML document retrieval

Background

Position based indexing

Queries are processed by manipulating the range of offsets of words, elements or attributes.

In path-based indexing, the location of words is expressed as structural elements and the paths in tree structures are used for the processing of query.

Page 11: Development of a new indexing technique for XML document retrieval

Background (continue)

Bitcube: A three dimensional indexing for XML Documents

According to this technique documents can be hierarchically represented by XML elements. XML documents are represented and indexed.

Page 12: Development of a new indexing technique for XML document retrieval

Background (continue)

Content and Structure in indexing and ranking XML

Index structures with a ranking support are therefore needed for fast access to relevant parts of large documents collections. An analysis reveals that ranking parameters related to both the content and structure of data are poorly supported by most known XML indexes.

Page 13: Development of a new indexing technique for XML document retrieval

Background (continue)

Ctree

It provides an indexing structure that is based on two levels: path summary and detailed element-level relationships. The first one, the path summary, is a tree that is

distracted from the original data

Page 14: Development of a new indexing technique for XML document retrieval

Background (continue)

Indexing for XML Siblings:

Given the importance of XPath based query access, Grust proposed R-tree index, we refer to as whole-tree indexes (WI). Such index, however, has a very high cost for the following-sibling and preceding-sibling axes. In this method they develop a family of index structures, which refer to as splittree indexes (SI), to address this problem, in which (i) XML data is horizontally split by a simple, yet efficient criteria, and (ii) the split value is associated with tree labeling.

Page 15: Development of a new indexing technique for XML document retrieval

Background (continue)

High-performance XML Storage/Retrieval System.

The basic idea of this technique is to allocate a field ID to each text data item of the XML element and to register it in the structure index and text index. The structure index manages the hierarchical structure of each field, and the text index manages the field ID and document ID in which query words appears. The structure index is one big data tree and represents

the overlapped structure of documents.

Page 16: Development of a new indexing technique for XML document retrieval

Background (continue)

Indexing documents for queries on structure,content and attributes

It Explains position-based indexing and path-based indexing to access XML document by content, structure, or attributes.

Page 17: Development of a new indexing technique for XML document retrieval

Background (continue)

Extensible index technique

An extensible index technique is proposed to express position information between nodes in a XML document. It is an efficient index technique that simplifies the comparative object applied to a search query and minimizes the reconstruction of index structure by update operation. In addition, they specially proposed extensible index technique with deferred update.

Page 18: Development of a new indexing technique for XML document retrieval

Problem Statement

Support of element addressing

Doc.ID should include NodeId (Xpath) + Offset

Index size becomes very large

Xpath are long

Support of typed data

Integer, float, simple types of XML schema

Requires classical indexes for certain elements

Page 19: Development of a new indexing technique for XML document retrieval

Problem Statement (continue)

Query processing

Structural joins

Text search

Exact search

Support of updates

Incremental updates would be a plus

Page 20: Development of a new indexing technique for XML document retrieval

Problem Statement (continue)

Evaluation criteria Identifiers

Per node or per document

Descendant/Ancestor Search By join algo. By graph traversal By OID comparison

Keyword Search By element scan By B-tree traversal

Update Incremental

Index size Entry number Entry size

Page 21: Development of a new indexing technique for XML document retrieval

Problem Statement (continue)

indexing structures use which the absolute address to pinpoint where data resides,

update causes a re-computation

If the update frequency is high the cost of reconstruction is unbearable

Support of updating the indexes is not considered in most of the indexing techniques.

Page 22: Development of a new indexing technique for XML document retrieval

Problem Statement (continue)

Updates are an issue in any such labeling scheme. It is conceivable that a complete re-labeling could be required for each update,

the existing techniques do not support the storage of multiple documents in a single time.

Page 23: Development of a new indexing technique for XML document retrieval

Proposed Technique

An XML document instance is a plain-text file that uses markup delimiters (tags) to define the logical structure of a document in a hierarchical fashion.

Robert Korfhage proposed three purposes of indexing in IR, which can best take advantage of structured documents.

To permit easy location of documents by topic;

To define topic areas and hence relate one document to another;

To predict relevance of a given document to given information need.

Page 24: Development of a new indexing technique for XML document retrieval

Proposed Technique (continue)

The current structured query and indexing models for XML have not fulfilled these requirements.

The ideal system seems to be one that will provide efficient and comprehensive indexing of document content and structure, and be able to support the predicted degree of relevance all matching documents have to a particular query

Page 25: Development of a new indexing technique for XML document retrieval

Proposed Technique (continue)

There is a node corresponding to each element, with child nodes for sub-elements. However, all attributes of an element node are clubbed together into a single node, which is then stored as a child node of that element node

The content of an element node, if any, is pulled out into a separate child node.

Page 26: Development of a new indexing technique for XML document retrieval

Proposed Technique (continue)

Ancestor–descendant relationship

a node(S1,E1,L1) is the ancestor of node (S2,E2,L2) Iff S1<S2 ^ E1>E2

Parent–child relationship

a node (S1,E1,L1)is the parent of node(S2,E2,L2) iff S1<S2^E1>E2 ^L1= L2-1

Page 27: Development of a new indexing technique for XML document retrieval

Proposed Technique (continue)

S1 and S2 are start labels, E1 and E2 are end labels, and L1 and L2 are levellabels in these formulae.

We address the update issue by leaving gaps between successive label values.

Page 28: Development of a new indexing technique for XML document retrieval

Results and discussions (continue)

System architecture

Data Parser

The Data Parser takes an XML document as input, and produces a parse tree as output.

Data manager takes each node of tree mark its indices and store it into database.

Page 29: Development of a new indexing technique for XML document retrieval

Results and discussions (continue)

If the node is of mixed type, with multiple content parts interspersed with sub-elements, each content part is pulled out into a separate child node.

All processing instructions, comments, and such are simply ignored

Page 30: Development of a new indexing technique for XML document retrieval

Conclusion and future directions

Reconstruction of index file due to a partial update is a problem that XML database applications inevitably have to face

We have developed the indexing system that is based on the two indexing techniques extensible index technique and the relative region coordinate based indexing of XML documents with our own proposed scheme which assigns the level numbers to each node of XML documents and document number to each document.

Page 31: Development of a new indexing technique for XML document retrieval

Conclusion & future directions (continue)

Update of the index structure which

increases the cost is successfully removed as the index structure remains unaffected after adding the new nodes.

Parent child and ancestor-descendent

relationship could be found easily for efficient retrieval.

Page 32: Development of a new indexing technique for XML document retrieval

Conclusion & future directions (continue)

all processing instructions, comments, and such which are simply ignored. In a future, it could be created yet another child node of the element node with all such data.

An index that is efficient for both update and retrieval may not available.

Page 33: Development of a new indexing technique for XML document retrieval

Conclusion & future directions (continue)

One of alternatives is building two separate indices such that one is suitable when update is frequent, the other is better at query processing.

In this case, a transformation mechanism between the indexing structures is needed to be developed.