browsing and querying on xml data sources

8/6/2019 Browsing and Querying on XML Data Sources

1/29

M.Tech. Dissertation

Browsing and Querying on XML Data Sources

submitted in partial fulfillment of the requirements for the degree of

Master of Technology

By

Urmila Kelkar

Roll No : 00305402

Under the guidance of

Prof. S. Sudarshan

Department of Computer Science and Engineering

Indian Institute of Technology, Bombay

Mumbai

January 17, 2002


2/29

Dissertation Approval Sheet

This is to certify that the dissertation entitled Browsing and Querying in XML Data Sources by Urmila

Kelkar is approved for the award of the degree of Master of Technology.

Prof. S. Sudarshan

(Guide)

Internal Examiner

External Examiner

Chairman

Date :


3/29

Acknowledgement

I would like to thank my guide, Dr. S. Sudarshan for his untiring support and encouragement throughout

my M.Tech project. I would also like to acknowledge ones, who, from behind the scenes have contributed

their ideas and energies. Special thanks to all my colleagues from Informatics lab.

Urmila Kelkar

January 17, 2002

ii


4/29

Contents

1 Introduction 1

2 Related Work 2

2.1 Browsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2.1.1 Blended Browsing and Querying by BBQ . . . . . . . . . . . . . . . . . . . . . . . 2

2.1.2 XML based information mediation with MIX . . . . . . . . . . . . . . . . . . . . . 2

2.1.3 BANKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2 Keyword Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2.1 BANKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2.2 DataSpot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3 Scalable Browsing of XML documents 5

3.1 Design Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3.1.1 Incremental browsing of XML documents . . . . . . . . . . . . . . . . . . . . . . . 5

3.1.2 Mapping from an XML document to the Foldertree . . . . . . . . . . . . . . . . . . 7

3.1.3 IDREF to ID links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.1.4 Sending serialized objects over HttpConnection . . . . . . . . . . . . . . . . . . . . 9

3.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2.1 Interaction between the Servlet and the Applet . . . . . . . . . . . . . . . . . . . . 10

3.2.2 Browser setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.2.3 Data structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.3 Scalability of the system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.4 Extensions to the Foldertree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.4.1 Styling enhancements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.4.2 Interactive Foldertree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.5 Select Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.5.1 Working of Select Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.5.2 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4 Integrating Keyword Search with Browsing 17

4.1 Keyword Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.2 Browsing search results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.3 Browsing Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5 Conclusions and Future work 21

iii


5/29

Abstract

The Web is extensively used by every information seeker. Search engines such as Google retrieve informa-

tion from HTML documents. They allow users to get desired information by just typing a few words and

following hyperlinks. The goal of our project is to design and implement a system providing a powerful

way of extracting information from XML documents, using browsing and keyword search.

Our system provides a directory tree like interface to browse through nested XML data, coupled with the use

of IDREF to ID links and the use of stylesheets. To facilitate customized views of the same XML document,

we provide with menus to drop elements, to find matching elements, to drop subtree and so on. Additionally,

we provide keyword search where users can just type a few keywords to get desired information from the

XML data source.


6/29

Chapter 1

Introduction

XML is an evolving technology, which is becoming important because of its standardized data representa-

tion format. XML documents focus on semantics of data. It does not provide information about displaying

the data contained in the document. XML portrays a semistructured data model which is likely to be used

to publish heterogeneous data.

Consider the example of electronic patient records(EPR) used by public health services. Many European

countries aim to introduce EPR as a standard way of maintaining patient records. XML, which provides

users with a flexible way to markup nested data, appears well suited for maintaining EPRs. Administrativecommittees in number of hospitals are now incorporating XML within their prescribed standards for main-

taining patient records. This emphasizes the need for a system that facilitates information retrieval from the

underlying XML documents.

We aim to develop a system that facilitates browsing of XML documents and provides keyword search

as well. Further, in browsing, we focus on two access patterns which will be most commonly used : docu-

ment traversal and pattern matching queries. The nesting structure of an XML document is used to provide

navigation through XML document in a tree format, called a foldertree. Users can choose a particular style

for displaying XML documents. Our system also provides a menu to drop the selected subtree, to expand

the selected subtree, to drop a particular element and to highlight matching elements. Taking into consider-

ation potentially large XML documents, it is important to make the system scalable. Our system facilitates

scalable browsing using incremental approach. We also provide an interface called as Select Interface

which helps users retrieving information from XML documents using pattern matching queries. There are a

few systems like Blended Browsing and Querying(BBQ) [MLP99], XML based information mediation with

MIX(MIX) [BGL 99], that support querying and browsing of XML documents. BBQ uses incremental

on-demand approach.

Our system additionally provides keyword search, which is not supported by BBQ and MIX. The key-

word search module constructs a graph from the XML documents available in the data source. We use least

common ancestor technique to find out answer results. The answer result tree can be browsed like any other

XML document.

Chapter 2 gives an overview of related work in the area of browsing XML data sources. The browsinginterface offered by our system is described in Chapter 3. Chapter 4 briefly describes the keyword search

module and browsing of search results. The detailed approach of keyword search is described by Megha

Meshram in her dissertation [Mes02]. Chapter 5 outlines conclusions and describes future work.

1


7/29

Chapter 2

Related Work

The Web has introduced a new paradigm of browsing and keyword search to retrieve information from

HTML documents. Our goal is to provide a similar system to browse through XML documents. Tech-

niques used for information retrieval from HTML documents can not be directly used for XML documents.

This chapter gives an overview of the systems supporting browsing, keyword search and querying of XML

documents.

2.1 Browsing

Search engines such as Google help a naive user to browse through information using hyperlinks. There are

a few systems that provide support for browsing through XML documents. The following sections describe

these systems.

2.1.1 Blended Browsing and Querying by BBQ

Blended Browsing and Querying (BBQ) [MLP99] provides a Document Type Definition(DTD) based graph-

ical user interface, which facilitates XML query construction and browsing of results by non-expert users.

It is used as a front-end to the virtual source exported by MIX mediator system. Virtual source may be an

actual XML source or XML view created by mediator.

BBQ assigns each document a separate window with its data and schema displayed side-by-side. Both

data and schema can be navigated using directory like structures. BBQ querying is schema driven. It uses

XMAS (XML matching and structured language) query underneath that supports joins, filtering and con-

straints on leaf nodes. BBQ assumes that users can not come up with the focussed query right at the first

step and facilitates iterative query refinement. It supports queries on multiple DTDs. Users can specify the

structure of the query by dragging-dropping elements from the source DTDs or introducing new elements

or grouping elements according to the value of other elements. Further when execution of the query starts,

users can browse through partial results. BBQ uses Document Object Model Application Programming

Interface(DOM API) and uses incremental approach for browsing potentially very large documents and

subsequent query results.

2.1.2 XML based information mediation with MIX

Mediation of information from heterogeneous data sources becomes crucial as data from different sources

like HTML documents, XML documents, relational databases, legacy data are getting published over the

Web. MIX [BGL 99] employs XML as a semistructured data model to provide a uniform and flexible

2


8/29

representation of arbitrary source data. Since XML may also become stumbling block while formulating

meaningful queries against semistructured databases, it uses XML DTD as a structural description of data

exchanged by components of mediator.

MIX focuses on valid XML documents which confirm to the DTD. XML queries are denoted in high level,

declarative query language known as XMAS which is evolved from the ideas from XMLQL and other XML

query languages. It allows pattern matching as well group by queries. To facilitate querying of heteroge-

neous sources, XML wrappers are provided which export data in a uniform format to the mediator. MIX

uses BBQ as its graphical interface. XMAS queries generated by BBQ are sent to the mediator for execution

and the results are displayed using BBQ.

2.1.3 BANKS

BANKS - an acronym for Browsing ANd Keyword Search [BHN 01] facilitates browsing and keyword

search in relational database. Earlier we had worked on a module of BANKS system. To retrieve informa-

tion from the underlying relational databases, users need to know SQL (structured query language). BANKS

uses a new paradigm of keyword search introduced by the Web to retrieve the desired information.

BANKS uses tabular format to display information retrieved from database. Foreign key - Primary key

relationship (FK-PK) is used to provide a link from one table to another in the form of hyperlinks. BANKS

also provides menu, using java script for sorting records from the table on a particular column, or grouping

records based on a column and so on. It also provides Select Interface to get records matching the specified

values. BANKS also provides templates such as crosstab, nested, foldertree, bar-chart, pie-chart to display

information from database in graphical manner.

Browsing and keyword search in XML can also be thought of as an extension to BANKS. XML is con-

sidered as semi-structured data, while BANKS uses relational database( structured) as back-end. Hence, the

strategies and the approaches used for browsing and keyword search in XML are different but the need for

browsing and keyword search lying underneath is the same.

XML documents can be displayed using Folder-tree structure like the one used in BANKS. Menus such

as drop elements, select matching elements, drop subtree can be used to interact with foldertree. We can

also provide a group-by selection interface wherein users can select group-by element and result elementto get graphical view of data like bar-chart used in BANKS. For example, rather than viewing the whole

collection of CDs at once, users will prefer having year-wise or artist-wise collection of CDs where we can

specify CD as the result element and the year or the artist as a group by element. We can also facilitate

querying of XML documents, using some query language at the back-end.

2.2 Keyword Search

Keyword search on documents is very well used for finding out desired information on Internet. Keyword

search paradigm is equally important for XML documents. We describe the prior work done in keyword

search in databases, in the following sections.

2.2.1 BANKS

BANKS supports keyword search to retrieve the information from the underlying relational database where

users are not required to know the schema details. Like keyword search on internet returns documents

3


9/29

containing given words, BANKS returns tuples containing the given word. BANKS exploits foreign key-

primary key relationship between tables of database to form a graph called meta-data graph. Further, it also

uses tuple level graph to identify particular tuple. Now given a set of words, two words are considered to be

close to each other, if in a table, they are in the same row and same column, or in different columns of a row,

or in rows of different tables linked by foreign keys. The details of the keyword search algorithm are given

in [HBN 01]. Search results are ranked and sorted out before presenting.

To facilitate keyword search in XML documents, we can use the built-in hierarchical structure of XML

document to form a graph. We can make an entry into the text index for all words from all documents

except for stop words. Text index can store DOM Node reference, which will help us to traverse the graphfor finding out shortest paths between keyword nodes. Simplest way to find out a tree containing all words

is to use the least common ancestor technique. The result will contain the root as the common intersection

node while leaf nodes will refer to the search terms. The search result can be ranked based on the depth of

the tree or say the longest edge of the tree. Result tree can be browsed using the foldertree like any other

XML document.

2.2.2 DataSpot

DataSpot [DEGP98] introduces a new approach to database query and information retrieval by providing end

users with the capability of exploring databases using the free-form queries and navigation. This capability

is based on a novel, schema-less representation of data, called a Hyperbase. The DataSpot representation

and search technology is the foundation of the DataSpot system. DataSpot has since been named a Mercado

and is available as a commercial product.

A DataSpot Hyperbase is a graph structure comprised of nodes, edges and node labels. Nodes are re-

lated by directed edges of two types. A simple edge is used to connect the parent node to a child node. The

set of children of a node are ordered. An identification edge is used to indicate that reference node uniquely

identifies the subject node. A DataSpot Query is an associative search over a Hyperbase. The input to a

query is a set of nodes called a query sources. The result of a query is a list of answers where each answer

is a connected a Hyperbase containing the query sources. The answers to the query are ranked. Users can

view answer records in detail, navigate to the related records or may submit the continuation queries from

the current record or from set of records.

4


10/29

Chapter 3

Scalable Browsing of XML documents

The primary motivation for developing a system for browsing is to ease the end-users task. Naive users

should be able to get the desired information from XML documents by just few clicks rather than writing

complex queries. We have developed a system which provides such an interface for extracting the desired

information from XML documents.

The central idea is to provide a Scalable Browsing system to navigate through XML documents. Since

an XML document consists of markup tags and not the formatting tags, we need some mechanism to con-

vert XML-encoded information into the true data model and to make it presentable. We are using a foldertreestructure to display XML documents. Folder tree is a simple hierarchical structure like the directory tree

structure used in Windows. Consider a document order.xml as shown in Figure 3.1. Our system provides

users with a foldertree view of XML document, as shown in Figure 3.2

3.1 Design Issues

This section describes various design issues in our system like, incremental browsing approach, mapping of

an XML document to foldertree, IDREF to ID links and serialized object used for communication between

client end and server end of the system.

3.1.1 Incremental browsing of XML documents

The approach used for displaying XML documents, brings nodes on demand (i.e. as requested by users) and

displays the tree incrementally. To indicate users that a particular node has a few more child nodes yet to be

retrieved, a dummy node called as More is displayed.

This incremental approach makes the design scalable since a user is not required to spend time waiting

for child nodes to arrive. The following sections explain our approach in detail.

Approach I

At the initial stages of work, we implemented the following non-incremental model. The model consists of

servlet running at the server end and applet working at the client end, communicating with each other. Thesteps to be followed are as follows :

Servlet parses the whole XML document and gets an in-memory DOM tree.

Applet sends request over HttpConnection for a particular XML document.

5


11/29

Figure 3.1: Original XML document

Servlet traverses in-memory DOM tree in BFS(Breadth First Search) order, creates newObject corre-

sponding to each DOM Node and sends all objects one by one, over HttpConnection. Refer Section

3.1.4 to get the details of newObject

Applet receives newObject corresponding to every DOM Node and goes on attaching newObjects to

their respective parents to form a tree.

Applet displays the tree using Java Swing.

This approach did not work well for huge XML documents because sending the whole DOM tree at once

was not feasible. Later, we came up with an incremental model as described below.

Approach II

The servlet running in background, sends DOM Nodes on demand, as requested by users at the client end.

It makes the servlet stateless. The applet saves the state of the request and sends the next request as per user

navigation. The following steps are taken :

The servlet parses the whole XML document and gets an in-memory DOM tree.

Initially the applet requests for the root node, identified by id=1, along with few child nodes, number

specified in the Configuration file.

In general, the applet sends a request with the node id of a parent node and number of chlld nodes,

identified by from and to parameters in request. It also includes docName parameter identifying the

6


12/29

Figure 3.2: Foldertree view of order.xml document displaying IDREF to ID links

XML document. The object called newObject, described in Section 3.1.4 is used for communication

between the servlet and the applet.

In response to the request from applet, the servlet sends root node if id is equal to 1, or else only child

nodes numbered from from to to of the requested XML document.

The servlet sends a special object with myID=0 as a demarcating object, to indicate that it is the end

of the response.

The applet updates the tree by appending the received child nodes to the respective parent nodes and

displays the tree.

On the applet side, once the tree is displayed, applet waits for a user request. User can request for

child nodes of a particular node by just clicking on that node and the request is sent to the servlet

asking for child nodes of that node.

A dummy node named as More... is used to indicate that the parent node has some more child nodes

yet to be retrieved. Users can click on More... node to request for those child nodes or he can also

click on parent node itself to ask for child nodes.

3.1.2 Mapping from an XML document to the Foldertree

This section describes the steps taken to map an XML document to the foldertree, while displaying it incre-

mentally.

7


13/29

Foldertree is chosen to display the document since it is suitable for displaying hierarchical, nested struc-

tures. The jaxp [JAX] DOM parser is used to parse XML documents. It gives us an in-memory tree, corre-

sponding to a document, where every node is a DOM Node. We only consider nodes of the types DOCU-

MENT NODE, DOCUMENT TYPE NODE, ELEMENT NODE, ATTRIBUTE NODE and TEXT NODE.

Document node is the root of the document, while Document type node is used to identify the DTD asso-

ciated with the document. Document node corresponds to the root of the foldertree. Element node and

Attribute node correspond to the foldertree node (FTN) and leaf node in a foldertree. Element node, Text

node and Attribute node are transformed into foldertree structure as follows :

ELEMENT NODE : The name of the Element node (i.e. the markup tag) in XML document is assignedto the corresponding foldertree node(FTN) in the foldertree. Refer Figure 3.1 and Figure 3.2 demon-

strating the mapping from XML document to the foldertree.

TEXT NODE : Text node in XML document corresponds to the Leaf node of a foldertree. Since Text node

in XML document carries the actual data, and an Element node in XML document can only have a

single Text node as its child, while mapping XML document to the foldertree, we append the value

of Leaf node (corresponding to the Text node in XML document) to its parent foldertree node and

remove Leaf node in the foldertree as it is redundant.

ATTRIBUTE NODE : Attribute Name=Attribute Value pair for every Attribute node in XML docu-

ment, is appended to the FTN corresponding to the Element node associated with an attribute. For

example, as shown in Figure 3.1 and Figure 3.2, Element node OrderData has Attribute node start-Date=1/11/2001. We append startDate=1/11/2001 to the FTN corresponding to OrderData in a

foldertree.

Attribute node can have different types such as CDATA, ID, IDREF, ENTITY etc. Currently our

system supports only attributes of type ID and IDREF. IDREF to ID links are creates by checking the

attribute type. Suppose an Element node in XML document has an attribute with a type IDREF, then we

identify the corresponding Element node having attribute of type ID(i.e. ID node) with matching values of

ID and IDREF. In a foldertree, FTN corresponding to the Element node having ID, is attached as a child of

FTN corresponding to the Element node having IDREF. In XML document, ID node and IDREF node can

lie distantly. Our system facilitates a way to browse from IDREF node to ID node.

IDREF to ID links are thus created as part of the initialization of a system and in-memory DOM tree corre-

sponding to XML document is updated to include IDREF to ID links, described in the following section.

3.1.3 IDREF to ID links

XML document contains elements. The element can have attributes. Attributes of type ID identify element

uniquely in a document. IDREF to ID relationship is considered analogous to foreign key-primary key re-

lationalship. Attribute of type IDREF indicate that the element refers to another element having an attribute

of type ID, wherein both attribute values are the same. Here, we are assuming that value of the ID attribute

is unique over the entire document.

DOM parser doesnt provide API to identify attributes of a particular type say ID or IDREF while, SAXparser API provide support for such identification. Since using SAX parser will lead to an overhead as one

more SAX parser scan is required in addition to DOM parser scan, we are using a DTD parser [DTD]. DTD

parser helps us to identify IDREF elements and ID elements. We support identification of IDREF to ID links

only if the document has DTD associated with it. An in-memory DOM tree is updated to include IDREF to

8


14/29

ID links as follows :

If the XML document has DTD associated with it,

Parse the DTD

For every Element check if any attribute is of type ID or it is of type IDREF,

For an attribute of type IDREF, enter ElementName-AttributeName pair into IDREF hashtable.

For an attribute of type ID, enter ElementName-AttributeName pair into ID hashtable.

When finished with DTD, parse the XML document to construct an in-memory DOM tree,

If the ElementName-AttributeName pair in the document matches with some entry in the ID

hashtable, enter id value of ID Node in ID array.

If it matches with some entry in IDREF hashtable, enter idref value of IDREF Node in IDREF

array.

After the DOM tree is completely constructed, for every entry in IDREF array do the following :

Get the matching ID Node with matching value.

Clone ID Node to get the duplicate-ID Node since DOM does not allow to have two nodes with

same information. If we do not clone the ID node, it gets removed from its original place in

XML document and gets appended to the child list of IDREF node. Because we want to keep

the ID node as it is in the original XML document, and additionally we want to append it to the

child list of IDREF node, we clone it.

Append duplicate-ID Node to the childlist of IDREF Node.

Figure 3.2 shows IDREF to ID links in order.xml document where Invoice refers to Customer and LineItem

refers to Part. We append Customer to the child list of Invoice and Part to the child list of LineItem for

matching IDREF and ID values in the document.

3.1.4 Sending serialized objects over HttpConnection

The system consists of client end or browser end and server end. XML documents are stored in a data source

at server end, while foldertree is displayed at the client end. This section describes the object used to send

XML document from the server end to the client end.

Users at the client end can send request for a particular XML document to be browsed or they may re-

quest for child nodes of FTN, numbered from say ten to twenty. This request is sent over a HttpConnection

to the servlet running at the back-end. In response to this request, the servlet sends requested nodes to the

client end. The client end redisplays the tree by attaching received nodes to their corresponding parents.

DOM parser gives us in-memory DOM tree for the document. But since DOM Node is not serializable,we

can not send it over HttpConnection. We have constructed our own object which stores DOM Node infor-

mation along with a few tags so that it is easier to attach those nodes to their respective parent nodes at theclient end. The class newObject describes the serializable object used for sending node information from

the server end to the client end.

9


15/29

Class newObject {String folderValue;int isLeaf;int myID;int parentID;int numChildren;int numChildrenAtClient;

}

The newObject carries following parameters required for reconstruction of the tree.

folderValue - is a variable of type String representing value which is displayed in foldertree.

isLeaf - is a variable of type integer, that indicates whether the Node is a leaf node or non-leaf node

(1-leaf , 0-non-leaf). The nodes with isLeaf value equal to 0, have to be stored temporarily on client

side, so that whenever we get child nodes, we can append them to the parent node. We can get rid of

nodes with with isLeaf value 1 since those are leaf nodes.

myID - is a variable of type integer, used to assign a unique ID to the Node.

parentID - is also an integer, used to store ID of the parent of the Node. It helps in appending child

nodes to the correct parent Node.

numChildren - is an integer used to store number of child nodes of a Node.

numChildrenAtClient - is an integer indicating number of child nodes received by the applet at the

client end.

3.2 Implementation Details

We are working on a sample data source containing saved XML documents. The system is developed using

Java [JDK]. It uses Java Servlet [JS] at the back end, Swing Applet [JDK] at the client end and standard

interfaces like DTD parser [DTD], DOM [JAX] and XQL [XQL] as shown in the Figure 3.3. Since XML is

a document format and not data format, we need to preprocess it. XML parser is used to retrieve actual datafrom XML documents by preprocessing them. XML parsers currently available are : jaxp parser(DOM-

Document Object Model, SAX-Simple API for XML) [JAX], Xerces(Apaches parser in Java), libxml in C.

We are using jaxp parser. XML documents can have a DTD (Document Type Definition) associated with

them. A few XML documents in a data source, have DTD describing the schema of XML document. For

our system, DTD is not needed but if a DTD is available, it helps while browsing.

3.2.1 Interaction between the Servlet and the Applet

The servlet is set up on URL corresponding to the entry servletRoot in the configuration file. We just

need to put all XML documents that we would like to browse in a directory called as XMLDocs inside

public html directory since applets can read files stored only in public areas. Otherwise applets need to be

signed which is a bit complex procedure. The working of the servlet and the applet is such that the applet isa master asking for a particular XML document, while the servlet runs in the background serving requests of

the applet. The servlet uses DOM API to parse the XML document and constructs an in-memory DOM tree

out of it. The servlet sends the DOM Node object over the HttpConnection using an outputStream. While

the applet gets the DOM Node object from the HttpConnection using an inputStream. The problem here is

10


16/29

XML Data source

Servlet

DTD Parser XQLDOM

Select InterfaceBrowsing Interface

(Foldertree Applet)

Network

Figure 3.3: Overview of the system

DOM Node, is not serializable. It is not possible to send the DOM Node as a stream over the HttpConnec-

tion. So we are using an object which is designed to store the DOM Node information in serializable format.

The obejct is described in Section 3.1.4. The same object is used at the back end as well as at the client end.

All XML documents in a data source are parsed to get corresponding in-memory DOM trees. Our sys-

tem maintains reference to the root node of in-memory DOM tree to retrieve requested nodes quickly. At

the initialization of the system, we update in-memory DOM tree by attaching text nodes, attaching attributes

and attaching ID nodes as a child of IDREF nodes, as described in Sections 3.1.2 and 3.1.3.

3.2.2 Browser setup

We are using the Swing Applet [JDK] to display XML document in a foldertree format. Netscape 4.7 doesnt

support Java Swing. One option is to setup JRE(Java Runtime Environment) with path set for plugins. This

is a bit complex way. One simple option is to place swingall.jar file in /usr/lib/netscape/java/classes path

of Unix environment, which makes the browser swing enabled. Netscape 4.7 and earlier versions does notprovide support for stylesheets associated with XML documents. Our system, embeds style information

(font and colour) in the foldertree and displays it using swing. Hence a better option is to use Netscape 6.1

or higher versions.

3.2.3 Data structures

The system is composed of servlet at the back-end and the applet at the client end. It is assumed that

the client end, has the same version of JVM (Java Virtual Machine) installed as on the Servlet side. To

understand the system in detail, we need to understand the flow control and data structures used at both

ends.

Data structures at Servlet end

/* initialized at the startup of system */static int numOfDocs;static String docNames[];

11


17/29

/* hashtable of document references */static Hashtable hashNode_to_id;static Hashtable hashid_to_Node;

/* per request */static String docName;static int reqid;static int from;static int to;

At the initialization, the system reads the available documents from the Configuration file. numOfDocs

indicates number of documents in the data source. array

stores DOM reference to the root

node of the in-memory tree for each document.

and

are Hashtables of

Hashtables.

is a hashtable with key as document number (index in docNames array) and

value as

hashtable for that document.

hashtable stores the DOM Node as a key and

unique value assigned to that node a a value.

is a hashtable with key as document number

and value as

hashtable. doInit() function of the servlet does this initialization work.

The servlet stores four variables - docName, reqid, from and to, indicating that the applet is asking for

XML document named docName. Redid, from and to indicate that the applet requires child nodes num-

bered from from to to of a node with id value equal to reqid. The servlet accepts these parameters in

doGet() and accordingly sends response over HttpConnection.

Data structures at Applet end

/* hashtables */Hashtable hashid_to_Path;Hashtable hashPath_to_id;Hashtable hashid_to_DMTN;Hashtable hashid_to_Object;

/* parameters sent in request to servlet */String docName;int reqid;int fromChild;int toChild;

To browse through the XML document, the servlet creates a new applet. Every applet stores four hashtables

mentioned above for constructing the tree.

is a hashtable with key as myID of newObject

and value as parentPath appended by folderValue of newObject. On the applet side, we need to identify

the id of the node, when a user clicks on a particular node. To uniquely identify any node, we use the

whole path to that node as a key value. Hence the hashtable providing the map from Path to id is used.

is used to get the actual tree node (called as DefaultMutableTreeNode) to which the

received child nodes are appended. Further, is used to retrieve the newObject corre-sponding to id value since tags like isLeaf, myID, parentID are stored in newObject.

The four parameters, docName, reqid, fromChild and toChild constitute the request sent from the applet

to the servlet as described above.

12


18/29


19/29

plet for reading a particular file on local machine or we need to provide the whole file, as input to the applet.

Thus, the approach becomes somewhat complex.

Our system uses style information to display foldertree node and Leaf nodes of a foldertree in a style that

is currently set as default. Users can change the default style using select Style option given in menu. In

our system, on the servlet end, style chosen by the user is saved in the Configuration file. Whenever the

applet at the client end asks for particular document, the servlet sends the current style settings to the applet

over the stream. The approach can be extended further to display the foldertree according to the stylesheets

associated with XML documents.

3.4.2 Interactive Foldertree

The system provides users with a mouse-over menu to play around with the foldertree to get customized

views of the same XML document. Figure 3.4 shows how the system provides navigation in foldertree

format. It also displays the menu provided to facilitate interaction with the foldertree.

Find matching - The menu helps users to highlight elements having the value same as that of the

selected element.

Expand subtree - Currenly we provide only one level expansion and child nodes of the selected

element are displayed. This feature can be extended to expand a particular node up to the few levels

as asked by the user.

Drop subtree - Drop Subtree drops the whole subtree below the selected element.

Drop element - Drop element drops the elements with the name same as selected element.

Figure 3.5: Query: Get articlesTuple from sigmod.xml containing Donald

14


20/29

Figure 3.6: Query Result: Get articlesTuple from sigmod.xml containing Donald

3.5 Select Interface

Our system facilitates the Select interface to select a particular element from XML document. If the user

knows what exactly he wants then he can specify pattern matching queries using Select interface to get the

desired result. We plan to extend it to provide even complex group-by, nested queries.

We are currently using XQL engine provided by GMD-IPSI [XQL]. They provide XQL APIs to run basic

queries on XML documents.

3.5.1 Working of Select Interface

XPATH expressions are needed to specify query in XQL. Every XML document doesnt have DTD asso-

ciated with it. Hence, we create the hierarchy of elements from XML document which helps us to query a

particular element from the document. At the initialization step, this nesting of elements along with their

XPATH expressions is stored in a file format. Queries are in the form of contains clause. Users can type

in values in Select interface and get the elements containing the specified values as a result of the query. For

example, while browsing sigmod.xml, which contains a collection of articles from sigmod, users might want

to get the articles written by author Donald. Here, users can specify Donald in the field articlesTuple to

get detailed list of articles written by author Donald. The result of the query is displayed in tabular format

using HTML. Indentation is used to portray the nesting of elements.

At the initialization of the system, we create .xpath file for every XML document in the data source. Oursystem provides XPATH module for that purpose. These files are used to form a query when users specify

values in Select Interface.

Figure 3.5 shows the Query interface for specifying values. sigmod.xml is a set of sigmod articles, each

15


21/29

Year1995 1996 1997 2000....

Number of

CD s

Figure 3.7: Bar graph displaying year-wise distribution of CDs for a XML document containing collection

of CDs

having fields like volume number, title, authors etc as shown in the Figure 3.5 . The query looks like arti-

clesTuple contains Donald. The result is shown in Figure 3.6. The document contains two entries with

author value equal to Donald. Both of these tuples are presented to the user as a result of the query.

3.5.2 ExtensionsThe select interface currently provided can be enhanced as described below :

The select interface provided is not yet integrated with the browsing system. We can use the same

foldertree, as used for browsing an XML document, to display query results with query-result nodes

in expanded state and the remaining portion of the document in collapsed state.

Select interface supports pattern matching queries on element nodes from the document. We can

extend it to include attribute nodes too.

We have not taken into consideration scalability. For potentially large documents, query execution

takes quite a long time since XQL engine again traverses through the whole document to find out

matching elements. The incremental approach can be used to run query in the background and dis-playing partial results if some query language provides that feature. We need to replace the current

querying engine with the better one.

Currently we have just focussed on implementing contains clause. We can provide group-by queries

where in users can select group-by element and result element. It will be similar to the concept of

group-by templates provided in BANKS. For example, for a document portraying collection of CDs,

users would like to see the list of CDs grouped by year. Here users can input group-by element and

result element to the system, the year as a group-by element and the CD as a result element. Users

can get the list of CDs grouped by year as a result. Further, this information can be displayed in a

graphical manner as shown in Figure 3.7 Rectangular bar representing the year can have hyperlink to

actual records giving the details of the CDs in that particular year.

16


22/29

Chapter 4

Integrating Keyword Search with Browsing

The primary motivation behind keyword search is to facilitate an interface to help naive user extracting

information from XML data source just by typing few keywords.

4.1 Keyword Search

The keyword Search routine constructs the graph from the XML documents where nodes in the graph

correspond to the nodes from the documents. Parent-child edges and IDREF-ID edges form edges of the

graph. The keyword search algorithm runs on the preconstructed graph to get answer results. The algorithmcan be described as follows :

Construct the graph from the XML data source.

Create an in-memory text index containing all words from all documents, except for stop words. Stop

words are words such as a, an, the which occur very frequently.

Take search terms as input from the user.

Traverse the graph starting from search terms.

Follow the backward edges to find an intersection node common to all search terms.

The intersection node is the root node of the answer result with the leaf nodes representing search

terms.

If there are more than one answer results, then sort them according to the relevance. The details of

how relevance score is calculated for a result tree are given in [Mes02]

Return the list of answer results to the user. The answer result is not only the name of XML document

where the word lies but the relevant portion of XML document where the word lies.

Construction of graph, construction of in-memory textindex, the search technique used, the ranking of

answer results according to the relevance, are described in detail by Megha Meshram in her dissertation

[Mes02].

17


23/29

Figure 4.1: Figure shows answer result tree for query : dunkel rabbit model.

4.2 Browsing search results

The keyword search routine returns the list of answer results in decreasing order of relevance. Since the

result has hierarchical format, we have chosen foldertree to display the answer result tree. Answer result is

also displayed using incremental, on demand approach.

The keyword search module accepts keywords from the user. The step by step execution of the algorithm

given above leads to the generation of answer results. Answer results are ranked and are sorted before dis-playing to the user. Users can view the answer result in a foldertree format as shown in Figure 4.1.

We use searchObject for sending search result node over HttpConnection. The searchObject carries param-

eters required for reconstruction of answer result at the client end. The class searchObject used is described

below :

searchObject {String folderValue;boolean isLeaf;int no_of_children;int got_children;

boolean hasKeyword;String docName;

}

Every searchObject carries with it the following parameters :

18


24/29

folderValue - is a String, representing the actual value of FTN.

isLeaf - is an Integer, which indicates whether node is a leaf node or a nonleaf node. Value 1 indicates

a nonleaf node while value 0 indicates a leaf node.

no of children - is an Integer, which indicates the number of children of the search node.

got children - is an Integer, which indicates the number of child nodes, currently brought to the client

end. It helps while displaying search results.

hasKeyword - is a String which indicates whether the node contains atleast one search term in it.

Answer result may contain node which does not contain search term. This is used to display nodes

containing search terms, with different colour.

docName - is a String storing the name of the document the search term belongs to. This information

is used to browse to the respective XML document from a particular search node in answer result.

Search result is provided with mouse-over menu where users can get the name of the XML document the

search term belongs to. Further, the users can browse through the document by clicking on the menu.

Figure 4.2: Figure shows answer result tree for query : hepatitis antibodies australia.

Consider an example where the user wants to know whether there is any medical citation written by author

dunkel containing rabbit model in its title. Here, the search query will look like dunkel rabbit model.Figure 4.1 shows the answer result for query : dunkel rabbit model. Consider another example of search

query : hepatitis antibodies australia which finds out medical citation in Australia, on hepatitis antibodies.

Refer Figure 4.2 which shows the answer result for this query.

19


25/29

4.3 Browsing Extensions

Current approach used for browsing search results is a naive approach. A few features can be added to the

system to display search results in an intersting manner.

Current approach provides an hyperlink from a search node to the root of the XML document to which

the search node belongs. Users would like to go to the corresponding node in XML document rather

than going to root of the document. This can be done by finding out the node in XML document that

corresponds to the search node and displaying the whole subtree below it.

Currently, browsing interface provides the only option of getting the child nodes of a particular node.We can extend this to provide an option, for getting the parent node of a particular node. This feature

may lead to interesting browsing patterns while browsing keyword search results.

20


26/29

Chapter 5

Conclusions and Future work

We have developed a complete system, which facilitates browsing and keyword search in XML documents.

The system works with any XML data source at the back-end. The system will help naive users, getting

information from XML data sources containing sigmod papers, or say collection of articles. XML will help

information providers portray different views of the same information as per users interest. Our system

provides foldertree for navigating through XML documents, starting from root, up to the leaf nodes. To get

the different view of the same XML document, it provides menus. Styling feature provided by our system

helps users to view XML document in customized styles. The system also provides a select interface to

extract desired information using pattern matching queries. The system can be enhanced in many ways.Some of the short term extensions include :

Creating an external index - Currently we are using in-memory DOM which is kept in memory all

the time. This may become an overhead for potentially huge XML documents. A better option would

be to use persistent DOM [XQL] where we can create disk based index, which will help in fetching

a node given unique id, without the cost of parsing XML document or building an in-memory DOM

tree.

Facilitating complete support for querying of XML documents - Currently we provide with an inter-

face which supports only simple pattern matching queries using XQL. It can be extended to support

even complex group by and nested queries.

Using HTML to display XML documents using hyperlinks - Instead of using the foldertree, we can

generate stylesheets on-the-fly. XML document can be displayed using style sheets. Hyperlinks may

be used for inter-document references.

Exploiting features of XML document - Currently our system supports only element nodes along

with attribute nodes and text nodes. Most of the times, extracting information from these nodes is

sufficient. Further, system can be extended to include use of entities, entity references, processing

instructions and CDATA sections.

Displaying keyword search results - Every search node in answer result tree contains a parameter

called docName which is used to browse through actual XML document in which the search term be-

longs. Further, this feature can be extended to display only partial view of XML document containing

nodes relevant to the search term, instead of displaying the whole XML document.

Extending Keyword Search - Including metadata search will be quite interesting for XML since XML

document identifies data using tags. For example, users can type in CATALOG.AUTHOR:Adams

to search for a particular author named Adams in catalog.xml. More precisely, we can restrict the

21


27/29

search domain only to those paths specified by the user. This feature will help users to get more

accurate results since XML identifies semantics of data contained in the document.

Longer term future work would include the following :

Combined system for structured and unstructured data - In E-commerce or marketplaces systems

where scalability and performance plays critical roles, it is desirable to have a system providing sup-

port for structured as well as unstructured querying. Using RDBMS leads to degraded performance,

while unstructured paradigm doesnt provide support for some structured components. For example,

for a query : Get documents containing XML, in English language, of type pdf where XML

is a structured component while other two are structured components.

OrderData

InvoicePart P3Part P1

Part P2

Customer C1

Customer C2

LineItem L1 LineItem L2

Figure 5.1: XML document portrayed as a graph

Navigation through XML using a graph model - Our system uses a tree model to represent XML

document. XML document can be mapped to the graph model rather than a tree model, which will

make browsing more interesting.

The graph of XML document can be constructed with nodes and edges such that element node fromXML document represents the nodes of the graph, while parent-child edges and IDREF-ID edges

represent edges of the graph. To portray graph structure, we can do the following :

For IDREF to ID references in XML document, rather than replicating the ID nodes as a child

of IDREF nodes, we can have only one instance of ID node, to which every IDREF node can

refer. Refer Figure 5.1 which portrays XML document as a graph, where dotted lines represent

IDREF to ID edges, while solid lines represent parent-child edges. The original XML document

is displayed in Figure 3.1.

We can merge identical text values or entire subelements from XML document.

The graph structure will help compressing the document by removing replication in tree format. Most

important is that it will be of immense use for keyword search. The current keyword search module

explicitly searches for ID node, once it receives IDREF node. While this approach will provide an

in-build graph with all IDREF-ID links hidden in it. We can use package called Grappa [LR], which

is a graph package written in Java provided by AT&T Research Labs.

22


28/29

Implications of graph model on browsing as compared to the Foldertree - Foldertree facilitates in-

order browsing through XML document, starting from root of the document, up to the leaf level. We

can map XML document to a connected graph as mentioned above. To browse through a particular

document, we can provide users with customized views, according to the starting node specified. For

example, if users specify OrderData as a root node, they can view the whole document as shown in

the Figure 5.1. If Invoice is specified as a starting node, users can just view the partial document that

is reachable from the Invoice, directly or indirectly.

23


29/29

Bibliography

[BGL 99] Chaitanya Baru, Amarnath Gupta, Bertram Ludascher, Richard Marciano, Yannis Papakon-

stantinou, and PAvel Velikhov. XML-Based Information Mediation with MIX. In ACMSIG-

MOD 1999, exibition program, University of California, San Diego, La Jolla, CA 92093, 1999.

[BHN 01] Gaurav Bhalotia, Arvind Hulgeri, Charuta Nakhe, Soumen Chakrabarti, and S. Sudarshan. Key-

word searching and browsing in databases using banks. In Proc of ICDE, 2001.

[DEGP98] Shaul Dar, Gadi Entin, Shai Geva, and Eran Palmon. DataSpot : Database Exploration Using

Plain Language. In Proc. of the 24th VLDB Conference, Data Technologies Ltd., 1998.

[DTD] Java DTD Parser. Online at http://www.wutka.com/dtdparserdownload.html .

[HBN

01] Arvind Hulgeri, Gaurav Bhalotia, Charuta Nakhe, Soumen Chakrabarti, and S. Sudarshan. Key-word searching and browsing in databases using banks. In IEEE Data Engineering Bulletin,

September 2001.

[JAX] JAXP API for XML parsing 1.1.1. Availale online at http://java.sun.com/xml/jaxp/dist/1.1/docs/api/overview-summary.html .

[JDK] Java API 1.2.2. Available online at http://java.sun.com/products/jdk/1.2/docs/api/index.html.

[JS] Java Servlet API. Available online at http://java.sun.com/products/servlet/2.2/javadoc/index.html .

[LR] AT&T Labs-Research. Grappa - A Java Graph Package. Available online at http://www.research.att.com/sw/tools/graphviz/packages/grappa.html .

[Mes02] Megha Meshram. Keyword Searching in XML Documents. Masters thesis, Computer Science

and Engineering Department, IIT Bombay., 2002.

[MLP99] Kevin D. Munroe, Bertram Ludascher, and Yannis Papakonstantinou. Blended Browsing and

Querying of XML in a Lazy Mediator System. In VDB 2000, University of California, San

Diego, La Jolla, CA 92093, 1999.

[XQL] GMD-IPSI XQL Engine. Available online at http://xml.darmstadt.gmd.de/xql/xql-examples.html.

24