browsing and querying on xml data sources

Upload: avneesh-kumar

Post on 08-Apr-2018

223 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/6/2019 Browsing and Querying on XML Data Sources

    1/29

    M.Tech. Dissertation

    Browsing and Querying on XML Data Sources

    submitted in partial fulfillment of the requirements for the degree of

    Master of Technology

    By

    Urmila Kelkar

    Roll No : 00305402

    Under the guidance of

    Prof. S. Sudarshan

    Department of Computer Science and Engineering

    Indian Institute of Technology, Bombay

    Mumbai

    January 17, 2002

  • 8/6/2019 Browsing and Querying on XML Data Sources

    2/29

    Dissertation Approval Sheet

    This is to certify that the dissertation entitled Browsing and Querying in XML Data Sources by Urmila

    Kelkar is approved for the award of the degree of Master of Technology.

    Prof. S. Sudarshan

    (Guide)

    Internal Examiner

    External Examiner

    Chairman

    Date :

  • 8/6/2019 Browsing and Querying on XML Data Sources

    3/29

    Acknowledgement

    I would like to thank my guide, Dr. S. Sudarshan for his untiring support and encouragement throughout

    my M.Tech project. I would also like to acknowledge ones, who, from behind the scenes have contributed

    their ideas and energies. Special thanks to all my colleagues from Informatics lab.

    Urmila Kelkar

    January 17, 2002

    ii

  • 8/6/2019 Browsing and Querying on XML Data Sources

    4/29

    Contents

    1 Introduction 1

    2 Related Work 2

    2.1 Browsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    2.1.1 Blended Browsing and Querying by BBQ . . . . . . . . . . . . . . . . . . . . . . . 2

    2.1.2 XML based information mediation with MIX . . . . . . . . . . . . . . . . . . . . . 2

    2.1.3 BANKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    2.2 Keyword Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    2.2.1 BANKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    2.2.2 DataSpot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

    3 Scalable Browsing of XML documents 5

    3.1 Design Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    3.1.1 Incremental browsing of XML documents . . . . . . . . . . . . . . . . . . . . . . . 5

    3.1.2 Mapping from an XML document to the Foldertree . . . . . . . . . . . . . . . . . . 7

    3.1.3 IDREF to ID links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    3.1.4 Sending serialized objects over HttpConnection . . . . . . . . . . . . . . . . . . . . 9

    3.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    3.2.1 Interaction between the Servlet and the Applet . . . . . . . . . . . . . . . . . . . . 10

    3.2.2 Browser setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

    3.2.3 Data structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

    3.3 Scalability of the system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.4 Extensions to the Foldertree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

    3.4.1 Styling enhancements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

    3.4.2 Interactive Foldertree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

    3.5 Select Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    3.5.1 Working of Select Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    3.5.2 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

    4 Integrating Keyword Search with Browsing 17

    4.1 Keyword Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    4.2 Browsing search results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

    4.3 Browsing Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

    5 Conclusions and Future work 21

    iii

  • 8/6/2019 Browsing and Querying on XML Data Sources

    5/29

    Abstract

    The Web is extensively used by every information seeker. Search engines such as Google retrieve informa-

    tion from HTML documents. They allow users to get desired information by just typing a few words and

    following hyperlinks. The goal of our project is to design and implement a system providing a powerful

    way of extracting information from XML documents, using browsing and keyword search.

    Our system provides a directory tree like interface to browse through nested XML data, coupled with the use

    of IDREF to ID links and the use of stylesheets. To facilitate customized views of the same XML document,

    we provide with menus to drop elements, to find matching elements, to drop subtree and so on. Additionally,

    we provide keyword search where users can just type a few keywords to get desired information from the

    XML data source.

  • 8/6/2019 Browsing and Querying on XML Data Sources

    6/29

    Chapter 1

    Introduction

    XML is an evolving technology, which is becoming important because of its standardized data representa-

    tion format. XML documents focus on semantics of data. It does not provide information about displaying

    the data contained in the document. XML portrays a semistructured data model which is likely to be used

    to publish heterogeneous data.

    Consider the example of electronic patient records(EPR) used by public health services. Many European

    countries aim to introduce EPR as a standard way of maintaining patient records. XML, which provides

    users with a flexible way to markup nested data, appears well suited for maintaining EPRs. Administrativecommittees in number of hospitals are now incorporating XML within their prescribed standards for main-

    taining patient records. This emphasizes the need for a system that facilitates information retrieval from the

    underlying XML documents.

    We aim to develop a system that facilitates browsing of XML documents and provides keyword search

    as well. Further, in browsing, we focus on two access patterns which will be most commonly used : docu-

    ment traversal and pattern matching queries. The nesting structure of an XML document is used to provide

    navigation through XML document in a tree format, called a foldertree. Users can choose a particular style

    for displaying XML documents. Our system also provides a menu to drop the selected subtree, to expand

    the selected subtree, to drop a particular element and to highlight matching elements. Taking into consider-

    ation potentially large XML documents, it is important to make the system scalable. Our system facilitates

    scalable browsing using incremental approach. We also provide an interface called as Select Interface

    which helps users retrieving information from XML documents using pattern matching queries. There are a

    few systems like Blended Browsing and Querying(BBQ) [MLP99], XML based information mediation with

    MIX(MIX) [BGL 99], that support querying and browsing of XML documents. BBQ uses incremental

    on-demand approach.

    Our system additionally provides keyword search, which is not supported by BBQ and MIX. The key-

    word search module constructs a graph from the XML documents available in the data source. We use least

    common ancestor technique to find out answer results. The answer result tree can be browsed like any other

    XML document.

    Chapter 2 gives an overview of related work in the area of browsing XML data sources. The browsinginterface offered by our system is described in Chapter 3. Chapter 4 briefly describes the keyword search

    module and browsing of search results. The detailed approach of keyword search is described by Megha

    Meshram in her dissertation [Mes02]. Chapter 5 outlines conclusions and describes future work.

    1

  • 8/6/2019 Browsing and Querying on XML Data Sources

    7/29

    Chapter 2

    Related Work

    The Web has introduced a new paradigm of browsing and keyword search to retrieve information from

    HTML documents. Our goal is to provide a similar system to browse through XML documents. Tech-

    niques used for information retrieval from HTML documents can not be directly used for XML documents.

    This chapter gives an overview of the systems supporting browsing, keyword search and querying of XML

    documents.

    2.1 Browsing

    Search engines such as Google help a naive user to browse through information using hyperlinks. There are

    a few systems that provide support for browsing through XML documents. The following sections describe

    these systems.

    2.1.1 Blended Browsing and Querying by BBQ

    Blended Browsing and Querying (BBQ) [MLP99] provides a Document Type Definition(DTD) based graph-

    ical user interface, which facilitates XML query construction and browsing of results by non-expert users.

    It is used as a front-end to the virtual source exported by MIX mediator system. Virtual source may be an

    actual XML source or XML view created by mediator.

    BBQ assigns each document a separate window with its data and schema displayed side-by-side. Both

    data and schema can be navigated using directory like structures. BBQ querying is schema driven. It uses

    XMAS (XML matching and structured language) query underneath that supports joins, filtering and con-

    straints on leaf nodes. BBQ assumes that users can not come up with the focussed query right at the first

    step and facilitates iterative query refinement. It supports queries on multiple DTDs. Users can specify the

    structure of the query by dragging-dropping elements from the source DTDs or introducing new elements

    or grouping elements according to the value of other elements. Further when execution of the query starts,

    users can browse through partial results. BBQ uses Document Object Model Application Programming

    Interface(DOM API) and uses incremental approach for browsing potentially very large documents and

    subsequent query results.

    2.1.2 XML based information mediation with MIX

    Mediation of information from heterogeneous data sources becomes crucial as data from different sources

    like HTML documents, XML documents, relational databases, legacy data are getting published over the

    Web. MIX [BGL 99] employs XML as a semistructured data model to provide a uniform and flexible

    2

  • 8/6/2019 Browsing and Querying on XML Data Sources

    8/29

    representation of arbitrary source data. Since XML may also become stumbling block while formulating

    meaningful queries against semistructured databases, it uses XML DTD as a structural description of data

    exchanged by components of mediator.

    MIX focuses on valid XML documents which confirm to the DTD. XML queries are denoted in high level,

    declarative query language known as XMAS which is evolved from the ideas from XMLQL and other XML

    query languages. It allows pattern matching as well group by queries. To facilitate querying of heteroge-

    neous sources, XML wrappers are provided which export data in a uniform format to the mediator. MIX

    uses BBQ as its graphical interface. XMAS queries generated by BBQ are sent to the mediator for execution

    and the results are displayed using BBQ.

    2.1.3 BANKS

    BANKS - an acronym for Browsing ANd Keyword Search [BHN 01] facilitates browsing and keyword

    search in relational database. Earlier we had worked on a module of BANKS system. To retrieve informa-

    tion from the underlying relational databases, users need to know SQL (structured query language). BANKS

    uses a new paradigm of keyword search introduced by the Web to retrieve the desired information.

    BANKS uses tabular format to display information retrieved from database. Foreign key - Primary key

    relationship (FK-PK) is used to provide a link from one table to another in the form of hyperlinks. BANKS

    also provides menu, using java script for sorting records from the table on a particular column, or grouping

    records based on a column and so on. It also provides Select Interface to get records matching the specified

    values. BANKS also provides templates such as crosstab, nested, foldertree, bar-chart, pie-chart to display

    information from database in graphical manner.

    Browsing and keyword search in XML can also be thought of as an extension to BANKS. XML is con-

    sidered as semi-structured data, while BANKS uses relational database( structured) as back-end. Hence, the

    strategies and the approaches used for browsing and keyword search in XML are different but the need for

    browsing and keyword search lying underneath is the same.

    XML documents can be displayed using Folder-tree structure like the one used in BANKS. Menus such

    as drop elements, select matching elements, drop subtree can be used to interact with foldertree. We can

    also provide a group-by selection interface wherein users can select group-by element and result elementto get graphical view of data like bar-chart used in BANKS. For example, rather than viewing the whole

    collection of CDs at once, users will prefer having year-wise or artist-wise collection of CDs where we can

    specify CD as the result element and the year or the artist as a group by element. We can also facilitate

    querying of XML documents, using some query language at the back-end.

    2.2 Keyword Search

    Keyword search on documents is very well used for finding out desired information on Internet. Keyword

    search paradigm is equally important for XML documents. We describe the prior work done in keyword

    search in databases, in the following sections.

    2.2.1 BANKS

    BANKS supports keyword search to retrieve the information from the underlying relational database where

    users are not required to know the schema details. Like keyword search on internet returns documents

    3

  • 8/6/2019 Browsing and Querying on XML Data Sources

    9/29

    containing given words, BANKS returns tuples containing the given word. BANKS exploits foreign key-

    primary key relationship between tables of database to form a graph called meta-data graph. Further, it also

    uses tuple level graph to identify particular tuple. Now given a set of words, two words are considered to be

    close to each other, if in a table, they are in the same row and same column, or in different columns of a row,

    or in rows of different tables linked by foreign keys. The details of the keyword search algorithm are given

    in [HBN 01]. Search results are ranked and sorted out before presenting.

    To facilitate keyword search in XML documents, we can use the built-in hierarchical structure of XML

    document to form a graph. We can make an entry into the text index for all words from all documents

    except for stop words. Text index can store DOM Node reference, which will help us to traverse the graphfor finding out shortest paths between keyword nodes. Simplest way to find out a tree containing all words

    is to use the least common ancestor technique. The result will contain the root as the common intersection

    node while leaf nodes will refer to the search terms. The search result can be ranked based on the depth of

    the tree or say the longest edge of the tree. Result tree can be browsed using the foldertree like any other

    XML document.

    2.2.2 DataSpot

    DataSpot [DEGP98] introduces a new approach to database query and information retrieval by providing end

    users with the capability of exploring databases using the free-form queries and navigation. This capability

    is based on a novel, schema-less representation of data, called a Hyperbase. The DataSpot representation

    and search technology is the foundation of the DataSpot system. DataSpot has since been named a Mercado

    and is available as a commercial product.

    A DataSpot Hyperbase is a graph structure comprised of nodes, edges and node labels. Nodes are re-

    lated by directed edges of two types. A simple edge is used to connect the parent node to a child node. The

    set of children of a node are ordered. An identification edge is used to indicate that reference node uniquely

    identifies the subject node. A DataSpot Query is an associative search over a Hyperbase. The input to a

    query is a set of nodes called a query sources. The result of a query is a list of answers where each answer

    is a connected a Hyperbase containing the query sources. The answers to the query are ranked. Users can

    view answer records in detail, navigate to the related records or may submit the continuation queries from

    the current record or from set of records.

    4

  • 8/6/2019 Browsing and Querying on XML Data Sources

    10/29

    Chapter 3

    Scalable Browsing of XML documents

    The primary motivation for developing a system for browsing is to ease the end-users task. Naive users

    should be able to get the desired information from XML documents by just few clicks rather than writing

    complex queries. We have developed a system which provides such an interface for extracting the desired

    information from XML documents.

    The central idea is to provide a Scalable Browsing system to navigate through XML documents. Since

    an XML document consists of markup tags and not the formatting tags, we need some mechanism to con-

    vert XML-encoded information into the true data model and to make it presentable. We are using a foldertreestructure to display XML documents. Folder tree is a simple hierarchical structure like the directory tree

    structure used in Windows. Consider a document order.xml as shown in Figure 3.1. Our system provides

    users with a foldertree view of XML document, as shown in Figure 3.2

    3.1 Design Issues

    This section describes various design issues in our system like, incremental browsing approach, mapping of

    an XML document to foldertree, IDREF to ID links and serialized object used for communication between

    client end and server end of the system.

    3.1.1 Incremental browsing of XML documents

    The approach used for displaying XML documents, brings nodes on demand (i.e. as requested by users) and

    displays the tree incrementally. To indicate users that a particular node has a few more child nodes yet to be

    retrieved, a dummy node called as More is displayed.

    This incremental approach makes the design scalable since a user is not required to spend time waiting

    for child nodes to arrive. The following sections explain our approach in detail.

    Approach I

    At the initial stages of work, we implemented the following non-incremental model. The model consists of

    servlet running at the server end and applet working at the client end, communicating with each other. Thesteps to be followed are as follows :

    Servlet parses the whole XML document and gets an in-memory DOM tree.

    Applet sends request over HttpConnection for a particular XML document.

    5

  • 8/6/2019 Browsing and Querying on XML Data Sources

    11/29

    Figure 3.1: Original XML document

    Servlet traverses in-memory DOM tree in BFS(Breadth First Search) order, creates newObject corre-

    sponding to each DOM Node and sends all objects one by one, over HttpConnection. Refer Section

    3.1.4 to get the details of newObject

    Applet receives newObject corresponding to every DOM Node and goes on attaching newObjects to

    their respective parents to form a tree.

    Applet displays the tree using Java Swing.

    This approach did not work well for huge XML documents because sending the whole DOM tree at once

    was not feasible. Later, we came up with an incremental model as described below.

    Approach II

    The servlet running in background, sends DOM Nodes on demand, as requested by users at the client end.

    It makes the servlet stateless. The applet saves the state of the request and sends the next request as per user

    navigation. The following steps are taken :

    The servlet parses the whole XML document and gets an in-memory DOM tree.

    Initially the applet requests for the root node, identified by id=1, along with few child nodes, number

    specified in the Configuration file.

    In general, the applet sends a request with the node id of a parent node and number of chlld nodes,

    identified by from and to parameters in request. It also includes docName parameter identifying the

    6

  • 8/6/2019 Browsing and Querying on XML Data Sources

    12/29

    Figure 3.2: Foldertree view of order.xml document displaying IDREF to ID links

    XML document. The object called newObject, described in Section 3.1.4 is used for communication

    between the servlet and the applet.

    In response to the request from applet, the servlet sends root node if id is equal to 1, or else only child

    nodes numbered from from to to of the requested XML document.

    The servlet sends a special object with myID=0 as a demarcating object, to indicate that it is the end

    of the response.

    The applet updates the tree by appending the received child nodes to the respective parent nodes and

    displays the tree.

    On the applet side, once the tree is displayed, applet waits for a user request. User can request for

    child nodes of a particular node by just clicking on that node and the request is sent to the servlet

    asking for child nodes of that node.

    A dummy node named as More... is used to indicate that the parent node has some more child nodes

    yet to be retrieved. Users can click on More... node to request for those child nodes or he can also

    click on parent node itself to ask for child nodes.

    3.1.2 Mapping from an XML document to the Foldertree

    This section describes the steps taken to map an XML document to the foldertree, while displaying it incre-

    mentally.

    7

  • 8/6/2019 Browsing and Querying on XML Data Sources

    13/29

    Foldertree is chosen to display the document since it is suitable for displaying hierarchical, nested struc-

    tures. The jaxp [JAX] DOM parser is used to parse XML documents. It gives us an in-memory tree, corre-

    sponding to a document, where every node is a DOM Node. We only consider nodes of the types DOCU-

    MENT NODE, DOCUMENT TYPE NODE, ELEMENT NODE, ATTRIBUTE NODE and TEXT NODE.

    Document node is the root of the document, while Document type node is used to identify the DTD asso-

    ciated with the document. Document node corresponds to the root of the foldertree. Element node and

    Attribute node correspond to the foldertree node (FTN) and leaf node in a foldertree. Element node, Text

    node and Attribute node are transformed into foldertree structure as follows :

    ELEMENT NODE : The name of the Element node (i.e. the markup tag) in XML document is assignedto the corresponding foldertree node(FTN) in the foldertree. Refer Figure 3.1 and Figure 3.2 demon-

    strating the mapping from XML document to the foldertree.

    TEXT NODE : Text node in XML document corresponds to the Leaf node of a foldertree. Since Text node

    in XML document carries the actual data, and an Element node in XML document can only have a

    single Text node as its child, while mapping XML document to the foldertree, we append the value

    of Leaf node (corresponding to the Text node in XML document) to its parent foldertree node and

    remove Leaf node in the foldertree as it is redundant.

    ATTRIBUTE NODE : Attribute Name=Attribute Value pair for every Attribute node in XML docu-

    ment, is appended to the FTN corresponding to the Element node associated with an attribute. For

    example, as shown in Figure 3.1 and Figure 3.2, Element node OrderData has Attribute node start-Date=1/11/2001. We append startDate=1/11/2001 to the FTN corresponding to OrderData in a

    foldertree.

    Attribute node can have different types such as CDATA, ID, IDREF, ENTITY etc. Currently our

    system supports only attributes of type ID and IDREF. IDREF to ID links are creates by checking the

    attribute type. Suppose an Element node in XML document has an attribute with a type IDREF, then we

    identify the corresponding Element node having attribute of type ID(i.e. ID node) with matching values of

    ID and IDREF. In a foldertree, FTN corresponding to the Element node having ID, is attached as a child of

    FTN corresponding to the Element node having IDREF. In XML document, ID node and IDREF node can

    lie distantly. Our system facilitates a way to browse from IDREF node to ID node.

    IDREF to ID links are thus created as part of the initialization of a system and in-memory DOM tree corre-

    sponding to XML document is updated to include IDREF to ID links, described in the following section.

    3.1.3 IDREF to ID links

    XML document contains elements. The element can have attributes. Attributes of type ID identify element

    uniquely in a document. IDREF to ID relationship is considered analogous to foreign key-primary key re-

    lationalship. Attribute of type IDREF indicate that the element refers to another element having an attribute

    of type ID, wherein both attribute values are the same. Here, we are assuming that value of the ID attribute

    is unique over the entire document.

    DOM parser doesnt provide API to identify attributes of a particular type say ID or IDREF while, SAXparser API provide support for such identification. Since using SAX parser will lead to an overhead as one

    more SAX parser scan is required in addition to DOM parser scan, we are using a DTD parser [DTD]. DTD

    parser helps us to identify IDREF elements and ID elements. We support identification of IDREF to ID links

    only if the document has DTD associated with it. An in-memory DOM tree is updated to include IDREF to

    8

  • 8/6/2019 Browsing and Querying on XML Data Sources

    14/29

    ID links as follows :

    If the XML document has DTD associated with it,

    Parse the DTD

    For every Element check if any attribute is of type ID or it is of type IDREF,

    For an attribute of type IDREF, enter ElementName-AttributeName pair into IDREF hashtable.

    For an attribute of type ID, enter ElementName-AttributeName pair into ID hashtable.

    When finished with DTD, parse the XML document to construct an in-memory DOM tree,

    If the ElementName-AttributeName pair in the document matches with some entry in the ID

    hashtable, enter id value of ID Node in ID array.

    If it matches with some entry in IDREF hashtable, enter idref value of IDREF Node in IDREF

    array.

    After the DOM tree is completely constructed, for every entry in IDREF array do the following :

    Get the matching ID Node with matching value.

    Clone ID Node to get the duplicate-ID Node since DOM does not allow to have two nodes with

    same information. If we do not clone the ID node, it gets removed from its original place in

    XML document and gets appended to the child list of IDREF node. Because we want to keep

    the ID node as it is in the original XML document, and additionally we want to append it to the

    child list of IDREF node, we clone it.

    Append duplicate-ID Node to the childlist of IDREF Node.

    Figure 3.2 shows IDREF to ID links in order.xml document where Invoice refers to Customer and LineItem

    refers to Part. We append Customer to the child list of Invoice and Part to the child list of LineItem for

    matching IDREF and ID values in the document.

    3.1.4 Sending serialized objects over HttpConnection

    The system consists of client end or browser end and server end. XML documents are stored in a data source

    at server end, while foldertree is displayed at the client end. This section describes the object used to send

    XML document from the server end to the client end.

    Users at the client end can send request for a particular XML document to be browsed or they may re-

    quest for child nodes of FTN, numbered from say ten to twenty. This request is sent over a HttpConnection

    to the servlet running at the back-end. In response to this request, the servlet sends requested nodes to the

    client end. The client end redisplays the tree by attaching received nodes to their corresponding parents.

    DOM parser gives us in-memory DOM tree for the document. But since DOM Node is not serializable,we

    can not send it over HttpConnection. We have constructed our own object which stores DOM Node infor-

    mation along with a few tags so that it is easier to attach those nodes to their respective parent nodes at theclient end. The class newObject describes the serializable object used for sending node information from

    the server end to the client end.

    9

  • 8/6/2019 Browsing and Querying on XML Data Sources

    15/29

    Class newObject {String folderValue;int isLeaf;int myID;int parentID;int numChildren;int numChildrenAtClient;

    }

    The newObject carries following parameters required for reconstruction of the tree.

    folderValue - is a variable of type String representing value which is displayed in foldertree.

    isLeaf - is a variable of type integer, that indicates whether the Node is a leaf node or non-leaf node

    (1-leaf , 0-non-leaf). The nodes with isLeaf value equal to 0, have to be stored temporarily on client

    side, so that whenever we get child nodes, we can append them to the parent node. We can get rid of

    nodes with with isLeaf value 1 since those are leaf nodes.

    myID - is a variable of type integer, used to assign a unique ID to the Node.

    parentID - is also an integer, used to store ID of the parent of the Node. It helps in appending child

    nodes to the correct parent Node.

    numChildren - is an integer used to store number of child nodes of a Node.

    numChildrenAtClient - is an integer indicating number of child nodes received by the applet at the

    client end.

    3.2 Implementation Details

    We are working on a sample data source containing saved XML documents. The system is developed using

    Java [JDK]. It uses Java Servlet [JS] at the back end, Swing Applet [JDK] at the client end and standard

    interfaces like DTD parser [DTD], DOM [JAX] and XQL [XQL] as shown in the Figure 3.3. Since XML is

    a document format and not data format, we need to preprocess it. XML parser is used to retrieve actual datafrom XML documents by preprocessing them. XML parsers currently available are : jaxp parser(DOM-

    Document Object Model, SAX-Simple API for XML) [JAX], Xerces(Apaches parser in Java), libxml in C.

    We are using jaxp parser. XML documents can have a DTD (Document Type Definition) associated with

    them. A few XML documents in a data source, have DTD describing the schema of XML document. For

    our system, DTD is not needed but if a DTD is available, it helps while browsing.

    3.2.1 Interaction between the Servlet and the Applet

    The servlet is set up on URL corresponding to the entry servletRoot in the configuration file. We just

    need to put all XML documents that we would like to browse in a directory called as XMLDocs inside

    public html directory since applets can read files stored only in public areas. Otherwise applets need to be

    signed which is a bit complex procedure. The working of the servlet and the applet is such that the applet isa master asking for a particular XML document, while the servlet runs in the background serving requests of

    the applet. The servlet uses DOM API to parse the XML document and constructs an in-memory DOM tree

    out of it. The servlet sends the DOM Node object over the HttpConnection using an outputStream. While

    the applet gets the DOM Node object from the HttpConnection using an inputStream. The problem here is

    10

  • 8/6/2019 Browsing and Querying on XML Data Sources

    16/29

    XML Data source

    Servlet

    DTD Parser XQLDOM

    Select InterfaceBrowsing Interface

    (Foldertree Applet)

    Network

    Figure 3.3: Overview of the system

    DOM Node, is not serializable. It is not possible to send the DOM Node as a stream over the HttpConnec-

    tion. So we are using an object which is designed to store the DOM Node information in serializable format.

    The obejct is described in Section 3.1.4. The same object is used at the back end as well as at the client end.

    All XML documents in a data source are parsed to get corresponding in-memory DOM trees. Our sys-

    tem maintains reference to the root node of in-memory DOM tree to retrieve requested nodes quickly. At

    the initialization of the system, we update in-memory DOM tree by attaching text nodes, attaching attributes

    and attaching ID nodes as a child of IDREF nodes, as described in Sections 3.1.2 and 3.1.3.

    3.2.2 Browser setup

    We are using the Swing Applet [JDK] to display XML document in a foldertree format. Netscape 4.7 doesnt

    support Java Swing. One option is to setup JRE(Java Runtime Environment) with path set for plugins. This

    is a bit complex way. One simple option is to place swingall.jar file in /usr/lib/netscape/java/classes path

    of Unix environment, which makes the browser swing enabled. Netscape 4.7 and earlier versions does notprovide support for stylesheets associated with XML documents. Our system, embeds style information

    (font and colour) in the foldertree and displays it using swing. Hence a better option is to use Netscape 6.1

    or higher versions.

    3.2.3 Data structures

    The system is composed of servlet at the back-end and the applet at the client end. It is assumed that

    the client end, has the same version of JVM (Java Virtual Machine) installed as on the Servlet side. To

    understand the system in detail, we need to understand the flow control and data structures used at both

    ends.

    Data structures at Servlet end

    /* initialized at the startup of system */static int numOfDocs;static String docNames[];

    11

  • 8/6/2019 Browsing and Querying on XML Data Sources

    17/29

    /* hashtable of document references */static Hashtable hashNode_to_id;static Hashtable hashid_to_Node;

    /* per request */static String docName;static int reqid;static int from;static int to;

    At the initialization, the system reads the available documents from the Configuration file. numOfDocs

    indicates number of documents in the data source. array

    stores DOM reference to the root

    node of the in-memory tree for each document.

    and

    are Hashtables of

    Hashtables.

    is a hashtable with key as document number (index in docNames array) and

    value as

    hashtable for that document.

    hashtable stores the DOM Node as a key and

    unique value assigned to that node a a value.

    is a hashtable with key as document number

    and value as

    hashtable. doInit() function of the servlet does this initialization work.

    The servlet stores four variables - docName, reqid, from and to, indicating that the applet is asking for

    XML document named docName. Redid, from and to indicate that the applet requires child nodes num-

    bered from from to to of a node with id value equal to reqid. The servlet accepts these parameters in

    doGet() and accordingly sends response over HttpConnection.

    Data structures at Applet end

    /* hashtables */Hashtable hashid_to_Path;Hashtable hashPath_to_id;Hashtable hashid_to_DMTN;Hashtable hashid_to_Object;

    /* parameters sent in request to servlet */String docName;int reqid;int fromChild;int toChild;

    To browse through the XML document, the servlet creates a new applet. Every applet stores four hashtables

    mentioned above for constructing the tree.

    is a hashtable with key as myID of newObject

    and value as parentPath appended by folderValue of newObject. On the applet side, we need to identify

    the id of the node, when a user clicks on a particular node. To uniquely identify any node, we use the

    whole path to that node as a key value. Hence the hashtable providing the map from Path to id is used.

    is used to get the actual tree node (called as DefaultMutableTreeNode) to which the

    received child nodes are appended. Further, is used to retrieve the newObject corre-sponding to id value since tags like isLeaf, myID, parentID are stored in newObject.

    The four parameters, docName, reqid, fromChild and toChild constitute the request sent from the applet

    to the servlet as described above.

    12

  • 8/6/2019 Browsing and Querying on XML Data Sources

    18/29

  • 8/6/2019 Browsing and Querying on XML Data Sources

    19/29

    plet for reading a particular file on local machine or we need to provide the whole file, as input to the applet.

    Thus, the approach becomes somewhat complex.

    Our system uses style information to display foldertree node and Leaf nodes of a foldertree in a style that

    is currently set as default. Users can change the default style using select Style option given in menu. In

    our system, on the servlet end, style chosen by the user is saved in the Configuration file. Whenever the

    applet at the client end asks for particular document, the servlet sends the current style settings to the applet

    over the stream. The approach can be extended further to display the foldertree according to the stylesheets

    associated with XML documents.

    3.4.2 Interactive Foldertree

    The system provides users with a mouse-over menu to play around with the foldertree to get customized

    views of the same XML document. Figure 3.4 shows how the system provides navigation in foldertree

    format. It also displays the menu provided to facilitate interaction with the foldertree.

    Find matching - The menu helps users to highlight elements having the value same as that of the

    selected element.

    Expand subtree - Currenly we provide only one level expansion and child nodes of the selected

    element are displayed. This feature can be extended to expand a particular node up to the few levels

    as asked by the user.

    Drop subtree - Drop Subtree drops the whole subtree below the selected element.

    Drop element - Drop element drops the elements with the name same as selected element.

    Figure 3.5: Query: Get articlesTuple from sigmod.xml containing Donald

    14

  • 8/6/2019 Browsing and Querying on XML Data Sources

    20/29

    Figure 3.6: Query Result: Get articlesTuple from sigmod.xml containing Donald

    3.5 Select Interface

    Our system facilitates the Select interface to select a particular element from XML document. If the user

    knows what exactly he wants then he can specify pattern matching queries using Select interface to get the

    desired result. We plan to extend it to provide even complex group-by, nested queries.

    We are currently using XQL engine provided by GMD-IPSI [XQL]. They provide XQL APIs to run basic

    queries on XML documents.

    3.5.1 Working of Select Interface

    XPATH expressions are needed to specify query in XQL. Every XML document doesnt have DTD asso-

    ciated with it. Hence, we create the hierarchy of elements from XML document which helps us to query a

    particular element from the document. At the initialization step, this nesting of elements along with their

    XPATH expressions is stored in a file format. Queries are in the form of contains clause. Users can type

    in values in Select interface and get the elements containing the specified values as a result of the query. For

    example, while browsing sigmod.xml, which contains a collection of articles from sigmod, users might want

    to get the articles written by author Donald. Here, users can specify Donald in the field articlesTuple to

    get detailed list of articles written by author Donald. The result of the query is displayed in tabular format

    using HTML. Indentation is used to portray the nesting of elements.

    At the initialization of the system, we create .xpath file for every XML document in the data source. Oursystem provides XPATH module for that purpose. These files are used to form a query when users specify

    values in Select Interface.

    Figure 3.5 shows the Query interface for specifying values. sigmod.xml is a set of sigmod articles, each

    15

  • 8/6/2019 Browsing and Querying on XML Data Sources

    21/29

    Year1995 1996 1997 2000....

    Number of

    CD s

    Figure 3.7: Bar graph displaying year-wise distribution of CDs for a XML document containing collection

    of CDs

    having fields like volume number, title, authors etc as shown in the Figure 3.5 . The query looks like arti-

    clesTuple contains Donald. The result is shown in Figure 3.6. The document contains two entries with

    author value equal to Donald. Both of these tuples are presented to the user as a result of the query.

    3.5.2 ExtensionsThe select interface currently provided can be enhanced as described below :

    The select interface provided is not yet integrated with the browsing system. We can use the same

    foldertree, as used for browsing an XML document, to display query results with query-result nodes

    in expanded state and the remaining portion of the document in collapsed state.

    Select interface supports pattern matching queries on element nodes from the document. We can

    extend it to include attribute nodes too.

    We have not taken into consideration scalability. For potentially large documents, query execution

    takes quite a long time since XQL engine again traverses through the whole document to find out

    matching elements. The incremental approach can be used to run query in the background and dis-playing partial results if some query language provides that feature. We need to replace the current

    querying engine with the better one.

    Currently we have just focussed on implementing contains clause. We can provide group-by queries

    where in users can select group-by element and result element. It will be similar to the concept of

    group-by templates provided in BANKS. For example, for a document portraying collection of CDs,

    users would like to see the list of CDs grouped by year. Here users can input group-by element and

    result element to the system, the year as a group-by element and the CD as a result element. Users

    can get the list of CDs grouped by year as a result. Further, this information can be displayed in a

    graphical manner as shown in Figure 3.7 Rectangular bar representing the year can have hyperlink to

    actual records giving the details of the CDs in that particular year.

    16

  • 8/6/2019 Browsing and Querying on XML Data Sources

    22/29

    Chapter 4

    Integrating Keyword Search with Browsing

    The primary motivation behind keyword search is to facilitate an interface to help naive user extracting

    information from XML data source just by typing few keywords.

    4.1 Keyword Search

    The keyword Search routine constructs the graph from the XML documents where nodes in the graph

    correspond to the nodes from the documents. Parent-child edges and IDREF-ID edges form edges of the

    graph. The keyword search algorithm runs on the preconstructed graph to get answer results. The algorithmcan be described as follows :

    Construct the graph from the XML data source.

    Create an in-memory text index containing all words from all documents, except for stop words. Stop

    words are words such as a, an, the which occur very frequently.

    Take search terms as input from the user.

    Traverse the graph starting from search terms.

    Follow the backward edges to find an intersection node common to all search terms.

    The intersection node is the root node of the answer result with the leaf nodes representing search

    terms.

    If there are more than one answer results, then sort them according to the relevance. The details of

    how relevance score is calculated for a result tree are given in [Mes02]

    Return the list of answer results to the user. The answer result is not only the name of XML document

    where the word lies but the relevant portion of XML document where the word lies.

    Construction of graph, construction of in-memory textindex, the search technique used, the ranking of

    answer results according to the relevance, are described in detail by Megha Meshram in her dissertation

    [Mes02].

    17

  • 8/6/2019 Browsing and Querying on XML Data Sources

    23/29

    Figure 4.1: Figure shows answer result tree for query : dunkel rabbit model.

    4.2 Browsing search results

    The keyword search routine returns the list of answer results in decreasing order of relevance. Since the

    result has hierarchical format, we have chosen foldertree to display the answer result tree. Answer result is

    also displayed using incremental, on demand approach.

    The keyword search module accepts keywords from the user. The step by step execution of the algorithm

    given above leads to the generation of answer results. Answer results are ranked and are sorted before dis-playing to the user. Users can view the answer result in a foldertree format as shown in Figure 4.1.

    We use searchObject for sending search result node over HttpConnection. The searchObject carries param-

    eters required for reconstruction of answer result at the client end. The class searchObject used is described

    below :

    searchObject {String folderValue;boolean isLeaf;int no_of_children;int got_children;

    boolean hasKeyword;String docName;

    }

    Every searchObject carries with it the following parameters :

    18

  • 8/6/2019 Browsing and Querying on XML Data Sources

    24/29

    folderValue - is a String, representing the actual value of FTN.

    isLeaf - is an Integer, which indicates whether node is a leaf node or a nonleaf node. Value 1 indicates

    a nonleaf node while value 0 indicates a leaf node.

    no of children - is an Integer, which indicates the number of children of the search node.

    got children - is an Integer, which indicates the number of child nodes, currently brought to the client

    end. It helps while displaying search results.

    hasKeyword - is a String which indicates whether the node contains atleast one search term in it.

    Answer result may contain node which does not contain search term. This is used to display nodes

    containing search terms, with different colour.

    docName - is a String storing the name of the document the search term belongs to. This information

    is used to browse to the respective XML document from a particular search node in answer result.

    Search result is provided with mouse-over menu where users can get the name of the XML document the

    search term belongs to. Further, the users can browse through the document by clicking on the menu.

    Figure 4.2: Figure shows answer result tree for query : hepatitis antibodies australia.

    Consider an example where the user wants to know whether there is any medical citation written by author

    dunkel containing rabbit model in its title. Here, the search query will look like dunkel rabbit model.Figure 4.1 shows the answer result for query : dunkel rabbit model. Consider another example of search

    query : hepatitis antibodies australia which finds out medical citation in Australia, on hepatitis antibodies.

    Refer Figure 4.2 which shows the answer result for this query.

    19

  • 8/6/2019 Browsing and Querying on XML Data Sources

    25/29

    4.3 Browsing Extensions

    Current approach used for browsing search results is a naive approach. A few features can be added to the

    system to display search results in an intersting manner.

    Current approach provides an hyperlink from a search node to the root of the XML document to which

    the search node belongs. Users would like to go to the corresponding node in XML document rather

    than going to root of the document. This can be done by finding out the node in XML document that

    corresponds to the search node and displaying the whole subtree below it.

    Currently, browsing interface provides the only option of getting the child nodes of a particular node.We can extend this to provide an option, for getting the parent node of a particular node. This feature

    may lead to interesting browsing patterns while browsing keyword search results.

    20

  • 8/6/2019 Browsing and Querying on XML Data Sources

    26/29

    Chapter 5

    Conclusions and Future work

    We have developed a complete system, which facilitates browsing and keyword search in XML documents.

    The system works with any XML data source at the back-end. The system will help naive users, getting

    information from XML data sources containing sigmod papers, or say collection of articles. XML will help

    information providers portray different views of the same information as per users interest. Our system

    provides foldertree for navigating through XML documents, starting from root, up to the leaf nodes. To get

    the different view of the same XML document, it provides menus. Styling feature provided by our system

    helps users to view XML document in customized styles. The system also provides a select interface to

    extract desired information using pattern matching queries. The system can be enhanced in many ways.Some of the short term extensions include :

    Creating an external index - Currently we are using in-memory DOM which is kept in memory all

    the time. This may become an overhead for potentially huge XML documents. A better option would

    be to use persistent DOM [XQL] where we can create disk based index, which will help in fetching

    a node given unique id, without the cost of parsing XML document or building an in-memory DOM

    tree.

    Facilitating complete support for querying of XML documents - Currently we provide with an inter-

    face which supports only simple pattern matching queries using XQL. It can be extended to support

    even complex group by and nested queries.

    Using HTML to display XML documents using hyperlinks - Instead of using the foldertree, we can

    generate stylesheets on-the-fly. XML document can be displayed using style sheets. Hyperlinks may

    be used for inter-document references.

    Exploiting features of XML document - Currently our system supports only element nodes along

    with attribute nodes and text nodes. Most of the times, extracting information from these nodes is

    sufficient. Further, system can be extended to include use of entities, entity references, processing

    instructions and CDATA sections.

    Displaying keyword search results - Every search node in answer result tree contains a parameter

    called docName which is used to browse through actual XML document in which the search term be-

    longs. Further, this feature can be extended to display only partial view of XML document containing

    nodes relevant to the search term, instead of displaying the whole XML document.

    Extending Keyword Search - Including metadata search will be quite interesting for XML since XML

    document identifies data using tags. For example, users can type in CATALOG.AUTHOR:Adams

    to search for a particular author named Adams in catalog.xml. More precisely, we can restrict the

    21

  • 8/6/2019 Browsing and Querying on XML Data Sources

    27/29

    search domain only to those paths specified by the user. This feature will help users to get more

    accurate results since XML identifies semantics of data contained in the document.

    Longer term future work would include the following :

    Combined system for structured and unstructured data - In E-commerce or marketplaces systems

    where scalability and performance plays critical roles, it is desirable to have a system providing sup-

    port for structured as well as unstructured querying. Using RDBMS leads to degraded performance,

    while unstructured paradigm doesnt provide support for some structured components. For example,

    for a query : Get documents containing XML, in English language, of type pdf where XML

    is a structured component while other two are structured components.

    OrderData

    InvoicePart P3Part P1

    Part P2

    Customer C1

    Customer C2

    LineItem L1 LineItem L2

    Figure 5.1: XML document portrayed as a graph

    Navigation through XML using a graph model - Our system uses a tree model to represent XML

    document. XML document can be mapped to the graph model rather than a tree model, which will

    make browsing more interesting.

    The graph of XML document can be constructed with nodes and edges such that element node fromXML document represents the nodes of the graph, while parent-child edges and IDREF-ID edges

    represent edges of the graph. To portray graph structure, we can do the following :

    For IDREF to ID references in XML document, rather than replicating the ID nodes as a child

    of IDREF nodes, we can have only one instance of ID node, to which every IDREF node can

    refer. Refer Figure 5.1 which portrays XML document as a graph, where dotted lines represent

    IDREF to ID edges, while solid lines represent parent-child edges. The original XML document

    is displayed in Figure 3.1.

    We can merge identical text values or entire subelements from XML document.

    The graph structure will help compressing the document by removing replication in tree format. Most

    important is that it will be of immense use for keyword search. The current keyword search module

    explicitly searches for ID node, once it receives IDREF node. While this approach will provide an

    in-build graph with all IDREF-ID links hidden in it. We can use package called Grappa [LR], which

    is a graph package written in Java provided by AT&T Research Labs.

    22

  • 8/6/2019 Browsing and Querying on XML Data Sources

    28/29

    Implications of graph model on browsing as compared to the Foldertree - Foldertree facilitates in-

    order browsing through XML document, starting from root of the document, up to the leaf level. We

    can map XML document to a connected graph as mentioned above. To browse through a particular

    document, we can provide users with customized views, according to the starting node specified. For

    example, if users specify OrderData as a root node, they can view the whole document as shown in

    the Figure 5.1. If Invoice is specified as a starting node, users can just view the partial document that

    is reachable from the Invoice, directly or indirectly.

    23

  • 8/6/2019 Browsing and Querying on XML Data Sources

    29/29

    Bibliography

    [BGL 99] Chaitanya Baru, Amarnath Gupta, Bertram Ludascher, Richard Marciano, Yannis Papakon-

    stantinou, and PAvel Velikhov. XML-Based Information Mediation with MIX. In ACMSIG-

    MOD 1999, exibition program, University of California, San Diego, La Jolla, CA 92093, 1999.

    [BHN 01] Gaurav Bhalotia, Arvind Hulgeri, Charuta Nakhe, Soumen Chakrabarti, and S. Sudarshan. Key-

    word searching and browsing in databases using banks. In Proc of ICDE, 2001.

    [DEGP98] Shaul Dar, Gadi Entin, Shai Geva, and Eran Palmon. DataSpot : Database Exploration Using

    Plain Language. In Proc. of the 24th VLDB Conference, Data Technologies Ltd., 1998.

    [DTD] Java DTD Parser. Online at http://www.wutka.com/dtdparserdownload.html .

    [HBN

    01] Arvind Hulgeri, Gaurav Bhalotia, Charuta Nakhe, Soumen Chakrabarti, and S. Sudarshan. Key-word searching and browsing in databases using banks. In IEEE Data Engineering Bulletin,

    September 2001.

    [JAX] JAXP API for XML parsing 1.1.1. Availale online at http://java.sun.com/xml/jaxp/dist/1.1/docs/api/overview-summary.html .

    [JDK] Java API 1.2.2. Available online at http://java.sun.com/products/jdk/1.2/docs/api/index.html.

    [JS] Java Servlet API. Available online at http://java.sun.com/products/servlet/2.2/javadoc/index.html .

    [LR] AT&T Labs-Research. Grappa - A Java Graph Package. Available online at http://www.research.att.com/sw/tools/graphviz/packages/grappa.html .

    [Mes02] Megha Meshram. Keyword Searching in XML Documents. Masters thesis, Computer Science

    and Engineering Department, IIT Bombay., 2002.

    [MLP99] Kevin D. Munroe, Bertram Ludascher, and Yannis Papakonstantinou. Blended Browsing and

    Querying of XML in a Lazy Mediator System. In VDB 2000, University of California, San

    Diego, La Jolla, CA 92093, 1999.

    [XQL] GMD-IPSI XQL Engine. Available online at http://xml.darmstadt.gmd.de/xql/xql-examples.html.

    24