copy-right@sanjaymadria, umr web data management sanjay kumar madria department of computer science...

copy-right@sanjaymadria, UMR

Web Data Management

Sanjay Kumar Madria

Department of Computer Science

University of Missouri-Rolla

[email protected]


WWW• Huge, widely distributed,

heterogeneous collection of semi-structured multimedia documents in the form of web pages connected via hyperlinks.


World Wide Web

• Web is fast growing

• More business organizations putting information in the Web

• Business on the highway

• Myriad of raw data to be processed for information


As WWW grows, more chaotic it becomes

• Web is fast growing, distributed, non-administered global information resource

• WWW allows access to text, image, video, sound and graphic data

• More business organizations creating web servers

• More chaotic environment to locate information of interest

• Lost in hyperspace syndrome


Characteristics of WWW

• WWW is a set of directed graphs

• Data in the WWW has a heterogeneous nature, self-describing and schema less

• Unstructured information , deeply nested

• No central authority to manage information

• Dynamic verses static information

• Web information discoveries - search engines


Web is Growing!

• In 1994, WWW grew by 1758 % !!

• June 1993 - 130

• June 1994 - 1265

• Dec. 1994 - 11,576

• April 1995 - 15,768

• July 1995 - 23,000+

• 2000 - !!!!!


‘COM’ domains are increasing!

• As of July 1995, 6.64 million host computers on the Internet:– 1.74 million are ‘com’ domains

– 1.41 million are ‘edu’ domains

– 0.30 million are ‘net’

– 0.27 million are ‘gov’

– 0.22 million are ‘mil’

– 0.20 million are ‘org’


Top web countries

1. Canada (1) 80% 9. New Zealand(7)101

2. US (4) 140% 10. Sweden (9) 101%

3. Ireland (3) 110% 11. Israel (12) 112%

4. Iceland (2) 68% 12. Cyprus (8) 72%

5. UK (14) 336 % 13. Hong Kong (15)148%

6. Malta (5) 155% 14. Norway (10) 64%

7. Australia (6) 133% 15. Switzerland (13) 75%

8. Singapore (11) 207% 16. Denmark (16) 105%


How users find web sites• Indexes and search engines 75

• UseNet newsgroups 44

• Cool lists 27

• New lists 24

• Listservers 23

• Print ads 21

• Word-of-mouth and e-mail 17

• Linked web advertisement 4


Limitations of Search Engines

• Do not exploit hyperlinks

• Search is limited to string matching

• Queries are evaluated on archived data rather than up-to-date data; no indexing on current data

• Low accuracy

• Replicated results

• No further manipulation possible


Limitations of Search Engines

• ERROR 404!

• No efficient document management

• Query results cannot be further manipulated

• No efficient means for knowledge discovery


More PROBLEMS• specifying/understanding what information

is wanted

• the high degree of variability of accessible information

• the variability in conceptual vocabulary or “ontology” used to describe information

• complexity of querying unstructured data


• complexity of querying structured data

• uncontrolled nature of web-based information content

• determining which information sources to search/query


Search Engine Capabilities– Selection of language

– Keywords with disjunction, adjacency, presence, absence, ...

– Word stemming (Hotbot)

– Similarity search (Excite)

– Natural language (LycosPro)

– Restrict by modification date (Hotbot) or range of dates (AltaVista)

– Restrict result types (e.g., must include images) (Hotbot)

– Restrict by geographical source (content or domain) (Hotbot)

– Restrict within various structured regions of a document (titles or URLs) (LycosPro); (summary, first heading, title, URL) (Opentext)


SEARCH & RETRIEVALSearch Engines

Search engine % web coveredHotbot 34AltaVista 28Northern Light 20Excite 14Infoseek 10Lycos 3

using several search engines is better than using only one Source: Lawrence, S., and Giles, C.L., “Searching the World Wide Web,” Science 280, pp. 98-100,

1998.


Key Objectives• Design a suitable data model to represent web

information• Development of web algebra and query

language, query optimization• Maintenance of Web data - view maintenance• Development of knowledge discovery and

web mining tools• Web warehouse • Data integration , secondary storages, indexes


Web Data Representation• HTML - Hypertext Markup Language

– fixed grammar, no regular expressions– Simple representation of data– good for simple data– difficult to extract information

• SGML - Standard Generalized Markup

Language - good for publishing deeply structured document

• XML - Extended Markup Language -a subset of SGML


Terminology• HTML - Hypertext Mark-up Language

• HTTP - Hypertext Transmission Protocol

• URL - Uniform Resource Locator

• example - <URL>:=<protocol>://<Host>/<path>/filename>[<#location>] where– <protocol> is http, ftp, gopher

– host is internet address …– #location is a textual label in the file.


• Links are specified as<A HREF=“Destination URL”>Anhor Text</A>• “destination URL is the URL of the destination document and

Anchor Text is the text that appears as an anchor when displayed.• Example: • <A HREF=http://www.ntu.edu.sg/ >Nanyang Technological

University</A>• Absolute and relative • URL <A HREF="AtlanticStates/NYStats.html">New

York</A> is relative • <A

HREF="http://www.ncsa.uiuc.edu/General/Internet/ WWW/HTMLPrimer.html"> NCSA's Beginner's Guide to HTML</A> absolute address


World Wide Web• Prevalent, persistent and informative

• HTML documents (soon, XML) created by humans or applications.

Can database technology help?

• Persistent HTML documents!!!

• Accessed day in and day out by humans and applications.


Current Research Projects• Web Query System

– W3QS, WebSQL, AKIRA, NetQL, RAW,

WebLog, Araneus

• Semistructured Data Management– LOREL, UnQL, WebOQL, Florid

• Website Management System– STRUDEL, Araneus

• Web Warehouse– WHOWEDA


Main Tasks

• Modeling and Querying the Web– view web as directed graph– content and link based queries– example - find the page that contain the word

“clinton” which has a link from a page containing word “monica”.


• Information Extraction and integration– wrapper - program to extract a structured

representation of the data; a set of tuples from HTML pages.

– Mediator - integration of data-softwares that access multiple source from a uniform interface

• Web Site Construction and Restructuring– creating sites– modeling the structure of web sites– restructuring data


MEDIATOR ARCHITECTUREUser Interface

Mediator(Query/Search/

Retrieval/Result)

Wrapper Wrapper

. . .


What to Model

• Structure of Web sites

• Internal structure of web pages

• Contents of web sites in finer granularities


Data Representation of Web Data

• Graph Data Models

• Semistructured Data Models (also graph based)


Graph Data Model

• Labeled graph data model where node represents web pages and arcs represent links between pages.

• Labels on arcs can be viewed as attribute names.

• Regular path expression queries


Semistructured Data Models

• Irregular data structure, no fixed schema known and may be implicit in the data

• Schema may be large and may change frequently

• Schema is descriptive rather than perspective; describes the current state of data, but violations of schema is still tolerated


• Data is not strongly typed; for different objects the values of the same attributes may be of differing types. (heterogenious sources)

• No restriction on the set of arcs that emanate from a given node in a graph or on the types of the values of attributes

• Ability to query the schemas; acr variables which get bound to labels on arcs, rather than nodes in the graph


Graph based Query Languages

• Use graph to model databases

• Support regular path expressions and graph construction in queries.

• Examples

Graph Log for hypertext queries

graph query language for OO


Query Languages for Semi-Structured data

• Use labeled graphs

• Query the schema of data

• Ability to accommodate irregularities in the data, such as missing links etc.

• Examples : Lorel (Stanford) , UnQL (AT&T), STRUQL (AT&T)


Comparison of Query SystemsSystem Data model Lang. style Path exp. Graph

websql Relational SQL yes No

W3QS LMG SQL Yes NO

WebLOG Relational Datalog No No

Lorel LG OQL Yes No

weboql hypertrees OQL Yes Yes

UnQL LG Recursion Yes Yes

Florid F-logic Datalog Yes NoStrudel LG Datalog Yes YesAraneus page schemes SQL Yes YesWhoweda relational SQL Yes Yes


Types of Query Languages

• First Generation

• Second generation


First Generation Query Languages

• Combine the content-based queries of search engines with structure-based queries

• Combine conditions on text pattern in documents with graph pattern describing link structures

• Examples - W3QL (TECHNION, Israel)

WebSQL (Toronto), WebLOG (Concordia)


Second generation languages• Called web data manipulation languages

• Web pages as atomic objects with properties that they contain or do not contain certain text patterns and they point to other objects

• Useful for data wrapping, transformation, and restructuring

• Useful for web site transformation and restructuring


How they Differ?• Provide access to the structure of web

objects they manipulate - return structure

• Model internal structures of web documents as well as the external links that connect them

• Support references to model hyperlinks and some support to ordered collections of records for more natural data representation

• Ability to create new complex structures as a result of a query


Examples

• Web OQL

• STRUQL

• Florid


W3QS (WWW Query System) at Technion - Israel

• Content queries

• Structural Queries

• Interfacing with user written programs and standard UNIX utilities

• Uses existing WWW indexes and search Services

• Provides view update facility


W3QS• Accessible via any WWW browsers

• API can be used by programs running anywhere in the Internet

• Support queries on the web structure by specifying starting page, a search domain and depth of links.

• File content analysis tools and filling up of forms automatically


File Types• Strict Inner Structure files such as Unix

environment files - Semantics of the data is clearly linked to the syntax

• Semi-structured files - text files containing formatting codes such as Latex or HTML files- possible to use formatting codes to analyze their semantic content

• Raw Files - no relation between meaning of file and its inner structure


Content Queries• Queries based on the content of a single

node of hypertext

• SQLCOND is used to evaluate boolean expressions

• Example - node-format = Latex and Node.author =“Sanjay”


Structure Queries• Information conveyed in the hypertext

organization itself is conveyed.

• The result is a set of nodes and links from the hypertext structure that satisfy a given graph pattern; graph with nodes and edges are annotated with conditions.

• Components are pattern definition, search engines and form completion


Node1.title =“Good article”

Node2.author= “Sanjay”

Link1.

rev=“doc”

Structure Query

Answer

URL http://../myarticles.html URL http:///.tex

<Title> Good articles</Title> \author {sanjay}

A HREF=//..rev=“doc”>


Search for an article

• Select cp n2/* result

from n1, l2, n2

where n1 in importantindexs.url

Fill n1.form as IN importantindexes.fil with Keyword = “sanjay” SQLCOND (n2.format =Laytex) and (n2.author=“sanjay”)


Query to search hypertext pattern

• Return all the articles cited in the first chapter of the book. Each chapter includes several pointers to the bibliography, for example

• <A HREF=“http//cs…/refrences.html#ref2”>

[Relativity]</> means link [Relativity] leads to the label ref2 in the references.html file.

In the references.html file the labeled link looks like <A HREF=“./relative.tex”name=“ref2”>

[relativity, sanjay]</A> this link points to relative.tex


• Select cp art/* result from Ind, l1,chap,l2,ref,l3 art where SQLCOND (ind.url = “http://) And (chap.url =/.chapter-1.html/) AND l2.HREF = /.\#13.Name/)

USING BFS.


Url http://cs.tech/bookindex.html

INDEX

Chapter 1

Chapter 2

…

References

Url

http://…/Chapter-1.html

ref 1

ref 2

ref 3

l1

ref 1

ref 2

ref 3

article

http://relative.texl3


WebSQL-University of Toronto

• Model web as relational database

• Use two relations Document and Anchor

• Document relation has one tuple for each document in the web and the anchor relation has one tuple for each anchor in each document


WebSQL• SQL-like query language for extracting

information from the web.

• Capable of systematic processing of either all the links in a page, all the pages that can be reached from a given URL through paths that match a pattern, or a combination of both.

• Provides transparent access to index servers


Document

Url title text Length Type Modif

http:// Title1 Text 1 1234 text 1-1-96

http:// Title 2 Text 2 2345 text 2-3-97


Anchor

Base label href

http:// Label 1 http://

http:// Label 1Label 2

http://http://


• Give documents’s URLs which contain same title and keyword(s)

• Select d1.url, d2.url from

document d1 such that d1 MENTIONS “keyword1” and document d2 such that d2 MENTIONS “keyword1”

where d1.title = d2.title

and NOT (d1.url = d2.url)


Find Labels of all Hyperlinks to Postscript Files

SELECT a.labelFROM Anchor a SUCH THAT base = "http://www.SomeDoc.html"

WHERE a.href CONTAINS ".ps.Z";


Documents about Databases SELECT Document d.url, d.titleFROM d SUCH THAT "http://www.OtherDoc.html" ->|=> dWHERE d.title CONTAINS "databases";Note : -> path of length one within same server=> path of length of one but different server


Retrieve all the documents in the same server that are pointed to

from the documentWhose URL is given

• Select d.url, d.title from

Document d SUCH THAT

“http//www. Cs.in -> d


Find all broken links in a page

• SELECT a.hrefFROM Anchor a SUCH THAT base = "http://the.document.to.test"WHERE protocol(a.href) = "http" AND doc(a.href) = null;


Web OQL (University of Toronto)

• Provides a framework that supports a large class of data

• Restructuring operations.

• Simple semistructured data model for documents and record-based data

• OQL-like syntax and regular expressions

• Serves as a two-way bridge between databases and the Web.


DATA MODEL

• Hypertrees are Ordered arc labeled trees with two types of arcs ; internal and external

• Internal arcs represent structured objects

• External arcs to represent refrences (huperlinks) among objects.

• Records as labels in the arcs

• Sets of related hypertrees as Web


ARCHITECTURE

• Wrappers map all data sources to trees

• The mapping can be done all at once or on demand


Example

• Extract from cspapers (paper database) title and URL of the full version of papers of “Smith”

• select [y.title,y’.URL]

from x in cs papers, y in x’

where y.authors ~”smith”


Web Creation• Create a new page for each research Group

(using the group name as URL). Each page contains the publications of the corresponding group.

• Select x’ as x.group from x in cspapers

• Select q1 as s1, q2 as s2, ...qm as sm

• where q’s are queries and each S’s is either a string query or keyword schema. “as” clause create a URL’s s1 , ..sm assigned to each new page resulting from each query.


ARANEUS• Data Model called ADM for Web

Documents - nested web objects, page schemas

• Several languages for wrapping, querying, creating and updating web sites - object algebra

• Methods and Techniques for Web Site Design and Implementation

• Presentation in SIGMOD’99

• Software is available at their home site


• Wrappers - map logical access to attribute values in a page at the ADM level tp physical access to text in the HTML source using EDITOR

• ULIXES - SQL-like query languages

• PENELOPE - manipulation language

• Site integration, semantic heterogeneities

• Materialized views

• http://poincare.dia.uniroma3.it:8080/Araneous


Lore - motivation• The data may be irregular and thus not conform

to a rigid schema.

• Relational data model has null values, and OO models have inheritance and complex objects. Both have difficulties in designing schemas to incorporate irregular data.

• It may be difficult to decide in advance on a single, correct schema, The structure of the data may evolve rapidly, data elements may change types, or data not conforming to previous structure may be added.


• Thus, there is a need for management of

semi-structured data!• Lore system manages semi-structured data. The

data managed by Lore is not confined to a schema and it may be irregular or incomplete.

• OEM is the Lore’s data model. OEM - object Exchange Model - graph based self-describing object instance model where nodes are objects and edges are labeled with attribute names and leaf nodes have atomic values

• Lore is light weight object repository and Lorel is Lore’s query language.


Object Exchange Model - OEM• Motivation - information exchange and extraction

• Why a new data model? … it not a new model.

•Each value exchanged is given an explicit label.

Object temp-in-Fahrenheit, integer, 80 - “temp-in-Fahrenheit” is the label. Each object is self-describing, with a label, type and value.

set-of-temps, set, {cmpnt1, cmpnt2} cmpnt1 is temp-in-Fahrenheit, integer, 80 cmpnt2 is temp-in-Celsius, integer, 20


Labels• Plays two roles

– identifying an object (component)– identifying the meaning of an object (component)

• Person-name both identifies cmpnt1 and coveys its meaning.

person-record, set, {cmpnt1, cmpnt2, cmpnt3} cmpnt1 is person-name, string, ``Fred’’ cmpnt2 is office-num-in-bldg-5, integer, 333 cmpnt3 is department, string, ``toy’’

• In relational data this corresponds to ….


Labels - Issues• What does the label mean?

– Database of labels – Ontology of labels - within each source

•Labels are relative (more specific) to the source of the data object. •Similar labels from different sources need to be resolved.

• Labels provide the flexibility in representing object structure


Self-describing data models• Have been in existence for a long time? Why

additional interest now?

•Use the ``nature’’ of self-describing data model for information exchange, and to extend the model to include object nesting.

•To provide an appropriate object request language (query facility)


OEM - Specification• Each object in OEM has the following structure:

– Label: A variable character string describing what the object represents.

– Type: The data type of the object’s value. Each is either an atom type, or type set.

– Value: A variable-length value of the object.– Object-ID: A unique variable-length identifier

for the object or null.

Label Type Value Object-ID


OEM - Summary• OEM is an information exchange model. It does not specify how

objects are stored at source.

• OEM does specify how objects are received at a client, but after objects are received they can be stored in any way the client likes.

• Note the schema-less nature of OEM is particularly useful when a client does not know in advance the labels or structure of OEM objects.

• Each source has a distinguished object with lexical identifier ``root’’.


Example• <biblio,set,{doc1,doc2,…,docn}>• doc1 is <doc, set, {auths1, topic1, call-

no1}>• auths1 is <auth-set,set {auth11}>• auth11 is <auth-ln, string, `Ùllman’’>• topic1 is <topic, string,``Databases’’>• call-no1 is <internal-call-no, integer, 25>• doc2 is <doc, set, {auths2, topic2, call-

no2}>• auths2 is <auth-set,set {auth21, auth22,

auth23}>• auth21 is <auth-ln, string, `Àho’’> • auth22 is <auth-ln, string, ``Hopcroft’’>

• auth23 is <auth-ln, string, `Ùllman’’> •

• topic2 is <topic, string,`Àlgorithms’’>

• call-no1 is <dewey-decimal, string, ``BR273’’>

• docn is <doc, set, {authsn, topicn, call-non}>

• authsn is <auth,string, ``Crichton’’>

• topic1 is <topic, string,``Dinosaurs’’>

• call-no1 is <fictional-call-no, integer, 95>

• biblio is the root object.


OEM - QLSELECT Fetch-expression

FROM Object

WHERE Condition

• The result of this query is itself an object, with special label ``answer’’:

answer, set, {obj1, obj2, …, objn}

• Each returned obji is a component of object specified in the From clause of the query, where the component is located by the Fetch-expression and satisfies the Condition.


Path• The notion of path is used in both Fetch-Expression in

the Select clause and the condition in the Where clause.

• Path describes traversals through an object using subobject structure and labels.

• Example: ``biblio.doc.auth’’

• Paths are used in Fetch-Expression to specify which components are are returned in the answer object.

• Paths are used in the condition to qualify the fetched objects or other (related) components in the same object structure.


Queries - Simple• Retrieve the topic of each document for which

`Ùllman’’ is one of the authors:SELECT biblio.doc.topic

FROM root

WHERE biblio.doc.auth-set.auth-ln = `Ùllman’’

• Intuitively, the query’s where clause finds all paths through subobject structure with the sequence of labels [biblio,doc,auth-set,auth-ln] such that the object at the end of the path has value `Ùllman.’’

<answer, set, {obj1, obj2}>

obj1 is <topic, string, ``Databases’’>

obj2 is <topic, string, “Algorithms”>


Queries - ``wild-cards’’• Retrieve all documents with internal call number:

SELECT biblio.?.topic

FROM root

WHERE biblio.?.internal-call-no

• ``?’’ label matches any label. For this query, the doc labels can be replaced by any other strings and query would produce the same result. By convention, two occurrences of ? In the same query must match the same label unless variables are used.

<answer, set, {obj1}>



Queries - ``wild-paths’’• Retrieve all documents with internal call number:

SELECT *.topic

FROM root

WHERE *.internal-call-no

• Symbol ``*’’ matches any path of length one or more. The use of * followed by a single label is a convenient and common way to locate objects with a certain label in complex structure. Similar to ?, two occurrences of * in the same query must match the same sequence of labels, unless variables are used.




Queries - variables• Retrieve each document for which both ``Hopcroft’’

and `Àho’’ are co-authors:SELECT biblio.doc

FROM root

WHERE biblio.doc.auth-set.auth-ln(a1)=`Àho’’ and

biblio.doc.auth-set.auth-ln(a1)=``Hopcroft’’

• Here, the query finds all the paths with structure [biblio, doc, auth-set], and with two distinct path completions with label auth with values `Àho’’ and ``Hopcroft’’


obj1 is the complete doc2


DBGroup

&1&1

&2&2 &3&3 &4&4 &5&5 &6&6

&11&11&8&8 &9&9 &10&10 &12&12 &13&13 &14&14&7&7 &15&15 &16&16

&17&17 &18&18 &19&19 &20&20

Member

Member

MemberMember

OfficeAge

Name Project

ProjectProject

Project

Name

Name AgeOffice

Office

RoomBuilding Building Room

“Clark” “Smith”

46 “Gates 252” “Lore” “Tsimmis”

“Jones” 28

An OEM Database

“CIS” “411” “CIS” 252


Lorel Queries - Simple Path Expression

•Retrieve the offices of members with age greater than 30 years:

Query SELECT DBGroup.Member.OfficeWHERE DBGroup.Member.Age > 30

Result Office “Gates 252”

Office

Building “CIS”

Room “411”


Queries - General Path Expression

Query SELECT DBGroup.Member.NameWHERE DBGroup.Member.Office(.Room

%|.Cubicle)?

Like “%252”

Result Name “Jones”

Name “Smith”

•Room% matches all labels starting from Room, like Room68. “|” stands for disjunction. “?” indicates that the label pattern is optional. “like %252” specifies that the data value should end with string “252”.


Queries - SubQueriesRetrieve Lore project members who work on other

projects

Query SELECT M.Name,

( SELECT M.Project.Title

WHERE M.Project.Title != “Lore”)

FROM DBGroup.Member M

WHERE M.Project.Title = “Lore”

Result Member

Name “Jones”

Title “Tsimmis”


Lore - Summary• Lore does facilitate query and updates on semi-

structural databases

• There has been more work done on optimization using: data guides (vldb97).

• The system is up and running: http://WWW-DB.Stanford.EDU/lore/demo/

• How is this related to WWW?

• XML-QL and related work provides the answer.


Extraction and Integration• OEM and subsequent LORE(L) can be used for

extracting information from multiple information sources.

• OEM helps navigate through unknown objects bySELECT ?

FROM root

Thus help browsing and schema discovery

• Efficient implementations are possible using partial fetch mechanism.

• Push and Pull information delivery systems are possible.

• How is this different from WebIR?


STRUDEL• Web Site Management System

• web Site from multiple sources

• STruQL - based on OEM, graphs, regular expressions, result as graph

• Example - return all the postscript papers from homepages:

• Where homepages(p), p “paper” q

• ispostscript(q) collect postscriptpages(p)

• Where C1,...Ck Create N1,...Nn link L1,...Lp, Collect G1, …Gq


Supported by Strudel: a Website Management System with StruQL as query language

where Biblio(X), X -> “paper” -> P, P -> “author” ->A,P -> “title” -> T, P -> “year” -> Y

create Root(), HomePage(A), YearPage(A,Y), PubPage(P)link Root() -> “person” -> HomePage (A),

HomePage(A) ->”yearentry” -> YearPage(A,Y),YearPage(A,Y) -> “publication” -> PubPage(P),PubPage(P) -> “author” -> HomePage(A),PubPage(P) -> “title” ->T

Complex Constructors


WebDB

• View WWW as multimedia documents in the form of web pages

• WQL supports selection, aggregation, sorting, summary, grouping

• projection on title , URL, keywords, tables, forms, images etc.


Some More Results

• UnQL - AT&T

• AKIRA- Pennstate

• NoDose - SIGMOD’98


HTML to XML

• HTML documents

• Emerging Web Standards - XML

• XML good for data interchange across platforms enterprise wide

• conversion HTML to XML - IBM, Microsoft


XML - Motivation• In HTML, both the tag semantics and tags are

fixed. There is limited and strict interpretation of tags.

• HTML is widely successful in disseminating documents across internet.

• Though data can be disseminated through HTML, its extraction is painful, and laborious.

• EDI has been a predominate mode of exchanging data among businesses. But it has very rigid format that requires highly customized applications.


XML - Introduction• XML aims to provide ease of authoring HTML

documents with ease of data exchange that is possible with EDI.

• Tags are used to markup documents.

• XML is a meta-language for describing markup languages.

• XML provides a facility to define tags and structural relationships between them.

• No pre-defined tag set implied no preconceived semantics, semantics of XML document is will be defined by applications that process them or style sheets (XSL).


XML - Goals• Straightforward to use over internet

• Support wide variety of applications, authoring, browsing, content analysis, etc.

• Easy to write programs that process XML documents and validate them.

• XML documents must be human-legible and reasonably clear.

• Design of XML shall be formal and concise - expressed as EBNF (extended Backus Naur Form) - amenable to modern compiler tools and techniques.


XML-features• Some structure - not rigid

• Extensibility - User defined tags

• nested elements

• validation - documents may specify their own grammar

• DTP (Document Type Descriptor) - schema exists with data as tag names

• Application -EDI - extraction, conversion, , transformation, integration

• can be modeled using DOM


More terminology• RDF - Resource Description Framework - a

method to describe metdata for XML documents

• XSL - Extensible Stylesheet Language - language for transforming and formatting XML.

• Transformation Language - XSLT, XPath, XPointer


Example-HTML• Print - Sanjay Madria

Web Warehouse Tutorial, ADBIS’99

HTML

<H2> Sanjay Madria </H2>

<I> Web Warehouse Tutorial, ADBIS’99</I>

Very difficult to understand, structure is hidden, describes only appearance


XML• <Ref>

<Speaker> <Firstname> Sanjay</firstname>

<Lastname> Madria</lastnaame>

</Speaker>

<Title > Web Warehouse Tutorial</Title>

<Conference> ADBIS’99</Conference>

</empty>

</Ref>

another format:

<Firstname Value “Sanjay”/>


XML Data• <book>

<title> database systems</title>

<author> John <lastname> Korth</lastname></author>

<price currency = “USD”> 5.87</price>

</book>

• DTD

• <!ELEMENT book (title, author+, price)>

• <!ELEMENT title (#PCDATA)>

• <!ELEMENT author(#PCDATA)|lastname)*


• <tr> <td width="20%" valign="top"> Firma Karl-Heinz Rosowski </td> <td width="20%" valign="top"> Maikstraße 14 </td> <td width="20%" valign="top"> 22041 Hamburg </td><td width="20%" valign="top"> 721 99 64 </td> <td width="20%" valign="top"> 21110111 </td> </tr>

• <?xml version="1.0"?>• <Addresses>• <Address id="12359">• <Name>Firma Karl-Heinz

Rosowski</Name>• <Street>Maikstraße 14</Street>• <ZIP>22041</ZIP>• <City>Hamburg</City>• <Tel>721 99 64</Tel>• <Fax>21110111</Fax> <Email/>• </Address> … • </Addresses>

HTML Version

XML Version


XML - Document - Continued•<?xml version="1.0"?> is the XML declaration.•Elements:Most common form of markup. <element> … </element>. For example <name>Jack Lemon </name>•Attributes: are name-value pairs that occur inside start-tags after the element name. For example: <Address id="12359"> attaches value 12359 to attribute id of Address element.•Entity References: to handle special characters of XML like “<“ in the XML documents.

• Comments: <!-- this is a comment --!>

• CDATA Sections: a CDATA (string of characters) section instructs the parser to ignore most markup characters. For example source code, <![CDATA[ *p = &q; b = (I <= 3);]]>, between [CDATA[ and ]] all character data is passed to an application, with out interpretation.


XML - DTD - Element Type Declarations

•Element type declarations: identify the names of elements and the nature of their content. A typical element type declaration looks like:

•<!Element Address (Name, Street, ZIP?, City, Tel+, Fax*, Email?)>

•Address is the element name, and (Name, Street, ZIP?, City, Tel+, Fax*, Email?) is the content model. Every address must contain, Name, Street, City and Tel. ZIP and Email are optional, whereas there can be zero or more Fax numbers.


• The declarations for Name, Street, ZIP …, must also be given. For example

• <!Element Name (#PCDATA)>• Attribute List Declarations: identify which

elements may have attributes, what values the attributes may hold, and what value is default. Attribute values appear only within start-tags and empty-element tags.

• <Address id="12359">


XML - Summary• HTML describes presentation• XML describes content• XML vs. HTML

– users define new tags– arbitrary nesting – validation is possible


XML and Semi Structural Data Model

• XML data is fundamentally different than relational and object oriented data.

• XML is not rigidly structured.• In relational and OO data model every data

instance has a schema which is separate and independent of the data.

• XML data is self describing and can naturally model irregularities that cannot be modeled by relational or OO data model.


• For example, data items may have missing elements or multiple occurrences of the same element; elements may have atomic values in some data items and structured values in others; and collections of elements can have heterogeneous structure.

• Even XML data that has an associated DTD is self-describing (the schema is always stored with the data) and, except for very restricted forms of DTDs, may have all the irregularities described above.

• XML is an instance of semistructured data.


XML-QL• Regular path expression

• pattern matching

• used edge labeled graphs

• extract data from existing XML documents and construct new XML documents

• support for ordered and unordered views on XML document

• simple and declarative


XML-QL• The simplest XML-QL queries extract data

from an XML document. Consider the following DTD:

<!ELEMENT book (author+,title,publisher)><!ATTLIST Book year CDATA><!ELEMENT article (author+ title year?,

(shortversion |longversion))><!ATTLIST article type CDATA><!ELEMENT publisher (name, address)><!ELEMENT author (firstname?, lastname)>


XML-QL Example Data<bib> <book year=“1995> <title> An Introduction to DB Systems </title>

<author> <lastname> Date </lastname></author><publisher><name> Addison-Wesley</name>

</publisher></book><book year=“1995>

<title> Foundations for OR Databases </title><author> <lastname> Date </lastname></author> <author> <lastname> Darwen </lastname></author> <publisher><name> Addison-Wesley</name>

</publisher></book>

</bib>


Matching Data Using Patterns• XML uses element patterns to match data in an XML document. • Find all authors of books whose publisher is Addison-Wesley in

XML document www.a.b.c/bib.xml

WHERE <book><publisher><name>Addison-Wesley</

name></publisher><title> $t </title><author> $a </author>

</book> IN “www.a.b.c/bib.xml”CONSTRUCT $a

matches every <book> element in the XML document that has at least one <title> element, one <author> element , and one publisher element whose <name> is Addison-Wesley. For each such match it binds $t and $a to every title and author pair.


XML-QL Constructing XML Data

• Often we would like format the result. • Find all authors and titles of books whose publisher is

Addison-Wesley in XML document www.a.b.c/bib.xmlWHERE <book>

<publisher><name>Addison-Wesley</></>

<title> $t </title><author> $a </author>

</book> IN “www.a.b.c/bib.xml”CONSTRUCT <result>

<author> $a </><title> $t </>

</>


Constructing XML Data -cont.Result of the query:

<result><author><lastname> Date </lastname></author><title> Introduction to Database Systems </title>

</result><result>

<author><lastname> Date </lastname></author><title> Foundations for OR Databases </title>

</result><result>

<author><lastname> Darwen </lastname></author><title> Foundations for OR Databases </title>

</result>One result for each author, duplicating title information.


XML-QL Nested Queries.WHERE <book>

<title> $t </><publisher><name>Addison-Wesley</></>

</> CONTENT_AS $p IN “www.a.b.c/bib.xml”CONSTRUCT <result>

<title> $t </>WHERE <author> $a </> in $pCONSTRUCT <author> $a </>

</><result>

<author><lastname> Date </lastname></author><title> Introduction to Database Systems </title>

</result><result>

<author><lastname> Date </lastname></author><author><lastname> Darwen

</lastname></author><title> Foundations for OR Databases </title>

</result>


XML-QL Join QueriesXML queries cab express “joins” by matching two or more elements that contain same value. Find all articles that have at least one author who has written a book since 1995.WHERE <article> <author> <firstname> $f </> // firstname $f <lastname> $l </> // lastname $l </> </> CONTENT_AS $a IN "www.a.b.c/bib.xml" <book year=$y> <author> <firstname> $f </> // join on same firstname $f <lastname> $l </> // join on same lastname $l </> </> IN "www.a.b.c/bib.xml", y > 1995 CONSTRUCT <article> $a </>


XML-QL Data Model for XML•XML graph G in which each node is represented by a unique string called object identifier (OID), G’s edges are labelled with element tags, G’s nodes are labeled with sets of attribute value pairs, G’s leaves are labeled with one string value, and G has a distinguished node called root. •


XML-QL Data Model for XML•The model allows several edges between the same two nodes with the following restriction:

between any two nodes there can be at most one edge with a given label

a node cannot have two leaf children with the same label and same string value

•XML graphs are not only derived from XML documents, but are also generated by queries.


XML- Element Identity, Ids, and IDREFS

•For element sharing XML reserves an attribute of type ID which allows a unique key to be associated with an element.

•An attribute of type IDREF allows an element to refer to another element with the designated key, and one of the type IDREFS may refer to multiple elements.


• <!ATTLIST person ID #REQUIRED>• <!ATTLIST article author IDREFS #IMPLIED>• <person ID="o123">• <firstname>John</firstname>• <lastname>Smith<lastname>• </person>• <person ID="o234">• . . .• </person>• <article author="o123 o234">• <title> ... </title>• <year> 1995 </year>• </article>


XML- Element Identity, Ids, and IDREFS


The following query produces all lastname, title pairs by joining the author element's IDREF attribute value with the person element's ID attribute value.

WHERE <article author=$i> <title> </> ELEMENT_AS $t </>, <person ID=$i> <lastname> </> ELEMENT_AS $l </>CONSTRUCT <result> $t $l</>

The idiom <title></> ELEMENT_AS $t binds $t to a <title> element with arbitrary contents. The element expression <title/> matches a <title> element with empty contents.


XML-QL- Advanced Examples

Tag VariablesRegular Path ExpressionsTransforming XML Data (from one DTD to another)Integrating Data from different XML sourcesEmbedding queries in data XML-QL

check http://www3.org/TR/NOTE-xml-ql


Summary•Even before you blink your eye…. Lot of work has gone in web data models and query languages•Some problems are addressed:Semi-structural

semi-structural data model based query languagesschema inference from semi-structural data modelefficient processing of queries on semi-structural data efficient indexing and storage structuresintegration with XML

•TraditionalWebSQL/WebOQLWeb Warehousing

•Which way will you go?


Further issues•Distributed query processing•Continuous result processing with push/pull result replenishment•Labels, labels every where, with XML more labels every where … how are semantics of queries across multiple information sources handled

•IR gives too many relevant/irrelevant results•Query Processing requires some schema knowledge that is difficult to handle across multiple sources•Can these two be bridged? Cooperative solutions.

•Next Agents, Agents everywhere, What are they doing? Will it work or Will it be a fad?

copy-right@sanjaymadria, umr web data management sanjay kumar madria department of computer science...

Documents

information slide

umr world wide web web

web business

unstructured information

accessible information

information sources

web servers

web countries