web data management: an introduction to semistructuredand...

20
1 1 1 Univ. of Crete V. Christophides I. Fundulaki CS561 Spring 2010 Web Data Management: An Introduction to Semistructured and XML Databases V. CHRISTOPHIDES I FUNDULAKI Department of Computer Science University of Crete ICS - FORTH, Heraklion, Crete 2 Univ. of Crete V. Christophides I. Fundulaki CS561 Spring 2010 A bit of History Research: 1950’s: Lisp [Mac Carthy] 1960’s: Tree languages [Buchi] 1970’s: Relational DBs [Codd] 1990: Graphlog [Univ. Toronto] 1994: O2 extensions [INRIA] 1995: Tsimmis & OEM [Stanford] 1995: UnQL [UPenn] Need to handle irregular Web data. Use graph data models. Internet industry: 1957: Sputnik launches ARPA 1972: First demonstration of ARPANET 1989: Number of hosts breaks 100,000 1991: CERN releases the World Wide Web HTML as the support for information 1997: 20 Million Hosts,1 Million Web sites 1998 : W3C releases XML to represent information on the Web XML provides a syntax for irregular textual Web information. ?

Upload: others

Post on 18-Apr-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Web Data Management: An Introduction to Semistructuredand ...hy561/Data/Lectures/CS561Intro10.pdf · Documents vsDatabases Document world plenty of small documents usually static

1

1

1

Univ. of Crete V. Christophides I. Fundulaki

CS561 Spring 2010

Web Data Management: An Introduction to

Semistructured and XML Databases

V. CHRISTOPHIDESI FUNDULAKI

Department of Computer ScienceUniversity of Crete

ICS - FORTH, Heraklion, Crete

2

Univ. of Crete V. Christophides I. Fundulaki

CS561 Spring 2010

A bit of History

� Research:

�1950’s: Lisp [Mac Carthy]

�1960’s: Tree languages [Buchi]

�1970’s: Relational DBs [Codd]

�1990: Graphlog [Univ. Toronto]

�1994: O2 extensions [INRIA]

�1995: Tsimmis & OEM [Stanford]

�1995: UnQL [UPenn]

Need to handle irregular Web data.

Use graph data models.

� Internet industry:�1957: Sputnik launches ARPA

�1972: First demonstration of ARPANET

�1989: Number of hosts breaks 100,000

�1991: CERN releases the World Wide WebHTML as the support for information

�1997: 20 Million Hosts,1 Million Web sites

�1998 : W3C releases XML to represent information on the Web

XML provides a syntax for irregular

textual Web information.

?

Page 2: Web Data Management: An Introduction to Semistructuredand ...hy561/Data/Lectures/CS561Intro10.pdf · Documents vsDatabases Document world plenty of small documents usually static

2

2

3

Univ. of Crete V. Christophides I. Fundulaki

CS561 Spring 2010

Documents vs Databases

Document world

� plenty of small documents

�usually static

� implicit structure

�section, paragraph, toc,

� tagging

�human friendly

� content

�form/layout, annotation

� paradigms

�“Save as”, WYSIWYG

� metadata

�author name, date, subject

Database world

� a few large databases

�usually dynamic

� explicit structure

� types

� records

�machine friendly

� content

�data, methods

� paradigms

�Data Independence, Transaction Management, Query Languages

� metadata

�schema description

4

Univ. of Crete V. Christophides I. Fundulaki

CS561 Spring 2010

What to do with them

Documents

� editing

� spell-checking

� counting words

� retrieving (IR)

� printing

Database

� updating

� cleaning

� querying

� composing/transforming

Page 3: Web Data Management: An Introduction to Semistructuredand ...hy561/Data/Lectures/CS561Intro10.pdf · Documents vsDatabases Document world plenty of small documents usually static

3

3

5

Univ. of Crete V. Christophides I. Fundulaki

CS561 Spring 2010

Query Languages

Document Retrieval

Claude Monet and San Diego Museum of Art

Database Querying

select p

from Artists a, a.artwork p

where a.first = “Claude”

and a.last = “Monet”

and p.located =

“San Diego Museum of Art”

6

Univ. of Crete V. Christophides I. Fundulaki

CS561 Spring 2010

How the Web is Today ?

� Information and its presentations are

mixed up in the form of HTML

documents

�all intended for human consumption

�many generated automatically by

applications

� Easy to fetch any Web page, from any

server, any platform

�access through a uniform interface

Page 4: Web Data Management: An Introduction to Semistructuredand ...hy561/Data/Lectures/CS561Intro10.pdf · Documents vsDatabases Document world plenty of small documents usually static

4

4

7

Univ. of Crete V. Christophides I. Fundulaki

CS561 Spring 2010

� Everybody can write it:

�HTML is simple

�HTML is textual: it is human readable, you can use any editor, ...

� Everybody can read it:

�HTML is portable on any platform

�The browser is the universal application

� Everybody can search it:

�Keyword-based Search Engines: high recall, low precision

� It connects pieces of information together

�Through hypertext links

The Secrets of HTML Success

Hypertext Links

8

Univ. of Crete V. Christophides I. Fundulaki

CS561 Spring 2010

<B>MONET, Claude<B><BR>Haystacks at Chailly at Sunrise<BR>1865<BR>Oil on canvas<BR>30 x 60 cm (11 7/8 x 23 3/4 in.)<BR>San Diego Museum of Art <BR><P><IMG SRC=“http://192.41.13.240/artchive/ m/monet/hayricks.jpg”>

What’s Wrong with HTML

� If written properly, normal HTML markup may reflect document presentation, but it cannot adequately represent the semantics &structure of data

Artist Name

Date

Artifact Title

Dimensions

Material

Museum

Image Reference

Page 5: Web Data Management: An Introduction to Semistructuredand ...hy561/Data/Lectures/CS561Intro10.pdf · Documents vsDatabases Document world plenty of small documents usually static

5

5

9

Univ. of Crete V. Christophides I. Fundulaki

CS561 Spring 2010

HTML Document Presentation

10

Univ. of Crete V. Christophides I. Fundulaki

CS561 Spring 2010

� Infomediaries:

�Community Web Portals

�Digital Museums & Libraries

� Electronic commerce:

�On-line Catalogs & Procurement

�Comparison Shoppers

�Market Places

�Virtual Enterprises

� Scientific applications:

�E-learning

�Data & Knowledge Grids

But Modern Web Applications Need More!

� Advanced Information Management

�finding,

�extracting,

�representing,

� interpreting,

�maintaining

� Flexible, Quick Interoperation: the ability to uniformly share, interpretand manipulate heterogeneousinformation

�applications cannot consume HTML

More than HTML documents: Data on the Web

More than Web browsers: Web-enabled Applications

Page 6: Web Data Management: An Introduction to Semistructuredand ...hy561/Data/Lectures/CS561Intro10.pdf · Documents vsDatabases Document world plenty of small documents usually static

6

6

11

Univ. of Crete V. Christophides I. Fundulaki

CS561 Spring 2010

Paradigm Shift on the Web

application

relational data

Transform

Integrate

Warehouse

XML DataWEB (HTTP)

application

application

legacy data

object-relational

� New Web standard XML:

�XML generated by

applications

�XML consumed by

applications

� Data exchange:

�across platforms

�across organizations

�Web: from collection of

documents to Web data

published as documents

12

Univ. of Crete V. Christophides I. Fundulaki

CS561 Spring 2010

XML Data Representation: The Document View

<ARTIST><NAME>

<FIRST>Claude</FIRST> <LAST>Monet</LAST></NAME><ARTWORK>

<ARTIFACT><TITLE>Haystacks at Chailly at Sunrise</TITLE><DATE>1865</DATE><MATERIAL>Oil on canvas</MATERIAL><DIM Metric=‘cm’> <HEIGHT>30</HEIGHT><WIDTH>60</WIDTH></DIM>

<DIM Metric=‘in’> <HEIGHT>11 7/8</HEIGHT><WIDTH>23 3/4</WIDTH></DIM>

<LOCATION>San Diego Museum of Art</LOCATION><IMAGE File=‘http://192.41.13.240/artchive/m/monet/hayricks.jpg’/>

</ARTIFACT></ARTWORK>

</ARTIST>

Element Name Element Content

Empty Element

Attribute ValueAttributeName

Page 7: Web Data Management: An Introduction to Semistructuredand ...hy561/Data/Lectures/CS561Intro10.pdf · Documents vsDatabases Document world plenty of small documents usually static

7

7

13

Univ. of Crete V. Christophides I. Fundulaki

CS561 Spring 2010

XML Data Representation: The Database View

ARTIST

NAME ARTWORK

FIRST LAST ARTIFACT

TITLE DATE

MATERIAL

DIM IMAGEDIM

LOCATION

...hayricks.jpg

Claude MONET

Haystacks 1865

Oil on canvas

San Diego Mus.

30 60 11

7/8

23

3/4

H W H W

14

Univ. of Crete V. Christophides I. Fundulaki

CS561 Spring 2010

� It looks like HTML...

�Simple, familiar, easy to learn, human-readable

�Universal and portable

�Supported by the W3C: trusted and quickly adopted by the industry

� …but it’s more than HTML!

�flexible: you can represent any information

�extensible: you can represent it the way you want!

� Increasing precision in XML specifications

�Well-Formed: already better than plain text

�Valid: Structure conforms to a DTD or an XML Schema

The Secrets of XML Popularity

<?<?<?<?XML!>!>!>!>

Page 8: Web Data Management: An Introduction to Semistructuredand ...hy561/Data/Lectures/CS561Intro10.pdf · Documents vsDatabases Document world plenty of small documents usually static

8

8

15

Univ. of Crete V. Christophides I. Fundulaki

CS561 Spring 2010

Well-Formed XML

�An object is said to be a well-formed XML document if it meets all the

well-formedness constraints (WFCs) of the XML syntax:

�tags (etc.) are syntactically correct

�every tag has an end-tag

�tags are properly nested

�there exists a root

�By definition if a document is not well-formed, it is not XML

�This means that there is no an XML document which is not well-

formed, and XML processors are not required to do anything with

such documents

16

Univ. of Crete V. Christophides I. Fundulaki

CS561 Spring 2010

Valid XML

� A well-formed document is valid only if it contains a proper DTD (or

Schema) and if the document obeys the constraints of that DTD (or

Schema) and therefore the XML Validity Constraints (VCs)

�only declared tags (element or attribute names) are used

�all tag occurrences conform to specified content models

� Examples:

�The following XML Document is well-formed but not valid

<ARTIST>Claude Monet</ARTIST>

�The following XML Document is not even well-formed

<FIRST>Claude</FIRST><LAST>Monet</LAST>

Page 9: Web Data Management: An Introduction to Semistructuredand ...hy561/Data/Lectures/CS561Intro10.pdf · Documents vsDatabases Document world plenty of small documents usually static

9

9

17

Univ. of Crete V. Christophides I. Fundulaki

CS561 Spring 2010

XML Document Type Definition (DTD)

<!DOCTYPE artist [<!ELEMENT artist (name, born, death, artwork, nationality?,

influences)><!ATTLIST artist oid ID #REQUIRED><!ELEMENT name (first, last)><!ELEMENT first (#PCDATA)><!ELEMENT last (#PCDATA)> ...<!ELEMENT artwork (artifact+)><!ELEMENT artifact (title, date, material, dim*, location, image)><!ELEMENT title (#PCDATA)> ...<!ELEMENT dim (height, width)><!ATTLIST dim metric (cm | in) ‘cm’><!ELEMENT location (#PCDATA)><!ELEMENT image EMPTY><!ATTLIST image file ENTITY #REQUIRED><!ELEMENT influences (PCDATA | aref)*><!ELEMENT aref EMPTY><!ATTLIST aref oref IDREF #IMPLIED>]>

18

Univ. of Crete V. Christophides I. Fundulaki

CS561 Spring 2010

XML Anatomy

Page 10: Web Data Management: An Introduction to Semistructuredand ...hy561/Data/Lectures/CS561Intro10.pdf · Documents vsDatabases Document world plenty of small documents usually static

10

10

19

Univ. of Crete V. Christophides I. Fundulaki

CS561 Spring 2010

Is XML the Solution to Interoperability?

Application 1 Application 2

ARTIST

NAME ARTWORK

FIRST LAST ARTIFACT

TITLE DATE

MATERIAL

DIM IMAGEDIM

LOCATION

hayricks.jpg

ClaudeMONET

Haystacks

1865

Oil on canvas

San Diego Mus.

30 60 11

7/823

3/4

H W H W

Communication

ARTIST

NAME ARTWORK

FIRST LAST ARTIFACT

TITLE DATE

MATERIAL

DIM IMAGEDIM

LOCATION

hayricks.jpg

ClaudeMONET

Haystacks

1865

Oil on canvas

San Diego Mus.

30 60 11

7/823

3/4

H W H W

Document = medium for

exchanging information

� Still need to agree on:

� DTDs or Schemas

� Meaning of tags

� “Operations” on data

� Meaning of operations

20

Univ. of Crete V. Christophides I. Fundulaki

CS561 Spring 2010

Communication Partner using DTD B

Large Scale Interoperation on the Web

XML-based Communicationusing DTD A

? ?

Communication Partner using DTD C

?

Sender using DTD A Recipient using DTD A

Page 11: Web Data Management: An Introduction to Semistructuredand ...hy561/Data/Lectures/CS561Intro10.pdf · Documents vsDatabases Document world plenty of small documents usually static

11

11

21

Univ. of Crete V. Christophides I. Fundulaki

CS561 Spring 2010

Interoperability is still an Open Issue !

� Semantic discrepancies : � Synonymy & Polysemy & Taxonomy

� <ARTIFACT> vs. <ARTEFACT>� is <ARTWORK> paintings or songs ?� how <… Style=‘Impressionism’> is related to <… Style=‘Pointillism’> ?

� Structural discrepancies :� Aggregation

� <NAME><FIRST>Claude</FIRST><LAST>Monet</LAST></NAME>vs <NAME>Claude Monet</NAME>

� Type� <ARTIFACT Kind=‘Painting’> ... </ARTIFACT>vs <PAINTING>Claude Monet</PAINTING>

� Syntactic discrepancies :� <ARTIST Name=‘Claude Monet’> ... </ARTIST> vs <ARTIST> <NAME>Claude Monet</NAME> ... </ARTIST>

More than Web Data: Semantics on the Web

More than Web Applications: Web Services

22

Univ. of Crete V. Christophides I. Fundulaki

CS561 Spring 2010

The Semantic Web Vision: A Web of Meaning

Museums

Artists

Artifacts

Techniques

Semantic Relationships

� The “Next Generation Web” aims to provide

infrastructure for expressing information in a

precise, human-readable, and machine-

interpretable form

� Enable both syntactic and semantic/

structural interoperability among

independently-developed Web applications,

allowing them to efficiently perform

sophisticated tasks for humans

� Enable Web resources (data & applications)

to be accessible by their meaning rather

than by keywords and syntactic forms

�Conceptual Navigation & Querying

�Inference Services (Picasso is an Artist)

Page 12: Web Data Management: An Introduction to Semistructuredand ...hy561/Data/Lectures/CS561Intro10.pdf · Documents vsDatabases Document world plenty of small documents usually static

12

12

23

Univ. of Crete V. Christophides I. Fundulaki

CS561 Spring 2010

Web Innovation: From Web Sites to Web Services

Web Sites Web Services

Brows

e/

Presen

tInt

egrate

/

Transa

ct

Progra

m/

Collab

orate

Today

PacketsCon

nect

HTML

TCP/IP

XML UDDI, SOAP,RosettaNet

1970s 1980s 1990s 1996 2003

24

Univ. of Crete V. Christophides I. Fundulaki

CS561 Spring 2010

Middleware Evolution & Interoperability

Page 13: Web Data Management: An Introduction to Semistructuredand ...hy561/Data/Lectures/CS561Intro10.pdf · Documents vsDatabases Document world plenty of small documents usually static

13

13

25

Univ. of Crete V. Christophides I. Fundulaki

CS561 Spring 2010

About the W3C and the XML Activities

� Membership organization

� Different types of groups inside W3C:

�Working groups

�Interest groups

�Coordination groups

� Status for W3C documents:

�Working draft

�Last Call

�Candidate/proposed recommendation

�Recommendation ~ Standard

� Core XML WG

�eXtensible Markup Language (XML 1.0), namespaces, Infoset

� XML Linking WG

�XML Pointer Language (Xpointer), XML Linking language

� XML Schema WG

� XML Query WG

�XML Data Model, Algebra and Query Language

� Document Object Model WG

� XSL WG

�XPath (with XML Linking WG)

�Transformation and stylesheetlanguage (XSLT/XSL)

26

Univ. of Crete V. Christophides I. Fundulaki

CS561 Spring 2010

W3C XML Related Specifications ‘Open’ std

W3C rec

W3C draft

industry std

SAX 1

XML 1.0 XML namespaces

Xpath

XSLT

XSL

DOM 1

MathML

SMIL 1 & 2

SVG

XHTML 1.0

Modularized

XHTML

XHTML

basic

Xforms

Canonical

XML

signature

XML base

Xlink

Xpointer

XML query ….

Infoset

XML schema

RDF

Xfragment

XHTML

events

SOAP UDDI FinXML

dirXMLXML-RPC

100's more ....

SAX 2

DOM 2

DOM 3

CSS 1

CSS 2

CSS 3

JDOM

JAXP

WSDLIFX

FpML ...

ebXML

Biztalk

WDDX XMI...

...

APIs

Style Protocols Web Services Application areas

XML Core

…...

Ian GRAHAM

Page 14: Web Data Management: An Introduction to Semistructuredand ...hy561/Data/Lectures/CS561Intro10.pdf · Documents vsDatabases Document world plenty of small documents usually static

14

14

27

Univ. of Crete V. Christophides I. Fundulaki

CS561 Spring 2010

� We now want to build advanced Web applications

�There is an urgent need for XML tools

� Designing XML tools is a data management problem:

�XML 1.0 to describe structured documents

= Syntax for trees

�XML data models to describe the information content

= Data model for trees

�XML schemas to describe the structure of information

= Data definition language for trees

�XML languages to describe information processing

= Data manipulation language for trees

XML is Just the Beginning...

28

Univ. of Crete V. Christophides I. Fundulaki

CS561 Spring 2010

� Databases are large collections of data

�Even “basic” APIs to XML will fail on

large XML documents

� Other reasons -- all the good things that

the DB community brought the world:

�Data models, integrity constraints,

and schemas

�Query languages, optimizers, fast

joins

�Views, Updates

�Concurrency

�Federated database systems

� The emergence of XML underscores the

importance of semistructured data

Why is Database Work Important?

Page 15: Web Data Management: An Introduction to Semistructuredand ...hy561/Data/Lectures/CS561Intro10.pdf · Documents vsDatabases Document world plenty of small documents usually static

15

15

29

Univ. of Crete V. Christophides I. Fundulaki

CS561 Spring 2010

� Data Schema is not what it used to be:

� not given in advance (self-describing, schema-less),

� descriptive, not prescriptive (designed by document, not db experts),

� partial (documents and data mixed together),

� rapidly evolving (without notice),

� may be large (compared to the size of the data)

� Data Types are not what they used to be:

� elements and attributes are not strongly typed

� missing or additional attributes

� multiple attributes

� elements in the same collection may have different types i.e.,

heterogeneous collections

� attributes with different types in different elements

XML Main Characteristics: Semistructured Data

30

Univ. of Crete V. Christophides I. Fundulaki

CS561 Spring 2010

Schemas are Useful for …

� Data readers

�What info is in a given collection?

�Thus, what queries might make sense?

� Data writers

�What should I call this piece of info?

�Is it okay to put this kind of data here?

� Efficient/effective data manipulation

�Optimize query processing

�Facilitate integration of multiple data sources

�Improve storage

�Construct indexes, statistics

�Forbid certain types of updates

Page 16: Web Data Management: An Introduction to Semistructuredand ...hy561/Data/Lectures/CS561Intro10.pdf · Documents vsDatabases Document world plenty of small documents usually static

16

16

31

Univ. of Crete V. Christophides I. Fundulaki

CS561 Spring 2010

Needs also for a DB Paradigm Shift

� Managing semistructured/XML data requires rethinking the design of

components of a DBMS:

�How do we model it?

�directed labeled trees with references (i.e. graphs)

�How do we query it?

�a new standard based on functional languages and regular path

expressions

�How do we store the data?

�looking for structural patterns

�How do we optimize queries?

�beginning to understand (algebras, indexes, etc.)

�What about Integrity constraints, views, updates,…,?

32

Univ. of Crete V. Christophides I. Fundulaki

CS561 Spring 2010

Towards a Convergence

� Databases: relax rigid constraints

imposed by schemas

�Move to a dynamic type system:

semistructured data

� Documents: enrich formatting

instructions with structuring/

semantic information

�Add “types” to documents: XML

Semistructured Data XML=~

SGML, Document Management

HTML,Web Pages

Semi-structureddata models for data integration

ANS.1, ACeDBScientific

Data Formats

Origins

Page 17: Web Data Management: An Introduction to Semistructuredand ...hy561/Data/Lectures/CS561Intro10.pdf · Documents vsDatabases Document world plenty of small documents usually static

17

17

33

Univ. of Crete V. Christophides I. Fundulaki

CS561 Spring 2010

What This Course is About

� What the database community has done:

�Semistructured data model: SSD-exps, labeled graphs

�OEM, UnQL and YATL

�Schemas, Storage, Query Optimization

� What the Web community has done:

�Data formats and APIs: XML 1.0, DOM

�Transformation and Stylesheet languages (XSLT/XSL)

� Where they meet and where they differ

�Comparison to relational and object-oriented data models

� Present emerging XML technology as a data management issue

�XML Data models

�XML Data Definition (Schema) Languages

�XML Data Manipulation (Query) languages

34

Univ. of Crete V. Christophides I. Fundulaki

CS561 Spring 2010

Your Research Projects

� XML Data Semantics

�Type Systems

�Structural & Integrity Constraints

�Incremental Validation

� XML Query Processing

�XQuery Algebras

�Tree Query Pattern Containment &Minimization

�XPath Engines

�Stream-based Query Processing

� XML Query Optimization

�Storage Schemes

�Labelling & Indexing Schemes

�Structural Joins & Cost Models

�Data Statistics & Compression

�Benchmarks, Real & Synthetic Data

� XML Data Management

�Updates, Evolution &Versioning

�Access Control & Active Rules

�Data Publishing & Relational Databases

�Warehouses & View Maintenance

� XML Database Systems

�Commercial DBMS

�Native DBMS

Page 18: Web Data Management: An Introduction to Semistructuredand ...hy561/Data/Lectures/CS561Intro10.pdf · Documents vsDatabases Document world plenty of small documents usually static

18

18

35

Univ. of Crete V. Christophides I. Fundulaki

CS561 Spring 2010

�1960’s: Data Centric

�1970’s: Process Centric

�1980’s: Object Oriented

�1990’s: Component Based

�2000’s: XML?

Retrospective

36

Univ. of Crete V. Christophides I. Fundulaki

CS561 Spring 2010

60’sData

•Record Layouts•Printer Layouts•System Flow Charts•Decision Tables

Batch Jobs were a Seriesof small Programs

Data was our First Focus

Page 19: Web Data Management: An Introduction to Semistructuredand ...hy561/Data/Lectures/CS561Intro10.pdf · Documents vsDatabases Document world plenty of small documents usually static

19

19

37

Univ. of Crete V. Christophides I. Fundulaki

CS561 Spring 2010

60’sData

70’sLogic

•GOTO-Less Programming•Structured Programming•Top-Down Design

Programs Became Very Large

Then we Focused on Logic

38

Univ. of Crete V. Christophides I. Fundulaki

CS561 Spring 2010

60’sData

80’sOO

70’sLogic

•Common Terms forAnalysis and Design•Tightly Coupled Code

Code Reuse was the Holy Grail, Rarely Achieved

Object Oriented Programming

Focused on Runtime Behavior

Page 20: Web Data Management: An Introduction to Semistructuredand ...hy561/Data/Lectures/CS561Intro10.pdf · Documents vsDatabases Document world plenty of small documents usually static

20

20

39

Univ. of Crete V. Christophides I. Fundulaki

CS561 Spring 2010

60’sData

90’sComp

70’sLogic

80’sOOSerialization Tied to Code

•Code Reuse•IDE-Based Composition•Limited Acceptance

Component Programming

Shifted the Focus to Interfaces

40

Univ. of Crete V. Christophides I. Fundulaki

CS561 Spring 2010

00’sXML

70’sLogic

80’sOO

90’sComp

•XML Wrappers for Incompatible Systems•Industry-Specific Markup Languages•XML for Persistent Data and Composition

XML Enables Middleware for Application-Specific Data

XML Returns the Focus to Data