xyleme, 20011 a dynamic warehouse for the xml data of the web grégory cobena inria & xyleme sa...

36
Xyleme, 2001 1 A Dynamic Warehouse for the XML data of the Web Grégory COBENA INRIA & Xyleme SA ( [email protected] ) Serge Abiteboul, INRIA & Xyleme SA ( [email protected] ) http://www-rocq.inria.fr/verso/ http://www.xyleme.com/

Upload: miya-farrant

Post on 15-Jan-2016

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Xyleme, 20011 A Dynamic Warehouse for the XML data of the Web Grégory COBENA INRIA & Xyleme SA ( Gregory.Cobena@inria.fr ) Serge Abiteboul, INRIA & Xyleme

Xyleme, 2001 1

A Dynamic Warehouse for the XML data of the Web

Grégory COBENAINRIA & Xyleme SA( [email protected] )

Serge Abiteboul, INRIA & Xyleme SA( [email protected] )

http://www-rocq.inria.fr/verso/ http://www.xyleme.com/

Page 2: Xyleme, 20011 A Dynamic Warehouse for the XML data of the Web Grégory COBENA INRIA & Xyleme SA ( Gregory.Cobena@inria.fr ) Serge Abiteboul, INRIA & Xyleme

2

Organization

• 1. The Web and XML

• 2. Xyleme

• 3. Data Acquisition and Maintenance

• XML Repository, Semantic Data Integration and Query Processing

• 4. Query Subscription

• Conclusion

Page 3: Xyleme, 20011 A Dynamic Warehouse for the XML data of the Web Grégory COBENA INRIA & Xyleme SA ( Gregory.Cobena@inria.fr ) Serge Abiteboul, INRIA & Xyleme

Xyleme, 2001 3

1. The Web and XML

Page 4: Xyleme, 20011 A Dynamic Warehouse for the XML data of the Web Grégory COBENA INRIA & Xyleme SA ( Gregory.Cobena@inria.fr ) Serge Abiteboul, INRIA & Xyleme

4

The Web today

• Terabytes of data

• A lot of public pages– 1 billion in [06/2000] – several millions of servers

• Private web: not publicly available pages

• Deep web: data hidden behind forms

Page 5: Xyleme, 20011 A Dynamic Warehouse for the XML data of the Web Grégory COBENA INRIA & Xyleme SA ( Gregory.Cobena@inria.fr ) Serge Abiteboul, INRIA & Xyleme

5

HTML = Hypertext Language

Ref Name PriceX23 Camera 359.99 R2D2 Robot 19350.00Z25 PC 1299.99

Information System

HTML

The <b> X23 </b> new camera replaces the <b> X22 </b>. It comes equipped with a flash (worth by itself <i>53.99 $</i>) and provides great quality for only <i>359.99 $</i>.

Text + presentationWhere is the data ?

hard

Page 6: Xyleme, 20011 A Dynamic Warehouse for the XML data of the Web Grégory COBENA INRIA & Xyleme SA ( Gregory.Cobena@inria.fr ) Serge Abiteboul, INRIA & Xyleme

6

XML = Semistructured Data

Ref Name PriceX23 Camera 359.99 R2D2 Robot 19350.00Z25 PC 1299.99...

Information System

<product-table>< product reference=”X23"> <designation> camera </designation> <price unit=Dollars> 359.99 </price> <description> … </description></product>< product reference=”R2D2"> <designation> Robot </designation> <price unit=Dollars> 19350 </price> <description> … </description>...</product-table> XML

Data + StructureSemistructured: more flexible

easy

Page 7: Xyleme, 20011 A Dynamic Warehouse for the XML data of the Web Grégory COBENA INRIA & Xyleme SA ( Gregory.Cobena@inria.fr ) Serge Abiteboul, INRIA & Xyleme

7

XML : Tree Types

• Semantics and structure are in paths– product-table/product/reference– product-table/product/price

product

designation descriptionprice

reference

product-table

Page 8: Xyleme, 20011 A Dynamic Warehouse for the XML data of the Web Grégory COBENA INRIA & Xyleme SA ( Gregory.Cobena@inria.fr ) Serge Abiteboul, INRIA & Xyleme

Xyleme, 2001 8

2. A Dynamic Warehouse for the XML Data of the Web

Xyleme

Page 9: Xyleme, 20011 A Dynamic Warehouse for the XML data of the Web Grégory COBENA INRIA & Xyleme SA ( Gregory.Cobena@inria.fr ) Serge Abiteboul, INRIA & Xyleme

9

Xyleme Research

Project Xyleme at INRIA (1999-2000) :Explore XML + Web + SGBD to make the Web a Knowledge Database

• INRIA– Sophie Cluet: Databases (OQL…)– Serge Abiteboul: semi-structured data + web– Guy Ferran: ex O2 Technology

• Mannheim University– Guido Moerkotte

• Université d’Orsay– Marie Christine Rousset

• CNAM– Dan Vodislav

Page 10: Xyleme, 20011 A Dynamic Warehouse for the XML data of the Web Grégory COBENA INRIA & Xyleme SA ( Gregory.Cobena@inria.fr ) Serge Abiteboul, INRIA & Xyleme

10

Xyleme Company• Started September 2000

(25 employees end of 2001)

• Market Challenges:– Few XML documents available on the Web (because of

weak software support)– Company is focusing on private XML:

• Press, Editors, Financial Data, Biology…

– Technology:• Scalability for large amount of data• Internet (+focus) / Intranet support• Monitoring and Version Management• Heterogeneous Data Integration

Page 11: Xyleme, 20011 A Dynamic Warehouse for the XML data of the Web Grégory COBENA INRIA & Xyleme SA ( Gregory.Cobena@inria.fr ) Serge Abiteboul, INRIA & Xyleme

11

Architecture

• Cluster of PCs

• Developed with Linux and C++

• Communications– local: Corba– external: HTTP

• Distribution between autonomous machines

• Now Web Services

Page 12: Xyleme, 20011 A Dynamic Warehouse for the XML data of the Web Grégory COBENA INRIA & Xyleme SA ( Gregory.Cobena@inria.fr ) Serge Abiteboul, INRIA & Xyleme

12

Repository and Index Manager

Change Control

Query Processor

Semantic Module

User Interface

Xyleme Interface

Functional Architecture

-------------------- I N T E R N E T -----------------------

Web Interface

Acquisition& Crawler

Loader

Page 13: Xyleme, 20011 A Dynamic Warehouse for the XML data of the Web Grégory COBENA INRIA & Xyleme SA ( Gregory.Cobena@inria.fr ) Serge Abiteboul, INRIA & Xyleme

13

Index Index Index

-------------------- I N T E R N E T -----------------------

Change Control andSemantic

Integration

Change Control andSemantic

Integration

ETHERNET

Repository Repository RepositorryRepository

Loader |Query Loader |Query

Architecture

Acquisition andMaintenance

Acquisition andMaintenance

Page 14: Xyleme, 20011 A Dynamic Warehouse for the XML data of the Web Grégory COBENA INRIA & Xyleme SA ( Gregory.Cobena@inria.fr ) Serge Abiteboul, INRIA & Xyleme

Xyleme, 2001 14

3. Data Acquisition and Maintenance,

Page Importance

Page 15: Xyleme, 20011 A Dynamic Warehouse for the XML data of the Web Grégory COBENA INRIA & Xyleme SA ( Gregory.Cobena@inria.fr ) Serge Abiteboul, INRIA & Xyleme

15

Goals

• Discover XML pages on the web that are of interest for customers– For this crawl the web (HTML+XML)

• Maintain them up to date

• Do this under bounded resources:– Memory for known URLs– Bandwidth

Page 16: Xyleme, 20011 A Dynamic Warehouse for the XML data of the Web Grégory COBENA INRIA & Xyleme SA ( Gregory.Cobena@inria.fr ) Serge Abiteboul, INRIA & Xyleme

16

Life Cycle of a page in Xyleme

• The URL of D is discovered as a link in another page (or published by a customer)

• The page scheduler decides to read D– The meta data of D is read

• type, last_date_update...

– The document D is loaded

• The document D is re(read) regularly

Page 17: Xyleme, 20011 A Dynamic Warehouse for the XML data of the Web Grégory COBENA INRIA & Xyleme SA ( Gregory.Cobena@inria.fr ) Serge Abiteboul, INRIA & Xyleme

17

Main Issues

• Loading of pages– we can load up to 5 millions of pages/day on a

standard PC– main cost is Internet connection

• Metadata management (access to disk)

• Page scheduling– decide which page to read or refresh next

Page 18: Xyleme, 20011 A Dynamic Warehouse for the XML data of the Web Grégory COBENA INRIA & Xyleme SA ( Gregory.Cobena@inria.fr ) Serge Abiteboul, INRIA & Xyleme

18

Page Importance

• Definition: Important pages are linked to by important pages

• Offline algorithm (used by Google)

• Our Online algorithm(M. Preda, S. Abiteboul, G. Cobena)– does not require to maintain graph information– faster convergence with focused crawling

Page 19: Xyleme, 20011 A Dynamic Warehouse for the XML data of the Web Grégory COBENA INRIA & Xyleme SA ( Gregory.Cobena@inria.fr ) Serge Abiteboul, INRIA & Xyleme

Xyleme, 2001 19

( XML Repository,Semantic Data Integration

and Query Processing )

Page 20: Xyleme, 20011 A Dynamic Warehouse for the XML data of the Web Grégory COBENA INRIA & Xyleme SA ( Gregory.Cobena@inria.fr ) Serge Abiteboul, INRIA & Xyleme

20

Querying Language

• Today: A mix of OQL and XQL• We are currently moving to X-Query

(which is also a mix of OQL and XQL…)

Select boss/Name, boss/Phone From comp in BusinessDomain,

boss in comp//Manager Where comp/Product contains “Xyleme”

Page 21: Xyleme, 20011 A Dynamic Warehouse for the XML data of the Web Grégory COBENA INRIA & Xyleme SA ( Gregory.Cobena@inria.fr ) Serge Abiteboul, INRIA & Xyleme

21

Web Heterogeneity

• Semantic domains, e.g., cinema

• Many possible types for data in this domain, many DTDs

• Semantic Integration– one abstract DTD for the domain– gives the illusion that the system maintains an

homogeneous database for this domain

1 domain = 1 abstract DTD

Page 22: Xyleme, 20011 A Dynamic Warehouse for the XML data of the Web Grégory COBENA INRIA & Xyleme SA ( Gregory.Cobena@inria.fr ) Serge Abiteboul, INRIA & Xyleme

22

Indexing

• Standard inverted index– word documents that contain this word

• Xyleme index– word elements that contain this word

document + element identifier

• Goal: more work can be performed without accessing data

Page 23: Xyleme, 20011 A Dynamic Warehouse for the XML data of the Web Grégory COBENA INRIA & Xyleme SA ( Gregory.Cobena@inria.fr ) Serge Abiteboul, INRIA & Xyleme

Xyleme, 2001 23

4. Change Control

Page 24: Xyleme, 20011 A Dynamic Warehouse for the XML data of the Web Grégory COBENA INRIA & Xyleme SA ( Gregory.Cobena@inria.fr ) Serge Abiteboul, INRIA & Xyleme

24

The Web changes all the time

• Data acquisition + maintenance – keep the warehouse up-to-date

• Version management– representation and storage of changes

• Change monitoring– query subscription

Page 25: Xyleme, 20011 A Dynamic Warehouse for the XML data of the Web Grégory COBENA INRIA & Xyleme SA ( Gregory.Cobena@inria.fr ) Serge Abiteboul, INRIA & Xyleme

25

Subscription Language

• SQL-like language based on ‘atomic events’.• Combines the use of monitoring queries and

continuous queries.• The language can be extended by adding new

types of atomic events.• Uses the XML Query Language for continuous

queries. “Querying the XML Documents of the Web”, V. Aguilera, S. Cluet, F. Boiscuvier, Tech. Report

Page 26: Xyleme, 20011 A Dynamic Warehouse for the XML data of the Web Grégory COBENA INRIA & Xyleme SA ( Gregory.Cobena@inria.fr ) Serge Abiteboul, INRIA & Xyleme

26

subscription myPaintings% what are the new painting entries in Musee d’Orsay sitemonitoring newPainting

select URLwhere URL extends www.musee-orsay.fr/*and <painter> contains “Monet”

% manage the changes in the expositions continuous delta Exposition

select ... from ... where when monthly

notify daily % send me a daily report

Example

Atomic events

Page 27: Xyleme, 20011 A Dynamic Warehouse for the XML data of the Web Grégory COBENA INRIA & Xyleme SA ( Gregory.Cobena@inria.fr ) Serge Abiteboul, INRIA & Xyleme

27

Step 1: Atomic Event Detection

metadatamanager

HTMLparser

XMLloader

document & alerts

d/46

complexevent detection

atomic event 46: URL matches pattern www.musee-orsay.fr/*atomic event 67: XML documentcontains the tag <painter> withthe value “Monet”

5 millions of pages/day

d

d/46,67loading

Page 28: Xyleme, 20011 A Dynamic Warehouse for the XML data of the Web Grégory COBENA INRIA & Xyleme SA ( Gregory.Cobena@inria.fr ) Serge Abiteboul, INRIA & Xyleme

28

Step 2: Complex Event Detection

HTMLparser

XMLloader

complexevent detection

complex event 12: 67 & 46 (XML document contains the tag <painter> with value “Monet” and URL matches pattern www.musee-orsay.fr/*)

Millions of alerts of pages/dayMillions of subscriptions

Page 29: Xyleme, 20011 A Dynamic Warehouse for the XML data of the Web Grégory COBENA INRIA & Xyleme SA ( Gregory.Cobena@inria.fr ) Serge Abiteboul, INRIA & Xyleme

29

triggers

notification/monitoring

Step 3: Notification Processor

Reporter

continuousqueries

complexevent detection

clock notification/results

Millions of notifications/day

alerts

Page 30: Xyleme, 20011 A Dynamic Warehouse for the XML data of the Web Grégory COBENA INRIA & Xyleme SA ( Gregory.Cobena@inria.fr ) Serge Abiteboul, INRIA & Xyleme

30

SQL

Architecture

XylemeAlerter

Web Browser

XylemeReporter

XylemeSubscription

Manager

ComplexEvent

DetectionSubscription

Manager

Reporter

TriggerEngine

XylemeQuery

Processor

SQL

documents

Page 31: Xyleme, 20011 A Dynamic Warehouse for the XML data of the Web Grégory COBENA INRIA & Xyleme SA ( Gregory.Cobena@inria.fr ) Serge Abiteboul, INRIA & Xyleme

31

Complex Events Algorithm

• The formal problem is NP-hard• We proposed several possible algorithms• Experimental (simulation) values proved the

effectiveness of our solutions• The Hash-Tree based algorithm is well suited for our

application: – 10 million Complex Events– 1 million Atomic Events – 100 Atomic events detected per document0.8 ms to process a document. ~2 million documents per day.

Page 32: Xyleme, 20011 A Dynamic Warehouse for the XML data of the Web Grégory COBENA INRIA & Xyleme SA ( Gregory.Cobena@inria.fr ) Serge Abiteboul, INRIA & Xyleme

32

Alerters

• Each Alerter can be viewed as a plug-in that acts on a document flow.

• All sorts of Atomic events can be detected: URL pattern detection, Keywords, XPath expressions, Page rank…

• Can be distributed.

Page 33: Xyleme, 20011 A Dynamic Warehouse for the XML data of the Web Grégory COBENA INRIA & Xyleme SA ( Gregory.Cobena@inria.fr ) Serge Abiteboul, INRIA & Xyleme

33

Some Advanced Alerts

• Process document flow (single pass)

• Full strings– Context Stack– Reversed look-up

• XML Alerts– Reversed XPath expressions– Dual context stack for ‘/’ and ‘//’

Page 34: Xyleme, 20011 A Dynamic Warehouse for the XML data of the Web Grégory COBENA INRIA & Xyleme SA ( Gregory.Cobena@inria.fr ) Serge Abiteboul, INRIA & Xyleme

34

Versions

• Objectives:– Temporal Queries (persistent identification of nodes)– Version some documents or some sites (store a ‘delta’)– Change Monitoring (query changes)

• We proposed a representation of changes“Change-Centric Management of Versions” (VLDB 2001)

• We developed a Diff algorithm for XML“Detecting Changes in XML Documents”,G. Cobena, S. Abiteboul, A. Marian ICDE 2002 (San Jose)

Page 35: Xyleme, 20011 A Dynamic Warehouse for the XML data of the Web Grégory COBENA INRIA & Xyleme SA ( Gregory.Cobena@inria.fr ) Serge Abiteboul, INRIA & Xyleme

35

Conclusion & Prospectives

• Focus crawling on important pages– Refine notion of importance– Improve important pages discovery

• Improve Change control accuracy– Semantic web– Real-time advanced processing

Page 36: Xyleme, 20011 A Dynamic Warehouse for the XML data of the Web Grégory COBENA INRIA & Xyleme SA ( Gregory.Cobena@inria.fr ) Serge Abiteboul, INRIA & Xyleme

Xyleme, 2001 36

Merci