organization

Xyleme, January 2001 -- Zurich 1

A Dynamic Warehouse for the XML data of the Web

Serge AbiteboulINRIA & Xyleme SA

[email protected] [email protected]://www-rocq.inria.fr/verso http://www.xyleme.com

2

Organization

• The Web and XML• Xyleme• 1. Data Acquisition and Maintenance• 2. XML Repository• 3. Semantic Data Integration• 4. Query Processing• 5. Query Subscription• Conclusion


The Web and XML

4

The Web today

• Terabytes of data

• Private web: not publicly available pages

• Deep web: data hidden behind forms

• A lot of public pages– 1 billion in [06/2000] – several millions of servers

5

The Web today

• Browsing• Search engines

– Google indexes more than 1 billion pages 11/00

– in: list of words

– out: sorted list of URLs• based on occurrence of

words in documents

• based on the link structure of the web

6

The Web today

• Queries: keywords to retrieve URLs– Imprecise

– Query results cannot be directly processed

– Difficult to extract data of interest

• Applications: based on hand-made wrappers– Expensive

– Incomplete

– Short-lived, not adapted to the Web constant changes

7

The Coming of XML

• HTML– comes from SGML– hypertext language– fixed number of tags– content and

presentation are mixed– very difficult to extract

data from a page

– old standard

• XML– also

– semistructured data

– not fixed

– not mixed

– very easy

– new standard

8

HTML = Hypertext Language

Ref Name PriceX23 Camera 359.99 R2D2 Robot 19350.00Z25 PC 1299.99

Information System

HTML

The X23 new camera replaces the X22 . It comes equipped with a flash (worth by itself 53.99 $) and provides great quality for only 359.99 $.

Text + presentationWhere is the data ?

hard

9

XML = Semistructured Data

Ref Name PriceX23 Camera 359.99 R2D2 Robot 19350.00Z25 PC 1299.99...

Information System

<product-table>< product reference=”X23"> <designation> camera </designation> <price unit=Dollars> 359.99 </price> <description> … </description></product>< product reference=”R2D2"> <designation> Robot </designation> <price unit=Dollars> 19350 </price> <description> … </description>...</product-table> XML

Data + StructureSemistructured: more flexible

easy

10

XML : Tree Types

• Semantics and structure are in paths– product-table/product/reference– product-table/product/price

product

designation descriptionprice

reference

product-table

11

XML

• Very active/noisy field - standards– schema (XML schema), stylesheet (XSL), resource

description (RDF...)– WML (wap), MathML, SMIL (multimedia), RSS

(news), RDF (metadata)...

• How fast will XML conquer the web? – so far rather slow (about 1% now of the visible

web; much more in intranets)– much faster since the arrival of Explorer 5.5


A Dynamic Warehouse for the XML Data of the Web

Xyleme

13

Xyleme

• Warehouse– Xyleme stores huge quantities of data (teraB)– Xyleme is not a search engine (only index) or a

mediator (only virtual data)

• XML– Xyleme is focused on XML, i.e., trees

• Dynamic– Xyleme is interested in data evolution/changes

14

Xyleme

• September 1999: a group of researchers from – Inria Rocquencourt, Verso Group– U. of Mannheim, Database Group– U. of Orsay, IASI Group– CNAM, Vertigo Group

• September 2000: creation of a start-up

• November 2000: about 15 people

15

Corporate Information Today

Web

Information System

manual searches using browsers

ad-hoc applications written by web-experts tailored for specific tasks and data.

I.e. inflexible and expensive

manual updates

16

Corporate Information with Xyleme

Web

Information System

Repository

Query Engine

Xyleme-warehouse

Crawling & interpreting data

publishing

updatesqueries

searches

17

Five Challenges

1. Data Acquisition and Maintenancediscover data of interest and maintain it up to date

2. Repositorystore this data and index it so that it can be

processed efficiently

3. Query Processingsupport efficiently an SQL-style query language

18

Five Challenges - continued

4. Semantic IntegrationUnderstand DTD and tags, partition the Web into

semantic domains, provide a simple view of each domain

5. Change ControlMonitor the web and offer services such as Query

Subscription

19

Challenges - continued

• Scale to the web

• Size of data: millions/billions of pages

• Size of index: terabytes

• Number of customers– thousands of simultaneous queries– millions of subscriptions

20

Repository and Index Manager

Change Control

Query Processor

Semantic Module

User Interface

Xyleme Interface

Functional Architecture

-------------------- I N T E R N E T -----------------------

Web Interface

Acquisition& Crawler

Loader

21

Architecture

• Cluster of PCs

• Developed with Linux and C++

• Communications– local: Corba– external: HTTP

• Distribution between autonomous machines

22

Index Index Index

-------------------- I N T E R N E T -----------------------

Change Control andSemantic

Integration

Change Control andSemantic

Integration

ETHERNET

Repository Repository RepositorryRepository

Loader |Query Loader |Query

Architecture

Acquisition andMaintenance

Acquisition andMaintenance


1. Data Acquisition and Maintenance

24

Goals

• Discover XML pages on the web that are of interest for customers– For this crawl the web (HTML+XML)

• Maintain them up to date

• Do this under bounded resources

25

Life Cycle of a page in Xyleme

• The URL of D is discovered as a link in another page (or published by a customer)

• The page scheduler decides to read D– The meta data of D is read

• type, last_date_update...

– The document D is loaded

• The document D is re(read) regularly

26

Main Issues

• Loading of pages– we can load up to 5 millions of pages/day on a

standard PC– main cost is Internet connection

• Metadata management

• Page scheduling– decide which page to read or refresh next

27

Metadata Management

• Example: management of the link matrix

– page i points to page j– for 1 billion URL, about 30 children/url– matrix has 30.109 edges (very sparse)

• For each page that is read, – find the IDs of the 30 children – 50 pages/second 1500 database calls/second

28

Page Scheduling

• Decide which page to read next – discovery (read first) and refresh (read again)

Based on:• Importance of the page

– read often important pages– also used to order query results

• Change rate of the page– don’t read a page that is probably up-to-date

29

Page Scheduling for Refresh

• Determine refresh frequency fi for each page i to minimize a cost function

• Minimize Under the constraint

1…N costi(fi) G 1…N fi

where costi(fi), penalty for page i, depends on the estimated importance and staleness of the page

30

Cost Function

costi(fi), penalty for page i, depends on the estimated importance and staleness of the page

• Importance of the page – link structure– pub/sub

• Staleness of the data– penalty for being out of date– penalty for aging

31

Evaluation of Change Rate• Based on the Last Date of Change

– provided by HTTP header of the page

– in general reliable but …

• Based on the number M of changes detected the last N times the pages was refreshed– limits: do not know the actual number of changes

First one more precise

32

Page Importance: Link Structure

• Intuition: a page is important if many important pages reference it : fixpoint

• Link Matrix– M(i,j) if page i refers to page j– M is a 109 109 matrix– out(i) : the outdegree of page i

• Fixpoint– W0(k) = 1/N (initialization)– Wm(k) = i [M(i,k) * Wm-1(i)/out(i) ]

33

Page Importance : Algorithm

Wm M(i,-)Wm-1(k)

+=

M(i,-) is stored as a listcomputation of Wm (line/line)for i = 1 to N do

[ read M(i,-) ; process the line ]

k

Wm(k)

out(k)

34

Page Importance: Fixpoint

• Techniques for fixpoint convergence

• Some results – convergence is fast (OK after 10)– simple precision suffices– possible on a standard PC

• Distribution and incremental evaluation

35

Page Importance: Refresh

Standard importancefor HTML/XMLpages

HTML pages areuseful onlyto discover XML

Taking pub/subinto account

circle = HTML square = XML

triangle = pub/sub


2. XML Repository

37

Storing XML documents

• Relational store (e.g., Oracle 8i)– binary long objects: not possible to access

directly elements– very typed data and Tables: efficient– otherwise: too many joins and inefficient

• Object database store (ODMG)– better adapted

• XML Native storage: Natix

38

Natix Repository

• Goal– minimize I/O for direct access and scanning– efficient direct accesses using indexing– good compaction but not at the cost of access

• Efficient storage of trees – use fixed length storage pages – variable length records inside a page

• Main issue: tree balancing

39

Tree Balancing

Record 1

Record 3Record 2

40

Tree Balancing - continued

Large collections may useseveral records


3. Semantic Data Integration

42

Web Heterogeneity

• Semantic domains, e.g., cinema

• Many possible types for data in this domain, many DTDs

• Semantic Integration– one abstract DTD for the domain– gives the illusion that the system maintains an

homogeneous database for this domain

1 domain = 1 abstract DTD

43

Relationship is not visible unless one knows the relationships between story and tale.

Cluster DTDs and Documents

44

Discover the Domains

Cluster DTDs sharing similar « tags » using data mining techniques (frequent item sets) and linguistic tools (e.g., thesaurus, heuristics to extract words from composite words or abbreviations, etc.)

to obtain domains

cdtd1 .cdtd2 .cdtd3 .

adtd1

adtd2

adtd4

Many concrete DTDs

Fewer abstract DTDs

cdtd7 .cdtd8 .cdtd9 .cdtd10 .

cdtd4 .

cdtd5 .cdtd6 .

45

Wordnet: Useful Relationships

• Synonyms One concept, two terms

•Hypernyms / Hyponyms two concepts linked through generalization/specialization - e.g., vehicle & car

•Meronyms / Holonyms two concepts linked through composition/inclusion - e.g., country & city

46

Choose an Abstract DTD / Domain

• Automatically– The analysis of a cluster, leads to « clusters of

tags »– Use a thesaurus (e.g., Wordnet) to build a

hierarchy from the clusters of tags

• Manually– Performed by a domain expert

• Hybrid

47

Mapping Concrete to Abstract

• For each concrete DTD in a domain, find how it relates to the abstract DTD:– Associate concrete tags to abstract tags using linguistic tools– Provide relationships between paths in the concrete and abstract DTD

E.g.: cdtd3/œuvre/nom/prénom and

adtd2/book/author/name/firstname

• Possibly automatic, manual or hybrid


4. Query Processing

49

Xyleme Query Language• Today: A mix of OQL and XQL• Tomorrow: the future W3C standard • Example

select product/name, product/price

from doc in catalogue,

product in doc/product

where product//components contains “flash”

and product/description contains “camera”

50

Data Distribution

• Cluster of documents = physical collection of documents ( semantic domain)

Distribution

• Storage machine– in charge of a cluster of documents

• Index machine– index for a cluster

51

Step 0: Indexing

• Standard inverted index– word documents that contain this word

• Xyleme index– word elements that contain this word

document + element identifier

• Goal: more work can be performed without accessing data

52

Step 1: Localization

• Query on an abstract dtd

• Localization of machines that host concrete DTDs that will participate in the query

global query on abstract dtd

union of querieson local machines

local queries

catalogue/product/pricerelevant for

machine 56machine 45

53

Step 2: Optimization

• Algebraic rewriting

• Linear search strategy based on simple heuristics– use in memory indexes– minimize communication

• Optimization of the global plan

• Optimization of the local plans

54

Step 3: Execution

A plan usually consists of:1. parallel translation from abstract queries to concrete

patterns on the relevant index machines

2. parallel index scans to identify the relevant elements for a concrete pattern

3. parallel construction of resulting elements

4. pipeline evaluation (i.e., no intermediate data structure)

Note: 2. Requires smart indexes

55

Execution: Abstract2Concrete

For each concrete pattern,

the local plan is optimized dynamically

for each concrete patternscan the element ids

&234 &177

for catalogue/product/pricescan relevant concrete pattern

d1//camera/price d2/product/cost d3/piano/price ...

56

Element Identifiers

• Essential for query processing

• Identifier = (preorder rank/postorder rank)– X ancestor of Y <=>

pre(X) < pre(Y) and post(X) > post(Y)

– E.g., 2<5 and 4 >2 => (2,4) ancestor (5,2)

A B C

D E F

G

1

2

3 4

5

6

71

2

3

4

5

6

7

Text

57

Patterns and Indexes

product

name description

“camera”

(d1, 12, 200), (d1, 201, 400)

(d1,1,11), (d1, 205,224) (d1,228, 237)

(d1, 229), (d2, 14)

Heuristics: to perform joins, start with the smallest cardinality (to minimize size intermediary results)


5. Change Control

59

The Web changes all the time

• Data acquisition + maintenance – keep the warehouse up-to-date

• Version management– representation and storage of changes

• Change monitoring– query subscription

60

Versions

• Version some documents or some sites

• Version some continuous queries

continuous query: query that is evaluated regularly

get each Monday the list of movies showing in Paris

61

Representing Versions: Deltas

• Version storage– current document– persistent identifiers for elements– description of changes - completed deltas

• Deltas are XML documents

• Changes can be processed like other data– exchanged: send me changes since June 1st!– queried: what are the products inserted since 2/1/99?

62

Completed Delta

<delta><unit delta previousversion=“1” thistime=“2”/><delete xid=“11” xid-children=“(17-21)” ><Product> <Name>DVD</Name> <Price>500</Price></Product> </delete><move xid=“16” new_parent=“11” new_position=“2” old_parent=“11” old_position=“1” /><update xid=“3” new_value=“50” old_value=“100” /><update xid=“8” new_value=“100” old_value=“150” /></unit_delta><unit delta>...</unit delta>

</delta>

persistentidentifier

63

Query Subscription• Users may subscribe to certain events, e.g.,

• changes in a page, a set of pages, • changes in pages from a particular semantic domain, containing some specific words or with a

particular DTD • changes of particular elements somewhere (new products in a catalog)

• Users may request to be notified • immediately at the time the event is detected• regularly, e.g., weekly• after a certain number of event detections

64

Examplesubscription myPariscope

% what are the new movie entries in Pariscope site

monitoring newMovies

select URL

where URL extends www.pariscope.fr/movies/*

and new(self)

% manage the changes in the movies showing in Paris

continuous delta Showing

select ... from ... where

when daily

notify daily % send me a daily report

65

Step 1: Atomic Event Detection

HTMLparser

XMLloader

metadatamanagerdocument

& alertsd/46

complexevent detection

atomic event 46: URL matches pattern www.inria.fr/*atomic event 67: XML documentcontains the tag painter

d/46,67

5 millions of pages/day

d

loading

66

Step2: Complex Event Detection

HTMLparser

XMLloader


comple event 12: 67 & 46 (XML document contains the tag painter and URL matches pattern www.inria.fr/*)

Millions of alerts of pages/dayMillions of subscriptions

67

Step 3: Notification Processor

notificationprocessor

continuousqueries

Millions of notifications/day


clock

triggers

alerts

notification/monitoring

notification/results


Conclusion

69

One Question Only

• The web is turning from a large collection of documents into a huge knowledge base

When will I be able to get

the precise knowledge I need?

Database + Knowledge Base + Linguistic + ...


Merci