a data stream publish/subscribe architecture with self-adapting queries alasdair j g gray and werner...

28
A Data Stream Publish/Subscribe Architecture with Self-adapting Queries Alasdair J G Gray and Werner Nutt School of Mathematical and Computer Sciences, Heriot-Watt University, Edinburgh 4 th November 2005

Upload: arabella-long

Post on 29-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Data Stream Publish/Subscribe Architecture with Self-adapting Queries Alasdair J G Gray and Werner Nutt School of Mathematical and Computer Sciences,

A Data Stream Publish/Subscribe Architecture

with Self-adapting Queries

Alasdair J G Gray and Werner NuttSchool of Mathematical and Computer Sciences,

Heriot-Watt University, Edinburgh

4th November 2005

Page 2: A Data Stream Publish/Subscribe Architecture with Self-adapting Queries Alasdair J G Gray and Werner Nutt School of Mathematical and Computer Sciences,

4th Nov 2005 A.J.G. Gray and W. NuttCoopIS 2005 2

Overview

Motivation Publish/subscribe architecture Answering a query Long-lived query plans Switching between data sources

Page 3: A Data Stream Publish/Subscribe Architecture with Self-adapting Queries Alasdair J G Gray and Werner Nutt School of Mathematical and Computer Sciences,

4th Nov 2005 A.J.G. Gray and W. NuttCoopIS 2005 3

Motivation

Scenario: Streams generated by

distributed sensors Users are also

distributed Use data integration to

match users to streams

For example, Grid monitoring for

logging and bookkeeping

Sensor networks

GridGrid

Job progressBookkeeping

Monitoring data

Motivation

Page 4: A Data Stream Publish/Subscribe Architecture with Self-adapting Queries Alasdair J G Gray and Werner Nutt School of Mathematical and Computer Sciences,

4th Nov 2005 A.J.G. Gray and W. NuttCoopIS 2005 4

R-GMA: A Grid Monitoring System

Grid monitoring system that integrates streams of data

Deployed on several Grids Continuing to be developed as part of the EGEE project We are developing innovative extensions for R-GMA

Page 5: A Data Stream Publish/Subscribe Architecture with Self-adapting Queries Alasdair J G Gray and Werner Nutt School of Mathematical and Computer Sciences,

4th Nov 2005 A.J.G. Gray and W. NuttCoopIS 2005 5

Publishing Monitoring Data Data can be represented in terms of

relations with Keys: “what” and “where” Measurements: the “value” Timestamps: “when”For example, Network ThroughPut

One reading is a tuple in the relationNTP (from, to, tool, psize, latency, timestamp)

('hw', 'ral', 'ping', 32, 11.1, 2005-06-24-15:05:34)

NTP (from, to, tool, psize, latency, timestamp)

Page 6: A Data Stream Publish/Subscribe Architecture with Self-adapting Queries Alasdair J G Gray and Werner Nutt School of Mathematical and Computer Sciences,

4th Nov 2005 A.J.G. Gray and W. NuttCoopIS 2005 6

Consuming Monitoring Data

Users are interested in how the grid changes over time. For example,

1. Latency for large packets sent from hw2. Links with a low latency as recorded by the

PingER tool

These can be expressed as SQL selection queries

)(: 1024''1 NTPq psizehwfrom

)(: 0.10''2 NTPq latencypingtool

Page 7: A Data Stream Publish/Subscribe Architecture with Self-adapting Queries Alasdair J G Gray and Werner Nutt School of Mathematical and Computer Sciences,

4th Nov 2005 A.J.G. Gray and W. NuttCoopIS 2005 7

Data Integration in a Publish/Subscribe Architecture

Local as View Approach Consumers pose a

query over the schema to request streams

Producers describe their stream using a view on the schema

Queries and views are selections over a single relationProducers

RegistryDataStreams

Consumers

Page 8: A Data Stream Publish/Subscribe Architecture with Self-adapting Queries Alasdair J G Gray and Werner Nutt School of Mathematical and Computer Sciences,

4th Nov 2005 A.J.G. Gray and W. NuttCoopIS 2005 8

What is an Answer to a Query?

Global relations contain no tuples (virtual

relation) Need to translate into query over sources An answer stream should be

Sound Complete Duplicate free Weakly ordered: all tuples that share the same

key value will be in timestamp order Order in general is difficult in a distributed

setting Weak order sufficient for more complex

queries such as aggregates

Page 9: A Data Stream Publish/Subscribe Architecture with Self-adapting Queries Alasdair J G Gray and Werner Nutt School of Mathematical and Computer Sciences,

4th Nov 2005 A.J.G. Gray and W. NuttCoopIS 2005 9

Λ from='hw' Λ tool='udp'Λ from='ral' Λ tool='ping'from='hw' Λ psize≥1024

Query Planning: Consumer Query

Satisfiability used to find relevant producers

S1: from='hw' Λ tool='udp'

S2: from='hw' Λ tool='ping'

S3: from='ral' Λ tool='ping'

q1: from='hw' Λ psize≥1024

S4: from='ral' Λ tool='udp'

S5: from=‘an' Λ tool='ping'

q2: tool=‘ping' Λ latency≤10.0

Page 10: A Data Stream Publish/Subscribe Architecture with Self-adapting Queries Alasdair J G Gray and Werner Nutt School of Mathematical and Computer Sciences,

4th Nov 2005 A.J.G. Gray and W. NuttCoopIS 2005 10

How does the Registry find Relevant Producers?

Producer views are stored in a structured format

Satisfiability check can be constructed as an SQL querySELECT producersWHERE NOT EXISTS

(SELECT *

WHERE contradictory condition);

Page 11: A Data Stream Publish/Subscribe Architecture with Self-adapting Queries Alasdair J G Gray and Werner Nutt School of Mathematical and Computer Sciences,

4th Nov 2005 A.J.G. Gray and W. NuttCoopIS 2005 11

Scalability is an Issue

Problem: Every consumer contacting every producer of interest does not scale

Even a small Grid of less than a dozen sites has problems

Grids may contain thousands of resourcesFor example,

Large Hadron Collider Computing Grid (LCG)

Page 12: A Data Stream Publish/Subscribe Architecture with Self-adapting Queries Alasdair J G Gray and Werner Nutt School of Mathematical and Computer Sciences,

4th Nov 2005 A.J.G. Gray and W. NuttCoopIS 2005 12

Republishers Allow the System to Scale

A republisher Consumes answers to a

selection query Merges "trickles" into

streams Publishes

Answer stream Latest-state answer History

Problem: Choice in where to obtain information

Producer S1 Producer S2

Republisher

Page 13: A Data Stream Publish/Subscribe Architecture with Self-adapting Queries Alasdair J G Gray and Werner Nutt School of Mathematical and Computer Sciences,

4th Nov 2005 A.J.G. Gray and W. NuttCoopIS 2005 13

Meta query plan contains choice

Query plan uses one of R1 or R3

Query Planning in the Presence of Republishers

Find all relevant publishers

Rank according to data provided

S1: from='hw' Λ tool='udp'

S2: from='hw' Λ tool='ping'

S3: from='ral' Λ tool='ping'

R1: from='hw' R2: from='ral'

R3:from='hw' Λ tool='ping'

q1: from='hw' Λ psize≥1024

S4: from='ral' Λ tool='udp'

S5: from=‘an' Λ tool='ping'

q2: tool='ping' Λ latency≤10.0

Page 14: A Data Stream Publish/Subscribe Architecture with Self-adapting Queries Alasdair J G Gray and Werner Nutt School of Mathematical and Computer Sciences,

4th Nov 2005 A.J.G. Gray and W. NuttCoopIS 2005 14

Weak Order is not Guaranteed

Tuples for same channel

(3) published before (8)

Arrive at consumer in wrong order

S1: from='hw' Λ tool='udp'

S2: from='hw' Λ tool='ping'

S3: from='ral' Λ tool='ping'

latency≤5.0 latency>5.0

S4: from='ral' Λ tool='udp'

q2: tool=‘ping' Λ latency≤10.0

slowlink

(3) (8)

(3) (8)

(8) (3)

Page 15: A Data Stream Publish/Subscribe Architecture with Self-adapting Queries Alasdair J G Gray and Werner Nutt School of Mathematical and Computer Sciences,

4th Nov 2005 A.J.G. Gray and W. NuttCoopIS 2005 15

Generating Well Formed Query Plans

A publisher is relevant for a global query if

1. Conditions are satisfiable, and2. All measurements that agree on their key

values come from the same publisher

The measurement condition can be checked using entailment.

Example on slide 13 was well formed.

Page 16: A Data Stream Publish/Subscribe Architecture with Self-adapting Queries Alasdair J G Gray and Werner Nutt School of Mathematical and Computer Sciences,

4th Nov 2005 A.J.G. Gray and W. NuttCoopIS 2005 17

Plans Need to be Maintained

Queries are long-lived Set of publishers can change Query plans should reflect changes What happens when we

Add a republisher? Remove a republisher?

Page 17: A Data Stream Publish/Subscribe Architecture with Self-adapting Queries Alasdair J G Gray and Werner Nutt School of Mathematical and Computer Sciences,

4th Nov 2005 A.J.G. Gray and W. NuttCoopIS 2005 18

How does a new Republisher affect our Consumers?

Find consumers for which R4 is relevant

Compare R4 to publishers in Meta Query Plan

S1: from='hw' Λ tool='udp'

S2: from='hw' Λ tool='ping'

S3: from='ral' Λ tool='ping'

R4: TRUE

R1: from='hw' R2: from='ral'

R3:from='hw' Λ tool='ping'

q1: from='hw' Λ psize≥1024

S4: from='ral' Λ tool='udp'

S5: from=‘an' Λ tool='ping'

q2: tool= 'ping' Λ latency≤10.0

Page 18: A Data Stream Publish/Subscribe Architecture with Self-adapting Queries Alasdair J G Gray and Werner Nutt School of Mathematical and Computer Sciences,

4th Nov 2005 A.J.G. Gray and W. NuttCoopIS 2005 19

General Case of Adding a Republisher

Republisher relevant for a consumer query, either

1. Republisher is not maximal relevant No change in query plans

2. Equivalent Republisher Change to the Meta Query Plan No change to the Query Plan

3. Covering Republisher Change to the Meta Query Plan Change to the Query Plan

Page 19: A Data Stream Publish/Subscribe Architecture with Self-adapting Queries Alasdair J G Gray and Werner Nutt School of Mathematical and Computer Sciences,

4th Nov 2005 A.J.G. Gray and W. NuttCoopIS 2005 20

How does removing a Republisher affect our Consumers?

Find all consumers for which R1 was relevant

Update plans

S1: from='hw' Λ tool='udp'

S2: from='hw' Λ tool='ping'

S3: from='ral' Λ tool='ping'

R4: TRUE

R1: from='hw' R2: from='ral'

R3:from='hw' Λ tool='ping'

q1: from='hw' Λ psize≥1024

S4: from='ral' Λ tool='udp'

S5: from=‘an' Λ tool='ping'

q2: tool=‘ping' Λ latency≤10.0

Page 20: A Data Stream Publish/Subscribe Architecture with Self-adapting Queries Alasdair J G Gray and Werner Nutt School of Mathematical and Computer Sciences,

4th Nov 2005 A.J.G. Gray and W. NuttCoopIS 2005 21

General Case of Dropping a Republisher

Republisher relevant for a consumer query, either

1. Republisher is not maximal relevant No change in query plans

2. Equivalent Republisher Change to the Meta Query Plan May need to change the Query Plan

3. Covering Republisher Change to the Meta Query Plan Change to the Query Plan Requires some method to patch the plan

Page 21: A Data Stream Publish/Subscribe Architecture with Self-adapting Queries Alasdair J G Gray and Werner Nutt School of Mathematical and Computer Sciences,

4th Nov 2005 A.J.G. Gray and W. NuttCoopIS 2005 22

Planning a Republisher Query

Applying Consumer planning techniques results in a problem

S1: from='hw' Λ tool='udp'

S2: from='hw' Λ tool='ping'

S3: from='ral' Λ tool='ping'

R4: TRUE

R1: from='hw' R2: from='ral'

R3:from='hw' Λ tool='ping'

S4: from='ral' Λ tool='udp'

S5: from=‘an' Λ tool='ping'

Problem: Hierarchy contains cycles Republishers disconnected

from Producers

Page 22: A Data Stream Publish/Subscribe Architecture with Self-adapting Queries Alasdair J G Gray and Werner Nutt School of Mathematical and Computer Sciences,

4th Nov 2005 A.J.G. Gray and W. NuttCoopIS 2005 23

Correctness: streams answer queries Cycle freeness: loops can lead to

duplicates Uniqueness: hierarchy defined for a

set of publishers Local planning: Publishers and

Consumers only need to communicate with the Registry

Desirable Properties for a Hierarchy

Page 23: A Data Stream Publish/Subscribe Architecture with Self-adapting Queries Alasdair J G Gray and Werner Nutt School of Mathematical and Computer Sciences,

4th Nov 2005 A.J.G. Gray and W. NuttCoopIS 2005 24

Generating Well Formed Hierarchies

Need a stricter relevance criterion R1 can consume from R2 iff

1. Everything R2 offers is relevant to R1, and

2. R1 offers something R2 does not.

Can be checked by entailment Ensures

No loops in the hierarchy Republishers connected to the Producers

Page 24: A Data Stream Publish/Subscribe Architecture with Self-adapting Queries Alasdair J G Gray and Werner Nutt School of Mathematical and Computer Sciences,

4th Nov 2005 A.J.G. Gray and W. NuttCoopIS 2005 25

Planning a Republisher Query: 2nd Attempt

Stricter relevance criterion

Republishers only consume from publishers below them

S1: from='hw' Λ tool='udp'

S2: from='hw' Λ tool='ping'

S3: from='ral' Λ tool='ping'

R4: TRUE

R1: from='hw' R2: from='ral'

R3:from='hw' Λ tool='ping'

S4: from='ral' Λ tool='udp'

S5: from=‘an' Λ tool='ping'

R4 is not relevant for R1

Page 25: A Data Stream Publish/Subscribe Architecture with Self-adapting Queries Alasdair J G Gray and Werner Nutt School of Mathematical and Computer Sciences,

4th Nov 2005 A.J.G. Gray and W. NuttCoopIS 2005 26

Computing Query Plans

Adding a new Consumer1. Consumer contacts

Consumer Agent2. Consumer Agent contacts

the Registry and receives a list of relevant publishers

3. Consumer Agent constructs Meta Query Plan and Query Plan

4. Consumer Agent contacts Publisher Agents in the Query Plan

5. Publisher starts streaming tuples to consumer agent

6. Consumer agent merges into a single answer stream

Similar approach for adding a publisher

Consumer

Registry

Producer Producer

ProducerAgent

ConsumerAgent

1

2

4

3

5

6

Page 26: A Data Stream Publish/Subscribe Architecture with Self-adapting Queries Alasdair J G Gray and Werner Nutt School of Mathematical and Computer Sciences,

4th Nov 2005 A.J.G. Gray and W. NuttCoopIS 2005 27

Maintaining Query Plans

Agents maintain registry entry through a soft state registration mechanism

Registry detects change in publisher set Poses query over internal

database Informs affected

consumers/republishers Consumer agent

considers query plan

Consumer

Registry

Producer Producer

ProducerAgent

ConsumerAgent

Page 27: A Data Stream Publish/Subscribe Architecture with Self-adapting Queries Alasdair J G Gray and Werner Nutt School of Mathematical and Computer Sciences,

4th Nov 2005 A.J.G. Gray and W. NuttCoopIS 2005 28

at1

bt2

at0

0

at0

Consumer switching to new Publisher R1 equivalent to R2

Plan changes to use R2

Send timestamp of oldest tuple

Stream from first tuple with timestamp

Filter against latest-state buffer

at0bt0

at1bt2

ct2ct3

at0bt0

at1bt2

ct2bt0

at1

bt0at1

at1bt2

ct2bt0

at1

1R 2R

1S

1C

at0

bt0

bt0

bt2

at1

bt2

bt2

bt2

bt2

bt2

at1

Mechanism ensures answer stream properties

Page 28: A Data Stream Publish/Subscribe Architecture with Self-adapting Queries Alasdair J G Gray and Werner Nutt School of Mathematical and Computer Sciences,

4th Nov 2005 A.J.G. Gray and W. NuttCoopIS 2005 29

Conclusions

Formal framework for publishing and consuming stream data

Partially implemented in R-GMA Republishers:

Allow system to scale Complicate query answering problem

Republishers require special planning We have developed algorithms that allows

the system to adapt to changes in the set of publishers

Protocol developed for switching between query plans