a data stream publish/subscribe architecture with self-adapting queries alasdair j g gray and werner...
TRANSCRIPT
A Data Stream Publish/Subscribe Architecture
with Self-adapting Queries
Alasdair J G Gray and Werner NuttSchool of Mathematical and Computer Sciences,
Heriot-Watt University, Edinburgh
4th November 2005
4th Nov 2005 A.J.G. Gray and W. NuttCoopIS 2005 2
Overview
Motivation Publish/subscribe architecture Answering a query Long-lived query plans Switching between data sources
4th Nov 2005 A.J.G. Gray and W. NuttCoopIS 2005 3
Motivation
Scenario: Streams generated by
distributed sensors Users are also
distributed Use data integration to
match users to streams
For example, Grid monitoring for
logging and bookkeeping
Sensor networks
GridGrid
Job progressBookkeeping
Monitoring data
Motivation
4th Nov 2005 A.J.G. Gray and W. NuttCoopIS 2005 4
R-GMA: A Grid Monitoring System
Grid monitoring system that integrates streams of data
Deployed on several Grids Continuing to be developed as part of the EGEE project We are developing innovative extensions for R-GMA
4th Nov 2005 A.J.G. Gray and W. NuttCoopIS 2005 5
Publishing Monitoring Data Data can be represented in terms of
relations with Keys: “what” and “where” Measurements: the “value” Timestamps: “when”For example, Network ThroughPut
One reading is a tuple in the relationNTP (from, to, tool, psize, latency, timestamp)
('hw', 'ral', 'ping', 32, 11.1, 2005-06-24-15:05:34)
NTP (from, to, tool, psize, latency, timestamp)
4th Nov 2005 A.J.G. Gray and W. NuttCoopIS 2005 6
Consuming Monitoring Data
Users are interested in how the grid changes over time. For example,
1. Latency for large packets sent from hw2. Links with a low latency as recorded by the
PingER tool
These can be expressed as SQL selection queries
)(: 1024''1 NTPq psizehwfrom
)(: 0.10''2 NTPq latencypingtool
4th Nov 2005 A.J.G. Gray and W. NuttCoopIS 2005 7
Data Integration in a Publish/Subscribe Architecture
Local as View Approach Consumers pose a
query over the schema to request streams
Producers describe their stream using a view on the schema
Queries and views are selections over a single relationProducers
RegistryDataStreams
Consumers
4th Nov 2005 A.J.G. Gray and W. NuttCoopIS 2005 8
What is an Answer to a Query?
Global relations contain no tuples (virtual
relation) Need to translate into query over sources An answer stream should be
Sound Complete Duplicate free Weakly ordered: all tuples that share the same
key value will be in timestamp order Order in general is difficult in a distributed
setting Weak order sufficient for more complex
queries such as aggregates
4th Nov 2005 A.J.G. Gray and W. NuttCoopIS 2005 9
Λ from='hw' Λ tool='udp'Λ from='ral' Λ tool='ping'from='hw' Λ psize≥1024
Query Planning: Consumer Query
Satisfiability used to find relevant producers
S1: from='hw' Λ tool='udp'
S2: from='hw' Λ tool='ping'
S3: from='ral' Λ tool='ping'
q1: from='hw' Λ psize≥1024
S4: from='ral' Λ tool='udp'
S5: from=‘an' Λ tool='ping'
q2: tool=‘ping' Λ latency≤10.0
4th Nov 2005 A.J.G. Gray and W. NuttCoopIS 2005 10
How does the Registry find Relevant Producers?
Producer views are stored in a structured format
Satisfiability check can be constructed as an SQL querySELECT producersWHERE NOT EXISTS
(SELECT *
WHERE contradictory condition);
4th Nov 2005 A.J.G. Gray and W. NuttCoopIS 2005 11
Scalability is an Issue
Problem: Every consumer contacting every producer of interest does not scale
Even a small Grid of less than a dozen sites has problems
Grids may contain thousands of resourcesFor example,
Large Hadron Collider Computing Grid (LCG)
4th Nov 2005 A.J.G. Gray and W. NuttCoopIS 2005 12
Republishers Allow the System to Scale
A republisher Consumes answers to a
selection query Merges "trickles" into
streams Publishes
Answer stream Latest-state answer History
Problem: Choice in where to obtain information
Producer S1 Producer S2
Republisher
4th Nov 2005 A.J.G. Gray and W. NuttCoopIS 2005 13
Meta query plan contains choice
Query plan uses one of R1 or R3
Query Planning in the Presence of Republishers
Find all relevant publishers
Rank according to data provided
S1: from='hw' Λ tool='udp'
S2: from='hw' Λ tool='ping'
S3: from='ral' Λ tool='ping'
R1: from='hw' R2: from='ral'
R3:from='hw' Λ tool='ping'
q1: from='hw' Λ psize≥1024
S4: from='ral' Λ tool='udp'
S5: from=‘an' Λ tool='ping'
q2: tool='ping' Λ latency≤10.0
4th Nov 2005 A.J.G. Gray and W. NuttCoopIS 2005 14
Weak Order is not Guaranteed
Tuples for same channel
(3) published before (8)
Arrive at consumer in wrong order
S1: from='hw' Λ tool='udp'
S2: from='hw' Λ tool='ping'
S3: from='ral' Λ tool='ping'
latency≤5.0 latency>5.0
S4: from='ral' Λ tool='udp'
q2: tool=‘ping' Λ latency≤10.0
slowlink
(3) (8)
(3) (8)
(8) (3)
4th Nov 2005 A.J.G. Gray and W. NuttCoopIS 2005 15
Generating Well Formed Query Plans
A publisher is relevant for a global query if
1. Conditions are satisfiable, and2. All measurements that agree on their key
values come from the same publisher
The measurement condition can be checked using entailment.
Example on slide 13 was well formed.
4th Nov 2005 A.J.G. Gray and W. NuttCoopIS 2005 17
Plans Need to be Maintained
Queries are long-lived Set of publishers can change Query plans should reflect changes What happens when we
Add a republisher? Remove a republisher?
4th Nov 2005 A.J.G. Gray and W. NuttCoopIS 2005 18
How does a new Republisher affect our Consumers?
Find consumers for which R4 is relevant
Compare R4 to publishers in Meta Query Plan
S1: from='hw' Λ tool='udp'
S2: from='hw' Λ tool='ping'
S3: from='ral' Λ tool='ping'
R4: TRUE
R1: from='hw' R2: from='ral'
R3:from='hw' Λ tool='ping'
q1: from='hw' Λ psize≥1024
S4: from='ral' Λ tool='udp'
S5: from=‘an' Λ tool='ping'
q2: tool= 'ping' Λ latency≤10.0
4th Nov 2005 A.J.G. Gray and W. NuttCoopIS 2005 19
General Case of Adding a Republisher
Republisher relevant for a consumer query, either
1. Republisher is not maximal relevant No change in query plans
2. Equivalent Republisher Change to the Meta Query Plan No change to the Query Plan
3. Covering Republisher Change to the Meta Query Plan Change to the Query Plan
4th Nov 2005 A.J.G. Gray and W. NuttCoopIS 2005 20
How does removing a Republisher affect our Consumers?
Find all consumers for which R1 was relevant
Update plans
S1: from='hw' Λ tool='udp'
S2: from='hw' Λ tool='ping'
S3: from='ral' Λ tool='ping'
R4: TRUE
R1: from='hw' R2: from='ral'
R3:from='hw' Λ tool='ping'
q1: from='hw' Λ psize≥1024
S4: from='ral' Λ tool='udp'
S5: from=‘an' Λ tool='ping'
q2: tool=‘ping' Λ latency≤10.0
4th Nov 2005 A.J.G. Gray and W. NuttCoopIS 2005 21
General Case of Dropping a Republisher
Republisher relevant for a consumer query, either
1. Republisher is not maximal relevant No change in query plans
2. Equivalent Republisher Change to the Meta Query Plan May need to change the Query Plan
3. Covering Republisher Change to the Meta Query Plan Change to the Query Plan Requires some method to patch the plan
4th Nov 2005 A.J.G. Gray and W. NuttCoopIS 2005 22
Planning a Republisher Query
Applying Consumer planning techniques results in a problem
S1: from='hw' Λ tool='udp'
S2: from='hw' Λ tool='ping'
S3: from='ral' Λ tool='ping'
R4: TRUE
R1: from='hw' R2: from='ral'
R3:from='hw' Λ tool='ping'
S4: from='ral' Λ tool='udp'
S5: from=‘an' Λ tool='ping'
Problem: Hierarchy contains cycles Republishers disconnected
from Producers
4th Nov 2005 A.J.G. Gray and W. NuttCoopIS 2005 23
Correctness: streams answer queries Cycle freeness: loops can lead to
duplicates Uniqueness: hierarchy defined for a
set of publishers Local planning: Publishers and
Consumers only need to communicate with the Registry
Desirable Properties for a Hierarchy
4th Nov 2005 A.J.G. Gray and W. NuttCoopIS 2005 24
Generating Well Formed Hierarchies
Need a stricter relevance criterion R1 can consume from R2 iff
1. Everything R2 offers is relevant to R1, and
2. R1 offers something R2 does not.
Can be checked by entailment Ensures
No loops in the hierarchy Republishers connected to the Producers
4th Nov 2005 A.J.G. Gray and W. NuttCoopIS 2005 25
Planning a Republisher Query: 2nd Attempt
Stricter relevance criterion
Republishers only consume from publishers below them
S1: from='hw' Λ tool='udp'
S2: from='hw' Λ tool='ping'
S3: from='ral' Λ tool='ping'
R4: TRUE
R1: from='hw' R2: from='ral'
R3:from='hw' Λ tool='ping'
S4: from='ral' Λ tool='udp'
S5: from=‘an' Λ tool='ping'
R4 is not relevant for R1
4th Nov 2005 A.J.G. Gray and W. NuttCoopIS 2005 26
Computing Query Plans
Adding a new Consumer1. Consumer contacts
Consumer Agent2. Consumer Agent contacts
the Registry and receives a list of relevant publishers
3. Consumer Agent constructs Meta Query Plan and Query Plan
4. Consumer Agent contacts Publisher Agents in the Query Plan
5. Publisher starts streaming tuples to consumer agent
6. Consumer agent merges into a single answer stream
Similar approach for adding a publisher
Consumer
Registry
Producer Producer
ProducerAgent
ConsumerAgent
1
2
4
3
5
6
4th Nov 2005 A.J.G. Gray and W. NuttCoopIS 2005 27
Maintaining Query Plans
Agents maintain registry entry through a soft state registration mechanism
Registry detects change in publisher set Poses query over internal
database Informs affected
consumers/republishers Consumer agent
considers query plan
Consumer
Registry
Producer Producer
ProducerAgent
ConsumerAgent
4th Nov 2005 A.J.G. Gray and W. NuttCoopIS 2005 28
at1
bt2
at0
0
at0
Consumer switching to new Publisher R1 equivalent to R2
Plan changes to use R2
Send timestamp of oldest tuple
Stream from first tuple with timestamp
Filter against latest-state buffer
at0bt0
at1bt2
ct2ct3
at0bt0
at1bt2
ct2bt0
at1
bt0at1
at1bt2
ct2bt0
at1
1R 2R
1S
1C
at0
bt0
bt0
bt2
at1
bt2
bt2
bt2
bt2
bt2
at1
Mechanism ensures answer stream properties
4th Nov 2005 A.J.G. Gray and W. NuttCoopIS 2005 29
Conclusions
Formal framework for publishing and consuming stream data
Partially implemented in R-GMA Republishers:
Allow system to scale Complicate query answering problem
Republishers require special planning We have developed algorithms that allows
the system to adapt to changes in the set of publishers
Protocol developed for switching between query plans