dax: dynamically adaptive distributed system for processing complex continuous queries bin liu, yali...
Post on 19-Dec-2015
217 views
TRANSCRIPT
DAX: Dynamically Adaptive Distributed System for Processing CompleX
Continuous QueriesBin Liu, Yali Zhu, Mariana Jbantova, Brad Momberger,
and Elke A. RundensteinerDepartment of Computer Science, Worcester Polytechnic Institute
100 Institute Road, Worcester, MA 01609
Tel: 1-508-831-5857, Fax: 1-508-831-5776
{binliu, yaliz, jbantova, bmombe, rundenst}@cs.wpi.edu
VLDB’05 Demonstration
http://davis.wpi.edu/dsrg/CAPE/index.html
Uncertainties in Stream Query ProcessingRegister
Continuous Queries
Distributed Stream Query Engine
Distributed Stream Query Engine
Streaming DataStreaming Result
Real-time and accurate responses
required
May have time-varying rates and
high-volumesAvailable resources for
executing each operator may vary over time.
Distribution and Adaptations are required.
High workload of queries
ReceiveAnswers
Memory- and CPU resource limitations
Adaptation in Distributed Stream Processing
• Adaptation Techniques:– Spilling data to disk– Relocating work to other machines– Reoptimizing and migrating query plan
• Granularity of Adaptation:– Operator-level distribution and adaptation– Partition-level distribution and adaptation
• Integrated Methodologies:– Consider trade-offs between spill vs redistribute– Consider trade-offs between migrate vs redistribute
System Overview [LZ+05, TLJ+05]
Local Statistics Gatherer
DataDistributor
CAPE-Continuous Query Processing Engine
DataReceiver
Query Processor
Local Adaptation Controller
Distribution Manager
StreamingDataNetwor
k
Network
End User
Global AdaptationController
RuntimeMonitor
Query PlanManager
Repository
ConnectionManager
Repository
Application Server Stream Generator
Global Plan Migrator
Local Plan Migrator
Motivating Example
Real Time Data Integration Server
......
Decision Support System
...
Decision-Make Applications
Stock Price, Volumes,...
Reviews, External Reports, News, ...
• Scalable Real-Time Data Processing Systems
– To Produce As Many Results As Possible at Run-Time • (i.e., 9:00am-4:00pm)• Main memory based processing
– To Require Complete Query Results • (i.e., for offline analysis after 4:00pm or whenever possible)• Load shedding not acceptable, must temporarily spill to disk
Complex queries such as multi-joins are common!
Analyze relationship among stock price, reports, and news?
A equi-Join of stock price, reports, and news on stock symbols
M1 M2 M3Legend:
M1
M2
M3
Random Distribution Balanced Network Aware Distribution
Goal: To minimize network connectivity.
Algorithm: Takes each query plan and creates sub-plans where neighbouring operators are grouped together.
Goal: To equalize workload per machine.
Algorithm: Iteratively takes each query operator and places it on the query processor with the least number of operators.
Initial Distribution Policies
Distribution Manager
Distribution Table
M 2Operator 8
M 2Operator 7
M 1Operator 6
M 1Operator 5
M 2Operator 4
M 2Operator 3
M 1Operator 2
M 1Operator 1
MachineOperator
Initial Distribution Process
Stream Source
Application
4321
876
5
1 2 3 4
5
67 8
M1
M2
Step 1 Step 2
Step 1: Create distribution table using initial distribution algorithm.
Step 2: Send distribution information to processing machines (nodes).
Operator-level Adaptation - Redistribution
4100 tuplesM 2
2000 tuplesM 1
M 2Operator 8
M 2Operator 7
M 1Operator 6
M 1Operator 5
M 2Operator 4
M 2Operator 3
M 1Operator 2
M 1Operator 1
Machine
(M)
Operator (OP)
Op 3: .3
Op 4: .2
Op 7: .3
Op 8: .2
.91M 2
Op 1: .25
Op 2: .25
Op 5: .25
Op 6: .25
.44M 1
Operator Cost
CostMachine
Statistics Table
M Capacity: 4500 tuples
Distribution Table
Op 3: .4
Op 4: .3
Op 8: .3
.64M 2
Op 1: .15
Op 2: .15
Op 5: .15
Op 6: .15
.71M 1
Operator Cost
CostMachineBalance
Cost Table (current) Cost Table (desired)
Cape’s cost models: number of tuples in memory and network output rate. Operators redistributed based on redistribution policy. Redistribution policies of Cape: Balance and Degradation.
Op 7: .4
Cost per machine is determined as percentage of memory filled with tuples.
Redistribution Protocol: Moving Operators Across Machines
Experimental Results of Distribution and Redistribution Algorithms
Query Plan Performance with Query Plan of 40 Operators.
Observations: Initial distribution is important for query plan performance. Redistribution improves at run-time query plan performance.
0
100000
200000
300000
400000
500000
600000
700000
800000
Time (m)
Th
rou
gh
pu
t
Random Distribution
Balanced Network Aware Distribution
RD + Redist
BNA + Redist
Operator-level Adaptation: Dynamic Plan Migration• The last step of plan re-optimization: After optimizer generates a new query
plan, how to replace currently running plan by the new plan on the fly?• A new challenge in streaming system because of stateful operators.• A unique feature of the DAX system.• But can we just take out the old plan and plug in the new plan?
Key Observation: Purge of tuples in states relies on processing of new tuples.
• Steps(1) Pause execution of old plan(2) Drain out all tuples inside old plan(3) Replace old plan by new plan(4) Resume execution of new plan
AB
BC
A B C
(2)All tuples
drained
(4)Processing
Resumed
(3) Old Replaced
By new
Deadlock Waiting Problem:
Migration Strategy - Moving State
• Basic idea - Share common states between two migration boxes
• Key Steps– Drain Tuples in Old Box – State Matching:
State in old box has unique ID. During rewriting, new ID given to newly generated state in new box. When rewriting done, match states based on IDs.
– State Moving between matched states
• What’s left?– Unmatched states in new box– Unmatched states in old box
CDSABC SD
BCSAB SC
ABSA SB
ABSA SBCD
CDSBC
SD
BCSB SC
QA QB QC QD QA QB QC QD
QABCD QABCD
Old Box New Box
Migration Requirements : No missing results and no duplicatesTwo migration boxes: One contains old sub-plan, one contains new sub-plan.
Two sub-plans semantically equivalent, with same input and output queues
Migration is abstracted as replacing old box by new box.
BC
AB
QA QB QC
b1b2
SC
SA SB
SAB
a2
b1b2
a1a2
c1
b3
c2
c3
A
BC
t
a1a2
b1b2b3
c1c2c3
W = 2
BC
AB
QA QBQC
b1b2
SC
SA SB
SAB
a1a2
b1b2
a2a2
c1c2
b3 c3
ABSA SBCD
CDSBC SD
BC
SB SC
QA QB QCQD
A B C
Old Old Old
Old Old New
Old New Old
New Old Old
Old New New
New Old New
New New Old
New Old New
New New New
Moving State:Unmatched States
Unmatched New States (Recomputation)Recursively recompute unmatched states from bottom up.
Unmatched Old States (Execution Synchronization)First clean accumulated tuples in box input queues, it is then safe to discard these unmatched states.
Distributed DynamicMigration Protocols (I)
...
(2) Local Synctime
(1) Request SyncTime
Distribution Managerop2
op3 op4
op1
op2
op3 op4
op13 4
1 2 3 5
4 2
Distribution TableOP1 M1
OP 2 M2
OP 3 M1
OP 4 M2
op1
op2
op3 op4
3 4
1 2M1
M2
op1
op2
op3 op4
3 4
1 2
(1) Request SyncTime
(3) Global SyncTime (3) Global SyncTime
(4) Execution Synced
...
op1
op2
op3 op4
Distribution TableOP1 M1
OP 2 M2
OP 3 M1
OP 4 M2
op1
op2
op3 op4
3 4
1 2M1
M2
op1
op2
op3 op4
3 4
1 2
Distribution Manager
Migration Start
Migration Stage:Execution
Synchronization
Distributed Dynamic Migration Protocols (II)
...
(6) PlanChanged
(5) Send New SubQueryPlan
Distribution Manager
op2
op3 op4
op1
op2
op3 op4
op13 4
1 2 3 5
4 2
Distribution TableOP1 M1
OP 2 M2
OP 3 M1
OP 4 M2
M1
M2
Migration Stage:Change Plan Shape
op2
op1
3 5
4 2
(5) Send New SubQueryPlan
op2
op1
3 5
4 2op2
op3 op4
op1
3 5
4 2
op2
op3 op4
op1
3 5
4 2
(8) States Filled
(7) Fill States [2, 4]
Distribution Managerop2
op3 op4
op1
op2
op3 op4
op13 4
1 2 3 5
4 2
Distribution TableOP1 M1
OP 2 M2
OP 3 M1
OP 4 M2
M1
M2
Migration Stage:Fill States and
Reactivate Operators
op2
op3 op4
op1
3 5
4 2
op2
op3 op4
op1
3 5
4 2
(7) Fill States [3, 5]
(7.1) Request state [4]
(7.2) Move state [4]
(7.3) Request state [2]
(7.4) Move state [2]
(9) Reconnet operators
(11) Active [op 1]
(9) Reconnect Operators
(11) Activate [op2]
(10) Operator Reconnected
Distributed Dynamic Migration Protocols (III)
From Operator-level to Partition-level• Problem of operator-level adaptation:
– Operators have large states. – Moving them across machines can be expensive.
• Solution as partition-level adaptation: – Partition state-intensive operators [Gra90,SH03,LR05]– Distribute Partitioned Plan into Multiple Machines
A B C
SplitA
m1 m2
SplitB SplitC
A B C
A B C
m1 Union
Join
SplitA SplitB SplitC
m2 Union
Join
SplitA SplitB SplitC
m3 Union
Join
SplitA SplitB SplitC
m4 Union
Join
SplitA SplitB SplitC
Partitioned Symmetric M-way Join
...3
...4
..1
...2
...3
...1
A2A1
...1
...2
..2
...3
...3
...4
B2B1
...3
...2
..1
...4
...1
...1
C2C1
• Example Query: A.A1 = B.B1 = C.C1
– Join is Processed in Two Machines
SplitA SplitB SplitC
m1 m23-Way Join
3-Way Join
A B C A B C = PA1 PB1 PC1 PA2 PB2 PC2
A1%2=0 ->m1
A1%2=1 ->m2
B1%2=0 ->m1
B1%2=1 ->m2
C1%2=0 ->m1
C1%2=1 ->m2
...4
...2
A2A1
PA1
...2
...2
...4
B2B1
PB1
...4
...2
C2C1
PC1
Partitions of m1
...1
...3
...3
...1
A2A1
PA2
...1
...3
...3
B2B1
PB2
...1
...1
...3
...1
C2C1
PC2
Partitions of m2
Partition-level Adaptations• 1: State Relocation : Uneven workload among machines!
A B C
SplitA
m1 m2
SplitB SplitC
• States relocated are active in another machine
• Overheads in monitoring and moving states across machines
Push Operator States Temporarily into Disks -Spilled operator states are temporarily inactive
A B C
A B C
Secondary Storage
New incoming tuples probe only against partial states
• 2: State Spill: Memory overflow problem still exists!
Approaches: Lazy- vs. Active-Disk• Lazy-Disk Approach Distribution Manager
...
Memory Usage
Query Processor (1)
Disk
Local Adapt.
Controller
Query Processor (n-1)
Disk
Local Adapt.
Controller
Query Processor (n)
Disk
Local Adapt.
Controller
State Spill
State Relocation
– Independent Spill and Relocation Decisions
• Distribution Manager: Trigger state relocation if Mr < r and t > r
• Query Processor: Start state spill if Memu / Memall > s
• Active-Disk Approach– Partitions on Different
Machines May Have Different Productivity
• i.e., Most productive partitions in machine 1 may be less productive than least productive ones other machines
– Proposed Technique: Perform State Spill Globally
Distribution Manager
...
Memory Usage/Average Productivity
Query Processor (1)
Disk
State Spill
Local Adapt.
Controller
Query Processor (n-1)
Disk
Local Adapt.
Controller
Query Processor (n)
Disk
Local Adapt.
Controller
State Relocation
Force State Spill
Performance Results of Lazy-Disk & Active-Disk Approaches
0
200000
400000
600000
800000
1000000
1200000
1400000
1600000
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58
Minutes
Th
rou
gh
pu
t
No-Relocation
Lazy-Disk
• Lazy-Disk vs. No-Relocation in Memory Constraint Env.
• Lazy-Disk vs. Active Disk
Three machines, M1(50%), M2(25%), M3(25%)Input Rate: 30ms; Tuple Range:30KInc. Join Ratio: 2State spill memory threshold: 100MState relocation: > 30M, Mem thres. 80%, Minspan 45s
Three machines, Input Rate: 30ms; Tuple Range:15K,45KState spill memory thres.: 80MAvg. Inc, Join Ratio: M1(4), M2(1), M3(1) Maximal Force-Disk memory: 100M, Ratio>2State relocation: >30M, Mem thres.: 80%, Minspan: 45s
0
500000
1000000
1500000
2000000
2500000
3000000
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58
MinutesT
hro
ug
hp
ut
Lazy-Disk
Active-Disk
Plan-Wide State Spill: Local Methods• Local Output
A B C
D
EJoin2
Join1
Join3
Poutput, Psize…
…
t1
t2
t3
Poutput, Psize
t
– Direct Extension of Single-Operator Solution:
– Update Operator Productivity Values Individually
– Spill partitions with smaller Poutput/Psize values among all operators
• Bottom Up Pushing
A B C
D
EJoin2
Join1
Join3
– Push States from Bottom Operators First
– Randomly or using local productivity value for partition selection
– Less intermediate results (states) stored -> reduce number of state spills
Plan-Wide State Spill: Global Outputs• Poutput: Contribution to Final Query Output
– Update Poutput values of partitions in Join3
– Apply Split2 to each tuple and find corresponding partitions from Join2, and update its Poutput value
A B CD
EJoin2
Join1
Join3
SplitE
SplitA SplitB SplitC
Split2
SplitDSplit1
k
– And so on …
A lineage tracing algorithm to update Poutput statistics
...
2
2
OP1
...p1
1
1
OP2
...p2
j12
20
......
OP1
...p1
11
p12
OP2
p21
p2j
OP3
p31
p3j
2
OP4
...p4
1
p4j
2
2 33
3
4
44
4
2+3+4 3+4 4
• Consider Intermediate Result Size
P11: Psize = 10,
Poutput=20P1
2: Psize = 10, Poutput=20
Intermediate Result Factor Pinter
• Poutput/(Psize + Pinter)
• Apply Same Lineage Tracing Algorithm for Intermediate Results
p12
p2i
Experiment Results for Plan-Wide Spill
0
5000
10000
15000
20000
25000
30000
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49
Minutes
Thro
ughp
ut
Global Output with PenaltyGlobal OutputLocal OutputBottom-Up
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49
Minutes
Thro
ughp
ut
Global Output with PenaltyGlobal OutputLocal OutputBottom-Up
Query with Average Join Rate:Join1: 1, Join2: 3, Join3: 3
Query with Average Join Rate:Join1: 3, Join2: 2, Join3: 3
300 partitions Memory Threshold: 60MB Push 30% of states in each state spill Average tuple inter-arrival time 50ms from each input
Backup Slides
Plan Shape Restructuring and Distributed Stream Processing
• New slides for yali’s migration + distribution ideas
Pros: Migrate in a gradual fashion. Still output even during migration.Cons: Still rely on executing of old box to process tuples during migration stage.
CD
SABC SD
BC
SAB SC
AB
SA SB
AB
SASBCD
CD
SBC SD
BCSB SC
QA QB QC QD
QA QB QC QD
QABCD QABCD
Migration Strategies – Parallel TrackBasic Idea : Execute both plans in parallel until old box is “expired”, after whichthe old box is disconnected and the migration is over.Potential Duplicate: Both boxes generate all-new tuples.
At root op in old box:
If both to-be-joined tuples have all-new sub-tuples,
don’t join.
Other op in old box:
Proceed as normal
Cost Estimations For MS:
TPT ≈ 2W given enough system resources
1st W
2nd W
TM-start
TM-end
T
New New
OldOld
New New
Old OldCD
BC
AB
QA QB QC QD
SABC
SC
SA SB
SD
SAB
Old Box
W
TMS = Tmatch + Tmove + Trecompute
≈ Trecompute(SBC) + Trecompute(SBCD)
= λBλCW2(Tj + TsσBC) + 2λBλCλDW3(TjσBC + TsσBCσBCD)
AB
CD
BC
QB QC QDQA
......
SD
SB SC
SBCD
SBC
...
Cost Estimations For PT:
New Box
Experimental Results for Plan Migration
0
2000
4000
6000
8000
10000
12000
14000
0 2000 4000 6000 8000Global Window Size W (ms)
Mig
rati
on
Du
rati
on
(m
s)
Measured T_PT Estimated T_PT
0200400600800
100012001400160018002000
0 2000 4000 6000 8000Global Window Size W (ms)
Mig
rati
on
Du
rati
on
(m
s)
Measured T_MS Poly. (Measured T_MS)
0
2000
4000
6000
8000
10000
12000
14000
0 1000 2000 3000 4000 5000Window Size (ms)
Mig
rati
on
Du
rati
on
T_MS T_PT Observations:
• Confirm with prior cost analysis.
• Duration of moving state affected by window size and arrival rates.
• Duration of parallel track is 2W given enough system resources, otherwise affected by system parameters, such as window size and arrival rates.
Related Work on Distributed Continuous Query Processing
[1] Medusa: M. Balazinska, H. Balakrishnan, and M. Stonebraker. Contract-based load management in federated distributed systems. In Ist of NSDI, March 2004
[2] Aurora*: M. Cherniack, H. Balakrishnan, M. Balazinska, and etl. Scalable distributed stream processing. In CIDR, 2003.
[3] Borealis: T. B. Team. The design of the Borealis Stream Processing Engine. Technical Report, Brown University, CS Department, August 2004
[4] Flux: M. Shah, J. Hellerstein, S. Chandrasekaran, and M. Franklin. Flux: An adaptive partitioning operator for continuous query systems. In ICDE, pages 25-36, 2003
[5] Distributed Eddies: F. Tian, and D. DeWitt. Tuple routing strategies for distributed Eddies. In VLDB Proceedings, Berlin, Germany, 2003
Related Work on Partitioned Processing
• Non state-intensive queries [BB+02,AC+03,GT03]– State-Intensive operators (run-time memory shortage)
• Operator-level adaptation [CB+03,SLJ+05,XZH05] – Fine grained state level adaptation (adapt partial states)
• Load shedding [TUZC03]– Require complete query result (no load shedding)– Drop input tuples to handle resource shortage
• XJoin [UF00] and Hash-Merge Join [MLA04]– Integrate both spill and relocation in distributed environments– Investigate dependency problem for multiple operators
• Flux [SH03]– Multi-Input operators– Integrate both state spill and state relocation– Adapt states of one single input operator across machines
• Hash-Merge Join [MLA04], XJoin [UF00]– Only spill states for one single operator in central environments
CAPE Publications and Reports[RDZ04] E. A. Rundensteiner, L. Ding, Y. Zhu, T. Sutherland and B. Pielech, “CAPE: A Constraint-
Aware Adaptive Stream Processing Engine”. Invited Book Chapter. http://www.cs.uno.edu/~nauman/streamBook/. July 2004.
[ZRH04] Y. Zhu, E. A. Rundensteiner and G. T. Heineman, "Dynamic Plan Migration for Continuous Queries Over Data Streams”. SIGMOD 2004, pages 431-442.
[DMR+04] L. Ding, N. Mehta, E. A. Rundensteiner and G. T. Heineman, "Joining Punctuated Streams“. EDBT 2004, pages 587-604.
[DR04] L. Ding and E. A. Rundensteiner, "Evaluating Window Joins over Punctuated Streams“. CIKM 2004, to appear.
[DRH03] L. Ding, E. A. Rundensteiner and G. T. Heineman, “MJoin: A Metadata-Aware Stream Join Operator”. DEBS 2003.
[RDSZBM04] E A. Rundensteiner, L Ding, T Sutherland, Y Zhu, B Pielech \ And N Mehta. CAPE: Continuous Query Engine with Heterogeneous-Grained Adaptivity.
Demonstration Paper. VLDB 2004[SR04] T. Sutherland and E. A. Rundensteiner, "D-CAPE: A Self-Tuning Continuous Query Plan
Distribution Architecture“. Tech Report, WPI-CS-TR-04-18, 2004.[SPR04] T. Sutherland, B. Pielech, Yali Zhu, Luping Ding, and E. A. Rundensteiner, "Adaptive
Multi-Objective Scheduling Selection Framework for Continuous Query Processing “. IDEAS 2005.
[SLJR05] T Sutherland, B Liu, M Jbantova, and E A. Rundensteiner, D-CAPE: Distributed and Self-Tuned Continuous Query Processing, CIKM, Bremen, Germany, Nov. 2005.
[LR05] Bin Liu and E.A. Rundensteiner, Revisiting Pipelined Parallelism in Multi-Join Query Processing, VLDB 2005.
[B05] Bin Liu and E.A. Rundensteiner, Partition-based Adaptation Strategies Integrating Spill and Relocation, Tech Report, WPI-CS-TR-05, 2005. (in submission)
CAPE Project: http://davis.wpi.edu/dsrg/CAPE/index.html
CAPE Engine
Constraint-aware
Adaptive
Continuous Query
Processing
Engine
Exploit semantic constraints such as sliding windows and punctuations to reduce resource usage and improve response time.
Incorporate heterogeneous-grained adaptivity at all query processing levels.
- Adaptive query operator execution- Adaptive query plan re-optimization- Adaptive operator scheduling- Adaptive query plan distribution
Process queries in a real-time manner by employing well-coordinated heterogeneous-grained adaptations.
Analyzing Adaptation Performance• Questions Addressed:
– Partitioned Parallel Processing• Resolves memory shortage• Should we partition non-memory intensive queries?• How effective is partitioning memory intensive queries?
– State Spill• Known Problem: Slows down run-time throughput• How many states to push?• Which states to push?• How to combine memory/disk states to produce complete results?
– State Relocation• Known Asset: Low overhead• When (how often) to trigger state relocation?• Is state relocation an expensive process?• How to coordinate state moving without losing data & states?
• Analyzing State Adaptation Performance & Policies– Given sufficient main memory, state relocation helps run-time throughput– With insufficient main memory, Active-Disk improves run-time throughput
• Adapting Multi-Operator Plan– Dependency among operators– Global throughput-oriented spill solutions improve throughput
Percentage Spilled per Adaptation
0
50
100
150
200
250
300
350
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49
Every 60 Seconds
Me
mo
ry U
sa
ge
(M
B)
All-Mem
10%-Push
30%-Push
50%-Push
100%-Push
0
200000
400000
600000
800000
1000000
1200000
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49
Minutes
Th
rou
gh
pu
t
All-Mem
10%-Push
30%-Push
50%-Push
100%-Push
• Amount of State Pushed Each Adaptation– Percentage: # of Tuples Pushed/Total # of Tuples
(Input Rate: 30ms/Input, Tuple Range:30K, Join Ratio:3, Adaptation threshold: 200MB)
Run-Time Query Throughput Run-Time Main Memory Usage