an adaptive distributed query processing grid service f.porto - v.f.v.da silva – m.l.dutra –...
TRANSCRIPT
An Adaptive Distributed Query Processing Grid ServiceF.Porto - V.F.V.da Silva – M.L.Dutra – B.Schulze
Proc. VLDB Workshop on Data Management in GridsVLDB,LNCS 3836, Trondheim, Norway 2-3 September 2005
Cours : Grille de donnèes Prof .: Jean-Marc Pierson, Lionel Brunie
Date de Présentation : 01/02/2006Étudiant : Sammarco Aniello
Slide N.:2
PLAN
1-INTRODUCTION2-ABSTRACT DB3-ARCHITECTURE4-QUERY PROCESSING 5-Grid Greedy Node (G2N) algorithm6-Query Execution Engine Framework7-INITIAL RESULT8-CONCLUSION
OBJECTIVES
SOLUTIONS
RESULT
Slide N.:3
PROJECT CoDIMS(Configurable Data Integration
Middleware)It is a distributed grid service for the
evaluation ofscientific queries . The design of CoDIMS-Gfocused on conceiving efficient and
adaptablequery evaluation strategies for the gridenvironment.TESTBED: It support the pre-processing
stage ofa scientific visualization application (SVA) at
theNational Laboratory of Scientific Computing(LNCC) - Brazil -
FOCUS ON ADAPTIVE PROBLEM
OBJECTIVES
SOLUTIONS
RESULT
Slide N.:4
PROJECT CoDIMS-G
FOCUS ON ADAPTIVE PROBLEM
(1) Dynamic scheduling and allocation of query execution engine modules into grid nodes
(2) Adaptability of query execution to variations on environment conditions
(3) Support to special scientific operations
OBJECTIVES
SOLUTIONS
RESULT
Slide N.:5
PROJECT CoDIMS-G
FOCUS ON ADAPTIVE PROBLEM
Using the processing power available in a grid
environment may substantially reduce the time
needed for pre-processing virtual particle
trajectory.(1) A new node scheduling algorithm “selects grid nodes for parallel evaluation”(2) Extend the Eddy operator
OBJECTIVES
SOLUTIONS
RESULT
Slide N.:6
PROJECT CoDIMS-G
FOCUS ON ADAPTIVE PROBLEM
Reduction of the sheduling time
OBJECTIVES
SOLUTIONS
RESULT
Slide N.:7
PROJECT CoDIMS-G
FOCUS ON ADAPTIVE PROBLEMTo adapt the execution of an application to thechanging conditions of selected grid nodes. The problem in this context is to identify pointswhere execution may be interrupted in a node andrestarted in other nodes .
Slide N.:8
ABSTRACT DBThe Geometry relation stores data associated
withpolyhedron's geometry:
Geometry (id, time-instant, polyhedron<point>,velocity<point-velocity>) ;
Particle relation holds the initial particle position :
Particle (part-id, time-instant, point)The Resulting-vector user program computes
aresulting speed vector in a specific position of
theflow path:
Resulting-vector (position, polyhedron<point>,velocity<point-velocity>): velocity
The Trajectory Computing Program (TCP)computes VP's subsequent position:
TCP (particle-id, position, velocity): new-positionVelocity relation corresponds to velocity
vectorsfor each time instant.
Slide N.:9
ARCHITECTURE OF CoDIMS-G
Client Interface Users requests are forwarded to the Control component .The Control Component is the essence of the CoDIMSenvironment which stores, manages, validates and verifies an instanceconfiguration. which sends users requests to the queryprocessing system >>
The Parser transforms the users´ requests in a query graph representation(QG)
>>
Parser Component
Control Component
The Query Optimizer (QO) receives the graph and generates a physical distributed query execution plan (DQEP) using a cost model based on data and programs statistics stored in the Metadata Manager (MM).
>>
Metad
ata Man
ager
The optimizer calls the Scheduler (SC) Component and it indicates the set of interesting nodes to be allocated for the parallelized operator. The scheduler and optimizer cooperate to generate an initialdistributed parallel query execution plan DQEP. >>
Scheduler Component
Query Optimizer
Query Engine 1
Query Engine 2
Query Engine n
A QE is the component where actual query execution takes place. Instances of QE are instantiated into grid scheduled nodes. Each QE receives a fragment of the DQEP and it is responsible of its execution control .
>>
Query ExecutionManager
The QEM is responsible for deploying the query execution engine (QE) services at the nodes specified in the DQEP and managing their life-cycle during the query execution.The QEM manages the QEs real-time performance .
Slide N.:10
DISTRIBUTED QUERY PROCESSING
We express a query as a query graph QG, definedas a partial ordered set of operators QG={,},where is a set of algebraic operators and is aset of dependencies relations,where if (w1 w2), with w1, w2 and w1 , then
w2 succeds w1 in a bottom-up navigation of the DEQP and not (w2 w1)
The optimization algorithm explores the searchspace of valid plans, in accordance to datadependency restrictions. It considers all validexecution orders of expensive operators in QG Edges.
ALTERNATIVES
WHY
Slide N.:11
DISTRIBUTED QUERY PROCESSING
ALTERNATIVES
WHY
(a)non parallelization(b)scheduling according to the G2N algorithm (Grid Greedy )(c) adoption of the same parallelization strategy
used by the previous operator in the query execution plan.
For each computed query execution plan, a cost isassociated, using a parallel pipeline cost function.The DQEP presenting the lowest cost is selectedfor execution.
Slide N.:12
DISTRIBUTED QUERY PROCESSING
ALTERNATIVES
WHYThis strategy guarantees that costly programs onlyget invoked when all predicates have beenevaluated, eventually reducing the number oftuples to be processed by them
Slide N.:13
IMPLEMENTATION
G2N (throughput(tp1,tp2,…, tpn ),number-tasks):resultnodelist:= descending order(throughput);result:= result {nodelist(1)};cost(1):= number-tasks * nodelist(1);current-cost:=cost(1);While (nodes in the list and add-new-node)
total-cost:= current-cost;new-node:= next-node in nodelist;While (current-cost <= total-cost)move tuples from lowest node in result to new-node;Update costs of nodes and total-cost;If current-cost > total-cost If we could move at least 1 tuple to the new-node
result:= result {new-node} else
add-new-node:=false;Stop loop;
endwhileendwhileoutput result;
The G2N algorithm receives a set of available nodes with corresponding average throughput (tp1;tp2;…tpn), measured in tuples per second. The total estimated number oftasks (T) to be evaluated
>>
The algorithm classifies the list of available grid nodes in decreasing order of their corresponding average throughput values. It then allocates all T tuples to the fastest node
>>
The loop node to new grid node . It produce a new evaluation estimation that reduce query elapsedtime,until actual elapsedtime becomes higherthe last computed. Conversely,the algorithm stops and outputs thegrid nodes accepted so far >>
OUTPUT :Load Query Optimazer with the initial query execution plan and the re-scheduling of allocated nodes in face of variations on estimated values
>>
Grid Greedy Node (G2N) algorithm
Slide N.:14
ADAPTIVE QUERY EXECUTION - QEEF
Query Execution Engines(QEE) for supportingthe execution of traditional queries.
QEEF (Query Execution Engine Framework):an extensible QEE adapted to new executionmodels that implement each execution model as acombination of execution modules
SIMULATION
ANALISIS ON BLOCK SIZE
Slide N.:15
ADAPTIVE QUERY EXECUTION - QEEF
SIMULATION
ANALYSIS ON BLOCK SIZE
SEND
SEND
Eddy
RECEIVE
MERGE
RECEIVE
SEND SEND
SPLIT
RECEIVE
RECEIVE
Slide N.:16
ADAPTIVE QUERY EXECUTION - QEEF
SIMULATION
ANALYSIS ON BLOCK SIZEBlock size is an important tool to build
adaptivityinto the system. Eddy modifies a remote node block size in the following scenarios :1-TimeOut(estimated time)2- eddy proceeds a local adaptation(checking
on current throughput values)3- variations scheduled nodes4- When 2/3 tuples have beene valuated: - dataflow reduced -Eddy recomputes the number of
scheduled nodes - increase the number of tuples in each
node
Slide N.:17
SCIENTIFIC APPLICATIONS
INITIAL RESULT
QEEF framework has been extended with :-user's program execution
(strategy Apply operator)-spatial and temporal hash-joins
(implements the iterator interface)-loop control over query execution plan
fragment(repetitively evaluated)
Slide N.:18
SCIENTIFIC APPLICATIONS
INITIAL RESULTThe project configuraation :-java 1.4.2 and globus 3.2.1-20 pentium IV20 pentium IV, 1.7 GHz, processorswith 256 MB of RAM, running linux 2.4.20-31.9We considered :an instance with 1000 particles and executing
25iterations by each particle.Than we Obtained increasing :from 1 node to 25 nodesResults :demonstrated a gain of up to 11 times with 20machines, with respect to a centralized
execution(With 2.7 tuples for second).
Problem :blocking size update strategy to be very useful .
Slide N.:19
CONCLUSION
CoDIMS-G, which is an adaptive distributedquery processing grid service.
The proposed query execution strategy extends eddy
adaptive query execution model for the grid.
Environment,considering the variations on grid nodes
run-time conditions.