ogsa-dai dqp: a developer's view
TRANSCRIPT
OGSA-DAI DQPOGSA DAI DQP
A D l ’ ViA Developer’s View
Bartosz DobrzeleckiApplications Consultant, EPCC
[email protected]+44 131 650 5137
DQPConfiguration.xml
<DQPConfiguration xmlns="http://ogsadai.org.uk/dqp/namespaces/2008/12">
<dataResources>< l "htt //l lh t 8080/d i/ i "<resource url="http://localhost:8080/dai/services"
dsos="DataSourceService"drerID="DataRequestExecutionResource"resourceID="MySQLResource"i L l "t "/>isLocal="true"/>
<resource url="http://localhost:8090/dai/services"dsos="DataSourceService"drerID="DataRequestExecutionResource"reso rceID "Reso rce2"/>resourceID="Resource2"/>
<resource url="http://localhost:8095/dai/services"dsos="DataSourceService"drerID="DataRequestExecutionResource"resourceID="MySQLResource"resourceID="MySQLResource"alias="MySQL"/>
</dataResources>
<evaluationResources><evaluationResources><resource url="http://localhost:8085/dai/services"
drerID="DataRequestExecutionResource"/></evaluationResources>
</DQPConfiguration> MySQLResource_employee MySQL_employee
Query processing steps
SQL LQP
LogicalQuery
Plan
SQLParser
SQL queryexpression
LQPBuilder
AbstractSyntax
OptimiserOptimiser
SyntaxTree
Results
OptimiserOptimiser
W kfl PartitionerWorkflowBuilder
PartitionedLQP
Execute OptimisedLQP
LQP
OGSA-DAIRequests
andSub-workflows
4
Query execution
OGSA-DAIRequest
SQLQuery Result
OGSA-DAI-DQP
DQP Coordinator
DB3OGSA-DAI
Data Node 3
DQP Coordinator
OGSA-DAIRequest
data
Sub-Workflow
OGSA-DAIData Node 1
DB1
data OGSA-DAIData Node 2
DB2
Producing Abstract Syntax Tree (AST)
• First step: parse SQL and generate AST.
• We use ANTLR 3 to generate code from grammars.
• Two grammars:– SQL to AST– AST to SQL (tree grammar)
• The tree grammar is used in our OGSA-DAI Views product which implements read only SQL Views by rewriting AST.p y y g
• In DQP the tree grammar is used to generate string representations for column definitions conditions ectrepresentations for column definitions, conditions, ect.
AST is a contract
• We do not expect AST to be changed.
• However, we do provide a mechanism for exposing new operators to the language surface.
SELECT A.aname AS nameFROM aircraft A, certified CWHERE A.aid = C.aid
Relation valued functions
SELECT A.aname AS nameFROM outerUnion((SELECT * FROM aircraft A),(SELECT * FROM certified C), 'ALL') A
Logical Query Plan
• Second step: translate AST to a logical query planplan.
SELECT aname AS name FROM aircraftFROM aircraft WHERE aid = 10
• Operator anatomy• Operator anatomy.
parentAttribute
H di
(name, source, type)Operatorspecific
children
Heading(list of Attributes)
pinternals
c d eOperatorID
Operators
• Behaviour defined in the Operator interfaceValidation checks if operator gets all the input data it needs– Validation – checks if operator gets all the input data it needs, detects missing attributes, ambiguities, deals with correlation, performs type checking.
– Update – updates operator internals after it was (re) connected.
• Operator, Heading, Attribute objects can be annotated with arbitrary annotations (key :String -> value :Object)– Sample uses:
– Attribute is sorted, correlated, temporary– Which physical algorithm for join operator
E ti t d di lit– Estimated cardinality– There will be a set of default annotations
Operator family
• Unary:SELECT
• Binary:– INNER JOIN– SELECT
– PROJECT– RENAME
INNER JOIN– PRODUCT– UNION
– DUPLICATE ELIMINATION– SORT
– INTERSECTION– DIFFERENCE
FULL OUTER JOIN– GROUP BY– SCALAR GROUP BY
ONE ROW ONLY
– FULL OUTER JOIN– [LEFT][RIGHT] OUTER JOIN– [ANTI] SEMI JOIN– ONE ROW ONLY
– TABLE SCAN– EXCHANGE
[ANTI] SEMI JOIN– APPLY– [UNARY][BINARY][SCAN]
REL_FUNCTION
Data Dictionary
• Data Dictionary provides information about federated data resources, available evaluators (DRERs), logicaldata resources, available evaluators (DRERs), logical and physical table schemas.
• It is populated when the resource is initialised.It is populated when the resource is initialised.
• Most of the entries can be annotated– you can plug in your own code to be executed on y p g y
initialisation– you may want to annotate attributes with histograms.
• TABLE SCAN t b ild it H di i d t• TABLE_SCAN operator builds its Heading using data from Data Dictionary (on update).
• Aft bli LQP i lid t d• After assembling LQP is validated.
Optimisation
• After successful validation LQP is optimised by a chain of optimisersoptimisers.
• This chain is defined as part of the Compiler configuration.
• Optimisers need to implement a single method:
Operator optimise(Operator lqpRoot,DataDictionary dataDictionary,DataDictionary dataDictionary,CompilerConfiguration compilerConfiguration)
throws LQPException;
Default optimisers
• Query normalisation + heuristicsRemove redundant operators– Remove redundant operators
– Select Push Down + implicit join detection– Rename Pull Upp– Project Pull Up
• Join orderingg
• Partitioning – finding best places for EXCHANGE operators
TABLE SCAN i l i hi h i• TABLE_SCAN implosion – pushing as much processing as we can to the RDBMS
Normalisation
SELECT Temp.name, Temp.AvgSalaryFROM (
SELECT A.aid, A.aname AS name,AVG (E.salary) AS y( y) y
FROM aircraft A, certified C, employees E
WHERE A.aid = C.aid AND C.eid = E.eid AND A.cruisingrange > 1000
GROUP BY A aid A aname
AST to LQP translator is not trying to be smart GROUP BY A.aid, A.aname
) AS Temptrying to be smart -it takes it easy
LQP is then normalised by a chain yof optimisers
Join Ordering
• Not there yet.
• Will be based on the same cost model as in OGSA-DQP.
• We will also reuse the same algorithm that produces left deep trees.
• More sophisticated models and algorithms (considering• More sophisticated models and algorithms (considering bushy trees, semi joins, etc.) will be implemented later on.
• You can always implement your own and replace the default.
Partitioning optimiser
• Pluggable optimiser decides how to split LQP into partitions by inserting the EXCHANGE operator.
• Default optimiser will put most load on the “local” evaluator (DRER) – otherwise it will choose randomly.
TABLE_SCAN Implosion
• Not there yet.
• We will always try to push as much processing asWe will always try to push as much processing as we can to the RDBMS.
• TABLE_SCAN “eats” as much of a tree as it can _and builds up an equivalent SQL query.
SELECT * FROM (SELECT * FROM aircraftWHERE aircraft.cruisingrange>1000g g
) aircraftJOIN (SELECT * FROM certifiedSELECT FROM certified
) certified ON aircraft.aid=certified.aid
SQL support level of a relational resource
• TABLE_SCAN implosion needs to know what level of SQL is supported by the underlying resource.by the underlying resource.– fully featured RDBMS– simple SQL interface for csv files supporting only simple filtering or records– a web service wrapper
• Relational resources will expose a resource property – a serialised object i l ti SQLS tL l i t f i il t th t d fi d b JDBCimplementing SQLSupportLevel interface similar to that defined by JDBC:
java sql DatabaseMetaDatajava.sql.DatabaseMetaDatapublic boolean supportsColumnAliasing()public boolean supportsCorrelatedSubqueries()public boolean supportsSubqueriesInComparisons()public boolean supportsSubqueriesInExists()...
Executing the plan
• Build phaseEach LQP Operator has associated Activity Pipeline Builder class– Each LQP Operator has associated Activity Pipeline Builder class which takes in Operator and returns Activity Output.
– Most operators can be mapped directly to single Activity.– Some operators may have different implementations (for example join
operator), builder chooses default one or is guided by an Annotation.Operator > Builder class mapping is configurable– Operator -> Builder class mapping is configurable.
• Setup phaseF h EXCHANGE D t S R i t d– For each EXCHANGE Data Source Resource is created.
• Execution phase– All workflows (partitions) are submitted.– Coordinator always executes sub workflow (with at least the
EXCHANGE CONSUMER operator)EXCHANGE_CONSUMER operator)
Extensibility points
• New Operator can be introduced by mapping relation valued function to Operators to Activity Pipeline Builder.p y p
• New Operator can be included in the default query normalisation by providing strategies for SELECT push down, RENAME/PROJECT pull up.
• Optimisation chain is configurable – it is easy to plug in new LQP transformations.
• Alternative physical operator implementations can be introduced by replacing defa lt Acti it Pipeline B ilders annotations can be sed toreplacing default Activity Pipeline Builders – annotations can be used to choose between several implementations.
• Scalar aggregate and relation valued User Defined Functions will beScalar, aggregate and relation valued User Defined Functions will be supported.
Introducing a new operator
SELECT A.aname AS nameFROM outerUnion((SELECT * FROM aircraft A),(SELECT * FROM certified C), 'ALL') A
• LQP Builder will check if there is a mapping from outerUnion > Operator• LQP Builder will check if there is a mapping from outerUnion -> Operator and use Operator object in LQP.
• If there is no mapping look for a relation valued function outerUnion in• If there is no mapping – look for a relation valued function outerUnion in the Function Repository and connect generic RELVAL_FUNCION operator.operator.
CompilerConfiguration.xml
<LQPCompilerConfiguration xmlns="http://ogsadai.org.uk/dqp/namespaces/2008/12">
<builders operator="GROUP_BY“
default="uk org ogsadai dqp execute workflow GroupBy"/>default= uk.org.ogsadai.dqp.execute.workflow.GroupBy />
<builders operator="INNER_THETA_JOIN“
default="uk.org.ogsadai.dqp.execute.workflow.ProductSelect">
<builder name="HASH JOIN“<builder name HASH_JOIN
class="uk.org.ogsadai.dqp.execute.workflow.HashJoin"/>
</builders>
<relationFunction name="outerUnion" operator="OUTER_UNION"/>
<operator name="OUTER_UNION“
class="uk.org.ogsadai.dqp.lqp.operators.extra.OuterUnionOperator"/> g g qp qp p p
<builders operator=“OUTER_UNION“
default="uk.org.ogsadai.dqp.execute.workflow.OuterUnion"/>
<optimisationChain>
<optimiser class="uk.org.ogsadai.dqp.lqp.optimiser.QueryNormaliser" />
<optimiser class="uk.org.ogsadai.dqp.lqp.optimiser.SelectPushDown" />
</optimisationChain>
</LQPCompilerConfiguration>
User Defined Functions
• Three typesScalar SELECT ditDi t ( ‘J h ’) FROM– Scalar SELECT editDistance(a.name, ‘John’) FROM a
– Aggregate SELECT * FROM a HAVING a.age<median(a.age)
– Relation valued– Unary SELECT * FROM sample(a, 0.75)
– Binary SELECT * FROMf (SELECT * FROM ) (SELECT * FROM b))fuse(SELECT * FROM a), (SELECT * FROM b))
– Scan (tuple producing) SELECT * FROM randomInt(0, 10, 1000)
• Implementations of sub interfaces of the Function interface• Implementations of sub interfaces of the Function interface.
• Function Repository is part of the Data Dictionary.
Discovering Evaluator Capabilities
• We assume that every evaluation resource has the same set of activities and UDFsof activities and UDFs.
• Checking if activities are supported is quite easy– Get list of supported activities from each evaluation resource (DRER)– Ask Activity Pipeline Builder for a list of required activities
• Checking for UDF availability is more tricky– Introduce UDF Resource + “GetUDFSchemas” activity– Match by name and parameter list, types, return type– Relation valued functions are problematic – they need to validate
themselves inside LQP and provide headings – this is dynamic –themselves inside LQP and provide headings this is dynamic function schema as a script?