apache drill technical overview
TRANSCRIPT
1
> Technical Overview Jacques Nadeau, [email protected] 22, 2013
2
Basic Process
Zookeeper
DFS/HBase DFS/HBase DFS/HBase
Drillbit
Distributed Cache
Drillbit
Distributed Cache
Drillbit
Distributed Cache
Query
1. Query comes to any Drillbit2. Drillbit generates execution plan based on affinity
3. Fragments are farmed to individual nodes
4. Data is returned to driving node
3
Core Modules within a Drillbit
SQL Parser Optimizer
Phys
ical
Pla
n
DFS Engine
HBase Engine
RPC Endpoint
Distributed Cache
Stor
age
Engi
ne
Inte
rfac
e
Logi
cal P
lan
Execution
4
Query States
SQL What we want to do (analyst friendly)
Logical Plan: What we want to do (language agnostic, computer friendly)
Physical Plan How we want to do it (the best way we can tell)
Execution Plan (fragments) Where we want to do it
5
SQL
SELECT
t.cf1.name as name,
SUM(t.cf1.sales) as total_sales
FROM m7://cluster1/sales t
GROUP BY name
ORDER BY by total_sales desc
LIMIT 10;
6
Logical Plan: API/Format using JSON
Designed to be as easy as possible for language implementers to utilize– Sugared syntax such as sequence meta-operator
Don’t constrain ourselves to SQL specific paradigm – support complex data type operators such as collapse and expand as well
Allow late typing
sequence: [ { op: scan, storageengine: m7, selection: {table: sales}} { op: project, projections: [
{ref: name, expr: cf1.name}, {ref: sales, expr: cf1.sales}]}
{ op: segment, ref: by_name, exprs: [name]} { op: collapsingaggregate, target: by_name, carryovers: [name],
aggregations: [{ref: total_sales, expr: sum(name)}]} { op: order, ordering: [{order: desc, expr: total_sales}]} { op: store, storageengine: screen}]
7
Physical Plan
Insert points of parallelization where optimizer thinks they are necessary– If we thought that the cardinality of name would be high, we might use an alternative of
sort > range-merge-exchange > streaming aggregate > sort > range-merge-exchange instead of the simpler hash-random-exchange > sorting-hash-aggregate.
Pick the right version of each operator– For example, here we’ve picked the sorting hash aggregate. Since a hash aggregate is
already a blocking operator, doing the sort simultaneously allows us to avoid materializing an intermediate state
Apply projection and other push-down rules into capable operators– Note that the projection is gone, applied directly to the m7scan operator.
{ @id: 1, pop: m7scan, cluster: def, table: sales, cols: [cf1.name, cf2.name]} { @id: 2, op: hash-random-exchange, input: 1, expr: 1} { @id: 3, op: sorting-hash-aggregate, input: 2,
grouping: 1, aggr:[sum(2)], carry: [1], sort: ~agrr[0] } { @id: 4, op: screen, input: 4}
8
Execution Plan
Break plan into major fragments Determine quantity of parallelization for each task based on
estimated costs as well as maximum parallelization for each fragment (file size for now)
Collect up endpoint affinity for each particular HasAffinity operator Assign particular nodes based on affinity, load and topology Generate minor versions of each fragment for individual execution
FragmentId: Major = portion of dataflow Minor = a particular version of that execution (1 or more)
9
Execution Plan, cont’d
Each execution plan has: One root fragment (runs on driving node) Leaf fragments (first tasks to run) Intermediate fragments (won’t start until
they receive data from their children) In the case where the query output is
routed to storage, the root operator will often receive metadata to present rather than data
Root
Intermediate
Leaf
Intermediate
Leaf
10
Example FragmentsLeaf Fragment 1{ pop : "hash-partition-sender", @id : 1, child : { pop : "mock-scan", @id : 2, url : "http://apache.org", entries : [ { id : 1, records : 4000}] }, destinations : [ "Cglsb2NhbGhvc3QY0gk=" ]
Leaf Fragment 2{ pop : "hash-partition-sender", @id : 1, child : { pop : "mock-scan", @id : 2, url : "http://apache.org", entries : [ { id : 1, records : 4000 }, { id : 2, records : 4000 } ] }, destinations : [ "Cglsb2NhbGhvc3QY0gk=" ]}
Root Fragment{ pop : "screen", @id : 1, child : { pop : "random-receiver", @id : 2, providingEndpoints : [ "Cglsb2NhbGhvc3QY0gk=" ] }}
Intermediate Fragment{ pop : "single-sender", @id : 1, child : { pop : "mock-store", @id : 2, child : { pop : "filter", @id : 3, child : { pop : "random-receiver", @id : 4, providingEndpoints : [ "Cglsb2NhbGhvc3QYqRI=", "Cglsb2NhbGhvc3QY0gk=" ] }, expr : " ('b') > (5) " } }, destinations : [ "Cglsb2NhbGhvc3QYqRI=" ]}
11
Execution Flow
Drill Client
UserServer Query Foreman BitCom
Parser Optimizer Execution Planner
12
SQL Parser
Leverage Optiq Add support for “any” type Add support for nested and repeated[] references Add transformation rules to convert from SQL AST to Logical plan
syntax
13
Optimizer
Convert Logical to Physical Very much TBD Likely leverage Optiq Hardest problem in system, especially given lack of statistics Probably not parallel
14
Execution Planner
Each scan operator provides a maximum width of parallelization based on the number of read entries (similar to splits)
Decision of parallelization width is based on simple disk costs size Affinity orders the location of fragment assignment Storage, Scan and Exchange operators are informed of the actual
endpoint assignments to then re-decide their entries (splits)
15
Grittier
16
Execution Engine
Single JVM per Drillbit Small heap space for object management Small set of network event threads to manage socket operations Callbacks for each message sent Messages contain header and collection of native byte buffers Designed to minimize copies and ser/de costs Query setup and fragment runners are managed via processing
queues & thread pools
17
Data
Records are broken into batches Batches contain a schema and a collection of fields Each field has a particular type (e.g. smallint) Fields (a.k.a. columns) are stored in ValueVectors ValueVectors are façades to byte buffers. The in-memory structure of each ValueVector is well defined and
language agnostic ValueVectors defined based on the width and nature of the
underlying data– RepeatMap Fixed1 Fixed2 Fixed4 Fixed8 Fixed12 Fixed16 Bit FixedLen
VarLen1 VarLen2 VarLen4
There are three sub value vector types– Optional (nullable), required or repeated
18
Execution Paradigm We will have a large amount of operators Each operator works on a batch of records at a time A loose goal is batches are roughly a single core’s L2 cache in size Each batch of records carries a schema An operator is responsible for reconfiguring itself if a new schema arrives (or rejecting
the record batch if the schema is disallowed) Most operators are the combination of a set of static operations along with the
evaluation of query specific expressions Runtime compiled operators are the combination of a pre-compiled template and a
runtime compiled set of expressions Exchange operators are converted into Senders and Receiver when execution plan is
materialized Each operator must support consumption of a SelectionVector, a partial
materialization of a filter
19
Storage Engine
Input and output is done through storage engines– (and the screen specialized storage operator)
A storage engine is responsible for providing metadata and statistics about the data
A storage engine exposes a set of optimizer (plan rewrite) rules to support things such as predicate pushdown
A storage engine provides one or more storage engine specific scan operators that can support affinity exposure and task splitting– These are generated based on a StorageEngine specific configuration
The primary interfaces are RecordReader and RecordWriter. RecordReaders are responsible for– Converting stored data into Drill canonical ValueVector format a batch at a time– Providing schema for each record batch
Our initial storage engines will be for DFS and HBase
20
Messages Foreman drives query Foreman saves intermediate fragments to distributed cache Foreman sends leaf fragments directly to execution nodes Executing fragments push record batches to their fragment’s destination
nodes When destination node receives first fragment for a new query, it retrieves
its appropriate fragment from distributed cache, setups up required framework, then waits until the start requirement is needed:– A fragment is evaluated for the number of different sending streams that are
required before the query can actually be scheduled based on each exchanges “supportsOutOfOrder” capability.
– When the IncomingBatchHandler recognizes that its start criteria has been reached, it begins
– In the meantime, destination mode will buffer (potentially to disk) Fragment status messages are pushed back to foreman directly from
individual nodes A single failure status causes the foreman to cancel all other parts of query
21
Scheduling
Plan is to leverage the concepts inside Sparrow Reality is that receiver-side buffering and pre-assigned execution
locations means that this is very much up in the air right now
22
Operation/Configuration
Drillbit is a single JVM Extension is done by building to an api and generating a jar file that
includes a drill-module.conf file with information about where that module needs to be inserted
All configuration is done via a JSON like configuration metaphort that supports complex types
Node discovery/service registry is done through Zookeeper Metrics are collected utilizing the Yammer metrics module
23
User Interfaces
Drill provides DrillClient– Encapsulates endpoint discovery– Supports logical and physical plan submission, query cancellation, query
status– Supports streaming return results
Drill will provide a JDBC driver which converts JDBC into DrillClient communication. – Currently SQL parsing is done client side• Artifact of the current state of Optiq• Need to slim up the JDBC driver and push stuff remotely
In time, will add REST proxy for DrillClient
24
Technologies Jackson for JSON SerDe for metadata Typesafe HOCON for configuration and module management Netty4 as core RPC engine, protobuf for communication Vanilla Java, Larray and Netty ByteBuf for off-heap large data structure help Hazelcast for distributed cache Curator on top of Zookeeper for service registry Optiq for SQL parsing and cost optimization Parquet (probably) as ‘native’ format Janino for expression compilation ASM for ByteCode manipulation Yammer Metrics for metrics Guava extensively Carrot HPC for primitive collections