apache drill - why, what, how

90
© MapR Technologies, confidential © 2014 MapR Technologies M. C. Srivas, CTO and Founder

Upload: mcsrivas

Post on 02-Dec-2014

704 views

Category:

Data & Analytics


3 download

DESCRIPTION

Apache Drill is the next generation of SQL query engines. It builds on ANSI SQL 2003, and extends it to handle new formats like JSON, Parquet, ORC, and the usual CSV, TSV, XML and other Hadoop formats. Most importantly, it melts away the barriers that have caused databases to become silos of data. It does so by able to handle schema-changes on the fly, enabling a whole new world of self-service and data agility never seen before.

TRANSCRIPT

Page 1: Apache Drill - Why, What, How

© MapR Technologies, confidential

© 2014 MapR Technologies

M. C. Srivas, CTO and Founder

Page 2: Apache Drill - Why, What, How

© MapR Technologies, confidential

MapR is Unbiased Open Source

Page 3: Apache Drill - Why, What, How

© MapR Technologies, confidential

Linux Is Unbiased

• Linux provides choice

– MySQL

– PostgreSQL

– SQLite

• Linux provides choice

– Apache httpd

– Nginx

– Lighttpd

Page 4: Apache Drill - Why, What, How

© MapR Technologies, confidential

MapR Is Unbiased

• MapR provides choice

MapR Distribution for Hadoop Distribution C Distribution H

Spark Spark (all of it) and SparkSQL Spark only No

Interactive SQL Impala, Drill, Hive/Tez,

SparkSQL

One option

(Impala)

One option

(Hive/Tez)

Scheduler YARN, Mesos One option

(YARN)

One option

(YARN)

Versions Hive 0.10, 0.11, 0.12, 0.13

Pig 0.11, 012

HBase 0.94, 0.98

One version One version

Page 5: Apache Drill - Why, What, How

© MapR Technologies, confidential

MapR Distribution for Apache Hadoop

MapR Data Platform (Random Read/Write)

Data Hub Enterprise Grade Operational

MapR-FS (POSIX)

MapR-DB (High-Performance NoSQL)

Security

YARN

Pig

Cascading

Spark

Batch

Spark Streaming

Storm*

Streaming

HBase

Solr

NoSQL & Search

Juju

Provisioning &

Coordination

Savannah*

Mahout

MLLib

ML, Graph

GraphX

MapReduce v1 & v2

APACHE HADOOP AND OSS ECOSYSTEM

EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS

Workflow & Data

Governance Tez*

Accumulo*

Hive

Impala

Shark

Drill*

SQL

Sentry* Oozie ZooKeeper Sqoop

Knox* Whirr Falcon* Flume

Data Integration & Access

HttpFS

Hue

NFS HDFS API HBase API JSON API

Map

R C

on

tro

l S

yste

m

(Managem

ent

and M

onitoring)

* In Roadmap for inclusion/certification

CLI

G

UI

RES

T A

PI

Page 6: Apache Drill - Why, What, How

© MapR Technologies, confidential

Page 7: Apache Drill - Why, What, How

© MapR Technologies, confidential

Hadoop an augmentation for EDW—Why?

Page 8: Apache Drill - Why, What, How

© MapR Technologies, confidential

Page 9: Apache Drill - Why, What, How

© MapR Technologies, confidential

Page 10: Apache Drill - Why, What, How

© MapR Technologies, confidential

Page 11: Apache Drill - Why, What, How

© MapR Technologies, confidential

Page 12: Apache Drill - Why, What, How

© MapR Technologies, confidential

Page 13: Apache Drill - Why, What, How

© MapR Technologies, confidential

But inside, it looks like this …

Page 14: Apache Drill - Why, What, How

© MapR Technologies, confidential

And this …

Page 15: Apache Drill - Why, What, How

© MapR Technologies, confidential

And this …

Page 16: Apache Drill - Why, What, How

© MapR Technologies, confidential

Consolidating schemas is very hard.

Page 17: Apache Drill - Why, What, How

© MapR Technologies, confidential

Consolidating schemas is very hard, causes SILOs

Page 18: Apache Drill - Why, What, How

© MapR Technologies, confidential

Silos make analysis very difficult

• How do I identify a unique {customer, trade} across data sets?

• How can I guarantee the lack of anomalous behavior if I can’t see all data?

Page 19: Apache Drill - Why, What, How

© 2014 MapR Technologies 19

Hard to know what’s of value a priori

Page 20: Apache Drill - Why, What, How

© 2014 MapR Technologies 20

Hard to know what’s of value a priori

Page 21: Apache Drill - Why, What, How

© 2014 MapR Technologies 21

Why Hadoop

Page 22: Apache Drill - Why, What, How

© MapR Technologies, confidential

Rethink SQL for Big Data

Preserve

• ANSI SQL • Familiar and ubiquitous

• Performance • Interactive nature crucial for BI/Analytics

• One technology • Painful to manage different technologies

• Enterprise ready • System-of-record, HA, DR, Security, Multi-

tenancy, …

Page 23: Apache Drill - Why, What, How

© MapR Technologies, confidential

Rethink SQL for Big Data

Preserve

• ANSI SQL • Familiar and ubiquitous

• Performance • Interactive nature crucial for BI/Analytics

• One technology • Painful to manage different technologies

• Enterprise ready • System-of-record, HA, DR, Security, Multi-

tenancy, …

Invent

• Flexible data-model • Allow schemas to evolve rapidly

• Support semi-structured data types

• Agility • Self-service possible when developer and DBA is

same

• Scalability • In all dimensions: data, speed, schemas, processes,

management

Page 24: Apache Drill - Why, What, How

© MapR Technologies, confidential

SQL is here to stay

Page 25: Apache Drill - Why, What, How

© MapR Technologies, confidential

Hadoop is here to stay

Page 26: Apache Drill - Why, What, How

© MapR Technologies, confidential

YOU CAN’T HANDLE REAL SQL

Page 27: Apache Drill - Why, What, How

© MapR Technologies, confidential

SQL

select * from A where exists (

select 1 from B where B.b < 100 );

• Did you know Apache HIVE cannot compute it?

– eg, Hive, Impala, Spark/Shark

Page 28: Apache Drill - Why, What, How

© MapR Technologies, confidential

Self-described Data

select cf.month, cf.year

from hbase.table1;

• Did you know normal SQL cannot handle the above?

• Nor can HIVE and its variants like Impala, Shark?

Page 29: Apache Drill - Why, What, How

© MapR Technologies, confidential

Self-described Data

select cf.month, cf.year

from hbase.table1;

• Why?

• Because there’s no meta-store definition available

Page 30: Apache Drill - Why, What, How

© MapR Technologies, confidential

Self-Describing Data Ubiquitous

Centralized schema - Static

- Managed by the DBAs - In a centralized repository Long, meticulous data preparation process (ETL, create/alter schema, etc.) – can take 6-18 months

Self-describing, or schema-less, data - Dynamic/evolving - Managed by the applications - Embedded in the data Less schema, more suitable for data that has higher volume, variety and velocity

Apache Drill

Page 31: Apache Drill - Why, What, How

© MapR Technologies, confidential

A Quick Tour through Apache Drill

Page 32: Apache Drill - Why, What, How

© MapR Technologies, confidential

Data Source is in the Query

select timestamp, message

from dfs1.logs.`AppServerLogs/2014/Jan/p001.parquet`

where errorLevel > 2

Page 33: Apache Drill - Why, What, How

© MapR Technologies, confidential

Data Source is in the Query

select timestamp, message

from dfs1.logs.`AppServerLogs/2014/Jan/p001.parquet`

where errorLevel > 2

This is a cluster in Apache Drill - DFS - HBase - Hive meta-store

Page 34: Apache Drill - Why, What, How

© MapR Technologies, confidential

Data Source is in the Query

select timestamp, message

from dfs1.logs.`AppServerLogs/2014/Jan/p001.parquet`

where errorLevel > 2

This is a cluster in Apache Drill - DFS - HBase - Hive meta-store

A work-space - Typically a sub-

directory - HIVE database

Page 35: Apache Drill - Why, What, How

© MapR Technologies, confidential

Data Source is in the Query

select timestamp, message

from dfs1.logs.`AppServerLogs/2014/Jan/p001.parquet`

where errorLevel > 2

This is a cluster in Apache Drill - DFS - HBase - Hive meta-store

A work-space - Typically a sub-

directory - HIVE database

A table - pathnames - Hbase table - Hive table

Page 36: Apache Drill - Why, What, How

© MapR Technologies, confidential

Combine data sources on the fly

• JSON

• CSV

• ORC (ie, all Hive types)

• Parquet

• HBase tables

• … can combine them

Select USERS.name, USERS.emails.work from dfs.logs.`/data/logs` LOGS, dfs.users.`/profiles.json` USERS, where LOGS.uid = USERS.uid and errorLevel > 5 order by count(*);

Page 37: Apache Drill - Why, What, How

© MapR Technologies, confidential

Can be an entire directory tree

// On a file

select errorLevel, count(*)

from dfs.logs.`/AppServerLogs/2014/Jan/part0001.parquet`

group by errorLevel;

Page 38: Apache Drill - Why, What, How

© MapR Technologies, confidential

Can be an entire directory tree

// On a file

select errorLevel, count(*)

from dfs.logs.`/AppServerLogs/2014/Jan/part0001.parquet`

group by errorLevel;

// On the entire data collection: all years, all months

select errorLevel, count(*)

from dfs.logs.`/AppServerLogs`

group by errorLevel

Page 39: Apache Drill - Why, What, How

© MapR Technologies, confidential

Can be an entire directory tree

// On a file

select errorLevel, count(*)

from dfs.logs.`/AppServerLogs/2014/Jan/part0001.parquet`

group by errorLevel;

// On the entire data collection: all years, all months

select errorLevel, count(*)

from dfs.logs.`/AppServerLogs`

group by errorLevel

dirs[1] dirs[2]

Page 40: Apache Drill - Why, What, How

© MapR Technologies, confidential

Can be an entire directory tree

// On a file

select errorLevel, count(*)

from dfs.logs.`/AppServerLogs/2014/Jan/part0001.parquet`

group by errorLevel;

// On the entire data collection: all years, all months

select errorLevel, count(*)

from dfs.logs.`/AppServerLogs`

group by errorLevel

where dirs[1] > 2012

, dirs[2]

dirs[1] dirs[2]

Page 41: Apache Drill - Why, What, How

© MapR Technologies, confidential

Querying JSON

{ name: classic fillings: [ { name: sugar cal: 400 }]} { name: choco fillings: [ { name: sugar cal: 400 } { name: chocolate cal: 300 }]} { name: bostoncreme fillings: [ { name: sugar cal: 400 } { name: cream cal: 1000 } { name: jelly cal: 600 }]}

donuts.json

Page 42: Apache Drill - Why, What, How

© MapR Technologies, confidential

Cursors inside Drill

DrillClient drill = new DrillClient().connect( …);

ResultReader r = drill.runSqlQuery( "select * from `donuts.json`");

while( r.next()) {

String donutName = r.reader( “name").readString();

ListReader fillings = r.reader( "fillings");

while( fillings.next()) {

int calories = fillings.reader( "cal").readInteger();

if (calories > 400)

print( donutName, calories, fillings.reader( "name").readString());

}

}

{ name: classic fillings: [ { name: sugar cal: 400 }]} { name: choco fillings: [ { name: sugar cal: 400 } { name: chocolate cal: 300 }]} { name: bostoncreme fillings: [ { name: sugar cal: 400 } { name: cream cal: 1000 } { name: jelly cal: 600 }]}

Page 43: Apache Drill - Why, What, How

© MapR Technologies, confidential

Direct queries on nested data

// Flattening maps in JSON, parquet and other

nested records

select name, flatten(fillings) as f

from dfs.users.`/donuts.json`

where f.cal < 300;

// lists the fillings < 300 calories

{ name: classic fillings: [ { name: sugar cal: 400 }]} { name: choco fillings: [ { name: sugar cal: 400 } { name: chocolate cal: 300 }]} { name: bostoncreme fillings: [ { name: sugar cal: 400 } { name: cream cal: 1000 } { name: jelly cal: 600 }]}

Page 44: Apache Drill - Why, What, How

© MapR Technologies, confidential

Complex Data Using SQL or Fluent API

// SQL Result r = drill.sql( "select name, flatten(fillings) from `donuts.json` where fillings.cal < 300`); // or Fluent API Result r = drill.table(“donuts.json”) .lt(“fillings.cal”, 300).all(); while( r.next()) { String name = r.get( “name").string(); List fillings = r.get( “fillings”).list(); while(fillings.next()) { print(name, calories, fillings.get(“name”).string()); } }

{ name: classic fillings: [ { name: sugar cal: 400 }]} { name: choco fillings: [ { name: sugar cal: 400 } { name: plain: 280 }]} { name: bostoncreme fillings: [ { name: sugar cal: 400 } { name: cream cal: 1000 } { name: jelly cal: 600 }]}

Page 45: Apache Drill - Why, What, How

© MapR Technologies, confidential

Queries on embedded data

// embedded JSON value inside column donut-json inside column-family cf1

of an hbase table donuts

select d.name, count( d.fillings),

from (

select convert_from( cf1.donut-json, json) as d

from hbase.user.`donuts` );

Page 46: Apache Drill - Why, What, How

© MapR Technologies, confidential

Queries inside JSON records

// Each JSON record itself can be a whole database

// example: get all donuts with at least 1 filling with > 300 calories

select d.name, count( d.fillings),

max(d.fillings.cal) within record as mincal

from ( select convert_from( cf1.donut-json, json) as d

from hbase.user.`donuts` )

where mincal > 300;

Page 47: Apache Drill - Why, What, How

© MapR Technologies, confidential

a

• Schema can change over course of query

• Operators are able to reconfigure themselves on schema change events

– Minimize flexibility overhead

– Support more advanced execution optimization based on actual data characteristics

Page 48: Apache Drill - Why, What, How

© MapR Technologies, confidential

De-centralized metadata

// count the number of tweets per customer, where the customers are in Hive, and

their tweets are in HBase. Note that the hbase data has no meta-data information

select c.customerName, hb.tweets.count

from hive.CustomersDB.`Customers` c

join hbase.user.`SocialData` hb

on c.customerId = convert_from( hb.rowkey, UTF-8);

Page 49: Apache Drill - Why, What, How

© MapR Technologies, confidential

So what does this all mean?

Page 50: Apache Drill - Why, What, How

© MapR Technologies, confidential

A Drill Database

• What is a database with Drill/MapR?

Page 51: Apache Drill - Why, What, How

© MapR Technologies, confidential

A Drill Database

• What is a database with Drill/MapR?

• Just a directory, with a bunch of related files

Page 52: Apache Drill - Why, What, How

© MapR Technologies, confidential

A Drill Database

• What is a database with Drill/MapR?

• Just a directory, with a bunch of related files

• There’s no need for artificial boundaries

– No need to bunch a set of tables together to call it a “database”

Page 53: Apache Drill - Why, What, How

© MapR Technologies, confidential

A Drill Database

/user/srivas/work/bugs

symptom version date bugid dump-name impala crash 3.1.1 14/7/14 12345 cust1.tgz cldb slow 3.1.0 12/7/14 45678 cust2.tgz

Customers BugList

name rep se dump-name xxxx dkim junhyuk cust1.tgz yyyy yoshi aki cust2.tgz

Page 54: Apache Drill - Why, What, How

© MapR Technologies, confidential

Queries are simple

select b.bugid, b.symptom, b.date from dfs.bugs.’/Customers’ c, dfs.bugs.’/BugList’ b where c.dump-name = b.dump-name

Page 55: Apache Drill - Why, What, How

© MapR Technologies, confidential

Queries are simple

select b.bugid, b.symptom, b.date from dfs.bugs.’/Customers’ c, dfs.bugs.’/BugList’ b where c.dump-name = b.dump-name

Let’s say I want to cross-reference against your list: select bugid, symptom from dfs.bugs.’/Buglist’ b, dfs.yourbugs.’/YourBugFile’ b2 where b.bugid = b2.xxx

Page 56: Apache Drill - Why, What, How

© MapR Technologies, confidential

What does it mean?

Page 57: Apache Drill - Why, What, How

© MapR Technologies, confidential

What does it mean?

• No ETL

• Reach out directly to the particular table/file

• As long as the permissions are fine, you can do it

• No need to have the meta-data

– None needed

Page 58: Apache Drill - Why, What, How

© MapR Technologies, confidential

Another example

select d.name, count( d.fillings),

from ( select convert_from( cf1.donut-json, json) as d

from hbase.user.`donuts` );

• convert_from( xx, json) invokes the json parser inside Drill

Page 59: Apache Drill - Why, What, How

© MapR Technologies, confidential

Another example

select d.name, count( d.fillings),

from ( select convert_from( cf1.donut-json, json) as d

from hbase.user.`donuts` );

• convert_from( xx, json) invokes the json parser inside Drill

• What if you could plug in any parser?

Page 60: Apache Drill - Why, What, How

© MapR Technologies, confidential

Another example

select d.name, count( d.fillings),

from ( select convert_from( cf1.donut-json, json) as d

from hbase.user.`donuts` );

• convert_from( xx, json) invokes the json parser inside Drill

• What if you could plug in any parser

– XML?

– Semi-conductor yield-analysis files? Oil-exploration readings?

– Telescope readings of stars?

– RFIDs of various things?

Page 61: Apache Drill - Why, What, How

© MapR Technologies, confidential

No ETL

• Basically, Drill is querying the raw data directly

• Joining with processed data

• NO ETL

• Folks, this is very, very powerful

• NO ETL

Page 62: Apache Drill - Why, What, How

© MapR Technologies, confidential

Seamless integration with Apache Hive

• Low latency queries on Hive tables

• Support for 100s of Hive file formats

• Ability to reuse Hive UDFs

• Support for multiple Hive metastores in a single query

Page 63: Apache Drill - Why, What, How

© MapR Technologies, confidential

Underneath the Covers

Page 64: Apache Drill - Why, What, How

© MapR Technologies, confidential

Basic Process

Zookeeper

DFS/HBase DFS/HBase DFS/HBase

Drillbit

Distributed Cache

Drillbit

Distributed Cache

Drillbit

Distributed Cache

Query 1. Query comes to any Drillbit (JDBC, ODBC, CLI, protobuf)

2. Drillbit generates execution plan based on query optimization & locality

3. Fragments are farmed to individual nodes

4. Result is returned to driving node

c c c

Page 65: Apache Drill - Why, What, How

© MapR Technologies, confidential

Stages of Query Planning

Parser Logical

Planner

Physical

Planner

Query

Foreman

Plan

fragments

sent to drill

bits

SQL Query

Heuristic and cost based

Cost based

Page 66: Apache Drill - Why, What, How

© MapR Technologies, confidential

Query Execution

SQL Parser

Optimizer

Scheduler

Pig Parser

Ph

ysic

al P

lan

Mongo

Cassandra HiveQL Parser

RPC Endpoint

Distributed Cache

Sto

rage

En

gin

e In

terf

ace

Operators Operators

Foreman

Logi

cal P

lan

HDFS

HBase

JDBC Endpoint ODBC Endpoint

Page 67: Apache Drill - Why, What, How

© MapR Technologies, confidential

A Query engine that is…

• Columnar/Vectorized

• Optimistic/pipelined

• Runtime compilation

• Late binding

• Extensible

Page 68: Apache Drill - Why, What, How

© MapR Technologies, confidential

Columnar representation

A B C D E

A

B

C

D

On disk

E

Page 69: Apache Drill - Why, What, How

© MapR Technologies, confidential

Columnar Encoding

• Values in a col. stored next to one-another

– Better compression

– Range-map: save min-max, can skip if not present

• Only retrieve columns participating in query

• Aggregations can be performed without decoding

A

B

C

D

On disk

E

Page 70: Apache Drill - Why, What, How

© MapR Technologies, confidential

Run-length-encoding & Sum

• Dataset encoded as <val> <run-length>:

– 2, 4 (4 2’s)

– 8, 10 (10 8’s)

• Goal: sum all the records

• Normally:

– Decompress: 2, 2, 2, 2, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8

– Add: 2 + 2 + 2 + 2 + 8 + 8 + 8 + 8 + 8 + 8 + 8 + 8 + 8 + 8

• Optimized work: 2 * 4 + 8 * 10

– Less memory, less operations

Page 71: Apache Drill - Why, What, How

© MapR Technologies, confidential

Bit-packed Dictionary Sort

• Dataset encoded with a dictionary and bit-positions:

– Dictionary: [Rupert, Bill, Larry] {0, 1, 2}

– Values: [1,0,1,2,1,2,1,0]

• Normal work

– Decompress & store: Bill, Rupert, Bill, Larry, Bill, Larry, Bill, Rupert

– Sort: ~24 comparisons of variable width strings

• Optimized work

– Sort dictionary: {Bill: 1, Larry: 2, Rupert: 0}

– Sort bit-packed values

– Work: max 3 string comparisons, ~24 comparisons of fixed-width dictionary bits

Page 72: Apache Drill - Why, What, How

© MapR Technologies, confidential

Drill 4-value semantics

• SQL’s 3-valued semantics

– True

– False

– Unknown

• Drill adds fourth

– Repeated

Page 73: Apache Drill - Why, What, How

© MapR Technologies, confidential

Vectorization

• Drill operates on more than one record at a time

– Word-sized manipulations

– SIMD-like instructions

• GCC, LLVM and JVM all do various optimizations automatically

– Manually code algorithms

• Logical Vectorization

– Bitmaps allow lightning fast null-checks

– Avoid branching to speed CPU pipeline

Page 74: Apache Drill - Why, What, How

© MapR Technologies, confidential

Runtime Compilation is Faster

• JIT is smart, but more gains with runtime compilation

• Janino: Java-based Java compiler

From http://bit.ly/16Xk32x

Page 75: Apache Drill - Why, What, How

© MapR Technologies, confidential

Drill compiler

Loaded class Merge byte-

code of the two classes

Janino compiles runtime

byte-code

CodeModel generates code

Precompiled byte-code templates

Page 76: Apache Drill - Why, What, How

© MapR Technologies, confidential

Optimistic

0

20

40

60

80

100

120

140

160

Speed vs. check-pointing

No need to checkpoint

Checkpoint frequently Apache Drill

Page 77: Apache Drill - Why, What, How

© MapR Technologies, confidential

Optimistic Execution

• Recovery code trivial

– Running instances discard the failed query’s intermediate state

• Pipelining possible

– Send results as soon as batch is large enough

– Requires barrier-less decomposition of query

Page 78: Apache Drill - Why, What, How

© MapR Technologies, confidential

Batches of Values

• Value vectors

– List of values, with same schema

– With the 4-value semantics for each value

• Shipped around in batches

– max 256k bytes in a batch

– max 64K rows in a batch

• RPC designed for multiple replies to a request

Page 79: Apache Drill - Why, What, How

© MapR Technologies, confidential

Pipelining

• Record batches are pipelined between nodes

– ~256kB usually

• Unit of work for Drill

– Operators works on a batch

• Operator reconfiguration happens at batch boundaries

DrillBit

DrillBit DrillBit

Page 80: Apache Drill - Why, What, How

© MapR Technologies, confidential

Pipelining Record Batches

SQL Parser

Optimizer

Scheduler

Pig Parser

Ph

ysic

al P

lan

Mongo

Cassandra HiveQL Parser

RPC Endpoint

Distributed Cache

Sto

rage

En

gin

e In

terf

ace

Operators Operators

Foreman

Logi

cal P

lan

HDFS

HBase

JDBC Endpoint ODBC Endpoint

Page 81: Apache Drill - Why, What, How

© MapR Technologies, confidential

DISK

Pipelining

• Random access: sort without copy or restructuring

• Avoids serialization/deserialization

• Off-heap (no GC woes when lots of memory)

• Full specification + off-heap + batch

– Enables C/C++ operators (fast!)

• Read/write to disk

– when data larger than memory

Drill Bit

Memory overflow uses disk

Page 82: Apache Drill - Why, What, How

© MapR Technologies, confidential

Cost-based Optimization

• Using Optiq, an extensible framework

• Pluggable rules, and cost model

• Rules for distributed plan generation

• Insert Exchange operator into physical plan

• Optiq enhanced to explore parallel query plans

• Pluggable cost model

– CPU, IO, memory, network cost (data locality)

– Storage engine features (HDFS vs HIVE vs HBase)

Query

Optimizer

Pluggable rules

Pluggable cost model

Page 83: Apache Drill - Why, What, How

© MapR Technologies, confidential

Distributed Plan Cost

• Operators have distribution property • Hash, Broadcast, Singleton, …

• Exchange operator to enforce distributions • Hash: HashToRandomExchange • Broadcast: BroadcastExchange • Singleton: UnionExchange, SingleMergeExchange

• Enumerate all, use cost to pick best • Merge Join vs Hash Join • Partition-based join vs Broadcast-based join • Streaming Aggregation vs Hash Aggregation

• Aggregation in one phase or two phases • partial local aggregation followed by final aggregation

HashToRandomExchange

Sort

Streaming-Aggregation

Data Data Data

Page 84: Apache Drill - Why, What, How

© MapR Technologies, confidential

Drill 1.0 Hive 0.13 w/ Tez Impala 1.x

Latency Low Medium Low

Files Yes (all Hive file formats, plus JSON, Text, …)

Yes (all Hive file formats) Yes (Parquet, Sequence, …)

HBase/MapR-DB Yes Yes, perf issues Yes, with issues

Schema Hive or schema-less Hive Hive

SQL support ANSI SQL HiveQL HiveQL (subset)

Client support ODBC/JDBC ODBC/JDBC ODBC/JDBC

Hive compat High High Low

Large datasets Yes Yes Limited

Nested data Yes Limited No

Concurrency High Limited Medium

Interactive SQL-on-Hadoop options

Page 85: Apache Drill - Why, What, How

© MapR Technologies, confidential

Apache Drill Roadmap

•Low-latency SQL

•Schema-less execution

•Files & HBase/M7 support

•Hive integration

•BI and SQL tool support via

ODBC/JDBC

Data exploration/ad-hoc queries

1.0

•HBase query speedup

•Nested data functions

•Advanced SQL

functionality

Advanced analytics and

operational data

1.1

•Ultra low latency queries

•Single row

insert/update/delete

•Workload management

Operational SQL

2.0

Page 86: Apache Drill - Why, What, How

© MapR Technologies, confidential

MapR Distribution for Apache Hadoop

MapR Data Platform (Random Read/Write)

Data Hub Enterprise Grade Operational

MapR-FS (POSIX)

MapR-DB (High-Performance NoSQL)

Security

YARN

Pig

Cascading

Spark

Batch

Spark Streaming

Storm*

Streaming

HBase

Solr

NoSQL & Search

Juju

Provisioning &

Coordination

Savannah*

Mahout

MLLib

ML, Graph

GraphX

MapReduce v1 & v2

APACHE HADOOP AND OSS ECOSYSTEM

EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS

Workflow & Data

Governance Tez*

Accumulo*

Hive

Impala

Shark

Drill*

SQL

Sentry* Oozie ZooKeeper Sqoop

Knox* Whirr Falcon* Flume

Data Integration & Access

HttpFS

Hue

NFS HDFS API HBase API JSON API

Map

R C

on

tro

l S

yste

m

(Managem

ent

and M

onitoring)

* In Roadmap for inclusion/certification

CLI

G

UI

RES

T A

PI

Page 87: Apache Drill - Why, What, How

© MapR Technologies, confidential

Apache Drill Resources

• Drill 0.5 released last week

• Getting started with Drill is easy

– just download tarball and start running SQL queries on local files

• Mailing lists

[email protected]

[email protected]

• Docs: https://cwiki.apache.org/confluence/display/DRILL/Apache+Drill+Wiki

• Fork us on GitHub: http://github.com/apache/incubator-drill/

• Create a JIRA: https://issues.apache.org/jira/browse/DRILL

Page 88: Apache Drill - Why, What, How

© MapR Technologies, confidential

Active Drill Community

• Large community, growing rapidly

– 35-40 contributors, 16 committers

– Microsoft, Linked-in, Oracle, Facebook, Visa, Lucidworks, Concurrent, many universities

• In 2014

– over 20 meet-ups, many more coming soon

– 3 hackathons, with 40+ participants

• Encourage you to join, learn, contribute and have fun …

Page 89: Apache Drill - Why, What, How

© MapR Technologies, confidential

Drill at MapR

• World-class SQL team, ~20 people

• 150+ years combined experience building commercial databases

• Oracle, DB2, ParAccel, Teradata, SQLServer, Vertica

• Team works on Drill, Hive, Impala

• Fixed some of the toughest problems in Apache Hive

Page 90: Apache Drill - Why, What, How

© MapR Technologies, confidential

Thank you!

M. C. Srivas [email protected]

Did I mention we are hiring…