apache drill: an active, ad-hoc query system for large-scale data sets
DESCRIPTION
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets given by MapR Chief Data Engineer EMEA . Big Data User Group in Stuttgart 2013-05-16TRANSCRIPT
![Page 1: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets](https://reader033.vdocuments.site/reader033/viewer/2022042601/554fa206b4c90586258b4a2c/html5/thumbnails/1.jpg)
Apache Drill a interac.ve, ad-‐hoc query system for large-‐scale datasets
Michael Hausenblas, Chief Data Engineer EMEA, MapR Big Data User Group Stu>gart, 2013-‐05-‐16
![Page 2: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets](https://reader033.vdocuments.site/reader033/viewer/2022042601/554fa206b4c90586258b4a2c/html5/thumbnails/2.jpg)
Which workloads do you encounter in your environment?
h>p://w
ww.flickr.com
/photos/kevinomara/2866648330/ licensed under CC BY-‐N
C-‐ND 2.0
![Page 3: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets](https://reader033.vdocuments.site/reader033/viewer/2022042601/554fa206b4c90586258b4a2c/html5/thumbnails/3.jpg)
Batch processing
… for recurring tasks such as large-‐scale data mining, ETL offloading/data-‐warehousing à for the batch layer in Lambda architecture
![Page 4: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets](https://reader033.vdocuments.site/reader033/viewer/2022042601/554fa206b4c90586258b4a2c/html5/thumbnails/4.jpg)
OLTP
… user-‐facing eCommerce transac[ons, real-‐[me messaging at scale (FB), [me-‐series processing, etc. à for the serving layer in Lambda architecture
![Page 5: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets](https://reader033.vdocuments.site/reader033/viewer/2022042601/554fa206b4c90586258b4a2c/html5/thumbnails/5.jpg)
Stream processing
… in order to handle stream sources such as social media feeds or sensor data (mobile phones, RFID, weather sta[ons, etc.) à for the speed layer in Lambda architecture
![Page 6: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets](https://reader033.vdocuments.site/reader033/viewer/2022042601/554fa206b4c90586258b4a2c/html5/thumbnails/6.jpg)
Search/Informa[on Retrieval
… retrieval of items from unstructured documents (plain text, etc.), semi-‐structured data formats (JSON, etc.), as well as data stores (MongoDB, CouchDB, etc.)
![Page 7: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets](https://reader033.vdocuments.site/reader033/viewer/2022042601/554fa206b4c90586258b4a2c/html5/thumbnails/7.jpg)
h>p://www.flickr.com/photos/9479603@N02/4144121838/ licensed under CC BY-‐NC-‐ND 2.0
But what about interac.ve ad-‐hoc query at scale?
![Page 8: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets](https://reader033.vdocuments.site/reader033/viewer/2022042601/554fa206b4c90586258b4a2c/html5/thumbnails/8.jpg)
Impala
Interac[ve Query (?)
low-‐latency
![Page 9: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets](https://reader033.vdocuments.site/reader033/viewer/2022042601/554fa206b4c90586258b4a2c/html5/thumbnails/9.jpg)
Use Case: Marke[ng Campaign
• Jane, a marke[ng analyst • Determine target segments • Data from different sources
![Page 10: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets](https://reader033.vdocuments.site/reader033/viewer/2022042601/554fa206b4c90586258b4a2c/html5/thumbnails/10.jpg)
Use Case: Logis[cs
• Supplier tracking and performance • Queries – Shipments from supplier ‘ACM’ in last 24h – Shipments in region ‘US’ not from ‘ACM’
SUPPLIER_ID NAME REGION
ACM ACME Corp US
GAL GotALot Inc US
BAP Bits and Pieces Ltd Europe
ZUP Zu Pli Asia
{ "shipment": 100123, "supplier": "ACM", “timestamp": "2013-02-01", "description": ”first delivery today” }, { "shipment": 100124, "supplier": "BAP", "timestamp": "2013-02-02", "description": "hope you enjoy it” } …
![Page 11: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets](https://reader033.vdocuments.site/reader033/viewer/2022042601/554fa206b4c90586258b4a2c/html5/thumbnails/11.jpg)
Use Case: Crime Detec[on
• Online purchases • Fraud, bilking, etc. • Batch-‐generated overview • Modes – Explora[ve – Alerts
![Page 12: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets](https://reader033.vdocuments.site/reader033/viewer/2022042601/554fa206b4c90586258b4a2c/html5/thumbnails/12.jpg)
Requirements
• Support for different data sources • Support for different query interfaces • Low-‐latency/real-‐[me • Ad-‐hoc queries • Scalable, reliable
![Page 13: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets](https://reader033.vdocuments.site/reader033/viewer/2022042601/554fa206b4c90586258b4a2c/html5/thumbnails/13.jpg)
And now for something completely different …
![Page 14: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets](https://reader033.vdocuments.site/reader033/viewer/2022042601/554fa206b4c90586258b4a2c/html5/thumbnails/14.jpg)
Google’s Dremel
h>p://research.google.com/pubs/pub36632.html
Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Ma@ Tolton, Theo Vassilakis, Proc. of the 36th Int'l Conf on Very Large Data Bases (2010), pp. 330-‐339
Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data, and has thousands of users at Google. …
“ “ Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data, and has thousands of users at Google. …
![Page 15: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets](https://reader033.vdocuments.site/reader033/viewer/2022042601/554fa206b4c90586258b4a2c/html5/thumbnails/15.jpg)
Google’s Dremel
multi-level execution trees
columnar data layout
![Page 16: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets](https://reader033.vdocuments.site/reader033/viewer/2022042601/554fa206b4c90586258b4a2c/html5/thumbnails/16.jpg)
Google’s Dremel
nested data + schema column-striped representation
map nested data to tables
![Page 17: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets](https://reader033.vdocuments.site/reader033/viewer/2022042601/554fa206b4c90586258b4a2c/html5/thumbnails/17.jpg)
Google’s Dremel
experiments: datasets & query performance
![Page 18: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets](https://reader033.vdocuments.site/reader033/viewer/2022042601/554fa206b4c90586258b4a2c/html5/thumbnails/18.jpg)
Back to Apache Drill …
![Page 19: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets](https://reader033.vdocuments.site/reader033/viewer/2022042601/554fa206b4c90586258b4a2c/html5/thumbnails/19.jpg)
Apache Drill–key facts
• Inspired by Google’s Dremel • Standard SQL 2003 support • Plug-‐able data sources • Nested data is a first-‐class ci[zen • Schema is op.onal • Community driven, open, 100’s involved
![Page 20: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets](https://reader033.vdocuments.site/reader033/viewer/2022042601/554fa206b4c90586258b4a2c/html5/thumbnails/20.jpg)
High-‐level Architecture
![Page 21: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets](https://reader033.vdocuments.site/reader033/viewer/2022042601/554fa206b4c90586258b4a2c/html5/thumbnails/21.jpg)
Principled Query Execu[on
Source Query Parser
Logical Plan Op[mizer
Physical Plan Execu[on
SQL 2003 DrQL MongoQL DSL
scanner API Topology CF etc.
query: [ { @id: "log", op: "sequence", do: [ { op: "scan", source: “logs” }, { op: "filter", condition: "x > 3” },
parser API
![Page 22: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets](https://reader033.vdocuments.site/reader033/viewer/2022042601/554fa206b4c90586258b4a2c/html5/thumbnails/22.jpg)
Wire-‐level Architecture • Each node: Drillbit -‐ maximize data locality • Co-‐ordina[on, query planning, execu[on, etc, are distributed • Any node can act as endpoint for a query—foreman
Storage Process
Drillbit
node
Storage Process
Drillbit
node
Storage Process
Drillbit
node
Storage Process
Drillbit
node
![Page 23: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets](https://reader033.vdocuments.site/reader033/viewer/2022042601/554fa206b4c90586258b4a2c/html5/thumbnails/23.jpg)
Wire-‐level Architecture • Curator/Zookeeper for ephemeral cluster membership info • Distributed cache (Hazelcast) for metadata, locality
informa[on, etc.
Curator/Zk
Distributed Cache
Storage Process
Drillbit
node
Storage Process
Drillbit
node
Storage Process
Drillbit
node
Storage Process
Drillbit
node
Distributed Cache Distributed Cache Distributed Cache
![Page 24: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets](https://reader033.vdocuments.site/reader033/viewer/2022042601/554fa206b4c90586258b4a2c/html5/thumbnails/24.jpg)
Wire-‐level Architecture • Origina[ng Drillbit acts as foreman: manages query execu[on,
scheduling, locality informa[on, etc. • Streaming data communica.on avoiding SerDe
Curator/Zk
Distributed Cache
Storage Process
Drillbit
node
Storage Process
Drillbit
node
Storage Process
Drillbit
node
Storage Process
Drillbit
node
Distributed Cache Distributed Cache Distributed Cache
![Page 25: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets](https://reader033.vdocuments.site/reader033/viewer/2022042601/554fa206b4c90586258b4a2c/html5/thumbnails/25.jpg)
Wire-‐level Architecture Foreman turns into root of the mul[-‐level execu[on tree, leafs ac[vate their storage engine interface. node
node node
Curator/Zk
![Page 26: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets](https://reader033.vdocuments.site/reader033/viewer/2022042601/554fa206b4c90586258b4a2c/html5/thumbnails/26.jpg)
Key features
• Full SQL – ANSI SQL 2003 • Nested Data as first class ci[zen • Op[onal Schema • Extensibility Points …
![Page 27: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets](https://reader033.vdocuments.site/reader033/viewer/2022042601/554fa206b4c90586258b4a2c/html5/thumbnails/27.jpg)
Extensibility Points
• Source query à parser API • Custom operators, UDF à logical plan • Serving tree, CF, topology à physical plan/op[mizer • Data sources &formats à scanner API
Source Query Parser
Logical Plan Op[mizer
Physical Plan Execu[on
![Page 28: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets](https://reader033.vdocuments.site/reader033/viewer/2022042601/554fa206b4c90586258b4a2c/html5/thumbnails/28.jpg)
… and Hadoop? • HDFS can be a data source
• Complementary use cases*
• … use Apache Drill – Find record with specified condi[on – Aggrega[on under dynamic condi[ons
• … use MapReduce – Data mining with mul[ple itera[ons – ETL
*) h>ps://cloud.google.com
/files/BigQueryTechnicalW
P.pdf
![Page 29: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets](https://reader033.vdocuments.site/reader033/viewer/2022042601/554fa206b4c90586258b4a2c/html5/thumbnails/29.jpg)
Basic Demo
h>ps://cwiki.apache.org/confluence/display/DRILL/Demo+HowTo
{ "id": "0001", "type": "donut", ”ppu": 0.55, "batters": { "batter”: [
{ "id": "1001", "type": "Regular" }, { "id": "1002", "type": "Chocolate" },
…
data source: donuts.json query:[ { op:"sequence", do:[
{ op: "scan", ref: "donuts", source: "local-logs", selection: {data: "activity"} }, { op: "filter", expr: "donuts.ppu < 2.00" },
…
logical plan: simple_plan.json
result: out.json
{ "sales" : 700.0, "typeCount" : 1, "quantity" : 700, "ppu" : 1.0 } { "sales" : 109.71, "typeCount" : 2, "quantity" : 159, "ppu" : 0.69 } { "sales" : 184.25, "typeCount" : 2, "quantity" : 335, "ppu" : 0.55 }
![Page 30: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets](https://reader033.vdocuments.site/reader033/viewer/2022042601/554fa206b4c90586258b4a2c/html5/thumbnails/30.jpg)
BE A PART OF IT!
![Page 31: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets](https://reader033.vdocuments.site/reader033/viewer/2022042601/554fa206b4c90586258b4a2c/html5/thumbnails/31.jpg)
Status
• Heavy development by mul[ple organiza[ons
• Available – Logical plan (ADSP) – Reference interpreter – Basic SQL parser – Basic demo
![Page 32: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets](https://reader033.vdocuments.site/reader033/viewer/2022042601/554fa206b4c90586258b4a2c/html5/thumbnails/32.jpg)
Status
May 2013 • Full SQL support (+JDBC) • Physical plan • In-‐memory compressed data interfaces • Distributed execu[on
![Page 33: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets](https://reader033.vdocuments.site/reader033/viewer/2022042601/554fa206b4c90586258b4a2c/html5/thumbnails/33.jpg)
Status
May 2013 • HBase and MySQL storage engine • WebUI client
![Page 34: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets](https://reader033.vdocuments.site/reader033/viewer/2022042601/554fa206b4c90586258b4a2c/html5/thumbnails/34.jpg)
Contribu[ng Contribu[ons appreciated (not only code drops) …
• Test data & test queries • Use case scenarios (textual/SQL queries) • Documenta[on • Further schedule – Alpha Q2 – Beta Q3
![Page 35: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets](https://reader033.vdocuments.site/reader033/viewer/2022042601/554fa206b4c90586258b4a2c/html5/thumbnails/35.jpg)
Kudos to …
• Julian Hyde, Pentaho • Lisen Mu, XingCloud • Tim Chen, Microsow • Chris Merrick, RJMetrics • David Alves, UT Aus[n • Sree Vaadi, SSS/NGData • Jacques Nadeau, MapR • Ted Dunning, MapR
![Page 36: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets](https://reader033.vdocuments.site/reader033/viewer/2022042601/554fa206b4c90586258b4a2c/html5/thumbnails/36.jpg)
Engage! • Follow @ApacheDrill on Twi>er
• Sign up at mailing lists (user | dev) h>p://incubator.apache.org/drill/mailing-‐lists.html
• Standing G+ hangouts every Tuesday at 5pm GMT
h>p://j.mp/apache-‐drill-‐hangouts
• Keep an eye on h>p://drill-‐user.org/