sql for everything at cwt2014

Masahiro NakagawaNov 6, 2014

Cloudera World Tokyo

SQL for EverythingPresto: Distributed SQL Query Engine

Who are you?

> Masahiro Nakagawa > github/twitter: @repeatedly > Ingress: Blue

> Treasure Data, Inc. > Senior Software Engineer > Fluentd / td-agent developer

> I love OSS :) > D language - Phobos committer > Fluentd - Main maintainer > MessagePack / RPC- D and Python (only RPC) > The organizer of Presto Source Code Reading > etc…

SQL on Hadoop?

> Hive > Spark SQL

Batch

Short Batch Low latency

Stream

> Presto > Impala > Drill

> Norikra > StreamSQL

> HAWQ > Actian > etc…

This color indicates a commercial product

SQL Players on Hadoop

Latency: minutes - hours

Latency: seconds - minutes

Latency: immediate

> Hive > Spark SQL

SQL Players on Hadoop

Batch

Short Batch Low latency

Stream

> Presto > Impala > Drill

> HAWQ > Actian > etc…

Red Ocean

Blue Ocean?> Norikra > StreamSQL

This color indicates a commercial product

Prestohttp://prestodb.io/

http://prestodb.io/

Presto overview> Open sourced by Facebook

> https://github.com/facebook/presto • github is a primary

> written in Java > latest version is 0.81

> Built-in useful features > Connectors > Machine Learning > Window function > Approximate query > etc…

https://github.com/facebook/presto

What’s Presto?

A distributed SQL query engine for interactive data analisys against GBs to PBs of data.

What problems does it solve?> We couldn’t visualize data in HDFS directly

using dashboards or BI tools > because Hive is too slow (not interactive) > or ODBC connectivity is unavailable/unstable

> We needed to store daily-batch results to an interactive DB for quick response(PostgreSQL, Redshift, etc.) > Interactive DB costs more & less scalable

> Some data are not stored in HDFS > We need to copy the data into HDFS to analyze

HDFS

Hive Dashboard

Presto

PostgreSQL, etc.

Daily/Hourly Batch

HDFS

HiveDashboard

Daily/Hourly Batch

Interactive query

Interactive query

Presto

HDFS

HiveDashboard

Daily/Hourly BatchInteractive query

Cassandra MySQL Commertial DBs

SQL on any data sets CommercialBI Tools

✓ IBM Cognos✓ Tableau ✓ ...

Data analysis platform

Presto’s deployment> Facebook

> Multiple geographical regions > scaled to 1,000 nodes > actively used by 1,000+ employees > processing 1PB/day

> Netflix, Dropbox, Treasure Data, Airbnb, Qubole, LINE, GREE, Scaleout, etc

> Presto as a Service > Treasure Data, Qubole

PostgreSQL gateway for Presto> A PostgreSQL protocol gateway based on

PostgreSQL’s stable ODBC / JDBC drivers > Developed by Sadayuki Furuhashi

https://github.com/treasure-data/prestogres

https://github.com/treasure-data/prestogres

Distributed architecture

Client

Coordinator ConnectorPlugin

Worker

Worker

Worker

Storage / Metadata

Discovery Service

What’s Connectors?> Access to storage and metadata

> provide table schema to coordinators > provide table rows to workers

> Connectors are pluggable to Presto > written in Java

> Implementations: > Hive(CDH, HDP, Community), Cassandra,

MySQL, JDBC, Kafka, etc… > Or your own connector

• Treasure Data has own connector

Client

Coordinator

otherconnectors

...

Worker

Worker

Worker

Cassandra

Discovery Service

find servers in a cluster

Hive Connector

HDFS / Metastore

Multiple connectors in a query

CassandraConnector

Other data sources...

Distributed architecture

> 3 type of servers: > Coordinator, worker, discovery service

> Get data/metadata through connector plugins. > Presto is NOT a database > Presto provides SQL to existent data stores

> Client protocol is HTTP + JSON > Language bindings:

Ruby, Python, PHP, Java (JDBC), R, Node.JS...

Presto’s execution model

> Presto is NOT MapReduce > Use its own execution engine

> Presto’s query plan is based on DAG > more like Apache Tez / Spark or

traditional MPP databases > Impala and Drill use a similar model

Query Planner

SELECT name, count(*) AS c FROM impressions GROUP BY name

SQL

impressions ( name varchar time bigint)

Table schemaTable scan

(name:varchar)

GROUP BY (name,

count(*))

Output (name, c)

+

Sink

Final aggr

Exchange

Sink

Partial aggr

Table scan

Output

Exchange

Logical query plan

Distributed query plan

Query Planner - Stages

Sink

Final aggr

Exchange

Sink

Partial aggr

Table scan

Output

Exchange

inter-worker data transfer

pipelined aggregation

inter-worker data transfer

Stage-0

Stage-1

Stage-2

Sink

Partial aggr

Table scan

Sink

Partial aggr

Table scan

Execution Planner

+Node list✓ 2 workers

Sink

Final aggr

Exchange

Output

Exchange

Sink

Final aggr

Exchange

Sink

Final aggr

Exchange

Sink

Partial aggr

Table scan

Output

Exchange

Worker 1 Worker 2

All stages are pipe-lined ✓ No wait time ✓ No fault-tolerance

MapReduce vs. Presto

MapReduce Presto

map map

reduce reduce

task task

task task

task

task

memory-to-memory data transfer ✓ No disk IO ✓ Data chunk must fit in memory

task

disk

map map

reduce reduce

disk

disk

Write data to disk

Wait betweenstages

Presto Meetup

The first half of 2015

Check: treasuredata.com

Cloud service for the entire data pipeline, including Presto

http://treasure-data.com

sql for everything at cwt2014

Technology