presto - sql on anything

1

Presto - SQL on anythingJanuary 2017

Grzegorz KokosińskiKarol SobczakTeradata Center for Hadoop

2

Agenda

- Who are we?

- What is Presto?

- What is data federation?

- Different federation strategies in other databases (HIVE)

- what is supported and what are the problems

- Presto Connector

- Show time

3

Lets make some noise

• Let tweet about this presentation!– #whug

– #prestodb

– #teradata

• Later on we will query that data!

4

Who are we

5

What is Presto?

• 100% open source distributed SQL query engine- Originally developed by Facebook

• Key Differentiators:- Performance & Scale- Cross platform query capability, not only SQL on Hadoop

• Apache licensed, hosted on GitHub- Certified distro & support from Teradata

6

Presto Users

See more at https://github.com/prestodb/presto/wiki/Presto-Users

https://github.com/prestodb/presto/wiki/Presto-Users

7

• Facebook – Multiple production clusters (100s of nodes total)

- 300PB in HDFS, sharded MySQL, SSD-based Raptor– 1000s of internal daily active users– 10s-100s of concurrent queries

• Netflix – 250+ node on EC2, 40+ PB in S3 (Parquet format)– Over 650 active users and 6K+ queries daily

• Twitter– 200+ nodes on-premises over Parquet nested data

• Uber– 200+ nodes (2 dedicated clusters) with 25K+ & 3K+ queries daily

• FINRA– 120+ nodes in AWS, 2PB is S3, 200+ users (supported by Teradata)

Presto in Production

8

• In-memory processing• Pipelined execution across nodes (MPP-style)

– Vectorized columnar processing– Multithreaded execution keeps all CPU cores busy

• Presto is written in highly tuned Java– Efficient memory management (reduced GC overhead)– Very careful coding of inner loops– Runtime bytecode generation

• Optimized ORC & Parquet readers• Excellent performance with interactive SQL analytics

– Enables to use BI tools

Presto – Query Execution Performance

9

• Hadoop/Hive connector & file formats (HDFS/S3):– HDFS & S3 + HCatalog– ORC, RCFile, Parquet, SequenceFile, Text

• Raptor– columnar store on flash driven by Facebook

• Open source data stores (driven by the community)– MySQL & PostgreSQL (non-parallel)– Cassandra (by Teradata)– Kafka– Redis– MongoDB– ElasticSearch– Accumulo (by Bloomberg)

Supported data sources & file formats

10

[ WITH with_query [, ...] ] SELECT [ ALL | DISTINCT ] select_expr [, ...] [ FROM table1 [[ INNER | OUTER ] JOIN table2 ON (…)] [ WHERE condition ] [ GROUP BY expression [, ...] ] [ HAVING condition] [ UNION [ ALL | DISTINCT ] select ] [ ORDER BY expression [ ASC | DESC ] [, ...] ] [ LIMIT [ count | ALL ] ]

In addition:• Windowing functions • UNNEST, TABLESAMPLE • ROLLUP, CUBE, GROUPING SETS• UNION, EXCEPT, INTERSECT• Subqueries (EXISTS, IN)

ANSI SQL Support

11

Presto is not a database!

• Presto is a query execution engine (storage independent)• Pluggable custom user functionalities

– Connectors– Functions– Types– System access controllers– Resource group configuration managers– Event listeners– …

• Built-in core functionalities:– parser, execution, types, sql functions, monitoring

12

Data federation

• Query data from several data sources (databases)

• Streaming– One to One

- there is a single connection between database access points- e.g. PSQL via PSQL- using storage handlers to access RDBMS data from Hive

– Many to One- many connections from one database nodes to a single access point of

other database- Accessing REST from UDF in (possibly each) HIVE map/reduce task

– Many to Many- workers talk to each other directly

• Through storage– Needs (intermittent) data materialization

• Presto supports them all!

13

Data federation common problems

• model incompatibilities

• multinode streaming is not always possible

• transactions

• cost based optimizations (statistics)

• SQL pushdown (predicates, projections, aggregations?, joins?)

14

Connector

• Presto interface to access arbitrary data source (hive, mysql, jmx)• Provides:

– metadata– ability to distributed, parallel and streamed read/write– transaction boundary– physical data layouts– statistics– (SQL) predicate pushdown)– indexes (index join)– session or table properties– access control– procedures (CALL …– . . .

• Most (if not all) of the above points are optional

15

Presto Architecture

Data stream API

Worker

Data stream API

Worker

Coordinator

MetadataAPI

Parser/analyzer

Planner Scheduler

Worker

Client

Data locationAPI

Pluggable

16

Data federation with Presto

• Through the storage

• Demo– HIVE

HDFS DataNode

HDFS DataNode

HiveMetastore

HDFSNamenode

data transfer

Presto worker

Presto worker

Prestocoordinator

data transfermetadata

metadata

17

Data federation with Presto

• One to One

• Demo– psql– REST– and above with HIVE

Presto worker

Presto worker

Prestocoordinator

SQL Database

JDBC metadataJDBC data

18

Many to many - data federation with Presto

AMP

AMP

AMP

AMP

QG

Exchange

QG

Exchange

PE Coordinator

Worker Thread

Worker Thread

Worker Thread

Worker Thread

Init & metadata exchange

Bi-directionalfully parallel

data exchange

TERADATA PRESTO

• Key features:• Low latency• High performance• Concurrency• SQL pushdown• Data conversion• Compression• Efficient CPU usage

19

Conclusion

• Presto Connector is expressive

• 3rd party data source is 1st class citizen

• Single ANSI SQL to rule them all– use BI tools on data which is not BI friendly

• Rapid data integration

20

Certified Distro: www.teradata.com/prestoWebsite: www.prestodb.ioPresto Users Group: www.groups.google.com/group/presto-users

GitHub:www.github.com/prestodb/prestowww.github.com/Teradata/presto

More information

http://www.teradata.com/presto

http://www.prestodb.io/

http://www.groups.google.com/group/presto-users

http://www.github.com/prestodb/presto

http://www.github.com/prestodb/presto

http://www.github.com/Teradata/presto

http://www.github.com/Teradata/presto

21

www.teradata.com/presto

presto - sql on anything

Software