the enterprise presto company starburst presto: sql-on...

41
The Enterprise Presto Company STARBURST Wojciech Biela Grzegorz Kokosiński Presto: SQL-on-Anything

Upload: others

Post on 31-Aug-2019

28 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Enterprise Presto Company STARBURST Presto: SQL-on ...biconsulting.hu/letoltes/2018budapestdata/wojciech_biela_presto_sql_on_anything.pdf · The Enterprise Presto Company STARBURST

The Enterprise Presto Company

STARBURST

Wojciech BielaGrzegorz Kokosiński

Presto: SQL-on-Anything

Page 2: The Enterprise Presto Company STARBURST Presto: SQL-on ...biconsulting.hu/letoltes/2018budapestdata/wojciech_biela_presto_sql_on_anything.pdf · The Enterprise Presto Company STARBURST

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Starburst in a nutshell

We are the Presto company!

● Largest team of Presto contributors outside of Facebook● Led Presto initiative at Teradata for past 3 years● Working in the SQL-on-Hadoop space since 2011

We offer:

● A production-ready distribution of Presto● Professional enterprise support● Presto managed services

Page 3: The Enterprise Presto Company STARBURST Presto: SQL-on ...biconsulting.hu/letoltes/2018budapestdata/wojciech_biela_presto_sql_on_anything.pdf · The Enterprise Presto Company STARBURST

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Presto is SQL-on-Anything.Deploy Anywhere , Query Anything

Analyst Tools

Data Sources

Page 4: The Enterprise Presto Company STARBURST Presto: SQL-on ...biconsulting.hu/letoltes/2018budapestdata/wojciech_biela_presto_sql_on_anything.pdf · The Enterprise Presto Company STARBURST

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Presto Community

More at https://github.com/prestodb/presto/wiki/Presto-Users

Page 5: The Enterprise Presto Company STARBURST Presto: SQL-on ...biconsulting.hu/letoltes/2018budapestdata/wojciech_biela_presto_sql_on_anything.pdf · The Enterprise Presto Company STARBURST

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Why Presto?

● 100% open source distributed ANSI SQL engine○ Originally developed by Facebook

○ Introduced to Fortune 500 by Teradata

○ Commercialized by Starburst

● Presto is SQL-on-Anything: ○ Deploy anywhere

○ Query anything

Page 6: The Enterprise Presto Company STARBURST Presto: SQL-on ...biconsulting.hu/letoltes/2018budapestdata/wojciech_biela_presto_sql_on_anything.pdf · The Enterprise Presto Company STARBURST

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Presto Highlights

● Community-driven open source project● No vendor lock-in

○ No Hadoop distro vendor lock-in

○ No storage engine vendor lock-in

○ No cloud vendor lock-in

● Query data where it lives○ No ETL or data integration necessary to get to insights

● Proven scalability● High concurrency● Interactive ANSI SQL queries

Page 7: The Enterprise Presto Company STARBURST Presto: SQL-on ...biconsulting.hu/letoltes/2018budapestdata/wojciech_biela_presto_sql_on_anything.pdf · The Enterprise Presto Company STARBURST

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Presto in Production

More at https://github.com/prestodb/presto/wiki/Presto-Users

Page 8: The Enterprise Presto Company STARBURST Presto: SQL-on ...biconsulting.hu/letoltes/2018budapestdata/wojciech_biela_presto_sql_on_anything.pdf · The Enterprise Presto Company STARBURST

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Presto in Production

* Multiple clusters (1000s of nodes)

* 300PB in HDFS, MySQL, and Raptor

* 1000s users, 100s concurrent queries

Page 9: The Enterprise Presto Company STARBURST Presto: SQL-on ...biconsulting.hu/letoltes/2018budapestdata/wojciech_biela_presto_sql_on_anything.pdf · The Enterprise Presto Company STARBURST

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Facebook - Data warehouse

● Hive + HDFS + ORC● multiple clusters ● Thousands of users, 300PB, 1000s nodes● PBs of data scanned, O(100k) queries every day● 100s of concurrent queries

Page 10: The Enterprise Presto Company STARBURST Presto: SQL-on ...biconsulting.hu/letoltes/2018budapestdata/wojciech_biela_presto_sql_on_anything.pdf · The Enterprise Presto Company STARBURST

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Facebook - User facing

● Usage○ reporting backend for ad campaign analytics

● Sharded MySQL storage● relatively small data (10’s to 100’s of TBs) ● 0.1-5 seconds latency● Support for data updates● highly available (different DCs)● 10-15 way joins

Page 11: The Enterprise Presto Company STARBURST Presto: SQL-on ...biconsulting.hu/letoltes/2018budapestdata/wojciech_biela_presto_sql_on_anything.pdf · The Enterprise Presto Company STARBURST

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Presto in Production

* 250+ AWS nodes

* 100+ PB in S3 (Parquet)

* 650+ users with 6K+ queries daily

Page 12: The Enterprise Presto Company STARBURST Presto: SQL-on ...biconsulting.hu/letoltes/2018budapestdata/wojciech_biela_presto_sql_on_anything.pdf · The Enterprise Presto Company STARBURST

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Suro / Kafka Cassandra

AegisthusUrsula

Amazon S3

TVs mobile laptop dimensionsevents

Teradata

TVs mobile laptopTVs mobile laptop

Netflix data pipeline

Page 13: The Enterprise Presto Company STARBURST Presto: SQL-on ...biconsulting.hu/letoltes/2018budapestdata/wojciech_biela_presto_sql_on_anything.pdf · The Enterprise Presto Company STARBURST

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Presto in Production

* 150+ PB HDFS

* 800+ nodes (2 clusters on prem)

* 200K+ queries/day

Page 14: The Enterprise Presto Company STARBURST Presto: SQL-on ...biconsulting.hu/letoltes/2018budapestdata/wojciech_biela_presto_sql_on_anything.pdf · The Enterprise Presto Company STARBURST

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Presto in Production

* 800+ nodes (on premise)

* Parquet data

Page 15: The Enterprise Presto Company STARBURST Presto: SQL-on ...biconsulting.hu/letoltes/2018budapestdata/wojciech_biela_presto_sql_on_anything.pdf · The Enterprise Presto Company STARBURST

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Presto in Production

* 120+ nodes in AWS

* 4PB is S3

* 200+ users

* Starburst support

Page 16: The Enterprise Presto Company STARBURST Presto: SQL-on ...biconsulting.hu/letoltes/2018budapestdata/wojciech_biela_presto_sql_on_anything.pdf · The Enterprise Presto Company STARBURST

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Presto in Production

More at https://github.com/prestodb/presto/wiki/Presto-Users

* Presto for interactive workload

* 200 nodes on AWS

* 20k+ queries / day

* 20PB+ data on S3

Page 17: The Enterprise Presto Company STARBURST Presto: SQL-on ...biconsulting.hu/letoltes/2018budapestdata/wojciech_biela_presto_sql_on_anything.pdf · The Enterprise Presto Company STARBURST

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Lyft ecosystem

Ingest Storage Compute Visualisation

AWS S3

Events

MongoDB

Other DS

Hive

Redshift

Superset

Other tools

Page 18: The Enterprise Presto Company STARBURST Presto: SQL-on ...biconsulting.hu/letoltes/2018budapestdata/wojciech_biela_presto_sql_on_anything.pdf · The Enterprise Presto Company STARBURST

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Presto in Production

More at https://github.com/prestodb/presto/wiki/Presto-Users

* 100 Presto VMs(on premises)

* 1K+ HDFS nodes

* ORC data

* Starburst support

Page 19: The Enterprise Presto Company STARBURST Presto: SQL-on ...biconsulting.hu/letoltes/2018budapestdata/wojciech_biela_presto_sql_on_anything.pdf · The Enterprise Presto Company STARBURST

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Presto in Production

More at https://github.com/prestodb/presto/wiki/Presto-Users

* 200+ nodes (on premises)

* HDFS, ObjectStore, and Cassandra

* Starburst support

Page 20: The Enterprise Presto Company STARBURST Presto: SQL-on ...biconsulting.hu/letoltes/2018budapestdata/wojciech_biela_presto_sql_on_anything.pdf · The Enterprise Presto Company STARBURST

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Logical Data Warehouse

Operational

Yahoo! Japan DWH

TeradataDWH

Operations(RDBs)

Data Lake (Hadoop)

TeradataDWH

RDBs

QG Presto

NoSQL

Data Lake

Hadoop S3

Copy & Load

ETL

Page 21: The Enterprise Presto Company STARBURST Presto: SQL-on ...biconsulting.hu/letoltes/2018budapestdata/wojciech_biela_presto_sql_on_anything.pdf · The Enterprise Presto Company STARBURST

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Data stream API

Worker

Data stream API

Worker

Coordinator

Metadata

API

Parser/

analyzerPlanner Scheduler

Worker

Client

Data location

API

Pluggable

Architecture

Page 22: The Enterprise Presto Company STARBURST Presto: SQL-on ...biconsulting.hu/letoltes/2018budapestdata/wojciech_biela_presto_sql_on_anything.pdf · The Enterprise Presto Company STARBURST

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Presto Connectors

Amazon S3

Page 23: The Enterprise Presto Company STARBURST Presto: SQL-on ...biconsulting.hu/letoltes/2018budapestdata/wojciech_biela_presto_sql_on_anything.pdf · The Enterprise Presto Company STARBURST

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Architecture

● Core Presto○ parser, planner, optimizer and scheduler

○ execution engine

○ stateless,

● Plugins○ connectors - data+metadata

○ user defined functions

○ user defined types

○ event listeners

○ authentication and authorization

Page 24: The Enterprise Presto Company STARBURST Presto: SQL-on ...biconsulting.hu/letoltes/2018budapestdata/wojciech_biela_presto_sql_on_anything.pdf · The Enterprise Presto Company STARBURST

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Connectors - Hive

● Table metadata read from Hive catalog

● Multiple filesystems○ HDFS, S3

● Supported file formats○ ORC (optimized reader, optimized writer)

○ RCFile (optimized reader, optimized writer)

○ Parquet (optimized reader)

○ Avro

○ all other Hive formats

Page 25: The Enterprise Presto Company STARBURST Presto: SQL-on ...biconsulting.hu/letoltes/2018budapestdata/wojciech_biela_presto_sql_on_anything.pdf · The Enterprise Presto Company STARBURST

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Easy deployment

● self contained (RPM/tar.gz)● worker auto discovery● trivial dependencies

○ just a recent JVM

● single-port network communication○ easy firewall/network setup

● even easier with presto-admin

Page 26: The Enterprise Presto Company STARBURST Presto: SQL-on ...biconsulting.hu/letoltes/2018budapestdata/wojciech_biela_presto_sql_on_anything.pdf · The Enterprise Presto Company STARBURST

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Hardware agnostic

● Infrastructure agnostic○ on premise (appliance or commodity clusters)

○ VM (OpenStack, etc.)

○ cloud (Amazon, Azure, etc)

■ pure EC2

■ EMR

■ AWS Athena (pay-as-you-go)

Page 27: The Enterprise Presto Company STARBURST Presto: SQL-on ...biconsulting.hu/letoltes/2018budapestdata/wojciech_biela_presto_sql_on_anything.pdf · The Enterprise Presto Company STARBURST

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Presto

Presto

HDFS DN

HDFS DN

HDFS DN

Presto

HDFS DN

Presto

HDFS DN

Presto

HDFS DN

HDFS DN

HDFS DN

Presto

HDFS DN

HDFS DN

HDFS DN

Separate nodes Shared nodes Mixed (rack local)

Deployment for HDFS

Page 28: The Enterprise Presto Company STARBURST Presto: SQL-on ...biconsulting.hu/letoltes/2018budapestdata/wojciech_biela_presto_sql_on_anything.pdf · The Enterprise Presto Company STARBURST

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

SQL support

● ANSI SQL support (good for BI tools)○ all standard data types

○ complex subqueries support (eg. correlated)

● Structural types○ map, array, row

○ JSON

● Lambda expressions○ SELECT transform(ARRAY['dog', 'whale'], x -> length(x))

■ [3, 5]

○ SELECT reduce(ARRAY[5, 20, 50], 0, (s, x) -> s + x, s -> s)

■ 75

● Spatial joins, functions and data types

Page 29: The Enterprise Presto Company STARBURST Presto: SQL-on ...biconsulting.hu/letoltes/2018budapestdata/wojciech_biela_presto_sql_on_anything.pdf · The Enterprise Presto Company STARBURST

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

SQL support

● All standard DDL/DML is supported ○ CREATE TABLE / CREATE TABLE AS

■ connector specific extensions supported via WITH clause

○ DROP TABLE

○ INSERT

○ DELETE

○ GRANT / REVOKE

● Set of supported features depends on connector○ richest support for Hive connector

Page 30: The Enterprise Presto Company STARBURST Presto: SQL-on ...biconsulting.hu/letoltes/2018budapestdata/wojciech_biela_presto_sql_on_anything.pdf · The Enterprise Presto Company STARBURST

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Performance

● MPP style ○ Operators pipeline and data streaming

○ In-memory execution

● Columnar data processing● Highly tuned Java

○ Query to ByteCode compilation

○ Memory efficient structures - Minimize GC

○ Careful inner loop implementation

● Multi-threaded execution keeps CPU busy○ Focus on being versatile. Support both (long running) single query at

a time and (interactive) highly concurrent workloads.

Page 31: The Enterprise Presto Company STARBURST Presto: SQL-on ...biconsulting.hu/letoltes/2018budapestdata/wojciech_biela_presto_sql_on_anything.pdf · The Enterprise Presto Company STARBURST

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Cost-Based Optimizer!

● Optimizer in ‘vanilla’ Presto○ currently rule based

● Exploit statistics provided by connectors○ Leverage existing Hive statistics

○ Selectivity estimates and statistics of plan fragments

○ Cost calculation of plan variants

● Cost based decisions (current Starburst release 195e)○ Join type selection

○ Join reordering

Page 32: The Enterprise Presto Company STARBURST Presto: SQL-on ...biconsulting.hu/letoltes/2018budapestdata/wojciech_biela_presto_sql_on_anything.pdf · The Enterprise Presto Company STARBURST

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Cost-Based Optimizer - results

Page 33: The Enterprise Presto Company STARBURST Presto: SQL-on ...biconsulting.hu/letoltes/2018budapestdata/wojciech_biela_presto_sql_on_anything.pdf · The Enterprise Presto Company STARBURST

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Cost-Based Optimizer in action

● 13x max speed-up● >50% 2-5x boost● ~10% 6-10x boost

For more see: https://www.starburstdata.com/technical-blog

Page 34: The Enterprise Presto Company STARBURST Presto: SQL-on ...biconsulting.hu/letoltes/2018budapestdata/wojciech_biela_presto_sql_on_anything.pdf · The Enterprise Presto Company STARBURST

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Connectivity

● Presto CLI● Enterprise JDBC/ODBC drivers

○ full JDBC and ODBC specs compliance

○ Kerberos authentication

○ LDAP authentication

● Open source JDBC driver○ requires Java 8

○ limited support for authentication

● Language specific bindings (R, Python, Go, Ruby, …)

Page 35: The Enterprise Presto Company STARBURST Presto: SQL-on ...biconsulting.hu/letoltes/2018budapestdata/wojciech_biela_presto_sql_on_anything.pdf · The Enterprise Presto Company STARBURST

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

BI Tool Support

Page 36: The Enterprise Presto Company STARBURST Presto: SQL-on ...biconsulting.hu/letoltes/2018budapestdata/wojciech_biela_presto_sql_on_anything.pdf · The Enterprise Presto Company STARBURST

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Security

● User authentication (CLI/ODBC/JDBC)○ Basic

○ Kerberos

○ LDAP

● Pluggable user authorization schemes (access control)● Connector level authorization

○ E.g. grants information stored in Hive catalog

● Support for kerberized HDFS/Hive metastore● SSL on the wire

○ client to Presto

○ between Presto nodes

Page 37: The Enterprise Presto Company STARBURST Presto: SQL-on ...biconsulting.hu/letoltes/2018budapestdata/wojciech_biela_presto_sql_on_anything.pdf · The Enterprise Presto Company STARBURST

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Key contributions from our team

● ANSI SQL syntax enhancements to fully support TPC-H and TPC-DS

● Spill to disk capabilities for large intermediate data sets

● Distributed sorting to handle ORDER BY for large datasets

● Security Integrations such as Kerberos, LDAP, and in-transit encryption

● Cost-Based Optimizer and other query performance improvements

● ODBC and JDBC drivers to enable BI tools such as Tableau, Qlik, etc

● Presto connectors for SQLServer and Cassandra

● Presto-Admin for easy installation & management of Presto

Page 38: The Enterprise Presto Company STARBURST Presto: SQL-on ...biconsulting.hu/letoltes/2018budapestdata/wojciech_biela_presto_sql_on_anything.pdf · The Enterprise Presto Company STARBURST

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Enterprise Support for Presto

PrestoCare™

Administration, monitoring and support of the

Presto Platform and Services

Enterprise Support

24/7 Enterprise support of Presto on-premises or in

the cloud.

Installation and tuning assistance.

Product roadmap influence.

Professional Services

Presto architecture,

tuning, integrations,

implementation, and other

development.

Page 39: The Enterprise Presto Company STARBURST Presto: SQL-on ...biconsulting.hu/letoltes/2018budapestdata/wojciech_biela_presto_sql_on_anything.pdf · The Enterprise Presto Company STARBURST

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Starburst Presto Roadmap

● Kafka connector improvements

● HDFS wire encryption

● Further Cost-Based Optimizer extensions

● Execution engine improvements

● Planner improvements

● Better AVRO support

● Support for Oracle Linux

● Support for Azure Cloud

Page 40: The Enterprise Presto Company STARBURST Presto: SQL-on ...biconsulting.hu/letoltes/2018budapestdata/wojciech_biela_presto_sql_on_anything.pdf · The Enterprise Presto Company STARBURST

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

More information

Certified Distro: www.starburstdata.com/presto

Project Website: www.prestodb.io

Presto Users Group: www.groups.google.com/group/presto-users

GitHub:www.github.com/prestodb/prestowww.github.com/starburstdata/presto

Page 41: The Enterprise Presto Company STARBURST Presto: SQL-on ...biconsulting.hu/letoltes/2018budapestdata/wojciech_biela_presto_sql_on_anything.pdf · The Enterprise Presto Company STARBURST

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Learn more at:

www.starburstdata.com

[email protected]@starburstdata.com