presto at hadoop summit 2016
TRANSCRIPT
![Page 1: Presto at Hadoop Summit 2016](https://reader034.vdocuments.site/reader034/viewer/2022042605/5877995b1a28ab0f778b6973/html5/thumbnails/1.jpg)
What's New in SQL-on-Hadoop and Beyond
Martin Traverso, FacebookKamil Bajda-Pawlikowski, TeradataHadoop Summit 2016, San Jose, CA
![Page 2: Presto at Hadoop Summit 2016](https://reader034.vdocuments.site/reader034/viewer/2022042605/5877995b1a28ab0f778b6973/html5/thumbnails/2.jpg)
Agenda● Introduction● Presto at Facebook● Presto users and use cases● New features● Roadmap
![Page 3: Presto at Hadoop Summit 2016](https://reader034.vdocuments.site/reader034/viewer/2022042605/5877995b1a28ab0f778b6973/html5/thumbnails/3.jpg)
Introduction
![Page 4: Presto at Hadoop Summit 2016](https://reader034.vdocuments.site/reader034/viewer/2022042605/5877995b1a28ab0f778b6973/html5/thumbnails/4.jpg)
What is Presto● Open source distributed SQL engine● ANSI SQL syntax● Custom built for interactive analytic queries● Queries data across multiple data stores● Flexible deployment (on premise or cloud)● Extensible
![Page 5: Presto at Hadoop Summit 2016](https://reader034.vdocuments.site/reader034/viewer/2022042605/5877995b1a28ab0f778b6973/html5/thumbnails/5.jpg)
![Page 6: Presto at Hadoop Summit 2016](https://reader034.vdocuments.site/reader034/viewer/2022042605/5877995b1a28ab0f778b6973/html5/thumbnails/6.jpg)
Presto at Facebook
![Page 7: Presto at Hadoop Summit 2016](https://reader034.vdocuments.site/reader034/viewer/2022042605/5877995b1a28ab0f778b6973/html5/thumbnails/7.jpg)
Presto @ Facebook● Ad-hoc/interactive queries for Hadoop warehouse● Batch processing for Hadoop warehouse● Analytics for user-facing products● Analytics over various specialized stores
![Page 8: Presto at Hadoop Summit 2016](https://reader034.vdocuments.site/reader034/viewer/2022042605/5877995b1a28ab0f778b6973/html5/thumbnails/8.jpg)
Hadoop Warehouse - Stats● 1000s of internal daily active users● Millions of queries each month● Scan PBs of data every day● Process trillions of rows every day● 10s of concurrent queries
![Page 9: Presto at Hadoop Summit 2016](https://reader034.vdocuments.site/reader034/viewer/2022042605/5877995b1a28ab0f778b6973/html5/thumbnails/9.jpg)
Hadoop Warehouse - Batch
![Page 10: Presto at Hadoop Summit 2016](https://reader034.vdocuments.site/reader034/viewer/2022042605/5877995b1a28ab0f778b6973/html5/thumbnails/10.jpg)
Presto for User-facing Products● Requirements
○ Hundreds of ms to seconds latency, low variability○ Availability ○ Update semantics○ 10 - 15 way joins
● Stats○ > 99.99% query success rate○ 100% system availability○ 25 - 200 concurrent queries○ 1 - 20 queries per second○ <100ms - 5s latency
![Page 11: Presto at Hadoop Summit 2016](https://reader034.vdocuments.site/reader034/viewer/2022042605/5877995b1a28ab0f778b6973/html5/thumbnails/11.jpg)
Presto with Raptor● Large data sets (petabytes)● Milliseconds to seconds latency● Predictable performance● 5-15 minute load latency● Reliable data loads (no duplicates, no missing data)● High availability● 10s of concurrent queries
![Page 12: Presto at Hadoop Summit 2016](https://reader034.vdocuments.site/reader034/viewer/2022042605/5877995b1a28ab0f778b6973/html5/thumbnails/12.jpg)
Presto users and use cases
![Page 13: Presto at Hadoop Summit 2016](https://reader034.vdocuments.site/reader034/viewer/2022042605/5877995b1a28ab0f778b6973/html5/thumbnails/13.jpg)
Presto users
See more at https://github.com/prestodb/presto/wiki/Presto-Users
![Page 14: Presto at Hadoop Summit 2016](https://reader034.vdocuments.site/reader034/viewer/2022042605/5877995b1a28ab0f778b6973/html5/thumbnails/14.jpg)
Netflix statsInteractive, reporting, and app-driven queries
Data warehouse: 40PB in S3
~250 nodes across multiple clusters
~650 users with ~6K+ queries/day
![Page 15: Presto at Hadoop Summit 2016](https://reader034.vdocuments.site/reader034/viewer/2022042605/5877995b1a28ab0f778b6973/html5/thumbnails/15.jpg)
Twitter statsAd-hoc and low-latency queries
~200 nodes dedicated to Presto
Parquet with nested data structures
![Page 16: Presto at Hadoop Summit 2016](https://reader034.vdocuments.site/reader034/viewer/2022042605/5877995b1a28ab0f778b6973/html5/thumbnails/16.jpg)
Uber stats2 clusters
100+ machines
2000+ queries per day
HDFS on premise
![Page 17: Presto at Hadoop Summit 2016](https://reader034.vdocuments.site/reader034/viewer/2022042605/5877995b1a28ab0f778b6973/html5/thumbnails/17.jpg)
FINRA stats120+ EC2 nodes (r3.4xlarge)
2+ PBs of data on S3 (bzip2 & orc)
200+ users
Distro supported by Teradata
![Page 18: Presto at Hadoop Summit 2016](https://reader034.vdocuments.site/reader034/viewer/2022042605/5877995b1a28ab0f778b6973/html5/thumbnails/18.jpg)
New features
![Page 19: Presto at Hadoop Summit 2016](https://reader034.vdocuments.site/reader034/viewer/2022042605/5877995b1a28ab0f778b6973/html5/thumbnails/19.jpg)
SQL features● DDL syntax
CREATE / ALTER / DROP TABLE
● DML syntaxINSERT / DELETE
● SQL features:Data types: DECIMAL, VARCHAR(n), INT, SMALLINT, TINYINT
CUBE, ROLLUP, GROUPING SETS
INTERSECT
Non-equi joins
Uncorrelated subqueries
![Page 20: Presto at Hadoop Summit 2016](https://reader034.vdocuments.site/reader034/viewer/2022042605/5877995b1a28ab0f778b6973/html5/thumbnails/20.jpg)
Other features● Performance
Join and aggregation optimizations
● ConnectorsRedisMongoDB
● Kerberos● Presto-Admin● Ambari and YARN (via Apache Slider)
![Page 21: Presto at Hadoop Summit 2016](https://reader034.vdocuments.site/reader034/viewer/2022042605/5877995b1a28ab0f778b6973/html5/thumbnails/21.jpg)
● Enterprise-grade ODBC & JDBC drivers● BI tools certifications
Information Builders, Looker, MicroStrategy, MS Power BI, Qlik, Tableau, ZoomData
Drivers and BI tools
![Page 22: Presto at Hadoop Summit 2016](https://reader034.vdocuments.site/reader034/viewer/2022042605/5877995b1a28ab0f778b6973/html5/thumbnails/22.jpg)
Roadmap
![Page 23: Presto at Hadoop Summit 2016](https://reader034.vdocuments.site/reader034/viewer/2022042605/5877995b1a28ab0f778b6973/html5/thumbnails/23.jpg)
Short term● LDAP● SQL features
Data types: FLOAT, CHAR(n), VAR/BINARY(n)EXISTS, EXCEPTCorrelated subqueriesLambda expressionsPrepared statements
● ConnectorsAccumulo (by Bloomberg)
![Page 24: Presto at Hadoop Summit 2016](https://reader034.vdocuments.site/reader034/viewer/2022042605/5877995b1a28ab0f778b6973/html5/thumbnails/24.jpg)
Long term● Materialized Query Tables● Workload management● Spill to disk● Cost-based Optimizer
See more at https://github.com/prestodb/presto/wiki/Roadmap
![Page 25: Presto at Hadoop Summit 2016](https://reader034.vdocuments.site/reader034/viewer/2022042605/5877995b1a28ab0f778b6973/html5/thumbnails/25.jpg)
More about Presto
GitHub: https://github.com/prestodb & https://github.com/Teradata/presto
Website: http://prestodb.io
Group: https://groups.google.com/group/presto-users
Distro: http://www.teradata.com/presto