presto @netflix - starburst data€¦ · challenges with presto metrics calculation cbo requires...

24
Presto @Netflix Presto Summit June 2019 Daniel Weeks

Upload: others

Post on 24-May-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Presto @Netflix - Starburst Data€¦ · Challenges with Presto Metrics Calculation CBO requires some metrics prior to execution Complete stats are calculated as part of the scan

Presto @Netflix

Presto SummitJune 2019

Daniel Weeks

Page 2: Presto @Netflix - Starburst Data€¦ · Challenges with Presto Metrics Calculation CBO requires some metrics prior to execution Complete stats are calculated as part of the scan

Overview

● Data @Netflix

● Platform Architecture

● Iceberg Connector

● ETL with Presto

● What’s Next

Page 3: Presto @Netflix - Starburst Data€¦ · Challenges with Presto Metrics Calculation CBO requires some metrics prior to execution Complete stats are calculated as part of the scan

Data @Netflix

Page 4: Presto @Netflix - Starburst Data€¦ · Challenges with Presto Metrics Calculation CBO requires some metrics prior to execution Complete stats are calculated as part of the scan

● 190+ Markets● 149 MM Global Paid Members*● 60 MM US Paid Members*● 89 MM Int’l Paid Members*

*Membership figures (rounded) as of Q1’19.

Page 5: Presto @Netflix - Starburst Data€¦ · Challenges with Presto Metrics Calculation CBO requires some metrics prior to execution Complete stats are calculated as part of the scan
Page 6: Presto @Netflix - Starburst Data€¦ · Challenges with Presto Metrics Calculation CBO requires some metrics prior to execution Complete stats are calculated as part of the scan
Page 7: Presto @Netflix - Starburst Data€¦ · Challenges with Presto Metrics Calculation CBO requires some metrics prior to execution Complete stats are calculated as part of the scan
Page 8: Presto @Netflix - Starburst Data€¦ · Challenges with Presto Metrics Calculation CBO requires some metrics prior to execution Complete stats are calculated as part of the scan

Platform Architecture

Page 9: Presto @Netflix - Starburst Data€¦ · Challenges with Presto Metrics Calculation CBO requires some metrics prior to execution Complete stats are calculated as part of the scan

Ursula

Casspactor

S3

Events

Dimensions

Data Ingestion

Page 10: Presto @Netflix - Starburst Data€¦ · Challenges with Presto Metrics Calculation CBO requires some metrics prior to execution Complete stats are calculated as part of the scan

Processing Engines

ETL

Machine Learning

Scale

Exploratory

Interactive

Reporting

Audits

Custom Viz

Dashboarding

Alerts

Page 11: Presto @Netflix - Starburst Data€¦ · Challenges with Presto Metrics Calculation CBO requires some metrics prior to execution Complete stats are calculated as part of the scan

Presto Clusters

● Static Clusters ○ Deployed on r5.4xl EC2 instances○ Primarily used for ad-hoc workloads○ User query limits and aggressive timeouts

● Dynamic Clusters○ Containerized deployments on Titus○ Scale workers based on pending tasks○ Isolate specific workloads

Page 12: Presto @Netflix - Starburst Data€¦ · Challenges with Presto Metrics Calculation CBO requires some metrics prior to execution Complete stats are calculated as part of the scan

Iceberg Connector

Page 13: Presto @Netflix - Starburst Data€¦ · Challenges with Presto Metrics Calculation CBO requires some metrics prior to execution Complete stats are calculated as part of the scan

Why ?

Current table formats . . . . . . expose complexity to the user

. . . provide weak guarantees

. . . don’t scale for large datasets

Page 14: Presto @Netflix - Starburst Data€¦ · Challenges with Presto Metrics Calculation CBO requires some metrics prior to execution Complete stats are calculated as part of the scan

Iceberg Features

● Full Schema Evolution○ Column resolution by id○ Allows for full DDL (add, drop, rename)○ Nested evolution

● Advanced Partitioning○ Defined by transforms (e.g. day(ts) )○ Hidden from users○ Supports mixed partition strategies and

evolution

● Atomic Commits○ Snapshot isolation○ History and rollback○ Optimistic commits

● File System Independence○ No listing ○ Does not rely on file renames○ No consistency requirements

Page 15: Presto @Netflix - Starburst Data€¦ · Challenges with Presto Metrics Calculation CBO requires some metrics prior to execution Complete stats are calculated as part of the scan

Iceberg Features

● Staging and Temporal Queries○ Query as of time○ Query a specific snapshot○ Stage snapshot before committing

● Advanced Statistics○ Automatically collected at file level○ Used for pruning files○ Accurate split planning

● Dataset Optimization○ Safe rewrite○ Compact small files○ Manage metadata layout

● Delete / Update / Merge○ Currently in design○ Join us - [email protected]

Page 16: Presto @Netflix - Starburst Data€¦ · Challenges with Presto Metrics Calculation CBO requires some metrics prior to execution Complete stats are calculated as part of the scan

Challenges with Presto

● Metrics Calculation ○ CBO requires some metrics prior to execution○ Complete stats are calculated as part of the scan○ Some tradeoff in split planning time vs early execution

● Writing Non-identity Partitions○ Tables appear as unpartitioned to Presto○ Need to influence plan for effective write

Page 17: Presto @Netflix - Starburst Data€¦ · Challenges with Presto Metrics Calculation CBO requires some metrics prior to execution Complete stats are calculated as part of the scan

Presto for ETL

Page 18: Presto @Netflix - Starburst Data€¦ · Challenges with Presto Metrics Calculation CBO requires some metrics prior to execution Complete stats are calculated as part of the scan

ETL with Presto

● ETL Historically Difficult ○ S3 optimizations relax file system requirements○ Failures result in indeterminate state○ Variable workloads result in unpredictable SLA

● Iceberg Support for ETL○ Provides stronger contract for S3 warehouse ○ Enables Netflix patterns used with Spark

Page 19: Presto @Netflix - Starburst Data€¦ · Challenges with Presto Metrics Calculation CBO requires some metrics prior to execution Complete stats are calculated as part of the scan

ETL Patterns

● Write, Audit, Publish ○ Write data to staged snapshot○ Run audits to validate data○ Publish data for downstream consumers

● Incremental Processing○ Track progress with a high-watermark○ Only process new data○ Restate by adjusting high-watermark

Page 20: Presto @Netflix - Starburst Data€¦ · Challenges with Presto Metrics Calculation CBO requires some metrics prior to execution Complete stats are calculated as part of the scan

What’s Next?

Page 21: Presto @Netflix - Starburst Data€¦ · Challenges with Presto Metrics Calculation CBO requires some metrics prior to execution Complete stats are calculated as part of the scan

Iceberg Vectorized Read Path

● Materialize to Arrow Buffers○ Common read path for Spark, Presto, etc.○ Optimized for vectorized operations○ Overlay engine specific columnar APIs

● Filter and Projection Pushdown○ Map key and nested projection○ Filter records during materialization

Page 22: Presto @Netflix - Starburst Data€¦ · Challenges with Presto Metrics Calculation CBO requires some metrics prior to execution Complete stats are calculated as part of the scan

Druid Connector

● Pushdown for Aggregate Queries○ Druid significantly faster for aggregates○ Provides better BI tool integration○ Reduces switching cost for analysts

● Convenient Interface for Users ○ Native Druid SQL support is limited○ Expand data availability across platforms

Page 23: Presto @Netflix - Starburst Data€¦ · Challenges with Presto Metrics Calculation CBO requires some metrics prior to execution Complete stats are calculated as part of the scan

View Support

● Common View Definitions○ Ability to read from multiple engines○ Separate representation from hive metastore

● Dataset Evolution○ Use views to evolve schema○ Decouple changes from downstream consumers

Page 24: Presto @Netflix - Starburst Data€¦ · Challenges with Presto Metrics Calculation CBO requires some metrics prior to execution Complete stats are calculated as part of the scan

Questions?