presto @netﬂix - starburst data€¦ · challenges with presto metrics calculation cbo requires...

Presto @Netflix

Presto SummitJune 2019

Daniel Weeks

Overview

● Data @Netflix

● Platform Architecture

● Iceberg Connector

● ETL with Presto

● What’s Next

Data @Netflix

● 190+ Markets● 149 MM Global Paid Members*● 60 MM US Paid Members*● 89 MM Int’l Paid Members*

*Membership figures (rounded) as of Q1’19.

Platform Architecture

Ursula

Casspactor

S3

Events

Dimensions

Data Ingestion

Processing Engines

ETL

Machine Learning

Scale

Exploratory

Interactive

Reporting

Audits

Custom Viz

Dashboarding

Alerts

Presto Clusters

● Static Clusters ○ Deployed on r5.4xl EC2 instances○ Primarily used for ad-hoc workloads○ User query limits and aggressive timeouts

● Dynamic Clusters○ Containerized deployments on Titus○ Scale workers based on pending tasks○ Isolate specific workloads

Iceberg Connector

Why ?

Current table formats . . . . . . expose complexity to the user

. . . provide weak guarantees

. . . don’t scale for large datasets

Iceberg Features

● Full Schema Evolution○ Column resolution by id○ Allows for full DDL (add, drop, rename)○ Nested evolution

● Advanced Partitioning○ Defined by transforms (e.g. day(ts) )○ Hidden from users○ Supports mixed partition strategies and

evolution

● Atomic Commits○ Snapshot isolation○ History and rollback○ Optimistic commits

● File System Independence○ No listing ○ Does not rely on file renames○ No consistency requirements

Iceberg Features

● Staging and Temporal Queries○ Query as of time○ Query a specific snapshot○ Stage snapshot before committing

● Advanced Statistics○ Automatically collected at file level○ Used for pruning files○ Accurate split planning

● Dataset Optimization○ Safe rewrite○ Compact small files○ Manage metadata layout

● Delete / Update / Merge○ Currently in design○ Join us - [email protected]

Challenges with Presto

● Metrics Calculation ○ CBO requires some metrics prior to execution○ Complete stats are calculated as part of the scan○ Some tradeoff in split planning time vs early execution

● Writing Non-identity Partitions○ Tables appear as unpartitioned to Presto○ Need to influence plan for effective write

Presto for ETL

ETL with Presto

● ETL Historically Difficult ○ S3 optimizations relax file system requirements○ Failures result in indeterminate state○ Variable workloads result in unpredictable SLA

● Iceberg Support for ETL○ Provides stronger contract for S3 warehouse ○ Enables Netflix patterns used with Spark

ETL Patterns

● Write, Audit, Publish ○ Write data to staged snapshot○ Run audits to validate data○ Publish data for downstream consumers

● Incremental Processing○ Track progress with a high-watermark○ Only process new data○ Restate by adjusting high-watermark

What’s Next?

Iceberg Vectorized Read Path

● Materialize to Arrow Buffers○ Common read path for Spark, Presto, etc.○ Optimized for vectorized operations○ Overlay engine specific columnar APIs

● Filter and Projection Pushdown○ Map key and nested projection○ Filter records during materialization

Druid Connector

● Pushdown for Aggregate Queries○ Druid significantly faster for aggregates○ Provides better BI tool integration○ Reduces switching cost for analysts

● Convenient Interface for Users ○ Native Druid SQL support is limited○ Expand data availability across platforms

View Support

● Common View Definitions○ Ability to read from multiple engines○ Separate representation from hive metastore

● Dataset Evolution○ Use views to evolve schema○ Decouple changes from downstream consumers

Questions?

presto @netﬂix - starburst data€¦ · challenges with presto metrics calculation cbo requires...

Documents