presto @netflix - starburst data€¦ · challenges with presto metrics calculation cbo requires...
TRANSCRIPT
Presto @Netflix
Presto SummitJune 2019
Daniel Weeks
Overview
● Data @Netflix
● Platform Architecture
● Iceberg Connector
● ETL with Presto
● What’s Next
Data @Netflix
● 190+ Markets● 149 MM Global Paid Members*● 60 MM US Paid Members*● 89 MM Int’l Paid Members*
*Membership figures (rounded) as of Q1’19.
Platform Architecture
Ursula
Casspactor
S3
Events
Dimensions
Data Ingestion
Processing Engines
ETL
Machine Learning
Scale
Exploratory
Interactive
Reporting
Audits
Custom Viz
Dashboarding
Alerts
Presto Clusters
● Static Clusters ○ Deployed on r5.4xl EC2 instances○ Primarily used for ad-hoc workloads○ User query limits and aggressive timeouts
● Dynamic Clusters○ Containerized deployments on Titus○ Scale workers based on pending tasks○ Isolate specific workloads
Iceberg Connector
Why ?
Current table formats . . . . . . expose complexity to the user
. . . provide weak guarantees
. . . don’t scale for large datasets
Iceberg Features
● Full Schema Evolution○ Column resolution by id○ Allows for full DDL (add, drop, rename)○ Nested evolution
● Advanced Partitioning○ Defined by transforms (e.g. day(ts) )○ Hidden from users○ Supports mixed partition strategies and
evolution
● Atomic Commits○ Snapshot isolation○ History and rollback○ Optimistic commits
● File System Independence○ No listing ○ Does not rely on file renames○ No consistency requirements
Iceberg Features
● Staging and Temporal Queries○ Query as of time○ Query a specific snapshot○ Stage snapshot before committing
● Advanced Statistics○ Automatically collected at file level○ Used for pruning files○ Accurate split planning
● Dataset Optimization○ Safe rewrite○ Compact small files○ Manage metadata layout
● Delete / Update / Merge○ Currently in design○ Join us - [email protected]
Challenges with Presto
● Metrics Calculation ○ CBO requires some metrics prior to execution○ Complete stats are calculated as part of the scan○ Some tradeoff in split planning time vs early execution
● Writing Non-identity Partitions○ Tables appear as unpartitioned to Presto○ Need to influence plan for effective write
Presto for ETL
ETL with Presto
● ETL Historically Difficult ○ S3 optimizations relax file system requirements○ Failures result in indeterminate state○ Variable workloads result in unpredictable SLA
● Iceberg Support for ETL○ Provides stronger contract for S3 warehouse ○ Enables Netflix patterns used with Spark
ETL Patterns
● Write, Audit, Publish ○ Write data to staged snapshot○ Run audits to validate data○ Publish data for downstream consumers
● Incremental Processing○ Track progress with a high-watermark○ Only process new data○ Restate by adjusting high-watermark
What’s Next?
Iceberg Vectorized Read Path
● Materialize to Arrow Buffers○ Common read path for Spark, Presto, etc.○ Optimized for vectorized operations○ Overlay engine specific columnar APIs
● Filter and Projection Pushdown○ Map key and nested projection○ Filter records during materialization
Druid Connector
● Pushdown for Aggregate Queries○ Druid significantly faster for aggregates○ Provides better BI tool integration○ Reduces switching cost for analysts
● Convenient Interface for Users ○ Native Druid SQL support is limited○ Expand data availability across platforms
View Support
● Common View Definitions○ Ability to read from multiple engines○ Separate representation from hive metastore
● Dataset Evolution○ Use views to evolve schema○ Decouple changes from downstream consumers
Questions?