presto @netﬂix - starburst data€¦ · challenges with presto metrics calculation cbo requires...

Presto @Netflix

Presto SummitJune 2019

Daniel Weeks

Overview

● Data @Netflix

● Platform Architecture

● Iceberg Connector

● ETL with Presto

● What’s Next

Data @Netflix

● 190+ Markets● 149 MM Global Paid Members*● 60 MM US Paid Members*● 89 MM Int’l Paid Members*

*Membership figures (rounded) as of Q1’19.

Platform Architecture

Ursula

Casspactor

Events

Dimensions

Data Ingestion

Processing Engines

Machine Learning

Exploratory

Interactive

Reporting

Audits

Custom Viz

Dashboarding

Alerts

Presto Clusters

● Static Clusters ○ Deployed on r5.4xl EC2 instances○ Primarily used for ad-hoc workloads○ User query limits and aggressive timeouts

● Dynamic Clusters○ Containerized deployments on Titus○ Scale workers based on pending tasks○ Isolate specific workloads

Iceberg Connector

Current table formats . . . . . . expose complexity to the user

. . . provide weak guarantees

. . . don’t scale for large datasets

Iceberg Features

● Full Schema Evolution○ Column resolution by id○ Allows for full DDL (add, drop, rename)○ Nested evolution

● Advanced Partitioning○ Defined by transforms (e.g. day(ts) )○ Hidden from users○ Supports mixed partition strategies and

evolution

● Atomic Commits○ Snapshot isolation○ History and rollback○ Optimistic commits

● File System Independence○ No listing ○ Does not rely on file renames○ No consistency requirements

Iceberg Features

● Staging and Temporal Queries○ Query as of time○ Query a specific snapshot○ Stage snapshot before committing

● Advanced Statistics○ Automatically collected at file level○ Used for pruning files○ Accurate split planning

● Dataset Optimization○ Safe rewrite○ Compact small files○ Manage metadata layout

● Delete / Update / Merge○ Currently in design○ Join us - dev@iceberg.apache.org

Challenges with Presto

● Metrics Calculation ○ CBO requires some metrics prior to execution○ Complete stats are calculated as part of the scan○ Some tradeoff in split planning time vs early execution

● Writing Non-identity Partitions○ Tables appear as unpartitioned to Presto○ Need to influence plan for effective write

Presto for ETL

ETL with Presto

● ETL Historically Difficult ○ S3 optimizations relax file system requirements○ Failures result in indeterminate state○ Variable workloads result in unpredictable SLA

● Iceberg Support for ETL○ Provides stronger contract for S3 warehouse ○ Enables Netflix patterns used with Spark

ETL Patterns

● Write, Audit, Publish ○ Write data to staged snapshot○ Run audits to validate data○ Publish data for downstream consumers

● Incremental Processing○ Track progress with a high-watermark○ Only process new data○ Restate by adjusting high-watermark

What’s Next?

Iceberg Vectorized Read Path

● Materialize to Arrow Buffers○ Common read path for Spark, Presto, etc.○ Optimized for vectorized operations○ Overlay engine specific columnar APIs

● Filter and Projection Pushdown○ Map key and nested projection○ Filter records during materialization

Druid Connector

● Pushdown for Aggregate Queries○ Druid significantly faster for aggregates○ Provides better BI tool integration○ Reduces switching cost for analysts

● Convenient Interface for Users ○ Native Druid SQL support is limited○ Expand data availability across platforms

View Support

● Common View Definitions○ Ability to read from multiple engines○ Separate representation from hive metastore

● Dataset Evolution○ Use views to evolve schema○ Decouple changes from downstream consumers

Questions?

presto @netﬂix - starburst data€¦ · challenges with presto metrics calculation cbo requires...

Documents

presto 012

操作に困った時のトラブルシューティング～presto～presto...

catalogo presto

presto 8.8

presto anatomy

symphony presto tm fuse ab - wordpress.com · symphony...

understanding presto - presto meetup @ tokyo #1

presto! pagemanager...

presto - drilling

presto 2.0 introduction - what is presto

functional reactive programming in the netﬂix api ·...

super presto garage building...

manual presto

zastaví někdo netﬂix? · zdroj: filmtoro. netﬂix...

mic presto

presto basico

presto thermoplastic tubing - kovaz.cz a koncovky... ·...

i racconti di hoffmann - hoffmann.pdf · presto, presto! e...

big & personal: data and models behind netﬂix ...big &...

introduction & overview red hat openshift...