the evolution of the big data platform @ netflix (oscon 2015)
TRANSCRIPT
![Page 1: The evolution of the big data platform @ Netflix (OSCON 2015)](https://reader030.vdocuments.site/reader030/viewer/2022033004/55d1d0a3bb61eb074f8b4647/html5/thumbnails/1.jpg)
The Evolution of Big Data Platform@
Netflix
Eva TseJuly 22, 2015
![Page 2: The evolution of the big data platform @ Netflix (OSCON 2015)](https://reader030.vdocuments.site/reader030/viewer/2022033004/55d1d0a3bb61eb074f8b4647/html5/thumbnails/2.jpg)
![Page 3: The evolution of the big data platform @ Netflix (OSCON 2015)](https://reader030.vdocuments.site/reader030/viewer/2022033004/55d1d0a3bb61eb074f8b4647/html5/thumbnails/3.jpg)
![Page 4: The evolution of the big data platform @ Netflix (OSCON 2015)](https://reader030.vdocuments.site/reader030/viewer/2022033004/55d1d0a3bb61eb074f8b4647/html5/thumbnails/4.jpg)
![Page 5: The evolution of the big data platform @ Netflix (OSCON 2015)](https://reader030.vdocuments.site/reader030/viewer/2022033004/55d1d0a3bb61eb074f8b4647/html5/thumbnails/5.jpg)
![Page 6: The evolution of the big data platform @ Netflix (OSCON 2015)](https://reader030.vdocuments.site/reader030/viewer/2022033004/55d1d0a3bb61eb074f8b4647/html5/thumbnails/6.jpg)
Our biggest challenge is scale
![Page 7: The evolution of the big data platform @ Netflix (OSCON 2015)](https://reader030.vdocuments.site/reader030/viewer/2022033004/55d1d0a3bb61eb074f8b4647/html5/thumbnails/7.jpg)
Netflix Key Business Metrics
65+ millionmembers
50 countries 1000+ devices
supported
10 billionhours / quarter
![Page 8: The evolution of the big data platform @ Netflix (OSCON 2015)](https://reader030.vdocuments.site/reader030/viewer/2022033004/55d1d0a3bb61eb074f8b4647/html5/thumbnails/8.jpg)
Global Expansion200 countries by end of 2016
![Page 9: The evolution of the big data platform @ Netflix (OSCON 2015)](https://reader030.vdocuments.site/reader030/viewer/2022033004/55d1d0a3bb61eb074f8b4647/html5/thumbnails/9.jpg)
Big Data SizeTotal ~20 PB DW on S3 Read ~10% DW dailyWrite ~10% of read data daily
~ 500 billion events daily
~ 350 active users
![Page 10: The evolution of the big data platform @ Netflix (OSCON 2015)](https://reader030.vdocuments.site/reader030/viewer/2022033004/55d1d0a3bb61eb074f8b4647/html5/thumbnails/10.jpg)
Our traditional BI stack is our competition
![Page 11: The evolution of the big data platform @ Netflix (OSCON 2015)](https://reader030.vdocuments.site/reader030/viewer/2022033004/55d1d0a3bb61eb074f8b4647/html5/thumbnails/11.jpg)
How do we meet the functionality bar and yet make it scale?
How do we make big data bite-size again?
![Page 12: The evolution of the big data platform @ Netflix (OSCON 2015)](https://reader030.vdocuments.site/reader030/viewer/2022033004/55d1d0a3bb61eb074f8b4647/html5/thumbnails/12.jpg)
Our North Star
• Infrastructure– No undifferentiated heavy lifting
• Architecture– Scalable and sustainable
• Self-serve– Ecosystem of tools
![Page 13: The evolution of the big data platform @ Netflix (OSCON 2015)](https://reader030.vdocuments.site/reader030/viewer/2022033004/55d1d0a3bb61eb074f8b4647/html5/thumbnails/13.jpg)
Cloudapps
Suro/Kafka Ursula
CassandraAegisthus
Dimension Data
Event Data
15 min
Daily
AWS S3
SS Tables
Data Pipelines
![Page 14: The evolution of the big data platform @ Netflix (OSCON 2015)](https://reader030.vdocuments.site/reader030/viewer/2022033004/55d1d0a3bb61eb074f8b4647/html5/thumbnails/14.jpg)
Parquet FF
Metacat(Federated metadata service)
Pig workflow visualization
Data movement
Data visualization
(Hadoop clusters)
Job/Cluster perfvisualization
Data lineage
Data quality
Storage Compute Service Tools
(Federated execution service)
AWS S3
![Page 15: The evolution of the big data platform @ Netflix (OSCON 2015)](https://reader030.vdocuments.site/reader030/viewer/2022033004/55d1d0a3bb61eb074f8b4647/html5/thumbnails/15.jpg)
Analytics
ETL
Interactive data exploration
Interactive slice & dice
RT analytics & iterative/ML algo
Evolving Big Data Processing Needs
![Page 16: The evolution of the big data platform @ Netflix (OSCON 2015)](https://reader030.vdocuments.site/reader030/viewer/2022033004/55d1d0a3bb61eb074f8b4647/html5/thumbnails/16.jpg)
Metacat(Federated metadata service)
Pig workflow visualization
Data movement
Data visualization
Job/Cluster perfvisualization
Data lineage
Data quality
Service Tools
(Federated execution service)
Big Data Portal
API Portal
Big Data APIEvolving Services/Tools Ecosystem
![Page 17: The evolution of the big data platform @ Netflix (OSCON 2015)](https://reader030.vdocuments.site/reader030/viewer/2022033004/55d1d0a3bb61eb074f8b4647/html5/thumbnails/17.jpg)
AWS S3 as our DW Storage• S3 as single source of truth (not HDFS)• 11 9’s durability and 4 9’s availability• Separate compute and storage• Key enablement to
– multiple clusters– easy upgrade via r/b deployment
![Page 18: The evolution of the big data platform @ Netflix (OSCON 2015)](https://reader030.vdocuments.site/reader030/viewer/2022033004/55d1d0a3bb61eb074f8b4647/html5/thumbnails/18.jpg)
Evolution of Big Data Processing Systems
![Page 19: The evolution of the big data platform @ Netflix (OSCON 2015)](https://reader030.vdocuments.site/reader030/viewer/2022033004/55d1d0a3bb61eb074f8b4647/html5/thumbnails/19.jpg)
![Page 20: The evolution of the big data platform @ Netflix (OSCON 2015)](https://reader030.vdocuments.site/reader030/viewer/2022033004/55d1d0a3bb61eb074f8b4647/html5/thumbnails/20.jpg)
• Analytics• Hive-QL is close to ANSI SQL syntax• Hive metastore serves as single source
of truth for metadata for big data
![Page 21: The evolution of the big data platform @ Netflix (OSCON 2015)](https://reader030.vdocuments.site/reader030/viewer/2022033004/55d1d0a3bb61eb074f8b4647/html5/thumbnails/21.jpg)
• ETL• Better language construct for ETL • Contributions since 0.11• Customization
– Integration with Metacat to Hive Metastore
– Integration with S3
![Page 22: The evolution of the big data platform @ Netflix (OSCON 2015)](https://reader030.vdocuments.site/reader030/viewer/2022033004/55d1d0a3bb61eb074f8b4647/html5/thumbnails/22.jpg)
• Interactive data exploration and experimentation• Why we like presto?
– Integration with Hive metastore– Easy integration with S3– Works at petabyte scale– ANSI SQL for usability– Fast
![Page 23: The evolution of the big data platform @ Netflix (OSCON 2015)](https://reader030.vdocuments.site/reader030/viewer/2022033004/55d1d0a3bb61eb074f8b4647/html5/thumbnails/23.jpg)
• Our contributions– S3 file system– Query optimizations– Complex types support – Parquet file format integration– Working on predicate pushdown
![Page 24: The evolution of the big data platform @ Netflix (OSCON 2015)](https://reader030.vdocuments.site/reader030/viewer/2022033004/55d1d0a3bb61eb074f8b4647/html5/thumbnails/24.jpg)
Parquet
• Columnar file format• Supported across Hive, Pig, Presto, Spark• Performance benefits across different processing engines• Working on vectorized read, lazy load and lazy
materialization
![Page 25: The evolution of the big data platform @ Netflix (OSCON 2015)](https://reader030.vdocuments.site/reader030/viewer/2022033004/55d1d0a3bb61eb074f8b4647/html5/thumbnails/25.jpg)
• Interactive dashboard for slicing and dicing• Column-based in-memory data store for time series data• Serves a specific use case very well
![Page 26: The evolution of the big data platform @ Netflix (OSCON 2015)](https://reader030.vdocuments.site/reader030/viewer/2022033004/55d1d0a3bb61eb074f8b4647/html5/thumbnails/26.jpg)
• ETL, RT analytics, ML algorithms• Why we like Spark?
– Cohesive environment – batch and ‘stream’ processing– Multiple language support – Scala, Python– Performance benefits– Run on top of YARN for multi-tenancy– Community momentum
![Page 27: The evolution of the big data platform @ Netflix (OSCON 2015)](https://reader030.vdocuments.site/reader030/viewer/2022033004/55d1d0a3bb61eb074f8b4647/html5/thumbnails/27.jpg)
Metacat(Federated metadata service)
Pig workflow visualization
Data movement
Data visualization
Job/Cluster perfvisualization
Data lineage
Data quality
Service Tools
(Federated execution service)
Big Data Portal
API Portal
Big Data APIEvolution of Services/Tools
Ecosystem
![Page 28: The evolution of the big data platform @ Netflix (OSCON 2015)](https://reader030.vdocuments.site/reader030/viewer/2022033004/55d1d0a3bb61eb074f8b4647/html5/thumbnails/28.jpg)
• Federated execution engine• Expose [your fave big data engine] as a
service • Flexible data model to support future job
types• Cluster configuration management
![Page 29: The evolution of the big data platform @ Netflix (OSCON 2015)](https://reader030.vdocuments.site/reader030/viewer/2022033004/55d1d0a3bb61eb074f8b4647/html5/thumbnails/29.jpg)
Metacat• Federated metadata catalog for the whole data platform
– Proxy service to different metadata sources
• Data metrics, data usage, ownership, categorization and retention policy …
• Common interface for tools to interact with metadata
• To be open sourced in 2015 on Netflix OSS
![Page 30: The evolution of the big data platform @ Netflix (OSCON 2015)](https://reader030.vdocuments.site/reader030/viewer/2022033004/55d1d0a3bb61eb074f8b4647/html5/thumbnails/30.jpg)
Metacat(Federated metadata service)
Pig workflow visualization
Data movement
Data visualization
Job/Cluster perfvisualization
Data lineage
Data quality
Service Tools
(Federated execution service)
Big Data Portal
API Portal
Big Data API dd
![Page 31: The evolution of the big data platform @ Netflix (OSCON 2015)](https://reader030.vdocuments.site/reader030/viewer/2022033004/55d1d0a3bb61eb074f8b4647/html5/thumbnails/31.jpg)
![Page 32: The evolution of the big data platform @ Netflix (OSCON 2015)](https://reader030.vdocuments.site/reader030/viewer/2022033004/55d1d0a3bb61eb074f8b4647/html5/thumbnails/32.jpg)
![Page 33: The evolution of the big data platform @ Netflix (OSCON 2015)](https://reader030.vdocuments.site/reader030/viewer/2022033004/55d1d0a3bb61eb074f8b4647/html5/thumbnails/33.jpg)
![Page 34: The evolution of the big data platform @ Netflix (OSCON 2015)](https://reader030.vdocuments.site/reader030/viewer/2022033004/55d1d0a3bb61eb074f8b4647/html5/thumbnails/34.jpg)
![Page 35: The evolution of the big data platform @ Netflix (OSCON 2015)](https://reader030.vdocuments.site/reader030/viewer/2022033004/55d1d0a3bb61eb074f8b4647/html5/thumbnails/35.jpg)
![Page 36: The evolution of the big data platform @ Netflix (OSCON 2015)](https://reader030.vdocuments.site/reader030/viewer/2022033004/55d1d0a3bb61eb074f8b4647/html5/thumbnails/36.jpg)
Big Data API• Integration layer for our ecosystem of tools and services• Python library (called Kragle)• Building block for our ETL workflow• Building block for Big Data Portal
![Page 37: The evolution of the big data platform @ Netflix (OSCON 2015)](https://reader030.vdocuments.site/reader030/viewer/2022033004/55d1d0a3bb61eb074f8b4647/html5/thumbnails/37.jpg)
![Page 38: The evolution of the big data platform @ Netflix (OSCON 2015)](https://reader030.vdocuments.site/reader030/viewer/2022033004/55d1d0a3bb61eb074f8b4647/html5/thumbnails/38.jpg)
![Page 39: The evolution of the big data platform @ Netflix (OSCON 2015)](https://reader030.vdocuments.site/reader030/viewer/2022033004/55d1d0a3bb61eb074f8b4647/html5/thumbnails/39.jpg)
Big Data Portal• One stop shop for all big data related tools and services• Built on top of Big Data API
![Page 40: The evolution of the big data platform @ Netflix (OSCON 2015)](https://reader030.vdocuments.site/reader030/viewer/2022033004/55d1d0a3bb61eb074f8b4647/html5/thumbnails/40.jpg)
![Page 41: The evolution of the big data platform @ Netflix (OSCON 2015)](https://reader030.vdocuments.site/reader030/viewer/2022033004/55d1d0a3bb61eb074f8b4647/html5/thumbnails/41.jpg)
![Page 42: The evolution of the big data platform @ Netflix (OSCON 2015)](https://reader030.vdocuments.site/reader030/viewer/2022033004/55d1d0a3bb61eb074f8b4647/html5/thumbnails/42.jpg)
![Page 43: The evolution of the big data platform @ Netflix (OSCON 2015)](https://reader030.vdocuments.site/reader030/viewer/2022033004/55d1d0a3bb61eb074f8b4647/html5/thumbnails/43.jpg)
![Page 44: The evolution of the big data platform @ Netflix (OSCON 2015)](https://reader030.vdocuments.site/reader030/viewer/2022033004/55d1d0a3bb61eb074f8b4647/html5/thumbnails/44.jpg)
Open source is an integral part of our strategy to achieve scale
![Page 45: The evolution of the big data platform @ Netflix (OSCON 2015)](https://reader030.vdocuments.site/reader030/viewer/2022033004/55d1d0a3bb61eb074f8b4647/html5/thumbnails/45.jpg)
Big Data Processing Systems
Services/Tools Ecosystem
![Page 46: The evolution of the big data platform @ Netflix (OSCON 2015)](https://reader030.vdocuments.site/reader030/viewer/2022033004/55d1d0a3bb61eb074f8b4647/html5/thumbnails/46.jpg)
Why use Open Source?• Collaborate with other internet scale tech companies• Unchartered area/scale, lock-in is not desirable• Need the flexibility to achieve scalabilityBUT…• Lots of choices• White box approach
![Page 47: The evolution of the big data platform @ Netflix (OSCON 2015)](https://reader030.vdocuments.site/reader030/viewer/2022033004/55d1d0a3bb61eb074f8b4647/html5/thumbnails/47.jpg)
Why contribute back?
• Non IP or trade secret • Help shape direction of projects • Don’t want to fork and diverge• Attract top talent
![Page 48: The evolution of the big data platform @ Netflix (OSCON 2015)](https://reader030.vdocuments.site/reader030/viewer/2022033004/55d1d0a3bb61eb074f8b4647/html5/thumbnails/48.jpg)
Why contribute our own tool?
• Share our goodness• Set industry standard• Community can help evolve the tool
![Page 49: The evolution of the big data platform @ Netflix (OSCON 2015)](https://reader030.vdocuments.site/reader030/viewer/2022033004/55d1d0a3bb61eb074f8b4647/html5/thumbnails/49.jpg)
![Page 50: The evolution of the big data platform @ Netflix (OSCON 2015)](https://reader030.vdocuments.site/reader030/viewer/2022033004/55d1d0a3bb61eb074f8b4647/html5/thumbnails/50.jpg)
Is open source right for you?
![Page 51: The evolution of the big data platform @ Netflix (OSCON 2015)](https://reader030.vdocuments.site/reader030/viewer/2022033004/55d1d0a3bb61eb074f8b4647/html5/thumbnails/51.jpg)
![Page 52: The evolution of the big data platform @ Netflix (OSCON 2015)](https://reader030.vdocuments.site/reader030/viewer/2022033004/55d1d0a3bb61eb074f8b4647/html5/thumbnails/52.jpg)
Measuring big data - understanding data by usage
By Charles Smith, NetflixTomorrow @ 1:40-2:20pm