Download - Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Source Libraries
Transcript
- Sudhir Tonse (@stonse) Danny Yuan (@g9yuayon) Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Source Software
- Data Is the most important asset at Netflix
- If all the data is easily available to all teams, it can be leveraged in new and exciting ways
- ~1000 Device Types ~500 Apps/Web Services ~100 Billion Events/Day 3.2M messages per second at peak time 3GB per second at peak time Dashboard
- Type of Events User Interface Events Search Event (Matrix using PS3 ) Star Rating Event (HoC : 5 stars, Xbox, US, ) Infrastructural Events RPC Call (API -> Billing Service, /bill/.., 200, ) Log Errors (NPE, Movie is null, , ) Other Events
- Making Sense of Billions of Events
- http://netflix.github.io +
- A Humble Beginning
- Evolution Scale!
- Application Application Application Application Application Application Application Application ApplicationApplication
- We Want to Process App Data in Hadoop
- Our Hadoop Ecosystem
- @NetflixOSS Big Data Tools
- Hadoop as a Service
- Pig Scripting on Steroids
- Pig Married to Clojure Map-Reduce for Clojure
- S3MPER S3mper is a library that provides an additional layer of consistency checking on top of Amazon's S3 index through use of a consistent, secondary index. S3mper is a library that provides an additional layer of consistency checking on top of Amazon's S3 index through use of a consistent, secondary index.
- Efficient ETL with Cassandra Cassandra
- Offline Analysis
- Evolution Speed!
- We Want to Aggregate, Index, and Query Data in Real Time
- Interactive Exploration
- Lets walk through some use cases
- client activity event * /name = movieStarts
- Pipeline Challenges App owners: send and forget Data scientists: validation, ETL, batch processing DevOps: stream processing, targeted search
- Message Routing
- We Want to Consume Data Selectively in Different Ways
- Message broker High-throughput Persistent and replicated
- There Is More
- Intelligent Alerts
- Intelligent Alerts
- Guided Debugging in the Right Context
- Guided Debugging in the Right Context
- Guided Debugging in the Right Context
- Ad-hoc query with different dimensions Quick aggregations and Top-N queries Time series with flexible filters Quick access to raw data using boolean queries What We Need
- Druid Rapid exploration of high dimensional data Fast ingestion and querying Time series
- Real-time indexing of event streams Killer feature: boolean search Great UI: Kibana
- The Old Pipeline
- The New Pipeline
- There Is More
- Its Not All About Counters and Time Series
- RequestId Parent Id Node Id Service Name Status 4965-4a74 0 123 Edge Service 200 4965-4a74 123 456 Gateway 200 4965-4a74 456 789 Service A 200 4965-4a74e 456 abc Service B 200 Status:200
- Distributed Tracing
- Distributed Tracing
- Distributed Tracing
- A System that Supports All These
- A Data Pipeline To Glue Them All
- Make It Simple
- Message Producing Simple and Uniform API messageBus.publish(event)
- Consumption Is Simple Too consumer.observe().subscribe(new Subscriber() { @Override public void onNext(Ackable ackable) { process(ackable.getEntity(MyEventType.class)); ackable.ack(); } }); consumer.pause(); consumer.resume()
- RxJava Functional reactive programming model Powerful streaming API Separation of logic and threading model
- Design Decisions Top Priority: app stability and throughput Asynchronous operations Aggressive buffering Drops messages if necessary
- Anything Can Fail
- Cloud Resiliency
- Fault Tolerance Features Write and forward with auto-reattached EBS (Amazons Elastic Block Storage) disk-backed queue: big-queue Customized scaling down
- Theres More to Do Contribute to @NetflixOSS Join us :-)
- Summary http://netflix.github.io +
- You can build your own web-scale data pipeline using open source components
- Thank You! Sudhir Tonse http://www.linkedin.com/in/sudhirtonse Twitter: @stonse Danny Yuan http://www.linkedin.com/pub/danny- yuan/4/374/862 Twitter: @g9yuayon