Transcript
  • 1.Sudhir Tonse (@stonse) Danny Yuan (@g9yuayon) Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Source Software

2. Data Is the most important asset at Netflix 3. If all the data is easily available to all teams, it can be leveraged in new and exciting ways 4. ~1000 Device Types ~500 Apps/Web Services ~100 Billion Events/Day 3.2M messages per second at peak time 3GB per second at peak time Dashboard 5. Type of Events User Interface Events Search Event (Matrix using PS3 ) Star Rating Event (HoC : 5 stars, Xbox, US, ) Infrastructural Events RPC Call (API -> Billing Service, /bill/.., 200, ) Log Errors (NPE, Movie is null, , ) Other Events 6. Making Sense of Billions of Events 7. http://netflix.github.io + 8. A Humble Beginning 9. Evolution Scale! 10. Application Application Application Application Application Application Application Application ApplicationApplication 11. We Want to Process App Data in Hadoop 12. Our Hadoop Ecosystem 13. @NetflixOSS Big Data Tools 14. Hadoop as a Service 15. Pig Scripting on Steroids 16. Pig Married to Clojure Map-Reduce for Clojure 17. S3MPER S3mper is a library that provides an additional layer of consistency checking on top of Amazon's S3 index through use of a consistent, secondary index. S3mper is a library that provides an additional layer of consistency checking on top of Amazon's S3 index through use of a consistent, secondary index. 18. Efficient ETL with Cassandra Cassandra 19. Offline Analysis 20. Evolution Speed! 21. We Want to Aggregate, Index, and Query Data in Real Time 22. Interactive Exploration 23. Lets walk through some use cases 24. client activity event * /name = movieStarts 25. Pipeline Challenges App owners: send and forget Data scientists: validation, ETL, batch processing DevOps: stream processing, targeted search 26. Message Routing 27. We Want to Consume Data Selectively in Different Ways 28. Message broker High-throughput Persistent and replicated 29. There Is More 30. Intelligent Alerts 31. Intelligent Alerts 32. Guided Debugging in the Right Context 33. Guided Debugging in the Right Context 34. Guided Debugging in the Right Context 35. Ad-hoc query with different dimensions Quick aggregations and Top-N queries Time series with flexible filters Quick access to raw data using boolean queries What We Need 36. Druid Rapid exploration of high dimensional data Fast ingestion and querying Time series 37. Real-time indexing of event streams Killer feature: boolean search Great UI: Kibana 38. The Old Pipeline 39. The New Pipeline 40. There Is More 41. Its Not All About Counters and Time Series 42. RequestId Parent Id Node Id Service Name Status 4965-4a74 0 123 Edge Service 200 4965-4a74 123 456 Gateway 200 4965-4a74 456 789 Service A 200 4965-4a74e 456 abc Service B 200 Status:200 43. Distributed Tracing 44. Distributed Tracing 45. Distributed Tracing 46. A System that Supports All These 47. A Data Pipeline To Glue Them All 48. Make It Simple 49. Message Producing Simple and Uniform API messageBus.publish(event) 50. Consumption Is Simple Too consumer.observe().subscribe(new Subscriber() { @Override public void onNext(Ackable ackable) { process(ackable.getEntity(MyEventType.class)); ackable.ack(); } }); consumer.pause(); consumer.resume() 51. RxJava Functional reactive programming model Powerful streaming API Separation of logic and threading model 52. Design Decisions Top Priority: app stability and throughput Asynchronous operations Aggressive buffering Drops messages if necessary 53. Anything Can Fail 54. Cloud Resiliency 55. Fault Tolerance Features Write and forward with auto-reattached EBS (Amazons Elastic Block Storage) disk-backed queue: big-queue Customized scaling down 56. Theres More to Do Contribute to @NetflixOSS Join us :-) 57. Summary http://netflix.github.io + 58. You can build your own web-scale data pipeline using open source components 59. Thank You! Sudhir Tonse http://www.linkedin.com/in/sudhirtonse Twitter: @stonse Danny Yuan http://www.linkedin.com/pub/danny- yuan/4/374/862 Twitter: @g9yuayon


Top Related