Transcript
  • Sudhir Tonse (@stonse) Danny Yuan (@g9yuayon) Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Source Software
  • Data Is the most important asset at Netflix
  • If all the data is easily available to all teams, it can be leveraged in new and exciting ways
  • ~1000 Device Types ~500 Apps/Web Services ~100 Billion Events/Day 3.2M messages per second at peak time 3GB per second at peak time Dashboard
  • Type of Events User Interface Events Search Event (Matrix using PS3 ) Star Rating Event (HoC : 5 stars, Xbox, US, ) Infrastructural Events RPC Call (API -> Billing Service, /bill/.., 200, ) Log Errors (NPE, Movie is null, , ) Other Events
  • Making Sense of Billions of Events
  • http://netflix.github.io +
  • A Humble Beginning
  • Evolution Scale!
  • Application Application Application Application Application Application Application Application ApplicationApplication
  • We Want to Process App Data in Hadoop
  • Our Hadoop Ecosystem
  • @NetflixOSS Big Data Tools
  • Hadoop as a Service
  • Pig Scripting on Steroids
  • Pig Married to Clojure Map-Reduce for Clojure
  • S3MPER S3mper is a library that provides an additional layer of consistency checking on top of Amazon's S3 index through use of a consistent, secondary index. S3mper is a library that provides an additional layer of consistency checking on top of Amazon's S3 index through use of a consistent, secondary index.
  • Efficient ETL with Cassandra Cassandra
  • Offline Analysis
  • Evolution Speed!
  • We Want to Aggregate, Index, and Query Data in Real Time
  • Interactive Exploration
  • Lets walk through some use cases
  • client activity event * /name = movieStarts
  • Pipeline Challenges App owners: send and forget Data scientists: validation, ETL, batch processing DevOps: stream processing, targeted search
  • Message Routing
  • We Want to Consume Data Selectively in Different Ways
  • Message broker High-throughput Persistent and replicated
  • There Is More
  • Intelligent Alerts
  • Intelligent Alerts
  • Guided Debugging in the Right Context
  • Guided Debugging in the Right Context
  • Guided Debugging in the Right Context
  • Ad-hoc query with different dimensions Quick aggregations and Top-N queries Time series with flexible filters Quick access to raw data using boolean queries What We Need
  • Druid Rapid exploration of high dimensional data Fast ingestion and querying Time series
  • Real-time indexing of event streams Killer feature: boolean search Great UI: Kibana
  • The Old Pipeline
  • The New Pipeline
  • There Is More
  • Its Not All About Counters and Time Series
  • RequestId Parent Id Node Id Service Name Status 4965-4a74 0 123 Edge Service 200 4965-4a74 123 456 Gateway 200 4965-4a74 456 789 Service A 200 4965-4a74e 456 abc Service B 200 Status:200
  • Distributed Tracing
  • Distributed Tracing
  • Distributed Tracing
  • A System that Supports All These
  • A Data Pipeline To Glue Them All
  • Make It Simple
  • Message Producing Simple and Uniform API messageBus.publish(event)
  • Consumption Is Simple Too consumer.observe().subscribe(new Subscriber() { @Override public void onNext(Ackable ackable) { process(ackable.getEntity(MyEventType.class)); ackable.ack(); } }); consumer.pause(); consumer.resume()
  • RxJava Functional reactive programming model Powerful streaming API Separation of logic and threading model
  • Design Decisions Top Priority: app stability and throughput Asynchronous operations Aggressive buffering Drops messages if necessary
  • Anything Can Fail
  • Cloud Resiliency
  • Fault Tolerance Features Write and forward with auto-reattached EBS (Amazons Elastic Block Storage) disk-backed queue: big-queue Customized scaling down
  • Theres More to Do Contribute to @NetflixOSS Join us :-)
  • Summary http://netflix.github.io +
  • You can build your own web-scale data pipeline using open source components
  • Thank You! Sudhir Tonse http://www.linkedin.com/in/sudhirtonse Twitter: @stonse Danny Yuan http://www.linkedin.com/pub/danny- yuan/4/374/862 Twitter: @g9yuayon

Top Related