big data pipeline and analytics platform using netflixoss and other open source libraries

74
Sudhir Tonse (@stonse) Danny Yuan (@g9yuayon) Big Data Pipeline and Analytics Platfo Using NetflixOSS and Other Open Source Software

Upload: sudhir-tonse

Post on 19-Aug-2014

503 views

Category:

Engineering


4 download

DESCRIPTION

Slides on the OSCON talk about the data platform used at Netflix for event collection, aggregation, and analysis. The platform helps Netflix process and analyze billions of events every day. Attendees will learn how to assemble their own large-scale data pipeline/analytics platform using open source software from NetflixOSS and others, such as Kafka, ElasticSearch, Druid from Metamarkets, and Hive.

TRANSCRIPT

  • Sudhir Tonse (@stonse) Danny Yuan (@g9yuayon) Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Source Software
  • Data Is the most important asset at Netflix
  • If all the data is easily available to all teams, it can be leveraged in new and exciting ways
  • ~1000 Device Types ~500 Apps/Web Services ~100 Billion Events/Day 3.2M messages per second at peak time 3GB per second at peak time Dashboard
  • Type of Events User Interface Events Search Event (Matrix using PS3 ) Star Rating Event (HoC : 5 stars, Xbox, US, ) Infrastructural Events RPC Call (API -> Billing Service, /bill/.., 200, ) Log Errors (NPE, Movie is null, , ) Other Events
  • Making Sense of Billions of Events
  • http://netflix.github.io +
  • A Humble Beginning
  • Evolution Scale!
  • Application Application Application Application Application Application Application Application ApplicationApplication
  • We Want to Process App Data in Hadoop
  • Our Hadoop Ecosystem
  • @NetflixOSS Big Data Tools
  • Hadoop as a Service
  • Pig Scripting on Steroids
  • Pig Married to Clojure Map-Reduce for Clojure
  • S3MPER S3mper is a library that provides an additional layer of consistency checking on top of Amazon's S3 index through use of a consistent, secondary index. S3mper is a library that provides an additional layer of consistency checking on top of Amazon's S3 index through use of a consistent, secondary index.
  • Efficient ETL with Cassandra Cassandra
  • Offline Analysis
  • Evolution Speed!
  • We Want to Aggregate, Index, and Query Data in Real Time
  • Interactive Exploration
  • Lets walk through some use cases
  • client activity event * /name = movieStarts
  • Pipeline Challenges App owners: send and forget Data scientists: validation, ETL, batch processing DevOps: stream processing, targeted search
  • Message Routing
  • We Want to Consume Data Selectively in Different Ways
  • Message broker High-throughput Persistent and replicated
  • There Is More
  • Intelligent Alerts
  • Intelligent Alerts
  • Guided Debugging in the Right Context
  • Guided Debugging in the Right Context
  • Guided Debugging in the Right Context
  • Ad-hoc query with different dimensions Quick aggregations and Top-N queries Time series with flexible filters Quick access to raw data using boolean queries What We Need
  • Druid Rapid exploration of high dimensional data Fast ingestion and querying Time series
  • Real-time indexing of event streams Killer feature: boolean search Great UI: Kibana
  • The Old Pipeline
  • The New Pipeline
  • There Is More
  • Its Not All About Counters and Time Series
  • RequestId Parent Id Node Id Service Name Status 4965-4a74 0 123 Edge Service 200 4965-4a74 123 456 Gateway 200 4965-4a74 456 789 Service A 200 4965-4a74e 456 abc Service B 200 Status:200
  • Distributed Tracing
  • Distributed Tracing
  • Distributed Tracing
  • A System that Supports All These
  • A Data Pipeline To Glue Them All
  • Make It Simple
  • Message Producing Simple and Uniform API messageBus.publish(event)
  • Consumption Is Simple Too consumer.observe().subscribe(new Subscriber() { @Override public void onNext(Ackable ackable) { process(ackable.getEntity(MyEventType.class)); ackable.ack(); } }); consumer.pause(); consumer.resume()
  • RxJava Functional reactive programming model Powerful streaming API Separation of logic and threading model
  • Design Decisions Top Priority: app stability and throughput Asynchronous operations Aggressive buffering Drops messages if necessary
  • Anything Can Fail
  • Cloud Resiliency
  • Fault Tolerance Features Write and forward with auto-reattached EBS (Amazons Elastic Block Storage) disk-backed queue: big-queue Customized scaling down
  • Theres More to Do Contribute to @NetflixOSS Join us :-)
  • Summary http://netflix.github.io +
  • You can build your own web-scale data pipeline using open source components
  • Thank You! Sudhir Tonse http://www.linkedin.com/in/sudhirtonse Twitter: @stonse Danny Yuan http://www.linkedin.com/pub/danny- yuan/4/374/862 Twitter: @g9yuayon