big data pipeline and analytics platform using netflixoss and other open source libraries

Sudhir Tonse (@stonse)Danny Yuan (@g9yuayon)

Big Data Pipeline and Analytics Platform

Using NetflixOSS and Other Open Source Software

Data Is the most important asset at Netflix

If all the data is easily available to all teams, it can be leveraged in new and exciting ways

~1000 Device Types~500 Apps/Web Services~100 Billion Events/Day

3.2M messages per second at peak time

3GB per second at peak time

Dashboard

Type of EventsUser Interface EventsSearch Event (Matrix using PS3 )Star Rating Event (HoC : 5 stars, Xbox, US, )

Infrastructural EventsRPC Call (API -> Billing Service, /bill/.., 200, )Log Errors (NPE, Movie is null, , )

Other Events

Making Sense of Billions of Events

http://netflix.github.io+

A Humble Beginning

Evolution Scale!

Application

Application

Application

Application

Application

Application

Application

Application

Application

Application

Something happened. Our traffic turned into a hockey stick, and the number of applications exploded. So, log traffic also exploded. Simple log scraping wouldnt cut it any more.

We Want to Process App Data in Hadoop

Our Hadoop Ecosystem

@NetflixOSS Big Data Tools

Hadoop as a Service

Pig Scripting on Steroids

Pig Married to Clojure

Map-Reduce for Clojure

S3MPER

S3mper is a library that provides an additional layer of consistency checking on top of Amazon's S3 index through use of a consistent, secondary index.S3mper is a library that provides an additional layer of consistency checking on top of Amazon's S3 index through use of a consistent, secondary index.

Efficient ETL with Cassandra

Cassandra

Offline Analysis

Evolution Speed!

We Want to Aggregate, Index, and Query Data in Real Time

Interactive Exploration

For one thing: interactive exploration. Sometimes we want to get data in real time so we can act quickly. Some data is only useful in a small time window after all. Sometimes we want to perform lots of experimental queries just to find the right insights. If we wait too long for a query back, we wont be able to iterate fast enough. Either way, we need to get query results back in seconds.

Lets walk through some use cases

client activity event

*

/name = movieStarts

Pipeline ChallengesApp owners: send and forgetData scientists: validation, ETL, batch processingDevOps: stream processing, targeted search

Message Routing

Here is one example: we process more than 150 thousand events per second about user activities. What if wed like to know the geographically how many users started playing videos in the past 5 minutes? So I submit my query, and in a few seconds....

But this is an aggregated view. What if I want to drill down the data immediately along different dimensions? In this particular case, to find out failed attempts on our SilverLight players that run on PCs and Macs?

We Want to Consume Data Selectively in Different Ways

Message brokerHigh-throughputPersistent and replicated

There Is More

Intelligent Alerts

Note this is different from alerting based on monitoring metrics. Monitoring metrics are great and versatile. But it doesnt help us catch unexpected errors. When we build an application, we instrument our code diligently, yet its very likely we miss some critical instrumentation points. Theres one thing that we always catch, though: logged errors and unhandled exceptions. Its about The alert provides a precise entrypoint and the right context for people to drill down the right problems

Intelligent Alerts

Note this is different from alerting based on monitoring metrics. Monitoring metrics are great and versatile. But it doesnt help us catch unexpected errors. When we build an application, we instrument our code diligently, yet its very likely we miss some critical instrumentation points. Theres one thing that we always catch, though: logged errors and unhandled exceptions. Its about The alert provides a precise entrypoint and the right context for people to drill down the right problems

Guided Debugging in the Right Context



Ad-hoc query with different dimensionsQuick aggregations and Top-N queriesTime series with flexible filtersQuick access to raw data using boolean queriesWhat We Need

DruidRapid exploration of high dimensional dataFast ingestion and queryingTime series

Real-time indexing of event streamsKiller feature: boolean searchGreat UI: Kibana

The Old Pipeline

The New Pipeline

There Is More

Its Not All About Counters and Time Series

RequestIdParent IdNode IdService NameStatus4965-4a740123Edge Service2004965-4a74123456Gateway2004965-4a74456789Service A2004965-4a74e456abcService B200

Status:200

Distributed Tracing

Distributed Tracing

Distributed Tracing

A System that Supports All These

A Data Pipeline To Glue Them All

Make It Simple

Message ProducingSimple and Uniform APImessageBus.publish(event)

Consumption Is Simple Too consumer.observe().subscribe(new Subscriber() {@Overridepublic void onNext(Ackable ackable) {process(ackable.getEntity(MyEventType.class));ackable.ack();}});

consumer.pause();consumer.resume()

RxJavaFunctional reactive programming modelPowerful streaming APISeparation of logic and threading model

Design DecisionsTop Priority: app stability and throughput Asynchronous operationsAggressive bufferingDrops messages if necessary

Anything Can Fail

Cloud Resiliency

Fault Tolerance FeaturesWrite and forward with auto-reattached EBS (Amazons Elastic Block Storage)disk-backed queue: big-queueCustomized scaling down

Theres More to DoContribute to @NetflixOSS Join us :-)

Summary

http://netflix.github.io+

You can build your own web-scale data pipeline using open source components

Thank You!Sudhir Tonsehttp://www.linkedin.com/in/sudhirtonseTwitter: @stonseDanny Yuan http://www.linkedin.com/pub/danny-yuan/4/374/862Twitter: @g9yuayon

big data pipeline and analytics platform using netflixoss and other open source libraries

Engineering