ebay pulsar: real-time analytics platform

21
eBay Pulsar (Real-time Analytics Platform) 2015.03.13 [email protected] 양경모

Upload: kyoungmo-yang

Post on 14-Jul-2015

368 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Page 1: eBay Pulsar: Real-time analytics platform

eBay Pulsar(Real-time Analytics Platform)

2015.03.13

[email protected]양경모

Page 2: eBay Pulsar: Real-time analytics platform

2

Agenda

1. What is Pulsar ?

2. Twitter stream processing demo

3. Key points

4. Other platforms

Page 3: eBay Pulsar: Real-time analytics platform

3

1. What is Pulsar ?

● Developed by eBay

● Real-time analytics platform

● Stream processing framework

● Scalability

– Scale to tens of millions of events per second

● Availability

– No downtime during software upgrade, stream processing of rules and topology changes

● Flexibility

– SQL-like language and annotations for defining stream processing rules

Page 4: eBay Pulsar: Real-time analytics platform

4

Pulsar's Building Block(basic framework)

● Jetstream– Real-time stream processing framework

– Spring IoC(Inversion of Control) container

Page 5: eBay Pulsar: Real-time analytics platform

5

Pulsar's Building Block (Cont.)(basic framework)

Page 6: eBay Pulsar: Real-time analytics platform

6

Pulsar's Building Block (Cont.)(basic framework)

● Jetstream's key points

– CEP capabilities through Esper integration.

– Define processing logic in SQL

– Extends SQL functionality and pipeline flow routing using SQL

– Hot deploy SQL without restarting applications

– Spring IoC enabling dynamic topology changes at runtime

– Clustering with elastic scaling

– Cloud deployment

Page 7: eBay Pulsar: Real-time analytics platform

7

Pulsar Real-time Analytics Pipeline

● Collector : Ingests events through a Rest end point

● Sessionizer : Sessionizes the events, maintaining the session state and generating marker events

● Distributor : Filters and mutates events to different consumers; acts as an event router

● Metrics Calculator : Calculates metrics by various dimensions and persists them in the metrics store

● Reply : Replays the failed events on other stages

● ConfigApp : Configures dynamic provisioning for the whole pipeline

Page 8: eBay Pulsar: Real-time analytics platform

8

1) Collector

● Supports REST API to ingest events● Geo and device classification enrichment● Detects fraud and bot● Streams the enriched events to Sessionizer stage

PulsarRawEvent:A

“si”: “UUID”"ipv4": "ip",..."itmP":”itmPrice”,"capQ":”cmapaignQuantity”

PulsarEvent:A

“device” : “deviceinfo”,“geo” : “geoinfo”,“raw” : “PulsarRawEvent:A”

Enrichment

Page 9: eBay Pulsar: Real-time analytics platform

9

2) Sessionizer

● A process of temporal grouping of events containing a specific identifier referred to as session duration

● Session metadata and state● Session store (in-memory cache)

PulsarEvent:A

“device” : “deviceinfo”,“geo” : “geoinfo”,“raw” : “PulsarRawEvent:A”

Sessionization

PulsarEvent:A

“device” : “deviceinfo”,“geo” : “geoinfo”,“raw” : “PulsarRawEvent:A”

Metadata:A

sessionId,PageId,geo-loc,device,etc..

Page 10: eBay Pulsar: Real-time analytics platform

10

3) Distributor

● Event filtering, mutation and routing

distributes

PulsarEvent:A“si” : “AAAAAA”,“device” : “deviceinfo”,“geo” : “geoinfo”,“raw” : “PulsarRawEvent:A”

@OutputTo("OutboundMessageChannel")@ClusterAffinityTag(colname="si")@PublishOn(topics="Pulsar.MC/ssnzEvent")select * from PulsarEvent;

OutboundMessageChannel

InboundMessageChannel

InboundMessageChannel

"Pulsar.MC/ssnzEvent"

PulsarEvent:B“si” : “BBBBBB”,“device” : “deviceinfo”,“geo” : “geoinfo”,“raw” : “PulsarRawEvent:A”

Page 11: eBay Pulsar: Real-time analytics platform

11

● Real-time metrics computation engine(Esper)● Metrics are stored into Cassandra for batch processing

4) Metrics Calculator

context MCContextinsert into PulsarEventCountSelect count(*) as count from PulsarEvent output snapshot when terminated;

@BroadCast@OutputTo("OutboundMessageChannel") @PublishOn(topics="Pulsar.Report/metric")select * from PulsarEventCount;

calculates PulsarEventCount:C

“count”: 2

OutboundMessageChannel

InboundMessageChannel

"Pulsar.Report/metric"

Page 12: eBay Pulsar: Real-time analytics platform

12

5) Replay

● Every stage, events are stored in Kafka● and Replays the failed events on other stages

Page 13: eBay Pulsar: Real-time analytics platform

13

2. Demo(Twitter stream processing)

TwitterStream

Twitter StreamCollector

Page 14: eBay Pulsar: Real-time analytics platform

14

EPLs (Context)

context MCContextinsert into TwitterTopCountryCount Select count(*) as count, country from TwitterSample(country is not null) group by country output snapshot when terminated order by count(*) desc limit 10; context MCContextinsert into TwitterTopLangCount Select count(*) as count, lang from TwitterSample(lang is not null) group by lang output snapshot when terminated order by count(*) desc limit 10; context MCContextinsert into TwitterTopHashTagCount Select topKNested(1000, 20, hashtag, ',') as TopHashTag from TwitterSample(hashtag is not null) output snapshot when terminated;

context MCContextinsert into TwitterEventCount Select count(*) as count from TwitterSample output snapshot when terminated;

Page 15: eBay Pulsar: Real-time analytics platform

15

EPLs (Select)@BroadCast@OutputTo("OutboundMessageChannel") @PublishOn(topics="Pulsar.Report/metric")select * from TwitterTopCountryCount;

@BroadCast@OutputTo("OutboundMessageChannel") @PublishOn(topics="Pulsar.Report/metric")select * from TwitterTopLangCount;

@BroadCast@OutputTo("OutboundMessageChannel") @PublishOn(topics="Pulsar.Report/metric")select * from TwitterTopHashTagCount;

@BroadCast@OutputTo("OutboundMessageChannel") @PublishOn(topics="Pulsar.Report/metric")select * from TwitterEventCount;

Page 16: eBay Pulsar: Real-time analytics platform

16

http://<hostname>:8088

Dashboard

Page 17: eBay Pulsar: Real-time analytics platform

17

3. Pulsar's key points

● Creating pipelines declaratively● SQL driven processing logic with hot deployment of SQL

● Framework for custom SQL extensions● Dynamic partitioning and flow control● < 100 millisecond pipeline latency● 99.99% Availability● < 0.01% data loss● Cloud deployable

Page 18: eBay Pulsar: Real-time analytics platform

18

4. Other Stream Processing Frameworks

● Storm(Trident)– Storm Transactional Topology

– Stateful

● Storm(Esper)– Our solution developed in NexR Project

– Integrates Esper

● Apache Spark– Fast and general cluster computing platform for Big Data

– Support SQL

Page 19: eBay Pulsar: Real-time analytics platform

19

Storm(+Esper) / Spark vs Pulsar

Points Pulsar Storm(Trident) Storm(Esper) Spark

Declarative pipeline wiring O X X X

Pipeline stitching Run time Build time Build time Build time

Hot deployment of topologies

O X X X

SQL support O X O O

Hot deployment of processing rules

O X O X

Pipeline flow control O △ △ ?

Stateful processing O O △ O

<http://gopulsar.io/docs/Pulsar_Presentation.pdf>

Page 20: eBay Pulsar: Real-time analytics platform

20

References

● http://www.ebaytechblog.com/2015/02/23/announcing-pulsar-real-time-analytics-at-scale/#.VQIuqBCsVW2

● http://gopulsar.io/● https://github.com/pulsarIO/realtime-analytics/wiki● http://gopulsar.io/html/docs.html● https://spark.apache.org/● https://storm.apache.org/● http://www.espertech.com/

Page 21: eBay Pulsar: Real-time analytics platform

21

Q & A