processing big data in realtime

108
1 Processing Processing “BIG-DATA” “BIG-DATA” In In Real Time Real Time Yanai Franchi , Tikal Yanai Franchi , Tikal

Upload: tikal-knowledge

Post on 27-Jan-2015

116 views

Category:

Technology


2 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Processing Big Data in Realtime

1

ProcessingProcessing“BIG-DATA”“BIG-DATA”In In Real TimeReal Time

Yanai Franchi , TikalYanai Franchi , Tikal

Page 2: Processing Big Data in Realtime

2

Page 3: Processing Big Data in Realtime

3

Vacation to BarcelonaVacation to Barcelona

Page 4: Processing Big Data in Realtime

4

After a Long Travel DayAfter a Long Travel Day

Page 5: Processing Big Data in Realtime

5

Going to a Salsa Club

Page 6: Processing Big Data in Realtime

6

Best Salsa Club NOW

● Good Music

● Crowded – Now!

Page 7: Processing Big Data in Realtime

7

Same Problem in “gogobot”

Page 8: Processing Big Data in Realtime

8

Page 9: Processing Big Data in Realtime

9

gogobot checkinHeat Map Service

Lets' Develop“Gogobot Checkins Heat-Map”

Page 10: Processing Big Data in Realtime

10

Key Notes● Collector Service - Collects checkins as text addresses

– We need to use GeoLocation ServiceWe need to use GeoLocation Service

● Upon elapsed interval, the last locations list will be displayed as Heat-Map in GUI.

● Web Scale service – 10Ks checkins/seconds all over the world (imaginary, but lets do it for the exercise).

● Accuracy – Sample data, NOT critical data.

– Proportionately representative

– Data volume is large enough to is large enough to compensate for data loss.compensate for data loss.

Page 11: Processing Big Data in Realtime

11

Heat-Map Context

Text-Address

Checkins Heat-MapService

Gogobot System

GogobotMicro Service

GogobotMicro Service

GogobotMicro Service

Geo LocationService

Get-GeoCode(Address)

Heat-Map

Last Interval Locations

Page 12: Processing Big Data in Realtime

12

Database

Persist Checkin Intervals

ProcessingCheckins

ReadText Address

Check-in #1Check-in #2Check-in #3Check-in #4Check-in #5Check-in #6Check-in #7Check-in #8Check-in #9...

Simulate Checkins with a File

Plan A

GET Geo Location

Geo LocationService

Page 13: Processing Big Data in Realtime

13

Tons of Addresses Arriving Every Second

Page 14: Processing Big Data in Realtime

14

Architect - First Reaction...

Page 15: Processing Big Data in Realtime

15

Second Reaction...

Page 16: Processing Big Data in Realtime

16

DeveloperFirst

Reaction

Page 17: Processing Big Data in Realtime

17

SecondReaction

Page 18: Processing Big Data in Realtime

18

Problems ?

● Tedious: Spend time confi guring where to send messages, deploying workers, and deploying intermediate queues.

● Brittle: There's little fault-tolerance.

● Painful to scale: Partition of running worker/s is complicated.

Page 19: Processing Big Data in Realtime

19

What We Want ?● Horizontal scalability● Fault-tolerance● No intermediate message brokers!● Higher level abstraction than message

passing● “Just works”● Guaranteed data processing (not in this

case)

Page 20: Processing Big Data in Realtime

20

Apache Storm

✔Horizontal scalability

✔Fault-tolerance

✔No intermediate message brokers!

✔Higher level abstraction than message passing

✔“Just works”

✔Guaranteed data processing

Page 21: Processing Big Data in Realtime

21

Anatomy of Storm

Page 22: Processing Big Data in Realtime

22

What is Storm ?

● CEP - Open source and distributed realtime computation system. – Makes it easy to Makes it easy to reliably process unboundedreliably process unbounded streams streams ofof

tuplestuples– Doing for realtime processing what Hadoop did for batch Doing for realtime processing what Hadoop did for batch

processing.processing.

● Fast - 1M Tuples/sec per node. – It is scalable,fault-tolerant, guarantees your data will be It is scalable,fault-tolerant, guarantees your data will be

processed, and is easy to set up and operate.processed, and is easy to set up and operate.

Page 23: Processing Big Data in Realtime

23

Streams

Tuple Tuple Tuple Tuple Tuple Tuple

Unbounded sequence of tuples

Page 24: Processing Big Data in Realtime

24

Spouts

TupleTuple

Sources of Streams

Tuple Tuple

Page 25: Processing Big Data in Realtime

25

Bolts

Tuple

TupleTuple

Processes input streams and produces new streams

TupleTupleTupleTuple

Tuple TupleTuple

Page 26: Processing Big Data in Realtime

26

Storm Topology

Network of spouts and bolts

Tuple

TupleTuple

TupleTuple TupleTupleTuple TupleTupleTuple

Tuple

Tuple

TupleTuple TupleTupleTuple

Page 27: Processing Big Data in Realtime

27

Guarantee for Processing

● Storm guarantees the full processing of a tuple by tracking its state

● In case of failure, Storm can re-process it.● Source tuples with full “acked” trees are removed

from the system

Page 28: Processing Big Data in Realtime

28

Tasks (Bolt/Spout Instance)

Spouts and bolts execute asmany tasks across the cluster

Page 29: Processing Big Data in Realtime

29

Stream Grouping

When a tuple is emitted, which task (instance) does it go to?

Page 30: Processing Big Data in Realtime

30

Stream Grouping

● Shuffl e grouping: pick a random task● Fields grouping: consistent hashing on a subset of

tuple fi elds● All grouping: send to all tasks● Global grouping: pick task with lowest id

Page 31: Processing Big Data in Realtime

31

Tasks , Executors , Workers

Task Task Task

Worker Process

Sput /Bolt

Sput /Bolt

Sput /Bolt=

Executor Thread

JVM

Executor Thread

Page 32: Processing Big Data in Realtime

32

Bolt B Bolt B

Worker Process

Executor

Spout A

Executor

Node

SupervisorBolt C Bolt C

Executor

Bolt B Bolt B

Worker Process

Executor

Spout A

Executor

Node

SupervisorBolt C Bolt C

Executor

Page 33: Processing Big Data in Realtime

33

Nimbus

Supervisor Supervisor

Supervisor Supervisor

Supervisor Supervisor

Upload/Rebalance Heat-Map Topology

Zoo Keeper Nodes

Storm Architecture

Master Node (similar to Hadoop JobTracker)

NOT criticalfor running topology

Page 34: Processing Big Data in Realtime

34

Nimbus

Supervisor Supervisor

Supervisor Supervisor

Supervisor Supervisor

Upload/Rebalance Heat-Map Topology

Zoo Keeper

Storm Architecture

Used For Cluster Coordination

A few nodes

Page 35: Processing Big Data in Realtime

35

Nimbus

Supervisor Supervisor

Supervisor Supervisor

Supervisor Supervisor

Upload/Rebalance Heat-Map Topology

Zoo Keeper

Storm Architecture

Run Worker Processes

Page 36: Processing Big Data in Realtime

36

Assembling Heatmap Topology

Page 37: Processing Big Data in Realtime

37

HeatMap Input/Output Tuples

● Input Tuples: Timestamp and Text Address : – (9:00:07 PM , “287 Hudson St New York NY 10013”)(9:00:07 PM , “287 Hudson St New York NY 10013”)

● Output Tuple: Time interval, and a list of points for it:– (9:00:00 PM to 9:00:15 PM, (9:00:00 PM to 9:00:15 PM,

ListList((((40.719,-73.98740.719,-73.987),(40.726,-74.001),(),(40.726,-74.001),(40.719,-73.98740.719,-73.987))))

Page 38: Processing Big Data in Realtime

38

Checkins Spout

GeocodeLookup

Bolt

HeatmapBuilder

Bolt

PersistorBolt

(9:01 PM @ 287 Hudson st)

(9:01 PM , (40.736, -74,354)))

Heat Map Storm

Topology(9:00 PM – 9:15 PM , List((40.73, -74,34),

(51.36, -83,33),(69.73, -34,24))

Upon Elapsed Interval

Page 39: Processing Big Data in Realtime

39

Checkins Spoutpublic class CheckinsSpout extends BaseRichSpout {

private List<String> sampleLocations;private int nextEmitIndex;private SpoutOutputCollector outputCollector;

@Overridepublic void open(Map map, TopologyContext topologyContext,

SpoutOutputCollector spoutOutputCollector) {this.outputCollector = spoutOutputCollector;this.nextEmitIndex = 0;sampleLocations = IOUtils.readLines(

ClassLoader.getSystemResourceAsStream("sanple-locations.txt"));}

@Overridepublic void nextTuple() {

String address = checkins.get(nextEmitIndex);String checkin = new Date().getTime()+"@ADDRESS:"+address;

outputCollector.emit(new Values(checkin));nextEmitIndex = (nextEmitIndex + 1) % sampleLocations.size();

}

@Override

public void declareOutputFields(OutputFieldsDeclarer declarer) {declarer.declare(new Fields("str"));

}}

We hold stateNo need for thread safety

Declare output fields

Been called iteratively by Storm

Page 40: Processing Big Data in Realtime

40

Geocode Lookup Bolt

public class GeocodeLookupBolt extends BaseBasicBolt {private LocatorService locatorService;

@Overridepublic void prepare(Map stormConf, TopologyContext context) {

locatorService = new GoogleLocatorService();}

@Overridepublic void execute(Tuple tuple, BasicOutputCollector outputCollector) {

String str = tuple.getStringByField("str");String[] parts = str.split("@");Long time = Long.valueOf(parts[0]);String address = parts[1];

LocationDTO locationDTO = locatorService.getLocation(address);if(checkinDTO!=null)

outputCollector.emit(new Values(time,locationDTO) );}

@Overridepublic void declareOutputFields(OutputFieldsDeclarer fieldsDeclarer) {

fieldsDeclarer.declare(new Fields("time", "location"));}

}

Get Geocode, Create DTO

Page 41: Processing Big Data in Realtime

41

Tick Tuple – Repeating Mantra

Page 42: Processing Big Data in Realtime

42

Two Streams to Heat-Map Builder

On tick tuple, we fl ush our Heat-Map

Checkin 1 Checkin 4 Checkin 5 Checkin 6

HeatMap-Builder Bolt

Page 43: Processing Big Data in Realtime

43

Tick Tuple in Actionpublic class HeatMapBuilderBolt extends BaseBasicBolt {

private Map<String, List<LocationDTO>> heatmaps;

@Overridepublic Map<String, Object> getComponentConfiguration() {

Config conf = new Config();conf.put(Config.TOPOLOGY_TICK_TUPLE_FREQ_SECS, 60 );return conf;

}

@Overridepublic void execute(Tuple tuple, BasicOutputCollector outputCollector) {

if (isTickTuple(tuple)) {// Emit accumulated intervals

} else {// Add check-in info to the current interval in the Map

}}

private boolean isTickTuple(Tuple tuple) { return tuple.getSourceComponent().equals(Constants.SYSTEM_COMPONENT_ID)

&& tuple.getSourceStreamId().equals(Constants.SYSTEM_TICK_STREAM_ID);}

Tick interval

Hold latest intervals

Page 44: Processing Big Data in Realtime

44

Persister Boltpublic class PersistorBolt extends BaseBasicBolt {

private Jedis jedis;

@Overridepublic void execute(Tuple tuple, BasicOutputCollector outputCollector) {

Long timeInterval = tuple.getLongByField("time-interval"); String city = tuple.getStringByField("city"); String locationsList = objectMapper.writeValueAsString

( tuple.getValueByField("locationsList"));

String dbKey = "checkins-" + timeInterval+"@"+city;

jedis.setex(dbKey, 3600*24 ,locationsList);

jedis.publish("location-key", dbKey);

}}

Publish in Redis channel for debugging

Persist in Redisfor 24h

Page 45: Processing Big Data in Realtime

45

Shuffle Grouping

Shuffle Grouping

Check-in #1Check-in #2Check-in #3Check-in #4Check-in #5Check-in #6Check-in #7Check-in #8Check-in #9...

Sample Checkins File

Read Text Addresses

Transforming the TuplesCheckins

Spout

GeocodeLookup

Bolt

HeatmapBuilder

Bolt

DatabasePersistor

Bolt

Get Geo Location

Geo LocationService

Field Grouping(city)

Group by city

Page 46: Processing Big Data in Realtime

46

Heat Map Topologypublic class LocalTopologyRunner { public static void main(String[] args) {

TopologyBuilder builder = buildTopolgy();StormSubmitter.submitTopology(

"local-heatmap", new Config(), builder.createTopology());

}

private static TopologyBuilder buildTopolgy() { topologyBuilder builder = new TopologyBuilder(); builder.setSpout("checkins", new CheckinsSpout());

builder.setBolt("geocode-lookup", new GeocodeLookupBolt() ).shuffleGrouping("checkins");

builder.setBolt("heatmap-builder", new HeatMapBuilderBolt() ).fieldsGrouping("geocode-lookup", new Fields("city"));

builder.setBolt("persistor", new PersistorBolt() ).shuffleGrouping("heatmap-builder");

return builder;

}}

Page 47: Processing Big Data in Realtime

47

Its NOT Scaled

Page 48: Processing Big Data in Realtime

48

Page 49: Processing Big Data in Realtime

49

Scaling the Topologypublic class LocalTopologyRunner {conf.setNumWorkers(20); public static void main(String[] args) {

TopologyBuilder builder = buildTopolgy();Config conf = new Config();conf.setNumWorkers(2);StormSubmitter.submitTopology(

"local-heatmap", conf, builder.createTopology()); }

private static TopologyBuilder buildTopolgy() { topologyBuilder builder = new TopologyBuilder(); builder.setSpout("checkins", new CheckinsSpout(), 4 );

builder.setBolt("geocode-lookup", new GeocodeLookupBolt() , 8 ).shuffleGrouping("checkins").setNumTasks(64);

builder.setBolt("heatmap-builder", new HeatMapBuilderBolt() , 4).fieldsGrouping("geocode-lookup", new Fields("city"));

builder.setBolt("persistor", new PersistorBolt() , 2 ).shuffleGrouping("heatmap-builder").setNumTasks(4);

return builder;

}}

Parallelism hint

Increase TasksFor Future

Set no. of workers

Page 50: Processing Big Data in Realtime

50

Demo

Page 51: Processing Big Data in Realtime

51

Database

Storm Heat-Map Topology

Persist Checkin Intervals

GET Geo Location

Check-in #1Check-in #2Check-in #3Check-in #4Check-in #5Check-in #6Check-in #7Check-in #8Check-in #9...

ReadText Address

Sample Checkins File

Recap – Plan A

Geo LocationService

Page 52: Processing Big Data in Realtime

52

We have something working

Page 53: Processing Big Data in Realtime

53

Add Kafka Messaging

Page 54: Processing Big Data in Realtime

54

Plan B - Kafka Spout&Bolt to HeatMap

GeocodeLookup

Bolt

HeatmapBuilder

Bolt

KafkaCheckins

Spout

Database

PersistorBolt

Geo LocationService

Read Text Addresses

Checkin Kafka Topic

Publish Checkins

Locations Topic

KafkaLocations

Bolt

Page 55: Processing Big Data in Realtime

55

Page 56: Processing Big Data in Realtime

56

They all are GoodBut not for all use-cases

Page 57: Processing Big Data in Realtime

57

KafkaA little introduction

Page 58: Processing Big Data in Realtime

58

Page 59: Processing Big Data in Realtime

59

Pub-Sub Messaging System

Page 60: Processing Big Data in Realtime

60

Page 61: Processing Big Data in Realtime

61

Page 62: Processing Big Data in Realtime

62

Page 63: Processing Big Data in Realtime

63

Page 64: Processing Big Data in Realtime

64

Doesn't Fear the File System

Page 65: Processing Big Data in Realtime

65

Page 66: Processing Big Data in Realtime

66

Page 67: Processing Big Data in Realtime

67

Page 68: Processing Big Data in Realtime

68

Topics● Logical collections of partitions (the physical fi les). ● A broker contains some of the partitions for a topic

Page 69: Processing Big Data in Realtime

69

A partition is Consumed byExactly One Group's Consumer

Page 70: Processing Big Data in Realtime

70

Distributed & Fault-Tolerant

Page 71: Processing Big Data in Realtime

71

Broker 1 Broker 3Broker 2

Zoo Keeper

Consumer 1 Consumer 2

Producer 1 Producer 2

Page 72: Processing Big Data in Realtime

72

Broker 1 Broker 4Broker 3Broker 2

Zoo Keeper

Consumer 1 Consumer 2

Producer 1 Producer 2

Page 73: Processing Big Data in Realtime

73

Broker 1 Broker 4Broker 3Broker 2

Zoo Keeper

Consumer 1 Consumer 2

Producer 1 Producer 2

Page 74: Processing Big Data in Realtime

74

Broker 1 Broker 4Broker 3Broker 2

Zoo Keeper

Consumer 1 Consumer 2

Producer 1 Producer 2

Page 75: Processing Big Data in Realtime

75

Broker 1 Broker 4Broker 3Broker 2

Zoo Keeper

Consumer 1 Consumer 2

Producer 1 Producer 2

Page 76: Processing Big Data in Realtime

76

Broker 1 Broker 4Broker 3Broker 2

Zoo Keeper

Consumer 1 Consumer 2

Producer 1 Producer 2

Page 77: Processing Big Data in Realtime

77

Broker 1 Broker 4Broker 3Broker 2

Zoo Keeper

Consumer 1 Consumer 2

Producer 1 Producer 2

Page 78: Processing Big Data in Realtime

78

Broker 1 Broker 3Broker 2

Zoo Keeper

Consumer 1 Consumer 2

Producer 1 Producer 2

Page 79: Processing Big Data in Realtime

79

Broker 1 Broker 3Broker 2

Zoo Keeper

Consumer 1 Consumer 2

Producer 1 Producer 2

Page 80: Processing Big Data in Realtime

80

Broker 1 Broker 3Broker 2

Zoo Keeper

Consumer 1 Consumer 2

Producer 1 Producer 2

Page 81: Processing Big Data in Realtime

81

Broker 1 Broker 3Broker 2

Zoo Keeper

Consumer 1

Producer 1 Producer 2

Page 82: Processing Big Data in Realtime

82

Broker 1 Broker 3Broker 2

Zoo Keeper

Consumer 1

Producer 1 Producer 2

Page 83: Processing Big Data in Realtime

83

Broker 1 Broker 3Broker 2

Zoo Keeper

Consumer 1

Producer 1 Producer 2

Page 84: Processing Big Data in Realtime

84

Performance Benchmark1 Broker

1 Producer1 Consumer

Page 85: Processing Big Data in Realtime

85

Page 86: Processing Big Data in Realtime

86

Page 87: Processing Big Data in Realtime

87

Add Kafka to our Topologypublic class LocalTopologyRunner { ...

private static TopologyBuilder buildTopolgy() {

... builder.setSpout("checkins", new KafkaSpout(kafkaConfig)); ... builder.setBolt("kafkaProducer", new KafkaOutputBolt ( "localhost:9092",

"kafka.serializer.StringEncoder","locations-topic"))

.shuffleGrouping("persistor");

return builder;}

}

Kafka Bolt

Kafka Spout

Page 88: Processing Big Data in Realtime

88

Checkin HTTP Reactor

Publish Checkins

Plan C – Add Reactor

Database

Checkin Kafka Topic

Consume Checkins

Storm Heat-Map Topology

Locations Kafka Topic

Publish Interval Key

Persist Checkin Intervals

Geo LocationServiceGET Geo

Location

Index Interval Locations

Search Server

Index

Text-Address

Page 89: Processing Big Data in Realtime

89

Why Reactor ?

Page 90: Processing Big Data in Realtime

90

C10KProblem

Page 91: Processing Big Data in Realtime

91

2008:Thread Per Request/Response

Page 92: Processing Big Data in Realtime

92

...events trigger handlers

Application registers handlers

Reactor Pattern Paradigm

Page 93: Processing Big Data in Realtime

93

Reactor Pattern – Key Points

● Single thread / single event loop● EVERYTHING runs on it● You MUST NOT block the event loop● Many Implementations (partial list):

– Node.js (JavaScrip), EventMachine (Ruby), Twisted Node.js (JavaScrip), EventMachine (Ruby), Twisted (Python)... and Vert.X(Python)... and Vert.X

Page 94: Processing Big Data in Realtime

94

Reactor Pattern Problems

● Some work is naturally blocking:– Intensive data crunchingIntensive data crunching– 3rd-party blocking API’s (e.g. JDBC)3rd-party blocking API’s (e.g. JDBC)

● Pure reactor (e.g. Node.js) is not a good fi t for this kind of work!

Page 95: Processing Big Data in Realtime

95

Page 96: Processing Big Data in Realtime

96

Vert.X Architecture

Event Bus

Vert.X Architecture

Page 97: Processing Big Data in Realtime

97

Vert.X Goodies● Growing Module● Repository● web server● Persistors (Mongo,

JDBC, ...)● Work queue● Authentication● Manager● Session manager● Socket.IO

● TCP/SSL servers/clients● HTTP/HTTPS servers/

clients● WebSockets support● SockJS support● Timers● Buffers● Streams and Pumps● Routing● Asynchronous File I/O

Page 98: Processing Big Data in Realtime

98

Node.JS vs Vert.X

Page 99: Processing Big Data in Realtime

99

Node.js vs Vert.X

● Node.js– JavaScript OnlyJavaScript Only

– Inherently Single Inherently Single ThreadedThreaded

– No help much with IPCNo help much with IPC

– All code MUST be in All code MUST be in Event loopEvent loop

● Vert.X– Polyglot (JavaScript, Polyglot (JavaScript,

Java, Ruby, Python...)Java, Ruby, Python...)

– Leverages JVM multi-Leverages JVM multi-threadingthreading

– Nervous Event BusNervous Event Bus

– Blocking work can be Blocking work can be done off the event loopdone off the event loop

Page 100: Processing Big Data in Realtime

100

Node.js vs Vert.X Benchmark

AMD Phenom II X6 (6 core), 8GB RAM, Ubuntu 11.04

http://vertxproject.wordpress.com/2012/05/09/vert-x-vs-node-js-simple-http-benchmarks/

Page 101: Processing Big Data in Realtime

101

Event Bus

HTTPServer Verticle

Kafka module

Kafka Topic

StormTopology

HeatMap Reactor ArchitectureVert.X Instance

Automatically sendsEventBus Msg → KafkaTopic

Vert.X Instance

Page 102: Processing Big Data in Realtime

102

Heat-Map Server – Only 6 LOC !

var vertx = require('vertx');var container = require('vertx/container');var console = require('vertx/console');var config = container.config;

vertx.createHttpServer().requestHandler(function(request) {request.dataHandler(function(buffer) {

vertx.eventBus.send(config.address, {payload:buffer.toString()});});request.response.end();

}).listen(config.httpServerPort, config.httpServerHost);

console.log("HTTP CheckinsReactor started on port "+config.httpServerPort);

Send checkin to Vert.X EventBus

Page 103: Processing Big Data in Realtime

103

Database

Checkin HTTP Reactor Checkin

Kafka Topic

Consume Checkins

Storm Heat-Map Topology

Hotzones Kafka Topic

Publish Interval Key

Persist Checkin Intervals

Web App

Geo LocationServiceGET Geo

Location

Get Interval Locations

Consume Intervals Keys

Push via WebSocket

Publish Checkins

Search Server

Index

Index Interval Locations

Search

Checkin HTTP Firehose

Page 104: Processing Big Data in Realtime

104

Demo

Page 105: Processing Big Data in Realtime

105

Summary

Page 106: Processing Big Data in Realtime

106

When You go out to Salsa Club

● Good Music

● Crowded

Page 107: Processing Big Data in Realtime

107

More Conclusions..

● Storm – Great for real-time BigData processing. Complementary for Hadoop batch jobs.

● Kafka – Great messaging for logs/events data, been served as a good “source” for Storm spout

● Vert.X – Worth trial and check as an alternative for reactor.

Page 108: Processing Big Data in Realtime

108

Thanks