atmosphere 2014: when storm hits data. data streams processing in real time - marcin stanislawski

55
WHEN STORM HITS DATA. DATA STREAMS PROCESSING IN REAL TIME. MARCIN STANISLAWSKI

Upload: proidea

Post on 16-Apr-2017

481 views

Category:

Presentations & Public Speaking


4 download

TRANSCRIPT

WHEN STORM HITS DATA.DATA STREAMS PROCESSING IN REAL TIME.

MARCIN STANISLAWSKI

WHO AM I?Architect/Developer at Interia.plStorm and Hadoop userGithub: webikTwitter: @unilama

BIG DATA

HADOOP

WELCOME IN ZOO

RUN JOB

COFFEE BREAK*

RESULTS* - there are some solutions

IMPALA

implemented in C++non Map Reduce solution

KIJI

KijiRESTHDFS/HBase/Cassandra

BATCH PROCESSING VS. STREAMING

STREAMING SOLUTIONSYahoo S4AkkaSpark StreamingStorm

STORM WHAT IS THAT?

README.MDStorm is a distributed realtime computation system.

Storm is simple, can be used with any programming

language, and is a lot of fun to use!

CURRENT STATUSApache IncubationIncluded in HortonWorks DataPlatformContributed by YahooEasy deploy to Amazon EC2

WHO USES

BASIC IDEA

SPOUTSTAKES EVENTS FROM:

KafkaKestrelRabitMQ...

AND PASS THEM TO...

BOLTSTUPLES ARE PROCESSED, IN WAY THAT YOU IMPLEMENT IT

EVENTS ARE TUPLES( 1, "TEST", "ATMOSPHERE", "2014-05-20 10:00:40", ... )

OBJECTS ARE SERIALIZED USING KYRO

WRITTEN IN JAVA&CLOJURETOPOLOGIES ARE DAGS

ARCHITECTURENimbusNodes(Supervisors)UIDRPC

EVENT PROCESSED ONE OR MORE TIMES.

ACKING FRAMEWORKEach tuple must be acked or failed

TUPLES TRACKINGtuple has random 64 bit id

xor of all tuple ids, that have been createdand/or acked in the tree

if tuple id equals 0, tuple is fully processed

COMMUNICATIONBetween:

Tasks: Disruptor LMAXWorkers: ⦰MQ -> Netty

TRIDENThigh-level abstractionsame as Cascading/Scalding in Hadoop World

SPOUTKey difference - producing Stream(s)

STREAMBatches chain with multiplication ability

STREAM OPERATIONSFunctionsFiltersProjectionsJoinsMerges

SATEOperations:

GroupingAggregateQuery

STATE TYPESnon-transactionaltransactionalopaque transactional

STATEIn memory stateNoSQL databasesExternal systems via APIs

DRPC

DRPC TOPOLOGYNAMED DRPC SPOUT

USES MAIN TOPOLOGY STATESGENERATES ONE TUPLE OUTPUT

DRPC ELEMENTSTHRIFT SERVER(S)

WITH PREDEFINED SPOUTAND BOLT

ARE YOU PROGRAMMING IN NON-JVMLANGUAGE?NO PROBLEM :)

RubyPythonPerlPHP...

STREAMING APIAPI defined as ThriftJSON based communication

RED STORMWriting topologies in Ruby

REAL TIME ALGORITHMS

SIMPLE OPERATIONSSumCountMultiplication

MAXIMUM AND MINIMUMdon't lose current value

USUALLY TWO TOPOLOGIES

LEARNINGClassificationClustering

MODELEvaluatorVisualiser

BASIC ELEMENT TABLE

SIMPLE EXAMPLE

ALGORITHM EXAMPLESk-means clustering

statistical test (T, F, Z, Chi2)Hidden Markov Models

ADVERT TIME :)

STORMUNIThttp://github.com/webik/StormUnit

MAVEN MOJO - COMMING SOON :)http://github.com/webik/storm-maven

WHAT NEXT...

SUMMINGBIRDWrite once, run on:

StormHadoop(Scalding)Amazon Kinesis

MAYBE BACK INTO ZOOSTORM YARN

THANK YOU.

QUESTIONS?