ApacheKafkaandRealTimeStreamProcessing
GwenShapiraSystemArchitect
Confluent@gwenshap
I’lltellyouabout
• Whatisstreamprocessingandwhyitmatters• WhatisApacheKafka• HowKafkahelpsstreamprocessing
Stayawakeforthispart
WhatisStreamProcessing?
DataProcessingParadigm
Request/Response
Batch
StreamProcessing
StreamProcessingParadigm
• Dataisgeneratedatitsownrateas“Streams”• Wecanprocessasmuchoraslittleaswewant• Continuously• Resultsareavailableinreal-time• Butnothingwaitsforspecificresults• Timefordataavailability?• Morethan“fewms”• Lessthan“hours”
Thisistheworldchangingbit
• Mostofthebusinessis…• Noturgentenoughtorequireimmediateresponse• Butcan’twaitforthenextday
• “Streamsofevents”representssomethingfundamental• Samewayrelationaltablesarefundamental
Ok,gotthestreamspart.ButwhataboutApacheKafka?
Crossofmessagingsystemandfilesystem
KafkaisallaboutLOGS
IfyouunderstandlogsYouunderstandKafka
RedoLog:
Themostcrucialstructureforrecoveryoperations…storeallchangesmadetothedatabaseastheyoccur.
ImportantPoint
Theredologistheonly reliablesourceofinformationaboutcurrentstateofthedatabase.
ButLogsarealsoaSTREAMofeventsAndKafkastoresthoselogs
Allowingtoreadthepastandkeepgettingupdatesonthefuture
StreamProcessing
Readastreammodifyitoutputanotherstream
Example:CDC-basedETL
IfweuseKafkaforCDC,doesitmeanitisACID?
StreamProcessingisImportant
Kafkaisacollectionoflogs.
HowdoesKafkahelpwithstreamprocessing?
First,Howdoweactuallydostreamprocessing?
Method1:Doityourself(Hipsterstreamprocessing)
Method2:TheStreamProcessingFrameworks• Storm• Spark• Flink• Samza• Apex• Nifi• StreamBase• InfoSphere Streams• GoogleDataFlow (AKABeam)• Icangoonfor5morepages…
Fewofthosearereallypopular!
• Pro:Theyhandlesomehardproblems• Con:Itcanbetoocomplex
WhatdoImeanbytoocomplex?
HadoopClusterIIStorage Processing
SolR
HadoopClusterI
ClientClientFlumeAgents
Hbase /Memory
SparkStreaming
HDFS
Hive/Impala
Map/Reduce
Spark
Search
Automated&Manual
AnalyticalAdjustmentsandPatterndetection
Fetching&UpdatingProfiles
AdjustingNRTStats
HDFSEventSink
SolR Sink
BatchTimeAdjustments
Automated&Manual
ReviewofNRTChangesandCounters
LocalCache
Kafka
Clients:(Swipehere!)
WebApp
Whysomanymovingparts?
Weneeded…Hbase tohandlecomplexstateSparkrequiresHDFSIngestlayerBatchlayertohandlere-calculations
Whatwereallywantedwas…
InputsKafka
Processor
output
EnterKafkaStreams
3Simplifications:
1. UsesKafka2. NoFramework3. UnifyTablesandStreams
Don’tallstreamprocessinguseKafka?
WeuseKafkafor…Partitioning,Scalability,FaultTolerance
Kafka
A A A
GroupA
B
B
GroupB
NoFramework
• Itisjustalibrarythatdoestransformations• Wecanaddlanguagesontop• Kafkadoeseverythingweneededtheframeworktodo• Youdon’tneed“framework”torunqueries,whydoyouneedittorunqueriescontinuously?
Thereallyimportantbit:StreamsmeetTables
Streams:Thingsthathappen.Events.Tables:Stateofthingsastheyare.
Databases:Onlystates.Streams:Onlyevents.
Wecanconverttablestostreamsandback:
Stream->Apply->TableTable->ChangeCapture->Stream
ThisiscalledTable-StreamDuality.
StreamsandTablessometimesworkthesame.Andsometimesareverydifferent.KafkaStreams handlesboth.
But…Wheredostreamscomefrom?
WereallylikestreamsSowecreatedaStreamDataPlatform
Wherecanwelearnmore?
• http://www.confluent.io/blog• http://kafka.apache.org/documentation.html• http://docs.confluent.io/current