a practical guide to selecting a stream processing technology

90
A Practical Guide to Selecting a Stream Processing Technology Michael G. Noll Product Manager, Confluent

Upload: confluent

Post on 08-Jan-2017

656 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: A Practical Guide to Selecting a Stream Processing Technology

A Practical Guide to Selecting a Stream Processing Technology

Michael � G. � NollProduct � Manager, � Confluent

Page 2: A Practical Guide to Selecting a Stream Processing Technology

Kafka Talk SeriesDate Title

Sep 27 Introduction  To  Streaming  Data  and  Stream  Processing  with  Apache  Kafka

Oct  06 Deep  Dive  into  Apache  Kafka

Oct  27 Data  Integration  with  Apache  Kafka

Nov  17 Demystifying  Stream  Processing  with  Apache  Kafka

Dec  01 A  Practical  Guide  to  Selecting  a  Stream  Processing  Technology

Dec  15 Streaming  in  Practice:  Putting  Apache  Kafka  in  Production

https://www.confluent.io/apache-­‐kafka-­‐talk-­‐series

Page 3: A Practical Guide to Selecting a Stream Processing Technology

Agenda

• Recap: � What � is � Stream � Processing?• The � Three � Pillars � of � Stream � Processing � in � Practice• Key � Selection � Criteria• Organizational/Non-Technical � Dimensions• Technical � Dimensions

• Summary

Page 4: A Practical Guide to Selecting a Stream Processing Technology

Agenda

• Recap: � What � is � Stream � Processing?• The � Three � Pillars � of � Stream � Processing � in � Practice• Key � Selection � Criteria• Organizational/Non-Technical � Dimensions• Technical � Dimensions

• Summary

Page 5: A Practical Guide to Selecting a Stream Processing Technology
Page 6: A Practical Guide to Selecting a Stream Processing Technology
Page 7: A Practical Guide to Selecting a Stream Processing Technology
Page 8: A Practical Guide to Selecting a Stream Processing Technology
Page 9: A Practical Guide to Selecting a Stream Processing Technology

Agenda

• Recap: � What � is � Stream � Processing?• The � Three � Pillars � of � Stream � Processing � in � Practice• Key � Selection � Criteria• Organizational/Non-Technical � Dimensions• Technical � Dimensions

• Summary

Page 10: A Practical Guide to Selecting a Stream Processing Technology
Page 11: A Practical Guide to Selecting a Stream Processing Technology
Page 12: A Practical Guide to Selecting a Stream Processing Technology
Page 13: A Practical Guide to Selecting a Stream Processing Technology
Page 14: A Practical Guide to Selecting a Stream Processing Technology

Powered by Kafka (﴾thousands more)﴿

Page 15: A Practical Guide to Selecting a Stream Processing Technology
Page 16: A Practical Guide to Selecting a Stream Processing Technology
Page 17: A Practical Guide to Selecting a Stream Processing Technology
Page 18: A Practical Guide to Selecting a Stream Processing Technology
Page 19: A Practical Guide to Selecting a Stream Processing Technology
Page 20: A Practical Guide to Selecting a Stream Processing Technology

Spark Streaming API (﴾2.0)﴿

Page 21: A Practical Guide to Selecting a Stream Processing Technology

Kafka’s Streams API (﴾0.10)﴿

Page 22: A Practical Guide to Selecting a Stream Processing Technology
Page 23: A Practical Guide to Selecting a Stream Processing Technology
Page 24: A Practical Guide to Selecting a Stream Processing Technology
Page 25: A Practical Guide to Selecting a Stream Processing Technology
Page 26: A Practical Guide to Selecting a Stream Processing Technology
Page 27: A Practical Guide to Selecting a Stream Processing Technology
Page 28: A Practical Guide to Selecting a Stream Processing Technology
Page 29: A Practical Guide to Selecting a Stream Processing Technology
Page 30: A Practical Guide to Selecting a Stream Processing Technology
Page 31: A Practical Guide to Selecting a Stream Processing Technology
Page 32: A Practical Guide to Selecting a Stream Processing Technology
Page 33: A Practical Guide to Selecting a Stream Processing Technology
Page 34: A Practical Guide to Selecting a Stream Processing Technology
Page 35: A Practical Guide to Selecting a Stream Processing Technology
Page 36: A Practical Guide to Selecting a Stream Processing Technology
Page 37: A Practical Guide to Selecting a Stream Processing Technology

Example: Streams and Tables in Kafka

Word Count

hello 2

kafka 1

world 1

… …

Page 38: A Practical Guide to Selecting a Stream Processing Technology
Page 39: A Practical Guide to Selecting a Stream Processing Technology
Page 40: A Practical Guide to Selecting a Stream Processing Technology
Page 41: A Practical Guide to Selecting a Stream Processing Technology
Page 42: A Practical Guide to Selecting a Stream Processing Technology

Streams & Databases

• A � stream � processing � technology � must � have � first-class � support � for Streams � and Tables• With � scalability, � fault � tolerance, � …

• Why? � Because � most � use � cases � require � not � just � one, � but � both!• Support � – or � lack � thereof � – strongly � impacts � the � resulting � 

technical � architecture � and � development � efforts• No � support � means:• Painful � Do-It-Yourself• Increased � complexity, � more � moving � pieces � to � juggle

Page 43: A Practical Guide to Selecting a Stream Processing Technology

Agenda

• Recap: � What � is � Stream � Processing?• The � Three � Pillars � of � Stream � Processing � in � Practice• Key � Selection � Criteria• Organizational/Non-Technical � Dimensions• Technical � Dimensions

• Summary

Page 44: A Practical Guide to Selecting a Stream Processing Technology

Agenda

• Recap: � What � is � Stream � Processing?• The � Three � Pillars � of � Stream � Processing � in � Practice• Key � Selection � Criteria• Organizational/Non-Technical � Dimensions• Technical � Dimensions

• Summary

Page 45: A Practical Guide to Selecting a Stream Processing Technology

Organizational/Non-‐Tech Dimensions

• Can � your � org � understand � and � leverage � the � technology?• Familiarity � with � languages; � intuitive � concepts � and � APIs; � trainings

• Are � you � permitted � to � use � it � in � your � organization?• Security � features, � licensing, � open � source � vs. � proprietary

• Can � you � continue � to � use � it � in � the � future?• Longevity � of � technology, � licensing, � vendor � strength

Page 46: A Practical Guide to Selecting a Stream Processing Technology

Organizational/Non-‐Tech Dimensions

• Do � you � believe � in � the � long-term � vision?• Switching � technologies � in � an � organization � is � often � expensive/slow: � 

legacy � migration, � re-training, � resistance � to � change, � etc.

• What � is � the � path � and � time � to � success?• Can � you � move � smoothly � and � quickly � from � proof-of-concept � to � 

production?

• Areas � and � range � of � applicability in � your � organization• General-purpose � vs. � niche � technology• Viable � for � S/M/L/XL � use � cases � vs. � for � XL � use � cases � only• Building � core � business � apps � vs. � doing � backend � analytics

Page 47: A Practical Guide to Selecting a Stream Processing Technology

Organizational/Non-‐Tech Dimensions

Licensing Vision/Roadmap ROI

Impact  onOrganization

Broad  vs.  NicheApplicability

Time  to  Market

ProfessionalServices

Documentation Examples User  CommunityLearning  Curve

Impact  on  Tools,Infrastructure,  …

Page 48: A Practical Guide to Selecting a Stream Processing Technology

Agenda

• Recap: � What � is � Stream � Processing?• The � Three � Pillars � of � Stream � Processing � in � Practice• Key � Selection � Criteria• Organizational/Non-Technical � Dimensions• Technical � Dimensions

• Summary

Page 49: A Practical Guide to Selecting a Stream Processing Technology

Technical Dimensions

Reprocessing Scalability  &Elasticity

Fault  Tolerance

API Dev/OpsLifecycle

Security ProcessingModel

Out  of  OrderData

Abstractions Time  Model WindowingState

Page 50: A Practical Guide to Selecting a Stream Processing Technology

State

• Stateful � processing � of � any � kind � requires…state• Many � (most?) � use � cases � for � stream � processing � are � stateful• Joins, � aggregations, � windowing, � counting, � ...

• Is � state � performant? � Local � vs. � remote � state?

50

Page 51: A Practical Guide to Selecting a Stream Processing Technology
Page 52: A Practical Guide to Selecting a Stream Processing Technology
Page 53: A Practical Guide to Selecting a Stream Processing Technology

State

• Stateful � processing � of � any � kind � requires…state• Many � (most?) � use � cases � for � stream � processing � are � stateful• Joins, � aggregations, � windowing, � counting, � ...

• Is � state � performant? � Local � vs. � remote � state?• Is � state � fault-tolerant? � How � fast � is � recovery/failover?

53

Page 54: A Practical Guide to Selecting a Stream Processing Technology
Page 55: A Practical Guide to Selecting a Stream Processing Technology

State

• Stateful � processing � of � any � kind � requires…state• Many � (most?) � use � cases � for � stream � processing � are � stateful• Joins, � aggregations, � windowing, � counting, � ...

• Is � state � performant? � Local � vs. � remote � state?• Is � state � fault-tolerant? � How � fast � is � recovery/failover?• Is � state � interactively � queryable?• Kafka: � ready � for � use � (GA)• Spark, � Flink: � under � development � (alpha)• Storm, � Samza, � and � others: � not � available

55

Page 56: A Practical Guide to Selecting a Stream Processing Technology
Page 57: A Practical Guide to Selecting a Stream Processing Technology

Technical Dimensions

Reprocessing Scalability  &Elasticity

Fault  Tolerance

API Dev/OpsLifecycle

Security ProcessingModel

Out  of  OrderData

Abstractions Time  Model WindowingState

Page 58: A Practical Guide to Selecting a Stream Processing Technology

Abstractions

• What � are � the � data � model � and � the � available � abstractions?• Most � common � abstraction: � stream of � records, � events• Kafka, � Spark, � Storm, � Samza, � Flink, � Apex, � ...

• New, � very � powerful: � table � of � records• Currently � unique � to � Kafka• Represents � latest � state and � materialized � views• State � must � have � a � first-class � abstraction � because, � as � we � just � saw � in � 

the � previous � section, � state � is � crucial � for � stream � processing!

58

Page 59: A Practical Guide to Selecting a Stream Processing Technology

Technical Dimensions

Reprocessing Scalability  &Elasticity

Fault  Tolerance

API Dev/OpsLifecycle

Security ProcessingModel

Out  of  OrderData

Abstractions Time  Model WindowingState

Page 60: A Practical Guide to Selecting a Stream Processing Technology

Time model

• Different � use � cases � require � different � time � semantics• Great � majority � of � use � cases � require � event-time semantics• Other � use � cases � may � require � processing-time (e.g. � real-

time � monitoring) � or � special � variants � like � ingestion-time• A � stream � processing � technology � should, � at � a � minimum, � 

support � event-time � to � cover � most � use � cases � in � practice• Examples: � Kafka, � Beam, � Flink

Page 61: A Practical Guide to Selecting a Stream Processing Technology

Time Model

61

Page 62: A Practical Guide to Selecting a Stream Processing Technology

Technical Dimensions

Reprocessing Scalability  &Elasticity

Fault  Tolerance

API Dev/OpsLifecycle

Security ProcessingModel

Out  of  OrderData

Abstractions Time  Model WindowingState

Page 63: A Practical Guide to Selecting a Stream Processing Technology

Windowing• Windowing � is � an � operation � that � groups events

Page 64: A Practical Guide to Selecting a Stream Processing Technology

Windowing

Input  data,  wherecolors  represent

different  users  events

Rectangles  denotedifferent  event-­‐time

windows

processing-­‐time

event-­‐time

windowing

alicebob

dave

Page 65: A Practical Guide to Selecting a Stream Processing Technology

Windowing• Windowing � is � an � operation � that � groups events• Most � commonly � needed: � time � windows, � session � windows• Examples:• Real-time � monitoring: � 5-minute � averages• Reader � behavior � on � a � website: � user � browsing � sessions

Page 66: A Practical Guide to Selecting a Stream Processing Technology

Windowing

Page 67: A Practical Guide to Selecting a Stream Processing Technology

Technical Dimensions

Reprocessing Scalability  &Elasticity

Fault  Tolerance

API Dev/OpsLifecycle

Security ProcessingModel

Out  of  OrderData

Abstractions Time  Model WindowingState

Page 68: A Practical Guide to Selecting a Stream Processing Technology

Out-‐of-‐order and late-‐arriving data

• Is � very � common in � practice, � not � a � rare � corner � case• Related � to � time � model � discussion

Page 69: A Practical Guide to Selecting a Stream Processing Technology

Out-‐of-‐order and late-‐arriving data

Users  with  mobile  phones  enterairplane,  lose  Internet  connectivity

Emails  are  being  writtenduring  the  10h  flight

Internet  connectivity  is  restored,phones  will  send  queued  emails  now

Page 70: A Practical Guide to Selecting a Stream Processing Technology

Out-‐of-‐order and late-‐arriving data

• Is � very � common in � practice, � not � a � rare � corner � case• Related � to � time � model � discussion

• We � want � control over � how � out-of-order � data � is � handled• Example:• We � process � data � in � 5-minute � windows, � e.g. � compute � statistics• When � event � arrives � 1 � minute � late: � update the � original � result!• When � event � arrives � 2 � hours � late: � discard it!

• Handling � must � be � efficient because � it � happens � so � often

Page 71: A Practical Guide to Selecting a Stream Processing Technology

Technical Dimensions

Reprocessing Scalability  &Elasticity

Fault  Tolerance

API Dev/OpsLifecycle

Security ProcessingModel

Out  of  OrderData

Abstractions Time  Model WindowingState

Page 72: A Practical Guide to Selecting a Stream Processing Technology

Reprocessing

• Re-process � data � by � rewinding � a � stream � back � in � time• Use � cases � in � practice � include• Correcting � output � data � after � fixing � a � bug• Facilitate � iterative � and � explorative � development• A/B � testing• Processing � historical � data• Walking � through � "What � If?" � scenarios

• Also: � often � used � behind-the-scenes � for � fault � tolerance

Page 73: A Practical Guide to Selecting a Stream Processing Technology
Page 74: A Practical Guide to Selecting a Stream Processing Technology

Technical Dimensions

Reprocessing Scalability  &Elasticity

Fault  Tolerance

API Dev/OpsLifecycle

Security ProcessingModel

Out  of  OrderData

Abstractions Time  Model WindowingState

Page 75: A Practical Guide to Selecting a Stream Processing Technology

Scalability, Elasticity, Fault Tolerance

• Can � the � technology � scale according � to � your � needs?• Desired � latency, � throughput?• Able � to � process � millions � of � messages � per � second?

• What � is � the � minimum � footprint?• Expand/shrink � capacity � dynamically � during � operations?

• Helps � with � resource � utilization � because � most � stream � apps � run � continuously• Resilience and � fault � tolerance

• Which � guarantees � for � data � delivery � and � for � state? � "At-least-once", � "exactly-once", � "effectively-once", � etc.

• Failover � behavior � and � recovery � time? � Automated � or � manual?• Any � negative � impact � of � fault � tolerance � features � on � performance?

Page 76: A Practical Guide to Selecting a Stream Processing Technology
Page 77: A Practical Guide to Selecting a Stream Processing Technology
Page 78: A Practical Guide to Selecting a Stream Processing Technology
Page 79: A Practical Guide to Selecting a Stream Processing Technology

Technical Dimensions

Reprocessing Scalability  &Elasticity

Fault  Tolerance

API Dev/OpsLifecycle

Security ProcessingModel

Out  of  OrderData

Abstractions Time  Model WindowingState

Page 80: A Practical Guide to Selecting a Stream Processing Technology

Security

• To � meet � internal � security � policies, � legal � compliance, � etc.• Typical � base � requirements � for � stream � processing � applications:• Encrypt � data-in-transit � (e.g. � from/to � Kafka)• Authentication: � "only � some � applications � may � talk � to � production"• Authorization: � "access � to � sensitive � data � such � as � PII � is � restricted”

• The � easier � it � is � to � use � security � features, � the � more � likely � they � are � actually � being � used � in � practice

Page 81: A Practical Guide to Selecting a Stream Processing Technology

Technical Dimensions

Reprocessing Scalability  &Elasticity

Fault  Tolerance

API Dev/OpsLifecycle

Security ProcessingModel

Out  of  OrderData

Abstractions Time  Model WindowingState

Page 82: A Practical Guide to Selecting a Stream Processing Technology

Processing Model• True � stream � processing � is � record-at-a-time processing

• Benefits � include � low � latency (millisecs), � dealing � efficiently � with � out-of-order � data• Can � provide � both � latency � and � high � throughput � via � internal � optimizations• Examples: � Kafka, � Storm, � Samza, � Flink, � Beam

• Some � processing � technologies � opt � for � (micro)batching• Micro-batching � has � no � true � benefits: � consider � it � a � technical � workaround � to � 

shoehorn � stream-like � functionality � into � a � tool• Suffers � from � significant � overhead � when � dealing � with � e.g. � out-of-order/late-arriving � 

data, � when � performing � windowed � analyses � (e.g. � session � windows)• Typically � a � strong � blocker � for � use � cases � such � as � fraud � detection � or � anything � where � 

"a � few � seconds" � of � latency � is � prohibitive• Examples: � Spark, � Storm � (Trident), � Hadoop*

Page 83: A Practical Guide to Selecting a Stream Processing Technology

Technical Dimensions

Reprocessing Scalability  &Elasticity

Fault  Tolerance

API Dev/OpsLifecycle

Security ProcessingModel

Out  of  OrderData

Abstractions Time  Model WindowingState

Page 84: A Practical Guide to Selecting a Stream Processing Technology

API

• Choice � of � API � is � a � subjective � matter � – skills, � preference, � …• Typical � options• Declarative, � expressive � API: � operations � like � map(), � filter()• Imperative, � lower-level � API: � callbacks � like � process(event)• Streaming � SQL: � STREAM  SELECT  …  FROM  …  WHERE  …  • In � the � best � case � you � get � not � just � one, � but � all � three

• "Abstractions � are � great!"• "Abstractions � considered � harmful!"

Page 85: A Practical Guide to Selecting a Stream Processing Technology

Technical Dimensions

Reprocessing Scalability  &Elasticity

Fault  Tolerance

API Dev/OpsLifecycle

Security ProcessingModel

Out  of  OrderData

Abstractions Time  Model WindowingState

Page 86: A Practical Guide to Selecting a Stream Processing Technology

Developer/Operations Lifecycle

• How � should � your � daily � work � look � and � feel � like?• "I � like � to � do � quick, � iterative � development" � (modify/test/repeat)• "I � want � to � decouple � team � roadmaps, � project � schedules"

• Big � difference � between � App � Model � <-> � Cluster � Model• Testing, � packaging, � deployment, � monitoring, � operations• "Do � I � need � to � know � Java � (app) � or � YARN � (cluster) � for � this?”• "I � want � reactive � processing � in � containers � that � run � on � Mesos!"

• Rolling, � no-downtime � upgrades?• Integration � with � existing � Ops � infra, � tools, � processes?

Page 87: A Practical Guide to Selecting a Stream Processing Technology

Agenda

• Recap: � What � is � Stream � Processing?• The � Three � Pillars � of � Stream � Processing � in � Practice• Key � Selection � Criteria• Organizational/Non-Technical � Dimensions• Technical � Dimensions

• Summary

Page 88: A Practical Guide to Selecting a Stream Processing Technology

Summary

• What � we � covered � is � a � good � starting � point• But, � no � free � lunch!• Understand � what � you � need, � and � weigh � criteria � appropriately• Think � end-to-end: � idea, � development, � operations, � troubleshooting• Think � big-picture: � future � use � cases, � architecture, � security, � training, � …• Do � your � own � internal � hackathons, � proof-of-concepts• Do � your � own � benchmarks

• If � in � doubt: � simplicity � beats � complexity• Faster � to � learn, � easier � to � understand, � less � likely � to � fail, � …

Page 89: A Practical Guide to Selecting a Stream Processing Technology

Q&A Session

89

Page 90: A Practical Guide to Selecting a Stream Processing Technology

Coming Up NextDate Title Speaker

Dec  15 Streaming in Practice: Putting Apache Kafka in Production

Roger Hoover

https://www.confluent.io/apache-­‐kafka-­‐talk-­‐series