c* summit 2013: real-time big data with storm, cassandra, and in-memory computing by dewayne filppi

38
Real Time Big Data With Storm, Cassandra, and InMemory Compu=ng DeWayne Filppi @dfilppi

Upload: planet-cassandra

Post on 01-Nov-2014

1.958 views

Category:

Technology


1 download

DESCRIPTION

This session will describe how to resolve the processing limitations by placing the streaming and data store interfaces in-memory as well, through an in-memory computing platform, and also how to resolve the complexity challenge by implementing a DevOps approach that abstracts all the underlying infrastructure and provides single-click management of all the application tiers and services, on any environment (private/public cloud, bare metal…). And the best news is that all this optimization can be implemented seamlessly, with no code change to your apps.

TRANSCRIPT

Page 1: C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Computing by Dewayne Filppi

Real  Time  Big  Data  With  Storm,  Cassandra,  and  In-­‐Memory  Compu=ng  

DeWayne  Filppi  @dfilppi  

Page 2: C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Computing by Dewayne Filppi

 Big  Data  Predic=ons    

“Over  the  next  few  years  we'll  see  the  adop=on  of  scalable  frameworks  and  pla1orms  for  handling  streaming,  or  near  real-­‐=me,  analysis  and  processing.  In  the  same  way  that  Hadoop  has  been  borne  out  of  large-­‐scale  web  applica=ons,  these  plaMorms  will  be  driven  by  the  needs  of  large-­‐scale  loca=on-­‐aware  mobile,  social  and  sensor  use.”  

Edd  Dumbill,  O’REILLY  

2 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved

Page 3: C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Computing by Dewayne Filppi

®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  3  

The  Two  Vs  of  Big  Data    

Velocity   Volume  

Page 4: C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Computing by Dewayne Filppi

We’re  Living  in  a  Real  Time  World…  Homeland Security

Real Time Search

Social  

eCommerce

User  Tracking  &  Engagement  

Financial Services

®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  4  

Page 5: C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Computing by Dewayne Filppi

The  Flavors  of  Big  Data  Analy=cs    

Coun:ng   Correla:ng   Research  

®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  5  

Page 6: C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Computing by Dewayne Filppi

Analy=cs  @  Twi`er  –  Coun=ng    

§  How  many  signups,  tweets,  retweets  for  a  topic?  

§  What’s  the  average  latency?  

§  Demographics  §  Countries  and  ci=es  §  Gender    §  Age  groups    §  Device  types    §  …      

®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  6  

Page 7: C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Computing by Dewayne Filppi

Analy=cs  @  Twi`er  –  Correla=ng    

§  What  devices  fail  at  the  same  =me?  

§  What  features  get  user  hooked?  

§  What  places  on  the  globe  are  “happening”?  

®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  7  

Page 8: C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Computing by Dewayne Filppi

Analy=cs  @  Twi`er  –  Research    

§  Sen=ment  analysis  §  “Obama  is  popular”  

§  Trends  §  “People  like  to  tweet  

aeer  watching  American  Idol”  

§  Spam  pa`erns    §  How  can  you  tell  when  

a  user  spams?  

®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  8  

Page 9: C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Computing by Dewayne Filppi

It’s  All  about  Timing    

“Real  :me”    (<  few  Seconds)    

Reasonably  Quick  (seconds  -­‐  minutes)    

Batch    (hours/days)    

®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  9  

Page 10: C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Computing by Dewayne Filppi

It’s  All  about  Timing    

•  Event  driven  /  stream  processing      •  High  resolu=on  –  every  tweet  gets  counted    

•  Ad-­‐hoc  querying    •  Medium  resolu=on  (aggrega=ons)    

•  Long  running  batch  jobs  (ETL,  map/reduce)    •  Low  resolu=on  (trends  &  pa`erns)    

®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  10  

This  is  what  we’re  here  to  discuss  J  

Page 11: C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Computing by Dewayne Filppi

VELOCITY  +  VAST  VOLUME  =    IN  MEMORY  +  BIG  DATA

11  

Page 12: C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Computing by Dewayne Filppi

§  RAM  is  the  new  disk  §  Data  par==oned  across  a  cluster  

§  Large  “virtual”  memory  space  §  Transac=onal  §  Highly  available  §  Code  collocated  with  data.        

In  Memory  Data  Grid  Review  

®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  12  

Page 13: C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Computing by Dewayne Filppi

®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  13  

Data  Grid  +  Cassandra:  A  Complete  Solu=on  •  Data  flows  through  the  in-­‐memory  cluster  async  to  Cassandra  •  Side  effects  calculated  •  Filtering  an  op=on  •  Enrichment  an  op=on  •  Results  instantly  available  •  Internal  and  external  event  listeners  no=fied  

Page 14: C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Computing by Dewayne Filppi

®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  14  

Simplified  Event  Flow  

Page 15: C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Computing by Dewayne Filppi

®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  15  

Grid  –  Cassandra  Interface  §  Hector  and  CQL  based  interface  §  In  memory  data  must  be  mapped  to  column  families.  

§  Configurable  class  to  column  family  mapping  §  Must  serialize  individual  fields  

§  Fixed  fields  can  use  defined  types  §  Variable  fields  (  for  schemaless  in-­‐memory  mode)  need  serializers  

§  Object  model  fla`ening  §  By  default,  nested  fields  are  fla`ened.  §  Can  be  overridden  by  custom  serializer.  

Page 16: C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Computing by Dewayne Filppi

®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  16  

Virtues  and  Limita=ons  

§  Could  be  faster:    high  availability  has  a  cost  §  Complex  flows  not  easy  to  assemble  or  understand  with  simple  

event  handlers  

§  Complete  stack,  not  just  two  tools  of  many  §  Fast.  

§  Microsecond  latencies  for  in  memory  opera=ons  §  Fast  enough  for  almost  anybody  

§  Highly  available/self  healing  §  Elas=c  

Page 17: C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Computing by Dewayne Filppi

§  Popular  open  source,  real  =me,  in-­‐memory,  streaming  computa=on  plaMorm.  

§  Includes  distributed  run=me  and  intui=ve  API  for  defining  distributed  processing  flows.  

§  Scalable  and  fault  tolerant.  §  Developed  at  BackType,              and  open  sourced  by  Twi`er  

®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  17  

Storm  Background  

Page 18: C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Computing by Dewayne Filppi

§  Streams  §  Unbounded  sequence  of  tuples  

§  Spouts  §  Source  of  streams  (Queues)  

§  Bolts  §  Func=ons,  Filters,  Joins,  Aggrega=ons  

§  Topologies  

®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  18  

Storm  Abstrac=ons  Spout  

Bolt  

Topologies  

Page 19: C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Computing by Dewayne Filppi

®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  19  

Streaming  word  count  with  Storm  

§  Storm  has  a  simple  builder  interface  to  crea=ng  stream  processing  topologies  

§  Storm  delegates  persistence  to  external  providers  §  Cassandra,  because  of  its  write  performance,  is  commonly  used  

Page 20: C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Computing by Dewayne Filppi

®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  20  

Storm  :  Op=mis=c  Processing  

§  Storm  (quite  ra=onally)  assumes  success  is  normal  §  Storm  uses  batching  and  pipelining  for  performance  §  Therefore  the  spout  must  be  able  to  replay  tuples  on  demand  

in  case  of  error.  §  Any  kind  of  quasi-­‐queue  like  data  source  can  be  fashioned  

into  a  spout.  §  No  persistence  is  ever  required,  and  speed  a`ained  by  

minimizing  network  hops  during  topology  processing.  

Page 21: C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Computing by Dewayne Filppi

®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  21  

Fast.    Want  to  go  faster?  

§  Eliminate  non-­‐memory  components  §  Subs=tute  disk  based  queue  for  reliable  in-­‐memory  queue  §  Subs=tute  disk  based  state  persistence  to  in-­‐memory  

persistence  §  Asynchronously  update  disk  based  state  (C*)  

Page 22: C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Computing by Dewayne Filppi

®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  22  

Sample  Architecture  

Page 23: C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Computing by Dewayne Filppi

®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  23  

References  §  Try  the  Cloudify  recipe  

§  Download  Cloudify  :  h`p://www.cloudifysource.org/  §  Download  the  Recipe  (apps/xapstream,  services/xapstream):  

–  h`ps://github.com/CloudifySource/cloudify-­‐recipes  §  XAP  –  Cassandra  Interface  Details;  

§  h`p://wiki.gigaspaces.com/wiki/display/XAP95/Cassandra+Space+Persistency  §  Check  out  the  source  for  the  XAP  Spout  and  a  sample  state  

implementa=on  backed  by  XAP,  and  a  Storm  friendly  streaming  implemen=on  on  github:  §  h`ps://github.com/Gigaspaces/storm-­‐integra=on  

§  For  more  background  on  the  effort,  check  out  my  recent  blog  posts  at  h`p://blog.gigaspaces.com/  §  h`p://blog.gigaspaces.com/gigaspaces-­‐and-­‐storm-­‐part-­‐1-­‐storm-­‐clouds/  §  h`p://blog.gigaspaces.com/gigaspaces-­‐and-­‐storm-­‐part-­‐2-­‐xap-­‐integra=on/  §  Part  3  coming  soon.  

Page 24: C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Computing by Dewayne Filppi

®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  24  

Page 25: C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Computing by Dewayne Filppi

®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  25  

Twi`er  Storm  With  Cassandra  

Page 26: C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Computing by Dewayne Filppi

®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  26  

Storm  Overview  

Page 27: C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Computing by Dewayne Filppi

§  Streams  §  Unbounded  sequence  of  tuples  

§  Spouts  §  Source  of  streams  (Queues)  

§  Bolts  §  Func=ons,  Filters,  Joins,  Aggrega=ons  

§  Topologies  

®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  27  

Storm  Concepts  Spouts  

Bolt  

Topologies  

Page 28: C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Computing by Dewayne Filppi

Challenge  –  Word  Count  

Word:Count

Tweets  

Count  ?®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  28  

• HoWest  topics  • URL  men:ons  • etc.  

Page 29: C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Computing by Dewayne Filppi

®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  29  

Streaming  word  count  with  Storm  

Page 30: C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Computing by Dewayne Filppi

®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  30  

Supercharging  Storm  §  Storm  doesn’t  supply  persistence,  but  provides  for  it  §  Storm  op=mizes  IO  to  slow  persistence  (e.g.  databases)  using  

batching.  §  Storm  processes  streams.    The  stream  provider  itself  needs  to  

support  persistency,  batching,  and  reliability.  

Tweets,  events,whatever….  

Page 31: C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Computing by Dewayne Filppi

XAP  Real  Time  Analy=cs  

®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  31  

Page 32: C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Computing by Dewayne Filppi

®  Copyright  2011  Gigaspaces  Ltd.  All  Rights  Reserved  

Two  Layer  Approach  §  Advantage:  Minimal  

“impedance  mismatch”  between  layers.  –  Both  NoSQL  cluster  

technologies,  with  similar  advantages  

§  Grid  layer  serves  as  an  in  memory  cache  for  interac=ve  requests.  

§  Grid  layer  serves  as  a  real  =me  computa=on  fabric  for  CEP,  and  limited  (  to  allocated  memory)  real  =me  distributed  query  capability.  

In  Memory  Compute  Cluster

NoSQL  Cluster

...

Raw  Event  Stream

Raw  Event  Stream

Raw  Event  Stream

Real  Tim

e  Even

ts

Raw  And  Derived  Events

Real  Tim

e  Even

ts

Repo

rting  En

gine

SCALE

SCALE

Page 33: C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Computing by Dewayne Filppi

®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  33  

Simplified  Architecture  

Page 34: C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Computing by Dewayne Filppi

§  Flowing  event  streams  through  memory  for  side  effects  §  Event  driven  architecture  execu=ng  in-­‐memory  §  Raw  events  flushed,  aggrega=ons/deriva=ons  retained  §  All  layers  horizontally  scalable  §  All  layers  highly  available  §  Real-­‐=me  analy=cs  &  cached  batch  analy=cs  on  same  scalable  

layer  §  Data  grid  provides  a  transac=onal/consistent  façade  on  

NoSQL  store  (in  this  case  elimina=ng  SQL  database  en=rely)  

®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  34  

Key  Concepts  

Page 35: C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Computing by Dewayne Filppi

Keep  Things  In  Memory  

Facebook  keeps  80%  of  its  data  in  Memory    (Stanford  research)  

RAM  is  100-­‐1000x  faster  than  Disk  (Random  seek)  •  Disk:  5  -­‐10ms      •  RAM:  ~0.001msec    

Page 36: C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Computing by Dewayne Filppi

Take  Aways  

§  A  data  grid  can  serve  different  needs  for  big  data  analy=cs:  §  Supercharge  a  dedicated  stream  processing  cluster  like  Storm.  

–  Provide  fast,  reliable,  transac=onal  tuple  streams  and  state  §  Provide  a  general  purpose  analy=cs  plaMorm  

–  Roll  your  own  §  Simplify  overall  architecture  while  enhancing  scalability  

–  Ultra  high  performance/low  latency  –  Dynamically  scalable  processing  and  in-­‐memory  storage  –  Eliminate  messaging  =er  –  Eliminate  or  minimize  need  for  RDBMS  

Page 37: C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Computing by Dewayne Filppi

§  Real:me  Analy:cs  with  Storm  and  Hadoop  §  hWp://www.slideshare.net/Hadoop_Summit/real:me-­‐

analy:cs-­‐with-­‐storm  §  Learn  and  fork  the  code  on  github:      

hWps://github.com/Gigaspaces/storm-­‐integra:on  

§  Twi`er  Storm:    hWp://storm-­‐project.net  

§  XAP  +  Storm  Detailed  Blog  Post            hWp://blog.gigaspaces.com/gigaspaces-­‐and-­‐storm-­‐part-­‐2-­‐xap-­‐integra:on/  

  ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  37  

References    

Page 38: C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Computing by Dewayne Filppi

®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  38