stormwars - when the data stream shrinks

33
bvishnu

Upload: vishnu-rao

Post on 19-Feb-2017

71 views

Category:

Technology


1 download

TRANSCRIPT

bvishnu

Apache  Storm

• A  Stream  Processing  framework

Apache  Storm

• A  Stream  Processing  framework

• Used  to  pull  data  from  a  stream  and  perform  real  time  analytics  on  the  data

a  Stream…

• Can  be  Apache  Kafka  ,  Amazon  Kinesis.

a  Stream…

• Can  be  Apache  Kafka  ,  Amazon  Kinesis.

• Normally  has  partitions  /  shards  for  better  read  &  write  throughput

Partition  Metadata

Partition  Metadata

• Storm  uses  INTEGERS (0,1…)  to  identify  partitions.

Partition  Metadata

• Storm  uses  INTEGERS (0,1…)  to  identify  partitions.

• Where  as  ……

Partition  Metadata

• Storm  uses  INTEGERS (0,1…)  to  identify  partitions.

• Where  as  ……

• Amazon  Kinesis  uses  STRINGS to  identify  partitions

So  how  can  we  process  data  ?

So  how  can  we  process  data  ?

• User  sorts  the  STRINGS  (shard  Id’s)

So  how  can  we  process  data  ?

• User  sorts  the  STRINGS  (shard  Id’s)• User  maps  the  sorted  items  id’s  from  0...N

So  how  can  we  process  data  ?

• User  sorts  the  STRINGS(shard  Id’s)• User  maps  the  sorted  items  id’s  from  0...N

Shard-­‐id-­‐0001        <-­‐>    0Shard-­‐id-­‐0002        <-­‐>    1

…..…..

Storm  API

Shard  Split  in  Amazon  Kinesis

Shard  Split  in  Amazon  Kinesis

Shard  Split  in  Amazon  Kinesis

Stream  shrinks  (3  to  2  shards)

Disturbance  in  the  Force

• Storm  partition  metadata  NO longer  valid  as  the  shard  has  been  deleted.

Disturbance  in  the  Force

• Storm  partition  metadata  NO longer  valid  as  the  shard  has  been  deleted.

• Storm  partition  metadata  should  now  be:shard-­‐2        <-­‐>    0shard-­‐3        <-­‐>    1

a Solution:

a  Solution:

• WHITE_LIST  of  shards  for  a  storm  topology.

a  Solution:

• WHITE_LIST  of  shards  for  a  storm  topology.• A  storm  topology  pulls  from  a  specific  set  of  shards.

a  Solution:

• WHITE_LIST  of  shards  for  a  storm  topology.• A  storm  topology  pulls  from  a  specific  set  of  shards.

• So  in  our  case:– start  topology-­‐1 with  WHITELIST  =“shard-­‐1”

a  Solution:

• WHITE_LIST  of  shards  for  a  storm  topology.• A  storm  topology  pulls  from  a  specific  set  of  shards.

• So  in  our  case:– start  topology-­‐1 with  WHITELIST  =“shard-­‐1”– split  shard

a  Solution:

• WHITE_LIST  of  shards  for  a  storm  topology.• A  storm  topology  pulls  from  a  specific  set  of  shards.

• So  in  our  case:– start  topology-­‐1 with  WHITELIST  =“shard-­‐1”– split  shard– start  topology-­‐2 with  WHITELIST=“shard-­‐2  &  3”

a  Solution…

• When  shard-­‐1    gets  deleted  ,  topology  1  dies  with  it.

a  Solution…

• When  shard-­‐1    gets  deleted  ,  topology  1  dies  with  it.

• Topology  2  continues  processing  data  for  the  new  shards.

a  Solution…

So,  there  is  NO  metadata  conflict  ,

as  there  are  2  different  topologies  

pulling  data  from  different  sets  of  shards.

Thank  you&

May  the  force  be  with  you  !

[email protected]@twittermash213.wordpress.comlinkedin.com/in/213vishnu