streaming big data analysis - meetupfiles.meetup.com/16951782/20140122 - streaming big data...

Streaming Big Data Analysis: Apache Storm and SensorStorm

Wilco Wijbrandi

About me

Studied Computing Science BSc and MSc

here in Groningen

Working at TNO since 2012, department

Service Enabling & Management

Working in the field of Smart Grids and

Cloud Computing

Tools of preference: Java and OSGi

Batch processing

Pros

Easy to implement

Redo processing

Cons

Results are delayed

Requires some serious

disk space

MapReduce

Stream Processing

Useful when

There’s just too much data to store

Near real-time results

Downsides

Data is gone after processing

Harder to implement

Apache Storm

Process unbounded streams of data

Scalable

Guarantees no data loss (at least, no tuples are lost)

Robust

Fault-tolerant

Programming language agnostic (sort of)

Apache Storm

What I like about Storm

Storm is hot! That means an active community!

Robustness and reliability is taken care of

Nice WebUI: metrics and logging

Scalable: Runs on your laptop, runs in a datacenter

Later on I’ll tell you what I don’t like…

Robustness

If Nimbus fails the topology can continue, but

nothing can change (that also means no

redistributing work if a supervisor fails)

Topology

Data is transmitted as Tuples

Untyped key-value pairs

Predefined set of keys

Spouts pull in data and produce tuples

Usually there is a message queue in front of a spout

Bolts process tuples and produce new tuples (or do something else)

Topology cannot be changed once started

A storm cluster can run multiple topologies at the same time

Parallelism

Spouts and bolts have multiple

instances (Tasks)

Tasks are automatically

distributed over workers

Stream Groupings

Defines which tuple go to which bolt instance

Most common:

ShuffleGrouping: Spread tuples equally over instances

FieldGrouping: Tuples with same value always end up at same

instance

AllGrouping: Broadcast tuples over instances

Spout Bolt A

1

0

instance(tuple, field) = hash(tupe[field]) % NrOfInstances

Reliability

Fail-fast

When a supervisor/worker crashes, tasks are reassigned

Tuples not acknowledged are resubmitted

At least once semantics

Spout is responsible for resubmitting

State of bolt instances gets lost when a node crashes or tasks are

moved

Chaining: Succes!

S

B2

B1

B3

A

B

C

Ack A!

Ack B!

Ack C!

Chaining: Fail

S

B2

B1

B3

A

B

C

Fail A!

Ack B!

Resubmit

A

B1 receives A twice

B2 receives B twice

Throttling

S

B2

B1

B3

A

B

C

D

E

F

G

H

I

J

K

L

M

N

O

In practice this happens in parallel

Spouts pull in data; this is the only place where storm can throttle

Number of ‘pending spout-tuples’ is limited

Configuration parameter MAX_SPOUT_PENDING

public class RandomSentenceSpout extends BaseRichSpout {

@Override

public void open(Map conf, TopologyContext context,

SpoutOutputCollector collector) {

...

}

@Override

public void nextTuple() {

String sentence = ...

this.collector.emit(new Values(sentence));

}

@Override

public void ack(Object id) {

}

@Override

public void fail(Object id) {

}

@Override

public void declareOutputFields(OutputFieldsDeclarer declarer) {

declarer.declare(new Fields("word"));

}

}

class WordCount extends BaseBasicBolt {

Map<String, Integer> counts = new HashMap<String, Integer>();

@Override

public void execute(Tuple tuple, BasicOutputCollector collector) {

String word = tuple.getString(0);

Integer count = counts.get(word);

if (count == null)

count = 0;

count++;

counts.put(word, count);

collector.emit(new Values(word, count));

}

@Override

public void declareOutputFields(OutputFieldsDeclarer declarer) {

declarer.declare(new Fields("word", "count"));

}

}

public static void main(String[] args) throws Exception {

TopologyBuilder builder = new TopologyBuilder();

builder.setSpout("spout", new RandomSentenceSpout(), 5);

builder.setBolt("split", new SplitSentence(),

8).shuffleGrouping("spout");

builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split",

new Fields("word"));

Config conf = new Config();

conf.setDebug(true);

StormSubmitter.submitTopology(args[0], conf,

builder.createTopology());

}

Spouts and bolts are copied to worker nodes by serializing them

Apache Storm

What I don’t like about Storm

Very static: Topologies cannot be altered while running

No real topology management, no run-time configuration

Use tuples because strong typing is annoying, but I do have to

declare output fields first?

State of bolt is not managed (unless you use Trident)

No dependency management

Other interesting design choices

Ugly Java Clojure hybrid (but that won’t bother most users)

SensorStorm

(suggestions for a logo are welcome)

Why SensorStorm?

Dealing with sensor data is a bit different than vanilla Storm or Trident

When dealing with sensor data…

We want measurements to be in the right order

Reliability is nice, but we can live with loosing a measurement once in

while

A lot of operations are reusable

What is SensorStorm?

SensorStorm is a library on top of Apache Storm

SensorStorm is an open source project with an Apache 2 license

Other research topics related to Storm

Elastic Storm

Automatically scale Storm clusters up and down -> paper

Make it easier to manage, configure and deploy topologies

Run multiple instances of the same topology

Central configuration (ConfigAdmin for Storm)

Particles are the new Tuples!

Particles are a special type of tuple

Particles always have a timestamp

Particles are strongly typed

Particles are automatically mapped to Storm Tuples, so we don’t

break compatibility

You can also customize the mapping

We take care of declareOutputFields

Time travel is complicated (ever seen Primer?)

For debugging and demo purposes…

We want to be able to process live data

We want to be able to replay a historic dataset

We want to run a historic dataset as fast as possible

So we need to be able to use the real clock, but also a fake, controlled

clock…

And that’s difficult in a

distributed system

Analysis of the measurement

We also want to analyse the frequency at which we receive

measurements

Processing of measurements does not always take the same

amount of time

What if there is no measurement at all?

We don’t want to

use the time at

processing…

But when the

measurement

was taken

Time injection

S

B2

B1

B3

Measurement Particle

Time Particle

Every particle carries a timestamp

Time Particles are injected at a fixed

(configurable) interval

Time Particles trigger different processing logic

than measurement Particles

DataParticles and MetaParticles

S

B2

B1

B3

MetaParticle

DataParticle

MetaParticles don’t carry measurements, but

are injected to the stream to trigger certain

behaviour to the stream

TimerMetaParticle carries time and triggers

scheduled tasks

ShutdownMetaParticle signals bolts to store

their state (e.g. before reassigning tasks)

What does a Particle look like?

public class SensorParticle extends AbstractDataParticle {

public SensorParticle() {

}

@TupleField

private String sensorId;

@TupleField

private double measurement;

public String getSensorId() {

return sensorId;

}

public double getMeasurement() {

return measurement;

}

...

}

Parallelism

Spout Bolt A

0

1

0

1

Bolt B

1

0

Specialized stream groupings

MetaParticles are broadcasted between instances

Duplicate MetaParticles are filtered out

DataParticles have their own grouping strategy (e.g. fields grouping)

Before each bolt particles are put in order in the SyncBuffer

SensorStormSpout and SensorStormBolt

Generic Spout and Bolt in which you can run your processing logic

SensorStorm

Spout

SensorStorm

Bolt Fetcher

Operation Batcher

SensorStorm

Bolt

Operation SensorStorm

Bolt

Operation

Injects Time Particles

Fetcher

@FetcherDeclaration(outputs = SensorParticle.class)

public class BlockFetcher implements Fetcher {

public void prepare(Map stormConfig,

ExternalStormConfiguration externalConfig, TopologyContext

context) {

}

public void activate() {

...

}

public void deactivate() {

...

}

@Override

public DataParticle fetchParticle() {

return new SensorParticle();

}

}

SingleParticleOperation

Process one particle at a time

Only DataParticles are offered

Processing a DataParticle can result

in no new Particles or many

@OperationDeclaration(inputs = SensorParticle.class, outputs =

SensorParticle.class)

public class ExampleOperation implements SingleParticleOperation {

public void init(...) throws OperationException {

}

public List<? extends DataParticle> execute(DataParticle

inputParticle)

throws OperationException {

System.out.println("Received Particle: " + inputParticle);

return inputParticle;

}

}

Batcher

Separate logic for creating Batches of particles

Particle can be part of multiple Batches

For example, creating windows

5 6 7 8 1 2 3 4

Batch 1

5 6 7 8 1 2 3 4

Batch 2

5 6 7 8 1 2 3 4

Batch 3

ParticleBatchOperation

Same as calculating the SingleParticleOperation, but now we process

a DataParticleBatch

For example:

Average

Min

Max

DIY MetaParticles

Extend the SensorStormSpout so it sends your MetaParticle

Create a MetaParticleHandler

Create an interface so Operations can interact with your

MetaParticleHardler

SensorStorm

Bolt

Operation

MetaParticle

Handler DataParticles

MetaParticles

Register

FieldOperations

Create an Operation for every value of a field

For example: process every sensor separately

You probably want to use a FieldsGroupings before this Bolt

SensorStorm

Bolt

Operation

S3 Operation

S2

Operation

S1 Operation

S4

SensorStorm FieldOperationBolt

Sync

Buffer Synchronizes

particles from

different sources

and filters out

duplicate

MetaParticels

Sensor

Storm

Stream

Grouping Broadcast

MetaParticles

SensorStorm SingleOperationBolt

Single container

Batcher Operation

Sync

Buffer Synchronizes

particles from

different sources

and filters out

duplicate

MetaParticels

Sensor

Storm

Stream

Grouping Broadcast

MetaParticles

Particles

Field container Batcher Operation

Field container Batcher Operation

And we’re open source now!

So, if SensorStorm is for you…

Give it a try

Let us know what you think

Help us improve it

Together we can do more!

https://github.com/sensorstorm

streaming big data analysis - meetupfiles.meetup.com/16951782/20140122 - streaming big data...

Documents