streaming big data analysis - meetupfiles.meetup.com/16951782/20140122 - streaming big data...
TRANSCRIPT
Streaming Big Data Analysis: Apache Storm and SensorStorm
Wilco Wijbrandi
About me
Studied Computing Science BSc and MSc
here in Groningen
Working at TNO since 2012, department
Service Enabling & Management
Working in the field of Smart Grids and
Cloud Computing
Tools of preference: Java and OSGi
Batch processing
Pros
Easy to implement
Redo processing
Cons
Results are delayed
Requires some serious
disk space
MapReduce
Stream Processing
Useful when
There’s just too much data to store
Near real-time results
Downsides
Data is gone after processing
Harder to implement
Apache Storm
Process unbounded streams of data
Scalable
Guarantees no data loss (at least, no tuples are lost)
Robust
Fault-tolerant
Programming language agnostic (sort of)
Apache Storm
What I like about Storm
Storm is hot! That means an active community!
Robustness and reliability is taken care of
Nice WebUI: metrics and logging
Scalable: Runs on your laptop, runs in a datacenter
Later on I’ll tell you what I don’t like…
Robustness
If Nimbus fails the topology can continue, but
nothing can change (that also means no
redistributing work if a supervisor fails)
Topology
Data is transmitted as Tuples
Untyped key-value pairs
Predefined set of keys
Spouts pull in data and produce tuples
Usually there is a message queue in front of a spout
Bolts process tuples and produce new tuples (or do something else)
Topology cannot be changed once started
A storm cluster can run multiple topologies at the same time
Parallelism
Spouts and bolts have multiple
instances (Tasks)
Tasks are automatically
distributed over workers
Stream Groupings
Defines which tuple go to which bolt instance
Most common:
ShuffleGrouping: Spread tuples equally over instances
FieldGrouping: Tuples with same value always end up at same
instance
AllGrouping: Broadcast tuples over instances
Spout Bolt A
1
0
instance(tuple, field) = hash(tupe[field]) % NrOfInstances
Reliability
Fail-fast
When a supervisor/worker crashes, tasks are reassigned
Tuples not acknowledged are resubmitted
At least once semantics
Spout is responsible for resubmitting
State of bolt instances gets lost when a node crashes or tasks are
moved
Chaining: Succes!
S
B2
B1
B3
A
B
C
Ack A!
Ack B!
Ack C!
Chaining: Fail
S
B2
B1
B3
A
B
C
Fail A!
Ack B!
Resubmit
A
B1 receives A twice
B2 receives B twice
Throttling
S
B2
B1
B3
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
In practice this happens in parallel
Spouts pull in data; this is the only place where storm can throttle
Number of ‘pending spout-tuples’ is limited
Configuration parameter MAX_SPOUT_PENDING
public class RandomSentenceSpout extends BaseRichSpout {
@Override
public void open(Map conf, TopologyContext context,
SpoutOutputCollector collector) {
...
}
@Override
public void nextTuple() {
String sentence = ...
this.collector.emit(new Values(sentence));
}
@Override
public void ack(Object id) {
}
@Override
public void fail(Object id) {
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word"));
}
}
class WordCount extends BaseBasicBolt {
Map<String, Integer> counts = new HashMap<String, Integer>();
@Override
public void execute(Tuple tuple, BasicOutputCollector collector) {
String word = tuple.getString(0);
Integer count = counts.get(word);
if (count == null)
count = 0;
count++;
counts.put(word, count);
collector.emit(new Values(word, count));
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word", "count"));
}
}
public static void main(String[] args) throws Exception {
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("spout", new RandomSentenceSpout(), 5);
builder.setBolt("split", new SplitSentence(),
8).shuffleGrouping("spout");
builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split",
new Fields("word"));
Config conf = new Config();
conf.setDebug(true);
StormSubmitter.submitTopology(args[0], conf,
builder.createTopology());
}
Spouts and bolts are copied to worker nodes by serializing them
Apache Storm
What I don’t like about Storm
Very static: Topologies cannot be altered while running
No real topology management, no run-time configuration
Use tuples because strong typing is annoying, but I do have to
declare output fields first?
State of bolt is not managed (unless you use Trident)
No dependency management
Other interesting design choices
Ugly Java Clojure hybrid (but that won’t bother most users)
SensorStorm
(suggestions for a logo are welcome)
Why SensorStorm?
Dealing with sensor data is a bit different than vanilla Storm or Trident
When dealing with sensor data…
We want measurements to be in the right order
Reliability is nice, but we can live with loosing a measurement once in
while
A lot of operations are reusable
What is SensorStorm?
SensorStorm is a library on top of Apache Storm
SensorStorm is an open source project with an Apache 2 license
Other research topics related to Storm
Elastic Storm
Automatically scale Storm clusters up and down -> paper
Make it easier to manage, configure and deploy topologies
Run multiple instances of the same topology
Central configuration (ConfigAdmin for Storm)
Particles are the new Tuples!
Particles are a special type of tuple
Particles always have a timestamp
Particles are strongly typed
Particles are automatically mapped to Storm Tuples, so we don’t
break compatibility
You can also customize the mapping
We take care of declareOutputFields
Time travel is complicated (ever seen Primer?)
For debugging and demo purposes…
We want to be able to process live data
We want to be able to replay a historic dataset
We want to run a historic dataset as fast as possible
So we need to be able to use the real clock, but also a fake, controlled
clock…
And that’s difficult in a
distributed system
Analysis of the measurement
We also want to analyse the frequency at which we receive
measurements
Processing of measurements does not always take the same
amount of time
What if there is no measurement at all?
We don’t want to
use the time at
processing…
But when the
measurement
was taken
Time injection
S
B2
B1
B3
Measurement Particle
Time Particle
Every particle carries a timestamp
Time Particles are injected at a fixed
(configurable) interval
Time Particles trigger different processing logic
than measurement Particles
DataParticles and MetaParticles
S
B2
B1
B3
MetaParticle
DataParticle
MetaParticles don’t carry measurements, but
are injected to the stream to trigger certain
behaviour to the stream
TimerMetaParticle carries time and triggers
scheduled tasks
ShutdownMetaParticle signals bolts to store
their state (e.g. before reassigning tasks)
What does a Particle look like?
public class SensorParticle extends AbstractDataParticle {
public SensorParticle() {
}
@TupleField
private String sensorId;
@TupleField
private double measurement;
public String getSensorId() {
return sensorId;
}
public double getMeasurement() {
return measurement;
}
...
}
Parallelism
Spout Bolt A
0
1
0
1
Bolt B
1
0
Specialized stream groupings
MetaParticles are broadcasted between instances
Duplicate MetaParticles are filtered out
DataParticles have their own grouping strategy (e.g. fields grouping)
Before each bolt particles are put in order in the SyncBuffer
SensorStormSpout and SensorStormBolt
Generic Spout and Bolt in which you can run your processing logic
SensorStorm
Spout
SensorStorm
Bolt Fetcher
Operation Batcher
SensorStorm
Bolt
Operation SensorStorm
Bolt
Operation
Injects Time Particles
Fetcher
@FetcherDeclaration(outputs = SensorParticle.class)
public class BlockFetcher implements Fetcher {
public void prepare(Map stormConfig,
ExternalStormConfiguration externalConfig, TopologyContext
context) {
}
public void activate() {
...
}
public void deactivate() {
...
}
@Override
public DataParticle fetchParticle() {
return new SensorParticle();
}
}
SingleParticleOperation
Process one particle at a time
Only DataParticles are offered
Processing a DataParticle can result
in no new Particles or many
@OperationDeclaration(inputs = SensorParticle.class, outputs =
SensorParticle.class)
public class ExampleOperation implements SingleParticleOperation {
public void init(...) throws OperationException {
}
public List<? extends DataParticle> execute(DataParticle
inputParticle)
throws OperationException {
System.out.println("Received Particle: " + inputParticle);
return inputParticle;
}
}
Batcher
Separate logic for creating Batches of particles
Particle can be part of multiple Batches
For example, creating windows
5 6 7 8 1 2 3 4
Batch 1
5 6 7 8 1 2 3 4
Batch 2
5 6 7 8 1 2 3 4
Batch 3
ParticleBatchOperation
Same as calculating the SingleParticleOperation, but now we process
a DataParticleBatch
For example:
Average
Min
Max
DIY MetaParticles
Extend the SensorStormSpout so it sends your MetaParticle
Create a MetaParticleHandler
Create an interface so Operations can interact with your
MetaParticleHardler
SensorStorm
Bolt
Operation
MetaParticle
Handler DataParticles
MetaParticles
Register
FieldOperations
Create an Operation for every value of a field
For example: process every sensor separately
You probably want to use a FieldsGroupings before this Bolt
SensorStorm
Bolt
Operation
S3 Operation
S2
Operation
S1 Operation
S4
SensorStorm FieldOperationBolt
Sync
Buffer Synchronizes
particles from
different sources
and filters out
duplicate
MetaParticels
Sensor
Storm
Stream
Grouping Broadcast
MetaParticles
SensorStorm SingleOperationBolt
Single container
Batcher Operation
Sync
Buffer Synchronizes
particles from
different sources
and filters out
duplicate
MetaParticels
Sensor
Storm
Stream
Grouping Broadcast
MetaParticles
Particles
Field container Batcher Operation
Field container Batcher Operation
And we’re open source now!
So, if SensorStorm is for you…
Give it a try
Let us know what you think
Help us improve it
Together we can do more!
https://github.com/sensorstorm