chadoop
DESCRIPTION
Continuous HadoopTRANSCRIPT
A Lightweight Continuous Jobs Mechanism for MapReduce Frameworks Trong-Tuan Vu INRIA Lille Nord Europe
Fabrice Huet INRIA-University of Nice
Static Dynamic
Batch
Iterative
Real-time
Data
Model Big Data processing landscape
Stream
Model Data
Batch
Iterative Real-time
Static Hadoop HOP
HaLoop Twister PIC
Dynamic (fast data)
Stream Amazon S4 Twitter Storm
Processing Big Data
Batch Processing of Big Data
• Canonical workflow
– Push data to cluster
– Start jobs
– Pull results
– Profit!
• As long as the data set does not change
Dealing with dynamic data
• Bulk arrival
• Job only submitted once and runs automatically
• Slightly changes the workflow
– While (new data)
• Push, execute, pull, profit!
- 5
Continuous Analysis
- 6
Foo Bar
Time
Word-Count
What Bar
Foo Bar
What Bar
Foo 1 Bar 1
What 1 Bar 1
Foo 1 Bar 2 What 1
Properties
• Efficiency
– Only process new data, not the whole data set
• Correctness
– Merging all results on intermediate data should give
the same result than processing the whole dataset
- 7
Dependencies
- 8
Foo Bar
Word-2
What Bar
Word-2
Foo Bar
What Bar
Bar
Time
Word-2 : display words which appears at least twice
Not all data are equals
• Processing only new data leads to incorrect results
– Because some old ones are useful
• Different categories
– New data
– Results
– Carried data
- 9
Carried data
• Data which have been processed
– But could be useful in subsequent run
• Typically application dependent
– Let the programmer decide this
• Example Word-2 :
– Result : words which appear at least twice
– Carry : words which appear once - 10
Continuous Map-Reduce jobs
- 11
Map Reduce
Carry
Map Reduce
Carry
Contribution
• A continuous Job model adapted to MapReduce
• An implementation on top of Hadoop
• An evaluation with two toys application and a
realistic one
- 12
CONTINUOUS HADOOP
- 13
Continuous MapReduce Framework
• Based on the Hadoop MapReduce Framework
• Support for automatic re-execution of jobs
– Notification of new data
– Filtering of data by timestamp
• New API with carry function
- 14
Even Elephants are fast
• No modification to Hadoop source code
– Proxies/Interceptors
– Subclassing
– Reflection (accessing private fields)
• Use public API
• Hopefully Never play cat and mouse elephant
- 15
- 16
NameNode
JobTracker TaskTracker
Data Nodes
Local File System
Job Task Task Task
Task Task Task
Task Task Task
Continuous JobTracker
Continuous Job
Continuous NameNode
Time stamping data
• Jobs should process new Data
– Only those added after last execution
• HDFS has limitations
– No in-place modification and no appending
• Add time stamp for blocks as metadata in
Continuous NameNode
- 17
API example (Word-2-count)
- 18
ContinuousJob job = new ContinuousJob() ; …. job.setCarryFilesName(”carry") ;
protected void continuousReduce(Text key, Iterable<IntWritable> values, ContinuousContext context) { … if(sum < 2) { context.carry(key, result); } else { context.write(key, result); } }
Application : SPARQL Query
• A SQL-like language for the RDF data format
- 19
SELECT ?yr WHERE { ?journal rdf:type bench:Journal. ?journal dc:title "Journal 1 (1940)"^^xsd:string. ?journal dcterms:issued ?yr }
<http://localhost/publications/journals/Journal1/1940> rdf:type bench:Journal <http://localhost/publications/journals/Journal1/1940> dc:title "Journal 1 (1940)"^^xsd:string <http://localhost/publications/journals/Journal1/1940> dcterms:issued "1940"^^xsd:integer
Continuous SPARQL
- 20
Map Reduce
Carry
Map Reduce
Carry
Map Reduce
Selection Job Join Job
Map Reduce
Selection Job
- 21
0
2
4
6
8
10
12
14
20 40 60 80 100 120 140 160 180
Hun
dred
of s
econ
ds cHadoop
Hadoop
(Millions of RDF triple)
Experiments on 40 nodes
Conclusion
• A model for processing dynamic (fast) data using
MapReduce
– Carry allows saving data for future use
• An non-intrusive implementation in Hadoop
– Automatic restarting of continuous jobs
• Latency of restarting jobs is high
- 22