chadoop

A Lightweight Continuous Jobs Mechanism for MapReduce Frameworks Trong-Tuan Vu INRIA Lille Nord Europe

Fabrice Huet INRIA-University of Nice

Static Dynamic

Batch

Iterative

Real-time

Data

Model Big Data processing landscape

Stream

Model Data

Batch

Iterative Real-time

Static Hadoop HOP

HaLoop Twister PIC

Dynamic (fast data)

Stream Amazon S4 Twitter Storm

Processing Big Data

Batch Processing of Big Data

•  Canonical workflow

–  Push data to cluster

–  Start jobs

–  Pull results

–  Profit!

•  As long as the data set does not change

Dealing with dynamic data

•  Bulk arrival

•  Job only submitted once and runs automatically

•  Slightly changes the workflow

–  While (new data)

•  Push, execute, pull, profit!

- 5

Continuous Analysis

- 6

Foo Bar

Time

Word-Count

What Bar

Foo Bar

What Bar

Foo 1 Bar 1

What 1 Bar 1

Foo 1 Bar 2 What 1

Properties

•  Efficiency

–  Only process new data, not the whole data set

•  Correctness

–  Merging all results on intermediate data should give

the same result than processing the whole dataset

- 7

Dependencies

- 8

Foo Bar

Word-2

What Bar

Word-2

Foo Bar

What Bar

Bar

Time

Word-2 : display words which appears at least twice

Not all data are equals

•  Processing only new data leads to incorrect results

–  Because some old ones are useful

•  Different categories

–  New data

–  Results

–  Carried data

- 9

Carried data

•  Data which have been processed

–  But could be useful in subsequent run

•  Typically application dependent

–  Let the programmer decide this

•  Example Word-2 :

–  Result : words which appear at least twice

–  Carry : words which appear once - 10

Continuous Map-Reduce jobs

- 11

Map Reduce

Carry

Map Reduce

Carry

Contribution

•  A continuous Job model adapted to MapReduce

•  An implementation on top of Hadoop

•  An evaluation with two toys application and a

realistic one

- 12

CONTINUOUS HADOOP

- 13

Continuous MapReduce Framework

•  Based on the Hadoop MapReduce Framework

•  Support for automatic re-execution of jobs

–  Notification of new data

–  Filtering of data by timestamp

•  New API with carry function

- 14

Even Elephants are fast

•  No modification to Hadoop source code

–  Proxies/Interceptors

–  Subclassing

–  Reflection (accessing private fields)

•  Use public API

•  Hopefully Never play cat and mouse elephant

- 15

- 16

NameNode

JobTracker TaskTracker

Data Nodes

Local File System

Job Task Task Task

Task Task Task

Task Task Task

Continuous JobTracker

Continuous Job

Continuous NameNode

Time stamping data

•  Jobs should process new Data

–  Only those added after last execution

•  HDFS has limitations

–  No in-place modification and no appending

•  Add time stamp for blocks as metadata in

Continuous NameNode

- 17

API example (Word-2-count)

- 18

ContinuousJob job = new ContinuousJob() ; …. job.setCarryFilesName(”carry") ;

protected void continuousReduce(Text key, Iterable<IntWritable> values, ContinuousContext context) { … if(sum < 2) { context.carry(key, result); } else { context.write(key, result); } }

Application : SPARQL Query

•  A SQL-like language for the RDF data format

- 19

SELECT ?yr WHERE { ?journal rdf:type bench:Journal. ?journal dc:title "Journal 1 (1940)"^^xsd:string. ?journal dcterms:issued ?yr }

<http://localhost/publications/journals/Journal1/1940> rdf:type bench:Journal <http://localhost/publications/journals/Journal1/1940> dc:title "Journal 1 (1940)"^^xsd:string <http://localhost/publications/journals/Journal1/1940> dcterms:issued "1940"^^xsd:integer

Continuous SPARQL

- 20

Map Reduce

Carry

Map Reduce

Carry

Map Reduce

Selection Job Join Job

Map Reduce

Selection Job

- 21

0

2

4

6

8

10

12

14

20 40 60 80 100 120 140 160 180

Hun

dred

of s

econ

ds cHadoop

Hadoop

(Millions of RDF triple)

Experiments on 40 nodes

Conclusion

•  A model for processing dynamic (fast) data using

MapReduce

–  Carry allows saving data for future use

•  An non-intrusive implementation in Hadoop

–  Automatic restarting of continuous jobs

•  Latency of restarting jobs is high

- 22

chadoop

Documents

data data

intermediate data

new data push

new data leads

bar foo bar

rdf data format

data set correctness

bar bar time word