Transcript
Page 1: Marimba: Abelian - - TU K · PDF fileMarimba: A MapReduce-based Framework for Incremental Recomputations Johannes Schildgen (schildgen@cs.uni-kl.de) TU Kaiserslautern, Germany Research

Marimba:A MapReduce-based Framework for

Incremental Recomputations

Johannes Schildgen ([email protected]) TU Kaiserslautern, Germany

Research Questions• How to avoid full recomputations?• Can a MapReduce job reuse its own result from a

former computation?

BackgroundMany MapReduce jobs for analyzing Big Data require many hours andhave to be repeated again and again because the base data changescontinuously. With simple MapReduce every job has to be executed fromscratch and cannot benefit from previous computations.

What We Can Learn From Materialized ViewsMapReduce jobs and materialized views in relational DBMS have a lotin common. A long running query is executed periodically to calculatea result. When this result can be computed by only analyzing changesthe in base data (deltas) as well as the result of the formercomputation, a materialized view is called self-maintainable. Ourapproach to make MapReduce jobs self-maintainable is based on twopopular concepts, Overwrite and Increment Installation:

Marimba• Simply write Map & Reduce functions,• as intermediate value type use an Abelian type,• the Marimba framework will only read inserted

and deleted data,• deleted values will be inverted,• the changes will be aggrega-

ted on the previous result.• Generic combiner

(simply aggregates values)• For Overwrite Installation, a

Deserialize method has to be defined to convert the former result into Abelian objects.

ExamplesWordCount• map(linenum, line):

∀word in line: emit(word, 1)• invert: ℤ → ℤ ≔ ⋅ −1 ℤ

• aggregate: ℤ × ℤ → ℤ ≔ +ℤ• identity element: 0ℤ

Friends-Of-Friends Other

Results

WordCount Reverse Web-Link Graph

on Big Data

Map

ΔMapΔ

+= Mapinvert

Aggregate Reduce

(dotted arrows: user-defined functions)

removed

edge

added

edge

Former

Result

• Reverse Web-Link Graph• N-Grams• PageRank• …M

arim

ba

Had

oop

map() reduce()

Abelianinvert()

aggregate()

neutral()

HBase

deserialize()

HDFS

Map (+ invert)

Inserted

i got rhythm

i got music

Deleted

rhythm of life

Former Result

got, 8

life, 1

melody 1

of, 5

rhythm 12

got, 10

i, 2

melody 1

music, 1

of, 4

rhythm 12

i, 1

got, 1

rhythm, 1

i, 1

got, 1,

music, 1

rhythm, -1

of, -1

life, -1

got, 8

life, 1

melody 1

of, 5

rhythm 12

00:00

00:10

00:20

00:30

00:40

00:50

01:00

01:10

0% 10% 20% 30% 40% 50% 60% 70%

Tim

e [

hh

:mm

]

Changed Data

FULL INCDEC OVERWRITE

00:00

00:20

00:40

01:00

01:20

01:40

02:00

02:20

0% 10% 20% 30% 40%

Tim

e [

hh

:mm

]

Changed Data

FULL INCDEC OVERWRITE

Top Related