implementing generate-test-and-aggregate algorithms on hadoop

35
Background Motivation and Objective Design and implementation Performance test Conclusion and future work Implementing Generate-Test-and-Aggregate Algorithms on Hadoop Yu Liu 1 , Sebastian Fischer 2 , Kento Emoto 3 , and Zhenjiang Hu 4 1 The Graduate University for Advanced Studies 2,4 National Institute of Informatics 3 University of Tokyo September 28, 2011 Yu Liu 1 , Sebastian Fischer 2 , Kento Emoto 3 , and Zhenjiang Hu 4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoo

Upload: yu-liu

Post on 22-Jan-2018

206 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

BackgroundMotivation and Objective

Design and implementationPerformance test

Conclusion and future work

Implementing Generate-Test-and-AggregateAlgorithms on Hadoop

Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4

1The Graduate University for Advanced Studies2,4National Institute of Informatics

3University of Tokyo

September 28, 2011

Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

Page 2: Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

BackgroundMotivation and Objective

Design and implementationPerformance test

Conclusion and future work

MapReduceGTA algorithmParallelization of GTA algorithm

MapReduce

Computation in three phases: map, shuffle and reduce

Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

Page 3: Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

BackgroundMotivation and Objective

Design and implementationPerformance test

Conclusion and future work

MapReduceGTA algorithmParallelization of GTA algorithm

Programming with MapReduce

Programmers need to implement the following classes (Hadoop)

Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

Page 4: Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

BackgroundMotivation and Objective

Design and implementationPerformance test

Conclusion and future work

MapReduceGTA algorithmParallelization of GTA algorithm

Programming with MapReduce

The main difficulties of MapReduce Programming :

Nontrivial problems are usually difficult to be computed in adivide-and-conquer fashion

Efficiency of parallel algorithms is difficult to be obtained

Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

Page 5: Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

BackgroundMotivation and Objective

Design and implementationPerformance test

Conclusion and future work

MapReduceGTA algorithmParallelization of GTA algorithm

Generate Test and Aggregate Algorithm

The Generate-Test-and-Aggregate (GTA for short) algorithmconsists of

generate can generate all possible solution candidates.

test filters the intermediate data.

aggregate computes a summary of valid intermediate data.

Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

Page 6: Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

BackgroundMotivation and Objective

Design and implementationPerformance test

Conclusion and future work

MapReduceGTA algorithmParallelization of GTA algorithm

Generate Test and Aggregate Algorithm

The Generate-Test-and-Aggregate (GTA for short) algorithmconsists of

generate can generate all possible solution candidates.

test filters the intermediate data.

aggregate computes a summary of valid intermediate data.

GTA is a very useful and common strategy for a large class ofproblems

Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

Page 7: Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

BackgroundMotivation and Objective

Design and implementationPerformance test

Conclusion and future work

MapReduceGTA algorithmParallelization of GTA algorithm

An Example: Knapsack Problem

Fill a knapsack with items, each of certain value and weight, such that

the total value of packed items is maximal while adhering to a weight

restriction of the knapsack.

picture from Wikipedia

Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

Page 8: Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

BackgroundMotivation and Objective

Design and implementationPerformance test

Conclusion and future work

MapReduceGTA algorithmParallelization of GTA algorithm

An Example: Knapsack Problem

A knapsack program (GTA algorithm):

knapsack = maxvalue ◦ filter ◦ sublists

Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

Page 9: Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

BackgroundMotivation and Objective

Design and implementationPerformance test

Conclusion and future work

MapReduceGTA algorithmParallelization of GTA algorithm

An Example: Knapsack Problem

A knapsack program (GTA algorithm):

knapsack = maxvalue ◦ filter ◦ sublists

E.g, there are 3 items: (1kg , $1), (1kg , $2), (2kg , $2)

sublists [(1kg , $1), (1kg , $2), (2kg , $2)]= *[ ], [(1kg , $1)], [(1kg , $1), (1kg , $2)], [(1kg , $1), (1kg , $2), (2kg , $2)],

[(1kg , $1), (2kg , $2)], [(1kg , $2)], [(1kg , $2), (2kg , $2)], [(2kg , $2)]+

Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

Page 10: Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

BackgroundMotivation and Objective

Design and implementationPerformance test

Conclusion and future work

MapReduceGTA algorithmParallelization of GTA algorithm

An Example: Knapsack Problem

A knapsack program (GTA algorithm):

knapsack = maxvalue ◦ filter ◦ sublists

Spouse the capacity of knapsack is 2 kg

filter *[ ], [(1kg , $1)], [(1kg , $1), (1kg , $2)], [(1kg , $1), (1kg , $2), (2kg , $2)],[(1kg , $1), (2kg , $2)], [(1kg , $2)], [(1kg , $2), (2kg , $2)], [(2kg , $2)]+

= *[ ], [(1kg , $1)], [(1kg , $1), (1kg , $2)], [(2kg , $2)], [(1kg , $2)]+

Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

Page 11: Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

BackgroundMotivation and Objective

Design and implementationPerformance test

Conclusion and future work

MapReduceGTA algorithmParallelization of GTA algorithm

An Example: Knapsack Problem

A knapsack program (GTA algorithm):

knapsack = maxvalue ◦ filter ◦ sublists

maxvalue *[ ], [(1kg , $1)], [(1kg , $1), (1kg , $2)], [(2kg , $2)], [(1kg , $2)]+= $3

Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

Page 12: Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

BackgroundMotivation and Objective

Design and implementationPerformance test

Conclusion and future work

MapReduceGTA algorithmParallelization of GTA algorithm

An Example: Knapsack Problem

A knapsack program (GTA algorithm):

knapsack = maxvalue ◦ filter ◦ sublists

This program is simple but inefficient because it generatesexponential intermediate data (2n).

Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

Page 13: Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

BackgroundMotivation and Objective

Design and implementationPerformance test

Conclusion and future work

MapReduceGTA algorithmParallelization of GTA algorithm

Theorems of Gernerating Efficient Parallel GTA Programs

Efficient parallel programs can be derived from users’naive but correct programs in terms of a generate, a test, and anaggregate functions [Emoto et. al., 2011]

aggregate ◦ test ◦ generate ⇒ list homomorphism

List homomorphisms is a class of recursive functions which match very well

with the divide-and-conquer paradigm [Bird, 87; Cole, 95].

Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

Page 14: Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

BackgroundMotivation and Objective

Design and implementationPerformance test

Conclusion and future work

MapReduceGTA algorithmParallelization of GTA algorithm

The Emoto’s theorem is under the following assumptions:

aggregate is a semiring homomorphism.

test is a list homomorphism.

generate is a polymorphism over semiring structures.

Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

Page 15: Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

BackgroundMotivation and Objective

Design and implementationPerformance test

Conclusion and future work

Motivation and Objective

The Emoto’s fusion theorem shows us a possible way tosystematically implement efficient parallel programs with GTAalgorithm

Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

Page 16: Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

BackgroundMotivation and Objective

Design and implementationPerformance test

Conclusion and future work

Motivation and Objective

The Emoto’s fusion theorem shows us a possible way tosystematically implement efficient parallel programs with GTAalgorithm

We need to evaluate this approach byimplementing a practical library, which should

have easy-to-use programming interface help users designGTA algorithms

be able to generate efficient parallel programs on MapReduce(Hadoop)

Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

Page 17: Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

BackgroundMotivation and Objective

Design and implementationPerformance test

Conclusion and future work

System Overview

Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

Page 18: Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

Implementation on HadoopWe implement the following classes:

Page 19: Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

BackgroundMotivation and Objective

Design and implementationPerformance test

Conclusion and future work

Implementation on Hadoop

MapReducer is an Interface of list homomorphism

h[ ] = id⊕h[a] = f a

h(x ++ y) = h x ⊕ h y

1 public interface MapReducer<Elem , Val , Res> {2 public Val identity ( ) ;3 public Val element ( Elem elem ) ;4 public Val combine ( Val left , Val right ) ;5 public Res postprocess ( Val val ) ;6 }

Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

Page 20: Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

BackgroundMotivation and Objective

Design and implementationPerformance test

Conclusion and future work

Implementation on Hadoop

MapReducer is an Interface of list homomorphism

Aggregator defines a semiring homomorphism

(A,⊕,⊗)→ (S ,⊕′,⊗′)

1 public interface Aggregator<A ,S> {2 public S zero ( ) ;3 public S one ( ) ;4 public S singleton ( A a ) ;5 public S plus ( S left , S right ) ;6 public S times ( S left , S right ) ;7 }

Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

Page 21: Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

BackgroundMotivation and Objective

Design and implementationPerformance test

Conclusion and future work

Implementation on Hadoop

MapReducer is an Interface of list homomorphismAggregator defines a semiring homomorphism

Test is almost list homomorphism, it inherits MapReducer

1 public interface Test<Elem , Key> extends MapReducer<Elem ,←↩Key , Boolean> {}

Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

Page 22: Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

BackgroundMotivation and Objective

Design and implementationPerformance test

Conclusion and future work

Implementation on HadoopMapReducer is an Interface of list homomorphismAggregator defines a semiring homomorphismTest inherits MapReducer

Generator implements a MapReducer

polymorphic over semiring: Constructor

filter embedding: embed function return a new generator

1 public abstract class Generator<Elem , Single , Val , Res>2 implements MapReducer<Elem , Val , Res> {3 //The c o n t r a c t o r t a k e s an i n s t a n c e o f A g g r e g a t o r4 public Generator ( Aggregator< Single , Val> aggregator ) { . . .}56 // t a k e an i n s t a n c e o f Test and r e t u r n a new i n s t a n c e o f G e n e r a t o r7 public <Key> Generator<Elem , Single , WritableMap<Key , Val>,Res>8 embed ( final Test<Single , Key> test ) {9 final Generator<Elem , Single , Val , Res> base = this ;

10 return new Generator<Elem , Single , WritableMap<Key , Val>,Res>11 ( new Aggregator<Single , WritableMap<Key , Val>>(){ . . . }12 }13 public Val process ( List<Elem> list ) { . . . }14 . . .15 }

Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

Page 23: Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

BackgroundMotivation and Objective

Design and implementationPerformance test

Conclusion and future work

Implementation on Hadoop

1 Users need to make their own Generator, Test, and Aggregatorby extending/implementing the library provided ones1

2 An instance of Generator will be created at run-time on eachworking-node, which is also an efficient list homomorphism

3 The instance list homomorphism can be executed by Hadoopin parallel

1Our library provides commonly used Generators and Aggregators.Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

Page 24: Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

BackgroundMotivation and Objective

Design and implementationPerformance test

Conclusion and future work

Java Codes

Let’s have a look at the actual implementation of GTA Knapsack...

Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

Page 25: Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

BackgroundMotivation and Objective

Design and implementationPerformance test

Conclusion and future work

Performance Evaluation

Environment: hardware

We configured clusters with 2, 4, 8, 16, and 32 nodes (virtualmachines). Each computing/data node has one CPU (VM, [email protected], 1 core), 3 GB memory.

Test data

102 × 220 (≈ 108) knapsack items (3.2GB)

Each item’s weight is between 0 to 10 and the capacity of theknapsack is 100.

Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

Page 26: Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

BackgroundMotivation and Objective

Design and implementationPerformance test

Conclusion and future work

Evaluation on Hadoop

The Knapsack program scales well when increasing nodes of cluster

Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

Page 27: Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

BackgroundMotivation and Objective

Design and implementationPerformance test

Conclusion and future work

Conclusion

The implementation of GTA library on Hadoop can

hide the technical details of MapReduce(Hadoop)

automatically do parallelization and optimization

generate MapReduce programs which have good scalability

make coding, testing and code-reusing much simpler

Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

Page 28: Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

BackgroundMotivation and Objective

Design and implementationPerformance test

Conclusion and future work

Future Work

Optimization of current framework to gain better performance

Extension of current framework

Other approaches of systematic parallel programming

Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

Page 29: Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

BackgroundMotivation and Objective

Design and implementationPerformance test

Conclusion and future work

Thanks

Questions?The project is hosted onhttp://screwdriver.googlecode.com

Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

Page 30: Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

BackgroundMotivation and Objective

Design and implementationPerformance test

Conclusion and future work

Appendix: The Computation on Semiring

Definition (Semiring)

Given a set S and two binary operations ⊕ and ⊗, the triple (S ,⊕,⊗) is called asemiring if and only if

(S ,⊕) is a commutative monoid with identity element id⊕

(S ,⊗) is a monoid with identity element id⊗

⊗ is associative and distributes over ⊕id⊕ is a zero of ⊗: id⊕ ⊗ a = a⊗ id⊕ = id⊕

(Int,+,×) is a semiring, (PositiveInt,+,max) is another semiring

Definition (Semiring homomorphism)

Given two semirings (S ,⊕,⊗) and (S ′,⊕′,⊗′), a function hom : S → S ′ is a semiringhomomorphism from (S,⊕,⊗) to (S ′,⊕′,⊗′), iff it is a monoid homomorphism from(S,⊕) to (S ′,⊕′) and also a monoid homomorphism from (S,⊗) to (S ′,⊗′).

Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

Page 31: Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

BackgroundMotivation and Objective

Design and implementationPerformance test

Conclusion and future work

Theorem (Filter-Embedding Fusion)

Given a set A, a finite monoid (M,�), a monoid homomorphism hom from ([A],++ )

to (M,�), a semiring (S ,⊕,⊗), a semiring homomorphism aggregate from

(*[A]+,×++ ]) to (S,⊕,⊗), a function ok : M → Bool and a polymorphic semiring

generator generate, the following equation holds:

aggregate ◦ filter(ok ◦ hom)◦ generate],x++ (λx → *[x ]+)

= postprocessM ok◦ generate⊕M ,⊗M

(λx → aggregateM*[x ]+)

The result of fusion is an efficient algorithm in form of a listhomomorphism.

Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

Page 32: Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

BackgroundMotivation and Objective

Design and implementationPerformance test

Conclusion and future work

List Homomorphism

List Homomorphism [Bird, 87; Cole,95] is a class of recursivefunctions.

Definition of List Homomorphism

If there is an associative operator �, such that for any list x andlist y

h (x ++ y) = h(x)� h(y).

Where ++ is the list concatenation and h [a] = f a, h(x)� id� = h(x), id� is an identity element of � .

Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

Page 33: Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

BackgroundMotivation and Objective

Design and implementationPerformance test

Conclusion and future work

List Homomorphism

List Homomorphism [Bird, 87; Cole,95] is a class of recursivefunctions.

Definition of List Homomorphism

If there is an associative operator �, such that for any list x andlist y

h (x ++ y) = h(x)� h(y).

Where ++ is the list concatenation and h [a] = f a, h(x)� id� = h(x), id� is an identity element of � .

Instance of a list homomorphism

sum [a] = asum (x ++ y) = sum x + sum y .

Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

Page 34: Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

BackgroundMotivation and Objective

Design and implementationPerformance test

Conclusion and future work

List Homomorphism

List Homomorphism [Bird, 87; Cole,95] is a class of recursivefunctions.

Definition of List Homomorphism

If there is an associative operator �, such that for any list x andlist y

h (x ++ y) = h(x)� h(y).

Where ++ is the list concatenation and h [a] = f a, h(x)� id� = h(x), id� is an identity element of � .

A list homomorphism can be automatically parallelized byMapReduce [Yu et. al., EuroPar11].

Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

Page 35: Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

BackgroundMotivation and Objective

Design and implementationPerformance test

Conclusion and future work

Evaluation on Hadoop

We test 3.2GB data on {2 , 4, 8, 16, 32} nodes clusters and 32GB data on {32, 64} nodes clusters

2 nodes 4 nodes 8 nodes 16 nodes 32 nodes 64 nodes

time(sec.) 1602 882 482 317 961 511speedup – × 1.82 × 1.83 × 1.52 – × 1.88

Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4 Implementing Generate-Test-and-Aggregate Algorithms on Hadoop