map/reduce and hadoop performance ioana manolescu senior researcher, oak team lead inria saclay and...

21
Map/Reduce and Hadoop performance Ioana Manolescu Senior researcher, OAK team lead Inria Saclay and Université Paris-Sud Big Data Paris, 2013

Upload: solomon-cook

Post on 27-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Map/Reduce and Hadoop performance Ioana Manolescu Senior researcher, OAK team lead Inria Saclay and Université Paris-Sud Big Data Paris, 2013

Map/Reduce and Hadoop performance

Ioana ManolescuSenior researcher, OAK team lead

Inria Saclay and Université Paris-Sud

Big Data Paris, 2013

Page 2: Map/Reduce and Hadoop performance Ioana Manolescu Senior researcher, OAK team lead Inria Saclay and Université Paris-Sud Big Data Paris, 2013

Ioana Manolescu (Inria), Big Data Paris 2

Plan

• The Map/Reduce model• Two performance problems and ways out– Blocking steps in the Map/Reduce processing

pipeline– Non-selective data access

• Conclusion: toward Map/Reduce optimization?

03/04/13

Page 3: Map/Reduce and Hadoop performance Ioana Manolescu Senior researcher, OAK team lead Inria Saclay and Université Paris-Sud Big Data Paris, 2013

Ioana Manolescu (Inria), Big Data Paris 3

The Map/Reduce model• Problem:

– How to compute in a distributed fashion a given processing to a very large amount of data

• Map/Reduce solution:– Programmer: express the processing as

• Splitting the data • Extract (key, value pairs) from each partition (MAP)• Compute partial results for each key (REDUCE)

– Map/Reduce platform (e.g., Hadoop): • Distributes partitions, runs one MAP task per partition• Runs one or several REDUCE tasks per key• Sends data across machines from MAP to REDUCE

03/04/13

Implicit parallelism

Fault tolerance

Communication across machines

Page 4: Map/Reduce and Hadoop performance Ioana Manolescu Senior researcher, OAK team lead Inria Saclay and Université Paris-Sud Big Data Paris, 2013

Ioana Manolescu (Inria), Big Data Paris 4

Map/Reduce in detail

03/04/13

Page 5: Map/Reduce and Hadoop performance Ioana Manolescu Senior researcher, OAK team lead Inria Saclay and Université Paris-Sud Big Data Paris, 2013

Ioana Manolescu (Inria), Big Data Paris 5

Hadoop Map/Reduce

03/04/13

Data Load

Map()

Local sort

Map write

Merge

Reduce

Final write

Merge

Reduce

Final write

Merge

Reduce

Final write

Merge

Reduce

Final write

Data Load

Map()

Local sort

Map write

Data Load

Map()

Local sort

Map write

Data Load

Map()

Local sort

Map write

Map

Redu

ce

Page 6: Map/Reduce and Hadoop performance Ioana Manolescu Senior researcher, OAK team lead Inria Saclay and Université Paris-Sud Big Data Paris, 2013

Performance problem 1: Idle CPU due to blocking steps

Page 7: Map/Reduce and Hadoop performance Ioana Manolescu Senior researcher, OAK team lead Inria Saclay and Université Paris-Sud Big Data Paris, 2013

Ioana Manolescu (Inria), Big Data Paris 7

Hadoop resource usage

03/04/13

Data Load

Map()

Local sort

Map write

Merge

Reduce

Final write

Merge

Reduce

Final write

Data Load

Map()

Local sort

Map write

Blocking sort

Blocking sort-merge

Page 8: Map/Reduce and Hadoop performance Ioana Manolescu Senior researcher, OAK team lead Inria Saclay and Université Paris-Sud Big Data Paris, 2013

Ioana Manolescu (Inria), Big Data Paris 8

Hadoop benchmark[Li, Mazur, Diao, McGregor, Shenoy, ACM SIGMOD Conference 2011]

CPU stalls during I/O intensive MergeReduce strictly follows Merge

03/04/13

Page 9: Map/Reduce and Hadoop performance Ioana Manolescu Senior researcher, OAK team lead Inria Saclay and Université Paris-Sud Big Data Paris, 2013

Ioana Manolescu (Inria), Big Data Paris 9

Hash-based algorithms to improve Hadoop performance

[Li, Mazur, Diao, McGregor, Shenoy, SIGMOD 2011]Main idea: use non-blocking hash-based algorithms to

group items by keys during Map.LocalSort and Reduce.Merge

Principle of hashing:

• Partitions can be in memory or flushed to disk• If the reduce works incrementally, early send

03/04/13

Data Load

Map()

Local sort

Map write

Merge

Reduce

Final write

key value

0 9, 3, 3

1 1, 13, 1

2 2, 2, 8, 2

3, 1, 2, 8, 2, 13, 1, 2, 3, 9…

h(x) e.g. x % 3

Page 10: Map/Reduce and Hadoop performance Ioana Manolescu Senior researcher, OAK team lead Inria Saclay and Université Paris-Sud Big Data Paris, 2013

Performance problem 2: non-selective data access

Page 11: Map/Reduce and Hadoop performance Ioana Manolescu Senior researcher, OAK team lead Inria Saclay and Université Paris-Sud Big Data Paris, 2013

Ioana Manolescu (Inria), Big Data Paris 11

Data access in Hadoop

03/04/13

• Basic model: read all the data– If the tasks are selective, we don't really

need to!• Database indexes? But:– Map/Reduce works on top of a file

system (e.g. Hadoop file system, HDFS)– Data is stored only once– Hard to foresee all future processing• "Exploratory nature" of Hadoop

Data Load

Map()

Local sort

Map write

Merge

Reduce

Final write

Page 12: Map/Reduce and Hadoop performance Ioana Manolescu Senior researcher, OAK team lead Inria Saclay and Université Paris-Sud Big Data Paris, 2013

Ioana Manolescu (Inria), Big Data Paris 12

Accelerating data access in Hadoop

• Idea 1: Hadop++ [Jindal, Quiané-Ruiz, Dittrich, ACM SOCC, 2011]– Add header information to each

data split, summarizing split attribute values

– Modify the RecordReader of HDFS, used by the Map() will prune irrelevant splits

03/04/13

Data Load

Map()

Local sort

Map write

Merge

Reduce

Final write

Page 13: Map/Reduce and Hadoop performance Ioana Manolescu Senior researcher, OAK team lead Inria Saclay and Université Paris-Sud Big Data Paris, 2013

Ioana Manolescu (Inria), Big Data Paris 13

Accelerating data access in Hadoop

• Idea 2: HAIL [Dittrich, Quiané-Ruiz, Richter, Schuh, Jindal, Schad, PVLDB 2012]– Each storage node builds an

in-memory, clustered index of the data in its split

– There are three copies of each split for reliability Build three different indexes!

– Customize RecordReader

03/04/13

Data Load

Map()

Local sort

Map write

Merge

Reduce

Final write

Page 14: Map/Reduce and Hadoop performance Ioana Manolescu Senior researcher, OAK team lead Inria Saclay and Université Paris-Sud Big Data Paris, 2013

Ioana Manolescu (Inria), Big Data Paris 14

Hadoop, Hadoop++ and HAIL

03/04/13

Data Load

Map()

Local sort

Map write

Merge

Reduce

Final write

Page 15: Map/Reduce and Hadoop performance Ioana Manolescu Senior researcher, OAK team lead Inria Saclay and Université Paris-Sud Big Data Paris, 2013

Conclusion

Toward Map/Reduce optimization

Page 16: Map/Reduce and Hadoop performance Ioana Manolescu Senior researcher, OAK team lead Inria Saclay and Université Paris-Sud Big Data Paris, 2013

Ioana Manolescu (Inria), Big Data Paris 16

Optimization defined as Hadoop tuning

Hadoop parameters:[Herodotou, technical report, Duke Univ, 2011]

03/04/13

Page 17: Map/Reduce and Hadoop performance Ioana Manolescu Senior researcher, OAK team lead Inria Saclay and Université Paris-Sud Big Data Paris, 2013

Ioana Manolescu (Inria), Big Data Paris 17

Hadoop tuning

[Babu, ACM SoCC 2010]

03/04/13

Page 18: Map/Reduce and Hadoop performance Ioana Manolescu Senior researcher, OAK team lead Inria Saclay and Université Paris-Sud Big Data Paris, 2013

Ioana Manolescu (Inria), Big Data Paris 18

Hadoop performance model

03/04/13

[Li, Mazur, Diao, McGregor, Shenoy, SIGMOD 2011]

Page 19: Map/Reduce and Hadoop performance Ioana Manolescu Senior researcher, OAK team lead Inria Saclay and Université Paris-Sud Big Data Paris, 2013

Ioana Manolescu (Inria), Big Data Paris 19

Hadoop performance model

03/04/13

[Li, Mazur, Diao, McGregor, Shenoy, SIGMOD 2011]

Easy

Page 20: Map/Reduce and Hadoop performance Ioana Manolescu Senior researcher, OAK team lead Inria Saclay and Université Paris-Sud Big Data Paris, 2013

Ioana Manolescu (Inria), Big Data Paris 20

References• Shivnath Babu. "Towards automatic optimization of MapReduce

programs", ACM SoCC, 2010• Herodotos Herodotou. "Hadoop Performance Models". Duke University,

2011• Boduo Li, Edward Mazur, Yanlei Diao, Andrew McGregor, Prashant Shenoy.

"A Platform for Scalable One-Pass Analytics using MapReduce", ACM SIGMOD 2011

• Jens Dittrich, Jorge-Arnulfo Quiané-Ruiz, Stefan Richter, Stefan Schuh, Alekh Jindal, Jorg Schad. "Only Aggressive Elephants are Fast Elephants", VLDB 2012

• A.Jindal,J.-A.Quiané-Ruiz,andJ.Dittrich. "Trojan Data Layouts: Right Shoes for a Running Elephant" SOCC, 2011

• Harold Lim, Herodotos Herodotou, Shivnath Babu. "Stubby: A Transformation-based Optimizer for MapReduce Workflows" PVLDB 2012

03/04/13

Page 21: Map/Reduce and Hadoop performance Ioana Manolescu Senior researcher, OAK team lead Inria Saclay and Université Paris-Sud Big Data Paris, 2013

Ioana Manolescu (Inria), Big Data Paris 21

Merci

Questions now?

Questions/feedback later on: [email protected]

03/04/13