approxhadoop bringing approximations to mapreduce frameworks

21
ApproxHadoop Bringing Approximations to MapReduce Frameworks Íñigo Goiri, Ricardo Bianchini, Santosh Nagarakatte, and Thu D. Nguyen 1

Upload: martin-shelton

Post on 06-Jan-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ApproxHadoop Bringing Approximations to MapReduce Frameworks

1

ApproxHadoopBringing Approximations to MapReduce

Frameworks

Íñigo Goiri, Ricardo Bianchini, Santosh Nagarakatte, and Thu D. Nguyen

Page 2: ApproxHadoop Bringing Approximations to MapReduce Frameworks

2

Page 3: ApproxHadoop Bringing Approximations to MapReduce Frameworks

3

Approximate computing• We’re producing more data than we can analyze• Many applications do not require precise outputs• Being precise is expensive

• Approximate computation• Time and/or energy vs. accuracy

[IEEE Design 2014]

Technologyscaling

Data warehouse growthgrowth rate = 173%

TB

Page 4: ApproxHadoop Bringing Approximations to MapReduce Frameworks

4

Data analytics using MapReduce• Example: Process web access logs to extract top pages• MapReduce is a popular framework• User provides code (map and reduce)• Framework manages data access and parallel execution• Higher level languages on top: Pig, Hive,…

• Hadoop is deployed widely at large scale• Facebook: 30PB Hadoop clusters• Yahoo: 16 Hadoop clusters >42000 nodes

Yahoo Computing Coop

Page 5: ApproxHadoop Bringing Approximations to MapReduce Frameworks

5

Our contributions• Approximations in MapReduce• Approximation mechanisms• Error bounds based on statistical theories

• ApproxHadoop: implementation for Hadoop• Approximate common applications• Achieve target error bounds online• Large execution time and energy savings with high accuracy

Page 6: ApproxHadoop Bringing Approximations to MapReduce Frameworks

6

Approximations in MapReduce

Map 1

Map 2

Map 3

Map 4

Block 1

Block 2

Block 3

Block 4

Reduce 1

Reduce 2

Output 1

Output 2

Lines in a blockhave similarities

Blocks have similarities

Example application: What is the average length of the lines of each color?

Why can we approximate with MapReduce?

Page 7: ApproxHadoop Bringing Approximations to MapReduce Frameworks

7

Mechanisms and error bounds• Similarities allow for accurate approximations• Approximation mechanisms for MapReduce:• Drop map tasks• Sample input data• User-defined approximations (technical report)

• Bound approximation errors using:• Multistage sampling for aggregation applications (e.g., sum, average, ratio)• Extreme value theory for extreme value computations (e.g., min, max)

Page 8: ApproxHadoop Bringing Approximations to MapReduce Frameworks

8

Multistage sampling and MapReduce• Combines inter/intra-cluster sampling techniques• Simple random sampling: inside a block → Data sampling• Cluster sampling: between blocks → Task dropping

• Given sampling/dropping ratios and variances• Compute error bounds with confidence level

Cluster

Population

Page 9: ApproxHadoop Bringing Approximations to MapReduce Frameworks

9

Mapping multistage sampling to MapReduce

Map 1

Map 2

Map 3

Map 4

Block 1

Block 2

Block 3

Block 4

Reduce 1

Reduce 2

Output 1

Output 2

Intra cluster sampling(data sampling)

Inter cluster sampling(task dropping)

Y±X%

Block → Cluster

Use inter/intravariances for

each line color

Population

Track sampling ratios

Approximationwith error bounds

Example application: What is the approximate average length of the lines of each color?

Page 10: ApproxHadoop Bringing Approximations to MapReduce Frameworks

10

Our contributions• Approximations in MapReduce• Approximation mechanisms• Error bounds based on statistical theories

• ApproxHadoop: implementation for Hadoop• Approximate common applications• Achieve target error bounds online• Large execution time and energy savings with high accuracy

Page 11: ApproxHadoop Bringing Approximations to MapReduce Frameworks

11

Example: Using ApproxHadoopclass WordCount:

class WCMapper extends Mapper:

void map(String key, String value):

foreach word w in value:

context.write(w, 1);

class WCReducer extends Reducer:

void reduce(String key, Iterator values):

int result = 0;

foreach int v in values:

result += v;

context.write(key, result);

void main():

setInputFormat(TextInputFormat);

run();

class ApproxWordCount:

class ApproxWCMapper extends MultiStageSamplingMapper:

void map(String key, String value):

foreach word w in value:

context.write(w, 1);

class ApproxWCReducer extends MultiStageSamplingReducer:

void reduce(String key, Iterator values):

int result = 0;

foreach int v in values:

result += v;

context.write(key, result);

void main():

setInputFormat(ApproxTextInputFormat);

run();

Page 12: ApproxHadoop Bringing Approximations to MapReduce Frameworks

12

1. User specifies the dropping/sampling ratios• ApproxHadoop calculates the error bound

2. User specifies the target error bound• Example: maximum error (±1%) with a confidence level (95% confidence)• ApproxHadoop:

How to specify approximations?

Select dropping/ sampling

ratios

Run first subset of

tasks

Run next subset of

tasks

Calculate final error

bounds

Targetbound?

No

Yes

Page 13: ApproxHadoop Bringing Approximations to MapReduce Frameworks

13

Implementation: ApproxHadoop• Extends Hadoop 1.2.1• Implements approximation mechanisms• Extended reducers• Bound estimation• Incremental reducers• Tune sampling ratios

• New data types• ApproxInteger

Map 1

Map 2

Map 3

Map 4

Block 1

Block 2

Block 3

Block 4

Reduce 1

Reduce 2

Output 1

Output 2

Y±X%

Map 2

Map 3

Block 2

Block 3

Page 14: ApproxHadoop Bringing Approximations to MapReduce Frameworks

14

Evaluation methodology• Datasets• Wikipedia access logs: 1 week with 4 billion accesses for 216.9GB• Wikipedia articles: 40GB in XML• Other applications and datasets in the paper

• Metrics• Actual % error (approximation vs precise)• Approximation with 95% confidence interval (e.g., 10±1%)• Run time• 20 runs reporting min, max and average

• Executions on 10- and 60-node clusters

Page 15: ApproxHadoop Bringing Approximations to MapReduce Frameworks

15

Example: Precise and approximate processing

1% input sampling introduces different errors in different applications

Actual values within bounds

Wikipedia project popularity 1% sampling Wikipedia article length 1% sampling

Page 16: ApproxHadoop Bringing Approximations to MapReduce Frameworks

16

User-specified input sampling ratio

More than 30% run time reduction for less than 0.1% ratioApplications exhibit different speedups for the same ratios

Wikipedia project popularity not dropping

Page 17: ApproxHadoop Bringing Approximations to MapReduce Frameworks

17

More than 55% run time reduction for less than 1% errorTask dropping increases errors significantly but decreases run time too

User-specified dropping/sampling ratios

Wikipedia project popularity 25% task dropping

Page 18: ApproxHadoop Bringing Approximations to MapReduce Frameworks

18

User-specified target error

ApproxHadoop tunes the sampling/dropping ratios depending on target

Wikipedia project popularity

Nosampling

Input datasampling

Maximumsampling

Taskdropping

Page 19: ApproxHadoop Bringing Approximations to MapReduce Frameworks

19

Wikipedia project popularity from 1 day (27GB) to 1 year (12.5TB)

Impact of input data size

Larger input data brings larger savings (up to 32x)

Compressed log size (in GB)

Runti

me

(sec

onds

)

Page 20: ApproxHadoop Bringing Approximations to MapReduce Frameworks

20

Conclusions• Apply statistical theories to MapReduce• Approximation mechanisms, such as input data sampling and task dropping

• Applicable to (large) classes of analytics applications• Achieve target error bounds online with ApproxHadoop• Tradeoff between execution time and accuracy• Significant execution time reduction with high accuracy• Scales well for large datasets

Page 21: ApproxHadoop Bringing Approximations to MapReduce Frameworks

21

ApproxHadoopBringing Approximations to MapReduce

Frameworks

Íñigo Goiri, Ricardo Bianchini, Santosh Nagarakatte, and Thu D. Nguyen