meeting service level objectives of pig programs

Post on 24-Feb-2016

32 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Meeting Service Level Objectives of Pig Programs. Zhuoyao Zhang, Ludmila Cherkasova, Abhishek Verma, Boon Thau Loo University of Pennsylvania Hewlett-Packard Labs. Advantages Large amount of resources Elasticity Pay-as-you-go pricing model Challenges Distributed resources - PowerPoint PPT Presentation

TRANSCRIPT

Meeting Service Level Objectives of Pig Programs

Zhuoyao Zhang, Ludmila Cherkasova,

Abhishek Verma, Boon Thau Loo

University of PennsylvaniaHewlett-Packard Labs

Cloud Environment•Advantages

▫Large amount of resources▫Elasticity ▫Pay-as-you-go pricing model

•Challenges▫Distributed resources▫Error-prone

MapReduce and Pig•MapReduce: Simple and fault tolerant

framework for data processing in the cloud

•Pig▫Advanced MapReduce based platform▫Widely used: Yahoo!, Twitter, LinkedIn▫PigLatin: A high-level declaratice language

for expressing data analysis tasks as Pig programsj

1

j2

j3

j4

j5

j6

j7

Motivation•Latency-sensitive applications

▫Personalized advertising▫Spam and fraud detection▫Real-time log analysis

•How much resource does an application need to meet their deadlines?

Contributions•Performance modeling for Pig programs▫Given a Pig grogram, estimates its

completion time as a function of assigned resource

•Deadline driven resource allocation estimates for Pig programs▫Given a completion time target,

determine the amount of resources for a Pig program to achieve it

Outline•Introduction•Building block

▫Performance model for single MapReduce jobs

•Resource allocation for Pig programs

•Evaluation•Conclusion and ongoing work

Theoretical Makespan Bounds•Bounds- based makespan estimates

▫n tasks, k servers▫avg: average duration of the n tasks▫max: maximum duration of the n tasks

•Lower bound

•Upper boundknavgTlow

max)1(

knavgTup

IllustrationSchedule 1: 1 4 3 2 3 1 2

Schedule 2: 3 1 2 3 2 1 4

Makespan = 4Lower bound =

4

Makespan = 7Upper bound =

8

1

2

4

3

1

2

4

3

•Estimate the bounds of the job completion time based on job profile▫Most production jobs are executed

routinely on new data sets▫Job profile based on previous running

Map stage: Mavg, Mmax, AvgInputSize, Selectivity Reduce stage: Shavg, Shmax, Ravg, Rmax, Selectivity

▫Predict the completion time for future running with the profile

Estimate Completion Time for Single MR Job

•Estimating bounds on the duration of map and reduce stages

•Map stage duration depends on:▫NM -- the number of map tasks▫SM -- the number of map slots

•Reduce stage duration depends on:▫NR -- the number of reduce tasks▫SR -- the number of reduce slots

•Job duration TJlow , TJ

up , Tjavg

▫ Sum of the map and reduce stage duration10

max)1(

MSNMT

SNMT

M

Mavg

upM

M

Mavg

lowM

Estimate Completion Time for Single MR Job

•Given a deadline D and the job profile, find the minimal resource to complete the job within D

Resource Allocation for Single MR Job

Given number of map/reduce tasks

Find the value of SMJ, SR

J with minimum value of SM

J+ SRJ using Lagrange's multipliers

Statistics from job profile

Outline•Introduction•Building block

▫Performance model for single MapReduce jobs

•Resource allocation for Pig programs

•Evaluation•Conclusion and ongoing work

Performance Model for Pig Programs•Let P = {J1, J2,….JN } , extract the job

profile of each job contained in P▫Assign unique name for each job within a

program•The program completion time sum of

the completion time of all the jobs contained in P

Ni iP TT

1

•Possible strategy: find out an appropriate pair of map and reduce slots for each job in the program

•Problem: difficult to implement and manage by the scheduler

NNN

R

N

N

M

N

RM

RM

dC SB

SA

dC SB

SA

dC SB

SA

222

2

2

2

111

1

1

1

Dd

Ni i 1

Resource Allocation for Pig Programs

with

Resource Allocation for Pig Programs•A simpler and more elegant solution

▫Allocate the same set of resource to the entire program instead of to each job

•Rewrite the previous equations into

DSS

TNi

NiNiiP

R

iPM

iP C

BA

1

11

Find the minimum set of map and reduce slots

( SMP , SR

P ) for the entire Pig program

Experiment Setup•66 nodes cluster in 2 racks

▫4 AMD 2.39GHz cores▫8 GB RAM, ▫two 160GB hard disks

•Configuration▫1 jobtracker, 1 namenode, 64 worker

nodes▫2 map slots and 1 reduce slot for each

node

Benchmark•Pigmix benchmark

▫17 programs▫8 tables as the input data

•Dataset▫Test dataset

Generated with the Pig mix data generator Total size around 1TB.

▫Experimental dataset Same layout as the test dataset 20% larger in size

Model Accuracy•How well of our performance model

captures Pig program completion time?

Normalized results for predicted and measured completion time

Meeting Deadlines•Are we meeting deadlines with our

resource allocation mode?

Pigmix executed on experimental data set : do we meet deadlines?

Conclusion•Conclusion

▫The performance model can accurately estimate the completion time of MapReduce workflow

▫Enables automatic resource provisioning for MapReduce workflow with deadlines

•Ongoing work▫Refine the performance model for workflow with

concurrent jobs▫Incorporating failure scenarios in the current

model

Thank you

top related