young suk moon chair: dr. hans-peter bischof reader: dr. gregor von laszewski observer: dr. minseok...

27
MS Thesis Defense Dynamic Fault Tolerant Grid Workflow in the Water Threat Management Project Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon 1

Upload: dwayne-moody

Post on 30-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon 1

1

MS Thesis DefenseDynamic Fault Tolerant Grid Workflow

in the Water Threat Management Project

Young Suk Moon

Chair: Dr. Hans-Peter BischofReader: Dr. Gregor von LaszewskiObserver: Dr. Minseok Kwon

Page 2: Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon 1

2

OutlineIntroduction to the Water Threat

Management Project

Motivation

Research Objectives

Fault-Tolerant Queue

Evaluation

Conclusion

Page 3: Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon 1

3

Water Threat ManagementMotivation

Urban Water Distribution Systems (WDSs) can be an easy target of terror attacks - e.g. contaminating the water.

Methods

Detect contamination using the sensors located across the WDSs.

Run algorithms (developed by NCSU) to determine the sensor locations to minimize the searching time to find the contaminant source locations.

Page 4: Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon 1

4

Existing Water Threat Management System Architecture

Optimization Engine: Runs Evolutionary Algorithm (EA)

Simulation Engine: Runs EPANET

Page 5: Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon 1

5

Water Threat Management System RequirementsRequirements

Time sensitiveMassive calculationDynamic adaptation to a Grid environmentFault tolerance

Our goalThe current system is not fault-tolerant -

develop a fault-tolerant framework in the dynamic environment.

Page 6: Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon 1

6

MotivationResource (Site)

Outage5% down during

2009

Queue Wait Time TeraGrid User & System News (http://news.teragrid.org/)

Page 7: Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon 1

7

Research ObjectivesDevelop a fault-tolerant framework dealing

with resource outages

Strategy: generation distribution on multiple sites

Reduce queue wait time

Strategy: dynamic job dependency

Page 8: Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon 1

8

Water Threat Management ApplicationSequential & parallel processing

Page 9: Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon 1

9

Generation DistributionDivide generations into multiple parts as

multiple jobs. Distribute them on multiple sites.

Page 10: Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon 1

10

Dynamic Job DependencyProblems of generation distribution on

multiple sitesAdditional queue wait times

Each job is dependent on another. Cannot submit a job before the prior job finishes.Solution: determine job dependency at run

time.Submit jobs at the same time.Any job start first computes the first set of

generations

Page 11: Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon 1

11

Dynamic WTM Workflow ManagementExample scenario

Page 12: Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon 1

12

Fault-tolerant QueueMost common fault-tolerant strategies in a Grid

ReplicationCheckpointing

Limitation of checkpointing with time-criticalityCheckpointing performance degradationCheckpointing may not be compatible on a

different site (heterogeneity)Cannot reschedule job on the same site in case of

site outageChoosing the replication strategy within the

fault-tolerant queue

Page 13: Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon 1

13

Fault-tolerant Queue DesignComponents

Command Line Interface

Task Pool

Resource Pool

Scheduler

Resource Checker (intergration with the TeraGrid Information Services)

Page 14: Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon 1

14

Fault Detection in Fault-tolerant QueueFault detection

Message from Grid Resource Allocation and Management (GRAM) in the Globus Toolkit Communicate with GRAM to detect job failure

TeraGrid Information Services GRAM service may fail when the resource is down Publishes XML documents containing the outage

information

Page 15: Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon 1

15

Evaluation – WTM performanceWTM application performance (original)

Abe

Big Red

#CPUs

16 16

CPU per Node

8 4

Page 16: Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon 1

16

Evaluation – Queue Wait TimeQueue wait time statistics

Abe Big Red

Avg. (min)

82 42

Var. 38513

5354

sd. 196 73

Page 17: Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon 1

17

Evaluation – Performance OverheadPerformance overhead

Integrating a fault-tolerant framework usually causes performance degradation

No performance loss in our framework

Page 18: Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon 1

Different type of workflow run time comparisonOriginal deployment VS. fault-tolerant

deploymentDynamic job dependency VS. static job

dependencyTest each type of deployment in the real Grid

system including queue wait time

Workflow Dependency

Site Name # Jobs Gen. range

Original - Abe 1 1-20

Original - Big Red 1 1-20

Fault-tolerant

static Abe, Big Red

2 1-10 (Abe),11-20 (Big Red)

Fault-tolerant

dynamic Abe, Big Red

2 1-10,11-2018

Evaluation – Workflow Performance

Page 19: Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon 1

19

Evaluation – Workflow PerformanceWorkflow comparison results Experiment 1 Experiment 2

Experiment 3

Page 20: Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon 1

20

Simulation – Worst Case Run Time Comparison

A threat management system must deliver results in any circumstances.

Thus, a run time of the worst case is a critical factor in the Water Threat Management system.

Page 21: Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon 1

21

Simulation – Worst Case Run Time ComparisonSimulation setup

The generations are equally distributed among the machines.

Use the 2009 TeraGrid outage data.Submit jobs every 5 minutes starting from

1/1/2009 12:00 am EST.

Abe Big Red

Queen Bee

Run Time per Gen. (min)

0.52 2.07 1.02

#CPUs 16 16 8

Page 22: Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon 1

22

Simulation – Worst Case Run Time ComparisonSimulation

queue wait time setup (unit: minutes)

Page 23: Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon 1

23

Simulation – Worst Case Run Time Comparison

TeraGrid User & System News (http://news.teragrid.org/)

Page 24: Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon 1

24

Simulation – Worst Case Run Time Comparison

Page 25: Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon 1

25

Simulation – Worst Case Run Time Comparison

Page 26: Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon 1

26

Simulation – Median Run Time, Worst Case (Max.) Run Time

Page 27: Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon 1

27

ConclusionAchievement:

Worst case run time is significantly reduced.Limitation:

In “general” cases, the dynamic workflow has performance degradation. Due to the low failure rate & compute performance

difference between difference machines.

Possible improvement:Migrate the generation process to a faster

machine whenever possible.