reliability concerns - iit-computer sciencelan/anl08_lan_2.pdf · ¾a “fence” to protect system...

24
1 FENCE: F ault awareness EN abled C ti E i t C omputing E nvironment Zhiling Lan Zhiling Lan Illinois Institute of Technology (Dept. of CS) Illinois Institute of Technology (Dept. of CS) [email protected] [email protected] Z. Lan (IIT) 1 Reliability Concerns Systems are getting bigger 10244096 processors is today’s “medium” size (73.4% of TOP500) O(10,000)~ O(100,000) processor systems are being designed/deployed Even highly reliable HW can become an issue at scale 1 node fails every 10,000 hours 6,000 nodes fail every 1.6 hours 64,000 nodes fail every 5 minutes Z. Lan (IIT) ) Need for fault management Losing the entire job due to one node’s failure is costly in time and CPU cycles! From “Simplicity and Complexity in Data Systems at Scale”, Garth Gibson, Hadoop Summit, 2008 2

Upload: others

Post on 15-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Reliability Concerns - IIT-Computer Sciencelan/anl08_lan_2.pdf · ¾A “fence” to protect system and appl. from severe failure impact ¾Exploit the synergy between various methods

11

FENCE: Fault awareness ENabledC ti E i tComputing Environment

Zhiling Lan Zhiling Lan Illinois Institute of Technology (Dept. of CS)Illinois Institute of Technology (Dept. of CS)

[email protected]@iit.edu

Z. Lan (IIT) 1

Reliability Concerns• Systems are getting bigger

– 1024‐4096 processors is today’s “medium” size (73.4% of TOP500)– O(10,000)~ O(100,000) processor systems are being designed/deployed

• Even highly reliable HW can become an issue at scale– 1 node fails every 10,000 hours– 6,000 nodes fail every 1.6 hours– 64,000 nodes fail every 5 minutes

Z. Lan (IIT)

Need for fault managementLosing the entire job due to one node’s failure  is costly in time and CPU cycles! From “Simplicity and Complexity in Data Systems at 

Scale”, Garth Gibson, Hadoop Summit, 2008

2

Page 2: Reliability Concerns - IIT-Computer Sciencelan/anl08_lan_2.pdf · ¾A “fence” to protect system and appl. from severe failure impact ¾Exploit the synergy between various methods

22

The Big Picture• Checkpoint/restart is widely used for fault tolerance

SimpleIO intensive may trigger a cycle of deteriorationIO intensive, may trigger a cycle of deteriorationReactively handle failures through rollbacks

• Newly emerging proactive methods Good at preventing failures and avoiding rollbacksBut, relies on accurate prediction of failure

Z. Lan (IIT)

FENCE: Fault awareness ENabled Computing EnvironmentA “fence” to protect system and appl. from severe failure impact Exploit  the synergy  between various methods to advance faultmanagement

3

FENCE Overview • Adopt a hybrid approach:

– Offline analysis: reliability modeling and scheduling enables intelligent system configuration and mappingg y g pp g

– Runtime support: to avoid and mitigate imminent failures

• Explore runtime adaptation:– Proactive actions prevent applications from anticipated failures 

– Reactive actions minimize the impact of unforeseeable failures

• Address fundamental issues– Failure prediction & diagnosis

– Adaptive management

– Runtime support

– Reliability modeling & scheduling

Z. Lan (IIT) 4

Page 3: Reliability Concerns - IIT-Computer Sciencelan/anl08_lan_2.pdf · ¾A “fence” to protect system and appl. from severe failure impact ¾Exploit the synergy between various methods

33

Key Components

Failure Prediction 

Hybrid

Analysis

l

Adaptive Management

R i

& Diagnosis

ReliabilityModeling

Z. Lan (IIT)

Fault‐Aware Scheduling

RuntimeSupport

5

Outline

• Overview of FENCE Project– Project website: http://www cs iit edu/~zlan/fence htmlProject website: http://www.cs.iit.edu/ zlan/fence.html

• This talk will focus on runtime support– Failure prediction and diagnosis

– Adaptive management

– Runtime supportRuntime support

Z. Lan (IIT) 6

Page 4: Reliability Concerns - IIT-Computer Sciencelan/anl08_lan_2.pdf · ¾A “fence” to protect system and appl. from severe failure impact ¾Exploit the synergy between various methods

44

Outline

• Overview of FENCE Project– Project website: http://www cs iit edu/~zlan/fence htmlProject website: http://www.cs.iit.edu/ zlan/fence.html

• This talk will focus on runtime support– Failure prediction and diagnosis

– Adaptive management

– Runtime support

Prediction & Diagnosis

Runtime support

Z. Lan (IIT) 7

Failure Prediction and Diagnosis

Health/perf monitoring: Hardware sensors

Fault tolerance methods:Checkpointing

Z. Lan (IIT)

Hardware sensors

System monitoring tools

Error checking services, 

e.g. Blue Gene series and Cray XT series

Checkpointing

(open MPI, MPICH‐V, BLCR, ….)

Process/object migration 

Other resilience supports

8

Page 5: Reliability Concerns - IIT-Computer Sciencelan/anl08_lan_2.pdf · ¾A “fence” to protect system and appl. from severe failure impact ¾Exploit the synergy between various methods

55

Failure Prediction and Diagnosis• Challenges:

– Potentially overwhelming amount of data collected across the systemthe system

• Fault patterns and root causes are often buried like needles in a haystack!

– Faults are many and complex• There is no one‐size‐fit‐all predictive method!

• Our approaches:R i f il di i– Runtime failure prediction 

• Exploit ensemble learning, data mining, and statistic learning

– Automated anomaly localization • Utilize pattern recognition techniques

Z. Lan (IIT) 9

Data PreprocessorCategorizerCategorizer W weeks

Runtime Failure Prediction

Raw Log

FilterFilter

Failure Predictor 

Meta‐LearnerMeta‐Learner

ReviserReviser

Statistical Rules & Assoc. Rules& Prob. Pattern

Effective 

1 432

Time

Training PredictionTesting

(k+1)k

Meta learner: dynamic training by

Clean Log

Z. Lan (IIT)

PredictorPredictor

ReviserReviserRule Set

Predicted Failures

ActualFailures

10

1) Meta‐learner: dynamic  training by integrating multiple base classifiers

2) Reviser: dynamic testing by tracing prediction accuracy

3) Predictor: event‐driven prediction

Page 6: Reliability Concerns - IIT-Computer Sciencelan/anl08_lan_2.pdf · ¾A “fence” to protect system and appl. from severe failure impact ¾Exploit the synergy between various methods

66

Case Studies on Blue Gene/L

SDSC BGL ANL BGL

Start Date 12/6/2004 1/21/2005

End Date 06/11/2007 2/28/2007End Date 06/11/2007 2/28/2007

Log Size after clean 704 MB 1.36 GB

p

p p

p

Tprecision

T F

Tll

=+

• Evaluation metrics:

Z. Lan (IIT) 11

p

p n

recallT F

=+

• A good prediction should achieve a high value (close to 1.0) for both metrics 

Preprocessing • Step 1: Hierarchical event categorization

– Based on LOCATION, FACILITY and ENTRY DATA 

St 2 t l i t i l l ti• Step 2: temporal compression at a single location– To coalesce events from the same location with the same JOB_ID and LOCATION, if reported within time duration of 300 seconds

• Step 3: spatial compression across multiple locations– To remove entries close to each other within time durationTo remove entries close to each other within time duration of 300 seconds, with the same ENTRY_DATA and JOB_ID

Z. Lan (IIT)12

Y. Liang, Y. Zhang, A. Sivasubramanium, R. Sahoo, J Moreira, M. Gupta, 

“Filtering Failure Logs for a BlueGene/L Prototype”, Proc. of DSN, 2005.

Page 7: Reliability Concerns - IIT-Computer Sciencelan/anl08_lan_2.pdf · ¾A “fence” to protect system and appl. from severe failure impact ¾Exploit the synergy between various methods

77

BGL Event Categories

Main Category Subcategories Examples 

Application 12 loadProgramFailure, loginFailure, nodemapCreateFailure,…

Iostream 8 socketReadFailure, streamReadFailure,…

Kernel 20 alignmentFailure, dataAddressFailure, instructionAddressFailure, …

Memory 22 cachePrefetchFailure, dataReadFailure, dataStoreFailure, parityFailure,…

Midplane 6 linkcardFailure, ciodSignalFailure, midplaneServiceWarning,…

Network 11 ethernetFailure, rtsFailure, torusFailure, 

Z. Lan (IIT)13

torusConnectionErrorInfo,…NodeCard 10 nodecardDiscoveryError, 

nodecardAssemblyWarning,…Other 12 BGLMasterRestartInfo,

CMCScontrolInfo, linkcardServiceWarning,…

Base Classifiers

• Statistical rules:– Discover statistic correlations among fatal events, i.e., how often and with what probability will the occurrence of oneoften and with what probability will the occurrence of one failure influence subsequent failures 

• Association rules:– Examine causal correlations between non‐fatal and fatal events, where rules are in the form (X =>Y)

• If X occurs, then it is likely that Y will occur

• Probability distribution:• Probability distribution:– Use maximum likelihood estimation (MLE) to obtain the distribution of failures 

• For both logs, Weibull is a good fit

Z. Lan (IIT) 14

Page 8: Reliability Concerns - IIT-Computer Sciencelan/anl08_lan_2.pdf · ¾A “fence” to protect system and appl. from severe failure impact ¾Exploit the synergy between various methods

88

Precision is between 0.8‐0.9, meaning less 

Prediction Results (1)

0.750.8

0.850.9

0.951

ANL Precision

than 20% false alarms

Recall is between 0.7‐0.8, meaning being capable of capturing over 70% failures

0.50.550.6

0.650.7

32 40 48 56 64 72 80 88 96 104 112

Association Rule Statistic RuleProbability Distribution Simple Meta-LearningOur method

0.80.9

1ANL Recall

Association Rule Statistic RuleProbability Distribution Simple Meta-LearningOur method

Z. Lan (IIT)15

0.20.30.40.50.60.7

32 40 48 56 64 72 80 88 96 104 112

W=4 weeks

support=0.06

confidence=0.6

Prediction Results (2)

0 70.750.8

0.850.9

0.951

SDSC Precision

Precision is between 0.9‐1.0, meaning less than 10% false alarms

Recall is around 0.8, 

0.50.550.6

0.650.7

32 40 48 56 64 72 80 88 96 104 112 120 128

Association Rule Statistic RuleProbability Distribution Simple Meta-LearningOur method

0.80.9

1SDSC Recall

Z. Lan (IIT)16

meaning being capable of capturing 80% failures

0.20.30.40.50.60.7

32 40 48 56 64 72 80 88 96 104 112 120 128

Association Rule Statistic RuleProbability Distribution Simple Meta-LearningOur method

Page 9: Reliability Concerns - IIT-Computer Sciencelan/anl08_lan_2.pdf · ¾A “fence” to protect system and appl. from severe failure impact ¾Exploit the synergy between various methods

99

Automated Anomaly Localization• To quickly locate faulty nodes in the system with 

little human interaction

• Primary objectives:• Primary objectives:– High precision, meaning low false alarm rate

– Extremely high recall (close to 1.0)

• Two key observations:1) Nodes performing comparable activities exhibit similar 

behaviorsbehaviors

2) In general, the majority is functioning normally since faults are rare events

Z. Lan (IIT) 17

Automated Anomaly Localization

To obtain the most significant 

features  via feature extraction

To quickly identify “outliers”  

by applying cell‐based algorithm

To assemble a feature space, 

usually high dimensional

18Z. Lan (IIT)

Page 10: Reliability Concerns - IIT-Computer Sciencelan/anl08_lan_2.pdf · ¾A “fence” to protect system and appl. from severe failure impact ¾Exploit the synergy between various methods

1010

Feature Collection• Construct feature space of the system: 

– A feature is an individual measurable property of the node being observed

– Examples: CPU usage, memory usage, IO performance, …, using system calls vmstat, mpstat, iostat & netstat

• Let m be the number of features collected from n nodes and k samples are obtained per node– Xi (i = 1, 2, ∙ ∙ ∙ , n), each representing the feature matrix collected 

from the ith node

Reorganize each matrix Xi into a long (m×k) column vector– Reorganize each matrix Xi into a long (m×k) column vector

• So Feature space X is a (m×k) ×n matrix

1 2[ , ,..., ]nX x x x=

Z. Lan (IIT) 19

Feature Extraction• Goal:

– Dimension reduction– Independent features p

• Principal Component Analysis (PCA)– A linear transformation onto new axes (i.e. principal components) ordered by the amount of data variance that they capture

20

Page 11: Reliability Concerns - IIT-Computer Sciencelan/anl08_lan_2.pdf · ¾A “fence” to protect system and appl. from severe failure impact ¾Exploit the synergy between various methods

1111

Feature Extraction

PCA steps: – Calculates the covariance matrix of X’’

– Calculates the s largest Eigenvalues of C

G j i i d

TXXn

C ''''1=

sλλλ ≥≥≥ L21

][W C λ– Get projection matrix                                    and 

– Project X’’  into a new space

],,,[ 21 swwwW L= iii wCw λ=

XWY T=

Z. Lan (IIT) 21

Outlier Detection

• Outliers are data points that are quite “different”  from the majority based on Euclidean distance

• We choose the cell‐based method due to its linear complexity– DB (p, d): Point o is a distance‐based outlier if at least a fraction p of the objects lie at a distance greater than dfrom o

– p & d are predefined parameters

Z. Lan (IIT) 22

Page 12: Reliability Concerns - IIT-Computer Sciencelan/anl08_lan_2.pdf · ¾A “fence” to protect system and appl. from severe failure impact ¾Exploit the synergy between various methods

1212

Experiments• Use Sunwulf cluster at SCS lab

– Each node is a SUN Blade 100, 500MHz CPU, 256KB L2 Cache and 128MB main memory, 100MbpsL2 Cache and 128MB main memory, 100Mbps Ethernet

– Execute a parameter sweep application

• Manual fault injection1) Memory leaking

2) Unterminated CPU intensive threads

3) High frequent IO operations

4) Network volume overflow

5) Deadlock

Z. Lan (IIT) 23

Result: Single Fault

Ellipse: true positive

Rectangle: false positive

1 2

Triangle: false negative

3 4 5

Z. Lan (IIT) 24

11, 5, 460.9565, 0.4 , the variance of Y

m k np d σ σ= = == = =

Page 13: Reliability Concerns - IIT-Computer Sciencelan/anl08_lan_2.pdf · ¾A “fence” to protect system and appl. from severe failure impact ¾Exploit the synergy between various methods

1313

Result: Single Fault

Faults Recall Precision

Memory leaking 1 0.98

Unterminated CPU intensive threads 1 0.80

Type #1 faults (precision>0.90): memory leaking and high frequent IO operations deadlock;

High frequency IO 1 0.94

Network volume overflow 1 0.85

Deadlock 1 0.94

operations, deadlock;Type #2  faults  (precision<0.90): unterminated CPU intensive threads and network volume overflow.

Z. Lan (IIT) 25

Result: Multiple Faults

Ellipse: true positive

Rectangle: false positive

Triangle: false negative

Results of localizing simultaneous Type #1 faults. The left point is from the node injected with high frequent IO operations, and the right one is from the node 

injected with a memory leaking error. 

Z. Lan (IIT) 26

Page 14: Reliability Concerns - IIT-Computer Sciencelan/anl08_lan_2.pdf · ¾A “fence” to protect system and appl. from severe failure impact ¾Exploit the synergy between various methods

1414

Result: Multiple Faults

Ellipse: true positive

Rectangle: false positive

Triangle: false negative

Results of localizing simultaneous Type #1 and #2 faults. The points inside of the ellipse are true outliers caused by network volume overflow, the point in the rectangle is a false alarm, and the point in the triangle is the missed fault caused by memory leaking.

Z. Lan (IIT) 27

Result: Multiple Faults

Ellipse: true positive

Rectangle: false positive

Triangle: false negative

Results of localizing simultaneous Type #2 faults. The identified outliers, including both true outliers and false alarms, spread across the space.

Z. Lan (IIT) 28

Page 15: Reliability Concerns - IIT-Computer Sciencelan/anl08_lan_2.pdf · ¾A “fence” to protect system and appl. from severe failure impact ¾Exploit the synergy between various methods

1515

Result: Multiple Faults

Faults Recall Precision

Conclusion: mixed Type #1 and #2 faults are difficult to identified; and multiple Type #2 faults could lead to a high cost for finding the real faults

Memory leak & high frequency I/O 1 1.0

Memory leak & network volume flow 0.87 0.85

Unterminated CPU intensive threads & network volume flow

1 0.80

Z. Lan (IIT) 29

Outline

• Overview of FENCE Project– Project website: http://www cs iit edu/~zlan/fence htmlProject website: http://www.cs.iit.edu/ zlan/fence.html

• This talk will focus on runtime support– Failure prediction and diagnosis

– Adaptive management

– Runtime supportAdaptive 

ManagementRuntime support

Z. Lan (IIT) 30

Page 16: Reliability Concerns - IIT-Computer Sciencelan/anl08_lan_2.pdf · ¾A “fence” to protect system and appl. from severe failure impact ¾Exploit the synergy between various methods

1616

Adaptive Management

• To reduce application completion time

• Runtime adaptation:– SKIP, to remove unnecessary overhead

– CHECKPOINT, to mitigate the recovery cost in case of unpredictable failures

– MIGRATION, to avoid anticipated failures

• Challenges:I f di i– Imperfect prediction

– Overhead/benefit of different actions

– The availability of spare resources

Z. Lan (IIT) 31

Main Idea

PW

PS

I1 I2 I3 I4

I5 I6 I5 …I6idle

idle

timeAP1 AP2 AP3 AP4 AP5

P1 I1 I2 I3 I4 I5

I1 I2 I3 I4 I5

I1 I2 I3 I4 I5

P2

P3

I6

I6

I6

AP6

I5

I5

I5 ………I6

I6

I6

AP7

Z. Lan (IIT) 32

Migration Checkpointing

Failure (True Positive)

Skip Recovery (Downtime/Restart)

False Alarm(False Positive)

Missed Failure (False Negative)

AdaptationManager

PredictionEngine

CostMonitor

Page 17: Reliability Concerns - IIT-Computer Sciencelan/anl08_lan_2.pdf · ¾A “fence” to protect system and appl. from severe failure impact ¾Exploit the synergy between various methods

1717

Adaptation Manager• MIGRATION:

1

(2 )* ( )*(1 )

1 (1 )

0

f hSW

next r pm appl pm appl

N Nf h

W Sappl i

f h

E I C C f I C f

precision N Nwhere f

N N

=

= + + + + −

⎧− − >⎪

= ⎨⎪⎩

∏Select the ac

• CHECKPOINT:

• SKIP:

0 f hW SN N⎪ ≤⎩

1

(2 )* ( )*(1 )

1 (1 ) f

W

next r ckp appl ckp appl

N

appli

E I C C f I C f

where f precision=

= + + + + −

= − −∏( (2 )* )* *(1 )

1 (1 )f

W

next r current last appl appl

N

l

E C l l I f I f

where f precision

= + + − + −

= − −∏

ction with m

in(Enext )

Z. Lan (IIT)33

Adaptation

ManagerPrediction accuracy,Operation cost,

Available Resources

SKIP, or CHECKPOINT, or MIGRATION

• An enforced FT window: 

1

1 (1 ) appli

where f precision=∏

(1- )MTBF

I recall⋅

Adaptive Manager

• It can be easily implemented with existing checkpointing tools– MPICH‐V, LAM/MPI, …

• Being fully operational with – Runtime failure prediction

– Checkpoint support

– Migration support• Currently, a stop‐and‐restart approach

Z. Lan (IIT) 34

Page 18: Reliability Concerns - IIT-Computer Sciencelan/anl08_lan_2.pdf · ¾A “fence” to protect system and appl. from severe failure impact ¾Exploit the synergy between various methods

1818

Experiments

• Fluid Stochastic Petri Net (FSPN) modeling– Study the impact of computation scales, number of spare 

d di ti i d ti tnodes, prediction accuracies, and operation costs

• Case studies– Implementation with MPICH‐V, as a new module

– Applications: ENZO, GROMACS, NPB

– TeraGrid/ANL IA32 Linux Cluster

Z. Lan (IIT) 35

Impact of Computation Scale

Z. Lan (IIT) 36

Page 19: Reliability Concerns - IIT-Computer Sciencelan/anl08_lan_2.pdf · ¾A “fence” to protect system and appl. from severe failure impact ¾Exploit the synergy between various methods

1919

Impact of Prediction Accuracy

Z. Lan (IIT) 37

It outperforms periodic checkpointing as long as recall and precision are higher than 0.30

Outline

• Overview of FENCE Project– Project website: http://www cs iit edu/~zlan/fence htmlProject website: http://www.cs.iit.edu/ zlan/fence.html

• This talk will focus on runtime support– Failure prediction and diagnosis

– Adaptive management

– Runtime supportRuntime support

Z. Lan (IIT) 38

RuntimeSupport

Page 20: Reliability Concerns - IIT-Computer Sciencelan/anl08_lan_2.pdf · ¾A “fence” to protect system and appl. from severe failure impact ¾Exploit the synergy between various methods

2020

Runtime Support

• Fault‐aware runtime rescheduling– Focus on re‐allocating active jobs (i.e., running jobs) to 

id i i t f ilavoid imminent failures• Allocate spare nodes for failure prevention

• Select active jobs for rescheduling in case of resource contention

• Fast failure recovery– Enhance checkpoint/recovery to reduce restart latency

– To appear in the Proc. of DSN’08

Z. Lan (IIT) 39

Spare Node Allocation• Observation:

– Idle nodes are common in production systems even in the systems under highsystems, even in the systems under high load

– Our studies have shown that prob(at least 2% of system nodes are idle) >= 70%

• A dynamic and non‐intrusive strategyt4 t5 t6 t7 t8

P1

P2

P3

P4

P5

2

15

6

34

– Spare nodes are determined at runtime

– Guarantee job reservations made by batch scheduler

Z. Lan (IIT) 40

4 5 6 7 8

4

Job Queue5

6Spare pool

Page 21: Reliability Concerns - IIT-Computer Sciencelan/anl08_lan_2.pdf · ¾A “fence” to protect system and appl. from severe failure impact ¾Exploit the synergy between various methods

2121

Job Rescheduling

• Transform into a general 0‐1 Knapsack model

To determine a binary vector { |1 } such thati sX x i J= ≤ ≤

• Generate three different strategies by setting vi:– Service unit loss driven to minimize the loss of service units

1

1

maximize , 0 1

s.t. s

s

i s

i i ii J

si i

i J

x v x or

x p S≤ ≤

≤ ≤

⋅ =

⋅ ≤

– Service unit loss driven, to minimize the loss of service units

– Job failure rate driven, to reduce number of failed jobs

– Failure slowdown driven, to minimize the slowdown caused by failures

Z. Lan (IIT) 41

Experiments

• Event‐based simulations– FCFS/EASY Backfilling 

C t d ti it W/ W/O f lt ti– Compare system productivity W/ vs. W/O fault‐aware runtime rescheduling

– Evaluation metrics• Performance metrics ‐‐ system utilization, response time, throughout

• Reliability metrics ‐‐ service unit loss, job failure rate, failure slowdown

• As long as failure prediction is capable of predicting 20%• As long as failure prediction is capable of predicting 20% of failures with a false alarm rate lower than 80%, a positive gain is observed by using fault‐aware runtime rescheduling

Z. Lan (IIT) 42

Page 22: Reliability Concerns - IIT-Computer Sciencelan/anl08_lan_2.pdf · ¾A “fence” to protect system and appl. from severe failure impact ¾Exploit the synergy between various methods

2222

Results with System Traces

Z. Lan (IIT) 43

Summary (1/2)

• Preliminary results are very encouraging– It’s possible to capture failure cause and effect relations by exploring data mining and pattern recognition technologies

• Towards automated failure prediction and diagnosis

– Runtime adaptation can significantly improve system productivity and application performance

• But, many issues remain openDevelopment of automated failure prediction and diagnosis– Development of automated failure prediction and diagnosis engine

• Stream data processing

• Extensive evaluation with production systems, such as Blue Gene series

Z. Lan (IIT) 44

Page 23: Reliability Concerns - IIT-Computer Sciencelan/anl08_lan_2.pdf · ¾A “fence” to protect system and appl. from severe failure impact ¾Exploit the synergy between various methods

2323

Summary (2/2)

• Development of fault‐aware resource management and job scheduling– Resource allocation for failure prevention & recovery

– Automated job recovery• Fault‐aware batch scheduler

• Simulation study

• A close collaboration with national labs and supercomputing centers is essential!supercomputing centers is essential!

• Integration with Fault Tolerant Backplane

Z. Lan (IIT) 45

Selected Publication• Z. Lan and Y. Li, “Adaptive Fault Management of Parallel Applications for High Performance 

Computing”, to appear in IEEE Trans. on Computers, 2008.

• Y. Li, Z. Lan, P. Gujrati, and X. Sun, “Fault‐Aware Runtime Strategies for High Performance Computing”, IEEE Trans. on Parallel and Distributed Systems, under minor revision

• Y Li and Z Lan “A Fast Recovery Mechanism for Checkpointing in Networked Environments”Y. Li and Z. Lan,  A Fast Recovery Mechanism for Checkpointing in Networked Environments , Proc. of DSN’08, 2008. 

• J. Gu, Z. Zheng*, Z. Lan, J. White, E. Hocks, B.H. Park, “Dynamic Meta‐Learning for Failure Prediction in Large‐Scale Systems: A Case Study”, Proc. of ICPP’08, 2008.

• Z. Lan, Y. Li, Z. Zheng, and P. Gujrati, “Enhancing Application Robustness through Adaptive Fault Tolerance”, Proc. of NSFNGS Workshop (in conjunction with IPDPS), 2008.

• X. Sun, Z. Lan, Y. Li, H. Jin, and Z. Zheng, “Towards a Fault‐Aware Computing Environment”, Proc. of High Availability and Performance Computing Workshop, 2008.  

• Z. Zheng, Y. Li and Z. Lan, “Anomaly Localization in Large‐scale Clusters”, Proc. of IEEE Cluster’07, 2007. 

• P. Gujrati, Y. Li, Z. Lan, R. Thakur, and J. White, “A Meta‐Learning Failure Predictor for Blue Gene/L Systems”, Proc. of ICPP’07, 2007. 

• Y. Li, P. Gujrati, Z. Lan, and X. Sun, “Fault‐Driven Re‐Scheduling for Improving System‐Level Fault Resilience”, Proc. of ICPP’07, 2007.  

• Y. Li and Z. Lan, “Exploit Failure Prediction for Adaptive Fault‐Tolerance in Cluster Computing”, Proc. of CCGrid’06, 2006.  

Z. Lan (IIT) 46

Page 24: Reliability Concerns - IIT-Computer Sciencelan/anl08_lan_2.pdf · ¾A “fence” to protect system and appl. from severe failure impact ¾Exploit the synergy between various methods

2424

Questions?

FENCE Project Website: FENCE Project Website: 

http://www.cs.iit.edu/~zlan/fence.htmlhttp://www.cs.iit.edu/~zlan/fence.html

FENCE group members: FENCE group members: 

•• Zhiling Lan, XianZhiling Lan, Xian‐‐He SunHe Sun

Z. Lan (IIT)

To  our sponsor:  National Science FoundationTo  our sponsor:  National Science Foundation

47

•• Currently, 5 Ph.D. students and 2 MS studentsCurrently, 5 Ph.D. students and 2 MS students