1 making services fault tolerant pat chan, michael r. lyu department of computer science and...

Making Services Fault Tolerant

Pat Chan, Michael R. Lyu Department of Computer Science and EngineeringThe Chinese University of Hong Kong Miroslaw MalekDepartment of Computer Science and EngineeringHumboldt University Berlin

Outline Introduction Problem Statement Methodologies for Web Service

Reliability New Reliable Web Service Paradigm Road Map for Experiment Experimental Results and Discussion Conclusion

Introduction Service-oriented computing is becoming a

reality. Service-oriented Architectures (SOA) are

based on a simple model of roles. The problems of service dependability,

security and timeliness are becoming critical.

We propose experimental settings and offer a roadmap to dependable Web services.

Problem Statement Fault-tolerant techniques

Replication Diversity

Replication is one of the efficient ways for providing reliable systems by time or space redundancy.

Increasing the availability of distributed systems Key components are re-executed or replicated Protect against hardware malfunctions or transient system faults.

Another efficient technique is design diversity. By independently designing software systems or services with

different programming teams, Resort in defending against permanent software design faults.

We focus on the analysis of the replication techniques when applied to Web services.

A generic Web service system with spatial as well as temporal replication is proposed and investigated.

Methodologies for reliable Web services -- Redundancy Spatial redundancy

Static redundancy, all replicas are active at the same time and voting takes place to obtain a correct result.

Dynamic redundancy engages one active replica at one time while others are kept in an active or in standby state.

Temporal redundancy Redundant in time

Methodologies for reliable Web services -- Diversity

Protect redundant systems against common-mode failures

With different designs and implementations, common failure modes will probably cause different error effects.

N-version programming, recovery blocks…

Failure Response Stages of Web Services Fault confinement Fault detection Diagnosis Fail-over Reconfiguration Recovery Restart Repair Reintegration

Fault Confinement

Fault Detection Fault Detection

Failover Diagnosis

Online Offline

Reconfiguration

Recovery

Restart

Repair

Reintegration

Replication Manager

Web service selection algorithm

WatchDog

Registry

Web ServiceIIS

Application

Database

Web ServiceIIS

Application

Database

Web ServiceIIS

Application

Database

Client

Application

Database

1. Create web services

2. Select primary web service (PWS)

3. Register

4. Look up

5. Get WSDL

6. Invoke web service

7. Keep check the availability of the PWS

8. If PWS failed, reselect the PWS.

9. Update the WSDL

Proposed Paradigm

RM sends message to the Web Service

Reselect a primary Web Service

Do not get reply

Map the new address to the WSDL

System Fail

Get reply

All Service failed

Work Flow of the Replication Manager

Road Map for Experiment Research

Redundancy in time Redundancy in space

SequentiallyParallelMajority voting using N modular

redundancyDiversified version of different

services

Experiments

A series of experiments are designed and performed for evaluating the reliability of the Web service, single service without replication,single service with retry or reboot and, service with spatial replication.

We will also perform retry or failover when the Web service is down.

Summary of the experiments

None Retry/Reboot

Failover Both (hybrid)

Single service, no retry

0 -- -- --

Single service with retry

-- 1 -- --

Single service with reboot

-- 2 -- --

Spatial replication

-- -- 3 4

Parameters of the Experiments

Parameters Current setting/metric

Request frequency 1 req/min

Polling frequency 5 ms

Number of replicas 5

Client timeout period for retry 10 s

Failure rate λ # failures/hour

Load (profile of the program) % or load function

Reboot time 10 min

Failover time 1 s

Experimental Results

Experiments over 360 hour periods (43200 reqs)

Number of failures Normal

Number of failuresServer busy

Number of failuresServer reboots periodically

Exp 0 4928 6130 6492

Exp 1 2210 2327 2658

Exp 2 2561 3160 3323

Exp 3 1324 1711 1658

Exp 4 1089 1148 1325

Retry11.97% to 4.93%

Reboot11.97% to 6.44%

Failover11.97% to 3.56%Retry and Failover11.97% to 2.59%

Number of failure when the server is is normal situation

Number of failure when the server is busy

Number of failure when the server reboots periodically

Reliability of the system over time

( ) ( )lim 0.025t

F t t F t

( )( ) t tR t e

Reliability Model

Reliability Model Parameters

ID Description Value

λn Network failure rate 0.02

λ* Web service failure rate 0.228

λ1 Resource problem rate 0.142

λ2 Entry point failure rate 0.150

μ* Web service repair rate 0.286

μ1 Resource problem repair rate 0.979

μ2 Entry point failure repair rate 0.979

C1 Probability that the RM responds on time 0.9

C2 Probability that the server reboots successfully 0.9

Outcome (SHARPE)

Failure Rate0.2280.1140.057

Reliability of the proposed system

Conclusion

Surveyed replication and design diversity techniques for reliable services.

Proposed a hybrid approach to improving the availability of Web services.

Carried out a series of experiments to evaluate the availability and reliability of the proposed Web service system.

N-Version Programming may finally become commercially viable in service environment.

1 making services fault tolerant pat chan, michael r. lyu department of computer science and...

reliable web services

primary web service

generic web service

dependable web services

time slide

time redundancy

paradigm slide

replication manager

Documents

lyu 0004 mobile agent’s community

malek 2003

konsep sirkulasi lyu

d - ca - xvi - mh d - 1 humboldt-universitÄt zu berlin...

dynamic upgrade of distributed software components · 2017....

automatic generation of fault-tolerant corba-services...

hao lyu slides_sarcasm

internet avenue lyu

malek bennabi.fimahab maaraka.pdf

directed by: professor miroslaw j. skibniewski, project

entrepreneurship unternehmensgründung im...

malek fahd hoxton park - malek fahd islamic school

networks of mobile arbitrary devices (nomads) miroslaw malek...

malek hamdi

thermodynamics [ap-2013] lecture 4b by ling-hsiao lyu...

methoden und werkzeuge zur verfügbarkeitsermittlung ·...

19.04. eröffnungsveranstaltung einführung: ziele und...

abraxas - malek · malek marino /215 abraxas - malek denia...

object-oriented real-time distributed computing 2nd ieee...

shahrad malek fazeli