the university of durham e-demand project paul townend [email protected] 14 th april 2003 paul...

The University of Durham e-Demand ProjectThe University of Durham e-Demand Project

Paul [email protected]

14th April 2003

Paul [email protected]

14th April 2003

2

About the Project

The e-Demand project at Durham is basically concerned with:

• Construction of a Service-based Architecture• Testing (Fault injection)• Security (FT-PIR)• Visualisation (Auto-3D)• Fault Tolerance

3

Service-based Architecture

The architecture that we are developing:

Service consumerContractor/assembly

service providerCatalogue/ontology

provider

Demand

Provision

Finding

Service/solutionprovider

Ultra-late binding Publishing

e-Action service

Attack-tolerance service

Attack-tolerance service

…

4

Testing Service

Our testing service currently implements network level fault injection.

Fault Injector(testing service)

Client

Server

ServiceRequest (may contain faults)

Response (may contain faults)

Middleware boundary

Interceptedrequest

Interceptedresponse

Potentiallyaltered

request

Potentiallyaltered response

5

Security service

Our Fault Tolerant Private Information Retreival service (FT-PIR) will allow users to query database records without revealing their true intentions.

Client

Server

Server

Server

DB

DB

DB

User

6

Visualisation service

The e-demand project is also developing visualisation services.

We hope to have a demo available for the 2nd national All-Hands conference in Nottingham.

I don’t really know enough to say anymore about this area!

7

Fault Tolerance on the Grid

Fault Tolerance (FT) is the main focus of this talk. FT allows a service to tolerate a fault and

continue to provide its service in some fashion. There is a great need for FT in the Grid

community, but currently, only the GGF Checkpoint Recovery group (Grid CPR-WG) is at work in this area.

We are seeking to perform work that will further the ease with which FT can be provided on the Grid.

In the following slides, we will look at the need for fault tolerance in the Grid, and look at some potential problems we may be able to resolve.

8

The Need for Fault Tolerance (1)

As applications scale to take advantage of Grid resources, their size and complexity will increase dramatically

Experience has shown that systems with complex asynchronous and interacting activities are very prone to errors and failures due to their extreme complexity.

At the same time, many Grid applications will perform long tasks that may require several days of computation, if not more.

9

The Need for Fault Tolerance (2)

“In a wide-area, distributed grid environment, however, the need for fault tolerance is essential. Besides having to cope with the higher probability of faults in a large system, the cost and difficulty of containing and recovering from a fault is higher. It is unacceptable that a process, host or network failure should cause a distributed grid application to irrevocably “hang” or malfunction in any way such that manual intervention is required at multiple sites.”

– Grid RPC, Events and Messaging, C.Lee, The Aerospace Corporation, September 2001

10

Example of the need for Fault Tolerance

Consider an application, decomposed into 100 services. Assume each service has a MTTR of 120 days and requires a week of computation.

Assuming an exponentially distributed failure mode, the composed application would have a MTTF of 1.2 days.

Without any kind of FT, the application would rarely finish.

11

Expressing Fault Tolerance Capabilities (1)

Application metadata is critical for the development of Grid computing environments.

Information captured by metadata supports discovery of applications in the Grid environment.

It also facilitates the seamless composition of services.

We are therefore seeking to create a standard way of expressing fault tolerance properties in service metadata.

12

Expressing Fault Tolerance Capabilities (2)

This would allow a user to identify whether, for example, a service uses Recovery Blocks, Multi-Version Design or has no fault tolerance whatsoever.

This information could then be used in both WSDL and Service Data Elements.

13

PC Grids (1)

Perhaps one of the most attractive opportunities allowed by Grid technologies is the idea of the “PC Grid”

PC

PC GridServer

PC

PC

Client

14

PC Grids (2)

The issue of providing fault tolerance in not as simple as it might initially appear.

As the individual nodes on a PC Grid are all potentially insecure, it is evident that replication is the most suitable FT methodology to use.

However, different nodes in the Grid will be running at different speeds, and have different loads at any one time.

It thus becomes difficult to guarantee a job will be finished within a given amount of time.

15

PC Grids (3)

Simple replication might therefore not be suitable, as the server may be left waiting for a heavily loaded node to finish and submit its results, while it already has the other nodes’ results.

PC Grid Server

2minutes

2minutes

2minutes

8minutes

8minutes

“replication block”

16

PC Grids (4)

In addition, PC Grids are highly dynamically – nodes may join or leave at any time.

We therefore can’t make any guarantees about the performance of each node.

We can, however, make general assumptions about their ability to perform a job within a given time-frame, based on their hardware and historical load levels, etc.

So here is an initial FT scheme we are currently looking at for providing replication on PC Grids…

17

Replication on a PC Grid (1)

It may be the case that some jobs to be sent out on the PC Grid are more important than others.

Ideally, we want the most important jobs (or the ones requiring most compute time) to be processed quickly.

We also want to ensure that different “replication blocks” finish at approximately the same time, so that the PC Grid server isn’t waiting to vote on jobs for too long.

18

Replication on a PC Grid (2)

We also need to allow for the possibility of nodes within a replication block leaving the PC Grid (voluntarily or due to some kind of failure)

Given that the resources available should be plentiful, we can therefore use more replication than we strictly need.

We can then vote on the results of the first n nodes within a replication block that return results (with n being arbitrary)

So we are thinking of something like the following…

19

Our Very Provisional Solution (1)

Whenever a node joins the PC Grid, it must be assigned a “performance category” based on its hardware capabilities and load.

These categories are dynamic, and continually re-assessed by the PC Grid server.

1 1 1 1 2 2 2

PC Grid Server

20


Similarly, when a job is submitted to a PC Grid, the scheduler must decide on the priority of the job.

This may be based on whether the job requires lots of computation, or perhaps is critical, etc.

It then identifies several nodes within the PC Grid that meet the performance requirements of the job, based on their performance category.

21


The server then sends the replicated job to several of these nodes, to form a “replication block”.

Should resources allow, this block can contain more nodes than we need, in order to guard against some of them leaving/failing.

We might specify that – should there be a lack of nodes within the given performance category – we use a mixture of nodes from other categories.

22

Example (1)

Assume we have a PC Grid like this:

Then assume, a job is submitted that is adjudged to have a “priority” of 1.

1 1 1 1 2 2 2

PC Grid Server

23

Example (2)

We only need 3 nodes for decent replication, but we have 4 that are not used, so why not?!

1 1 1 1 2 2 2

PC Grid Server

Job from Client

Job of priority 1

The job’s replication block

24

Example (3)

Part way through computation, one of the nodes either leaves the PC Grid or is reallocated to another replication block – but we still have 3 left.

1 1 1 2 2 2

PC Grid Server

The job’s replication block

25

Example (4)

As each node finishes its task, it sends its results back to the server, and is free to be allocated elsewhere.

The server stores the results centrally and waits for the final job in the replication block to finish.

1 1 1 2 2 2

PC Grid Server

The job’s replication blockfinished finished

26

Example (5)

Because the nodes used were of similar performance, the server will not have to wait long and hence overhead should be kept down.

Eventually, the final node finishes, and the result is voted on, and - if successful – sent back to the client.

1 1 1 2 2 2

PC Grid Server

finished finished finished

27

Acceptance Tests on the Grid (1)

Speaking of voting, another area of FT that is traditionally problematic is that of acceptance testing.

This is where the result of a program/service is verified by one or more tests, performed automatically.

A number of FT schemes depend on such testing, but the testing itself must usually be simple, as it otherwise introduces unacceptable run-time overhead.

28

Acceptance Tests on the Grid (2)

With the Grid, this problem may be solvable for some applications.

Rather than process the acceptance test locally, we could send the data and either an executable or a schema specifying the test to perform, to an HPC node.

The overhead of compute time would thus be decreased, although whether this will be offset by the increase in communication overhead remains to be seen.

29

Conclusions (1)

The e-Demand project is multi-faceted – it’s looking at security, visualisation, testing and fault tolerance.

The main focus of this talk has been to present some ideas we have in regard to fault tolerance.

FT is obviously needed on the Grid.

A standard way of expressing FT capabilities in a services’ metadata would be A Good Thing.

We are inclined to focus initially on the problem of providing FT to PC Grids.

30

Conclusions (2)

At first glance, FT on a PC Grid simply involves replication, but it soon becomes apparent that a more optimal solution involves:

Assessing and grouping PC Grid nodes. Assessing and scheduling jobs. Using extra redundancy to tolerate nodes leaving. Perhaps reallocating redundant nodes on the fly. Perhaps farming out computationally expensive

acceptance tests to HPC nodes. Assessing scalability of this architecture. And so on!

31

Open Issues

Obviously, this work is still in its initial stages and so there are many things that need to be considered, such as:

Cost (will using extra PC Grid nodes be chargeable?) How to choose nodes for replication blocks How to dynamically assess node performance How to dynamically assess job priorities What to do if consensus is not reached, or only 1 node

successfully returns the job Whether a traditional distributed system fault model is

applicable in a grid environment, or whether revision is needed.

32

Open Issues (2)

The traditional distributed systems fault model includes events such as:

Physical faults Software faults Timing faults Communication faults Life-cycle faults

Is such a traditional fault model acceptable for Grid computation, or is some revision required?

33

Thanks!

Thanks!

If you have any questions or anything, then e-mail:

[email protected]

And you might even get a reply!

the university of durham e-demand project paul townend [email protected] 14 th april 2003 paul...

Documents

grid fault tolerance

d fault tolerance slide

fault tolerance properties

service metadata

security service

grid applications

grid technologies

grid rpc