the university of durham e-demand project paul townend [email protected] 14 th april 2003 paul...

33
The University of Durham e-Demand Project Paul Townend [email protected] 14 th April 2003

Upload: hailey-quinlan

Post on 28-Mar-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The University of Durham e-Demand Project Paul Townend p.m.townend@dur.ac.uk 14 th April 2003 Paul Townend p.m.townend@dur.ac.uk 14 th April 2003

The University of Durham e-Demand ProjectThe University of Durham e-Demand Project

Paul [email protected]

14th April 2003

Paul [email protected]

14th April 2003

Page 2: The University of Durham e-Demand Project Paul Townend p.m.townend@dur.ac.uk 14 th April 2003 Paul Townend p.m.townend@dur.ac.uk 14 th April 2003

2

About the Project

The e-Demand project at Durham is basically concerned with:

• Construction of a Service-based Architecture• Testing (Fault injection)• Security (FT-PIR)• Visualisation (Auto-3D)• Fault Tolerance

Page 3: The University of Durham e-Demand Project Paul Townend p.m.townend@dur.ac.uk 14 th April 2003 Paul Townend p.m.townend@dur.ac.uk 14 th April 2003

3

Service-based Architecture

The architecture that we are developing:

Service consumerContractor/assembly

service providerCatalogue/ontology

provider

Demand

Provision

Finding

Service/solutionprovider

Ultra-late binding Publishing

e-Action service

Attack-tolerance service

Attack-tolerance service

Page 4: The University of Durham e-Demand Project Paul Townend p.m.townend@dur.ac.uk 14 th April 2003 Paul Townend p.m.townend@dur.ac.uk 14 th April 2003

4

Testing Service

Our testing service currently implements network level fault injection.

Fault Injector(testing service)

Client

Server

ServiceRequest (may contain faults)

Response (may contain faults)

Middleware boundary

Interceptedrequest

Interceptedresponse

Potentiallyaltered

request

Potentiallyaltered response

Page 5: The University of Durham e-Demand Project Paul Townend p.m.townend@dur.ac.uk 14 th April 2003 Paul Townend p.m.townend@dur.ac.uk 14 th April 2003

5

Security service

Our Fault Tolerant Private Information Retreival service (FT-PIR) will allow users to query database records without revealing their true intentions.

Client

Server

Server

Server

DB

DB

DB

User

Page 6: The University of Durham e-Demand Project Paul Townend p.m.townend@dur.ac.uk 14 th April 2003 Paul Townend p.m.townend@dur.ac.uk 14 th April 2003

6

Visualisation service

The e-demand project is also developing visualisation services.

We hope to have a demo available for the 2nd national All-Hands conference in Nottingham.

I don’t really know enough to say anymore about this area!

Page 7: The University of Durham e-Demand Project Paul Townend p.m.townend@dur.ac.uk 14 th April 2003 Paul Townend p.m.townend@dur.ac.uk 14 th April 2003

7

Fault Tolerance on the Grid

Fault Tolerance (FT) is the main focus of this talk. FT allows a service to tolerate a fault and

continue to provide its service in some fashion. There is a great need for FT in the Grid

community, but currently, only the GGF Checkpoint Recovery group (Grid CPR-WG) is at work in this area.

We are seeking to perform work that will further the ease with which FT can be provided on the Grid.

In the following slides, we will look at the need for fault tolerance in the Grid, and look at some potential problems we may be able to resolve.

Page 8: The University of Durham e-Demand Project Paul Townend p.m.townend@dur.ac.uk 14 th April 2003 Paul Townend p.m.townend@dur.ac.uk 14 th April 2003

8

The Need for Fault Tolerance (1)

As applications scale to take advantage of Grid resources, their size and complexity will increase dramatically

Experience has shown that systems with complex asynchronous and interacting activities are very prone to errors and failures due to their extreme complexity.

At the same time, many Grid applications will perform long tasks that may require several days of computation, if not more.

Page 9: The University of Durham e-Demand Project Paul Townend p.m.townend@dur.ac.uk 14 th April 2003 Paul Townend p.m.townend@dur.ac.uk 14 th April 2003

9

The Need for Fault Tolerance (2)

“In a wide-area, distributed grid environment, however, the need for fault tolerance is essential. Besides having to cope with the higher probability of faults in a large system, the cost and difficulty of containing and recovering from a fault is higher. It is unacceptable that a process, host or network failure should cause a distributed grid application to irrevocably “hang” or malfunction in any way such that manual intervention is required at multiple sites.”

– Grid RPC, Events and Messaging, C.Lee, The Aerospace Corporation, September 2001

Page 10: The University of Durham e-Demand Project Paul Townend p.m.townend@dur.ac.uk 14 th April 2003 Paul Townend p.m.townend@dur.ac.uk 14 th April 2003

10

Example of the need for Fault Tolerance

Consider an application, decomposed into 100 services. Assume each service has a MTTR of 120 days and requires a week of computation.

Assuming an exponentially distributed failure mode, the composed application would have a MTTF of 1.2 days.

Without any kind of FT, the application would rarely finish.

Page 11: The University of Durham e-Demand Project Paul Townend p.m.townend@dur.ac.uk 14 th April 2003 Paul Townend p.m.townend@dur.ac.uk 14 th April 2003

11

Expressing Fault Tolerance Capabilities (1)

Application metadata is critical for the development of Grid computing environments.

Information captured by metadata supports discovery of applications in the Grid environment.

It also facilitates the seamless composition of services.

We are therefore seeking to create a standard way of expressing fault tolerance properties in service metadata.

Page 12: The University of Durham e-Demand Project Paul Townend p.m.townend@dur.ac.uk 14 th April 2003 Paul Townend p.m.townend@dur.ac.uk 14 th April 2003

12

Expressing Fault Tolerance Capabilities (2)

This would allow a user to identify whether, for example, a service uses Recovery Blocks, Multi-Version Design or has no fault tolerance whatsoever.

This information could then be used in both WSDL and Service Data Elements.

Page 13: The University of Durham e-Demand Project Paul Townend p.m.townend@dur.ac.uk 14 th April 2003 Paul Townend p.m.townend@dur.ac.uk 14 th April 2003

13

PC Grids (1)

Perhaps one of the most attractive opportunities allowed by Grid technologies is the idea of the “PC Grid”

PC

PC GridServer

PC

PC

Client

Page 14: The University of Durham e-Demand Project Paul Townend p.m.townend@dur.ac.uk 14 th April 2003 Paul Townend p.m.townend@dur.ac.uk 14 th April 2003

14

PC Grids (2)

The issue of providing fault tolerance in not as simple as it might initially appear.

As the individual nodes on a PC Grid are all potentially insecure, it is evident that replication is the most suitable FT methodology to use.

However, different nodes in the Grid will be running at different speeds, and have different loads at any one time.

It thus becomes difficult to guarantee a job will be finished within a given amount of time.

Page 15: The University of Durham e-Demand Project Paul Townend p.m.townend@dur.ac.uk 14 th April 2003 Paul Townend p.m.townend@dur.ac.uk 14 th April 2003

15

PC Grids (3)

Simple replication might therefore not be suitable, as the server may be left waiting for a heavily loaded node to finish and submit its results, while it already has the other nodes’ results.

PC Grid Server

2minutes

2minutes

2minutes

8minutes

8minutes

“replication block”

Page 16: The University of Durham e-Demand Project Paul Townend p.m.townend@dur.ac.uk 14 th April 2003 Paul Townend p.m.townend@dur.ac.uk 14 th April 2003

16

PC Grids (4)

In addition, PC Grids are highly dynamically – nodes may join or leave at any time.

We therefore can’t make any guarantees about the performance of each node.

We can, however, make general assumptions about their ability to perform a job within a given time-frame, based on their hardware and historical load levels, etc.

So here is an initial FT scheme we are currently looking at for providing replication on PC Grids…

Page 17: The University of Durham e-Demand Project Paul Townend p.m.townend@dur.ac.uk 14 th April 2003 Paul Townend p.m.townend@dur.ac.uk 14 th April 2003

17

Replication on a PC Grid (1)

It may be the case that some jobs to be sent out on the PC Grid are more important than others.

Ideally, we want the most important jobs (or the ones requiring most compute time) to be processed quickly.

We also want to ensure that different “replication blocks” finish at approximately the same time, so that the PC Grid server isn’t waiting to vote on jobs for too long.

Page 18: The University of Durham e-Demand Project Paul Townend p.m.townend@dur.ac.uk 14 th April 2003 Paul Townend p.m.townend@dur.ac.uk 14 th April 2003

18

Replication on a PC Grid (2)

We also need to allow for the possibility of nodes within a replication block leaving the PC Grid (voluntarily or due to some kind of failure)

Given that the resources available should be plentiful, we can therefore use more replication than we strictly need.

We can then vote on the results of the first n nodes within a replication block that return results (with n being arbitrary)

So we are thinking of something like the following…

Page 19: The University of Durham e-Demand Project Paul Townend p.m.townend@dur.ac.uk 14 th April 2003 Paul Townend p.m.townend@dur.ac.uk 14 th April 2003

19

Our Very Provisional Solution (1)

Whenever a node joins the PC Grid, it must be assigned a “performance category” based on its hardware capabilities and load.

These categories are dynamic, and continually re-assessed by the PC Grid server.

1 1 1 1 2 2 2

PC Grid Server

Page 20: The University of Durham e-Demand Project Paul Townend p.m.townend@dur.ac.uk 14 th April 2003 Paul Townend p.m.townend@dur.ac.uk 14 th April 2003

20

Our Very Provisional Solution (2)

Similarly, when a job is submitted to a PC Grid, the scheduler must decide on the priority of the job.

This may be based on whether the job requires lots of computation, or perhaps is critical, etc.

It then identifies several nodes within the PC Grid that meet the performance requirements of the job, based on their performance category.

Page 21: The University of Durham e-Demand Project Paul Townend p.m.townend@dur.ac.uk 14 th April 2003 Paul Townend p.m.townend@dur.ac.uk 14 th April 2003

21

Our Very Provisional Solution (3)

The server then sends the replicated job to several of these nodes, to form a “replication block”.

Should resources allow, this block can contain more nodes than we need, in order to guard against some of them leaving/failing.

We might specify that – should there be a lack of nodes within the given performance category – we use a mixture of nodes from other categories.

Page 22: The University of Durham e-Demand Project Paul Townend p.m.townend@dur.ac.uk 14 th April 2003 Paul Townend p.m.townend@dur.ac.uk 14 th April 2003

22

Example (1)

Assume we have a PC Grid like this:

Then assume, a job is submitted that is adjudged to have a “priority” of 1.

1 1 1 1 2 2 2

PC Grid Server

Page 23: The University of Durham e-Demand Project Paul Townend p.m.townend@dur.ac.uk 14 th April 2003 Paul Townend p.m.townend@dur.ac.uk 14 th April 2003

23

Example (2)

We only need 3 nodes for decent replication, but we have 4 that are not used, so why not?!

1 1 1 1 2 2 2

PC Grid Server

Job from Client

Job of priority 1

The job’s replication block

Page 24: The University of Durham e-Demand Project Paul Townend p.m.townend@dur.ac.uk 14 th April 2003 Paul Townend p.m.townend@dur.ac.uk 14 th April 2003

24

Example (3)

Part way through computation, one of the nodes either leaves the PC Grid or is reallocated to another replication block – but we still have 3 left.

1 1 1 2 2 2

PC Grid Server

The job’s replication block

Page 25: The University of Durham e-Demand Project Paul Townend p.m.townend@dur.ac.uk 14 th April 2003 Paul Townend p.m.townend@dur.ac.uk 14 th April 2003

25

Example (4)

As each node finishes its task, it sends its results back to the server, and is free to be allocated elsewhere.

The server stores the results centrally and waits for the final job in the replication block to finish.

1 1 1 2 2 2

PC Grid Server

The job’s replication blockfinished finished

Page 26: The University of Durham e-Demand Project Paul Townend p.m.townend@dur.ac.uk 14 th April 2003 Paul Townend p.m.townend@dur.ac.uk 14 th April 2003

26

Example (5)

Because the nodes used were of similar performance, the server will not have to wait long and hence overhead should be kept down.

Eventually, the final node finishes, and the result is voted on, and - if successful – sent back to the client.

1 1 1 2 2 2

PC Grid Server

finished finished finished

Page 27: The University of Durham e-Demand Project Paul Townend p.m.townend@dur.ac.uk 14 th April 2003 Paul Townend p.m.townend@dur.ac.uk 14 th April 2003

27

Acceptance Tests on the Grid (1)

Speaking of voting, another area of FT that is traditionally problematic is that of acceptance testing.

This is where the result of a program/service is verified by one or more tests, performed automatically.

A number of FT schemes depend on such testing, but the testing itself must usually be simple, as it otherwise introduces unacceptable run-time overhead.

Page 28: The University of Durham e-Demand Project Paul Townend p.m.townend@dur.ac.uk 14 th April 2003 Paul Townend p.m.townend@dur.ac.uk 14 th April 2003

28

Acceptance Tests on the Grid (2)

With the Grid, this problem may be solvable for some applications.

Rather than process the acceptance test locally, we could send the data and either an executable or a schema specifying the test to perform, to an HPC node.

The overhead of compute time would thus be decreased, although whether this will be offset by the increase in communication overhead remains to be seen.

Page 29: The University of Durham e-Demand Project Paul Townend p.m.townend@dur.ac.uk 14 th April 2003 Paul Townend p.m.townend@dur.ac.uk 14 th April 2003

29

Conclusions (1)

The e-Demand project is multi-faceted – it’s looking at security, visualisation, testing and fault tolerance.

The main focus of this talk has been to present some ideas we have in regard to fault tolerance.

FT is obviously needed on the Grid.

A standard way of expressing FT capabilities in a services’ metadata would be A Good Thing.

We are inclined to focus initially on the problem of providing FT to PC Grids.

Page 30: The University of Durham e-Demand Project Paul Townend p.m.townend@dur.ac.uk 14 th April 2003 Paul Townend p.m.townend@dur.ac.uk 14 th April 2003

30

Conclusions (2)

At first glance, FT on a PC Grid simply involves replication, but it soon becomes apparent that a more optimal solution involves:

Assessing and grouping PC Grid nodes. Assessing and scheduling jobs. Using extra redundancy to tolerate nodes leaving. Perhaps reallocating redundant nodes on the fly. Perhaps farming out computationally expensive

acceptance tests to HPC nodes. Assessing scalability of this architecture. And so on!

Page 31: The University of Durham e-Demand Project Paul Townend p.m.townend@dur.ac.uk 14 th April 2003 Paul Townend p.m.townend@dur.ac.uk 14 th April 2003

31

Open Issues

Obviously, this work is still in its initial stages and so there are many things that need to be considered, such as:

Cost (will using extra PC Grid nodes be chargeable?) How to choose nodes for replication blocks How to dynamically assess node performance How to dynamically assess job priorities What to do if consensus is not reached, or only 1 node

successfully returns the job Whether a traditional distributed system fault model is

applicable in a grid environment, or whether revision is needed.

Page 32: The University of Durham e-Demand Project Paul Townend p.m.townend@dur.ac.uk 14 th April 2003 Paul Townend p.m.townend@dur.ac.uk 14 th April 2003

32

Open Issues (2)

The traditional distributed systems fault model includes events such as:

Physical faults Software faults Timing faults Communication faults Life-cycle faults

Is such a traditional fault model acceptable for Grid computation, or is some revision required?

Page 33: The University of Durham e-Demand Project Paul Townend p.m.townend@dur.ac.uk 14 th April 2003 Paul Townend p.m.townend@dur.ac.uk 14 th April 2003

33

Thanks!

Thanks!

If you have any questions or anything, then e-mail:

[email protected]

And you might even get a reply!