grid checkpoining architecture radosław januszewski coregrid summer school 2007

26
Managed by Grid Checkpoining Architecture Radosław Januszewski CoreGrid Summer School 2007

Upload: devin-robinson

Post on 27-Mar-2015

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Grid Checkpoining Architecture Radosław Januszewski CoreGrid Summer School 2007

Managed by

Grid Checkpoining Architecture

Radosław Januszewski

CoreGrid Summer School 2007

Page 2: Grid Checkpoining Architecture Radosław Januszewski CoreGrid Summer School 2007

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 2

motivation

- The Grids are complex and therefore prone to errors.

- The distributed nature of the Grid makes scheduling of system maintenance hard.

- Each uncoordinated power-down or failure effects in loss of currently running applications.

- Loss of computation time means additional cost!

Page 3: Grid Checkpoining Architecture Radosław Januszewski CoreGrid Summer School 2007

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 3

goal

To enhance the reliability, fault-tolerance and robustness of the Grid computing environment.

Page 4: Grid Checkpoining Architecture Radosław Januszewski CoreGrid Summer School 2007

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 4

the solution

Grid Checkpoint Architecture (GCA): a proposal of placement, functionality and interaction schemes of checkpoinitng service in the Grid environment

Page 5: Grid Checkpoining Architecture Radosław Januszewski CoreGrid Summer School 2007

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 5

Grid Broker

User Interface

Operating System Operating System Operating System

Globally Accessible Storage (Data Management)

User Tier

GRID Tier

Cluster Tier

Computing Nodes

Local Resource Manager

grid - model

Page 6: Grid Checkpoining Architecture Radosław Januszewski CoreGrid Summer School 2007

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 6

GCA in the Grid

Grid Broker

User Interface

Core Setvice

Operating System Core Service

Operating System Core Service

Operating System

Globally Accessible Storage (Data Management)

User Tier

GRID Tier

Cluster Tier

Computing Nodes

Local Resource Manager

Checkpoint Translation service (CTS)

Grid Checkpoint Service

Page 7: Grid Checkpoining Architecture Radosław Januszewski CoreGrid Summer School 2007

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 7

Proof of concept – the goals

• check whether the GCA survives contact with the reality

• prepare PoC on the basis of real-life installation• the Grid with the GCA should provide additional

value comparing with the „traditional” approach

Page 8: Grid Checkpoining Architecture Radosław Januszewski CoreGrid Summer School 2007

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 8

GCA proof of concept installation

Torque/PBS Pro

WS GRAM

GRMS

Command Line Client GridSphere interface Migrating Desktop Client

SGIckpt

Linux SGIckpt

Linux SGIckpt

Linux

NFS shared space

PBS JobManager

User Tier

GRID Tier

Cluster Tier

Computing Nodes

Checkpoint script

Page 9: Grid Checkpoining Architecture Radosław Januszewski CoreGrid Summer School 2007

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 9

involved elements

• GUI: command line, Grid Sphere, Migrating Desktop

• Broker: GRMS• Local Resource Manager: Globus + TORQUE• Core service: SGIckpt

Page 10: Grid Checkpoining Architecture Radosław Januszewski CoreGrid Summer School 2007

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 10

Bottom-up approach

How to make the checkpointer work with the local resource manager?

Torque/PBS Pro

WS GRAM

GRMS

Command Line Client GridSphere interface Migrating Desktop Client

SGIckpt

Linux SGIckpt

Linux SGIckpt

Linux

NFS shared space

PBS JobManager

User Tier

GRID Tier

Cluster Tier

Computing Nodes

Checkpoint script

Page 11: Grid Checkpoining Architecture Radosław Januszewski CoreGrid Summer School 2007

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 11

pbs/torque special features

action checkpoint

action restart

action checkpoint_abort

Page 12: Grid Checkpoining Architecture Radosław Januszewski CoreGrid Summer School 2007

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 12

config

$action checkpoint 0 !/usr/pbs/bin/pbs-mom-checkpoint.sh %globid %jobid %sid %ta

skid %path

$action restart 0 !/usr/pbs/bin/pbs_restart_test.sh %path %taskid

$restart_transmogrify true

$action checkpoint_abort 0 !/usr/pbs/bin/pbs-mom-checkpoint-and-stop.sh %globid

%jobid %sid %taskid %path

Detailed description accessible on the http://checkpointing.psnc.pl

Page 13: Grid Checkpoining Architecture Radosław Januszewski CoreGrid Summer School 2007

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 13

Broker – local RM connectivity

Torque/PBS Pro

WS GRAM

GRMS

Command Line Client GridSphere interface Migrating Desktop Client

SGIckpt

Linux SGIckpt

Linux SGIckpt

Linux

NFS shared space

PBS JobManager

User Tier

GRID Tier

Cluster Tier

Computing Nodes

Checkpoint script

Page 14: Grid Checkpoining Architecture Radosław Januszewski CoreGrid Summer School 2007

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 14

problem

The checkpointer: a service or resource?

Page 15: Grid Checkpoining Architecture Radosław Januszewski CoreGrid Summer School 2007

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 15

<grmsJob appid="matrix_demo_submit"> <task taskid="matrix" persistent="true" crucial="true"> <resource> <localrmname>pbs</localrmname> </resource> <executable type="multiple" count="1"> <execfile name="matrixi"> <url>gsiftp://xxx.xxx.xxx.xxxl//home/user/povray</url> </execfile> </executable> <other> <grms_id>${JOB_ID}</grms_id> <checkpointable>true</checkpointable> <period>1</period> </other> </task></grmsJob>

job description with checkpointing

Page 16: Grid Checkpoining Architecture Radosław Januszewski CoreGrid Summer School 2007

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 16

the end-user point of view

Page 17: Grid Checkpoining Architecture Radosław Januszewski CoreGrid Summer School 2007

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 17

Torque/PBS Pro

WS GRAM

GRMS

Command Line Client GridSphere interface Migrating Desktop Client

SGIckpt

Linux SGIckpt

Linux SGIckpt

Linux

NFS shared space

PBS JobManager

User Tier

GRID Tier

Cluster Tier

Computing Nodes

Checkpoint script

manual scenario

Page 18: Grid Checkpoining Architecture Radosław Januszewski CoreGrid Summer School 2007

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 18

Torque/PBS Pro

WS GRAM

GRMS

Command Line Client GridSphere interface Migrating Desktop Client

SGIckpt

Linux SGIckpt

Linux SGIckpt

Linux

NFS shared space

PBS JobManager

User Tier

GRID Tier

Cluster Tier

Computing Nodes

Checkpoint script

manual scenario - restart

Application

Failure!

Page 19: Grid Checkpoining Architecture Radosław Januszewski CoreGrid Summer School 2007

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 19

<grmsJob appid="matrix_demo_resume"> <task taskid="matrix" persistent="true" crucial="true"> <resource> <hostname>node-03.checkpointing.psnc.pl</hostname> <localrmname>pbs</localrmname> </resource> <executable type="multiple" count="1"> <execfile name="matrix_long"> <url>gsiftp://xxx.xxx.xxx.xxx//home/xxxxxx/test_apps/matrix_long</url> </execfile> </executable> <other> <grms_id>${JOB_ID}</grms_id> <recovery>true</recovery> <ckpt_id>1179315947518_matrix_demo_submit_0459</ckpt_id> <checkpointable>true</checkpointable> <period>1</period> </other> </task></grmsJob>

Page 20: Grid Checkpoining Architecture Radosław Januszewski CoreGrid Summer School 2007

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 20

failure – end-user view

Page 21: Grid Checkpoining Architecture Radosław Januszewski CoreGrid Summer School 2007

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 21

problem

This semi-automatic solution is not optimal.

How to introduce automatic job failure handling without introducing new functionality in the Broker?

Use the workflows!

Page 22: Grid Checkpoining Architecture Radosław Januszewski CoreGrid Summer School 2007

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 22

the workflow

submit job description

send results to useryes

submit „restart scenario” job

job finished successfullty?

send results to useryes

no

no

return error description

job finished successfullty?

Problem: using this broker we are not able to model loops

Page 23: Grid Checkpoining Architecture Radosław Januszewski CoreGrid Summer School 2007

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 23

Torque/PBS Pro

WS GRAM

GRMS

Command Line Client GridSphere interface Migrating Desktop Client

SGIckpt

Linux SGIckpt

Linux SGIckpt

Linux

NFS shared space

PBS JobManager

User Tier

GRID Tier

Cluster Tier

Computing Nodes

Checkpoint script

automatic scenario

Application

Failure!

Page 24: Grid Checkpoining Architecture Radosław Januszewski CoreGrid Summer School 2007

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 24

end-user point of view

Page 25: Grid Checkpoining Architecture Radosław Januszewski CoreGrid Summer School 2007

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 25

the benefits

user: more robust and fault-tolerant Grid environment

sysadmin: much easier system management due to automatic checkpoint and recovery mechanism

Page 26: Grid Checkpoining Architecture Radosław Januszewski CoreGrid Summer School 2007

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 26

Thank you!