massively parallel implementation of total-feti ddm with application to medical image registration

28
Massively parallel implementation of Total-FETI DDM with application to medical image registration Michal Merta Alena Vašatová Václav Hapla David Horák DD21, Rennes, France

Upload: cala

Post on 24-Feb-2016

22 views

Category:

Documents


0 download

DESCRIPTION

Massively parallel implementation of Total-FETI DDM with application to medical image registration. Michal Merta Alena Vašatová Václav Hapla David Horák. DD21, Rennes, France. Motivation. solution of large-scale scientific and engineering problems possibly hundreds of millions DOFs - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Massively parallel implementation of Total-FETI DDM with application to medical image registration

Massively parallel implementation of Total-FETI DDM with application

to medical image registrationMichal Merta

Alena VašatováVáclav HaplaDavid Horák

DD21, Rennes, France

Page 2: Massively parallel implementation of Total-FETI DDM with application to medical image registration

solution of large-scale scientific and engineering problems possibly hundreds of millions DOFs linear problems non-linear problems

non-overlapping, FETI methods with up to tens of thousands of subdomains

usage of PRACE Tier-1 and Tier-0 HPC systems

Motivation

Page 3: Massively parallel implementation of Total-FETI DDM with application to medical image registration

developed by Argonne National Laboratory data structures and routines for the scalable parallel

solution of scientific applications modeled by PDE coded primarily in C language, but good FORTRAN

support, can also be called from C++ and Python codes current version is 3.2 www.mcs.anl.gov/petsc petsc-dev (development branch) is intensively evolving code and mailing lists open to anybody

PETSc(Portable, Extensible Toolkit for Scientific computation)

Page 4: Massively parallel implementation of Total-FETI DDM with application to medical image registration

PETSc components

seq. / par.

Page 5: Massively parallel implementation of Total-FETI DDM with application to medical image registration

developed by Sandia National Laboratories collection of relatively independent packages toolkit for basic linear algebra operations, direct and

iterative solvers for linear systems, PDE discretization utilities, mesh generation tools etc.

object oriented design, high modularity, use of modern C++ features (templating)

mainly in C++ (Fortran and Python bindings) current version 10.10 trilinos.sandia.gov

Trilinos

Page 6: Massively parallel implementation of Total-FETI DDM with application to medical image registration

Trilinos components

Page 7: Massively parallel implementation of Total-FETI DDM with application to medical image registration

are parallelized on the data level (vectors & matrices) using MPI

use BLAS and LAPACK – de facto standard for dense LA have their own implementation of sparse BLAS include robust preconditioners, linear solvers (direct and

iterative) and nonlinear solvers can cooperate with many other external solvers and

libraries (e.g. MATLAB, MUMPS, UMFPACK, …) support CUDA and hybrid parallelization are licensed as open-source

Both PETSc and Trilinos…

Page 8: Massively parallel implementation of Total-FETI DDM with application to medical image registration

Problem of elastostatics

... boundary with prescribed surface traction

... boundary with prescribed

... isotropic e

displacements ... body loads

lastic body

F

U

f

f

Page 9: Massively parallel implementation of Total-FETI DDM with application to medical image registration

TFETI decomposition

12 G

34 G

24 G13 G

... artificial boundariesbetween subdomains and with prescribed gluing conditions- enforced byLagrange multipliers

pqG

p q

Page 10: Massively parallel implementation of Total-FETI DDM with application to medical image registration

The FEM discretization with a suitable numbering of nodes results in the QP problem:

Primal discretized formulation

1min s. t.2

T T u

u Ku f u Bu c1diag( ) is a symmetric positive semidefinite (and so singular in general)block-diagonal global stiffness matrix

is a stiffness matrix of the subdomain is a

,...,

full rank cons r t t ain

NS

s

n

sm

n

n

B

KK K

K matrix, constraint RHS

1 is a load vectornfc

Page 11: Massively parallel implementation of Total-FETI DDM with application to medical image registration

Dual discretized formulation(homogenized)

1min s.t.2

T T λ

λ Aλ λ b Gλ o

1( ) (Im Ker (

( ) (

Im Ker )

Im Ker )

)

T

T T

T T T

K K

F BK

G R B

GG G Q GP I Q P GA

K KKR R K

B

Q

FP

G

P Q

10

0())

(

T

T T

f

λ G

d BK

e R f

G eb d F

GλP

QP problem again, but with lower dimension and simpler constraints

Page 12: Massively parallel implementation of Total-FETI DDM with application to medical image registration

Primal data distribution,F action

… straightforwardmatrix distribution,

given by a decomposition

*Fλ

very sparse

block diagonal embarrassingly parallel

Page 13: Massively parallel implementation of Total-FETI DDM with application to medical image registration

Coarse projector action

1( ) ,T T GG G PG IQ Q

*

… can easily take 85 % of computation time if not properly parallelized!

?

?

?

Page 14: Massively parallel implementation of Total-FETI DDM with application to medical image registration

G preprocessing and action

preprocessing

action

?

Page 15: Massively parallel implementation of Total-FETI DDM with application to medical image registration

Coarse problempreprocessing and action

preprocessing

action

? 3

1

2

Currently used variant: B2(PPAM 2011)

Page 16: Massively parallel implementation of Total-FETI DDM with application to medical image registration

Coarse problem

Page 17: Massively parallel implementation of Total-FETI DDM with application to medical image registration

the UK's largest, fastest and most powerful supercomputer supplied by Cray Inc., operated by EPCC

uses the latest AMD "Bulldozer" multicoreprocessor architecture

704 compute blades each blade with 4 compute nodes giving

a total of 2816 compute nodes each node with two 16-core AMD Opteron 2.3GHz Interlagos

processors → 32 cores per node total of 90 112 cores each 16-core processor shares 16Gb of memory, in total 60 Tb theoretical peak performance over 800 Tflops

HECToR phase 3 (XE6)

www.hector.ac.uk

Page 18: Massively parallel implementation of Total-FETI DDM with application to medical image registration

K+ implemented as direct solve (LU) of regularized K

built-in CG routine used(PETSc.KSP, Trilinos.Belos)

E = 1e6, = 0.3, g = 9.81 ms-2 computed @ HECToR

Benchmark

Page 19: Massively parallel implementation of Total-FETI DDM with application to medical image registration

Results

# subds = # cores 1 4 16 64 256 1024

Prim. dim. 31752 127 008 508 032 2 032 128

8 128 512

32 514 048

Dual dim. 252 1512 7056 30240 124992 508032Solution time Trilinos 1.39 3.01 4.80 6.25 10.31 28.05 PETSc 1.14 2.66 4.16 4.74 4.92 5.84# iterations Trilinos 34 63 96 105 105 102 PETSc 33 68 94 105 105 1021 iter. time Trilinos 4.48e-2 4.76e-2 5.00e-2 5.95e-2 9.81e-2 2.75e-1

PETSc 3.46e-2 3.92e-2 4.42e-2 4.52e-2 4.69e-2 5.73e-2

stopping criterion: ||rk|| / || r0|| < 1e-5 without preconditioning

Page 20: Massively parallel implementation of Total-FETI DDM with application to medical image registration

Process of integrating information from two (or more) different images

Images from different sensors, different angles or/and times

Application to image registration

Page 21: Massively parallel implementation of Total-FETI DDM with application to medical image registration

Application to image registration

In medicine: Monitoring of growth of a tumour Therapy valuation Comparison of patient data with anathomical atlas Data from magnetic resonance (MR), computer

tomography (CT), positron emission tomography (PET)

Page 22: Massively parallel implementation of Total-FETI DDM with application to medical image registration

The task is to minimize the distance between two images

Elastic registration

𝜑≔𝑥−𝑢 (𝑥 )→

𝑇 𝑅

Page 23: Massively parallel implementation of Total-FETI DDM with application to medical image registration

Parallelization using TFETI method

Elastic registration

Page 24: Massively parallel implementation of Total-FETI DDM with application to medical image registration

# of subdomains 1 4 16

Primal variables 20402 81608 326432

Dual variables 903 2641 8254

Solution time [s] 41 34.54 57.44

# of iterations 2467 990 665

Time/iteration [s] 0.01 0.03 0.08

Results

stopping criterion: ||rk|| / || r0|| < 1e-5

Page 25: Massively parallel implementation of Total-FETI DDM with application to medical image registration

Solution

Page 26: Massively parallel implementation of Total-FETI DDM with application to medical image registration

To consolidate PETSc & Trilinos TFETI implementation to the form of extensions or packages

To further optimize the codes using core-hours on Tier-1/Tier-0 systems (PRACE DECI Initiative, HPC-Europa2)

To extend image registration to 3D data

Conclusion and future work

Page 27: Massively parallel implementation of Total-FETI DDM with application to medical image registration

KOZUBEK T. et al. Total FETI domain decomposition method and its massively parallel implementation. Accepted for publishing in Advances in Engineering Software.

HORAK, D.; HAPLA, V. TFETI coarse space projectors parallelization strategies. Accepted for publishing in the proceedings of PPAM 2011, Springer LNCS, 2012.

Zitova B., Flusser J., Image registration methods: a survey, Image and Vision Computing, Vol.21, No.11, 2003, pp. 977-100.

References

Page 28: Massively parallel implementation of Total-FETI DDM with application to medical image registration

Thank you for your attention!