gpgpu acceler rated cardiac arrhythmia simula ations - seas

Abstract – Computational modelinelectrophysiology is a powerful tool for stumechanisms. In particular, cardiac modelgaining insights into experimental studies, andfuture they will be used by clinicians to impropatients suffering from complex arrhythmiashighly intricate, both in their geometric struequations that represent myocyte electrophymodels to be useful in a clinical setting, cost-for solving the models in real time must be work, we hypothesized that low-cost GPGPUsystems can be used to accelerate arrhythmiported a two dimensional monodomain caexecuted it on various GPGPU platforms. Elecsimulated during point stimulation and roGPGPU implementations provided significathe CPU implementation: 18X for point stimurotor activity. We found that the number of tbe launched concurrently was a critical factorGPGPU implementations.

I. INTRODUCTION ach year approximately 300,000 peopsuddenly of a cardiac arrhythmiaarrhythmias are typically treated with

and/or ablation therapies. Cardiac ablationusing a catheter that delivers radio frequencin the heart to kill tissue from which an arrhyTo identify these ablation sites, cardiac mcalculate local depolarization times by recorpotentials (electrograms) from many loendocardial surface, which are rendered as ato show the progression of electrical wavesmaps, arrhythmia pathways can be rendergeometry of the surface of a patient’s heart. the maps to guide the ablation to interrupt cure the arrhythmia. This type of image-gucured many patients suffering from arrhytarrhythmias with complex pathways oftenablation procedures which, ultimately, may n

Computational cardiac modeling provapproach for improving the efficacy of therapy. One strategy is to use the data proimaging systems, such as electroanatomicalresolution MRI systems, to build a detailed of a patient’s heart. A cardiologist could theprocedure using the model to determisuccessfully interrupt an arrhythmia path

1W. Wang and J. Cavazos are with the Departm

Information Sciences, University of Delaware, New(e-mail: [email protected]).

2H. H. Huang and M. Kay are with the DepartmComputer Engineering, The George Washington UnDC, 20052 USA (e-mail: [email protected]).

GPGPU AccelerWei Wang1, H.

E

Figure 1. A: Cardiac muscle is modeled as aof nodes that are electronically coupled. B: Thcell membrane at each node is represented aequations. C: Numerical integration of tprovides transmembrane voltage (action poSpatiotemporal visualization of transmeelectrophysiological mechanisms of arrhythmshown).

ng of cardiac dying arrhythmia ls are useful for d in the foreseeable ove therapy for the s. Such models are ucture and in the

ysiology. For these -effective solutions developed. In this

U-based hardware a simulations. We

ardiac model and ctrical activity was otor activity. Our ant speedups over ulation and 12X for threads that could r in optimizing the

ple in the US die a. Patients with h pharmaceutical ns are conducted cy energy to sites ythmia originates. mapping systems rding extracellular ocations on the an activation map s. With activation red on a realistic Cardiologists use the pathway and

uided therapy has thmias; however,

n require multiple not be successful. ides a powerful cardiac ablation

ovided by clinical l mapping or high numerical model

en test an ablation ine if it would

hway to cure the

ent of Computer and wark DE 19716 USA

ment of Electrical and niversity, Washington

arrhythmia. If the model does not ithen alternate ablation strategies coumodel.

To accomplish this goal, the time models must be significantly reduceis to be solved multiple times to optimThis requirement demands vast comare currently only provided by typically entail high cost and strict pspace and energy).

Recent developments in the fiecomputing have leveraged the compGeneral-Purpose Graphics Proceswhich have been extensively used e.g., bioinformatics, signal processforecasting, and molecular modelinparallel computations of ionic curremodel nodes using GPGPUs shperformance improvements over that(CPUs). Previous work showed thatGPGPU implementation of a cardiaas fast as a CPU, even for a smallpaper, we demonstrate that signifachieved when a cluster of GPGPUdimensional monodomain cardiac ac

rated Cardiac Arrhythmia Simula. Howie Huang2, Matthew Kay2 and John Cavazos1

large geometrical network

he electrical potential of the as a large set of differential the differential equations

otentials) at each node. D: embrane voltage reveals mias (an electrical rotor is

indicate therapy success, uld be explored using the

required to solve cardiac d; especially if the model mize an ablation strategy.

mputational resources that supercomputers, which

physical constraints (e.g.,

eld of high performance putational capabilities of ssing Units (GPGPUs), in many research fields, ing, astronomy, weather ng. For cardiac models, ents at a large number of hall provide significant t of traditional processors t, running on an Xbox, a c tissue model was twice -scale model [1]. In this ficant speed-ups can be s are used to solve a two ction potential model.

ations

Figure 2. General code that is solved at each time step to compute transmembrane potential (Vm).

for (Xstep=1;Xstep<Nx+1;++Xstep){ for (Ystep=1;Ystep<Ny+1;++Ystep){

stability(); // check for numerical stabilitystimulate(); // apply stimulating current, if necessarybrgates(); // update ionic gating equationsbrcurrents(); // update ionic currentsVmdiff(); // update diffusion terms for next time step

} // end Ystep loop } // end Xstep loop bcs(); // apply boundary conditions

II. BACKGROUND

A. GPGPUs General-Purpose Graphics Processing Units (GPGPUs)

achieve high performance through massively parallel processing of hundreds of computing cores. With the help of a parallel programming model, e.g., CUDA (Compute Unified Device Architecture), application developers can take advantage of CUDA-enabled GPUs that are available in desktop and notebook computers, professional workstations, and supercomputer clusters.

B. Cardiac Model In our cardiac model, transmembrane potential (Vm) at

each node in a rectilinear 2D grid was computed using a continuum approach with no-flux boundary conditions and finite difference integration, as we have previously described [2-3]. Although the 2D model is not clinically relevant, it allows us to quickly prototype different techniques that could then be applied to a clinically relevant 3D model. An overview of the model and representative results are shown in Figure 1. The general algorithm for the model is shown in Figure 2. The differential equations were solved independently on a matrix of Nx*Ny nodes at each time step. Therefore, within each time step there is no data inter-dependency, which fits the GPU architecture well – ample opportunities can be exploited for data-level parallelism.

Membrane ionic current kinetics (Iion, µA/cm2) are computed using the Drouhard-Roberge formulation of the inward sodium current (INa) [4] and the Beeler-Reuter formulations of the slow inward current (Is), time independent potassium current (IK1), and time-activated outward current (Ix1) [5]. Fiber orientation was 33deg. Diffusion coefficient along fibers was 0.00076 cm2/msec and diffusion across fibers was 0.00038 cm2/msec. Point stimulation and electrical rotor activity were simulated. All simulations were checked for accuracy and numerical stability.

III. GPGPU-BASED CARDIAC ARRHYTHMIA MODEL As shown in Figure 2, the general algorithm loops through

each node in a 2D grid (Xstep represents the coordinate of the X direction and Ystep the coordinate of the Y direction). Inside the loop, the same set of functions is solved at each node. The temporal loop is outside the nested spatial loops. Because of the sequential structure of the program, large spatial domains and/or long-time simulations require solution times that increase as a factor of domain area.

Typical parallel implementations of N dimensional cardiac models are relatively straightforward because, once the diffusion currents have been computed; there is no data dependency between the neighboring nodes in the grid at any particular time step. Therefore, the differential equations that represent myocyte electrophysiology (the brgates and brcurrents functions) can be solved at each node, in any sequence.

In our GPGPU implementation of the cardiac model, the

basic idea is to get rid of the double (Xsteps and Ysteps) inner loops where data parallelism resides and the speedup can be achieved. The outside time loop is impossible for us to eliminate. The GPGPU model works basically the same way as the CPU model but with larger "bandwidth": thanks to hundreds of cores in a commodity GPGPU, the set of functions could be applied to different nodes in the grid in parallel. Note that GPGPU computing falls into SIMD (Single Instruction Multiple Data) category, which means that the instructions the threads execute are the same but the data processed by these threads belong to different nodes.

We have implemented a model running on GPGPUs that completely eliminates one loop (Xsteps or Ysteps) and reduces the number of iterations in the other. This is done by assigning a number of threads to execute the set of functions on the corresponding grid nodes. Theoretically, the relation between the threads and the nodes is a one-to-one mapping. However, because limited resources (e.g. registers) are available on a GPU card, a large number of threads can only handle a portion (say 30 columns) of grid nodes. The same threads are used again to calculate another portion of the grid after finishing the previous part. It is easy to achieve the assignment of threads using CUDA directives. For example, for the 2D grid containing Nx by Ny nodes, i.e. Nx columns and Ny rows of nodes, we can compute W columns using W*Ny threads. The following CUDA code will accomplish this task.

In the CUDA programming model [6], a GPU device is

usually viewed as a grid containing a large amount of equally-shaped blocks, into which the threads are grouped. For the above code, W*Ny threads are grouped into W blocks, each of which contains Ny threads. The parameters of dimGrid and dimBlock define how the blocks (threads) align in the grid (block). Each thread block in the grid executes on a multiprocessor and the threads in the block execute on multiple cores inside the multiprocessor. Such parallel execution of blocks of threads on W columns of data reduces the initial double loops shown in Figure 2 to one small loop as follows.

Considering multiple threads would concurrently execute the Vmdiff function, serialized execution is needed because each

dimGrid(W, 1); //Assign W Blocks of Threads dimBlock(1, Ny, 1); //Assign Ny Threads in Each Block

for (Xstep=1; Xstep<Nx+1; Xstep+=W)

node would update the diffusion terms of thefunction and eventually multiple threads wothe same memory location. Without serializpotentially lead to data overwritten and incoresult, produce incorrect simulation resultproblem, we utilized atomic add operationfunction. Note that normal arithmetic operaother functions, as there are no updates to neBoth GPU cards we used in the experimatomic add operation.

IV. RESULTS

A. Environment We ran our GPGPU-based model on tw

cards and the CPU-based model was run omachines they reside on. The first machine h2.26GHz Quad Core (total 16 cores) CPUsand 16GB memory. It also had a Tesla which contains 30 multiprocessors, each hav240 cores) and 4GB global memory. The sec2 Intel E5530 2.4GHz Quad Core (total 8 c8MB L1 cache and 24GB memory, as well C2050 GPU card, which has 14 multiprocesscores (total 448 cores), and 3GB global memdriver version was 3.1 on Tesla C1060 aC2050.

B. Scalability and Performance We report the scalability and performanc

point stimulation and electrical rotor activcards. Both GPUs ran the same cowell-developed CPU-implementation [3] comparison. While we didn’t specifically ocode, we feel that our current GPU implemfurther improved by utilizing shared methreads into more blocks, etc. Our GPU imstraightforward port of the CPU implementa

1. Point stimulation Table 1 shows the input parameters used f

of different sizes as well as the runtimes cards. Nx (Ny) in the leftmost column specifnodes in X (Y) dimension of a grid. The sizexperimented with varied from 64*64 to 44

Table 1: Point Stimulation Input Paramete

Nx (=Ny)

dx=dy (mm)

dt (msecs)

simulation steps

C1(se

64 0.3125 0.025 10,000 22128 0.15625 0.025 10,000 43256 0.078125 0.025 10,000 11320 0.0625 0.0125 20,000 30384 0.052083 0.001 250,000 49448 0.041322 0.001 250,000 63In Table 1, dx and dt are spatial and tempo

the cardiac model respectively. The simx-dimension is Lx (2cm). “dx”, which iscalculated as Lx/Nx and “simulation steps”

Figure 3. GPGPU implementation: resustimulation. Top: an image of transmetransmembrane potential for one node.

Figure 4. Speedups from Tesla C1060 different input sizes.

e neighbors in this ould be writing to zation, this would oherence, and as a ts. To solve this ns in the Vmdiff ations are used in eighboring nodes.

ments support the

wo different GPU on the respective had 4 Intel E5520 s, 8MB L1 cache C1060GPU card

ving 8 cores (total cond machine had cores) CPUs, with as a Fermi Tesla sors, each with 32 mory. The CUDA and 3.2 on Tesla

ce results for both vity on two GPU de. We use a as the basis for

optimize the CPU mentations can be emory, grouping

mplementation is a ation.

for the stimulation on the two GPU

fies the number of ze of the grids we 8*448. rs and Runtime

1060 ecs)

C2050 (secs)

2 15 27 2 89

02 259 905 3763 82 5483 oral parameters of

mulated length in s equal to dy, is ” is calculated as

total/dt (total means the overall simuWe fed the same input to the model GPUs and also recorded the executiThe last two columns in the table areon the Tesla C1060 GPU and the Te

In terms of correctness, the modbetween the original CPU (sequentthe GPGPU (parallel) implementatioeach node is similar the one in Figurtop of Figure 3 is a snapshot of the for sampled nodes in the 2D grid; tThe sample interval is one node in bso the coordinates are less than 192. the voltage curve of a node at the fThe X axis shows that the total sim250 and the Y axis shows the value(in mV) during the simulation.

Figure 4 shows the speedups we acC1060 and the Tesla C2050 over thX axis is the grid size from 64*64 tthe relative speedup value. One canyield more than 16X speedups runnlike 384*384 and 448*448. While thone or two days to finish, both GPUtwo hours. The larger the input sizewe can get from both GPU cards. Nlast two columns in Table 1, the C2060 runs faster than the Tesla C10stop at the grid size of 448*448 beca

ults of the model for point embrane potential. Bottom:

and Tesla C2050 given

ulated time is 250msecs). running on CPU and two on times for comparison. e the runtimes in seconds

esla C2050 GPU. del outputs are identical tial) implementation and on. The voltage curve for re 1-C. The picture at the transmembrane potential the grid size is 384*384. both directions of the grid

The bottom graph shows final stage of simulation.

mulated time (in msec) is s of the potential voltage

chieved on both the Tesla e sequential version. The to 448*448, the Y axis is n see that both cards can ning on large input sizes he sequential model takes U models finish in one or e, the larger the speedups Note that, as shown in the

Fermi-architecture Tesla 050 for all input sizes. We ause limited threads could

Figure 5. GPGPU implementation: results of the mactivity. Top: an image of transmembrane ptransmembrane potential for one node.

be launched in a thread block, given limiteregisters) on the GPU cards. For the 448*4able to assign 30 blocks of threads to run thethe 30 blocks contained 448 threads. Our include scaling our model to run with even l

2. Electrical rotor For the electrical rotor activity, we ran ou

grid size (Nx*Ny) ranging from 256*256 to displays the input parameters and the run timfor these sizes. The input parameters have tas in Table 1. Here 500msecs is the simulatedt values are 0.025, so the simulations all toLx, the simulated length, is 10cm. Table 2 Electrical Rotor Simulation Paramet

Nx (=Ny)

dx=dy (mm)

dt (msecs)

simulation steps

256 0.390625 0.025 20,000 320 0. 3125 0.025 20,000 384 0.260417 0.025 20,000 448 0.2232 0.025 20,000

We again got identical outputs between thimplementations. The final voltage values foend of simulation are represented graphicaFigure 5, and the voltage curve for each simulation is shown at the bottom of Figure 5the top figure is 384*384: voltage values fogrid are displayed. In the bottom figure, the (in msecs) and the Y axis is the Voltage (in the last millisecond of simulation. We havand graphs for other grid sizes as well.

Figure 6 shows the GPU speedups over Cvarious problem sizes on the Tesla TeslaC2050. The X axis represents the griaxis shows the speedup value. The two ca10X speedups for all the input sizes. The proGPU would finish in minutes compared to hAlso, the general trend is that the bigger inbetter speedups. For example, the Tesla C2

Figure 6. Speedups from the Tesla C10electrical rotor simulation for different siz

model for reentrant potential. Bottom:

ed resources (e.g. 48 case, we were

e code and each of future work will

larger grid sizes.

ur model with 2D 448*448. Table 2

mes on two GPUs the same meaning ed time and all the ook 20,000 steps.

ters and Runtime

C1060 (secs)

C2050 (secs)

223 173 332 264 441 325 718 465

he CPU and GPU for all nodes at the ally at the top of node during the

5. The grid size in or all nodes in the X axis is the time mV). 500msec is

ve similar curves

CPU from running C1060 and the id size and the Y ards yield at least ogram running on hours on the CPU. nput sizes lead to 2050 ran 14.5, 15,

17.4 and 16.5 faster than CPU wit320*320, 384*384 and 448*448 resee from the rightmost columns ofC2050 GPU outperformed the Tesl50%.

V. CONCLUS

In this paper, we ported an exismodel to GPGPUs and significanttime. We ran our GPGPU-based simand Tesla C2050 and compared implementation. We found that the othe speedups could reach as higstimulation and 12X on electrical rothat computational modeling of cardibenefit from running on GPGPUs, wtool for a clinical setting. In the futuwork to run 3D models with realistic

ACKNOWLEDGEM

This work was in part supporteInstitute of Biomedical EngineWashington University.

REFERENCE

[1] Scarle S. Implications of the Reaction-Diffusion Models, Informed Xbox 360: Cardiac Arrhythmias, Re-EComputational Biology and Chemistry.

[2] Agladze K, Kay MW, Krinsky V, Sarspiral and paced waves in cardiac tisPhysiol. 2007;293:H503-13.

[3] Kay MW, Gray RA. Measuring curvatuwaves of cardiac excitation in 2-D me2005;52:50-63.

[4] Drouhard JP, Roberge FA. Revised formrepresentation of the sodium current inRes. 1987;20:333-50.

[5] Beeler GW, Reuter H. Reconstructioventricular myocardial fibres. J Physiol

[6] NVIDIA CUDA C PRhttp://www.nvidia.com.

[7] Xu L, Taufer M, Collins S, Vlacho DCoarse-Grained Monte Carlo Simulatiothe 24th IEEE International ParalleSymposium (IPDPS), April 2010, Atlan

[8] Lionetti FV, McCulloch AD, Baden SBof CUDA C for GPU AcceleratedProceedings of the 16th international Eprocessing: Part I (EuroPar'10), Pasquaand Domenico Talia (Eds.). Springe38-49.

060 and the Tesla C2050 on zes.

th grid size of 256*256, espectively. We can also f Table 2 that the Tesla la C1060 by as much as

SIONS sting cardiac arrhythmia tly reduced the running mulation on Tesla C1060 the results to the CPU

outputs were identical and gh as 18X upon point otor activity. We believe iac electrophysiology can which are a cost-effective ure, we plan to extend our c geometries.

MENT ed by a grant from the eering at the George

S Turing Completeness of

by Gpgpu Simulations on an Entry and the Halting Problem.

2009; 33(4): 253-260. rvazyan N. Interaction between sue. Am J Physiol Heart Circ

ure and velocity vector fields for edia. IEEE Trans Biomed Eng.

mulation of the hodgkin-huxley n cardiac cells. Comput Biomed

on of the action potential of . 1977;268:177-210.

ROGRAMMING GUIDE.

D. Parallelization of Tau-Leap ons on GPUs.In Proceeding of el & Distributed Processing nta, Georgia, USA. . Source-to-source optimization

d Cardiac Cell Modeling. In Euro-Par conference on Parallel a D'Ambra, Mario Guarracino, er-Verlag, Berlin, Heidelberg,

gpgpu acceler rated cardiac arrhythmia simula ations - seas

Documents