blue matter parallel biomolecular simulation for … · blue matter: parallel biomolecular...

26
© 2000-2003 IBM Corporation First Prev Next Last Go Back Full Screen Close Quit Blue Matter: Parallel Biomolecular Simulation for Blue Gene Blake G. Fitch Biomolecular Dynamics and Scalable Modeling Thomas J. Watson Research Center IBM, Yorktown Heights, NY 10598 http://www.research.ibm.com/bluegene/ October 14, 2003

Upload: doandien

Post on 12-Sep-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

© 2000-2003 IBM Corporation •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit

Blue Matter :Parallel Biomolecular Simulation for Blue

Gene

Blake G. Fitch

Biomolecular Dynamics and Scalable ModelingThomas J. Watson Research CenterIBM, Yorktown Heights, NY 10598

http://www.research.ibm.com/bluegene/

October 14, 2003

© 2000-2003 IBM Corporation •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit

Outline• Introduction

• Parallelization of molecular dynamics

• Blue Matter overview

• Decomposition explorations and crude performance models

• 3D-FFT for BG/L (early measurements on hardware)

• Status

© 2000-2003 IBM Corporation •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit

Current Status• Blue Matter molecular simulation application has run and regressed on

early BG/L hardware (leverages extensive testing infrastructure)

• Developed implementation of “volumetric” decomposition of 3D-FFT asan alternative to typical “slab” decomposition to enable scaling to largenumbers of nodes. Implementation can leverage any uniprocessor1D-FFT implementation.

• Early results from MPI version of 3D-FFT on large SP (Power4) clustershow scaling superior to that of FFTW on the same platform

• MPI-based implementation of 3D-FFT and Blue Matter now running onproduction software stack (up to 512 node BG/L)

• Active message-based implementation of 3D-FFT and Blue Matter nowrunning on multi-chip BG/L hardware using both cores

• Tuning and experimentation is just beginning

© 2000-2003 IBM Corporation •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit

Blue Gene: A spectrum of possibleprojects

1LE11L2Y 1EOM

1EIO

• cover a range of system sizes, topological complexity

• cover a broad range of scientific questions and impact areas:

– thermodynamics

– folding kinetics

– membranes and membrane-bound systems

– folding-related disease (CF, Alzheimer’s, BSE)

• improve our understanding not just of protein folding butprotein function

1JRJ

CFTR

GPCR in membrane 5CRO

1POU1ENH

© 2000-2003 IBM Corporation •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit

© 2000-2003 IBM Corporation •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit

Design Points• system size 10K-100K atoms with focus on 30K atoms

• biomolecules

• explicit treatment of solvent

• extensive infrastructure for validation and binary regression

• 1K-100K nodes

• modular architecture:

– cellular MPP for computational core

– large AIX/SMP for analysis

– DB2 for experimental metadata

• separation of parallel programming complexity from molecular modelingcomplexity

© 2000-2003 IBM Corporation •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit

What Limits the Scalability of MD?• Inherent limitations on concurrency:

– Bonded force evaluation?

∗ Represents only small fraction of computation, can be distributedmoderately well.

– Real space non-bond force evaluation?

∗ Large fraction of computation, but good distribution can beachieved using volume or interaction decompositions.

– Reciprocal space contribution to force evaluation?

∗ Most prevalent methods (particle-mesh) rely on convolutionevaluated using 3D-FFT—FFT involves global communication.

• Load balancing.

• System software overheads (particularly communications software).

© 2000-2003 IBM Corporation •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit

Real Space Domain Decompositions• atom

– commensurate with propagation of dynamics

– limited ability to distribute work

– potential to generate strange (global) communications patterns

• volume

– commensurate with propagation of dynamics, 3D mesh machine, andlimited range forces

– load balancing issues

– “strict” volume decomposition may conflict with efficientimplementation of rigid subunits

• interaction

– good load-balancing characteristics

– implementation (Plimpton) available with efficient use of 2D meshtopology

– incommensurate with propagation of dynamics

© 2000-2003 IBM Corporation •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit

What Is Blue Matter?• Application platform for the Blue Gene Science program

• Prototyping platform for exploration of application frameworks suitable forcellular architecture machines

• Blue Matter comprises all of the necessary applicationcomponents–those that run on the computational core and those that runon the host

For more details, see: Fitch et al., Blue Matter, An Application Frameworkfor Molecular Simulation on Blue Gene, Journal of Parallel and DistributedComputing Volume 63, Issues 7-8 July-August 2003 , Pages 759-773

© 2000-2003 IBM Corporation •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit

Blue Matter Overview• Separate MD program into multiple subpackages (offload function to host

where possible):

– MD core engine (massively parallel, minimal in size)

– Setup programs to setup force field assignments, etc

– Monitoring and analysis tools to analyze MD trajectories, etc.

• Multiple Force Field Support

– CHARMM force field (done)

– OPLS-AA force field (done)

– AMBER force field (done)

– GROMOS force field (in progress)

– Polarizable Force Field (planned)

© 2000-2003 IBM Corporation •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit

Application Overview

MDKernel

Setup

Post-processingandanalysis

UnixCluster

UnixCluster

Blue GeneHW/OS

Platform

Monitoringand control

Experimental Record(stored off-line)

© 2000-2003 IBM Corporation •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit

Blue Matter DataflowBlue Matter Runtime

User Interface

MolecularSystemSetup:

CHARMM,IMPACT,AMBER,

etc.

XML Filter

Blue MatterMolecular System

Spec

RTPFile

DVSFile

MSD.cppFile

Blue MatterStart Process

Blue MatterParallel Application

Kernel on BG/W

Raw DataManagement

System(UDP/IP-based)

Restart Process(Checkpoint)

Long TermArchivalStorageSystem

OnlineAnalysis/

Monitoring/Visualization

RegressionTest Driver Scripts

DVSFilter

OfflineAnalysis

ReliableDatagrams

ReliableDatagrams

Probspec Db2

C++ Compiler

MD UDFRegistry

HardwarePlatform

Description

Visualization

© 2000-2003 IBM Corporation •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit

Early Options for ParallelDecomposition Targeting BG/L

• Global force reduction — replication of dynamics propagation

• Globalize positions — double computation of forces, P3ME issues

• Globalize positions with nearest neighbor force reduction withapproximate volume decomposition — supports P3ME

• Globalize positions with near neighbor (cutoff radius) force reduction withapproximate volume decomposition — supports P3ME, avoids doublecomputation of forces

© 2000-2003 IBM Corporation •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit

Molecular Dynamics, 30,000 Atoms

0.01

0.1

1

10

1 10 100 1000 10000 100000

MD

tim

e-st

ep(s

econds)

Node Count

Position GlobalizationGlobal Force Reduction

© 2000-2003 IBM Corporation •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit

Contributions to MD Time Step(Position Globalization)

1

10

102

103

104

105

106

107

108

109

1 10 100 1000 10,000 100,000

Contr

ibuti

ons

toTim

eSt

epD

ura

tion

(cyc

les)

Node Count

Position Globalization

TotalVerlet

GlobalizationFFT

Real space non-bond

© 2000-2003 IBM Corporation •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit

3D-FFT for BG/L• Requirement: Scalable 3D FFT for meshes ranging from 323 to 2563 as

part of particle mesh molecular dynamics methods commonly used inbiomolecular simulations.

• Design goal: ”Volumetric” 3D FFT decomposition with strong scalingcharacteristics for mesh sizes of interest (as an alternative to ”slab”based 3D FFT decomposition).

• Sizing: Paper and pencil sizing indicates that mesh sizes of interest willscale to thousands of nodes.

• Prototyping: ”Active Packet” and MPI versions of volumetric 3D FFThave been implemented.

• Results: MPI version shows scaling on SP (Power4) superior to that ofFFTW; MPI version on BG/L has run up to 512 nodes; Active Packetversion running in BG/L bringup/test environment and scales well to 32BG/L nodes.

Eleftheriou et al., A Volumetric FFT for Blue Gene/L , to appear in theProceedings of HiPC2003

© 2000-2003 IBM Corporation •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit

Volumetric 3D FFT on SP (Power 4)and BG/L

0.001

0.01

0.1

1

10

1 10 100 1000

3D

-FFT

tim

e(s

econds)

Task Count

643 FFTBGMPI on BG/L -qarch=440 w/tjcw 1d fft1283 FFTBGMPI on BG/L -qarch=440 w/tjcw 1d fft

643 packet layer -qarch=440d 444MHz1283 packet layer -qarch=440d 444MHz

1283 FFTBG on Power4

© 2000-2003 IBM Corporation •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit

1D-FFT on BG/L (3D-FFT BuildingBlock)

• Naive 1D-FFT implemented for 64 and 128 points (2-element vector im-plementation to take advantage of hummer2)

• Using IBM xlC compiler on Power3 (-O3 -qarch=pwr3 ) and BG/Lhummer2 (-O3 -qarch=440d )

– 6643 cycles per 128 point FFT pair on Power3 (34% of peak)

– 7346 cycles per 128 FFT pair on BG/L (30% of peak)

• Better performance should be expected from more sophisticated 1D-FFTimplementations (such as the work being done at the Technical Universityof Vienna)

© 2000-2003 IBM Corporation •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit

BG/L Dual Core 3D FFT Flowchart

Pkt Send

PlaneCompute1D

FFTsPkt SendAll-to-All

Row

Pkt SendAll-to-All

Row Compute1D

FFTs

Compute1D

FFTs

DispatchActivePackets

DispatchActivePackets

Compute1D

FFTs

DispatchActive

All-to-All

Packets

Red : Compute 1/2 nodes allocation of 1D FFTs in phase

Compute1D

FFTsAll-to-AllPkt Send

Plane

Pkt SendAll-to-All

Row

ActivePackets

Dispatch Compute1D

FFTs

Compute1D

FFTs

Core 0

Core 1

Time

Green : Active packet send in row or plane.

Blue : Active packet receive and dispatch (from BG/L FIFOs and internal Queue ).

© 2000-2003 IBM Corporation •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit

Measured 3D FFT Speedups on BG/Land SP (Power4)

1

10

100

1000

1 10 100 1000

Spee

dup

Task Count

Ideal643 FFTBGMPI on BG/L -qarch=440, 500MHz

1283 FFTBG on Power4643 packet layer -qarch=440d, 444MHz

1283 packet layer -qarch=440d, 444MHz

© 2000-2003 IBM Corporation •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit

32 Node Application Tracing

Z-RowAll-to-All

(10.7 ms)

Y-RowAll-to-All(15 ms)

1D-FFTsalong Z

(3.45 ms)

PlanarAll-to-All(15 ms)

PlanarAll-to-All(15 ms)

1D-FFTsalong X

(3.45 ms)

1D-FFTsalong Y

(3.45 ms)

© 2000-2003 IBM Corporation •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit

Regression/Validation• External regression of multiple force fields intramolecular energies via

“Brooks suite”

• Main check on validity is quadratic dependence of conserved quantityrmsd vs. timestep for multiple short runs for every combination offeatures

– Necessary but not sufficient many times

– Agreement of p3me with Ewald in limit of small grid spacing

– Good temperature and pressure behavior

• Testing “corner” cases ( e.g. Imaging )

• Binary regression capability is needed due to the experimental hw/swenvironment

© 2000-2003 IBM Corporation •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit

4ns NVE Lipid Run using Blue Matter

-5548

-5546

-5544

-5542

-5540

-5538

-5536

-5534

-5532

0 0.5 1 1.5 2 2.5 3 3.5 4

Tot

alE

nerg

y(K

cal/

mol

)

Simulation Time (ns)

Total Energy for Fully Hydrated SOPE

Total E

© 2000-2003 IBM Corporation •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit

Conclusions from Simple Models• Raw hardware “speeds and feeds” allow the use of standard MD

techniques (P3ME) to node counts in excess of 1000 for problem sizes ofinterest to the Blue Gene project.

– No extraordinary algorithmic methods have to be employed based onhardware considerations.

• “Naive” parallelization strategies leveraging global tree network arepractical for a reasonable range of node counts even though thecommunication time is not scalable because:

• Scalability is limited by 3D-FFT—hardware latencies dominate for largenode counts on fixed size problems as message sizes decrease andmessages sent/node increase.

© 2000-2003 IBM Corporation •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit

Current Status• Blue Matter molecular simulation application has run and regressed on

early BG/L hardware (leverages extensive testing infrastructure)

• Developed implementation of “volumetric” decomposition of 3D-FFT asan alternative to typical “slab” decomposition to enable scaling to largenumbers of nodes. Implementation can leverage any uniprocessor1D-FFT implementation.

• Early results from MPI version of 3D-FFT on large SP (Power4) clustershow scaling superior to that of FFTW on the same platform

• MPI-based implementation of 3D-FFT and Blue Matter now running onproduction software stack (up to 512 node BG/L)

• Active message-based implementation of 3D-FFT and Blue Matter nowrunning on multi-chip BG/L hardware using both cores

• Tuning and experimentation is just beginning

© 2000-2003 IBM Corporation •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit

AcknowledgementsBlue Gene Application/Science team:Alex Rayshubskiy, Blake Fitch, Maria Eleftheriou, Jed Pitera, Mike Pitman,Frank Suits, Bill Swope, Chris Ward, Yuri Zhestkov, Ruhong Zhou

Blue Gene Hardware and System Software teams, particularly:Mark Giampapa, Alan Gara, Philip Heidelberger, BurkhartSteinmacher-Burow, Mark Mendell

Collaborators:Bruce Berne (Columbia), Scott Feller (Wabash), Klaus Gawrisch (NIH), VijayPande (Stanford)