blue matter parallel biomolecular simulation for … · blue matter: parallel biomolecular...
TRANSCRIPT
© 2000-2003 IBM Corporation •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
Blue Matter :Parallel Biomolecular Simulation for Blue
Gene
Blake G. Fitch
Biomolecular Dynamics and Scalable ModelingThomas J. Watson Research CenterIBM, Yorktown Heights, NY 10598
http://www.research.ibm.com/bluegene/
October 14, 2003
© 2000-2003 IBM Corporation •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
Outline• Introduction
• Parallelization of molecular dynamics
• Blue Matter overview
• Decomposition explorations and crude performance models
• 3D-FFT for BG/L (early measurements on hardware)
• Status
© 2000-2003 IBM Corporation •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
Current Status• Blue Matter molecular simulation application has run and regressed on
early BG/L hardware (leverages extensive testing infrastructure)
• Developed implementation of “volumetric” decomposition of 3D-FFT asan alternative to typical “slab” decomposition to enable scaling to largenumbers of nodes. Implementation can leverage any uniprocessor1D-FFT implementation.
• Early results from MPI version of 3D-FFT on large SP (Power4) clustershow scaling superior to that of FFTW on the same platform
• MPI-based implementation of 3D-FFT and Blue Matter now running onproduction software stack (up to 512 node BG/L)
• Active message-based implementation of 3D-FFT and Blue Matter nowrunning on multi-chip BG/L hardware using both cores
• Tuning and experimentation is just beginning
© 2000-2003 IBM Corporation •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
Blue Gene: A spectrum of possibleprojects
1LE11L2Y 1EOM
1EIO
• cover a range of system sizes, topological complexity
• cover a broad range of scientific questions and impact areas:
– thermodynamics
– folding kinetics
– membranes and membrane-bound systems
– folding-related disease (CF, Alzheimer’s, BSE)
• improve our understanding not just of protein folding butprotein function
1JRJ
CFTR
GPCR in membrane 5CRO
1POU1ENH
© 2000-2003 IBM Corporation •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
Design Points• system size 10K-100K atoms with focus on 30K atoms
• biomolecules
• explicit treatment of solvent
• extensive infrastructure for validation and binary regression
• 1K-100K nodes
• modular architecture:
– cellular MPP for computational core
– large AIX/SMP for analysis
– DB2 for experimental metadata
• separation of parallel programming complexity from molecular modelingcomplexity
© 2000-2003 IBM Corporation •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
What Limits the Scalability of MD?• Inherent limitations on concurrency:
– Bonded force evaluation?
∗ Represents only small fraction of computation, can be distributedmoderately well.
– Real space non-bond force evaluation?
∗ Large fraction of computation, but good distribution can beachieved using volume or interaction decompositions.
– Reciprocal space contribution to force evaluation?
∗ Most prevalent methods (particle-mesh) rely on convolutionevaluated using 3D-FFT—FFT involves global communication.
• Load balancing.
• System software overheads (particularly communications software).
© 2000-2003 IBM Corporation •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
Real Space Domain Decompositions• atom
– commensurate with propagation of dynamics
– limited ability to distribute work
– potential to generate strange (global) communications patterns
• volume
– commensurate with propagation of dynamics, 3D mesh machine, andlimited range forces
– load balancing issues
– “strict” volume decomposition may conflict with efficientimplementation of rigid subunits
• interaction
– good load-balancing characteristics
– implementation (Plimpton) available with efficient use of 2D meshtopology
– incommensurate with propagation of dynamics
© 2000-2003 IBM Corporation •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
What Is Blue Matter?• Application platform for the Blue Gene Science program
• Prototyping platform for exploration of application frameworks suitable forcellular architecture machines
• Blue Matter comprises all of the necessary applicationcomponents–those that run on the computational core and those that runon the host
For more details, see: Fitch et al., Blue Matter, An Application Frameworkfor Molecular Simulation on Blue Gene, Journal of Parallel and DistributedComputing Volume 63, Issues 7-8 July-August 2003 , Pages 759-773
© 2000-2003 IBM Corporation •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
Blue Matter Overview• Separate MD program into multiple subpackages (offload function to host
where possible):
– MD core engine (massively parallel, minimal in size)
– Setup programs to setup force field assignments, etc
– Monitoring and analysis tools to analyze MD trajectories, etc.
• Multiple Force Field Support
– CHARMM force field (done)
– OPLS-AA force field (done)
– AMBER force field (done)
– GROMOS force field (in progress)
– Polarizable Force Field (planned)
© 2000-2003 IBM Corporation •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
Application Overview
MDKernel
Setup
Post-processingandanalysis
UnixCluster
UnixCluster
Blue GeneHW/OS
Platform
Monitoringand control
Experimental Record(stored off-line)
© 2000-2003 IBM Corporation •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
Blue Matter DataflowBlue Matter Runtime
User Interface
MolecularSystemSetup:
CHARMM,IMPACT,AMBER,
etc.
XML Filter
Blue MatterMolecular System
Spec
RTPFile
DVSFile
MSD.cppFile
Blue MatterStart Process
Blue MatterParallel Application
Kernel on BG/W
Raw DataManagement
System(UDP/IP-based)
Restart Process(Checkpoint)
Long TermArchivalStorageSystem
OnlineAnalysis/
Monitoring/Visualization
RegressionTest Driver Scripts
DVSFilter
OfflineAnalysis
ReliableDatagrams
ReliableDatagrams
Probspec Db2
C++ Compiler
MD UDFRegistry
HardwarePlatform
Description
Visualization
© 2000-2003 IBM Corporation •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
Early Options for ParallelDecomposition Targeting BG/L
• Global force reduction — replication of dynamics propagation
• Globalize positions — double computation of forces, P3ME issues
• Globalize positions with nearest neighbor force reduction withapproximate volume decomposition — supports P3ME
• Globalize positions with near neighbor (cutoff radius) force reduction withapproximate volume decomposition — supports P3ME, avoids doublecomputation of forces
© 2000-2003 IBM Corporation •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
Molecular Dynamics, 30,000 Atoms
0.01
0.1
1
10
1 10 100 1000 10000 100000
MD
tim
e-st
ep(s
econds)
Node Count
Position GlobalizationGlobal Force Reduction
© 2000-2003 IBM Corporation •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
Contributions to MD Time Step(Position Globalization)
1
10
102
103
104
105
106
107
108
109
1 10 100 1000 10,000 100,000
Contr
ibuti
ons
toTim
eSt
epD
ura
tion
(cyc
les)
Node Count
Position Globalization
TotalVerlet
GlobalizationFFT
Real space non-bond
© 2000-2003 IBM Corporation •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
3D-FFT for BG/L• Requirement: Scalable 3D FFT for meshes ranging from 323 to 2563 as
part of particle mesh molecular dynamics methods commonly used inbiomolecular simulations.
• Design goal: ”Volumetric” 3D FFT decomposition with strong scalingcharacteristics for mesh sizes of interest (as an alternative to ”slab”based 3D FFT decomposition).
• Sizing: Paper and pencil sizing indicates that mesh sizes of interest willscale to thousands of nodes.
• Prototyping: ”Active Packet” and MPI versions of volumetric 3D FFThave been implemented.
• Results: MPI version shows scaling on SP (Power4) superior to that ofFFTW; MPI version on BG/L has run up to 512 nodes; Active Packetversion running in BG/L bringup/test environment and scales well to 32BG/L nodes.
Eleftheriou et al., A Volumetric FFT for Blue Gene/L , to appear in theProceedings of HiPC2003
© 2000-2003 IBM Corporation •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
Volumetric 3D FFT on SP (Power 4)and BG/L
0.001
0.01
0.1
1
10
1 10 100 1000
3D
-FFT
tim
e(s
econds)
Task Count
643 FFTBGMPI on BG/L -qarch=440 w/tjcw 1d fft1283 FFTBGMPI on BG/L -qarch=440 w/tjcw 1d fft
643 packet layer -qarch=440d 444MHz1283 packet layer -qarch=440d 444MHz
1283 FFTBG on Power4
© 2000-2003 IBM Corporation •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
1D-FFT on BG/L (3D-FFT BuildingBlock)
• Naive 1D-FFT implemented for 64 and 128 points (2-element vector im-plementation to take advantage of hummer2)
• Using IBM xlC compiler on Power3 (-O3 -qarch=pwr3 ) and BG/Lhummer2 (-O3 -qarch=440d )
– 6643 cycles per 128 point FFT pair on Power3 (34% of peak)
– 7346 cycles per 128 FFT pair on BG/L (30% of peak)
• Better performance should be expected from more sophisticated 1D-FFTimplementations (such as the work being done at the Technical Universityof Vienna)
© 2000-2003 IBM Corporation •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
BG/L Dual Core 3D FFT Flowchart
Pkt Send
PlaneCompute1D
FFTsPkt SendAll-to-All
Row
Pkt SendAll-to-All
Row Compute1D
FFTs
Compute1D
FFTs
DispatchActivePackets
DispatchActivePackets
Compute1D
FFTs
DispatchActive
All-to-All
Packets
Red : Compute 1/2 nodes allocation of 1D FFTs in phase
Compute1D
FFTsAll-to-AllPkt Send
Plane
Pkt SendAll-to-All
Row
ActivePackets
Dispatch Compute1D
FFTs
Compute1D
FFTs
Core 0
Core 1
Time
Green : Active packet send in row or plane.
Blue : Active packet receive and dispatch (from BG/L FIFOs and internal Queue ).
© 2000-2003 IBM Corporation •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
Measured 3D FFT Speedups on BG/Land SP (Power4)
1
10
100
1000
1 10 100 1000
Spee
dup
Task Count
Ideal643 FFTBGMPI on BG/L -qarch=440, 500MHz
1283 FFTBG on Power4643 packet layer -qarch=440d, 444MHz
1283 packet layer -qarch=440d, 444MHz
© 2000-2003 IBM Corporation •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
32 Node Application Tracing
Z-RowAll-to-All
(10.7 ms)
Y-RowAll-to-All(15 ms)
1D-FFTsalong Z
(3.45 ms)
PlanarAll-to-All(15 ms)
PlanarAll-to-All(15 ms)
1D-FFTsalong X
(3.45 ms)
1D-FFTsalong Y
(3.45 ms)
© 2000-2003 IBM Corporation •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
Regression/Validation• External regression of multiple force fields intramolecular energies via
“Brooks suite”
• Main check on validity is quadratic dependence of conserved quantityrmsd vs. timestep for multiple short runs for every combination offeatures
– Necessary but not sufficient many times
– Agreement of p3me with Ewald in limit of small grid spacing
– Good temperature and pressure behavior
• Testing “corner” cases ( e.g. Imaging )
• Binary regression capability is needed due to the experimental hw/swenvironment
© 2000-2003 IBM Corporation •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
4ns NVE Lipid Run using Blue Matter
-5548
-5546
-5544
-5542
-5540
-5538
-5536
-5534
-5532
0 0.5 1 1.5 2 2.5 3 3.5 4
Tot
alE
nerg
y(K
cal/
mol
)
Simulation Time (ns)
Total Energy for Fully Hydrated SOPE
Total E
© 2000-2003 IBM Corporation •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
Conclusions from Simple Models• Raw hardware “speeds and feeds” allow the use of standard MD
techniques (P3ME) to node counts in excess of 1000 for problem sizes ofinterest to the Blue Gene project.
– No extraordinary algorithmic methods have to be employed based onhardware considerations.
• “Naive” parallelization strategies leveraging global tree network arepractical for a reasonable range of node counts even though thecommunication time is not scalable because:
• Scalability is limited by 3D-FFT—hardware latencies dominate for largenode counts on fixed size problems as message sizes decrease andmessages sent/node increase.
© 2000-2003 IBM Corporation •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
Current Status• Blue Matter molecular simulation application has run and regressed on
early BG/L hardware (leverages extensive testing infrastructure)
• Developed implementation of “volumetric” decomposition of 3D-FFT asan alternative to typical “slab” decomposition to enable scaling to largenumbers of nodes. Implementation can leverage any uniprocessor1D-FFT implementation.
• Early results from MPI version of 3D-FFT on large SP (Power4) clustershow scaling superior to that of FFTW on the same platform
• MPI-based implementation of 3D-FFT and Blue Matter now running onproduction software stack (up to 512 node BG/L)
• Active message-based implementation of 3D-FFT and Blue Matter nowrunning on multi-chip BG/L hardware using both cores
• Tuning and experimentation is just beginning
© 2000-2003 IBM Corporation •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
AcknowledgementsBlue Gene Application/Science team:Alex Rayshubskiy, Blake Fitch, Maria Eleftheriou, Jed Pitera, Mike Pitman,Frank Suits, Bill Swope, Chris Ward, Yuri Zhestkov, Ruhong Zhou
Blue Gene Hardware and System Software teams, particularly:Mark Giampapa, Alan Gara, Philip Heidelberger, BurkhartSteinmacher-Burow, Mark Mendell
Collaborators:Bruce Berne (Columbia), Scott Feller (Wabash), Klaus Gawrisch (NIH), VijayPande (Stanford)