bringing up anton: taking co-design into production
DESCRIPTION
Bringing Up Anton: Taking Co-Design into Production. Joseph A. Bank September 24, 2010 D E Shaw Research. Talk Outline. Brief history of Anton Bringup challenges of Anton The bringup lessons. A Brief History of Anton. - PowerPoint PPT PresentationTRANSCRIPT
Bringing Up Anton:Taking Co-Design into Production
Joseph A. BankSeptember 24, 2010D E Shaw Research
Brief history of Anton Bringup challenges of Anton The bringup lessons
Talk Outline
Massively parallel special purpose machine to accelerate Molecular Dynamics (MD) simulations
Custom designed ASICs connected by specialized toroidal network
First ASIC received Q1 2008 512-node machine operational Q4 2008 1-millisecond BPTI MD simulation Q2 2009 512-node achieves performance of ~17,000
ns/day for 5DHFR (23,558 atoms) MD simulation
A Brief History of Anton
Application: MD itself is a bit hard to verify◦ Few simple metrics (energy drift, frms, folds/ms, …)◦ No one has simulated the time scales of Anton
Algorithm Changes◦ Gaussian Split Ewald method non-bonded far interactions◦ Neutral Territory method for non-bonded near interactions
Architecture◦ Massively parallel heterogeneous system
512+ nodes, 13 cores per node, 3 types of cores Custom communication primitives
◦ Fixed point instead of floating point◦ Resource optimized => I/D caches, SRAMs are all tightly constrained
Software◦ From scratch MD code base for Anton◦ Anton simulation preparation framework is complex
Dynamic code generation Specialization to machine size, chemical system, etc
Summary => Application/Architecture Co-design makes bringup uniquely challenging
Bringup Challenges of Anton
“Do your homework” “Where’s the chip?” “Repeat yourself, over and over and over” “Inspector gadget” “Use your eyes” “Target practice” “Trust no one”
Bringup Lesson Outline as Quips
Desmond: Verification of algorithms, develop experience with MD simulation
Pyrite: Verification of fixed point calculation kernels Detailed architectural simulator◦ Interface compatible with ASIC design (allowing co-simulation
with RTL)◦Enabled earliest possible development and testing of
complete software stack (embedded code, prep time, etc)◦During bringup the simulator could rerun simulations with
much higher visibility of the architectural and software state.
“Do Your Homework”: Preparing for Bringup
Challenge: Anton’s primary designed mode of operation is “SRAM mode” where all data fits in SRAM. This requires a configuration of at least 2x2x2 ASICs. During bringup, ASICs trickled in…
Solution: “DRAM mode”◦ We spent about 6 man months of software development on a
mode of operation that choreographs paging data into SRAM from DRAM and could perform large chemistry simulations on small Anton configurations (even single ASICs).
◦ DRAM mode was used to test every ASIC individually and at each machine size we have built:1, 2, 4, 8, 64, 128, 256, 512, 1024, 2048.
“Where’s the chip?”: Dealing with Scarcity
Anton and its embedded SW were designed to provide application level bit-wise reproducibility independent of HW configuration.◦ Detection: Rerun entire simulations and compare trajectories.
Primary means of detecting HW/SW bugs during bringup Used with “golden” trajectories for suite of tests on every ASIC Periodically used to check machine status
◦ Isolation with Force comparison: Online checking of redundant force calculation
◦ Generalized isolation with redundancy checker infrastructure: Online piecewise rerunning of simulation with arbitrary logging of lightweight checksums
“Repeat Yourself”: Bit-wise Reproducibility
Anton ASICs include a builtin “logic analyzer” that can be configured to capture traces of various hardware signals without perturbing timing.
Extremely useful when it worked.◦ Limited number of signals could be traced in a single run,
often requiring multiple runs◦ Traces can be “bumped” for other DRAM traffic, so often
was not useful in DRAM mode simulations Provided key performance tuning data
Lesson: HW visibility tools are a great investment.
“Inspector Gadget”: Anton’s Logic Analyzer
Many of the most difficult bugs during bringup were initially tracked down by creating custom visualizations that provided key insights.
Favor quick and dirty over beautiful! Example 1: Force mismatch blast patterns Example 2: When ions attack Example 3: Logic Analyzer for optimization/tuning
“Use your eyes”: Visualization for Debugging and Optimization
“Use Your Eyes”: Blast Pattern
“Use your eyes”: When ions attack
“Use your eyes”: Profiling
“Target Practice”: Have concrete milestones
During Anton bringup, it was useful to be very paranoid. ◦ Issues were found in both hardware and software at
similar frequency and our initial guesses were often wrong.
◦Most engineers have little experience with this phase of a project; as a software developer it takes practice to learn to distrust the hardware.
Best example: SRAMs that would return bad results for some locations less than once an hour.
“Trust No One”: Paranoid Debugging
Application/Architecture Co-design made bringing up Anton extremely challenging
Most important lessons from Anton’s successful bringup◦Preparation◦Repeatability◦Paranoia
Conclusions
End
Molecular Dynamics Simulation (MD) 104 to 105 atoms in a simulation Millisecond-scale simulations◦Each time step is ~2fs (2x10-15 seconds)◦Need 5x1011 time steps◦Presently at ~108 time steps/day on a
cluster with Desmond (Bowers et al, SC2006)
◦Simulating 1 ms takes >10 years on a cluster
◦Needed an architectural jump forward: Anton (Shaw et al, ISCA 2007, CACM2008, SC09)
Hours/days on workstation
Biomolecular Timescales (seconds)
Less than a day on Anton
A few months on Anton, longest MD simulation ever runLong MD run with
Desmond on Infiniband cluster (weeks to months)
Simulation Experiment
Adapted from Suits (IBM), originally from Chan & Dill (1993)
Compute Interactions on Neutral Territory
Traditional Method NT Method
Tower
Plate
D. E. Shaw, “A Fast, Scalable Method for the Parallel Evaluation of Distance-Limited Pairwise Particle Interactions”, J Comput. Chem., 2005
Two computational subsystems connected by communication ring◦ Hardware datapaths
compute over 25 billion interactions/sec
◦ Software runs on 12 cores in the flexible subsystem
6 links for the 3D Torus, each 42Gbps bandwidth, 50ns chip-chip latency
1 Host Interface link for external I/O, 1Gbps.
2 banks of DDR2-800 DRAM
An Anton ASIC
Anton’s Flexible Subsystem
General Purpose cores are 32bit Tensilica LX
Remote Access Unit handles multiple parallel DMA to/from 32KB of local SRAM
Geometry Cores are custom-designed, dual-slot VLIW, quad-word fixed-point SIMD
Kuskin et al, HPCA 2008
D. E. Shaw Research 23
Anton 512 node system in NY. 2 of 4 racks shown under construction.
Each racks contains 32 boards Each board holds 4 Anton
nodes
3/9/2009
Anton New York Segment
512 nodes in an 888 3D torus
can be built out to 4096 nodes in a larger data center