bringing up anton: taking co-design into production

23
Bringing Up Anton: Taking Co-Design into Production Joseph A. Bank September 24, 2010 D E Shaw Research

Upload: coy

Post on 24-Feb-2016

54 views

Category:

Documents


0 download

DESCRIPTION

Bringing Up Anton: Taking Co-Design into Production. Joseph A. Bank September 24, 2010 D E Shaw Research. Talk Outline. Brief history of Anton Bringup challenges of Anton The bringup lessons. A Brief History of Anton. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Bringing Up Anton: Taking Co-Design into Production

Bringing Up Anton:Taking Co-Design into Production

Joseph A. BankSeptember 24, 2010D E Shaw Research

Page 2: Bringing Up Anton: Taking Co-Design into Production

Brief history of Anton Bringup challenges of Anton The bringup lessons

Talk Outline

Page 3: Bringing Up Anton: Taking Co-Design into Production

Massively parallel special purpose machine to accelerate Molecular Dynamics (MD) simulations

Custom designed ASICs connected by specialized toroidal network

First ASIC received Q1 2008 512-node machine operational Q4 2008 1-millisecond BPTI MD simulation Q2 2009 512-node achieves performance of ~17,000

ns/day for 5DHFR (23,558 atoms) MD simulation

A Brief History of Anton

Page 4: Bringing Up Anton: Taking Co-Design into Production

Application: MD itself is a bit hard to verify◦ Few simple metrics (energy drift, frms, folds/ms, …)◦ No one has simulated the time scales of Anton

Algorithm Changes◦ Gaussian Split Ewald method non-bonded far interactions◦ Neutral Territory method for non-bonded near interactions

Architecture◦ Massively parallel heterogeneous system

512+ nodes, 13 cores per node, 3 types of cores Custom communication primitives

◦ Fixed point instead of floating point◦ Resource optimized => I/D caches, SRAMs are all tightly constrained

Software◦ From scratch MD code base for Anton◦ Anton simulation preparation framework is complex

Dynamic code generation Specialization to machine size, chemical system, etc

Summary => Application/Architecture Co-design makes bringup uniquely challenging

Bringup Challenges of Anton

Page 5: Bringing Up Anton: Taking Co-Design into Production

“Do your homework” “Where’s the chip?” “Repeat yourself, over and over and over” “Inspector gadget” “Use your eyes” “Target practice” “Trust no one”

Bringup Lesson Outline as Quips

Page 6: Bringing Up Anton: Taking Co-Design into Production

Desmond: Verification of algorithms, develop experience with MD simulation

Pyrite: Verification of fixed point calculation kernels Detailed architectural simulator◦ Interface compatible with ASIC design (allowing co-simulation

with RTL)◦Enabled earliest possible development and testing of

complete software stack (embedded code, prep time, etc)◦During bringup the simulator could rerun simulations with

much higher visibility of the architectural and software state.

“Do Your Homework”: Preparing for Bringup

Page 7: Bringing Up Anton: Taking Co-Design into Production

Challenge: Anton’s primary designed mode of operation is “SRAM mode” where all data fits in SRAM. This requires a configuration of at least 2x2x2 ASICs. During bringup, ASICs trickled in…

Solution: “DRAM mode”◦ We spent about 6 man months of software development on a

mode of operation that choreographs paging data into SRAM from DRAM and could perform large chemistry simulations on small Anton configurations (even single ASICs).

◦ DRAM mode was used to test every ASIC individually and at each machine size we have built:1, 2, 4, 8, 64, 128, 256, 512, 1024, 2048.

“Where’s the chip?”: Dealing with Scarcity

Page 8: Bringing Up Anton: Taking Co-Design into Production

Anton and its embedded SW were designed to provide application level bit-wise reproducibility independent of HW configuration.◦ Detection: Rerun entire simulations and compare trajectories.

Primary means of detecting HW/SW bugs during bringup Used with “golden” trajectories for suite of tests on every ASIC Periodically used to check machine status

◦ Isolation with Force comparison: Online checking of redundant force calculation

◦ Generalized isolation with redundancy checker infrastructure: Online piecewise rerunning of simulation with arbitrary logging of lightweight checksums

“Repeat Yourself”: Bit-wise Reproducibility

Page 9: Bringing Up Anton: Taking Co-Design into Production

Anton ASICs include a builtin “logic analyzer” that can be configured to capture traces of various hardware signals without perturbing timing.

Extremely useful when it worked.◦ Limited number of signals could be traced in a single run,

often requiring multiple runs◦ Traces can be “bumped” for other DRAM traffic, so often

was not useful in DRAM mode simulations Provided key performance tuning data

Lesson: HW visibility tools are a great investment.

“Inspector Gadget”: Anton’s Logic Analyzer

Page 10: Bringing Up Anton: Taking Co-Design into Production

Many of the most difficult bugs during bringup were initially tracked down by creating custom visualizations that provided key insights.

Favor quick and dirty over beautiful! Example 1: Force mismatch blast patterns Example 2: When ions attack Example 3: Logic Analyzer for optimization/tuning

“Use your eyes”: Visualization for Debugging and Optimization

Page 11: Bringing Up Anton: Taking Co-Design into Production

“Use Your Eyes”: Blast Pattern

Page 12: Bringing Up Anton: Taking Co-Design into Production

“Use your eyes”: When ions attack

Page 13: Bringing Up Anton: Taking Co-Design into Production

“Use your eyes”: Profiling

Page 14: Bringing Up Anton: Taking Co-Design into Production

“Target Practice”: Have concrete milestones

Page 15: Bringing Up Anton: Taking Co-Design into Production

During Anton bringup, it was useful to be very paranoid. ◦ Issues were found in both hardware and software at

similar frequency and our initial guesses were often wrong.

◦Most engineers have little experience with this phase of a project; as a software developer it takes practice to learn to distrust the hardware.

Best example: SRAMs that would return bad results for some locations less than once an hour.

“Trust No One”: Paranoid Debugging

Page 16: Bringing Up Anton: Taking Co-Design into Production

Application/Architecture Co-design made bringing up Anton extremely challenging

Most important lessons from Anton’s successful bringup◦Preparation◦Repeatability◦Paranoia

Conclusions

Page 17: Bringing Up Anton: Taking Co-Design into Production

End

Page 18: Bringing Up Anton: Taking Co-Design into Production

Molecular Dynamics Simulation (MD) 104 to 105 atoms in a simulation Millisecond-scale simulations◦Each time step is ~2fs (2x10-15 seconds)◦Need 5x1011 time steps◦Presently at ~108 time steps/day on a

cluster with Desmond (Bowers et al, SC2006)

◦Simulating 1 ms takes >10 years on a cluster

◦Needed an architectural jump forward: Anton (Shaw et al, ISCA 2007, CACM2008, SC09)

Page 19: Bringing Up Anton: Taking Co-Design into Production

Hours/days on workstation

Biomolecular Timescales (seconds)

Less than a day on Anton

A few months on Anton, longest MD simulation ever runLong MD run with

Desmond on Infiniband cluster (weeks to months)

Simulation Experiment

Adapted from Suits (IBM), originally from Chan & Dill (1993)

Page 20: Bringing Up Anton: Taking Co-Design into Production

Compute Interactions on Neutral Territory

Traditional Method NT Method

Tower

Plate

D. E. Shaw, “A Fast, Scalable Method for the Parallel Evaluation of Distance-Limited Pairwise Particle Interactions”, J Comput. Chem., 2005

Page 21: Bringing Up Anton: Taking Co-Design into Production

Two computational subsystems connected by communication ring◦ Hardware datapaths

compute over 25 billion interactions/sec

◦ Software runs on 12 cores in the flexible subsystem

6 links for the 3D Torus, each 42Gbps bandwidth, 50ns chip-chip latency

1 Host Interface link for external I/O, 1Gbps.

2 banks of DDR2-800 DRAM

An Anton ASIC

Page 22: Bringing Up Anton: Taking Co-Design into Production

Anton’s Flexible Subsystem

General Purpose cores are 32bit Tensilica LX

Remote Access Unit handles multiple parallel DMA to/from 32KB of local SRAM

Geometry Cores are custom-designed, dual-slot VLIW, quad-word fixed-point SIMD

Kuskin et al, HPCA 2008

Page 23: Bringing Up Anton: Taking Co-Design into Production

D. E. Shaw Research 23

Anton 512 node system in NY. 2 of 4 racks shown under construction.

Each racks contains 32 boards Each board holds 4 Anton

nodes

3/9/2009

Anton New York Segment

512 nodes in an 888 3D torus

can be built out to 4096 nodes in a larger data center