teraflops and beyond robert f. lucas · 1-1 sixth international ls-dyna users conference 1...

22
1-1 1 S i xth International L S-D YNA User s Conference Teraflops and Beyond Robert F. Lucas High Performance Computing Research Department Head National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory [email protected] April 11, 2000 2 S i xth International L S-D YNA User s Conference What is NERSC? The US Department of Energy, Office of Science, supercomputing center for the past 25 years Currently located at Lawrence Berkeley National Laboratory in the hills above the University of California, Berkeley campus

Upload: others

Post on 25-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Teraflops and Beyond Robert F. Lucas · 1-1 Sixth International LS-DYNA Users Conference 1 Teraflops and Beyond Robert F. Lucas High Performance Computing Research Department Head

1-1

1Sixth International LS-DYNA Users Conference

Teraflops and Beyond

Robert F. LucasHigh Performance Computing Research Department Head

National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory

[email protected]

April 11, 2000

2Sixth International LS-DYNA Users Conference

What is NERSC?

• The US Department of Energy, Office of Science, supercomputing center for the past 25 years

• Currently located at Lawrence Berkeley National Laboratory in the hills above the University of California, Berkeley campus

Page 2: Teraflops and Beyond Robert F. Lucas · 1-1 Sixth International LS-DYNA Users Conference 1 Teraflops and Beyond Robert F. Lucas High Performance Computing Research Department Head

1-2

3Sixth International LS-DYNA Users Conference

Outline

• A Teraflop today

• Ten Teraflops tomorrow!

• A Petaflop someday?

• Conclusions

4Sixth International LS-DYNA Users Conference

Path to a Tflop/s at NERSC Cray C90 Installed In 1991

• Cray C90 installed in December 1991• Stable high end production platform for seven years

until 12/31/98

Page 3: Teraflops and Beyond Robert F. Lucas · 1-1 Sixth International LS-DYNA Users Conference 1 Teraflops and Beyond Robert F. Lucas High Performance Computing Research Department Head

1-3

5Sixth International LS-DYNA Users Conference

Getting Closer to a Tflop/s Cray T3E-900 Installed In 1996

The 696 processor T3E-900 is one of the most powerful unclassified supercomputers in the U.S.

1997 GAO report judged NERSC to have the best MPP utilization (75% then -- >90% today)

6Sixth International LS-DYNA Users Conference

T3E Utilization

MPP Charging and UsageFY98

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

Date

30-Day Moving Ave. Lost Time

30-Day Moving Ave. Pierre Free

30-Day Moving Ave. Pierre

30-Day Moving Ave. GC0

30-Day Moving Ave. Mcurie

80%

85%

90%

Max CPU Hours

80%

85%

90%

Peak

Page 4: Teraflops and Beyond Robert F. Lucas · 1-1 Sixth International LS-DYNA Users Conference 1 Teraflops and Beyond Robert F. Lucas High Performance Computing Research Department Head

1-4

7Sixth International LS-DYNA Users Conference

We’re Almost to a Tflop/sIBM SP3 Installed In 1999

• New contract with IBM announced in April1999• Phase 1 system delivered in June, 1999• Phase 2 will exceed 3 Tflop/s in Q4 2000

8Sixth International LS-DYNA Users Conference

Why IBM?

Peak Performance

0500

1000

1500

2000

250030003500

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Quarter

Gfl

op

/s t

ota

l

NERSC-3

NERSC-2

Page 5: Teraflops and Beyond Robert F. Lucas · 1-1 Sixth International LS-DYNA Users Conference 1 Teraflops and Beyond Robert F. Lucas High Performance Computing Research Department Head

1-5

9Sixth International LS-DYNA Users Conference

What About Applications?

Discipline Computational technology

magnetic fusion particle in cell

computational chemistry local density functionalmaterial sciences

climate research partial diff. equationscomputational biology

QCD Monte Carlo techniqueaccelerator designparticle detection searching, pattern matchingsimulation

combustion PDEsapplied mathematics image processing

10Sixth International LS-DYNA Users Conference

Good News:1998 Gordon Bell Prize

First complete application to break the 1Tflops barrier.

Collaborators from DOE’s Grand Challenge on Materials, Methods, Microstructure, and Magnetism.

1024-atom first-principles simulation of metallic magnetism in iron

Page 6: Teraflops and Beyond Robert F. Lucas · 1-1 Sixth International LS-DYNA Users Conference 1 Teraflops and Beyond Robert F. Lucas High Performance Computing Research Department Head

1-6

11Sixth International LS-DYNA Users Conference

How About DYNA?

• World record is only 6 Gflop/s on a Fujitsu VPP 5000VPP 5000 is a “cluster” of large vector nodesVery high memory B/W per nodeHigh B/W crossbar interconnect

• Lots of reasons why we’re not up to 1 Tflop yet:Meshes are irregular and dynamically changingDot products and other mathematical bottlenecks to scalabilityNote: complexity of LINPACK benchmark and LSMS Tflop/s

application dominated by matrix-matrix multiply

12Sixth International LS-DYNA Users Conference

Lets Look at Sparse Solvers

• Direct solver dominates the complexity of non-linear and dynamic analysis.

• Multifrontal algorithm is method of choice.—Dense arithmetic kernels -> maximize speed—Easily goes out-of-core -> maximize size

• Why is BCSLIB-Ext world record only 8 Gflop/s?

Page 7: Teraflops and Beyond Robert F. Lucas · 1-1 Sixth International LS-DYNA Users Conference 1 Teraflops and Beyond Robert F. Lucas High Performance Computing Research Department Head

1-7

13Sixth International LS-DYNA Users Conference

Load Balance

• Can’t simultaneously load balance:Finite Element AssemblyFactorizationTriangular Solves

14Sixth International LS-DYNA Users Conference

Graph Partitioning

• How do you cut this grid in half?

• Optimal graph partitioning is NP-complete

Page 8: Teraflops and Beyond Robert F. Lucas · 1-1 Sixth International LS-DYNA Users Conference 1 Teraflops and Beyond Robert F. Lucas High Performance Computing Research Department Head

1-8

15Sixth International LS-DYNA Users Conference

Out-of-Core?

• What if you go out-of-core?10,000,000 equation NASTRAN problems today!

• What system today provides 1 GB/s of standard I/O?

• Triangular solves completely I/O bound.2 flops/word for linear solve12 flops/word for block-shifted Lanczos

16Sixth International LS-DYNA Users Conference

Scalable Solvers?

• Research codes … one or more of the following usually apply:

Don’t exploit symmetry!Don’t dynamically pivot!Don’t go out-of-core!Don’t map to initial distribution of finite elements!Don’t report performance of triangular solver!

• Usually benchmarked on trivial, 3D grids!

Page 9: Teraflops and Beyond Robert F. Lucas · 1-1 Sixth International LS-DYNA Users Conference 1 Teraflops and Beyond Robert F. Lucas High Performance Computing Research Department Head

1-9

17Sixth International LS-DYNA Users Conference

More Bad News

• The HPC community left Electrical Engineers behind twenty years ago.SPICE didn’t vectorize effectively

• Mechancal Engineers are falling by the wayside nowNot good for HPC industryNASTRAN was the “killer app.” for Crays

• Who’s pushing the envelope?Bigger problems?Automatic optimization?New algorithms?

18Sixth International LS-DYNA Users Conference

Outline

• A Teraflop today

• Ten Teraflops tomorrow!

• A Petaflop someday?

• Conclusions

Page 10: Teraflops and Beyond Robert F. Lucas · 1-1 Sixth International LS-DYNA Users Conference 1 Teraflops and Beyond Robert F. Lucas High Performance Computing Research Department Head

1-10

19Sixth International LS-DYNA Users Conference

What Will a 10 Tflop/s System Look Like?

• The first ones are already on orderLawrence Livermore National Laboratory in USEarth System Simulator in Japan

• Systems are large clustersSMP nodes in USVector nodes in Japan

• Programming model:OpenMP and/or vectors to maximize node speedMPI for global communication

20Sixth International LS-DYNA Users Conference

A New Business Model?

Transition from vertical to horizontal companies

SGI IBM HP Sun

MIPS PowerPC PA-RISC SPARC

Origin SP SPP HPC

Irix AIX Solaris

applications software with MPI

sales

Intel others

SGI Compaq HP SunIBM

Microsoft “NT” Linux

applications software with MPI

mail order retail

Page 11: Teraflops and Beyond Robert F. Lucas · 1-1 Sixth International LS-DYNA Users Conference 1 Teraflops and Beyond Robert F. Lucas High Performance Computing Research Department Head

1-11

21Sixth International LS-DYNA Users Conference

Its Already Happening

• Forecast Systems Lab procurement• Prime contractor is High Performance Technologies

Inc. (HPTi)• Subcontractor is Compaq

22Sixth International LS-DYNA Users Conference

Its Really Nothing New!

• Remember EDS?• Integration is how Ross Perot got rich!

Page 12: Teraflops and Beyond Robert F. Lucas · 1-1 Sixth International LS-DYNA Users Conference 1 Teraflops and Beyond Robert F. Lucas High Performance Computing Research Department Head

1-12

23Sixth International LS-DYNA Users Conference

Buying From Integrators Worries Me

• Integrator’s don’t “own” the whole system end-to-end.

• Who takes responsibility for problems?

• Who fixes third party H/W or S/W bugs?

• Who adds new features not demanded by transaction, database, or Web server markets?CheckpointingLow-latency communication?

24Sixth International LS-DYNA Users Conference

How Do Clusters of SMPsPerform?

• Good news:Cheap flop/s and memoryPC’s in LSTC’s classroom faster than LSTC’s 8-

processor Origin 2000 on an explicit problem

• Bad news:Global communication B/W is lowCommunication latency is highNet result will be difficulty with implicit solutions

Page 13: Teraflops and Beyond Robert F. Lucas · 1-1 Sixth International LS-DYNA Users Conference 1 Teraflops and Beyond Robert F. Lucas High Performance Computing Research Department Head

1-13

25Sixth International LS-DYNA Users Conference

Good news example

0

200

400

600

800

1000

1200

1400

0 5 10 15 20 25

Per

form

ance

Mflo

p/s

Number of Processors

NAS Parallel Benchmark Results: BT

PC ClusterT3E-900

26Sixth International LS-DYNA Users Conference

Bad News

0

100

200

300

400

500

600

700

0 5 10 15 20 25 30 35

Per

form

ance

Mflo

p/s

Number of Processors

NAS Parallel Benchmark Results: CG

PC ClusterT3E-900

Page 14: Teraflops and Beyond Robert F. Lucas · 1-1 Sixth International LS-DYNA Users Conference 1 Teraflops and Beyond Robert F. Lucas High Performance Computing Research Department Head

1-14

27Sixth International LS-DYNA Users Conference

Downright Ugly

0

200

400

600

800

1000

1200

1400

0 5 10 15 20 25 30 35

Per

form

ance

Mflo

p/s

Number of Processors

NAS Parallel Benchmark Results: FT

PC ClusterT3E-900

28Sixth International LS-DYNA Users Conference

Clusters Increase Power and Space Requirements

0

500

1000

1500

2000

2500

3000

3500

4000

Gflop/sPeak

kWPower

ft 2̂Space

1992

1996

2000

Page 15: Teraflops and Beyond Robert F. Lucas · 1-1 Sixth International LS-DYNA Users Conference 1 Teraflops and Beyond Robert F. Lucas High Performance Computing Research Department Head

1-15

29Sixth International LS-DYNA Users Conference

Eight Tflop/s Floorplan

30Sixth International LS-DYNA Users Conference

We’re Not Quite There Yet

Page 16: Teraflops and Beyond Robert F. Lucas · 1-1 Sixth International LS-DYNA Users Conference 1 Teraflops and Beyond Robert F. Lucas High Performance Computing Research Department Head

1-16

31Sixth International LS-DYNA Users Conference

Near-term Prediction

• Peak performance and memory will go up• So will power and floor space• Most big systems will be procured from integrators

• The space of applications enjoying this will shrink• The space of applications will certainly change

Materials!Designer drugs?

• Implicit codes will probably not be well served at the very high end

32Sixth International LS-DYNA Users Conference

Outline

• A Teraflop today

• Ten Teraflops tomorrow!

• A Petaflop someday?

• Conclusions

Page 17: Teraflops and Beyond Robert F. Lucas · 1-1 Sixth International LS-DYNA Users Conference 1 Teraflops and Beyond Robert F. Lucas High Performance Computing Research Department Head

1-17

33Sixth International LS-DYNA Users Conference

Processor in Memory Chips

• Processing in memory is an old idea• Semiconductor technology now makes it very

attractive

Mitsubishi M32R

34Sixth International LS-DYNA Users Conference

Rediscovering Vectors

• Couple this with vectors for high B/W

UC Berkeley IRAM floorplan

CPU

IO8 Vector Pipes (+ 1 spare)

Memory (512 Mbits / 64 MBytes)

Memory (512 Mbits / 64 MBytes)

Page 18: Teraflops and Beyond Robert F. Lucas · 1-1 Sixth International LS-DYNA Users Conference 1 Teraflops and Beyond Robert F. Lucas High Performance Computing Research Department Head

1-18

35Sixth International LS-DYNA Users Conference

CMOS Petaflop/s Solution

• 100,000 10 Gflop/s vector PIM chips• 3D mesh interconnect• O(10^7) operations must be sustained on every clock

cycle to avoid an Amdahl bottleneck• IBM Blue Gene is an example

• God help the programmers!

36Sixth International LS-DYNA Users Conference

An Alternate Technology?

1 THz

HTS RSFQ (??)

100 GHz 30M JJ3M JJ 0.4 um

300K JJ 0.8 um10K JJ 1.5 um

3.5 um10 GHz 0.05 um

0.07 um0.10 um

0.12 um0.14 um

0.20 um1 GHz

lithography?

optical lithography100 MHz

1995 1998 2001 2004 2007 2010Year

•Single Flux Quantum (SFQ)

•Operates at 4 Kelvin

Page 19: Teraflops and Beyond Robert F. Lucas · 1-1 Sixth International LS-DYNA Users Conference 1 Teraflops and Beyond Robert F. Lucas High Performance Computing Research Department Head

1-19

37Sixth International LS-DYNA Users Conference

Hybrid Technology, Multithreaded Architecture

L-1 Frame Buffer512 MB

L-2 Frame Buffer128 GB

CNET

Smart Shared L-3SRAM-PIM

1 TB

HRAMBlock 3/2 Storage

1 PB

DRAM-PIM16 TB

Data Intensive—Data Flow Driven

to DRAM10,000 - 40,000

cycles

Latencies

to CRAM40 - 400 cycles

Intra-executionpipeline

10 - 100 cycles

to SRAM400 - 1000 cycles

DRAM to HRAM1x106-4x106 cycles

Data Vortex Network

Smart Main Memory

P M P M P MP M P M

CRAM

SRAM

CRAM

SRAM

CRAM

SRAM

CRAM

SRAM

HSP HSP HSP HSP

P MP M P M P MP M

38Sixth International LS-DYNA Users Conference

HTMT Machine Room

Page 20: Teraflops and Beyond Robert F. Lucas · 1-1 Sixth International LS-DYNA Users Conference 1 Teraflops and Beyond Robert F. Lucas High Performance Computing Research Department Head

1-20

39Sixth International LS-DYNA Users Conference

4oK

50 W77oK

Fiber/Wire Interconnects

3 m

0.5 m

220Volts

Nitrogen Helium

Hard Disk

Array

(40 cabinets) 3 m

Tape Silo

Array

(400 Silos)Front End Computer

Server

Console

Cable Tray Assembly

Generator

WDM Source

Optical Amplifiers

220Volts

980 nm Pumps

(20 cabinets)

Generator

HTMT Cross-Section

40Sixth International LS-DYNA Users Conference

Outline

• A Teraflop today

• Ten Teraflops tomorrow!

• A Petaflop someday?

• Conclusions

Page 21: Teraflops and Beyond Robert F. Lucas · 1-1 Sixth International LS-DYNA Users Conference 1 Teraflops and Beyond Robert F. Lucas High Performance Computing Research Department Head

1-21

41Sixth International LS-DYNA Users Conference

It’s a Volatile Business

• For most of us:PCs will become SMPsLarger, more complicated analysis will be common

• At the very high end:Clusters of SMPs will predominate for the next five yearsPeak performance will keep going upIt’ll get harder and harder to realize

• Long Term Future:Architecture will change againProgramming model will change againApplication base will change again

Page 22: Teraflops and Beyond Robert F. Lucas · 1-1 Sixth International LS-DYNA Users Conference 1 Teraflops and Beyond Robert F. Lucas High Performance Computing Research Department Head