understanding application scaling nas parallel benchmarks 2.2 on now and sgi origin 2000

Understanding Application Scaling

NAS Parallel Benchmarks 2.2 on NOW and SGI Origin 2000

Frederick Wong, Rich Martin,Remzi Arpaci-Dusseau, David Wu,and David Culler

{fredwong, rmartin, remzi, davidwu, culler}@CS.Berkeley.EDU

Department of Electrical Engineering and Computer Science

Computer Science Division

University of California, Berkeley

June 15th, 1998

Introduction

NAS Parallel Benchmarks suite 2.2 (NPB) has been used widely to evaluate modern parallel systems

7 scientific benchmarks that represents the most common computation kernels

NPB is written on top of Message Passing Interface (MPI) for portability

NPB is a Constant Problem Size (CPS) scaling benchmark suite

This study focuses on understanding NPB scaling on both NOW and SGI Origin 2000

Speedup on NOW

1 6 11 16 21 26 31Nodes

Motivation Early study on NPB shows ideal speedup on NOW!

Scaling as good as T3D and better than SP-2 Per node performance better than T3D, close to SP-2

Speedup on SGI Origin 2000

1 6 11 16 21 26 31

Submitted results for Origin 2000 show a spread

Presentation Outline

Hardware Configuration Time Breakdown of the Applications Communication Performance Computation Performance Conclusion

Hardware Configuration

SGI Origin 2000 (64 nodes) MIPS R10000 processor, 195 MHz, 32KB/32KB L1 4MB external L2 cache per processor 16GB memory total MPI performance: 13 sec one-way latency, 150 MB

peak, half-power at 8KB message size

Network Of Workstations (NOW) UltraSPARC I processor, 167MHz, 16KB/16KB L1 512KB external L2 cache per processor 128 MB memory per processor MPI performance: 22 sec one-way latency, 27 MB

peak, half-power at 4KB message size

Time Breakdown -- LU

Time Breakdown of LU on NOW

1 2 4 8 16 32Processors

s) Cummulative

Computation

CommunicationIdeal

Black line -- total running time a single-man - 10

secs job ideally, requires 5

secs for 2 men total amount of work

-- 10 secs More work, need

communication

Time Breakdown of LU on Origin 2000

Time (

CummulativeComputationCommunicationIdeal

Time Breakdown -- LU

Time Breakdown of LU on NOW

) Cummulative

Computation

CommunicationIdeal

Time Breakdown -- SP

Time Breakdown on NOW

1 4 9 16 25Processors

Time (

nds) Cummulative

ComputationCommunicationIdeal

Time Breakdown on SGI

1 4 9 16 25Processors

Time (

CummulativeComputationCommunicationIdeal

Communication Performance

Micro-benchmarks show that SGI O2000 has better pt2pt comm. performance when compare to NOW

MPI Pp2pt Latency (One-way)

100000

1 10 100 1000 10000 100000

1E+06Message Size

Origin 2000

MPI Pt2pt Bandwidth (One-way)

100120

1 100 10,000 1,000,000Message Size

SGI 1/2 Power

NOW 1/2 Power

Communication Efficiency

50%60%

0 10 20 30 40

Processors

NOW-LUSGI-LUNOW-SPSGI-SP

absolute bandwidth delivered are close SP/32 on NOW -- 215s SP/32 on SGI -- 289s

comm. efficiency on SGI only achieved 30% of potential bandwidth

protocols tradeoff are pronounce hand-shake vs. bulk-

send in pt2pt collective ops

Computation Performance Relative performance of the benchmarks on single node

roughly close to the processor performance difference

LU SPSGI 1373 1652NOW 2469 2807

Both computational CPI and L2 misses change significantly on both platforms when scaled

LU SPCPI decrease 94% 93%L2 misses decrease 25% 27%

Recap on CPS Scaling

8163264

128256

LU Working Set

4-processor Knee starts at 256KB

1 10 100 1000 10000Cache Size (KB)

4-Node

LU Working Set

1 10 100 1000 10000Cache Size (KB)

4-Node

8-Node 8-processor

Knee starts at 128KB

LU Working Set

1 10 100 1000 10000Cache Size (KB)

4-Node

8-Node

16-Node

LU Working Set

1 10 100 1000 10000Cache Size (KB)

4-Node

8-Node

16-Node

32-Node

miss rate drops from 2MB to 4 MB global cache

1 10 100 1000 10000

Cache Size (KB)

4-Node

8-Node

16-Node

32-Node

Cost under scaling extra work worsen

memory system’s performance

SP Working Set

total memory references on SGI 4-processor has 64.38

billion memory reference

25-processor has 72.35 billion memory reference

12.38% increase

CostBenefit

Conclusion NPB

-benchmarks hard to predict comm performance global cache increases effectively reduce comp. time sequential node arch. is a dominant factor in NPB perf.

NOW an inexpensive way to go parallel absolute performance is excellent MPI on NOW has good scalability and performance NOW vs. proprietary system -- detail instrumentation ability

speedup cannot tell the whole story, scalability involves: the interplay of program and machine scaling delivered comm. performance, not -benchmarks complicated memory system performance

understanding application scaling nas parallel benchmarks 2.2 on now and sgi origin 2000

sgi origin

optimal resultssgi class

understanding npb scaling

ideal speedup

motivationearly study

david wu

david culler

remzi arpacidusseau

Documents

fundamentos sgi

apresentação sgi

sgi specials

numa-aware thread-parallel breadth-first search for graph500...

smartgridireland (sgi)

prayerbook sgi

ecobank cÔte d’ivoire - impaxis securities · - everest...

scaling up pastured livestock production: benchmarks for...

sgi brochure

spec hpg benchmarks for hpc systems kumaran kalyanasundaram...

workshop de segurança operacional e meio ambiente 2019 ·...

sgi ssoma

sgi diciembre.pdf

aposila sgi

sgi manual do sgi

info-1 5 2008 just - startsida - · pdf fileinformation...

sgi. profile

palestra sgi

sgi mpi and sgi shmem user guide...sgi mpi and sgi shmemtm...

sgi autocrash