the politics and economics of parallel computing performance

23
1 The Politics and Economics of Parallel Computing Performance Allan Snavely UCSD Computer Science Dept. & SDSC

Upload: saber

Post on 13-Jan-2016

22 views

Category:

Documents


1 download

DESCRIPTION

The Politics and Economics of Parallel Computing Performance. Allan Snavely UCSD Computer Science Dept. & SDSC. Computnik. Not many of us (not even me) are old enough to remember Sputnik But recently U.S. technology received a similar shock. Japanese Earth Simulator. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: The Politics and Economics of Parallel Computing Performance

1

The Politics and Economics of Parallel Computing Performance

Allan Snavely

UCSD Computer Science Dept.

&

SDSC

Page 2: The Politics and Economics of Parallel Computing Performance

2San Diego Supercomputer Center

Performance Modeling and Characterization LabPMaC

Computnik

Not many of us (not even me) are old enough to remember Sputnik

But recently U.S. technology received a similar shock

Page 3: The Politics and Economics of Parallel Computing Performance

3San Diego Supercomputer Center

Performance Modeling and Characterization LabPMaC

Japanese Earth Simulator

The world’s mot powerful computer

Page 4: The Politics and Economics of Parallel Computing Performance

4San Diego Supercomputer Center

Performance Modeling and Characterization LabPMaC

Top500.org

HIGHLIGHTS FROM THE TOP 10 The Earth Simulator, built by NEC, remains the

unchallenged #1, > 30 TFlops

The cost is conservatively $500M

Page 5: The Politics and Economics of Parallel Computing Performance

5San Diego Supercomputer Center

Performance Modeling and Characterization LabPMaC

ASCI Q at Los Alamos is at #2 at 13.88 TFlop/s.

The third system ever to exceed the 10 TFflop/s mark is Virgina Tech's X measured at 10.28 TFlop/s. This cluster is built with the Apple G5 as building blocks and is often referred to as the 'SuperMac‘.

The fourth system is also a cluster. The Tungsten cluster at NCSA is a Dell PowerEdge-based system using a Myrinet interconnect. It just missed the 10 TFlop/s mark with a measured 9.82 TFlop/s.

Page 6: The Politics and Economics of Parallel Computing Performance

6San Diego Supercomputer Center

Performance Modeling and Characterization LabPMaC

More top 500

The list of clusters in the TOP10 continues with the upgraded Itanium2-based Hewlett-Packard system, located at DOE's Pacific Northwest National Laboratory, which uses a Quadrics interconnect.

#6 is the first system in the TOP500 based on AMD's Opteron chip. It was installed by Linux Networx at the Los Alamos National Laboratory and also uses a Myrinet interconnect. T

With the exception of the leading Earth Simulator, all other TOP10 systems are installed in the U.S.

The performance of the #10 system jumped to 6.6 TFlop/s.

Page 7: The Politics and Economics of Parallel Computing Performance

7San Diego Supercomputer Center

Performance Modeling and Characterization LabPMaC

The fine print

But how is performance measured?

Linpack is very compute intensive and not very memory or communications inten sive and it scales perfectly!

Page 8: The Politics and Economics of Parallel Computing Performance

8San Diego Supercomputer Center

Performance Modeling and Characterization LabPMaC

Axiom: You get what you ask for(or what you measure for)

Measures of goodness: Macho image Big gas tank Cargo space Drive it offroad Arnold drives one

Measures of goodness: Trendy Euro image Fuel efficiency Parking space Drive it on narrow streets Herr Schroeder drives one

Page 9: The Politics and Economics of Parallel Computing Performance

9San Diego Supercomputer Center

Performance Modeling and Characterization LabPMaC

HPC Users Forum and metrics

From the beginning we dealt with Political issues

You get what you ask for (Top500 Macho Flops) Policy makers need a number (Macho Flops) You measure what makes you look good (Macho Flops)

Technical issues Recent reports (HECRTF, SCALES) echo our earlier

consensus that time-to-solution (TTS) is the HPC metric But TTS is complicated and problem dependent ( and policy

makers need a number) Is it even technically feasible to encompass TTS in one or a

few low-level metrics?

Page 10: The Politics and Economics of Parallel Computing Performance

10San Diego Supercomputer Center

Performance Modeling and Characterization LabPMaC

A science of performance

A model is a calculable explanation of why a {program, application,input,…} tuple performs as it does

Should yield a prediction (quantifiable objective) Accurate predictions of observable performance points give

you some confidence in methods (as for example to allay fears of perturbation via intrusion)

Performance models embody understanding of the factors that affect performance Inform the tuning process (of application and machine) Guide applications to the best machine Enable applications driven architecture design Extrapolate to the performance of future systems

San Diego Supercomputer Center

Performance Modeling and Characterization LabPMaC

Page 11: The Politics and Economics of Parallel Computing Performance

11San Diego Supercomputer Center

Performance Modeling and Characterization LabPMaC

Goals for performance modeling tools and methods

Performance should map back to a small set of orthogonal benchmarks

Generation of performance models should be automated, or at least as regular and systemized as possible

Performance models must be time-tractable

Error is acceptable if it is bounded and allows meeting these objectives

Taking these principles to extremes would allow dynamic, automatic performance improvement via adaption (this is open research)

San Diego Supercomputer Center

Performance Modeling and Characterization LabPMaC

Page 12: The Politics and Economics of Parallel Computing Performance

12San Diego Supercomputer Center

Performance Modeling and Characterization LabPMaC

A useful framework

Machine Profiles - characterizations of the rates at which a machine can (or is projected to) carry out fundamental operations abstract from the particular application

Application Signature - detailed summaries of the fundamental operations to be carried out by the application independent of any particular machine

Combine Machine Profile and Application Signature using:

Convolution Methods - algebraic mappings of the Application Signatures on to the Machine profiles to arrive at a performance prediction

San Diego Supercomputer Center

Performance Modeling and Characterization LabPMaC

Page 13: The Politics and Economics of Parallel Computing Performance

13San Diego Supercomputer Center

Performance Modeling and Characterization LabPMaC

PMaC HPC Benchmark Suite

The goal is develop means to infer execution time of full applications at scale from low-level metrics taken on (smaller) prototype systems To do this in a systematic, even automated way

To be able to compare apples and oranges To enable wide workload characterizations

To keep number of metrics compact Add metrics only to increase resolution

Go to web page www.sdsc.edu/PMaC

Page 14: The Politics and Economics of Parallel Computing Performance

14San Diego Supercomputer Center

Performance Modeling and Characterization LabPMaC

Machine Profiles useful for : revealing underlying capability of the machine comparing machines

Machine Profiles produced by:MAPS (Memory Access Pattern Signature) along with the rest of the PMaC HPC

Benchmark Suite is available at www.sdsc.edu/PMaC

Machine Profiles Single Processor Component – MAPS

Page 15: The Politics and Economics of Parallel Computing Performance

15San Diego Supercomputer Center

Performance Modeling and Characterization LabPMaC

Convolutions put the two togethermodeling deep memory hierarchies

Procedure

Name

% Mem.

Ref.

Ratio

Random

L1 Hit

Rate

L2 Hit

Rate

Data Set Location in Memory Memory

BW (MAPS)

dgemv_n 0.9198 0.07 93.47 93.48 L1/L2 Cache 4166.0

dgemv_n 0.0271 0.00 90.33 90.39 Main Memory 1809.2

dgemv_n 0.0232 0.00 94.81 99.89 L1 Cache 5561.3

MatSet. 0.0125 0.20 77.32 90.00 L2 Cache 1522.6

n

1iii BB MemRate/BB MemOps

Single-processor or per-processor performance:•Machine profile for processor (Machine A)•Application Signature for application (App. #1)

The relative “per-processor” performance ofApp. #1 on Machine A is represented as the MetaSim Number=

MetaSim trace collected on PETSc Matrix-Vector code 4 CPUs with user supplied memory parameters for PSC’s TCSini

Page 16: The Politics and Economics of Parallel Computing Performance

16San Diego Supercomputer Center

Performance Modeling and Characterization LabPMaC

Metasim: cpu events convolverpick simple models to apply to each basic block

Output:5 differentconvolutions.Meta1: Mem. time

Meta2: Mem. time+FP time

Meta3: MAX(mem.time,FP time)

Meta4: .5Mem. time+.5FP time

Meta5: .9Mem. time+.1FP time

Page 17: The Politics and Economics of Parallel Computing Performance

17San Diego Supercomputer Center

Performance Modeling and Characterization LabPMaC

Dimemas: communications events convolver

Simple communication models applied to each communication event

Page 18: The Politics and Economics of Parallel Computing Performance

18San Diego Supercomputer Center

Performance Modeling and Characterization LabPMaC

POP results graphically

Seconds per simulation day

POP Total Timings POP 1.4.3, x1 benchmark

0

20

40

60

80

100

120

16 32 64 128

Processors

Sec

on

ds

per

Sim

ula

tio

n D

ay

Lemieux (R) Lemieux (M)Blue Horizon (R) Blue Horizon (M)Longhorn (R) Longhorn (M)Seaborg (R) SeaBorg (M)X1 (R) X1 (M)

Figure 2: Real (R ) versus Predicted-by Model (M) Times for POP x1 Benchmark

San Diego Supercomputer Center

Performance Modeling and Characterization LabPMaC

Page 19: The Politics and Economics of Parallel Computing Performance

19San Diego Supercomputer Center

Performance Modeling and Characterization LabPMaC

Quality of model predictions for POP

Table 2: Real versus Predicted-by-Model Wall-clock Times for POP x1 Benchmark X1 at 16 processors real time 9.21 seconds, predicted time 9.79 seconds, error 6.3%

Blue Horizon Lemieux # of pe’s

Real Time(sec)

Predicted Time(sec)

Error Real Time(sec)

Predicted Time(sec)

Error

16 204.92 214.29 -5 % 125.35 125.75 0 %

32 115.23 118.25 -3 % 64.02 71.49 -11 %

64 62.64 63.03 1 % 35.04 36.55 -4 %

128 46.77 40.60 13 % 22.76 20.35 11 %

Longhorn Seaborg # of pe’s

Real Time(sec)

Predicted Time(sec)

Error Real Time(sec)

Predicted Time(sec)

Error

16 93.94 95.15 -1 % 204.3 200.07 2 %

32 51.38 53.30 -4% 108.16 123.10 -14%

64 27.46 24.45 11% 54.07 63.19 -17%

128 19.65 15.99 16%% 45.27 42.35 6 %

POP has been ported to a wide variety of computers for eddy-resolving simulations of the world oceans and for climate simulations as the ocean component of coupled climate models.

San Diego Supercomputer Center

Performance Modeling and Characterization LabPMaC

Page 20: The Politics and Economics of Parallel Computing Performance

20San Diego Supercomputer Center

Performance Modeling and Characterization LabPMaC

Explaining Relative Performance of POP

In Case 1, we model the effect of reducing the bandwidth of BH’s network to that of a single rail of the Quadrics switch. There is no discernable performance effect as the POP x1 problem at this size is not sensitive to a change in peak network bandwidth from 350MB/s to 269 MBs.

In Case 2 we model the effect of replacing the Colony switch with the Quadrics switch. There is a significant performance improvement due to the 5 s latency of the Quadrics switch versus the 19 s latency of the Colony switch. This is because the barotropic calculations in POP x1 at this size are latency sensitive.

In Case 3 we use Quadrics latency but Colony bandwidth just for completeness. In Case 4 we model keeping the Colony switch latencies and bandwidths but replacing the Pwr3

processors and local memory subsystem with Alpha ES640 processors and their memory subsystem. There is a substantial improvement in performance due mainly to the faster memory subsystem of the Alpha. The Alpha can load stride-1 data from its L2 cache at about twice the rate of the Pwr3 and this benefits POP x1 a lot.

The last set of bars show the values of TCS performance, processor and memory subsystem speed, network bandwidth and latency, as a ratio to BH’s values.

0

0.5

1

1.5

2

2.5

PWR3 Colony(BH)

Case1 Case2 Case3 Case4 Alpha ES45Quadrics

(TCS)

POP Performance

Processor and Memory Subsystem

Network Bandwidth

Network Latency

Page 21: The Politics and Economics of Parallel Computing Performance

21San Diego Supercomputer Center

Performance Modeling and Characterization LabPMaC

POP Performance Sensitivity

1/Execution Time

Processor PerformanceLatency Performance Normalized

BW Performance Normalized

Figure 4: POP Performance Sensitivity for 128 cpu POP x1. Axis are plotted logscale and normalized to 1, thus the black quadrilateral represents the {execution time, network bandwidth, network bandwidth, CPU and memory

subsystem performance} of BH.

Page 22: The Politics and Economics of Parallel Computing Performance

22San Diego Supercomputer Center

Performance Modeling and Characterization LabPMaC

Practical uses

DoD HPCMO procurement cycle Identify strategic applications Identify candidate machines Run PMaC HPC Benchmark Probes on (prototypes of)

machines Use tools to model applications on exemplary inputs Generate performance expectations Input to solver that factors in performance, cost,

architectural diversity, whim of program director

DARPA HPCS program Help vendors evaluate performance impacts of proposed

architectural features

Page 23: The Politics and Economics of Parallel Computing Performance

23San Diego Supercomputer Center

Performance Modeling and Characterization LabPMaC

Acknowledgments

This work was sponsored in part by the Department of Energy Office of Science through SciDAC award “High-End Computer System Performance: Science and Engineering”. This work was sponsored in part by the Department of Defense High Performance Computing Modernization Program office through award “HPC Applications Benchmarking”. This research was sponsored in part by DARPA through award “HEC Metrics”. This research was supported in part by NSF cooperative agreement ACI-9619020 through computing resources provided by the National Partnership for Advanced Computational Infrastructure at the San Diego Supercomputer Center. Computer time was provided by the Pittsburgh Supercomputer Center and the Texas Advanced Computing Center and Oak Ridge National laboratory and ERDC. We would like to thank Francesc Escale of CEPBA for all his help with Dimemas, and Pat Worley for all his help with POP.