pact pact 98 http:

64
PACT PACT 98 Http://www.research.microsoft.com/barc/ gbell/pact.ppt

Post on 19-Dec-2015

238 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PACT PACT 98 Http:

PACTPACT

PACT 98

Http://www.research.microsoft.com/barc/gbell/pact.ppt

Page 2: PACT PACT 98 Http:

PACTPACT

Gordon Bell

Microsoft

What Architectures? Compilers?

Run-time environments? Programming models?

… Any Apps?

Parallel Architectures and Compilers TechniquesParis, 14 October 1998

Page 3: PACT PACT 98 Http:

PACTPACT

Talk plan

Where are we today? History… predicting the future

– Ancient– Strategic Computing Initiative and ASCI– Bell Prize since 1987 – Apps & architecture taxonomy

Petaflops: when, … how, how much New ideas: Grid, Globus, Legion Bonus: Input to Thursday panel

Page 4: PACT PACT 98 Http:

PACTPACT

1998: ISVs, buyers, & users? Technical: supers dying; DSM (and SMPs) trying

– Mainline: user & ISV apps ported to PCs & workstations – Supers (legacy code) market lives ... – Vector apps (e.g ISVs) ported to DSM (&SMP)– MPI for custom and a few, leading edge ISVs– Leading edge, one-of-a-kind apps: Clusters of 16, 256, ...1000s built from uni,

SMP, or DSM Commercial: mainframes, SMPs (&DSMs), and clusters are

interchangeable (control is the issue)– Dbase & tp: SMPs compete with mainframes if central control is an

issue else clusters– Data warehousing: may emerge… just a Dbase– High growth, web and stream servers:

Clusters have the advantage

Page 5: PACT PACT 98 Http:

PACTPACT

Xpt connected SMPSXpt-SMPvectorXpt-multithread (Tera)

“multi”Xpt-”multi” hybrid

DSM- SCI (commodity)DSM (high bandwidth_

Commodity “multis” & switchesProprietary “multis”& switchesProprietary DSMs

SMP

Multicomputers akaClusters … MPP16-(64)- 10K processors

mainline

mainline

c2000 Architecture Taxonomy

Page 6: PACT PACT 98 Http:

PACTPACT

TOP500 Technical Systems by Vendor (sans PC and mainframe clusters)

CRI

SGI

IBM

Convex

HP

SunTMC

IntelDEC

JapaneseOther

0

100

200

300

400

500Ju

n-9

3

No

v-93

Jun

-94

No

v-94

Jun

-95

No

v-95

Jun

-96

No

v-96

Jun

-97

No

v-97

Jun

-98

Page 7: PACT PACT 98 Http:

PACTPACT

Parallelism of Jobs On NCSA Origin Cluster

by # of Jobs by CPU Delivered

# CPUs

40%

5%16%

21%

8%

6% 3%1% 7%2%

9%

19%

18%

17%

19%

9%

123-45-89-1617-3233-6465-128

20 Weeks of Data, March 16 - Aug 2, 199815,028 Jobs / 883,777 CPU-Hrs

Page 8: PACT PACT 98 Http:

PACTPACT

How are users using the Origin Array?

1 23-

45-

89-

1617

-32

33-6

465

-128

0-6464-128128-256256-384384-512512+

020,00040,00060,00080,000

100,000

120,000

CPU Hrs Delivered

# CPUs

Mem/CPU (MB)

Page 9: PACT PACT 98 Http:

PACTPACT

National Academic Community Large Project Requests September 1998

Source: National Resource Allocation Committee

Over 5 Million NUs Requested

One NU = One XMP Processor-Hour

Vector

MPPDSM

Page 10: PACT PACT 98 Http:

PACTPACT

GB's Estimate of Parallelism in Engineering & Scientific Applications

granularity & degree of coupling (comp./comm.)

scalar60%

vector15%

Vector& //5%

One-of>>// 5%

Embarrassingly & perfectly parallel

15%

log

(#

app

s) new orscaled-up apps

dusty decksfor supers

SupersPCsWSs Clusters aka MPPs

aka multicomputers

----scalable multiprocessors-----

Gordon’s WAG

Page 11: PACT PACT 98 Http:

PACTPACT

General purpose, non-parallelizable codes(PCs have it!)

VectorizableVectorizable & //able(Supers & small DSMs)Hand tuned, one-ofMPP course grainMPP embarrassingly //(Clusters of PCs...)

DatabaseDatabase/TPWeb HostStream Audio/Video

Technical

Commercial

Application Taxonomy

If central control & rich then IBM or large SMPselse PC Clusters

Page 12: PACT PACT 98 Http:

PACTPACT

One procerssor perf. as % of Linpack

0

200

400

600

800

1000

1200

1400

1600

1800

T90 C90 SPP-2000

SP2-160

Origin195

PCA

Linpack

Apps. Ave.

22%

14% 19%33% 26%

CFDBiomolec.ChemistryMaterials

QCD

25%

Page 13: PACT PACT 98 Http:

PACTPACT

10 Processor Linpack (Gflops); 10 P appsx10; Apps % 1 P Linpack; Apps %10 P Linpack

0

5

10

15

20

25

30

35

T90 C90 SPP SP2/160 Origin195

PCA

Gordon’s WAG

Page 14: PACT PACT 98 Http:

PACTPACT

Ancient history

Page 15: PACT PACT 98 Http:

PACTPACT

Growth in Computational Resources Used for UK Weather Forecasting

•1950

•2000

10T •

1T •

100G •

10G •

1G •

100M •

10M •

1M •

100K •

10K •

1K •

100 •

10 •

LeoMercury

KDF9

195

205YMP

1010/ 50 yrs = 1.5850

Page 16: PACT PACT 98 Http:

PACTPACT

Harvard Mark I aka IBM ASCC

Page 17: PACT PACT 98 Http:

PACTPACT

I think there is a world I think there is a world

market for maybe five market for maybe five

computers.computers.

““ ””

Thomas Watson Senior, Chairman of IBM, 1943

Page 18: PACT PACT 98 Http:

PACTPACT

The scientific market is still about that size… 3 computers

When scientific processing was 100% of the industry a good predictor

$3 Billion: 6 vendors, 7 architectures DOE buys 3 very big ($100-$200 M)

machines every 3-4 years

Page 19: PACT PACT 98 Http:

PACTPACT

NCSA Cluster of 6 x 128 processors SGI Origin

Page 20: PACT PACT 98 Http:

PACTPACT

Intel/Sandia: 9000x1 node Ppro

LLNL/IBM: 512x8 PowerPC (SP2)

LNL/Cray: ?

Maui Supercomputer Center– 512x1 SP2

Our Tax Dollars At WorkASCI for Stockpile Stewardship

Page 21: PACT PACT 98 Http:

PACTPACT

“LARC doesn’t need 30,000 words!” --Von Neumann, 1955.

“During the review, someone said: “von Neumann was right. 30,000 word was too much IF all the users were as skilled as von Neumann ... for ordinary people, 30,000 was barely enough!” -- Edward Teller, 1995

The memory was approved. Memory solves many problems!

Page 22: PACT PACT 98 Http:

PACTPACT

““ ””

Parallel processing Parallel processing computer architectures computer architectures will be in use by 1975. will be in use by 1975.

Navy Delphi Panel1969

Page 23: PACT PACT 98 Http:

PACTPACT

““

””

In Dec. 1995 computers In Dec. 1995 computers with 1,000 processors with 1,000 processors will do most of the will do most of the scientific processing. scientific processing.

Danny Hillis 1990 (1 paper or 1 company)

Page 24: PACT PACT 98 Http:

PACTPACT

The Bell-Hillis BetMassive Parallelism in 1995TMC

World-wide

Supers

TMC

World-wide Supers

TMC

World-wideSupers

ApplicationsRevenue

Petaflops / mo.

Page 25: PACT PACT 98 Http:

PACTPACT

Bell-Hillis Bet: wasn’t paid off!

My goal was not necessarily to just win the bet!

Hennessey and Patterson were to evaluate what was really happening…

Wanted to understand degree of MPP progress and programmability

Page 26: PACT PACT 98 Http:

PACTPACT

““

””

DARPA, 1985 Strategic Computing Initiative (SCI)

A 50 X LISP machineA 50 X LISP machine

Tom Knight, Symbolics

A 1,000 node multiprocessorA 1,000 node multiprocessor

A Teraflops by 1995A Teraflops by 1995

Gordon Bell, Encore

””

““

All of ~20 HPCC projects failed!

““ ””

Page 27: PACT PACT 98 Http:

PACTPACT

SCI (c1980s): Strategic Computing Initiative funded

ATT/Columbia (Non Von), BBN Labs, Bell Labs/Columbia (DADO), CMU Warp (GE & Honeywell), CMU (Production Systems), Encore, ESL, GE (like connection machine), Georgia Tech, Hughes (dataflow), IBM (RP3), MIT/Harris, MIT/Motorola (Dataflow), MIT Lincoln Labs, Princeton (MMMP), Schlumberger (FAIM-1), SDC/Burroughs, SRI (Eazyflow), University of Texas, Thinking Machines (Connection Machine),

Page 28: PACT PACT 98 Http:

PACTPACT

Those who gave up their lives in SCI’s search for parallellism

Alliant, American Supercomputer, Ametek, AMT, Astronautics, BBN Supercomputer, Biin, CDC (independent of ETA), Cogent, Culler, Cydrome, Dennelcor, Elexsi, ETA, Evans & Sutherland Supercomputers, Flexible, Floating Point Systems, Gould/SEL, IPM, Key, Multiflow, Myrias, Pixar, Prisma, SAXPY, SCS, Supertek (part of Cray), Suprenum (German National effort), Stardent (Ardent+Stellar), Supercomputer Systems Inc., Synapse, Vitec, Vitesse, Wavetracer.

Page 29: PACT PACT 98 Http:

PACTPACT

Worlton: "Bandwagon Effect"explains massive parallelismBandwagon: A propaganda device by which

the purported acceptance of an idea ...is claimed in order to win further public acceptance.

Pullers: vendors, CS community Pushers: funding bureaucrats & deficit Riders: innovators and early adopters4 flat tires:

training, system software, applications, and "guideposts"

Spectators: most users, 3rd party ISVs

Page 30: PACT PACT 98 Http:

PACTPACT

Parallel processing is a constant distance away.

Our vision ... is a system of millions of hosts… in a loose confederation.

Users will have the illusion of a very powerful desktop computer through which they can manipulate objects.

Grimshaw, Wulf, et al “Legion” CACM Jan. 1997

““ ”” ““

””

Page 31: PACT PACT 98 Http:

PACTPACT

Progress

"Parallelism is a journey.*"

*Paul Borrill

Page 32: PACT PACT 98 Http:

PACTPACT

Let us not forget:

“The purpose of computing is insight, not numbers.”

R. W. Hamming

Page 33: PACT PACT 98 Http:

PACTPACT

Progress 1987-1998

Page 34: PACT PACT 98 Http:

PACTPACT

Bell Prize Peak Gflops vs time

0.1

1

10

100

1000

1986 1988 1990 1992 1994 1996 1998 2000

Page 35: PACT PACT 98 Http:

PACTPACT

Bell Prize: 1000x 1987-1998 1987 Ncube 1,000 computers:

showed with more memory, apps scaled 1987 Cray XMP 4 proc. @200 Mflops/proc 1996 Intel 9,000 proc. @200 Mflops/proc

1998 600 RAP Gflops Bell prize Parallelism gains

– 10x in parallelism over Ncube– 2000x in parallelism over XMP

Spend 2- 4x more Cost effect.: 5x; ECL CMOS; Sram Dram Moore’s Law =100x Clock: 2-10x; CMOS-ECL speed cross-over

Page 36: PACT PACT 98 Http:

PACTPACT

No more 1000X/decade.We are now (hopefully) only limited by Moore’s Law and not limited by memory access.

1 GF to 10 GF took 2 years10 GF to 100 GFtook 3 years100 GFto 1 TF took >5 years2n+1 or 2^(n-1)+1?

Page 37: PACT PACT 98 Http:

PACTPACT

Commercial Perf/$$/tpmC vs time

$10

$100

$1,000

Mar-94 Sep-94 Apr-95 Oct-95 May-96 Dec-96 Jun-97date

$/tp

mC

250 %/year improvement!

Page 38: PACT PACT 98 Http:

PACTPACT

tpmC vs time

100

1,000

10,000

100,000

Mar-94 Sep-94 Apr-95 Oct-95 May-96 Dec-96 Jun-97date

tpm

C

250 %/year improvement!

Commercal Perf.

Page 39: PACT PACT 98 Http:

PACTPACT

1998 Observations vs1989 Predictions for technical

Got a TFlops PAP 12/1996 vs 1995. Really impressive progress! (RAP<1 TF)

More diversity… results in NO software! – Predicted: SIMD, mC, hoped for scalable SMP– Got: Supers, mCv, mC, SMP, SMP/DSM,

SIMD disappeared $3B (un-profitable?) industry; 10 platforms PCs and workstations diverted users MPP apps DID NOT materialize

Page 40: PACT PACT 98 Http:

PACTPACT

Observation: CMOS supers replaced ECL in Japan 2.2 Gflops vector units have dual use

– In traditional mPv supers– as basis for computers in mC

Software apps are present Vector processor out-performs n micros

for many scientific apps It’s memory bandwidth, cache

prediction, and inter-communication

Page 41: PACT PACT 98 Http:

PACTPACT

Observation: price & performance Breaking $30M barrier increases PAP Eliminating “state computers” increased prices, but got

fewer, more committed suppliers, less variation, and more focus

Commodity micros aka Intel are critical to improvement. DEC, IBM, and SUN are ??

Conjecture: supers and MPPs may be equally cost-effective despite PAP – Memory bandwidth determines performance & price– “You get what you pay for ” aka

“there’s no free lunch”

Page 42: PACT PACT 98 Http:

PACTPACT

Observation: MPPs 1, Users <1 MPPs with relatively low speed micros with lower memory

bandwidth, ran over supers, but didn’t kill ‘em.

Did the U.S. industry enter an abyss?- Is crying “Unfair trade” hypocritical?- Are users denied tools?- Are users not “getting with the program”

Challenge we must learn to program clusters...- Cache idiosyncrasies- Limited memory bandwidth- Long Inter-communication delays- Very large numbers of computers

Page 43: PACT PACT 98 Http:

PACTPACT

Strong recommendation: Utilize in situ workstations!

NoW (Berkeley) set sort record, decrypting Grid, Globus, Condor and other projects Need “standard” interface and programming

model for clusters using “commodity” platforms & fast switches

Giga- and tera-bit links and switches allow geo-distributed systems

Each PC in a computational environment should have an additional 1GB/9GB!

Page 44: PACT PACT 98 Http:

PACTPACT

““ ”” Petaflops by 2010Petaflops by 2010

DOEAccelerated Strategic

Computing Initiative (ASCI)

Page 45: PACT PACT 98 Http:

PACTPACT

DOE’s 1997 “PathForward” Accelerated Strategic Computing Initiative (ASCI)

1997 1-2 Tflops: $100M 1999-2001 10-30 Tflops $200M?? 2004 100 Tflops 2010 Petaflops

Page 46: PACT PACT 98 Http:

PACTPACT

““

”” When is a Petaflops When is a Petaflops possible? What price? possible? What price?

Moore’s Law 100xBut how fast can the clock tick?

Increase parallelism 10K>100K 10x Spend more ($100M $500M) 5x Centralize center or fast network 3x Commoditization (competition) 3x

Gordon Bell, ACM 1997

Page 47: PACT PACT 98 Http:

PACTPACT

Micros gains if 20, 40, & 60% / year

1.E+21

1.E+18

1.E+15

1.E+12

1.E +9

1.E+61995 2005 2015 2025 2035 2045

20%= 20%= TeraopsTeraops

40%= 40%= PetaopsPetaops

60%= 60%= ExaopsExaops

Page 48: PACT PACT 98 Http:

PACTPACT

Processor Limit: DRAM GapµProc60%/yr..

DRAM7%/yr..

1

10

100

10001980

1981

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

DRAM

CPU

1982

Processor-MemoryPerformance Gap:(grows 50% / year)

Per

form

ance

• Alpha 21264 full cache miss / instructions executed: 180 ns/1.7 ns =108 clks x 4 or 432 instructions• Caches in Pentium Pro: 64% area, 88% transistors

*Taken from Patterson-Keeton Talk to SigMod

“Moore’s Law”

Page 49: PACT PACT 98 Http:

PACTPACT

Five ScalabilitiesSize scalable -- designed from a few components,

with no bottlenecks

Generation scaling -- no rewrite/recompile is required across generations of computers

Reliability scaling

Geographic scaling -- compute anywhere (e.g. multiple sites or in situ workstation sites)

Problem x machine scalability -- ability of an algorithm or program to exist at a range of sizes that run efficiently on a given, scalable computer.

Problem x machine space => run time: problem scale, machine scale (#p), run time, implies speedup and efficiency,

Page 50: PACT PACT 98 Http:

PACTPACT

The Law of Massive Parallelism (mine) is based on application scaling

There exists a problem that can be made sufficiently large such that any network of computers can run efficiently given enough memory, searching, & work -- but this problem may be unrelated to no other.

A ... any parallel problem can be scaled to run efficiently on an arbitrary network of computers, given enough memory and time… but it may be completely impractical

Challenge to theoreticians and tool builders:How well will or will an algorithm run?

Challenge for software and programmers: Can package be scalable & portable? Are there models?

Challenge to users: Do larger scale, faster, longer run times, increase problem insight and not just total flop or flops?

Challenge to funders: Is the cost justified?

Gordon’s WAG

Page 51: PACT PACT 98 Http:

PACTPACT

Manyflops for Manybucks: what are the goals of spending?

Getting the most flops, independent of how much taxpayers give to spend on computers?

Building or owning large machines? Doing a job (stockpile stewardship)? Understanding and publishing about

parallelism? Making parallelism accessible? Forcing other labs to follow?

Page 52: PACT PACT 98 Http:

PACTPACT

Petaflops Alternatives c2007-14 from 1994 DOE Workshop

SMP Cluster Active Mem Grid

400 Proc.;1 Tflops

4-40 K Proc.;10-100 Gflops

400 K Proc.;1Gflops

400 TB SRAM250K chips

400 TB DRAM60K-100K chips

0.8 TB embed.4K chips

1 ps/result…multi-threading100 10 Gflopsthread is likely

10-100 ps/resultcache heirarchy

No definition of storage, network, orprogramming model

Page 53: PACT PACT 98 Http:

PACTPACT

Or more parallelism… and use installed machines

10,000 nodes in 1998 or 10x Increase Assume 100K nodes 10 Gflops/10GBy/100GB nodes

or low end c2010 PCs Communication is first problem… use the

network Programming is still the major barrier Will any problems fit it

Page 54: PACT PACT 98 Http:

PACTPACT

Next, short steps

Page 55: PACT PACT 98 Http:

PACTPACT

192 HP 300 MHz

64 Compaq 333 MHz

• Andrew Chien, CS UIUC-->UCSD • Rob Pennington, NCSA• Myrinet Network, HPVM, Fast Msgs• Microsoft NT OS, MPI API

“Supercomputer performance at mail-order prices”-- Jim Gray, Microsoft

The Alliance LES NT Supercluster

Page 56: PACT PACT 98 Http:

PACTPACT

0

1

2

3

4

5

6

7

0

10

20

30

40

50

60

Processors

Gig

afl

op

s

Origin-DSM

Origin-MPI

NT-MPI

SP2-MPI

T3E-MPI

SPP2000-DSM

2D Navier-Stokes Kernel - PerformancePreconditioned Conjugate Gradient Method With

Multi-level Additive Schwarz Richardson Pre-conditioner

Danesh Tafti, Rob Pennington, NCSA; Andrew Chien (UIUC, UCSD)

Sustaining 7 GF on 128 Proc.

NT Cluster

Page 57: PACT PACT 98 Http:

PACTPACT

The Grid:Blueprint for a New Computing InfrastructureIan Foster, Carl Kesselman (Eds), Morgan Kaufmann, 1999

Published July 1998; ISBN 1-55860-475-8 22 chapters by expert authors

including: – Andrew Chien, – Jack Dongarra, – Tom DeFanti, – Andrew Grimshaw, – Roch Guerin, – Ken Kennedy, – Paul Messina, – Cliff Neuman, – Jon Postel, – Larry Smarr, – Rick Stevens, – Charlie Catlett– John Toole– and many others

http://www.mkp.com/grids

“A source book for the historyof the future” -- Vint Cerf

Page 58: PACT PACT 98 Http:

PACTPACT

The Grid“Dependable, consistent,

pervasive access to

[high-end] resources” Dependable: Can provide

performance and functionality guarantees

Consistent: Uniform interfaces to a wide variety of resources

Pervasive: Ability to “plug in” from anywhere

Page 59: PACT PACT 98 Http:

PACTPACT

Alliance Grid Technology Roadmap: It’s just not flops or records/se

User InterfaceUser Interface

TangoWebflowHabanero

WorkbenchesNetMeeting

H.320/323

RealNetworks

MiddlewareMiddlewareGlobus

LDAPQoS

Java

vBNSAbilene

ActiveX

MREN

Clusters

ComputeCompute

Condor JavaGrandeHPVM/FM

Symera (DCOM)

DSMHPF

MPI OpenMP

Clusters DataData

ODBC

Emerge (Z39.50)

SRB HDF-5

SANssvPablo DMFXML

Virtual Director

CAVERNsoft

Java3D

SCIRun

VisualizationVisualization

Cave5D

VRML

Page 60: PACT PACT 98 Http:

PACTPACT

Globus Approach Focus on architecture issues

– Propose set of core services as basic infrastructure

– Use to construct high-level, domain-specific solutions

Design principles– Keep participation cost low– Enable local control– Support for adaptation

Core Globusservices

Local OS

A p p l i c a t i o n s

Diverse global svcs

Page 61: PACT PACT 98 Http:

PACTPACT

Globus Toolkit: Core Services

Scheduling (Globus Resource Alloc. Manager)– Low-level scheduler API

Information (Metacomputing Directory Service) – Uniform access to structure/state information

Communications (Nexus)– Multimethod communication + QoS

management Security (Globus Security Infrastructure)

– Single sign-on, key management Health and status (Heartbeat monitor) Remote file access (Global Access to Secondary

Storage)

Page 62: PACT PACT 98 Http:

PACTPACT

Summary of some beliefs 1000x increase in PAP has not been accompanied

with RAP, insight, infrastructure, and use. What was the PACT/$? “The PC World Challenge” is to provide

commodity, clustered parallelism to commercial and technical communities

Only comes true of ISVs believe and act Grid etc. using world-wide resources, including in

situ PCs is the new idea

Page 63: PACT PACT 98 Http:

PACTPACT

PACT 98

Http://www.research.microsoft.com/barc/gbell/pact.ppt

Page 64: PACT PACT 98 Http:

PACTPACT

The end