cs 258 parallel computer architecture lecture 1 introduction to parallel architecture january 23,...

CS 258 Parallel Computer Architecture

Lecture 1

Introduction to Parallel Architecture

January 23, 2002Prof John D. Kubiatowicz

John KubiatowiczSlide 1.21/23/02

Computer Architecture Is …the attributes of a [computing] system as seen by the programmer, i.e., the conceptual structure and functional behavior, as distinct from the organization of the data flows and controls the logic design, and the physical implementation.

Amdahl, Blaaw, and Brooks, 1964

SOFTWARESOFTWARE


Instruction Set Architecture (ISA)

• Great picture for uniprocessors….– Rapidly crumbling, however!

• Can this be true for multiprocessors???– Much harder to say.

instruction set

software

hardware


What is Parallel Architecture?• A parallel computer is a collection of processing

elements that cooperate to solve large problems fast

• Some broad issues:– Models of computation: PRAM? BSP? Sequential Consistency?– Resource Allocation:

» how large a collection? » how powerful are the elements?» how much memory?

– Data access, Communication and Synchronization» how do the elements cooperate and communicate?» how are data transmitted between processors?» what are the abstractions and primitives for cooperation?

– Performance and Scalability» how does it all translate into performance?» how does it scale?


Computer Architecture Topics (252+)

Instruction Set Architecture

Pipelining, Hazard Resolution,Superscalar, Reordering, Prediction, Speculation,Vector, Dynamic Compilation

Addressing,Protection,Exception Handling

L1 Cache

L2 Cache

DRAM

Disks, WORM, Tape

Coherence,Bandwidth,Latency

Emerging TechnologiesInterleavingBus protocols

RAID

VLSI

Input/Output and Storage

MemoryHierarchy

Pipelining and Instruction Level Parallelism

NetworkCommunication

Oth

er

Proc

esso

rs


Computer Architecture Topics (258)

M

Interconnection NetworkS

PMPMPMP ° ° °

Topologies,Routing,Bandwidth,Latency,Reliability

Network Interfaces

Shared Memory,Message Passing,Data Parallelism

Processor-Memory-Switch

MultiprocessorsNetworks and InterconnectionsEverything in previous slide but more so!


What will you get out of CS258?• In-depth understanding of the design and

engineering of modern parallel computers– technology forces– Programming models– fundamental architectural issues

» naming, replication, communication, synchronization– basic design techniques

» cache coherence, protocols, networks, pipelining, …– methods of evaluation

• from moderate to very large scale• across the hardware/software boundary• Study of REAL parallel processors

– Research papers, white papers• Natural consequences??

– Massive Parallelism Reconfigurable computing?– Message Passing Machines NOW Peer-to-peer systems?


Will it be worthwhile?• Absolutely!

– even through few of you will become PP designers• The fundamental issues and solutions

translate across a wide spectrum of systems.

– Crisp solutions in the context of parallel machines.• Pioneered at the thin-end of the platform

pyramid on the most-demanding applications

– migrate downward with time• Understand implications for software• Network attached

storage, MEMs, etc?

SuperServers

Departmenatal Servers

Workstations

Personal Computers

Workstations


Role of a computer architect: To design and engineer the various levels of a computer system to maximize performance and programmability within limits of technology and cost.

Parallelism:• Provides alternative to faster clock for performance• Applies at all levels of system design• Is a fascinating perspective from which to view

architecture• Is increasingly central in information processing• How is instruction-level parallelism related to course-

grained parallelism??

Why Study Parallel Architecture?

1/23/02John Kubiatowicz

Slide 1.10

Is Parallel Computing Inevitable?This was certainly not clear just a few years

ago

Today, however:• Application demands: Our insatiable need

for computing cycles• Technology Trends: Easier to build• Architecture Trends: Better abstractions• Economics: Cost of pushing uniprocessor • Current trends:

– Today’s microprocessors have multiprocessor support– Servers and workstations becoming MP: Sun, SGI, DEC,

COMPAQ!...– Tomorrow’s microprocessors are multiprocessors


Can programmers handle parallelism?

• Humans not as good at parallel programming as they would like to think!– Need good model to think of machine– Architects pushed on instruction-level

parallelism really hard, because it is “transparent”

• Can compiler extract parallelism? – Sometimes

• How do programmers manage parallelism?? – Language to express parallelism?– How to schedule varying number of processors?

• Is communication Explicit (message-passing) or Implicit (shared memory)?– Are there any ordering constraints on

communication?


Granularity:• Is communication fine or coarse grained?

– Small messages vs big messages• Is parallelism fine or coarse grained

– Small tasks (frequent synchronization) vs big tasks

• If hardware handles fine-grained parallelism, then easier to get incremental scalability

• Fine-grained communication and parallelism harder than coarse-grained:

– Harder to build with low overhead– Custom communication architectures often needed

• Ultimate course grained communication:– GIMPS (Great Internet Mercenne Prime Search)– Communication once a month


CS258: StaffInstructor:Prof John D. Kubiatowicz

Office: 673 Soda Hall, 643-6817 kubitron@csOffice Hours: Thursday 1:30 - 3:00 or by appt.Class: Wed, Fri, 1:00 - 2:30pm 310 Soda Hall

Administrative: Veronique Richard, Office: 676 Soda Hall, 642-4334 nicou@cs

Web page: http://www.cs/~kubitron/courses/cs258-S02/ Lectures available online <11:30AM day of lecture

Email: [email protected] signup link on web page (as soon as it is up)

http://www.cs/~kubitron/courses/cs258-S02/

mailto:[email protected]


Why Me? The Alewife Multiprocessor

• Cache-coherence Shared Memory– Partially in Software!

• User-level Message-Passing• Rapid Context-Switching• Asynchronous network• One node/board


TextBook: Two leaders in fieldText: Parallel Computer Architecture:

A Hardware/Software Approach, By: David Culler & Jaswinder Singh

Covers a range of topics

We will not necessarily cover them in order.


Lecture style• 1-Minute Review • 20-Minute Lecture/Discussion• 5- Minute Administrative Matters• 25-Minute Lecture/Discussion• 5-Minute Break (water, stretch)• 25-Minute Lecture/Discussion• Instructor will come to class early & stay after

to answer questions

Attention

Time20 min. Break“In Conclusion, ...”


Research Paper Reading

• As graduate students, you are now researchers.

• Most information of importance to you will be in research papers.

• Ability to rapidly scan and understand research papers is key to your success.

• So: you will read lots of papers in this course!– Quick 1 paragraph summaries will be due in class– Students will take turns discussing papers

• Papers will be scanned and on web page.


How will grading work?• No TA This term!

• Tentative breakdown:– 20% homeworks / paper presentations– 30% exam– 40% project (teams of 2)– 10% participation


Application Trends

• Application demand for performance fuels advances in hardware, which enables new appl’ns, which...

– Cycle drives exponential increase in microprocessor performance– Drives parallel architecture harder

» most demanding applications

• Programmers willing to work really hard to improve high-end applications

• Need incremental scalability:– Need range of system performance with progressively increasing

cost

New ApplicationsMore Performance


Speedup

• Speedup (p processors) =

• Common mistake: – Compare parallel program on 1 processor to parallel

program on p processors• Wrong!:

– Should compare uniprocessor program on 1 processor to parallel program on p processors

• Why? Keeps you honest– It is easy to parallelize overhead.

Time (1 processor)

Time (p processors)


Amdahl's LawSpeedup due to enhancement E: ExTime w/o E Performance w/ ESpeedup(E) = ------------- = ------------------- ExTime w/ E Performance w/o E

Suppose that enhancement E accelerates a fraction F of the task by a factor S, and the remainder of the task is unaffected


Amdahl’s Law for parallel programs?

par

overhead

parparallel

ExTimestuffp,ExTime

pFraction Fraction

1 Speedup

1

Best you could ever hope to do:

parmaximum Fraction - 1

1 Speedup

stuffp,ExTimepFractionFraction ExTime ExTime overhead

parparserpar

1

Worse: Overhead may kill your performance!


Metrics of Performance

Compiler

Programming Language

Application

DatapathControl

TransistorsWiresPins

ISA

Function Units

(millions) of Instructions per second: MIPS(millions) of (FP) operations per second: MFLOP/s

Cycles per second (clock rate)

Megabytes per second

Answers per monthOperations per second


Commercial Computing• Relies on parallelism for high end

– Computational power determines scale of business that can be handled

• Databases, online-transaction processing, decision support, data mining, data warehousing ...

• TPC benchmarks (TPC-C order entry, TPC-D decision support)

– Explicit scaling criteria provided– Size of enterprise scales with size of system– Problem size not fixed as p increases.– Throughput is performance measure (transactions per

minute or tpm)


Thro

ughp

ut (t

pmC

)

Number of processors

0

5,000

10,000

15,000

20,000

25,000

0 20 40 60 80 100 120

Tandem Himalaya DEC Alpha SGI PowerChallenge HP PA IBM PowerPC Other

TPC-C Results for March 1996

• Parallelism is pervasive• Small to moderate scale parallelism very important• Difficult to obtain snapshot to compare across

vendor platforms


Scientific Computing Demand


Engineering Computing Demand• Large parallel machines a mainstay in many

industries– Petroleum (reservoir analysis)– Automotive (crash simulation, drag analysis,

combustion efficiency), – Aeronautics (airflow analysis, engine efficiency,

structural mechanics, electromagnetism), – Computer-aided design– Pharmaceuticals (molecular modeling)– Visualization

» in all of the above» entertainment (films like Toy Story)» architecture (walk-throughs and rendering)

– Financial modeling (yield and derivative analysis)– etc.


1980 1985 1990 1995

1 MIPS

10 MIPS

100 MIPS

1 GIPS

Sub-BandSpeech Coding

200 WordsIsolated SpeechRecognition

SpeakerVeri¼cation

CELPSpeech Coding

ISDN-CD StereoReceiver

5,000 WordsContinuousSpeechRecognition

HDTV Receiver

CIF Video

1,000 WordsContinuousSpeechRecognitionTelephone

NumberRecognition

10 GIPS

• Also CAD, Databases, . . .• 100 processors gets you 10 years, 1000 gets you 20 !

Applications: Speech and Image Processing


Is better parallel arch enough?

• AMBER molecular dynamics simulation program• Starting point was vector code for Cray-1• 145 MFLOP on Cray90, 406 for final version on 128-

processor Paragon, 891 on 128-processor Cray T3D


Summary of Application Trends• Transition to parallel computing has occurred

for scientific and engineering computing• In rapid progress in commercial computing

– Database and transactions as well as financial– Usually smaller-scale, but large-scale systems also used

• Desktop also uses multithreaded programs, which are a lot like parallel programs

• Demand for improving throughput on sequential workloads

– Greatest use of small-scale multiprocessors• Solid application demand exists and will

increase


Per

form

ance

0.1

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors

Technology Trends

• Today the natural building-block is also fastest!


• Microprocessor performance increases 50% - 100% per year• Transistor count doubles every 3 years• DRAM size quadruples every 3 years• Huge investment per generation is carried by huge commodity market

0

20

4060

80100

120

140160

180

1987 1988 1989 1990 1991 1992

Integer FP

Sun 4260

MIPSM/120

IBMRS6000

540MIPS

M2000

HP 9000750

DECalpha

Can’t we just wait for it to get faster?


Proc $

Interconnect

Technology: A Closer Look• Basic advance is decreasing feature size ( )

– Circuits become either faster or lower in power• Die size is growing too

– Clock rate improves roughly proportional to improvement in

– Number of transistors improves like (or faster)• Performance > 100x per decade

– clock rate < 10x, rest is transistor count• How to use more transistors?

– Parallelism in processing» multiple operations per cycle reduces CPI

– Locality in data access» avoids latency and reduces CPI» also improves processor utilization

– Both need resources, so tradeoff• Fundamental issue is resource distribution,

as in uniprocessors


• 30% per year

0.1

1

10

100

1,000

19701975

19801985

19901995

20002005

Clo

ck ra

te (M

Hz)

i4004i8008

i8080

i8086 i80286i80386

Pentium100

R10000

Growth Rates

Tra

nsisto

rs

1,000

10,000

100,000

1,000,000

10,000,000

100,000,000

19701975

19801985

19901995

20002005

i4004i8008

i8080

i8086

i80286i80386

R2000

Pentium R10000

R3000

40% per year


Architectural Trends

• Architecture translates technology’s gifts into performance and capability

• Resolves the tradeoff between parallelism and locality

– Current microprocessor: 1/3 compute, 1/3 cache, 1/3 off-chip connect

– Tradeoffs may change with scale and technology advances

• Understanding microprocessor architectural trends

=> Helps build intuition about design issues or parallel machines

=> Shows fundamental role of parallelism even in “sequential” computers


Tran

sist

ors

1,000

10,000

100,000

1,000,000

10,000,000

100,000,000

1970 1975 1980 1985 1990 1995 2000 2005

Bit-level parallelism Instruction-level Thread-level (?)

i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000

Phases in “VLSI” Generation


Architectural Trends• Greatest trend in VLSI generation is increase

in parallelism– Up to 1985: bit level parallelism: 4-bit -> 8 bit -> 16-bit

» slows after 32 bit » adoption of 64-bit now under way, 128-bit far (not

performance issue)» great inflection point when 32-bit micro and cache fit

on a chip– Mid 80s to mid 90s: instruction level parallelism

» pipelining and simple instruction sets, + compiler advances (RISC)

» on-chip caches and functional units => superscalar execution

» greater sophistication: out of order execution, speculation, prediction

• to deal with control transfer and latency problems– Next step: thread level parallelism? Bit-level parallelism?


Slide 1.38

0 1 2 3 4 5 6+0

5

10

15

20

25

30

0 5 10 150

0.5

1

1.5

2

2.5

3

Frac

tion

of to

tal c

ycle

s (%

)

Number of instructions issuedS

peed

up

Instructions issued per cycle

How far will ILP go?

• Infinite resources and fetch bandwidth, perfect branch prediction and renaming

– real caches and non-zero miss latencies


Threads Level Parallelism “on board”

• Micro on a chip makes it natural to connect many to shared memory

– dominates server and enterprise market, moving down to desktop• Alternative: many PCs sharing one complicated pipe• Faster processors began to saturate bus, then bus technology

advanced– today, range of sizes for bus-based systems, desktop to large servers

Proc Proc Proc Proc

MEM


What about Multiprocessor Trends?

0

10

20

30

40

CRAY CS6400

SGI Challenge

Sequent B2100

Sequent B8000

Symmetry81

Symmetry21

Power

SS690MP 140 SS690MP 120

AS8400

HP K400AS2100SS20

SE30SS1000E

SS10

SE10

SS1000

P-ProSGI PowerSeries

SE60

SE70

Sun E6000

SC2000ESun SC2000SGI PowerChallenge/XL

SunE10000

50

60

70

1984 1986 1988 1990 1992 1994 1996 1998

Num

ber o

f pro

cess

ors


Sha

red

bus

band

wid

th (M

B/s

)

10

100

1,000

10,000

100,000

1984 1986 1988 1990 1992 1994 1996 1998

SequentB8000

SGIPowerCh

XL

Sequent B2100

Symmetry81/21

SS690MP 120SS690MP 140

SS10/SE10/SE60

SE70/SE30SS1000 SS20

SS1000EAS2100SC2000EHPK400

SGI Challenge

Sun E6000AS8400

P-Pro

Sun E10000

SGI PowerSeries

SC2000

Power

CS6400

Bus Bandwidth


What about Storage Trends?• Divergence between memory capacity and speed even

more pronounced– Capacity increased by 1000x from 1980-95, speed only 2x– Gigabit DRAM by c. 2000, but gap with processor speed much

greater• Larger memories are slower, while processors get faster

– Need to transfer more data in parallel– Need deeper cache hierarchies– How to organize caches?

• Parallelism increases effective size of each level of hierarchy, without increasing access time

• Parallelism and locality within memory systems too– New designs fetch many bits within memory chip; follow with fast

pipelined transfer across narrower interface– Buffer caches most recently accessed data– Processor in memory?

• Disks too: Parallel disks plus caching


Slide 1.43

Economics• Commodity microprocessors not only fast but

CHEAP– Development costs tens of millions of dollars– BUT, many more are sold compared to supercomputers– Crucial to take advantage of the investment, and use the

commodity building block• Multiprocessors being pushed by software vendors

(e.g. database) as well as hardware vendors• Standardization makes small, bus-based SMPs

commodity• Desktop: few smaller processors versus one larger

one?• Multiprocessor on a chip?


Can anyone afford high-end MPPs???

• ASCI (Accellerated Strategic Computing Initiative) ASCI White: Built by IBM

– 12.3 TeraOps, 8192 processors (RS/6000)– 6TB of RAM, 160TB Disk– 2 basketball courts in size– Program it??? Message passing


Consider Scientific Supercomputing• Proving ground and driver for innovative

architecture and techniques – Market smaller relative to commercial as MPs become

mainstream– Dominated by vector machines starting in 70s– Microprocessors have made huge gains in floating-point

performance» high clock rates » pipelined floating point units (e.g., multiply-add every

cycle)» instruction-level parallelism» effective use of caches (e.g., automatic blocking)

– Plus economics

• Large-scale multiprocessors replace vector supercomputers


LIN

PA

CK

(MFL

OP

S)

1

10

100

1,000

10,000

1975 1980 1985 1990 1995 2000

CRAY n = 100 CRAY n = 1,000

Micro n = 100 Micro n = 1,000

CRAY 1s

Xmp/14se

Xmp/416Ymp

C90

T94

DEC 8200

IBM Power2/990MIPS R4400

HP9000/735DEC Alpha

DEC Alpha AXPHP 9000/750

IBM RS6000/540

MIPS M/2000

MIPS M/120

Sun 4/260

Raw Uniprocessor Performance: LINPACK


LIN

PAC

K (G

FLO

PS)

CRAY peak MPP peak

Xmp /416(4)

Ymp/832(8) nCUBE/2(1024)iPSC/860

CM-2CM-200

Delta

Paragon XP/S

C90(16)

CM-5

ASCI Red

T932(32)

T3D

Paragon XP/S MP(1024)

Paragon XP/S MP(6768)

0.1

1

10

100

1,000

10,000

1985 1987 1989 1991 1993 1995 1996

Raw Parallel Performance: LINPACK

• Even vector Crays became parallel– X-MP (2-4) Y-MP (8), C-90 (16), T94 (32)

• Since 1993, Cray produces MPPs too (T3D, T3E)


Num

ber o

f sys

tem

s

11/93 11/94 11/95 11/960

50

100

150

200

250

300

350

PVP MPP

SMP

319

106

284

239

63

187

313

198

110

10673

500 Fastest Computers


Summary: Why Parallel Architecture?• Increasingly attractive

– Economics, technology, architecture, application demand• Increasingly central and mainstream• Parallelism exploited at many levels

– Instruction-level parallelism– Multiprocessor servers– Large-scale multiprocessors (“MPPs”)

• Focus of this class: multiprocessor level of parallelism

• Same story from memory system perspective– Increase bandwidth, reduce average latency with many

local memories• Spectrum of parallel architectures make sense

– Different cost, performance and scalability


Where is Parallel Arch Going?

Application Software

System Software SIMD

Message Passing

Shared MemoryDataflow

SystolicArrays Architecture

• Uncertainty of direction paralyzed parallel software development!

Old view: Divergent architectures, no predictable pattern of growth.


Today

• Extension of “computer architecture” to support communication and cooperation

– Instruction Set Architecture plus Communication Architecture

• Defines – Critical abstractions, boundaries, and primitives (interfaces)– Organizational structures that implement interfaces (hw or sw)

• Compilers, libraries and OS are important bridges today

cs 258 parallel computer architecture lecture 1 introduction to parallel architecture january 23,...

Documents