cs 258 parallel computer architecture lecture 1 introduction to parallel architecture january 23,...
DESCRIPTION
1/23/02 John Kubiatowicz Slide 1.3 Instruction Set Architecture (ISA) Great picture for uniprocessors…. –Rapidly crumbling, however! Can this be true for multiprocessors??? –Much harder to say. instruction set software hardwareTRANSCRIPT
CS 258 Parallel Computer Architecture
Lecture 1
Introduction to Parallel Architecture
January 23, 2002Prof John D. Kubiatowicz
John KubiatowiczSlide 1.21/23/02
Computer Architecture Is …the attributes of a [computing] system as seen by the programmer, i.e., the conceptual structure and functional behavior, as distinct from the organization of the data flows and controls the logic design, and the physical implementation.
Amdahl, Blaaw, and Brooks, 1964
SOFTWARESOFTWARE
John KubiatowiczSlide 1.31/23/02
Instruction Set Architecture (ISA)
• Great picture for uniprocessors….– Rapidly crumbling, however!
• Can this be true for multiprocessors???– Much harder to say.
instruction set
software
hardware
John KubiatowiczSlide 1.41/23/02
What is Parallel Architecture?• A parallel computer is a collection of processing
elements that cooperate to solve large problems fast
• Some broad issues:– Models of computation: PRAM? BSP? Sequential Consistency?– Resource Allocation:
» how large a collection? » how powerful are the elements?» how much memory?
– Data access, Communication and Synchronization» how do the elements cooperate and communicate?» how are data transmitted between processors?» what are the abstractions and primitives for cooperation?
– Performance and Scalability» how does it all translate into performance?» how does it scale?
John KubiatowiczSlide 1.51/23/02
Computer Architecture Topics (252+)
Instruction Set Architecture
Pipelining, Hazard Resolution,Superscalar, Reordering, Prediction, Speculation,Vector, Dynamic Compilation
Addressing,Protection,Exception Handling
L1 Cache
L2 Cache
DRAM
Disks, WORM, Tape
Coherence,Bandwidth,Latency
Emerging TechnologiesInterleavingBus protocols
RAID
VLSI
Input/Output and Storage
MemoryHierarchy
Pipelining and Instruction Level Parallelism
NetworkCommunication
Oth
er
Proc
esso
rs
John KubiatowiczSlide 1.61/23/02
Computer Architecture Topics (258)
M
Interconnection NetworkS
PMPMPMP ° ° °
Topologies,Routing,Bandwidth,Latency,Reliability
Network Interfaces
Shared Memory,Message Passing,Data Parallelism
Processor-Memory-Switch
MultiprocessorsNetworks and InterconnectionsEverything in previous slide but more so!
John KubiatowiczSlide 1.71/23/02
What will you get out of CS258?• In-depth understanding of the design and
engineering of modern parallel computers– technology forces– Programming models– fundamental architectural issues
» naming, replication, communication, synchronization– basic design techniques
» cache coherence, protocols, networks, pipelining, …– methods of evaluation
• from moderate to very large scale• across the hardware/software boundary• Study of REAL parallel processors
– Research papers, white papers• Natural consequences??
– Massive Parallelism Reconfigurable computing?– Message Passing Machines NOW Peer-to-peer systems?
John KubiatowiczSlide 1.81/23/02
Will it be worthwhile?• Absolutely!
– even through few of you will become PP designers• The fundamental issues and solutions
translate across a wide spectrum of systems.
– Crisp solutions in the context of parallel machines.• Pioneered at the thin-end of the platform
pyramid on the most-demanding applications
– migrate downward with time• Understand implications for software• Network attached
storage, MEMs, etc?
SuperServers
Departmenatal Servers
Workstations
Personal Computers
Workstations
John KubiatowiczSlide 1.91/23/02
Role of a computer architect: To design and engineer the various levels of a computer system to maximize performance and programmability within limits of technology and cost.
Parallelism:• Provides alternative to faster clock for performance• Applies at all levels of system design• Is a fascinating perspective from which to view
architecture• Is increasingly central in information processing• How is instruction-level parallelism related to course-
grained parallelism??
Why Study Parallel Architecture?
1/23/02John Kubiatowicz
Slide 1.10
Is Parallel Computing Inevitable?This was certainly not clear just a few years
ago
Today, however:• Application demands: Our insatiable need
for computing cycles• Technology Trends: Easier to build• Architecture Trends: Better abstractions• Economics: Cost of pushing uniprocessor • Current trends:
– Today’s microprocessors have multiprocessor support– Servers and workstations becoming MP: Sun, SGI, DEC,
COMPAQ!...– Tomorrow’s microprocessors are multiprocessors
John KubiatowiczSlide 1.111/23/02
Can programmers handle parallelism?
• Humans not as good at parallel programming as they would like to think!– Need good model to think of machine– Architects pushed on instruction-level
parallelism really hard, because it is “transparent”
• Can compiler extract parallelism? – Sometimes
• How do programmers manage parallelism?? – Language to express parallelism?– How to schedule varying number of processors?
• Is communication Explicit (message-passing) or Implicit (shared memory)?– Are there any ordering constraints on
communication?
John KubiatowiczSlide 1.121/23/02
Granularity:• Is communication fine or coarse grained?
– Small messages vs big messages• Is parallelism fine or coarse grained
– Small tasks (frequent synchronization) vs big tasks
• If hardware handles fine-grained parallelism, then easier to get incremental scalability
• Fine-grained communication and parallelism harder than coarse-grained:
– Harder to build with low overhead– Custom communication architectures often needed
• Ultimate course grained communication:– GIMPS (Great Internet Mercenne Prime Search)– Communication once a month
John KubiatowiczSlide 1.131/23/02
CS258: StaffInstructor:Prof John D. Kubiatowicz
Office: 673 Soda Hall, 643-6817 kubitron@csOffice Hours: Thursday 1:30 - 3:00 or by appt.Class: Wed, Fri, 1:00 - 2:30pm 310 Soda Hall
Administrative: Veronique Richard, Office: 676 Soda Hall, 642-4334 nicou@cs
Web page: http://www.cs/~kubitron/courses/cs258-S02/ Lectures available online <11:30AM day of lecture
Email: [email protected] signup link on web page (as soon as it is up)
John KubiatowiczSlide 1.141/23/02
Why Me? The Alewife Multiprocessor
• Cache-coherence Shared Memory– Partially in Software!
• User-level Message-Passing• Rapid Context-Switching• Asynchronous network• One node/board
John KubiatowiczSlide 1.151/23/02
TextBook: Two leaders in fieldText: Parallel Computer Architecture:
A Hardware/Software Approach, By: David Culler & Jaswinder Singh
Covers a range of topics
We will not necessarily cover them in order.
John KubiatowiczSlide 1.161/23/02
Lecture style• 1-Minute Review • 20-Minute Lecture/Discussion• 5- Minute Administrative Matters• 25-Minute Lecture/Discussion• 5-Minute Break (water, stretch)• 25-Minute Lecture/Discussion• Instructor will come to class early & stay after
to answer questions
Attention
Time20 min. Break“In Conclusion, ...”
John KubiatowiczSlide 1.171/23/02
Research Paper Reading
• As graduate students, you are now researchers.
• Most information of importance to you will be in research papers.
• Ability to rapidly scan and understand research papers is key to your success.
• So: you will read lots of papers in this course!– Quick 1 paragraph summaries will be due in class– Students will take turns discussing papers
• Papers will be scanned and on web page.
John KubiatowiczSlide 1.181/23/02
How will grading work?• No TA This term!
• Tentative breakdown:– 20% homeworks / paper presentations– 30% exam– 40% project (teams of 2)– 10% participation
John KubiatowiczSlide 1.191/23/02
Application Trends
• Application demand for performance fuels advances in hardware, which enables new appl’ns, which...
– Cycle drives exponential increase in microprocessor performance– Drives parallel architecture harder
» most demanding applications
• Programmers willing to work really hard to improve high-end applications
• Need incremental scalability:– Need range of system performance with progressively increasing
cost
New ApplicationsMore Performance
John KubiatowiczSlide 1.201/23/02
Speedup
• Speedup (p processors) =
• Common mistake: – Compare parallel program on 1 processor to parallel
program on p processors• Wrong!:
– Should compare uniprocessor program on 1 processor to parallel program on p processors
• Why? Keeps you honest– It is easy to parallelize overhead.
Time (1 processor)
Time (p processors)
John KubiatowiczSlide 1.211/23/02
Amdahl's LawSpeedup due to enhancement E: ExTime w/o E Performance w/ ESpeedup(E) = ------------- = ------------------- ExTime w/ E Performance w/o E
Suppose that enhancement E accelerates a fraction F of the task by a factor S, and the remainder of the task is unaffected
John KubiatowiczSlide 1.221/23/02
Amdahl’s Law for parallel programs?
par
overhead
parparallel
ExTimestuffp,ExTime
pFraction Fraction
1 Speedup
1
Best you could ever hope to do:
parmaximum Fraction - 1
1 Speedup
stuffp,ExTimepFractionFraction ExTime ExTime overhead
parparserpar
1
Worse: Overhead may kill your performance!
John KubiatowiczSlide 1.231/23/02
Metrics of Performance
Compiler
Programming Language
Application
DatapathControl
TransistorsWiresPins
ISA
Function Units
(millions) of Instructions per second: MIPS(millions) of (FP) operations per second: MFLOP/s
Cycles per second (clock rate)
Megabytes per second
Answers per monthOperations per second
John KubiatowiczSlide 1.241/23/02
Commercial Computing• Relies on parallelism for high end
– Computational power determines scale of business that can be handled
• Databases, online-transaction processing, decision support, data mining, data warehousing ...
• TPC benchmarks (TPC-C order entry, TPC-D decision support)
– Explicit scaling criteria provided– Size of enterprise scales with size of system– Problem size not fixed as p increases.– Throughput is performance measure (transactions per
minute or tpm)
John KubiatowiczSlide 1.251/23/02
Thro
ughp
ut (t
pmC
)
Number of processors
0
5,000
10,000
15,000
20,000
25,000
0 20 40 60 80 100 120
Tandem Himalaya DEC Alpha SGI PowerChallenge HP PA IBM PowerPC Other
TPC-C Results for March 1996
• Parallelism is pervasive• Small to moderate scale parallelism very important• Difficult to obtain snapshot to compare across
vendor platforms
John KubiatowiczSlide 1.261/23/02
Scientific Computing Demand
John KubiatowiczSlide 1.271/23/02
Engineering Computing Demand• Large parallel machines a mainstay in many
industries– Petroleum (reservoir analysis)– Automotive (crash simulation, drag analysis,
combustion efficiency), – Aeronautics (airflow analysis, engine efficiency,
structural mechanics, electromagnetism), – Computer-aided design– Pharmaceuticals (molecular modeling)– Visualization
» in all of the above» entertainment (films like Toy Story)» architecture (walk-throughs and rendering)
– Financial modeling (yield and derivative analysis)– etc.
John KubiatowiczSlide 1.281/23/02
1980 1985 1990 1995
1 MIPS
10 MIPS
100 MIPS
1 GIPS
Sub-BandSpeech Coding
200 WordsIsolated SpeechRecognition
SpeakerVeri¼cation
CELPSpeech Coding
ISDN-CD StereoReceiver
5,000 WordsContinuousSpeechRecognition
HDTV Receiver
CIF Video
1,000 WordsContinuousSpeechRecognitionTelephone
NumberRecognition
10 GIPS
• Also CAD, Databases, . . .• 100 processors gets you 10 years, 1000 gets you 20 !
Applications: Speech and Image Processing
John KubiatowiczSlide 1.291/23/02
Is better parallel arch enough?
• AMBER molecular dynamics simulation program• Starting point was vector code for Cray-1• 145 MFLOP on Cray90, 406 for final version on 128-
processor Paragon, 891 on 128-processor Cray T3D
John KubiatowiczSlide 1.301/23/02
Summary of Application Trends• Transition to parallel computing has occurred
for scientific and engineering computing• In rapid progress in commercial computing
– Database and transactions as well as financial– Usually smaller-scale, but large-scale systems also used
• Desktop also uses multithreaded programs, which are a lot like parallel programs
• Demand for improving throughput on sequential workloads
– Greatest use of small-scale multiprocessors• Solid application demand exists and will
increase
John KubiatowiczSlide 1.311/23/02
Per
form
ance
0.1
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Technology Trends
• Today the natural building-block is also fastest!
John KubiatowiczSlide 1.321/23/02
• Microprocessor performance increases 50% - 100% per year• Transistor count doubles every 3 years• DRAM size quadruples every 3 years• Huge investment per generation is carried by huge commodity market
0
20
4060
80100
120
140160
180
1987 1988 1989 1990 1991 1992
Integer FP
Sun 4260
MIPSM/120
IBMRS6000
540MIPS
M2000
HP 9000750
DECalpha
Can’t we just wait for it to get faster?
John KubiatowiczSlide 1.331/23/02
Proc $
Interconnect
Technology: A Closer Look• Basic advance is decreasing feature size ( )
– Circuits become either faster or lower in power• Die size is growing too
– Clock rate improves roughly proportional to improvement in
– Number of transistors improves like (or faster)• Performance > 100x per decade
– clock rate < 10x, rest is transistor count• How to use more transistors?
– Parallelism in processing» multiple operations per cycle reduces CPI
– Locality in data access» avoids latency and reduces CPI» also improves processor utilization
– Both need resources, so tradeoff• Fundamental issue is resource distribution,
as in uniprocessors
John KubiatowiczSlide 1.341/23/02
• 30% per year
0.1
1
10
100
1,000
19701975
19801985
19901995
20002005
Clo
ck ra
te (M
Hz)
i4004i8008
i8080
i8086 i80286i80386
Pentium100
R10000
Growth Rates
Tra
nsisto
rs
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
19701975
19801985
19901995
20002005
i4004i8008
i8080
i8086
i80286i80386
R2000
Pentium R10000
R3000
40% per year
John KubiatowiczSlide 1.351/23/02
Architectural Trends
• Architecture translates technology’s gifts into performance and capability
• Resolves the tradeoff between parallelism and locality
– Current microprocessor: 1/3 compute, 1/3 cache, 1/3 off-chip connect
– Tradeoffs may change with scale and technology advances
• Understanding microprocessor architectural trends
=> Helps build intuition about design issues or parallel machines
=> Shows fundamental role of parallelism even in “sequential” computers
John KubiatowiczSlide 1.361/23/02
Tran
sist
ors
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level (?)
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Phases in “VLSI” Generation
John KubiatowiczSlide 1.371/23/02
Architectural Trends• Greatest trend in VLSI generation is increase
in parallelism– Up to 1985: bit level parallelism: 4-bit -> 8 bit -> 16-bit
» slows after 32 bit » adoption of 64-bit now under way, 128-bit far (not
performance issue)» great inflection point when 32-bit micro and cache fit
on a chip– Mid 80s to mid 90s: instruction level parallelism
» pipelining and simple instruction sets, + compiler advances (RISC)
» on-chip caches and functional units => superscalar execution
» greater sophistication: out of order execution, speculation, prediction
• to deal with control transfer and latency problems– Next step: thread level parallelism? Bit-level parallelism?
1/23/02John Kubiatowicz
Slide 1.38
0 1 2 3 4 5 6+0
5
10
15
20
25
30
0 5 10 150
0.5
1
1.5
2
2.5
3
Frac
tion
of to
tal c
ycle
s (%
)
Number of instructions issuedS
peed
up
Instructions issued per cycle
How far will ILP go?
• Infinite resources and fetch bandwidth, perfect branch prediction and renaming
– real caches and non-zero miss latencies
John KubiatowiczSlide 1.391/23/02
Threads Level Parallelism “on board”
• Micro on a chip makes it natural to connect many to shared memory
– dominates server and enterprise market, moving down to desktop• Alternative: many PCs sharing one complicated pipe• Faster processors began to saturate bus, then bus technology
advanced– today, range of sizes for bus-based systems, desktop to large servers
Proc Proc Proc Proc
MEM
John KubiatowiczSlide 1.401/23/02
What about Multiprocessor Trends?
0
10
20
30
40
CRAY CS6400
SGI Challenge
Sequent B2100
Sequent B8000
Symmetry81
Symmetry21
Power
SS690MP 140 SS690MP 120
AS8400
HP K400AS2100SS20
SE30SS1000E
SS10
SE10
SS1000
P-ProSGI PowerSeries
SE60
SE70
Sun E6000
SC2000ESun SC2000SGI PowerChallenge/XL
SunE10000
50
60
70
1984 1986 1988 1990 1992 1994 1996 1998
Num
ber o
f pro
cess
ors
John KubiatowiczSlide 1.411/23/02
Sha
red
bus
band
wid
th (M
B/s
)
10
100
1,000
10,000
100,000
1984 1986 1988 1990 1992 1994 1996 1998
SequentB8000
SGIPowerCh
XL
Sequent B2100
Symmetry81/21
SS690MP 120SS690MP 140
SS10/SE10/SE60
SE70/SE30SS1000 SS20
SS1000EAS2100SC2000EHPK400
SGI Challenge
Sun E6000AS8400
P-Pro
Sun E10000
SGI PowerSeries
SC2000
Power
CS6400
Bus Bandwidth
John KubiatowiczSlide 1.421/23/02
What about Storage Trends?• Divergence between memory capacity and speed even
more pronounced– Capacity increased by 1000x from 1980-95, speed only 2x– Gigabit DRAM by c. 2000, but gap with processor speed much
greater• Larger memories are slower, while processors get faster
– Need to transfer more data in parallel– Need deeper cache hierarchies– How to organize caches?
• Parallelism increases effective size of each level of hierarchy, without increasing access time
• Parallelism and locality within memory systems too– New designs fetch many bits within memory chip; follow with fast
pipelined transfer across narrower interface– Buffer caches most recently accessed data– Processor in memory?
• Disks too: Parallel disks plus caching
1/23/02John Kubiatowicz
Slide 1.43
Economics• Commodity microprocessors not only fast but
CHEAP– Development costs tens of millions of dollars– BUT, many more are sold compared to supercomputers– Crucial to take advantage of the investment, and use the
commodity building block• Multiprocessors being pushed by software vendors
(e.g. database) as well as hardware vendors• Standardization makes small, bus-based SMPs
commodity• Desktop: few smaller processors versus one larger
one?• Multiprocessor on a chip?
John KubiatowiczSlide 1.441/23/02
Can anyone afford high-end MPPs???
• ASCI (Accellerated Strategic Computing Initiative) ASCI White: Built by IBM
– 12.3 TeraOps, 8192 processors (RS/6000)– 6TB of RAM, 160TB Disk– 2 basketball courts in size– Program it??? Message passing
John KubiatowiczSlide 1.451/23/02
Consider Scientific Supercomputing• Proving ground and driver for innovative
architecture and techniques – Market smaller relative to commercial as MPs become
mainstream– Dominated by vector machines starting in 70s– Microprocessors have made huge gains in floating-point
performance» high clock rates » pipelined floating point units (e.g., multiply-add every
cycle)» instruction-level parallelism» effective use of caches (e.g., automatic blocking)
– Plus economics
• Large-scale multiprocessors replace vector supercomputers
John KubiatowiczSlide 1.461/23/02
LIN
PA
CK
(MFL
OP
S)
1
10
100
1,000
10,000
1975 1980 1985 1990 1995 2000
CRAY n = 100 CRAY n = 1,000
Micro n = 100 Micro n = 1,000
CRAY 1s
Xmp/14se
Xmp/416Ymp
C90
T94
DEC 8200
IBM Power2/990MIPS R4400
HP9000/735DEC Alpha
DEC Alpha AXPHP 9000/750
IBM RS6000/540
MIPS M/2000
MIPS M/120
Sun 4/260
Raw Uniprocessor Performance: LINPACK
John KubiatowiczSlide 1.471/23/02
LIN
PAC
K (G
FLO
PS)
CRAY peak MPP peak
Xmp /416(4)
Ymp/832(8) nCUBE/2(1024)iPSC/860
CM-2CM-200
Delta
Paragon XP/S
C90(16)
CM-5
ASCI Red
T932(32)
T3D
Paragon XP/S MP(1024)
Paragon XP/S MP(6768)
0.1
1
10
100
1,000
10,000
1985 1987 1989 1991 1993 1995 1996
Raw Parallel Performance: LINPACK
• Even vector Crays became parallel– X-MP (2-4) Y-MP (8), C-90 (16), T94 (32)
• Since 1993, Cray produces MPPs too (T3D, T3E)
John KubiatowiczSlide 1.481/23/02
Num
ber o
f sys
tem
s
11/93 11/94 11/95 11/960
50
100
150
200
250
300
350
PVP MPP
SMP
319
106
284
239
63
187
313
198
110
10673
500 Fastest Computers
John KubiatowiczSlide 1.491/23/02
Summary: Why Parallel Architecture?• Increasingly attractive
– Economics, technology, architecture, application demand• Increasingly central and mainstream• Parallelism exploited at many levels
– Instruction-level parallelism– Multiprocessor servers– Large-scale multiprocessors (“MPPs”)
• Focus of this class: multiprocessor level of parallelism
• Same story from memory system perspective– Increase bandwidth, reduce average latency with many
local memories• Spectrum of parallel architectures make sense
– Different cost, performance and scalability
John KubiatowiczSlide 1.501/23/02
Where is Parallel Arch Going?
Application Software
System Software SIMD
Message Passing
Shared MemoryDataflow
SystolicArrays Architecture
• Uncertainty of direction paralyzed parallel software development!
Old view: Divergent architectures, no predictable pattern of growth.
John KubiatowiczSlide 1.511/23/02
Today
• Extension of “computer architecture” to support communication and cooperation
– Instruction Set Architecture plus Communication Architecture
• Defines – Critical abstractions, boundaries, and primitives (interfaces)– Organizational structures that implement interfaces (hw or sw)
• Compilers, libraries and OS are important bridges today