lec jan12 2009
TRANSCRIPT
Introduction12th January, 2009
CSL718 : Architecture of High Performance Systems CSL718 : Architecture of CSL718 : Architecture of
High Performance SystemsHigh Performance Systems
Anshul Kumar, CSE IITD slide 2
Some basic questionsSome basic questionsSome basic questions
• What is high performance?
• Who needs high performance systems?
• How do you achieve high performance?
• How to analyse or evaluate performance?
– Rate of computation– Time to compute– Weather prediction,
complex design, scientific computation etc.
– Every one needs it.– Technology– Circuit / logic design– Architecture– Theoretical models– Simulation– Experimentation
Anshul Kumar, CSE IITD slide 3
Execution Time and Clock PeriodExecution Time and Clock PeriodExecution Time and Clock Period
Program exec time = Tprog = N * Tinst
= N * CPI * ΔtN : Number of instructionsCPI : Cycles per instruction(Av)Δt : Clock cycle time
IF D RF EX/AG M WB
Instruction execution time = Tinst = CPI* ΔtΔt
Anshul Kumar, CSE IITD slide 4
What influences clock period?What influences clock period?What influences clock period?
Tprog = N * CPI * ΔtTechnology - Δt ⇓
Software - N ⇓
Architecture - N * CPI * Δt ⇓Instruction set architecture (ISA)
trade-off N vs CPI * ΔtMicro architecture (μA)
trade-off CPI vs Δt
Anshul Kumar, CSE IITD slide 5
Relative performance per unit costRelative performance per unit costRelative performance per unit cost
Year Technology Perf/cost1951 Vacuum tube 11965 Transistor 351975 Integrated circuit 9001995 VLSI 2,400,000
Anshul Kumar, CSE IITD slide 6
Increase in workstation performanceIncrease in workstation performanceIncrease in workstation performance
HP 9000/750
SUN-4/ 260
MIPS M2000
MIPS M/120
IBM RS6000
100
200
300
400
500
600
700
800
900
1100
DEC Alpha 5/500
DEC Alpha 21264/600
DEC Alpha 5/300
DEC Alpha 4/266
DEC AXP/500IBM POWER 100
Year
Per
form
ance
0
1000
1200
19971996199519941993199219911990198919881987
Anshul Kumar, CSE IITD slide 7
Growth in DRAM CapacityGrowth in DRAM CapacityGrowth in DRAM Capacity
1992
100,000
Kbi
t cap
acity
10,000
1000
100
1019901988198619841982198019781976
Year of introduction
16M
4M
1M
256K
16K
64K
1994 1996
64M
Anshul Kumar, CSE IITD slide 8
CPU-Memory Performance GapCPUCPU--Memory Performance GapMemory Performance Gap• Semiconductor
– Registers CPU speed– SRAM Random Access– DRAM– FLASH
• Magnetic Slow– FDD– HDD
• Optical Random + sequential– CD Very slow– DVD
Anshul Kumar, CSE IITD slide 9
Memory Hierarchy PrincipleMemory Hierarchy PrincipleMemory Hierarchy Principle
Memory
CPU
Memory
Size Cost / bitSpeed
Smallest
Biggest
Highest
Lowest
Fastest
Slowest Memory
access
hit
miss
Temporal Locality– References repeated in
timeSpatial Locality
– References repeated in space
– Special case: Sequential Locality
Anshul Kumar, CSE IITD slide 10
Parallelism : Flynn’s ClassificationParallelism : FlynnParallelism : Flynn’’ss ClassificationClassification
Architecture Categories
SISD SIMD MISD MIMD
Anshul Kumar, CSE IITD slide 11
SISDSISDSISD
C P MIS IS DS
Anshul Kumar, CSE IITD slide 12
SIMDSIMDSIMD
C
P
P
MIS
DS
DS
Anshul Kumar, CSE IITD slide 13
MISDMISDMISD
C
C
P
P
M
IS
IS
IS
IS
DS
DS
Anshul Kumar, CSE IITD slide 14
MIMDMIMDMIMD
C
C
P
P
M
IS
IS
IS
IS
DS
DS
Anshul Kumar, CSE IITD slide 15
Feng’s ClassificationFengFeng’’ss ClassificationClassification
1 16 32 641
16
64
256
16K
word length
bit slicelength
•MPP
•STARAN
•C.mmP
•PDP11
•PEPE
•IBM370
•IlliacIV
•CRAY-1
Anshul Kumar, CSE IITD slide 16
Händler’s ClassificationHHäändlerndler’’ss ClassificationClassification
< K x K’ , D x D’ , W x W’ >control data word
dash → degree of pipeliningTI - ASC <1, 4, 64 x 8>CDC 6600 <1, 1 x 10, 60> x <10, 1, 12> (I/O)C.mmP <16,1,16> + <1x16,1,16> + <1,16,16>PEPE <1 x 3, 288, 32>Cray-1 <1, 12 x 8, 64 x (1 ~ 14)>
Anshul Kumar, CSE IITD slide 17
Modern ClassificationModern ClassificationModern Classification
Parallel architectures
Data-parallel
architectures
Function-parallel
architectures
Anshul Kumar, CSE IITD slide 18
Data Parallel ArchitecturesData Parallel ArchitecturesData Parallel Architectures
• SIMD Processors– Multiple processing elements driven by a single
instruction stream• Vector Processors
– Uni-processors with vector instructions• Associative Processors
– SIMD like processors with associative memory• Systolic Arrays
– Application specific VLSI structures
Anshul Kumar, CSE IITD slide 19
Function Parallel ArchitecturesFunction Parallel ArchitecturesFunction Parallel Architectures
Function-parallel architectures
Instr level Parallel Arch
Thread level Parallel Arch
Process level Parallel Arch
(ILPs) (MIMDs)
Pipelined processors
VLIWs Superscalar processors
Distributed Memory
MIMD
Shared Memory
MIMD
Anshul Kumar, CSE IITD slide 20
PipeliningPipeliningPipelining
IF D RF EX/AG M WB
• faster throughput with pipelining
Simple multicycle design :•resource sharing across cycles • all instructions may not take same cycles
Anshul Kumar, CSE IITD slide 21
Limits of PipeliningLimits of PipeliningLimits of Pipelining• Structural hazards
– Resource conflicts - two instruction require same resource in the same cycle
• Data hazards– Data dependencies - one instruction needs data
which is yet to be produced by another instruction
• Control Hazards– Decision about next instruction needs more
cycles
Anshul Kumar, CSE IITD slide 22
Cache/
memoryFetch
Unit Single multi-operation instruction
multi-operation instruction
FU FU FU
Register file
ILP in VLIW processorsILP in VLIW processorsILP in VLIW processors
Anshul Kumar, CSE IITD slide 23
Cache/
memoryFetch
UnitMultiple instruction
Sequential stream of instructions
FU FU FU
Register file
Decode
and issue
unit
Instruction/controlData
FU Funtional Unit
ILP in Superscalar processorsILP in Superscalar processorsILP in Superscalar processors
Anshul Kumar, CSE IITD slide 24
Superscalar and VLIW processors Superscalar and VLIW processors Superscalar and VLIW processors
Anshul Kumar, CSE IITD slide 25
FU FU FU
Register file
•Scalability with increase in number of register ports
•ILP detection – special compilers / special hardware
•Code compatibility
•Code density, Instruction encoding
•Maintaining consistency
Issues in ILP ArchitecturesIssues in Issues in ILPILP ArchitecturesArchitectures
Anshul Kumar, CSE IITD slide 26
ILP and MultithreadingILP and MultithreadingILP and MultithreadingILP Coarse MT Fine MT SMT
Hen
ness
y an
d Pa
tters
on
Anshul Kumar, CSE IITD slide 27
Why Process level Parallel Architectures?Why Process level Parallel Architectures?Why Process level Parallel Architectures?
Function-parallel architectures
Instruction level PAs
Thread level PAs
Process level PAs(MIMDs)
Distributed Memory
MIMD
Shared Memory
MIMD
Data-parallel architectures
Built usinggeneral purpose
processors
Anshul Kumar, CSE IITD slide 28
Issues from user’s perspectiveIssues from userIssues from user’’s perspectives perspective
• Specification / Program design– explicit parallelism or – implicit parallelism + parallelizing compiler
• Partitioning / mapping to processors• Scheduling / mapping to time instants
– static or dynamic• Communication and Synchronization
Anshul Kumar, CSE IITD slide 29
Parallel programming modelsParallel programming modelsParallel programming models
Concurrent control flow
Functional or logic program
Vector/array operations
Concurrent tasks/processes/threads/objects
With shared variables or message passing
Relationship between programming model and architecture ?
Anshul Kumar, CSE IITD slide 30
Issues from architect’s perspectiveIssues from architectIssues from architect’’s perspectives perspective
• Coherence problem in shared memory with caches
• Efficient interconnection networks
Anshul Kumar, CSE IITD slide 31
Shared Memory MultiprocessorShared Memory MultiprocessorShared Memory Multiprocessor
P P P P
M M M
Interconnection Network
M
M M M
P P P P
M M M
Interconnection Network
M
M M M
Global Interconnection Network
M M M
Anshul Kumar, CSE IITD slide 32
Cache Coherence ProblemCache Coherence ProblemCache Coherence Problem
Multiple copies of data may exist⇒ Problem of cache coherenceOptions for coherence protocols• What action is taken?
– Invalidate or Update• Which processors/caches communicate?
– Snoopy (broadcast) or directory based• Status of each block?
Anshul Kumar, CSE IITD slide 33
Interconnection NetworksInterconnection NetworksInterconnection Networks
• Architectural Variations:– Topology– Direct or Indirect (through switches)– Static (fixed connections) or Dynamic (connections
established as required)– Routing type store and forward/worm hole)
• Efficiency:– Delay– Bandwidth– Cost
Anshul Kumar, CSE IITD slide 34
Quest for PerformanceQuest for PerformanceQuest for Performance1946 ENIAC ($0.5 M, 18K VTs, 150 kW)add/sub 5000 per secmult 385 per secdiv 40 per secsqrt 3 per sec
1962 Atlas (Pipelined, Int + FPU)200K FLOPs
1962 Burroughs D825 (4 CPUs 16 Mem)
1964 CDC 6600 (first supercomputer)multiple FUs, dynamic scheduling
1972 ILLIAC-IV (64 PEs, 4 MFLOPs each)
Anshul Kumar, CSE IITD slide 35
Fastest Supercomputer (ref www.top500.org)
Fastest Supercomputer Fastest Supercomputer (ref www.top500.org)(ref www.top500.org)
• IBM’s Blue Gene/L at Lawrence Livermore Lab topped in June 2006 with 280.6 teraflops
• Japan’s Earth simulator introduced in 2002 was fastest with 35.8 teraflops till Blue Gene took over in 2004.
• Japan’s proposal (2005) to build a supercomputer 73 times faster than the current best. Target: 10 petaflops, budget $800 - $900 million, date 2011.
• Tata sons’ EKA entered 4th spot in 2007 with 132.8 teraflops
• Energy efficiency (max 488 mflopr/watt) also listed in June 2008
June 2008 listJune 2008 listJune 2008 list
EKA - Cluster Platform 3000 BL460c, Xeon 53xx 3GHz, Infiniband, HP (133 teraflops)
Computational Research Laboratories, TATA SONS India8
Encanto - SGI Altix ICE 8200, Xeon quad core 3.0 GHz, SGI
New Mexico Computing Applications Center (NMCAC) United States
7
JUGENE - Blue Gene/P Solution, IBMForschungszentrum Juelich (FZJ) 6
Jaguar - Cray XT4 QuadCore 2.1 GHz, Cray Inc.DOE/Oak Ridge National Laboratory United States5
Ranger - SunBlade x6420, Opteron Quad 2Ghz, Infiniband, Sun Microsystems
Texas Advanced Computing Center/Univ. of Texas United States
4
Blue Gene/P Solution, IBMArgonne National Laboratory United States3
BlueGene/L - eServer Blue Gene Solution, IBMDOE/NNSA/LLNL United States2
Roadrunner - BladeCenter QS22/LS21 Cluster, PowerXCell 8i 3.2 Ghz / Opteron DC 1.8 GHz , Voltaire Infiniband, IBM (1026 teraflops)
DOE/NNSA/LANL United States1
ComputerSiteRank
Anshul Kumar, CSE IITD slide 37
Blue Gene SupercomputerBlue Gene SupercomputerBlue Gene Supercomputer• 32 x 32 x 64 3D torus (65,536 nodes)• Global reduction tree - max/sum in a
few μs• Fast synch across entire machine within
a few μs• 1,024 gbps links to a global parallel file
system
Anshul Kumar, CSE IITD slide 38
Blue Gene Supercomputer contd.Blue Gene Supercomputer contd.Blue Gene Supercomputer contd.
Anshul Kumar, CSE IITD slide 39
Embedded vs GP ComputingEmbedded vs GP ComputingEmbedded vs GP Computing
• Fixed functionality• Part of a larger system• Interact with environment• Real-time requirements• Power constraints• Environmental contraints
• Performance can not be increased simply by increasing clock frequency
Anshul Kumar, CSE IITD slide 40
Cradle CT 3616 ArchitectureCradle CT 3616 ArchitectureCradle CT 3616 Architecture
Anshul Kumar, CSE IITD slide 41
IBM Cell Architecture
IBM Cell IBM Cell ArchitectureArchitecture
• Clock speed: > 4 GHz • Peak performance (single
precision): > 256 GFlops • Peak performance
(double precision): >26 GFlops
• SPU registers 128 x 128b• Local storage size per
SPU: 256KB • Area: 221 mm²• Technology 90nm SOI • Total number of
transistors: 234M
Anshul Kumar, CSE IITD slide 42
BooksBooksBooks1. D.A. Patterson, J.L. Hennessy, "Computer Architecture : A
Quantitative Approach", Morgan Kaufmann Publishers, 2006. 2. D. Sima, T. Fountain, P. Kacsuk, "Advanced Computer
Architectures : A Design Space Approach", Addison Wesley, 1997.
3. M.J. Flynn, "Computer Architecture : Pipelined and Parallel Processor Design", Narosa Publishing House/ Jones and Bartlett, 1996.
4. K. Hwang, "Advanced Computer Architecture : Parallelism, Scalability, Programmability", McGraw Hill, 1993.
5. H.G. Cragon, "Memory Systems and Pipelined Processors", Narosa Publishing House/ Jones and Bartlett, 1998.
6. D.E. Culler, J.P Singh and Anoop Gupta, "Parallel Computer Architecture, A Hardware/Software Approach", Harcourt Asia / Morgan Kaufmann Publishers, 2000.