scaling computers: why streaming is interesting
TRANSCRIPT
Streaming Workshop 8/03MAH 1
Scaling Computers:
Why Streaming Is Interesting
Mark HorowitzBill Dally, Christos Kozyrakis, Kunle Olukotun
Computer Systems LaboratoryStanford University
Streaming Workshop 8/03MAH 2
Processor Model
• User model has been very stable for 30 years– Sequentially executes instructions– IO operations interact with outside world
• Model has hidden the scaling of technology– Efficiently transformed transistors to performance– 8008 – 3,500 transistors, and ran at 200kHz– P4 – 42M transistors, runs at 3GHz– Performance changed from 0.06MIPS to >1000MIPS
• C is a perfect fit to this programming model– They grew up together …
Streaming Workshop 8/03MAH 3
Why Talk About Anything Else?
• People hate change
– Programmers especially
• Most applications are written for this computation model
– It is also is what the cheapest machines execute
• It is hard enough to make things work in this model
– Parallel programs are much more complex
– Need to focus on increasing productivity of programmers
• Hardware is always getting faster
– Wait 18 months and the machine will be 2x performance
• DARPA funding is not a good answer
– IMHO
Streaming Workshop 8/03MAH 4
Supporting Data
Specint2000
1.00
10 .00
100 .00
1000 .00
85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02
i nt el 386
i nt el 486
i nt el pent i um
i nt el pent i um 2
i nt el pent i um 3
i nt el pent i um 4
i nt el i t an i um
A l pha 21064
A l pha 21164
A l pha 21264
Spar c
Super Spar c
Spar c 64
M i ps
H P P A
P ow er P C
A M D K 6
A M D K 7
Streaming Workshop 8/03MAH 5
The World Is About To Change
• Processor performance will not continue to scale
– We will fall off the current performance curve soon
• Many factors will cause this to occur
– VLSI wire issues (global structure are hard to build)
– Insufficient recoverable ILP
– Power
• This performance growth was partially an illusion
Streaming Workshop 8/03MAH 6
Technology Scaling
Scaling CMOS has two direct effects:
• Devices get smaller
– Both transistors and wires
– Get more per square mm
– Generally means they get cheaper
– Enables more complex devices
• Transistors get faster
– So do wires when viewed the ‘right’ way
Streaming Workshop 8/03MAH 7
FO4 Inverter Delay Under Scaling
Gate delay varies linearly with process technology (so far)• Useful rule of thumb: Dgate = 500pS*Ldrawn at TTLH
• Issues with being able to continue this scaling– Some current technologies are faster (le < feature size)
0
100
200
300
400
500
600
700
0.20.40.60.811.2
Gat
e d
elay
(p
S)
Technology Ldrawn (um)
Fanout=4 inverter delay at TT, 90% Vdd, 125C
500 * Ldrawn
Streaming Workshop 8/03MAH 8
0
0.2
0.4
0.6
0.25 0.18 0.13 0.1 0.07 0.05 0.035
pF
Technology Ldrawn (um)
Semi-global wire capacitance, 1mm long
Aggressive scalingConservative scaling
0
0.1
0.2
0.3
0.4
0.25 0.18 0.13 0.1 0.07 0.05 0.035
Ko
hm
s
Technology Ldrawn (um)
Semi-global wire resistance, 1mm long
Aggressive scalingConservative scaling
Fixed Length Wire Scaling
• R gets quite a bit worse with scaling; C basically constant
Streaming Workshop 8/03MAH 9
0
0.2
0.4
0.6
0.25 0.18 0.13 0.1 0.07 0.05 0.035
pF
Technology Ldrawn (um)
Semi-global wire capacitance, scaled length
Aggressive scalingConservative scaling
0
0.1
0.2
0.3
0.4
0.25 0.18 0.13 0.1 0.07 0.05 0.035
Ko
hm
s
Technology Ldrawn (um)
Semi-global wire resistance, scaled length
Aggressive scalingConservative scaling
Module Level Wire Scaling
• R is basically constant, and C falls linearly with scaling
Streaming Workshop 8/03MAH 10
The World is Growing
• The problem associated with wires is really due to complexity• Diagram shows the logical span you reach in a cycle
– It also show the logical span of a chip
Old view: a chip looks small to a wire
Logical chip size
Distance I can go in 1 cycle
New view: a chip looks really big to a wire
Communication is expensive, even on-chip
Streaming Workshop 8/03MAH 11
Architecture
• Convert transistors to performance• Use transistors to
– Exploit parallelism– Or create it (speculate)
• Processor generations– Simple machine
• Reuse hardware
– Pipelined• Separate hardware for each stage
– Super-scalar• Multiple port mems, function units
– Out-of-order• Mega-ports, complex scheduling
– Speculation• Each design has more logic to
accomplish same task (but faster)
Streaming Workshop 8/03MAH 12
Architecture Scaling
• Plot of IPC
– Compiler + IPC
– 1.5x / generation– Until PIII, now
falling• There is a lot of
hardware to make this happen– Many transistors
– Lots of power
– Lots of designers
0.00
0.01
0.02
0.03
0.04
0.05
Jan-85 Jan-88 Jan-91 Jan-94 Jan-97 Jan-00
8038680486PentiumPentium IIPentiumIIIPentium4
Streaming Workshop 8/03MAH 13
SpecInt/MHz
0.01
0.10
1.00
85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02
intel 386
intel 486
intel pentium
intel pentium 2
intel pentium 3
intel pentium 4
intel i tanium
Alpha 21064
Alpha 21164
Alpha 21264
Spar c
Super Spar c
Spar c64
Mips
HP PA
Power PC
AMD K6
AMD K7
Streaming Workshop 8/03MAH 14
10
100
1000
10000
85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02
intel 386
intel 486
intel pentium
intel pentium 2
intel pentium 3
intel pentium 4
intel i tanium
Alpha 21064
Alpha 21164
Alpha 21264
Spar c
Super Spar c
Spar c64
Mips
HP PA
Power PC
AMD K6
AMD K7
Clock Frequency Scaling
Streaming Workshop 8/03MAH 15
Gates Per Clock
• Clock speed has been scaling faster than base technology
• Number of FO4 delays in a cycle has been falling
• Number of gates decrease 1.4x each generation
• Caused by:– Faster circuit families
(dynamic logic)– Better optimization– Better micro-architecture– Better adder/mem arch
• All this generally requires more transistors
10.00
100.00
Dec-83 Dec-86 Dec-89 Dec-92 Dec-95 Dec-98
8038680486PentiumPentium II
Expon.
FO
4 in
vert
er d
elay
s /
cycl
e
Streaming Workshop 8/03MAH 16
Clock Cycle in ‘FO4’
10
100
85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01
intel 386
intel 486
intel pentium
intel pentium 2
intel pentium 3
intel pentium 4
intel i tanium
Alpha 21064
Alpha 21164
Alpha 21264
Spar c
Super Spar c
Spar c64
Mips
HP PA
Power PC
AMD K6
AMD K7
Note that the points at around 10 FO4 are not correct. The FO4 for these technologies is about ½ my simple formula
Alpha
Streaming Workshop 8/03MAH 17
Gates Per Clock
• Current SOA machines are at 16 FO4 gates per cycle– Historical low values (Cray) were at this level
• Overhead for short tick machines grows rapidly– Power
• Increases clock power per logic function
– Latency• Flops are already 10-20% of cycle today
– Logic reach grows smaller• What fits in a cycle (how many bits/gates) decreases
• Difficult to generate a clock at less than 8 FO4 gates
• Continued scaling of gates/clock will be hard– Performance gain from 16 FO4 to 8 FO4 is only 20% anyhow
Streaming Workshop 8/03MAH 18
1
10
100
85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01
intel 386
intel 486
intel pentium
intel pentium 2
intel pentium 3
intel pentium 4
intel i tanium
Alpha 21064
Alpha 21164
Alpha 21264
Spar c
Super Spar c
Spar c64
Mips
HP PA
Power PC
AMD K6
AMD K7
Power
Streaming Workshop 8/03MAH 19
Complexity and Power – The Dark Side
Two other factor will limit conventional scalar processors:
• Power– Current SOA designs are power limited– If you put all the techniques you know about
• It will dissipate too much power
• Need to get the most performance for XX Watts
– Performance improvements must be energy efficient
• Complexity– Very few companies can afford the design costs
• Fewer players every year
Streaming Workshop 8/03MAH 20
The Illusion:
• I remember processor performance plots used to have two lines
• Microprocessors and mainframes– Mainframes had maxed out and improved at technology rate
• 2x every 6 years
– Unless something changes micros will be there soon
uP
Mainframe
Streaming Workshop 8/03MAH 21
Jim Smith’s Graph
• Graph Jim Smith presented at ISCA 2000– Fastest uniprocessor– Then growth has been
modest
– Micros were starting from a slow point, so have a large growth rate
Streaming Workshop 8/03MAH 22
Coming Opportunity
• Conventional processor scaling is going to slow down– Design costs are enormous– Improving IPC is getting harder– Improving cycle time is getting harder
• For performance need to exploit parallelism– EV8, Pentium 4 – SMT– Power 4 is an explicit multiprocessor
• Power 5 is explicit multiprocessor with SMT
• How do we do this well?– Create other programming models– Make the models match VLSI constraints– Don’t worry about universality
Streaming Workshop 8/03MAH 23
What VLSI Wants You to Build
• Modular machine– Lots of potential compute units, w/ memory
Interconnect
Processor
Registers
Streaming Workshop 8/03MAH 24
Making Communication Explicit
• In VLSI, communication is what matters
It is the wires, stupid
• Another way of saying this is:– In VLSI building computation elements is easy– Keeping them feed is hard– Hence, most of a modern processor stages data
• What a computation model that– Makes communication explicit– Provides feedback to the programmer about communication
Streaming Workshop 8/03MAH 25
The “Ideal” VLSI Machine
• Lots of simple compute units
– Units feed by cheap (in energy, area) sources – local regs
– Relatively cheap instruction issue logic
• Memory (FIFOs) to decouple data fetch/execute
– Communication takes time (it is the LAW)
– Need to enable the machine to tolerate latency
• Interconnection network with high-bandwidth
– And as small latency as possible
• Connections to large backing store
– Main memory and disk
• Streams are a programming model that matches this machine
Streaming Workshop 8/03MAH 26
What Stream Programming Means To Me
•Programming model where– Applications have large amounts of data parallelism
• So you don’t have to invent it
– Communication is made explicit• So the programmer can tell what will be expensive
– Easy to estimate performance of applications•Similar to
– Vector programming– Synchronous dataflow graphs– Mathlab/Simulink– Graphics machines, etc
It does not bother me that this is not general. It is common enough we should have an efficient solution for these applications
Streaming Workshop 8/03MAH 27
Stream Programming Abstraction
• Stream model makes communication / locality explicit– Temporaries within kernels, producer-consumer variables
between kernels never are written back to memory– Most communication is represented by arcs, or memory
gather/scatter• Locality helps support large numbers of ALUs
SAD
Input Data
Output DataImage 1 convolve convolve
Image 0 convolve convolve
Depth Map
Kernel: Operations within a kernel operate on local data Streams: producer-
consumer locality
Streaming Workshop 8/03MAH 28
Stream Architecture
• Basically need four parts:– Function units (Add,Mult)– Registers to hold the data– DMA channels to load/spill the registers– Something to sequence instructions
• For efficient operation– Want to decouple load/store from execute
• Need lots of registers
– Can optimize registers/memory for different functions
Streaming Workshop 8/03MAH 29
Simple Stream Machine
• Vector machines were early stream machines• Scalar processor was the sequencer• Two levels of storage for the “registers”
– Vector registers– Main memory
• Vector registers are often a limitation– #ports = 3*#FU– #regs = Latency
• Graphics use this resource model– Slice the computation into stream
Register
FU FU FU FU
Lane
Streaming Workshop 8/03MAH 30
More Refined Model
• Split the registers into two functions– Hold temporaries in the computation– Input / Output of the kernel
• Temps don’t need to be global for all function units– Although it is easier for the compiler if they are
• Localized registers will need move instructions
– The don’t need to hide the memory latency either• Generated from function units
• Input /Output registers are only for queuing– Number of ports does not need to grow with #FU– Very regular access out of these registers– Access can be longer– Build dense memory for these
Streaming Workshop 8/03MAH 31
Imagine Stream Machine
2.1 GB/s 26 GB/s
SDRAM
SDRAM
SDRAM
SDRAM
Str
eam
R
egis
ter
File
ALU Cluster
ALU Cluster
ALU Cluster
435 GB/s
Streaming Workshop 8/03MAH 32
Stream Virtual Machine
• Really three threads of control
• Control thread is master
– Issues non-blocking instructions for core and DMA
– Mechanism for respecting dependencies
• Explicitly specify what each non-blocking instruction depends on
• DMA and Core threads
– Have a queue of work– Run when constraints satisfied
• Machine dependent resources
– Graphics SRF = memory
Streaming Workshop 8/03MAH 33
Many Possible Stream Hardware Implementations
• We think (hope) all will fall into the general SVM model• Need to have some parameters that characterize hardware
– Less is more in this situation– Want to export only key issues
• Goal is partition compilation process into two parts– High-level compiler
• Work with parallelization of code
• Data blocking and placement
• Platform independent
– Low-level (node) compiler• Code scheduling
• Detailed resource allocation
Streaming Workshop 8/03MAH 34
Streaming Summary
• Uniprocessor scaling will change soon– Highly constrained space– Performance gains are generally modest– Hard to build any really interesting machines
• VLSI wants you to build parallel modular machines– Streams are a programming model that matches machine
• Stream programming model is familiar to some programmers– It is the model that they prototype SP applications in– Easy to explain to someone
• Potentially large improvement possible in Op/sec and Ops/Joule