Streaming SupercomputerStrawman
Bill Dally, Jung-Ho Ahn, Mattan Erez, Ujval Kapasi, Tim Knight, Ben Serebrin
April 15, 2002
2
Outline• Overview
• Operation
• Stream-ISA
• Kernel-ISA
• Micro-architecture
• Next steps
3
System Overview
StreamProcessor64 FPUs
64GFLOPS
16 xDRDRAM2GBytes
38GBytes/s
20GBytes/s32+32 pairs
Node
On-Board Network
Node2
Node16 Board 2
16 Nodes1K FPUs1TFLOPS32GBytes
Intra-Cabinet Network(passive - wires only)
Board 64
160GBytes/s256+256 pairs
10.5" Teradyne GbX
Board
Cabinet
Inter-Cabinet Network
Cabinet 264 Boards1K Nodes64K FPUs64TFLOPS
2TBytes
E/OO/E
5TBytes/s8K+8K links
Ribbon Fiber
Cabinet 16
Bisection 64TBytes/s
All links 5Gb/s per pair or fiberAll bandwidths are full duplex
Roundtrip memory access latency ~ 500ns = 500 processor cycles
4
Board Overview
• Chips– 16 SS nodes
– 4 router chips – provide 4 independent routing planes on each board
• Ports– 32 to back-plane network
– 16 to global network
Board-levelRouter Chip
SS Node 0
SS Node 15
8
4
Back-planenetwork
Globalnetwork
5
Node Overview
StreamExecution Unit
StreamRegister File
MemorySystem
NetworkInterface
ScalarExecution Unit
texttext
DRDRAMNetwork
6
Node Operation [1]
• MIPS assembly• COP2 instructions encode
stream instructions
• VLIW microcode• Called by instructions in
the scalar program
BrookProgram
KernelFunctions
ScalarProgram(s?)
7
Node Operation [2]Example Execution• Transfer data from DRDRAM and
network into SRF
• Execute Kernel k1
• Execute Kernel k2
• Synchronize (across nodes)
• Store results to DRDRAM and network
• Synchronize
StreamExecution Unit
StreamRegister File
MemorySystem
NetworkInterface
ScalarExecution Unit
texttext
DRDRAMNetwork
8
Stream-ISA Visible StateState which is used by a scalar (MIPS) program:
• MIPS registers– Standard processor registers
– Coprocessor 2 interface registers – the MIPS program’s interface to the SSS
• SRF
• Global memory: all memory on all nodes can be addressed by any node
• Control registers:– Segment registers: implement virtual memory
– Stream descriptor registers & memory descriptor registers: hold parameters such as length, record size, etc.
• Stream cache
9
Stream Instruction Set• MIPS instructions
– Standard processor instructions– COP2 instruction used to issue stream instructions
• Stream instructions– Write control registers– Stream load, store– Stream cache prefetch, invalidate– Kernel load, execute
• More on…– Messaging instructions– Global Synchronization– Exceptions/Interrupts
• Open Issues– COP2 interface– Critical sections for sending stream instructions
10
Memory Model• Address space shared by scalar/stream units
• No data-duplication of global data in local DRAM
• Virtual memory implemented via programmable segment registers– Virtual address: (SegNum, SegOffset)
– Physical address: (NodeNum, LocalOffset)
• Segments can be interleaved across nodes in user-specified amounts.
• Read-mostly stream cache with gang invalidation– MIPS instructions specify which stream elements to
cache
• No HW paging support
• Open Issues– Cache write policy
– Coherence
Node 16
Node 17
Node 18 Node 19
Node 21
Node 22 Node 23
Node 20
Segment Register:NodeBase = 16Size = 256MBBase = 1GBNodeBits = 20-22
Phys: 1G;Virtual: 0
Virtual: 8M
11
Global Synchronization• Minimal global synchronization and communication
mechanisms– Barrier
– Remote update - Fetch-and-add/Compare-and-swap
• Implement all other synchronization primitives in software– General messaging mechanism interrupts scalar processor
– e.g., General-purpose synchronization• For example, end a parallel search as soon as one node finds a match
• Open Issues– Hardware acceleration
• Locking: a table with 1K entries for locked locations
12
Exceptions/Interrupts• Scalar processor handles all interrupts
– e.g., Message received from remote node interrupts scalar processor
• Stream operations delay exceptions until the end of the operation– Examples: divide-by-0 in clusters, invalid address in memory system
– Information about exception saved
– Scalar processor must read this state to figure out nature of exception
• Open Issues– Best way to save info about stream operation exceptions and transfer this to
scalar processor
– Multithreading scalar processor
13
Kernel-ISA Visible StateState which is used by a kernel function:
• Per-cluster:– Local register files
• Per ALU register files
• Cluster condition code registers
• Scratchpad: small indexable register file, used for:– lookup tables
– complex data structures
– register spills
• Per-node:– Microcontroller register file, used for:
• Loop counters
• Data to be broadcast to clusters
• Passing parameters between the MIPS processor and the kernel functions
– Microcode store contains the kernel VLIW instructions.
– Microcontroller condition code registers, for looping and conditional streams
– Microprogram counter
14
Kernel Instruction Set• Kernel microprogram consists of VLIW instructions• Each cluster in the node is sent the same control signals
each cycle (i.e.) they execute in lockstep (SIMD)• Kernel VLIW instructions control:
– Microcontroller units and register files
– Cluster arithmetic units and register files
– Inter-cluster switch
– Transfers of stream data between the clusters and the SRF
• More on…– Conditional Operations
15
Conditionals• Hardware select (analog to C “?:” operator)• Scratchpad (using register indexing)• Conditional Streams• Open Issues
– Assuming that Brook will retain ‘if-statements’• Need to automatically map code to conditional streams or predication
• Requires a method to “split” a kernel
16
Scalar Execution Unit
• Scalar processor is a MIPS (or Tensilica) core with data and instruction caches• Scalar processor issues stream instructions to the stream controller
– Stream Controller stores them in a scoreboard– Instructions issued when
1. all required resources are available2. all inter-instruction dependencies have been satisfied.
• Open Issues– On-chip L2 cache for MIPS Core– Best method to encode dependencies between stream instructions
Processor-MemoryInterface
StreamController
Scalar Processor
Processor-NetworkInterface
SSS Blocks Memory System Network
17
Stream Execution Unit
• Kernel instructions issued by microcontroller– SIMD clusters– VLIW control of ALUs within a cluster
• Open Issues– “Aspect Ratio”– Inter-cluster switch implementation– Local register file organization– FU Mix– Division / Square root support– Integer unit for logical ops
Micro-controller
Inter-cluster Switch
StreamController
SRF
Cluster0
SRF
Cluster15
SRF
Intra-clusterSwitch
RF
RF
FPMul
RF
RF
FPMul
RF
RF
FPAdd
RF
RF
FPAdd
RF
RF
FPDSQ
RF
RFScratch-Pad
RF
SRF
Inter-clusterSwitch
Cluster
18
Stream Register File
Single-portedSRAM
64KW1024 x 2048b
64W/cycle
Arbiter
To/FromArithmeticClusters
Streambuffers
19
Memory System• Memory Unit
– Translates virtual addresses
– Routes requests and replies
– Frames new network message for each external word
• Requestors– Address generators
– Scalar processor
– Stream cache
– Network
• Suppliers– Local DRAM
– Stream cache
– Network
• Open Issues– Multidimensional strides
Memory Unit
StreamCache
DRAMInterface
Memory-NetworkInterface
Processor-MemoryInterface
SRF
DRAM
NetworkInterface
ScalarProcessor
AddressGenerators
ReorderBuffers
20
Network Interface
• Flit-reservation flow-control• Expect to be able to service messages faster than arrival
rate• Open Issues:
– Need to flesh out the details of this module
NetworkInterface
Memory-NetworkInterface
Processor-NetworkInterface
Network:4 channels,5GB/s each
To ScalarProcessor
To ScalarProcessor
Interrupt Line
To MemorySystem
21
Next Steps• SVM will be used to study system-wide architecture
– e.g. System bandwidths, synchronization mechanisms, etc.
• Cycle-accurate simulator (ssim)– Used for node architecture studies– Validate multi-node results– Current emphasis on getting single-node simulations to work– Feedback single-node results (i.e., kernel results) into SVM for quick but fairly
accurate system-level results
• Global architecture studies– Feasibility, Global mechanisms, network topologies
• Node architecture studies– Aspect ratio, FU mix, inter-cluster switch, conditionals, SRF size/bandwidth
• Most are gated on getting apps ported• We can start with Imagine apps for now
– Area/Power estimates
• End-of-quarter project-wide goal is to have Brook apps running on SVM and ssim.– We should be able to conduct architectural experiments at that point