![Page 1: The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT](https://reader036.vdocuments.site/reader036/viewer/2022062511/551789845503463e368b5429/html5/thumbnails/1.jpg)
The Raw All Purpose Unit (APU)based on a Tiled-Processor Architecture
A Logical Successor to the CPU, GPU and NPU?
Anant Agarwal
MIT
http://www.cag.csail.mit.edu/raw
![Page 2: The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT](https://reader036.vdocuments.site/reader036/viewer/2022062511/551789845503463e368b5429/html5/thumbnails/2.jpg)
A tiled processor architecture prototype: the Raw microprocessor
October 02
![Page 3: The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT](https://reader036.vdocuments.site/reader036/viewer/2022062511/551789845503463e368b5429/html5/thumbnails/3.jpg)
Embedded system:1020 Element Microphone Array
![Page 4: The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT](https://reader036.vdocuments.site/reader036/viewer/2022062511/551789845503463e368b5429/html5/thumbnails/4.jpg)
1020 Node Beamformer Demo
. 2 People moving about and talking
. Track movement of people using camera (vision group) and display on monitor
. Beamformer focuses on speech of one person
. Select another person using mouse
. Beamformer switches focus to the speech of the other person
![Page 5: The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT](https://reader036.vdocuments.site/reader036/viewer/2022062511/551789845503463e368b5429/html5/thumbnails/5.jpg)
The opportunity
20MIPS cpu1987
![Page 6: The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT](https://reader036.vdocuments.site/reader036/viewer/2022062511/551789845503463e368b5429/html5/thumbnails/6.jpg)
2007
The billion transistor chip
The opportunity
![Page 7: The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT](https://reader036.vdocuments.site/reader036/viewer/2022062511/551789845503463e368b5429/html5/thumbnails/7.jpg)
Enables seeking out new application domains
o Redefine our notion of a “general purpose processor”
o Imagine a single-chip handheld that is a speech
driven cellphone, camera, PDA, MP3 player, video engineo Imagine a single-chip PC that is also a 10G router, wireless access point, graphics engine
o While running the gamut of existing desktop binaries
-- A new versatile processor -- APU**Asanovic
![Page 8: The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT](https://reader036.vdocuments.site/reader036/viewer/2022062511/551789845503463e368b5429/html5/thumbnails/8.jpg)
But, where is general purpose computing today?
Other…
Encryption
Sound
Ethernet
Wireless
Graphics 0.25TFLOPS
X86Pentium IV
ASICs
![Page 9: The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT](https://reader036.vdocuments.site/reader036/viewer/2022062511/551789845503463e368b5429/html5/thumbnails/9.jpg)
How does the ASIC do it?
Lots of ALUs, lots of registers, lots of small memories Hand-routed, short wires
Lower power (everything close by) Stream data model for high throughput
But, not general purpose
memmem
mem
mem
mem
![Page 10: The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT](https://reader036.vdocuments.site/reader036/viewer/2022062511/551789845503463e368b5429/html5/thumbnails/10.jpg)
Our challenge
How to exploit ASIC-like features Lots of resources like ALUs and memoriesApplication-specific routing of short wires
While being “general purpose”ProgrammableAnd even running ILP-based sequential programs
One Approach: Tiled Processor Architecture (TPA)
![Page 11: The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT](https://reader036.vdocuments.site/reader036/viewer/2022062511/551789845503463e368b5429/html5/thumbnails/11.jpg)
Tiled Processor Architecture (TPA)
SMEM
SWITCHPC
Short, prog. wires
DMEM
IMEM
REGPC
FPU
ALU
Lots of ALUs and regs
Tile
Lower power
Programmable. Supports ILP and Streams
![Page 12: The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT](https://reader036.vdocuments.site/reader036/viewer/2022062511/551789845503463e368b5429/html5/thumbnails/12.jpg)
A Prototype TPA: The Raw Microprocessor
The Raw Chip
Software-scheduled interconnects (can use static or dynamic routing – but compiler determines instruction placement and routes)
Tile
Disk stream
Video1
RDRAM
Packet stream
A Raw Tile
SMEM
SWITCHPC
DMEM
IMEM
REGPC
FPU
ALU
Raw Switch
PC
SMEM[Billion transistor IEEE Computer Issue ’97]
![Page 13: The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT](https://reader036.vdocuments.site/reader036/viewer/2022062511/551789845503463e368b5429/html5/thumbnails/13.jpg)
Tight integration of interconnect
The Raw ChipTile
Disk stream
Video1
RDRAM
Packet stream
A Raw Tile
SMEM
SWITCHPC
DMEM
IMEM
REGPC
FPU
ALUIF RFD
A TL
M1
F P
E
U WB
r26
r27
r25
r24
InputFIFOsfromStaticRouter
r26
r27
r25r24
OutputFIFOstoStaticRouter
0-cycle“local bypassnetwork”
M2
TV
F4
Point-to-point bypass-integratedcompiler-orchestratedon-chip networks
![Page 14: The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT](https://reader036.vdocuments.site/reader036/viewer/2022062511/551789845503463e368b5429/html5/thumbnails/14.jpg)
Tile 11
fmul r24, r3, r4
softwarecontrolledcrossbar
softwarecontrolledcrossbar
fadd r5, r3, r25
route P->E route W->P
Tile 10
How to “program the wires”
![Page 15: The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT](https://reader036.vdocuments.site/reader036/viewer/2022062511/551789845503463e368b5429/html5/thumbnails/15.jpg)
The result of orchestrating the wires
CustomDatapathPipeline
mem
mem
mem
httpd
C programILP computation
MPI program
Zzzz
![Page 16: The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT](https://reader036.vdocuments.site/reader036/viewer/2022062511/551789845503463e368b5429/html5/thumbnails/16.jpg)
Perspective
We have replacedBypass paths, ALU-reg bus, FPU-Int. bus, reg-cache-bus, cache-mem bus, etc.
With a general, point-to-point, routed interconnect called:
Scalar operand network (SON)Fundamentally new kind of
network optimized for both scalar and stream transport
![Page 17: The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT](https://reader036.vdocuments.site/reader036/viewer/2022062511/551789845503463e368b5429/html5/thumbnails/17.jpg)
Programming models and software for tiled processor architectures
o Conventional scalar programs (C, C++, Java)Or, how to do ILP
o Stream programs
![Page 18: The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT](https://reader036.vdocuments.site/reader036/viewer/2022062511/551789845503463e368b5429/html5/thumbnails/18.jpg)
Scalar (ILP) program mapping
v2.4 = v2seed.0 = seedv1.2 = v1pval1 = seed.0 * 3.0pval0 = pval1 + 2.0tmp0.1 = pval0 / 2.0pval2 = seed.0 * v1.2tmp1.3 = pval2 + 2.0pval3 = seed.0 * v2.4tmp2.5 = pval3 + 2.0pval5 = seed.0 * 6.0pval4 = pval5 + 2.0tmp3.6 = pval4 / 3.0pval6 = tmp1.3 - tmp2.5v2.7 = pval6 * 5.0pval7 = tmp1.3 + tmp2.5v1.8 = pval7 * 3.0v0.9 = tmp0.1 - v1.8v3.10 = tmp3.6 - v2.7tmp2 = tmp2.5v1 = v1.8;tmp1 = tmp1.3v0 = v0.9tmp0 = tmp0.1v3 = v3.10tmp3 = tmp3.6v2 = v2.7
E.g., Start with a C program, and several transformations later:
Lee, Amarasinghe et al, “Space-time scheduling”, ASPLOS ‘98
Existing languages will work
![Page 19: The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT](https://reader036.vdocuments.site/reader036/viewer/2022062511/551789845503463e368b5429/html5/thumbnails/19.jpg)
pval5=seed.0*6.0
pval4=pval5+2.0
tmp3.6=pval4/3.0
tmp3=tmp3.6
v3.10=tmp3.6-v2.7
v3=v3.10
v2.4=v2
pval3=seed.o*v2.4
tmp2.5=pval3+2.0
tmp2=tmp2.5
pval6=tmp1.3-tmp2.5
v2.7=pval6*5.0
v2=v2.7
seed.0=seed
pval1=seed.0*3.0
pval0=pval1+2.0
tmp0.1=pval0/2.0
tmp0=tmp0.1
v1.2=v1
pval2=seed.0*v1.2
tmp1.3=pval2+2.0
tmp1=tmp1.3
pval7=tmp1.3+tmp2.5
v1.8=pval7*3.0
v1=v1.8
v0.9=tmp0.1-v1.8
v0=v0.9
Scalar program mappingv2.4 = v2seed.0 = seedv1.2 = v1pval1 = seed.0 * 3.0pval0 = pval1 + 2.0tmp0.1 = pval0 / 2.0pval2 = seed.0 * v1.2tmp1.3 = pval2 + 2.0pval3 = seed.0 * v2.4tmp2.5 = pval3 + 2.0pval5 = seed.0 * 6.0pval4 = pval5 + 2.0tmp3.6 = pval4 / 3.0pval6 = tmp1.3 - tmp2.5v2.7 = pval6 * 5.0pval7 = tmp1.3 + tmp2.5v1.8 = pval7 * 3.0v0.9 = tmp0.1 - v1.8v3.10 = tmp3.6 - v2.7tmp2 = tmp2.5v1 = v1.8;tmp1 = tmp1.3v0 = v0.9tmp0 = tmp0.1v3 = v3.10tmp3 = tmp3.6v2 = v2.7
Graph
Program code
![Page 20: The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT](https://reader036.vdocuments.site/reader036/viewer/2022062511/551789845503463e368b5429/html5/thumbnails/20.jpg)
pval5=seed.0*6.0
pval4=pval5+2.0
tmp3.6=pval4/3.0
tmp3=tmp3.6
v3.10=tmp3.6-v2.7
v3=v3.10
v2.4=v2
pval3=seed.o*v2.4
tmp2.5=pval3+2.0
tmp2=tmp2.5
pval6=tmp1.3-tmp2.5
v2.7=pval6*5.0
v2=v2.7
seed.0=seed
pval1=seed.0*3.0
pval0=pval1+2.0
tmp0.1=pval0/2.0
tmp0=tmp0.1
v1.2=v1
pval2=seed.0*v1.2
tmp1.3=pval2+2.0
tmp1=tmp1.3
pval7=tmp1.3+tmp2.5
v1.8=pval7*3.0
v1=v1.8
v0.9=tmp0.1-v1.8
v0=v0.9
Program graph clustering
![Page 21: The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT](https://reader036.vdocuments.site/reader036/viewer/2022062511/551789845503463e368b5429/html5/thumbnails/21.jpg)
Placement
seed.0=seed
pval1=seed.0*3.0
pval0=pval1+2.0
tmp0.1=pval0/2.0
tmp0=tmp0.1
v1.2=v1
pval2=seed.0*v1.2
tmp1.3=pval2+2.0 tmp1=tmp1.3
pval7=tmp1.3+tmp2.5
v1.8=pval7*3.0
v1=v1.8
v0.9=tmp0.1-v1.8
v0=v0.9
pval5=seed.0*6.0
pval4=pval5+2.0
tmp3.6=pval4/3.0
tmp3=tmp3.6
v3.10=tmp3.6-v2.7
v3=v3.10
v2.4=v2
pval3=seed.o*v2.4
tmp2.5=pval3+2.0
tmp2=tmp2.5
pval6=tmp1.3-tmp2.5
v2.7=pval6*5.0
v2=v2.7
Tile1
Tile2
Tile3Tile4
![Page 22: The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT](https://reader036.vdocuments.site/reader036/viewer/2022062511/551789845503463e368b5429/html5/thumbnails/22.jpg)
Routing
Processor code
seed.0=recv()pval5=seed.0*6.0pval4=pval5+2.0tmp3.6=pval4/3.0tmp3=tmp3.6v2.7=recv()v3.10=tmp3.6-v2.7v3=v3.10
route(W,S,t)
route(W,S)
route(S,t)
Tile2
v2.4=v2seed.0=recv(0)pval3=seed.o*v2.4tmp2.5=pval3+2.0tmp2=tmp2.5send(tmp2.5)
tmp1.3=recv()pval6=tmp1.3-tmp2.5v2.7=pval6*5.0Send(v2.7)v2=v2.7
route(N,t)
route(t,E)route(E,t)
route(t,E)
Tile3
v1.2=v1seed.0=recv()pval2=seed.0*v1.2tmp1.3=pval2+2.0 send(tmp1.3)tmp1=tmp1.3tmp2.5=recv()pval7=tmp1.3+tmp2.5v1.8=pval7*3.0v1=v1.8tmp0.1=recv()v0.9=tmp0.1-v1.8v0=v0.9
route(N,t)
route(t,W)route(W,t)
route(N,t)route(W,N)
Tile
4
seed.0=seedsend(seed.0)pval1=seed.0*3.0pval0=pval1+2.0tmp0.1=pval0/2.0send(tmp0.1)tmp0=tmp0.1
route(t,E,S)
route(t,E)
Tile
1
Switch code
![Page 23: The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT](https://reader036.vdocuments.site/reader036/viewer/2022062511/551789845503463e368b5429/html5/thumbnails/23.jpg)
Instruction Scheduling
seed.0=recv()pval5=seed.0*6.0
pval4=pval5+2.0tmp3.6=pval4/3.0
tmp3=tmp3.6
v2.7=recv()v3.10=tmp3.6-v2.7
v3=v3.10
route(W,t)
route(W,S)
route(S,t)
send(seed.0)pval1=seed.0*3.0
pval0=pval1+2.0
tmp0.1=pval0/2.0
send(tmp0.1)tmp0=tmp0.1
route(t,E)
route(t,E)
v2.4=v2
seed.0=recv(0)pval3=seed.o*v2.4
tmp2.5=pval3+2.0
tmp2=tmp2.5send(tmp2.5)
tmp1.3=recv()pval6=tmp1.3-tmp2.5
v2.7=pval6*5.0
Send(v2.7)v2=v2.7
route(N,t)
route(t,E)
route(E,t)
route(t,E)
v1.2=v1
seed.0=recv()pval2=seed.0*v1.2
tmp1.3=pval2+2.0
send(tmp1.3)tmp1=tmp1.3
tmp2.5=recv()pval7=tmp1.3+tmp2.5
v1.8=pval7*3.0
v1=v1.8
v0.9=tmp0.1-v1.8
v0=v0.9
route(N,t)
route(N,t)
route(W,N)
seed.0=seed
route(W,S)
route(W,S)
tmp0.1=recv()
route(t,W)
route(W,t)
route(W,N)
route(t,E)
Tile1Tile3 Tile4 Tile2
time
![Page 24: The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT](https://reader036.vdocuments.site/reader036/viewer/2022062511/551789845503463e368b5429/html5/thumbnails/24.jpg)
class BeamFormer extends Pipeline {void init(numChannels, numBeams) {
add(new SplitJoin() { void init() {
setSplitter(Duplicate());for (int i=0; i<numChannels; i++) {
add(new FIR1(N1));
add(new FIR2(N2));}setJoiner(RoundRobin()); }});
}add(new SplitJoin() {
void init() {setSplitter(Duplicate());for (int i=0; i<numBeams; i++) {
add(new VectorMult());
add(new FIR3(N3));
add(new Magnitude());
add(new Detect());
}setJoiner(Null()); }});
}
StreamIt: Stream Language and Compiler
Splitter
FIRFilter FIRFilterFIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter
FIRFilter FIRFilterFIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter
Joiner
Splitter
Detector
Magnitude
FirFilter
Vector Mult
Detector
Magnitude
FirFilter
Vector Mult
Detector
Magnitude
FirFilter
Vector Mult
Detector
Magnitude
FirFilter
Vector Mult
Joiner
e.g., BeamFormer Amarasinghe et al.
![Page 25: The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT](https://reader036.vdocuments.site/reader036/viewer/2022062511/551789845503463e368b5429/html5/thumbnails/25.jpg)
mem
+mem
+mem
+mem
+
mem
+
mem
+
mem
+
mem
+mem
+
mem
+
mem
+
mem
+
Search
FRM
FIR
Control
Raw Beamformer Layout (by hand)
![Page 26: The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT](https://reader036.vdocuments.site/reader036/viewer/2022062511/551789845503463e368b5429/html5/thumbnails/26.jpg)
Raw die photo
.18 micron process, 16 tiles, 425MHz, 18 Watts (vpenta)Of course, custom IC designed by industrial design team
could do much better
![Page 27: The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT](https://reader036.vdocuments.site/reader036/viewer/2022062511/551789845503463e368b5429/html5/thumbnails/27.jpg)
Raw motherboard
![Page 28: The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT](https://reader036.vdocuments.site/reader036/viewer/2022062511/551789845503463e368b5429/html5/thumbnails/28.jpg)
More Experimental Systems
Systems Online or in PipelineRaw Workstation Raw-based 1020 Microphone
ArrayRaw 802.11a/g wireless system
(collab with Engim)Raw Gigabit IP routerRaw graphics systemRaw supercomputing fabric
![Page 29: The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT](https://reader036.vdocuments.site/reader036/viewer/2022062511/551789845503463e368b5429/html5/thumbnails/29.jpg)
Empirical EvaluationCompare to P3 implemented in similar technology
Parameter Raw (IBM ASIC) P3 (Intel)
Litho 180 nm 180 nm
Process CMOS 7SF P858
Metal Layers Cu 6 Al 6
FO1 Delay 23 ps 11 ps
Dielectric k 4.1 3.55
Design Style Standard Cell
SA27E ASIC
Full custom
Initial Freq 425 MHz 500-733 MHz (use 600)
Die Area 331 mm2 106 mm2
Raw #s from cycle-accurate simulator validated against real chip-- FPGA mem controller in Raw -- Raw SW i-caching adds 0-30% ovhd
![Page 30: The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT](https://reader036.vdocuments.site/reader036/viewer/2022062511/551789845503463e368b5429/html5/thumbnails/30.jpg)
[ISCA’04]
Perf
orm
an
ce
Architecture Space
Performance Results~10x parallelism~ 4x ld/store elim~ 4x stream mem bw
![Page 31: The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT](https://reader036.vdocuments.site/reader036/viewer/2022062511/551789845503463e368b5429/html5/thumbnails/31.jpg)
Raw IP Router
Gb/sec
![Page 32: The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT](https://reader036.vdocuments.site/reader036/viewer/2022062511/551789845503463e368b5429/html5/thumbnails/32.jpg)
VersaBench
Sharing a benchmark set to stress versatility of processors
Categories of programs:ILP – Desktop and ScientificStreamsThroughput oriented serversBit-level embedded
www.cag.csail.mit.edu/versabench
![Page 33: The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT](https://reader036.vdocuments.site/reader036/viewer/2022062511/551789845503463e368b5429/html5/thumbnails/33.jpg)
Summary
Raw: single chip for ILP and streaming
Scalar operand network is key toILP and streams
Can enable the APU
www.cag.csail.mit.edu/raw
![Page 34: The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT](https://reader036.vdocuments.site/reader036/viewer/2022062511/551789845503463e368b5429/html5/thumbnails/34.jpg)