![Page 1: CDE: A Compiler-driven, Dependence- centric, Eager-executing architecture for the billion transistor era Carmelo Acosta Sriram Vajapeyam Alex Ramirez Mateo](https://reader035.vdocuments.site/reader035/viewer/2022062714/56649d3a5503460f94a14432/html5/thumbnails/1.jpg)
CDE: A Compiler-driven, Dependence-CDE: A Compiler-driven, Dependence-centric, Eager-executing architecture for the centric, Eager-executing architecture for the
billion transistor erabillion transistor era
Carmelo AcostaCarmelo Acosta
Sriram VajapeyamSriram Vajapeyam
Alex RamirezAlex Ramirez
Mateo ValeroMateo Valero
UPC-BarcelonaUPC-Barcelona
![Page 2: CDE: A Compiler-driven, Dependence- centric, Eager-executing architecture for the billion transistor era Carmelo Acosta Sriram Vajapeyam Alex Ramirez Mateo](https://reader035.vdocuments.site/reader035/viewer/2022062714/56649d3a5503460f94a14432/html5/thumbnails/2.jpg)
MotivationMotivation Entering the billion transistor era
How to use the available Hw to increase performance Maintain cost and complexity under control Obtain a true general-purpose architecture
Do not limit High Performance to a single application class
Clustered architectures seem the way to go Avoid excessive dependence on the compiler Avoid impossible communication delays Avoid complex interconnection networks
Hierarchical program partitioning Both in the compiler and the hardware
![Page 3: CDE: A Compiler-driven, Dependence- centric, Eager-executing architecture for the billion transistor era Carmelo Acosta Sriram Vajapeyam Alex Ramirez Mateo](https://reader035.vdocuments.site/reader035/viewer/2022062714/56649d3a5503460f94a14432/html5/thumbnails/3.jpg)
OutlineOutline
Motivation The CDE architecture
Hierarchical program partitioning Epochs
• Selective Eager Execution
Dependence clusters Hierarchical architecture
Epoch Processing Core (EPC) Processing Elements (PE)
Program execution Related work Summary and conclusions
![Page 4: CDE: A Compiler-driven, Dependence- centric, Eager-executing architecture for the billion transistor era Carmelo Acosta Sriram Vajapeyam Alex Ramirez Mateo](https://reader035.vdocuments.site/reader035/viewer/2022062714/56649d3a5503460f94a14432/html5/thumbnails/4.jpg)
The CDE architectureThe CDE architecture The way CDE obtains performance
Rely on the compiler for code partitioning Hierarchical program view Matching hierarchical hardware
Use both run-time and compile-time speculation to keep the transistors occupied
How to achieve it The Dependence Cluster (DC) is the basic execution
unit Larger than one instruction
• Larger virtual instruction window
Reduces communication Amortizes speculation costs
• Commit, squash, and redo an entire DC
![Page 5: CDE: A Compiler-driven, Dependence- centric, Eager-executing architecture for the billion transistor era Carmelo Acosta Sriram Vajapeyam Alex Ramirez Mateo](https://reader035.vdocuments.site/reader035/viewer/2022062714/56649d3a5503460f94a14432/html5/thumbnails/5.jpg)
Hierarchical program partitioningHierarchical program partitioning Horizontal control epochs
Large code segments Loops, functions,
hyperblock-like Limit the scope of
compiler optimizations Trace scheduling Selective eager
execution
Vertical dependence clusters
Chains of dependent instructions
Localize communications
![Page 6: CDE: A Compiler-driven, Dependence- centric, Eager-executing architecture for the billion transistor era Carmelo Acosta Sriram Vajapeyam Alex Ramirez Mateo](https://reader035.vdocuments.site/reader035/viewer/2022062714/56649d3a5503460f94a14432/html5/thumbnails/6.jpg)
EpochsEpochs [ 0] 0x12001e09c: ldq t0, -21056(gp) [ 1] 0x12001e0a0: beq t0, 0x12001e0e4 [ 2] 0x12001e0a4: ldq t2, 8(t0) [ 3] 0x12001e0a8: beq t2, 0x12001e0dc [ 4] 0x12001e0ac: ldq t4, 8(t2) [ 5] 0x12001e0b0: ldq t4, 8(t4) [ 6] 0x12001e0b4: xor a0, t4, t4 [ 7] 0x12001e0b8: beq t4, 0x12001e0ec [ 8] 0x12001e0bc: ldq t2, 16(t2) [ 9] 0x12001e0c0: beq t2, 0x12001e0dc [10] 0x12001e0c4: ldq t6, 8(t2) [11] 0x12001e0c8: ldq t6, 8(t6) [12] 0x12001e0cc: xor a0, t6, t6 [13] 0x12001e0d0: beq t6, 0x12001e0ec [14] 0x12001e0d4: ldq t2, 16(t2) [15] 0x12001e0d8: bne t2, 0x12001e0ac [16] 0x12001e0dc: ldq t0, 16(t0) [17] 0x12001e0e0: bne t0, 0x12001e0a4 [18] 0x12001e0e4: ldq v0, 16(a0) [19] 0x12001e0e8: ret zero, (ra), 1 [20] 0x12001e0ec: ldq t2, 8(t2) [21] 0x12001e0f0: ldq v0, 16(t2) [22] 0x12001e0f4: ret zero, (ra), 1
b) SuperScalar code
DC #0
[4]
[5]
[6]
[7]
[8]
[14]
[15]
[2]
[3]
[0]
[1]
[16]
[17]
[8]
[9]
[8]
[10]
[11]
[12]
[13]
[18] [19] [20]
[21]
[22]
DC #1 DC #2 DC #3 DC #4 DC #5 DC #6 DC #7 DC #8 DC #9 DC
#10
c) Control Epoch
NODE *xlygetvalue(NODE *sym){ register NODE *fp,*ep;
/* check the environment list */ for (fp = xlenv; fp; fp = cdr(fp)) for (ep = car(fp); ep; ep = cdr(ep)) if (sym == car(car(ep))) return (cdr(car(ep)));
/* return the global value */ return (getvalue(sym));}
a) Source code
![Page 7: CDE: A Compiler-driven, Dependence- centric, Eager-executing architecture for the billion transistor era Carmelo Acosta Sriram Vajapeyam Alex Ramirez Mateo](https://reader035.vdocuments.site/reader035/viewer/2022062714/56649d3a5503460f94a14432/html5/thumbnails/7.jpg)
Eager executionEager execution
Traditional trace-scheduling
Bet on one direction Optimize frequent case Generate fix-up code for
infrequent case
Eager-execution Remove the branch Optimize each separate
case Squash the incorrect
trace
Hard to predict branch
Optimized trace + fix-up code
Remove branch and execute both paths
![Page 8: CDE: A Compiler-driven, Dependence- centric, Eager-executing architecture for the billion transistor era Carmelo Acosta Sriram Vajapeyam Alex Ramirez Mateo](https://reader035.vdocuments.site/reader035/viewer/2022062714/56649d3a5503460f94a14432/html5/thumbnails/8.jpg)
Dependence clustersDependence clusters
Essentially a set of dependent instructions May have dependencies with other DCs in the same
Epoch
The compiler balances Inter-DC dependencies
Localize communication within a DC ILP
Place independent instructions in a different DC
![Page 9: CDE: A Compiler-driven, Dependence- centric, Eager-executing architecture for the billion transistor era Carmelo Acosta Sriram Vajapeyam Alex Ramirez Mateo](https://reader035.vdocuments.site/reader035/viewer/2022062714/56649d3a5503460f94a14432/html5/thumbnails/9.jpg)
Hierarchical architecture partitioningHierarchical architecture partitioning
Epoch Processing Core
Quickly sequences through control epochs
Epoch level speculation
Mesh of MIPS-2000 like Processing Elements
Execute individual Dependence Clusters
EPC
PE
![Page 10: CDE: A Compiler-driven, Dependence- centric, Eager-executing architecture for the billion transistor era Carmelo Acosta Sriram Vajapeyam Alex Ramirez Mateo](https://reader035.vdocuments.site/reader035/viewer/2022062714/56649d3a5503460f94a14432/html5/thumbnails/10.jpg)
Epoch Processing Core (EPC)Epoch Processing Core (EPC)
Fetches and processes epochs one at a time Speculatively branches to the next epoch
Epoch level sequencing Epoch level speculation
Renames live-in and live-outs of each epoch Out of order epoch execution
Dispatches the DC’s to the PE grid Coupled with the required data about the epoch
Renaming of live-in and live-outs
![Page 11: CDE: A Compiler-driven, Dependence- centric, Eager-executing architecture for the billion transistor era Carmelo Acosta Sriram Vajapeyam Alex Ramirez Mateo](https://reader035.vdocuments.site/reader035/viewer/2022062714/56649d3a5503460f94a14432/html5/thumbnails/11.jpg)
Processing Elements (PE)Processing Elements (PE)
MIPS-2000 like In-order Single-issue Short pipeline
Local register file Intra-DC dependencies
Communications manager
Inter-DC dependencies
F D E M W
Reg.file
Comms.
![Page 12: CDE: A Compiler-driven, Dependence- centric, Eager-executing architecture for the billion transistor era Carmelo Acosta Sriram Vajapeyam Alex Ramirez Mateo](https://reader035.vdocuments.site/reader035/viewer/2022062714/56649d3a5503460f94a14432/html5/thumbnails/12.jpg)
Program execution (Cycle 0)Program execution (Cycle 0)
The EPC fetches, processes, renames and starts Epoch’s execution.
DC #0
[4]
[5]
[6]
[7]
[8]
[14]
[15]
[2]
[3]
[0]
[1]
[16]
[17]
[8]
[9]
[8]
[10]
[11]
[12]
[13]
[18] [19] [20]
[21]
[22]D
C #1 DC #2 DC #3 DC #4 DC #5 DC #6 DC #7 DC #8 DC #9 DC
#10
![Page 13: CDE: A Compiler-driven, Dependence- centric, Eager-executing architecture for the billion transistor era Carmelo Acosta Sriram Vajapeyam Alex Ramirez Mateo](https://reader035.vdocuments.site/reader035/viewer/2022062714/56649d3a5503460f94a14432/html5/thumbnails/13.jpg)
Program execution (Cycle 1)Program execution (Cycle 1)
1 2 3 4 5 6
EPC
Initial EPC-PEs communication delay.
![Page 14: CDE: A Compiler-driven, Dependence- centric, Eager-executing architecture for the billion transistor era Carmelo Acosta Sriram Vajapeyam Alex Ramirez Mateo](https://reader035.vdocuments.site/reader035/viewer/2022062714/56649d3a5503460f94a14432/html5/thumbnails/14.jpg)
Program execution (Cycle 2)Program execution (Cycle 2)
0-IF
18-IF
19-IF
1 2 3 4 5 6
0
8
7
EPC
DCs #0, #7 and #8 start execution on their respective PEs.
DC#0
DC#7DC#8
![Page 15: CDE: A Compiler-driven, Dependence- centric, Eager-executing architecture for the billion transistor era Carmelo Acosta Sriram Vajapeyam Alex Ramirez Mateo](https://reader035.vdocuments.site/reader035/viewer/2022062714/56649d3a5503460f94a14432/html5/thumbnails/15.jpg)
Program execution (Cycle 3)Program execution (Cycle 3)
0-IF 0-ID
18-IF 18-ID
19-IF 19-ID
1 2 3 4 5 6
0
8
7
EPC
Each PE continues its execution as statically scheduled by the compiler.
DC#0
DC#7DC#8
![Page 16: CDE: A Compiler-driven, Dependence- centric, Eager-executing architecture for the billion transistor era Carmelo Acosta Sriram Vajapeyam Alex Ramirez Mateo](https://reader035.vdocuments.site/reader035/viewer/2022062714/56649d3a5503460f94a14432/html5/thumbnails/16.jpg)
Program execution (Cycle 4)Program execution (Cycle 4)
0-IF 0-ID 0-EX
1-IF
2-IF
16-IF
18-IF 18-ID 18-EX
19-IF 19-ID 19-EX
1 2 3 4 5 6
2 1
0
8
7
EPC
DCs #1 and #2 start execution on their respective PEs.
DC#0
DC#7DC#8
DC#1
DC#2
![Page 17: CDE: A Compiler-driven, Dependence- centric, Eager-executing architecture for the billion transistor era Carmelo Acosta Sriram Vajapeyam Alex Ramirez Mateo](https://reader035.vdocuments.site/reader035/viewer/2022062714/56649d3a5503460f94a14432/html5/thumbnails/17.jpg)
Program execution (Cycle 5)Program execution (Cycle 5)
0-IF 0-ID 0-EX 0-M
1-IF 1-ID
2-IF 2-ID
16-IF 16-ID
18-IF 18-ID 18-EX 18-M
19-IF 19-ID 19-EX 19-M
1 2 3 4 5 6
2 1
0
8
7
EPC
DC#0 (0-M) generates reg. t0, bypassed to next instruction (1-EX) and sent to DCs #1 and #2.
DC#0
DC#7DC#8
DC#1
DC#2
![Page 18: CDE: A Compiler-driven, Dependence- centric, Eager-executing architecture for the billion transistor era Carmelo Acosta Sriram Vajapeyam Alex Ramirez Mateo](https://reader035.vdocuments.site/reader035/viewer/2022062714/56649d3a5503460f94a14432/html5/thumbnails/18.jpg)
Program execution (Cycle 6)Program execution (Cycle 6)
0-IF 0-ID 0-EX 0-M 0-W
1-IF 1-ID 1-EX
2-IF 2-ID 2-EX
3-IF
16-IF 16-ID 16-EX
17-IF
18-IF 18-ID 18-EX 18-M 18-W
19-IF 19-ID 19-EX 19-M 19-W
2-IF
16-IF
1 2 3 4 5 6
2’ 1’
2 1
0
8
7
EPC
DCs #1’ and #2’ (next instance) start execution. Reg. t0 arrives at DCs #1 and #2.
DC#0
DC#7DC#8
DC#1
DC#2
DC#1’
DC#2’
![Page 19: CDE: A Compiler-driven, Dependence- centric, Eager-executing architecture for the billion transistor era Carmelo Acosta Sriram Vajapeyam Alex Ramirez Mateo](https://reader035.vdocuments.site/reader035/viewer/2022062714/56649d3a5503460f94a14432/html5/thumbnails/19.jpg)
Related WorkRelated Work
RAW Not Hierarchical HW Exploits Basic Block
parallelism
GPA Grid of ALUs High Instruction Fetch
requirements Exploits HyperBlock
parallelism
Multiscalar Horizontal but not vertical
Code partitioning SuperScalar Branch
treatment
ILDP Hardware only Approach Dynamic steer of
dependent instructions to PEs
Depends on an accumulator-based ISA
Trace Processors Hardware only Approach Dynamic paths are
captured in traces
![Page 20: CDE: A Compiler-driven, Dependence- centric, Eager-executing architecture for the billion transistor era Carmelo Acosta Sriram Vajapeyam Alex Ramirez Mateo](https://reader035.vdocuments.site/reader035/viewer/2022062714/56649d3a5503460f94a14432/html5/thumbnails/20.jpg)
Implementation considerationsImplementation considerations Low complexity architecture based on
regularity Epoch Processing Core Grid of PE Communication network
High performance due to far-fetched speculation Large virtual instruction window
Strong dependence on the compiler Code partitioning, DC communication Epochs limit the scope of optimizations
![Page 21: CDE: A Compiler-driven, Dependence- centric, Eager-executing architecture for the billion transistor era Carmelo Acosta Sriram Vajapeyam Alex Ramirez Mateo](https://reader035.vdocuments.site/reader035/viewer/2022062714/56649d3a5503460f94a14432/html5/thumbnails/21.jpg)
Solving multiple problems at onceSolving multiple problems at once
CDE can also behave in a polimorphic way
Exploiting ILP Far-fetched speculation through Epoch speculation
Exploiting TLP Multi-threaded Epoch Processong Core
Distribute the PE's among all running threads
Exploiting DLP No need to re-dispatch a DC to the PE's
Simply re-start the DC with new data
![Page 22: CDE: A Compiler-driven, Dependence- centric, Eager-executing architecture for the billion transistor era Carmelo Acosta Sriram Vajapeyam Alex Ramirez Mateo](https://reader035.vdocuments.site/reader035/viewer/2022062714/56649d3a5503460f94a14432/html5/thumbnails/22.jpg)
Summary and conclusionsSummary and conclusions Hierarchical partitioning
Epoch speculation maintains transistors occupied Eager execution works around difficult branches
DC helps to keep complexity at bay Amortizes cost of speculation (squash, commit)
Scalable performance with more PE Increasing wire delays may limit scalability Rely on the compiler to minimize communication
Design in its initial stages Lots of unanswered questions
Specially regarding the memory hierarchy Feedback is welcome!