1 an accelerator based on single-flux quantum circuits for a high-performance reconfigurable...
TRANSCRIPT
1
An Accelerator Based on Single-Flux Quantum Circuits for a High-PerformanceReconfigurable Computer
F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*and K. Murakami*
*Graduate School of Information Science and Electrical Engineering, Kyushu University, Japan
**Institute of Systems, Information Technologies and Nanotechnologies (ISIT), Fukuoka, Japan
E-mail: [email protected], [email protected]
2WAHA 2009Kyushu University
Agenda
Introduction Large-Scale Reconfigurable Data-Path (LSRDP)
General Architecture and Specifications Design Procedure and Tool Chain Preliminary Results Conclusions and Future Work
3WAHA 2009Kyushu University
Introduction
Parallel computer clusters with General-Purpose Processors (GPP)are often used for HPC
Various accelerators are used with GPPs for further performance improvement PowerXcell, GPGPU, GRAPE-DR, ClearSpeed, etc. Small size and low power consumption comparing to processors with similar
performance
http://www.elsa-jp.co.jp/products/hpc/tesla/s1070/index.html
TSUBAME
http://it.nikkei.co.jp/
NVIDIA Tesla S1070
http://www.top500.org/system/9485
Roadrunner with PowerXcell
4WAHA 2009Kyushu University
Single Flux Quantum Large Scale Reconfigurable Data-Path (SFQ-LSRDP)
A large memory bandwidth is demanded in conventional accelerators for high-performance computation
On chip memories are often used to hide memory access latency
Large-Scale Reconfigurable Data-Path (LSRDP): • is introduced as an alternative accelerator• reduces the no. of memory accesses• is implemented by Single-Flux Quantum (SFQ) circuits instead of CMOS circuits• is suitable for high performance scientific computations
5WAHA 2009Kyushu University
Outline of Large-Scale Reconfigurable Data-Path (LSRDP) processor
Features:Data Flow Graphs (DFGs) extracted
from critical calculation parts are directly mapped
Pipeline executionBurst transfer is used for input /output
rearranged data from/to memoryMainMemory
GPP
ORN
: : : :
ORN : Operand Routing Network
...FU FU FUFU
...FU FU FUFU
...FU FU FUFU
LSRDP
: : : ... :SB
SMAC
Scratchpad Memory
Reconfigurable data-path includes:A large number of floating point
Functional Units (FUs)Reconfigurable Operand Routing
Network : ORNDynamic reconfiguration facilitiesStreaming Buffers (SB) for I/O ports Implementation by SFQ circuits
6WAHA 2009Kyushu University
Single-Flux Quantum (SFQ)against CMOS
CMOS issues: high electric power consumption high heat radiation and difficulties in high-density packing memory wall problem which limits the processing speed
SFQ Features: High-speed switching and signal transmission Low power consumption Compact implementation of a system (small area) No cost for latch Suitable for pipeline processing of data stream Serial bit-level processing
7WAHA 2009Kyushu University
CREST-JST (2006~): Low-power,high-performance, reconfigurable processor using single-flux quantum circuits
SFQ-LSRDP
Prof. K. MurakamiDr. K. InoueDr. H. Honda
Dr. F. MehdipourH. Kataoka
Kyushu Univ.Architecture, Compiler
and Applications
Dr. S. Nagasawa et al.
Superconducting Research Lab. (SRL)
SFQ process
Prof. N. Yoshikawa et al.
Yokohama National Univ.SFQ-FPU chip, cell library
Prof. A. Fujimaki et al.
Nagoya Univ.SFQ-RDP chip, cell library,
and wiring
Prof. N. Takagi (Leader) et al.
Nagoya Univ.CAD for logic design and arithmetic circuits
8WAHA 2009Kyushu University
Goals of the Project
Discovering appropriate applications
Developing compiler tools
Developing performance analyzing tools
Designing and Implementing SFQ-LSRDP architecture Designing and Implementing SFQ-LSRDP architecture considering the features and limitations of SFQ circuitsconsidering the features and limitations of SFQ circuits
9
LSRDP General Architecture and Specifications
10WAHA 2009Kyushu University
Parameters Should Be DecidedWithin the LSRDP Design Procedure
Height
PE1 ...
...
...
PEm...
.
.
.
.
.
.
.
.
.
PE2 PE3
ORN
ORN
Width
...
...
Streaming Buffer (SB)
ORN
Operand Routing Network (ORN)
Streaming Buffer (SB)
Maximum Connection Length (MCL)between consecutive rows?
• PE: combination of a Functional Unit (FU) and a data Transfer Unit (TU)
• Reconfiguration mechanism?
(PE, ORN, Immediate data)
Layout: FU types(ADD/SUB and MUL)?
• Core structure a matrix of PEs
Width and Height ?
• On-chip memory configuration?
11WAHA 2009Kyushu University
LSRDP Architecture
Processing Elements FU
implements basic 64-bit double-precision floating point operations including: ADD, SUB and MUL
TU (transfer unit) as a routing resource for transferring datafrom a row to an inconsecutive row
FU TU
FU
TU FU TUTU
FU TUFU
PE including Two components
Four functionalities
12WAHA 2009Kyushu University
Layout Types- Type IW
ORN
ORN
ORN
.
.
.
…A
TM
AT
M
AT
M
AT
M
AT
M
…A
TM
AT
M
AT
M
AT
M
AT
M
…A
TM
AT
M
AT
M
AT
M
AT
M
…A
TM
AT
M
AT
M
AT
M
AT
M
ADD/SUM
MUL
TU
Each PE implements ADD/SUB and MUL
M
A
T
: ADD/SUB
: MUL
: Transfer Unit
H
Flexible but consume a lot of resources
13WAHA 2009Kyushu University
W
ORN
ORN
ORN
.
.
.
…M TA T A T A T M T
…M TA T A T A T M T
…M TA T A T A T M T
…M TA T A T A T M T
Layout Types- Type II (Checkered)
H
Each PE implements ADD/SUB or MUL Each PE implements
ADD/SUB or MUL
ADD/SUM TU MUL TU
14WAHA 2009Kyushu University
W
ORN
ORN
ORN
.
.
.
…M TM T M T M T M T
…A TA T A T A T A T
…M TM T M T M T M T
…A TA T A T A T A T
Layout Types- Type III (Striped)
H
Each PE implements ADD/SUB or MUL
Each PE implements ADD/SUB or MUL
ADD/SUM TU
MUL TU
Type II or III, which one is more efficient?
15WAHA 2009Kyushu University
Maximum Connection Length (MCL)
(i, 0)
(i+1,0)
(i+1,j)
...
...
(i,j)
ORN
...
... ...
...
(i+1,j+L)
Longest ConnectionLength= L
(i,j+2)
(i,j+1)
(i+1,j+2)
(i+1,j+1)
ConnectionLength= 0
ConnectionLength= 2
MCL: maximum horizontal distance between two PEs located in two consecutive rows
16WAHA 2009Kyushu University
An ORN Structure
A. Fujimaki, et al., Demonstration of an SFQ-Based Accelerator Prototype for a High-Performance Computer,” ASC08, 2008.
FPUFPUFPUFPUFPU TTTTT
FPUFPUFPUFPUFPU TT
T
TT
½CB½CB½CB½CB½CB
CB CB CB CBT2 T2
½CB½CB½CB½CB½CB
CB CB CB CBCB
CB CB CB CBCB CB CB CBCBCB
CB CB CB CBT2 T2CB CB CB CBCB
T2 CB T2 CBT2 CB T2 CBCBT2
FPUFPUFPUFPUFPU TTTTT
FPUFPUFPUFPUFPU TT
T
TT
½CB½CB½CB½CB½CB
CB CB CB CBT2 T2
½CB½CB½CB½CB½CB
CB CB CB CBCB
CB CB CB CBCB CB CB CBCBCB
CB CB CB CBT2 T2CB CB CB CBCB
T2 CB T2 CBT2 CB T2 CBCBT2
ORN is consisted of 2-bit shift registers, 1-by-2 and 2-by-2 cross bar switches
FPU
2bit shiftregister
ORN
17WAHA 2009Kyushu University
Dynamic Reconfiguration Mechanism
FU(A op B)
TransferUnit
ImmediateRegister (64b)
ORN
MUX
・・・・・・
ImmediateRegister
・・・・・・
PEInput-AInput-B Input-C
log(2x (2MCL+1)) x 3 [b]
Conf. Reg.[bit]
Three bit-stream lines for dynamic reconfiguration of:• Immediate registers (64bit) in each PE • Selector bits for muxes selecting the input data of FUs• Cross-bar switches in ORNs
18
Design Procedure and Tool Chain
19WAHA 2009Kyushu University
Compiler and Design Flow
Application Code
Hardware-Software Partitioning(manual or automatic)
Critical Parts(h/w part)
Non-Critical Parts(s/w part)
Port positioning
s/w Part Modification
Binary Code(for GPP)
Configurations(for LSRDP)
LSRDP Architecture
Placement
Routing
DFG Genration
Bit-Stream Generation
DFGsDFGs
Analyzing DFG mappingresults
Design Phase
Mapping
• DFGs are manually generated from critical parts of applications• DFG mapping results are used for
• Analyzing LSRDP architecture statistics• Generating LSRDP configuration bit-streams
20WAHA 2009Kyushu University
LSRDP Design Procedure
Choosing a design parameter
Mapping DFGs onto the LSRDP
Obtaining required statistics
Choosing the appropriate value
Analyzing the mapping results
For eachparameter
Appropriate value for each parameter
DFGs & LSRDP HW constraints
21WAHA 2009Kyushu University
Benchmark Applicationsfor Design Procedures
Finite differential method calculation of2nd order partial differential equations 1dim-Heat equation (Heat) 1dim-Vibration equation (Vibration) 2dim-Poisson equation (Poisson)
Quantum chemistry application Recursive parts of Electron Repulsion Integral calculation
(ERI-Rec)
Only ADD/SUB and MUL operations are usedin the critical calculations of all above applications
22WAHA 2009Kyushu University
DFG Extraction- Heat Equation
1-dim. heat equation for T(x,t)
Calculation by Finite DifferenceMethod (FDM)
2
2
( , ) ( , )T x t T x tA
t x
(A is const.)
T(i,j+1)
T(i-1,j) T(i,j) T(i+1,j)
+
*
*
+
D
B
T(i,j+1)
T(i-1,j) T(i,j) T(i+1,j)
+
*
*
+
D
B
),(),(*),(*
),(
11
1
jijiji
ji
txTtxTBtxTD
txT
Basic DFG corresponding to Minimum FDM calculation
Basic DFG can be extended to horizontal and vertical directions to make a larger DFG
23WAHA 2009Kyushu University
Example of extracted DFGs- Heat
Inputs: 32Outputs: 16Operations: 721 Immediates: 364
A huge sample DFG (Heat)
24WAHA 2009Kyushu University
DFG Classification
Class # of FUs# of
Inputs# of
Outputs# of
DFGsHeat (3)Poi (1)Vib (2)Eri (4)
Heat (1)Poi (1)Vib (1)Eri (4)
Heat (2)Poi (1)Vib (2)Eri (5)
Heat (1)Poi (1)Vib (2)Eri (5)
12
12
24
52
19
19
38
64
128
512
1024
> 1024
RDP-S
RDP-M
RDP-L
RDP-XL
Due to broad range of DFG sizesDFGs are classified as S, M, L, XL with respect to their sizeand the number of Input/Output nodes
Totally,24 DFGs are preparedfor benchmark DFG
25WAHA 2009Kyushu University
Mapping DFGs onto LSRDP
Longest connections
Placing DFGnodes on LSRDP
RoutingConnections
Placing IO nodes
Routing Inp/OutConnections
DFG
LSRDPArchitectureDescription
ConfigurationFile
26
Preliminary Results
27WAHA 2009Kyushu University
LSRDP Specifications: Width & Height
# of Input ports
# of Output ports
Width Height
LSRDP-S 19 12 16 16
LSRDP-M 19 12 32 16
LSRDP-L 38 24 64 32
LSRDP Dimensions and the number of Input/Output Ports
28WAHA 2009Kyushu University
LSRDP Specifications: MCL
Needs further MCL optimization
(i, j)
(i+2,j+1)
(i+L,j+1)
(i+1,j+1)
(i,j+1)
MCL = L
・・・
LSRDP MCL(avg/max)
ORN Size-No of Inps (avg/max),
Outs
LSRDP-S 4/8 18/34, 3
LSRDP-M 5/9 22/38, 3
LSRDP-L 5/9 22/34, 3
29WAHA 2009Kyushu University
Analyzing Various LSRDP Layouts
Layout II can be used instead of Layout I to obtain a smaller LSRDP architecture with lower power consumption and implementation cost.
(Except ERI1 DFG which gives better size for Layout III)
Layout SizeI 8x3II 8x3III 8x4I 10x8II 10x8III 10x11I 10x10II 10x12III 15x18I 6x2II 9x3III 6x2I 10x10II 10x10III 15x8
Viration
Poisson
ERI1
ERI2
Heat
Layout I Layout II
30WAHA 2009Kyushu University
LSRDP at One Glance (1/2)
Functional units ADD/SUB, MUL
Layout Type II (checker pattern)
Operations 64-bit floating point
Processing structure Pipelined
PE structure FU, T, FU+T, T+T
LSRDP Size Small Medium Large
No. of inp/out ports 19/12 19/12 38/24
Width/Height 16/16 32/16 64/32
Conf. bit-stream size
Imm. Regs 16*16*64 32*16*64 64*32*64
ORNs 16*BSS(ORN) 32* BSS(ORN) 64*BSS(ORN)
PEs 16*16* 2 32*16*2 64*32* 2
ORN inputs, outputs 22 , 3 26 , 3 26 , 3
Structure Cross-bar switch
Conn. Type One-directional
31WAHA 2009Kyushu University
LSRDP at One Glance (2/2)
Internal memory Type Immediate registers
Size and count 64-bit registers, One reg. for each PE
Communication mechanism Serial
External memory No. of memory modules 16
Date trans. rate 1800Mbps/pin
Overall data trans. rate 24 GB/s
Mem. to LSRDP bus width 64 bit
Channels per module Two
Reconf. mechanism Bit serial configuration through a serial chain
32WAHA 2009Kyushu University
Preliminary Performance Evaluation
Processor type Out-of-order
GPP operating frequency 3.2GHz
Inst. issue width 4 instruction/cc
Inst. decode width 4 instruction/cc
Cache configuration L1 data 64KB(128B Entry, 2way, 2cc)
L1 instruction 64KB(64B Entry, 1way, 1cc)
L2 unified 4MB(128B Entry, 4way, 16cc)
Latency of main memory 300cc
L2 to main memory Bus width 64 Bytes
Freq 800 MHz
LSRDP operating frequency 80 GHz
Reconfiguration Latency 1cc
Latency SPM LSRDP latency 1cc
Latency Main Memory SPM 7500cc
Bandwidth SPMLSRDP Max. 64 * 8 Bytes/cc
Bandwidth Main Memory SPM 102.4GB/sec
Base processor configuration
GPP+LSRDP configuration
GPP : Exec. time measurement by means of a processor simulatorLSRDP : Estimation by performance modeling
33WAHA 2009Kyushu University
Preliminary Performance Evaluation(Heat)
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Basic Reuse Basic Reuse
Heat (M) Heat (L)
Nor
mal
ized
by
GPP E
xec.
Tim
e
Reconf.
Comm.
Rearrange
Stall
LSRDP
GPP
Data reusing is employed to avoid the need for data rearrangement as well as frequently data retrieval from the scratchpad memory.
Basic: SB onlyReuse: SB + SPM
34WAHA 2009Kyushu University
Preliminary Performance Evaluation (Poisson)
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
poisson(S) poisson(M) poisson(L)
Norm
aliz
ed
by G
PP E
xec. Tim
e
Reconf.
Comm.
Rearrange
Stall
LSRDP
GPP
A small fraction is related to processing time on LSRDP and the main fraction concerns to various overhead times as well as the execution time on GPP
35WAHA 2009Kyushu University
Conclusions & Future Work
A high-performance computer comprising an accelerator (LSRDP) implemented by superconducting circuits was introduced.
24 benchmark Data Flow Graphs (DFGs) were manually generated.
LSRDP micro-architecture is designed based on characteristics of scientific applications via a quantitative approach.
LSRDP is promising for resolving issues originated from CMOS technology as well as achieving considerable performances.
Future Work:
•To achieve higher performance it is required to reduce various overhead costs mainly related to data management part.
•To reduce the implementation cost of LSRDP, we will focus on reducing maximum connection length and ORN size.
36WAHA 2009Kyushu University
Acknowledgement
This research was supportedin part by Core Research for Evolutional Scienceand Technology (CREST) of Japan Scienceand Technology Corporation (JST).
37
Thanks!
Any Questions?
38WAHA 2009Kyushu University
Backup Slides
Backup Slides
39WAHA 2009Kyushu University
SFQ (Single Flux Quantum) Circuit High speed, Low power consumption, and Operating by a different principle from the CMOS
Φ0
L
Ic
Ib
2mV
2ps
Tunneling effect
ジョセフソン接合
超伝導ループ
磁束量子Single Flux QuantumSuperconductivityloop
Josephson junction
40WAHA 2009Kyushu University
Mapping Results
Width Height Total # ofPEs
Extra TUs
(max/avg) (max/avg) (max/avg) (max/avg)
RDP-S 128 19 12 26/14.9 10/6.7 98/51.7 56/23.5
RDP-M 512 19 12 26/17.1 16/9.3 170/77.8 92/37.1
RDP-L 1024 38 24 58/40 24/14.4 730/260.1 428/141.3
RDP-XL > 1024 64 52 122/45.3 25/12.4 1217/350.4 1065/240
# ofFUs
# ofInputs
# ofOutputs
For each class, a lot of extra TUsare needed to map all DFGs
PE types FU
TFU
T
TT
41WAHA 2009Kyushu University
Connection Length Minimization- Results
MCL (ave/max)
RDP-S 4/9
RDP-M 5/9
RDP-L 9.3/19
Final optimized Maximum Connection Length (MCL) results
ORNs should provide the connection length of 9in LSRDP-S/M (MCL= 9). For LSRDP-L, MCL = 19 !!!
⇒ Serious Implementation Cost
(i, j)
(i+2,j+1)
(i+L,j+1)
(i+1,j+1)
(i,j+1)
MCL = L
・・・
Possible to decrease?
42WAHA 2009Kyushu University
Distributions of Connection Lengths
Average Fraction of Connection Lengths inthe RDP/S Maps
79%
10%
4% 3% 2%1%1%0%0%0%
0
1
2
3
4
5
6
7
8
9
Connectionlength
93% of connection lengths are 0 ~ 2Only small fractions of connections results in larger ORNs
43WAHA 2009Kyushu University
Analyzing Various LSRDP Layouts
Layout II can be used instead of Layout I to obtain a smaller LSRDP architecture with lower power consumption and implementation cost as well
• Almost a similar small size values are achievedfor Layout I and IIfor the majority of DFGs (except ERI1 DFG which gives better size for Layout III)
Layout SizeI 8x3II 8x3III 8x4I 10x8II 10x8III 10x11I 10x10II 10x12III 15x18I 6x2II 9x3III 6x2I 10x10II 10x10III 15x8
Viration
Poisson
ERI1
ERI2
Heat
44WAHA 2009Kyushu University
Why only ERI1 DFG is suitable to Layout III ?
RDP-Layout Type II
ADD/SUB
MUL
ADD/SUB
MULADD/SUB
MUL
ADD/SUB
MULADD/SUB
MUL
...
...
...
ADD/SUB
MULADD/SUB
MUL ...
.
.
.
.
.
.
.
.
.
MULADD/SUB
ORN
ORN
ORN
.
.
.
Heat
ERI 1 RDP-Layout Type III
MUL MUL
ADD/SUB
ADD/SUB
ADD/SUB
ADD/SUB
MUL MUL MUL MUL
...
...
...
ADD/SUB
ADD/SUB
ADD/SUB
ADD/SUB
...
.
.
.
.
.
.
.
.
.
MUL MUL
ORN
ORN
ORN
.
.
.
Layout III
Layout II
45WAHA 2009Kyushu University
FU Layout for DIV, SQRT, EXP operations
ORN
: : : :
ORN : Operand Routing Network
...FU FU FUFU
...FU FU FUFU
...FU FU FUFU
ORN
...FU FU FUFU
...FU FU FUFU
DIV
Three timeslarger latency
Where ?
Where should we place different latency FU ?Heterogeneous configuration of FU array ?
16Bits Floating point DIV, SQRT, and EXP Functional unit have been already developed by SFQ current technology.
Pipeline execution basedon ADD and MUL latency
46WAHA 2009Kyushu University
Estimated performance improvement of 2-dim Poisson equation by LSRDP calc.
0
1
2
3
4
5
6
7
8
12.8 102.4 12.8 102.4 12.8 102.4
poisson(S) poisson(M) poisson(L)
Treconf
Ttra
Trearr
Tst
Tcal
Tgpp
Nor
mal
ized
exe
c. ti
me
by
GPP
(3G
Hz)
cal
c.
Main Mem. bandwidth [GByte/sec]
47WAHA 2009Kyushu University
Estimated performance improvement of ERI calculation by LSRDP
0
0.2
0.4
0.6
0.8
1
GPPonly LSRDP
ERI
Treconf
Ttra
Trearr
Tst
Tcal
Tgpp
(3GHz)
48WAHA 2009Kyushu University
Recursive Parts of Electron Repulsion Integral Formula (ERI-Rec)
DFG sizes have already determinedfrom original recursive formula
No. of Operations
No. of
Inputs
No. of
Output
(ps,ss) 9 8 3
(ps,ps) 51 16 9
(pp,ss) 66 14 9
(pp,ps) 252 22 27
(pp,pp) 1004 28 81
49WAHA 2009Kyushu University
What types of software/algorithms are suitable for LSRDP ?
When same calculations have to be calculated repeatedly. LSRDP is used for high throughput accelerator.
Input/Output data size is small compared with the amount of the operations.
small sizeof input
small sizeof output
Large amountof calculations
Xmemory access
LSRDP
50WAHA 2009Kyushu University
Exploration of suitable applicationsfor LSRDP
Application matrix elements calculation
Molecular integral calculations in molecular orbital method Monte Carlo type simulation etc…
Numerical calculation library special function (promising?) differential equation numerical integration matrix operation (difficult ??) Triangular matrix simultaneous equation etc…
Investigating applicability against various applications
51WAHA 2009Kyushu University
Recursive Parts of Electron Repulsion Integral Formula in Molecular Orbital Calc.
# of Inputs : Max. 28# of Outputs : 1 ~ 81
(0) (0) (1)
(0) (0) (1) (1)
(0) (0) (1) (0) (1)
(0) (
( , ) ( , ) ( , )
( , ) ( , ) ( , ) ( , )2
( , ) ( , ) ( , ) ( , ) ( , )2
( , ) ( , )
i i i
abiki k k i k i
aijai j i i j i
i j k k i j
p s ss PA ss ss WP ss ss
Zp s p s QC p s ss WQ p s ss ss ss
Zp p ss PB p s ss WP p s ss ss ss Z ss ss
p p p s QC p p ss
0) (1)
(1) (1)
(0) (0) (1) (1) (1)
(0) (1)
( , )
( , ) ( , )2
( , ) ( , ) ( , ) ( , ) ( , )2
( , ) ( , )2
k i j
abik j jk i
abi j k l k i j k l i j k il j k jl i k
bijbi j i j
WQ p p ss
Zsp ss p s ss
Zp p p p QD p p p s WQ p p p s sp p s p s p s
Zp p ss Z p p ss
(ss,ss)(m) and all coefficientsare given as input
(i,j,k,l = x,y,z): p function has 3 components (as 1dim array)Each DFG has only ADD (SUB) and MUL FUs.
~Up to (pp,pp) Recursive Calculation~
DFG sizes are determined by original calculation algorithm
52WAHA 2009Kyushu University
0
200
400
600
800
1000
0 10 20 30 40 50 60 70
DFG Distribution for each application#
of F
Us
# of Inputs
Poisson (3)
Vibration (7)
Heat (6)
ERI-Rec (8 DFGs)
DFGs have different qualities in terms of the # of FUs, # of Inputs and Outputs
53WAHA 2009Kyushu University
Example of MCL (Heat)
Heat original DFG(I/O: 8/4, FUs: 32) Mapping result
MCL
54WAHA 2009Kyushu University
Example of extracted DFGs (ERI-Rec)
Maximum DFG of ERI-Rec: (pipj,pkpl)
Inputs: 28Outputs: 81FUs: 1004Immediates: 0
Vertical PartitioningInputs: 24Outputs: 1FUs: 108Immediates: 0
55WAHA 2009Kyushu University
Poisson Equation
),(),(),(
2
2
2
2
yxfy
yxu
x
yxu
2D – Poisson Eq.
),(),(),(
),(),(4
),(*)1(
),(
21
)(1
)(
1)(
1)(
)(
)1(
jijin
jin
jin
jin
jin
jin
yxfhyxuyxu
yxuyxu
yxu
yxu
ω is const.
Successive Over Relaxation method
In order to obtain u(n+1) (xi,yj) in the next iteration,current values of five variables i.e. u(n) (xi,yj), u(n) (xi±1,yj), u(n) (xi,yj±1) are needed
Red/Black Gauss Seidel
55
56WAHA 2009Kyushu University
Example of extracted DFGs (Poisson)
Maximum Poisson DFG
Inputs: 32Outputs: 1FUs: 721 Immediates: 364
57WAHA 2009Kyushu University
Performance Evaluation:Simulation Environment
57
GPP
MainMemory
LSRDP
GPP : Exec. time measurement by processor simulatorLSRDP : Estimation by performance modeling
Variable parameters:• Freq. of GPP and LSRDP• Bandwidth between main memory and LSRDP• Latency of reconfiguration time• # of FPUs in LSRDP• Supporting FPU types (Add, Mul, Div, Exp, Sqrt, Error function units are supported)Use streaming buffer in the LSRDP chip
I/O data is sorted in the main memory.
58WAHA 2009Kyushu University
Estimated performance improvementof 1-dim heat equation by LSRDP calc.
0
0.2
0.4
0.6
0.8
1
1.2
OnlyGPP
12.8 25.6 51.2 102.4
LSRDP
Trec
Ttra
Trearr
Tst
Tcal
ETgpp
Main Mem. bandwidth [GByte/sec]
59WAHA 2009Kyushu University
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
12.8 25.6 51.2 102.4
Heat
Trec
Ttra
Trearr
Tst
Tcal
ETgpp
Estimated performance improvement of 1-dim heat equation by LSRDP calc.
Main Mem. bandwidth [GByte/sec]
Nor
mal
ized
exe
c. ti
me
by
GPP
(3G
Hz)
cal
c.
60WAHA 2009Kyushu University
Poisson Red/Black 法におけるDFG の拡大による繰り返し回数の増加
9+4 ノードの入力
中心 1 ノードの出力
SOR 式2 回の繰り返し
4+1 ノードの入力
中心 1 ノードの出力
SOR 式1 回の計算
これに伴い必要な入力数も増加•DFG の拡大により 1 度に計算可能な繰り返し回数が増加
60
61WAHA 2009Kyushu University
Implementation of Heat calculation to LSRDP
Loop j Loop i T(xi,tj) End LoopEnd Loop
Original GPP code
LSRDP ReconfigurationLoop j’ Input Data Rearrangement Loop N LSRDP pipeline exec. (FDM DFG calc.) End Loop Output Data RearrangementEnd Loop
61
LSRDP code
62WAHA 2009Kyushu University
Implementation of Poisson calculation to LSRDP
Loop Iter Loop i loop j u(xi,yj) End Loop End LoopEnd Loop
Original GPP code
LSRDP ReconfigurationLoop Iter’ Input Data rearrangement Loop N LSRDP pipeline exec. (FDM DFG calc.) End Loop Output Data rearrangementEnd Loop
62
LSRDP code
63WAHA 2009Kyushu University
Implementation of ERI-Rec calculation to LSRDP
Loop I,J,K,L LSRDP Reconfiguration Loop contraction Initial Integral Calc. End Loop Input Data rearrangement Loop N LSRDP pipeline calc. (Recursive DFG calc.) End Loop Output Data rearrangement Partial Fock Calc.End Loop
Loop I,J,K,L Loop contraction Initial Integral Calc. Recursive Calc. End Loop Partial Fock Calc.End Loop
Initial Integral Calc.: 1/Sqrt, Exp, Fm(T) are utilized => GPP calculation
Recursive Calc.: only ADD/SUB, MUL => LSRDP calculation
original GPP code LSRDP code
63
64WAHA 2009Kyushu University
Vertical vs. HorizontalDFG Decomposition
Loop N Reconfiguration Loop M LSRDP pipeline calc. End LoopEnd Loop
Original
64
Loop n ( > N) Reconfiguration Loop M LSRDP pipeline calc. End LoopEnd Loop
Vertical Decomp.
Loop N Reconfiguration Loop M 1st LSRDP pipeline calc. End LoopEnd LoopLoop N Reconfiguration Loop M 2nd LSRDP pipeline calc. End LoopEnd Loop
Horizontal Decomp.
65WAHA 2009Kyushu University
Example of extracted DFGs
Maximum DFG of ERI-Rec: (pipj,pkpl)
Inputs: 28Outputs: 81FUs: 1004Immediates: 0
Vertical PartitioningInputs: 24Outputs: 1FUs: 108Immediates: 0
66WAHA 2009Kyushu University
Example of extracted DFGs- Heat
Inputs: 32Outputs: 16Operations: 721 Immediates: 364
A huge sample DFG (Heat)
67WAHA 2009Kyushu University
Performance Evaluation:Simulation Environment
67
GPP
MainMemory
LSRDP
GPP : Exec. time measurement by processor simulatorLSRDP : Estimation by performance modeling
Variable parameters:• Freq. of GPP and LSRDP• Bandwidth between main memory and LSRDP• Latency of reconfiguration time• # of FPUs in LSRDP• Supporting FPU types (Add, Mul, Div, Exp, Sqrt, Error function units are supported)Use streaming buffer in the LSRDP chip
I/O data is sorted in the main memory.
68WAHA 2009Kyushu University
Performance Evaluation:Execution Time Modeling
68
LSRDPCalLSRDP
OverheadLSRDPGPPTOTAL
STTET
TETETET
calT
LSRDPST
Execution time
Calculation time
Stall time
Latency of LSRDP <->Mem For first Input and last output
Sort data + Reconfig.+ Send signal for comm.
+ Stall fromBandwidthreq > Bandwidthmem
Total pipeline depthin the given program +
# of rows of LSRDP(latency of LSRDP)
69WAHA 2009Kyushu University
Layout Types- Type IW
ORN
ORN
ORN
.
.
.
…A
TM
AT
M
AT
M
AT
M
AT
M
…A
TM
AT
M
AT
M
AT
M
AT
M
…A
TM
AT
M
AT
M
AT
M
AT
M
…A
TM
AT
M
AT
M
AT
M
AT
M
Total No. of PEs= W * H
Total Area= W*H* [Area(MUL)+Area(ADD/SUB)+ Area(TU)]+ Area(ORNs)
ADD/SUM
MUL
TU
Each PE implements ADD/SUB and MUL
M
A
T
: ADD/SUB
: MUL
: Transfer Unit
H
70WAHA 2009Kyushu University
W
ORN
ORN
ORN
.
.
.
…M TA T A T A T M T
…M TA T A T A T M T
…M TA T A T A T M T
…M TA T A T A T M T
Layout Types- Type II
H
Each PE implements ADD/SUB or MUL Each PE implements
ADD/SUB or MUL
Total No. of PEs= W * H
Total Area= ½* W*H*[Area(MUL)+Area(TU)]+½*W*H*[Area(ADD/SUB)+Area(TU)]+ Area (ORNs)
ADD/SUM TU MUL TU
71WAHA 2009Kyushu University
W
ORN
ORN
ORN
.
.
.
…M TM T M T M T M T
…A TA T A T A T A T
…M TM T M T M T M T
…M TA T A T A T M T
Layout Types- Type III
H
Each PE implements ADD/SUB or MUL
Each PE implements ADD/SUB or MUL
Total No. of PEs= W * H
Total Area= ½* W*H*[Area(MUL)+Area(TU)]+½*W*H*[Area(ADD/SUB)+Area(TU)]+ Area (ORNs)
ADD/SUM TU
MUL TU
72WAHA 2009Kyushu University
CB Various Functionalities
CB two inputs/outputs four cases are possible reconfigurable
1/2CB one input/ two outputs four cases are possible reconfigurable
00 01 10 11 00 01 10 11
CB ½CB
73WAHA 2009Kyushu University
The number of FPUs is M, the number of Transfer Units (T) is also M;MCL is a maximum connection length if we consider FPUs only =>
½ CB – 2×M T2 – (M+4×MCL+2) CB – (2×MCL+1) ×(4×M-1)
An ORN Structure
FP
UFP
UFP
UFP
UFP
U
T
T
T
T
T
FP
UFP
UFP
UFP
UFP
U
T
T
T
T
T
½CB
½CB
½CB
½CB
½CB
CB
CB
CB
CB
T2
T2
½CB
½CB
½CB
½CB
½CB
CB
CB
CB
CB
CB
CB
CB
CB
CB
CB
CB
CB
CB
CB
CB
CB
CB
CB
CB
T2
T2
CB
CB
CB
CB
CB
T2
CB
T2
CB
T2
CB
T2
CB
CB
T2
CB: 351 JJs
½ CB: 216 JJs
* T2 is a 2-bit shift register
“+”: scalable pipelined easily re-designed for any number of N and M
“–”: large number of Josephson junctions M number of ½ CB and (2×M+1)×MCL number of CB
Reduction of the number of Josephson junctions is
essential!
A. Fujimaki, et al., Demonstration of an SFQ-Based Accelerator Prototype for a High-Performance Computer,” ASC08, 2008.
74WAHA 2009Kyushu University
Reconfiguration Mechanism
Execution
wait
Starting ofExecution
End ofExecution
Starting ofReconfiguration
End ofReconfiguration
idle
Reconfiguration
ORN
Immediate
PE
InitialState
TU
・・・ ・・・
・・・ ・・・
・・・FU
Imm.Reg.
ORN
MUX
FU
TUImm.Reg.
ORN
MUX
FU
TUImm.Reg.
ORN
MUX
FU
TUImm.Reg.
ORN
MUX
・・・ ・・・
・・・ ・・・
・・・MUX
MUX
FU(A op B)
TransferUnit
ImmediateRegister (64b)
ORN
MUX
・・・・・・
ImmediateRegister
・・・・・・
PEInput-AInput-B Input-C
log(2x (2MCL+1)) x 3 [b]
Conf. Reg.[bit]
PE & ORN reconfiguration structure
Reconfiguration in LSRDP
State Diagram
75WAHA 2009Kyushu University
LSRDP Design Procedure
Choosing a design parameterbased on the priority of
parameters
Mapping DFGs onto the LSRDP(placement, routing. I/O
positioning)
Obtaining required statisticsfrom the mapping results
Choosing the appropriate valueof the respective design
parameter
Analyzing the mapping results
Start
76WAHA 2009Kyushu University
SMACSMAC
10TFLOPS SFQ-RDP computer
:...:::
SMAC
SB
ORN
...
ORN
...
: : : :
ORN
...
ORN
FPU SFQ RDP( 32FPU×32chips )(4 GFLOPS /FPU)
4.2 K
SFQ Streaming Buffer( 64Kb×2chips )
CMOSCPU
(1chip)
Memory band width per MCM :256GB/ s(=16GB/s ×16 channels)
1024FPU@MCM(34
chips ) ×4MCM
2TB memory module( FB-DIMM
[DDR3@1333MHz, 128GB]×16 modules )
SFQ 0.5um process
77WAHA 2009Kyushu University
Power Consumption Comparisonfor 10TFLOPS computers
MPU Mem. HD Freezer Air
Conditioning
Total
CMOS RISC (90 nm)
125 kW 12.5 kW 5 KW 43 kW 186 kW
SFQ RDP
(0.5 um)
3W
+3.3 W
250 W 100 W 1 kW 0.1 kW 1.5 kW
78WAHA 2009Kyushu University
Power Consumption / Performance
Performance ( GFlops )
Power Consumption (W)
Power Cons. /Perf.(W/GFlops)
SFQ-RDP ~10K ~1.5K ~0.150.5 um,
Whole system
SFQ-RDP ~10K ~6.3 (MPU) ~0.63*10^(-3) 0.5 um, MPU
GRAPE-DR 512 65 0.12 Chip
CSX600 25 10 0.40 ClearSpeed Chip
Cell 192 (single) 32 0.17 Inside Chip, SPE core
Cell (eDP) 1.33PF(#12960) ? 12960*110W ? Roadrunner
GeForce8800GTX 518 (single) 150 0.29 Chip
SX-9 18.3512 nodes whole system: 15.4MW
CMOS RISC (90nm)
~10K ~186K ~18.6