1 an accelerator based on single-flux quantum circuits for a high-performance reconfigurable...

78
1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue* and K. Murakami* *Graduate School of Information Science and Electrical Engineering, Kyushu University, Japan **Institute of Systems, Information Technologies and Nanotechnologies (ISIT), Fukuoka, Japan E-mail: [email protected] , [email protected]

Upload: simon-thornton

Post on 18-Dec-2015

218 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

1

An Accelerator Based on Single-Flux Quantum Circuits for a High-PerformanceReconfigurable Computer

F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*and K. Murakami*

*Graduate School of Information Science and Electrical Engineering, Kyushu University, Japan

**Institute of Systems, Information Technologies and Nanotechnologies (ISIT), Fukuoka, Japan

E-mail: [email protected], [email protected]

Page 2: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

2WAHA 2009Kyushu University

Agenda

Introduction Large-Scale Reconfigurable Data-Path (LSRDP)

General Architecture and Specifications Design Procedure and Tool Chain Preliminary Results Conclusions and Future Work

Page 3: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

3WAHA 2009Kyushu University

Introduction

Parallel computer clusters with General-Purpose Processors (GPP)are often used for HPC

Various accelerators are used with GPPs for further performance improvement PowerXcell, GPGPU, GRAPE-DR, ClearSpeed, etc. Small size and low power consumption comparing to processors with similar

performance

http://www.elsa-jp.co.jp/products/hpc/tesla/s1070/index.html

TSUBAME

http://it.nikkei.co.jp/

NVIDIA Tesla S1070

http://www.top500.org/system/9485

Roadrunner with PowerXcell

Page 4: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

4WAHA 2009Kyushu University

Single Flux Quantum Large Scale Reconfigurable Data-Path (SFQ-LSRDP)

A large memory bandwidth is demanded in conventional accelerators for high-performance computation

On chip memories are often used to hide memory access latency

Large-Scale Reconfigurable Data-Path (LSRDP): • is introduced as an alternative accelerator• reduces the no. of memory accesses• is implemented by Single-Flux Quantum (SFQ) circuits instead of CMOS circuits• is suitable for high performance scientific computations

Page 5: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

5WAHA 2009Kyushu University

Outline of Large-Scale Reconfigurable Data-Path (LSRDP) processor

Features:Data Flow Graphs (DFGs) extracted

from critical calculation parts are directly mapped

Pipeline executionBurst transfer is used for input /output

rearranged data from/to memoryMainMemory

GPP

ORN

: : : :

ORN : Operand Routing Network

...FU FU FUFU

...FU FU FUFU

...FU FU FUFU

LSRDP

: : : ... :SB

SMAC

Scratchpad Memory

Reconfigurable data-path includes:A large number of floating point

Functional Units (FUs)Reconfigurable Operand Routing

Network : ORNDynamic reconfiguration facilitiesStreaming Buffers (SB) for I/O ports Implementation by SFQ circuits

Page 6: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

6WAHA 2009Kyushu University

Single-Flux Quantum (SFQ)against CMOS

CMOS issues: high electric power consumption high heat radiation and difficulties in high-density packing memory wall problem which limits the processing speed

SFQ Features: High-speed switching and signal transmission Low power consumption Compact implementation of a system (small area) No cost for latch Suitable for pipeline processing of data stream Serial bit-level processing

Page 7: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

7WAHA 2009Kyushu University

CREST-JST (2006~): Low-power,high-performance, reconfigurable processor using single-flux quantum circuits

SFQ-LSRDP

Prof. K. MurakamiDr. K. InoueDr. H. Honda

Dr. F. MehdipourH. Kataoka

Kyushu Univ.Architecture, Compiler

and Applications

Dr. S. Nagasawa et al.

Superconducting Research Lab. (SRL)

SFQ process

Prof. N. Yoshikawa et al.

Yokohama National Univ.SFQ-FPU chip, cell library

Prof. A. Fujimaki et al.

Nagoya Univ.SFQ-RDP chip, cell library,

and wiring

Prof. N. Takagi (Leader) et al.

Nagoya Univ.CAD for logic design and arithmetic circuits

Page 8: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

8WAHA 2009Kyushu University

Goals of the Project

Discovering appropriate applications

Developing compiler tools

Developing performance analyzing tools

Designing and Implementing SFQ-LSRDP architecture Designing and Implementing SFQ-LSRDP architecture considering the features and limitations of SFQ circuitsconsidering the features and limitations of SFQ circuits

Page 9: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

9

LSRDP General Architecture and Specifications

Page 10: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

10WAHA 2009Kyushu University

Parameters Should Be DecidedWithin the LSRDP Design Procedure

Height

PE1 ...

...

...

PEm...

.

.

.

.

.

.

.

.

.

PE2 PE3

ORN

ORN

Width

...

...

Streaming Buffer (SB)

ORN

Operand Routing Network (ORN)

Streaming Buffer (SB)

Maximum Connection Length (MCL)between consecutive rows?

• PE: combination of a Functional Unit (FU) and a data Transfer Unit (TU)

• Reconfiguration mechanism?

(PE, ORN, Immediate data)

Layout: FU types(ADD/SUB and MUL)?

• Core structure a matrix of PEs

Width and Height ?

• On-chip memory configuration?

Page 11: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

11WAHA 2009Kyushu University

LSRDP Architecture

Processing Elements FU

implements basic 64-bit double-precision floating point operations including: ADD, SUB and MUL

TU (transfer unit) as a routing resource for transferring datafrom a row to an inconsecutive row

FU TU

FU

TU FU TUTU

FU TUFU

PE including Two components

Four functionalities

Page 12: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

12WAHA 2009Kyushu University

Layout Types- Type IW

ORN

ORN

ORN

.

.

.

…A

TM

AT

M

AT

M

AT

M

AT

M

…A

TM

AT

M

AT

M

AT

M

AT

M

…A

TM

AT

M

AT

M

AT

M

AT

M

…A

TM

AT

M

AT

M

AT

M

AT

M

ADD/SUM

MUL

TU

Each PE implements ADD/SUB and MUL

M

A

T

: ADD/SUB

: MUL

: Transfer Unit

H

Flexible but consume a lot of resources

Page 13: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

13WAHA 2009Kyushu University

W

ORN

ORN

ORN

.

.

.

…M TA T A T A T M T

…M TA T A T A T M T

…M TA T A T A T M T

…M TA T A T A T M T

Layout Types- Type II (Checkered)

H

Each PE implements ADD/SUB or MUL Each PE implements

ADD/SUB or MUL

ADD/SUM TU MUL TU

Page 14: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

14WAHA 2009Kyushu University

W

ORN

ORN

ORN

.

.

.

…M TM T M T M T M T

…A TA T A T A T A T

…M TM T M T M T M T

…A TA T A T A T A T

Layout Types- Type III (Striped)

H

Each PE implements ADD/SUB or MUL

Each PE implements ADD/SUB or MUL

ADD/SUM TU

MUL TU

Type II or III, which one is more efficient?

Page 15: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

15WAHA 2009Kyushu University

Maximum Connection Length (MCL)

(i, 0)

(i+1,0)

(i+1,j)

...

...

(i,j)

ORN

...

... ...

...

(i+1,j+L)

Longest ConnectionLength= L

(i,j+2)

(i,j+1)

(i+1,j+2)

(i+1,j+1)

ConnectionLength= 0

ConnectionLength= 2

MCL: maximum horizontal distance between two PEs located in two consecutive rows

Page 16: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

16WAHA 2009Kyushu University

An ORN Structure

A. Fujimaki, et al., Demonstration of an SFQ-Based Accelerator Prototype for a High-Performance Computer,” ASC08, 2008.

FPUFPUFPUFPUFPU TTTTT

FPUFPUFPUFPUFPU TT

T

TT

½CB½CB½CB½CB½CB

CB CB CB CBT2 T2

½CB½CB½CB½CB½CB

CB CB CB CBCB

CB CB CB CBCB CB CB CBCBCB

CB CB CB CBT2 T2CB CB CB CBCB

T2 CB T2 CBT2 CB T2 CBCBT2

FPUFPUFPUFPUFPU TTTTT

FPUFPUFPUFPUFPU TT

T

TT

½CB½CB½CB½CB½CB

CB CB CB CBT2 T2

½CB½CB½CB½CB½CB

CB CB CB CBCB

CB CB CB CBCB CB CB CBCBCB

CB CB CB CBT2 T2CB CB CB CBCB

T2 CB T2 CBT2 CB T2 CBCBT2

ORN is consisted of 2-bit shift registers, 1-by-2 and 2-by-2 cross bar switches

FPU

2bit shiftregister

ORN

Page 17: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

17WAHA 2009Kyushu University

Dynamic Reconfiguration Mechanism

FU(A op B)

TransferUnit

ImmediateRegister (64b)

ORN

MUX

・・・・・・

ImmediateRegister

・・・・・・

PEInput-AInput-B Input-C

log(2x (2MCL+1)) x 3 [b]

Conf. Reg.[bit]

Three bit-stream lines for dynamic reconfiguration of:• Immediate registers (64bit) in each PE • Selector bits for muxes selecting the input data of FUs• Cross-bar switches in ORNs

Page 18: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

18

Design Procedure and Tool Chain

Page 19: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

19WAHA 2009Kyushu University

Compiler and Design Flow

Application Code

Hardware-Software Partitioning(manual or automatic)

Critical Parts(h/w part)

Non-Critical Parts(s/w part)

Port positioning

s/w Part Modification

Binary Code(for GPP)

Configurations(for LSRDP)

LSRDP Architecture

Placement

Routing

DFG Genration

Bit-Stream Generation

DFGsDFGs

Analyzing DFG mappingresults

Design Phase

Mapping

• DFGs are manually generated from critical parts of applications• DFG mapping results are used for

• Analyzing LSRDP architecture statistics• Generating LSRDP configuration bit-streams

Page 20: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

20WAHA 2009Kyushu University

LSRDP Design Procedure

Choosing a design parameter

Mapping DFGs onto the LSRDP

Obtaining required statistics

Choosing the appropriate value

Analyzing the mapping results

For eachparameter

Appropriate value for each parameter

DFGs & LSRDP HW constraints

Page 21: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

21WAHA 2009Kyushu University

Benchmark Applicationsfor Design Procedures

Finite differential method calculation of2nd order partial differential equations 1dim-Heat equation      (Heat) 1dim-Vibration equation (Vibration) 2dim-Poisson equation (Poisson)

Quantum chemistry application Recursive parts of Electron Repulsion Integral calculation

(ERI-Rec)

Only ADD/SUB and MUL operations are usedin the critical calculations of all above applications

Page 22: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

22WAHA 2009Kyushu University

DFG Extraction- Heat Equation

1-dim. heat equation for T(x,t)

             

Calculation by Finite DifferenceMethod (FDM)

2

2

( , ) ( , )T x t T x tA

t x

(A is const.)

T(i,j+1)

T(i-1,j) T(i,j) T(i+1,j)

+

*

*

+

D

B

T(i,j+1)

T(i-1,j) T(i,j) T(i+1,j)

+

*

*

+

D

B

),(),(*),(*

),(

11

1

jijiji

ji

txTtxTBtxTD

txT

Basic DFG corresponding to Minimum FDM calculation

Basic DFG can be extended to horizontal and vertical directions to make a larger DFG

Page 23: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

23WAHA 2009Kyushu University

Example of extracted DFGs- Heat

Inputs: 32Outputs: 16Operations: 721 Immediates: 364

A huge sample DFG (Heat)

Page 24: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

24WAHA 2009Kyushu University

DFG Classification

Class # of FUs# of

Inputs# of

Outputs# of

DFGsHeat (3)Poi (1)Vib (2)Eri (4)

Heat (1)Poi (1)Vib (1)Eri (4)

Heat (2)Poi (1)Vib (2)Eri (5)

Heat (1)Poi (1)Vib (2)Eri (5)

12

12

24

52

19

19

38

64

128

512

1024

> 1024

RDP-S

RDP-M

RDP-L

RDP-XL

Due to broad range of DFG sizesDFGs are classified as S, M, L, XL with respect to their sizeand the number of Input/Output nodes

Totally,24 DFGs are preparedfor benchmark DFG

Page 25: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

25WAHA 2009Kyushu University

Mapping DFGs onto LSRDP

Longest connections

Placing DFGnodes on LSRDP

RoutingConnections

Placing IO nodes

Routing Inp/OutConnections

DFG

LSRDPArchitectureDescription

ConfigurationFile

Page 26: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

26

Preliminary Results

Page 27: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

27WAHA 2009Kyushu University

LSRDP Specifications: Width & Height

# of Input ports

# of Output ports

Width Height

LSRDP-S 19 12 16 16

LSRDP-M 19 12 32 16

LSRDP-L 38 24 64 32

LSRDP Dimensions and the number of Input/Output Ports

Page 28: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

28WAHA 2009Kyushu University

LSRDP Specifications: MCL

Needs further MCL optimization

(i, j)

(i+2,j+1)

(i+L,j+1)

(i+1,j+1)

(i,j+1)

MCL = L

・・・

LSRDP MCL(avg/max)

ORN Size-No of Inps (avg/max),

Outs

LSRDP-S 4/8 18/34, 3

LSRDP-M 5/9 22/38, 3

LSRDP-L 5/9 22/34, 3

Page 29: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

29WAHA 2009Kyushu University

Analyzing Various LSRDP Layouts

Layout II can be used instead of Layout I to obtain a smaller LSRDP architecture with lower power consumption and implementation cost.

(Except ERI1 DFG which gives better size for Layout III)

Layout SizeI 8x3II 8x3III 8x4I 10x8II 10x8III 10x11I 10x10II 10x12III 15x18I 6x2II 9x3III 6x2I 10x10II 10x10III 15x8

Viration

Poisson

ERI1

ERI2

Heat

Layout I Layout II

Page 30: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

30WAHA 2009Kyushu University

LSRDP at One Glance (1/2)

Functional units ADD/SUB, MUL

Layout Type II (checker pattern)

Operations 64-bit floating point

Processing structure Pipelined

PE structure FU, T, FU+T, T+T

LSRDP Size Small Medium Large

No. of inp/out ports 19/12 19/12 38/24

Width/Height 16/16 32/16 64/32

Conf. bit-stream size

Imm. Regs 16*16*64 32*16*64 64*32*64

ORNs 16*BSS(ORN) 32* BSS(ORN) 64*BSS(ORN)

PEs 16*16* 2 32*16*2 64*32* 2

ORN inputs, outputs 22 , 3 26 , 3 26 , 3

Structure Cross-bar switch

Conn. Type One-directional

Page 31: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

31WAHA 2009Kyushu University

LSRDP at One Glance (2/2)

Internal memory Type Immediate registers

Size and count 64-bit registers, One reg. for each PE

Communication mechanism Serial

External memory No. of memory modules 16

Date trans. rate 1800Mbps/pin

Overall data trans. rate 24 GB/s

Mem. to LSRDP bus width 64 bit

Channels per module Two

Reconf. mechanism Bit serial configuration through a serial chain

Page 32: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

32WAHA 2009Kyushu University

Preliminary Performance Evaluation

Processor type Out-of-order

GPP operating frequency 3.2GHz

Inst. issue width 4 instruction/cc

Inst. decode width 4 instruction/cc

Cache configuration L1 data 64KB(128B Entry, 2way, 2cc)

L1 instruction 64KB(64B Entry, 1way, 1cc)

L2 unified 4MB(128B Entry, 4way, 16cc)

Latency of main memory 300cc

L2 to main memory Bus width 64 Bytes

Freq 800 MHz

LSRDP operating frequency 80 GHz

Reconfiguration Latency 1cc

Latency SPM LSRDP latency 1cc

Latency Main Memory SPM 7500cc

Bandwidth SPMLSRDP Max. 64 * 8 Bytes/cc

Bandwidth Main Memory SPM 102.4GB/sec

Base processor configuration

GPP+LSRDP configuration

GPP : Exec. time measurement by means of a processor simulatorLSRDP : Estimation by performance modeling

Page 33: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

33WAHA 2009Kyushu University

Preliminary Performance Evaluation(Heat)

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Basic Reuse Basic Reuse

Heat (M) Heat (L)

Nor

mal

ized

by

GPP E

xec.

Tim

e

Reconf.

Comm.

Rearrange

Stall

LSRDP

GPP

Data reusing is employed to avoid the need for data rearrangement as well as frequently data retrieval from the scratchpad memory.

Basic: SB onlyReuse: SB + SPM

Page 34: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

34WAHA 2009Kyushu University

Preliminary Performance Evaluation (Poisson)

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

poisson(S) poisson(M) poisson(L)

Norm

aliz

ed

by G

PP E

xec. Tim

e

Reconf.

Comm.

Rearrange

Stall

LSRDP

GPP

A small fraction is related to processing time on LSRDP and the main fraction concerns to various overhead times as well as the execution time on GPP

Page 35: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

35WAHA 2009Kyushu University

Conclusions & Future Work

A high-performance computer comprising an accelerator (LSRDP) implemented by superconducting circuits was introduced.

24 benchmark Data Flow Graphs (DFGs) were manually generated.

LSRDP micro-architecture is designed based on characteristics of scientific applications via a quantitative approach.

LSRDP is promising for resolving issues originated from CMOS technology as well as achieving considerable performances.

Future Work:

•To achieve higher performance it is required to reduce various overhead costs mainly related to data management part.

•To reduce the implementation cost of LSRDP, we will focus on reducing maximum connection length and ORN size.

Page 36: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

36WAHA 2009Kyushu University

Acknowledgement

This research was supportedin part by Core Research for Evolutional Scienceand Technology (CREST) of Japan Scienceand Technology Corporation (JST).

Page 37: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

37

Thanks!

Any Questions?

Page 38: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

38WAHA 2009Kyushu University

Backup Slides

Backup Slides

Page 39: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

39WAHA 2009Kyushu University

SFQ (Single Flux Quantum) Circuit High speed, Low power consumption, and Operating by a different principle from the CMOS

Φ0

L

Ic

Ib

2mV

2ps

Tunneling effect

ジョセフソン接合

超伝導ループ

磁束量子Single Flux QuantumSuperconductivityloop

Josephson junction

Page 40: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

40WAHA 2009Kyushu University

Mapping Results

Width Height Total # ofPEs

Extra TUs

(max/avg) (max/avg) (max/avg) (max/avg)

RDP-S 128 19 12 26/14.9 10/6.7 98/51.7 56/23.5

RDP-M 512 19 12 26/17.1 16/9.3 170/77.8 92/37.1

RDP-L 1024 38 24 58/40 24/14.4 730/260.1 428/141.3

RDP-XL > 1024 64 52 122/45.3 25/12.4 1217/350.4 1065/240

# ofFUs

# ofInputs

# ofOutputs

For each class, a lot of extra TUsare needed to map all DFGs

PE types FU

TFU

T

TT

Page 41: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

41WAHA 2009Kyushu University

Connection Length Minimization- Results

MCL (ave/max)

RDP-S 4/9

RDP-M 5/9

RDP-L 9.3/19

Final optimized Maximum Connection Length (MCL) results

ORNs should provide the connection length of 9in LSRDP-S/M (MCL= 9). For LSRDP-L, MCL = 19 !!!

⇒ Serious Implementation Cost

(i, j)

(i+2,j+1)

(i+L,j+1)

(i+1,j+1)

(i,j+1)

MCL = L

・・・

Possible to decrease?

Page 42: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

42WAHA 2009Kyushu University

Distributions of Connection Lengths

Average Fraction of Connection Lengths inthe RDP/S Maps

79%

10%

4% 3% 2%1%1%0%0%0%

0

1

2

3

4

5

6

7

8

9

Connectionlength

93% of connection lengths are 0 ~ 2Only small fractions of connections results in larger ORNs

Page 43: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

43WAHA 2009Kyushu University

Analyzing Various LSRDP Layouts

Layout II can be used instead of Layout I to obtain a smaller LSRDP architecture with lower power consumption and implementation cost as well

• Almost a similar small size values are achievedfor Layout I and IIfor the majority of DFGs (except ERI1 DFG which gives better size for Layout III)

Layout SizeI 8x3II 8x3III 8x4I 10x8II 10x8III 10x11I 10x10II 10x12III 15x18I 6x2II 9x3III 6x2I 10x10II 10x10III 15x8

Viration

Poisson

ERI1

ERI2

Heat

Page 44: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

44WAHA 2009Kyushu University

Why only ERI1 DFG is suitable to Layout III ?

RDP-Layout Type II

ADD/SUB

MUL

ADD/SUB

MULADD/SUB

MUL

ADD/SUB

MULADD/SUB

MUL

...

...

...

ADD/SUB

MULADD/SUB

MUL ...

.

.

.

.

.

.

.

.

.

MULADD/SUB

ORN

ORN

ORN

.

.

.

Heat

ERI 1 RDP-Layout Type III

MUL MUL

ADD/SUB

ADD/SUB

ADD/SUB

ADD/SUB

MUL MUL MUL MUL

...

...

...

ADD/SUB

ADD/SUB

ADD/SUB

ADD/SUB

...

.

.

.

.

.

.

.

.

.

MUL MUL

ORN

ORN

ORN

.

.

.

Layout III

Layout II

Page 45: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

45WAHA 2009Kyushu University

FU Layout for DIV, SQRT, EXP operations

ORN

: : : :

ORN : Operand Routing Network

...FU FU FUFU

...FU FU FUFU

...FU FU FUFU

ORN

...FU FU FUFU

...FU FU FUFU

DIV

Three timeslarger latency

Where ?

Where should we place different latency FU ?Heterogeneous configuration of FU array ?

16Bits Floating point DIV, SQRT, and EXP Functional unit have been already developed by SFQ current technology.

Pipeline execution basedon ADD and MUL latency

Page 46: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

46WAHA 2009Kyushu University

Estimated performance improvement of 2-dim Poisson equation by LSRDP calc.

0

1

2

3

4

5

6

7

8

12.8 102.4 12.8 102.4 12.8 102.4

poisson(S) poisson(M) poisson(L)

Treconf

Ttra

Trearr

Tst

Tcal

Tgpp

Nor

mal

ized

exe

c. ti

me

by

GPP

(3G

Hz)

cal

c.

Main Mem. bandwidth [GByte/sec]

Page 47: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

47WAHA 2009Kyushu University

Estimated performance improvement of ERI calculation by LSRDP

0

0.2

0.4

0.6

0.8

1

GPPonly LSRDP

ERI

Treconf

Ttra

Trearr

Tst

Tcal

Tgpp

(3GHz)

Page 48: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

48WAHA 2009Kyushu University

Recursive Parts of Electron Repulsion Integral Formula (ERI-Rec)

DFG sizes have already determinedfrom original recursive formula

No. of Operations

No. of

Inputs

No. of

Output

(ps,ss) 9 8 3

(ps,ps) 51 16 9

(pp,ss) 66 14 9

(pp,ps) 252 22 27

(pp,pp) 1004 28 81

Page 49: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

49WAHA 2009Kyushu University

What types of software/algorithms are suitable for LSRDP ?

When same calculations have to be calculated repeatedly. LSRDP is used for high throughput accelerator.

Input/Output data size is small compared with the amount of the operations.

small sizeof input

small sizeof output

Large amountof calculations

Xmemory access

LSRDP

Page 50: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

50WAHA 2009Kyushu University

Exploration of suitable applicationsfor LSRDP

Application matrix elements calculation

Molecular integral calculations in molecular orbital method Monte Carlo type simulation etc…

Numerical calculation library special function (promising?) differential equation numerical integration matrix operation (difficult ??) Triangular matrix simultaneous equation etc…

Investigating applicability against various applications

Page 51: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

51WAHA 2009Kyushu University

Recursive Parts of Electron Repulsion Integral Formula in Molecular Orbital Calc.

# of Inputs : Max. 28# of Outputs : 1 ~ 81

(0) (0) (1)

(0) (0) (1) (1)

(0) (0) (1) (0) (1)

(0) (

( , ) ( , ) ( , )

( , ) ( , ) ( , ) ( , )2

( , ) ( , ) ( , ) ( , ) ( , )2

( , ) ( , )

i i i

abiki k k i k i

aijai j i i j i

i j k k i j

p s ss PA ss ss WP ss ss

Zp s p s QC p s ss WQ p s ss ss ss

Zp p ss PB p s ss WP p s ss ss ss Z ss ss

p p p s QC p p ss

0) (1)

(1) (1)

(0) (0) (1) (1) (1)

(0) (1)

( , )

( , ) ( , )2

( , ) ( , ) ( , ) ( , ) ( , )2

( , ) ( , )2

k i j

abik j jk i

abi j k l k i j k l i j k il j k jl i k

bijbi j i j

WQ p p ss

Zsp ss p s ss

Zp p p p QD p p p s WQ p p p s sp p s p s p s

Zp p ss Z p p ss

(ss,ss)(m) and all coefficientsare given as input

(i,j,k,l = x,y,z): p function has 3 components (as 1dim array)Each DFG has only ADD (SUB) and MUL FUs.

~Up to (pp,pp) Recursive Calculation~

DFG sizes are determined by original calculation algorithm

Page 52: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

52WAHA 2009Kyushu University

0

200

400

600

800

1000

0 10 20 30 40 50 60 70

DFG Distribution for each application#

of F

Us

# of Inputs

Poisson (3)

Vibration (7)

Heat (6)

ERI-Rec (8 DFGs)

DFGs have different qualities in terms of the # of FUs, # of Inputs and Outputs

Page 53: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

53WAHA 2009Kyushu University

Example of MCL (Heat)

Heat original DFG(I/O: 8/4, FUs: 32) Mapping result

MCL

Page 54: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

54WAHA 2009Kyushu University

Example of extracted DFGs (ERI-Rec)

Maximum DFG of ERI-Rec: (pipj,pkpl)

Inputs: 28Outputs: 81FUs: 1004Immediates: 0

Vertical PartitioningInputs: 24Outputs: 1FUs: 108Immediates: 0

Page 55: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

55WAHA 2009Kyushu University

Poisson Equation

),(),(),(

2

2

2

2

yxfy

yxu

x

yxu

2D – Poisson Eq.

),(),(),(

),(),(4

),(*)1(

),(

21

)(1

)(

1)(

1)(

)(

)1(

jijin

jin

jin

jin

jin

jin

yxfhyxuyxu

yxuyxu

yxu

yxu

ω is const.

Successive Over Relaxation method

In order to obtain u(n+1) (xi,yj) in the next iteration,current values of five variables i.e. u(n) (xi,yj), u(n) (xi±1,yj), u(n) (xi,yj±1) are needed

Red/Black Gauss Seidel

55

Page 56: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

56WAHA 2009Kyushu University

Example of extracted DFGs (Poisson)

Maximum Poisson DFG

Inputs: 32Outputs: 1FUs: 721 Immediates: 364

Page 57: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

57WAHA 2009Kyushu University

Performance Evaluation:Simulation Environment

57

GPP

MainMemory

LSRDP

GPP : Exec. time measurement by processor simulatorLSRDP : Estimation by performance modeling

Variable parameters:• Freq. of GPP and LSRDP• Bandwidth between main memory and LSRDP• Latency of reconfiguration time• # of FPUs in LSRDP• Supporting FPU types (Add, Mul, Div, Exp, Sqrt, Error function units are supported)Use streaming buffer in the LSRDP chip

I/O data is sorted in the main memory.

Page 58: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

58WAHA 2009Kyushu University

Estimated performance improvementof 1-dim heat equation by LSRDP calc.

0

0.2

0.4

0.6

0.8

1

1.2

OnlyGPP

12.8 25.6 51.2 102.4

LSRDP

Trec

Ttra

Trearr

Tst

Tcal

ETgpp

Main Mem. bandwidth [GByte/sec]

Page 59: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

59WAHA 2009Kyushu University

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

12.8 25.6 51.2 102.4

Heat

Trec

Ttra

Trearr

Tst

Tcal

ETgpp

Estimated performance improvement of 1-dim heat equation by LSRDP calc.

Main Mem. bandwidth [GByte/sec]

Nor

mal

ized

exe

c. ti

me

by

GPP

(3G

Hz)

cal

c.

Page 60: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

60WAHA 2009Kyushu University

Poisson Red/Black 法におけるDFG の拡大による繰り返し回数の増加

9+4 ノードの入力

中心 1 ノードの出力

SOR 式2 回の繰り返し

4+1 ノードの入力

中心 1 ノードの出力

SOR 式1 回の計算

これに伴い必要な入力数も増加•DFG の拡大により 1 度に計算可能な繰り返し回数が増加

60

Page 61: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

61WAHA 2009Kyushu University

Implementation of Heat calculation to LSRDP

Loop j   Loop i T(xi,tj) End LoopEnd Loop

Original GPP code

LSRDP ReconfigurationLoop j’   Input Data Rearrangement Loop N LSRDP pipeline exec. (FDM DFG calc.) End Loop   Output Data RearrangementEnd Loop

61

LSRDP code

Page 62: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

62WAHA 2009Kyushu University

Implementation of Poisson calculation to LSRDP

Loop Iter Loop i loop j u(xi,yj) End Loop   End LoopEnd Loop

Original GPP code

LSRDP ReconfigurationLoop Iter’    Input Data rearrangement Loop N LSRDP pipeline exec. (FDM DFG calc.) End Loop   Output Data rearrangementEnd Loop

62

LSRDP code

Page 63: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

63WAHA 2009Kyushu University

Implementation of ERI-Rec calculation to LSRDP

Loop I,J,K,L   LSRDP Reconfiguration  Loop contraction Initial Integral Calc. End Loop   Input Data rearrangement   Loop N   LSRDP pipeline calc.       (Recursive DFG calc.)  End Loop Output Data rearrangement Partial Fock Calc.End Loop

Loop I,J,K,L   Loop contraction Initial Integral Calc. Recursive Calc. End Loop   Partial Fock Calc.End Loop

Initial Integral Calc.: 1/Sqrt, Exp, Fm(T) are utilized => GPP calculation

Recursive Calc.: only ADD/SUB, MUL => LSRDP calculation

original GPP code LSRDP code

63

Page 64: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

64WAHA 2009Kyushu University

Vertical vs. HorizontalDFG Decomposition

Loop N  Reconfiguration  Loop M   LSRDP pipeline calc.     End LoopEnd Loop

Original

64

Loop n ( > N)  Reconfiguration  Loop M   LSRDP pipeline calc.     End LoopEnd Loop

Vertical Decomp.

Loop N  Reconfiguration  Loop M   1st LSRDP pipeline calc.     End LoopEnd LoopLoop N  Reconfiguration  Loop M   2nd LSRDP pipeline calc.     End LoopEnd Loop

Horizontal Decomp.

Page 65: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

65WAHA 2009Kyushu University

Example of extracted DFGs

Maximum DFG of ERI-Rec: (pipj,pkpl)

Inputs: 28Outputs: 81FUs: 1004Immediates: 0

Vertical PartitioningInputs: 24Outputs: 1FUs: 108Immediates: 0

Page 66: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

66WAHA 2009Kyushu University

Example of extracted DFGs- Heat

Inputs: 32Outputs: 16Operations: 721 Immediates: 364

A huge sample DFG (Heat)

Page 67: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

67WAHA 2009Kyushu University

Performance Evaluation:Simulation Environment

67

GPP

MainMemory

LSRDP

GPP : Exec. time measurement by processor simulatorLSRDP : Estimation by performance modeling

Variable parameters:• Freq. of GPP and LSRDP• Bandwidth between main memory and LSRDP• Latency of reconfiguration time• # of FPUs in LSRDP• Supporting FPU types (Add, Mul, Div, Exp, Sqrt, Error function units are supported)Use streaming buffer in the LSRDP chip

I/O data is sorted in the main memory.

Page 68: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

68WAHA 2009Kyushu University

Performance Evaluation:Execution Time Modeling

68

LSRDPCalLSRDP

OverheadLSRDPGPPTOTAL

STTET

TETETET

calT

LSRDPST

Execution time

Calculation time

Stall time

Latency of LSRDP <->Mem For first Input and last output

Sort data + Reconfig.+ Send signal for comm.

+ Stall fromBandwidthreq > Bandwidthmem

Total pipeline depthin the given program +

# of rows of LSRDP(latency of LSRDP)

Page 69: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

69WAHA 2009Kyushu University

Layout Types- Type IW

ORN

ORN

ORN

.

.

.

…A

TM

AT

M

AT

M

AT

M

AT

M

…A

TM

AT

M

AT

M

AT

M

AT

M

…A

TM

AT

M

AT

M

AT

M

AT

M

…A

TM

AT

M

AT

M

AT

M

AT

M

Total No. of PEs= W * H

Total Area= W*H* [Area(MUL)+Area(ADD/SUB)+ Area(TU)]+ Area(ORNs)

ADD/SUM

MUL

TU

Each PE implements ADD/SUB and MUL

M

A

T

: ADD/SUB

: MUL

: Transfer Unit

H

Page 70: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

70WAHA 2009Kyushu University

W

ORN

ORN

ORN

.

.

.

…M TA T A T A T M T

…M TA T A T A T M T

…M TA T A T A T M T

…M TA T A T A T M T

Layout Types- Type II

H

Each PE implements ADD/SUB or MUL Each PE implements

ADD/SUB or MUL

Total No. of PEs= W * H

Total Area= ½* W*H*[Area(MUL)+Area(TU)]+½*W*H*[Area(ADD/SUB)+Area(TU)]+ Area (ORNs)

ADD/SUM TU MUL TU

Page 71: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

71WAHA 2009Kyushu University

W

ORN

ORN

ORN

.

.

.

…M TM T M T M T M T

…A TA T A T A T A T

…M TM T M T M T M T

…M TA T A T A T M T

Layout Types- Type III

H

Each PE implements ADD/SUB or MUL

Each PE implements ADD/SUB or MUL

Total No. of PEs= W * H

Total Area= ½* W*H*[Area(MUL)+Area(TU)]+½*W*H*[Area(ADD/SUB)+Area(TU)]+ Area (ORNs)

ADD/SUM TU

MUL TU

Page 72: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

72WAHA 2009Kyushu University

CB Various Functionalities

CB two inputs/outputs four cases are possible reconfigurable

1/2CB one input/ two outputs four cases are possible reconfigurable

00 01 10 11 00 01 10 11

CB ½CB

Page 73: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

73WAHA 2009Kyushu University

The number of FPUs is M, the number of Transfer Units (T) is also M;MCL is a maximum connection length if we consider FPUs only =>

½ CB – 2×M T2 – (M+4×MCL+2) CB – (2×MCL+1) ×(4×M-1)

An ORN Structure

FP

UFP

UFP

UFP

UFP

U

T

T

T

T

T

FP

UFP

UFP

UFP

UFP

U

T

T

T

T

T

½CB

½CB

½CB

½CB

½CB

CB

CB

CB

CB

T2

T2

½CB

½CB

½CB

½CB

½CB

CB

CB

CB

CB

CB

CB

CB

CB

CB

CB

CB

CB

CB

CB

CB

CB

CB

CB

CB

T2

T2

CB

CB

CB

CB

CB

T2

CB

T2

CB

T2

CB

T2

CB

CB

T2

CB: 351 JJs

½ CB: 216 JJs

* T2 is a 2-bit shift register

“+”: scalable pipelined easily re-designed for any number of N and M

“–”: large number of Josephson junctions M number of ½ CB and (2×M+1)×MCL number of CB

Reduction of the number of Josephson junctions is

essential!

A. Fujimaki, et al., Demonstration of an SFQ-Based Accelerator Prototype for a High-Performance Computer,” ASC08, 2008.

Page 74: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

74WAHA 2009Kyushu University

Reconfiguration Mechanism

Execution

wait

Starting ofExecution

End ofExecution

Starting ofReconfiguration

End ofReconfiguration

idle

Reconfiguration

ORN

Immediate

PE

InitialState

TU

・・・ ・・・

・・・ ・・・

・・・FU

Imm.Reg.

ORN

MUX

FU

TUImm.Reg.

ORN

MUX

FU

TUImm.Reg.

ORN

MUX

FU

TUImm.Reg.

ORN

MUX

・・・ ・・・

・・・ ・・・

・・・MUX

MUX

FU(A op B)

TransferUnit

ImmediateRegister (64b)

ORN

MUX

・・・・・・

ImmediateRegister

・・・・・・

PEInput-AInput-B Input-C

log(2x (2MCL+1)) x 3 [b]

Conf. Reg.[bit]

PE & ORN reconfiguration structure

Reconfiguration in LSRDP

State Diagram

Page 75: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

75WAHA 2009Kyushu University

LSRDP Design Procedure

Choosing a design parameterbased on the priority of

parameters

Mapping DFGs onto the LSRDP(placement, routing. I/O

positioning)

Obtaining required statisticsfrom the mapping results

Choosing the appropriate valueof the respective design

parameter

Analyzing the mapping results

Start

Page 76: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

76WAHA 2009Kyushu University

SMACSMAC

10TFLOPS SFQ-RDP computer

:...:::

SMAC

SB

ORN

...

ORN

...

: : : :

ORN

...

ORN

FPU SFQ RDP( 32FPU×32chips )(4 GFLOPS /FPU)

4.2 K

SFQ Streaming Buffer( 64Kb×2chips )

CMOSCPU

(1chip)

Memory band width per MCM :256GB/ s(=16GB/s ×16 channels)

1024FPU@MCM(34

chips ) ×4MCM

2TB memory module( FB-DIMM

[DDR3@1333MHz, 128GB]×16 modules )

SFQ 0.5um process

Page 77: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

77WAHA 2009Kyushu University

Power Consumption Comparisonfor 10TFLOPS computers

MPU Mem. HD Freezer Air

Conditioning

Total

CMOS RISC (90 nm)

125 kW 12.5 kW 5 KW 43 kW 186 kW

SFQ RDP

(0.5 um)

3W

+3.3 W

250 W 100 W 1 kW 0.1 kW 1.5 kW

Page 78: 1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

78WAHA 2009Kyushu University

Power Consumption / Performance

Performance ( GFlops )

Power Consumption (W)

Power Cons. /Perf.(W/GFlops)

SFQ-RDP ~10K ~1.5K ~0.150.5 um,

Whole system

SFQ-RDP ~10K ~6.3 (MPU) ~0.63*10^(-3) 0.5 um, MPU

GRAPE-DR 512 65 0.12 Chip

CSX600 25 10 0.40 ClearSpeed Chip

Cell 192 (single) 32 0.17 Inside Chip, SPE core

Cell (eDP) 1.33PF(#12960) ? 12960*110W ? Roadrunner

GeForce8800GTX 518 (single) 150 0.29 Chip

SX-9 18.3512 nodes whole system: 15.4MW

CMOS RISC (90nm)

~10K ~186K ~18.6