techniques d’optimisation architecturale

1DIOUCamille

Master EAII Sp. RSEE

Camille Diou

diou@univ-metz.fr

Techniques d’optimisationarchitecturale

2DIOUCamille

State machine ALU

t1t2t3ABC

DATAPATHCONTROLLERBUS

Arithmetic and Logic Unit (ALU)Register file

Tristate components (inputs/ outputs)

Microprocessor basics1

3DIOUCamille

t1 <- x

t2 <- y

t3 <- A.t1

t3 <- t3.t1

t2 <- B.t2

t3 <- t2+t3

out<- t3+C

t1t2t3ABC

S=Ax²+By+C

Computation example : DATAPATHCONTROLLER

4DIOUCamille

t1 <- x

t2 <- y

t3 <- A.t1

t3 <- t3.t1

t2 <- B.t2

t3 <- t2+t3

out<- t3+C

#CYCLES: 1

t1t2t3ABC

S=Ax²+By+C

DATAPATHCONTROLLERComputation example :

5DIOUCamille

t1 <- x

t2 <- y

t3 <- A.t1

t3 <- t3.t1

t2 <- B.t2

t3 <- t2+t3

out<- t3+C

#CYCLES: 2

t1t2t3ABC

S=Ax²+By+C

6DIOUCamille

t1 <- x

t2 <- y

t3 <- A.t1

t3 <- t3.t1

t2 <- B.t2

t3 <- t2+t3

out<- t3+C

#CYCLES: 3

t1t2t3ABC

A.t1S=Ax²+By+C

7DIOUCamille

t1 <- x

t2 <- y

t3 <- A.t1

t3 <- t3.t1

t2 <- B.t2

t3 <- t2+t3

out<- t3+C

#CYCLES: 4

t1t2t3ABC

t3.t1S=Ax²+By+C

8DIOUCamille

t1 <- x

t2 <- y

t3 <- A.t1

t3 <- t3.t1

t2 <- B.t2

t3 <- t2+t3

out<- t3+C

#CYCLES: 5

t1t2t3ABC

B.t2S=Ax²+By+C

9DIOUCamille

t1 <- x

t2 <- y

t3 <- A.t1

t3 <- t3.t1

t2 <- B.t2

t3 <- t2+t3

out<- t3+C

#CYCLES: 6

t1t2t3ABC

t2+t3S=Ax²+By+C

10DIOUCamille

t1 <- x

t2 <- y

t3 <- A.t1

t3 <- t3.t1

t2 <- B.t2

t3 <- t2+t3

out<- t3+C

#CYCLES: 7

t1t2t3ABC

S=Ax²+By+C

11DIOUCamille

STARTSTART HALTHALTFetch NextInstruction

Fetch NextInstruction

ExecuteInstructionExecute

Instruction

Fetch Cycle Execute Cycle

Execution principle

12DIOUCamille

Data flow

Control signals

A Single accumulator machineMAR : Memory Adress RegisterIR : Instruction RegisterPC : Program Counter register

Load path

Store path

Address

Address operand

Branch

Instruction path

OpcodeLD

Functioncontrols

16 bits wide16M words

Memory

13DIOUCamille

Instruction:

Opcode:

00: Load01: Store10: Add11: Branch

Address

15 14 13 0

Single Address Instruction: one of the registers is fixed (= accumulator)-AC is an implicit operand

AC:= AC <operation> Memory(Address)

14DIOUCamille

OpcodeMAR

MAR : Memory Adress RegisterIR : Instruction RegisterPC : Program Counter register

Load path

Store path

Instruction path

Address

Address operand

Branch

Functioncontrols 16 bits wide

16M words

Memory

141416

15DIOUCamille

OpcodeMAR

Load path

Store path

Instruction path

Address operand

BranchIR

10110100110011

1. Instruction fetch:- PC is moved into MAR- Read from memory- Load instruction into IR

2. Instruction decode: - Op code bits to FSM(ADD)- rest of bits is operand addr.

10110100110011

1000110100110011

ALU Address

Functioncontrols 16 bits wide

16M words

Memory

141416

16DIOUCamille

OpcodeMAR

Load path

Store path

Instruction path

Memory

ALUFSM

Address operand

Branch

3. Operand Fetch:- IR<address> -> MAR- Read data from memory

4. Instr. Execute- Memory to ALU B- AC to ALU - ALU Add- S to AC

00110100110011

10110100110011

0101010101110001

01010101011100010011001101110110

1000100011100111

Address

Functioncontrols

141416

1000110100110011

17DIOUCamille

OpcodeMAR

Load path

Store path

Instruction path

Memory

ALUFSM

Address operand

Branch

LD00110100110011

10110100110011

0101010101110001

01010101011100010011001101110110

1000100011100111

Address

Functioncontrols

141416

10110100110100

5. Housekeeping:- Increment PC

1000110100110011

18DIOUCamille

A simple microprocessor : Architecture

16x16registers

Adress to memorydata to/from memory

To controller(FSM)

19DIOUCamille

A simple microprocessor : Instruction format

20DIOUCamille

Instruction formatInstruction

Action

21DIOUCamille

22DIOUCamille

0000 7C0A ;

0001 8C00 ; LOAD RC, #A

0002 7B04 ; ...

0003 7A0A ; ...

0004 9C7C ; ...

0005 611A ; ...

0006 614B ; ...

A simple microprocessor : test programWhat will it do ?

23DIOUCamille

Compiler dependancies detection for ILP

• Detect data dependency at compile time:– examples:

c[i]=a[i]+b[i]; potential dependencyd[i]=a[i]+c[j]; c[i] might be c[j]

c[1]=a[i]+b[i]; no dependencyd[i]=a[i]+c[2]; c[1] is never c[2]

24DIOUCamille

• Superscalar processors must find dataflow graph at run time

• Reconfigurable architectures constructs data flow graph at compile time

• No FU limitations

• No control logic overhead

• No window size limitations

Reconfigurable computing : Instruction level parallelism (ILP)

Systolic ring2

25DIOUCamille

• RC scheme: • General Purpose Computeradd r1, r2, r4add r1, r3, r5sub r3, r2, r6add r4 r5 r1add r5 r6 r2

r1 r2 r3 r1 r3 r2

r4 r5 r6

Question: what is the advantage of RC against superscalar?

Reconfigurable computing : Instruction level parallelism (ILP)

Answer: Dataflow graph constructed at compile time, thus, no overhead

Systolic ring2

26DIOUCamille

Reconfigurable computing : Why now ?

• Increasing number of transistors

• Complexity and cost of chip design increase fast

• Current computing demands are RC friendly :

Desktops & embedded demands driven NOT by Word or Excel but by multimedia, encryption, filters (dataflow oriented applications

Systolic ring2

27DIOUCamille

• RA less flexible (like a VLIW with fixed instructions)

• RA provides more (customized) computation elements• RA can decrease memory traffic• RA can be tailored for specific algorithms and data types

RA will not replace µP, but complement them

RA versus microprocessors

Systolic ring2

28DIOUCamille

•A set of simple processing elements with regular and local connections which takes external inputs and processes them in a prederterminedmanner in a determined fashion

Systolic computing : definition

H.T. Kung

Systolic ring2

29DIOUCamille

• Simple PE

• Regular and local interconnect

• Pipeline between Pes

• I/O at boundary

Systolic computing : characteristics of best RC design

Systolic ring2

30DIOUCamille

In abstract :Instructions configure both PE and interconnect every cycle

In reality :Instruction Bandwidth / Memory too high, so…

COMPROMISE

Coarse grain RA model

Systolic ring2

31DIOUCamille

Relationship of communication among processors• Shared clock (Pipelined)• Shared registers (VLIW)• Shared memory (SMM)• Shared network

Communications…

Systolic ring2

32DIOUCamille

Instructionscurrently in hardware

Instructions paged out

Actual availablehardware

Reconfigurable computing

Systolic ring2

33DIOUCamille

a1a2 a0

ynZ-1Z-1 Z-1

∑−

=−−=

0)1()()(

iinxiany

aN-1aN aN-2 a1 a0

ynZ-1 Z-1Z-1 Z-1

3 coefficients filter)3()2()1()( .2.1.0 −+−+−= nxanxanxany

Finite Impulse response filter (FIR)

Systolic ring2

34DIOUCamille

Systolic FIR implementation

(MAC unit)

Systolic ring2

35DIOUCamille

Systolic ring2

36DIOUCamille

Systolic ring2

37DIOUCamille

Systolic ring2

38DIOUCamille

Systolic ring2

39DIOUCamille

Systolic ring2

40DIOUCamille

Systolic ring2

41DIOUCamille

Optimize outer loop, preload-repeated value

Systolic ring2

42DIOUCamille

Optimize outer loop, broadcast common value

Systolic ring2

43DIOUCamille

Optimize outer loop, retime to eliminate broadcast

Systolic ring2

44DIOUCamille

Systolic ring2

45DIOUCamille

Systolic ring2

46DIOUCamille

Systolic ring2

47DIOUCamille

The Systolic Ring

• Coarse grain architecture• Multi-mode dynamical reconfiguration• Scalable, bidimentionnal array• VHDL design• Designed for SoC integration

Systolic ring2

48DIOUCamille

Dnode : word-level processing unit

ALU + MULT

Reg FILE

Constitution• Optimized Datapath (16 bits)• Register File (4x16bits)• Hardwired ALU and multiplier

Features• Complex computations in local mode (FIR,IIR, WT…)

• Low silicon area (0.07mm², 0.18µm CMOS process)

• Single-cycle operations (ex:MAC+register load)

µinst.

Systolic ring2

49DIOUCamille

Local controller : Dynamical reconfiguration at the Dnode levelConstitution

• 8 configuration registers• 3 differents run modes• 1 programming mode

ALU + MULT

In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)

Reg FILE

µinst.

ALU + MULT

Reg FILE

µinst.

Decoderenex

3 Controller

Systolic ring2

50DIOUCamille

ALU + MULT

Reg FILE

µinst.

ALU + MULT

Reg FILE

µinst.

Programming mode

Decoderenex

3 Controller

Systolic ring2

51DIOUCamille

ALU + MULT

Reg FILE

µinst.

ALU + MULT

Reg FILE

µinst.

Decoderenex

3 Controller

Instruction 0

Programming mode

Systolic ring2

52DIOUCamille

ALU + MULT

Reg FILE

µinst.

ALU + MULT

Reg FILE

µinst.

Decoderenex

3 Controller

Instruction 1

Programming mode

Systolic ring2

53DIOUCamille

ALU + MULT

Reg FILE

µinst.

ALU + MULT

Reg FILE

µinst.

Decoderenex

3 Controller

Instruction 2

Programming mode

Systolic ring2

54DIOUCamille

ALU + MULT

Reg FILE

µinst.

ALU + MULT

Reg FILE

µinst.

Decoderenex

3 Controller

Instruction 3

Programming mode

Systolic ring2

55DIOUCamille

ALU + MULT

Reg FILE

µinst.

ALU + MULT

Reg FILE

µinst.

Run-mode 1 : Fixed

Decoderenex

3 Controller

2 mode

Instruction 0

Systolic ring2

56DIOUCamille

ALU + MULT

Reg FILE

µinst.

ALU + MULT

Reg FILE

µinst.

Decoderenex

3 Controller

2 mode

Instruction 0

Run-mode 1 : Fixed

Systolic ring2

57DIOUCamille

ALU + MULT

Reg FILE

µinst.

ALU + MULT

Reg FILE

µinst.

Decoderenex

3 Controller

2 mode

Instruction 0

Run-mode 1 : Fixed

Systolic ring2

58DIOUCamille

ALU + MULT

Reg FILE

µinst.

ALU + MULT

Reg FILE

µinst.

Decoderenex

3 Controller

2 mode

Instruction 0

Run-mode 1 : Fixed

Systolic ring2

59DIOUCamille

ALU + MULT

Reg FILE

µinst.

ALU + MULT

Reg FILE

µinst.

Decoderenex

3 Controller

2 mode

Instruction 1

Run-mode 2 : Dynamic

Systolic ring2

60DIOUCamille

ALU + MULT

Reg FILE

µinst.

ALU + MULT

Reg FILE

µinst.

Decoderenex

3 Controller

2 mode

Instruction 2

Run-mode 2 : Dynamic

Systolic ring2

61DIOUCamille

ALU + MULT

Reg FILE

µinst.

ALU + MULT

Reg FILE

µinst.

Decoderenex

3 Controller

2 mode

Instruction 3

Run-mode 2 : Dynamic (one-time or loop)

Systolic ring2

62DIOUCamille

• Unidirectional communications between neighbours• Hard to implement datapath with greater pipeline depth than the array

• Hard to implement recursive operations

Scalable

Unités de traitement Switchs

Flots de données UNIDIRECTIONNELS

BUS : ressource PARTAGÉE…

Configurableblocks

BUS : Shared resources

Main dataflow (unidirectional)

Array structure

Systolic ring2

63DIOUCamille

RING STRUCTURERING STRUCTURE

Use of a Ring structure

Array structure

Configurableblocks

Systolic ring2

64DIOUCamille

Forward

Dataflow

Reverse Dataflow

Use of a bi-dataflows structure

Configurableblocks

Array structure

RING STRUCTURERING STRUCTURE

Systolic ring2

65DIOUCamille

SwitchSwitch

DnodeDnode

SwitchSwitch

DnodeDnode DnodeDnode

SwitchSwitch

DnodeDnode

SwitchSwitch

DnodeDnode DnodeDnode

Forward dataflow

Peak power : 3200 MIPS@200MHz (16 Dnodes version)

DnodeDnode

SwitchSwitch

Flot de données

Couche n

DnodeDnode DnodeDnode Couche n+1

Systolic Ring architecture

Systolic ring2

66DIOUCamille

No complex data routing problems (crossbars…)Unidirectional data transfers between adjacent layers (pipeline)Linear performances increase with Dnode numberProvides 3200 MIPS@200MHz of computing power for a 16 Dnodes realization

Forward dataflow

D node Local mode : stand-alone

D node Global mode : FPGA like

Switch components:Direct FIFO connection for Data injection

BUS connection for RISC communication

Full connectivity between 2 Dnode layers

Config.controller

Switch Switch

SwitchSwitch

D-Node D-Node

D-Node

Layer n

Layer n+1

Layer n-1Forward Dataflow

I/O I/O

I/OI/O

D node

Systolic Ring architecture

Systolic ring2

67DIOUCamille

NodeD-Node D-Node

D-Node D-Node

Switch

SwitchSwitch

Switch

Reverse dataflow

Each switch writes computed data in his own feedback pipelineEach switch has read ports on others switch’s pipelinesEasy implementation of various recursive algorithms (IIR, WT…)

Feedback pipelines

Systolic ring2

68DIOUCamille

NodeD-Node D-Node

D-Node D-Node

Switch

SwitchSwitch

Switch

Reverse dataflow

Feedback pipelines

Systolic ring2

69DIOUCamille

NodeD-Node D-Node

D-Node D-Node

Switch

SwitchSwitch

Switch

Reverse dataflow

Feedback pipelines

Systolic ring2

70DIOUCamille

• Global mode (first level)The program which manages the configuration runs on the RISC processorThe configuration of an entire cluster can be modified at each clock cycleThe operating layer computes the data coming from the host processor

112233

• Local mode (second level) Each Dnode runs his own up-to-8 instructions program

ConfigConfigConfigControllerControllerController

DATA HostHostHostµPµPµP

*RAMRAMRAMRAMRAMRAM

OPERATING layerOPERATING layer

CONFIGURATIONCONFIGURATIONlayerlayer

MANAGEMENT CODE

CONFIG

ALU + MULT

Reg FILE

ALU + MULT

Reg FILE

ALU + MULT

Reg FILE

2 levels dynamically reconfigurable architecture:

Systolic ring2

71DIOUCamille

8 Dnodes version…• ST* CMOS process 0.25 µm & 0.18 µm

200 MHz200 MHz150 MHzFréquency

0.04 mm20.7 mm20.9 mm2Area

Dnode0.18 µm

Ring-80.18 µm

Ring-80.25 µm

• Low Dnode area Possible to realize 128 Dnodes versions…

• Suited as an IP core for SoC

*: ST: STmicroelectronics

Features :• Parametrizable core (number of Dnodes)

• Good Performances / cost tradeoff: (Ring-8@200MHz Systolic Ring system)

• 1600 MIPS (PII@450MHz : 400 MIPS)

• 3 Gb/s bandwidth

Systolic ring2

72DIOUCamille

0000 r:ldl(0,8) M1: N1:clr N2:clr 0001 r:ldl(1,2) M2: N1:clr N2:clr 0002 r:dec(0,0) M1: N1:add(fifo1,fifo1) N2:sub(fifo1,fifo1) 0003 r:jnz(1) M2: N1:mac(in1) N2:mac(in2) 0004 r: halt

RISC instructions Layer selection Dnodes instructions

Assembly-level programming

RAM FPGA

Prototype

Testbench

Simulator

Ring-8RAM

File1.bin

File2.m

Assembler

Systolic ring2

73DIOUCamille

RIF filter : edge detection

0000 r:ldl(0,1) M1: N1:rst N2:rst 0001 r:jmp(0) M1: N2:sub(fifo,fifo)

Convolution mask : [ -1 1 0 ] yn=xn-xn-1.

Input image Output image

Assembly code Timing diagrams

Testbench

Simulator

Ring-8Ring-8RAMFile2.m

AssemblerAssembler

Systolic ring2

74DIOUCamille

ALU + MULT

Polynomial calculus• P(x)=a.x+b.x²+c.x3

x reg0

x.x reg1

reg0.reg1 reg2

a.reg0 ACC

b.reg1 + ACC ACC

c.reg2 + ACC ACC

/* load reg1,x3 */

/* load ACC,a.x */

/* load ACC,a.x+b.x² */

/* load ACC,a.x+b.x²+c.x3 */

/* load reg0,x */

/* load reg1,x² */

Systolic ring2

75DIOUCamille

ALU + MULT

x reg0

x.x reg1

reg0.reg1 reg2

a.reg0 ACC

b.reg1 + ACC ACC

c.reg2 + ACC ACC

/* load reg1,x3 */

/* load ACC,a.x */

/* load reg0,x */

/* load reg1,x² */

Systolic ring2

76DIOUCamille

ALU + MULT

x² x3

x reg0

x.x reg1

reg0.reg1 reg2

a.reg0 ACC

b.reg1 + ACC ACC

c.reg2 + ACC ACC

/* load reg1,x3 */

/* load ACC,a.x */

/* load reg0,x */

/* load reg1,x² */

Systolic ring2

77DIOUCamille

ALU + MULT

a.x+b.x²

x reg0

x.x reg1

reg0.reg1 reg2

a.reg0 ACC

b.reg1 + ACC ACC

c.reg2 + ACC ACC

/* load reg1,x3 */

/* load ACC,a.x */

/* load reg0,x */

/* load reg1,x² */

Systolic ring2

78DIOUCamille

ALU + MULT

a.x+b.x²+c. x3

x reg0

x.x reg1

reg0.reg1 reg2

a.reg0 ACC

b.reg1 + ACC ACC

c.reg2 + ACC ACC

/* load reg1,x3 */

/* load ACC,a.x */

/* load reg0,x */

/* load reg1,x² */

Systolic ring2

79DIOUCamille

∑−

=−−=

0)1()(

ii inxany

Z-1Z-1 Z-1

a2 aN-1 aN

Systolic ring2

80DIOUCamille

a1a2 a0

ynZ-1Z-1 Z-1

∑−

=−−=

0)1()()(

iinxiany

aN-1aN aN-2 a1 a0

ynZ-1 Z-1Z-1 Z-1

3 coefficients filter

)3()2()1()( .2.1.0 −+−+−= nxanxanxany

Systolic ring2

81DIOUCamille

Cycle 1

x0, x0, x0

a2x0 a1 a2

a2, a1, a0

3 Dnodes / layer architecture use• Piplelined implementation• Samples are injected through dedicated lines• Coefficients loaded during first cycle

FIR implementation

Systolic ring2

82DIOUCamille

x1, x1, x1

Cycle 2

Feedback

FIR implementation

Systolic ring2

83DIOUCamille

Cycle 3

a2.x0+a1.x1a2.x1

x2, x2, x2

a2.x0+a1.x1

Feedback

FIR implementation

Systolic ring2

84DIOUCamille

Cycle 4

a2.x1+a1.x2a2.x2

x3, x3, x3

a2.x1+a1.x2

a2.x0+a1.x1 +a0.x2

OPTIMAL IMPLEMENTATION : 1 SAMPLE / CYCLE

Feedback

FIR implementation

Systolic ring2

85DIOUCamille

Cycle 5

a2.x2+a1.x3a2.x3

x4, x4, x4

a2.x2+a1.x3

a2.x1+a1.x2 +a0.x3

Feedback

Systolic ring2

86DIOUCamille

Cycle 6

a2.x2+a1.x3a2.x3

x4, x4, x4

a2.x2+a1.x3

a2.x1+a1.x2 +a0.x3

Feedback

Systolic ring2

87DIOUCamille

a2.x3+a1.x4a2.x4

x5, x5, x5

a2.x3+a1.x4

a2.x2+a1.x3 +a0.x4

Cycle 7

Feedback

Systolic ring2

88DIOUCamille

6 coefficients filter

)6()5()4()3()2()1()( .5.4.3.2.1.0 −+−+−+−+−+−= nxanxanxanxanxanxany

a5a5 a4a4

MACMAC

Inter-layersfeedback

Systolic ring2

89DIOUCamille

DCTCoeff.

Originalimage

inverse Quantification

QuantifiedCoeff.

Quantification

Decoding

Decompressedimage

Coding

Compressedimage

Discrete Cosine Transform• Usually bidimensional 8x8 points DCT• Very demanding algorithm…

Systolic ring2

90DIOUCamille

( ) ( )∑−

=⎟⎠⎞⎜

⎝⎛ +=

0 212cos2 N

knxkNz πα k = 0,1,……,N-1

Direct transform

( ) ( )∑−

=⎟⎠⎞⎜

⎝⎛ +=

0 212cos2 N

knzkNx πα n = 0,1,……,N-1

Inverse transform

1/√2 for k = 0

1 else=with

DCT algorithm

( )kα

Systolic ring2

91DIOUCamille

⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢

7,70,7

1,10,1

7,01,00,0

......................

Image initiale

64 blocs 8x8

Image• 64x64 points• 8x8 pixels blocks•16 bits coded image

Systolic ring2

92DIOUCamille

Implementation• Matrix implementation• Even / Odd frequency decomposition of the DCT algorithm

⎥⎥⎥⎥

⎢⎢⎢⎢

⎥⎥⎥⎥

⎢⎢⎢⎢

δ−ββ−δαα−α−αβ−δ−δβ

⎥⎥⎥⎥

⎢⎢⎢⎢

0 1111

xxxxxxxx

α = cos (π/4)

β = cos (π/8)

δ = sin (π/8) ⎥⎥⎥⎥

⎢⎢⎢⎢

⎥⎥⎥⎥

⎢⎢⎢⎢

⎥⎥⎥⎥

⎢⎢⎢⎢

0 1111

xxxxxxxx

α = cos (π/4)

β = cos (π/8)

δ = sin (π/8)

⎥⎥⎥⎥

⎢⎢⎢⎢

−−−−

⎥⎥⎥⎥

⎢⎢⎢⎢

λ−γμ−νγνλ−μμ−λ−ν−γνμγλ

⎥⎥⎥⎥

⎢⎢⎢⎢

xxxxxxxx

zzzz λ = cos (π/16)

γ = cos (3π/16)

μ = sin (3π/16)

ν = sin (π/16)⎥⎥⎥⎥

⎢⎢⎢⎢

−−−−

⎥⎥⎥⎥

⎢⎢⎢⎢

⎥⎥⎥⎥

⎢⎢⎢⎢

xxxxxxxx

zzzz λ = cos (π/16)

γ = cos (3π/16)

μ = sin (3π/16)

ν = sin (π/16)

z )(2= xNT

Nz )(2=

Systolic ring2

93DIOUCamille

α = 0000000000010110 - α = 1111111111101010β = 0000000000011101 - β = 1111111111100011δ = 0000000000001100 - δ = 1111111111110100 λ = 0000000000011111 - λ = 1111111111100001γ = 0000000000011010 - γ = 1111111111100101μ = 0000000000010001 - μ = 1111111111101111ν = 0000000000000110 - ν = 1111111111111010

Example : n=6

( ) ( )∑−

=⎟⎠⎞

⎜⎝⎛ +

0 212cos2 N

z πα

Coefficients coding• Fixed point

Systolic ring2

94DIOUCamille

Dnode1C

tsxn - x(N-1)-n

xn + x(N-1)-n

x nx (

z0 , z2 , z4 , z6

z1 , z3 , z5 , z7

Dnode2

Dnode1

Dnode2

Implementation :• ADD and SUB on the first Dnode layer• Multiply-accumulate operations (MAC) on the second Dnodes layer

Systolic ring2

95DIOUCamille

⎥⎥⎥⎥

⎢⎢⎢⎢

⎥⎥⎥⎥

⎢⎢⎢⎢

⎥⎥⎥⎥

⎢⎢⎢⎢

0 1111

xxxxxxxx

⎥⎥⎥⎥

⎢⎢⎢⎢

−−−−

⎥⎥⎥⎥

⎢⎢⎢⎢

⎥⎥⎥⎥

⎢⎢⎢⎢

xxxxxxxx

M0: N1:add(fifo,fifo) N2: sub(fifo,fifo)

t=0 n=0Computing…

Dnode1

x 0x 7

x 7 Dnode2

Dnode1

Dnode2

Systolic ring2

96DIOUCamille

t=1 n=1

Dnode1

+_ x0 – x7

x0 + x7

x 1x 6

x 6 Dnode2

Dnode1

Dnode2

MACλ,

M1: N1:MAC(in1,fifo) N2: MAC(in2,fifo)

⎥⎥⎥⎥

⎢⎢⎢⎢

⎥⎥⎥⎥

⎢⎢⎢⎢

⎥⎥⎥⎥

⎢⎢⎢⎢

0 1111

xxxxxxxx

⎥⎥⎥⎥

⎢⎢⎢⎢

−−−−

⎥⎥⎥⎥

⎢⎢⎢⎢

⎥⎥⎥⎥

⎢⎢⎢⎢

xxxxxxxx

Computing…

Systolic ring2

97DIOUCamille

t=2 n=2

Dnode1

+_ x1 – x6

x1 + x6

x 2x 5

x 5 Dnode2

Dnode1

Dnode2

⎥⎥⎥⎥

⎢⎢⎢⎢

⎥⎥⎥⎥

⎢⎢⎢⎢

⎥⎥⎥⎥

⎢⎢⎢⎢

0 1111

xxxxxxxx

⎥⎥⎥⎥

⎢⎢⎢⎢

−−−−

⎥⎥⎥⎥

⎢⎢⎢⎢

⎥⎥⎥⎥

⎢⎢⎢⎢

xxxxxxxx

Computing…

Systolic ring2

98DIOUCamille

Dnode1

+_ x2 – x5

x2 + x5

x 3x 4

x 4 Dnode2

Dnode1

Dnode2

⎥⎥⎥⎥

⎢⎢⎢⎢

⎥⎥⎥⎥

⎢⎢⎢⎢

⎥⎥⎥⎥

⎢⎢⎢⎢

0 1111

xxxxxxxx

⎥⎥⎥⎥

⎢⎢⎢⎢

−−−−

⎥⎥⎥⎥

⎢⎢⎢⎢

⎥⎥⎥⎥

⎢⎢⎢⎢

xxxxxxxx

Computing… t=3 n=3

Systolic ring2

99DIOUCamille

Dnode1

+_ x3 – x4

x3 + x4

Dnode2

Dnode1

Dnode2

MACν,

⎥⎥⎥⎥

⎢⎢⎢⎢

⎥⎥⎥⎥

⎢⎢⎢⎢

⎥⎥⎥⎥

⎢⎢⎢⎢

0 1111

xxxxxxxx

⎥⎥⎥⎥

⎢⎢⎢⎢

−−−−

⎥⎥⎥⎥

⎢⎢⎢⎢

⎥⎥⎥⎥

⎢⎢⎢⎢

xxxxxxxx

Systolic ring2

100DIOUCamille

M1: N1:clear N2: clear

Dnode1

+_ x3 – x4

x3 + x4

Dnode2

Dnode1

Dnode2

clearν,

⎥⎥⎥⎥

⎢⎢⎢⎢

⎥⎥⎥⎥

⎢⎢⎢⎢

⎥⎥⎥⎥

⎢⎢⎢⎢

0 1111

xxxxxxxx

⎥⎥⎥⎥

⎢⎢⎢⎢

−−−−

⎥⎥⎥⎥

⎢⎢⎢⎢

⎥⎥⎥⎥

⎢⎢⎢⎢

xxxxxxxx

Systolic ring2

101DIOUCamille

Results– 2 transforms issued each 5 machine cycles

– « Clear » performed during addition

20 cycles for 8 samples

⎥⎥⎥⎥

⎢⎢⎢⎢

⎥⎥⎥⎥

⎢⎢⎢⎢

⎥⎥⎥⎥

⎢⎢⎢⎢

0 1111

xxxxxxxx

⎥⎥⎥⎥

⎢⎢⎢⎢

−−−−

⎥⎥⎥⎥

⎢⎢⎢⎢

⎥⎥⎥⎥

⎢⎢⎢⎢

xxxxxxxx

Computing…

Systolic ring2

102DIOUCamille

DCT 1D - 4 last lines

ConfigC

Config

Switch

Dnode1

Dnode2

Dnode1

Dnode2

Dnode1

DCT 1D - 4 first lines

Achievable parallelisn on a 8 Dnodes structures : Ring-8

Systolic ring2

103DIOUCamille

⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢

7,70,7

1,10,1

7,01,00,0

'......'................'''.....''

⇒ 5 cycles

2 partial transforms

Systolic ring2

Overall performances

104DIOUCamille

⇒ 20 cycles

⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢

7,70,7

1,10,1

7,01,00,0

'......'................'''.....''

1 Line – 8 partial transforms

Systolic ring2

105DIOUCamille

⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢

7,70,7

1,10,1

7,01,00,0

'......'................'''.....''

⇒ 80 cyclesM 1M 0

4 Lines - 32 partial transforms

Systolic ring2

106DIOUCamille

⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢

7,70,7

1,10,1

7,01,00,0

'......'................'''.....''

4 Lines - 32 partial transforms

⇒ 80 cyclesM 1M 0

Systolic ring2

107DIOUCamille

⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢

7,70,7

1,10,1

7,01,00,0

'......'................'''.....''

⇒ 80 cyclesM 3M 2

8 Columns - 64 transforms

Systolic ring2

108DIOUCamille

⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢

7,70,7

1,10,1

7,01,00,0

'......'................'''.....''

DCT 2D sur 8 points :

160 CYCLES

Systolic ring2

109DIOUCamille

VLIW : CPU64, TM1000, TI 320C60 Superscalar : Pentium I, Pentium II, NEC V830

Comparisons : execution time (cycles)

CPU64 TM-1000 320C62 Ring-8 Ring-64 PentiumI PentiumII NEC V830

VLIW Superscalar

Systolic ring2

techniques d’optimisation architecturale

Documents

architecturale des panneaux solaires

cahier des charges techniques pour une charte architecturale...

sicafi: nouveaux acteurs et mode d’emploi … ·...

techniques d’optimisation stochastique appliqu´ees

approche architecturale

techniques d’optimisation et contrôle de puissance

qualite architecturale des_batiments_agricoles (1)

dossier d’optimisation pour l’exploitant

conception architecturale du jardin - eyrolles.com ·...

analyse architecturale et pertinence des techniques de...

baromètre des assurances dommages · la politique...

etude d’optimisation thermique dynamique phase …

dÉveloppement d’une mÉthodologie d’optimisation du …

les techniques d'optimisation du potentiel. · formation...

algorithmes d’optimisation en grande dimension

prestation d’optimisation periodique des …

1 © ceramique-architecturale

techniques d’optimisation max cerf 4.3...

approximation polynomiale de problèmes d’optimisation

découvrir les techniques d’optimisation du potentiel (top...