techniques d’optimisation architecturale

1DIOUCamille

Master EAII Sp. RSEE

Camille Diou

[email protected]

Techniques d’optimisationarchitecturale

2DIOUCamille


State machine ALU

t1t2t3ABC

DATAPATHCONTROLLERBUS

Arithmetic and Logic Unit (ALU)Register file

Tristate components (inputs/ outputs)

Microprocessor basics1

3DIOUCamille


t1 <- x

t2 <- y

t3 <- A.t1

t3 <- t3.t1

t2 <- B.t2

t3 <- t2+t3

out<- t3+C

ALU

t1t2t3ABC

S=Ax²+By+C

Computation example : DATAPATHCONTROLLER


4DIOUCamille


t1 <- x

t2 <- y

t3 <- A.t1

t3 <- t3.t1

t2 <- B.t2

t3 <- t2+t3

out<- t3+C

#CYCLES: 1

ALU

t1t2t3ABC

x

S=Ax²+By+C

DATAPATHCONTROLLERComputation example :


5DIOUCamille


t1 <- x

t2 <- y

t3 <- A.t1

t3 <- t3.t1

t2 <- B.t2

t3 <- t2+t3

out<- t3+C

#CYCLES: 2

ALU

t1t2t3ABC

y

S=Ax²+By+C



6DIOUCamille


t1 <- x

t2 <- y

t3 <- A.t1

t3 <- t3.t1

t2 <- B.t2

t3 <- t2+t3

out<- t3+C

#CYCLES: 3

X

t1t2t3ABC

A

t1

A.t1S=Ax²+By+C



7DIOUCamille


t1 <- x

t2 <- y

t3 <- A.t1

t3 <- t3.t1

t2 <- B.t2

t3 <- t2+t3

out<- t3+C

#CYCLES: 4

X

t1t2t3ABC

t3

t1

t3.t1S=Ax²+By+C



8DIOUCamille


t1 <- x

t2 <- y

t3 <- A.t1

t3 <- t3.t1

t2 <- B.t2

t3 <- t2+t3

out<- t3+C

#CYCLES: 5

X

t1t2t3ABC

t2

B

B.t2S=Ax²+By+C



9DIOUCamille


t1 <- x

t2 <- y

t3 <- A.t1

t3 <- t3.t1

t2 <- B.t2

t3 <- t2+t3

out<- t3+C

#CYCLES: 6

+

t1t2t3ABC

t3

t2

t2+t3S=Ax²+By+C



10DIOUCamille


t1 <- x

t2 <- y

t3 <- A.t1

t3 <- t3.t1

t2 <- B.t2

t3 <- t2+t3

out<- t3+C

#CYCLES: 7

+

t1t2t3ABC

C

t3

t3+C

S

S=Ax²+By+C



11DIOUCamille


STARTSTART HALTHALTFetch NextInstruction

Fetch NextInstruction

ExecuteInstructionExecute

Instruction

Fetch Cycle Execute Cycle

Execution principle


12DIOUCamille


Data flow

Control signals

A Single accumulator machineMAR : Memory Adress RegisterIR : Instruction RegisterPC : Program Counter register

Load path

Store path

n

Address

mFSM

incr

Address operand

Branch

Instruction path

IR

OpcodeLD

Functioncontrols

MAR

PC

ACC

A B

ALU

S

16 bits wide16M words

Memory


13DIOUCamille


Instruction:

Opcode:

00: Load01: Store10: Add11: Branch

Address

15 14 13 0

Single Address Instruction: one of the registers is fixed (= accumulator)-AC is an implicit operand

AC:= AC <operation> Memory(Address)


14DIOUCamille


OpcodeMAR

PC

MAR : Memory Adress RegisterIR : Instruction RegisterPC : Program Counter register

Load path

Store path

Instruction path

Address

ACC

FSM

incr

Address operand

Branch

LD

ALU

IR

Functioncontrols 16 bits wide

16M words

Memory

2

141416

16

14

S

A B


15DIOUCamille


OpcodeMAR

PC


Load path

Store path

Instruction path

ACC

S

FSM

incr

Address operand

BranchIR

LD

10110100110011

1. Instruction fetch:- PC is moved into MAR- Read from memory- Load instruction into IR

2. Instruction decode: - Op code bits to FSM(ADD)- rest of bits is operand addr.

10110100110011

1000110100110011

1000110100110011

A B

ALU Address

Functioncontrols 16 bits wide

16M words

Memory

2

141416

16

14


16DIOUCamille


OpcodeMAR

PC


Load path

Store path

Instruction path


Memory

ACC

ALUFSM

incr

Address operand

Branch

LD

3. Operand Fetch:- IR<address> -> MAR- Read data from memory

4. Instr. Execute- Memory to ALU B- AC to ALU - ALU Add- S to AC

00110100110011

10110100110011

0101010101110001

01010101011100010011001101110110

1000100011100111

1000100011100111

Address

Functioncontrols

2

141416

16

14

S

A B

1000110100110011


17DIOUCamille


OpcodeMAR

PC


Load path

Store path

Instruction path


Memory

ACC

ALUFSM

incr

Address operand

Branch

LD00110100110011

10110100110011

0101010101110001

01010101011100010011001101110110

1000100011100111

1000100011100111

Address

Functioncontrols

2

141416

16

14

S

A B

10110100110100

5. Housekeeping:- Increment PC

1000110100110011


18DIOUCamille


A simple microprocessor : Architecture

16x16registers

Adress to memorydata to/from memory

To controller(FSM)

To controller(FSM)


19DIOUCamille


A simple microprocessor : Instruction format

shift

oror

or


20DIOUCamille


Instruction formatInstruction

Action



21DIOUCamille




22DIOUCamille


0000 7C0A ;

0001 8C00 ; LOAD RC, #A

0002 7B04 ; ...

0003 7A0A ; ...

0004 9C7C ; ...

0005 611A ; ...

0006 614B ; ...

...

A simple microprocessor : test programWhat will it do ?


23DIOUCamille


Compiler dependancies detection for ILP

• Detect data dependency at compile time:– examples:

c[i]=a[i]+b[i]; potential dependencyd[i]=a[i]+c[j]; c[i] might be c[j]

c[1]=a[i]+b[i]; no dependencyd[i]=a[i]+c[2]; c[1] is never c[2]


24DIOUCamille


• Superscalar processors must find dataflow graph at run time

• Reconfigurable architectures constructs data flow graph at compile time

• No FU limitations

• No control logic overhead

• No window size limitations

Reconfigurable computing : Instruction level parallelism (ILP)

Systolic ring2

25DIOUCamille


• RC scheme: • General Purpose Computeradd r1, r2, r4add r1, r3, r5sub r3, r2, r6add r4 r5 r1add r5 r6 r2

r1 r2 r3 r1 r3 r2

r4 r5 r6

r1 r2

Question: what is the advantage of RC against superscalar?

Reconfigurable computing : Instruction level parallelism (ILP)

Answer: Dataflow graph constructed at compile time, thus, no overhead

Systolic ring2

26DIOUCamille


Reconfigurable computing : Why now ?

• Increasing number of transistors

• Complexity and cost of chip design increase fast

• Current computing demands are RC friendly :

Desktops & embedded demands driven NOT by Word or Excel but by multimedia, encryption, filters (dataflow oriented applications

Systolic ring2

27DIOUCamille


• RA less flexible (like a VLIW with fixed instructions)

but

• RA provides more (customized) computation elements• RA can decrease memory traffic• RA can be tailored for specific algorithms and data types

RA will not replace µP, but complement them

RA versus microprocessors

Systolic ring2

28DIOUCamille


•A set of simple processing elements with regular and local connections which takes external inputs and processes them in a prederterminedmanner in a determined fashion

Systolic computing : definition

H.T. Kung

Systolic ring2

29DIOUCamille


• Simple PE

• Regular and local interconnect

• Pipeline between Pes

• I/O at boundary

Systolic computing : characteristics of best RC design

Systolic ring2

30DIOUCamille


In abstract :Instructions configure both PE and interconnect every cycle

In reality :Instruction Bandwidth / Memory too high, so…

COMPROMISE

Coarse grain RA model

Systolic ring2

31DIOUCamille


Relationship of communication among processors• Shared clock (Pipelined)• Shared registers (VLIW)• Shared memory (SMM)• Shared network

Communications…

Systolic ring2

32DIOUCamille


Instructionscurrently in hardware

Instructions paged out

Actual availablehardware

Prog

ram

Reconfigurable computing

Systolic ring2

33DIOUCamille


xn

a1a2 a0

ynZ-1Z-1 Z-1

∑−

=−−=

1

0)1()()(

N

iinxiany

xn

aN-1aN aN-2 a1 a0

ynZ-1 Z-1Z-1 Z-1

3 coefficients filter)3()2()1()( .2.1.0 −+−+−= nxanxanxany

Finite Impulse response filter (FIR)

Systolic ring2

34DIOUCamille


Systolic FIR implementation

(MAC unit)

Systolic ring2

35DIOUCamille



Systolic ring2

36DIOUCamille



Systolic ring2

37DIOUCamille



Systolic ring2

38DIOUCamille



Systolic ring2

39DIOUCamille



Systolic ring2

40DIOUCamille



Systolic ring2

41DIOUCamille


Optimize outer loop, preload-repeated value


Systolic ring2

42DIOUCamille


Optimize outer loop, broadcast common value


Systolic ring2

43DIOUCamille


Optimize outer loop, retime to eliminate broadcast


Systolic ring2

44DIOUCamille



Systolic ring2

45DIOUCamille



Systolic ring2

46DIOUCamille



Systolic ring2

47DIOUCamille


The Systolic Ring

• Coarse grain architecture• Multi-mode dynamical reconfiguration• Scalable, bidimentionnal array• VHDL design• Designed for SoC integration

Systolic ring2

48DIOUCamille


Dnode : word-level processing unit

ALU + MULT

Reg FILE

Constitution• Optimized Datapath (16 bits)• Register File (4x16bits)• Hardwired ALU and multiplier

Features• Complex computations in local mode (FIR,IIR, WT…)

• Low silicon area (0.07mm², 0.18µm CMOS process)

• Single-cycle operations (ex:MAC+register load)

µinst.

Systolic ring2

49DIOUCamille


Local controller : Dynamical reconfiguration at the Dnode levelConstitution

• 8 configuration registers• 3 differents run modes• 1 programming mode

ALU + MULT

In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)

out

Reg FILE

µinst.

ALU + MULT


out

Reg FILE

µinst.

reg0

reg1

reg3

reg2

ck

Mux

Mux

reg4

reg5

reg6

reg7

8

Decoderenex

wait

wait

3 Controller

2

inhib

mode

Systolic ring2

50DIOUCamille


ALU + MULT


out

Reg FILE

µinst.

ALU + MULT


out

Reg FILE

µinst.

Programming mode

reg0

reg1

reg3

reg2

ck

Mux

Mux

reg4

reg5

reg6

reg7

8

Decoderenex

wait

wait

3 Controller

2

inhib

mode

clk

Systolic ring2

51DIOUCamille


ALU + MULT


out

Reg FILE

µinst.

ALU + MULT


out

Reg FILE

µinst.

reg0

reg1

reg3

reg2

ck

Mux

Mux

reg4

reg5

reg6

reg7

8

Decoderenex

wait

wait

3 Controller

2

inhib

mode

clk

Instruction 0

Programming mode

Systolic ring2

52DIOUCamille


ALU + MULT


out

Reg FILE

µinst.

ALU + MULT


out

Reg FILE

µinst.

reg0

reg1

reg3

reg2

ck

Mux

Mux

reg4

reg5

reg6

reg7

8

Decoderenex

wait

wait

3 Controller

2

inhib

mode

clk

Instruction 1

Programming mode

Systolic ring2

53DIOUCamille


ALU + MULT


out

Reg FILE

µinst.

ALU + MULT


out

Reg FILE

µinst.

reg0

reg3

ck

Mux

Mux

reg4

reg5

reg6

reg7

8

Decoderenex

wait

wait

3 Controller

2

inhib

mode

Reg2

reg1

clk

Instruction 2

Programming mode

Systolic ring2

54DIOUCamille


ALU + MULT


out

Reg FILE

µinst.

ALU + MULT


out

Reg FILE

µinst.

reg0

ck

Mux

Mux

reg4

reg5

reg6

reg7

8

Decoderenex

wait

wait

3 Controller

2

inhib

mode

reg1

Reg3

Reg2

clk

Instruction 3

Programming mode

Systolic ring2

55DIOUCamille


ALU + MULT


out

Reg FILE

µinst.

ALU + MULT


out

Reg FILE

µinst.

Run-mode 1 : Fixed

reg0

ck

Mux

Mux

reg4

reg5

reg6

reg7

8

Decoderenex

wait

wait

3 Controller

2 mode

reg1

Reg2

Reg3

inhib

clk

Instruction 0

Systolic ring2

56DIOUCamille


ALU + MULT


out

Reg FILE

µinst.

ALU + MULT


out

Reg FILE

µinst.

reg0

ck

Mux

Mux

reg4

reg5

reg6

reg7

8

Decoderenex

wait

wait

3 Controller

2 mode

reg1

Reg2

Reg3

inhib

clk

Instruction 0

Run-mode 1 : Fixed

Systolic ring2

57DIOUCamille


ALU + MULT


out

Reg FILE

µinst.

ALU + MULT


out

Reg FILE

µinst.

reg0

ck

Mux

Mux

reg4

reg5

reg6

reg7

8

Decoderenex

wait

wait

3 Controller

2 mode

reg1

Reg2

Reg3

inhib

clk

Instruction 0

Run-mode 1 : Fixed

Systolic ring2

58DIOUCamille


ALU + MULT


out

Reg FILE

µinst.

ALU + MULT


out

Reg FILE

µinst.

reg0

ck

Mux

Mux

reg4

reg5

reg6

reg7

8

Decoderenex

wait

wait

3 Controller

2 mode

reg1

Reg2

Reg3

inhib

clk

Instruction 0

Run-mode 1 : Fixed

Systolic ring2

59DIOUCamille


ALU + MULT


out

Reg FILE

µinst.

ALU + MULT


out

Reg FILE

µinst.

reg0

ck

Mux

Mux

reg4

reg5

reg6

reg7

8

Decoderenex

wait

wait

3 Controller

2 mode

reg1

Reg2

Reg3

Inhib

clk

Instruction 1

Run-mode 2 : Dynamic

Systolic ring2

60DIOUCamille


ALU + MULT


out

Reg FILE

µinst.

ALU + MULT


out

Reg FILE

µinst.

reg0

ck

Mux

Mux

reg4

reg5

reg6

reg7

8

Decoderenex

wait

wait

3 Controller

2 mode

reg1

Reg2

Reg3

inhib

clk

Instruction 2

Run-mode 2 : Dynamic

Systolic ring2

61DIOUCamille


ALU + MULT


out

Reg FILE

µinst.

ALU + MULT


out

Reg FILE

µinst.

reg0

ck

Mux

Mux

reg4

reg5

reg6

reg7

8

Decoderenex

wait

wait

3 Controller

2 mode

reg1

Reg2

Reg3

inhib

clk

Instruction 3

Run-mode 2 : Dynamic (one-time or loop)

Systolic ring2

62DIOUCamille


• Unidirectional communications between neighbours• Hard to implement datapath with greater pipeline depth than the array

• Hard to implement recursive operations

Scalable

Unités de traitement Switchs

Flots de données UNIDIRECTIONNELS

BUS : ressource PARTAGÉE…

ENTR

ÉES

SORT

IES




ENTR

ÉES

SORT

IES

Configurableblocks

INPU

TS

OU

TPU

TS

BUS : Shared resources

Main dataflow (unidirectional)

Array structure

Systolic ring2

63DIOUCamille


RING STRUCTURERING STRUCTURE

Use of a Ring structure



Array structure




ENTR

ÉES

SORT

IES




ENTR

ÉES

SORT

IES

Configurableblocks

INPU

TS

OU

TPU

TS



Systolic ring2

64DIOUCamille


Forward

Dataflow

Reverse Dataflow

Use of a bi-dataflows structure






ENTR

ÉES

SORT

IES




ENTR

ÉES

SORT

IES

Configurableblocks

INPU

TS

OU

TPU

TS



Array structure

RING STRUCTURERING STRUCTURE

Systolic ring2

65DIOUCamille


SwitchSwitch

Sw

itchS

witch

DnodeDnode

DnodeDnode

DnodeDnode

DnodeDnode

SwitchSwitch

SwitchSwitch

DnodeDnode DnodeDnode


SwitchSwitch

Sw

itchS

witch

DnodeDnode

DnodeDnode

SwitchSwitch


Forward dataflow

Peak power : 3200 MIPS@200MHz (16 Dnodes version)

DnodeDnode

DnodeDnode

SwitchSwitch

E/S

E/S

E/S

E/S

E/S

E/S

Flot de données

Couche n

DnodeDnode DnodeDnode Couche n+1

Systolic Ring architecture

Systolic ring2

66DIOUCamille


No complex data routing problems (crossbars…)Unidirectional data transfers between adjacent layers (pipeline)Linear performances increase with Dnode numberProvides 3200 MIPS@200MHz of computing power for a 16 Dnodes realization

Forward dataflow

D node Local mode : stand-alone

D node Global mode : FPGA like

Switch components:Direct FIFO connection for Data injection

BUS connection for RISC communication

Full connectivity between 2 Dnode layers

Config.controller

Node

Switch Switch

SwitchSwitch

SwitchSwitch

D-Node D-Node

D-Node D-Node

D-Node D-Node

D-Node D-Node

D-Node D-Node

D-Node D-Node

D-Node D-Node

D-Node

Sw

itch

D-Node

Sw

itch

Layer n

Layer n+1

Layer n-1Forward Dataflow

I/O I/O

I/O

I/O

I/O

I/O

I/OI/O

D node

D node

D node

D node

Systolic Ring architecture

Systolic ring2

67DIOUCamille


NodeD-Node D-Node

D-Node D-Node

D-Node D-Node

D-Node D-Node

D-Node D-Node

D-Node D-Node

D-Node D-Node

D-Node D-Node

Switch

SwitchSwitch

SwitchSwitch

Sw

itchS

witch

Switch

Reverse dataflow

Each switch writes computed data in his own feedback pipelineEach switch has read ports on others switch’s pipelinesEasy implementation of various recursive algorithms (IIR, WT…)

Feedback pipelines

Systolic ring2

68DIOUCamille


NodeD-Node D-Node

D-Node D-Node

D-Node D-Node

D-Node D-Node

D-Node D-Node

D-Node D-Node

D-Node D-Node

D-Node D-Node

Switch

SwitchSwitch

SwitchSwitch

Sw

itchS

witch

Switch

Reverse dataflow


Feedback pipelines

Systolic ring2

69DIOUCamille


NodeD-Node D-Node

D-Node D-Node

D-Node D-Node

D-Node D-Node

D-Node D-Node

D-Node D-Node

D-Node D-Node

D-Node D-Node

Switch

SwitchSwitch

SwitchSwitch

Sw

itchS

witch

Switch

Reverse dataflow


Feedback pipelines

Systolic ring2

70DIOUCamille


• Global mode (first level)The program which manages the configuration runs on the RISC processorThe configuration of an entire cluster can be modified at each clock cycleThe operating layer computes the data coming from the host processor

112233

• Local mode (second level) Each Dnode runs his own up-to-8 instructions program

ConfigConfigConfigControllerControllerController

ConfigConfigConfigControllerControllerController

DATA HostHostHostµPµPµP

+

*+

*RAMRAMRAMRAMRAMRAM

OPERATING layerOPERATING layer

CONFIGURATIONCONFIGURATIONlayerlayer

MANAGEMENT CODE

CONFIG

Dnode

ALU + MULT

Reg FILE

A B

S

Dnode

ALU + MULT

Reg FILE

A B

S

ALU + MULT

Reg FILE

A B

S

11

2233

2 levels dynamically reconfigurable architecture:

Systolic ring2

71DIOUCamille


8 Dnodes version…• ST* CMOS process 0.25 µm & 0.18 µm

200 MHz200 MHz150 MHzFréquency

0.04 mm20.7 mm20.9 mm2Area

Dnode0.18 µm

Ring-80.18 µm

Ring-80.25 µm

• Low Dnode area Possible to realize 128 Dnodes versions…

• Suited as an IP core for SoC

*: ST: STmicroelectronics

Features :• Parametrizable core (number of Dnodes)

• Good Performances / cost tradeoff: (Ring-8@200MHz Systolic Ring system)

• 1600 MIPS (PII@450MHz : 400 MIPS)

• 3 Gb/s bandwidth

Systolic ring2

72DIOUCamille


0000 r:ldl(0,8) M1: N1:clr N2:clr 0001 r:ldl(1,2) M2: N1:clr N2:clr 0002 r:dec(0,0) M1: N1:add(fifo1,fifo1) N2:sub(fifo1,fifo1) 0003 r:jnz(1) M2: N1:mac(in1) N2:mac(in2) 0004 r: halt

RISC instructions Layer selection Dnodes instructions

Assembly-level programming

RAM FPGA

Prototype

Testbench

Simulator

Ring-8RAM

File1.bin

File2.m

Assembler

Systolic ring2

http://www.testequipmentdepot.com/hp/oscilloscopes/picpages/hp54622dpic.htm

73DIOUCamille


RIF filter : edge detection

0000 r:ldl(0,1) M1: N1:rst N2:rst 0001 r:jmp(0) M1: N2:sub(fifo,fifo)

Convolution mask : [ -1 1 0 ] yn=xn-xn-1.

Input image Output image

Assembly code Timing diagrams

Testbench

Simulator

Ring-8Ring-8RAMFile2.m

AssemblerAssembler

Systolic ring2

74DIOUCamille


ALU + MULT

x

x

x

x

x²

Polynomial calculus• P(x)=a.x+b.x²+c.x3

x reg0

x.x reg1

reg0.reg1 reg2

a.reg0 ACC

b.reg1 + ACC ACC

c.reg2 + ACC ACC

11

12

13

14

15

/* load reg1,x3 */

/* load ACC,a.x */

/* load ACC,a.x+b.x² */

/* load ACC,a.x+b.x²+c.x3 */

/* load reg0,x */

/* load reg1,x² */

Systolic ring2

75DIOUCamille


ALU + MULT

x

x x²

x3x²


x reg0

x.x reg1

reg0.reg1 reg2

a.reg0 ACC

b.reg1 + ACC ACC

c.reg2 + ACC ACC

11

12

13

14

15

/* load reg1,x3 */

/* load ACC,a.x */



/* load reg0,x */

/* load reg1,x² */

Systolic ring2

76DIOUCamille


ALU + MULT

x² x3

a

x

a.x

a x

x reg0

x.x reg1

reg0.reg1 reg2

a.reg0 ACC

b.reg1 + ACC ACC

c.reg2 + ACC ACC

11

12

13

14

15

/* load reg1,x3 */

/* load ACC,a.x */



/* load reg0,x */

/* load reg1,x² */


Systolic ring2

77DIOUCamille


ALU + MULT

x3

b

x

a.x+b.x²

x²

b x²

x reg0

x.x reg1

reg0.reg1 reg2

a.reg0 ACC

b.reg1 + ACC ACC

c.reg2 + ACC ACC

11

12

13

14

15

/* load reg1,x3 */

/* load ACC,a.x */



/* load reg0,x */

/* load reg1,x² */


Systolic ring2

78DIOUCamille


ALU + MULT

c

x

a.x+b.x²+c. x3

x²

c x3

x3

x reg0

x.x reg1

reg0.reg1 reg2

a.reg0 ACC

b.reg1 + ACC ACC

c.reg2 + ACC ACC

11

12

13

14

15

/* load reg1,x3 */

/* load ACC,a.x */



/* load reg0,x */

/* load reg1,x² */


Systolic ring2

79DIOUCamille



∑−

=−−=

1

0)1()(

N

ii inxany

Z-1xn

yn

a1a0

Z-1Z-1 Z-1

a2 aN-1 aN

Systolic ring2

80DIOUCamille


xn

a1a2 a0

ynZ-1Z-1 Z-1

∑−

=−−=

1

0)1()()(

N

iinxiany

xn

aN-1aN aN-2 a1 a0

ynZ-1 Z-1Z-1 Z-1

3 coefficients filter

)3()2()1()( .2.1.0 −+−+−= nxanxanxany


Systolic ring2

81DIOUCamille


Cycle 1

x0, x0, x0

a2x0 a1 a2

a2, a1, a0

3 Dnodes / layer architecture use• Piplelined implementation• Samples are injected through dedicated lines• Coefficients loaded during first cycle

FIR implementation

Systolic ring2

82DIOUCamille


a2.x0

a2x1

a2.x0

x1, x1, x1

a1

MAC

x1

Cycle 2

Feedback


FIR implementation

Systolic ring2

83DIOUCamille


Cycle 3

a2.x0+a1.x1a2.x1

a2x2

a2.x1

x2, x2, x2

a2.x0+a1.x1

a1

MAC

x2

a0

MAC

x2

Feedback


FIR implementation

Systolic ring2

84DIOUCamille


Cycle 4

a2.x1+a1.x2a2.x2

a2x3

a2.x2

x3, x3, x3

a2.x1+a1.x2

a1

MAC

x3

a0

MAC

x3

a2.x0+a1.x1 +a0.x2

OPTIMAL IMPLEMENTATION : 1 SAMPLE / CYCLE

Feedback


FIR implementation

Systolic ring2

85DIOUCamille


Cycle 5

a2.x2+a1.x3a2.x3

a2x4

a2.x3

x4, x4, x4

a2.x2+a1.x3

a1

MAC

x4

a0

MAC

a2.x1+a1.x2 +a0.x3

x4


Feedback


Systolic ring2

86DIOUCamille


Cycle 6

a2.x2+a1.x3a2.x3

a2x4

a2.x3

x4, x4, x4

a2.x2+a1.x3

a1

MAC

x4

a0

MAC

a2.x1+a1.x2 +a0.x3

x4


Feedback


Systolic ring2

87DIOUCamille


a2.x3+a1.x4a2.x4

a2x4

a2.x4

x5, x5, x5

a2.x3+a1.x4

a1

MAC

x5

a0

MAC

x5

a2.x2+a1.x3 +a0.x4

Cycle 7


Feedback


Systolic ring2

88DIOUCamille


6 coefficients filter

)6()5()4()3()2()1()( .5.4.3.2.1.0 −+−+−+−+−+−= nxanxanxanxanxanxany

a5a5 a4a4

MACMAC

a3a3

MACMAC

a1a1

MACMAC

a0a0

MACMAC

a2a2

MACMAC

xn

xn

yn

Inter-layersfeedback

Systolic ring2

89DIOUCamille


DCTCoeff.

DCT

iDCT

Originalimage

inverse Quantification

QuantifiedCoeff.

Quantification

Decoding

Decompressedimage

Coding

Compressedimage

Discrete Cosine Transform• Usually bidimensional 8x8 points DCT• Very demanding algorithm…

Systolic ring2

90DIOUCamille


( ) ( )∑−

=⎟⎠⎞⎜

⎝⎛ +=

1

0 212cos2 N

nnk N

knxkNz πα k = 0,1,……,N-1

Direct transform

( ) ( )∑−

=⎟⎠⎞⎜

⎝⎛ +=

1

0 212cos2 N

kkn N

knzkNx πα n = 0,1,……,N-1

Inverse transform

1/√2 for k = 0

1 else=with

DCT algorithm

( )kα

Systolic ring2

91DIOUCamille


⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢

⎣

⎡

7,70,7

1,10,1

7,01,00,0

......................

.....

xx

xxxxx

64

64

8

8

Image initiale

64 blocs 8x8

Image• 64x64 points• 8x8 pixels blocks•16 bits coded image

Systolic ring2

92DIOUCamille


Implementation• Matrix implementation• Even / Odd frequency decomposition of the DCT algorithm

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

++++

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

δ−ββ−δαα−α−αβ−δ−δβ

=

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

43

52

61

70

6

4

2

0 1111

xxxxxxxx

zzzz

α = cos (π/4)

β = cos (π/8)

δ = sin (π/8) ⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

++++

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡


=

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

43

52

61

70

6

4

2

0 1111

xxxxxxxx

zzzz

α = cos (π/4)

β = cos (π/8)

δ = sin (π/8)

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

−−−−

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

λ−γμ−νγνλ−μμ−λ−ν−γνμγλ

=

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

43

52

61

70

7

5

3

1

xxxxxxxx

zzzz λ = cos (π/16)

γ = cos (3π/16)

μ = sin (3π/16)

ν = sin (π/16)⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

−−−−

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡


=

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

43

52

61

70

7

5

3

1

xxxxxxxx

zzzz λ = cos (π/16)

γ = cos (3π/16)

μ = sin (3π/16)

ν = sin (π/16)

xNTN

z )(2= xNT

Nz )(2=

Systolic ring2

93DIOUCamille


α = 0000000000010110 - α = 1111111111101010β = 0000000000011101 - β = 1111111111100011δ = 0000000000001100 - δ = 1111111111110100 λ = 0000000000011111 - λ = 1111111111100001γ = 0000000000011010 - γ = 1111111111100101μ = 0000000000010001 - μ = 1111111111101111ν = 0000000000000110 - ν = 1111111111111010

Example : n=6

( ) ( )∑−

=⎟⎠⎞

⎜⎝⎛ +

=1

0 212cos2 N

nnk N

knxkN

z πα

Coefficients coding• Fixed point

Systolic ring2

94DIOUCamille


Dnode1C

onfig

+_

Con

fig

MAC

MACC

oeffi

cien

tsxn - x(N-1)-n

xn + x(N-1)-n

x nx (

N-1

)-n x

nx (

N-1

)-n

z0 , z2 , z4 , z6

z1 , z3 , z5 , z7

Dnode2

Dnode1

Dnode2

Implementation :• ADD and SUB on the first Dnode layer• Multiply-accumulate operations (MAC) on the second Dnodes layer

Systolic ring2

95DIOUCamille


⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

++++

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡


=

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

43

52

61

70

6

4

2

0 1111

xxxxxxxx

zzzz

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

−−−−

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡


=

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

43

52

61

70

7

5

3

1

xxxxxxxx

zzzz

M0: N1:add(fifo,fifo) N2: sub(fifo,fifo)

t=0 n=0Computing…

Dnode1

+_

x 0x 7

x0

x 7 Dnode2

Dnode1

Dnode2

Con

fig

Systolic ring2

96DIOUCamille


t=1 n=1

Dnode1

+_ x0 – x7

x0 + x7

x 1x 6

x1

x 6 Dnode2

Dnode1

Dnode2

MAC

MACλ,

x,1,

x

M1: N1:MAC(in1,fifo) N2: MAC(in2,fifo)

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

++++

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡


=

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

43

52

61

70

6

4

2

0 1111

xxxxxxxx

zzzz

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

−−−−

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡


=

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

43

52

61

70

7

5

3

1

xxxxxxxx

zzzz

Computing…

Con

fig

Systolic ring2

97DIOUCamille


t=2 n=2

Dnode1

+_ x1 – x6

x1 + x6

x 2x 5

x2

x 5 Dnode2

Dnode1

Dnode2

MAC

MAC

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

++++

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡


=

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

43

52

61

70

6

4

2

0 1111

xxxxxxxx

zzzz

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

−−−−

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡


=

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

43

52

61

70

7

5

3

1

xxxxxxxx

zzzz

Computing…

γ,x,

1,x

Systolic ring2

98DIOUCamille


Dnode1

+_ x2 – x5

x2 + x5

x 3x 4

x3

x 4 Dnode2

Dnode1

Dnode2

MAC

MAC

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

++++

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡


=

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

43

52

61

70

6

4

2

0 1111

xxxxxxxx

zzzz

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

−−−−

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡


=

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

43

52

61

70

7

5

3

1

xxxxxxxx

zzzz

Computing… t=3 n=3

μ,x,

1,x

Systolic ring2

99DIOUCamille


Dnode1

+_ x3 – x4

x3 + x4

Dnode2

Dnode1

Dnode2

MAC

MACν,

x,1,

x

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

++++

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡


=

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

43

52

61

70

6

4

2

0 1111

xxxxxxxx

zzzz

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

−−−−

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡


=

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

43

52

61

70

7

5

3

1

xxxxxxxx

zzzz


Systolic ring2

100DIOUCamille


M1: N1:clear N2: clear

Dnode1

+_ x3 – x4

x3 + x4

Dnode2

Dnode1

Dnode2

clear

clearν,

x,1,

x

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

++++

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡


=

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

43

52

61

70

6

4

2

0 1111

xxxxxxxx

zzzz

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

−−−−

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡


=

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

43

52

61

70

7

5

3

1

xxxxxxxx

zzzz


z0

z1

Con

fig

Systolic ring2

101DIOUCamille


Results– 2 transforms issued each 5 machine cycles

– « Clear » performed during addition

20 cycles for 8 samples

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

++++

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡


=

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

43

52

61

70

6

4

2

0 1111

xxxxxxxx

zzzz

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

−−−−

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡


=

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

43

52

61

70

7

5

3

1

xxxxxxxx

zzzz

Computing…

Systolic ring2

102DIOUCamille


DCT 1D - 4 last lines

ConfigC

onfig

Config

Config

M 0

M 3

Sw

itch

Switc

h

Switch

Switch

Dnode1

Dnode2

Dnode1

Dnode1

Dnode2

Dnode2

Dnode2

Dnode1

DCT 1D - 4 first lines

Achievable parallelisn on a 8 Dnodes structures : Ring-8

Systolic ring2

M 1

M 2

103DIOUCamille


⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢

⎣

⎡

7,70,7

1,10,1

7,01,00,0

'......'................'''.....''

zz

zzzzz

⇒ 5 cycles

2 partial transforms

Systolic ring2

Overall performances

104DIOUCamille


⇒ 20 cycles

⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢

⎣

⎡

7,70,7

1,10,1

7,01,00,0

'......'................'''.....''

zz

zzzzz

1 Line – 8 partial transforms

Systolic ring2


105DIOUCamille


⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢

⎣

⎡

7,70,7

1,10,1

7,01,00,0

'......'................'''.....''

zz

zzzzz

⇒ 80 cyclesM 1M 0

4 Lines - 32 partial transforms

Systolic ring2


106DIOUCamille


⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢

⎣

⎡

7,70,7

1,10,1

7,01,00,0

'......'................'''.....''

zz

zzzzz

4 Lines - 32 partial transforms

⇒ 80 cyclesM 1M 0

Systolic ring2


107DIOUCamille


⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢

⎣

⎡

7,70,7

1,10,1

7,01,00,0

'......'................'''.....''

zz

zzzzz

⇒ 80 cyclesM 3M 2

8 Columns - 64 transforms

Systolic ring2


108DIOUCamille


⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢

⎣

⎡

7,70,7

1,10,1

7,01,00,0

'......'................'''.....''

zz

zzzzz

DCT 2D sur 8 points :

160 CYCLES

Systolic ring2


109DIOUCamille


VLIW : CPU64, TM1000, TI 320C60 Superscalar : Pentium I, Pentium II, NEC V830

Comparisons : execution time (cycles)

0

50

100

150

200

250

300

350

400

CPU64 TM-1000 320C62 Ring-8 Ring-64 PentiumI PentiumII NEC V830

# cy

cles

VLIW Superscalar

Systolic ring2

techniques d’optimisation architecturale

Documents