techniques d’optimisation architecturale

109
1 DIOU Camille Master EAII Sp. RSEE Camille Diou [email protected] Techniques d’optimisation architecturale

Upload: sameermdani

Post on 08-Apr-2016

20 views

Category:

Documents


2 download

DESCRIPTION

skjbbdsbc

TRANSCRIPT

Page 1: Techniques d’optimisation architecturale

1DIOUCamille

Master EAII Sp. RSEE

Camille Diou

[email protected]

Techniques d’optimisationarchitecturale

Page 2: Techniques d’optimisation architecturale

2DIOUCamille

Master EAII Sp. RSEE

State machine ALU

t1t2t3ABC

DATAPATHCONTROLLERBUS

Arithmetic and Logic Unit (ALU)Register file

Tristate components (inputs/ outputs)

Microprocessor basics1

Page 3: Techniques d’optimisation architecturale

3DIOUCamille

Master EAII Sp. RSEE

t1 <- x

t2 <- y

t3 <- A.t1

t3 <- t3.t1

t2 <- B.t2

t3 <- t2+t3

out<- t3+C

ALU

t1t2t3ABC

S=Ax²+By+C

Computation example : DATAPATHCONTROLLER

Microprocessor basics1

Page 4: Techniques d’optimisation architecturale

4DIOUCamille

Master EAII Sp. RSEE

t1 <- x

t2 <- y

t3 <- A.t1

t3 <- t3.t1

t2 <- B.t2

t3 <- t2+t3

out<- t3+C

#CYCLES: 1

ALU

t1t2t3ABC

x

S=Ax²+By+C

DATAPATHCONTROLLERComputation example :

Microprocessor basics1

Page 5: Techniques d’optimisation architecturale

5DIOUCamille

Master EAII Sp. RSEE

t1 <- x

t2 <- y

t3 <- A.t1

t3 <- t3.t1

t2 <- B.t2

t3 <- t2+t3

out<- t3+C

#CYCLES: 2

ALU

t1t2t3ABC

y

S=Ax²+By+C

DATAPATHCONTROLLERComputation example :

Microprocessor basics1

Page 6: Techniques d’optimisation architecturale

6DIOUCamille

Master EAII Sp. RSEE

t1 <- x

t2 <- y

t3 <- A.t1

t3 <- t3.t1

t2 <- B.t2

t3 <- t2+t3

out<- t3+C

#CYCLES: 3

X

t1t2t3ABC

A

t1

A.t1S=Ax²+By+C

DATAPATHCONTROLLERComputation example :

Microprocessor basics1

Page 7: Techniques d’optimisation architecturale

7DIOUCamille

Master EAII Sp. RSEE

t1 <- x

t2 <- y

t3 <- A.t1

t3 <- t3.t1

t2 <- B.t2

t3 <- t2+t3

out<- t3+C

#CYCLES: 4

X

t1t2t3ABC

t3

t1

t3.t1S=Ax²+By+C

DATAPATHCONTROLLERComputation example :

Microprocessor basics1

Page 8: Techniques d’optimisation architecturale

8DIOUCamille

Master EAII Sp. RSEE

t1 <- x

t2 <- y

t3 <- A.t1

t3 <- t3.t1

t2 <- B.t2

t3 <- t2+t3

out<- t3+C

#CYCLES: 5

X

t1t2t3ABC

t2

B

B.t2S=Ax²+By+C

DATAPATHCONTROLLERComputation example :

Microprocessor basics1

Page 9: Techniques d’optimisation architecturale

9DIOUCamille

Master EAII Sp. RSEE

t1 <- x

t2 <- y

t3 <- A.t1

t3 <- t3.t1

t2 <- B.t2

t3 <- t2+t3

out<- t3+C

#CYCLES: 6

+

t1t2t3ABC

t3

t2

t2+t3S=Ax²+By+C

DATAPATHCONTROLLERComputation example :

Microprocessor basics1

Page 10: Techniques d’optimisation architecturale

10DIOUCamille

Master EAII Sp. RSEE

t1 <- x

t2 <- y

t3 <- A.t1

t3 <- t3.t1

t2 <- B.t2

t3 <- t2+t3

out<- t3+C

#CYCLES: 7

+

t1t2t3ABC

C

t3

t3+C

S

S=Ax²+By+C

DATAPATHCONTROLLERComputation example :

Microprocessor basics1

Page 11: Techniques d’optimisation architecturale

11DIOUCamille

Master EAII Sp. RSEE

STARTSTART HALTHALTFetch NextInstruction

Fetch NextInstruction

ExecuteInstructionExecute

Instruction

Fetch Cycle Execute Cycle

Execution principle

Microprocessor basics1

Page 12: Techniques d’optimisation architecturale

12DIOUCamille

Master EAII Sp. RSEE

Data flow

Control signals

A Single accumulator machineMAR : Memory Adress RegisterIR : Instruction RegisterPC : Program Counter register

Load path

Store path

n

Address

mFSM

incr

Address operand

Branch

Instruction path

IR

OpcodeLD

Functioncontrols

MAR

PC

ACC

A B

ALU

S

16 bits wide16M words

Memory

Microprocessor basics1

Page 13: Techniques d’optimisation architecturale

13DIOUCamille

Master EAII Sp. RSEE

Instruction:

Opcode:

00: Load01: Store10: Add11: Branch

Address

15 14 13 0

Single Address Instruction: one of the registers is fixed (= accumulator)-AC is an implicit operand

AC:= AC <operation> Memory(Address)

Microprocessor basics1

Page 14: Techniques d’optimisation architecturale

14DIOUCamille

Master EAII Sp. RSEE

OpcodeMAR

PC

MAR : Memory Adress RegisterIR : Instruction RegisterPC : Program Counter register

Load path

Store path

Instruction path

Address

ACC

FSM

incr

Address operand

Branch

LD

ALU

IR

Functioncontrols 16 bits wide

16M words

Memory

2

141416

16

14

S

A B

Microprocessor basics1

Page 15: Techniques d’optimisation architecturale

15DIOUCamille

Master EAII Sp. RSEE

OpcodeMAR

PC

MAR : Memory Adress RegisterIR : Instruction RegisterPC : Program Counter register

Load path

Store path

Instruction path

ACC

S

FSM

incr

Address operand

BranchIR

LD

10110100110011

1. Instruction fetch:- PC is moved into MAR- Read from memory- Load instruction into IR

2. Instruction decode: - Op code bits to FSM(ADD)- rest of bits is operand addr.

10110100110011

1000110100110011

1000110100110011

A B

ALU Address

Functioncontrols 16 bits wide

16M words

Memory

2

141416

16

14

Microprocessor basics1

Page 16: Techniques d’optimisation architecturale

16DIOUCamille

Master EAII Sp. RSEE

OpcodeMAR

PC

MAR : Memory Adress RegisterIR : Instruction RegisterPC : Program Counter register

Load path

Store path

Instruction path

16 bits wide16M words

Memory

ACC

ALUFSM

incr

Address operand

Branch

LD

3. Operand Fetch:- IR<address> -> MAR- Read data from memory

4. Instr. Execute- Memory to ALU B- AC to ALU - ALU Add- S to AC

00110100110011

10110100110011

0101010101110001

01010101011100010011001101110110

1000100011100111

1000100011100111

Address

Functioncontrols

2

141416

16

14

S

A B

1000110100110011

Microprocessor basics1

Page 17: Techniques d’optimisation architecturale

17DIOUCamille

Master EAII Sp. RSEE

OpcodeMAR

PC

MAR : Memory Adress RegisterIR : Instruction RegisterPC : Program Counter register

Load path

Store path

Instruction path

16 bits wide16M words

Memory

ACC

ALUFSM

incr

Address operand

Branch

LD00110100110011

10110100110011

0101010101110001

01010101011100010011001101110110

1000100011100111

1000100011100111

Address

Functioncontrols

2

141416

16

14

S

A B

10110100110100

5. Housekeeping:- Increment PC

1000110100110011

Microprocessor basics1

Page 18: Techniques d’optimisation architecturale

18DIOUCamille

Master EAII Sp. RSEE

A simple microprocessor : Architecture

16x16registers

Adress to memorydata to/from memory

To controller(FSM)

To controller(FSM)

Microprocessor basics1

Page 19: Techniques d’optimisation architecturale

19DIOUCamille

Master EAII Sp. RSEE

A simple microprocessor : Instruction format

shift

oror

or

Microprocessor basics1

Page 20: Techniques d’optimisation architecturale

20DIOUCamille

Master EAII Sp. RSEE

Instruction formatInstruction

Action

A simple microprocessor : Instruction format

Microprocessor basics1

Page 21: Techniques d’optimisation architecturale

21DIOUCamille

Master EAII Sp. RSEE

A simple microprocessor : Instruction format

Microprocessor basics1

Page 22: Techniques d’optimisation architecturale

22DIOUCamille

Master EAII Sp. RSEE

0000 7C0A ;

0001 8C00 ; LOAD RC, #A

0002 7B04 ; ...

0003 7A0A ; ...

0004 9C7C ; ...

0005 611A ; ...

0006 614B ; ...

...

A simple microprocessor : test programWhat will it do ?

Microprocessor basics1

Page 23: Techniques d’optimisation architecturale

23DIOUCamille

Master EAII Sp. RSEE

Compiler dependancies detection for ILP

• Detect data dependency at compile time:– examples:

c[i]=a[i]+b[i]; potential dependencyd[i]=a[i]+c[j]; c[i] might be c[j]

c[1]=a[i]+b[i]; no dependencyd[i]=a[i]+c[2]; c[1] is never c[2]

Microprocessor basics1

Page 24: Techniques d’optimisation architecturale

24DIOUCamille

Master EAII Sp. RSEE

• Superscalar processors must find dataflow graph at run time

• Reconfigurable architectures constructs data flow graph at compile time

• No FU limitations

• No control logic overhead

• No window size limitations

Reconfigurable computing : Instruction level parallelism (ILP)

Systolic ring2

Page 25: Techniques d’optimisation architecturale

25DIOUCamille

Master EAII Sp. RSEE

• RC scheme: • General Purpose Computeradd r1, r2, r4add r1, r3, r5sub r3, r2, r6add r4 r5 r1add r5 r6 r2

r1 r2 r3 r1 r3 r2

r4 r5 r6

r1 r2

Question: what is the advantage of RC against superscalar?

Reconfigurable computing : Instruction level parallelism (ILP)

Answer: Dataflow graph constructed at compile time, thus, no overhead

Systolic ring2

Page 26: Techniques d’optimisation architecturale

26DIOUCamille

Master EAII Sp. RSEE

Reconfigurable computing : Why now ?

• Increasing number of transistors

• Complexity and cost of chip design increase fast

• Current computing demands are RC friendly :

Desktops & embedded demands driven NOT by Word or Excel but by multimedia, encryption, filters (dataflow oriented applications

Systolic ring2

Page 27: Techniques d’optimisation architecturale

27DIOUCamille

Master EAII Sp. RSEE

• RA less flexible (like a VLIW with fixed instructions)

but

• RA provides more (customized) computation elements• RA can decrease memory traffic• RA can be tailored for specific algorithms and data types

RA will not replace µP, but complement them

RA versus microprocessors

Systolic ring2

Page 28: Techniques d’optimisation architecturale

28DIOUCamille

Master EAII Sp. RSEE

•A set of simple processing elements with regular and local connections which takes external inputs and processes them in a prederterminedmanner in a determined fashion

Systolic computing : definition

H.T. Kung

Systolic ring2

Page 29: Techniques d’optimisation architecturale

29DIOUCamille

Master EAII Sp. RSEE

• Simple PE

• Regular and local interconnect

• Pipeline between Pes

• I/O at boundary

Systolic computing : characteristics of best RC design

Systolic ring2

Page 30: Techniques d’optimisation architecturale

30DIOUCamille

Master EAII Sp. RSEE

In abstract :Instructions configure both PE and interconnect every cycle

In reality :Instruction Bandwidth / Memory too high, so…

COMPROMISE

Coarse grain RA model

Systolic ring2

Page 31: Techniques d’optimisation architecturale

31DIOUCamille

Master EAII Sp. RSEE

Relationship of communication among processors• Shared clock (Pipelined)• Shared registers (VLIW)• Shared memory (SMM)• Shared network

Communications…

Systolic ring2

Page 32: Techniques d’optimisation architecturale

32DIOUCamille

Master EAII Sp. RSEE

Instructionscurrently in hardware

Instructions paged out

Actual availablehardware

Prog

ram

Reconfigurable computing

Systolic ring2

Page 33: Techniques d’optimisation architecturale

33DIOUCamille

Master EAII Sp. RSEE

xn

a1a2 a0

ynZ-1Z-1 Z-1

∑−

=−−=

1

0)1()()(

N

iinxiany

xn

aN-1aN aN-2 a1 a0

ynZ-1 Z-1Z-1 Z-1

3 coefficients filter)3()2()1()( .2.1.0 −+−+−= nxanxanxany

Finite Impulse response filter (FIR)

Systolic ring2

Page 34: Techniques d’optimisation architecturale

34DIOUCamille

Master EAII Sp. RSEE

Systolic FIR implementation

(MAC unit)

Systolic ring2

Page 35: Techniques d’optimisation architecturale

35DIOUCamille

Master EAII Sp. RSEE

Systolic FIR implementation

Systolic ring2

Page 36: Techniques d’optimisation architecturale

36DIOUCamille

Master EAII Sp. RSEE

Systolic FIR implementation

Systolic ring2

Page 37: Techniques d’optimisation architecturale

37DIOUCamille

Master EAII Sp. RSEE

Systolic FIR implementation

Systolic ring2

Page 38: Techniques d’optimisation architecturale

38DIOUCamille

Master EAII Sp. RSEE

Systolic FIR implementation

Systolic ring2

Page 39: Techniques d’optimisation architecturale

39DIOUCamille

Master EAII Sp. RSEE

Systolic FIR implementation

Systolic ring2

Page 40: Techniques d’optimisation architecturale

40DIOUCamille

Master EAII Sp. RSEE

Systolic FIR implementation

Systolic ring2

Page 41: Techniques d’optimisation architecturale

41DIOUCamille

Master EAII Sp. RSEE

Optimize outer loop, preload-repeated value

Systolic FIR implementation

Systolic ring2

Page 42: Techniques d’optimisation architecturale

42DIOUCamille

Master EAII Sp. RSEE

Optimize outer loop, broadcast common value

Systolic FIR implementation

Systolic ring2

Page 43: Techniques d’optimisation architecturale

43DIOUCamille

Master EAII Sp. RSEE

Optimize outer loop, retime to eliminate broadcast

Systolic FIR implementation

Systolic ring2

Page 44: Techniques d’optimisation architecturale

44DIOUCamille

Master EAII Sp. RSEE

Systolic FIR implementation

Systolic ring2

Page 45: Techniques d’optimisation architecturale

45DIOUCamille

Master EAII Sp. RSEE

Systolic FIR implementation

Systolic ring2

Page 46: Techniques d’optimisation architecturale

46DIOUCamille

Master EAII Sp. RSEE

Systolic FIR implementation

Systolic ring2

Page 47: Techniques d’optimisation architecturale

47DIOUCamille

Master EAII Sp. RSEE

The Systolic Ring

• Coarse grain architecture• Multi-mode dynamical reconfiguration• Scalable, bidimentionnal array• VHDL design• Designed for SoC integration

Systolic ring2

Page 48: Techniques d’optimisation architecturale

48DIOUCamille

Master EAII Sp. RSEE

Dnode : word-level processing unit

ALU + MULT

Reg FILE

Constitution• Optimized Datapath (16 bits)• Register File (4x16bits)• Hardwired ALU and multiplier

Features• Complex computations in local mode (FIR,IIR, WT…)

• Low silicon area (0.07mm², 0.18µm CMOS process)

• Single-cycle operations (ex:MAC+register load)

µinst.

Systolic ring2

Page 49: Techniques d’optimisation architecturale

49DIOUCamille

Master EAII Sp. RSEE

Local controller : Dynamical reconfiguration at the Dnode levelConstitution

• 8 configuration registers• 3 differents run modes• 1 programming mode

ALU + MULT

In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)

out

Reg FILE

µinst.

ALU + MULT

In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)

out

Reg FILE

µinst.

reg0

reg1

reg3

reg2

ck

Mux

Mux

reg4

reg5

reg6

reg7

8

Decoderenex

wait

wait

3 Controller

2

inhib

mode

Systolic ring2

Page 50: Techniques d’optimisation architecturale

50DIOUCamille

Master EAII Sp. RSEE

ALU + MULT

In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)

out

Reg FILE

µinst.

ALU + MULT

In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)

out

Reg FILE

µinst.

Programming mode

reg0

reg1

reg3

reg2

ck

Mux

Mux

reg4

reg5

reg6

reg7

8

Decoderenex

wait

wait

3 Controller

2

inhib

mode

clk

Systolic ring2

Page 51: Techniques d’optimisation architecturale

51DIOUCamille

Master EAII Sp. RSEE

ALU + MULT

In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)

out

Reg FILE

µinst.

ALU + MULT

In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)

out

Reg FILE

µinst.

reg0

reg1

reg3

reg2

ck

Mux

Mux

reg4

reg5

reg6

reg7

8

Decoderenex

wait

wait

3 Controller

2

inhib

mode

clk

Instruction 0

Programming mode

Systolic ring2

Page 52: Techniques d’optimisation architecturale

52DIOUCamille

Master EAII Sp. RSEE

ALU + MULT

In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)

out

Reg FILE

µinst.

ALU + MULT

In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)

out

Reg FILE

µinst.

reg0

reg1

reg3

reg2

ck

Mux

Mux

reg4

reg5

reg6

reg7

8

Decoderenex

wait

wait

3 Controller

2

inhib

mode

clk

Instruction 1

Programming mode

Systolic ring2

Page 53: Techniques d’optimisation architecturale

53DIOUCamille

Master EAII Sp. RSEE

ALU + MULT

In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)

out

Reg FILE

µinst.

ALU + MULT

In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)

out

Reg FILE

µinst.

reg0

reg3

ck

Mux

Mux

reg4

reg5

reg6

reg7

8

Decoderenex

wait

wait

3 Controller

2

inhib

mode

Reg2

reg1

clk

Instruction 2

Programming mode

Systolic ring2

Page 54: Techniques d’optimisation architecturale

54DIOUCamille

Master EAII Sp. RSEE

ALU + MULT

In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)

out

Reg FILE

µinst.

ALU + MULT

In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)

out

Reg FILE

µinst.

reg0

ck

Mux

Mux

reg4

reg5

reg6

reg7

8

Decoderenex

wait

wait

3 Controller

2

inhib

mode

reg1

Reg3

Reg2

clk

Instruction 3

Programming mode

Systolic ring2

Page 55: Techniques d’optimisation architecturale

55DIOUCamille

Master EAII Sp. RSEE

ALU + MULT

In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)

out

Reg FILE

µinst.

ALU + MULT

In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)

out

Reg FILE

µinst.

Run-mode 1 : Fixed

reg0

ck

Mux

Mux

reg4

reg5

reg6

reg7

8

Decoderenex

wait

wait

3 Controller

2 mode

reg1

Reg2

Reg3

inhib

clk

Instruction 0

Systolic ring2

Page 56: Techniques d’optimisation architecturale

56DIOUCamille

Master EAII Sp. RSEE

ALU + MULT

In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)

out

Reg FILE

µinst.

ALU + MULT

In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)

out

Reg FILE

µinst.

reg0

ck

Mux

Mux

reg4

reg5

reg6

reg7

8

Decoderenex

wait

wait

3 Controller

2 mode

reg1

Reg2

Reg3

inhib

clk

Instruction 0

Run-mode 1 : Fixed

Systolic ring2

Page 57: Techniques d’optimisation architecturale

57DIOUCamille

Master EAII Sp. RSEE

ALU + MULT

In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)

out

Reg FILE

µinst.

ALU + MULT

In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)

out

Reg FILE

µinst.

reg0

ck

Mux

Mux

reg4

reg5

reg6

reg7

8

Decoderenex

wait

wait

3 Controller

2 mode

reg1

Reg2

Reg3

inhib

clk

Instruction 0

Run-mode 1 : Fixed

Systolic ring2

Page 58: Techniques d’optimisation architecturale

58DIOUCamille

Master EAII Sp. RSEE

ALU + MULT

In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)

out

Reg FILE

µinst.

ALU + MULT

In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)

out

Reg FILE

µinst.

reg0

ck

Mux

Mux

reg4

reg5

reg6

reg7

8

Decoderenex

wait

wait

3 Controller

2 mode

reg1

Reg2

Reg3

inhib

clk

Instruction 0

Run-mode 1 : Fixed

Systolic ring2

Page 59: Techniques d’optimisation architecturale

59DIOUCamille

Master EAII Sp. RSEE

ALU + MULT

In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)

out

Reg FILE

µinst.

ALU + MULT

In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)

out

Reg FILE

µinst.

reg0

ck

Mux

Mux

reg4

reg5

reg6

reg7

8

Decoderenex

wait

wait

3 Controller

2 mode

reg1

Reg2

Reg3

Inhib

clk

Instruction 1

Run-mode 2 : Dynamic

Systolic ring2

Page 60: Techniques d’optimisation architecturale

60DIOUCamille

Master EAII Sp. RSEE

ALU + MULT

In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)

out

Reg FILE

µinst.

ALU + MULT

In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)

out

Reg FILE

µinst.

reg0

ck

Mux

Mux

reg4

reg5

reg6

reg7

8

Decoderenex

wait

wait

3 Controller

2 mode

reg1

Reg2

Reg3

inhib

clk

Instruction 2

Run-mode 2 : Dynamic

Systolic ring2

Page 61: Techniques d’optimisation architecturale

61DIOUCamille

Master EAII Sp. RSEE

ALU + MULT

In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)

out

Reg FILE

µinst.

ALU + MULT

In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)

out

Reg FILE

µinst.

reg0

ck

Mux

Mux

reg4

reg5

reg6

reg7

8

Decoderenex

wait

wait

3 Controller

2 mode

reg1

Reg2

Reg3

inhib

clk

Instruction 3

Run-mode 2 : Dynamic (one-time or loop)

Systolic ring2

Page 62: Techniques d’optimisation architecturale

62DIOUCamille

Master EAII Sp. RSEE

• Unidirectional communications between neighbours• Hard to implement datapath with greater pipeline depth than the array

• Hard to implement recursive operations

Scalable

Unités de traitement Switchs

Flots de données UNIDIRECTIONNELS

BUS : ressource PARTAGÉE…

ENTR

ÉES

SORT

IES

Unités de traitement Switchs

Flots de données UNIDIRECTIONNELS

BUS : ressource PARTAGÉE…

ENTR

ÉES

SORT

IES

Configurableblocks

INPU

TS

OU

TPU

TS

BUS : Shared resources

Main dataflow (unidirectional)

Array structure

Systolic ring2

Page 63: Techniques d’optimisation architecturale

63DIOUCamille

Master EAII Sp. RSEE

RING STRUCTURERING STRUCTURE

Use of a Ring structure

• Unidirectional communications between neighbours• Hard to implement datapath with greater pipeline depth than the array

• Hard to implement recursive operations

Array structure

Unités de traitement Switchs

Flots de données UNIDIRECTIONNELS

BUS : ressource PARTAGÉE…

ENTR

ÉES

SORT

IES

Unités de traitement Switchs

Flots de données UNIDIRECTIONNELS

BUS : ressource PARTAGÉE…

ENTR

ÉES

SORT

IES

Configurableblocks

INPU

TS

OU

TPU

TS

BUS : Shared resources

Main dataflow (unidirectional)

Systolic ring2

Page 64: Techniques d’optimisation architecturale

64DIOUCamille

Master EAII Sp. RSEE

Forward

Dataflow

Reverse Dataflow

Use of a bi-dataflows structure

• Unidirectional communications between neighbours• Hard to implement datapath with greater pipeline depth than the array

• Hard to implement recursive operations

Unités de traitement Switchs

Flots de données UNIDIRECTIONNELS

BUS : ressource PARTAGÉE…

ENTR

ÉES

SORT

IES

Unités de traitement Switchs

Flots de données UNIDIRECTIONNELS

BUS : ressource PARTAGÉE…

ENTR

ÉES

SORT

IES

Configurableblocks

INPU

TS

OU

TPU

TS

BUS : Shared resources

Main dataflow (unidirectional)

Array structure

RING STRUCTURERING STRUCTURE

Systolic ring2

Page 65: Techniques d’optimisation architecturale

65DIOUCamille

Master EAII Sp. RSEE

SwitchSwitch

Sw

itchS

witch

DnodeDnode

DnodeDnode

DnodeDnode

DnodeDnode

SwitchSwitch

SwitchSwitch

DnodeDnode DnodeDnode

DnodeDnode DnodeDnode

SwitchSwitch

Sw

itchS

witch

DnodeDnode

DnodeDnode

SwitchSwitch

DnodeDnode DnodeDnode

Forward dataflow

Peak power : 3200 MIPS@200MHz (16 Dnodes version)

DnodeDnode

DnodeDnode

SwitchSwitch

E/S

E/S

E/S

E/S

E/S

E/S

Flot de données

Couche n

DnodeDnode DnodeDnode Couche n+1

Systolic Ring architecture

Systolic ring2

Page 66: Techniques d’optimisation architecturale

66DIOUCamille

Master EAII Sp. RSEE

No complex data routing problems (crossbars…)Unidirectional data transfers between adjacent layers (pipeline)Linear performances increase with Dnode numberProvides 3200 MIPS@200MHz of computing power for a 16 Dnodes realization

Forward dataflow

D node Local mode : stand-alone

D node Global mode : FPGA like

Switch components:Direct FIFO connection for Data injection

BUS connection for RISC communication

Full connectivity between 2 Dnode layers

Config.controller

Node

Switch Switch

SwitchSwitch

SwitchSwitch

D-Node D-Node

D-Node D-Node

D-Node D-Node

D-Node D-Node

D-Node D-Node

D-Node D-Node

D-Node D-Node

D-Node

Sw

itch

D-Node

Sw

itch

Layer n

Layer n+1

Layer n-1Forward Dataflow

I/O I/O

I/O

I/O

I/O

I/O

I/OI/O

D node

D node

D node

D node

Systolic Ring architecture

Systolic ring2

Page 67: Techniques d’optimisation architecturale

67DIOUCamille

Master EAII Sp. RSEE

NodeD-Node D-Node

D-Node D-Node

D-Node D-Node

D-Node D-Node

D-Node D-Node

D-Node D-Node

D-Node D-Node

D-Node D-Node

Switch

SwitchSwitch

SwitchSwitch

Sw

itchS

witch

Switch

Reverse dataflow

Each switch writes computed data in his own feedback pipelineEach switch has read ports on others switch’s pipelinesEasy implementation of various recursive algorithms (IIR, WT…)

Feedback pipelines

Systolic ring2

Page 68: Techniques d’optimisation architecturale

68DIOUCamille

Master EAII Sp. RSEE

NodeD-Node D-Node

D-Node D-Node

D-Node D-Node

D-Node D-Node

D-Node D-Node

D-Node D-Node

D-Node D-Node

D-Node D-Node

Switch

SwitchSwitch

SwitchSwitch

Sw

itchS

witch

Switch

Reverse dataflow

Each switch writes computed data in his own feedback pipelineEach switch has read ports on others switch’s pipelinesEasy implementation of various recursive algorithms (IIR, WT…)

Feedback pipelines

Systolic ring2

Page 69: Techniques d’optimisation architecturale

69DIOUCamille

Master EAII Sp. RSEE

NodeD-Node D-Node

D-Node D-Node

D-Node D-Node

D-Node D-Node

D-Node D-Node

D-Node D-Node

D-Node D-Node

D-Node D-Node

Switch

SwitchSwitch

SwitchSwitch

Sw

itchS

witch

Switch

Reverse dataflow

Each switch writes computed data in his own feedback pipelineEach switch has read ports on others switch’s pipelinesEasy implementation of various recursive algorithms (IIR, WT…)

Feedback pipelines

Systolic ring2

Page 70: Techniques d’optimisation architecturale

70DIOUCamille

Master EAII Sp. RSEE

• Global mode (first level)The program which manages the configuration runs on the RISC processorThe configuration of an entire cluster can be modified at each clock cycleThe operating layer computes the data coming from the host processor

112233

• Local mode (second level) Each Dnode runs his own up-to-8 instructions program

ConfigConfigConfigControllerControllerController

ConfigConfigConfigControllerControllerController

DATA HostHostHostµPµPµP

+

*+

*RAMRAMRAMRAMRAMRAM

OPERATING layerOPERATING layer

CONFIGURATIONCONFIGURATIONlayerlayer

MANAGEMENT CODE

CONFIG

Dnode

ALU + MULT

Reg FILE

A B

S

Dnode

ALU + MULT

Reg FILE

A B

S

ALU + MULT

Reg FILE

A B

S

11

2233

2 levels dynamically reconfigurable architecture:

Systolic ring2

Page 71: Techniques d’optimisation architecturale

71DIOUCamille

Master EAII Sp. RSEE

8 Dnodes version…• ST* CMOS process 0.25 µm & 0.18 µm

200 MHz200 MHz150 MHzFréquency

0.04 mm20.7 mm20.9 mm2Area

Dnode0.18 µm

Ring-80.18 µm

Ring-80.25 µm

• Low Dnode area Possible to realize 128 Dnodes versions…

• Suited as an IP core for SoC

*: ST: STmicroelectronics

Features :• Parametrizable core (number of Dnodes)

• Good Performances / cost tradeoff: (Ring-8@200MHz Systolic Ring system)

• 1600 MIPS (PII@450MHz : 400 MIPS)

• 3 Gb/s bandwidth

Systolic ring2

Page 72: Techniques d’optimisation architecturale

72DIOUCamille

Master EAII Sp. RSEE

0000 r:ldl(0,8) M1: N1:clr N2:clr 0001 r:ldl(1,2) M2: N1:clr N2:clr 0002 r:dec(0,0) M1: N1:add(fifo1,fifo1) N2:sub(fifo1,fifo1) 0003 r:jnz(1) M2: N1:mac(in1) N2:mac(in2) 0004 r: halt

RISC instructions Layer selection Dnodes instructions

Assembly-level programming

RAM FPGA

Prototype

Testbench

Simulator

Ring-8RAM

File1.bin

File2.m

Assembler

Systolic ring2

Page 73: Techniques d’optimisation architecturale

73DIOUCamille

Master EAII Sp. RSEE

RIF filter : edge detection

0000 r:ldl(0,1) M1: N1:rst N2:rst 0001 r:jmp(0) M1: N2:sub(fifo,fifo)

Convolution mask : [ -1 1 0 ] yn=xn-xn-1.

Input image Output image

Assembly code Timing diagrams

Testbench

Simulator

Ring-8Ring-8RAMFile2.m

AssemblerAssembler

Systolic ring2

Page 74: Techniques d’optimisation architecturale

74DIOUCamille

Master EAII Sp. RSEE

ALU + MULT

x

x

x

x

Polynomial calculus• P(x)=a.x+b.x²+c.x3

x reg0

x.x reg1

reg0.reg1 reg2

a.reg0 ACC

b.reg1 + ACC ACC

c.reg2 + ACC ACC

11

12

13

14

15

/* load reg1,x3 */

/* load ACC,a.x */

/* load ACC,a.x+b.x² */

/* load ACC,a.x+b.x²+c.x3 */

/* load reg0,x */

/* load reg1,x² */

Systolic ring2

Page 75: Techniques d’optimisation architecturale

75DIOUCamille

Master EAII Sp. RSEE

ALU + MULT

x

x x²

x3x²

Polynomial calculus• P(x)=a.x+b.x²+c.x3

x reg0

x.x reg1

reg0.reg1 reg2

a.reg0 ACC

b.reg1 + ACC ACC

c.reg2 + ACC ACC

11

12

13

14

15

/* load reg1,x3 */

/* load ACC,a.x */

/* load ACC,a.x+b.x² */

/* load ACC,a.x+b.x²+c.x3 */

/* load reg0,x */

/* load reg1,x² */

Systolic ring2

Page 76: Techniques d’optimisation architecturale

76DIOUCamille

Master EAII Sp. RSEE

ALU + MULT

x² x3

a

x

a.x

a x

x reg0

x.x reg1

reg0.reg1 reg2

a.reg0 ACC

b.reg1 + ACC ACC

c.reg2 + ACC ACC

11

12

13

14

15

/* load reg1,x3 */

/* load ACC,a.x */

/* load ACC,a.x+b.x² */

/* load ACC,a.x+b.x²+c.x3 */

/* load reg0,x */

/* load reg1,x² */

Polynomial calculus• P(x)=a.x+b.x²+c.x3

Systolic ring2

Page 77: Techniques d’optimisation architecturale

77DIOUCamille

Master EAII Sp. RSEE

ALU + MULT

x3

b

x

a.x+b.x²

b x²

x reg0

x.x reg1

reg0.reg1 reg2

a.reg0 ACC

b.reg1 + ACC ACC

c.reg2 + ACC ACC

11

12

13

14

15

/* load reg1,x3 */

/* load ACC,a.x */

/* load ACC,a.x+b.x² */

/* load ACC,a.x+b.x²+c.x3 */

/* load reg0,x */

/* load reg1,x² */

Polynomial calculus• P(x)=a.x+b.x²+c.x3

Systolic ring2

Page 78: Techniques d’optimisation architecturale

78DIOUCamille

Master EAII Sp. RSEE

ALU + MULT

c

x

a.x+b.x²+c. x3

c x3

x3

x reg0

x.x reg1

reg0.reg1 reg2

a.reg0 ACC

b.reg1 + ACC ACC

c.reg2 + ACC ACC

11

12

13

14

15

/* load reg1,x3 */

/* load ACC,a.x */

/* load ACC,a.x+b.x² */

/* load ACC,a.x+b.x²+c.x3 */

/* load reg0,x */

/* load reg1,x² */

Polynomial calculus• P(x)=a.x+b.x²+c.x3

Systolic ring2

Page 79: Techniques d’optimisation architecturale

79DIOUCamille

Master EAII Sp. RSEE

Finite Impulse response filter (FIR)

∑−

=−−=

1

0)1()(

N

ii inxany

Z-1xn

yn

a1a0

Z-1Z-1 Z-1

a2 aN-1 aN

Systolic ring2

Page 80: Techniques d’optimisation architecturale

80DIOUCamille

Master EAII Sp. RSEE

xn

a1a2 a0

ynZ-1Z-1 Z-1

∑−

=−−=

1

0)1()()(

N

iinxiany

xn

aN-1aN aN-2 a1 a0

ynZ-1 Z-1Z-1 Z-1

3 coefficients filter

)3()2()1()( .2.1.0 −+−+−= nxanxanxany

Finite Impulse response filter (FIR)

Systolic ring2

Page 81: Techniques d’optimisation architecturale

81DIOUCamille

Master EAII Sp. RSEE

Cycle 1

x0, x0, x0

a2x0 a1 a2

a2, a1, a0

3 Dnodes / layer architecture use• Piplelined implementation• Samples are injected through dedicated lines• Coefficients loaded during first cycle

FIR implementation

Systolic ring2

Page 82: Techniques d’optimisation architecturale

82DIOUCamille

Master EAII Sp. RSEE

a2.x0

a2x1

a2.x0

x1, x1, x1

a1

MAC

x1

Cycle 2

Feedback

3 Dnodes / layer architecture use• Piplelined implementation• Samples are injected through dedicated lines• Coefficients loaded during first cycle

FIR implementation

Systolic ring2

Page 83: Techniques d’optimisation architecturale

83DIOUCamille

Master EAII Sp. RSEE

Cycle 3

a2.x0+a1.x1a2.x1

a2x2

a2.x1

x2, x2, x2

a2.x0+a1.x1

a1

MAC

x2

a0

MAC

x2

Feedback

3 Dnodes / layer architecture use• Piplelined implementation• Samples are injected through dedicated lines• Coefficients loaded during first cycle

FIR implementation

Systolic ring2

Page 84: Techniques d’optimisation architecturale

84DIOUCamille

Master EAII Sp. RSEE

Cycle 4

a2.x1+a1.x2a2.x2

a2x3

a2.x2

x3, x3, x3

a2.x1+a1.x2

a1

MAC

x3

a0

MAC

x3

a2.x0+a1.x1 +a0.x2

OPTIMAL IMPLEMENTATION : 1 SAMPLE / CYCLE

Feedback

3 Dnodes / layer architecture use• Piplelined implementation• Samples are injected through dedicated lines• Coefficients loaded during first cycle

FIR implementation

Systolic ring2

Page 85: Techniques d’optimisation architecturale

85DIOUCamille

Master EAII Sp. RSEE

Cycle 5

a2.x2+a1.x3a2.x3

a2x4

a2.x3

x4, x4, x4

a2.x2+a1.x3

a1

MAC

x4

a0

MAC

a2.x1+a1.x2 +a0.x3

x4

3 Dnodes / layer architecture use• Piplelined implementation• Samples are injected through dedicated lines• Coefficients loaded during first cycle

Feedback

OPTIMAL IMPLEMENTATION : 1 SAMPLE / CYCLE

Systolic ring2

Page 86: Techniques d’optimisation architecturale

86DIOUCamille

Master EAII Sp. RSEE

Cycle 6

a2.x2+a1.x3a2.x3

a2x4

a2.x3

x4, x4, x4

a2.x2+a1.x3

a1

MAC

x4

a0

MAC

a2.x1+a1.x2 +a0.x3

x4

3 Dnodes / layer architecture use• Piplelined implementation• Samples are injected through dedicated lines• Coefficients loaded during first cycle

Feedback

OPTIMAL IMPLEMENTATION : 1 SAMPLE / CYCLE

Systolic ring2

Page 87: Techniques d’optimisation architecturale

87DIOUCamille

Master EAII Sp. RSEE

a2.x3+a1.x4a2.x4

a2x4

a2.x4

x5, x5, x5

a2.x3+a1.x4

a1

MAC

x5

a0

MAC

x5

a2.x2+a1.x3 +a0.x4

Cycle 7

3 Dnodes / layer architecture use• Piplelined implementation• Samples are injected through dedicated lines• Coefficients loaded during first cycle

Feedback

OPTIMAL IMPLEMENTATION : 1 SAMPLE / CYCLE

Systolic ring2

Page 88: Techniques d’optimisation architecturale

88DIOUCamille

Master EAII Sp. RSEE

6 coefficients filter

)6()5()4()3()2()1()( .5.4.3.2.1.0 −+−+−+−+−+−= nxanxanxanxanxanxany

a5a5 a4a4

MACMAC

a3a3

MACMAC

a1a1

MACMAC

a0a0

MACMAC

a2a2

MACMAC

xn

xn

yn

Inter-layersfeedback

Systolic ring2

Page 89: Techniques d’optimisation architecturale

89DIOUCamille

Master EAII Sp. RSEE

DCTCoeff.

DCT

iDCT

Originalimage

inverse Quantification

QuantifiedCoeff.

Quantification

Decoding

Decompressedimage

Coding

Compressedimage

Discrete Cosine Transform• Usually bidimensional 8x8 points DCT• Very demanding algorithm…

Systolic ring2

Page 90: Techniques d’optimisation architecturale

90DIOUCamille

Master EAII Sp. RSEE

( ) ( )∑−

=⎟⎠⎞⎜

⎝⎛ +=

1

0 212cos2 N

nnk N

knxkNz πα k = 0,1,……,N-1

Direct transform

( ) ( )∑−

=⎟⎠⎞⎜

⎝⎛ +=

1

0 212cos2 N

kkn N

knzkNx πα n = 0,1,……,N-1

Inverse transform

1/√2 for k = 0

1 else=with

DCT algorithm

( )kα

Systolic ring2

Page 91: Techniques d’optimisation architecturale

91DIOUCamille

Master EAII Sp. RSEE

⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢

7,70,7

1,10,1

7,01,00,0

......................

.....

xx

xxxxx

64

64

8

8

Image initiale

64 blocs 8x8

Image• 64x64 points• 8x8 pixels blocks•16 bits coded image

Systolic ring2

Page 92: Techniques d’optimisation architecturale

92DIOUCamille

Master EAII Sp. RSEE

Implementation• Matrix implementation• Even / Odd frequency decomposition of the DCT algorithm

⎥⎥⎥⎥

⎢⎢⎢⎢

++++

⎥⎥⎥⎥

⎢⎢⎢⎢

δ−ββ−δαα−α−αβ−δ−δβ

=

⎥⎥⎥⎥

⎢⎢⎢⎢

43

52

61

70

6

4

2

0 1111

xxxxxxxx

zzzz

α = cos (π/4)

β = cos (π/8)

δ = sin (π/8) ⎥⎥⎥⎥

⎢⎢⎢⎢

++++

⎥⎥⎥⎥

⎢⎢⎢⎢

δ−ββ−δαα−α−αβ−δ−δβ

=

⎥⎥⎥⎥

⎢⎢⎢⎢

43

52

61

70

6

4

2

0 1111

xxxxxxxx

zzzz

α = cos (π/4)

β = cos (π/8)

δ = sin (π/8)

⎥⎥⎥⎥

⎢⎢⎢⎢

−−−−

⎥⎥⎥⎥

⎢⎢⎢⎢

λ−γμ−νγνλ−μμ−λ−ν−γνμγλ

=

⎥⎥⎥⎥

⎢⎢⎢⎢

43

52

61

70

7

5

3

1

xxxxxxxx

zzzz λ = cos (π/16)

γ = cos (3π/16)

μ = sin (3π/16)

ν = sin (π/16)⎥⎥⎥⎥

⎢⎢⎢⎢

−−−−

⎥⎥⎥⎥

⎢⎢⎢⎢

λ−γμ−νγνλ−μμ−λ−ν−γνμγλ

=

⎥⎥⎥⎥

⎢⎢⎢⎢

43

52

61

70

7

5

3

1

xxxxxxxx

zzzz λ = cos (π/16)

γ = cos (3π/16)

μ = sin (3π/16)

ν = sin (π/16)

xNTN

z )(2= xNT

Nz )(2=

Systolic ring2

Page 93: Techniques d’optimisation architecturale

93DIOUCamille

Master EAII Sp. RSEE

α = 0000000000010110 - α = 1111111111101010β = 0000000000011101 - β = 1111111111100011δ = 0000000000001100 - δ = 1111111111110100 λ = 0000000000011111 - λ = 1111111111100001γ = 0000000000011010 - γ = 1111111111100101μ = 0000000000010001 - μ = 1111111111101111ν = 0000000000000110 - ν = 1111111111111010

Example : n=6

( ) ( )∑−

=⎟⎠⎞

⎜⎝⎛ +

=1

0 212cos2 N

nnk N

knxkN

z πα

Coefficients coding• Fixed point

Systolic ring2

Page 94: Techniques d’optimisation architecturale

94DIOUCamille

Master EAII Sp. RSEE

Dnode1C

onfig

+_

Con

fig

MAC

MACC

oeffi

cien

tsxn - x(N-1)-n

xn + x(N-1)-n

x nx (

N-1

)-n x

nx (

N-1

)-n

z0 , z2 , z4 , z6

z1 , z3 , z5 , z7

Dnode2

Dnode1

Dnode2

Implementation :• ADD and SUB on the first Dnode layer• Multiply-accumulate operations (MAC) on the second Dnodes layer

Systolic ring2

Page 95: Techniques d’optimisation architecturale

95DIOUCamille

Master EAII Sp. RSEE

⎥⎥⎥⎥

⎢⎢⎢⎢

++++

⎥⎥⎥⎥

⎢⎢⎢⎢

δ−ββ−δαα−α−αβ−δ−δβ

=

⎥⎥⎥⎥

⎢⎢⎢⎢

43

52

61

70

6

4

2

0 1111

xxxxxxxx

zzzz

⎥⎥⎥⎥

⎢⎢⎢⎢

−−−−

⎥⎥⎥⎥

⎢⎢⎢⎢

λ−γμ−νγνλ−μμ−λ−ν−γνμγλ

=

⎥⎥⎥⎥

⎢⎢⎢⎢

43

52

61

70

7

5

3

1

xxxxxxxx

zzzz

M0: N1:add(fifo,fifo) N2: sub(fifo,fifo)

t=0 n=0Computing…

Dnode1

+_

x 0x 7

x0

x 7 Dnode2

Dnode1

Dnode2

Con

fig

Systolic ring2

Page 96: Techniques d’optimisation architecturale

96DIOUCamille

Master EAII Sp. RSEE

t=1 n=1

Dnode1

+_ x0 – x7

x0 + x7

x 1x 6

x1

x 6 Dnode2

Dnode1

Dnode2

MAC

MACλ,

x,1,

x

M1: N1:MAC(in1,fifo) N2: MAC(in2,fifo)

⎥⎥⎥⎥

⎢⎢⎢⎢

++++

⎥⎥⎥⎥

⎢⎢⎢⎢

δ−ββ−δαα−α−αβ−δ−δβ

=

⎥⎥⎥⎥

⎢⎢⎢⎢

43

52

61

70

6

4

2

0 1111

xxxxxxxx

zzzz

⎥⎥⎥⎥

⎢⎢⎢⎢

−−−−

⎥⎥⎥⎥

⎢⎢⎢⎢

λ−γμ−νγνλ−μμ−λ−ν−γνμγλ

=

⎥⎥⎥⎥

⎢⎢⎢⎢

43

52

61

70

7

5

3

1

xxxxxxxx

zzzz

Computing…

Con

fig

Systolic ring2

Page 97: Techniques d’optimisation architecturale

97DIOUCamille

Master EAII Sp. RSEE

t=2 n=2

Dnode1

+_ x1 – x6

x1 + x6

x 2x 5

x2

x 5 Dnode2

Dnode1

Dnode2

MAC

MAC

⎥⎥⎥⎥

⎢⎢⎢⎢

++++

⎥⎥⎥⎥

⎢⎢⎢⎢

δ−ββ−δαα−α−αβ−δ−δβ

=

⎥⎥⎥⎥

⎢⎢⎢⎢

43

52

61

70

6

4

2

0 1111

xxxxxxxx

zzzz

⎥⎥⎥⎥

⎢⎢⎢⎢

−−−−

⎥⎥⎥⎥

⎢⎢⎢⎢

λ−γμ−νγνλ−μμ−λ−ν−γνμγλ

=

⎥⎥⎥⎥

⎢⎢⎢⎢

43

52

61

70

7

5

3

1

xxxxxxxx

zzzz

Computing…

γ,x,

1,x

Systolic ring2

Page 98: Techniques d’optimisation architecturale

98DIOUCamille

Master EAII Sp. RSEE

Dnode1

+_ x2 – x5

x2 + x5

x 3x 4

x3

x 4 Dnode2

Dnode1

Dnode2

MAC

MAC

⎥⎥⎥⎥

⎢⎢⎢⎢

++++

⎥⎥⎥⎥

⎢⎢⎢⎢

δ−ββ−δαα−α−αβ−δ−δβ

=

⎥⎥⎥⎥

⎢⎢⎢⎢

43

52

61

70

6

4

2

0 1111

xxxxxxxx

zzzz

⎥⎥⎥⎥

⎢⎢⎢⎢

−−−−

⎥⎥⎥⎥

⎢⎢⎢⎢

λ−γμ−νγνλ−μμ−λ−ν−γνμγλ

=

⎥⎥⎥⎥

⎢⎢⎢⎢

43

52

61

70

7

5

3

1

xxxxxxxx

zzzz

Computing… t=3 n=3

μ,x,

1,x

Systolic ring2

Page 99: Techniques d’optimisation architecturale

99DIOUCamille

Master EAII Sp. RSEE

Dnode1

+_ x3 – x4

x3 + x4

Dnode2

Dnode1

Dnode2

MAC

MACν,

x,1,

x

⎥⎥⎥⎥

⎢⎢⎢⎢

++++

⎥⎥⎥⎥

⎢⎢⎢⎢

δ−ββ−δαα−α−αβ−δ−δβ

=

⎥⎥⎥⎥

⎢⎢⎢⎢

43

52

61

70

6

4

2

0 1111

xxxxxxxx

zzzz

⎥⎥⎥⎥

⎢⎢⎢⎢

−−−−

⎥⎥⎥⎥

⎢⎢⎢⎢

λ−γμ−νγνλ−μμ−λ−ν−γνμγλ

=

⎥⎥⎥⎥

⎢⎢⎢⎢

43

52

61

70

7

5

3

1

xxxxxxxx

zzzz

Computing… t=4 n=4

Systolic ring2

Page 100: Techniques d’optimisation architecturale

100DIOUCamille

Master EAII Sp. RSEE

M1: N1:clear N2: clear

Dnode1

+_ x3 – x4

x3 + x4

Dnode2

Dnode1

Dnode2

clear

clearν,

x,1,

x

⎥⎥⎥⎥

⎢⎢⎢⎢

++++

⎥⎥⎥⎥

⎢⎢⎢⎢

δ−ββ−δαα−α−αβ−δ−δβ

=

⎥⎥⎥⎥

⎢⎢⎢⎢

43

52

61

70

6

4

2

0 1111

xxxxxxxx

zzzz

⎥⎥⎥⎥

⎢⎢⎢⎢

−−−−

⎥⎥⎥⎥

⎢⎢⎢⎢

λ−γμ−νγνλ−μμ−λ−ν−γνμγλ

=

⎥⎥⎥⎥

⎢⎢⎢⎢

43

52

61

70

7

5

3

1

xxxxxxxx

zzzz

Computing… t=5 n=0

z0

z1

Con

fig

Systolic ring2

Page 101: Techniques d’optimisation architecturale

101DIOUCamille

Master EAII Sp. RSEE

Results– 2 transforms issued each 5 machine cycles

– « Clear » performed during addition

20 cycles for 8 samples

⎥⎥⎥⎥

⎢⎢⎢⎢

++++

⎥⎥⎥⎥

⎢⎢⎢⎢

δ−ββ−δαα−α−αβ−δ−δβ

=

⎥⎥⎥⎥

⎢⎢⎢⎢

43

52

61

70

6

4

2

0 1111

xxxxxxxx

zzzz

⎥⎥⎥⎥

⎢⎢⎢⎢

−−−−

⎥⎥⎥⎥

⎢⎢⎢⎢

λ−γμ−νγνλ−μμ−λ−ν−γνμγλ

=

⎥⎥⎥⎥

⎢⎢⎢⎢

43

52

61

70

7

5

3

1

xxxxxxxx

zzzz

Computing…

Systolic ring2

Page 102: Techniques d’optimisation architecturale

102DIOUCamille

Master EAII Sp. RSEE

DCT 1D - 4 last lines

ConfigC

onfig

Config

Config

M 0

M 3

Sw

itch

Switc

h

Switch

Switch

Dnode1

Dnode2

Dnode1

Dnode1

Dnode2

Dnode2

Dnode2

Dnode1

DCT 1D - 4 first lines

Achievable parallelisn on a 8 Dnodes structures : Ring-8

Systolic ring2

M 1

M 2

Page 103: Techniques d’optimisation architecturale

103DIOUCamille

Master EAII Sp. RSEE

⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢

7,70,7

1,10,1

7,01,00,0

'......'................'''.....''

zz

zzzzz

⇒ 5 cycles

2 partial transforms

Systolic ring2

Overall performances

Page 104: Techniques d’optimisation architecturale

104DIOUCamille

Master EAII Sp. RSEE

⇒ 20 cycles

⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢

7,70,7

1,10,1

7,01,00,0

'......'................'''.....''

zz

zzzzz

1 Line – 8 partial transforms

Systolic ring2

Overall performances

Page 105: Techniques d’optimisation architecturale

105DIOUCamille

Master EAII Sp. RSEE

⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢

7,70,7

1,10,1

7,01,00,0

'......'................'''.....''

zz

zzzzz

⇒ 80 cyclesM 1M 0

4 Lines - 32 partial transforms

Systolic ring2

Overall performances

Page 106: Techniques d’optimisation architecturale

106DIOUCamille

Master EAII Sp. RSEE

⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢

7,70,7

1,10,1

7,01,00,0

'......'................'''.....''

zz

zzzzz

4 Lines - 32 partial transforms

⇒ 80 cyclesM 1M 0

Systolic ring2

Overall performances

Page 107: Techniques d’optimisation architecturale

107DIOUCamille

Master EAII Sp. RSEE

⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢

7,70,7

1,10,1

7,01,00,0

'......'................'''.....''

zz

zzzzz

⇒ 80 cyclesM 3M 2

8 Columns - 64 transforms

Systolic ring2

Overall performances

Page 108: Techniques d’optimisation architecturale

108DIOUCamille

Master EAII Sp. RSEE

⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢

7,70,7

1,10,1

7,01,00,0

'......'................'''.....''

zz

zzzzz

DCT 2D sur 8 points :

160 CYCLES

Systolic ring2

Overall performances

Page 109: Techniques d’optimisation architecturale

109DIOUCamille

Master EAII Sp. RSEE

VLIW : CPU64, TM1000, TI 320C60 Superscalar : Pentium I, Pentium II, NEC V830

Comparisons : execution time (cycles)

0

50

100

150

200

250

300

350

400

CPU64 TM-1000 320C62 Ring-8 Ring-64 PentiumI PentiumII NEC V830

# cy

cles

VLIW Superscalar

Systolic ring2