techniques d’optimisation architecturale

Post on 08-Apr-2016

20 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

skjbbdsbc

TRANSCRIPT

1DIOUCamille

Master EAII Sp. RSEE

Camille Diou

diou@univ-metz.fr

Techniques d’optimisationarchitecturale

2DIOUCamille

Master EAII Sp. RSEE

State machine ALU

t1t2t3ABC

DATAPATHCONTROLLERBUS

Arithmetic and Logic Unit (ALU)Register file

Tristate components (inputs/ outputs)

Microprocessor basics1

3DIOUCamille

Master EAII Sp. RSEE

t1 <- x

t2 <- y

t3 <- A.t1

t3 <- t3.t1

t2 <- B.t2

t3 <- t2+t3

out<- t3+C

ALU

t1t2t3ABC

S=Ax²+By+C

Computation example : DATAPATHCONTROLLER

Microprocessor basics1

4DIOUCamille

Master EAII Sp. RSEE

t1 <- x

t2 <- y

t3 <- A.t1

t3 <- t3.t1

t2 <- B.t2

t3 <- t2+t3

out<- t3+C

#CYCLES: 1

ALU

t1t2t3ABC

x

S=Ax²+By+C

DATAPATHCONTROLLERComputation example :

Microprocessor basics1

5DIOUCamille

Master EAII Sp. RSEE

t1 <- x

t2 <- y

t3 <- A.t1

t3 <- t3.t1

t2 <- B.t2

t3 <- t2+t3

out<- t3+C

#CYCLES: 2

ALU

t1t2t3ABC

y

S=Ax²+By+C

DATAPATHCONTROLLERComputation example :

Microprocessor basics1

6DIOUCamille

Master EAII Sp. RSEE

t1 <- x

t2 <- y

t3 <- A.t1

t3 <- t3.t1

t2 <- B.t2

t3 <- t2+t3

out<- t3+C

#CYCLES: 3

X

t1t2t3ABC

A

t1

A.t1S=Ax²+By+C

DATAPATHCONTROLLERComputation example :

Microprocessor basics1

7DIOUCamille

Master EAII Sp. RSEE

t1 <- x

t2 <- y

t3 <- A.t1

t3 <- t3.t1

t2 <- B.t2

t3 <- t2+t3

out<- t3+C

#CYCLES: 4

X

t1t2t3ABC

t3

t1

t3.t1S=Ax²+By+C

DATAPATHCONTROLLERComputation example :

Microprocessor basics1

8DIOUCamille

Master EAII Sp. RSEE

t1 <- x

t2 <- y

t3 <- A.t1

t3 <- t3.t1

t2 <- B.t2

t3 <- t2+t3

out<- t3+C

#CYCLES: 5

X

t1t2t3ABC

t2

B

B.t2S=Ax²+By+C

DATAPATHCONTROLLERComputation example :

Microprocessor basics1

9DIOUCamille

Master EAII Sp. RSEE

t1 <- x

t2 <- y

t3 <- A.t1

t3 <- t3.t1

t2 <- B.t2

t3 <- t2+t3

out<- t3+C

#CYCLES: 6

+

t1t2t3ABC

t3

t2

t2+t3S=Ax²+By+C

DATAPATHCONTROLLERComputation example :

Microprocessor basics1

10DIOUCamille

Master EAII Sp. RSEE

t1 <- x

t2 <- y

t3 <- A.t1

t3 <- t3.t1

t2 <- B.t2

t3 <- t2+t3

out<- t3+C

#CYCLES: 7

+

t1t2t3ABC

C

t3

t3+C

S

S=Ax²+By+C

DATAPATHCONTROLLERComputation example :

Microprocessor basics1

11DIOUCamille

Master EAII Sp. RSEE

STARTSTART HALTHALTFetch NextInstruction

Fetch NextInstruction

ExecuteInstructionExecute

Instruction

Fetch Cycle Execute Cycle

Execution principle

Microprocessor basics1

12DIOUCamille

Master EAII Sp. RSEE

Data flow

Control signals

A Single accumulator machineMAR : Memory Adress RegisterIR : Instruction RegisterPC : Program Counter register

Load path

Store path

n

Address

mFSM

incr

Address operand

Branch

Instruction path

IR

OpcodeLD

Functioncontrols

MAR

PC

ACC

A B

ALU

S

16 bits wide16M words

Memory

Microprocessor basics1

13DIOUCamille

Master EAII Sp. RSEE

Instruction:

Opcode:

00: Load01: Store10: Add11: Branch

Address

15 14 13 0

Single Address Instruction: one of the registers is fixed (= accumulator)-AC is an implicit operand

AC:= AC <operation> Memory(Address)

Microprocessor basics1

14DIOUCamille

Master EAII Sp. RSEE

OpcodeMAR

PC

MAR : Memory Adress RegisterIR : Instruction RegisterPC : Program Counter register

Load path

Store path

Instruction path

Address

ACC

FSM

incr

Address operand

Branch

LD

ALU

IR

Functioncontrols 16 bits wide

16M words

Memory

2

141416

16

14

S

A B

Microprocessor basics1

15DIOUCamille

Master EAII Sp. RSEE

OpcodeMAR

PC

MAR : Memory Adress RegisterIR : Instruction RegisterPC : Program Counter register

Load path

Store path

Instruction path

ACC

S

FSM

incr

Address operand

BranchIR

LD

10110100110011

1. Instruction fetch:- PC is moved into MAR- Read from memory- Load instruction into IR

2. Instruction decode: - Op code bits to FSM(ADD)- rest of bits is operand addr.

10110100110011

1000110100110011

1000110100110011

A B

ALU Address

Functioncontrols 16 bits wide

16M words

Memory

2

141416

16

14

Microprocessor basics1

16DIOUCamille

Master EAII Sp. RSEE

OpcodeMAR

PC

MAR : Memory Adress RegisterIR : Instruction RegisterPC : Program Counter register

Load path

Store path

Instruction path

16 bits wide16M words

Memory

ACC

ALUFSM

incr

Address operand

Branch

LD

3. Operand Fetch:- IR<address> -> MAR- Read data from memory

4. Instr. Execute- Memory to ALU B- AC to ALU - ALU Add- S to AC

00110100110011

10110100110011

0101010101110001

01010101011100010011001101110110

1000100011100111

1000100011100111

Address

Functioncontrols

2

141416

16

14

S

A B

1000110100110011

Microprocessor basics1

17DIOUCamille

Master EAII Sp. RSEE

OpcodeMAR

PC

MAR : Memory Adress RegisterIR : Instruction RegisterPC : Program Counter register

Load path

Store path

Instruction path

16 bits wide16M words

Memory

ACC

ALUFSM

incr

Address operand

Branch

LD00110100110011

10110100110011

0101010101110001

01010101011100010011001101110110

1000100011100111

1000100011100111

Address

Functioncontrols

2

141416

16

14

S

A B

10110100110100

5. Housekeeping:- Increment PC

1000110100110011

Microprocessor basics1

18DIOUCamille

Master EAII Sp. RSEE

A simple microprocessor : Architecture

16x16registers

Adress to memorydata to/from memory

To controller(FSM)

To controller(FSM)

Microprocessor basics1

19DIOUCamille

Master EAII Sp. RSEE

A simple microprocessor : Instruction format

shift

oror

or

Microprocessor basics1

20DIOUCamille

Master EAII Sp. RSEE

Instruction formatInstruction

Action

A simple microprocessor : Instruction format

Microprocessor basics1

21DIOUCamille

Master EAII Sp. RSEE

A simple microprocessor : Instruction format

Microprocessor basics1

22DIOUCamille

Master EAII Sp. RSEE

0000 7C0A ;

0001 8C00 ; LOAD RC, #A

0002 7B04 ; ...

0003 7A0A ; ...

0004 9C7C ; ...

0005 611A ; ...

0006 614B ; ...

...

A simple microprocessor : test programWhat will it do ?

Microprocessor basics1

23DIOUCamille

Master EAII Sp. RSEE

Compiler dependancies detection for ILP

• Detect data dependency at compile time:– examples:

c[i]=a[i]+b[i]; potential dependencyd[i]=a[i]+c[j]; c[i] might be c[j]

c[1]=a[i]+b[i]; no dependencyd[i]=a[i]+c[2]; c[1] is never c[2]

Microprocessor basics1

24DIOUCamille

Master EAII Sp. RSEE

• Superscalar processors must find dataflow graph at run time

• Reconfigurable architectures constructs data flow graph at compile time

• No FU limitations

• No control logic overhead

• No window size limitations

Reconfigurable computing : Instruction level parallelism (ILP)

Systolic ring2

25DIOUCamille

Master EAII Sp. RSEE

• RC scheme: • General Purpose Computeradd r1, r2, r4add r1, r3, r5sub r3, r2, r6add r4 r5 r1add r5 r6 r2

r1 r2 r3 r1 r3 r2

r4 r5 r6

r1 r2

Question: what is the advantage of RC against superscalar?

Reconfigurable computing : Instruction level parallelism (ILP)

Answer: Dataflow graph constructed at compile time, thus, no overhead

Systolic ring2

26DIOUCamille

Master EAII Sp. RSEE

Reconfigurable computing : Why now ?

• Increasing number of transistors

• Complexity and cost of chip design increase fast

• Current computing demands are RC friendly :

Desktops & embedded demands driven NOT by Word or Excel but by multimedia, encryption, filters (dataflow oriented applications

Systolic ring2

27DIOUCamille

Master EAII Sp. RSEE

• RA less flexible (like a VLIW with fixed instructions)

but

• RA provides more (customized) computation elements• RA can decrease memory traffic• RA can be tailored for specific algorithms and data types

RA will not replace µP, but complement them

RA versus microprocessors

Systolic ring2

28DIOUCamille

Master EAII Sp. RSEE

•A set of simple processing elements with regular and local connections which takes external inputs and processes them in a prederterminedmanner in a determined fashion

Systolic computing : definition

H.T. Kung

Systolic ring2

29DIOUCamille

Master EAII Sp. RSEE

• Simple PE

• Regular and local interconnect

• Pipeline between Pes

• I/O at boundary

Systolic computing : characteristics of best RC design

Systolic ring2

30DIOUCamille

Master EAII Sp. RSEE

In abstract :Instructions configure both PE and interconnect every cycle

In reality :Instruction Bandwidth / Memory too high, so…

COMPROMISE

Coarse grain RA model

Systolic ring2

31DIOUCamille

Master EAII Sp. RSEE

Relationship of communication among processors• Shared clock (Pipelined)• Shared registers (VLIW)• Shared memory (SMM)• Shared network

Communications…

Systolic ring2

32DIOUCamille

Master EAII Sp. RSEE

Instructionscurrently in hardware

Instructions paged out

Actual availablehardware

Prog

ram

Reconfigurable computing

Systolic ring2

33DIOUCamille

Master EAII Sp. RSEE

xn

a1a2 a0

ynZ-1Z-1 Z-1

∑−

=−−=

1

0)1()()(

N

iinxiany

xn

aN-1aN aN-2 a1 a0

ynZ-1 Z-1Z-1 Z-1

3 coefficients filter)3()2()1()( .2.1.0 −+−+−= nxanxanxany

Finite Impulse response filter (FIR)

Systolic ring2

34DIOUCamille

Master EAII Sp. RSEE

Systolic FIR implementation

(MAC unit)

Systolic ring2

35DIOUCamille

Master EAII Sp. RSEE

Systolic FIR implementation

Systolic ring2

36DIOUCamille

Master EAII Sp. RSEE

Systolic FIR implementation

Systolic ring2

37DIOUCamille

Master EAII Sp. RSEE

Systolic FIR implementation

Systolic ring2

38DIOUCamille

Master EAII Sp. RSEE

Systolic FIR implementation

Systolic ring2

39DIOUCamille

Master EAII Sp. RSEE

Systolic FIR implementation

Systolic ring2

40DIOUCamille

Master EAII Sp. RSEE

Systolic FIR implementation

Systolic ring2

41DIOUCamille

Master EAII Sp. RSEE

Optimize outer loop, preload-repeated value

Systolic FIR implementation

Systolic ring2

42DIOUCamille

Master EAII Sp. RSEE

Optimize outer loop, broadcast common value

Systolic FIR implementation

Systolic ring2

43DIOUCamille

Master EAII Sp. RSEE

Optimize outer loop, retime to eliminate broadcast

Systolic FIR implementation

Systolic ring2

44DIOUCamille

Master EAII Sp. RSEE

Systolic FIR implementation

Systolic ring2

45DIOUCamille

Master EAII Sp. RSEE

Systolic FIR implementation

Systolic ring2

46DIOUCamille

Master EAII Sp. RSEE

Systolic FIR implementation

Systolic ring2

47DIOUCamille

Master EAII Sp. RSEE

The Systolic Ring

• Coarse grain architecture• Multi-mode dynamical reconfiguration• Scalable, bidimentionnal array• VHDL design• Designed for SoC integration

Systolic ring2

48DIOUCamille

Master EAII Sp. RSEE

Dnode : word-level processing unit

ALU + MULT

Reg FILE

Constitution• Optimized Datapath (16 bits)• Register File (4x16bits)• Hardwired ALU and multiplier

Features• Complex computations in local mode (FIR,IIR, WT…)

• Low silicon area (0.07mm², 0.18µm CMOS process)

• Single-cycle operations (ex:MAC+register load)

µinst.

Systolic ring2

49DIOUCamille

Master EAII Sp. RSEE

Local controller : Dynamical reconfiguration at the Dnode levelConstitution

• 8 configuration registers• 3 differents run modes• 1 programming mode

ALU + MULT

In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)

out

Reg FILE

µinst.

ALU + MULT

In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)

out

Reg FILE

µinst.

reg0

reg1

reg3

reg2

ck

Mux

Mux

reg4

reg5

reg6

reg7

8

Decoderenex

wait

wait

3 Controller

2

inhib

mode

Systolic ring2

50DIOUCamille

Master EAII Sp. RSEE

ALU + MULT

In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)

out

Reg FILE

µinst.

ALU + MULT

In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)

out

Reg FILE

µinst.

Programming mode

reg0

reg1

reg3

reg2

ck

Mux

Mux

reg4

reg5

reg6

reg7

8

Decoderenex

wait

wait

3 Controller

2

inhib

mode

clk

Systolic ring2

51DIOUCamille

Master EAII Sp. RSEE

ALU + MULT

In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)

out

Reg FILE

µinst.

ALU + MULT

In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)

out

Reg FILE

µinst.

reg0

reg1

reg3

reg2

ck

Mux

Mux

reg4

reg5

reg6

reg7

8

Decoderenex

wait

wait

3 Controller

2

inhib

mode

clk

Instruction 0

Programming mode

Systolic ring2

52DIOUCamille

Master EAII Sp. RSEE

ALU + MULT

In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)

out

Reg FILE

µinst.

ALU + MULT

In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)

out

Reg FILE

µinst.

reg0

reg1

reg3

reg2

ck

Mux

Mux

reg4

reg5

reg6

reg7

8

Decoderenex

wait

wait

3 Controller

2

inhib

mode

clk

Instruction 1

Programming mode

Systolic ring2

53DIOUCamille

Master EAII Sp. RSEE

ALU + MULT

In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)

out

Reg FILE

µinst.

ALU + MULT

In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)

out

Reg FILE

µinst.

reg0

reg3

ck

Mux

Mux

reg4

reg5

reg6

reg7

8

Decoderenex

wait

wait

3 Controller

2

inhib

mode

Reg2

reg1

clk

Instruction 2

Programming mode

Systolic ring2

54DIOUCamille

Master EAII Sp. RSEE

ALU + MULT

In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)

out

Reg FILE

µinst.

ALU + MULT

In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)

out

Reg FILE

µinst.

reg0

ck

Mux

Mux

reg4

reg5

reg6

reg7

8

Decoderenex

wait

wait

3 Controller

2

inhib

mode

reg1

Reg3

Reg2

clk

Instruction 3

Programming mode

Systolic ring2

55DIOUCamille

Master EAII Sp. RSEE

ALU + MULT

In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)

out

Reg FILE

µinst.

ALU + MULT

In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)

out

Reg FILE

µinst.

Run-mode 1 : Fixed

reg0

ck

Mux

Mux

reg4

reg5

reg6

reg7

8

Decoderenex

wait

wait

3 Controller

2 mode

reg1

Reg2

Reg3

inhib

clk

Instruction 0

Systolic ring2

56DIOUCamille

Master EAII Sp. RSEE

ALU + MULT

In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)

out

Reg FILE

µinst.

ALU + MULT

In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)

out

Reg FILE

µinst.

reg0

ck

Mux

Mux

reg4

reg5

reg6

reg7

8

Decoderenex

wait

wait

3 Controller

2 mode

reg1

Reg2

Reg3

inhib

clk

Instruction 0

Run-mode 1 : Fixed

Systolic ring2

57DIOUCamille

Master EAII Sp. RSEE

ALU + MULT

In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)

out

Reg FILE

µinst.

ALU + MULT

In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)

out

Reg FILE

µinst.

reg0

ck

Mux

Mux

reg4

reg5

reg6

reg7

8

Decoderenex

wait

wait

3 Controller

2 mode

reg1

Reg2

Reg3

inhib

clk

Instruction 0

Run-mode 1 : Fixed

Systolic ring2

58DIOUCamille

Master EAII Sp. RSEE

ALU + MULT

In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)

out

Reg FILE

µinst.

ALU + MULT

In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)

out

Reg FILE

µinst.

reg0

ck

Mux

Mux

reg4

reg5

reg6

reg7

8

Decoderenex

wait

wait

3 Controller

2 mode

reg1

Reg2

Reg3

inhib

clk

Instruction 0

Run-mode 1 : Fixed

Systolic ring2

59DIOUCamille

Master EAII Sp. RSEE

ALU + MULT

In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)

out

Reg FILE

µinst.

ALU + MULT

In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)

out

Reg FILE

µinst.

reg0

ck

Mux

Mux

reg4

reg5

reg6

reg7

8

Decoderenex

wait

wait

3 Controller

2 mode

reg1

Reg2

Reg3

Inhib

clk

Instruction 1

Run-mode 2 : Dynamic

Systolic ring2

60DIOUCamille

Master EAII Sp. RSEE

ALU + MULT

In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)

out

Reg FILE

µinst.

ALU + MULT

In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)

out

Reg FILE

µinst.

reg0

ck

Mux

Mux

reg4

reg5

reg6

reg7

8

Decoderenex

wait

wait

3 Controller

2 mode

reg1

Reg2

Reg3

inhib

clk

Instruction 2

Run-mode 2 : Dynamic

Systolic ring2

61DIOUCamille

Master EAII Sp. RSEE

ALU + MULT

In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)

out

Reg FILE

µinst.

ALU + MULT

In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)

out

Reg FILE

µinst.

reg0

ck

Mux

Mux

reg4

reg5

reg6

reg7

8

Decoderenex

wait

wait

3 Controller

2 mode

reg1

Reg2

Reg3

inhib

clk

Instruction 3

Run-mode 2 : Dynamic (one-time or loop)

Systolic ring2

62DIOUCamille

Master EAII Sp. RSEE

• Unidirectional communications between neighbours• Hard to implement datapath with greater pipeline depth than the array

• Hard to implement recursive operations

Scalable

Unités de traitement Switchs

Flots de données UNIDIRECTIONNELS

BUS : ressource PARTAGÉE…

ENTR

ÉES

SORT

IES

Unités de traitement Switchs

Flots de données UNIDIRECTIONNELS

BUS : ressource PARTAGÉE…

ENTR

ÉES

SORT

IES

Configurableblocks

INPU

TS

OU

TPU

TS

BUS : Shared resources

Main dataflow (unidirectional)

Array structure

Systolic ring2

63DIOUCamille

Master EAII Sp. RSEE

RING STRUCTURERING STRUCTURE

Use of a Ring structure

• Unidirectional communications between neighbours• Hard to implement datapath with greater pipeline depth than the array

• Hard to implement recursive operations

Array structure

Unités de traitement Switchs

Flots de données UNIDIRECTIONNELS

BUS : ressource PARTAGÉE…

ENTR

ÉES

SORT

IES

Unités de traitement Switchs

Flots de données UNIDIRECTIONNELS

BUS : ressource PARTAGÉE…

ENTR

ÉES

SORT

IES

Configurableblocks

INPU

TS

OU

TPU

TS

BUS : Shared resources

Main dataflow (unidirectional)

Systolic ring2

64DIOUCamille

Master EAII Sp. RSEE

Forward

Dataflow

Reverse Dataflow

Use of a bi-dataflows structure

• Unidirectional communications between neighbours• Hard to implement datapath with greater pipeline depth than the array

• Hard to implement recursive operations

Unités de traitement Switchs

Flots de données UNIDIRECTIONNELS

BUS : ressource PARTAGÉE…

ENTR

ÉES

SORT

IES

Unités de traitement Switchs

Flots de données UNIDIRECTIONNELS

BUS : ressource PARTAGÉE…

ENTR

ÉES

SORT

IES

Configurableblocks

INPU

TS

OU

TPU

TS

BUS : Shared resources

Main dataflow (unidirectional)

Array structure

RING STRUCTURERING STRUCTURE

Systolic ring2

65DIOUCamille

Master EAII Sp. RSEE

SwitchSwitch

Sw

itchS

witch

DnodeDnode

DnodeDnode

DnodeDnode

DnodeDnode

SwitchSwitch

SwitchSwitch

DnodeDnode DnodeDnode

DnodeDnode DnodeDnode

SwitchSwitch

Sw

itchS

witch

DnodeDnode

DnodeDnode

SwitchSwitch

DnodeDnode DnodeDnode

Forward dataflow

Peak power : 3200 MIPS@200MHz (16 Dnodes version)

DnodeDnode

DnodeDnode

SwitchSwitch

E/S

E/S

E/S

E/S

E/S

E/S

Flot de données

Couche n

DnodeDnode DnodeDnode Couche n+1

Systolic Ring architecture

Systolic ring2

66DIOUCamille

Master EAII Sp. RSEE

No complex data routing problems (crossbars…)Unidirectional data transfers between adjacent layers (pipeline)Linear performances increase with Dnode numberProvides 3200 MIPS@200MHz of computing power for a 16 Dnodes realization

Forward dataflow

D node Local mode : stand-alone

D node Global mode : FPGA like

Switch components:Direct FIFO connection for Data injection

BUS connection for RISC communication

Full connectivity between 2 Dnode layers

Config.controller

Node

Switch Switch

SwitchSwitch

SwitchSwitch

D-Node D-Node

D-Node D-Node

D-Node D-Node

D-Node D-Node

D-Node D-Node

D-Node D-Node

D-Node D-Node

D-Node

Sw

itch

D-Node

Sw

itch

Layer n

Layer n+1

Layer n-1Forward Dataflow

I/O I/O

I/O

I/O

I/O

I/O

I/OI/O

D node

D node

D node

D node

Systolic Ring architecture

Systolic ring2

67DIOUCamille

Master EAII Sp. RSEE

NodeD-Node D-Node

D-Node D-Node

D-Node D-Node

D-Node D-Node

D-Node D-Node

D-Node D-Node

D-Node D-Node

D-Node D-Node

Switch

SwitchSwitch

SwitchSwitch

Sw

itchS

witch

Switch

Reverse dataflow

Each switch writes computed data in his own feedback pipelineEach switch has read ports on others switch’s pipelinesEasy implementation of various recursive algorithms (IIR, WT…)

Feedback pipelines

Systolic ring2

68DIOUCamille

Master EAII Sp. RSEE

NodeD-Node D-Node

D-Node D-Node

D-Node D-Node

D-Node D-Node

D-Node D-Node

D-Node D-Node

D-Node D-Node

D-Node D-Node

Switch

SwitchSwitch

SwitchSwitch

Sw

itchS

witch

Switch

Reverse dataflow

Each switch writes computed data in his own feedback pipelineEach switch has read ports on others switch’s pipelinesEasy implementation of various recursive algorithms (IIR, WT…)

Feedback pipelines

Systolic ring2

69DIOUCamille

Master EAII Sp. RSEE

NodeD-Node D-Node

D-Node D-Node

D-Node D-Node

D-Node D-Node

D-Node D-Node

D-Node D-Node

D-Node D-Node

D-Node D-Node

Switch

SwitchSwitch

SwitchSwitch

Sw

itchS

witch

Switch

Reverse dataflow

Each switch writes computed data in his own feedback pipelineEach switch has read ports on others switch’s pipelinesEasy implementation of various recursive algorithms (IIR, WT…)

Feedback pipelines

Systolic ring2

70DIOUCamille

Master EAII Sp. RSEE

• Global mode (first level)The program which manages the configuration runs on the RISC processorThe configuration of an entire cluster can be modified at each clock cycleThe operating layer computes the data coming from the host processor

112233

• Local mode (second level) Each Dnode runs his own up-to-8 instructions program

ConfigConfigConfigControllerControllerController

ConfigConfigConfigControllerControllerController

DATA HostHostHostµPµPµP

+

*+

*RAMRAMRAMRAMRAMRAM

OPERATING layerOPERATING layer

CONFIGURATIONCONFIGURATIONlayerlayer

MANAGEMENT CODE

CONFIG

Dnode

ALU + MULT

Reg FILE

A B

S

Dnode

ALU + MULT

Reg FILE

A B

S

ALU + MULT

Reg FILE

A B

S

11

2233

2 levels dynamically reconfigurable architecture:

Systolic ring2

71DIOUCamille

Master EAII Sp. RSEE

8 Dnodes version…• ST* CMOS process 0.25 µm & 0.18 µm

200 MHz200 MHz150 MHzFréquency

0.04 mm20.7 mm20.9 mm2Area

Dnode0.18 µm

Ring-80.18 µm

Ring-80.25 µm

• Low Dnode area Possible to realize 128 Dnodes versions…

• Suited as an IP core for SoC

*: ST: STmicroelectronics

Features :• Parametrizable core (number of Dnodes)

• Good Performances / cost tradeoff: (Ring-8@200MHz Systolic Ring system)

• 1600 MIPS (PII@450MHz : 400 MIPS)

• 3 Gb/s bandwidth

Systolic ring2

72DIOUCamille

Master EAII Sp. RSEE

0000 r:ldl(0,8) M1: N1:clr N2:clr 0001 r:ldl(1,2) M2: N1:clr N2:clr 0002 r:dec(0,0) M1: N1:add(fifo1,fifo1) N2:sub(fifo1,fifo1) 0003 r:jnz(1) M2: N1:mac(in1) N2:mac(in2) 0004 r: halt

RISC instructions Layer selection Dnodes instructions

Assembly-level programming

RAM FPGA

Prototype

Testbench

Simulator

Ring-8RAM

File1.bin

File2.m

Assembler

Systolic ring2

73DIOUCamille

Master EAII Sp. RSEE

RIF filter : edge detection

0000 r:ldl(0,1) M1: N1:rst N2:rst 0001 r:jmp(0) M1: N2:sub(fifo,fifo)

Convolution mask : [ -1 1 0 ] yn=xn-xn-1.

Input image Output image

Assembly code Timing diagrams

Testbench

Simulator

Ring-8Ring-8RAMFile2.m

AssemblerAssembler

Systolic ring2

74DIOUCamille

Master EAII Sp. RSEE

ALU + MULT

x

x

x

x

Polynomial calculus• P(x)=a.x+b.x²+c.x3

x reg0

x.x reg1

reg0.reg1 reg2

a.reg0 ACC

b.reg1 + ACC ACC

c.reg2 + ACC ACC

11

12

13

14

15

/* load reg1,x3 */

/* load ACC,a.x */

/* load ACC,a.x+b.x² */

/* load ACC,a.x+b.x²+c.x3 */

/* load reg0,x */

/* load reg1,x² */

Systolic ring2

75DIOUCamille

Master EAII Sp. RSEE

ALU + MULT

x

x x²

x3x²

Polynomial calculus• P(x)=a.x+b.x²+c.x3

x reg0

x.x reg1

reg0.reg1 reg2

a.reg0 ACC

b.reg1 + ACC ACC

c.reg2 + ACC ACC

11

12

13

14

15

/* load reg1,x3 */

/* load ACC,a.x */

/* load ACC,a.x+b.x² */

/* load ACC,a.x+b.x²+c.x3 */

/* load reg0,x */

/* load reg1,x² */

Systolic ring2

76DIOUCamille

Master EAII Sp. RSEE

ALU + MULT

x² x3

a

x

a.x

a x

x reg0

x.x reg1

reg0.reg1 reg2

a.reg0 ACC

b.reg1 + ACC ACC

c.reg2 + ACC ACC

11

12

13

14

15

/* load reg1,x3 */

/* load ACC,a.x */

/* load ACC,a.x+b.x² */

/* load ACC,a.x+b.x²+c.x3 */

/* load reg0,x */

/* load reg1,x² */

Polynomial calculus• P(x)=a.x+b.x²+c.x3

Systolic ring2

77DIOUCamille

Master EAII Sp. RSEE

ALU + MULT

x3

b

x

a.x+b.x²

b x²

x reg0

x.x reg1

reg0.reg1 reg2

a.reg0 ACC

b.reg1 + ACC ACC

c.reg2 + ACC ACC

11

12

13

14

15

/* load reg1,x3 */

/* load ACC,a.x */

/* load ACC,a.x+b.x² */

/* load ACC,a.x+b.x²+c.x3 */

/* load reg0,x */

/* load reg1,x² */

Polynomial calculus• P(x)=a.x+b.x²+c.x3

Systolic ring2

78DIOUCamille

Master EAII Sp. RSEE

ALU + MULT

c

x

a.x+b.x²+c. x3

c x3

x3

x reg0

x.x reg1

reg0.reg1 reg2

a.reg0 ACC

b.reg1 + ACC ACC

c.reg2 + ACC ACC

11

12

13

14

15

/* load reg1,x3 */

/* load ACC,a.x */

/* load ACC,a.x+b.x² */

/* load ACC,a.x+b.x²+c.x3 */

/* load reg0,x */

/* load reg1,x² */

Polynomial calculus• P(x)=a.x+b.x²+c.x3

Systolic ring2

79DIOUCamille

Master EAII Sp. RSEE

Finite Impulse response filter (FIR)

∑−

=−−=

1

0)1()(

N

ii inxany

Z-1xn

yn

a1a0

Z-1Z-1 Z-1

a2 aN-1 aN

Systolic ring2

80DIOUCamille

Master EAII Sp. RSEE

xn

a1a2 a0

ynZ-1Z-1 Z-1

∑−

=−−=

1

0)1()()(

N

iinxiany

xn

aN-1aN aN-2 a1 a0

ynZ-1 Z-1Z-1 Z-1

3 coefficients filter

)3()2()1()( .2.1.0 −+−+−= nxanxanxany

Finite Impulse response filter (FIR)

Systolic ring2

81DIOUCamille

Master EAII Sp. RSEE

Cycle 1

x0, x0, x0

a2x0 a1 a2

a2, a1, a0

3 Dnodes / layer architecture use• Piplelined implementation• Samples are injected through dedicated lines• Coefficients loaded during first cycle

FIR implementation

Systolic ring2

82DIOUCamille

Master EAII Sp. RSEE

a2.x0

a2x1

a2.x0

x1, x1, x1

a1

MAC

x1

Cycle 2

Feedback

3 Dnodes / layer architecture use• Piplelined implementation• Samples are injected through dedicated lines• Coefficients loaded during first cycle

FIR implementation

Systolic ring2

83DIOUCamille

Master EAII Sp. RSEE

Cycle 3

a2.x0+a1.x1a2.x1

a2x2

a2.x1

x2, x2, x2

a2.x0+a1.x1

a1

MAC

x2

a0

MAC

x2

Feedback

3 Dnodes / layer architecture use• Piplelined implementation• Samples are injected through dedicated lines• Coefficients loaded during first cycle

FIR implementation

Systolic ring2

84DIOUCamille

Master EAII Sp. RSEE

Cycle 4

a2.x1+a1.x2a2.x2

a2x3

a2.x2

x3, x3, x3

a2.x1+a1.x2

a1

MAC

x3

a0

MAC

x3

a2.x0+a1.x1 +a0.x2

OPTIMAL IMPLEMENTATION : 1 SAMPLE / CYCLE

Feedback

3 Dnodes / layer architecture use• Piplelined implementation• Samples are injected through dedicated lines• Coefficients loaded during first cycle

FIR implementation

Systolic ring2

85DIOUCamille

Master EAII Sp. RSEE

Cycle 5

a2.x2+a1.x3a2.x3

a2x4

a2.x3

x4, x4, x4

a2.x2+a1.x3

a1

MAC

x4

a0

MAC

a2.x1+a1.x2 +a0.x3

x4

3 Dnodes / layer architecture use• Piplelined implementation• Samples are injected through dedicated lines• Coefficients loaded during first cycle

Feedback

OPTIMAL IMPLEMENTATION : 1 SAMPLE / CYCLE

Systolic ring2

86DIOUCamille

Master EAII Sp. RSEE

Cycle 6

a2.x2+a1.x3a2.x3

a2x4

a2.x3

x4, x4, x4

a2.x2+a1.x3

a1

MAC

x4

a0

MAC

a2.x1+a1.x2 +a0.x3

x4

3 Dnodes / layer architecture use• Piplelined implementation• Samples are injected through dedicated lines• Coefficients loaded during first cycle

Feedback

OPTIMAL IMPLEMENTATION : 1 SAMPLE / CYCLE

Systolic ring2

87DIOUCamille

Master EAII Sp. RSEE

a2.x3+a1.x4a2.x4

a2x4

a2.x4

x5, x5, x5

a2.x3+a1.x4

a1

MAC

x5

a0

MAC

x5

a2.x2+a1.x3 +a0.x4

Cycle 7

3 Dnodes / layer architecture use• Piplelined implementation• Samples are injected through dedicated lines• Coefficients loaded during first cycle

Feedback

OPTIMAL IMPLEMENTATION : 1 SAMPLE / CYCLE

Systolic ring2

88DIOUCamille

Master EAII Sp. RSEE

6 coefficients filter

)6()5()4()3()2()1()( .5.4.3.2.1.0 −+−+−+−+−+−= nxanxanxanxanxanxany

a5a5 a4a4

MACMAC

a3a3

MACMAC

a1a1

MACMAC

a0a0

MACMAC

a2a2

MACMAC

xn

xn

yn

Inter-layersfeedback

Systolic ring2

89DIOUCamille

Master EAII Sp. RSEE

DCTCoeff.

DCT

iDCT

Originalimage

inverse Quantification

QuantifiedCoeff.

Quantification

Decoding

Decompressedimage

Coding

Compressedimage

Discrete Cosine Transform• Usually bidimensional 8x8 points DCT• Very demanding algorithm…

Systolic ring2

90DIOUCamille

Master EAII Sp. RSEE

( ) ( )∑−

=⎟⎠⎞⎜

⎝⎛ +=

1

0 212cos2 N

nnk N

knxkNz πα k = 0,1,……,N-1

Direct transform

( ) ( )∑−

=⎟⎠⎞⎜

⎝⎛ +=

1

0 212cos2 N

kkn N

knzkNx πα n = 0,1,……,N-1

Inverse transform

1/√2 for k = 0

1 else=with

DCT algorithm

( )kα

Systolic ring2

91DIOUCamille

Master EAII Sp. RSEE

⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢

7,70,7

1,10,1

7,01,00,0

......................

.....

xx

xxxxx

64

64

8

8

Image initiale

64 blocs 8x8

Image• 64x64 points• 8x8 pixels blocks•16 bits coded image

Systolic ring2

92DIOUCamille

Master EAII Sp. RSEE

Implementation• Matrix implementation• Even / Odd frequency decomposition of the DCT algorithm

⎥⎥⎥⎥

⎢⎢⎢⎢

++++

⎥⎥⎥⎥

⎢⎢⎢⎢

δ−ββ−δαα−α−αβ−δ−δβ

=

⎥⎥⎥⎥

⎢⎢⎢⎢

43

52

61

70

6

4

2

0 1111

xxxxxxxx

zzzz

α = cos (π/4)

β = cos (π/8)

δ = sin (π/8) ⎥⎥⎥⎥

⎢⎢⎢⎢

++++

⎥⎥⎥⎥

⎢⎢⎢⎢

δ−ββ−δαα−α−αβ−δ−δβ

=

⎥⎥⎥⎥

⎢⎢⎢⎢

43

52

61

70

6

4

2

0 1111

xxxxxxxx

zzzz

α = cos (π/4)

β = cos (π/8)

δ = sin (π/8)

⎥⎥⎥⎥

⎢⎢⎢⎢

−−−−

⎥⎥⎥⎥

⎢⎢⎢⎢

λ−γμ−νγνλ−μμ−λ−ν−γνμγλ

=

⎥⎥⎥⎥

⎢⎢⎢⎢

43

52

61

70

7

5

3

1

xxxxxxxx

zzzz λ = cos (π/16)

γ = cos (3π/16)

μ = sin (3π/16)

ν = sin (π/16)⎥⎥⎥⎥

⎢⎢⎢⎢

−−−−

⎥⎥⎥⎥

⎢⎢⎢⎢

λ−γμ−νγνλ−μμ−λ−ν−γνμγλ

=

⎥⎥⎥⎥

⎢⎢⎢⎢

43

52

61

70

7

5

3

1

xxxxxxxx

zzzz λ = cos (π/16)

γ = cos (3π/16)

μ = sin (3π/16)

ν = sin (π/16)

xNTN

z )(2= xNT

Nz )(2=

Systolic ring2

93DIOUCamille

Master EAII Sp. RSEE

α = 0000000000010110 - α = 1111111111101010β = 0000000000011101 - β = 1111111111100011δ = 0000000000001100 - δ = 1111111111110100 λ = 0000000000011111 - λ = 1111111111100001γ = 0000000000011010 - γ = 1111111111100101μ = 0000000000010001 - μ = 1111111111101111ν = 0000000000000110 - ν = 1111111111111010

Example : n=6

( ) ( )∑−

=⎟⎠⎞

⎜⎝⎛ +

=1

0 212cos2 N

nnk N

knxkN

z πα

Coefficients coding• Fixed point

Systolic ring2

94DIOUCamille

Master EAII Sp. RSEE

Dnode1C

onfig

+_

Con

fig

MAC

MACC

oeffi

cien

tsxn - x(N-1)-n

xn + x(N-1)-n

x nx (

N-1

)-n x

nx (

N-1

)-n

z0 , z2 , z4 , z6

z1 , z3 , z5 , z7

Dnode2

Dnode1

Dnode2

Implementation :• ADD and SUB on the first Dnode layer• Multiply-accumulate operations (MAC) on the second Dnodes layer

Systolic ring2

95DIOUCamille

Master EAII Sp. RSEE

⎥⎥⎥⎥

⎢⎢⎢⎢

++++

⎥⎥⎥⎥

⎢⎢⎢⎢

δ−ββ−δαα−α−αβ−δ−δβ

=

⎥⎥⎥⎥

⎢⎢⎢⎢

43

52

61

70

6

4

2

0 1111

xxxxxxxx

zzzz

⎥⎥⎥⎥

⎢⎢⎢⎢

−−−−

⎥⎥⎥⎥

⎢⎢⎢⎢

λ−γμ−νγνλ−μμ−λ−ν−γνμγλ

=

⎥⎥⎥⎥

⎢⎢⎢⎢

43

52

61

70

7

5

3

1

xxxxxxxx

zzzz

M0: N1:add(fifo,fifo) N2: sub(fifo,fifo)

t=0 n=0Computing…

Dnode1

+_

x 0x 7

x0

x 7 Dnode2

Dnode1

Dnode2

Con

fig

Systolic ring2

96DIOUCamille

Master EAII Sp. RSEE

t=1 n=1

Dnode1

+_ x0 – x7

x0 + x7

x 1x 6

x1

x 6 Dnode2

Dnode1

Dnode2

MAC

MACλ,

x,1,

x

M1: N1:MAC(in1,fifo) N2: MAC(in2,fifo)

⎥⎥⎥⎥

⎢⎢⎢⎢

++++

⎥⎥⎥⎥

⎢⎢⎢⎢

δ−ββ−δαα−α−αβ−δ−δβ

=

⎥⎥⎥⎥

⎢⎢⎢⎢

43

52

61

70

6

4

2

0 1111

xxxxxxxx

zzzz

⎥⎥⎥⎥

⎢⎢⎢⎢

−−−−

⎥⎥⎥⎥

⎢⎢⎢⎢

λ−γμ−νγνλ−μμ−λ−ν−γνμγλ

=

⎥⎥⎥⎥

⎢⎢⎢⎢

43

52

61

70

7

5

3

1

xxxxxxxx

zzzz

Computing…

Con

fig

Systolic ring2

97DIOUCamille

Master EAII Sp. RSEE

t=2 n=2

Dnode1

+_ x1 – x6

x1 + x6

x 2x 5

x2

x 5 Dnode2

Dnode1

Dnode2

MAC

MAC

⎥⎥⎥⎥

⎢⎢⎢⎢

++++

⎥⎥⎥⎥

⎢⎢⎢⎢

δ−ββ−δαα−α−αβ−δ−δβ

=

⎥⎥⎥⎥

⎢⎢⎢⎢

43

52

61

70

6

4

2

0 1111

xxxxxxxx

zzzz

⎥⎥⎥⎥

⎢⎢⎢⎢

−−−−

⎥⎥⎥⎥

⎢⎢⎢⎢

λ−γμ−νγνλ−μμ−λ−ν−γνμγλ

=

⎥⎥⎥⎥

⎢⎢⎢⎢

43

52

61

70

7

5

3

1

xxxxxxxx

zzzz

Computing…

γ,x,

1,x

Systolic ring2

98DIOUCamille

Master EAII Sp. RSEE

Dnode1

+_ x2 – x5

x2 + x5

x 3x 4

x3

x 4 Dnode2

Dnode1

Dnode2

MAC

MAC

⎥⎥⎥⎥

⎢⎢⎢⎢

++++

⎥⎥⎥⎥

⎢⎢⎢⎢

δ−ββ−δαα−α−αβ−δ−δβ

=

⎥⎥⎥⎥

⎢⎢⎢⎢

43

52

61

70

6

4

2

0 1111

xxxxxxxx

zzzz

⎥⎥⎥⎥

⎢⎢⎢⎢

−−−−

⎥⎥⎥⎥

⎢⎢⎢⎢

λ−γμ−νγνλ−μμ−λ−ν−γνμγλ

=

⎥⎥⎥⎥

⎢⎢⎢⎢

43

52

61

70

7

5

3

1

xxxxxxxx

zzzz

Computing… t=3 n=3

μ,x,

1,x

Systolic ring2

99DIOUCamille

Master EAII Sp. RSEE

Dnode1

+_ x3 – x4

x3 + x4

Dnode2

Dnode1

Dnode2

MAC

MACν,

x,1,

x

⎥⎥⎥⎥

⎢⎢⎢⎢

++++

⎥⎥⎥⎥

⎢⎢⎢⎢

δ−ββ−δαα−α−αβ−δ−δβ

=

⎥⎥⎥⎥

⎢⎢⎢⎢

43

52

61

70

6

4

2

0 1111

xxxxxxxx

zzzz

⎥⎥⎥⎥

⎢⎢⎢⎢

−−−−

⎥⎥⎥⎥

⎢⎢⎢⎢

λ−γμ−νγνλ−μμ−λ−ν−γνμγλ

=

⎥⎥⎥⎥

⎢⎢⎢⎢

43

52

61

70

7

5

3

1

xxxxxxxx

zzzz

Computing… t=4 n=4

Systolic ring2

100DIOUCamille

Master EAII Sp. RSEE

M1: N1:clear N2: clear

Dnode1

+_ x3 – x4

x3 + x4

Dnode2

Dnode1

Dnode2

clear

clearν,

x,1,

x

⎥⎥⎥⎥

⎢⎢⎢⎢

++++

⎥⎥⎥⎥

⎢⎢⎢⎢

δ−ββ−δαα−α−αβ−δ−δβ

=

⎥⎥⎥⎥

⎢⎢⎢⎢

43

52

61

70

6

4

2

0 1111

xxxxxxxx

zzzz

⎥⎥⎥⎥

⎢⎢⎢⎢

−−−−

⎥⎥⎥⎥

⎢⎢⎢⎢

λ−γμ−νγνλ−μμ−λ−ν−γνμγλ

=

⎥⎥⎥⎥

⎢⎢⎢⎢

43

52

61

70

7

5

3

1

xxxxxxxx

zzzz

Computing… t=5 n=0

z0

z1

Con

fig

Systolic ring2

101DIOUCamille

Master EAII Sp. RSEE

Results– 2 transforms issued each 5 machine cycles

– « Clear » performed during addition

20 cycles for 8 samples

⎥⎥⎥⎥

⎢⎢⎢⎢

++++

⎥⎥⎥⎥

⎢⎢⎢⎢

δ−ββ−δαα−α−αβ−δ−δβ

=

⎥⎥⎥⎥

⎢⎢⎢⎢

43

52

61

70

6

4

2

0 1111

xxxxxxxx

zzzz

⎥⎥⎥⎥

⎢⎢⎢⎢

−−−−

⎥⎥⎥⎥

⎢⎢⎢⎢

λ−γμ−νγνλ−μμ−λ−ν−γνμγλ

=

⎥⎥⎥⎥

⎢⎢⎢⎢

43

52

61

70

7

5

3

1

xxxxxxxx

zzzz

Computing…

Systolic ring2

102DIOUCamille

Master EAII Sp. RSEE

DCT 1D - 4 last lines

ConfigC

onfig

Config

Config

M 0

M 3

Sw

itch

Switc

h

Switch

Switch

Dnode1

Dnode2

Dnode1

Dnode1

Dnode2

Dnode2

Dnode2

Dnode1

DCT 1D - 4 first lines

Achievable parallelisn on a 8 Dnodes structures : Ring-8

Systolic ring2

M 1

M 2

103DIOUCamille

Master EAII Sp. RSEE

⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢

7,70,7

1,10,1

7,01,00,0

'......'................'''.....''

zz

zzzzz

⇒ 5 cycles

2 partial transforms

Systolic ring2

Overall performances

104DIOUCamille

Master EAII Sp. RSEE

⇒ 20 cycles

⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢

7,70,7

1,10,1

7,01,00,0

'......'................'''.....''

zz

zzzzz

1 Line – 8 partial transforms

Systolic ring2

Overall performances

105DIOUCamille

Master EAII Sp. RSEE

⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢

7,70,7

1,10,1

7,01,00,0

'......'................'''.....''

zz

zzzzz

⇒ 80 cyclesM 1M 0

4 Lines - 32 partial transforms

Systolic ring2

Overall performances

106DIOUCamille

Master EAII Sp. RSEE

⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢

7,70,7

1,10,1

7,01,00,0

'......'................'''.....''

zz

zzzzz

4 Lines - 32 partial transforms

⇒ 80 cyclesM 1M 0

Systolic ring2

Overall performances

107DIOUCamille

Master EAII Sp. RSEE

⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢

7,70,7

1,10,1

7,01,00,0

'......'................'''.....''

zz

zzzzz

⇒ 80 cyclesM 3M 2

8 Columns - 64 transforms

Systolic ring2

Overall performances

108DIOUCamille

Master EAII Sp. RSEE

⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢

7,70,7

1,10,1

7,01,00,0

'......'................'''.....''

zz

zzzzz

DCT 2D sur 8 points :

160 CYCLES

Systolic ring2

Overall performances

109DIOUCamille

Master EAII Sp. RSEE

VLIW : CPU64, TM1000, TI 320C60 Superscalar : Pentium I, Pentium II, NEC V830

Comparisons : execution time (cycles)

0

50

100

150

200

250

300

350

400

CPU64 TM-1000 320C62 Ring-8 Ring-64 PentiumI PentiumII NEC V830

# cy

cles

VLIW Superscalar

Systolic ring2

top related