architecture exploration lecture 9iverbauw/courses/... · • architecture alternatives bit...

1

1HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9

Architecture exploration Lecture 9

Ingrid Verbauwhede

Departement Elektrotechniek, afdeling ESAT/COSIC

[email protected]


Motivation

• Architecture exploration

• Specification: MATLAB, SPW, C/C++, Java

• Floating point

• Fixed point

• Algorithm transformations

• Architecture alternatives

Bit parallel (Bit serial)

ASIC SpecialPurpose

(Art Designer)

Retargetablecoprocessor

(Target compilertechnologies)

DSP extensionsto RISC

DSP processors

(Gezel,Tensilica)

(TI TMS320C54x,TMS320C55x,ADI Blackfin, etc. )

2


References

• The origins:• E.A. Lee, “Programmable DSP Processors,” Part I, IEEE ASSP

magazine, October 1988, pg. 4-19.• Part II, IEEE ASSP magazine, January 1989, pg. 4-14

• Continue on this:• I. Verbauwhede, C. Nicol, “Low power DSP's for wireless communications,” 2000 International Symposium on Low Power Electronics and Design (ISLPED), July 2000 • I. Verbauwhede, P. Schaumont, C. Piguet, B. Kienhuis, “Architectures and design techniques for energy efficient embedded DSP and multimedia,” 2004 Design Automation and Test in Europe (DATE 2004).


Today

• SOC components (continue)– DSP processors– VLIW processors

• Design of SOC itself

3


DSP Processors

Today’s general purposeassembly coded

DSP

Low cost,low power

DSPs

HighPerformance

DSPs

• 1-10 GOPS• 1-5 watts• < $50

• 200-1000 MOPS• < 100 mW • $10

• 100 MOPS• 250 mW• $40

InfrastructureMobile Terminals

Highly optimizedDomain specificProcessors

Compiler FriendlyVLIW type of DSPprocessors


DSP processors -

• Last lecture: DSP = domain specific processor– Highly optimized for wireless communication– EVERY component of the processor:

• Datapath = MAC• Memory = Harvard or Modified Harvard• Address arithmetic: indirect – modulo – bit reverse (FFT)• Control: CISC with specialized instruction set

– Example of FIR calculation

• Today:– Pipeline specifics of DSP processors

4


Pipelining:

ExecuteDecodeFetch MemoryAccess



Fetch = fetch instructionDecode = decode instructionMemory access = address generation and read operandsExecute = perform operation

Time


Pipelining

How does pipeline appears to the programmer?Lee’s paper (part II) discusses 3 variations(the difference is often blurry):• interlocking• time stationary coding• data stationary coding

Trade-off between efficiency and “ease-of-use”

Interlocking: the instructions appear if executed one after another

5


Interlocking on C10

LTPMEM MPY LTD




LT

MPY

LTD


MPY

MPY

DMEM data coef1 data coef2

ALU

MPY

Reservation table:

LTD MPY

. . .

Instruction cycles


Interlocking on C2x

Programmer does not know the pipelineIf an access conflict occurs: hardware will “stall” and finish one (part) of anInstruction before finishing a second part.

RPTKPMEM MACD coef1 coef2

DMEM data1 data2

ALU

MPY

Reservation table:

. . .

RPTK 49MACD

coef3

6


Time stationary

Instruction specifies “one instruction cycle”.So it specifies, all that occurs in parallel.





Example:Motorola:

MAC X0, Y0, A X:(R0)+, X0 Y:(R4-), Y0(multiply-acc of values read from memory in the previous cycle)

Lucent 16xa0 = a0 + p, p = x * y, y = *r0++, x = *pt ++


Data stationary

Time stationary: working on different samples in one instructionData stationary: describes what happens with one input data fromstart to end.

Example (Lode):

*r3++ = a0+ = a2 * *r2++;(read from memory with pointer reg r2,Multiply with a2, add to a0 and store back in a0,Store the result in memory with pointer r3,Post modify r2 and r3)

ExecuteDecodeFetch Read Write

7


Control & Pipeline for DSP’sRISC: load/store machinememory access with load/store instructions (DLX, MIPS, D10V)

MemoryAccessDecodeFetch Execute Write

Back

Memory access / branchExecution/ address generation

Excellent for complex decision making!

Memory accessExecution

DSP: register-memory architecture (TI, Lucent, HX, Lode)

Excellent for number crunching!


WriteBack


Pipeline RISC compared to DSPRISC:example

DSP: memory intensive applications:

r0 = *p0; // load dataa0 = a0 + r0; // execute

MemoryAccessDecodeFetch Execute



Too expensive for DSP

ExecuteDecodeFetchMemoryAccess




Penalty: data dependent branch is expensive

8


Application domain: wireless communications

Receiver

Tran

smit

Syn

thes

ize

PA

TCXO

Receiver

Tran

smit

Syn

thes

ize

PA

TCXO

Ext

erna

lM

emor

ies

DigitalASIC

MicroProcessor

DSP

BatteryPack

AnalogASIC

PowerSupply

AudioCodec

No network

* 0 #7 8 94 5 61 2 3

clr

RF Board

Baseband board


Performance requirements: digital cellular phone

RFReceive

RFSend

Demodulation Channeldecoder

Speechdecoder

Modulation Channelencoder

Speechencoder

Communication Application

Goal: Minimum “MIPS” to get the job done.

9


Application Domain: compute intensive functions

Source encoder/decoder = speech codersAdvanced vocoders for improved speech quality & higher capacity:Example: ACELP derivatives for GSM and IS136A

• Digital filtering (FIR, IIR)

• Vector quantization, code book search (square distance computation)

Channel encoder/decoder = error correctingComplex wireless modems:

• Galois field arithmetic

• Convolution coders based on Viterbi trellis search

• Turbo coders


Compute intensive functions: evolution of DSP’s

Simple FIR example

Square distance for speech processing

Speed-up of FIR example

Viterbi acceleration for communication algorithms

Evolution of DSPs follows these examples

10


The Viterbi Decoding (Introduction)

• Error Correcting Decoding Algorithm for Convolutional Code• Trellis Representation• Maximum Likelihood Decoding Algorithm• GSM System


Convolutional Code (ex. Wyner-Ash Code)

• Generator matrix G(D) = [ 1 1+D ]• Input sequence u(D) = 1, 1, 0, 1, 0, …• Output Sequence c(D) = u(D)G(D)

=11, 10, 01, 11, 01, …

D

11


Constraint length K and Rate

• v = 1, K = 2, 2states

• Rate = 1/2, one input bit generates twocoded output bits.

D

100,00 1,111,10

0,01


Trellis Representation

• Example G(D)=[ 1+D2 1+D+D2 ]v = 2, K = 3, 4 states

• Instead of writing a State Diagram,

D D

t0 1 2 3 4

S00

S10

S01

S00

S10

S01

S00

S10

S01

S00

S10

S01

S00

S10

S01

S11 S11 S11 S11 S11

00

11

00

11

00

11

00

11

1001

1001

1001

11

00

11

00

01

10

01

10

12


Efficiency of Viterbi decoding

• Identifies the path through the Trellis--- Selecting survivor paths for each states by calculating Hamming Distance

• The total number of paths grows exponentially with the number of states--- K increasing, H/W Complexity increases exponentially

but the Error Rate decreases


Viterbi Decoding Algorithm (1)

• Assume N = 7 blocks

t

S00

S10

S01

S11

00

11

00

11

00

11

00

11

1001

1001

1001

11

00

11

00

01

10

01

10

0 1 2 3 4 5 6 7

000000

11

1001 01

11 11

10

11

00

01

10

Information Data

Convolution Codes

Error Sequence

Received Data

0

00

00

00

1

11

01

10

1

10

10

00

0

10

00

10

1

00

00

00

0

01

10

11

0

11

00

11

Tail Bit

13


S00

S10

S01

S11

00

11

00

11

00

11

00

11

1001

1001

1001

11

00

11

00

01

10

01

10

000000

11

1001 01

11 11

10

11

00

01

10

0 1 10

12

4

2


• Calculate Hamming Distance (Choose smaller one)

t0 1 2 3 4 5 6 7

Information Data

Convolution Codes

Error Sequence

Received Data

0

00

00

00

1

11

01

10

1

10

10

00

0

10

00

10

1

00

00

00

0

01

10

11

0

11

00

11



• Selecting the Optimal Path

t0 1 2 3 4 5 6 7

Information Data

Convolution Codes

Error Sequence

Received Data

0

00

00

00

1

11

01

10

1

10

10

00

0

10

00

10

1

00

00

00

0

01

10

11

0

11

00

11

S00

S10

S01

S11

00

11

00

11

00

11

00

11

1001

1001

1001

11

00

11

00

01

10

01

10

000000

11

1001 01

11 11

10

11

00

01

10

0 1 1 20 2 33

1 3 22 2

2 2 34

2 42 3

3

14


Traceback

• We cannot wait for the end of sequence for some applications

• The amount of “delay” is called tracebackdepth LD.

--- Larger LD , better performancebut need more memory and complexity


Viterbi in GSM

• Full-rate speech channel 22.8kbps: Rate = 1/2, K = 5

• Half-rate speech channel :11.4kbps: Rate = 1/3, K = 7

15


Required Performance


Compute Intensive function 2: Viterbi

i

i+ s/2

2i

2i+1

+a

-a

-a

+a

. . .

. . .

Viterbi butterfly

i = state indexs = # of states = 2w = decoding window

Basic equations:

d(2n) = min { d(i) + a, d(i + s/2) - a }d(2i + 1) = min { d(i) - a, d(i + s/2) + a }

IS-95: k = 8, w = 192, corresponds to 2 x 192 x (cycles for one ACS)

k-1

7

Basic algorithm in Viterbi channel decoders,modified version in turbo decoders.

Key operation: Add-Compare-Select (ACS)

16


Viterbi on Atmel’s Lode

Two MAC units & ALU: Add-Compare-Select

• DMAC operates as dual add/subtract unit

• ALU finds minimum

• Shortest distance saved

• Path indicator saved

• 4 cycles / butterfly

+

A1

MAC0

DB1(16)DB0(16)

µ2

+

µ1

A0

MAC1

Γ1 Γ2

Min()ALU

A3Γ

A2

decision bit

to memory

Γ = min [(Γ1 + µ1), (Γ2 + µ2)]


MSW/LSWSelect

Viterbi on TIC54x

ALU and CSSU: Add-Compare-Select

• ALU splits in 16 bit halves

• ACC splits in half

• Shortest distance saved

• CSSU compares halves

• Path indicator saved

• 4 cycles / butterfly

+

TREG

ALU

DB1(16)DB0(16)

µ2

+

µ1

AccumulatorΓ1 Γ2

CompALU

TRN reg

Γ

decision bit

Data bus EB, to memory

Γ = min [(Γ1 + µ1), (Γ2 + µ2)]

17


Viterbi on LU DSP16210

do 8 {a0=a4+y a1=a5-y *r3++=a0ha2=a4-y a3=a5+y *r5++=a2ha0=cmp1(a1,a0) yh=*r0 r0=r1+j j=k k=*pt1++a2=cmp1(a3,a2) a4_5h=*pt0++

}

GSM (K=5, 16 states)

AR0

AR0

AR0

AR0

. . .

a0=cmp1(a1,a0)

a2=cmp1(a3,a2)

a2=cmp1(a3,a2)

• Hardware support for Viterbialgorithm:– ACS calculations are efficient– Minimal overhead

• 4 cycles per butterfly– 32 cycles per GSM timeslot.

• Comparison functions store ACS decision bits:

. . .

Results writtento memory

Courtesy: Gareth Hughes, Bell Labs Australia


BUT: DSP Software Development

• Complex DSP architecture not amenable to compiler technology

• Algorithms are modeled in high level language (e.g. C++)

• Solutions are implemented and debugged in hand-optimized assembler - large development effort with minimal tool support

HLL

algorithmic

model

prototype

code

production

code

hand coded assembler

optimize & debug

Long, frustrating time to market

Fragile legacy code

Widely used in handhelds, but change in basestations VLIW

18


2G Basestation Baseband Processing

• Multiple DSPs used for baseband processing.• RISC Microcontroller for timing, framing, I/O control• Software upgradable over the network• DSPs dominate cost and power consumption

DSP RISCMicro

Controller

I/O

T1/E1

DSP

DSP

DSP

DSP

DSP

DSP

DSP

I/O

I/O I/O ASIC

DSP

DSP

AFE

AFE

ChannelEqualization

ChannelDe/coding Encryption

RAM

RAM

Tx

TxRx

Rx

Tx/Rx baseband processing board for 2-carrier GSM basestation

Future trend - integratebaseband processing -low cost Pico BTS


Compiler Driven VLIW

Large orthogonal register set, regular interconnect

Data memory

RegisterArray

Interconnect

ex1(alu)

ex2(alu)

ex3(mpy)

ex4(ld/st)

exn(ld/st)

cond/branch ex1 ex2 ex3 ….. exnInstruction format:

Atomic RISC-like operations => heavily pipelined, high freq. clock

19


Explicitly Parallel Instruction Computing

Execution ClustersData memory

RegisterArray

Interconnect

ex1(alu)

ex4(alu)

ex5(mpy)

ex3(ld/st)

ex6(ld/st)

RegisterArray

Interconnect

ex2(alu)

Execution Sets

1 1 1 0 1 0 1 0

fetch set

exec. set


Texas Instruments ‘C6201

ALU shift mpy add ALU shift mpy add

Register Bank A(16 x 32)

Register Bank B(16 x 32)

Instruction Dispatch & Decode

Program Memory(16K x 32)

256

Data Memory(32K x 16)

8-way VLIW with two execution clusters256 bit (8x32) instruction fetch with variable length execute setEach 32 bit instruction individually predicated11 stage pipeline1600 MIPS, 400 MMACs @ 200 MHz

20


FIR Filter on TI ‘C6x

loop:

ldw .d1t1 *a4++,a5

|| ldw .d2t2 *b4++,b5

||[b0] sub .s2 b0,1,b0

||[b0] b .s1 loop

|| mpy .m1x a5,b5,a6

|| mpyh .m2x a5,b5,b6

|| add .l1 a7,a6,a7

|| add .l2 b7,b6,b7

• Outer Loop: 23 cycles, 180 bytes– 1 cycle in inner loop

• All 8 exec units used in inner loop - maximum efficiency– 2 MACs per cycle

Hand-coded assembly: 32-tap FIR filter

Assembly syntax more difficult to learn.Hard to get full use of all 8 execution units at once.Software pipelining difficult to implement, and requires longer prolog/epilog (larger

code size).

Courtesy: Gareth Hughes: Bell Labs Australia


Viterbi on TI ‘C6x

LOOP: [b1] b .s1 LOOP||[b1] sub .s2 b1,1,b1||[!a2] sth .d1 b12,*+a6[8]||[!a2] add .d2 b0,b14,b14|| cmpgt .l1 a11,a10,a1|| cmpgt .l2 b11,b10,b0|| mpy .m1x 1,b5,a4

[a2] sub .s1 a2,1,a2||[!a2] sth .d1 a12,*a6++||[a1] add .s2 2,b0,b0||[b0] mpy .m2 1,b11,b12|| mpy .m1 1,a10,a12|| sub .l2x a7,b5,b10|| ldh .d2 *++b9,b5

shl .s2 b14,2,b14||[a1] mpy .m1 1,a11,a12|| add .s1 a7,a4,a10|| sub .l1x b13,a4,a11|| add .l2 b13,b5,b11|| mpy .m2 1,b10,b12|| ldh .d2 *b4++[2],a7|| ldh .d1 *a5++[2],b13; end of LOOP

Cycle 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

.D1 STH new 8 LDH old0 STH new 0 STH new 8 LDH old0 STH new 0 STH new 8 LDH old0 STH new 0 STH new 8 LDH old0 STH new 0 STH new 8 LDH sd1 STH m[2] STH m[3]

.D2 ADD tr LDH old1 LDH mj ADD tr LDH old1 LDH mj ADD tr LDH old1 LDH mj ADD tr LDH old1 LDH mj SUB m LDH sd0 STH m[5] STH m[4]

.M1 MPY mj *MPY b0 MPY a0 MPY mj *MPY b0 MPY a0 MPY mj *MPY b0 MPY a0 MPY mj *MPY b0 MPY a0

.M2 MPY a8 *MPY b8 MPY a8 *MPY b8 MPY a8 *MPY b8 MPY a8 *MPY b8

.L1 CMPGT t0 SUB b0 CMPGT t0 SUB b0 CMPGT t0 SUB b0 CMPGT t0 SUB b0 ADD m0 SUB -m0

.L2 CMPGT t8 ADD b8 SUB a8 CMPGT t8 ADD b8 SUB a8 CMPGT t8 ADD b8 SUB a8 CMPGT t8 ADD b8 SUB a8 SUB old SUB -m1 SUB m1 SUB I

.S1 B JLOOP ADD a0 SUB k B JLOOP ADD a0 SUB k B JLOOP ADD a0 SUB k B JLOOP ADD a0 SUB k

.S2 SUB j SHL tr *ADD t0,t8 SUB j SHL tr *ADD t0,t8 SUB j SHL tr *ADD t0,t8 SUB j SHL tr *ADD t0,t8 ADD tr B JLOOP MVK j

Cycle 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31

.D1 STH new 0 STH new 8 LDH old0 STH new 0 STH new 8 LDH old0 STH new 0 STH new 8 LDH old0 STH new 0 STH new 8 LDH old0 STH new 0 STH m[0] STH m[1] LDH old1

.D2 LDH mj ADD tr LDH old1 LDH mj ADD tr LDH old1 LDH mj ADD tr LDH old1 LDH mj ADD tr LDH old1 STH trans STH m[1] STH m[6] LDH old0

.M1 MPY a0 MPY mj *MPY b0 MPY a0 MPY mj *MPY b0 MPY a0 MPY mj *MPY b0 MPY a0 MPY mj *MPY b0

.M2 *MPY b8 MPY a8 *MPY b8 MPY a8 *MPY b8 MPY a8 *MPY b8 MPY a8 MPY mj

.L1 CMPGT t0 SUB b0 CMPGT t0 SUB b0 CMPGT t0 SUB b0 CMPGT t0 SUB b0 SUB new ADD old ADD SP

.L2 SUB a8 CMPGT t8 ADD b8 SUB a8 CMPGT t8 ADD b8 SUB a8 CMPGT t8 ADD b8 SUB a8 CMPGT t8 ADD b8

.S1 SUB k B JLOOP ADD a0 SUB k B JLOOP ADD a0 SUB k B JLOOP ADD a0 SUB k B JLOOP ADD a0 MVK k

.S2 *ADD t0,t8 SUB j SHL tr *ADD t0,t8 SUB j SHL tr *ADD t0,t8 SUB j SHL tr *ADD t0,t8 SUB j SHL tr B JLOOP

Utilization of execution units in Viterbi decoder

• 16-state Viterbi decoder for GSM from TI WWW site: ftp://ftp.ti.com/pub/tms320bbs/c62xfiles/vitgsm.asm

– 3 cycles per butterfly– 32 cycles per GSM timeslot (8 butterflies)– MPY instructions used to move data

3-cycle 2-ACS Inner-Loop

x 8

21


Lucent / Motorola Star*Core SC140

6-way VLIW with 128 bit (8x16) instruction fetchPrefix instructions for high performance without sacrificing code densityEach execution set (parallel instructions + prefix) predicated5 stage pipeline1800 MIPS, 1200 MMACs @ 300 MHz

Program / Data Memory

ProgramSequencerInstructionDispatcher

AddressRegisters

(27)

AAU

Data Registers(16)

MAC

ALU

BFUAAU

MAC

ALU

BFU

MAC

ALU

BFU

MAC

ALU

BFU


Viterbi on Star*Core

• Hardware support for Viterbi algorithm:– max2vit instruction.– vsl instruction

• 1 cycle per butterfly through software-pipelining

• Decision bits are manually stored using the Viterbi Shift Left (VSL) instruction:

GSM (K=5, 16 states)[ move.2l (r0)+,d0:d1 move.2l (r1)+,d1:d2 ][ add2 d0,d4 sub2 d6,d2

sub2 d4,d0 add2 d2,d6 ][ max2vit d4,d2 max2vit d0,d6 ][ vsl.4w d2:d6:d1:d3,(r2)+n0

vsl.4f d2:d6:d1:d3,(r3)+n0 ]

max2vit d4,d2 max2vit d0,d6

SR

D1

D3

D2

D6

vsl.4w d2:d6:d1:d3,(r2)+n0

Results writtento memory

x 4

decisions

decisions

path metrics

path metrics

Courtesy: Gareth Hughes: Bell Labs Australia

22


SOC


Energy-Efficient SoC are distributed

[‘Under the Hood’, EET, D. Carey, 9/5/02]

TIBaseband

DSP

HTCInterface

ASIC

TIPower

Management

Intel32Mb Flash

Intel128Mb Flash

Winbond128Mb

SDRAM

TIRF Synth

TIRF TX/RX

ConexantPower Amp

IntelStrongArm

SonyLCD

Interface

Sony240x320

color LCD

PhilipsAudio Codec

TouchscreenSIM

MMICExpansion

T-MobilePocketPC Phone

23


DisplayAD7873Digitizer

MotorolaDragonBall

8M SDRAM

4M FLASH

FPGA

PhilipsUSB

MaximTransceivers

Agere POMBaseband

MotorolaTransceiver

RF MicroPoweramp

MaximControl

Driver

MemoryCardSlot

architecture tuned to applicationPalmPilot i705


Power Cost

???

GeneralPurpose

Fixed

Platform

Application

ASIC

Energy-flexibility trade-off

24


Also general purpose architectures become heterogeneous.

IBM PowerPC ®

RISC CPU

Synchronous Dual-Port RAM

SelectIO-ltra™ SystemIO™ & XCITE ™

Conexant3.125Gb Serial

XtremeDSP™

Source: Xilinx webpage


Question

• Energy - flexibility are opposite demands!• How to navigate in this jungle?• 3D design space:

• Next question: how to map (or compile) an application onto such an architecture?

Computational Abstraction Level

Reconfigurable featureBinding rate

25


Flexibility (1) - Abstraction level


• Instruction set level = “programmable”

• CLB level = “reconfigurable”


Flexibility (2) - Reconfigurable feature

• Basic components:

CLB RAM details

Switches, Muxes

Implementation

Execution unit type

Register file

Cross-bar Busses

Micro-architecture

Custom instructions

Register set

Size address/ data bus

Instruction set Architecture

Number & type of processes

Memory hierarchy

Interconnect network

Systems

ComputationStorageCommunication

Reconfigurable feature


26


Flexibility (3) - Binding rate

Binding rate

Compare processing to binding• Configurable (“compile-time”)• Re-configurable• Dynamic reconfigurable (“adaptive”)


SOC architecture: RINGS

Networking Video

StandardAlgorithm

ArchitectureµArchitecture

Circuit

MEMORY

Reconfigurable Interconnect

CPU

RF

BasebandProcessing

VideoEngine

Domain-Specific

Hardware

SoftwareNetworking

Medium accessBaseband ProcµArchitecture

Circuit

Signal Proc

DSP

AlgorithmArchitectureµArchitecture

Assembly

27


Instruction set extension

• Instruction set extension• Register mapped• Tightly coupled• Experiment: DFT

12.5 times5.76 mJ67.6 mJEnergy

Improve-ment

SW with HW datapath

SW onEmbedded proc.

1000iterations


Co-processor

• Memory mapped• Loosely coupled• Experiment: AES

LocalMemory

25 times13.5 mJ89.2 mJEnergy

Improve-ment

SW with HW

datapath

SW on emb. Proc.

175iterations

28


Independent IP

• Loosely coupled• Network on chip

connected• Flexible interconnect• Experiment: TCP/IP

checksum

router

router

84 times0.20 mJ17.0 mJEnergy

Improve-ment

HW datapath

SW on emb. Proc.

100packets


Example: The Security Pyramid

DQ

Vcc

CPUCrypto

MEM

JCA

Java

JVM

CLK

Protocol

Algorithm

Architecture

Circuit

Micro-Architecture

Identification

ConfidentialityIntegrity

Kasumi, Rijndael,RC4, MD5, …

29


Example: AES Coprocessor

InputFSM

ProcFSM

OutputFSM

>>

Encrypt

KeySchedule

>>

instruction

roundkey16 16256256

handshakeCORE

[DAC 2002]


[1] Amphion CS5230 on Virtex2 + Xilinx Virtex2 Power Estimator[2] Helger Lipmaa PIII assembly handcoded + Intel Pentium III (1.13 GHz) Datasheet[3] gcc, 1 mW/MHz @ 120 Mhz Sparc – assumes 0.25 u CMOS[4] Java on KVM (Sun J2ME, non-JIT) on 1 mW/MHz @ 120 MHz Sparc – assumes 0.25 u CMOS

648 Mbits/secAsmPentium III [2] 41.4 W 0.015 (1/1900)

Java [4]Emb. Sparc 450 bits/sec 120 mW 0.0000037

(1/9.600.000)

C Emb. Sparc [3] 133 Kbits/sec 0.0011 (1/33000)

56 mW

Power

1.32 Gbit/secFPGA [1]

35.7 (1/1)2 Gbits/sec0.18µm CMOS

Figure of Merit(Gb/s/W)

ThroughputAES 128bit key128bit data

490 mW 2.7 (1/11)

120 mW

Design options: AES acceleration: Gbits/Joule

30


Applications

Mapped

onto

Architectures

Conclusion

Design Methods

= Low Power!


Motivation

• Architecture exploration

• Specification: MATLAB, SPW, C/C++, Java

• Floating point

• Fixed point

• Algorithm transformations

• Architecture alternatives

Bit parallel (Bit serial)

ASIC SpecialPurpose

(Art Designer)

Retargetablecoprocessor

(Target compilertechnologies)

DSP extensionsto RISC

DSP processors

(Gezel,Tensilica)

(TI TMS320C54x,TMS320C55x,ADI Blackfin, etc. )

architecture exploration lecture 9iverbauw/courses/... · • architecture alternatives bit...

Documents