domain specific processors lecture 9iverbauw/courses/...•i. verbauwhede, “low power dsps”,...

1

1HJ94, Spring 2006, Ingrid Verbauwhede, lecture 9

Domain specific processors Lecture 9

Ingrid Verbauwhede

Departement Elektrotechniek, afdeling ESAT/COSIC

[email protected]


Overview

• Lecture 1: what is a system-on-chip• Lecture 2: terminology for the different steps• Lecture 3: models of computation• Lecture 4: two MOC’s: SDFG & control flow• Lecture 5: control flow & FIR example• Lecture 6: fixed point refinement• Lecture 7: architecture exploration• Lecture 8: DSP Processors• Lecture 9 – DSP : Domain specific processors

2


HJ94 goal: Skiing down a mountain

Specification

ASIC SpecialPurpose

Retargetablecoprocessor

DSPprocessor

DSP-RISC RISC

Algorithm Transformations

Memory Transformations and Optimizations

Floating-point to Fixed-point

SPW, Matlab, C++

pipelining, unrolling

loop merging, compaction

40 bit accumulator

• DSP = one class of domain specific processors


References

• The origins:• E.A. Lee, “Programmable DSP Processors,” Part I, IEEE ASSP

magazine, October 1988, pg. 4-19.• Part II, IEEE ASSP magazine, January 1989, pg. 4-14

• Good overview:• P. Lapsley, J. Bier, A. Shoham, E.A.Lee, “DSP Processor Fundamentals:

Architectures and Features,” IEEE Press, 1998.

• Domain specific processor:•I. Verbauwhede, “Low Power DSPs”, Chapter 19 in Low Power Electronics

and Design, Edited by Christian Piguet, CRC Press, 2005.

•Other domains: security and cryptography, wireless communications

3


DSP processors -

• Last lecture: DSP = domain specific processor– Highly optimized for wireless communication– EVERY component of the processor:

• Datapath = MAC• Memory = Harvard or Modified Harvard• Address arithmetic: indirect – modulo – bit reverse (FFT)• Control: CISC with specialized instruction set

– Example of FIR calculation

• Today:– More domain specific processors– Type of co-processors


Application domain: wireless communications

Receiver

Tran

smit

Syn

thes

ize

PA

TCXO

Receiver

Tran

smit

Syn

thes

ize

PA

TCXO

Ext

erna

lM

emor

ies

DigitalASIC

MicroProcessor

DSP

BatteryPack

AnalogASIC

PowerSupply Audio

Codec

No network

* 0 #7 8 94 5 61 2 3

clr

RF Board

Baseband board

DSP is example of “domain specific” processor

4


Performance requirements: digital cellular phone

RFReceive

RFSend

Demodulation Channeldecoder

Speechdecoder

Modulation Channelencoder

Speechencoder

Communication Application

Goal: Minimum “MIPS” to get the job done.


Application Domain: compute intensive functions

Source encoder/decoder = speech codersAdvanced vocoders for improved speech quality & higher capacity:Example: ACELP derivatives for GSM and IS136A

• Digital filtering (FIR, IIR)

• Vector quantization, code book search (square distance computation)

Channel encoder/decoder = error correctingComplex wireless modems:

• Galois field arithmetic

• Convolution coders based on Viterbi trellis search

• Turbo coders

5


Compute intensive functions: evolution of DSP’s

Simple FIR example

Square distance for speech processing

Speed-up of FIR example

Viterbi acceleration for communication algorithms

Evolution of DSPs follows these examples


The Viterbi Decoding (Introduction)

• Error Correcting Decoding Algorithm for Convolutional Code• Trellis Representation• Maximum Likelihood Decoding Algorithm• GSM System

6


Convolutional Code (ex. Wyner-Ash Code)

• Generator matrix G(D) = [ 1 1+D ]• Input sequence u(D) = 1, 1, 0, 1, 0, …• Output Sequence c(D) = u(D)G(D)

=11, 10, 01, 11, 01, …

D


Constraint length K and Rate

• v = 1, K = 2, 2states

• Rate = 1/2, one input bit generates twocoded output bits.

D

100,00 1,101,11

0,01

7


Trellis Representation

• Example G(D)=[ 1+D2 1+D+D2 ]v = 2, K = 3, 4 states

• Instead of writing a State Diagram,

D D

t0 1 2 3 4

S00

S10

S01

S00

S10

S01

S00

S10

S01

S00

S10

S01

S00

S10

S01

S11 S11 S11 S11 S11

00

11

00

11

00

11

00

11

1001

1001

1001

11

00

11

00

01

10

01

10


Efficiency of Viterbi decoding

• Identifies the path through the Trellis--- Selecting survivor paths for each states by calculating Hamming Distance

• The total number of paths grows exponentially with the number of states--- K increasing, H/W Complexity increases exponentially

but the Error Rate decreases

8


Viterbi Decoding Algorithm (1)

• Assume N = 7 blocks

t

S00

S10

S01

S11

00

11

00

11

00

11

00

11

1001

1001

1001

11

00

11

00

01

10

01

10

0 1 2 3 4 5 6 7

000000

11

1001 01

11 11

10

11

00

01

10

Information Data

Convolution Codes

Error Sequence

Received Data

0

00

00

00

1

11

01

10

1

10

10

00

0

10

00

10

1

00

00

00

0

01

10

11

0

11

00

11

Tail Bit


S00

S10

S01

S11

00

11

00

11

00

11

00

11

1001

1001

1001

11

00

11

00

01

10

01

10

000000

11

1001 01

11 11

10

11

00

01

10

0 1 10

12

4

2


• Calculate Hamming Distance (Choose smaller one)

t0 1 2 3 4 5 6 7

Information Data

Convolution Codes

Error Sequence

Received Data

0

00

00

00

1

11

01

10

1

10

10

00

0

10

00

10

1

00

00

00

0

01

10

11

0

11

00

11

9



• Selecting the Optimal Path

t0 1 2 3 4 5 6 7

Information Data

Convolution Codes

Error Sequence

Received Data

0

00

00

00

1

11

01

10

1

10

10

00

0

10

00

10

1

00

00

00

0

01

10

11

0

11

00

11

S00

S10

S01

S11

00

11

00

11

00

11

00

11

1001

1001

1001

11

00

11

00

01

10

01

10

000000

11

1001 01

11 11

10

11

00

01

10

0 1 1 20 2 33

1 3 22 2

2 2 34

2 32 3

3


Traceback

• We cannot wait for the end of sequence for some applications

• The amount of “delay” is called tracebackdepth LD.

--- Larger LD , better performancebut need more memory and complexity

10


Viterbi in GSM

• Full-rate speech channel 22.8kbps: Rate = 1/2, K = 5

• Half-rate speech channel :11.4kbps: Rate = 1/3, K = 7


Required Performance

11


Compute Intensive function 2: Viterbi

i

i+ s/2

2i

2i+1

+a

-a

-a

+a

. . .

. . .

Viterbi butterfly

i = state indexs = # of states = 2w = decoding window

Basic equations:

d(2n) = min { d(i) + a, d(i + s/2) - a }d(2i + 1) = min { d(i) - a, d(i + s/2) + a }

IS-95: k = 8, w = 192, corresponds to 2 x 192 x (cycles for one ACS)

k-1

7

Basic algorithm in Viterbi channel decoders,modified version in turbo decoders.

Key operation: Add-Compare-Select (ACS)


Viterbi on Atmel’s Lode

Two MAC units & ALU: Add-Compare-Select

• DMAC operates as dual add/subtract unit

• ALU finds minimum

• Shortest distance saved

• Path indicator saved

• 4 cycles / butterfly

+

A1

MAC0

DB1(16)DB0(16)

µ2

+

µ1

A0

MAC1

Γ1 Γ2

Min()ALU

A3Γ

A2

decision bit

to memory

Γ = min [(Γ1 + µ1), (Γ2 + µ2)]

12


MSW/LSWSelect

Viterbi on TIC54x

ALU and CSSU: Add-Compare-Select

• ALU splits in 16 bit halves

• ACC splits in half

• Shortest distance saved

• CSSU compares halves

• Path indicator saved

• 4 cycles / butterfly

+

TREG

ALU

DB1(16)DB0(16)

µ2

+

µ1

AccumulatorΓ1 Γ2

CompALU

TRN reg

Γ

decision bit

Data bus EB, to memory

Γ = min [(Γ1 + µ1), (Γ2 + µ2)]


BUT: DSP Software Development

• Complex DSP architecture not amenable to compiler technology

• Algorithms are modeled in high level language (e.g. C++)

• Solutions are implemented and debugged in hand-optimized assembler - large development effort with minimal tool support

HLL

algorithmic

model

prototype

code

production

code

hand coded assembler

optimize & debug

Long, frustrating time to market

Fragile legacy code

Widely used in handhelds, but change in basestations VLIW

13


2G Basestation Baseband Processing

• Multiple DSPs used for baseband processing.• RISC Microcontroller for timing, framing, I/O control• Software upgradable over the network• DSPs dominate cost and power consumption

DSP RISCMicro

Controller

I/O

T1/E1

DSP

DSP

DSP

DSP

DSP

DSP

DSP

I/O

I/O I/O ASIC

DSP

DSP

AFE

AFE

ChannelEqualization

ChannelDe/coding Encryption

RAM

RAM

Tx

TxRx

Rx

Tx/Rx baseband processing board for 2-carrier GSM basestation

Future trend - integrate baseband processing -low cost Pico BTS


Compiler Driven VLIW

Large orthogonal register set, regular interconnect

Data memory

RegisterArray

Interconnect

ex1(alu)

ex2(alu)

ex3(mpy)

ex4(ld/st)

exn(ld/st)

cond/branch ex1 ex2 ex3 ….. exnInstruction format:

Atomic RISC-like operations => heavily pipelined, high freq. clock

14


Explicitly Parallel Instruction Computing

Execution ClustersData memory

RegisterArray

Interconnect

ex1(alu)

ex4(alu)

ex5(mpy)

ex3(ld/st)

ex6(ld/st)

RegisterArray

Interconnect

ex2(alu)

Execution Sets

1 1 1 0 1 0 1 0

fetch set

exec. set


Texas Instruments ‘C6201

ALU shift mpy add ALU shift mpy add

Register Bank A(16 x 32)

Register Bank B(16 x 32)

Instruction Dispatch & Decode

Program Memory(16K x 32)

256

Data Memory(32K x 16)

8-way VLIW with two execution clusters256 bit (8x32) instruction fetch with variable length execute setEach 32 bit instruction individually predicated11 stage pipeline1600 MIPS, 400 MMACs @ 200 MHz

15


FIR Filter on TI ‘C6x

loop:

ldw .d1t1 *a4++,a5

|| ldw .d2t2 *b4++,b5

||[b0] sub .s2 b0,1,b0

||[b0] b .s1 loop

|| mpy .m1x a5,b5,a6

|| mpyh .m2x a5,b5,b6

|| add .l1 a7,a6,a7

|| add .l2 b7,b6,b7

• Outer Loop: 23 cycles, 180 bytes– 1 cycle in inner loop

• All 8 exec units used in inner loop - maximum efficiency– 2 MACs per cycle

Hand-coded assembly: 32-tap FIR filter

Assembly syntax more difficult to learn.Hard to get full use of all 8 execution units at once.Software pipelining difficult to implement, and requires longer prolog/epilog (larger

code size).

Courtesy: Gareth Hughes: Bell Labs Australia


Viterbi on TI ‘C6x

Cycle 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

.D1 STH new 8 LDH old0 STH new 0 STH new 8 LDH old0 STH new 0 STH new 8 LDH old0 STH new 0 STH new 8 LDH old0 STH new 0 STH new 8 LDH sd1 STH m[2] STH m[3]

.D2 ADD tr LDH old1 LDH mj ADD tr LDH old1 LDH mj ADD tr LDH old1 LDH mj ADD tr LDH old1 LDH mj SUB m LDH sd0 STH m[5] STH m[4]

.M1 MPY mj *MPY b0 MPY a0 MPY mj *MPY b0 MPY a0 MPY mj *MPY b0 MPY a0 MPY mj *MPY b0 MPY a0

.M2 MPY a8 *MPY b8 MPY a8 *MPY b8 MPY a8 *MPY b8 MPY a8 *MPY b8

.L1 CMPGT t0 SUB b0 CMPGT t0 SUB b0 CMPGT t0 SUB b0 CMPGT t0 SUB b0 ADD m0 SUB -m0

.L2 CMPGT t8 ADD b8 SUB a8 CMPGT t8 ADD b8 SUB a8 CMPGT t8 ADD b8 SUB a8 CMPGT t8 ADD b8 SUB a8 SUB old SUB -m1 SUB m1 SUB I

.S1 B JLOOP ADD a0 SUB k B JLOOP ADD a0 SUB k B JLOOP ADD a0 SUB k B JLOOP ADD a0 SUB k

.S2 SUB j SHL tr *ADD t0,t8 SUB j SHL tr *ADD t0,t8 SUB j SHL tr *ADD t0,t8 SUB j SHL tr *ADD t0,t8 ADD tr B JLOOP MVK j

Cycle 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31

.D1 STH new 0 STH new 8 LDH old0 STH new 0 STH new 8 LDH old0 STH new 0 STH new 8 LDH old0 STH new 0 STH new 8 LDH old0 STH new 0 STH m[0] STH m[1] LDH old1

.D2 LDH mj ADD tr LDH old1 LDH mj ADD tr LDH old1 LDH mj ADD tr LDH old1 LDH mj ADD tr LDH old1 STH trans STH m[1] STH m[6] LDH old0

.M1 MPY a0 MPY mj *MPY b0 MPY a0 MPY mj *MPY b0 MPY a0 MPY mj *MPY b0 MPY a0 MPY mj *MPY b0

.M2 *MPY b8 MPY a8 *MPY b8 MPY a8 *MPY b8 MPY a8 *MPY b8 MPY a8 MPY mj

.L1 CMPGT t0 SUB b0 CMPGT t0 SUB b0 CMPGT t0 SUB b0 CMPGT t0 SUB b0 SUB new ADD old ADD SP

.L2 SUB a8 CMPGT t8 ADD b8 SUB a8 CMPGT t8 ADD b8 SUB a8 CMPGT t8 ADD b8 SUB a8 CMPGT t8 ADD b8

.S1 SUB k B JLOOP ADD a0 SUB k B JLOOP ADD a0 SUB k B JLOOP ADD a0 SUB k B JLOOP ADD a0 MVK k

.S2 *ADD t0,t8 SUB j SHL tr *ADD t0,t8 SUB j SHL tr *ADD t0,t8 SUB j SHL tr *ADD t0,t8 SUB j SHL tr B JLOOP

Utilization of execution units in Viterbi decoder

• 16-state Viterbi decoder for GSM from TI WWW site: ftp://ftp.ti.com/pub/tms320bbs/c62xfiles/vitgsm.asm

– 3 cycles per butterfly– 32 cycles per GSM timeslot (8 butterflies)– MPY instructions used to move data

x 8

16


Viterbi on TI ‘C6x

3-cycle 2-ACS Inner-Loop

x 8

LOOP:

[b1]

b.s1LOOP

||[b1]

sub.s2b1,1,b1

||[!a2]sth.d1

b12,*+a6[8]

||[!a2]add.d2

b0,b14,b14

||

cmpgt

.l1

a11,a10,a1

||

cmpgt

.l2

b11,b10,b0

||

mpy.m1x1,b5,a4

[a2]

sub.s1a2,1,a2

||[!a2]

sth.d1a12,*a6++

||[a1]

add.s22,b0,b0

||[b0]

mpy

.m21,b11,b12

||

mpy.m11,a10,a12

||

sub

.l2xa7,b5,b10

||

ldh.d2*++b9,b5

shl.s2b14,2,b14

||[a1]

mpy.m11,a11,a12

||

add.s1

a7,a4,a10

||

sub.l1xb13,a4,a11

||

add.l2b13,b5,b11

||

mpy.m21,b10,b12

||

ldh.d2*b4++[2],a7

||

ldh.d1*a5++[2],b13

; end of LOOP


Lucent / Motorola Star*Core SC140

6-way VLIW with 128 bit (8x16) instruction fetchPrefix instructions for high performance without sacrificing code densityEach execution set (parallel instructions + prefix) predicated5 stage pipeline1800 MIPS, 1200 MMACs @ 300 MHz

Program / Data Memory

ProgramSequencerInstructionDispatcher

AddressRegisters

(27)

AAU

Data Registers(16)

MAC

ALU

BFUAAU

MAC

ALU

BFU

MAC

ALU

BFU

MAC

ALU

BFU

17


Viterbi on Star*Core

• Hardware support for Viterbialgorithm:– max2vit instruction.– vsl instruction

• 1 cycle per butterfly through software-pipelining

• Decision bits are manually stored using the Viterbi Shift Left (VSL) instruction:

GSM (K=5, 16 states)[ move.2l (r0)+,d0:d1 move.2l (r1)+,d1:d2 ][ add2 d0,d4 sub2 d6,d2

sub2 d4,d0 add2 d2,d6 ][ max2vit d4,d2 max2vit d0,d6 ][ vsl.4w d2:d6:d1:d3,(r2)+n0

vsl.4f d2:d6:d1:d3,(r3)+n0 ]

max2vit d4,d2 max2vit d0,d6

SR

D1

D3

D2

D6

vsl.4w d2:d6:d1:d3,(r2)+n0

Results writtento memory

x 4

decisions

decisions

path metrics

path metrics

Courtesy: Gareth Hughes: Bell Labs Australia


SOC

18


Energy-Efficient SoC are distributed

[‘Under the Hood’, EET, D. Carey, 9/5/02]

TIBaseband

DSP

HTCInterface

ASIC

TIPower

Management

Intel32Mb Flash

Intel128Mb Flash

Winbond128Mb

SDRAM

TIRF Synth

TIRF TX/RX

ConexantPower Amp

IntelStrongArm

SonyLCD

Interface

Sony240x320

color LCD

PhilipsAudio Codec

TouchscreenSIM

MMICExpansion

T-MobilePocketPC Phone


DisplayAD7873Digitizer

MotorolaDragonBall

8M SDRAM

4M FLASH

FPGA

PhilipsUSB

MaximTransceivers

Agere POMBaseband

MotorolaTransceiver

RF MicroPoweramp

MaximControl

Driver

MemoryCardSlot

architecture tuned to applicationPalmPilot i705

19


OMAP 2420 platform (TI)


OMAP 2420 features

• Intended for high-volume wireless handset manufacturers

• Application processor “all-in-one entertainment”• Supports all wireless standards• Dual core ARM11 (330MHz) + DSP C55x (220MHz)• 2D/3D graphics accelerator at 2 Mega-Polygon/s for

gaming applications• Image & Video accelerator for 4 Megapixel cameras

and 30 frame/s VGA video support• 5Mbits SRAM to boost streaming media performance

20



Power Cost

???

GeneralPurpose

Fixed

Platform

Application

ASIC

Energy-flexibility trade-off

21


Also general purpose architectures become heterogeneous.

IBM PowerPC ®

RISC CPU

Synchronous Dual-Port RAM

SelectIO-ltra™ SystemIO™ & XCITE ™

Conexant3.125Gb Serial

XtremeDSP™

Source: Xilinx webpage


Question

• Energy - flexibility are opposite demands!• How to navigate in this jungle?• 3D design space:

• Next question: how to map (or compile) an application onto such an architecture?

Computational Abstraction Level

Reconfigurable featureBinding rate

22


Flexibility (1) - Abstraction level


• Instruction set level = “programmable”

• CLB level = “reconfigurable”


Flexibility (2) - Reconfigurable feature

• Basic components:

CLB RAM details

Switches, Muxes

Implementation

Execution unit type

Register file

Cross-bar Busses

Micro-architecture

Custom instructions

Register set

Size address/ data bus

Instruction set Architecture

Number & type of processes

Memory hierarchy

Interconnect network

Systems

ComputationStorageCommunication

Reconfigurable feature


23


Flexibility (3) - Binding rate

Binding rate

Compare processing to binding• Configurable (“compile-time”)• Re-configurable• Dynamic reconfigurable (“adaptive”)


SOC architecture: RINGS

Networking Video

StandardAlgorithm

ArchitectureµArchitecture

Circuit

MEMORY

Reconfigurable Interconnect

CPU

RF

BasebandProcessing

VideoEngine

Domain-Specific

Hardware

SoftwareNetworking

Medium accessBaseband ProcµArchitecture

Circuit

Signal Proc

DSP

AlgorithmArchitectureµArchitecture

Assembly

24


Instruction set extension

• Instruction set extension• Register mapped• Tightly coupled• Experiment: DFT

12.5 times5.76 mJ67.6 mJEnergy

Improve-ment

SW with HW datapath

SW onEmbedded proc.

1000iterations


Co-processor

• Memory mapped• Loosely coupled• Experiment: AES

LocalMemory

25 times13.5 mJ89.2 mJEnergy

Improve-ment

SW with HW

datapath

SW on emb. Proc.

175iterations

25


Independent IP

• Loosely coupled• Network on chip

connected• Flexible interconnect• Experiment: TCP/IP

checksum

router

router

84 times0.20 mJ17.0 mJEnergy

Improve-ment

HW datapath

SW on emb. Proc.

100packets


Example: The Security Pyramid

DQ

Vcc

CPUCrypto

MEM

JCA

Java

JVM

CLK

Protocol

Algorithm

Architecture

Circuit

Micro-Architecture

Identification

ConfidentialityIntegrity

Kasumi, Rijndael,RC4, MD5, …

26


Example: AES Coprocessor

InputFSM

ProcFSM

OutputFSM

>>

Encrypt

KeySchedule

>>

instruction

roundkey16 16256256

handshakeCORE

[DAC 2002]


[1] Amphion CS5230 on Virtex2 + Xilinx Virtex2 Power Estimator[2] Helger Lipmaa PIII assembly handcoded + Intel Pentium III (1.13 GHz) Datasheet[3] gcc, 1 mW/MHz @ 120 Mhz Sparc – assumes 0.25 u CMOS[4] Java on KVM (Sun J2ME, non-JIT) on 1 mW/MHz @ 120 MHz Sparc – assumes 0.25 u CMOS

648 Mbits/secAsmPentium III [2] 41.4 W 0.015 (1/1900)

Java [4]Emb. Sparc 450 bits/sec 120 mW 0.0000037

(1/9.600.000)

C Emb. Sparc[3] 133 Kbits/sec 0.0011 (1/33000)

56 mW

Power

1.32 Gbit/secFPGA [1]

35.7 (1/1)2 Gbits/sec0.18µm CMOS

Figure of Merit(Gb/s/W)

ThroughputAES 128bit key128bit data

490 mW 2.7 (1/11)

120 mW

Design options: AES acceleration: Gbits/Joule

27


Applications

Mapped

onto

Architectures

Conclusion

Design Methods

= Low Power!

domain specific processors lecture 9iverbauw/courses/...•i. verbauwhede, “low power dsps”,...

Documents