handset architectures

30
RICE UNIVERSITY Handset architectures Sridhar Rajagopal [email protected] http://www.ece.rice.edu/~sridhar ASICs Programmable The support for this work in part by Nokia, TI and NSF is gratefully acknowledged

Upload: kayla

Post on 07-Jan-2016

46 views

Category:

Documents


1 download

DESCRIPTION

ASICs. Programmable. Handset architectures. Sridhar Rajagopal [email protected] http://www.ece.rice.edu/~sridhar. The support for this work in part by Nokia, TI and NSF is gratefully acknowledged. ro. 2G handsets. DSP for most of the baseband. ASIC for compute-intensive operations - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Handset architectures

RICE UNIVERSITY

Handset architectures

Sridhar Rajagopal

[email protected]://www.ece.rice.edu/~sridhar

ASICs Programmable

The support for this work in part by Nokia, TI and NSF is gratefully acknowledged

Page 2: Handset architectures

RICE UNIVERSITY

2G handsets

ro

ASIC forcompute-intensive

operations(spreading etc.)

DSP for most of the

baseband

microcontroller for higher layers

Evolving Cellular Handset Architectures but a Continuing, Insatiable Desire for DSP MIPs M. L. McMahan, TI Report SPRA650, March 2000

Page 3: Handset architectures

RICE UNIVERSITY

Proposed 3G handsets

DSP for the third generation wireless communications U. Ko, M. McMahan and E. Auslander, International Conference on Computer Design,1999 pp.516 –520Introduction to W-CDMA SoC design approach H. Chen, VIA Technologies, August 2002 www.itpilot.org.tw/provisional/910802/ INTRODUCTION%20TO%20WCDMA%20SOC%20.PDF

Increased number of co-processors as DSPs unable to do most of the baseband

TI VIA

Page 4: Handset architectures

RICE UNIVERSITY

Motivation

How does this scale?Do we need a DSP or should we build ASICs?

If ASICs, how to build better ASICs?If programmable, how to build better DSPs?If both, how do we mix them better?

Answers dependent on level of programmability needed area-time-power architecture tradeoffs

Page 5: Handset architectures

RICE UNIVERSITY

Rice innovations for ASICs and DSPs

ASICs:On-line arithmetic for dynamic truncation

Programmable:Scalable Wireless Application-specific Processors (SWAPs)

Mix and match :Hybrid SWAPs (H-SWAPs)

ASICs Programmable

Page 6: Handset architectures

RICE UNIVERSITY

Outline

On-line arithmetic for dynamic truncation

SWAPs

H-SWAPs

Page 7: Handset architectures

RICE UNIVERSITY

ASIC designs

Finite precision arithmeticFasterLow powerLow area

How to keep finite precision bounded:SaturationTruncation

Page 8: Handset architectures

RICE UNIVERSITY

Keeping precision bounded

Example of truncationMultiplication by in gradient descentSign detection

Example of saturationAvoiding overflowsWhen probability of useful MSBs are low

Tru n catio n(M SBs imp o rtan t)

Satu ratio n(LSBs imp o rtan t)

x xx x x xx xx xx xx xx x

Page 9: Handset architectures

RICE UNIVERSITY

Dynamic precision requirements

Precision needs change with algorithms, SNRAdapt hardware dynamically to save power25-35% power reduction possible

Dynamic saturation vs. dynamic truncationEasy as LSBs first – difficultNo error – significant errorThroughput benefits – no benefits

Page 10: Handset architectures

RICE UNIVERSITY

On-line arithmetic for dynamic truncation

Works Most Significant Digit First

Natural way of truncation

Digit-serial dynamic truncation

Redundant number system error only in LSD

Throughput benefits as digit-serial

Page 11: Handset architectures

RICE UNIVERSITY

Example for sign detection

ai * bi

Tree additionLevel 1

Tree additionResult

= constant = 3*

R R

Sign determined at this point. Stop!

(d) Dynamically truncated on-line arithmetic

R R

R R

R R

tOL-MF tOL

(2 MSDs)(c) Dynamically truncated on-line arithmetic

(without truncation error)

0 0 R R

R R

R R

R R

ai * bi

Tree additionLevel 1

Tree additionResult

deff*tOLtOL-MF

B

B B

B B B

B

B

B

B

B B

Sign determined at this point

Idle(PipelineBubbles)

(a) Truncated conventional arithmetic

Tree additionLevel 1

Tree additionResult

log(d)

ai * bi

tCONV-MF

(b) On-line arithmetic with full precision

0 0 0 R 0 0 0 Rai * bi

TreeadditionLevel 1

Tree additionResult d*tOLtOL-MF

R R

R R

R R

Page 12: Handset architectures

RICE UNIVERSITY

Throughput comparisons

0 5 10 15 20 25 30 3510

1

102

103

Input Precision (in bits)

Tim

e r

eq

uir

ed

(g

ate

de

lays

)

Truncated conventionalTruncated conventional with CSAOn-line (Full precision)Truncated On-line (2 MSDs)Truncated On-line (MSD)

Page 13: Handset architectures

RICE UNIVERSITY

Area comparisons

0 5 10 15 20 25 30 3510

2

103

104

105

106

Input Precision (in bits)

Are

a r

eq

uir

ed

(g

ate

s)

Truncated conventionalTruncated conventional with CSAOn-line (Full precision)Truncated On-line (2 MSDs)Truncated On-line (MSD)

Page 14: Handset architectures

RICE UNIVERSITY

ASIC design conclusion

Details : Predrag

Using on-line arithmetic for dynamic truncation and conventional arithmetic for dynamic saturation, one can design efficient ASICs for handsets.

Page 15: Handset architectures

RICE UNIVERSITY

Outline

On-line arithmetic for dynamic truncation

SWAPs

H-SWAPs

Page 16: Handset architectures

RICE UNIVERSITY

Programmable architectures

Current DSPsNot enough functional units (FUs)

Cannot extend to more FUsLimited Instruction Level Parallelism (ILP)Cannot support more registers (register area increases quadratically with FUs)Compilers: difficult to find ILP as FUs increase

Page 17: Handset architectures

RICE UNIVERSITY

Solution

Exploit data parallelism (DP)Lots available in wireless algorithms

Example:

for (i = 1: 1024)

{

a[i] = b[i] + c[i];

d[i] = b[i] * c[i];

} ILP

DP

Page 18: Handset architectures

RICE UNIVERSITY

DSP vs. SWAPs

+++***

InternalMemory

ILP

Internal Memory

+++***

+++***

+++***

+++***

+++***

+++***

+++***

+++***

+++***

ILP

DP

DSP(1 cluster)

SWAPs(max. clusters)

Page 19: Handset architectures

RICE UNIVERSITY

SWAPs trade-offs

Same internal memory size as DSPs Dependent on application, not architecture

Needs more area to support more functional unitsArea is not a constraint (power is)

Varying levels of DP in applicationsNeeds reconfiguration!!Need to turn off unused clusters

More parallelism lower clock frequency lower voltage

low power (CV2f + leakage) in spite of larger area

Page 20: Handset architectures

RICE UNIVERSITY

Example: Viterbi Decoding

Add-Compare-Select (ACS) : trellis interconnectRe-order for exploiting DP

Traceback – sequentialUse Register Exchange (RE)

Exploiting DP in programmable architecture implies:Re-order ACS Re-order RE

Page 21: Handset architectures

RICE UNIVERSITY

Re-ordering for parallel Viterbi

X(0)

X(2)

X(4)

X(6)

X(8)

X(10)

X(12)

X(14)

X(1)

X(3)

X(5)

X(7)

X(9)

X(11)

X(13)

X(15)

X(0)

X(1)

X(2)

X(3)

X(4)

X(5)

X(6)

X(7)

X(8)

X(9)

X(10)

X(11)

X(12)

X(13)

X(14)

X(15)

b. Shuffled Trellisa. Trellis

X(0)

X(1)

X(2)

X(3)

X(4)

X(5)

X(6)

X(7)

X(8)

X(9)

X(10)

X(11)

X(12)

X(13)

X(14)

X(15)

X(0)

X(1)

X(2)

X(3)

X(4)

X(5)

X(6)

X(7)

X(8)

X(9)

X(10)

X(11)

X(12)

X(13)

X(14)

X(15)

Page 22: Handset architectures

RICE UNIVERSITY

Viterbi reconfiguration

Packet 1Constraint length 7

(16 clusters)

Packet 2Constraint length 9

(64 clusters)

Packet 3Constraint length 5

(4 clusters)

DP Can be turned OFF

Page 23: Handset architectures

RICE UNIVERSITY

64-bit Packet 1Rate ½ Constraint Length 7

64-bit Packet 2Rate ½ Constraint Length 9

64-bit Packet 3Rate ½ Constraint Length 5

Kernels(Computation)

Memoryaccesses

Page 24: Handset architectures

RICE UNIVERSITY

Viterbi decoding: rate 1/2 at 128 Kbps = 10 MHz

100

101

102

100

101

102

103

Number of clusters

Frequency

needed t

o a

ttain

real-

tim

e (

in M

Hz) Actual K = 9

Actual K = 7

Actual K = 5

Regular codeReconfigurable code

Page 25: Handset architectures

RICE UNIVERSITY

Viterbi decoding: Comparisons

10

10

10

103

DSP C64x (w/o co-proc)

*VITURBO: A reconfigurable architecture for Viterbi and Turbo decoding, M. Vaya, J. R. Cavallaro, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), April 2003, Hong Kong 

128 KHz

(1 bit /cycle)

DSP (RE)

SWAP

FPGA

DP

Task PipeliningDedicated interconnect 10

010

110

2

0

1

2

Actual K = 9Actual K = 7Actual K = 5

Virtex II FPGA*

Page 26: Handset architectures

RICE UNIVERSITY

Salient features of this solution

Any constraint length 10 MHz at 128 Kbps

Same code for all constraint lengths no need to re-compile or load another codeas long as parallelism/cluster ratio is constant

Exploiting parallelism at 3 levels for real-time: Instruction Level Parallelism (DSP)Subword Parallelism (DSP)Data Parallelism (SWAP)

Page 27: Handset architectures

RICE UNIVERSITY

Problems

Suitable for handsets? - Not yet!

Still too general Not low power enough!!!

No special customization for the applicationExcept for a fixed-point architectureGeneric instruction setGeneric ALUs (though can be powered down)Generic inter-cluster communication network

Page 28: Handset architectures

RICE UNIVERSITY

Outline

On-line arithmetic for dynamic truncation

SWAPs

Hybrid SWAPs (H-SWAPs)

Page 29: Handset architectures

RICE UNIVERSITY

H-SWAPs

Trade Data Parallelism for Task Pipelining Customize each mini-SWAP

Internal Memory

+++***

+++***

+++***

+++***

+++***

+++***

+++***

+++***

+++***

DP

SWAPs(max. clusters

and reconfigure)

+++*

+++*

+++*

+++*

LimitedDP

Mini-SWAP(limit

clusters)

+++*

+++*

+++*

+++*

LimitedDP

++*

++*

++*

++*

LimitedDP

++++

++++

LimitedDP

H-SWAPs(collection of customized

mini-SWAPs)

Page 30: Handset architectures

RICE UNIVERSITY

Work in progress

How to trade-off task vs. data parallelism?

Power estimation for SWAPs (actual numbers)

Comparisons with ASIC solutions in terms of area-time-power

Evaluation of specialized inter-cluster communication

Specialized instructions (ACS) and arithmetic units (on-line)

I am looking for jobs!!!