a system solution for high- performance, low power sdr yuan lin 1, hyunseok lee 1, yoav harel 1,...

A System Solution for High- Performance, Low Power SDR

Yuan Lin1, Hyunseok Lee1, Yoav Harel1, Mark Woh1,

Scott Mahlke1, Trevor Mudge1 and Krisztian Flautner2

1Advanced Computer Architecture Laboratory

University of Michigan2ARM, Ltd.

Advanced Computer Architecture LaboratoryUniversity of Michigan

2

SDR Design Challenges:

Hardware design challenges High computational throughput (~40 Gops) Low power consumption (~200mW) Meet real-time requirements

DSP programming support System-level development

Inter-algorithm communication Algorithm-level development

Efficient DSP representations

SDR Benchmark Design & Analysis


4

W-CDMA Protocol: 2Mbps

Frontend

LPF-Tx scrambler spreader InterleaverChannelencoder

LPF-Rx

searcher

descrambler despreader combiner

descrambler despreader

...

modulator

demodulator

deinteleaverChanneldecoder

(turbo/viterbi)

Upper layersTransmitter

Receiver


5

W-CDMA Characteristics

Plenty of vector parallelism 8 & 16-bit DSP algorithms Multiplication is not dominant No floating-point operation Small instruction/data memory Has periodic real-time tasks


6

802.11a Protocol: 24Mbps

Frontend

LPF-TxQAM

MappingInterleaver

Channelencoder

LPF-Rx

FFT

Frequency, TimeSynchronization

modulator

demodulator

deinteleaverChanneldecoder(viterbi)


Receiver

IFFTPreambleInsertion

ChannelEstimation

FrequencyEqualization

QAMdemapping


7

802.11a Characteristics

Similar to W-CDMA Plenty of vector parallelism No floating-point operation Small instruction/data memory

Different from W-CDMA Mostly 16-bit DSP algorithms Multiplication is more dominant No periodic real-time tasks

SDR Processor Architecture Design


9

System Architecture Design Tradeoffs

Fine grainprocessors

Inter-processorNetwork

PE

Coarse grainprocessors

Uniprocessoralu

alu

alu

alu

alu

alu

alu

alu

alu

alu

alu

alu

alu

alu

alu

alu

alu

alu

alu

alu

alu

alu

alu

alu

alu

alu

alu

alu

alu

alu

alu

alu

PEalu alu alu alu

PEalu alu alu alu

PEalu alu alu alu

PEalu alu alu alu


PEalu

PEalu

PEalu

PEalu

PEalu

PEalu

PEalu

PEalu

PEalu

PEalu

PEalu

PEalu

PEalu

PEalu

PEalu

PEalu

Amortized fetch

Intra-processorCommunication

Inter-processor Communication


10

System Architecture Design Tradeoffs

128x1

64x2

32x4

16x8

8x16

4x32 2x64 1x1281x256

(50% Utilization)

0

50

100

150

200

250

300

0 50 100 150 200 250 300

SIMD width (arithmetic units)

MO

PS

/mW

Intra-processor shuffle networkcritical delay path

ALU dominated critical delay path

critical delay path

Number of Processing Elements x SIMD width For W-CDMA 2Mbps (51.2GOP/sec) 90nm 1V @400MHz

Fine grain processors


PEalu

PEalu

PEalu

PEalu

PEalu

PEalu

PEalu

PEalu

PEalu

PEalu

PEalu

PEalu

PEalu

PEalu

PEalu

PEalu

PE

Uniprocessor

alu

alu

alu

alu

alu

alu

alu

alu

alu

alu

alu

alu

alu

alu

alu

alu


Coarse grain processorsPE

alu alu alu alu

PEalu alu alu alu

PEalu alu alu alu

PEalu alu alu alu


12

System Architecture Design

Interconnect

GlobalMemory

SIMDMemory

SIMD ALU

SIMD RF

ScalarMemory

ScalarALU

ScalarRF

SoCInterface

ScalarPipeline

SIMDPipeline

SoCInterface

GlobalMemory

SoCInterface

MEM MEM

Memory

ALU

SoCInterface

Controller

PE

SIMDMemory

SIMD ALU

SIMD RF

ScalarMemory

ScalarALU

ScalarRF

SoCInterface

ScalarPipeline

SIMDPipeline

PE

SIMDMemory

SIMD ALU

SIMD RF

ScalarMemory

ScalarALU

ScalarRF

SoCInterface

ScalarPipeline

SIMDPipeline

PE


13

PE Design (Area < 1mm2 Power<50mW)

SIMDMemory

4 KB

ScalarMemory

4KB

16x8bitRegFile

16x8bitRegFile

16x8bitRegFile

16x8bitRegFile

DataShuffle

Network

8bit ALU

8bit ALU

8bit ALU

8bit ALU

16x16bitRegFile

16bit ALU

DMA

SIMD Unit

Scalar Unit

PE

8bit

8bit

8bit

8bit

8bit

8bit

8bit

8bit

8bit

8bit

8bit

8bit

32x8bit

16bit 16bit 16bit

16bit


14

Mapping DSP Algorithms: Filters

Z-1 Z-1 Z-1

b0b1b2b3

In

Out

Out

Inb

z-1

z-1spread Vin, Sin

shift z, z, upmac z, Vin, Sin


15

SIMDMemory

4 KB

ScalarMemory

4KB

16x8bitRegFile

16x8bitRegFile

16x8bitRegFile

16x8bitRegFile

DataShuffle

Network

16x16bitRegFile

DMA

SIMD Unit

Scalar Unit

PE

8bit

8bit

8bit

8bit

8bit

8bit

8bit

8bit

8bit

8bit

8bit

8bit

32x8bit

16bit 16bit 16bit

16bit

8bit ALU

8bit ALU

8bit ALU

8bit ALU

16bit ALU

Mapping DSP Algorithm: Filter

spread Vin, Sin

shift z, z, upmac z, Vin, Sin


16

Efficient Design

Wide SIMD width Small register file with minimum ports Small memories Narrow system BUS Data-path optimized for 8bits Vector shuffle reduce memory ports


18

2 LPF-Rx182 MopsMisc. Control

1 Mops

Searcher200 Mops

Deinterleaver16 Mops

PowerControl15 Kops

PN CodeTX/RX

23 Mops

Turbo Decoder324 Mops

Buffer(1360 Bytes)

Buffer(1280 Bytes) Buffer

(2560 Bytes)FIFO Queue(12.5 KBytes)

Buffer(10 Bytes)

Buffer(20 KBytes)

Buffer(20 KBytes)

Buffer(1024 Bytes)

ARM PE PE PE PEGlobal

Memory

Buffer(1024 Bytes)

WCDMA Receiver WCDMA Transmitter

4 LPF-Rx307 Mops

Scrambler8.6 Mops

Spreader4.7 Mops

Turbo Encoder2 Mops

Interleaver2 Mops

Descrambler22.5 Mops

Despreader11.3 Mops

Combiner3 Mops

Frontend


LPF-Rx

searcher



...

modulator

demodulator


(turbo/viterbi)


Receiver

Channeldecoder

(turbo/viterbi)

4 LPF-Rx307 Mops

Scrambler8.6 Mops

Spreader4.7 Mops

Turbo Encoder2 Mops

Interleaver2 Mops


Searcher200 Mops

2 LPF-Rx182 Mops


Despreader11.3 Mops

Combiner3 Mops

PowerControl15 Kops


PN CodeTX/RX

23 Mops

Misc. Control1 Mops



...LPF-Rx deinteleaver

searcher

Channelencoder

InterleaverspreaderscramblerLPF-Tx

Buffer(1280 Bytes)

Buffer(1360 Bytes)

Buffer(2560 Bytes)

Buffer(1024 Bytes) FIFO Queue

(12.5 KBytes)

Buffer(20 KBytes)

Buffer(20 KBytes)Buffer

(10 Bytes)

Buffer(1024 Bytes)


19

2 LPF-Rx182 MopsMisc. Control

1 Mops

Searcher200 Mops


PowerControl15 Kops

PN CodeTX/RX

23 Mops


Buffer(1360 Bytes)


(2560 Bytes)FIFO Queue(12.5 KBytes)

Buffer(10 Bytes)

Buffer(20 KBytes)

Buffer(20 KBytes)

Buffer(1024 Bytes)


Memory

Buffer(1024 Bytes)

WCDMA Receiver WCDMA Transmitter

4 LPF-Rx307 Mops

Scrambler8.6 Mops

Spreader4.7 Mops

Turbo Encoder2 Mops

Interleaver2 Mops


Despreader11.3 Mops

Combiner3 Mops

Frontend


LPF-Rx

searcher



...

modulator

demodulator


(turbo/viterbi)


Receiver

Channelencoder

InterleaverspreaderscramblerLPF-Tx

Channeldecoder

(turbo/viterbi)deinteleaver

searcher



...LPF-Rx

4 LPF-Rx307 Mops

Scrambler8.6 Mops

Spreader4.7 Mops

Turbo Encoder2 Mops

Interleaver2 Mops


Searcher200 Mops

2 LPF-Rx182 Mops


Despreader11.3 Mops

Combiner3 Mops

PowerControl15 Kops


PN CodeTX/RX

23 Mops

Misc. Control1 Mops

MACFrontend A/D

MAC

Frontend D/A

Buffer(1280 Bytes)

Buffer(1360 Bytes)

Buffer(2560 Bytes)

Buffer(1024 Bytes) FIFO Queue

(12.5 KBytes)

Buffer(20 KBytes)

Buffer(20 KBytes)Buffer

(10 Bytes)

Buffer(1024 Bytes)


20

802.11a PE Mapping

FIR (Rx)320 Mops

Misc. Control80 Mops

FFT120 Mops


Buffer(2048 Bytes)


(30 KBytes)

Buffer(30 KBytes)


Memory

Interleaver60 Mops

Freq. Eq.120 Mops

QAM Demod2 Mops

FIR (Tx)320 Mops

Buffer(2048 Bytes)

Channel Enc.20 Mops

Buffer(22048 Bytes)

IFFT120 Mops

Buffer(1024 Bytes)

QAM Mod2 Mops

Viterbi Dec.398 Mops

Buffer(2048 Bytes)

Preamble Ins.50 Mops

Buffer(30 KBytes)

Buffer(30 KBytes)

Interplator250 Mops

Buffer(2048 Bytes)

Sync.20 Mops

Buffer(1024 Bytes)

802.11a Receiver

802.11a Transmitter


21

Power Results

0

50

100

150

200

250

300

350

400

PE Mem

ory

PE Reg

ister

File

PE ALU

PE Oth

ersARM

Glob

al M

emor

y

Inte

r-pro

cess

or Bus

Oth

ersTota

l

Po

wer

(m

W)

WCDMA

802.11a

Configuration 4 PEs, 1 ARM (Cortex M3) controller Global scratchpad memory (64Kb) 90nm (1V @ 400 MHZ), Synthesized conservatively


22

Area Results

0

0.5

1

1.5

2

2.5

3

3.5

mm

^2 (9

0nm

)

SDR Programming Language Support


24

Software Development Flow

Matlab-Simulink/CFloating Point

Algorithm Prototyping

TimingRequirements

C/SPEXFixed Point

Algorithm KernelImplementations

IO/ControlCode

C/SPEXSystem

Description

Timing/MachineIndependent

asm code

TimingRequirements

MemoryAccessPatterns

SystemDescriptionasm code

Controllerasmcode

PEasmcode

c

ProgrammerDefined

CompilerGenerated

Algorithm-levelDesign Flow

System-LevelDesign Flow


25

SPEX (Signal Processing EXtension)

Implemented as a library extension to C System-level development

Support concurrent DSP kernel function definitions Channel variables for inter-kernel communications

Algorithm-level development Native vector & matrix variables Explicit DSP variable attribute definition Native vector & matrix operations


26

SPEX Overview

SPEX

DSP variableextension

DSP kernelextension

Scalarobject

SIMDobject

DSP variableobjects

DSPvariable

attributes

DSP variable

operations

ModeBit-

widthType

Arith. &logicop.

Predi-cation

Permu-tation

Concurrentvariableobject

Concurrentvariable

operations

Kernelobject

Channelobject

Concurrentkernel

mangement

Channelcommunication

operations


27

SPEX Example Code: Viterbi ACSConcurrent DSP kernel definitions

void* acs(void*) { /* variable declaration */ saturated char<64> metrics1, metrics2; saturated char<64> states; saturated char<64> t1, t2;

while (!viterbi.stop()) { /* receiving data from BMC */ metrics1 = bmc_to_acs.receive(); metrics2 = bmc_to_acs.receive();

/* add */ metrics1 += states; metrics2 += states;

/* compare and select */ t1 = (metrics1(0,2,62),metrics2(0,2,62)); t2 = (metrics1(1,2,63),metrics2(1,2,63)); states(t1<t2) = t1; states(t1>=t2) = t2;

/* sending data to TB */ acs_to_tb.send(states); }}


28

SPEX Example Code: Viterbi ACS

Native SIMD variable definition with explicit attributes

SPEX variable supports1. saturated/overflow2. various variable bit-width3. vector & matrices







29


Inter-kernel communication through channel operations

Channel types:1. FIFO queue2. Broadcast queue3. Sync/control channel4. Random-read FIFO queue







30


SPEX vector operations Supports(Matlab-like C code)

1. SIMD arithmetic operations

2. SIMD permutation 3. SIMD predication







31

Summary

- Hardware & software solutions for SDR- Hardware

- 4 dual-issue asymmetric SIMD processing elements- Consumes 200~300mW for 90nm- Meets the performance requirements for WCDMA &

802.11a

- Software- SPEX provides efficient DSP algorithm and system

implementation


32

Questions?

a system solution for high- performance, low power sdr yuan lin 1, hyunseok lee 1, yoav harel 1,...

Documents

mbps slide

sdr design challenges

system solution

pe design area

wcdma protocol

wcdma characteristics

soc interface system

saturated char metrics1