architecture exploration lecture 9iverbauw/courses/... · • architecture alternatives bit...
TRANSCRIPT
1
1HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9
Architecture exploration Lecture 9
Ingrid Verbauwhede
Departement Elektrotechniek, afdeling ESAT/COSIC
2HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9
Motivation
• Architecture exploration
• Specification: MATLAB, SPW, C/C++, Java
• Floating point
• Fixed point
• Algorithm transformations
• Architecture alternatives
Bit parallel (Bit serial)
ASIC SpecialPurpose
(Art Designer)
Retargetablecoprocessor
(Target compilertechnologies)
DSP extensionsto RISC
DSP processors
(Gezel,Tensilica)
(TI TMS320C54x,TMS320C55x,ADI Blackfin, etc. )
2
3HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9
References
• The origins:• E.A. Lee, “Programmable DSP Processors,” Part I, IEEE ASSP
magazine, October 1988, pg. 4-19.• Part II, IEEE ASSP magazine, January 1989, pg. 4-14
• Continue on this:• I. Verbauwhede, C. Nicol, “Low power DSP's for wireless communications,” 2000 International Symposium on Low Power Electronics and Design (ISLPED), July 2000 • I. Verbauwhede, P. Schaumont, C. Piguet, B. Kienhuis, “Architectures and design techniques for energy efficient embedded DSP and multimedia,” 2004 Design Automation and Test in Europe (DATE 2004).
4HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9
Today
• SOC components (continue)– DSP processors– VLIW processors
• Design of SOC itself
3
5HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9
DSP Processors
Today’s general purposeassembly coded
DSP
Low cost,low power
DSPs
HighPerformance
DSPs
• 1-10 GOPS• 1-5 watts• < $50
• 200-1000 MOPS• < 100 mW • $10
• 100 MOPS• 250 mW• $40
InfrastructureMobile Terminals
Highly optimizedDomain specificProcessors
Compiler FriendlyVLIW type of DSPprocessors
6HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9
DSP processors -
• Last lecture: DSP = domain specific processor– Highly optimized for wireless communication– EVERY component of the processor:
• Datapath = MAC• Memory = Harvard or Modified Harvard• Address arithmetic: indirect – modulo – bit reverse (FFT)• Control: CISC with specialized instruction set
– Example of FIR calculation
• Today:– Pipeline specifics of DSP processors
4
7HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9
Pipelining:
ExecuteDecodeFetch MemoryAccess
ExecuteDecodeFetch MemoryAccess
ExecuteDecodeFetch MemoryAccess
Fetch = fetch instructionDecode = decode instructionMemory access = address generation and read operandsExecute = perform operation
Time
8HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9
Pipelining
How does pipeline appears to the programmer?Lee’s paper (part II) discusses 3 variations(the difference is often blurry):• interlocking• time stationary coding• data stationary coding
Trade-off between efficiency and “ease-of-use”
Interlocking: the instructions appear if executed one after another
5
9HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9
Interlocking on C10
LTPMEM MPY LTD
ExecuteDecodeFetch MemoryAccess
ExecuteDecodeFetch MemoryAccess
ExecuteDecodeFetch MemoryAccess
LT
MPY
LTD
ExecuteDecodeFetch MemoryAccess
MPY
MPY
DMEM data coef1 data coef2
ALU
MPY
Reservation table:
LTD MPY
. . .
Instruction cycles
10HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9
Interlocking on C2x
Programmer does not know the pipelineIf an access conflict occurs: hardware will “stall” and finish one (part) of anInstruction before finishing a second part.
RPTKPMEM MACD coef1 coef2
DMEM data1 data2
ALU
MPY
Reservation table:
. . .
RPTK 49MACD
coef3
6
11HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9
Time stationary
Instruction specifies “one instruction cycle”.So it specifies, all that occurs in parallel.
ExecuteDecodeFetch MemoryAccess
ExecuteDecodeFetch MemoryAccess
ExecuteDecodeFetch MemoryAccess
ExecuteDecodeFetch MemoryAccess
Example:Motorola:
MAC X0, Y0, A X:(R0)+, X0 Y:(R4-), Y0(multiply-acc of values read from memory in the previous cycle)
Lucent 16xa0 = a0 + p, p = x * y, y = *r0++, x = *pt ++
12HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9
Data stationary
Time stationary: working on different samples in one instructionData stationary: describes what happens with one input data fromstart to end.
Example (Lode):
*r3++ = a0+ = a2 * *r2++;(read from memory with pointer reg r2,Multiply with a2, add to a0 and store back in a0,Store the result in memory with pointer r3,Post modify r2 and r3)
ExecuteDecodeFetch Read Write
7
13HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9
Control & Pipeline for DSP’sRISC: load/store machinememory access with load/store instructions (DLX, MIPS, D10V)
MemoryAccessDecodeFetch Execute Write
Back
Memory access / branchExecution/ address generation
Excellent for complex decision making!
Memory accessExecution
DSP: register-memory architecture (TI, Lucent, HX, Lode)
Excellent for number crunching!
ExecuteDecodeFetch MemoryAccess
WriteBack
14HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9
Pipeline RISC compared to DSPRISC:example
DSP: memory intensive applications:
r0 = *p0; // load dataa0 = a0 + r0; // execute
MemoryAccessDecodeFetch Execute
MemoryAccessDecodeFetch Execute
MemoryAccessDecodeFetch Execute
Too expensive for DSP
ExecuteDecodeFetchMemoryAccess
ExecuteDecodeFetchMemoryAccess
ExecuteDecodeFetchMemoryAccess
ExecuteDecodeFetchMemoryAccess
Penalty: data dependent branch is expensive
8
15HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9
Application domain: wireless communications
Receiver
Tran
smit
Syn
thes
ize
PA
TCXO
Receiver
Tran
smit
Syn
thes
ize
PA
TCXO
Ext
erna
lM
emor
ies
DigitalASIC
MicroProcessor
DSP
BatteryPack
AnalogASIC
PowerSupply
AudioCodec
No network
* 0 #7 8 94 5 61 2 3
clr
RF Board
Baseband board
16HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9
Performance requirements: digital cellular phone
RFReceive
RFSend
Demodulation Channeldecoder
Speechdecoder
Modulation Channelencoder
Speechencoder
Communication Application
Goal: Minimum “MIPS” to get the job done.
9
17HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9
Application Domain: compute intensive functions
Source encoder/decoder = speech codersAdvanced vocoders for improved speech quality & higher capacity:Example: ACELP derivatives for GSM and IS136A
• Digital filtering (FIR, IIR)
• Vector quantization, code book search (square distance computation)
Channel encoder/decoder = error correctingComplex wireless modems:
• Galois field arithmetic
• Convolution coders based on Viterbi trellis search
• Turbo coders
18HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9
Compute intensive functions: evolution of DSP’s
Simple FIR example
Square distance for speech processing
Speed-up of FIR example
Viterbi acceleration for communication algorithms
Evolution of DSPs follows these examples
10
19HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9
The Viterbi Decoding (Introduction)
• Error Correcting Decoding Algorithm for Convolutional Code• Trellis Representation• Maximum Likelihood Decoding Algorithm• GSM System
20HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9
Convolutional Code (ex. Wyner-Ash Code)
• Generator matrix G(D) = [ 1 1+D ]• Input sequence u(D) = 1, 1, 0, 1, 0, …• Output Sequence c(D) = u(D)G(D)
=11, 10, 01, 11, 01, …
D
11
21HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9
Constraint length K and Rate
• v = 1, K = 2, 2states
• Rate = 1/2, one input bit generates twocoded output bits.
D
100,00 1,111,10
0,01
22HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9
Trellis Representation
• Example G(D)=[ 1+D2 1+D+D2 ]v = 2, K = 3, 4 states
• Instead of writing a State Diagram,
D D
t0 1 2 3 4
S00
S10
S01
S00
S10
S01
S00
S10
S01
S00
S10
S01
S00
S10
S01
S11 S11 S11 S11 S11
00
11
00
11
00
11
00
11
1001
1001
1001
11
00
11
00
01
10
01
10
12
23HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9
Efficiency of Viterbi decoding
• Identifies the path through the Trellis--- Selecting survivor paths for each states by calculating Hamming Distance
• The total number of paths grows exponentially with the number of states--- K increasing, H/W Complexity increases exponentially
but the Error Rate decreases
24HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9
Viterbi Decoding Algorithm (1)
• Assume N = 7 blocks
t
S00
S10
S01
S11
00
11
00
11
00
11
00
11
1001
1001
1001
11
00
11
00
01
10
01
10
0 1 2 3 4 5 6 7
000000
11
1001 01
11 11
10
11
00
01
10
Information Data
Convolution Codes
Error Sequence
Received Data
0
00
00
00
1
11
01
10
1
10
10
00
0
10
00
10
1
00
00
00
0
01
10
11
0
11
00
11
Tail Bit
13
25HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9
S00
S10
S01
S11
00
11
00
11
00
11
00
11
1001
1001
1001
11
00
11
00
01
10
01
10
000000
11
1001 01
11 11
10
11
00
01
10
0 1 10
12
4
2
Viterbi Decoding Algorithm (2)
• Calculate Hamming Distance (Choose smaller one)
t0 1 2 3 4 5 6 7
Information Data
Convolution Codes
Error Sequence
Received Data
0
00
00
00
1
11
01
10
1
10
10
00
0
10
00
10
1
00
00
00
0
01
10
11
0
11
00
11
26HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9
Viterbi Decoding Algorithm (3)
• Selecting the Optimal Path
t0 1 2 3 4 5 6 7
Information Data
Convolution Codes
Error Sequence
Received Data
0
00
00
00
1
11
01
10
1
10
10
00
0
10
00
10
1
00
00
00
0
01
10
11
0
11
00
11
S00
S10
S01
S11
00
11
00
11
00
11
00
11
1001
1001
1001
11
00
11
00
01
10
01
10
000000
11
1001 01
11 11
10
11
00
01
10
0 1 1 20 2 33
1 3 22 2
2 2 34
2 42 3
3
14
27HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9
Traceback
• We cannot wait for the end of sequence for some applications
• The amount of “delay” is called tracebackdepth LD.
--- Larger LD , better performancebut need more memory and complexity
28HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9
Viterbi in GSM
• Full-rate speech channel 22.8kbps: Rate = 1/2, K = 5
• Half-rate speech channel :11.4kbps: Rate = 1/3, K = 7
15
29HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9
Required Performance
30HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9
Compute Intensive function 2: Viterbi
i
i+ s/2
2i
2i+1
+a
-a
-a
+a
. . .
. . .
Viterbi butterfly
i = state indexs = # of states = 2w = decoding window
Basic equations:
d(2n) = min { d(i) + a, d(i + s/2) - a }d(2i + 1) = min { d(i) - a, d(i + s/2) + a }
IS-95: k = 8, w = 192, corresponds to 2 x 192 x (cycles for one ACS)
k-1
7
Basic algorithm in Viterbi channel decoders,modified version in turbo decoders.
Key operation: Add-Compare-Select (ACS)
16
31HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9
Viterbi on Atmel’s Lode
Two MAC units & ALU: Add-Compare-Select
• DMAC operates as dual add/subtract unit
• ALU finds minimum
• Shortest distance saved
• Path indicator saved
• 4 cycles / butterfly
+
A1
MAC0
DB1(16)DB0(16)
µ2
+
µ1
A0
MAC1
Γ1 Γ2
Min()ALU
A3Γ
A2
decision bit
to memory
Γ = min [(Γ1 + µ1), (Γ2 + µ2)]
32HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9
MSW/LSWSelect
Viterbi on TIC54x
ALU and CSSU: Add-Compare-Select
• ALU splits in 16 bit halves
• ACC splits in half
• Shortest distance saved
• CSSU compares halves
• Path indicator saved
• 4 cycles / butterfly
+
TREG
ALU
DB1(16)DB0(16)
µ2
+
µ1
AccumulatorΓ1 Γ2
CompALU
TRN reg
Γ
decision bit
Data bus EB, to memory
Γ = min [(Γ1 + µ1), (Γ2 + µ2)]
17
33HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9
Viterbi on LU DSP16210
do 8 {a0=a4+y a1=a5-y *r3++=a0ha2=a4-y a3=a5+y *r5++=a2ha0=cmp1(a1,a0) yh=*r0 r0=r1+j j=k k=*pt1++a2=cmp1(a3,a2) a4_5h=*pt0++
}
GSM (K=5, 16 states)
AR0
AR0
AR0
AR0
. . .
a0=cmp1(a1,a0)
a2=cmp1(a3,a2)
a2=cmp1(a3,a2)
• Hardware support for Viterbialgorithm:– ACS calculations are efficient– Minimal overhead
• 4 cycles per butterfly– 32 cycles per GSM timeslot.
• Comparison functions store ACS decision bits:
. . .
Results writtento memory
Courtesy: Gareth Hughes, Bell Labs Australia
34HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9
BUT: DSP Software Development
• Complex DSP architecture not amenable to compiler technology
• Algorithms are modeled in high level language (e.g. C++)
• Solutions are implemented and debugged in hand-optimized assembler - large development effort with minimal tool support
HLL
algorithmic
model
prototype
code
production
code
hand coded assembler
optimize & debug
Long, frustrating time to market
Fragile legacy code
Widely used in handhelds, but change in basestations VLIW
18
35HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9
2G Basestation Baseband Processing
• Multiple DSPs used for baseband processing.• RISC Microcontroller for timing, framing, I/O control• Software upgradable over the network• DSPs dominate cost and power consumption
DSP RISCMicro
Controller
I/O
T1/E1
DSP
DSP
DSP
DSP
DSP
DSP
DSP
I/O
I/O I/O ASIC
DSP
DSP
AFE
AFE
ChannelEqualization
ChannelDe/coding Encryption
RAM
RAM
Tx
TxRx
Rx
Tx/Rx baseband processing board for 2-carrier GSM basestation
Future trend - integratebaseband processing -low cost Pico BTS
36HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9
Compiler Driven VLIW
Large orthogonal register set, regular interconnect
Data memory
RegisterArray
Interconnect
ex1(alu)
ex2(alu)
ex3(mpy)
ex4(ld/st)
exn(ld/st)
cond/branch ex1 ex2 ex3 ….. exnInstruction format:
Atomic RISC-like operations => heavily pipelined, high freq. clock
19
37HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9
Explicitly Parallel Instruction Computing
Execution ClustersData memory
RegisterArray
Interconnect
ex1(alu)
ex4(alu)
ex5(mpy)
ex3(ld/st)
ex6(ld/st)
RegisterArray
Interconnect
ex2(alu)
Execution Sets
1 1 1 0 1 0 1 0
fetch set
exec. set
38HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9
Texas Instruments ‘C6201
ALU shift mpy add ALU shift mpy add
Register Bank A(16 x 32)
Register Bank B(16 x 32)
Instruction Dispatch & Decode
Program Memory(16K x 32)
256
Data Memory(32K x 16)
8-way VLIW with two execution clusters256 bit (8x32) instruction fetch with variable length execute setEach 32 bit instruction individually predicated11 stage pipeline1600 MIPS, 400 MMACs @ 200 MHz
20
39HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9
FIR Filter on TI ‘C6x
loop:
ldw .d1t1 *a4++,a5
|| ldw .d2t2 *b4++,b5
||[b0] sub .s2 b0,1,b0
||[b0] b .s1 loop
|| mpy .m1x a5,b5,a6
|| mpyh .m2x a5,b5,b6
|| add .l1 a7,a6,a7
|| add .l2 b7,b6,b7
• Outer Loop: 23 cycles, 180 bytes– 1 cycle in inner loop
• All 8 exec units used in inner loop - maximum efficiency– 2 MACs per cycle
Hand-coded assembly: 32-tap FIR filter
Assembly syntax more difficult to learn.Hard to get full use of all 8 execution units at once.Software pipelining difficult to implement, and requires longer prolog/epilog (larger
code size).
Courtesy: Gareth Hughes: Bell Labs Australia
40HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9
Viterbi on TI ‘C6x
LOOP: [b1] b .s1 LOOP||[b1] sub .s2 b1,1,b1||[!a2] sth .d1 b12,*+a6[8]||[!a2] add .d2 b0,b14,b14|| cmpgt .l1 a11,a10,a1|| cmpgt .l2 b11,b10,b0|| mpy .m1x 1,b5,a4
[a2] sub .s1 a2,1,a2||[!a2] sth .d1 a12,*a6++||[a1] add .s2 2,b0,b0||[b0] mpy .m2 1,b11,b12|| mpy .m1 1,a10,a12|| sub .l2x a7,b5,b10|| ldh .d2 *++b9,b5
shl .s2 b14,2,b14||[a1] mpy .m1 1,a11,a12|| add .s1 a7,a4,a10|| sub .l1x b13,a4,a11|| add .l2 b13,b5,b11|| mpy .m2 1,b10,b12|| ldh .d2 *b4++[2],a7|| ldh .d1 *a5++[2],b13; end of LOOP
Cycle 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
.D1 STH new 8 LDH old0 STH new 0 STH new 8 LDH old0 STH new 0 STH new 8 LDH old0 STH new 0 STH new 8 LDH old0 STH new 0 STH new 8 LDH sd1 STH m[2] STH m[3]
.D2 ADD tr LDH old1 LDH mj ADD tr LDH old1 LDH mj ADD tr LDH old1 LDH mj ADD tr LDH old1 LDH mj SUB m LDH sd0 STH m[5] STH m[4]
.M1 MPY mj *MPY b0 MPY a0 MPY mj *MPY b0 MPY a0 MPY mj *MPY b0 MPY a0 MPY mj *MPY b0 MPY a0
.M2 MPY a8 *MPY b8 MPY a8 *MPY b8 MPY a8 *MPY b8 MPY a8 *MPY b8
.L1 CMPGT t0 SUB b0 CMPGT t0 SUB b0 CMPGT t0 SUB b0 CMPGT t0 SUB b0 ADD m0 SUB -m0
.L2 CMPGT t8 ADD b8 SUB a8 CMPGT t8 ADD b8 SUB a8 CMPGT t8 ADD b8 SUB a8 CMPGT t8 ADD b8 SUB a8 SUB old SUB -m1 SUB m1 SUB I
.S1 B JLOOP ADD a0 SUB k B JLOOP ADD a0 SUB k B JLOOP ADD a0 SUB k B JLOOP ADD a0 SUB k
.S2 SUB j SHL tr *ADD t0,t8 SUB j SHL tr *ADD t0,t8 SUB j SHL tr *ADD t0,t8 SUB j SHL tr *ADD t0,t8 ADD tr B JLOOP MVK j
Cycle 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
.D1 STH new 0 STH new 8 LDH old0 STH new 0 STH new 8 LDH old0 STH new 0 STH new 8 LDH old0 STH new 0 STH new 8 LDH old0 STH new 0 STH m[0] STH m[1] LDH old1
.D2 LDH mj ADD tr LDH old1 LDH mj ADD tr LDH old1 LDH mj ADD tr LDH old1 LDH mj ADD tr LDH old1 STH trans STH m[1] STH m[6] LDH old0
.M1 MPY a0 MPY mj *MPY b0 MPY a0 MPY mj *MPY b0 MPY a0 MPY mj *MPY b0 MPY a0 MPY mj *MPY b0
.M2 *MPY b8 MPY a8 *MPY b8 MPY a8 *MPY b8 MPY a8 *MPY b8 MPY a8 MPY mj
.L1 CMPGT t0 SUB b0 CMPGT t0 SUB b0 CMPGT t0 SUB b0 CMPGT t0 SUB b0 SUB new ADD old ADD SP
.L2 SUB a8 CMPGT t8 ADD b8 SUB a8 CMPGT t8 ADD b8 SUB a8 CMPGT t8 ADD b8 SUB a8 CMPGT t8 ADD b8
.S1 SUB k B JLOOP ADD a0 SUB k B JLOOP ADD a0 SUB k B JLOOP ADD a0 SUB k B JLOOP ADD a0 MVK k
.S2 *ADD t0,t8 SUB j SHL tr *ADD t0,t8 SUB j SHL tr *ADD t0,t8 SUB j SHL tr *ADD t0,t8 SUB j SHL tr B JLOOP
Utilization of execution units in Viterbi decoder
• 16-state Viterbi decoder for GSM from TI WWW site: ftp://ftp.ti.com/pub/tms320bbs/c62xfiles/vitgsm.asm
– 3 cycles per butterfly– 32 cycles per GSM timeslot (8 butterflies)– MPY instructions used to move data
3-cycle 2-ACS Inner-Loop
x 8
21
41HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9
Lucent / Motorola Star*Core SC140
6-way VLIW with 128 bit (8x16) instruction fetchPrefix instructions for high performance without sacrificing code densityEach execution set (parallel instructions + prefix) predicated5 stage pipeline1800 MIPS, 1200 MMACs @ 300 MHz
Program / Data Memory
ProgramSequencerInstructionDispatcher
AddressRegisters
(27)
AAU
Data Registers(16)
MAC
ALU
BFUAAU
MAC
ALU
BFU
MAC
ALU
BFU
MAC
ALU
BFU
42HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9
Viterbi on Star*Core
• Hardware support for Viterbi algorithm:– max2vit instruction.– vsl instruction
• 1 cycle per butterfly through software-pipelining
• Decision bits are manually stored using the Viterbi Shift Left (VSL) instruction:
GSM (K=5, 16 states)[ move.2l (r0)+,d0:d1 move.2l (r1)+,d1:d2 ][ add2 d0,d4 sub2 d6,d2
sub2 d4,d0 add2 d2,d6 ][ max2vit d4,d2 max2vit d0,d6 ][ vsl.4w d2:d6:d1:d3,(r2)+n0
vsl.4f d2:d6:d1:d3,(r3)+n0 ]
max2vit d4,d2 max2vit d0,d6
SR
D1
D3
D2
D6
vsl.4w d2:d6:d1:d3,(r2)+n0
Results writtento memory
x 4
decisions
decisions
path metrics
path metrics
Courtesy: Gareth Hughes: Bell Labs Australia
22
43HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9
SOC
44HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9
Energy-Efficient SoC are distributed
[‘Under the Hood’, EET, D. Carey, 9/5/02]
TIBaseband
DSP
HTCInterface
ASIC
TIPower
Management
Intel32Mb Flash
Intel128Mb Flash
Winbond128Mb
SDRAM
TIRF Synth
TIRF TX/RX
ConexantPower Amp
IntelStrongArm
SonyLCD
Interface
Sony240x320
color LCD
PhilipsAudio Codec
TouchscreenSIM
MMICExpansion
T-MobilePocketPC Phone
23
45HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9
DisplayAD7873Digitizer
MotorolaDragonBall
8M SDRAM
4M FLASH
FPGA
PhilipsUSB
MaximTransceivers
Agere POMBaseband
MotorolaTransceiver
RF MicroPoweramp
MaximControl
Driver
MemoryCardSlot
architecture tuned to applicationPalmPilot i705
46HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9
Power Cost
???
GeneralPurpose
Fixed
Platform
Application
ASIC
Energy-flexibility trade-off
24
47HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9
Also general purpose architectures become heterogeneous.
IBM PowerPC ®
RISC CPU
Synchronous Dual-Port RAM
SelectIO-ltra™ SystemIO™ & XCITE ™
Conexant3.125Gb Serial
XtremeDSP™
Source: Xilinx webpage
48HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9
Question
• Energy - flexibility are opposite demands!• How to navigate in this jungle?• 3D design space:
• Next question: how to map (or compile) an application onto such an architecture?
Computational Abstraction Level
Reconfigurable featureBinding rate
25
49HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9
Flexibility (1) - Abstraction level
Computational Abstraction Level
• Instruction set level = “programmable”
• CLB level = “reconfigurable”
50HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9
Flexibility (2) - Reconfigurable feature
• Basic components:
CLB RAM details
Switches, Muxes
Implementation
Execution unit type
Register file
Cross-bar Busses
Micro-architecture
Custom instructions
Register set
Size address/ data bus
Instruction set Architecture
Number & type of processes
Memory hierarchy
Interconnect network
Systems
ComputationStorageCommunication
Reconfigurable feature
Computational Abstraction Level
26
51HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9
Flexibility (3) - Binding rate
Binding rate
Compare processing to binding• Configurable (“compile-time”)• Re-configurable• Dynamic reconfigurable (“adaptive”)
52HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9
SOC architecture: RINGS
Networking Video
StandardAlgorithm
ArchitectureµArchitecture
Circuit
MEMORY
Reconfigurable Interconnect
CPU
RF
BasebandProcessing
VideoEngine
Domain-Specific
Hardware
SoftwareNetworking
Medium accessBaseband ProcµArchitecture
Circuit
Signal Proc
DSP
AlgorithmArchitectureµArchitecture
Assembly
27
53HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9
Instruction set extension
• Instruction set extension• Register mapped• Tightly coupled• Experiment: DFT
12.5 times5.76 mJ67.6 mJEnergy
Improve-ment
SW with HW datapath
SW onEmbedded proc.
1000iterations
54HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9
Co-processor
• Memory mapped• Loosely coupled• Experiment: AES
LocalMemory
25 times13.5 mJ89.2 mJEnergy
Improve-ment
SW with HW
datapath
SW on emb. Proc.
175iterations
28
55HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9
Independent IP
• Loosely coupled• Network on chip
connected• Flexible interconnect• Experiment: TCP/IP
checksum
router
router
84 times0.20 mJ17.0 mJEnergy
Improve-ment
HW datapath
SW on emb. Proc.
100packets
56HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9
Example: The Security Pyramid
DQ
Vcc
CPUCrypto
MEM
JCA
Java
JVM
CLK
Protocol
Algorithm
Architecture
Circuit
Micro-Architecture
Identification
ConfidentialityIntegrity
Kasumi, Rijndael,RC4, MD5, …
29
57HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9
Example: AES Coprocessor
InputFSM
ProcFSM
OutputFSM
>>
Encrypt
KeySchedule
>>
instruction
roundkey16 16256256
handshakeCORE
[DAC 2002]
58HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9
[1] Amphion CS5230 on Virtex2 + Xilinx Virtex2 Power Estimator[2] Helger Lipmaa PIII assembly handcoded + Intel Pentium III (1.13 GHz) Datasheet[3] gcc, 1 mW/MHz @ 120 Mhz Sparc – assumes 0.25 u CMOS[4] Java on KVM (Sun J2ME, non-JIT) on 1 mW/MHz @ 120 MHz Sparc – assumes 0.25 u CMOS
648 Mbits/secAsmPentium III [2] 41.4 W 0.015 (1/1900)
Java [4]Emb. Sparc 450 bits/sec 120 mW 0.0000037
(1/9.600.000)
C Emb. Sparc [3] 133 Kbits/sec 0.0011 (1/33000)
56 mW
Power
1.32 Gbit/secFPGA [1]
35.7 (1/1)2 Gbits/sec0.18µm CMOS
Figure of Merit(Gb/s/W)
ThroughputAES 128bit key128bit data
490 mW 2.7 (1/11)
120 mW
Design options: AES acceleration: Gbits/Joule
30
59HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9
Applications
Mapped
onto
Architectures
Conclusion
Design Methods
= Low Power!
60HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9
Motivation
• Architecture exploration
• Specification: MATLAB, SPW, C/C++, Java
• Floating point
• Fixed point
• Algorithm transformations
• Architecture alternatives
Bit parallel (Bit serial)
ASIC SpecialPurpose
(Art Designer)
Retargetablecoprocessor
(Target compilertechnologies)
DSP extensionsto RISC
DSP processors
(Gezel,Tensilica)
(TI TMS320C54x,TMS320C55x,ADI Blackfin, etc. )