processor architectures and program mapping
DESCRIPTION
Processor Architectures and Program Mapping. Application domain specific processors (ADSP or ASIP) 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman. Application domain specific processors (ADSP or ASIP). DSP. Programmable CPU. Programmable DSP. - PowerPoint PPT PresentationTRANSCRIPT
Processor Architectures and Program Mapping
Application domain specific processors(ADSP or ASIP)
5kk10TU/e
Henk CorporaalJef van Meerbergen
Bart Mesman
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
2
flexibility
efficiency
DSP
Programmable CPU
Programmable DSP
Application domain specific
Applicationspecific processor
Application domain specific processors (ADSP or ASIP)
Application domain specific processors (ADSP or ASIP)
takes a well defined application domain as a starting point• exploits characteristics of the domain (computation kernels)• still programmable within the domain
e.g. MPEG2 coding uses 8*8 DCT transform, DECT, GSM etc ...
performance: clock speed + ILP ILP + tuning to domain flexible dev. (new apps.) cost effective (high volume)
Appl. domain
implementation
ADSP
implementation
Appl. domain
GP
problems - specification manual design, - design time and effort large effort => synthesized cores
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
4
Part DescriptionClock(MHz)
Size(gates)
ROM(Kbyte)
RAM(Kbyte)
Speech Components
ADPCM Full duplex ITU-T G.726 compliant and 40 kbit/s speech-compression encoder/decoder. 4 5,100 1.3 0.128
ADPCM-16 Full duplex 16 Channel ITU-T G.726 compliant 16, 24, 32 and 40 kbit/s speech-compression encoder/decoder. 32 10,200 1.3 2.048
IW-ASRSpeechRecognition
Template-based speaker-dependent, isolated-word automatic speech recognition 1.3 9,000 6approx.1kbyte/word
G.723.1 Low bit-rate ITU-TG.723.1 compliant speech-compression at 6.3 kbit/s; can be combined with G.723.1A. 20 24,000 22 2.3
G.723.1AExtended version of G.723.1 to reduce bit rate by a silence compression scheme. Uses voice activity detection andcomfort-noise generation. Fully compliant with Annex A of speech-compression standard CODEC G.723.1.Yields no additional hardware cost.
20 24,000 22 2.3
SpeechSynthesis
Phrase-concatenated speech synthesisDepends on compressionrequirements
Telecommunications
EchoCancellation
High-performance Echo-cancellation and suppression processor. 4 6,000 2.80 0.15
DTMF Full-duplex DTMF transceiver. 2 4,000 1.00 0.15
Caller-ID On-hook and off-hook caller line identification. Includes DTMF and V.23. 3 6,000 2.10 0.15
Reed-Solomon Full-duplex Reed-Solomon codec 7,000 3.75 0.15
ViterbiDecoder
Configurable rate, code and constraint-length. (depending on throughput) Configurable traceback depth. Supportssoft & hard decision making. Supports code puncturing.
5,000
to9,000
--- ---
V.23 modem ITU-T V23 compliant 1200 baud FSK modem 6,000 0.80 0.15
Other
Pink NoiseGenerator
Low-ripple pink noise filter with filter characteristic of -3 ± 0.08 dB per octave over the bandwidth 20Hz to 20kHz 4,000 0.10 0.10
CCIR 656/601 Digital video converter : CCIR to raw-video data and vice versa. 1,500 none none
www.adelantetech.com
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
5
• design process• retargetable code generation (problem statement)• ADSP/VLIW architectures (Mistral 2 /A|RT designer)• instructive demo (Adelante)• application examples• low power aspects (Mistral 2 /A|RT designer)• discussion• conclusion
Outline
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
6
application(s)processor
-model
OK?
more appl.? yes
no
noyes
Estimationscycles/algoccupation
HWdesign
SW (code generation)
Estimationsnsec/cycle,
area, power/instr
go to phase 2
3 phases 1. exploration 2. hw design (layout) + processing 3. design appl. sw
Fast, accurate and early feedback
Design process
parametersinstance
e.g. VLIW withshared RFs
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
7
A compiler is retargetable if it can generate code for a ‘new’ processor architecture specified in a machine description file.
A guarded register transfer pattern (GRTP) is a register transferpattern (RTP) together with the control bits of the instruction word that control the RTP. a: = b + c | instr = xxxx0101GRTPs contain all inter-RT-conflict information.
Instruction set extraction (ISE) is the process of generating all possible GRTPs for a specific processor.
Problem statement
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
8
Algorithmspec
FE
CDFG
Code Generation
Machinecode
Processorspec (instance)
ISE
GRTP
Problem statement
in ch 4 this is
part of the code
generator
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
9
PC
IM
+1
I.(20:0)
RAM
I.(12:5)
I.(4)
Inp
I.(20:13)
I.(3:2)
I.(1:0)
REG
outp
Example: Simple processor [Leupers]
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
10
Instruction Instruction bits21111111111098765432109876543210
PC := PC + 1 xxxxxxxxxxxxxxxxxxxxxREG := Inp xxxxxxxxxxxxxxxxx011x
REG := IM PC .(20..13) xxxxxxxxxxxxxxxxx001x
REG := RAM IM PC . (12..5 ) xxxxxxxxxxxxxxxxx1x1xREG := REG - Inp xxxxxxxxxxxxxxxxx0101
REG := REG - IM PC .(20..13) xxxxxxxxxxxxxxxxx0001
REG := REG - RAM IM PC . (12..5 ) xxxxxxxxxxxxxxxxx1x01REG := REG + Inp xxxxxxxxxxxxxxxxx0100
REG := REG + IM PC .(20..13) xxxxxxxxxxxxxxxxx0000
REG := REG + RAM IM PC . (12..5 ) xxxxxxxxxxxxxxxxx1x00RAM IM PC . (12..5 ) := REG xxxxxxxxxxxxxxxx1xxxxoutp := REG xxxxxxxxxxxxxxxxxxxxxRAM_NOP xxxxxxxxxxxxxxxx0xxxx
Example: Simple processor [Leupers]
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
11
ASIP/VLIW architectures
A|RT designer template as an example (= set of rules, a model)
Differences with VLIW processors of ch. 41. // FUs
• ASUs = complex appl. Spec. FUs (beyond subword //) e.g. biquad, median, DCT etc …
• larger grainsize, more heterogeneous, more pipelines2. Rfiles
• many Rfiles (>5 vs 1 or 2)• limited # ports (3 vs 15) • limited size (<16 vs. 128)
3. Issue slots• all in parallel vs. 5
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
12
RF1
FU1
RF2 RF3
FU2
RF4 RF5
FU3
RF6 RF7
FU4
RF8
IR1 IR2 IR3 IR4
Instruction memory Con-trol
flags
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
13
readaddress
RF 1
writeaddress
RF 1
readaddress
RF 2
writeaddress
RF 2mux 1 mux 2
controlFU
outputdrivers
Additional characteristics of the A|RT designer template• interconnect network: busses + input multiplexers
mux control is part of the instruction control can change every clock cycle network can be incomplete busses can be merged
• memories are modeled as FUs separate data in and data out 2 inputs (data in and address) and 1 output
• Each FU can generate one or more flags• instruction format (per issue slot)
ASIP/VLIW architectures
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
14
ALU MACbus1 bus2
RF1 RF2 RF3 RF4
mux 2
read RF1
write RF1
read RF2
write RF2
ALU instr.mux
3read RF4
write RF4
read RF3
write RF3
MAC instr.
091019
ASIP/VLIW architectures: example
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
15
GRTP Instruction bits1 1 1 1 1 1 1 1 1 19 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
RF1 = ALU (RF1, RF2) x c c c c x x c c c x x x x x x x x x xRF2 = ALU (RF1, RF2) x c x c c c c c c c x x x x x x x x x xRF3 = ALU (RF1, RF2) x c x c c x x c c c c x x c c x x x x xRF3 = MAC (RF3, RF4) x x x x x x x x x x c c c c c c x c c cRF4 = MAC (RF3, RF4) x x x x x x x x x x x c c x x c c c c cRF2 = MAC (RF3, RF4) c x x x x c c x x x x c c x x c x c c c
ASIP/VLIW architectures : example
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
16
Datapath synthesis
Controller synthesis
OK?
Changepragmas
Algorithmspec
no
yes
RTs
Estimationsarea, power, timing
RF1 : x = RF2 : y, RF3 : z | ALU = ADDInmux = bus2
assign ( a+b, ALU, fu_alu1)assign ( a+_, ALU, fu_alu2)assign ( _+_, ALU, fu_alu3)
VLIW makes relatively simple code selection
possible
ASIP/VLIW architectures:design flow
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
18
architecture viewarchitecture view
life-time analysislife-time analysis
resource loadresource load
bus loadbus load
cycle-countcycle-count
ASIP/VLIW architectures: feedback
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
19
• design process• retargetable code generation (problem statement)• ASIP/VLIW architectures (Mistral 2 /A|RT designer)• instructive demo (Adelante)• application examples• low power aspects (Mistral 2 /A|RT designer)• discussion• conclusion
Outline
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
20
filter
Control unit -
c0 c1 c63
x y
er
Application examples: adaptive filterMinimizes the difference between
x and e (reference signal)
Many applications are possible• echo cancelling for TV
e = flyback signal (known without echoes)• automatic equalization of cables in data transmission• acoustic echo cancelling
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
21
filter
Control unit -
c0 c1 c63
x
y
e
r
speaker
microphone
speech
Speech + noise
noise
Application examples: adaptive filter
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
22
filter
Control unit -
c0 c1 c63
x
y
e
r
noise (e.g. radio)
Speech + noise
speech
Hearing aid
Application examples: adaptive filter
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
23
A1 *
Z-1
Ai *
Z-1
An *
Z-1
A0 *
*
Z-1
+
-
S0[n] S1[n] Si[n]S63[n]
c0 c1 ci c63
x[n] x[n-1] x[n-i] x[n-63]
r[n]
e[n]
ê [n]mu
t[n]
Application examples: adaptive filter
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
24
* + Z-1Ci[n]
Ci[n-1]
x[n-i]
t[n]
Ai
Application examples: adaptive filter
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
25
#define mu 0.1#define WORD num<32,12>
func main ( input, e : WORD) r : WORD =begin
sum [ 0 ] = WORD ( 0 )x = inputt = WORD ( r @ 1 * WORD ( mu ) )(i : 0 .. 63) ::
beginc [ i ] = c [ i ] @ 1 + WORD ( t * x @ i)s [ i ] = WORD ( x @ i * c [ i ] @ 1)sum [ i+1 ] = sum [ i ] + s [ i ]
endehat = sum [ 64 ]r = e – ehat
end
*
r
+
w
r
*
sum[i+1]
sum[i]
x@i
t
c[i]@1
+
Application examples: adaptive filter
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
26
RAM
bus1
21
ALU
12
ROM MULT
12
ACU
23
bus2
266 clock cycles1.1 mm2
Application examples: adaptive filter
implementation 1
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
27
RAM
bus1
41
ALU
55
ROM ACU
25
bus2
2250 clock cycles0.7 mm2
Application examples: adaptive filter
implementation 2
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
28
RAM1
11
ACU1
22
ALU
12
MULT
12
RAM2
11
ROM ACU2
11
202 clock cycles1.4 mm2
Application examples: adaptive filterimplementation 3
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
29
clockcycles
area (mm2)1 2
1000
2000
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
30
• design process• retargetable code generation (problem statement)• ADSP/VLIW architectures (Mistral 2 /A|RT designer)• instructive demo (Adelante)• application examples• low power aspects (Mistral 2 /A|RT designer)• discussion• conclusion
Outline
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
31
ImplementationIndependent
Design Database
ImplementationIndependent
Design Database
Low power aspects
• Estimation
EXU ACTIVITY AREA POWERalu_1 20% 261 105acs_asu_1 83% 2382 3816or_asu_1 10% 611 122romctrl_1 16% 65 21acu_1 36% 294 205ipb_1 20% 107 43opb_1 11% 163 35ctrl 1864 3597total 5747 7944
area
speed
power
Estimation Database
+Architecture
Mistral2 Mistral2
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
32
GSM viterbi decoder : default solution
13750
EXU ACTIV AREA POWERalu_1 96% 3469 46196romctrl_1 48% 39 259acu_1 26% 327 1209ipb_1 5% 131 105opb_1 23% 1804 5801ctrl 9821 135035total 15591 188605
EXU ACTIV AREA POWERalu_1 96% 3469 46196romctrl_1 48% 39 259acu_1 26% 327 1209ipb_1 5% 131 105opb_1 23% 1804 5801ctrl 9821 135035total 15591 188605
• controller responsible for 70% of power consumption
– maximum resource-sharing
– heavy decision-making : “main” loop with 16 metrics-computations per iteration
• EXU-numbers include Registers for local storage
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
33
GSM viterbi decoder : no loop-folding
• area down by 33%
• power down by 35%
• next step: reduce # of program-steps with second ALU
14247
EXU ACTIV AREA POWERalu_1 92% 3411 45073romctrl_1 45% 39 255acu_1 25% 294 1087ipb_1 5% 107 86opb_1 22% 1661 5340ctrl 4919 70087total 10431 121928
EXU ACTIV AREA POWERalu_1 92% 3411 45073romctrl_1 45% 39 255acu_1 25% 294 1087ipb_1 5% 107 86opb_1 22% 1661 5340ctrl 4919 70087total 10431 121928
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
34
GSM viterbi decoder : 2 ALU’s
9739
EXU ACTIV AREA POWERalu_1 69% 1797 12248alu_2 65% 1393 8916romctrl_1 67% 39 255acu_1 37% 294 1087ipb_1 8% 149 119opb_1 33% 2136 6871ctrl 8957 87235total 14766 116731
EXU ACTIV AREA POWERalu_1 69% 1797 12248alu_2 65% 1393 8916romctrl_1 67% 39 255acu_1 37% 294 1087ipb_1 8% 149 119opb_1 33% 2136 6871ctrl 8957 87235total 14766 116731
cycle count down 30%
area up 42% power down by 5% next step: introduce
ASU to reduce ALU-load
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
35
GSM viterbi decoder : 1 x ACS-ASU
EXU ACTIV AREA POWERalu_1 20% 261 105acs_asu_1 83% 2382 3816or_asu_1 10% 611 122romctrl_1 16% 65 21acu_1 36% 294 205ipb_1 20% 107 43opb_1 11% 163 35ctrl 1864 3597total 5747 7944
EXU ACTIV AREA POWERalu_1 20% 261 105acs_asu_1 83% 2382 3816or_asu_1 10% 611 122romctrl_1 16% 65 21acu_1 36% 294 205ipb_1 20% 107 43opb_1 11% 163 35ctrl 1864 3597total 5747 7944
func ACS ( M1, M2, d ) MS, MS8 =begin MS = if ( M1+d > M2-d ) -> ( M1+d) || ( M2-d) fi; MS8 = if ( M1- d > M2+d) -> ( M1- d) || ( M2+d) fi;end;
func ACS ( M1, M2, d ) MS, MS8 =begin MS = if ( M1+d > M2-d ) -> ( M1+d) || ( M2-d) fi; MS8 = if ( M1- d > M2+d) -> ( M1- d) || ( M2+d) fi;end;
=
1930
cycle count down 5X power down 20X !
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
36
GSM viterbi decoder : 4 x ACS-ASU
EXU ACTIV AREA POWERalu_1 94% 243 97acs_asu_1 95% 1041 420acs_asu_2 95% 1041 420acs_asu_3 95% 1041 420acs_asu_4 95% 1041 420split_asu_1 47% 90 18or_asu_1 47% 592 118romctrl_1 28% 48 6acu_1 98% 212 85ipb_1 23% 60 6opb_1 50% 369 80ctrl 1306 555total 7084 2645
EXU ACTIV AREA POWERalu_1 94% 243 97acs_asu_1 95% 1041 420acs_asu_2 95% 1041 420acs_asu_3 95% 1041 420acs_asu_4 95% 1041 420split_asu_1 47% 90 18or_asu_1 47% 592 118romctrl_1 28% 48 6acu_1 98% 212 85ipb_1 23% 60 6opb_1 50% 369 80ctrl 1306 555total 7084 2645
cycle count down another 5X
area up 23% power down another
3X !
425
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
37
GSM viterbi example : summary
ImplementationIndependent
Design Database
ImplementationIndependent
Design Database
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
default loop 2 ALU 1 ACS 4 ACS
power
areacycles
72x !72x !
Mistral2 Mistral2
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
38
Exploration phase
Application softwaredevelopment:
constraint driven compilation
application(s)processor
-model
OK?
more appl.? yes
no
noyes
HWdesign
SW (code generation)
application(s)
OK?no
yes
SW (code generation)
Freezeprocessor
model
no
Discussion: phase 3
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
39
Discussion: problems with VLIWs
• code compaction = reduce code size after scheduling possible compaction ratio ?e.g. p0 = 0.9 and p1 = 0.1 information content (entropy) = - pi log2 pi = 0.47
maximum compression factor 2 • control parallelism during scheduling = switch between
different processor models (10% of code = 90% runtime) • architecture
reduce number of control bits for operand addressese.g. 128 reg (TM) -> 28 bits/issue slot for addresses only=> use stacks and fifos
code size and instruction bandwidth
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
40
RF1
FU1 FU2 FU3 FU4
IR1 IR2 IR3 IR4
Instruction memory Con-trol
flags
RF2 RF3 RF4
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
41
RF1
FU1 FU2 FU3 FU4
RF2 RF3 RF4
Discussion: clustered VLIW architectures
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
42
Conclusions
• ASIPs provide efficient solutions for well-defined application domains (2 orders of magnitude higher efficiency).
• The methodology is interesting for IP creation.
• The key problem is retargetable compilation.
• A (distributed) VLIW model is a good compromise between HW and SW.
• Although an automatic process can generate a default solution, the process usually is interactive and iterative for efficiency reasons. The key is fast and accurate feedback.
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
43
Imagine assignment
• For the coming 3 weeks:– Install the tools (VisualC package will be sent by
mail)– Read the beginners’ guide– Experiment with the compiler on a few examples
• http://www.ics.ele.tue.nl/~hfatemi/5kk10/
• Further information on Imagine:– www.cva.stanford.edu/projects/imagine/