parameterized embedded systems platforms frank vahid students: tony givargis, roman lysecky, susan...
TRANSCRIPT
Parameterized Embedded Systems PlatformsParameterized Embedded Systems Platforms
Frank Vahid Students: Tony Givargis, Roman Lysecky, Susan CotterellDept. of Computer Science and EngineeringUniversity of California, Riverside
Member, Center for Embedded Computer Systems, UC Irvine
The Dalton Project
Supported by: NSF, NEC
2
OutlineOutline
• Introduction
• Parameterized SOC platforms
• Exploring parameter configurations
• Future direction: self-optimizing platforms
• Conclusions
3
IC
IntroductionIntroduction
• Advent of system-on-a-chip
Micro-proc. IC
MemoryIC
Peripher.IC
FPGAIC
Board
Microprocessor core (aka “IP”)
Peripheral core
Introduction
4
System-on-a-chip (SOC)System-on-a-chip (SOC)
Introduction
5
The Productivity GapThe Productivity Gap
[ITRS99]
6
Programmable Platforms (ITRS99)Programmable Platforms (ITRS99)
• Pre-fabricated IC, synthesizable HDL, or both– “reference designs” (VLSI), “silicon
platforms” (Philips), “fig chips” (Vahid/Givargis99)
Micro-processor
Cache Memory DMA Bridge
FPGAFPGAPeripheralPeripheral
System bus
Peripheral bus
Programmable
Platform
Introduction
7
Targeted to Embedded SystemsTargeted to Embedded Systems
• May drive future architecture design [Patterson98]
• Varied power/performance/size constraints– Programmable platforms must adapt
Introduction
8
Micro-processor
Cache Memory DMA Bridge
FPGAFPGAPeripheralPeripheral
System bus
Peripheral bus
Programmable
Platform
Adapting platforms to constraintsAdapting platforms to constraints
• One solution: Architectural Parameters
Application1
main()
while (…) {
…
…
…
}
Cache
Application2main() while(…) { …}
Cache
Introduction
9
Related workRelated work
• Pleiades project [Rabaey97]• VLSI’s Velocity
Architecture Applications
Numbers
Mapping
Analysis
Our focus
Introduction
• Microprocessor + FPGA• Philips’ Y-Chart approach
• Microcontrollers
10
OutlineOutline
• Introduction
• Parameterized SOC platforms
• Exploring parameter configurations
• Future direction: self-optimizing platforms
• Conclusions
11
Basic parameters -- cacheBasic parameters -- cache
Micro-processor
Cache Memory DMA Bridge
FPGAFPGAPeripheralPeripheral
System bus
Peripheral bus
Programmable
Platform
Cache
Parameterized Systems-on-a-chip
12
Basic parameters -- cacheBasic parameters -- cache
Tag Index Offset
V T D V T D
== ==Mux
Data
• Associativity
• Cache Size
• Line Size
Parameterized Systems-on-a-chip
13
Micro-processor
Cache Memory DMA Bridge
FPGAFPGAPeripheralPeripheralProgrammable
Platform
System bus
Peripheral bus
Basic parameters -- busBasic parameters -- bus
Parameterized Systems-on-a-chip
14
Basic parameters -- BusBasic parameters -- Bus
Bus
Change Bus Width [Givargis98]
C1
C2
C1 > C2
Parameterized Systems-on-a-chip
MuxDemux
MuxDemux
Bus
15
Basic parameters -- BusBasic parameters -- Bus
Bus
Parameterized Systems-on-a-chip
Encode data to reduce switching (Bus Invert) [Stan95]
EncoderDecoder
EncoderDecoder
invert_ctrl
01001011
10010110
Ham
min
g D
ist
= 6
01101001
1
01001011
0 invert_ctrl
Bin
ary
En
cod
ing
Bu
s-In
vert
En
cod
ing
Ham
min
g D
ist
= 3
16
Parameter definitionsParameter definitions
• Parameter– An architectural feature that can be varied,
with a small set of possible values, without changing the application’s essential functionality.
• Configuration– A selection of a particular value for every
architecture parameter
• Static vs. dynamic parameter– Static: Value is set before fabricating the IC.– Dynamic: Value is set after fabricating the
IC.Parameterized Systems-on-a-chip
17
Potential tradeoffs experiment [ICCAD99]
Potential tradeoffs experiment [ICCAD99]
Parameterized Systems-on-a-chip
Micro-processor
Memory DMA Bridge
FPGAFPGAPeripheralPeripheral
System bus
Peripheral bus
I-cache
D-cache
Parameters Possible values
I-cacheSize 32k,16k,8k,4k,2k,1k,512,256,128Line 8, 16, 32Associativity 2, 4, 8
D-cacheSize 32k,16k,8k,4k,2k,1k,512,256,128Line 8, 16, 32Associativity 2, 4, 8
Mp-c busData bus width 4, 8, 16, 32Data bus invert on or off
Sys. busData bus width 4, 8, 16, 32Data bus invert on or off
18
Potential tradeoffs experiment [ICCAD99]
Potential tradeoffs experiment [ICCAD99]
• Cache: Dinero [Edler, Hill]
• ISS: [Tiwari96]
Micro-processor
Cache Memory
CProgram
Bus simulator
Instr. SetSimulator
CacheSimulator
MemorySimulator
Power
Power
Power
Power
Total power
Parameterized Systems-on-a-chip
19
Potential tradeoffs experimentPotential tradeoffs experiment
• X-axis: execution time (sec)
• Y-axis: power (watt)
Tradeoff between performance and power
• Computed power for all 45,568 configurations– For each of four C
applications– Used microprocessor,
cache, and bus simulators (1 wk CPU)
Parameterized Systems-on-a-chip
20
Potential tradeoffs experimentPotential tradeoffs experiment
Narrower bus required a larger cache size
Bus: 8-1/32-1 I: 32k, 8, 8D: 16k, 8, 16.995 sec, 3.4 W, 30K
Bus: 16-1/32-1 I: 16k, 8, 16D: 32k, 8, 8.389 sec, 11.4 W, 21kG
Bus: 32-1/32-0 I: 16k, 4, 4D: 16k, 4, 4.086 sec, 43.6 W, 20kG
Parameterized Systems-on-a-chip
21
Potential tradeoffs experimentPotential tradeoffs experiment
02468
101214
Perf
orm
ance
Pow
er
Are
a
Energ
y
• Performance varied by 11x• Power varied by 13x• Area varied by 1x• Energy consumption varied by 2x
Parameterized Systems-on-a-chip
22
Potential tradeoffs experimentPotential tradeoffs experiment
Bus: 8-1/4-0, I: 1k, 2, 4D: 512, 2, 45 ms, .02 W, 18kG
Bus: 16-1/32-1 I: 1k, 4, 4D: 512, 4, 83 ms, .07 W, 17kG
Bus: 32-1/32-1 I: 1k, 4, 4D: 512, 4, 82 ms, .19 W, 15kG
Parameterized Systems-on-a-chip
23
Potential tradeoffs experimentPotential tradeoffs experiment
0123456789
10P
erf
orm
ance
Pow
er
Are
a
Energ
y
• Performance varied by 2.5x• Power varied by 9.5x• Area varied by 1x• Energy consumption varied by 4x
Parameterized Systems-on-a-chip
24
Potential tradeoffs experimentPotential tradeoffs experiment
• How much variation in total system power and performance can we obtain just by varying the cache and bus parameters?– 9 to 14x improvement in
power/performance• How interdependent are these two
types of parameters?– fixing cache param. values, then selecting
bus param. values results in non-optimal solutions
Parameterized Systems-on-a-chip
25
Many more parameters possibleMany more parameters possible
• Some examples include:– Code compression (Henkel/Wolf)– Address bus encoding– Multiple levels of memory hierarchy– CPU parameters (e.g., voltage scale, DP
width)– Peripheral core parameters (our current
focus)– Fertile research area
• Can yield even larger tradeoffs if we:– Create parameter-aware compiler– Adapt OS?Parameterized Systems-on-a-chip
26
OutlineOutline
• Introduction
• Parameterized SOC platforms
• Exploring parameter configurations
• Future direction: self-optimizing platforms
• Conclusions
27
Exploring parameter configurationsExploring parameter configurations• Low-level simulation
– Gate-level simulation• Far too slow, days per configuration
– RT-level simulation• Still slow, hours per configuration
• Our approach– System-level simulation
• Minutes per configuration– System-level trace simulation
• Seconds per configuration– System-level trace analysis
• Milliseconds per configuration
28
Evaluation by gate-level simulationEvaluation by gate-level simulation
Exploring Parameter Configurations
Micro-processor
Cache Memory DMA Bridge
FPGAFPGAPeripheralPeripheralProgrammable
Platform
System busPeripheral bus
• Capture each core in HDL, synthesize, simulate
HDL synthesis
HDL simulationTotal power
Rec
onfi
gure
• Hours (often tens) per configuration
29
Evaluation by system-level simulationEvaluation by system-level simulation
Exploring Parameter Configurations
Micro-processor
Cache Memory DMA Bridge
PeripheralPeripheral
Peripheral bus
CProgram
TraceGenerator
Bus simulator
Instr. SetSimulator
CacheSimulator
MemorySimulator
Power
Power
Power
Power
Total power
OO
models
DMA Simulator
BridgeSimulator
PeripheralSimulator
PeripheralSimulator
Power
• Minutes-per-configuration
• Contrast with hours-per-config.
Rec
onfi
gure
30
Evaluation by trace-simulationEvaluation by trace-simulation
Exploring Parameter Configurations
OO
non-fct. models
• Note that the cache simulator is non-functional
• Same approach for others– Get traces from small #
of system simulation
Bus trace
Bus trace simulator
Instr. trace
Simulator
Memory trace
Simulator
Instr. trace
CProgram
TraceGenerator
Cache trace
Simulator
Address trace
DMA trace
Simulator
Bridgetrace
Simulator
Peripheral trace
Simulator
Peripheraltrace
Simulator
Instr. traces
Power
Power
Power
Power
Total power
Power
Rec
onfi
gure
• Seconds-per-configuration
31
System simulation vs. trace simulationSystem simulation vs. trace simulation
Parameter evaluation
System level model
Execute
Power
System level model
Execute
Tra
ces
Power
uP DMA UART
Trace simulators
uP DMA UART
Parameter evaluation
32
Evaluation by trace-analysisEvaluation by trace-analysis
Exploring Parameter ConfigurationsExploring Parameter Configurations
Equations
• Further speedup -- – statistically-characterize
traces– Still only small # of
system simulations
Bus stats.
Bus trace simulator
Instr. trace
analyzer
Memory trace
analyzer
Instr. stats.
CProgram
TraceGenerator
Cache trace
analyzer
Address stats.
DMA trace
analyzer
Bridgetrace
analyzer
Peripheral trace
analyzer
Peripheraltrace
analyzer
Instr. stats.
Power
Power
Power
Power
Total power
Power
Rec
onfi
gure
• Milliseconds-per-configuration
33
Trace-analysis approach for cacheTrace-analysis approach for cache
• Given a trace of memory refs• Cache parameters
• Size (S)• Line/block-size (L)• Associativity (A)
• Compute # of misses (N)
6
5
4
3
2
1
,,
,,
,,
,,
,,
,,
NALSf
NALSf
NALSf
NALSf
NALSf
NALSf
MaxMinMax
MinMaxMax
MaxMinMin
MinMaxMin
MinMinMax
MinMinMin
0
20
40
60
80
100
120
0.5 1 2 4 8 16 32 64
Size (S)
# of misses (N)
Exploring Parameter Configurations
34
Trace-analysis approach for cacheTrace-analysis approach for cache
)1(),,(
/)//(
/)//(
)(
)(
321
maxminminminminminmaxminminminminminmaxminmaxminminmax3
minmaxmaxminminmaxminmaxminminminmaxminmaxmaxminminmax2
minminminminminminminminmax1
tttALSf
NNNNNNat
NNNNNNlt
NNNst
Anormalizea
Lnormalizel
Snormalizes
Exploring Parameter Configurations
35
Trace-analysis approach for cacheTrace-analysis approach for cache
• Capture improvements obtainable by:– changing line-size at
small/large values of cache-size
– changing associativity at small/large values of cache-size
)1(),,(
)(
)(
)(
/
/
/
321
2243
1132
1121
tttALSf
RRRat
RRRlt
NNNst
AAAa
LLLl
SSSs
kji
MaxMink
MaxMinj
MaxMini
Exploring Parameter Configurations
./,/
/,/
624523
412311
NNRNNR
NNRNNR
36
Trace-analysis approach for busTrace-analysis approach for bus
mkk
nCP bus
2
1
Exploring Parameter Configurations
Items/second
Bus widthNum transfers
per item
Random data
capacitance
37
Trace-analysis approach for busTrace-analysis approach for bus
• Bus equation:• m items/second (denotes the traffic N on the bus)• n bits/item• k bit wide bus• bus-invert encoding• random data assumption
222
21
2
1
2
1
2
1
1 k
k
nmCP
k
k
k
k
k
k
k
bus
Exploring Parameter Configurations
38
Trace-analysis experimentsTrace-analysis experiments
Bus A Bus B
Peripheral 1
Peripheral Bus
Bridge
CPUI-Cache
D-Cache
Peripheral 2 Peripheral n
Memory
• Cache parameters– size: 128, 256, 512, 1k, 2k, 4k, 8k, 16k, 32k– assoc: 2, 4, 8– line: 8, 16, 32
• Bus Parameters– width: 4, 8, 16, 32– code: binary/bus-invert
• Analyzed 45K sets exhaustively for each of 4 examples.
Exploring Parameter Configurations
39
Experiment ResultsExperiment Results
0
0. 05
0. 1
0. 15
0. 2
0. 25
0. 3
Conf 0 Conf 1 Conf 2 Conf 3 Conf 4 Conf 5 Conf 6 Conf 7 Conf 8 Conf 9Execu
tio
n T
ime (
sec)
• Diesel application’s performance• Blue (light-gray) is system-simulation-based• Red (dark-gray) is trace-analysis-based
4% error320x faster
Exploring Parameter Configurations
40
Experiment ResultsExperiment Results
• Diesel application’s energy consumption• Blue (light-gray) is obtained using full simulation• Red (dark-gray) is obtained using our equations
0
500
1000
1500
2000
2500
3000
Conf 0 Conf 1 Conf 2 Conf 3 Conf 4 Conf 5 Conf 6 Conf 7 Conf 8 Conf 9
mic
ro-J
ou
les
2% error420x faster
Exploring Parameter Configurations
41
Experiment ResultsExperiment Results
• CKey application’s performance• Blue (light-gray) is obtained using full simulation• Red (dark-gray) is obtained using our equations
0
5
10
15
20
25
Conf 0 Conf 1 Conf 2 Conf 3 Conf 4 Conf 5 Conf 6 Conf 7 Conf 8 Conf 9
Execu
tio
n T
ime (
sec)
8% error125x faster
Exploring Parameter Configurations
42
Experiment ResultsExperiment Results
• CKey application’s energy consumption• Blue (light-gray) is obtained using full simulation• Red (dark-gray) is obtained using our equations
0
50
100
150
200
250
300
Conf 0 Conf 1 Conf 2 Conf 3 Conf 4 Conf 5 Conf 6 Conf 7 Conf 8 Conf 9
milli-J
ou
les
3 % error125x faster
Exploring Parameter Configurations
43
Experiment ResultsExperiment Results
• 125 - 400x speedup
• 1-18% absolute error (power & performance)
• 2% average power error
020406080
100120140160180200
3d-image
mpeg ckey diesel
SimEq.
Time (hours)
0
0.5
1
1.5
2
2.5
3
3d-image mpeg ckey diesel
Power Error (%)
Exploring Parameter Configurations
44
Techniques for general coresTechniques for general cores
• Earlier experiments were for uP/cache/bus
• System simulation for other cores (ISSS’00)– Isolate “instructions” in system-level
model– Gate-level simulation per instruction– Back-annotate system-level model’s
instructions– Similar to technique for microprocessors,
but:• Must consider “power modes”
45
Trace approach for general coresTrace approach for general cores
System level model
Execute
Tra
ces
Power
Trace simulators
uP DMA UART
Parameter evaluationFull trace
Reset --Quantize P1,P2,…,P64
IDCT P1,P2,…,P64
Quantize P1,P2,…,P64
IDCT P1,P2,…,P64
Reduced trace with characterized
dataReset --Quantize .80IDCT .72Quantize .93IDCT .63
Reduced trace with instructions
onlyReset --Quantize --IDCT --Quantize --IDCT --
Reduced trace with instruction
frequenciesReset *1Quantize *2IDCT *2
46
Experiments with general cores: JPEGExperiments with general cores: JPEG
trace file size (Kb) CPU time for power evaluation (sec)pixelsize
(bits)ftrc rtrc_
cdrtrc_i
gate sys ftrc rtrc_cd
rtrc_i
10 32 3.6 0.5 290000 48 26 4.9 4.612 39 3.6 0.5 330000 49 27 5.1 4.6
average speedup: 6K 12K 62K 67K
gate ftrc rtrc_cd rtrc_ipixelsize
(bits)mJ mJ error mJ error mJ error
10 420 443 5% 451 7% 491 17%12 531 569 7% 576 8% 632 19%
average error: 6% 7.5% 18%
47
Experiments with general cores: UARTExperiments with general cores: UART
trace file size(Kb) CPU time for power evaluation (sec)buffersize ftrc rtrc_i gate sys ftrc rtrc_i
2 7.2 3.1 123000 22.8 17.1 10.9
4 6.7 2.6 145000 21.4 16.9 10.88 6.4 2.3 155000 21.3 16.3 10.8
16 6.3 2.2 164000 21.7 15.8 10.9average speedup: 6800 8900 13500
buffer size gate ftrc rtrc_imJ mJ error mJ error
2 76.0 79.1 4.1% 80.0 5.2%4 81.2 84.9 4.6% 84.3 3.8%8 97.1 99.8 2.8% 100.0 3.0%
16 113.0 115 1.8% 115.0 1.8%average error: 3.3% 3.5%
48
OutlineOutline
• Introduction
• Parameterized SOC platforms
• Exploring parameter configurations
• Future direction: self-optimizing platforms
• Conclusions
49
Future directionsFuture directions
• Earlier work – used software on
workstation to explore parameter configurations
• “Self-optimizing” platform– Can we build the
exploration ability into the platform itself?
– Transparent to the user• Ease of use, more
accurate metrics, wider acceptance,
– “Embedded CAD”
Workstation Platform
Exploration sw
Configuration
Workstation Platform
Exploration ability
Regular binary
50
ConclusionsConclusions
• Parameters can improve usefulness of programmable platforms– by adapting platform to particular application
and to power/performance constraints
• Good tradeoff range even for basic parameters
• Fast and accurate evaluation seems possible
• Much work remains – More parameters– Better exploration– Self-optimizing platforms