OutlineOutline
Problem statementProblem statement Simulation and embedded system designSimulation and embedded system design
functional simulationfunctional simulation performance simulationperformance simulation
POLIS implementationPOLIS implementation partitioning examplepartitioning example
implementation simulationimplementation simulation software-orientedsoftware-oriented hardware-centrichardware-centric
SummarySummary
Embedded Heterogeneous Embedded Heterogeneous SystemSystem
A
Ddigital
downconv
900Mhz70Mhz 10.7Mhz
SH
SAW FILTER
RF IF 40Ms/sec - 540ks/sec
ViterbiEqls.
270.8ks/sec
demodandsync
BB
phone
bookkeypad
intfc
protocolcontrol
de-intl&
decoder
RPE-LTPspeechdecoder
speechquality
enhancement
voicerecognition
Logic
analog digital
phonebookDMA
S/P
DSP core
uC core
RAM & ROM
GSM Phone
Modeling andModeling andSimulation TechniquesSimulation Techniques
execute C code on a workstation
execute object code using
instruction level model instruction cycle
accurate clock cycle accurate
HDL model of the processor RTL gate level netlist
phase accurate model
pin accurate model
fully functional model
spice model of the processor
in-circuit emulators
real hardware
bus functional model
some sort of C code
behavioral HDL code
RTL HDL code discrete event clock cycle
based
gate level netlist
real hardware
some sort of C code
frequency domain model
spice model
real hardware
RF IF
uC
DSPor
ASIC
SourceModels
SmartModels
DesignWare
Hardware Modeler
Problem StatementProblem Statement
To model the behavior of a combined To model the behavior of a combined hardware and software system based on hardware and software system based on models of the behavior of the hardware models of the behavior of the hardware and software componentsand software components
Usually requires trading offUsually requires trading off accuracyaccuracy throughputthroughput convenienceconvenience
using the right abstractions for the taskusing the right abstractions for the task
Function and Interface Function and Interface AbstractionAbstraction
what is going on inside a subsystem
how do two subsystems communicate
How much visibility do I need to have into
function abstraction
interface abstraction
? ?? ?
OutlineOutline
Problem statementProblem statement Simulation and embedded system designSimulation and embedded system design
functional simulationfunctional simulation performance simulationperformance simulation
POLIS implementationPOLIS implementation partitioning examplepartitioning example
implementation simulationimplementation simulation software-orientedsoftware-oriented hardware-centrichardware-centric
SummarySummary
Cycle-based,logic simul.
Design flowDesign flow
Behaviorcapture
Mapping(partitioning)
Architecturecapture
Functionalsimul.
Architecturesimul.
Performancesimul.
Synthesis,coding
for (i=0;i<8;i++) idctrow(i); for (i=0;i<8;i++) idctcol(i);
Functional modelFunctional model
Variable-length dec.
Flowsplitting
IDCT
Motioncomp.
Imagememory
Videooutput
MPEG 2decoder
Stream in Video out
Functional simulationFunctional simulation Timeless (not really co-simulation…)Timeless (not really co-simulation…) Algorithm exploration, functional Algorithm exploration, functional
debugging, virtual prototypingdebugging, virtual prototyping Different formal models: Different formal models:
control-dominated: CSP, EFSMs, DE, etc.control-dominated: CSP, EFSMs, DE, etc.
event-based: Bones, StateCharts, etc.event-based: Bones, StateCharts, etc. data dominated: DF networksdata dominated: DF networks
token-based: Cossap, SPW, etc.token-based: Cossap, SPW, etc. Single process network (Ptolemy-style)Single process network (Ptolemy-style)
HW and SW Modeling HW and SW Modeling TechniquesTechniques
throughput
accu
racy
progmemory
procHW HW
HW
SW&
procHW
HWSW
hardware centrichardware centric
software orientedsoftware oriented
functionalfunctional
Cycle-based,logic simul.
Design flowDesign flow
Behaviorcapture
Mapping(partitioning)
Architecturecapture
Functionalsimul.
Architecturesimul.
Performancesimul.
Synthesis,coding
Architecture modelArchitecture model
AbstractAbstract model for mapping model for mapping no detailed wiring (busses, serial links, etc.)no detailed wiring (busses, serial links, etc.) black-box components (ASICs, micro-black-box components (ASICs, micro-
controllers, DSPs, memories, etc.)controllers, DSPs, memories, etc.) Later refined to a detailed designLater refined to a detailed design
implement communicationimplement communication refine interfacesrefine interfaces
Cycle-based,logic simul.
Design flowDesign flow
Behaviorcapture
Mapping(partitioning)
Architecturecapture
Functionalsimul.
Architecturesimul.
Performancesimul.
Synthesis,coding
MappingMapping
Associates functional units with Associates functional units with architectural unitsarchitectural units
Performs HW/SW partitioningPerforms HW/SW partitioning Associates functional communication Associates functional communication
with resources (buffers, busses, serial with resources (buffers, busses, serial links, etc.)links, etc.)
Provides Provides estimatesestimates of performance of a of performance of a given function on a given architectural given function on a given architectural unitunit
MappingMapping
CPU(10 SPEC)
Memory(2Mb) bus
10Mb/s
Flowsplitting
IDCT
Motioncomp.
Imagememory
Videooutput
ASIC(160Mops)
Variable-length dec.
Stream in Video out
MappingMapping
CPU(10 SPEC)
Memory(2Mb) bus
10Mb/s
Flowsplitting
IDCT
Motioncomp.
Imagememory
Videooutput
ASIC(160Mops)
Variable-length dec.
Stream in Video out
MappingMapping
CPU(10 SPEC)
Memory(2Mb) bus
10Mb/s
Flowsplitting
IDCT
Motioncomp.
Imagememory
Videooutput
ASIC(160Mops)
Variable-length dec.
Stream in Video out
MappingMapping
CPU(10 SPEC)
Memory(2Mb) bus
10Mb/s
Flowsplitting
IDCT
Motioncomp.
Imagememory
Videooutput
ASIC(160Mops)
Variable-length dec.
Stream in Video out
Another mappingAnother mapping
CPU(10 SPEC)
Memory(2Mb) bus
10Mb/s
Flowsplitting
IDCT
Motioncomp.
Imagememory
Videooutput
Variable-length dec.
Stream in Video out
ASIC(160Mops)
Performance simulationPerformance simulation
Analyzes performance of behavior on Analyzes performance of behavior on given architecturegiven architecture
Architectural model provides:Architectural model provides: performance estimatesperformance estimates resource constraintsresource constraints
CPU schedulingCPU scheduling bus arbitration policybus arbitration policy (abstract) cache modeling(abstract) cache modeling
OutlineOutline
Problem statementProblem statement Simulation and embedded system designSimulation and embedded system design
functional simulationfunctional simulation performance simulationperformance simulation
POLIS implementationPOLIS implementation partitioning examplepartitioning example
implementation simulationimplementation simulation software-orientedsoftware-oriented hardware-centrichardware-centric
SummarySummary
Performance simulation Performance simulation prototypeprototype
PolisPolis High level specificationHigh level specification Unified model for both hardware and software Unified model for both hardware and software
(CFSMs)(CFSMs) Code generation and estimationCode generation and estimation Hardware synthesisHardware synthesis
The POLIS design flowThe POLIS design flow
Graphical EFSMGraphical EFSM ESTERELESTEREL ................................
CompilersCompilers
CFSMsPartitioningPartitioning
Sw SynthesisSw Synthesis
FormalFormal
VerificationVerification
Sw Code + Sw Code +
RTOSRTOSLogic NetlistLogic NetlistSimulationSimulation
Hw SynthesisHw SynthesisIntfc SynthesisIntfc Synthesis
PrototypePrototype
Performance simulation Performance simulation in POLISin POLIS
Based on synthesized software timing estimatesBased on synthesized software timing estimates Generate C code for both hardware and software Generate C code for both hardware and software
componentscomponents each statement is labeled with the estimated running each statement is labeled with the estimated running
timetime accumulate delays during execution of software accumulate delays during execution of software
componentscomponents use cycle based simulation for hardware componentsuse cycle based simulation for hardware components
Very fastVery fast The system is compiled on the host machineThe system is compiled on the host machine Don’t need a model of the processorDon’t need a model of the processor
Performance estimationPerformance estimation
Current practice: manual guess or detailed Current practice: manual guess or detailed simulationsimulation
Software: need restrictions on coding styleSoftware: need restrictions on coding style still requires lots of user inputs still requires lots of user inputs can be automated if software is synthesized can be automated if software is synthesized
Hardware: need RTL specification or fast Hardware: need RTL specification or fast behavioral synthesisbehavioral synthesis
Communication: need communication Communication: need communication mapping (and refinement)mapping (and refinement)
Performance estimationPerformance estimation
4040
26264141 6363
14
18 9
a := a a := a + 1+ 1
a := 0a := 0
detect(c)detect(c)a<a<
bb
BEGINBEGIN
ENDEND
FF
TT
TT FF
emit(y)emit(y)
L0: if (detect_c) goto L1; else goto end;
L1: if (a<b) goto L3; else goto L2;
L2: a = a + 1; goto L4;
L3: a = 0;
L4: emit(y);
end: clean_up();
CFSM scheduling policyCFSM scheduling policy
Concurrency and shared resourcesConcurrency and shared resources hardware components are concurrenthardware components are concurrent only one software component can be executed at any only one software component can be executed at any
timetime if a software component receives an event,if a software component receives an event,
but the simulated processor is busy, but the simulated processor is busy,
its execution is postponedits execution is postponed need a scheduler to choose among all enabled need a scheduler to choose among all enabled
software processessoftware processes Discrete event simulation with time-stamped Discrete event simulation with time-stamped
events is used to synchronize HW and SW events is used to synchronize HW and SW
Car dashboard exampleCar dashboard example
FRCFRC
TimerTimer
OdoOdo
BeltBelt
SpeedSpeed
RPMRPM CrossdispCrossdisp
CrossdispCrossdisp
CrossdispCrossdispFuelFuel
fuelfuel
key, beltkey, belt
clockclock
wheelwheel
engineengine
fuel_dispfuel_disp
speed_dispspeed_disp
RPM_dispRPM_disp
odo_dispodo_disp
Timing generatorsData processingPWM drivers
Trade-off evaluationTrade-off evaluation
Interactively changed simulation Interactively changed simulation parametersparameters
Define different aspects:Define different aspects: Implementation of each CFSMImplementation of each CFSM CPU and clock frequencyCPU and clock frequency SchedulerScheduler
May be inherited in hierarchyMay be inherited in hierarchy Automatically transmitted to following Automatically transmitted to following
synthesis stepssynthesis steps
Trade-off evaluationTrade-off evaluation
Hw/Sw implementation and partitioningHw/Sw implementation and partitioning meeting timing constraintsmeeting timing constraints trade-offs: speed, code size, chip areatrade-offs: speed, code size, chip area
Scheduling policiesScheduling policies Round RobinRound Robin Pre-emptive and non pre-emptivePre-emptive and non pre-emptive
CPU selectionCPU selection MC 68HC11, MC 68332, MIPS R3000MC 68HC11, MC 68332, MIPS R3000
Don’t need to recompile the systemDon’t need to recompile the system
Trade-off evaluationTrade-off evaluation
Input patternsInput patterns Impulse, clock, randomImpulse, clock, random Sliders, buttonsSliders, buttons Waveforms from fileWaveforms from file
Monitoring the systemMonitoring the system Processor utilization and task scheduling chartsProcessor utilization and task scheduling charts Missed deadlinesMissed deadlines Cost of implementationCost of implementation Internal valuesInternal values
Performance evaluationPerformance evaluation
Targetproc.
MHz Part. Wheel &Engine
Missed
68HC11 4 1 260-8000 0
68HC11 1 1 180-6000 900
68HC11 4 2 50-3000 5000
68332 20 3 50-3000 30
68332 40 3 260-8000 20
Partition 1: HW={Timing unit, PWM drivers) SW={Data Processing}Partition 2: HW={Timing unit} SW={Data processing, PWM drivers}Partition 3: HW={} SW={Timing unit, Data processing, PWM drivers}
Simulation speedSimulation speed
Targetproc.
MHz Part. Graph. cycles/sec
68HC11 4 3 No < 20000
68HC11 4 2 No 650000
68HC11 4 1 No 550000
68HC11 4 2 Yes 100000
Partition 1: HW={Timing unit, PWM drivers) SW={Data Processing}Partition 2: HW={Timing unit} SW={Data processing, PWM drivers}Partition 3: HW={} SW={Timing unit, Data processing, PWM drivers}
Cycle-based,logic simul.
Design flowDesign flow
Behaviorcapture
Mapping(partitioning)
Architecturecapture
Functionalsimul.
Architecturesimul.
Performancesimul.
Synthesis,coding
Architecture refinementArchitecture refinement
CPU(10 SPEC)
ASIC(160 Mops)
Memory(2Mb) bus
10Mb/s
486(66 MHz)
Memory(2Mb) PCI bus
132 Mb/s
ASIC(30 Kgates66 MHz)
Architecture simulationArchitecture simulation
Performed on refined model of the Performed on refined model of the architecture, architecture, ignoring behaviorignoring behavior
Simulation verifies Simulation verifies interface correctnessinterface correctness bus-functional model for processors, with bus-functional model for processors, with
random address and data generationrandom address and data generation
(Logic Modeling, etc.)(Logic Modeling, etc.) cycle-based or logic simulation for the hardware cycle-based or logic simulation for the hardware
interfacesinterfaces Interface refinement simulationInterface refinement simulation
(Rowson et al., Hines et al.)(Rowson et al., Hines et al.)
HW and SW Modeling HW and SW Modeling TechniquesTechniques
throughput
accu
racy
progmemory
procHW HW
HW
SW&
procHW
HWSW
hardware centrichardware centric
software orientedsoftware oriented
functionalfunctional
Bus Functional ModelBus Functional Model
command interpreter command interpreter executing memory bus executing memory bus access cyclesaccess cycles
drives all relevant drives all relevant processor pinsprocessor pins
does not implement does not implement the complete the complete instruction set of the instruction set of the processorprocessor
cmd> WRT val,addrcmd> WRT R2,addrcmd> RD addr,R1cmd> ...
C/C++
HDLlanguage
and/orsimulatorbinding
adrdatardwrt
HW
SW&
procHW
Refined mappingRefined mapping
486(66 MHz)
Memory(2Mb) PCI bus
132 Mb/s
Flowsplitting
IDCT
Motioncomp.
Imagememory
Videooutput
ASIC(30Kgates66 MHz)
Variable-length dec.
Stream in Video out
Cycle-based,logic simul.
Design flowDesign flow
Behaviorcapture
Mapping(partitioning)
Architecturecapture
Functionalsimul.
Architecturesimul.
Performancesimul.
Synthesis,coding
Synthesis and codingSynthesis and coding
Implement functional specification on Implement functional specification on architecturearchitecture
Ideally should be automated Ideally should be automated (e.g. if function for HW is specified as RTL…)(e.g. if function for HW is specified as RTL…)
In practice generally a mix of hand In practice generally a mix of hand design and synthesisdesign and synthesis
OutlineOutline
Problem statementProblem statement Simulation and embedded system designSimulation and embedded system design
functional simulationfunctional simulation performance simulationperformance simulation
POLIS implementationPOLIS implementation partitioning examplepartitioning example
implementation simulationimplementation simulation software-orientedsoftware-oriented hardware-centrichardware-centric
SummarySummary
Implementation simulationImplementation simulation
Necessary because:Necessary because: need detailed performance analysisneed detailed performance analysis hand design is error-pronehand design is error-prone
Most errors should have been caught Most errors should have been caught beforebefore
How to use test vectors and compare How to use test vectors and compare results between results between different abstraction different abstraction levels levels ??
HW and SW Modeling HW and SW Modeling TechniquesTechniques
throughput
accu
racy
progmemory
procHW HW
HW
SW&
procHW
HWSW
hardware centrichardware centric
software orientedsoftware oriented
functionalfunctional
Software OrientedSoftware Oriented
software software executing on aexecuting on amodel of the model of the processorprocessor combine program and data memory with
the processor into a single model
implement instruction fetch, decode, andexecution cycle in opcode interpreter in software
HW
SW&
procHW
Clock Cycle AccurateClock Cycle AccurateInstruction Set Architecture ModelInstruction Set Architecture Model
clock cycle accurate
ISA model
C/C++
HDLlanguage
and/orsimulatorbinding
model derived/generated fromthe instruction set description
ADD R1,R2,R3
{ A1.OP = 0x01
A1.RA = R1
A1.RB = R2
... }
{ R3<- ADD(R2,R1) }
ADD R1,R2,R3
{ A1.OP = 0x01
A1.RA = R1
A1.RB = R2
... }
{ R3<- ADD(R2,R1) }
HW
SW&
procHW
Clock Cycle AccurateClock Cycle AccurateInstruction Set Architecture ModelInstruction Set Architecture Model
hardware devices hardware devices hanginghangingoff the memory busoff the memory bus memory read and write memory read and write
cycles are executed cycles are executed when specific memory when specific memory addresses are being addresses are being referencedreferenced
clock cycle accurate
ISA model
C/C++
HDLlanguage
and/orsimulatorbinding
adrdatardwrt
HW
SW&
procHW
Instruction Cycle AccurateInstruction Cycle AccurateInstruction Set Architecture ModelInstruction Set Architecture Model
same basic idea as same basic idea as Clock Cycle Clock Cycle Accurate ISA ModelAccurate ISA Model processor status can only be observed at processor status can only be observed at
instruction cycle boundariesinstruction cycle boundaries usually no notion of processor pins, usually no notion of processor pins,
besides serial and parallel portsbesides serial and parallel ports can be enhanced to drive all pins, can be enhanced to drive all pins,
including the memory interfaceincluding the memory interface Clock Cycle Accurate ISA ModelClock Cycle Accurate ISA Model
HW
SW&
procHW
Virtual ProcessorVirtual Processor
C/C++ code is compiled and executed C/C++ code is compiled and executed on a workstationon a workstation
accuracy of numeric results may be an accuracy of numeric results may be an issueissue
Hardware devices hanging off the Hardware devices hanging off the memory busmemory bus
additional function calls to execute additional function calls to execute memory read and write cycles when memory read and write cycles when specific memory addresses are being specific memory addresses are being referenced can be inserted in the original referenced can be inserted in the original softwaresoftware
need estimation and synchronization with need estimation and synchronization with hardware simulation to model software hardware simulation to model software timingtiming
HW
SW&
procHW
HW and SW Modeling HW and SW Modeling TechniquesTechniques
throughput
accu
racy
progmemory
procHW HW
hardware centrichardware centric
HW
SW&
procHW
software orientedsoftware oriented
HWSW
functionalfunctional
Hardware-CentricHardware-Centric
Possible descriptions of the Possible descriptions of the hardwarehardware
actual hardwareactual hardware gate level netlistgate level netlist clock cycle accurate architectural clock cycle accurate architectural
descriptiondescription
hardware executing a softwareimage stored in memory
progmemory
procHW HW
progmemory
procHW HWActual HardwareActual Hardware
processor is plugged into a device processor is plugged into a device connected to a workstationconnected to a workstation
discrete event simulator just sees discrete event simulator just sees another componentanother component
software image is loaded into memory software image is loaded into memory and clock is appliedand clock is applied
workstation running discrete event simulator
XPLORERMS-3X00
ModelSource
XPLORERMS-3X00
ModelSource
Hardware Modeler
progmemory
procHW HWGate Level NetlistGate Level Netlist
exact timing of the signals at all input exact timing of the signals at all input and output pins and output pins
full visibility of all internal signals and full visibility of all internal signals and storagestorage
gate level netlist executed in a discreteevent verification environment
Clock Cycle-AccurateClock Cycle-AccurateArchitectural DescriptionArchitectural Description
model delivers at each clock model delivers at each clock edgeedgea set of a set of stablestable output signals output signals givengivena set of a set of stablestable input signals input signals
visibility of internal signalsvisibility of internal signalsand storage depends onand storage depends ondegree of model refinementdegree of model refinement
architectural description executing ina cycle based verification environment
progmemory
procHW HW
SummarySummary
Functional modelVERIFY
FUNCTIONALITY
Implementation modelVERIFY
ABSTRACTIONS
Performance modelVERIFY
PERFORMANCE
Architecture modelVERIFY
INTERFACES
ConclusionsConclusions
Abstraction key to speedupAbstraction key to speedup Separate behavior (functionality), Separate behavior (functionality),
communication and timingcommunication and timing Architecture refinement essential for Architecture refinement essential for
true co-design and fast co-simulationtrue co-design and fast co-simulation Software and hardware synthesis helps Software and hardware synthesis helps
performance estimationperformance estimation Rapid prototyping may be the ultimate Rapid prototyping may be the ultimate
co-simulation tool...co-simulation tool...