virtualized system development · why use virtual systems? “because hardware is no fun” because...
TRANSCRIPT
Virtualized System Development
Dr. Jakob Engblom
Virtutech
Virtuali-what?
For Uppsala University RT Course, Copyright Virtutech 20092009-11-12
• Making a computer program behave like a computer for the purpose of running software
• Mechanisms for making some computer resource less subject to physical constraints
• Virtual memory, disk virtualization, virtual machines, …
• Very hip in the data center world currently
Virtualization
• A piece of software that simulates something
• Used to perform experiments and gain insight not possible in the real world
• Control and insight into the internals of the process big advantages
• For computing, often associated with being slow
• Physics, computers, weather, games, …
Simulation
• A piece of software that mimics some computer software or hardware
• Focused on the execution of existing software
• Terminal emulators, OS emulators, console emulators
Emulation
Virtutech Simics is really doing a bit of all of these, and it has been called all three
2
What is a Virtual Platform?
A piece of software
Running on a regular PC, server,
or workstation
Functionally identical to the
target hardware
Runs the same software as the
physical hardware system
For Uppsala University RT Course, Copyright Virtutech 20092009-11-12
Virtual Platform
3
Virtutech Core Technology
Model any electronic system on a PC or workstation
‒ Simics is a software program, no hardware required
Run the exact same software as the physical target (complete binary)
Run it fast (100s of MIPS)
Model any target system
‒ Networks, SoCs, boards, ASICs, ... no limits
For the benefit of software developers and hardware providers
Enables process change in software development
For Uppsala University RT Course, Copyright Virtutech 20092009-11-12
SimicsSimics
User application code
Host hardwareHost hardware
Host operating systemHost operating system
Virtual target hardware
Target operating system (s)
Middleware and libraries
Typically, an embedded or real-time control computer system
4
Why use Virtual Systems?
“Because hardware is no fun”
Because Hardware Is...
For Uppsala University RT Course, Copyright Virtutech 20092009-11-12
Not yet available Flaky prototype stage Not available anymore
?
Photo: Computer History Museum
Photo: Freescale
6
Because Hardware Is...
For Uppsala University RT Course, Copyright Virtutech 20092009-11-12
Inconvenient Dangerous Inaccessible
Photo: ESA
Photo: www.mil.se, Bromma Conquip
7
Because Hardware Is...
For Uppsala University RT Course, Copyright Virtutech 20092009-11-12
Impractical in scale Limited Inflexible
8
Virtual Platform Advantages and Features
Example: Early Hardware
For Uppsala University RT Course, Copyright Virtutech 20092009-11-12
Hardware/Software
Integration and Test
Hardware-dependent
software development
Hardware design and production
Simulator
development
Hardware-dependent
software development
Hardware/Software
Integration and Test
First successful power-on and boot
Reduced project time-to-ship using simulated
hardware
Hardware design and production
10
Handy Features of Simulation
For Uppsala University RT Course, Copyright Virtutech 20092009-11-12
Checkpointing
‒ Store current state; pick up and continue later
‒ Position workload once, use many times
‒ Spread a system state to multiple developers
‒ Package error reports (see demo)
‒ Can checkpoint an entire network of machines
‒ Key for repeating executions
Determinism/Repeatability
‒ Same initial state gives same execution;
‒ Repeat the same execution any number of times
‒ Investigate a problem time after time
‒ For multiprocessor systems & network systems
‒ Very useful for complex systems, where repeatable runs otherwise do not happen
11
Checkpointing in Simics
For Uppsala University RT Course, Copyright Virtutech 20092009-11-12
Save checkpoint
Restore to same machine, same model version
Restore to different host machine, same model version
Restore to same machine, updated model version (bug fix)
Restore to an updated and upgraded version of the model
12
Handy Features of Simulation
For Uppsala University RT Course, Copyright Virtutech 20092009-11-12
Visibility (insight without intrusion)
‒ All state can be observed
‒ All events can be traced and logged
Controllability
‒ Any part of machine or state can be changed
‒ Fault injection
Virtual time
‒ Time is completely virtual
‒ Global synchronization across all machines in a network
‒ Global stop across all processors in a multiprocessor
13
Example system insight: Load over time
For Uppsala University RT Course, Copyright Virtutech 20092009-11-1214
Convenience: Loading Flash
For Uppsala University RT Course, Copyright Virtutech 20092009-11-12
SimicsSimulator
FLASH
bin
Simics
Flash
Programmer
bin
Much shorter turn-around time for changing target software setup
No risk of “bricking” a target
15
Hardware Availability
Wide Availability
Virtual system is ”just software”
Trivial to copy
Trivial to distribute
Each engineer can have a custom
hardware system at their desk
Scalable
No physical supply limit
‒ Any number of each type of board
‒ Any type of system in ”infinite” supply
A virtual system can be big or small
by simple software (re)configuration
For Uppsala University RT Course, Copyright Virtutech 20092009-11-1216
Handy Features of Simulation
For Uppsala University RT Course, Copyright Virtutech 20092009-11-12
Configurability
‒ Any parameter of system can be changed
Sandboxing
‒ Allows investigating ”nasty code”
‒ Simulated machine complete isolated
‒ Networks can be isolated
‒ Simics undetectable by malware
Complete hardware simulation, no virtualization tricks
Reverse execution
‒ Roll back execution to previous state
‒ Reverse breakpoints
‒ Investigate details of program errors
17
Virtualization = Infinite Longevity
For Uppsala University RT Course, Copyright Virtutech 20092009-11-12
Host hardwareToday: 32-bit PC
Host operating systemWindows
SimicsSimics for x86/win
PPC 750fx Card
Target OS
Applications
Host hardwareTomorrow: 64-bit PC
Host operating systemLinux
SimicsSimics for AMD64/linux
PPC 750fx Card
Target OS
Applications
Host hardwareFuture: X Hardware
Host operating systemY OS
SimicsSimics for X/Y
PPC 750fx Card
Target OS
Applications
Time
...but the simulated target hardware stays the same
And the target software keeps working
The host machines available change over time...
18
System-Level Features
Checkpoint and restore Multicore, processor, board Real-world connections
Repeatable fault injection on
any system component
Scripting Mixed endianness, word
sizes, heterogeneitycon0.wait-for-string "$“
con0.record-start
con0.input "./ptest.elf 5\n"
con0.wait-for-string "."
$r = con0.record-stop
if ($r == "fail.”) {
echo ”test failed”
}
2009-11-12 For Uppsala University RT Course, Copyright Virtutech 200919
Simics Debugging Features
For Uppsala University RT Course, Copyright Virtutech 20092009-11-12
Synchronous stop
for entire system
Determinism and
repeatability
Reverse execution
Unlimited and powerful
breakpoints
Trace anything Insight into all devices
break –x 0x0000->0x1F00
break-io uart0
break-exception int13
20
What Types of Systems Can be Simulated?
For Uppsala University RT Course, Copyright Virtutech 20092009-11-12
Complete Systems & Networks
• Satellite, telecom network,
backbone net
Racks of Boards
& Backplanes
• Telecom rack, avionics
bay, blade server
Complete Boards
• MPC8548CDS,
MPC8572DSDevices &
Buses
• PCI, PCI-X, RapidIO,
Custom ASICS
SoC Devices
• Freescale QorIQ P4080,
MPC8572E, MPC8548E
Processor
& Memory
• Processor cores such as
e300, e500mc, e600
Examples
21
Debugging with Virtual Platforms
Three Steps of Debugging
For Uppsala University RT Course, Copyright Virtutech 20092009-11-12
1. Provoking errors
‒ Forcing the system to a state where things break
2. Reproducing errors
‒ Recreating a provoked error reliably
3. Locating the source of errors
‒ Investigating the program flow and data
‒ Depends on success in reproduction
A simulator can help with all three steps
23
Repeatability and Reverse Debugging
Repeat any run trivially
‒ No need to rerun and hope for bug to
reoccur
Stop & go back in time
‒ Instead of rerunning program from
start
‒ Breakpoints & watchpoints
backwards in time
‒ Investigate exactly what happened
this time
This control and reliable
repeatability is very powerful for
parallel code!
For Uppsala University RT Course, Copyright Virtutech 20092009-11-12
On hardware, only some runs reproduce an error
On virtual hardware, debugging is much easier
24
Code is not just about CPUs
For Uppsala University RT Course, Copyright Virtutech 20092009-11-12
On a modern SoC, the processor cores are just one part of the system
Much application functionality is implemented by using special accelerators... and you need to debug their interaction with the processors & software
25
Divide-by-zero in OS Kernel
For Uppsala University RT Course, Copyright Virtutech 20092009-11-12
Operating-system kernel crash in virtual model
‒ Divide-by-zero right in the kernel
‒ Algorithm to determine and compensate for clock skew
‒ Division by difference in time between two processors
Virtual model had zero clock skew = provoked error
‒ Could have happened on a real system
‒ Just not very likely
‒ Typical rare problem in the field
‒ Essentially testing a rare corner case in system state
26
Race Condition in Serial Driver
For Uppsala University RT Course, Copyright Virtutech 20092009-11-12
The problem:
‒ Dual-core MPC8641D machine
‒ Changed clock frequency from 800 to 833 Mhz
‒ OS froze on startup – quite unexpectedly
Investigation:
‒ Only happened at 832.9 to 833.3 MHz
‒ Determinism: 100% reproduction of error trivial
‒ Time control: single-step code feasible
‒ Insight: look at complete system state, log interrupts, check the call stack at the point of the freeze, check lock state
What we found:
‒ An interrupt service routine attempted to take a lock, before re-enabling interrupts. In the case that froze, the lock was already taken when the service routine was entered, and with no interrupts enabled there was no way for it to be released.
27
Custom Scripts to Debug Software
For Uppsala University RT Course, Copyright Virtutech 20092009-11-12
Scripting is a great tool for debugging
Simics debugger is fully scriptable
Available information:
‒ Looking up where functions, variables are in memory
‒ Mapping of instruction addresses to functions and code lines
‒ Tracking executing processes
‒ Switching processor debug context as target operating system executes
Also known as ”OS Awareness”
28
OS awareness and debug scripting examples
For Uppsala University RT Course, Copyright Virtutech 2009
Output example: Analyzing a multithreaded queue
[bp] Thread 1401, writing variable done with value 1.At rule30_packet_queue_signal_done, line 62 Prev. state: Done: 0 Empty: 0 Full: 0 Tail: 0 Head: 99 Elems: 1
[bp] Thread 1396, writing variable empty with value 1.At rule30_packet_queue_get, line 152 Prev. state: Done: 1 Empty: 0 Full: 0 Tail: 0 Head: 0 Elems: 0
[bp] Thread 1396, writing variable full with value 0.At rule30_packet_queue_get, line 153 Prev. state: Done: 1 Empty: 1 Full: 0 Tail: 0 Head: 0 Elems: 0
Code behind the above
def general_breakpoint_handle(target,user_arg,context,bpno,memop):...cpu = SIM_get_mem_op_initiator(memop)tid = process_tracker.iface.tracker.active_trackee(cpu)pc = cpu.iface.processor_info.get_program_counter()now = cpu.iface.cycle.get_cycle_count(cpu)value = SIM_get_mem_op_value_cpu(memop)(file,line,func) = context.symtable.source_at[pc]done = read_32b_variable(pq_done)empty = read_32b_variable(pq_empty)full = read_32b_variable(pq_full)head = read_32b_variable(pq_head)tail = read_32b_variable(pq_tail)
2009-11-12
Source: http://www.virtutech.com/whitepapers/fixing-intermittent-multicore-bug-simics-807.html
29
The Disk Corruption
For Uppsala University RT Course, Copyright Virtutech 20092009-11-12
Distributed fault-tolerant file system got corrupted
‒ Not shared-memory machine, all boards single-processor connected by several networks
‒ Intermittent error
‒ Error seen as a composite state across multiple disks
‒ Months spent chasing it on physical hardware
Simics solution:
‒ Reproduce corruption in Simics model of target
‒ Pin-point time when it happens, by interval halving
‒ Around the critical time, take periodic snapshots of disks
‒ Check consistency of disk states in offline scripts
Result:
‒ Found the precise instruction causing the problem
‒ Could capture the network traffic pattern causing issue
‒ Communicated the complete setup to the file system creator, allowing the root cause to be fixed
30
Computer System Simulation Technology
Embedded Computer System
For Uppsala University RT Course, Copyright Virtutech 20092009-11-12
So
ftwa
re s
tac
k
Communications
networks
Controlled
Environment
Human user interface
BootROM, drivers, HAL
Operating system
Middleware, libraries
Applications
32
Simulating Embedded Computer System
For Uppsala University RT Course, Copyright Virtutech 20092009-11-12
So
ftwa
re s
tac
k
Communications
networks
Controlled
Environment
Human user interface
BootROM, drivers, HAL
Operating system
Middleware, libraries
Applications
Simulation: “fake” one or more of the system pieces to enable work on other pieces. Some parts may be physical, while others are virtual.
Each piece has its own simulation issues and specialized simulation tools
We focus on running the software, and simulating the networks. Other parts are done using integrations with other companies’ tools.
33
Simics Architecture
Mo
de
l libra
ry
Simics
Target Machine(s)
Processor
Memory DevicesDevicesDevices
Processors
DevicesDevicesNetworks
and IO
Target operating system
Target hardware drivers
User program Middleware
Target boot code
User program
Target ISA decoder
JIT
CompilerInterpreter
Simics
Core
Configuration
management
Core Services
API
Inspection
Control
Features Memory
GUI
Scripting
Built-in
Debugger
Device Network CPU core SoCInter-connect
Intercon
nect
Configscripts
VMP
External
world
connections
Ethernet
Serial
Keyboard
Mouse
Ext. Debuggers
...
DML, C, C++,
Python,
Simics script,
SystemC
Event queue
and time
Multithreading
and scaling
2009-11-12 For Uppsala University RT Course, Copyright Virtutech 200934
Detail level determines speed
‒ The more detail, the slower the simulation
Abstraction: timing precision, implementation details
Functionality must always be correct!
Simulation detail levelTypical
slowdown
Approximate
speed in “MIPS”
Time to simulate one
real-world minute
Gate-level simulation 1000000 0.002 2 years
Cycle-accurate simulation 10000 0.2 7 days
Cycle-approximate simulation 500 4 8 hours
Fast functional simulation 5 400 5 minutes
Full-System Simulation
2009-11-12 For Uppsala University RT Course, Copyright Virtutech 200935
Cardinal Rule of Simulation
For Uppsala University RT Course, Copyright Virtutech 20092009-11-12
Scope of
modeled system
Quarks
Atom
Galaxy
Galaxies
Reasonable to simulate: scope
proportional to abstraction
Universe
Planets
Units of the
simulation
36
The Art of Fast Simulation
”Know when to bluff”
‒ You are in a poker game against the
software
It is an art to implement just
enough to fool the software, but
not more
‒ Details cost dev time and execution
speed
‒ Implement the what and not the how
‒ Do work in largest possible units
entire Ethernet packets
DMA in a single step
‒ ”transaction-level modeling”
For Uppsala University RT Course, Copyright Virtutech 20092009-11-1237
Simics Modeling Level: Processor
For Uppsala University RT Course, Copyright Virtutech 20092009-11-12
Instruction-set simulation (ISS)
Complete and correct processor functionality
‒ Endianess, word lengths, arithmetic results
‒ All instructions semantics bit-correct vs real machine
‒ Supervisor-mode & user-mode
‒ Runs the complete target instruction set
Including Altivec, SSE, 3dNow, VIS, etc. extensions
‒ All accessible values represented
User-level registers
Supervisor-level registers
Model-specific registers, ASIs, debug register, etc.
Memory-management unit
Timing abstracted
‒ Fixed execution time per instruction
‒ No cache model in default mode
38
Some things have to be modeled…
Correct endianness Incorrect endianness
For Uppsala University RT Course, Copyright Virtutech 20092009-11-1239
Simics Model of Memory Bus
For Uppsala University RT Course, Copyright Virtutech 20092009-11-12
DDR
SDRAM
Core0
UART
Coherency
module
DDR MC
L1 cache
Core1
DDR MC
Core2
L1 cache L1 cache
Shared
L2 cache
DDR
SDRAMFast interconnect for
high-bandwidth devices
Bridge
Slower interconnect for
other devices Flash MC
Flash
memory
Timer
I2C
Ethernet
Accelerator
RapidIO
Typical hardware
structure of a generic
modern SoC
40
Simics Model of Memory Bus
For Uppsala University RT Course, Copyright Virtutech 20092009-11-12
DDR
SDRAM
UART
Coherency
module
DDR MC
L1 cache
DDR MC
L1 cache L1 cache
Shared
L2 cache
DDR
SDRAMFast interconnect for
high-bandwidth devices
Bridge
Slower interconnect for
other devices Flash MC
Flash
memory
Timer
I2C
Ethernet
Accelerator
RapidIO
Coherency
module
L1 cache L1 cache L1 cache
Shared
L2 cache
Coherency
module
L1 cache L1 cache L1 cache
Fast interconnect for
high-bandwidth devices
Shared
L2 cache
Coherency
module
L1 cache L1 cache L1 cache
Bridge
Fast interconnect for
high-bandwidth devices
Shared
L2 cache
Coherency
module
L1 cache L1 cache L1 cache
Slower interconnect for
other devices
Bridge
Fast interconnect for
high-bandwidth devices
Shared
L2 cache
Coherency
module
L1 cache L1 cache L1 cache
Simics does not model
cache timing & coherency
protocol
Memory traffic goes directly
to memory store, memory
controller, bridges, etc only
modeled for their effect on
the configuration of the
memory map
A single global memory
map directly routes
memory accesses to
devices or memory
Memory map
Core0 Core1 Core2
41
Simics Modeling Level: Devices
For Uppsala University RT Course, Copyright Virtutech 20092009-11-12
Hardware modeled as a set of devices
‒ Memory map of machine (as seen by processor)
‒ At the programming register level
Model the program-visible behavior
‒ Configuration registers
‒ Control register
‒ Data transmitted & received
Transaction-level modeling
‒ Reads, writes, DMA transfers, network packets
Reactive, passive models
‒ Only execute code when a transaction occurs
ASICs & FPGAs
‒ Model programming interface behavior
‒ Not detailed implementation
42
Simics Modeling Level: Networks
For Uppsala University RT Course, Copyright Virtutech 20092009-11-12
Interfaced using “real” network devices
Networks modeled at message level
‒ Entire messages (packets, frames, ...) delivered as a unit
Hardware addressing used
‒ Ethernet MAC
‒ Does not care about higher-level protocols
‒ Ethernet allows IPv4, IPv6, TCP, UDP, SCP, ICMP, ...
Any topology or addressing scheme
‒ Broadcast, unicast, switched, point-to-point, etc.
Perfect network by default
‒ Introduce latencies
‒ Introduce bandwidth limits
‒ Introduce faults
43
Key Technology: Temporal Decoupling
For Uppsala University RT Course, Copyright Virtutech 20092009-11-12
Board
Chipset
Board
SoC
Core Core PCIe
EthRTOS
SW
MW RTC
PIC
RAM FLASH
CPU
Core Core UART
EthRTOS
SW
MW RTC
PIC
RAM
DiskUART
USB
SATA
ROM
Board
SoC
Core IO
EthRTOS
SW
MW RTC
PIC
RAM FLASH
UART
network
Simulation progress, temporal decoupling
Core
Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core
Simulation progress, cycle-by-cycle interleave
Core Core Core Core Core Core Core Core Core Core Core Core Core Core
44
Key Technology: Hypersimulation
Fast-forward the simulation
when a processor is idle
‒ Detect nothing to run for a while
‒ Relies on Simics isolating hardware
models from the outside world
Performance effect drastic
‒ Effectively removes idle processors
from the simulation of parallel
systems
‒ Can simulate at 100s of GHz
Examples of idle:
‒ Processor stop/power-down until an
interrupt happens
X86 “halt” instruction
PPC nap mode
Etc.
‒ Instruction sequence with known
outcome and no side effects
PPC “bndz 0”, which loops COUNT
times
Patterns for operating-system idle
loops, if not using power-down or other
obvious idle instructions
For Uppsala University RT Course, Copyright Virtutech 20092009-11-1245
Key Technology: Multithreading
For Uppsala University RT Course, Copyright Virtutech 2009
Simple system Complex systemComplex system with
Simics Accelerator
Host WorkstationHost Workstation
Simics Simics
Single thread
Simics
Host Workstation
Target
simulatio
n speed
Total
simulator
work
25% 1.0100% 4.0100% 1.0
2009-11-1246
Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core
Simulation progress, cycle-by-cycle interleave
Multithreading the Simulator
2009-11-12For Uppsala University RT Course, Copyright Virtutech 2009
Board
Chipset
Board
SoC
Core Core PCIe
EthRTOS
SW
MW RTC
PIC
RAM FLASH
CPU
Core Core UART
EthRTOS
SW
MW RTC
PIC
RAM
DiskUART
USB
SATA
ROM
Board
SoC
Core IO
EthRTOS
SW
MW RTC
PIC
RAM FLASH
UART
network
Simulation progress, parallel execution
Core Core Core Core Core Core
Core Core Core Core Core Core
Core Core Core
We need to synchronize the threads every once in a while, to maintain
semantics: no more apart than they would be under sequential execution
of the same system
Thread 1 Thread 2 Thread 3
Load balancing is a new limiter of simulation performnace, never an
issue in a single-threaded simulation
47
Speed Impact
For Uppsala University RT Course, Copyright Virtutech 20092009-11-12
-20%
0%
20%
40%
60%
80%
100%
120%
0
20
40
60
80
100
120
10
10
0
10
00
10
00
0
10
00
00
10
00
00
0
Time quantum length
Rela
tiv
e S
peed
Computation-Intense Benchmark on P4080 Model
Single P4080
With second P4080 idling
With second P4080 multithreaded
Overhead of second P4080 idling
48
Linux boot on MPC8572E
‒ Contains an spin-lock loop algorithm that is affected by temporal decoupling
‒ But still, things work out fine with a long time slice
0
5 000 000 000
10 000 000 000
15 000 000 000
20 000 000 000
25 000 000 000
30 000 000 000
35 000 000 000
40 000 000 000
45 000 000 000
50 000 000 000
0
100
200
300
400
500
600
700
10
50
10
0
50
0
10
00
10
00
0
10
00
00
50
00
00
10
00
00
0
15
00
00
0
wall time
total instr
Temporal Decoupling Affects Execution
2009-11-12 For Uppsala University RT Course, Copyright Virtutech 2009
Fastest execution at 10000, but not the
least number of target instructions
49
SIMULATIONS DO NOT NEED TO BE
COMPLETE
For Uppsala University RT Course, Copyright Virtutech 20092009-11-1250
Virtual Platforms Evolve with your Hardware
Start with simple board: CPU/SoC + memory
Add details as the hardware design evolves
Virtual platform does not need to be complete until the end
Enables hardware/software co-design‒ Develop software as soon as hardware is designed
Fast feedback at each iteration
Reduces time to market, improves product quality
Physic
al
Hard
ware
Vir
tual
Hard
ware
2009-11-12 For Uppsala University RT Course, Copyright Virtutech 200951
Dummy Devices
For Uppsala University RT Course, Copyright Virtutech 20092009-11-12
Many devices lack interesting behavior
‒ From the perspective of the software for a particular system
Only affect low-level system timing
Not used by current software setup
No interesting effects from using them
‒ Replace with dummies that do nothing
But do not give “access out of memory” errors
Examples: memory timing setup, performance counters, error
detection registers, ...
‒ Note that you can add them later if effects are needed
= Simulation runs faster, takes less time to create
52
Stubs
For Uppsala University RT Course, Copyright Virtutech 20092009-11-12
Not all parts of a system need to be modeled
Replace by stubs
‒ Find an appropriate (narrow) interface to cut at
‒ Replace complete model with its behavior
53
Stubs Example
For Uppsala University RT Course, Copyright Virtutech 20092009-11-12
Control card
Main
processor
SoC
Line card
Interface
processor Rack back-plane
DSP DSP
DSP DSP
DSP DSP
Control card
Main
processor
SoC
Line card
Interface
processor Rack back-plane
DSP DSP
DSP DSP
DSP DSP
For simulation of rack management, stub out the
DSPs on the line cards
For testing control-plane algorithms, stub out the
entire line card
54
Simulating Networks in Simics: Mixing Abstractions
For Uppsala University RT Course, Copyright Virtutech 20092009-11-12
SimicsSimics
Simulated HW
OS
Application
Simics
Network Link
Simulation
Simulated HW
OS
Application
Network connect
Simulated HW
OS
Application
Network connect
Real-network
Traffic gen
Network
testerRest-of-network
model
System under test, fully simulated
Physical HW
OS
Application
Real-world
network test
equipment
Physical HW
OS
Application
Dedicated test system that injects packets and checks the
replies
Instrumentation
module
Other fully simulated nodes on the
simulated network
Simplified behavioral simulation of other nodes, based on network
I/O
55
MIXING ABSTRACTION LEVELS
2009-11-12 For Uppsala University RT Course, Copyright Virtutech 200956
Abstraction Levels for Virtual Platforms
For Uppsala University RT Course, Copyright Virtutech 20092009-11-12
MeaningSystemTiming
Timing points for memory transaction
Memory transactiondelay
Arbitration, contention, bus bandwidth
Temporal decoupling
Simulationdriver
Max obs. speed
UTUntimed; SystemC, C, CSP
no none none No no Events 1 MTrans
ST Software timing:Simics, Qemu, Mambo
yes 1 Zero No yes Processor 1000s MIPS
LT SystemC TLM-2.0 Loosely Timed
yes 2 Loose No Optional Processor 100s MIPS
ATSystemC TLM-2.0Approximately Timed
yes 4 (2 phases) Approximate YesNot meaningful
Processor or Devices
10 MIPS
CC Clock-cycle yes Each clock Accurate Yes no Clock 100 Kcycles
RTL Register-transfer level
yes The truth The truth Yes no Hardware clocks n/a
57
System Simulation use Cases
System-on-Chip Design
‒ Focus on hardware designer needs
‒ Architecture exploration
‒ Sizing, performance, optimization of hardware
Fidelity to target is primary driver for models
‒ Timing
‒ Bandwidth
‒ Latency
‒ Bus structure
All components are equals
UT, LT, AT, CC
Software Development
‒ Focus on software developer needs
‒ Execute large workloads
‒ Debug code
Speed of execution is the primary driver for model
‒ Abstract as far as possible
‒ Approximate timing
Work from the processor outwards
Clear difference between processors and other devices
ST, LT
For Uppsala University RT Course, Copyright Virtutech 20092009-11-1258
Performance of HW acceleration vs SW impl
‒ Including the overhead of Linux device drivers
‒ Different driver modes and different hardware latencies
‒ Packet lengths from 8 to 256 bits
‒ Total of 400 runs, 200 billion target instructions
‒ ST mode, ideal memory
0
200000
400000
600000
800000
1000000
1200000
1400000
1600000
1800000
2000000
8 16 24 32 40 48 56 64 72 80 88 96 104112120128136144152160168176184192200208216224232240248256
bit
char
hw-1
hw-10
hw-100
hw-1000
hw-10000
hw-mmap-1
hw-mmap-10
hw-mmap-100
hw-mmap-1000
hwo-1
Example Measurement: Execution Time
2009-11-12 For Uppsala University RT Course, Copyright Virtutech 200959
Same setup as previous slide
Shows impact of OS on result latencies (aka jitter)
‒ Note how different modes have different susceptibility to jitter
0,00
1,00
2,00
3,00
4,00
5,00
6,00
7,00
8,00
9,00
8 16 24 32 40 48 56 64 72 80 88 96 104112120128136144152160168176184192200208216224232240248256
Ma
xim
um
Ex
ec
uti
on
Tim
e D
ivid
ed
by
Ave
rag
e
Ex
ec
uti
on
Tim
e
Packet Length
Maximum Compared to Average Time
char
hw-100
hw-1000
hw-10000
hw-mmap-100
hw-mmap-1000
Example Measurement: Jitter
2009-11-12 For Uppsala University RT Course, Copyright Virtutech 200960
Shows value of memory latency simulation in Simics, “LT” mode
Parallel processing benchmark
‒ Shared memory restricted single access and high latencies
‒ Testing two different transfer modes, 1 packet and 4 packets per transmission
‒ Scalability quite different
0,00
1,00
2,00
3,00
4,00
5,00
6,00
7,00
8,00
9,00
10,00
1 2 3 4 5 6 7 8 9
Pero
frm
an
ce r
ela
tive t
o o
ne w
ork
er
nd
oe
Number of worker nodes
Scaling as Worker Nodes are Added
Perfect memory100 cycles, single port200 cycles, single port500 cycles, single portPerfect memory, 4 packets/trans100 cycles, single port, 4 packets/trans200 cycles, single port, 4 packets/trans500 cycles, single port, 4 packets/trans
Example Measurement: Memory Speed Impact
2009-11-12 For Uppsala University RT Course, Copyright Virtutech 200961
Hybrid Simulation
For Uppsala University RT Course, Copyright Virtutech 20092009-11-12
Mix fast functional (ST) and clock-cycle-accurate (CC) models
‒ Two models of each device: ST, CC
‒ Use fast simulation to get to interesting places
‒ Zoom in selectively using detailed models
‒ Switch mode using a checkpoint: repeatable, archived, restartable
Additional Simics implementation mode: support detailed models
‒ You need a more complicated bus model
‒ Allows out-of-order transactions and timed buses
‒ Supports full bus hierarchy, not just a streamlined memory map
‒ A single simulation kernel for the entire system
First product: the Freescale QorIQ P4080
‒ Functional fast models from Virtutech
‒ Detailed models from Freescale (internal engineering)
62
Hybrid: What it Means
For Uppsala University RT Course, Copyright Virtutech 20092009-11-12
Mix temporally: change from functional to detailed simulation when
workload reaches interesting point
‒ Use fast mode to position in reasonable time
Mix spatially: combine fast and detailed models in the same
simulation setup
‒ Leverage fast models to get complete system
‒ Speed up simulation by only simulating what is relevant
time
Functional simulation
Detailed
simulation
Drop into detailed mode at
interesting points
Virtual board
Virtual model of new SoC
CPU
Pattern
Matching
Timer
Interrupt MemCtrl
UART
Ethernet
Ethernet
CPU
CryptoBuffer
Memory
TCP
Offload
Buffer
Memory
CPURT clock
RAM FLASH
A packet processing pipeline in detail, rest of virtual platform
functional
Flow control
63
Simics
CC Simulator Bridge
Another Option: Co-Simulation with Detailed System
Integrate a complete subsystem
from a CC (or AT) simulator
‒ Fully-detailed subsystem simulation
‒ Multiple CC models with a CC
interconnect between them
‒ Use existing proven CC bus models
‒ Transactor between the worlds,
typically bus-to-bus
‒ Let the Simics system run as today,
pass transactions in and out of the
CC subsystem
Dual simulation kernels
‒ Automatic reuse of second simulator
Simics bus model
STCPU
Transactor
STCPU
STdevice
ST Memory
CC Simulator
CC Device
CC Device
CC CPU
CC DSP
CC
Bu
s M
od
el
2009-11-12 For Uppsala University RT Course, Copyright Virtutech 200964
Simics
Wrapper device model
Another Option: Individual CC models
Individual CC models inside a
Simics ST framework
‒ Bus is Simics bus model
‒ Transactor to convert from ST to CC
traffic, locally for each model
‒ Transactor creates local clocks
‒ Will not get correct bus contention,
arbitration, etc.
Main point: to check the internal
behavior of a device model
‒ Maybe run the RTL vs Driver
CC Device
Simics bus model
STCPU
Transactor
STCPU
STdevice
STMemory
2009-11-12 For Uppsala University RT Course, Copyright Virtutech 200965
Questions?
Work for Us!
For Uppsala University RT Course, Copyright Virtutech 20092009-11-12
Exjobb finnes!
67
Spares
Locking Test Program
0
0,2
0,4
0,6
0,8
1
1,2
10 100 1000 10000
Time quantum
Execu
tio
n t
ime
no locking
fake locking
proper locking
Test program
‒ 2 threads
‒ 1000000 iterations
‒ MPC8641D virtual target
All locking disciplines
Time quantum 10-10000
Notes:
‒ Locking overhead visible
‒ Lock contention visible
‒ Only proper locking varies in
execution time
On real hardware:
‒ no << fake << proper
‒ Same relation seen in simulation
2009-11-12 For Uppsala University RT Course, Copyright Virtutech 2009
Lock contention execution time
effect clearly visible
72
Locking Test Program: Find Race
Seriously broken program with an unprotected variable access in a tight loop
Test on single-core and dual-core setups
‒ Range of frequencies
‒ Test program run 20 times on each setup
‒ Count percentage of runs triggering race
Results:
‒ Race always triggers in dual-core mode
‒ Triggers around 10% in single-core mode
‒ Higher clock = less chance to trigger in single-core
Simplified timing does not hide the race, this was run on standard Simics
1 CPU2 CPUs
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1
3
10
100
200
500
800
950
977
1000
1013
10000
Clock freqency (MHz)
Percentage of runs triggering race
2009-11-12For Uppsala University RT Course, Copyright Virtutech 2009
Simulator shows the difference between single-core and multi-core setups
in bug aggressiveness
73
Units of Simulation
For Uppsala University RT Course, Copyright Virtutech 20092009-11-12
Processor Cores Devices
The CPUs running code
Special case to gain performance,
simulated using ISS, JIT, API, etc. – buy or
borrow!
Comparatively limited in variants,
compared to devices
Typically provided by tool vendors
Anything that the system contains that does
things and that is not a user-programmable
CPU
Written by tool vendors and their users
Examples: Timers, interrupt controllers,
ADC, DAC, network interfaces, I2C
controllers, serial ports, LEDs, displays,
media accelerators, pattern matches, table
lookup engines, memory controllers, ...
Memories Interconnects
RAM, ROM, FLASH, EEPROM, ...
Store code and data
Usually special simulation case for
performance reasons, closely integrated
with processor core simulators
Typically provided by tool vendors
Connecting devices, chips, boards, cabinets,
systems together
I2C, Serial, Ethernet, PCI, PCIe, RapidIO,
ATM, CAN, FireWire, USB, MIL-STD-1553,
MII, VME, HyperTransport, memory bus, ...
Typically provided by tool vendors
74
Virtual Platform Block Diagram
For Uppsala University RT Course, Copyright Virtutech 20092009-11-12
Virtual MPC8572E Board
MPC8572E
DDR
SDRAM
Core0
EBC
eTSEC
PCIe
DUART
OS
Apps
BSP
I2C
PHY
PHY
PHY
ECM
PHY
PHY
eTSEC
eTSEC
eTSEC
RapidIO
FEC
SEC
DDR MC
L2$ /
SRAM
Core1
OS
Apps
BSP
TLU
Deflate
PME
PIC
DMA
DDR MC
Flash
DeviceMemory Processor
Eth
ern
et l
ink
Se
ria
l lin
k
Interconnect
RapidIO link
PCIe link
I2C link
75