the gem5 simulator · welcome! we’re glad you’re here! the gem5 simulator has been multi-year...
TRANSCRIPT
-
The gem5 SimulatorISCA 2011
Brad Beckmann1 Nathan Binkert2 Ali Saidi3 Joel Hestness4
Gabe Black5 Korey Sewell6 Derek Hower7
1 AMD Research 2 HP Labs 3 ARM, Inc. 4 University of Texas, Austin5 Google, Inc. 6 University of Michigan, Ann Arbor
7 University of Wisconsin, Madison
June 5th, 2011
1
-
Welcome!
Were glad youre here! The gem5 simulator has been multi-year effort A wide variety of institutions have participated
This tutorial is for you Please ask questions! Dont save them for the break! We intend the focus to be audience driven
2
-
Tutorial Goals and Timeline
Tutorial goals Introduce you to the gem5 simulator Answer your development questions
Two halves 8:30-noon: Overview of the simulator, features, components,
and simple examples after lunch: Birds of a feather sessions and informal
discussions of simulator internals
3
-
Outline
1 Introduction to gem5
2 Basics
3 Debugging
4 Checkpointing and Fastforwarding
5 Break
6 Multiple Architecture Support
7 CPU Modeling
8 Ruby Memory System
9 Wrap-Up
4
-
Outline
1 Introduction to gem5
2 Basics
3 Debugging
4 Checkpointing and Fastforwarding
5 Break
6 Multiple Architecture Support
7 CPU Modeling
8 Ruby Memory System
9 Wrap-Up
4
-
Outline
1 Introduction to gem5
2 Basics
3 Debugging
4 Checkpointing and Fastforwarding
5 Break
6 Multiple Architecture Support
7 CPU Modeling
8 Ruby Memory System
9 Wrap-Up
4
-
Outline
1 Introduction to gem5
2 Basics
3 Debugging
4 Checkpointing and Fastforwarding
5 Break
6 Multiple Architecture Support
7 CPU Modeling
8 Ruby Memory System
9 Wrap-Up
4
-
Outline
1 Introduction to gem5
2 Basics
3 Debugging
4 Checkpointing and Fastforwarding
5 Break
6 Multiple Architecture Support
7 CPU Modeling
8 Ruby Memory System
9 Wrap-Up
4
-
Outline
1 Introduction to gem5
2 Basics
3 Debugging
4 Checkpointing and Fastforwarding
5 Break
6 Multiple Architecture Support
7 CPU Modeling
8 Ruby Memory System
9 Wrap-Up
4
-
Outline
1 Introduction to gem5
2 Basics
3 Debugging
4 Checkpointing and Fastforwarding
5 Break
6 Multiple Architecture Support
7 CPU Modeling
8 Ruby Memory System
9 Wrap-Up
4
-
Outline
1 Introduction to gem5
2 Basics
3 Debugging
4 Checkpointing and Fastforwarding
5 Break
6 Multiple Architecture Support
7 CPU Modeling
8 Ruby Memory System
9 Wrap-Up
4
-
Outline
1 Introduction to gem5
2 Basics
3 Debugging
4 Checkpointing and Fastforwarding
5 Break
6 Multiple Architecture Support
7 CPU Modeling
8 Ruby Memory System
9 Wrap-Up
4
-
Introduction to gem5
Introduction to gem5
Brad Beckmann
AMD Research
5
-
Introduction to gem5
What is gem5? The best parts of M5 The best parts of GEMS
Overall goals, design principles, and capabilities
6
-
What is gem5?
The combination of M5 and GEMS into a new simulator
Google scholar statistics M5 (IEEE Micro, CAECW): 440 citations GEMS (CAN): 588 citations
Best aspects of both glued together M5: CPU models, ISAs, I/O devices, infrastructure GEMS (essentially Ruby): cache coherence protocols,
interconnect models
7
-
What else is new?
Many other things have changed since previous tutorialsbeyond GEMS+M5
Some of the highlights: The worlds most popular ISAs: ARM and x86 The In-order CPU model New documentation
Overall gem5 has a high degree of capabilities
8
-
Android on ARM FS
9
android.mp4Media File (video/mp4)
-
64 Processor Linux on x86 FS
10
-
What gem5 is Not
A hardware design language Higher level for design space exploration, simulation speed
A restrictive environment Just C++ and Python with an event queue and a bunch of
APIs you can choose to ignore
Finished! Always room for improvement . . .
11
-
What We Would Like gem5 to Be
Something that spares you the pain weve been through A community resource
Modular enough to localize changes Contribute back, and spare others some pain
A path to reproducible/comparable results A common platform for evaluating ideas
Let us know how we can help you contribute Public wiki is up at http://www.gem5.org Please submit patches and additional features Ability to add modules with EXTRAS= The more active the community is, the more successful gem5
will be!
12
-
Two Views of gem5
View #1 A framework for event-driven simulation
Events, objects, statistics, configuration
View #2 A collection of predefined object models
CPUs, caches, busses, devices, etc.
This tutorial focuses on #2 You may find #1 useful even if #2 is not
At least three other simulators have been created using #1
13
-
Main GoalsOverall Goal: Open source community tool focused onarchitectural modeling
Flexibility Multiple CPU models across the speed vs. accuracy spectrum Two execution modes: System-call Emulation & Full-system Two memory system models: Classic & Ruby Once you learn it, you can apply to a wide-range of
investigations Availability
For both academic and corporate researchers No dependence on proprietary code BSD license
Collaboration Combined effort of many with different specialties Active community leveraging collaborative technologies
14
-
Key Features
Pervasive object-oriented design Provides modularity, flexibility Significantly leverages inheritance e.g. SimObject
Python integration Powerful front-end interface Provides initialization, configuration, & simulation control
Domain-Specific Languages ISA DSL: defines ISA semantics Cache Coherence DSL (a.k.a.SLICC): defines coherence logic
Standard interfaces: Ports and MessageBuffers
15
-
Capabilities
Execution modes: System-call Emulation (SE) &Full-System (FS)
ISAs: Alpha, ARM, MIPS, Power, SPARC, x86 CPU models: AtomicSimple, TimingSimple, InOrder, and O3 Cache coherence protocols: broadcast-based, directories,
etc. Interconnection networks: Simple & Garnet (Princeton,
MIT) Devices: NICs, IDE controller, etc. Multiple systems: communicate over TCP/IP
16
-
Cross-Product Matrix
Processor Memory System
CPU Model System Mode Classic RubySimple Garnet
Atomic Simple SEFS
Timing Simple SEFS
InOrder SEFS
O3 SEFS
17
-
Outline
1 Introduction to gem5
2 Basics
3 Debugging
4 Checkpointing and Fastforwarding
5 Break
6 Multiple Architecture Support
7 CPU Modeling
8 Ruby Memory System
9 Wrap-Up
18
-
Basics
Basics
Nate Binkert
HP Labs
19
-
Basics
Compiling gem5 Running gem5 Very brief overview of a few key concepts:
Objects Events Modes Ports Stats
20
-
Building Executables
Platforms Linux, BSD, MacOS, Solaris, etc. Little endian machines
Some architectures support big endian 64-bit machines help a lot
Tools GCC/G++ 3.4.6+
Most frequently tested with 4.2-4.5 Python 2.4+ SCons 0.98.1+
We generally test versions 0.98.5 and 1.2.0 http://www.scons.org
SWIG 1.3.31+ http://www.swig.org
21
-
Compile Targets
build// configs
By convention, usually _ ALPHA_SE (Alpha syscall emulation) ALPHA_FS (Alpha full system) Other ISAs: ARM, MIPS, POWER, SPARC, X86 Sometimes followed by Ruby protocol: ALPHA_SE_MOESI_hammer You can define your own configs
binary gem5.debug debug build, symbols, tracing, assert gem5.opt optimized build, symbols, tracing, assert gem5.fast optimized build, no debugging, no symbols, no
tracing, no assertions gem5.prof gem5.fast + profiling support
22
-
Sample Compile
blue% scons build/X86_FS/gem5.optscons: Reading SConscript files ...Checking for leading underscore in global variables...noChecking for C header file Python.h... yesChecking for C library pthread... yes
Reading /n/blue/z/binkert/work/m5/incoming/src/mem/ruby/SConsoptsReading /n/blue/z/binkert/work/m5/incoming/src/mem/protocol/SConsoptsReading /n/blue/z/binkert/work/m5/incoming/src/arch/arm/SConsopts
Building in /n/blue/z/binkert/work/m5/incoming/build/X86_FSVariables file /n/blue/z/binkert/work/m5/incoming/build/variables/X86_FS not found,
using defaults in /n/blue/z/binkert/work/m5/incoming/build_opts/X86_FSscons: done reading SConscript files.scons: Building targets ...[ CXX] X86_FS/sim/main.cc -> .o[ CXX] X86_FS/sim/async.cc -> .o[ CXX] X86_FS/sim/core.cc -> .o[ TRACING] -> X86_FS/debug/Event.hhDefining FAST_ALLOC_STATS as 0 in build/X86_FS/config/fast_alloc_stats.hh.Defining FORCE_FAST_ALLOC as 0 in build/X86_FS/config/force_fast_alloc.hh.Defining NO_FAST_ALLOC as 0 in build/X86_FS/config/no_fast_alloc.hh.[ CXX] X86_FS/sim/debug.cc -> .o[ TRACING] -> X86_FS/debug/Config.hh[ CXX] X86_FS/sim/eventq.cc -> .o[ CXX] X86_FS/sim/init.cc -> .o[ TRACING] -> X86_FS/debug/TimeSync.hh[SO PARAM] Root -> X86_FS/params/Root.hh[SO PARAM] SimObject -> X86_FS/params/SimObject.hh...
23
-
Running Simulations
maize% ./build/ARM_FS/gem5.opt --helpUsage=====
gem5.opt [gem5 options] script.py [script options]
gem5 is copyrighted software; use the --copyright option for details.
Options=======--version show programs version number and exit--help, -h show this help message and exit--build-info, -B Show build information--copyright, -C Show full copyright information--readme, -R Show the readme--outdir=DIR, -d DIR Set the output directory to DIR [Default: /tmp/m5out]--redirect-stdout, -r Redirect stdout (& stderr, without -e) to file--redirect-stderr, -e Redirect stderr to file--stdout-file=FILE Filename for -r redirection [Default: simout]--stderr-file=FILE Filename for -e redirection [Default: simerr]--interactive, -i Invoke the interactive interpreter after running the script--pdb Invoke the python debugger before running the script--path=PATH[:PATH], -p PATH[:PATH]
Prepend PATH to the system path when invoking the script--quiet, -q Reduce verbosity--verbose, -v Increase verbosity
24
-
Running Simulations (cont)
Statistics Options--------------------stats-file=FILE Sets the output file for statistics [Default:
stats.txt]
Configuration Options-----------------------dump-config=FILE Dump configuration output file [Default: config.ini]
Debugging Options-------------------debug-break=TIME[,TIME]
Cycle to create a breakpoint--debug-help Print help on trace flags--debug-flags=FLAG[,FLAG]
Sets the flags for tracing (-FLAG disables a flag)--remote-gdb-port=REMOTE_GDB_PORT
Remote gdb base port (set to 0 to disable listening)
Trace Options---------------trace-start=TIME Start tracing at TIME (must be in ticks)--trace-file=FILE Sets the output file for tracing [Default: cout]--trace-ignore=EXPR Ignore EXPR sim objects
Help Options--------------list-sim-objects List all built-in SimObjects, their params and default
values
25
-
Sample Run
maize% ./build/ARM_SE/gem5.opt configs/example/se.pygem5 Simulator System. http://gem5.orggem5 is copyrighted software; use the --copyright option for details.
gem5 compiled Jun 2 2011 17:39:30gem5 started Jun 3 2011 14:48:20gem5 executing on maizecommand line: ./build/ARM_SE/gem5.opt configs/example/se.pyGlobal frequency set at 1000000000000 ticks per second0: system.remote_gdb.listener: listening for remote gdb #0 on port 7000
**** REAL SIMULATION ****info: Entering event queue @ 0. Starting simulation...Hello world!hack: be nice to actually delete the event hereExiting @ tick 3350000 because target called exit()
26
-
Modes
gem5 has two fundamental modes Full system (FS)
For booting operating systems Models bare hardware, including devices Interrupts, exceptions, privileged instructions, fault handlers
Syscall emulation (SE) For running individual applications, or set of applications on
MP/SMT Models user-visible ISA plus common system calls System calls emulated, typ. by calling host OS Simplified address translation model, no scheduling
Selected via compile-time option Vast majority of code is unchanged, though
27
-
Objects
Everything you care about is an object (C++/Python) Derived from SimObject base class
Common code for creation, configuration parameters, naming,checkpointing, etc.
Uniform method-based APIs for object types CPUs, caches, memory, etc. Plug-compatibility across implementations
Functional vs. detailed CPU Conventional vs. indirect-index cache
Easy replication: cores, multiple systems, . . .
28
-
Events
Standard event queue timing model Global logical time in ticks No fixed relation to real time
Normally picoseconds in our examples Objects schedule their own events
Flexibility for detail vs. performance trade-offs E.g., a CPU typically schedules event at regular intervals
Every cycle or every n picoseconds Wont schedule self if stalled/idle
29
-
Ports src/mem/port.{hh,cc}
Method for connecting MemObjects together Each MemObject subclass has its own Port subclass(es)
Specialized to forward packets to appropriate methods ofMemObject subclass
Each pair of MemObjects is connected via a pair of Ports(peers)
Function pairs pass packets across ports sendTiming() on one port calls recvTiming() on peer
Result: class-specific handling with arbitrary connections andonly a single virtual function call
30
-
Access Modes
Three access modes: Functional, Atomic, Timing Selected by choosing function on initial Port:
sendFunctional(), sendAtomic(), sendTiming() Functional mode:
Just make it happen Used for loading binaries, debugging, etc. Accesses happen instantaneously updating data everywhere
in the hierarchy If devices contain queues of packets they must be scanned
and updated as well
31
-
Access Modes (contd)
Atomic mode: Requests complete before sendAtomic() returns Models state changes (cache fills, coherence, etc.) Returns approx. latency w/o contention or queuing delay Used for fast simulation, fast forwarding, or warming caches
Timing mode: Models all timing/queuing in the memory system Split transaction
sendTiming() just initiates send of request to target Target later calls sendTiming() to send response packet
Atomic and Timing accesses can not coexist in system
32
-
Statistics
Scalar Average Vector Formula Histogram Distribution Vector Distribution
33
-
Statistics Example hh file
class MySimObject : public SimObject{
private:Stats::Scalar txBytes;Stats::Formula txBandwidth;Stats::Vector syscall;
public:void regStats();
};
34
-
Statistics Example cc file
txBytes.name(name() + ".txBytes").desc("Bytes Transmitted").prereq(txBytes);
txBandwidth.name(name() + ".txBandwidth").desc("Transmit Bandwidth (bits/s)").precision(0);
txBandwidth = txBytes * Stats::constant(8) / simSeconds;
syscall.init(SystemCalls ::Number).name(name() + ".syscall").desc("number of syscalls executed").flags(total | pdf | nozero | nonan);
35
-
Statistics Output
client.tsunami.etherdev.txBandwidth 4302720client.tsunami.etherdev.txBytes 13446server.tsunami.etherdev.txBandwidth 4684921600server.tsunami.etherdev.txBytes 14640380sim_seconds 0.025000server.cpu.kern.syscall 492server.cpu.kern.syscall_1 189 38.41% 38.41%server.cpu.kern.syscall_2 249 50.61% 89.02%server.cpu.kern.syscall_3 54 10.98% 100.00%
36
-
Outline
1 Introduction to gem5
2 Basics
3 Debugging
4 Checkpointing and Fastforwarding
5 Break
6 Multiple Architecture Support
7 CPU Modeling
8 Ruby Memory System
9 Wrap-Up
37
-
Debugging
Debugging
Ali Saidi
ARM Research & Development
38
-
Debugging Facilities
Tracing Instruction Tracing Diffing Traces
Using gdb to debug gem5 Debugging C++ and gdb-callable functions Remote Debugging
Python Debugging
Pipeline Viewer
39
-
Tracing/Debugging src/base/trace.*
printf() is a nice debugging tool
Keep good printfs for tracing
Lots of debug output is a very good thing
Example flags: Fetch, Decode, Ethernet, Exec, TLB, DMA, Bus, Cache,
Loader, O3CPUAll, etc. Print out all flags with --debug-help option
40
-
Enabling Tracing
Selecting flags: --debug-flags=Cache,Bus --debug-flags=Exec,-ExecTicks
Selecting destination: --trace-file=my_trace.out --trace-file=my_trace.out.gz
Selecting start: --trace-start=23000000
./build/ARM_FS/gem5.opt --debug-flags=Cache,Bus--trace-start=2400 configs/example/fs.py
41
-
Adding Debuging
Print statement put in source code Encourage you to add ones to your models or contribute ones
you find particularly useful Macros remove them for gem5.fast or gem5.prof binaries
So you must be using gem5.debug or gem5.opt to get anyoutput
Adding an extra tracing statement: DPRINTF(Flag, normal printf %s\n,arguments);
Adding a new debug flags (in a SConscript): DebugFlag(MyNewFlag)
42
-
Instruction Tracing src/sim/insttracer.hh
Separate from the general debug/trace facility But both are enabled the same way
Per-instruction records populated as instruction executes Start with PC and mnemonic Add argument and result values as they become known
Printed to trace when instruction completes Flags for printing cycle, symbolic addresses, etc.
4000: sys.cpu : @sym+776 : add r3, r3, #8 : IntAlu : D=0x000083584500: sys.cpu : @sym+780 : sub r3, r3, r7 : IntAlu : D=0x400000005000: sys.cpu : @sym+784 : add r5, r5, r3 : IntAlu : D=0x000173cc5500: sys.cpu : @sym+788 : add r6, r6, r3 : IntAlu : D=0x000174006000: sys.cpu : @sym+792.0 : addi_uop r34, r5, #0 : IntAlu : D=0x000173cc6500: sys.cpu : @sym+792.1 : ldr_uop r3, [r34, #0] : MemRead : D=0x000f0000 A=0x173cc7000: sys.cpu : @sym+792.2 : ldr_uop r4, [r34, #4] : MemRead : D=0x000f0000 A=0x173d07500: sys.cpu : @sym+796 : and r4, r4, r9 : IntAlu : D=0x000f00008000: sys.cpu : @sym+800 : teqs r3, r4 : IntAlu : D=0x00000001
43
-
Using GDB with gem5
Several gem5 functions designed to be called from GDB: schedBreakCycle() also with --debug-break setDebugFlag()/clearDebugFlag() dumpDebugStatus() eventqDump() SimObject::find() takeCheckpoint()
44
-
Using GDB with gem5
gdb --args ./build/ARM_FS/gem5.opt configs/example/fs.pyGNU gdb Fedora (6.8-37.el5)...
(gdb) b mainBreakpoint 1 at 0x4090b0: file build/ARM_FS/sim/main.cc, line 40.(gdb) run
Breakpoint 1, main (argc=2, argv=0x7fffa59725f8) at build/ARM_FS/sim/main.cc:4040main(int argc, char **argv)
(gdb) call schedBreakCycle(1000000)(gdb) continueContinuing.
gem5 Simulator System...0: system.remote_gdb.listener: listening for remote gdb #0 on port 7000
**** REAL SIMULATION ****info: Entering event queue @ 0. Starting simulation...
Program received signal SIGTRAP, Trace/breakpoint trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6
(gdb) p _curTick$1 = 1000000
45
aliHighlight
-
Using GDB with gem5
gdb --args ./build/ARM_FS/gem5.opt configs/example/fs.pyGNU gdb Fedora (6.8-37.el5)...
(gdb) b mainBreakpoint 1 at 0x4090b0: file build/ARM_FS/sim/main.cc, line 40.(gdb) run
Breakpoint 1, main (argc=2, argv=0x7fffa59725f8) at build/ARM_FS/sim/main.cc:4040main(int argc, char **argv)
(gdb) call schedBreakCycle(1000000)(gdb) continueContinuing.
gem5 Simulator System...0: system.remote_gdb.listener: listening for remote gdb #0 on port 7000
**** REAL SIMULATION ****info: Entering event queue @ 0. Starting simulation...
Program received signal SIGTRAP, Trace/breakpoint trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6
(gdb) p _curTick$1 = 1000000
45
aliHighlight
-
Using GDB with gem5
gdb --args ./build/ARM_FS/gem5.opt configs/example/fs.pyGNU gdb Fedora (6.8-37.el5)...
(gdb) b mainBreakpoint 1 at 0x4090b0: file build/ARM_FS/sim/main.cc, line 40.(gdb) run
Breakpoint 1, main (argc=2, argv=0x7fffa59725f8) at build/ARM_FS/sim/main.cc:4040main(int argc, char **argv)
(gdb) call schedBreakCycle(1000000)(gdb) continueContinuing.
gem5 Simulator System...0: system.remote_gdb.listener: listening for remote gdb #0 on port 7000
**** REAL SIMULATION ****info: Entering event queue @ 0. Starting simulation...
Program received signal SIGTRAP, Trace/breakpoint trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6
(gdb) p _curTick$1 = 1000000
45
-
Using GDB with gem5
gdb --args ./build/ARM_FS/gem5.opt configs/example/fs.pyGNU gdb Fedora (6.8-37.el5)...
(gdb) b mainBreakpoint 1 at 0x4090b0: file build/ARM_FS/sim/main.cc, line 40.(gdb) run
Breakpoint 1, main (argc=2, argv=0x7fffa59725f8) at build/ARM_FS/sim/main.cc:4040main(int argc, char **argv)
(gdb) call schedBreakCycle(1000000)(gdb) continueContinuing.
gem5 Simulator System...0: system.remote_gdb.listener: listening for remote gdb #0 on port 7000
**** REAL SIMULATION ****info: Entering event queue @ 0. Starting simulation...
Program received signal SIGTRAP, Trace/breakpoint trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6
(gdb) p _curTick$1 = 1000000
45
aliHighlight
-
Using GDB with gem5
gdb --args ./build/ARM_FS/gem5.opt configs/example/fs.pyGNU gdb Fedora (6.8-37.el5)...
(gdb) b mainBreakpoint 1 at 0x4090b0: file build/ARM_FS/sim/main.cc, line 40.(gdb) run
Breakpoint 1, main (argc=2, argv=0x7fffa59725f8) at build/ARM_FS/sim/main.cc:4040main(int argc, char **argv)
(gdb) call schedBreakCycle(1000000)(gdb) continueContinuing.
gem5 Simulator System...0: system.remote_gdb.listener: listening for remote gdb #0 on port 7000
**** REAL SIMULATION ****info: Entering event queue @ 0. Starting simulation...
Program received signal SIGTRAP, Trace/breakpoint trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6
(gdb) p _curTick$1 = 1000000
45
aliHighlight
-
Using GDB with gem5
gdb --args ./build/ARM_FS/gem5.opt configs/example/fs.pyGNU gdb Fedora (6.8-37.el5)...
(gdb) b mainBreakpoint 1 at 0x4090b0: file build/ARM_FS/sim/main.cc, line 40.(gdb) run
Breakpoint 1, main (argc=2, argv=0x7fffa59725f8) at build/ARM_FS/sim/main.cc:4040main(int argc, char **argv)
(gdb) call schedBreakCycle(1000000)(gdb) continueContinuing.
gem5 Simulator System...0: system.remote_gdb.listener: listening for remote gdb #0 on port 7000
**** REAL SIMULATION ****info: Entering event queue @ 0. Starting simulation...
Program received signal SIGTRAP, Trace/breakpoint trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6
(gdb) p _curTick$1 = 1000000
45
aliHighlight
-
Using GDB with gem5
(gdb) call setDebugFlag("Exec")(gdb) call schedBreakCycle(1001000)(gdb) continueContinuing.
1000000: system.cpu T0 : @_stext+148. 1 : addi_uop r0, r0, #4 : IntAlu : D=0x0000000000004c301000500: system.cpu T0 : @_stext+152 : teqs r0, r6 : IntAlu : D=0x0000000000000000
Program received signal SIGTRAP, Trace/breakpoint trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6
(gdb) print SimObject::find("system.cpu")$2 = (SimObject *) 0x19cba130(gdb) print (BaseCPU*)SimObject::find("system.cpu")$3 = (BaseCPU *) 0x19cba130(gdb) p $3->instCnt$4 = 431
(gdb) call clearDebugFlag("Exec")(gdb) call takeCheckpoint(0)(gdb) call schedBreakCycle(1001500)(gdb) continueContinuing.Writing checkpointinfo: Entering event queue @ 1001001. Starting simulation...
Program received signal SIGTRAP, Trace/breakpoint trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6(gdb)
46
aliHighlight
aliHighlight
-
Using GDB with gem5
(gdb) call setDebugFlag("Exec")(gdb) call schedBreakCycle(1001000)(gdb) continueContinuing.
1000000: system.cpu T0 : @_stext+148. 1 : addi_uop r0, r0, #4 : IntAlu : D=0x0000000000004c301000500: system.cpu T0 : @_stext+152 : teqs r0, r6 : IntAlu : D=0x0000000000000000
Program received signal SIGTRAP, Trace/breakpoint trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6
(gdb) print SimObject::find("system.cpu")$2 = (SimObject *) 0x19cba130(gdb) print (BaseCPU*)SimObject::find("system.cpu")$3 = (BaseCPU *) 0x19cba130(gdb) p $3->instCnt$4 = 431
(gdb) call clearDebugFlag("Exec")(gdb) call takeCheckpoint(0)(gdb) call schedBreakCycle(1001500)(gdb) continueContinuing.Writing checkpointinfo: Entering event queue @ 1001001. Starting simulation...
Program received signal SIGTRAP, Trace/breakpoint trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6(gdb)
46
aliHighlight
-
Using GDB with gem5
(gdb) call setDebugFlag("Exec")(gdb) call schedBreakCycle(1001000)(gdb) continueContinuing.
1000000: system.cpu T0 : @_stext+148. 1 : addi_uop r0, r0, #4 : IntAlu : D=0x0000000000004c301000500: system.cpu T0 : @_stext+152 : teqs r0, r6 : IntAlu : D=0x0000000000000000
Program received signal SIGTRAP, Trace/breakpoint trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6
(gdb) print SimObject::find("system.cpu")$2 = (SimObject *) 0x19cba130(gdb) print (BaseCPU*)SimObject::find("system.cpu")$3 = (BaseCPU *) 0x19cba130(gdb) p $3->instCnt$4 = 431
(gdb) call clearDebugFlag("Exec")(gdb) call takeCheckpoint(0)(gdb) call schedBreakCycle(1001500)(gdb) continueContinuing.Writing checkpointinfo: Entering event queue @ 1001001. Starting simulation...
Program received signal SIGTRAP, Trace/breakpoint trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6(gdb)
46
aliHighlight
aliHighlight
aliHighlight
-
Using GDB with gem5
(gdb) call setDebugFlag("Exec")(gdb) call schedBreakCycle(1001000)(gdb) continueContinuing.
1000000: system.cpu T0 : @_stext+148. 1 : addi_uop r0, r0, #4 : IntAlu : D=0x0000000000004c301000500: system.cpu T0 : @_stext+152 : teqs r0, r6 : IntAlu : D=0x0000000000000000
Program received signal SIGTRAP, Trace/breakpoint trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6
(gdb) print SimObject::find("system.cpu")$2 = (SimObject *) 0x19cba130(gdb) print (BaseCPU*)SimObject::find("system.cpu")$3 = (BaseCPU *) 0x19cba130(gdb) p $3->instCnt$4 = 431
(gdb) call clearDebugFlag("Exec")(gdb) call takeCheckpoint(0)(gdb) call schedBreakCycle(1001500)(gdb) continueContinuing.Writing checkpointinfo: Entering event queue @ 1001001. Starting simulation...
Program received signal SIGTRAP, Trace/breakpoint trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6(gdb)
46
aliHighlight
aliHighlight
aliHighlight
-
Diffing Traces util/{rundiff,tracediff}
Often useful to compare traces from two simulations Find where known good and modified simulators diverge
Standard diff works only on files (not pipes) ...but you really dont want to run to completion
util/rundiff Perl script for diffing two pipes on the fly
util/tracediff Handy wrapper for using rundiff to compare gem5 outputs tracediff "a/gem5.opt|b/gem5.opt"--debug-flags=Exec compares instruction traces from twobuilds of gem5
See comments for details
47
-
Advanced Trace Diffing
Sometimes if you run into a nasty bug its hard to compareapples-to-apples traces
Different cycle counts, different code paths frominterrupts/timers
Some mechanisms that can help: -ExecTicks dont print out ticks -ExecKernel dont print out kernel code -ExecUser dont print out user code ExecAsid print out ASID of currently running process
State trace PTRACE program that runs binary on real system compares
cycle-by-cycle to gem5 Supports ARM, x86, SPARC See wiki for more information
48
-
Remote Debugging
./build/ARM_FS/gem5.opt configs/example/fs.pygem5 Simulator System
...command line: ./build/ARM_FS/gem5.opt configs/example/fs.pyGlobal frequency set at 1000000000000 ticks per secondinfo: kernel located at: /chips/pd/randd/dist/binaries/vmlinux.armListening for system connection on port 5900Listening for system connection on port 34560: system.remote_gdb.listener: listening for remote gdb #0 on port 7000info: Entering event queue @ 0. Starting simulation...
Remote gdb connection listening on port 7000
49
aliHighlight
-
Remote Debugging
GNU gdb (Sourcery G++ Lite 2010.09-50) 7.2.50.20100908-cvsCopyright (C) 2010 Free Software Foundation, Inc....(gdb) symbol-file /dist/binaries/vmlinux.armReading symbols from //dist/binaries/vmlinux.arm...done.(gdb) set remote Z-packet on(gdb) set tdesc filename arm-with-neon.xml(gdb) target remote 127.0.0.1:7000Remote debugging using 127.0.0.1:7000cache_init_objs (cachep=0xc7c00240, flags=3351249472) at mm/slab.c:2658(gdb) stepsighand_ctor (data=0xc7ead060) at kernel/fork.c:1467(gdb) info registers
r0 0xc7ead060-940912544r1 0x5201312r2 0xc002f1e4-1073548828r3 0xc7ead060-940912544r4 0x00r5 0xc7ead020-940912608r6 0x00r7 0xc7ead03c-940912580r8 0xc7c034a0-943704928r9 0x1001001048832r10 0xc7c0cee0-943665440r11 0x2002002097664r12 0xc0000000-1073741824sp 0xc7c29e280xc7c29e28lr 0xc008ed98-1073156712pc 0xc002f1e40xc002f1e4 cpsr 0x1319
50
aliHighlight
aliHighlight
aliHighlight
aliHighlight
aliHighlight
aliHighlight
-
Python Debugging
It is possible to drop into the python interpreter (-i flag) This currently happens after the script file is run If you want to do this before objects are instantiated, remove
them from script It is possible to drop into the python debugger (--pdb flag)
Occurs just before your script is invoked Lets you use the debugger to debug your script code
Code that enables this stuff is in src/python/m5/main.py At the bottom of the main function Can copy the mechanism directly into your scripts, if in the
wrong place for you needs import pdb pdb.set_trace()
51
-
O3 Pipeline ViewerUse --debug-flags=O3PipeView and util/o3-pipeview.py
52
-
Outline
1 Introduction to gem5
2 Basics
3 Debugging
4 Checkpointing and Fastforwarding
5 Break
6 Multiple Architecture Support
7 CPU Modeling
8 Ruby Memory System
9 Wrap-Up
53
-
Checkpointing and Fastforwarding
Checkpointing and Fastforwarding
Joel Hestness
University of Texas, Austin
54
-
Checkpointing and Fastforwarding
Idea is simple: Snapshot of relevant system state Restore it later and/or in different CPUs, configurations
Provides flexibility: Test numerous different systems configurations Exact same point in the benchmark Avoid re-simulating up to that point Avoid non-determinism inherent with different configurations
55
-
Checkpointing and Fastforwarding
Outline: Constraints Checkpointing demo Checkpointing internals Instrumenting a benchmark Fastforwarding internals Fastforwarding demo
56
-
Checkpointing and Fastforwarding
Constraints: Original simulation and test simulations must have
Same ISA Same number of cores Same memory size
Usually run original sim with atomic (functional) CPUs
57
-
Checkpointing DEMO!
Starting simulation
58
-
Checkpointing DEMO!
Starting simulation
59
-
Checkpointing DEMO!
Simulated system is running
60
-
Checkpointing DEMO!
Another terminal to control simulated system
61
-
Checkpointing DEMO!
Attach to simulated system
62
-
Checkpointing DEMO!
Simulated system has booted to shell
63
-
Checkpointing DEMO!
Run a quick application
64
-
Checkpointing DEMO!
Run a quick application
65
-
Checkpointing DEMO!
Drop a checkpoint
66
-
Checkpointing DEMO!
Exit simulation
67
-
Checkpointing DEMO!
Exit simulation
68
-
Checkpointing DEMO!
Restore from checkpoint into different simulated system
69
-
Checkpointing DEMO!
Simulated system is running
70
-
Checkpointing DEMO!
Attach to simulated system
71
-
Checkpointing DEMO!
Run a quick application
72
-
Checkpointing DEMO!
Slower execution: detailed v. functional simulation
73
-
Checkpointing DEMO!
Exit simulation
74
-
Checkpointing Output
cpt.6967183789500/ m5.cpt: State of system components system.disk?.image.cow: Modified state of disk(s) system.physmem.physmem: State of memory
75
-
Specifying State to Checkpoint
To checkpoint a piece of state, serialize it To restore that state, unserialize it
voidserialize(std::ostream &os){
SERIALIZE_ARRAY(interrupts, NumInterruptLevels);SERIALIZE_SCALAR(intstatus);
}
voidunserialize(Checkpoint *cp, const std::string §ion){
UNSERIALIZE_ARRAY(interrupts, NumInterruptLevels);UNSERIALIZE_SCALAR(intstatus);
}
76
-
Checkpointing functionality status
Classic memory model: Does not save state of caches
Ruby memory model: Can save state of caches
77
-
Instrumenting a Benchmark
Copy files from ./util/m5/ into source tree: m5op.h m5ops.h Appropriate assembly file: m5op_.S
Include m5op.h in source code that should take a checkpoint
#include "m5op.h"
...
// Take checkpoint in codem5_checkpoint(0,0);
1st param: no. ticks in future to schedule the checkpoint 2nd param: no. ticks between checkpoints (periodic)
Compile and link against assembly file
78
-
Checkpointing Functionality in Progress
Current limitation: cache warm-up1 Take periodic checkpoints throughout execution2 Inspect statistics for interesting sections (think Simpoints)3 Choose interesting sections4 Create memory access traces for cache warm-up5 Restore from checkpoint:
1 Start simulated system2 Warm up caches from trace3 Restore the rest of state4 Begin execution
79
-
Fastforwarding
Setup: Specify sets of CPUs
cpu_class = AtomicSimpleCPUswitch_cpu_class = DerivO3CPUtest_sys.cpu = [cpu_class(cpu_id=i) for i in xrange(np)]switch_cpus = [switch_cpu_class(defer_registration=True, cpu_id=(np+i))
for i in xrange(np)]switch_cpu_list = [(testsys.cpu[i], switch_cpus[i]) for i in xrange(np)]
80
-
Fastforwarding DEMO!
Starting simulation
81
-
Fastforwarding DEMO!
Starting simulation
82
-
Fastforwarding DEMO!
Simulated system is running
83
-
Fastforwarding DEMO!
Another terminal to control simulated system
84
-
Fastforwarding DEMO!
Attach to simulated system
85
-
Fastforwarding DEMO!
Simulated system has booted to shell
86
-
Fastforwarding DEMO!
Run a quick application
87
-
Fastforwarding DEMO!
Run a quick application
88
-
Fastforwarding DEMO!
Switch from functional to detailed CPUs
89
-
Fastforwarding DEMO!
Switch from functional to detailed CPUs
90
-
Fastforwarding DEMO!
Run a quick application
91
-
Fastforwarding DEMO!
Slower execution: detailed v. functional simulation
92
-
Fastforwarding DEMO!
Exit simulation
93
-
Outline
1 Introduction to gem5
2 Basics
3 Debugging
4 Checkpointing and Fastforwarding
5 Break
6 Multiple Architecture Support
7 CPU Modeling
8 Ruby Memory System
9 Wrap-Up
94
-
Break
Break
95
-
Outline
1 Introduction to gem5
2 Basics
3 Debugging
4 Checkpointing and Fastforwarding
5 Break
6 Multiple Architecture Support
7 CPU Modeling
8 Ruby Memory System
9 Wrap-Up
96
-
Multiple Architecture Support
Multiple Architecture Support
Gabe Black
Google, Inc.
97
-
Overview
Tour of the ISAs Parts of an ISA Decoding and instructions
98
-
ISA Support
Full-System & Syscall Emulation Alpha ARM SPARC x86
Syscall Emulation MIPS POWER
99
-
Alpha
Alpha 21264 including the BWX, MVI, FIX, and CIXA 21164 PAL code. Syscall Emulation
Linux or Tru64 binaries Simple Atomic, Simple Timing, In-Order, Out-of-Order CPU
models Full system
Linux or FreeBSD Simple Atomic, Simple Timing, In-Order, Out-of-Order CPU
models Four-cores in a normal Tsunami system Also gem5 big Tsunami support 64 cores
Custom PAL code and kernel patches required
100
-
ARM ARMv7-A, Thumb, Thumb2, MP, VFPv3, NEON
Doesnt (yet) include TrustZone, ThumbEE, Virtualization,LPAE
Syscall Emulation EABI Linux binaries - no OABI Simple Atomic, Simple Timing, Out-of-Order CPU models
Full system Linux or Android Simple Atomic, Simple Timing, Out-of-Order CPU models Four-cores in a normal ARM RealView system No kernel patches required Also supports frame buffer, and control via VNC
Can run X11, Android, Web browsers, etc
101
-
ARM
101
-
MIPS
32 bit little endian Syscall Emulation
Linux binaries Simple Atomic, Simple Timing, In-Order, Out-of-Order CPU
models Full system
Significant progress, but not actively developed
102
-
POWER
POWER ISA v2.06 B Book, 32-bit, little endian Most instructions available, but some FP missing; no vector
support Syscall Emulation
Linux binaries Simple Atomic, Simple Timing, Out-of-Order CPU models
Full system No current plans
103
-
SPARC
UltraSPARC Architecture 2005 Syscall Emulation
Linux or Solaris binaries Simple Atomic, Simple Timing, Out-of-Order CPU models
Full system Solaris Single core of a UltraSPARC T1 (Niagara) processor Simple Atomic CPU model only Significant progress on MP, but not actively developed
104
-
x86
Generic x86 CPU w/ 64 bit, 3DNow, & SSE extensions Effort focused on modern features No x87 floating point. Compile 32 bit with -msse2. No Windows support any time soon. Syscall Emulation
Linux binaries Simple Atomic, Simple Timing, Out-of-Order CPU models
Full system Linux Simple Atomic, Simple Timing CPU models MP support
105
-
Parts of an ISA
Parameterization Number of registers Endianness Page size
Specialized objects TLBs Faults Control state Interrupt controller
Instructions Instructions themselves Decoding mechanism
106
-
Instruction decode process
Memory
Byte Byte Byte Byte ByteByte
Predecoder Context
ExtMachInst
Decoder
StaticInst Macroop
Microop Microop
Or
107
-
ISA Description Languagesrc/arch/isa_parser.py, src/arch/*/isa/*
Custom domain-specific language Defines decoding & behavior of ISA Generates C++ code
Scads of StaticInst subclasses decodeInst () function
Maps machine instruction to StaticInst instance Multiple scads of execute() methods
Cross-product of CPU models and StaticInst subclasses
108
-
Definitions etc.
def bitfield OPCODE ;def bitfield RA ;def bitfield RB ;def bitfield INTFUNC ; // function codedef bitfield RC < 4: 0>; // dest reg
def operands {{Ra: (IntReg, uq, PALMODE ? AlphaISA::reg_redir[RA] : RA,
IsInteger, 1),Rb: (IntReg, uq, PALMODE ? AlphaISA::reg_redir[RB] : RB,
IsInteger, 2),Rc: (IntReg, uq, PALMODE ? AlphaISA::reg_redir[RC] : RC,
IsInteger, 3),Fa: (FloatReg, df, FA, IsFloating, 1),Fb: (FloatReg, df, FB, IsFloating, 2),Fc: (FloatReg, df, FC, IsFloating, 3),
}}
def format LoadAddress(code) {{// Python code here...
}}
def format IntegerOperate(code) {{// Python code here...
}}
109
-
Instruction Decode and Semantics
decode OPCODE {format LoadAddress {
0x08: lda ({{ Ra = Rb + disp; }});0x09: ldah ({{ Ra = Rb + (disp
-
Microcode
def macroop MOVS_E_M_M {and t0, rcx, rcx, flags=(EZF,), dataSize=aszbr label("end"), flags=(CEZF,)# Find the constant we need to either add or subtract from rdiruflag t0, 10movi t3, t3, dsz, flags=(CEZF,), dataSize=aszsubi t4, t0, dsz, dataSize=aszmov t3, t3, t4, flags=(nCEZF,), dataSize=asz
topOfLoop:ld t1, seg, [1, t0, rsi]st t1, es, [1, t0, rdi]
subi rcx, rcx, 1, flags=(EZF,), dataSize=aszadd rdi, rdi, t3, dataSize=aszadd rsi, rsi, t3, dataSize=aszbr label("topOfLoop"), flags=(nCEZF,)
end:fault "NoFault"
};
111
-
Key Features
Very compact representation Most instructions take 1 line of C code Alpha: 3437 lines of isa description 39K lines of C++
15K generic decode, 12K for each of 2 CPU models Characteristics auto-extracted from C
source, dest regs; func unit class; etc. execute() code customized for CPU models
Thoroughly documented (for us, anyway) See wiki pages
112
-
Outline
1 Introduction to gem5
2 Basics
3 Debugging
4 Checkpointing and Fastforwarding
5 Break
6 Multiple Architecture Support
7 CPU Modeling
8 Ruby Memory System
9 Wrap-Up
113
-
CPU Modeling
CPU Modeling
Korey Sewell
University of Michigan, Ann Arbor
114
-
Overview
High Level View Supported CPU Models
AtomicSimpleCPU TimingSimpleCPU InOrderCPU O3CPU
CPU Model Internals Parameters Time Buffers Key Interfaces
115
-
CPU Models - System Level View
CPU Models are designed to be hot pluggable with arbitraryISAs and Memory Systems
116
-
Supported CPU Models src/cpu/*.hh,cc Simple CPUs
Models Single-Thread 1 CPI Machine
Two Types: AtomicSimpleCPU and TimingSimpleCPU Common Uses:
Fast, Functional Simulation: 2.9 million and 1.2 millioninstructions per second on the twolf benchmark
Warming Up Caches Studies that do not require detailed CPU modeling
Detailed CPUs Parameterizable Pipeline Models w/SMT support Two Types: InOrderCPU and O3CPU Execute in Execute, detailed modeling Slower than SimpleCPUs: 200K instructions per second on
the twolf benchmark Models the timing for each pipeline stage Forces both timing and execution of simulation to be accurate Important for Coherence, I/O, Multiprocessor Studies, etc.
117
-
Supported CPU Models src/cpu/*.hh,cc Simple CPUs
Models Single-Thread 1 CPI Machine Two Types: AtomicSimpleCPU and TimingSimpleCPU
Common Uses: Fast, Functional Simulation: 2.9 million and 1.2 million
instructions per second on the twolf benchmark Warming Up Caches Studies that do not require detailed CPU modeling
Detailed CPUs Parameterizable Pipeline Models w/SMT support Two Types: InOrderCPU and O3CPU Execute in Execute, detailed modeling Slower than SimpleCPUs: 200K instructions per second on
the twolf benchmark Models the timing for each pipeline stage Forces both timing and execution of simulation to be accurate Important for Coherence, I/O, Multiprocessor Studies, etc.
117
-
Supported CPU Models src/cpu/*.hh,cc Simple CPUs
Models Single-Thread 1 CPI Machine Two Types: AtomicSimpleCPU and TimingSimpleCPU Common Uses:
Fast, Functional Simulation: 2.9 million and 1.2 millioninstructions per second on the twolf benchmark
Warming Up Caches Studies that do not require detailed CPU modeling
Detailed CPUs Parameterizable Pipeline Models w/SMT support Two Types: InOrderCPU and O3CPU Execute in Execute, detailed modeling Slower than SimpleCPUs: 200K instructions per second on
the twolf benchmark Models the timing for each pipeline stage Forces both timing and execution of simulation to be accurate Important for Coherence, I/O, Multiprocessor Studies, etc.
117
-
Supported CPU Models src/cpu/*.hh,cc Simple CPUs
Models Single-Thread 1 CPI Machine Two Types: AtomicSimpleCPU and TimingSimpleCPU Common Uses:
Fast, Functional Simulation: 2.9 million and 1.2 millioninstructions per second on the twolf benchmark
Warming Up Caches Studies that do not require detailed CPU modeling
Detailed CPUs Parameterizable Pipeline Models w/SMT support
Two Types: InOrderCPU and O3CPU Execute in Execute, detailed modeling Slower than SimpleCPUs: 200K instructions per second on
the twolf benchmark Models the timing for each pipeline stage Forces both timing and execution of simulation to be accurate Important for Coherence, I/O, Multiprocessor Studies, etc.
117
-
Supported CPU Models src/cpu/*.hh,cc Simple CPUs
Models Single-Thread 1 CPI Machine Two Types: AtomicSimpleCPU and TimingSimpleCPU Common Uses:
Fast, Functional Simulation: 2.9 million and 1.2 millioninstructions per second on the twolf benchmark
Warming Up Caches Studies that do not require detailed CPU modeling
Detailed CPUs Parameterizable Pipeline Models w/SMT support Two Types: InOrderCPU and O3CPU
Execute in Execute, detailed modeling Slower than SimpleCPUs: 200K instructions per second on
the twolf benchmark Models the timing for each pipeline stage Forces both timing and execution of simulation to be accurate Important for Coherence, I/O, Multiprocessor Studies, etc.
117
-
Supported CPU Models src/cpu/*.hh,cc Simple CPUs
Models Single-Thread 1 CPI Machine Two Types: AtomicSimpleCPU and TimingSimpleCPU Common Uses:
Fast, Functional Simulation: 2.9 million and 1.2 millioninstructions per second on the twolf benchmark
Warming Up Caches Studies that do not require detailed CPU modeling
Detailed CPUs Parameterizable Pipeline Models w/SMT support Two Types: InOrderCPU and O3CPU Execute in Execute, detailed modeling Slower than SimpleCPUs: 200K instructions per second on
the twolf benchmark Models the timing for each pipeline stage Forces both timing and execution of simulation to be accurate Important for Coherence, I/O, Multiprocessor Studies, etc.
117
-
AtomicSimpleCPU src/cpu/simple/atomic/*.hh,cc
On every CPU tick(),perform all necessaryoperations for an instruction
Memory accesses areatomic
Fastest functional simulation
118
-
TimingSimpleCPU src/cpu/simple/timing/*.hh,cc
Memory accesses usetiming path
CPU waits until memoryaccess returns
Fast, provides some level oftiming
119
-
InOrder CPU Model src/cpu/inorder/*.hh,cc Detailed in-order CPU InOrder is a new feature to the gem5 Simulator
Default 5-stage pipeline Fetch, Decode, Execute, Memory, Writeback
120
-
InOrder CPU Model src/cpu/inorder/*.hh,cc Detailed in-order CPU InOrder is a new feature to the gem5 Simulator
Default 5-stage pipeline Fetch, Decode, Execute, Memory, Writeback
120
-
InOrder CPU Model src/cpu/inorder/*.hh,cc
Detailed in-order CPU Default 5-stage pipeline
Fetch, Decode, Execute, Memory, Writeback
Key Resources CacheUnit, ExecutionUnit, BranchPredictor, etc.
Key Parameters Pipeline Stages, Hardware Threads
Implementation: Customizable Set of Pipeline Components Pipeline stages interact with Resource Pool Pipeline defined through Instruction Schedules
Each instruction type defines what resources they need in aparticular stage
If an instruction cant complete all its resource requests in onestage, it blocks the pipeline
121
-
InOrder CPU Model src/cpu/inorder/*.hh,cc
Detailed in-order CPU Default 5-stage pipeline
Fetch, Decode, Execute, Memory, Writeback Key Resources
CacheUnit, ExecutionUnit, BranchPredictor, etc.
Key Parameters Pipeline Stages, Hardware Threads
Implementation: Customizable Set of Pipeline Components Pipeline stages interact with Resource Pool Pipeline defined through Instruction Schedules
Each instruction type defines what resources they need in aparticular stage
If an instruction cant complete all its resource requests in onestage, it blocks the pipeline
121
-
InOrder CPU Model src/cpu/inorder/*.hh,cc
Detailed in-order CPU Default 5-stage pipeline
Fetch, Decode, Execute, Memory, Writeback Key Resources
CacheUnit, ExecutionUnit, BranchPredictor, etc. Key Parameters
Pipeline Stages, Hardware Threads
Implementation: Customizable Set of Pipeline Components Pipeline stages interact with Resource Pool Pipeline defined through Instruction Schedules
Each instruction type defines what resources they need in aparticular stage
If an instruction cant complete all its resource requests in onestage, it blocks the pipeline
121
-
InOrder CPU Model src/cpu/inorder/*.hh,cc
Detailed in-order CPU Default 5-stage pipeline
Fetch, Decode, Execute, Memory, Writeback Key Resources
CacheUnit, ExecutionUnit, BranchPredictor, etc. Key Parameters
Pipeline Stages, Hardware Threads Implementation: Customizable Set of Pipeline Components
Pipeline stages interact with Resource Pool Pipeline defined through Instruction Schedules
Each instruction type defines what resources they need in aparticular stage
If an instruction cant complete all its resource requests in onestage, it blocks the pipeline
121
-
O3 CPU Model src/cpu/o3/*.hh,cc Detailed out-of-order CPU
Default 7-stage pipeline Fetch, Decode, Rename, IEW,Commit IEW Issue, Execute, and Writeback
Model varying amount of pipeline stages by changing delaysbetween pipeline stages (e.g. fetchToDecodeDelay)
Key Resources Physical Register (PR) File, IQ, LSQ, ROB, Functional Unit
(FU) Pool Key Parameters
Interstage pipeline delays, Hardware threads, IQ/LSQ/ROB/PRentries, FU Delays
Other Key Features Support for CISC decoding (e.g. x86) Renaming with a Physical Register (PR) File Functional units with varying latencies Branch Prediction Memory dependence prediction
122
-
O3 CPU Model src/cpu/o3/*.hh,cc Detailed out-of-order CPU
Default 7-stage pipeline Fetch, Decode, Rename, IEW,Commit IEW Issue, Execute, and Writeback Model varying amount of pipeline stages by changing delays
between pipeline stages (e.g. fetchToDecodeDelay)
Key Resources Physical Register (PR) File, IQ, LSQ, ROB, Functional Unit
(FU) Pool Key Parameters
Interstage pipeline delays, Hardware threads, IQ/LSQ/ROB/PRentries, FU Delays
Other Key Features Support for CISC decoding (e.g. x86) Renaming with a Physical Register (PR) File Functional units with varying latencies Branch Prediction Memory dependence prediction
122
-
O3 CPU Model src/cpu/o3/*.hh,cc Detailed out-of-order CPU
Default 7-stage pipeline Fetch, Decode, Rename, IEW,Commit IEW Issue, Execute, and Writeback Model varying amount of pipeline stages by changing delays
between pipeline stages (e.g. fetchToDecodeDelay) Key Resources
Physical Register (PR) File, IQ, LSQ, ROB, Functional Unit(FU) Pool
Key Parameters Interstage pipeline delays, Hardware threads, IQ/LSQ/ROB/PR
entries, FU Delays Other Key Features
Support for CISC decoding (e.g. x86) Renaming with a Physical Register (PR) File Functional units with varying latencies Branch Prediction Memory dependence prediction
122
-
O3 CPU Model src/cpu/o3/*.hh,cc Detailed out-of-order CPU
Default 7-stage pipeline Fetch, Decode, Rename, IEW,Commit IEW Issue, Execute, and Writeback Model varying amount of pipeline stages by changing delays
between pipeline stages (e.g. fetchToDecodeDelay) Key Resources
Physical Register (PR) File, IQ, LSQ, ROB, Functional Unit(FU) Pool
Key Parameters Interstage pipeline delays, Hardware threads, IQ/LSQ/ROB/PR
entries, FU Delays
Other Key Features Support for CISC decoding (e.g. x86) Renaming with a Physical Register (PR) File Functional units with varying latencies Branch Prediction Memory dependence prediction
122
-
O3 CPU Model src/cpu/o3/*.hh,cc Detailed out-of-order CPU
Default 7-stage pipeline Fetch, Decode, Rename, IEW,Commit IEW Issue, Execute, and Writeback Model varying amount of pipeline stages by changing delays
between pipeline stages (e.g. fetchToDecodeDelay) Key Resources
Physical Register (PR) File, IQ, LSQ, ROB, Functional Unit(FU) Pool
Key Parameters Interstage pipeline delays, Hardware threads, IQ/LSQ/ROB/PR
entries, FU Delays Other Key Features
Support for CISC decoding (e.g. x86) Renaming with a Physical Register (PR) File Functional units with varying latencies Branch Prediction Memory dependence prediction
122
-
CPU Model Internals src/cpu/*
A key reason that the CPU Models are hot pluggable intogem5 is that the CPUs share common components andinterfaces within the simulator
Parameter Definition Shared Components
Branch Predictors, TLBs, ISA decoding, Interrupt Handlers TimeBuffer-Based Communication External Interfaces
System: ThreadContext ISA: StaticInst and DynInst Memory: Ports, {send/recv}Timing
123
-
CPU Model Internals src/cpu/*
A key reason that the CPU Models are hot pluggable intogem5 is that the CPUs share common components andinterfaces within the simulator Parameter Definition
Shared Components Branch Predictors, TLBs, ISA decoding, Interrupt Handlers
TimeBuffer-Based Communication External Interfaces
System: ThreadContext ISA: StaticInst and DynInst Memory: Ports, {send/recv}Timing
123
-
CPU Model Internals src/cpu/*
A key reason that the CPU Models are hot pluggable intogem5 is that the CPUs share common components andinterfaces within the simulator Parameter Definition Shared Components
Branch Predictors, TLBs, ISA decoding, Interrupt Handlers
TimeBuffer-Based Communication External Interfaces
System: ThreadContext ISA: StaticInst and DynInst Memory: Ports, {send/recv}Timing
123
-
CPU Model Internals src/cpu/*
A key reason that the CPU Models are hot pluggable intogem5 is that the CPUs share common components andinterfaces within the simulator Parameter Definition Shared Components
Branch Predictors, TLBs, ISA decoding, Interrupt Handlers TimeBuffer-Based Communication
External Interfaces System: ThreadContext ISA: StaticInst and DynInst Memory: Ports, {send/recv}Timing
123
-
CPU Model Internals src/cpu/*
A key reason that the CPU Models are hot pluggable intogem5 is that the CPUs share common components andinterfaces within the simulator Parameter Definition Shared Components
Branch Predictors, TLBs, ISA decoding, Interrupt Handlers TimeBuffer-Based Communication External Interfaces
System: ThreadContext ISA: StaticInst and DynInst Memory: Ports, {send/recv}Timing
123
-
CPU Internals - Parameterssrc/cpu/{simple/inorder/o3}*.py
Parameters are defined in a *.py in each CPUs directory e.g. The contents of src/cpu/inorder/InOrderCPU.py are shown
below:
class InOrderCPU(BaseCPU):type = InOrderCPU...cachePorts = Param.Unsigned(2, "Cache Ports")stageWidth = Param.Unsigned(4, "Stage width")...icache_port = Port("Instruction Port")dcache_port = Port("Data Port")...predType = Param.String("tournament", "Branch predictor type (local, tournament)")
Use in your configuration scripts
...cpu = InOrderCPU()cpu.stageWidth = 2...
124
-
CPU Internals - Parameterssrc/cpu/{simple/inorder/o3}*.py
Parameters are defined in a *.py in each CPUs directory e.g. The contents of src/cpu/inorder/InOrderCPU.py are shown
below:
class InOrderCPU(BaseCPU):type = InOrderCPU...cachePorts = Param.Unsigned(2, "Cache Ports")stageWidth = Param.Unsigned(4, "Stage width")...icache_port = Port("Instruction Port")dcache_port = Port("Data Port")...predType = Param.String("tournament", "Branch predictor type (local, tournament)")
Use in your configuration scripts
...cpu = InOrderCPU()cpu.stageWidth = 2...
124
-
CPU Internals - Time Buffers src/base/timebuf.hh
Similar to queues Are advance()d each CPU cycle
Each pipeline stage places information into time buffer Next stage reads from time buffer by indexing into appropriate
cycle Used for both forwards and backwards communication
Avoids unrealistic interaction between pipeline stages Time buffer class is templated
Its template parameter is the communication struct betweenstages
125
-
CPU Internals - Time Buffers src/base/timebuf.hh
Similar to queues Are advance()d each CPU cycle
Each pipeline stage places information into time buffer Next stage reads from time buffer by indexing into appropriate
cycle
Used for both forwards and backwards communication Avoids unrealistic interaction between pipeline stages
Time buffer class is templated Its template parameter is the communication struct between
stages
125
-
CPU Internals - Time Buffers src/base/timebuf.hh
Similar to queues Are advance()d each CPU cycle
Each pipeline stage places information into time buffer Next stage reads from time buffer by indexing into appropriate
cycle Used for both forwards and backwards communication
Avoids unrealistic interaction between pipeline stages
Time buffer class is templated Its template parameter is the communication struct between
stages
125
-
CPU Internals - Time Buffers src/base/timebuf.hh
Similar to queues Are advance()d each CPU cycle
Each pipeline stage places information into time buffer Next stage reads from time buffer by indexing into appropriate
cycle Used for both forwards and backwards communication
Avoids unrealistic interaction between pipeline stages Time buffer class is templated
Its template parameter is the communication struct betweenstages
125
-
Time Buffer Communication
Demonstrated on out-of-order pipeline ... Red is a time buffer
Fetch Decode RenameIssue
ExecuteWriteback
Commit
Backwards Communication
126
-
CPU Interfaces - ThreadContextsrc/cpu/thread_context.hh
Interface for accessing total architectural state of a singlethread PC, register values, etc.
Used to obtain pointers to key classes CPU, process, system, ITB, DTB, etc.
Abstract base class Each CPU model must implement its own derived
ThreadContext
127
-
CPU Interfaces - ThreadContextsrc/cpu/thread_context.hh
Interface for accessing total architectural state of a singlethread PC, register values, etc.
Used to obtain pointers to key classes CPU, process, system, ITB, DTB, etc.
Abstract base class Each CPU model must implement its own derived
ThreadContext
127
-
CPU Interfaces - ThreadContextsrc/cpu/thread_context.hh
Interface for accessing total architectural state of a singlethread PC, register values, etc.
Used to obtain pointers to key classes CPU, process, system, ITB, DTB, etc.
Abstract base class Each CPU model must implement its own derived
ThreadContext
127
-
CPU Interfaces - StaticInst Classsrc/cpu/static_inst.{hh,cc}
Represents a decoded instruction Has classifications of the inst Corresponds to the binary machine inst Only has static information
Has all the methods needed to execute an instruction Tells which regs are source and dest Contains the execute() function ISA parser generates execute() for all insts
128
-
CPU Interfaces - DynInst Classsrc/cpu/base_dyn_inst.{hh,cc}
Dynamic version of StaticInst Used to hold extra information detailed CPU models
BaseDynInst Holds PC, Results, Branch Prediction Status Interface for TLB translations
InOrderDynInst - src/cpu/inorder/dyn_inst.{hh,cc} Holds current status of an instructions request to a resource Manages each instructions pipeline schedule
O3DynInst - src/cpu/o3/dyn_inst.{hh,cc} Holds Status of Renamed Registers Interfaces to the IQ, LSQ and ROB
129
-
CPU Interfaces - DynInst Classsrc/cpu/base_dyn_inst.{hh,cc}
Dynamic version of StaticInst Used to hold extra information detailed CPU models
BaseDynInst Holds PC, Results, Branch Prediction Status Interface for TLB translations
InOrderDynInst - src/cpu/inorder/dyn_inst.{hh,cc} Holds current status of an instructions request to a resource Manages each instructions pipeline schedule
O3DynInst - src/cpu/o3/dyn_inst.{hh,cc} Holds Status of Renamed Registers Interfaces to the IQ, LSQ and ROB
129
-
CPU Interfaces - DynInst Classsrc/cpu/base_dyn_inst.{hh,cc}
Dynamic version of StaticInst Used to hold extra information detailed CPU models
BaseDynInst Holds PC, Results, Branch Prediction Status Interface for TLB translations
InOrderDynInst - src/cpu/inorder/dyn_inst.{hh,cc} Holds current status of an instructions request to a resource Manages each instructions pipeline schedule
O3DynInst - src/cpu/o3/dyn_inst.{hh,cc} Holds Status of Renamed Registers Interfaces to the IQ, LSQ and ROB
129
-
CPU Interfaces - DynInst Classsrc/cpu/base_dyn_inst.{hh,cc}
Dynamic version of StaticInst Used to hold extra information detailed CPU models
BaseDynInst Holds PC, Results, Branch Prediction Status Interface for TLB translations
InOrderDynInst - src/cpu/inorder/dyn_inst.{hh,cc} Holds current status of an instructions request to a resource Manages each instructions pipeline schedule
O3DynInst - src/cpu/o3/dyn_inst.{hh,cc} Holds Status of Renamed Registers Interfaces to the IQ, LSQ and ROB
129
-
Outline
1 Introduction to gem5
2 Basics
3 Debugging
4 Checkpointing and Fastforwarding
5 Break
6 Multiple Architecture Support
7 CPU Modeling
8 Ruby Memory System
9 Wrap-Up
130
-
Ruby Memory System
Ruby Memory System
Derek Hower
University of Wisconsin, Madison
131
-
Outline
Feature Overview Rich Configuration Rapid Prototyping
SLICC Modular & Detailed Components
Lifetime of a Ruby memory request
132
-
Feature Overview
Flexible Memory System Rich configuration - Just run it
Simulate combinations of caches, coherence, interconnect,etc...
Rapid prototyping - Just create it Domain-Specific Language (SLICC) for coherence protocols Modular components
Detailed statistics e.g., Request size/type distribution, state transition
frequencies, etc... Detailed component simulation
Network (fixed/flexible pipeline and simple) Caches (Pluggable replacement policies) Memory (DDR2)
133
-
Feature Overview
Flexible Memory System Rich configuration - Just run it
Simulate combinations of caches, coherence, interconnect,etc...
Rapid prototyping - Just create it Domain-Specific Language (SLICC) for coherence protocols Modular components
Detailed statistics e.g., Request size/type distribution, state transition
frequencies, etc... Detailed component simulation
Network (fixed/flexible pipeline and simple) Caches (Pluggable replacement policies) Memory (DDR2)
133
-
Feature Overview
Flexible Memory System Rich configuration - Just run it
Simulate combinations of caches, coherence, interconnect,etc...
Rapid prototyping - Just create it Domain-Specific Language (SLICC) for coherence protocols Modular components
Detailed statistics e.g., Request size/type distribution, state transition
frequencies, etc...
Detailed component simulation Network (fixed/flexible pipeline and simple) Caches (Pluggable replacement policies) Memory (DDR2)
133
-
Feature Overview
Flexible Memory System Rich configuration - Just run it
Simulate combinations of caches, coherence, interconnect,etc...
Rapid prototyping - Just create it Domain-Specific Language (SLICC) for coherence protocols Modular components
Detailed statistics e.g., Request size/type distribution, state transition
frequencies, etc... Detailed component simulation
Network (fixed/flexible pipeline and simple) Caches (Pluggable replacement policies) Memory (DDR2)
133
-
Rich Configuration - Just run it
Can build many different memory systems CMPs, SMPs, SCMPs 1/2/3 level caches Pt2Pt/Torus/Mesh Topologies MESI/MOESI coherence
Each components is individually configurable Build heterogeneous cache architectures (new) Adjust cache sizes, bandwidth, link latencies, etc...
Get research started without modifying code!
134
-
Configuration Examples
1 8 core CMP, 2-Level, MESI protocol, 32K L1s, 8MB 8-bankedL2s, crossbar interconnect scons build/ALPHA_FS/gem5.opt PROTOCOL=MESI_CMP_directory RUBY=True ./build/ALPHA_FS/gem5.opt configs/example/ruby_fs.py -n 8 --l1i_size=32kB
--l1d_size=32kB --l2_size=8MB --num-l2caches=8 --topology=Crossbar --timing
2 64 socket SMP, 2-Level on-chip Caches, MOESI protocol,32K L1s, 8MB L2 per chip, mesh interconnect scons build/ALPHA_FS/gem5.opt PROTOCOL=MOESI_CMP_directory RUBY=True ./build/ALPHA_FS/m5.opt configs/example/ruby_fs.py -n 64 --l1i_size=32kB
--l1d_size=32kB --l2_size=512MB --num-l2caches=64 --topology=Mesh --timing
Many other configuration options Protocols only work with specific architectures (see wiki)
135
-
Rapid Prototyping - Just create it
Modular construction Coherence controller (SLICC) Cache (C++)
Replacement Policy (C++) DRAM (C++) Topology (Python) Network implementation (C++)
Debugging support
136
-
SLICC: Specification Language forImplementing Cache Coherence
Domain-Specific Language Syntatically similar to C/C++ Like HDLs, constrains operations to be hardware-like (e.g., no
loops) Two generation targets
C++ for simulation Coherence controller object
HTML for documentation Table-driven specification (State x Event -> Actions & next
state)
137
-
SLICC Protocol Structure
Collection of Machines, e.g. L1 Controller L2 Controller DRAM Controller
Machines are connectedthrough network ports(different than MemPorts)
Network can be an arbitrarytopology
138
-
Machine Structure
Machines are (logically) per-block Consist of:
Ports - Interface to the world States - Both stable and transient Events - Triggered by incoming messages Transitions - Old state x Event -> New state Actions - Occur atomically during transition, e.g.,
Send/receive messages from network139
-
MI Example...Directory Directory
Pt-to-Pt Interconnect
...L1 Cache L1 Cache
CPU CPU
Single-level coherence protocol 2 controller types Cache + Directory
Cache Controller 2 stable states: Modified (a.k.a. Valid), Invalid
Directory Controller [Not Shown] 2 stable states: Modified (Present in cache), Valid
3 virtual networks (request, response, forward) See src/mem/ruby/protocols/MI_example.*
140
-
MI Example - L1 Cache ControllerMachine structure
Machine Pseduo-Codemachine(L1Cache, "MI Example L1 Cache)
: Sequencer * sequencer, // parameters to the machine object (set at initialization)CacheMemory * cacheMemory,int cache_response_latency = 12,int issue_latency = 2
{ // 3 virtual channels to/from network + connection to CPU
// M,I, & Load,Store, etc.
// e.g., RequestMessage::GETX -> Fwd_GETX
// e.g., issueRequest
// e.g., I x Store -> M}
141
-
MI Example - L1 Cache ControllerDefining a machine interface
Interface to the network
// MessageBuffers - opaque C++ communication queuesMessageBuffer requestFromCache, network=To, virtual_network=0, ordered=true;MessageBuffer responseFromCache, network=To, virtual_network=1, ordered=true;MessageBuffer forwardToCache, network=From, virtual_network=2, ordered=true;MessageBuffer responseToCache, network=From, virtual_network=1, ordered=true;
// out_port - map request type to outgoing message bufferout_port(requestNetwork_out, RequestMsg, requestFromCache);out_port(responseNetwork_out, ResponseMsg, responseFromCache);
// in_port - map request type to incomming message buffer// and produce code to accept incomming messagesin_port(forwardRequestNetwork_in, RequestMsg, forwardToCache) { ... }in_port(responseNetwork_in, ResponseMsg, responseToCache) { ... }
Interface to a CPU
// The other end of mandatoryQueue attaches to SequencerMessageBuffer mandatoryQueue, ordered=false;in_port(mandatoryQueue_in, RubyRequest, mandatoryQueue, desc=...) { ... }// There is no corresponing out_port - handled with hitCallback
142
-
MI Example - L1 Cache ControllerDefining a machine interface
Interface to the network
// MessageBuffers - opaque C++ communication queuesMessageBuffer requestFromCache, network=To, virtual_network=0, ordered=true;MessageBuffer responseFromCache, network=To, virtual_network=1, ordered=true;MessageBuffer forwardToCache, network=From, virtual_network=2, ordered=true;MessageBuffer responseToCache, network=From, virtual_network=1, ordered=true;
// out_port - map request type to outgoing message bufferout_port(requestNetwork_out, RequestMsg, requestFromCache);out_port(responseNetwork_out, ResponseMsg, responseFromCache);
// in_port - map request type to incomming message buffer// and produce code to accept incomming messagesin_port(forwardRequestNetwork_in, RequestMsg, forwardToCache) { ... }in_port(responseNetwork_in, ResponseMsg, responseToCache) { ... }
Interface to a CPU
// The other end of mandatoryQueue attaches to SequencerMessageBuffer mandatoryQueue, ordered=false;in_port(mandatoryQueue_in, RubyRequest, mandatoryQueue, desc=...) { ... }// There is no corresponing out_port - handled with hitCallback
142
-
MI Example - L1 Cache ControllerDefining a machine interface
Interface to the network
// MessageBuffers - opaque C++ communication queuesMessageBuffer requestFromCache, network=To, virtual_network=0, ordered=true;MessageBuffer responseFromCache, network=To, virtual_network=1, ordered=true;MessageBuffer forwardToCache, network=From, virtual_network=2, ordered=true;MessageBuffer responseToCache, network=From, virtual_network=1, ordered=true;
// out_port - map request type to outgoing message bufferout_port(requestNetwork_out, RequestMsg, requestFromCache);out_port(responseNetwork_out, ResponseMsg, responseFromCache);
// in_port - map request type to incomming message buffer// and produce code to accept incomming messagesin_port(forwardRequestNetwork_in, RequestMsg, forwardToCache) { ... }in_port(responseNetwork_in, ResponseMsg, responseToCache) { ... }
Interface to a CPU
// The other end of mandatoryQueue attaches to SequencerMessageBuffer mandatoryQueue, ordered=false;in_port(mandatoryQueue_in, RubyRequest, mandatoryQueue, desc=...) { ... }// There is no corresponing out_port - handled with hitCallback
142
-
MI Example - L1 Cache ControllerDefining a machine interface
Interface to the network
// MessageBuffers - opaque C++ communication queuesMessageBuffer requestFromCache, network=To, virtual_network=0, ordered=true;MessageBuffer responseFromCache, network=To, virtual_network=1, ordered=true;MessageBuffer forwardToCache, network=From, virtual_network=2, ordered=true;MessageBuffer responseToCache, network=From, virtual_network=1, ordered=true;
// out_port - map request type to outgoing message bufferout_port(requestNetwork_out, RequestMsg, requestFromCache);out_port(responseNetwork_out, ResponseMsg, responseFromCache);
// in_port - map request type to incomming message buffer// and produce code to accept incomming messagesin_port(forwardRequestNetwork_in, RequestMsg, forwardToCache) { ... }in_port(responseNetwork_in, ResponseMsg, responseToCache) { ... }
Interface to a CPU
// The other end of mandatoryQueue attaches to SequencerMessageBuffer mandatoryQueue, ordered=false;in_port(mandatoryQueue_in, RubyRequest, mandatoryQueue, desc=...) { ... }// There is no corresponing out_port - handled with hitCallback
142
-
MI Example - L1 Cache ControllerDeclaring States
State Declaration
// STATESstate_declaration(State, desc="Cache states") {
// Stable StatesI, AccessPermission:Invalid, desc="Not Present/Invalid";M, AccessPermission:Read_Write, desc="Modified";
// Transient StatesII, AccessPermission:Busy, desc="Not Present/Invalid, issued PUT";MI, AccessPermission:Busy, desc="Modified, issued PUT";MII, AccessPermission:Busy, desc="Modified, issued PUTX, received nack";IS, AccessPermission:Busy, desc="Issued request for LOAD/IFETCH";IM, AccessPermission:Busy, desc="Issued request for STORE/ATOMIC";
}
143
-
MI Example - L1 Cache ControllerDeclaring Events
Event Declaration
// EVENTSenumeration(Event, desc="Cache events") {
// from processorLoad, desc="Load request from processor";Ifetch, desc="Ifetch request from processor";Store, desc="Store request from processor";
// From network (directory)Data, desc="Data from network";Fwd_GETX, desc="Forward from network";Inv, desc="Invalidate request from dir";Writeback_Ack, desc="Ack from the directory for a writeback";Writeback_Nack, desc="Nack from the directory for a writeback";
// Internally generatedReplacement, desc="Replace a block";
}
144
-
MI Example - L1 Cache ControllerMapping messages to events
Mapping occurs in in_port declaration. peek(in_port, message_type)
Sets variable in_msg to head of in_port queue. trigger(Event, address)
Event mapping
in_port(forwardRequestNetwork_in, RequestMsg, forwardToCache) {if (forwardRequestNetwork_in.isReady()) {
peek(forwardRequestNetwork_in, RequestMsg) {if (in_msg.Type == CoherenceRequestType:GETX) {
trigger(Event:Fwd_GETX, in_msg.Address);}...
}}
}
145
-
MI Example - L1 Cache ControllerDefining Transitions
transition(Starting State(s), Event, [EndingState]) [ { Actions } ]
Transition sequence for new Store request
transition(I, Store, IM) {v_allocateTBE; // allocate TBE (a.k.a. MSHR) on transition to transient statei_allocateL1CacheBlock;a_issueRequest;m_popMandatoryQueue;
}
transition(IM, Data, M) {u_writeDataToCache;s_store_hit;w_deallocateTBE; // deallocate TBE on transition back to stable staten_popResponseQueue;
}...
146
-
MI Example - L1 Cache ControllerDefining Actions
action(name, abbrev, [desc]) { implementation } Two special functions available in action
peek(in_port, message_type) { use in_msg } assigns in_msg to message at head of port
enqueue(out_port, message_type, [options]) {set out_msg } enqueues out_msg on out_port
Special variable address is available inside an action block Set to the address associated with the event that caused the
calling transition
Example Action Definition
action(e_sendData, "e", desc="Send data from cache to requestor") {peek(forwardRequestNetwork_in, RequestMsg) {
enqueue(responseNetwork_out, ResponseMsg, latency=cache_response_latency) {out_msg.Address := address;out_msg.Type := CoherenceResponseType:DATA;out_msg.Sender := machineID;out_msg.Destination.add(in_msg.Requestor); // uses in_msg set by peekout_msg.DataBlk := cacheMemory[address].DataBlk;out_msg.MessageSize := MessageSizeType:Response_Data;
}}
}
147
-
MI Example - L1 Cache ControllerTransition Table
148
-
MI ExampleConnecting SLICC Machines with a Topology
Creating the Topology Not In SLICCsrc/mem/ruby/network/topologies/Pt2Pt.py
# returns a SimObject for for a Pt2Pt Topologydef makeTopology(nodes, options, IntLink, ExtLink, Router):
# Create an individual router for each controller (node),# and connect them (ext_links)routers = [Router(router_id=i) for i in range(len(nodes))]ext_links = [ExtLink(link_id=i, ext_node=n, int_node=routers[i])
for (i, n) in enumerate(nodes)]link_count = len(nodes)
# Connect routers all-to-all (int_links)int_links = []for i in xrange(len(nodes)):
for j in xrange(len(nodes)):if (i != j):
link_count += 1int_links.append(IntLink(link_id=link_count,
node_a=routers[i],node_b=routers[j]))
# Return Pt2Pt Topology SimObjectreturn Pt2Pt(ext_links=ext_links,
int_links=int_links,routers=routers)
149
-
Using C++ Objects in SLICC SLICC can be arbitrarily extended with C++ objects
e.g., Interface with a new message filter Steps:
Create class in C++ Declare interface in SLICC with structure, external=yes Initialize object in machine Use!
Extending SLICC
// MessageFilter.hclass MessageFilter {public:
MessageFilter(int param1);
// returns 1 if message should be filteredint filter(RequestMsg msg);
};
// MessageFilter.ccint MessageFilter::filter(RequestMsg msg){
...return 0;
}
// MI_example-cache.smstructure(MessageFilter, external=yes) {
int filter(RequestMsg);};
MessageFilter requestFilter,constructor_hack=param;
action(af_allocateUnlessFiltered, af) {if (requestFilter.filter(in_msg) != 1) {
cacheMemory.allocate(address, new Entry);}
}
150
-
Using C++ Objects in SLICC SLICC can be arbitrarily extended with C++ objects
e.g., Interface with a new message filter Steps:
Create class in C++ Declare interface in SLICC with structure, external=yes Initialize object in machine Use!
Extending SLICC
// MessageFilter.hclass MessageFilter {public:
MessageFilter(int param1);
// returns 1 if message should be filteredint filter(RequestMsg msg);
};
// MessageFilter.ccint MessageFilter::filter(RequestMsg msg){
...return 0;
}
// MI_example-cache.smstructure(MessageFilter, external=yes) {
int filter(RequestMsg);};
MessageFilter requestFilter,constructor_hack=param;
action(af_allocateUnlessFiltered, af) {if (requestFilter.filter(in_msg) != 1) {
cacheMemory.allocate(address, new Entry);}
}
150
-
Using C++ Objects in SLICC SLICC can be arbitrarily extended with C++ objects
e.g., Interface with a new message filter Steps:
Create class in C++ Declare interface in SLICC with structure, external=yes Initialize object in machine Use!
Extending SLICC
// MessageFilter.hclass MessageFilter {public:
MessageFilter(int param1);
// returns 1 if message should be filteredint filter(RequestMsg msg);
};
// MessageFilter.ccint MessageFilter::filter(RequestMsg msg){
...return 0;
}
// MI_example-cache.smstructure(MessageFilter, external=yes) {
int filter(RequestMsg);};
MessageFilter requestFilter,constructor_hack=param;
action(af_allocateUnlessFiltered, af) {if (requestFilter.filter(in_msg) != 1) {
cacheMemory.allocate(address, new Entry);}
}
150
-
Using C++ Objects in SLICC SLICC can be arbitrarily extended with C++ objects
e.g., Interface with a new message filter Steps:
Create class in C++ Declare interface in SLICC with structure, external=yes Initialize object in machine Use!
Extending SLICC
// MessageFilter.hclass MessageFilter {public:
MessageFilter(int param1);
// returns 1 if message should be filteredint filter(RequestMsg msg);
};
// MessageFilter.ccint MessageFilter::filter(RequestMsg msg){
...return 0;
}
// MI_example-cache.smstructure(MessageFilter, external=yes) {
int filter(RequestMsg);};
MessageFilter requestFilter,constructor_hack=param;
action(af_allocateUnlessFiltered, af) {if (requestFilter.filter(in_msg) != 1) {
cacheMemory.allocate(address, new Entry);}
}
150
-
Detailed Component Simulation: Caches
Set-Associative Caches Each CacheMemory object represents one bank of cache Configurable bit select for indexing Modular replacement policy
Tree-based pseudo-LRU LRU
See src/mem/ruby/system/CacheMemory.hh
151
-
Detailed Component Simulation: Memory
Memory controller models a single channel DDR2 controller Implements closed-page policy Can configure ranks, tCAS, refresh, etc.. See src/mem/ruby/system/MemoryController.hh
152
-
Detailed Component Simulation: Network Simple Network
Idealized routers - fixed latency, no internal resources Does model link bandwidth
Garnet Network Detailed routers - both fixed and flexible pipeline model From Princeton, MIT
See src/mem/ruby/network/*
153
-
Ruby Debugging Support
Random testing support Stresses protocol by inserting random timing delays
Support for coherence transition tracing Frequent assertions Deadlock detection
154
-
Lifetime of a Ruby Memory Request
1 Request enters through RubyPort::recvTiming, isconverted to RubyRequest, and passed to Sequencer.
2 Request enters SLICC controllers throughSequencer::makeRequest via mandatoryQueue.
3 Message on mandatoryQueue triggers an event in L1Controller.
4 Until request is completed:1 (Event, State) is matched to a transition.
2 Actions in the matched transition (optionally) send messagesto network & allocate TBE.
3 Responses from network trigger more events.
5 Last event causes action that callsSequencer::hitCallback & deallocates TBE.
6 RubyRequest is converted back into a Packet & sent toRubyPort.
155
-
Lifetime of a Ruby Memory Request
1 Request enters through RubyPort::recvTiming, isconverted to RubyRequest, and passed to Sequencer.
2 Request enters SLICC controllers throughSequencer::makeRequest via mandatoryQueue.
3 Message on mandatoryQueue triggers an event in L1Controller.
4 Until request is completed:1 (Event, State) is matched to a transition.2 Actions in the matched transition (optionally) send messages
to network & allocate TBE.
3 Responses from network trigger more events.
5 Last event causes action that callsSequencer::hitCallback & deallocates TBE.
6 RubyRequest is converted back into a Packet & sent toRubyPort.
155
-
Lifetime of a Ruby Memory Request
1 Request enters through RubyPort::recvTiming, isconverted to RubyRequest, and passed to Sequencer.
2 Request enters SLICC controllers throughSequencer::makeRequest via mandatoryQueue.
3 Message on mandatoryQueue triggers an event in L1Controller.
4 Until request is completed:1 (Event, State) is matched to a transition.2 Actions in the matched transition (optionally) send messages
to network & allocate TBE.3 Responses from network trigger more events.
5 Last event causes action that callsSequencer::hitCallback & deallocates TBE.
6 RubyRequest is converted back into a Packet & sent toRubyPort.
155
-
Lifetime of a Ruby Memory Request
1 Request enters through RubyPort::recvTiming, isconverted to RubyRequest, and passed to Sequencer.
2 Request enters SLICC controllers throughSequencer::makeRequest via mandatoryQueue.
3 Message on mandatoryQueue triggers an event in L1Controller.
4 Until request is completed:1 (Event, State) is matched to a transition.2 Actions in the matched transition (optionally) send messages
to network & allocate TBE.3 Responses from network trigger more events.
5 Last event causes action that callsSequencer::hitCallback & deallocates TBE.
6 RubyRequest is converted back into a Packet & sent toRubyPort.
155
-
Lifetime of a Ruby Memory Request
1 Request enters through RubyPort::recvTiming, isconverted to RubyRequest, and passed to Sequencer.
2 Request enters SLICC controllers throughSequencer::makeRequest via mandatoryQueue.
3 Message on mandatoryQueue triggers an event in L1Controller.
4 Until request is completed:1 (Event, State) is matched to a transition.2 Actions in the matched transition (optionally) send messages
to network & allocate TBE.3 Responses from network trigger more events.
5 Last event causes action that callsSequencer::hitCallback & deallocates TBE.
6 RubyRequest is converted back into a Packet & sent toRubyPort.
155
-
Lifetime of a Ruby Memory Request
1 Request enters through RubyPort::recvTiming, isconverted to RubyRequest, and passed to Sequencer.
2 Request enters SLICC controllers throughSequencer::makeRequest via mandatoryQueue.
3 Message on mandatoryQueue triggers an event in L1Controller.
4 Until request is completed:1 (Event, State) is matched to a transition.2 Actions in the matched transition (optionally) send messages
to network & allocate TBE.3 Responses from network trigger more events.
5 Last event causes action that callsSequencer::hitCallback & deallocates TBE.
6 RubyRequest is converted back into a Packet & sent toRubyPort.
155
-
Outline
1 Introduction to gem5
2 Basics
3 Debugging
4 Checkpointing and Fastforwarding
5 Break
6 Multiple Architecture Support
7 CPU Modeling
8 Ruby Memory System
9 Wrap-Up
156
-
Wrap-Up
Wrap-Up
Brad Beckmann
AMD Research
157
-
Summary
Reviewed the basics High-level features Debugging Checkpointing
Highlighted new aspects ISA changes: x86 & ARM InOrder CPU model Ruby memory system
Upcoming Computer Architecture News (CAN) article Summarizes goals, features, and capabilities Please cite if you use gem5
Overall gem5 has a wide range of capabilities ...but not all combinations currently work
158
-
Cross-Product Table
Processor Memory System
CPU Model System Mode Classic RubySimple Garnet
Atomic Simple SEFS
Timing Simple SEFS
InOrder SEFS
O3 SEFS
Spectrum of choices (light = speed, dark = accuracy)
159
-
Matrix Examples: Alpha and x86Alpha
Processor Memory System
CPU Model System Mode Classic RubySimple Garnet
Atomic Simple SEFS
Timing Simple SEFS
InOrder SEFS
O3 SEFS
x86Processor Memory System
CPU Model System Mode Classic RubySimple Garnet
Atomic Simple SEFS
Timing Simple SEFS
InOrder SEFS