the gem5 simulatordist.gem5.org/dist/tutorials/isca_pres_2011.pdfthe gem5 simulator isca 2011 brad...
TRANSCRIPT
-
The gem5 SimulatorISCA 2011
Brad Beckmann1 Nathan Binkert2 Ali Saidi3 Joel Hestness4
Gabe Black5 Korey Sewell6 Derek Hower7
1 AMD Research 2 HP Labs 3 ARM, Inc. 4 University of Texas, Austin5 Google, Inc. 6 University of Michigan, Ann Arbor
7 University of Wisconsin, Madison
June 5th, 2011
1
-
Welcome!
• We’re glad you’re here!• The gem5 simulator has been multi-year effort• A wide variety of institutions have participated
• This tutorial is for you• Please ask questions! Don’t save them for the break!• We intend the focus to be audience driven
2
-
Tutorial Goals and Timeline
• Tutorial goals• Introduce you to the gem5 simulator• Answer your development questions
• Two halves• 8:30-noon: Overview of the simulator, features, components,
and simple examples• after lunch: Birds of a feather sessions and informal
discussions of simulator internals
3
-
Outline
1 Introduction to gem5
2 Basics
3 Debugging
4 Checkpointing and Fastforwarding
5 Break
6 Multiple Architecture Support
7 CPU Modeling
8 Ruby Memory System
9 Wrap-Up
4
-
Outline
1 Introduction to gem5
2 Basics
3 Debugging
4 Checkpointing and Fastforwarding
5 Break
6 Multiple Architecture Support
7 CPU Modeling
8 Ruby Memory System
9 Wrap-Up
4
-
Outline
1 Introduction to gem5
2 Basics
3 Debugging
4 Checkpointing and Fastforwarding
5 Break
6 Multiple Architecture Support
7 CPU Modeling
8 Ruby Memory System
9 Wrap-Up
4
-
Outline
1 Introduction to gem5
2 Basics
3 Debugging
4 Checkpointing and Fastforwarding
5 Break
6 Multiple Architecture Support
7 CPU Modeling
8 Ruby Memory System
9 Wrap-Up
4
-
Outline
1 Introduction to gem5
2 Basics
3 Debugging
4 Checkpointing and Fastforwarding
5 Break
6 Multiple Architecture Support
7 CPU Modeling
8 Ruby Memory System
9 Wrap-Up
4
-
Outline
1 Introduction to gem5
2 Basics
3 Debugging
4 Checkpointing and Fastforwarding
5 Break
6 Multiple Architecture Support
7 CPU Modeling
8 Ruby Memory System
9 Wrap-Up
4
-
Outline
1 Introduction to gem5
2 Basics
3 Debugging
4 Checkpointing and Fastforwarding
5 Break
6 Multiple Architecture Support
7 CPU Modeling
8 Ruby Memory System
9 Wrap-Up
4
-
Outline
1 Introduction to gem5
2 Basics
3 Debugging
4 Checkpointing and Fastforwarding
5 Break
6 Multiple Architecture Support
7 CPU Modeling
8 Ruby Memory System
9 Wrap-Up
4
-
Outline
1 Introduction to gem5
2 Basics
3 Debugging
4 Checkpointing and Fastforwarding
5 Break
6 Multiple Architecture Support
7 CPU Modeling
8 Ruby Memory System
9 Wrap-Up
4
-
Introduction to gem5
Introduction to gem5
Brad Beckmann
AMD Research
5
-
Introduction to gem5
• What is gem5?• The best parts of M5• The best parts of GEMS
• Overall goals, design principles, and capabilities
6
-
What is gem5?
• The combination of M5 and GEMS into a new simulator
• Google scholar statistics• M5 (IEEE Micro, CAECW): 440 citations• GEMS (CAN): 588 citations
• Best aspects of both glued together• M5: CPU models, ISAs, I/O devices, infrastructure• GEMS (essentially Ruby): cache coherence protocols,
interconnect models
7
-
What else is new?
• Many other things have changed since previous tutorialsbeyond GEMS+M5
• Some of the highlights:• The world’s most popular ISAs: ARM and x86• The In-order CPU model• New documentation
• Overall gem5 has a high degree of capabilities
8
-
Android on ARM FS
9
android.mp4Media File (video/mp4)
-
64 Processor Linux on x86 FS
10
-
What gem5 is Not
• A hardware design language• Higher level for design space exploration, simulation speed
• A restrictive environment• Just C++ and Python with an event queue and a bunch of
APIs you can choose to ignore
• Finished!• Always room for improvement . . .
11
-
What We Would Like gem5 to Be
• Something that spares you the pain we’ve been through• A community resource
• Modular enough to localize changes• Contribute back, and spare others some pain
• A path to reproducible/comparable results• A common platform for evaluating ideas
• Let us know how we can help you contribute• Public wiki is up at http://www.gem5.org• Please submit patches and additional features• Ability to add modules with EXTRAS=• The more active the community is, the more successful gem5
will be!
12
-
Two Views of gem5
View #1• A framework for event-driven simulation
• Events, objects, statistics, configuration
View #2• A collection of predefined object models
• CPUs, caches, busses, devices, etc.
• This tutorial focuses on #2• You may find #1 useful even if #2 is not
• At least three other “simulators” have been created using #1
13
-
Main GoalsOverall Goal: Open source community tool focused onarchitectural modeling
• Flexibility• Multiple CPU models across the speed vs. accuracy spectrum• Two execution modes: System-call Emulation & Full-system• Two memory system models: Classic & Ruby• Once you learn it, you can apply to a wide-range of
investigations• Availability
• For both academic and corporate researchers• No dependence on proprietary code• BSD license
• Collaboration• Combined effort of many with different specialties• Active community leveraging collaborative technologies
14
-
Key Features
• Pervasive object-oriented design• Provides modularity, flexibility• Significantly leverages inheritance e.g. SimObject
• Python integration• Powerful front-end interface• Provides initialization, configuration, & simulation control
• Domain-Specific Languages• ISA DSL: defines ISA semantics• Cache Coherence DSL (a.k.a.SLICC): defines coherence logic
• Standard interfaces: Ports and MessageBuffers
15
-
Capabilities
• Execution modes: System-call Emulation (SE) &Full-System (FS)
• ISAs: Alpha, ARM, MIPS, Power, SPARC, x86• CPU models: AtomicSimple, TimingSimple, InOrder, and O3• Cache coherence protocols: broadcast-based, directories,
etc.• Interconnection networks: Simple & Garnet (Princeton,
MIT)• Devices: NICs, IDE controller, etc.• Multiple systems: communicate over TCP/IP
16
-
Cross-Product Matrix
Processor Memory System
CPU Model System Mode Classic RubySimple Garnet
Atomic Simple SEFS
Timing Simple SEFS
InOrder SEFS
O3 SEFS
17
-
Outline
1 Introduction to gem5
2 Basics
3 Debugging
4 Checkpointing and Fastforwarding
5 Break
6 Multiple Architecture Support
7 CPU Modeling
8 Ruby Memory System
9 Wrap-Up
18
-
Basics
Basics
Nate Binkert
HP Labs
19
-
Basics
• Compiling gem5• Running gem5• Very brief overview of a few key concepts:
• Objects• Events• Modes• Ports• Stats
20
-
Building Executables
• Platforms• Linux, BSD, MacOS, Solaris, etc.• Little endian machines
• Some architectures support big endian• 64-bit machines help a lot
• Tools• GCC/G++ 3.4.6+
• Most frequently tested with 4.2-4.5• Python 2.4+• SCons 0.98.1+
• We generally test versions 0.98.5 and 1.2.0• http://www.scons.org
• SWIG 1.3.31+• http://www.swig.org
21
-
Compile Targets
• build//• configs
• By convention, usually _• ALPHA_SE (Alpha syscall emulation)• ALPHA_FS (Alpha full system)• Other ISAs: ARM, MIPS, POWER, SPARC, X86• Sometimes followed by Ruby protocol:• ALPHA_SE_MOESI_hammer• You can define your own configs
• binary• gem5.debug – debug build, symbols, tracing, assert• gem5.opt – optimized build, symbols, tracing, assert• gem5.fast – optimized build, no debugging, no symbols, no
tracing, no assertions• gem5.prof – gem5.fast + profiling support
22
-
Sample Compile
blue% scons build/X86_FS/gem5.optscons: Reading SConscript files ...Checking for leading underscore in global variables...noChecking for C header file Python.h... yesChecking for C library pthread... yes
Reading /n/blue/z/binkert/work/m5/incoming/src/mem/ruby/SConsoptsReading /n/blue/z/binkert/work/m5/incoming/src/mem/protocol/SConsoptsReading /n/blue/z/binkert/work/m5/incoming/src/arch/arm/SConsopts
Building in /n/blue/z/binkert/work/m5/incoming/build/X86_FSVariables file /n/blue/z/binkert/work/m5/incoming/build/variables/X86_FS not found,
using defaults in /n/blue/z/binkert/work/m5/incoming/build_opts/X86_FSscons: done reading SConscript files.scons: Building targets ...[ CXX] X86_FS/sim/main.cc -> .o[ CXX] X86_FS/sim/async.cc -> .o[ CXX] X86_FS/sim/core.cc -> .o[ TRACING] -> X86_FS/debug/Event.hhDefining FAST_ALLOC_STATS as 0 in build/X86_FS/config/fast_alloc_stats.hh.Defining FORCE_FAST_ALLOC as 0 in build/X86_FS/config/force_fast_alloc.hh.Defining NO_FAST_ALLOC as 0 in build/X86_FS/config/no_fast_alloc.hh.[ CXX] X86_FS/sim/debug.cc -> .o[ TRACING] -> X86_FS/debug/Config.hh[ CXX] X86_FS/sim/eventq.cc -> .o[ CXX] X86_FS/sim/init.cc -> .o[ TRACING] -> X86_FS/debug/TimeSync.hh[SO PARAM] Root -> X86_FS/params/Root.hh[SO PARAM] SimObject -> X86_FS/params/SimObject.hh...
23
-
Running Simulations
maize% ./build/ARM_FS/gem5.opt --helpUsage=====
gem5.opt [gem5 options] script.py [script options]
gem5 is copyrighted software; use the --copyright option for details.
Options=======--version show program’s version number and exit--help, -h show this help message and exit--build-info, -B Show build information--copyright, -C Show full copyright information--readme, -R Show the readme--outdir=DIR, -d DIR Set the output directory to DIR [Default: /tmp/m5out]--redirect-stdout, -r Redirect stdout (& stderr, without -e) to file--redirect-stderr, -e Redirect stderr to file--stdout-file=FILE Filename for -r redirection [Default: simout]--stderr-file=FILE Filename for -e redirection [Default: simerr]--interactive, -i Invoke the interactive interpreter after running the script--pdb Invoke the python debugger before running the script--path=PATH[:PATH], -p PATH[:PATH]
Prepend PATH to the system path when invoking the script--quiet, -q Reduce verbosity--verbose, -v Increase verbosity
24
-
Running Simulations (cont)
Statistics Options--------------------stats-file=FILE Sets the output file for statistics [Default:
stats.txt]
Configuration Options-----------------------dump-config=FILE Dump configuration output file [Default: config.ini]
Debugging Options-------------------debug-break=TIME[,TIME]
Cycle to create a breakpoint--debug-help Print help on trace flags--debug-flags=FLAG[,FLAG]
Sets the flags for tracing (-FLAG disables a flag)--remote-gdb-port=REMOTE_GDB_PORT
Remote gdb base port (set to 0 to disable listening)
Trace Options---------------trace-start=TIME Start tracing at TIME (must be in ticks)--trace-file=FILE Sets the output file for tracing [Default: cout]--trace-ignore=EXPR Ignore EXPR sim objects
Help Options--------------list-sim-objects List all built-in SimObjects, their params and default
values
25
-
Sample Run
maize% ./build/ARM_SE/gem5.opt configs/example/se.pygem5 Simulator System. http://gem5.orggem5 is copyrighted software; use the --copyright option for details.
gem5 compiled Jun 2 2011 17:39:30gem5 started Jun 3 2011 14:48:20gem5 executing on maizecommand line: ./build/ARM_SE/gem5.opt configs/example/se.pyGlobal frequency set at 1000000000000 ticks per second0: system.remote_gdb.listener: listening for remote gdb #0 on port 7000
**** REAL SIMULATION ****info: Entering event queue @ 0. Starting simulation...Hello world!hack: be nice to actually delete the event hereExiting @ tick 3350000 because target called exit()
26
-
Modes
• gem5 has two fundamental modes• Full system (FS)
• For booting operating systems• Models bare hardware, including devices• Interrupts, exceptions, privileged instructions, fault handlers
• Syscall emulation (SE)• For running individual applications, or set of applications on
MP/SMT• Models user-visible ISA plus common system calls• System calls emulated, typ. by calling host OS• Simplified address translation model, no scheduling
• Selected via compile-time option• Vast majority of code is unchanged, though
27
-
Objects
• Everything you care about is an object (C++/Python)• Derived from SimObject base class
• Common code for creation, configuration parameters, naming,checkpointing, etc.
• Uniform method-based APIs for object types• CPUs, caches, memory, etc.• Plug-compatibility across implementations
• Functional vs. detailed CPU• Conventional vs. indirect-index cache
• Easy replication: cores, multiple systems, . . .
28
-
Events
• Standard event queue timing model• Global logical time in “ticks”• No fixed relation to real time
• Normally picoseconds in our examples• Objects schedule their own events
• Flexibility for detail vs. performance trade-offs• E.g., a CPU typically schedules event at regular intervals
• Every cycle or every n picoseconds• Won’t schedule self if stalled/idle
29
-
Ports src/mem/port.{hh,cc}
• Method for connecting MemObjects together• Each MemObject subclass has its own Port subclass(es)
• Specialized to forward packets to appropriate methods ofMemObject subclass
• Each pair of MemObjects is connected via a pair of Ports(“peers”)
• Function pairs pass packets across ports• sendTiming() on one port calls recvTiming() on peer
• Result: class-specific handling with arbitrary connections andonly a single virtual function call
30
-
Access Modes
• Three access modes: Functional, Atomic, Timing• Selected by choosing function on initial Port:
• sendFunctional(), sendAtomic(), sendTiming()• Functional mode:
• Just “make it happen”• Used for loading binaries, debugging, etc.• Accesses happen instantaneously updating data everywhere
in the hierarchy• If devices contain queues of packets they must be scanned
and updated as well
31
-
Access Modes (cont’d)
• Atomic mode:• Requests complete before sendAtomic() returns• Models state changes (cache fills, coherence, etc.)• Returns approx. latency w/o contention or queuing delay• Used for fast simulation, fast forwarding, or warming caches
• Timing mode:• Models all timing/queuing in the memory system• Split transaction
• sendTiming() just initiates send of request to target• Target later calls sendTiming() to send response packet
• Atomic and Timing accesses can not coexist in system
32
-
Statistics
• Scalar• Average• Vector• Formula• Histogram• Distribution• Vector Distribution
33
-
Statistics Example – hh file
class MySimObject : public SimObject{
private:Stats::Scalar txBytes;Stats::Formula txBandwidth;Stats::Vector syscall;
public:void regStats();
};
34
-
Statistics Example – cc file
txBytes.name(name() + ".txBytes").desc("Bytes Transmitted").prereq(txBytes);
txBandwidth.name(name() + ".txBandwidth").desc("Transmit Bandwidth (bits/s)").precision(0);
txBandwidth = txBytes * Stats::constant(8) / simSeconds;
syscall.init(SystemCalls ::Number).name(name() + ".syscall").desc("number of syscalls executed").flags(total | pdf | nozero | nonan);
35
-
Statistics Output
client.tsunami.etherdev.txBandwidth 4302720client.tsunami.etherdev.txBytes 13446server.tsunami.etherdev.txBandwidth 4684921600server.tsunami.etherdev.txBytes 14640380sim_seconds 0.025000server.cpu.kern.syscall 492server.cpu.kern.syscall_1 189 38.41% 38.41%server.cpu.kern.syscall_2 249 50.61% 89.02%server.cpu.kern.syscall_3 54 10.98% 100.00%
36
-
Outline
1 Introduction to gem5
2 Basics
3 Debugging
4 Checkpointing and Fastforwarding
5 Break
6 Multiple Architecture Support
7 CPU Modeling
8 Ruby Memory System
9 Wrap-Up
37
-
Debugging
Debugging
Ali Saidi
ARM Research & Development
38
-
Debugging Facilities
• Tracing• Instruction Tracing• Diffing Traces
• Using gdb to debug gem5• Debugging C++ and gdb-callable functions• Remote Debugging
• Python Debugging
• Pipeline Viewer
39
-
Tracing/Debugging src/base/trace.*
• printf() is a nice debugging tool
• Keep good printfs for tracing
• Lots of debug output is a very good thing
• Example flags:• Fetch, Decode, Ethernet, Exec, TLB, DMA, Bus, Cache,
Loader, O3CPUAll, etc.• Print out all flags with --debug-help option
40
-
Enabling Tracing
• Selecting flags:• --debug-flags=Cache,Bus• --debug-flags=Exec,-ExecTicks
• Selecting destination:• --trace-file=my_trace.out• --trace-file=my_trace.out.gz
• Selecting start:• --trace-start=23000000
./build/ARM_FS/gem5.opt --debug-flags=Cache,Bus--trace-start=2400 configs/example/fs.py
41
-
Adding Debuging
• Print statement put in source code• Encourage you to add ones to your models or contribute ones
you find particularly useful• Macros remove them for gem5.fast or gem5.prof binaries
• So you must be using gem5.debug or gem5.opt to get anyoutput
• Adding an extra tracing statement:• DPRINTF(Flag, “normal printf %s\n”,“arguments”);
• Adding a new debug flags (in a SConscript):• DebugFlag(’MyNewFlag’)
42
-
Instruction Tracing src/sim/insttracer.hh
• Separate from the general debug/trace facility• But both are enabled the same way
• Per-instruction records populated as instruction executes• Start with PC and mnemonic• Add argument and result values as they become known
• Printed to trace when instruction completes• Flags for printing cycle, symbolic addresses, etc.
4000: sys.cpu : @sym+776 : add r3, r3, #8 : IntAlu : D=0x000083584500: sys.cpu : @sym+780 : sub r3, r3, r7 : IntAlu : D=0x400000005000: sys.cpu : @sym+784 : add r5, r5, r3 : IntAlu : D=0x000173cc5500: sys.cpu : @sym+788 : add r6, r6, r3 : IntAlu : D=0x000174006000: sys.cpu : @sym+792.0 : addi_uop r34, r5, #0 : IntAlu : D=0x000173cc6500: sys.cpu : @sym+792.1 : ldr_uop r3, [r34, #0] : MemRead : D=0x000f0000 A=0x173cc7000: sys.cpu : @sym+792.2 : ldr_uop r4, [r34, #4] : MemRead : D=0x000f0000 A=0x173d07500: sys.cpu : @sym+796 : and r4, r4, r9 : IntAlu : D=0x000f00008000: sys.cpu : @sym+800 : teqs r3, r4 : IntAlu : D=0x00000001
43
-
Using GDB with gem5
• Several gem5 functions designed to be called from GDB:• schedBreakCycle() – also with --debug-break• setDebugFlag()/clearDebugFlag()• dumpDebugStatus()• eventqDump()• SimObject::find()• takeCheckpoint()
44
-
Using GDB with gem5
gdb --args ./build/ARM_FS/gem5.opt configs/example/fs.pyGNU gdb Fedora (6.8-37.el5)...
(gdb) b mainBreakpoint 1 at 0x4090b0: file build/ARM_FS/sim/main.cc, line 40.(gdb) run
Breakpoint 1, main (argc=2, argv=0x7fffa59725f8) at build/ARM_FS/sim/main.cc:4040main(int argc, char **argv)
(gdb) call schedBreakCycle(1000000)(gdb) continueContinuing.
gem5 Simulator System...0: system.remote_gdb.listener: listening for remote gdb #0 on port 7000
**** REAL SIMULATION ****info: Entering event queue @ 0. Starting simulation...
Program received signal SIGTRAP, Trace/breakpoint trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6
(gdb) p _curTick$1 = 1000000
45
aliHighlight
-
Using GDB with gem5
gdb --args ./build/ARM_FS/gem5.opt configs/example/fs.pyGNU gdb Fedora (6.8-37.el5)...
(gdb) b mainBreakpoint 1 at 0x4090b0: file build/ARM_FS/sim/main.cc, line 40.(gdb) run
Breakpoint 1, main (argc=2, argv=0x7fffa59725f8) at build/ARM_FS/sim/main.cc:4040main(int argc, char **argv)
(gdb) call schedBreakCycle(1000000)(gdb) continueContinuing.
gem5 Simulator System...0: system.remote_gdb.listener: listening for remote gdb #0 on port 7000
**** REAL SIMULATION ****info: Entering event queue @ 0. Starting simulation...
Program received signal SIGTRAP, Trace/breakpoint trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6
(gdb) p _curTick$1 = 1000000
45
aliHighlight
-
Using GDB with gem5
gdb --args ./build/ARM_FS/gem5.opt configs/example/fs.pyGNU gdb Fedora (6.8-37.el5)...
(gdb) b mainBreakpoint 1 at 0x4090b0: file build/ARM_FS/sim/main.cc, line 40.(gdb) run
Breakpoint 1, main (argc=2, argv=0x7fffa59725f8) at build/ARM_FS/sim/main.cc:4040main(int argc, char **argv)
(gdb) call schedBreakCycle(1000000)(gdb) continueContinuing.
gem5 Simulator System...0: system.remote_gdb.listener: listening for remote gdb #0 on port 7000
**** REAL SIMULATION ****info: Entering event queue @ 0. Starting simulation...
Program received signal SIGTRAP, Trace/breakpoint trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6
(gdb) p _curTick$1 = 1000000
45
-
Using GDB with gem5
gdb --args ./build/ARM_FS/gem5.opt configs/example/fs.pyGNU gdb Fedora (6.8-37.el5)...
(gdb) b mainBreakpoint 1 at 0x4090b0: file build/ARM_FS/sim/main.cc, line 40.(gdb) run
Breakpoint 1, main (argc=2, argv=0x7fffa59725f8) at build/ARM_FS/sim/main.cc:4040main(int argc, char **argv)
(gdb) call schedBreakCycle(1000000)(gdb) continueContinuing.
gem5 Simulator System...0: system.remote_gdb.listener: listening for remote gdb #0 on port 7000
**** REAL SIMULATION ****info: Entering event queue @ 0. Starting simulation...
Program received signal SIGTRAP, Trace/breakpoint trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6
(gdb) p _curTick$1 = 1000000
45
aliHighlight
-
Using GDB with gem5
gdb --args ./build/ARM_FS/gem5.opt configs/example/fs.pyGNU gdb Fedora (6.8-37.el5)...
(gdb) b mainBreakpoint 1 at 0x4090b0: file build/ARM_FS/sim/main.cc, line 40.(gdb) run
Breakpoint 1, main (argc=2, argv=0x7fffa59725f8) at build/ARM_FS/sim/main.cc:4040main(int argc, char **argv)
(gdb) call schedBreakCycle(1000000)(gdb) continueContinuing.
gem5 Simulator System...0: system.remote_gdb.listener: listening for remote gdb #0 on port 7000
**** REAL SIMULATION ****info: Entering event queue @ 0. Starting simulation...
Program received signal SIGTRAP, Trace/breakpoint trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6
(gdb) p _curTick$1 = 1000000
45
aliHighlight
-
Using GDB with gem5
gdb --args ./build/ARM_FS/gem5.opt configs/example/fs.pyGNU gdb Fedora (6.8-37.el5)...
(gdb) b mainBreakpoint 1 at 0x4090b0: file build/ARM_FS/sim/main.cc, line 40.(gdb) run
Breakpoint 1, main (argc=2, argv=0x7fffa59725f8) at build/ARM_FS/sim/main.cc:4040main(int argc, char **argv)
(gdb) call schedBreakCycle(1000000)(gdb) continueContinuing.
gem5 Simulator System...0: system.remote_gdb.listener: listening for remote gdb #0 on port 7000
**** REAL SIMULATION ****info: Entering event queue @ 0. Starting simulation...
Program received signal SIGTRAP, Trace/breakpoint trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6
(gdb) p _curTick$1 = 1000000
45
aliHighlight
-
Using GDB with gem5
(gdb) call setDebugFlag("Exec")(gdb) call schedBreakCycle(1001000)(gdb) continueContinuing.
1000000: system.cpu T0 : @_stext+148. 1 : addi_uop r0, r0, #4 : IntAlu : D=0x0000000000004c301000500: system.cpu T0 : @_stext+152 : teqs r0, r6 : IntAlu : D=0x0000000000000000
Program received signal SIGTRAP, Trace/breakpoint trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6
(gdb) print SimObject::find("system.cpu")$2 = (SimObject *) 0x19cba130(gdb) print (BaseCPU*)SimObject::find("system.cpu")$3 = (BaseCPU *) 0x19cba130(gdb) p $3->instCnt$4 = 431
(gdb) call clearDebugFlag("Exec")(gdb) call takeCheckpoint(0)(gdb) call schedBreakCycle(1001500)(gdb) continueContinuing.Writing checkpointinfo: Entering event queue @ 1001001. Starting simulation...
Program received signal SIGTRAP, Trace/breakpoint trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6(gdb)
46
aliHighlight
aliHighlight
-
Using GDB with gem5
(gdb) call setDebugFlag("Exec")(gdb) call schedBreakCycle(1001000)(gdb) continueContinuing.
1000000: system.cpu T0 : @_stext+148. 1 : addi_uop r0, r0, #4 : IntAlu : D=0x0000000000004c301000500: system.cpu T0 : @_stext+152 : teqs r0, r6 : IntAlu : D=0x0000000000000000
Program received signal SIGTRAP, Trace/breakpoint trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6
(gdb) print SimObject::find("system.cpu")$2 = (SimObject *) 0x19cba130(gdb) print (BaseCPU*)SimObject::find("system.cpu")$3 = (BaseCPU *) 0x19cba130(gdb) p $3->instCnt$4 = 431
(gdb) call clearDebugFlag("Exec")(gdb) call takeCheckpoint(0)(gdb) call schedBreakCycle(1001500)(gdb) continueContinuing.Writing checkpointinfo: Entering event queue @ 1001001. Starting simulation...
Program received signal SIGTRAP, Trace/breakpoint trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6(gdb)
46
aliHighlight
-
Using GDB with gem5
(gdb) call setDebugFlag("Exec")(gdb) call schedBreakCycle(1001000)(gdb) continueContinuing.
1000000: system.cpu T0 : @_stext+148. 1 : addi_uop r0, r0, #4 : IntAlu : D=0x0000000000004c301000500: system.cpu T0 : @_stext+152 : teqs r0, r6 : IntAlu : D=0x0000000000000000
Program received signal SIGTRAP, Trace/breakpoint trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6
(gdb) print SimObject::find("system.cpu")$2 = (SimObject *) 0x19cba130(gdb) print (BaseCPU*)SimObject::find("system.cpu")$3 = (BaseCPU *) 0x19cba130(gdb) p $3->instCnt$4 = 431
(gdb) call clearDebugFlag("Exec")(gdb) call takeCheckpoint(0)(gdb) call schedBreakCycle(1001500)(gdb) continueContinuing.Writing checkpointinfo: Entering event queue @ 1001001. Starting simulation...
Program received signal SIGTRAP, Trace/breakpoint trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6(gdb)
46
aliHighlight
aliHighlight
aliHighlight
-
Using GDB with gem5
(gdb) call setDebugFlag("Exec")(gdb) call schedBreakCycle(1001000)(gdb) continueContinuing.
1000000: system.cpu T0 : @_stext+148. 1 : addi_uop r0, r0, #4 : IntAlu : D=0x0000000000004c301000500: system.cpu T0 : @_stext+152 : teqs r0, r6 : IntAlu : D=0x0000000000000000
Program received signal SIGTRAP, Trace/breakpoint trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6
(gdb) print SimObject::find("system.cpu")$2 = (SimObject *) 0x19cba130(gdb) print (BaseCPU*)SimObject::find("system.cpu")$3 = (BaseCPU *) 0x19cba130(gdb) p $3->instCnt$4 = 431
(gdb) call clearDebugFlag("Exec")(gdb) call takeCheckpoint(0)(gdb) call schedBreakCycle(1001500)(gdb) continueContinuing.Writing checkpointinfo: Entering event queue @ 1001001. Starting simulation...
Program received signal SIGTRAP, Trace/breakpoint trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6(gdb)
46
aliHighlight
aliHighlight
aliHighlight
-
Diffing Traces util/{rundiff,tracediff}
• Often useful to compare traces from two simulations• Find where known good and modified simulators diverge
• Standard diff works only on files (not pipes)• ...but you really don’t want to run to completion
• util/rundiff• Perl script for diffing two pipes on the fly
• util/tracediff• Handy wrapper for using rundiff to compare gem5 outputs• tracediff "a/gem5.opt|b/gem5.opt"--debug-flags=Exec compares instruction traces from twobuilds of gem5
• See comments for details
47
-
Advanced Trace Diffing
• Sometimes if you run into a nasty bug it’s hard to compareapples-to-apples traces
• Different cycle counts, different code paths frominterrupts/timers
• Some mechanisms that can help:• -ExecTicks don’t print out ticks• -ExecKernel don’t print out kernel code• -ExecUser don’t print out user code• ExecAsid print out ASID of currently running process
• State trace• PTRACE program that runs binary on real system compares
cycle-by-cycle to gem5• Supports ARM, x86, SPARC• See wiki for more information
48
-
Remote Debugging
./build/ARM_FS/gem5.opt configs/example/fs.pygem5 Simulator System
...command line: ./build/ARM_FS/gem5.opt configs/example/fs.pyGlobal frequency set at 1000000000000 ticks per secondinfo: kernel located at: /chips/pd/randd/dist/binaries/vmlinux.armListening for system connection on port 5900Listening for system connection on port 34560: system.remote_gdb.listener: listening for remote gdb #0 on port 7000info: Entering event queue @ 0. Starting simulation...
Remote gdb connection listening on port 7000
49
aliHighlight
-
Remote Debugging
GNU gdb (Sourcery G++ Lite 2010.09-50) 7.2.50.20100908-cvsCopyright (C) 2010 Free Software Foundation, Inc....(gdb) symbol-file /dist/binaries/vmlinux.armReading symbols from //dist/binaries/vmlinux.arm...done.(gdb) set remote Z-packet on(gdb) set tdesc filename arm-with-neon.xml(gdb) target remote 127.0.0.1:7000Remote debugging using 127.0.0.1:7000cache_init_objs (cachep=0xc7c00240, flags=3351249472) at mm/slab.c:2658(gdb) stepsighand_ctor (data=0xc7ead060) at kernel/fork.c:1467(gdb) info registers
r0 0xc7ead060-940912544r1 0x5201312r2 0xc002f1e4-1073548828r3 0xc7ead060-940912544r4 0x00r5 0xc7ead020-940912608r6 0x00r7 0xc7ead03c-940912580r8 0xc7c034a0-943704928r9 0x1001001048832r10 0xc7c0cee0-943665440r11 0x2002002097664r12 0xc0000000-1073741824sp 0xc7c29e280xc7c29e28lr 0xc008ed98-1073156712pc 0xc002f1e40xc002f1e4 cpsr 0x1319
50
aliHighlight
aliHighlight
aliHighlight
aliHighlight
aliHighlight
aliHighlight
-
Python Debugging
• It is possible to drop into the python interpreter (-i flag)• This currently happens after the script file is run• If you want to do this before objects are instantiated, remove
them from script• It is possible to drop into the python debugger (--pdb flag)
• Occurs just before your script is invoked• Lets you use the debugger to debug your script code
• Code that enables this stuff is in src/python/m5/main.py• At the bottom of the main function• Can copy the mechanism directly into your scripts, if in the
wrong place for you needs• import pdb• pdb.set_trace()
51
-
O3 Pipeline ViewerUse --debug-flags=O3PipeView and util/o3-pipeview.py
52
-
Outline
1 Introduction to gem5
2 Basics
3 Debugging
4 Checkpointing and Fastforwarding
5 Break
6 Multiple Architecture Support
7 CPU Modeling
8 Ruby Memory System
9 Wrap-Up
53
-
Checkpointing and Fastforwarding
Checkpointing and Fastforwarding
Joel Hestness
University of Texas, Austin
54
-
Checkpointing and Fastforwarding
• Idea is simple:• Snapshot of relevant system state• Restore it later and/or in different CPUs, configurations
• Provides flexibility:• Test numerous different systems configurations• Exact same point in the benchmark• Avoid re-simulating up to that point• Avoid non-determinism inherent with different configurations
55
-
Checkpointing and Fastforwarding
• Outline:• Constraints• Checkpointing demo• Checkpointing internals• Instrumenting a benchmark• Fastforwarding internals• Fastforwarding demo
56
-
Checkpointing and Fastforwarding
• Constraints:• Original simulation and test simulations must have
• Same ISA• Same number of cores• Same memory size
• Usually run original sim with atomic (functional) CPUs
57
-
Checkpointing DEMO!
Starting simulation
58
-
Checkpointing DEMO!
Starting simulation
59
-
Checkpointing DEMO!
Simulated system is running
60
-
Checkpointing DEMO!
Another terminal to control simulated system
61
-
Checkpointing DEMO!
Attach to simulated system
62
-
Checkpointing DEMO!
Simulated system has booted to shell
63
-
Checkpointing DEMO!
Run a quick application
64
-
Checkpointing DEMO!
Run a quick application
65
-
Checkpointing DEMO!
Drop a checkpoint
66
-
Checkpointing DEMO!
Exit simulation
67
-
Checkpointing DEMO!
Exit simulation
68
-
Checkpointing DEMO!
Restore from checkpoint into different simulated system
69
-
Checkpointing DEMO!
Simulated system is running
70
-
Checkpointing DEMO!
Attach to simulated system
71
-
Checkpointing DEMO!
Run a quick application
72
-
Checkpointing DEMO!
Slower execution: detailed v. functional simulation
73
-
Checkpointing DEMO!
Exit simulation
74
-
Checkpointing Output
• cpt.6967183789500/• m5.cpt: State of system components• system.disk?.image.cow: Modified state of disk(s)• system.physmem.physmem: State of memory
75
-
Specifying State to Checkpoint
• To checkpoint a piece of state, serialize it• To restore that state, unserialize it
voidserialize(std::ostream &os){
SERIALIZE_ARRAY(interrupts, NumInterruptLevels);SERIALIZE_SCALAR(intstatus);
}
voidunserialize(Checkpoint *cp, const std::string §ion){
UNSERIALIZE_ARRAY(interrupts, NumInterruptLevels);UNSERIALIZE_SCALAR(intstatus);
}
76
-
Checkpointing functionality status
• Classic memory model:• Does not save state of caches
• Ruby memory model:• Can save state of caches
77
-
Instrumenting a Benchmark
• Copy files from ./util/m5/ into source tree:• m5op.h• m5ops.h• Appropriate assembly file: m5op_.S
• Include m5op.h in source code that should take a checkpoint
#include "m5op.h"
...
// Take checkpoint in codem5_checkpoint(0,0);
• 1st param: no. ticks in future to schedule the checkpoint• 2nd param: no. ticks between checkpoints (periodic)
• Compile and link against assembly file
78
-
Checkpointing Functionality in Progress
Current limitation: cache warm-up1 Take periodic checkpoints throughout execution2 Inspect statistics for interesting sections (think Simpoints)3 Choose interesting sections4 Create memory access traces for cache warm-up5 Restore from checkpoint:
1 Start simulated system2 Warm up caches from trace3 Restore the rest of state4 Begin execution
79
-
Fastforwarding
Setup:• Specify sets of CPUs
cpu_class = AtomicSimpleCPUswitch_cpu_class = DerivO3CPUtest_sys.cpu = [cpu_class(cpu_id=i) for i in xrange(np)]switch_cpus = [switch_cpu_class(defer_registration=True, cpu_id=(np+i))
for i in xrange(np)]switch_cpu_list = [(testsys.cpu[i], switch_cpus[i]) for i in xrange(np)]
80
-
Fastforwarding DEMO!
Starting simulation
81
-
Fastforwarding DEMO!
Starting simulation
82
-
Fastforwarding DEMO!
Simulated system is running
83
-
Fastforwarding DEMO!
Another terminal to control simulated system
84
-
Fastforwarding DEMO!
Attach to simulated system
85
-
Fastforwarding DEMO!
Simulated system has booted to shell
86
-
Fastforwarding DEMO!
Run a quick application
87
-
Fastforwarding DEMO!
Run a quick application
88
-
Fastforwarding DEMO!
Switch from functional to detailed CPUs
89
-
Fastforwarding DEMO!
Switch from functional to detailed CPUs
90
-
Fastforwarding DEMO!
Run a quick application
91
-
Fastforwarding DEMO!
Slower execution: detailed v. functional simulation
92
-
Fastforwarding DEMO!
Exit simulation
93
-
Outline
1 Introduction to gem5
2 Basics
3 Debugging
4 Checkpointing and Fastforwarding
5 Break
6 Multiple Architecture Support
7 CPU Modeling
8 Ruby Memory System
9 Wrap-Up
94
-
Break
Break
95
-
Outline
1 Introduction to gem5
2 Basics
3 Debugging
4 Checkpointing and Fastforwarding
5 Break
6 Multiple Architecture Support
7 CPU Modeling
8 Ruby Memory System
9 Wrap-Up
96
-
Multiple Architecture Support
Multiple Architecture Support
Gabe Black
Google, Inc.
97
-
Overview
• Tour of the ISAs• Parts of an ISA• Decoding and instructions
98
-
ISA Support
• Full-System & Syscall Emulation• Alpha• ARM• SPARC• x86
• Syscall Emulation• MIPS• POWER
99
-
Alpha
• Alpha 21264 including the BWX, MVI, FIX, and CIXA• 21164 PAL code.• Syscall Emulation
• Linux or Tru64 binaries• Simple Atomic, Simple Timing, In-Order, Out-of-Order CPU
models• Full system
• Linux or FreeBSD• Simple Atomic, Simple Timing, In-Order, Out-of-Order CPU
models• Four-cores in a normal Tsunami system• Also gem5 big Tsunami support 64 cores
• Custom PAL code and kernel patches required
100
-
ARM• ARMv7-A, Thumb, Thumb2, MP, VFPv3, NEON
• Doesn’t (yet) include TrustZone, ThumbEE, Virtualization,LPAE
• Syscall Emulation• EABI Linux binaries - no OABI• Simple Atomic, Simple Timing, Out-of-Order CPU models
• Full system• Linux or Android• Simple Atomic, Simple Timing, Out-of-Order CPU models• Four-cores in a normal ARM RealView system• No kernel patches required• Also supports frame buffer, and control via VNC
• Can run X11, Android, Web browsers, etc
101
-
ARM
101
-
MIPS
• 32 bit little endian• Syscall Emulation
• Linux binaries• Simple Atomic, Simple Timing, In-Order, Out-of-Order CPU
models• Full system
• Significant progress, but not actively developed
102
-
POWER
• POWER ISA v2.06 B Book, 32-bit, little endian• Most instructions available, but some FP missing; no vector
support• Syscall Emulation
• Linux binaries• Simple Atomic, Simple Timing, Out-of-Order CPU models
• Full system• No current plans
103
-
SPARC
• UltraSPARC Architecture 2005• Syscall Emulation
• Linux or Solaris binaries• Simple Atomic, Simple Timing, Out-of-Order CPU models
• Full system• Solaris• Single core of a UltraSPARC T1 (Niagara) processor• Simple Atomic CPU model only• Significant progress on MP, but not actively developed
104
-
x86
• Generic x86 CPU w/ 64 bit, 3DNow, & SSE extensions• Effort focused on modern features• No x87 floating point. Compile 32 bit with -msse2.• No Windows support any time soon.• Syscall Emulation
• Linux binaries• Simple Atomic, Simple Timing, Out-of-Order CPU models
• Full system• Linux• Simple Atomic, Simple Timing CPU models• MP support
105
-
Parts of an ISA
• Parameterization• Number of registers• Endianness• Page size
• Specialized objects• TLBs• Faults• Control state• Interrupt controller
• Instructions• Instructions themselves• Decoding mechanism
106
-
Instruction decode process
Memory
Byte Byte Byte Byte ByteByte
Predecoder Context
ExtMachInst
Decoder
StaticInst Macroop
Microop Microop
Or
107
-
ISA Description Languagesrc/arch/isa_parser.py, src/arch/*/isa/*
• Custom domain-specific language• Defines decoding & behavior of ISA• Generates C++ code
• Scads of StaticInst subclasses• decodeInst () function
• Maps machine instruction to StaticInst instance• Multiple scads of execute() methods
• Cross-product of CPU models and StaticInst subclasses
108
-
Definitions etc.
def bitfield OPCODE ;def bitfield RA ;def bitfield RB ;def bitfield INTFUNC ; // function codedef bitfield RC < 4: 0>; // dest reg
def operands {{’Ra’: (’IntReg’, ’uq’, ’PALMODE ? AlphaISA::reg_redir[RA] : RA’,
’IsInteger’, 1),’Rb’: (’IntReg’, ’uq’, ’PALMODE ? AlphaISA::reg_redir[RB] : RB’,
’IsInteger’, 2),’Rc’: (’IntReg’, ’uq’, ’PALMODE ? AlphaISA::reg_redir[RC] : RC’,
’IsInteger’, 3),’Fa’: (’FloatReg’, ’df’, ’FA’, ’IsFloating’, 1),’Fb’: (’FloatReg’, ’df’, ’FB’, ’IsFloating’, 2),’Fc’: (’FloatReg’, ’df’, ’FC’, ’IsFloating’, 3),
}}
def format LoadAddress(code) {{// Python code here...
}}
def format IntegerOperate(code) {{// Python code here...
}}
109
-
Instruction Decode and Semantics
decode OPCODE {format LoadAddress {
0x08: lda ({{ Ra = Rb + disp; }});0x09: ldah ({{ Ra = Rb + (disp
-
Microcode
def macroop MOVS_E_M_M {and t0, rcx, rcx, flags=(EZF,), dataSize=aszbr label("end"), flags=(CEZF,)# Find the constant we need to either add or subtract from rdiruflag t0, 10movi t3, t3, dsz, flags=(CEZF,), dataSize=aszsubi t4, t0, dsz, dataSize=aszmov t3, t3, t4, flags=(nCEZF,), dataSize=asz
topOfLoop:ld t1, seg, [1, t0, rsi]st t1, es, [1, t0, rdi]
subi rcx, rcx, 1, flags=(EZF,), dataSize=aszadd rdi, rdi, t3, dataSize=aszadd rsi, rsi, t3, dataSize=aszbr label("topOfLoop"), flags=(nCEZF,)
end:fault "NoFault"
};
111
-
Key Features
• Very compact representation• Most instructions take 1 line of C code• Alpha: 3437 lines of isa description→ 39K lines of C++
• ∼15K generic decode, ∼12K for each of 2 CPU models• Characteristics auto-extracted from C
• source, dest regs; func unit class; etc.• execute() code customized for CPU models
• Thoroughly documented (for us, anyway)• See wiki pages
112
-
Outline
1 Introduction to gem5
2 Basics
3 Debugging
4 Checkpointing and Fastforwarding
5 Break
6 Multiple Architecture Support
7 CPU Modeling
8 Ruby Memory System
9 Wrap-Up
113
-
CPU Modeling
CPU Modeling
Korey Sewell
University of Michigan, Ann Arbor
114
-
Overview
• High Level View• Supported CPU Models
• AtomicSimpleCPU• TimingSimpleCPU• InOrderCPU• O3CPU
• CPU Model Internals• Parameters• Time Buffers• Key Interfaces
115
-
CPU Models - System Level View
CPU Models are designed to be “hot pluggable” with arbitraryISAs and Memory Systems
116
-
Supported CPU Models src/cpu/*.hh,cc• Simple CPUs
• Models Single-Thread 1 CPI Machine
• Two Types: AtomicSimpleCPU and TimingSimpleCPU• Common Uses:
• Fast, Functional Simulation: 2.9 million and 1.2 millioninstructions per second on the “twolf” benchmark
• Warming Up Caches• Studies that do not require detailed CPU modeling
• Detailed CPUs• Parameterizable Pipeline Models w/SMT support• Two Types: InOrderCPU and O3CPU• “Execute in Execute”, detailed modeling• Slower than SimpleCPUs: 200K instructions per second on
the “twolf” benchmark• Models the timing for each pipeline stage• Forces both timing and execution of simulation to be accurate• Important for Coherence, I/O, Multiprocessor Studies, etc.
117
-
Supported CPU Models src/cpu/*.hh,cc• Simple CPUs
• Models Single-Thread 1 CPI Machine• Two Types: AtomicSimpleCPU and TimingSimpleCPU
• Common Uses:• Fast, Functional Simulation: 2.9 million and 1.2 million
instructions per second on the “twolf” benchmark• Warming Up Caches• Studies that do not require detailed CPU modeling
• Detailed CPUs• Parameterizable Pipeline Models w/SMT support• Two Types: InOrderCPU and O3CPU• “Execute in Execute”, detailed modeling• Slower than SimpleCPUs: 200K instructions per second on
the “twolf” benchmark• Models the timing for each pipeline stage• Forces both timing and execution of simulation to be accurate• Important for Coherence, I/O, Multiprocessor Studies, etc.
117
-
Supported CPU Models src/cpu/*.hh,cc• Simple CPUs
• Models Single-Thread 1 CPI Machine• Two Types: AtomicSimpleCPU and TimingSimpleCPU• Common Uses:
• Fast, Functional Simulation: 2.9 million and 1.2 millioninstructions per second on the “twolf” benchmark
• Warming Up Caches• Studies that do not require detailed CPU modeling
• Detailed CPUs• Parameterizable Pipeline Models w/SMT support• Two Types: InOrderCPU and O3CPU• “Execute in Execute”, detailed modeling• Slower than SimpleCPUs: 200K instructions per second on
the “twolf” benchmark• Models the timing for each pipeline stage• Forces both timing and execution of simulation to be accurate• Important for Coherence, I/O, Multiprocessor Studies, etc.
117
-
Supported CPU Models src/cpu/*.hh,cc• Simple CPUs
• Models Single-Thread 1 CPI Machine• Two Types: AtomicSimpleCPU and TimingSimpleCPU• Common Uses:
• Fast, Functional Simulation: 2.9 million and 1.2 millioninstructions per second on the “twolf” benchmark
• Warming Up Caches• Studies that do not require detailed CPU modeling
• Detailed CPUs• Parameterizable Pipeline Models w/SMT support
• Two Types: InOrderCPU and O3CPU• “Execute in Execute”, detailed modeling• Slower than SimpleCPUs: 200K instructions per second on
the “twolf” benchmark• Models the timing for each pipeline stage• Forces both timing and execution of simulation to be accurate• Important for Coherence, I/O, Multiprocessor Studies, etc.
117
-
Supported CPU Models src/cpu/*.hh,cc• Simple CPUs
• Models Single-Thread 1 CPI Machine• Two Types: AtomicSimpleCPU and TimingSimpleCPU• Common Uses:
• Fast, Functional Simulation: 2.9 million and 1.2 millioninstructions per second on the “twolf” benchmark
• Warming Up Caches• Studies that do not require detailed CPU modeling
• Detailed CPUs• Parameterizable Pipeline Models w/SMT support• Two Types: InOrderCPU and O3CPU
• “Execute in Execute”, detailed modeling• Slower than SimpleCPUs: 200K instructions per second on
the “twolf” benchmark• Models the timing for each pipeline stage• Forces both timing and execution of simulation to be accurate• Important for Coherence, I/O, Multiprocessor Studies, etc.
117
-
Supported CPU Models src/cpu/*.hh,cc• Simple CPUs
• Models Single-Thread 1 CPI Machine• Two Types: AtomicSimpleCPU and TimingSimpleCPU• Common Uses:
• Fast, Functional Simulation: 2.9 million and 1.2 millioninstructions per second on the “twolf” benchmark
• Warming Up Caches• Studies that do not require detailed CPU modeling
• Detailed CPUs• Parameterizable Pipeline Models w/SMT support• Two Types: InOrderCPU and O3CPU• “Execute in Execute”, detailed modeling• Slower than SimpleCPUs: 200K instructions per second on
the “twolf” benchmark• Models the timing for each pipeline stage• Forces both timing and execution of simulation to be accurate• Important for Coherence, I/O, Multiprocessor Studies, etc.
117
-
AtomicSimpleCPU src/cpu/simple/atomic/*.hh,cc
• On every CPU “tick()”,perform all necessaryoperations for an instruction
• Memory accesses areatomic
• Fastest functional simulation
118
-
TimingSimpleCPU src/cpu/simple/timing/*.hh,cc
• Memory accesses usetiming path
• CPU waits until memoryaccess returns
• Fast, provides some level oftiming
119
-
InOrder CPU Model src/cpu/inorder/*.hh,cc• Detailed in-order CPU• InOrder is a new feature to the gem5 Simulator
• Default 5-stage pipeline• Fetch, Decode, Execute, Memory, Writeback
120
-
InOrder CPU Model src/cpu/inorder/*.hh,cc• Detailed in-order CPU• InOrder is a new feature to the gem5 Simulator
• Default 5-stage pipeline• Fetch, Decode, Execute, Memory, Writeback
120
-
InOrder CPU Model src/cpu/inorder/*.hh,cc
• Detailed in-order CPU• Default 5-stage pipeline
• Fetch, Decode, Execute, Memory, Writeback
• Key Resources• CacheUnit, ExecutionUnit, BranchPredictor, etc.
• Key Parameters• Pipeline Stages, Hardware Threads
• Implementation: Customizable Set of Pipeline Components• Pipeline stages interact with Resource Pool• Pipeline defined through Instruction Schedules
• Each instruction type defines what resources they need in aparticular stage
• If an instruction can’t complete all it’s resource requests in onestage, it blocks the pipeline
121
-
InOrder CPU Model src/cpu/inorder/*.hh,cc
• Detailed in-order CPU• Default 5-stage pipeline
• Fetch, Decode, Execute, Memory, Writeback• Key Resources
• CacheUnit, ExecutionUnit, BranchPredictor, etc.
• Key Parameters• Pipeline Stages, Hardware Threads
• Implementation: Customizable Set of Pipeline Components• Pipeline stages interact with Resource Pool• Pipeline defined through Instruction Schedules
• Each instruction type defines what resources they need in aparticular stage
• If an instruction can’t complete all it’s resource requests in onestage, it blocks the pipeline
121
-
InOrder CPU Model src/cpu/inorder/*.hh,cc
• Detailed in-order CPU• Default 5-stage pipeline
• Fetch, Decode, Execute, Memory, Writeback• Key Resources
• CacheUnit, ExecutionUnit, BranchPredictor, etc.• Key Parameters
• Pipeline Stages, Hardware Threads
• Implementation: Customizable Set of Pipeline Components• Pipeline stages interact with Resource Pool• Pipeline defined through Instruction Schedules
• Each instruction type defines what resources they need in aparticular stage
• If an instruction can’t complete all it’s resource requests in onestage, it blocks the pipeline
121
-
InOrder CPU Model src/cpu/inorder/*.hh,cc
• Detailed in-order CPU• Default 5-stage pipeline
• Fetch, Decode, Execute, Memory, Writeback• Key Resources
• CacheUnit, ExecutionUnit, BranchPredictor, etc.• Key Parameters
• Pipeline Stages, Hardware Threads• Implementation: Customizable Set of Pipeline Components
• Pipeline stages interact with Resource Pool• Pipeline defined through Instruction Schedules
• Each instruction type defines what resources they need in aparticular stage
• If an instruction can’t complete all it’s resource requests in onestage, it blocks the pipeline
121
-
O3 CPU Model src/cpu/o3/*.hh,cc• Detailed out-of-order CPU
• Default 7-stage pipeline• Fetch, Decode, Rename, IEW,Commit• IEW Issue, Execute, and Writeback
• Model varying amount of pipeline stages by changing delaysbetween pipeline stages (e.g. fetchToDecodeDelay)
• Key Resources• Physical Register (PR) File, IQ, LSQ, ROB, Functional Unit
(FU) Pool• Key Parameters
• Interstage pipeline delays, Hardware threads, IQ/LSQ/ROB/PRentries, FU Delays
• Other Key Features• Support for CISC decoding (e.g. x86)• Renaming with a Physical Register (PR) File• Functional units with varying latencies• Branch Prediction• Memory dependence prediction
122
-
O3 CPU Model src/cpu/o3/*.hh,cc• Detailed out-of-order CPU
• Default 7-stage pipeline• Fetch, Decode, Rename, IEW,Commit• IEW Issue, Execute, and Writeback• Model varying amount of pipeline stages by changing delays
between pipeline stages (e.g. fetchToDecodeDelay)
• Key Resources• Physical Register (PR) File, IQ, LSQ, ROB, Functional Unit
(FU) Pool• Key Parameters
• Interstage pipeline delays, Hardware threads, IQ/LSQ/ROB/PRentries, FU Delays
• Other Key Features• Support for CISC decoding (e.g. x86)• Renaming with a Physical Register (PR) File• Functional units with varying latencies• Branch Prediction• Memory dependence prediction
122
-
O3 CPU Model src/cpu/o3/*.hh,cc• Detailed out-of-order CPU
• Default 7-stage pipeline• Fetch, Decode, Rename, IEW,Commit• IEW Issue, Execute, and Writeback• Model varying amount of pipeline stages by changing delays
between pipeline stages (e.g. fetchToDecodeDelay)• Key Resources
• Physical Register (PR) File, IQ, LSQ, ROB, Functional Unit(FU) Pool
• Key Parameters• Interstage pipeline delays, Hardware threads, IQ/LSQ/ROB/PR
entries, FU Delays• Other Key Features
• Support for CISC decoding (e.g. x86)• Renaming with a Physical Register (PR) File• Functional units with varying latencies• Branch Prediction• Memory dependence prediction
122
-
O3 CPU Model src/cpu/o3/*.hh,cc• Detailed out-of-order CPU
• Default 7-stage pipeline• Fetch, Decode, Rename, IEW,Commit• IEW Issue, Execute, and Writeback• Model varying amount of pipeline stages by changing delays
between pipeline stages (e.g. fetchToDecodeDelay)• Key Resources
• Physical Register (PR) File, IQ, LSQ, ROB, Functional Unit(FU) Pool
• Key Parameters• Interstage pipeline delays, Hardware threads, IQ/LSQ/ROB/PR
entries, FU Delays
• Other Key Features• Support for CISC decoding (e.g. x86)• Renaming with a Physical Register (PR) File• Functional units with varying latencies• Branch Prediction• Memory dependence prediction
122
-
O3 CPU Model src/cpu/o3/*.hh,cc• Detailed out-of-order CPU
• Default 7-stage pipeline• Fetch, Decode, Rename, IEW,Commit• IEW Issue, Execute, and Writeback• Model varying amount of pipeline stages by changing delays
between pipeline stages (e.g. fetchToDecodeDelay)• Key Resources
• Physical Register (PR) File, IQ, LSQ, ROB, Functional Unit(FU) Pool
• Key Parameters• Interstage pipeline delays, Hardware threads, IQ/LSQ/ROB/PR
entries, FU Delays• Other Key Features
• Support for CISC decoding (e.g. x86)• Renaming with a Physical Register (PR) File• Functional units with varying latencies• Branch Prediction• Memory dependence prediction
122
-
CPU Model Internals src/cpu/*
• A key reason that the CPU Models are “hot pluggable” intogem5 is that the CPUs share common components andinterfaces within the simulator
• Parameter Definition• Shared Components
• Branch Predictors, TLBs, ISA decoding, Interrupt Handlers• TimeBuffer-Based Communication• External Interfaces
• System: ThreadContext• ISA: StaticInst and DynInst• Memory: Ports, {send/recv}Timing
123
-
CPU Model Internals src/cpu/*
• A key reason that the CPU Models are “hot pluggable” intogem5 is that the CPUs share common components andinterfaces within the simulator• Parameter Definition
• Shared Components• Branch Predictors, TLBs, ISA decoding, Interrupt Handlers
• TimeBuffer-Based Communication• External Interfaces
• System: ThreadContext• ISA: StaticInst and DynInst• Memory: Ports, {send/recv}Timing
123
-
CPU Model Internals src/cpu/*
• A key reason that the CPU Models are “hot pluggable” intogem5 is that the CPUs share common components andinterfaces within the simulator• Parameter Definition• Shared Components
• Branch Predictors, TLBs, ISA decoding, Interrupt Handlers
• TimeBuffer-Based Communication• External Interfaces
• System: ThreadContext• ISA: StaticInst and DynInst• Memory: Ports, {send/recv}Timing
123
-
CPU Model Internals src/cpu/*
• A key reason that the CPU Models are “hot pluggable” intogem5 is that the CPUs share common components andinterfaces within the simulator• Parameter Definition• Shared Components
• Branch Predictors, TLBs, ISA decoding, Interrupt Handlers• TimeBuffer-Based Communication
• External Interfaces• System: ThreadContext• ISA: StaticInst and DynInst• Memory: Ports, {send/recv}Timing
123
-
CPU Model Internals src/cpu/*
• A key reason that the CPU Models are “hot pluggable” intogem5 is that the CPUs share common components andinterfaces within the simulator• Parameter Definition• Shared Components
• Branch Predictors, TLBs, ISA decoding, Interrupt Handlers• TimeBuffer-Based Communication• External Interfaces
• System: ThreadContext• ISA: StaticInst and DynInst• Memory: Ports, {send/recv}Timing
123
-
CPU Internals - Parameterssrc/cpu/{simple/inorder/o3}*.py
• Parameters are defined in a *.py in each CPU’s directory• e.g. The contents of src/cpu/inorder/InOrderCPU.py are shown
below:
class InOrderCPU(BaseCPU):type = ’InOrderCPU’...cachePorts = Param.Unsigned(2, "Cache Ports")stageWidth = Param.Unsigned(4, "Stage width")...icache_port = Port("Instruction Port")dcache_port = Port("Data Port")...predType = Param.String("tournament", "Branch predictor type (’local’, ’tournament’)")
• Use in your configuration scripts
...cpu = InOrderCPU()cpu.stageWidth = 2...
124
-
CPU Internals - Parameterssrc/cpu/{simple/inorder/o3}*.py
• Parameters are defined in a *.py in each CPU’s directory• e.g. The contents of src/cpu/inorder/InOrderCPU.py are shown
below:
class InOrderCPU(BaseCPU):type = ’InOrderCPU’...cachePorts = Param.Unsigned(2, "Cache Ports")stageWidth = Param.Unsigned(4, "Stage width")...icache_port = Port("Instruction Port")dcache_port = Port("Data Port")...predType = Param.String("tournament", "Branch predictor type (’local’, ’tournament’)")
• Use in your configuration scripts
...cpu = InOrderCPU()cpu.stageWidth = 2...
124
-
CPU Internals - Time Buffers src/base/timebuf.hh
• Similar to queues• Are advance()’d each CPU cycle
• Each pipeline stage places information into time buffer• Next stage reads from time buffer by indexing into appropriate
cycle• Used for both forwards and backwards communication
• Avoids unrealistic interaction between pipeline stages• Time buffer class is templated
• Its template parameter is the communication struct betweenstages
125
-
CPU Internals - Time Buffers src/base/timebuf.hh
• Similar to queues• Are advance()’d each CPU cycle
• Each pipeline stage places information into time buffer• Next stage reads from time buffer by indexing into appropriate
cycle
• Used for both forwards and backwards communication• Avoids unrealistic interaction between pipeline stages
• Time buffer class is templated• Its template parameter is the communication struct between
stages
125
-
CPU Internals - Time Buffers src/base/timebuf.hh
• Similar to queues• Are advance()’d each CPU cycle
• Each pipeline stage places information into time buffer• Next stage reads from time buffer by indexing into appropriate
cycle• Used for both forwards and backwards communication
• Avoids unrealistic interaction between pipeline stages
• Time buffer class is templated• Its template parameter is the communication struct between
stages
125
-
CPU Internals - Time Buffers src/base/timebuf.hh
• Similar to queues• Are advance()’d each CPU cycle
• Each pipeline stage places information into time buffer• Next stage reads from time buffer by indexing into appropriate
cycle• Used for both forwards and backwards communication
• Avoids unrealistic interaction between pipeline stages• Time buffer class is templated
• Its template parameter is the communication struct betweenstages
125
-
Time Buffer Communication
• Demonstrated on out-of-order pipeline ...• Red is a time buffer
Fetch Decode RenameIssue
ExecuteWriteback
Commit
Backwards Communication
126
-
CPU Interfaces - ThreadContextsrc/cpu/thread_context.hh
• Interface for accessing total architectural state of a singlethread• PC, register values, etc.
• Used to obtain pointers to key classes• CPU, process, system, ITB, DTB, etc.
• Abstract base class• Each CPU model must implement its own derived
ThreadContext
127
-
CPU Interfaces - ThreadContextsrc/cpu/thread_context.hh
• Interface for accessing total architectural state of a singlethread• PC, register values, etc.
• Used to obtain pointers to key classes• CPU, process, system, ITB, DTB, etc.
• Abstract base class• Each CPU model must implement its own derived
ThreadContext
127
-
CPU Interfaces - ThreadContextsrc/cpu/thread_context.hh
• Interface for accessing total architectural state of a singlethread• PC, register values, etc.
• Used to obtain pointers to key classes• CPU, process, system, ITB, DTB, etc.
• Abstract base class• Each CPU model must implement its own derived
ThreadContext
127
-
CPU Interfaces - StaticInst Classsrc/cpu/static_inst.{hh,cc}
• Represents a decoded instruction• Has classifications of the inst• Corresponds to the binary machine inst• Only has static information
• Has all the methods needed to execute an instruction• Tells which regs are source and dest• Contains the execute() function• ISA parser generates execute() for all insts
128
-
CPU Interfaces - DynInst Classsrc/cpu/base_dyn_inst.{hh,cc}
• Dynamic version of StaticInst• Used to hold extra information detailed CPU models
• BaseDynInst• Holds PC, Results, Branch Prediction Status• Interface for TLB translations
• InOrderDynInst - src/cpu/inorder/dyn_inst.{hh,cc}• Holds current status of an instruction’s request to a resource• Manages each instruction’s pipeline schedule
• O3DynInst - src/cpu/o3/dyn_inst.{hh,cc}• Holds Status of Renamed Registers• Interfaces to the IQ, LSQ and ROB
129
-
CPU Interfaces - DynInst Classsrc/cpu/base_dyn_inst.{hh,cc}
• Dynamic version of StaticInst• Used to hold extra information detailed CPU models
• BaseDynInst• Holds PC, Results, Branch Prediction Status• Interface for TLB translations
• InOrderDynInst - src/cpu/inorder/dyn_inst.{hh,cc}• Holds current status of an instruction’s request to a resource• Manages each instruction’s pipeline schedule
• O3DynInst - src/cpu/o3/dyn_inst.{hh,cc}• Holds Status of Renamed Registers• Interfaces to the IQ, LSQ and ROB
129
-
CPU Interfaces - DynInst Classsrc/cpu/base_dyn_inst.{hh,cc}
• Dynamic version of StaticInst• Used to hold extra information detailed CPU models
• BaseDynInst• Holds PC, Results, Branch Prediction Status• Interface for TLB translations
• InOrderDynInst - src/cpu/inorder/dyn_inst.{hh,cc}• Holds current status of an instruction’s request to a resource• Manages each instruction’s pipeline schedule
• O3DynInst - src/cpu/o3/dyn_inst.{hh,cc}• Holds Status of Renamed Registers• Interfaces to the IQ, LSQ and ROB
129
-
CPU Interfaces - DynInst Classsrc/cpu/base_dyn_inst.{hh,cc}
• Dynamic version of StaticInst• Used to hold extra information detailed CPU models
• BaseDynInst• Holds PC, Results, Branch Prediction Status• Interface for TLB translations
• InOrderDynInst - src/cpu/inorder/dyn_inst.{hh,cc}• Holds current status of an instruction’s request to a resource• Manages each instruction’s pipeline schedule
• O3DynInst - src/cpu/o3/dyn_inst.{hh,cc}• Holds Status of Renamed Registers• Interfaces to the IQ, LSQ and ROB
129
-
Outline
1 Introduction to gem5
2 Basics
3 Debugging
4 Checkpointing and Fastforwarding
5 Break
6 Multiple Architecture Support
7 CPU Modeling
8 Ruby Memory System
9 Wrap-Up
130
-
Ruby Memory System
Ruby Memory System
Derek Hower
University of Wisconsin, Madison
131
-
Outline
• Feature Overview• Rich Configuration• Rapid Prototyping
• SLICC• Modular & Detailed Components
• Lifetime of a Ruby memory request
132
-
Feature Overview
• Flexible Memory System• Rich configuration - Just run it
• Simulate combinations of caches, coherence, interconnect,etc...
• Rapid prototyping - Just create it• Domain-Specific Language (SLICC) for coherence protocols• Modular components
• Detailed statistics• e.g., Request size/type distribution, state transition
frequencies, etc...• Detailed component simulation
• Network (fixed/flexible pipeline and simple)• Caches (Pluggable replacement policies)• Memory (DDR2)
133
-
Feature Overview
• Flexible Memory System• Rich configuration - Just run it
• Simulate combinations of caches, coherence, interconnect,etc...
• Rapid prototyping - Just create it• Domain-Specific Language (SLICC) for coherence protocols• Modular components
• Detailed statistics• e.g., Request size/type distribution, state transition
frequencies, etc...• Detailed component simulation
• Network (fixed/flexible pipeline and simple)• Caches (Pluggable replacement policies)• Memory (DDR2)
133
-
Feature Overview
• Flexible Memory System• Rich configuration - Just run it
• Simulate combinations of caches, coherence, interconnect,etc...
• Rapid prototyping - Just create it• Domain-Specific Language (SLICC) for coherence protocols• Modular components
• Detailed statistics• e.g., Request size/type distribution, state transition
frequencies, etc...
• Detailed component simulation• Network (fixed/flexible pipeline and simple)• Caches (Pluggable replacement policies)• Memory (DDR2)
133
-
Feature Overview
• Flexible Memory System• Rich configuration - Just run it
• Simulate combinations of caches, coherence, interconnect,etc...
• Rapid prototyping - Just create it• Domain-Specific Language (SLICC) for coherence protocols• Modular components
• Detailed statistics• e.g., Request size/type distribution, state transition
frequencies, etc...• Detailed component simulation
• Network (fixed/flexible pipeline and simple)• Caches (Pluggable replacement policies)• Memory (DDR2)
133
-
Rich Configuration - Just run it
• Can build many different memory systems• CMPs, SMPs, SCMPs• 1/2/3 level caches• Pt2Pt/Torus/Mesh Topologies• MESI/MOESI coherence
• Each components is individually configurable• Build heterogeneous cache architectures (new)• Adjust cache sizes, bandwidth, link latencies, etc...
• Get research started without modifying code!
134
-
Configuration Examples
1 8 core CMP, 2-Level, MESI protocol, 32K L1s, 8MB 8-bankedL2s, crossbar interconnect• scons build/ALPHA_FS/gem5.opt PROTOCOL=MESI_CMP_directory RUBY=True• ./build/ALPHA_FS/gem5.opt configs/example/ruby_fs.py -n 8 --l1i_size=32kB
--l1d_size=32kB --l2_size=8MB --num-l2caches=8 --topology=Crossbar --timing
2 64 socket SMP, 2-Level on-chip Caches, MOESI protocol,32K L1s, 8MB L2 per chip, mesh interconnect• scons build/ALPHA_FS/gem5.opt PROTOCOL=MOESI_CMP_directory RUBY=True• ./build/ALPHA_FS/m5.opt configs/example/ruby_fs.py -n 64 --l1i_size=32kB
--l1d_size=32kB --l2_size=512MB --num-l2caches=64 --topology=Mesh --timing
• Many other configuration options• Protocols only work with specific architectures (see wiki)
135
-
Rapid Prototyping - Just create it
• Modular construction• Coherence controller (SLICC)• Cache (C++)
• Replacement Policy (C++)• DRAM (C++)• Topology (Python)• Network implementation (C++)
• Debugging support
136
-
SLICC: Specification Language forImplementing Cache Coherence
• Domain-Specific Language• Syntatically similar to C/C++• Like HDLs, constrains operations to be hardware-like (e.g., no
loops)• Two generation targets
• C++ for simulation• Coherence controller object
• HTML for documentation• Table-driven specification (State x Event -> Actions & next
state)
137
-
SLICC Protocol Structure
• Collection of Machines, e.g.• L1 Controller• L2 Controller• DRAM Controller
• Machines are connectedthrough network ports(different than MemPorts)
• Network can be an arbitrarytopology
138
-
Machine Structure
• Machines are (logically) per-block• Consist of:
• Ports - Interface to the world• States - Both stable and transient• Events - Triggered by incoming messages• Transitions - Old state x Event -> New state• Actions - Occur atomically during transition, e.g.,
Send/receive messages from network139
-
MI Example...Directory Directory
Pt-to-Pt Interconnect
...L1 Cache L1 Cache
CPU CPU
• Single-level coherence protocol• 2 controller types – Cache + Directory
• Cache Controller• 2 stable states: Modified (a.k.a. Valid), Invalid
• Directory Controller [Not Shown]• 2 stable states: Modified (Present in cache), Valid
• 3 virtual networks (request, response, forward)• See src/mem/ruby/protocols/MI_example.*
140
-
MI Example - L1 Cache ControllerMachine structure
Machine Pseduo-Codemachine(L1Cache, "MI Example L1 Cache’’)
: Sequencer * sequencer, // parameters to the machine object (set at initialization)CacheMemory * cacheMemory,int cache_response_latency = 12,int issue_latency = 2
{ // 3 virtual channels to/from network + connection to CPU
// M,I, & Load,Store, etc.
// e.g., RequestMessage::GETX -> Fwd_GETX
// e.g., issueRequest
// e.g., I x Store -> M}
141
-
MI Example - L1 Cache ControllerDefining a machine interface
Interface to the network
// MessageBuffers - opaque C++ communication queuesMessageBuffer requestFromCache, network=”To”, virtual_network=”0”, ordered=”true”;MessageBuffer responseFromCache, network=”To”, virtual_network=”1”, ordered=”true”;MessageBuffer forwardToCache, network=”From”, virtual_network=”2”, ordered=”true”;MessageBuffer responseToCache, network=”From”, virtual_network=”1”, ordered=”true”;
// out_port - map request type to outgoing message bufferout_port(requestNetwork_out, RequestMsg, requestFromCache);out_port(responseNetwork_out, ResponseMsg, responseFromCache);
// in_port - map request type to incomming message buffer// and produce code to accept incomming messagesin_port(forwardRequestNetwork_in, RequestMsg, forwardToCache) { ... }in_port(responseNetwork_in, ResponseMsg, responseToCache) { ... }
Interface to a CPU
// The other end of mandatoryQueue attaches to SequencerMessageBuffer mandatoryQueue, ordered=”false”;in_port(mandatoryQueue_in, RubyRequest, mandatoryQueue, desc=”...”) { ... }// There is no corresponing out_port - handled with hitCallback
142
-
MI Example - L1 Cache ControllerDefining a machine interface
Interface to the network
// MessageBuffers - opaque C++ communication queuesMessageBuffer requestFromCache, network=”To”, virtual_network=”0”, ordered=”true”;MessageBuffer responseFromCache, network=”To”, virtual_network=”1”, ordered=”true”;MessageBuffer forwardToCache, network=”From”, virtual_network=”2”, ordered=”true”;MessageBuffer responseToCache, network=”From”, virtual_network=”1”, ordered=”true”;
// out_port - map request type to outgoing message bufferout_port(requestNetwork_out, RequestMsg, requestFromCache);out_port(responseNetwork_out, ResponseMsg, responseFromCache);
// in_port - map request type to incomming message buffer// and produce code to accept incomming messagesin_port(forwardRequestNetwork_in, RequestMsg, forwardToCache) { ... }in_port(responseNetwork_in, ResponseMsg, responseToCache) { ... }
Interface to a CPU
// The other end of mandatoryQueue attaches to SequencerMessageBuffer mandatoryQueue, ordered=”false”;in_port(mandatoryQueue_in, RubyRequest, mandatoryQueue, desc=”...”) { ... }// There is no corresponing out_port - handled with hitCallback
142
-
MI Example - L1 Cache ControllerDefining a machine interface
Interface to the network
// MessageBuffers - opaque C++ communication queuesMessageBuffer requestFromCache, network=”To”, virtual_network=”0”, ordered=”true”;MessageBuffer responseFromCache, network=”To”, virtual_network=”1”, ordered=”true”;MessageBuffer forwardToCache, network=”From”, virtual_network=”2”, ordered=”true”;MessageBuffer responseToCache, network=”From”, virtual_network=”1”, ordered=”true”;
// out_port - map request type to outgoing message bufferout_port(requestNetwork_out, RequestMsg, requestFromCache);out_port(responseNetwork_out, ResponseMsg, responseFromCache);
// in_port - map request type to incomming message buffer// and produce code to accept incomming messagesin_port(forwardRequestNetwork_in, RequestMsg, forwardToCache) { ... }in_port(responseNetwork_in, ResponseMsg, responseToCache) { ... }
Interface to a CPU
// The other end of mandatoryQueue attaches to SequencerMessageBuffer mandatoryQueue, ordered=”false”;in_port(mandatoryQueue_in, RubyRequest, mandatoryQueue, desc=”...”) { ... }// There is no corresponing out_port - handled with hitCallback
142
-
MI Example - L1 Cache ControllerDefining a machine interface
Interface to the network
// MessageBuffers - opaque C++ communication queuesMessageBuffer requestFromCache, network=”To”, virtual_network=”0”, ordered=”true”;MessageBuffer responseFromCache, network=”To”, virtual_network=”1”, ordered=”true”;MessageBuffer forwardToCache, network=”From”, virtual_network=”2”, ordered=”true”;MessageBuffer responseToCache, network=”From”, virtual_network=”1”, ordered=”true”;
// out_port - map request type to outgoing message bufferout_port(requestNetwork_out, RequestMsg, requestFromCache);out_port(responseNetwork_out, ResponseMsg, responseFromCache);
// in_port - map request type to incomming message buffer// and produce code to accept incomming messagesin_port(forwardRequestNetwork_in, RequestMsg, forwardToCache) { ... }in_port(responseNetwork_in, ResponseMsg, responseToCache) { ... }
Interface to a CPU
// The other end of mandatoryQueue attaches to SequencerMessageBuffer mandatoryQueue, ordered=”false”;in_port(mandatoryQueue_in, RubyRequest, mandatoryQueue, desc=”...”) { ... }// There is no corresponing out_port - handled with hitCallback
142
-
MI Example - L1 Cache ControllerDeclaring States
State Declaration
// STATESstate_declaration(State, desc="Cache states") {
// Stable StatesI, AccessPermission:Invalid, desc="Not Present/Invalid";M, AccessPermission:Read_Write, desc="Modified";
// Transient StatesII, AccessPermission:Busy, desc="Not Present/Invalid, issued PUT";MI, AccessPermission:Busy, desc="Modified, issued PUT";MII, AccessPermission:Busy, desc="Modified, issued PUTX, received nack";IS, AccessPermission:Busy, desc="Issued request for LOAD/IFETCH";IM, AccessPermission:Busy, desc="Issued request for STORE/ATOMIC";
}
143
-
MI Example - L1 Cache ControllerDeclaring Events
Event Declaration
// EVENTSenumeration(Event, desc="Cache events") {
// from processorLoad, desc="Load request from processor";Ifetch, desc="Ifetch request from processor";Store, desc="Store request from processor";
// From network (directory)Data, desc="Data from network";Fwd_GETX, desc="Forward from network";Inv, desc="Invalidate request from dir";Writeback_Ack, desc="Ack from the directory for a writeback";Writeback_Nack, desc="Nack from the directory for a writeback";
// Internally generatedReplacement, desc="Replace a block";
}
144
-
MI Example - L1 Cache ControllerMapping messages to events
• Mapping occurs in in_port declaration.• peek(in_port, message_type)
• Sets variable in_msg to head of in_port queue.• trigger(Event, address)
Event mapping
in_port(forwardRequestNetwork_in, RequestMsg, forwardToCache) {if (forwardRequestNetwork_in.isReady()) {
peek(forwardRequestNetwork_in, RequestMsg) {if (in_msg.Type == CoherenceRequestType:GETX) {
trigger(Event:Fwd_GETX, in_msg.Address);}...
}}
}
145
-
MI Example - L1 Cache ControllerDefining Transitions
• transition(Starting State(s), Event, [EndingState]) [ { Actions } ]
Transition sequence for new Store request
transition(I, Store, IM) {v_allocateTBE; // allocate TBE (a.k.a. MSHR) on transition to transient statei_allocateL1CacheBlock;a_issueRequest;m_popMandatoryQueue;
}
transition(IM, Data, M) {u_writeDataToCache;s_store_hit;w_deallocateTBE; // deallocate TBE on transition back to stable staten_popResponseQueue;
}...
146
-
MI Example - L1 Cache ControllerDefining Actions
• action(name, abbrev, [desc]) { implementation }• Two special functions available in action
• peek(in_port, message_type) { use in_msg }• assigns in_msg to message at head of port
• enqueue(out_port, message_type, [options]) {set out_msg }• enqueues out_msg on out_port
• Special variable address is available inside an action block• Set to the address associated with the event that caused the
calling transition
Example Action Definition
action(e_sendData, "e", desc="Send data from cache to requestor") {peek(forwardRequestNetwork_in, RequestMsg) {
enqueue(responseNetwork_out, ResponseMsg, latency=cache_response_latency) {out_msg.Address := address;out_msg.Type := CoherenceResponseType:DATA;out_msg.Sender := machineID;out_msg.Destination.add(in_msg.Requestor); // uses in_msg set by peekout_msg.DataBlk := cacheMemory[address].DataBlk;out_msg.MessageSize := MessageSizeType:Response_Data;
}}
}
147
-
MI Example - L1 Cache ControllerTransition Table
148
-
MI ExampleConnecting SLICC Machines with a Topology
Creating the Topology – Not In SLICCsrc/mem/ruby/network/topologies/Pt2Pt.py
# returns a SimObject for for a Pt2Pt Topologydef makeTopology(nodes, options, IntLink, ExtLink, Router):
# Create an individual router for each controller (node),# and connect them (ext_links)routers = [Router(router_id=i) for i in range(len(nodes))]ext_links = [ExtLink(link_id=i, ext_node=n, int_node=routers[i])
for (i, n) in enumerate(nodes)]link_count = len(nodes)
# Connect routers all-to-all (int_links)int_links = []for i in xrange(len(nodes)):
for j in xrange(len(nodes)):if (i != j):
link_count += 1int_links.append(IntLink(link_id=link_count,
node_a=routers[i],node_b=routers[j]))
# Return Pt2Pt Topology SimObjectreturn Pt2Pt(ext_links=ext_links,
int_links=int_links,routers=routers)
149
-
Using C++ Objects in SLICC• SLICC can be arbitrarily extended with C++ objects
• e.g., Interface with a new message filter• Steps:
• Create class in C++• Declare interface in SLICC with structure, external=”yes”• Initialize object in machine• Use!
Extending SLICC
// MessageFilter.hclass MessageFilter {public:
MessageFilter(int param1);
// returns 1 if message should be filteredint filter(RequestMsg msg);
};
// MessageFilter.ccint MessageFilter::filter(RequestMsg msg){
...return 0;
}
// MI_example-cache.smstructure(MessageFilter, external=”yes”) {
int filter(RequestMsg);};
MessageFilter requestFilter,constructor_hack=”param”;
action(af_allocateUnlessFiltered, “af”) {if (requestFilter.filter(in_msg) != 1) {
cacheMemory.allocate(address, new Entry);}
}
150
-
Using C++ Objects in SLICC• SLICC can be arbitrarily extended with C++ objects
• e.g., Interface with a new message filter• Steps:
• Create class in C++• Declare interface in SLICC with structure, external=”yes”• Initialize object in machine• Use!
Extending SLICC
// MessageFilter.hclass MessageFilter {public:
MessageFilter(int param1);
// returns 1 if message should be filteredint filter(RequestMsg msg);
};
// MessageFilter.ccint MessageFilter::filter(RequestMsg msg){
...return 0;
}
// MI_example-cache.smstructure(MessageFilter, external=”yes”) {
int filter(RequestMsg);};
MessageFilter requestFilter,constructor_hack=”param”;
action(af_allocateUnlessFiltered, “af”) {if (requestFilter.filter(in_msg) != 1) {
cacheMemory.allocate(address, new Entry);}
}
150
-
Using C++ Objects in SLICC• SLICC can be arbitrarily extended with C++ objects
• e.g., Interface with a new message filter• Steps:
• Create class in C++• Declare interface in SLICC with structure, external=”yes”• Initialize object in machine• Use!
Extending SLICC
// MessageFilter.hclass MessageFilter {public:
MessageFilter(int param1);
// returns 1 if message should be filteredint filter(RequestMsg msg);
};
// MessageFilter.ccint MessageFilter::filter(RequestMsg msg){
...return 0;
}
// MI_example-cache.smstructure(MessageFilter, external=”yes”) {
int filter(RequestMsg);};
MessageFilter requestFilter,constructor_hack=”param”;
action(af_allocateUnlessFiltered, “af”) {if (requestFilter.filter(in_msg) != 1) {
cacheMemory.allocate(address, new Entry);}
}
150
-
Using C++ Objects in SLICC• SLICC can be arbitrarily extended with C++ objects
• e.g., Interface with a new message filter• Steps:
• Create class in C++• Declare interface in SLICC with structure, external=”yes”• Initialize object in machine• Use!
Extending SLICC
// MessageFilter.hclass MessageFilter {public:
MessageFilter(int param1);
// returns 1 if message should be filteredint filter(RequestMsg msg);
};
// MessageFilter.ccint MessageFilter::filter(RequestMsg msg){
...return 0;
}
// MI_example-cache.smstructure(MessageFilter, external=”yes”) {
int filter(RequestMsg);};
MessageFilter requestFilter,constructor_hack=”param”;
action(af_allocateUnlessFiltered, “af”) {if (requestFilter.filter(in_msg) != 1) {
cacheMemory.allocate(address, new Entry);}
}
150
-
Detailed Component Simulation: Caches
• Set-Associative Caches• Each CacheMemory object represents one bank of cache• Configurable bit select for indexing• Modular replacement policy
• Tree-based pseudo-LRU• LRU
• See src/mem/ruby/system/CacheMemory.hh
151
-
Detailed Component Simulation: Memory
• Memory controller models a single channel DDR2 controller• Implements closed-page policy• Can configure ranks, tCAS, refresh, etc..• See src/mem/ruby/system/MemoryController.hh
152
-
Detailed Component Simulation: Network• Simple Network
• Idealized routers - fixed latency, no internal resources• Does model link bandwidth
• Garnet Network• Detailed routers - both fixed and flexible pipeline model• From Princeton, MIT
•• See src/mem/ruby/network/*
153
-
Ruby Debugging Support
• Random testing support• Stresses protocol by inserting random timing delays
• Support for coherence transition tracing• Frequent assertions• Deadlock detection
154
-
Lifetime of a Ruby Memory Request
1 Request enters through RubyPort::recvTiming, isconverted to RubyRequest, and passed to Sequencer.
2 Request enters SLICC controllers throughSequencer::makeRequest via mandatoryQueue.
3 Message on mandatoryQueue triggers an event in L1Controller.
4 Until request is completed:1 (Event, State) is matched to a transition.
2 Actions