a detailed on -chip network model inside a full-system simulator€¦ · full-state directory...
TRANSCRIPT
Garnet2.0: A Detailed On-Chip
Network Model Inside a Full-System Simulator
Tushar KrishnaAssistant Professor
School of ECE and CSGeorgia Institute of Technology
gem5 workshopARM Research Summit
September 11, [email protected]
http://synergy.ece.gatech.edu/tools/garnet
Overviewu What are Networks-On-Chip (NoCs)?
u Why model NoCs accurately
u Garnet2.0u Configuration
u Topology
u Routing
u Flow-Control
u Router Microarchitecture
u System Integrationu Network Interface
u Network Parameters
u Running Garnet with Ruby
u Ruby Garnet Standalone
u Output Stats
u Extensions and FAQs
September 11, 2017Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology 2
Overviewu What are Networks-On-Chip (NoCs)?
u Why model NoCs accurately
u Garnet2.0u Configuration
u Topology
u Routing
u Flow-Control
u Router Microarchitecture
u System Integrationu Network Interface
u Network Parameters
u Running Garnet with Ruby
u Ruby Garnet Standalone
u Output Stats
u Extensions and FAQs
September 11, 2017Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology 3
Networks-on-Chip
September 11, 2017Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology 4
Introduction to NoCs
September 11, 2017Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology 5
Core + L1$
Core + L1$
Core + L1$
Core + L1$
Core + L1$
Core + L1$
L2$ L2$ L2$ L2$ L2$ L2$
On-Chip NetworkReq
Fwd
Rsp
LD (A) R1A
Modern NoCs
September 11, 2017Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology 6
L3$/Directory
L2$
L1D$
Core
L1I$ Network
Interface
Router
“Tile”
Network Interface converts cache messages (ctrl or data) into packets.
Packets get broken down into one or more flits depending on NoC link width
Network Architecture
uTopology
uRouting
uFlow Control
uRouter Microarchitecture
September 11, 2017Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology 7
Topology:How to connect nodes with links
September 11, 2017Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology 8
~Road Network
Routing: Which path should a message take
September 11, 2017Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology 9
~Series of road segments from source to destination
Flow Control:When does a message stop/proceed
September 11, 2017Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology 10
~Traffic Signals / Stop signs at end of each road segment
Router Microarchitecture: How to build the routers
September 11, 2017Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology 11
~Design of traffic intersection (number of lanes, algorithm for turning red/green)
Overviewu What are Networks-On-Chip (NoCs)?
u Why model NoCs accurately
u Garnet2.0u Configuration
u Topology
u Routing
u Flow-Control
u Router Microarchitecture
u System Integrationu Network Interface
u Network Parameters
u Running Garnet with Ruby
u Ruby Garnet Standalone
u Output Stats
u Extensions and FAQs
September 11, 2017Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology 12
Why model NoCs accurately?
September 11, 2017Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology 13
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
BaselineNoC
(IntelSCC)
FANOUT+FANINNoC
(Krishnaetal,MICRO2011)
SMARTNoC
(Krishnaetal,HPCA2013)
Normalize
dRu
ntim
e
64-coreCMPwithdifferentNoCMicroarchitectures
Full-stateDirectory HyperTransport TokenCoherence
0
0.2
0.4
0.6
0.8
1
1.2
BaselineNoC(IntelSCC)
SMARTNoC(Krishnaetal,HPCA2013)
Norm
alize
dRu
ntime
64-coreCMPwithdifferentNoCMicroarchitectures
SharedL2 PrivateL2
Case Study I: Directory vs. Broadcast Protocols
Case Study II: Private vs. Shared L2
64-core CMP running PARSEC workloads in full-system gem5. Average runtime plotted.
Different NoC Microarchitectures may lead to different microarchitectural decisions
and new design optimization opportunities
Overviewu What are Networks-On-Chip (NoCs)?
u Why model NoCs accurately
u Garnet2.0u Configuration
u Topology
u Routing
u Flow-Control
u Router Microarchitecture
u System Integrationu Network Interface
u Network Parameters
u Running Garnet with Ruby
u Ruby Garnet Standalone
u Output Stats
u Extensions and FAQs
September 11, 2017Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology 14
Garnet2.0u Detailed NoC Model
uCurrently part of Ruby Memory System in gem5uOriginal version (5-stage pipeline) released in 2009
uDeveloped by Niket Agarwal (currently at Google) and myself
uNew version (1-stage pipeline, more configurability) released in 2016
u Resourcesu Source: src/mem/ruby/network/garnet2.0u gem5 wiki page: www.gem5.org/garnet2.0uDev patches + practice labs:
http://synergy.ece.gatech.edu/garnetSeptember 11, 2017Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology 15
Topology
September 11, 2017Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology 16
Core + L1$
Core + L1$
Core + L1$
Core + L1$
Core + L1$
Core + L1$
L2$
L2$ L2$ L2$
L2$ L2$
Dir Dir
Dir Dir
DMA
Step 1: Instantiate routers and connect to controllersvia “Ext Links”
Step 2: Instantiate “Int Links” and connect to routers in desired topology
Each topology is a python file in configs/topologies/
This is an example of MeshDirCorners_XY.py
ExtLink(bi-directional)
IntLink(uni-directional)
R
R
R
R
R
R
Topology Configurable Parametersu Router
u router_latency (in cycles)u Can be set per router
u Defined in src/mem/ruby/network/BasicRouter.py
u Linku link_latency (in cycles)
u Can be set per link
u Defined in src/mem/ruby/network/BasicLink.py
u weight (i.e., link weight)u To bias routing algorithm [later slides]
u src_outport (string) and dst_inport (string)u Port direction (e.g., “East”)
u Helps with readability of Config file + Adaptive routing algorithms
u bw_multiplier (value)u Used by Ruby’s simple network model, NOT by Garnet
u Link bandwidth is set inside Garnet via the ni_flit_size parameter [later slides]
September 11, 2017Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology 17
Default Topologies Supported
uPt2Pt
uCrossbar
uMeshuMesh_XY
uMesh_westfirst
uMeshDirCorners_XY
uCluster
September 11, 2017Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology 18
Routingu Routing table
uAutomatically populated based on topologyuAll messages use shortest path
u In case of multiple options, the path with the smaller weight is chosen
uDeterministic Routing
u CustomuUsers can leverage outport/inport direction names
associated with each port to implement custom algorithms (say adaptive)
September 11, 2017Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology 19
Deadlock Avoidanceu Deadlock: A condition in which a set of agents wait
indefinitely trying to acquire a set of resources
u Packet A holds buffer u (in 1) and wants buffer v (in 2)u Packet B holds buffer v (in 2) and wants buffer w (in 3)u Packet C holds buffer w (in 3) and wants buffer x (in 0)u Packet D holds buffer x (in 0) and wants buffer u (in 1)
September 11, 2017Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology 20
0 1
23
A
BC
Du
vw
x
Deadlock-free Routing Algorithmsu Deadlocks may occur if the turns taken form a cycle
u Removing some turns can make the routing algorithm deadlock free
September 11, 2017Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology 21
West-First Turn ModelXY Model
Deadlock-free Routing Algorithms
uAssign weights to bias which links used first (to ensure no cyclic dependence)
September 11, 2017Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology 22
SA
DC
DA DB
SB SC
XY Routing: Always go X first, then Y
1 1
1 1
1 1
1 1
1 1
1 1
2
2
2
2
2
2
2
2
2
2
2
2
Mesh_XY
Deadlock-free Routing Algorithms
uAssign weights to bias which links used first (to ensure no cyclic dependence)
September 11, 2017Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology 23
SA
DC
DA DB
SB SC
West-first Routing: Go W first, then N/S
2 2
2 2
2 2
1 1
1 1
1 1
2
2
2
2
2
2
2
2
2
2
2
2
Mesh_westfirst
Flow Controlu Virtual Channels
u Coherence Protocol requires certain number of virtual networks / message classes to avoid protocol deadlocksu This is the minimum number of VCs required
u Within each vnet, there can be more than one VC for boosting network performanceu In Garnet, only one packet can use a VC inside a router at a time
u VCs in vnets carrying control messages are 1-flit deep
u VCs in vnets carrying data (cacheline) messages are 4-5 flit deep
u Creditsu Each VC conveys its buffer availability by sending credits
to its upstream router
September 11, 2017Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology 24
Conventional VC Router Microarch
September 11, 2017Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology 25
Crossbar SwitchInput Buffers
VC 1VC 2
VC n
VA: VC AllocationInput VCs arbitrate for “output” VCs (Input VCs at next router)
SA: Switch AllocationInput ports arbitrate for output ports
Input Buffers
VC 0VC 1
VC n
BW VA ST LT
VC Allocator
SW Allocator
BW: Buffer Write
ST: Switch Traversal
LT: Link Traversal
RC: Route Compute
RC
Route Compute
SA BR
BR: Buffer Read
VN0
VN 1
FLIT
Single-Cycle Router Implementation
September 11, 2017Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology 26
Router.ccàwakeup()
InputUnit.ccàwakeup()
* Buffer Write* Route Compute
OutputUnit.ccàwakeup()
* Rcv Credit* Update State
VirtualChannel.cc
OutVCState.cc
CrossbarSwitch.ccàwakeup()
* Switch Traversal: Push winner flits on link queue* Schedule output link wakeup for next cycle
* Arbitrate outports* VC Allocate* send credit
(schedule creditlinkwakeup next cycle)
* Buffer Read (send flit to switch)
SwitchAllocator.ccàwakeup()
has_free_vc()
select_free_vc() CreditLink.cc
NetworkLink.cc
NetworkLink.cc
CreditLink.cc
1 23
4
VC 1VN0
VC 2VC 3
VC 0
VN1
* Arbitrate inports
For multi-cycle router, add delay in flit
before it can do SA
Overviewu What are Networks-On-Chip (NoCs)?
u Why model NoCs accurately
u Garnet2.0u Configuration
u Topology
u Routing
u Flow-Control
u Router Microarchitecture
u System Integrationu Network Interface
u Network Parameters
u Running Garnet with Ruby
u Ruby Garnet Standalone
u Output Stats
u Extensions and FAQs
September 11, 2017Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology 27
System Organization
September 11, 2017Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology 28
RouterRouter
NetworkInterface
NetworkLink
CreditLink
GarnetNetwork.cc
Ruby Memory System
Coherence Protocol
Stats
Interface is “MessageBuffers
” from Ruby
Network Interface
September 11, 2017Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology 29
Core + L1$
Core + L1$
Core + L1$
Core + L1$
L2$
L2$
L2$
L2$
Dir
Dir
DMA R
R
R
R
NI
NI
NI
NI
NI
NINI
NI
NINI
NI
Ingress
Cach
e Co
ntro
ller
To/From Router
Flitisize
Egress
VC Select
credit<VN, msg>
VC 0VC 1
VN0
VC 2VC 3
VN1
Msg_size decides number of flits
Can add additional info in flit
NetworkInterface.ccàwakeup()
Garnet Configurable Parametersu ni_flit_size
u Default = 16B (128b) à 1-flit ctrl, 5-flit datau This sets the bandwidth of each physical link
u vcs_per_vnetu Total VCs in each inport = num_vnets * vcs_per_vnet
u buffers_per_data_vcu Default = 4
u buffers_per_ctrl_vcu Default = 1
u routing_algorithmu Weight-based table or custom
u Defined in: src/mem/ruby/network/garnet2.0/GarnetNetwork.py
September 11, 2017Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology 30
Command Line Parameters
September 11, 2017Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology 31
garnet2.0/GarnetNetwork.py
BasicRouter.pyBasicLink.py
Network.py
src/mem/ruby/network/
Definitions and Default Values
configs/
- Overrides default values in Ruby- All parameters in in these .py files can be specified from command line
network/Network.py
ruby/Ruby.py common/Options.py
example/garnet_synth_test.py
Running Garnet with Ruby
uBuild Ruby Coherence Protocol
uProtocol determines number of message classes/virtual networks required
u Invoke garnet2.0 from command line with appropriate network parameters
September 11, 2017Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology 32
scons build/X86_MOESI_hammer/gem5.opt PROTOCOL=MOESI_hammer
./build/X86_MOESI_hammer/gem5.opt configs/example/fs.py--network=garnet2.0 --topology=Mesh_XY ...
Running Garnet Standaloneu Build the Garnet_standalone protocol
u Dummy protocol just for traffic injection via Garnet Synthetic Traffic tester (next slide)
u 3 Virtual Networks: vnet 0 and vnet 1 inject ctrl (1-flit) packets, vnet 2 injects data (5-flit) packets
u Run user-specified synthetic traffic
September 11, 2017Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology 33
scons build/NULL/gem5.debug PROTOCOL=Garnet_Standalone
./build/NULL/gem5.debug configs/example/garnet_synth_traffic.py--network=garnet2.0 --topology=... \--synthetic=uniform_random \--injectionrate=0.01 \--num-packets-max=100 \
Garnet Synthetic Trafficu Dummy CPU Model [only works with Garnet_standalone protocol]
u src/cpu/testers/garnet_synthetic_traffic
u Inject packets for user-specified traffic pattern at user-specified injection rate
u uniform_random
u tornado
u bit_complement
u bit_reverse
u bit_rotation
u neighbor
u shuffle
u transpose
u Ability to inject continuously or fixed number of packets, and/or for fixed time, and/or from fixed source and/or to fixed destination in one/more vnets
u Heavy customization very useful for debugging
u All parameters described here: http://www.gem5.org/Garnet_Synthetic_Traffic
September 11, 2017Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology 34
Overviewu What are Networks-On-Chip (NoCs)?
u Why model NoCs accurately
u Garnet2.0u Configuration
u Topology
u Routing
u Flow-Control
u Router Microarchitecture
u System Integrationu Network Interface
u Network Parameters
u Running Garnet with Ruby
u Ruby Garnet Standalone
u Output Stats
u Extensions and FAQs
September 11, 2017Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology 35
Sample Simulation
September 11, 2017Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology 36
./build/NULL/gem5.debug configs/example/garnet_synth_traffic.py--topology=Mesh_XY --num-cpus=16 --num-dirs=16 --mesh-rows=4
m5out/config.ini
Output Stats
September 11, 2017Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology 37
m5out/stats.txt
Packet and Flit stats (#injected, #received, queueing latency at NIC, and
network latency) per VN and overall
More stats can be added via GarnetNetwork.cc
Overviewu What are Networks-On-Chip (NoCs)?
u Why model NoCs accurately
u Garnet2.0u Configuration
u Topology
u Routing
u Flow-Control
u Router Microarchitecture
u System Integrationu Network Interface
u Network Parameters
u Running Garnet with Ruby
u Ruby Garnet Standalone
u Output Stats
u Extensions and FAQs
September 11, 2017Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology 38
Extensions and FAQsu System Level Modeling
u How can I integrate Garnet into my own simulator (or with the gem5 classic memory system)?u This should not be too hard - would just require some changes to the NI code on how
it receives its inputs. If anyone wants to try it, I would love to give pointers.
u How do I print a network trace?u You can add some code to the NI. I have a patch on my website for reference.
u Can garnet read a network trace?u You can run Garnet in a standalone mode, and have it inject traffic from a trace
instead of a fixed synthetic pattern. Alternately, try to use cpu/testers/traffic_gen
u Does Garnet report NoC area and power numbers?u The output of Garnet can be fed into DSENT (Sun et al, NOCS 2012) which is present
in ext/dsent. We will add an automatic stats.txt parser for it soon.
u Can we model multiple clock domains?u The entire Ruby memory system (including Garnet inside it) is one clock domain. To
mimic a multi-clock domain design, you can schedule wakeups of slow routers intelligently at some multiples of the clock rather than every cycle.
September 11, 2017Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology 39
Extensions and FAQsu Topology
u How do we model multiple BW links in Garnet?u Inherently, that would require support in the router to manage multiple flit
sizes. Instead, you can add multiple links between the same nodes if you want to model higher bandwidth. If they have the same weight, Garnet will randomly send over the two.
u Can we model a heterogeneous CPU-GPU system?u Yes. The current AMD GPU model models a cluster of CPUs connected to a GPU.
u Can we model indirect networks such as Clos?u Yes, there can be additional routers that are not connected to any controller
and act purely as switches.
u Can we model large-scale HPC networks?u Garnet can model any sized network. You can run 256-node standalone
simulations easily. However, beyond that gem5 cannot instantiate more directories (which it uses as destination nodes). I have a patch to run 1024-node synthetic sims on my website. But these run quite slowly.
September 11, 2017Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology 40
Extensions and FAQsu Routing
u How do we model an adaptive routing algorithm?u If you want to use internal NoC metrics (such as number of credits at an
output port) for making routing decisions, do not use table-based routing. Instead, set the routing-algorithm to custom, and implement your own routing function inside RoutingUnit.cc. See outportComputeXY() for reference.
u Flow Controlu Can we implement alternate deadlock avoidance schemes (such as
escape VCs or dateline)?u You can update the vc selection scheme inside SwitchAllocator to control
which VCs get allocated.
u Microarchitectureu Can we model variable number of VCs in each router?
u Currently the codebase is very tied to the global vcs_per_vnet parameter. If you want to model variable number of VCs, one hack could be to have everyone instantiate the same number of VCs, but modify the VC select to never allocate certain VCs
September 11, 2017Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology 41
Conclusionsu Garnet2.0 is an open-source research vehicle
u Use it and contribute to it!u Being actively maintained by the following people
u Georgia Tech: Tushar Krishnau AMD Research: Brad Beckmann, Onur Kayiran, Jieming Yin, Matt
Porembasu If you have any questions, email on gem5-users or gem5-dev mailing lists
u Would love to have it integrated with the classic memory model
u Useful Resources:u www.gem5.org/Interconnection_Networku www.gem5.org/garnet2.0u http://synergy.ece.gatech.edu/garnet
September 11, 2017Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology 42
Thank you!