ramp gold : an fpga-based architecture simulator for multiprocessors
DESCRIPTION
RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors. Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab, EECS UC Berkeley. March 2010. Outline. Overview RAMP Gold HW Architecture and Implementation RAMP Gold Software Infrastructure - PowerPoint PPT PresentationTRANSCRIPT
RAMP Gold : An FPGA-based Architecture Simulator for
Multiprocessors
Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic
Parallel Computing Lab, EECS UC BerkeleyMarch 2010
2
Outline Overview RAMP Gold HW Architecture and Implementation RAMP Gold Software Infrastructure Usage Case and Live Demo Future work
3
Overview Purpose of RAMP Gold
An FPGA-based simulator for shared-memory multicore target for Parlab
Usage case: Architecture, OS and applications
Highlight of RAMP Gold Works on $750 Xilinx XUP v5 board Written in systemverilog, no special CAD tools required,
works with standard FPGA CAD flows (Synplify/ISE/Modelsim)
Two orders of magnitude faster than Simics+GEMS Runtime configurable parameters without resynthesis Full RTL verification environment and software
infrastructure BSD and GNU license
4
Simulation Jargon Target vs. Host
Target: System/architecture being simulated, e.g. SPARC v8 CMP
Host : The platform on which the simulator runs, e.g. FPGAs
Functional model and timing model Functional: compute instruction result Timing: how long to compute the instruction
5
RAMP Gold Overall Setup
Functional Models
TimingModels
Functional State
Timing State
Target M
emory
Single Xilinx Virtex 5/6 FPGA
Frontend App Server(Linux PC)
Ethernet
Both functional and timing models on FPGA App server: control and service syscall/IO
6
Target Machine Template
64-core SPARC v8 shared-memory machine Configurable two-level cache + multichannel
DRAM
CPU
L1I$ L1D$
CPU
L1I$ L1D$
CPU
L1I$ L1D$
L2 Bank L2 Bank L2 Bank L2 Bank
Interconnect
DRAM channel DRAM channel DRAM channel DRAM channel
7
RAMP Gold Performance vs Simics
PARSEC parallel benchmarks running on a research OS
>250x faster than full system simulator for a 64-core multiprocessor target
4 8 16 32 640
50
100
150
200
250
300
2 3 5 10
34
6 1021
44
106
7 1536
69
263Functional onlyFunctional+cache/memory (g-cache)Functional+cache/memory+coherency (GEMS)
Number of Cores
Spee
dup
(Geo
met
ric M
ean)
8
Outline Overview RAMP Gold HW Architecture and
Implementation RAMP Gold Software Infrastructure Usage Case and Live Demo Future work
9
RAMP Gold Model Key Concepts
Decoupled functional/timing model, both in hardware Enables many FPGA fabric friendly optimizations Increase modeling efficiency and module reuse
Host multithreading of both functional and timing models Hide emulation latencies and improve resource utilization Time-multiplexed effect patched by the timing model
Functional Model
Pipeline
Arch State
Timing Model
Pipeline
Timing State
Host multithreading Example: simulating four independent CPUs
10
+1
PC1PC
1PC1PC
1I$ IR GPR1GPR1GPR1GPR1
X
Y
ALU
D$
2 2
DE
2Thread Select
CPU0
CPU1
CPU2
CPU3Target Model
Functional CPU model on FPGA
0 1 2 3 4 5 6 7 8 9 10Host Cycle:
Target Cycle:
Instruction in Functional Model Pipeline: i0 i0 i0 i0 i1 i1 i1 i1 i2 i2 i2
0 1 2
11
Functional Model
Full SPARC v8 support (FP, MMU, I/Os) Pass the SPARC v8 certification test Run Linux and research OS
Fetch PArchitecture State(x64)
Microcode ROM Decode
PHost I$ PHost ITLB
Target Register File Access PArchitecture
Register File(x64)
MMU PHost DTLB
DDR2
Mem
ory
Cont
rolle
r
Exception/Write Back
PIO Devices(x64)
1 6 K B
u n i f i e d
h o s t D $
6 4 - e n t r y
M S H R
225
MHz
/2GB
SOD
IMM
Pipelined FPUInteger ALU
From timing model
To timing model
12
Timing Model
Simple CPU timing but detailed memory timing model (i.e. every instruction takes 1 cycle except LD/ST)
Cache models: only store tags in BRAMs Runtime configurable parameters: associativity, size, line size, # of banks, latency and etc Model 3C but not 4C (coherent support soon)
DRAM model: bandwidth-delay pipe with optional QoS
Bank 0L2 TagsMSHR
=
T h r e a d S c h e d u l e r
TargetCycle Count
Scoreboard
PL1 I$ Tag
==
Timing Model Config Reg
PL1 D$ Tag
Timing Model Config Reg
L2 TagsMSHR
=
L2 TagsMSHR
=
L2 TagsMSHR
=
Bank 0
Bank 1
DRAM TimingQoS
DRAM TimingQoS
DRAM TimingQoS
DRAM TimingQoS
Timing Model Config Reg
From functional model I/O bus
CPU Timing Model
Banked L2 Timing Model
DRAM Channel Timing Model
Bank 2
Bank 3
13
Debugging and Simulation Configuration
Frontend Link
RX TX
Frontend App ServerFrontend App Server
Gigabit Ethernet
Timing Model Control
Functional Model
Microcode Injector
Timing Model
Performance Counters
Host DRAM
32-bit@90 MHz
Frontend app server Reliable Gigabit Ethernet
connection to FPGA Periodically pulls the simulator
to serve I/O requests Transparent to target (no side
effect on simulated timing)
64-bit hardware performance counters to collect runtime stats
657 counters in timing model + 10 host counters
Can be read by either target apps or the app server
Ring interconnect for counters (easy to add and remove)
14
Host Performance
Timing synchronization is the largest overhead Tiny host $/TLBs are not on the performance critical
path Host DRAM bandwidth is not a problem (<15%
utilization)
blackscholes bodytrack fluidanimate streamcluster swaption x2640
10
20
30
40
50
60
70
80
90
100Retire Inst.
Misc.
FPU
Microcode
Host TLB miss
Host D$ miss
Host I$ miss
Timing Sync.
Idle by target stalls
Perc
enta
ge o
f hos
t exe
cutio
n tim
e
15
Implementation Single FPGA: 64-core @ 90 MHz, 2 GB DDR2 SODIMM ~2 hours CAD turnaround time on a mid-range
workstation BRAM bounded, but have logic resources to fit more
pipelines
16
Outline Overview RAMP Gold HW Architecture and Implementation RAMP Gold Software Infrastructure Usage Case and Live Demo Future work
17
Software Tools SPARC cross compiler with binutils/gcc/glibc
Support most of POSIX programs Static & dynamic linking support Built from GNU GCC (4.3.2)
Full software and HW debugging suite Low-cost XUP boards sometimes do not work out-of-box FPGA CAD tools are very bad
18
Target Software Proxy Kernel: single-protection-domain application
host Runs programs statically linked against glibc Forwards I/O system calls to x86/Linux host PC Presents simple “hard-threads” API for multithreaded
programs Very easy to modify
ROS: UCB’s manycore research OS Provides multiprogramming support Sufficiently POSIX compliant to run many programs Much easier to modify than linux Run more than 64-cores
19
Infrastructure
Verilog Simulation Backend
Functional Simulation Backend
Frontend Test Server
HW Backend
App Source Files
(.S or .C)
GNU SPARC v8 Compiler/
Linker
Customized Linker Script
(.lds)
RTL src files / netlist
(.sv, .v)
ELF Binaries
Modelsim SE/ Questasim
Host dynamic simulation
libraries (.so)
libbfdDisassembler C implementationFrontend Links
Xilinx Unisim Library Systemverilog DPI interface
Simulation logs
Optional Checker
C-Gold Functional Simulator
Frontend Links
Reference Result
FPGA TargetXilinx XUP-V5, BEE3
Frontend Links
HW state dumps
Linux Machine(Handle I/Os and Syscalls)
C-Gold simulation
module
20
Case studies Parallel application studies for software
programmers Parallel OS for system researchers Adding hardware performance counter for
advanced debugging Micro-architecture studies - adding features and
modifying existing timing models Adding new instructions – changing the functional
model
21
Appserver 101 Appserver command-line options:
Usage: sparc_app [-f<conf>] [-p<nprocs>] [-s] <htif> <kernel> [binary] [args]
Platform memory test: App server memory test: sparc_app –p64 hw memtest none Proxykernel memory test (stress test) sparc_app –p64 hw pathlkernel.ramp path/memtest
22
For application programmers
Main usage scenario: use runtime configurable timing model without any FPGA hardware change Use ‘hard-threads’ to write a parallel ‘hello world’ program
running on the proxykernel
Compile the program using the cross toolchainsparc-ros-gcc –o hello hellp.cpp -lhart
Measure performance using performance counterssparc_app –s1 –p64 hw kernel.ramp hello
Change target machine configuration on the fly and rerun the experiment
edit file ‘appserver.conf’
23
For OS Developer Similar usage model like application programmers
Proxykernel is a good start to learn the bootstrapping process
ROS is a full functional kernel
Demo: Boot the ROS kernel using the appserversparc_app –p64 –fappserver_ros.conf hw your_kernel none
24
Adding Hardware Performance Counters
Two types of counter interface Global counter: <EN> Local (per core) counter: <TID, EN>
Modify the verilog file to add more counters on the ring.
perfctr_io #(.NLOCAL(num_of_local), .NGLOBAL(num_of_global)) gen_tm_counter(.gclk, .rst, .bus_out(io_out), .bus_in(io_in), .bus_sel(), //IO bus interface
.global_inc(global_counter_inc), .local_inc(local_counter_inc), .local_tid(local_counter_tid));
Modify the app server to support more counters: Add your counter definition in ‘TestAppServer/perfcnt.h’
25
Adding Features to Timing Models
Timing models are much simpler than functional models ~1000 LoC vs 35,000 LoC
Example 1: Changing the cache replacement policy
Example 2: Adding memory QoS Lee et al. “Globally-Synchronized Frames for,
Guaranteed Quality-of-Service in On-Chip Networks”, ISCA’08
~100 lines of code added in the timing model A new DRAM model Several memory mapped register added on the
functional I/O bus for configuration purpose
26
Adding New Instructions Adding instructions to a feed-through pipeline is
straightforward FPU instructions were added as “new” instructions within
a week Including: new register file, decode, exception/commit
and microcode
Example: Adding new atomic instructions through microcode 4 global scratchpad registers (not visible to
programmer) in the main integer register file for temporary storage
Two write-port for supporting scratchpad registers update along with architecture register change
27
Steps of Adding Instructions Add proper decoding logic in function
“decode_dsp_add_logic“ of “regacc_dma.sv”
Update the writeback/exception stage in file “exception_dma.sv” to trap to microcode. Edit function “decode_microcode_mode” to trap to microcode Edit function “rd_gen” to write address to scratch register 0,
and load data to scratch register 1
Edit microcode ROM ‘Microcode.sv’ //----------SWAP*-------9: begin uco.uend = '0; uco.cwp_rs1 = '0; uco.cwp_rd = '0; uco.inst = {LDST, 5'b0, ST, REGADDR_SCRATCH_0 | UCI_MASK, 1'b1, 13'b0};
end 10: begin uco.uend = '1; uco.cwp_rs1 = '0; uco.cwp_rd = '0; uco.inst = {FMT3, 5'b0, IADD, REGADDR_SCRATCH_1 | UCI_MASK, 1'b1,
13'b0}; end
28
Future work Cache Coherence models (soon) Realistic interconnect model (soon) Better CPU core model (next major version) Support other ISAs (next major version)
29
Further References Research papers
Usage case: A Case for FAME: FPGA Architecture Model Execution, ISCA’10 RAMP Gold design:RAMP Gold: An FPGA-based Architecture Simulator for
Multiprocessors, DAC’10
Beta releasehttp://sites.google.com/site/rampgold
30
Backup Slides
31
Functional/Timing Model Interface
// FM -> TMtypedef struct { bit valid; //timing token between FM and TM. bit [5:0] tid; //thread ID bit run; //cpu states bit replay; //this instruction needs to replay by FM bit retired; //retiring an instruction bit [31:0] inst; //the instruction that was retired bit [31:0] paddr; //load/store physical address bit [31:0] npc; //PC of next fetched insn}tm_cpu_ctrl_token_type;
// TM -> FMtypedef struct { bit valid; //timing token between FM and TM. bit [5:0] tid; //thread ID bit run; //run bit }tm2cpu_token_type;