ramp gold : an fpga-based architecture simulator for multiprocessors

RAMP Gold : An FPGA-based Architecture Simulator for

Multiprocessors

Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic

Parallel Computing Lab, EECS UC BerkeleyMarch 2010

2

Outline Overview RAMP Gold HW Architecture and Implementation RAMP Gold Software Infrastructure Usage Case and Live Demo Future work

3

Overview Purpose of RAMP Gold

An FPGA-based simulator for shared-memory multicore target for Parlab

Usage case: Architecture, OS and applications

Highlight of RAMP Gold Works on $750 Xilinx XUP v5 board Written in systemverilog, no special CAD tools required,

works with standard FPGA CAD flows (Synplify/ISE/Modelsim)

Two orders of magnitude faster than Simics+GEMS Runtime configurable parameters without resynthesis Full RTL verification environment and software

infrastructure BSD and GNU license

4

Simulation Jargon Target vs. Host

Target: System/architecture being simulated, e.g. SPARC v8 CMP

Host : The platform on which the simulator runs, e.g. FPGAs

Functional model and timing model Functional: compute instruction result Timing: how long to compute the instruction

5

RAMP Gold Overall Setup

Functional Models

TimingModels

Functional State

Timing State

Target M

emory

Single Xilinx Virtex 5/6 FPGA

Frontend App Server(Linux PC)

Ethernet

Both functional and timing models on FPGA App server: control and service syscall/IO

6

Target Machine Template

64-core SPARC v8 shared-memory machine Configurable two-level cache + multichannel

DRAM

CPU

L1I$ L1D$

CPU

L1I$ L1D$

CPU

L1I$ L1D$

L2 Bank L2 Bank L2 Bank L2 Bank

Interconnect

DRAM channel DRAM channel DRAM channel DRAM channel

7

RAMP Gold Performance vs Simics

PARSEC parallel benchmarks running on a research OS

>250x faster than full system simulator for a 64-core multiprocessor target

4 8 16 32 640

50

100

150

200

250

300

2 3 5 10

34

6 1021

44

106

7 1536

69

263Functional onlyFunctional+cache/memory (g-cache)Functional+cache/memory+coherency (GEMS)

Number of Cores

Spee

dup

(Geo

met

ric M

ean)

8

Outline Overview RAMP Gold HW Architecture and

Implementation RAMP Gold Software Infrastructure Usage Case and Live Demo Future work

9

RAMP Gold Model Key Concepts

Decoupled functional/timing model, both in hardware Enables many FPGA fabric friendly optimizations Increase modeling efficiency and module reuse

Host multithreading of both functional and timing models Hide emulation latencies and improve resource utilization Time-multiplexed effect patched by the timing model

Functional Model

Pipeline

Arch State

Timing Model

Pipeline

Timing State

Host multithreading Example: simulating four independent CPUs

10

+1

PC1PC

1PC1PC

1I$ IR GPR1GPR1GPR1GPR1

X

Y

ALU

D$

2 2

DE

2Thread Select

CPU0

CPU1

CPU2

CPU3Target Model

Functional CPU model on FPGA

0 1 2 3 4 5 6 7 8 9 10Host Cycle:

Target Cycle:

Instruction in Functional Model Pipeline: i0 i0 i0 i0 i1 i1 i1 i1 i2 i2 i2

0 1 2

11

Functional Model

Full SPARC v8 support (FP, MMU, I/Os) Pass the SPARC v8 certification test Run Linux and research OS

Fetch PArchitecture State(x64)

Microcode ROM Decode

PHost I$ PHost ITLB

Target Register File Access PArchitecture

Register File(x64)

MMU PHost DTLB

DDR2

Mem

ory

Cont

rolle

r

Exception/Write Back

PIO Devices(x64)

1 6 K B

u n i f i e d

h o s t D $

6 4 - e n t r y

M S H R

225

MHz

/2GB

SOD

IMM

Pipelined FPUInteger ALU

From timing model

To timing model

12

Timing Model

Simple CPU timing but detailed memory timing model (i.e. every instruction takes 1 cycle except LD/ST)

Cache models: only store tags in BRAMs Runtime configurable parameters: associativity, size, line size, # of banks, latency and etc Model 3C but not 4C (coherent support soon)

DRAM model: bandwidth-delay pipe with optional QoS

Bank 0L2 TagsMSHR

=

T h r e a d S c h e d u l e r

TargetCycle Count

Scoreboard

PL1 I$ Tag

==

Timing Model Config Reg

PL1 D$ Tag


L2 TagsMSHR

=

L2 TagsMSHR

=

L2 TagsMSHR

=

Bank 0

Bank 1

DRAM TimingQoS

DRAM TimingQoS

DRAM TimingQoS

DRAM TimingQoS


From functional model I/O bus

CPU Timing Model

Banked L2 Timing Model

DRAM Channel Timing Model

Bank 2

Bank 3

13

Debugging and Simulation Configuration

Frontend Link

RX TX

Frontend App ServerFrontend App Server

Gigabit Ethernet

Timing Model Control

Functional Model

Microcode Injector

Timing Model

Performance Counters

Host DRAM

32-bit@90 MHz

Frontend app server Reliable Gigabit Ethernet

connection to FPGA Periodically pulls the simulator

to serve I/O requests Transparent to target (no side

effect on simulated timing)

64-bit hardware performance counters to collect runtime stats

657 counters in timing model + 10 host counters

Can be read by either target apps or the app server

Ring interconnect for counters (easy to add and remove)

14

Host Performance

Timing synchronization is the largest overhead Tiny host $/TLBs are not on the performance critical

path Host DRAM bandwidth is not a problem (<15%

utilization)

blackscholes bodytrack fluidanimate streamcluster swaption x2640

10

20

30

40

50

60

70

80

90

100Retire Inst.

Misc.

FPU

Microcode

Host TLB miss

Host D$ miss

Host I$ miss

Timing Sync.

Idle by target stalls

Perc

enta

ge o

f hos

t exe

cutio

n tim

e

15

Implementation Single FPGA: 64-core @ 90 MHz, 2 GB DDR2 SODIMM ~2 hours CAD turnaround time on a mid-range

workstation BRAM bounded, but have logic resources to fit more

pipelines

16

Outline Overview RAMP Gold HW Architecture and Implementation RAMP Gold Software Infrastructure Usage Case and Live Demo Future work

17

Software Tools SPARC cross compiler with binutils/gcc/glibc

Support most of POSIX programs Static & dynamic linking support Built from GNU GCC (4.3.2)

Full software and HW debugging suite Low-cost XUP boards sometimes do not work out-of-box FPGA CAD tools are very bad

18

Target Software Proxy Kernel: single-protection-domain application

host Runs programs statically linked against glibc Forwards I/O system calls to x86/Linux host PC Presents simple “hard-threads” API for multithreaded

programs Very easy to modify

ROS: UCB’s manycore research OS Provides multiprogramming support Sufficiently POSIX compliant to run many programs Much easier to modify than linux Run more than 64-cores

19

Infrastructure

Verilog Simulation Backend

Functional Simulation Backend

Frontend Test Server

HW Backend

App Source Files

(.S or .C)

GNU SPARC v8 Compiler/

Linker

Customized Linker Script

(.lds)

RTL src files / netlist

(.sv, .v)

ELF Binaries

Modelsim SE/ Questasim

Host dynamic simulation

libraries (.so)

libbfdDisassembler C implementationFrontend Links

Xilinx Unisim Library Systemverilog DPI interface

Simulation logs

Optional Checker

C-Gold Functional Simulator

Frontend Links

Reference Result

FPGA TargetXilinx XUP-V5, BEE3

Frontend Links

HW state dumps

Linux Machine(Handle I/Os and Syscalls)

C-Gold simulation

module

20

Case studies Parallel application studies for software

programmers Parallel OS for system researchers Adding hardware performance counter for

advanced debugging Micro-architecture studies - adding features and

modifying existing timing models Adding new instructions – changing the functional

model

21

Appserver 101 Appserver command-line options:

Usage: sparc_app [-f<conf>] [-p<nprocs>] [-s] <htif> <kernel> [binary] [args]

Platform memory test: App server memory test: sparc_app –p64 hw memtest none Proxykernel memory test (stress test) sparc_app –p64 hw pathlkernel.ramp path/memtest

22

For application programmers

Main usage scenario: use runtime configurable timing model without any FPGA hardware change Use ‘hard-threads’ to write a parallel ‘hello world’ program

running on the proxykernel

Compile the program using the cross toolchainsparc-ros-gcc –o hello hellp.cpp -lhart

Measure performance using performance counterssparc_app –s1 –p64 hw kernel.ramp hello

Change target machine configuration on the fly and rerun the experiment

edit file ‘appserver.conf’

23

For OS Developer Similar usage model like application programmers

Proxykernel is a good start to learn the bootstrapping process

ROS is a full functional kernel

Demo: Boot the ROS kernel using the appserversparc_app –p64 –fappserver_ros.conf hw your_kernel none

24

Adding Hardware Performance Counters

Two types of counter interface Global counter: <EN> Local (per core) counter: <TID, EN>

Modify the verilog file to add more counters on the ring.

perfctr_io #(.NLOCAL(num_of_local), .NGLOBAL(num_of_global)) gen_tm_counter(.gclk, .rst, .bus_out(io_out), .bus_in(io_in), .bus_sel(), //IO bus interface

.global_inc(global_counter_inc), .local_inc(local_counter_inc), .local_tid(local_counter_tid));

Modify the app server to support more counters: Add your counter definition in ‘TestAppServer/perfcnt.h’

25

Adding Features to Timing Models

Timing models are much simpler than functional models ~1000 LoC vs 35,000 LoC

Example 1: Changing the cache replacement policy

Example 2: Adding memory QoS Lee et al. “Globally-Synchronized Frames for,

Guaranteed Quality-of-Service in On-Chip Networks”, ISCA’08

~100 lines of code added in the timing model A new DRAM model Several memory mapped register added on the

functional I/O bus for configuration purpose

26

Adding New Instructions Adding instructions to a feed-through pipeline is

straightforward FPU instructions were added as “new” instructions within

a week Including: new register file, decode, exception/commit

and microcode

Example: Adding new atomic instructions through microcode 4 global scratchpad registers (not visible to

programmer) in the main integer register file for temporary storage

Two write-port for supporting scratchpad registers update along with architecture register change

27

Steps of Adding Instructions Add proper decoding logic in function

“decode_dsp_add_logic“ of “regacc_dma.sv”

Update the writeback/exception stage in file “exception_dma.sv” to trap to microcode. Edit function “decode_microcode_mode” to trap to microcode Edit function “rd_gen” to write address to scratch register 0,

and load data to scratch register 1

Edit microcode ROM ‘Microcode.sv’ //----------SWAP*-------9: begin uco.uend = '0; uco.cwp_rs1 = '0; uco.cwp_rd = '0; uco.inst = {LDST, 5'b0, ST, REGADDR_SCRATCH_0 | UCI_MASK, 1'b1, 13'b0};

end 10: begin uco.uend = '1; uco.cwp_rs1 = '0; uco.cwp_rd = '0; uco.inst = {FMT3, 5'b0, IADD, REGADDR_SCRATCH_1 | UCI_MASK, 1'b1,

13'b0}; end

28

Future work Cache Coherence models (soon) Realistic interconnect model (soon) Better CPU core model (next major version) Support other ISAs (next major version)

29

Further References Research papers

Usage case: A Case for FAME: FPGA Architecture Model Execution, ISCA’10 RAMP Gold design:RAMP Gold: An FPGA-based Architecture Simulator for

Multiprocessors, DAC’10

Beta releasehttp://sites.google.com/site/rampgold

30

Backup Slides

31

Functional/Timing Model Interface

// FM -> TMtypedef struct { bit valid; //timing token between FM and TM. bit [5:0] tid; //thread ID bit run; //cpu states bit replay; //this instruction needs to replay by FM bit retired; //retiring an instruction bit [31:0] inst; //the instruction that was retired bit [31:0] paddr; //load/store physical address bit [31:0] npc; //PC of next fetched insn}tm_cpu_ctrl_token_type;

// TM -> FMtypedef struct { bit valid; //timing token between FM and TM. bit [5:0] tid; //thread ID bit run; //run bit }tm2cpu_token_type;

ramp gold : an fpga-based architecture simulator for multiprocessors

Documents