a case for fame: fpga architecture model execution

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

P A R A L L E L C O M P U T I N G L A B O R A T O R Y



A Case for FAME:FPGA Architecture Model Execution

Zhangxi Tan, Andrew Waterman,Henry Cook, Sarah Bird,

Krste Asanovic, David PattersonThe Parallel Computing Lab, UC Berkeley

ISCA ’10



2

A Brief History of Time

Hardware prototyping initially popular for architects

Prototyping each point in a design space is expensive

Simulators became popular cost-effective alternative Software Architecture Model Execution (SAME)

simulators most popular SAME performance scaled with uniprocessor

performance scaling



3

The Multicore Revolution

Abrupt change to multicore architectures HW, SW systems larger, more complex

Timing-dependent nondeterminism Dynamic code generation Automatic tuning of app kernels

We need more simulation cycles than ever



4

The Multicore Simulation Gap As number of cores increases exponentially,

time to model a target cycle increases accordingly

SAME is difficult to parallelize because of cycle-by-cycle interactions Relaxed simulation synchronization may not work

Must bridge simulation gap



One Decade of SAME

Median Instructions Simulated/ Benchmark

Median #Cores

Median Instructions Simulated/ Core

ISCA 1998

267M 1 267M

ISCA 2008

825M 16 100M

Effect is dramatically shorter (~10 ms) simulation runs

5



6

FAME: FPGA Architecture Model Execution

The SAME approach provides inadequate simulation throughput and latency

Need a fundamentally new strategy to maximize useful experiments per day Want flexibility of SAME and performance of hardware

Ours: FPGA Architecture Model Execution (FAME) (cf. SAME, Software Architecture Model Execution)

Why FPGAs? FPGA capacity scaling with Moore’s Law Now can fit a few cores on die Highly concurrent programming model with cheap

synchronization



Non-FAME:FPGA Computers

FPGA Computers: using FPGAs to build a production computer

RAMP Blue (UCB 2006) 1008 MicroBlaze cores No MMU, message passing only Requires lots of hardware

• 21 BEE2 boards (full rack) / 84 FPGAs RTL directly mapped to FPGA Time-consuming to modify

Cool, useful, but not a flexible simulator

7



FAME:System Simulators in FPGAs

8

CORE

D$

DRAM

Shared L2$ / Interconnect

…I$

CORE

D$

I$CORE

D$

I$CORE

D$

I$CORE

D$

DRAM

L2$

I$CORE

D$

I$CORE

D$

I$

L2$ L2$

Target System A Target System B

Host System(FAME simulator)



9

A Vast FAME Design Space

FAME design space even larger than SAME’s Three dimensions of FAME simulators

Direct or Decoupled: does one host cycle model one target cycle?

Full RTL or Abstract RTL? Host Single-threaded or Host Multi-threaded?

See paper for a FAME taxonomy!



FAME Dimension 1:Direct vs. Decoupled

Direct FAME: compile target RTL to FPGA Problem: common ASIC structures map poorly to FPGAs Solution: resource-efficient multi-cycle FPGA mapping Decoupled FAME: decouple host cycles from target cycles

Full RTL still modeled, so timing accuracy still guaranteed

10

R1R2R3R4W1W2

RegFile

Rd1Rd2Rd3Rd4

R1R2W1

RegFile

Rd1Rd2

Target System Regfile Decoupled Host Implementation

FSM



11

FAME Dimension 2:Full RTL vs. Abstract RTL

Decoupled FAME models full RTL of target machine Don’t have full RTL in initial design phase Full RTL is too much work for design space exploration

Abstract FAME: model the target RTL at a high level For example, split timing and functional models (à la SAME) Also enables runtime parameterization: run different simulations without

re-synthesizing the design Advantages of Abstract FAME come at cost: model verification

Timing of abstract model not guaranteed to match target machine

Abstraction

Functional Model Target

RTL Timing Model



FAME Dimension 3:Single- or Multi-threaded Host

Problem: can’t fit big manycore on FPGA, even abstracted Problem: long host latencies reduce utilization Solution: host-multithreading

12

CPU1

CPU2

CPU3

CPU4Target Model

Multithreaded Emulation Engine (on FPGA)

+1

2

PC1PC

1PC1PC

1I$ IR GPRGPRGPRGPR1

X

Y

2

D$Single hardware pipeline with multiple copies of CPU state



Metrics besides Cycles:Power, Area, Cycle Time

FAME simulators determine how many cycles a program takes to run

Computing Power/Area/Cycle Time: SAME old story Push target RTL through VLSI flow Analytical or empirical models Collecting event stats for model inputs is much faster than

with SAME

13



RAMP Gold: A Multithreaded FAME

SimulatorRapid accurate simulation of manycore architectural ideas using FPGAs

Initial version models 64 cores of SPARC v8 with shared memory system on $750 board

Hardware FPU, MMU, boots OS. Cost Performance

(MIPS) Simulations per day

Simics (SAME) $2,000 0.1 - 1 1

RAMP Gold (FAME) $2,000 + $750 50 - 100 100

14



RAMP Gold Target Machine

SPARC V8CORE

I$ D$

DRAM


SPARC V8CORE

I$ D$

SPARC V8CORE

I$ D$

SPARC V8CORE

I$ D$

…64 cores

15



RAMP Gold Model

Functional Model Pipeline

Arch State

Timing Model Pipeline

Timing State

SPARC V8 ISAOne-socket manycore target

Split functional/timing model, both in hardware

–Functional model: Executes ISA

–Timing model: Capture pipeline timing detail

Host multithreading of both functional and timing models

Functional-first, timing-directed

Built for Xilinx Virtex-5 systems

[ RAMP Gold, DAC ‘10 ]16

CORE

D$

DRAM


…

64 coresI$

CORE

D$

I$CORE

D$

I$CORE

D$

I$



17

Case Study: Manycore OS Resource Allocation

Spatial resource allocation in a manycore system is hard Combinatorial explosion in number of apps and

number of resources Idea: use predictive models of app

performance to make it easier on OS HW partitioning for performance isolation

(so models still work when apps run together) Problem: evaluating effectiveness of resulting

scheduling decisions requires running hundreds of schedules for billions of cycles each

Simulation-bound: 8.3 CPU-years for Simics! See paper for app modeling strategy details



18


Synthetic Only PARSEC Small0

0.5

1

1.5

2

2.5

3

3.5

4

worst sched.chosen sched.best sched.

Norm

alize

d Ru

ntim

e



19


The technique appears to perform very well for synthetic or reduced-input workloads, but is lackluster in reality!

Synthetic Only PARSEC Small PARSEC Large0

0.5

1

1.5

2

2.5

3

3.5

4

worst sched.chosen sched.best sched.

Norm

alize

d Ru

ntim

e



20

RAMP Gold Performance FAME (RAMP Gold) vs. SAME (Simics) Performance

PARSEC parallel benchmarks, large input sets >250x faster than full system simulator for a 64-core target system

4 8 16 32 640

50

100

150

200

250

300

Functional only

Functional+cache/memory (g-cache)

Functional+cache/memory+coherency (GEMS)

Number of Target Cores

Spee

dup

(Geo

met

ric M

ean)



21

Researcher Productivity is Inversely Proportional to Latency

Simulation latency is even more important than throughput How long before experimenter gets feedback? How many experimenter-days are wasted if there was an

error in the experimental setup?

Median Latency (days)

Maximum Latency (days)

FAME 0.04 0.12

SAME 7.50 33.20



22

Fallacy: FAME is too hard FAME simulators more complex, but not greatly so Efficient, complete SAME simulators also quite

complex Most experiments only need to change timing model

RAMP Gold’s timing model is only 1000 lines of SystemVerilog

Modeled Globally Synchronized Frames [Lee08] in 3 hours & 100 LOC

Corollary fallacy: architects don’t need to write RTL We design hardware; we shouldn’t be scared of HDL



23

Fallacy: FAME Costs Too Much Running SAME on cloud (EC2) much more expensive!

FAME: 5 XUP boards at $750 ea.; $0.10 per kWh SAME: EC2 Medium-High instances at $0.17 per hour

Runtime (hours)

Cost for first experiment

Cost for next experiment

Carbon offset (trees)

FAME 257 $3,750 $10 0.1

SAME

73,000 $12,500 $12,500 55.0

Are architects good stewards of the environment? SAME uses energy of 45 seconds of Gulf oil spill!



24

Fallacy: statistical samplingwill save us

Sampling may not make sense for multiprocessors Timing is now architecturally visible May be OK for transactional workloads

Even if sampling is appropriate, runtime dominated by functional warming => still need FAME FAME simulator ProtoFlex (CMU) originally designed for this

purpose

Parallel programs of the future will likely be dynamically adaptive and auto-tuned, which may render sampling useless



25

Challenge: Simulator Debug Loop can be Longer

Takes 2 hours to push RAMP Gold through the CAD tools Software RTL simulation to debug simulator is also

very slow

SAME debug loop only minutes long

But sheer speed of FAME eases some tasks Try debugging and porting a complex parallel

program in SAME



26

Challenge: FPGA CAD Tools Compared to ASIC tools, FPGA tools are

immature Encountered 84 formally-tracked bugs

developing RAMP Gold Including several in the formal verification tools!!

By far FAME’s biggest barrier (Help us, industry!) On the bright side, the more people using FAME,

the better



27

When should Architectsstill use SAME?

SAME still appropriate in some situations Pure functional simulation ISA design Uniprocessor pipeline design

FAME necessary for manycore research with modern applications



28

Conclusions FAME uses FPGAs to build simulators, not

computers FAME works, it’s fast, and we’re using it SAME doesn’t cut it, so use FAME!

Thanks to the entire RAMP community for contributions to FAME methodology

Thanks to NSF, DARPA, Xilinx, SPARC International, IBM, Microsoft, Intel, and UC Discovery for funding support

RAMP Gold source code is available: http://ramp.eecs.berkeley.edu/gold

a case for fame: fpga architecture model execution

Documents

host cycle model

target cycle

simulation cycles

decoupleddirect fame

fame taxonomy

system simulators

fpga capacity scaling

fpga computersfpga computers