a case for fame: fpga architecture model execution
DESCRIPTION
A Case for FAME: FPGA Architecture Model Execution. Zhangxi Tan, Andrew Waterman , Henry Cook, Sarah Bird, Krste Asanovic , David Patterson The Parallel Computing Lab, UC Berkeley ISCA ’ 10. A Brief History of Time. Hardware prototyping initially popular for architects - PowerPoint PPT PresentationTRANSCRIPT
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
A Case for FAME:FPGA Architecture Model Execution
Zhangxi Tan, Andrew Waterman,Henry Cook, Sarah Bird,
Krste Asanovic, David PattersonThe Parallel Computing Lab, UC Berkeley
ISCA ’10
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
2
A Brief History of Time
Hardware prototyping initially popular for architects
Prototyping each point in a design space is expensive
Simulators became popular cost-effective alternative Software Architecture Model Execution (SAME)
simulators most popular SAME performance scaled with uniprocessor
performance scaling
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
3
The Multicore Revolution
Abrupt change to multicore architectures HW, SW systems larger, more complex
Timing-dependent nondeterminism Dynamic code generation Automatic tuning of app kernels
We need more simulation cycles than ever
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
4
The Multicore Simulation Gap As number of cores increases exponentially,
time to model a target cycle increases accordingly
SAME is difficult to parallelize because of cycle-by-cycle interactions Relaxed simulation synchronization may not work
Must bridge simulation gap
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
One Decade of SAME
Median Instructions Simulated/ Benchmark
Median #Cores
Median Instructions Simulated/ Core
ISCA 1998
267M 1 267M
ISCA 2008
825M 16 100M
Effect is dramatically shorter (~10 ms) simulation runs
5
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
6
FAME: FPGA Architecture Model Execution
The SAME approach provides inadequate simulation throughput and latency
Need a fundamentally new strategy to maximize useful experiments per day Want flexibility of SAME and performance of hardware
Ours: FPGA Architecture Model Execution (FAME) (cf. SAME, Software Architecture Model Execution)
Why FPGAs? FPGA capacity scaling with Moore’s Law Now can fit a few cores on die Highly concurrent programming model with cheap
synchronization
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Non-FAME:FPGA Computers
FPGA Computers: using FPGAs to build a production computer
RAMP Blue (UCB 2006) 1008 MicroBlaze cores No MMU, message passing only Requires lots of hardware
• 21 BEE2 boards (full rack) / 84 FPGAs RTL directly mapped to FPGA Time-consuming to modify
Cool, useful, but not a flexible simulator
7
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
FAME:System Simulators in FPGAs
8
CORE
D$
DRAM
Shared L2$ / Interconnect
…I$
CORE
D$
I$CORE
D$
I$CORE
D$
I$CORE
D$
DRAM
L2$
I$CORE
D$
I$CORE
D$
I$
L2$ L2$
Target System A Target System B
Host System(FAME simulator)
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
9
A Vast FAME Design Space
FAME design space even larger than SAME’s Three dimensions of FAME simulators
Direct or Decoupled: does one host cycle model one target cycle?
Full RTL or Abstract RTL? Host Single-threaded or Host Multi-threaded?
See paper for a FAME taxonomy!
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
FAME Dimension 1:Direct vs. Decoupled
Direct FAME: compile target RTL to FPGA Problem: common ASIC structures map poorly to FPGAs Solution: resource-efficient multi-cycle FPGA mapping Decoupled FAME: decouple host cycles from target cycles
Full RTL still modeled, so timing accuracy still guaranteed
10
R1R2R3R4W1W2
RegFile
Rd1Rd2Rd3Rd4
R1R2W1
RegFile
Rd1Rd2
Target System Regfile Decoupled Host Implementation
FSM
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
11
FAME Dimension 2:Full RTL vs. Abstract RTL
Decoupled FAME models full RTL of target machine Don’t have full RTL in initial design phase Full RTL is too much work for design space exploration
Abstract FAME: model the target RTL at a high level For example, split timing and functional models (à la SAME) Also enables runtime parameterization: run different simulations without
re-synthesizing the design Advantages of Abstract FAME come at cost: model verification
Timing of abstract model not guaranteed to match target machine
Abstraction
Functional Model Target
RTL Timing Model
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
FAME Dimension 3:Single- or Multi-threaded Host
Problem: can’t fit big manycore on FPGA, even abstracted Problem: long host latencies reduce utilization Solution: host-multithreading
12
CPU1
CPU2
CPU3
CPU4Target Model
Multithreaded Emulation Engine (on FPGA)
+1
2
PC1PC
1PC1PC
1I$ IR GPRGPRGPRGPR1
X
Y
2
D$Single hardware pipeline with multiple copies of CPU state
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Metrics besides Cycles:Power, Area, Cycle Time
FAME simulators determine how many cycles a program takes to run
Computing Power/Area/Cycle Time: SAME old story Push target RTL through VLSI flow Analytical or empirical models Collecting event stats for model inputs is much faster than
with SAME
13
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
RAMP Gold: A Multithreaded FAME
SimulatorRapid accurate simulation of manycore architectural ideas using FPGAs
Initial version models 64 cores of SPARC v8 with shared memory system on $750 board
Hardware FPU, MMU, boots OS. Cost Performance
(MIPS) Simulations per day
Simics (SAME) $2,000 0.1 - 1 1
RAMP Gold (FAME) $2,000 + $750 50 - 100 100
14
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
RAMP Gold Target Machine
SPARC V8CORE
I$ D$
DRAM
Shared L2$ / Interconnect
SPARC V8CORE
I$ D$
SPARC V8CORE
I$ D$
SPARC V8CORE
I$ D$
…64 cores
15
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
RAMP Gold Model
Functional Model Pipeline
Arch State
Timing Model Pipeline
Timing State
SPARC V8 ISAOne-socket manycore target
Split functional/timing model, both in hardware
–Functional model: Executes ISA
–Timing model: Capture pipeline timing detail
Host multithreading of both functional and timing models
Functional-first, timing-directed
Built for Xilinx Virtex-5 systems
[ RAMP Gold, DAC ‘10 ]16
CORE
D$
DRAM
Shared L2$ / Interconnect
…
64 coresI$
CORE
D$
I$CORE
D$
I$CORE
D$
I$
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
17
Case Study: Manycore OS Resource Allocation
Spatial resource allocation in a manycore system is hard Combinatorial explosion in number of apps and
number of resources Idea: use predictive models of app
performance to make it easier on OS HW partitioning for performance isolation
(so models still work when apps run together) Problem: evaluating effectiveness of resulting
scheduling decisions requires running hundreds of schedules for billions of cycles each
Simulation-bound: 8.3 CPU-years for Simics! See paper for app modeling strategy details
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
18
Case Study: Manycore OS Resource Allocation
Synthetic Only PARSEC Small0
0.5
1
1.5
2
2.5
3
3.5
4
worst sched.chosen sched.best sched.
Norm
alize
d Ru
ntim
e
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
19
Case Study: Manycore OS Resource Allocation
The technique appears to perform very well for synthetic or reduced-input workloads, but is lackluster in reality!
Synthetic Only PARSEC Small PARSEC Large0
0.5
1
1.5
2
2.5
3
3.5
4
worst sched.chosen sched.best sched.
Norm
alize
d Ru
ntim
e
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
20
RAMP Gold Performance FAME (RAMP Gold) vs. SAME (Simics) Performance
PARSEC parallel benchmarks, large input sets >250x faster than full system simulator for a 64-core target system
4 8 16 32 640
50
100
150
200
250
300
Functional only
Functional+cache/memory (g-cache)
Functional+cache/memory+coherency (GEMS)
Number of Target Cores
Spee
dup
(Geo
met
ric M
ean)
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
21
Researcher Productivity is Inversely Proportional to Latency
Simulation latency is even more important than throughput How long before experimenter gets feedback? How many experimenter-days are wasted if there was an
error in the experimental setup?
Median Latency (days)
Maximum Latency (days)
FAME 0.04 0.12
SAME 7.50 33.20
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
22
Fallacy: FAME is too hard FAME simulators more complex, but not greatly so Efficient, complete SAME simulators also quite
complex Most experiments only need to change timing model
RAMP Gold’s timing model is only 1000 lines of SystemVerilog
Modeled Globally Synchronized Frames [Lee08] in 3 hours & 100 LOC
Corollary fallacy: architects don’t need to write RTL We design hardware; we shouldn’t be scared of HDL
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
23
Fallacy: FAME Costs Too Much Running SAME on cloud (EC2) much more expensive!
FAME: 5 XUP boards at $750 ea.; $0.10 per kWh SAME: EC2 Medium-High instances at $0.17 per hour
Runtime (hours)
Cost for first experiment
Cost for next experiment
Carbon offset (trees)
FAME 257 $3,750 $10 0.1
SAME
73,000 $12,500 $12,500 55.0
Are architects good stewards of the environment? SAME uses energy of 45 seconds of Gulf oil spill!
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
24
Fallacy: statistical samplingwill save us
Sampling may not make sense for multiprocessors Timing is now architecturally visible May be OK for transactional workloads
Even if sampling is appropriate, runtime dominated by functional warming => still need FAME FAME simulator ProtoFlex (CMU) originally designed for this
purpose
Parallel programs of the future will likely be dynamically adaptive and auto-tuned, which may render sampling useless
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
25
Challenge: Simulator Debug Loop can be Longer
Takes 2 hours to push RAMP Gold through the CAD tools Software RTL simulation to debug simulator is also
very slow
SAME debug loop only minutes long
But sheer speed of FAME eases some tasks Try debugging and porting a complex parallel
program in SAME
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
26
Challenge: FPGA CAD Tools Compared to ASIC tools, FPGA tools are
immature Encountered 84 formally-tracked bugs
developing RAMP Gold Including several in the formal verification tools!!
By far FAME’s biggest barrier (Help us, industry!) On the bright side, the more people using FAME,
the better
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
27
When should Architectsstill use SAME?
SAME still appropriate in some situations Pure functional simulation ISA design Uniprocessor pipeline design
FAME necessary for manycore research with modern applications
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
28
Conclusions FAME uses FPGAs to build simulators, not
computers FAME works, it’s fast, and we’re using it SAME doesn’t cut it, so use FAME!
Thanks to the entire RAMP community for contributions to FAME methodology
Thanks to NSF, DARPA, Xilinx, SPARC International, IBM, Microsoft, Intel, and UC Discovery for funding support
RAMP Gold source code is available: http://ramp.eecs.berkeley.edu/gold