Computer Architecture Lab at
1
ProtoFlex: Status Update and Design Experiences
Eric S. Chung, Michael Papamichael, Eriko Nurvitadhi,James C. Hoe, Babak Falsafi, Ken Mai
{echung, enurvita, jhoe, babak, kenmai}@ece.cmu.edu
PROTOFLEX
Our work in this area has been supported in part by NSF, IBM, Intel, and Xilinx.
222
Full-system Functional Simulation• Effective substitute for real (or non-existent) HW
– Can boot OS, run commercial apps
– Important in SW research & computer architecture
• But too slow for large-scale MP studies– Multicore won’t help existing tools
– Is serious challenge for large-MP (1000-way) simulation
REVIEW
333
Alternative: FPGA-based simulation• Only 10x slower in clock freq than custom HW
• But FPGAs harder to use than software– Simulating large-MP (100- to 1000-way) can’t be done trivially
– Simulating full-system support need devices + entire ISA
The “build-all” strategy in FPGAs = significant effort + resources
Memory
PCI Bus
Ethernetcontroller
Graphics card
I/O MMUcontroller
DiskDisk
DMAcontroller
IRQ controller
Terminal
SCSIcontroller
CPU CPUFPGAs
444
Reducing complexity w/ virtualization
Hybrid Full-System SimulationVirtualized MP Simulation
Only frequent behaviors hosted in FPGA. Relegate infrequent to SW.
Target full-system behaviors
FPGA Software
frequent infrequent
CPU CPU CPU CPU CPU
Logical CPUs multiplexed onto fewer physical CPUs.
Host resources
1 FPGA CPU
Host resources
Making multiple physical resources appear as a single logical resource
Making a single physical resource appear as multiple logical resources
21
555
Outline
• Hybrid Full-System Simulation
• Virtualized Multiprocessor Simulation
• BlueSPARC Implementation
• Design Experiences
• Future Work
666
3CPU
Hybrid Full-System Simulation
• 3 ways to map target component to hybrid simulation hostFPGA-only Simulation-only Transplantable
• CPUs can fallback to SW by “transplanting” between hosts– Only common-case instructions/behaviors implemented in FPGA
– Remaining behavs relegated to SW (turns out many of complex ones)
1 2 3
CPU CPU
Memory
MMU Fibre
Graphics NIC PCI
Terminal
SCSI
Software full-system simulator host
Hybrid Simulation
FPGA host
12
I/O instr
CPUCPU
transplant
Transplants reduce full-system design effort
CPUCPU CPU
Memory
MMU Fibre
Graphics NIC PCI
Terminal
SCSI
Software full-system simulator host
CPU
Software-only simulation
777
Outline
• Hybrid Full-System Simulation
• Virtualized Multiprocessor Simulation
• BlueSPARC Implementation
• Design Experiences
• Future Work
8
Virtualized Multiprocessor Simulation• Problem: large-scale simulation configurations challenging to
implement in FPGAs using structurally-accurate approaches
# processors in target model
Structural-accuracy1-to-1 mapping
between target and host CPUs
# host processors implemented in FPGA
Pros: fastest possible solution, only 10x slower than real HW
Cons: difficult to build for large-scale configs (e.g., >100-way)
10x slower than real HW
1-to-1
999
Virtualized Multiprocessor Simulation
Advantages:
• Decouple logical target system size from FPGA host size
• Scale FPGA host as-needed to deliver required performance
• High target-to-host ratio (TH) simplifies/consolidates HW (e.g., fewer # nodes in cache coherence, interconnect)
# processors in target model
HostInterleavingMultiplex target
processors onto fewer # FPGA-hosted processors
# host “engines” implemented in FPGA
40x slower than real HW
4-to-1
101010
What’s inside an FPGA host processor?
• An “engine” that architecturally executes multiple contexts– Existing multithreaded designs are good candidates
– Choice is influenced by TH ratio (target-to-host ratio)
• We propose an interleaved pipeline (e.g., TERA-style)– Best suited for high TH ratio
– Switch in new CPU context on each cycle
– Simple, efficient design w/ no stalling or forwarding
– Long-latency tolerance (e.g., cache miss, transplants)
– Coherence is “free” between CPUs mapped onto same engine
CPU CPU CPU
HOSTCPU
111111
Outline
• Hybrid Full-System Simulation
• Virtualized Multiprocessor Simulation
• BlueSPARC Implementation
• Design Experiences
• Future Work
1212
Implementation: BlueSPARC simulator
16-CPU Shared-memory UltraSPARC III Server
(SunFire 3800) Memory
MMU DMA
Graphics NIC SCSI
Terminal
PCI
CPUCPU CPU..
BEE2 Platform Simics (PC)Xilinx XCV2P70
DDR2MemDDR2Mem
InterleavedPipeline
CPUcontextCPU
context16xCPU
PowerPC
SimulatedI/O devices
1313
BlueSPARC Simulator (continued)Processing Nodes 16 64-bit UltraSPARC III contexts
14-stage instruction-interleaved pipeline
L1 caches Split I/D, 64KB, 64B, direct-mapped, writebackNon-blocking loads/stores16-entry MSHR, 4-entry store buffer
Clock frequency 90MHz on Xilinx V2P70
Main memory 4GB total
Resources (Xilinx V2P70)
33,508 LUTs (50%), 222 BRAMs (67%) w/o stats+debug43,206 LUTs (65%), 238 BRAMs (72%)
Instrumentation All internal state fully traceableAttachable to FPGA-based CMP cache simulator*
EDA tools Xilinx EDK 9.2i, Bluespec System Verilog
Statistics 25K lines Bluespec, 511 rules, 89 module types
Checkpointing Fully compatible with Simics checkpointsCan load AND generate checkpoints
1414
BlueSPARC host microarchitecture
TransplantUnit
TransplantUnit
1 2, 3 4,5 6 87 9,10,11 12,13 14
PowerPC405 (transplant service processor)
64KBI-cache64KB
I-cache
I-TLB16-entry(direct-
mapped)x16
I-TLB16-entry(direct-
mapped)x16
I-TLB128-entry
(2-way)x16
I-TLB128-entry
(2-way)x16
ALU1ALU1ALU2ALU2
64KBD-cache64KB
D-cache
TrapUnitTrapUnit
WritebackUnit
WritebackUnit
D-TLB512-entry
(2-way)x16
D-TLB512-entry
(2-way)x16
D-TLB16-entry
(fully-assoc)
x16
D-TLB16-entry
(fully-assoc)
x16
RegFileRegFile
DecodeDecodePC, statex16
PC, statex16
ContextSelectorContextSelector
AssistUnit
AssistUnit
Normal pipeline stageNormal pipeline stage Multi-context stateMulti-context state Transplant support unitTransplant support unit
64-bit ISA, SW-visible MMU, complex memory high # of pipeline stages
1515
Hybrid host partitioning choices
BlueSPARC (FPGA) Micro-transplant (on-chip simulation)• add/sub/shift/logical• multiply/divide• register windows• 38/103 SPARC ASIs• interprocessor x-calls• device interrupts• I-/D-MMU + tlb miss• Loads/stores/atomics• VIS block memory
• 65/103 SPARC ASIs• VIS I/II multimedia• FP add/sub/mul/div + traps• FP/INT conversion• trap on integer arithmetic• alignment• fixed-point arithmetic• tlb/cache diagnostics• tlb demap
Transplant (off-chip simulation)•PCI bus•ISP2200 Fibre Channel•I21152 PCI bridge•IRQ bus•Fibre Channel SCSI disk/cdrom
•Text Console•SBBC PCI device•Serengeti I/O PROM•Cheerio-hme NIC•SCSI bus
BlueSPARC Micro-transplants(PowerPC405)
ON-CHIP FPGATransplants
(Simics on PC)
OFF-CHIP
1616
Performance
Perf comparable to Simics-fast39x speedup on average over Simics-trace
010203040506070
orac
lebz
ip2
craft
y
gcc
gzip
pars
ervo
rtex
aver
age
MIP
S
BlueSPARC (90mhz)Simics-fast (2.0GHz C2Duo)Simics-trace
1.18
171717
Outline
• Hybrid Full-System Simulation
• Virtualized Multiprocessor Simulation
• BlueSPARC Implementation
• Design Experiences
• Future Work
18
Design experiences
2007 TimelineJanuary-February
Initial virtualization ideasAnalysis + simulation of interleavingISA profiling of apps for hybrid partitioningInitial specifications for host pipeline
March Simics API wrappers + software experimentsApril-November
BlueSPARC RTL developmentValidation tools
November-December
Host performance instrumentation and writeup*
* To appear in FPGA’08
19
Design experiences (cont)
• What was important:– Developing effective validation strategies (more on next slide)
– Existing reference model (Simics) to study and compare against
– Efficient mapping of state to FPGA resources (e.g., 16 PCs 16-bit LUT-based distributed RAM)
– Coping with long Xilinx builds by easing up on timing constraints
– “Judicious” Bluespec
• What was NOT important:– Meeting 100MHz timing for every Xilinx build (i.e., deep pipelining)
– Implementing every functionality as efficiently/fast as possible
20
Validation
• THE most challenging aspect of this project
• Strategies used– Auto-generated torture tests + hand-written test cases
– Auto-port test-cases from OpenSPARC T1 framework to UltraSPARC III
– Validated single-threaded + multithreaded ISA execution against Simics (both in Verilog Simulations and in FPGA)
– Flight data recorder for non-deterministic interleaving of CPUs
– Batched Verilog simulations w/ varying parameters
– Validate non-blocking memory system with “shadow” flat memories during Verilog simulation caught self-modifying code bugs
– > 200 synthesizable assertions to Chipscope
– Built-in deadlock/error detectors
21
In retrospect…
• What I would have done differently to begin with– Write entire USIII functional model myself in software first
– Take more advantage of Verilog PLI for validation (interface to C)
– Don’t over-engineer HDL
– Don’t upgrade tools unless necessary (e.g., trial license runs out)
– Validation infrastructure w/ batching capabilities (do earlier!)
– Automated “binary search” tool for bug hunting
– Re-write DDR2 Async FIFOs without BRAMs
– Fast memory checkpoint loader (3GB images per run = 25m)
– Simple, correct >> Fast, buggy
22
Future Work
• Scalability– Burden-of-proof for 1000-way simulation?
– Investigate cache-coherence/interconnect mechanisms for combining multiple interleaved pipelines
• Virtualization design spaces– On-chip storage virtualization (e.g., architectural state)
– Memory + disk capacity (e.g., HW-based demand paging?)
– Virtualizing instrumentation (e.g., paging functional cache tags)
• Fast instrumentation tools– Understanding systems at multiple levels of abstraction (beyond ISA)
– Validation+analysis: beyond ISA, how to sanity-check app+sys behavior?
23
BlueSPARC Demo on BEE2
23
• Demo application– On-Line Transaction
Processing benchmark (TPC-C) in Oracle
– Runs in Solaris 8 (unmodified binary)
– FPGA + Memory directly loaded from Simics checkpoint
4 DDR2 Controllers + 4 GB memory
Ethernet (to Simics
on PC)
Virtex-II Pro 70 (PowerPC & BlueSPARC) RS232 (Debugging)
BEE2 Platform
242424
Conclusion
• “Build-all” simulation approach in FPGAs is challenging
• Two virtualization techniques for reducing complexity
– Hybrid: attain full-system by deferring rare behavs to SW
– Virtualized MP: decouples target system size from host size
• BlueSPARC proof-of-concept
– Models 16-cpu UltraSPARC III server
– Comparable perf to Simics-fast, 39x on avg faster than Simics-trace
• Thanks! Questions? [email protected]• PROTOFLEX (http://www.ece.cmu.edu/~simflex/protoflex.html)