fpga-based fast, cycle-accurate full system simulators derek chiou, huzefa sanjeliwala, dam sunwoo,...
TRANSCRIPT
FPGA-based Fast, Cycle-Accurate Full System Simulators
Derek Chiou, Huzefa Sanjeliwala, Dam Sunwoo, John
Xu and Nikhil Patil
University of Texas at Austin
Wouldn’t it be nice to have a simulator that is Fast
10M cycles per second, fast enough to run real datasets to completion
Accurate Produce cycle-accurate numbers
Complete Run real operating systems, applications
Transparent Can see everything in processor, no performance hit
Inexpensive Need thousands
Usable Quick changes, easy to see performance
Software?
Software-based simulators inherently cannot achieve this speed and be cycle-accurate at the same time A 128 entry, fully-associative TLB at the limit requires 128
load, compare operations Arbitration requires first looking across multiple bidders
There are lots of these structures in a complex processor!
Thousands to tens of thousands of events
Even with perfect parallelism, need a lot of CPUs
Hardware
Clearly, hardware is necessary Reconfigurability (read FPGAs) is required for
flexibility But how?
Full Implementation? Take RTL code, compile for FPGA
Implementing full system in FPGA is prohibitively large Shih-Lin Lu’s group has single original Pentium (586, 3.1M
transistors) in largest Xilinx FPGA Emulate Pentium M in a single FPGA?
140M transistors
Instead, what about Accurately (to cycle resolution) simulate its behavior Running real, unmodified applications, OS With full visibility at full speed?
If execution speeds are reasonable, do I care?
Derek Chiou, UTexas, AustinDerek Chiou, UTexas, Austin
Can I Partition the Problem?
64b adder way too big to be implemented as a single monolithic entity
But, I can implement 64 1b adders very easily with very little state and complexity
Partitioning is good if possible But, how to partition?
Classic Partitioning On module boundary
Caches, memories, ALUs, processors, memory controllers Partitioning doesn’t save state or complexity, but enables design to be partitioned over
multiple FPGAs and software Problems?
0x2
addrinst
Instruction$/Mem
Add
rd1
GPR File
rr1rr2
wrwd rd2
we
Immed.Extend
M
0
2
raddr
waddr
wdata
rdata
re
Data $/Memory
ALU
algn
1
3
wePCA
B
MD1
Y
MD2
IR
IR IR IR
R
I1
I2
bypass
Functional/Timing Partition
Functional model simulates ISA Timing model simulates micro-architecture Asim and Simplescalar are written like this
Software One processor Lots of interaction between functional and timing
Intended to avoid rollback of any component Put timing model in FPGA???
Parallel component executed in hardware!
UT FAST Partitioning On ISA/micro-architecture boundary (ISA + FPGA)
Instruction trace generated by ISA simulator (e.g., Bochs, Simics) Fast, full system but no timing information (could be hardware!!!)
What do we need to simulate in the timing model?
TraceTrace
0x2
addrinst
InstructionMemory
Add
rd1
GPR File
rr1rr2
wrwd rd2
we
Immed.Extend
M
0
2
raddr
waddr
wdata
rdata
re
Data Memory
ALU
algn
1
3
wePCA
B
MD1
Y
MD2
IR
IR IR IR
R
bypass I1
I2
UT FAST Complex Processors Straight pipelines are easy what
about Caches/TLBs?
Keep tags, pass address (virtual and physical if necessary)
Hits, misses determined but don’t need data
Superscalar (multiple issue)? “Fetch and issue” multiple
instructions assuming they meet boundary constraints
Multiple “functional units” Reservation stations Reorder buffer Pipeline control along with
instructions NO DATAPATH!!!
Timing Model speed almost unimportant! Multi-cycle memories to create more
ports
I-Fetch I-Cache
I-Decode
Instruction stream
Delay
GPR Rename
Delay
Delay
FPR Rename
Delay
GPR Read
Delay
ALU ALU Br Ldst
FPR Read
Delay
Ldst FPU FPU
D-Cache
ReorderBuffer
BIU
MemoryController
Disk
NetworkMemory
Example of Complication:Branch Prediction
Must process mis-speculated instructions in timing model Implement BP in timing model Timing model forces ISA simulator to mis-speculate
Rollback, restore Requires support from ISA simulator Branch predictor predictor in ISA simulator?
BP only works in processor if it’s fairly accurate
FAST simulators take advantage of the fact that most of the time micro-architecture is on the right path Most complexity (BP, parallelism) can be handled this way
Status & Conclusions 1MHz to 100MHz, cycle-accurate, full-system, multiprocessor simulator
Well, not quite that fast right now, but we are using embedded 300MHz PowerPC 405 to simplify
X86, boots Linux, Windows, targeting 80486 to Pentium D-like and beyond (Dam Sunwoo, Nikhil Patil) Bochs functional model (looking at much faster models) Heavily modified instruction trace and rollback
Branch-predicted superscalar model almost done in Bluespec and Verilog (John Xu, Huzefa Sanjeliwala) Have straight pipeline 486 model with TLBs and caches
Statistics gathered in hardware Very little if any probe effect
Tools to semi-automate micro-architectural and ISA level exploration Orthogonality of models makes both simpler