fpga-based fast, cycle-accurate full system simulators derek chiou, huzefa sanjeliwala, dam sunwoo,...

FPGA-based Fast, Cycle-Accurate Full System Simulators

Derek Chiou, Huzefa Sanjeliwala, Dam Sunwoo, John

Xu and Nikhil Patil

University of Texas at Austin

Wouldn’t it be nice to have a simulator that is Fast

10M cycles per second, fast enough to run real datasets to completion

Accurate Produce cycle-accurate numbers

Complete Run real operating systems, applications

Transparent Can see everything in processor, no performance hit

Inexpensive Need thousands

Usable Quick changes, easy to see performance

Software?

Software-based simulators inherently cannot achieve this speed and be cycle-accurate at the same time A 128 entry, fully-associative TLB at the limit requires 128

load, compare operations Arbitration requires first looking across multiple bidders

There are lots of these structures in a complex processor!

Thousands to tens of thousands of events

Even with perfect parallelism, need a lot of CPUs

Hardware

Clearly, hardware is necessary Reconfigurability (read FPGAs) is required for

flexibility But how?

Full Implementation? Take RTL code, compile for FPGA

Implementing full system in FPGA is prohibitively large Shih-Lin Lu’s group has single original Pentium (586, 3.1M

transistors) in largest Xilinx FPGA Emulate Pentium M in a single FPGA?

140M transistors

Instead, what about Accurately (to cycle resolution) simulate its behavior Running real, unmodified applications, OS With full visibility at full speed?

If execution speeds are reasonable, do I care?

Derek Chiou, UTexas, AustinDerek Chiou, UTexas, Austin

Can I Partition the Problem?

64b adder way too big to be implemented as a single monolithic entity

But, I can implement 64 1b adders very easily with very little state and complexity

Partitioning is good if possible But, how to partition?

Classic Partitioning On module boundary

Caches, memories, ALUs, processors, memory controllers Partitioning doesn’t save state or complexity, but enables design to be partitioned over

multiple FPGAs and software Problems?

0x2

addrinst

Instruction$/Mem

Add

rd1

GPR File

rr1rr2

wrwd rd2

we

Immed.Extend

M

0

2

raddr

waddr

wdata

rdata

re

Data $/Memory

ALU

algn

1

3

wePCA

B

MD1

Y

MD2

IR

IR IR IR

R

I1

I2

bypass

Functional/Timing Partition

Functional model simulates ISA Timing model simulates micro-architecture Asim and Simplescalar are written like this

Software One processor Lots of interaction between functional and timing

Intended to avoid rollback of any component Put timing model in FPGA???

Parallel component executed in hardware!

UT FAST Partitioning On ISA/micro-architecture boundary (ISA + FPGA)

Instruction trace generated by ISA simulator (e.g., Bochs, Simics) Fast, full system but no timing information (could be hardware!!!)

What do we need to simulate in the timing model?

TraceTrace

0x2

addrinst

InstructionMemory

Add

rd1

GPR File

rr1rr2

wrwd rd2

we

Immed.Extend

M

0

2

raddr

waddr

wdata

rdata

re

Data Memory

ALU

algn

1

3

wePCA

B

MD1

Y

MD2

IR

IR IR IR

R

bypass I1

I2

UT FAST Complex Processors Straight pipelines are easy what

about Caches/TLBs?

Keep tags, pass address (virtual and physical if necessary)

Hits, misses determined but don’t need data

Superscalar (multiple issue)? “Fetch and issue” multiple

instructions assuming they meet boundary constraints

Multiple “functional units” Reservation stations Reorder buffer Pipeline control along with

instructions NO DATAPATH!!!

Timing Model speed almost unimportant! Multi-cycle memories to create more

ports

I-Fetch I-Cache

I-Decode

Instruction stream

Delay

GPR Rename

Delay

Delay

FPR Rename

Delay

GPR Read

Delay

ALU ALU Br Ldst

FPR Read

Delay

Ldst FPU FPU

D-Cache

ReorderBuffer

BIU

MemoryController

Disk

NetworkMemory

Example of Complication:Branch Prediction

Must process mis-speculated instructions in timing model Implement BP in timing model Timing model forces ISA simulator to mis-speculate

Rollback, restore Requires support from ISA simulator Branch predictor predictor in ISA simulator?

BP only works in processor if it’s fairly accurate

FAST simulators take advantage of the fact that most of the time micro-architecture is on the right path Most complexity (BP, parallelism) can be handled this way

Status & Conclusions 1MHz to 100MHz, cycle-accurate, full-system, multiprocessor simulator

Well, not quite that fast right now, but we are using embedded 300MHz PowerPC 405 to simplify

X86, boots Linux, Windows, targeting 80486 to Pentium D-like and beyond (Dam Sunwoo, Nikhil Patil) Bochs functional model (looking at much faster models) Heavily modified instruction trace and rollback

Branch-predicted superscalar model almost done in Bluespec and Verilog (John Xu, Huzefa Sanjeliwala) Have straight pipeline 486 model with TLBs and caches

Statistics gathered in hardware Very little if any probe effect

Tools to semi-automate micro-architectural and ISA level exploration Orthogonality of models makes both simpler

fpga-based fast, cycle-accurate full system simulators derek chiou, huzefa sanjeliwala, dam sunwoo,...

Documents