1/36
by Martin Labrecque
How to Fake 1000 Registers
Oehmke, Binkert, Mudge, Reinhartto appear in Nov @ Micro 2005
2/36
Outline● Motivation:
– Observations on registers● Idea
– Virtual Context Architecture● Evaluation in 2 types of applications
3/36
Some definitions
● Activation record:
Data structure {● variables belonging to one particular scope
(e.g. a procedure body)● links to other activation records
};
Synonyms: "data frame", "stack frame"● Context:
– Activation record of a thread of execution
A register is only meaningful to the current activation record
4/36
Key observation● Virtual Memory:
– For the ISA standpoint: each process has an 'infinite' amount of memory available
– Memory is managed in caches, RAM and disk
– Memory is context free● This is not true for registers
– Limited resource
Need to virtualize registers
5/36
How registers are used
Compiler
Pipeline
Source code: variables
IR: virtual registers
Binary: logical registers
Data path: physical registers
Register allocation
Decode/Rename
6/36
Registers are useful
● Can't get rid of registers:– Efficient address encoding in instructions– Unambiguous data dependences– Efficient integration in the micro-
architecture
7/36
Attach a memory address tothe content of the register!
Dawn of a New Idea
8/36
Virtualizing registers
9/36
Mapping registers to memory
● Registers are virtualized because they hold the content of a memory location
● 2 options– At register allocation, map compiler
virtual registers to memory● Memory to memory operations ● Doesn't make use of ISA registers
– Map ISA registers to memory ● Key Idea of the Virtual Context Architecture
10/36
Programming the VCA
● Where are the registers mapped in memory?
● The Stack Pointer is the Reference– Allows to 'allocate' memory dynamically– Efficient way of passing parameters to a a
function – Need some architectural support to
address with offsets to the stack pointer
11/36
Renaming
● To get the register memory address, combine:– the source/destination register index of
the binary program– base pointer (stack pointer)
● ISA register index register memory address physical register
12/36
Register memory address physical reg.
● The address = base pointer + offset● Exploit locality of the addresses to
compress the number of bits in the conversion, low probability of capacity miss
13/36
Register File is a Cache
● Hardware controlled cache● An instruction requires its source
operands and destination register to execute
What happens on a “cache” miss?We need some hardware control!
14/36
Some additional HW
● Each register has 3 new attributes:
1) A reference count: ● Incremented when instruction using it goes
through rename● Decremented when instruction is committed● Non zero value means that register cannot be
reallocated to other logical registers● Guarantees instruction correct execution
15/36
Some additionnal HW (ctnd)
2) A 'committed' bit● Valid, non speculative value
3) A 'dirty' bit● Value more up-to-date than memory
• Using those attributes, a state machine controls which registers are available or not
• Branch recovery works by having a duplicate renaming table containing the committed architectural state
16/36
Source operand to physical
registerconversion
17/36
Destination logical
register to physical register
conversion
18/36
Allocation of an entry for
destination register
● Replacement policy in rename table
19/36
Pipeline modifications
● Changes in the renaming● ATSQ: architectural state transfer queue
– Adds to the queue upon fills and spills– Has priority on the instruction to execute– Addresses for fills and spills are pre-calculated– No memory disambiguation required– No data dependences
20/36
Outline
● Motivation:– Observations on registers
● Idea– Virtual Context Architecture
● Evaluation in 2 types of applications– Baseline & Methodology– Register windows w/ results– SMT w/ results– Combined register windows + SMT
21/36
Baseline machine
22/36
More on methodology
● Uses SimPoints to find representative simulation intervals
● SPEC CPU 2000● Baseline doesn't have register windows
– (Alpha’s register remapping with issue queues)● Window overflow/underflow: 10 cycles
23/36
Applications
● Register windows● Multithreading
http://en.wikipedia.org/wiki/Register_windowhttp://www.sics.se/~psm/sparcstack.html
24/36
Register Windows● Global register allocation
– How many registers should we reserve for the current procedure versus the rest of the program?
– SPARC example:● usually contains as many as 128 GPRs● At any point only 32 are available:
– 8 global, 8 params in, 8 params out, 8 local values– Up to 32 windows– Windows changed by an instruction usually along with 'call' and
'return'– Partial overlap: 'params out' of caller are 'params in' of callee
– Also used in Itanium (variable sized window)– Alternative is e.g.: renaming with reservation
stations
Save some memory (stack) traffic on function calls
25/36
Register Windows Caveats
● Problem: – Overflow of windows: call depth too deep– Underflow of window: need to restore a
window from memory● Solution
– Operating system handler– typical scheme saves and restores
windows– VCA handles registers individually
Performance Advantage of the Register Stack in Intel® Itanium™ Processors
26/36
Register windows evaluation
‘Ideal’: fills and spills are freeVCA is especially good with few
registersClose to ideal at 256 registersVCA 4% faster than baseline
@256 regs
Less registers means less in-flight
instructions and less branch
misprediction increaseFor others decrease
27/36
Single data cache port experiment
● Normalized to 2-port baseline● 7% faster than baseline @ 256 regs● 0.5 % slower than ideal @ 256 regs
28/36
2nd App:
multi-threadin
g
29/36
SMT: simultaneous multi-threading
● Lots of replicated resources (larger register file)
● VCA: renaming table is not replicated, only base thread pointer
● VCA: – # of in-flight instructions determine
number of registers required– not # of threads
30/36
SMT:
2 and 4
threads
● Normalized to single thread baseline 256 regs (not shown)
● @ 192 regs, VCA 2T is 97% of baseline @ 320 regs (baseline is at 88%)
● @192 regs, VCA 4T is at 98.7% of baseline @448 regs
31/36
Combined
SMT w/ register windows
● Normalized to single thread baseline @ 256 regs● VCA 4T: 98% of peak performance @ 192 regs
32/36
SMT + register windows
● Register window reduces cache accesses while SMT increases them
● VCA 4T non-windowed @192 regs is 98% perf. of baseline, it still has 24% more cache accesses, adding windows makes cache accesses 5% below baseline
33/36
VCA summarized● unifies support for both multiple independent
threads and register windowing within each thread;
● backwards compatible with existing ISAs at the application level for multithreaded contexts;
● requires only minimal ISA changes for register windowing;
● requires no changes to the physical register file design and the performance-critical schedule/execute/writeback loop;
● builds on existing rename logic to map logical registers to physical registers and handles register cache misses in the decode/rename stages;
34/36
VCA summarized (ctnd)
● completely decouples physical register file size from the number of logical registers by using memory as a backing store, rather than another larger register file;
● does not involve speculation or prediction, avoiding the need for recovery mechanisms.
35/36
Conclusions● A VCA-based implementation of register
windows in an out-of-order processor reduces execution time by 4% while reducing data cache accesses by nearly 20% compared to a non-windowed machine, with an even larger performance advantage over a conventional register-window implementation.
● VCA's data cache traffic reduction is large enough that it can achieve the same performance with one cache port as an otherwise similar conventional machine would with two cache ports.
36/36
Conclusions (ctnd)
● VCA is also able to manage thread contexts efficiently, enabling effective implementation of simultaneous multithreading (SMT) using as few as half the registers of a standard architecture.
● VCA allows SMT to be combined with register windows with no additional physical registers.
● a 4-thread VCA machine with 192 registers can achieve higher performance than a conventional non-windowed SMT machine with twice as many registers.