alpha 21264 microarchitecture onur/aditya 11/6/2001

Alpha 21264 Microarchitecture

Onur/Aditya

11/6/2001

Key Features of 21264

• Introduced in Feb 98 at 500 MHz• 15M transistors, 2.2V 0.35-micron 6 metal layer CMOS

process• Implements 64-bit Alpha ISA• Out-of-order execution (unlike 21164)• 4-wide fetch (like 21164)• Max 6 inst/cycle execution bandwidth• 7-stage pipeline• Hybrid two-level branch prediction (tournament predictor)• Clustered integer pipeline• 80 in-flight instructions

Overview of the Presentation

• Overview of 21264 pipeline

• Fetch and Branch Prediction mechanism

• Register Renaming in 21264

• Clustering

• Memory System

21264 Pipelines

Source: Microprocessor Report, 10/28/96

Pipeline Structure

Source: IEEE Micro, March-April 1999

Instruction Fetch Mechanism• Two features:

– Line and way prediction– Branch prediction

• Line-way predictor predicts the line-way of the I-cache that will be accessed in the next cycle

• Line-way prediction takes the branch predictor outside the critical fetch loop.

• On cache fills, line predictor value at each line points to the next sequential fetch line.

• Line predictor is later trained by the branch predictor.• In effect, line-way predictor is similar to a very fast BTB.• Prediction of the line predictor is verified in stage 1 (Instruction slot). If

line-way prediction is incorrect, slot stage is flushed and PC generated using the branch predictor information is used to redirect fetch.

Fetch - Line and Way Prediction


Branch Prediction Mechanism• Hybrid Branch Predictor• Global predictor:

– Good for inter-correlated branches. – Indexed by global path history register (T/NT status of last 12

branches)– 4K-entry table of 2-bit counters

• Local Predictor– Good for self-correlated branches.– 10 bits of PC indexes a per-address local history table, which in turn

indexes a 1K-entry table of 3-bit counters.– Aliasing among branches is a problem.

• Choice Predictor– Decides which predictor to use.– Indexed by global path history register– 4K-entry table of 2-bit counters

Branch Prediction Mechanism

• Minimum branch penalty: 7 cycles• Typical branch penalty: 11+ cycles (IQ delay)• 48K bits of target addresses stored in I-cache• 32-entry return address stack• Predictor tables are reset on a context switch


Instruction Slotting

• Check line predictor prediction• Branch predictor compares the next cache

index it generates with the one generated by line predictor

• Determine the subclusters integer instructions will go to

• Some subclusters are specialized resource constraints

• Perform load balancing on subclusters

Register Renaming• 31 Integer 31 FP architectural registers• 41 Int 41 FP extra physical registers • Uses a merged rename and architectural register file, one

for Int one for FP• Same physical register holds the results of an instruction

before and after commit• No separate architectural register file (no data copying on

commit)• Register map table stores current mappings of architectural

registers.• A map silo contains old mappings of up to 20 previous

decode cycles (used in case of misprediction)

Register Renaming Logic

• On decoding an instruction:– Search map CAMs for the source registers– Find the physical registers currently containing the value of the architectural source

registers– Access free physical register list – Map the found free physical register to the architectural destination register

Source: Presentation by R. Kessler, August 1998.

Register Renaming Logic• On completing an instruction:

– Write result into the physical destination register– Mark the physical destination register as valid in the register

scoreboard– Broadcast results to issue queue entries– Physical destination register number is broadcast as tag

• On committing an instruction– Mark the physical destination register as committed– Free the physical register that corresponds to an old mapping of the

same architectural register • On a misprediction/exception

– Roll back the map state to what it was when the exception-causing instruction was renamed

– To be able to do this, instructions should be associated with map entries

– This is done using inums. Each instruction is given an 8-bit unique identifier during register mapping

Physical Register States• 4 states• Initially n architectural

registers are in AR state.• Rest are Available• When an instruction with

a destination register is issued, one of the available registers is allocated as rename buffer (RB)

• When instruction finishes execution, state is set to valid

• On instruction commit, state is set to AR and old AR mapping is reclaimed

Source: Sima, D. The Design Space of Register Renaming Techniques. IEEE Micro, September/October 2000.

Integer Issue Queues - Clustering

• 20 entries, maximum 4 per cycle• Two arbiters pick the instructions that will issue (One for upper

subclusters, one for lower subclusters)• Each queue entry asserts a request to the arbiter when it contains an

instruction that can be executed by the subcluster (if operand values are available within that subcluster)

• 4 request signals (U0, U1, L0, L1)• Arbiters choose between simultaneous requesters of a subcluster based

on the age of the request • Older instructions are given priority• Each arbiter picks 2 of the possible 20 requesters for service• A given instruction can request only upper or lower subclusters (load

balancing based on the assignment done by Stage 1)• Subcluster assignment is static (Stage 1)• Cluster selection on issue is dynamic (Stage 2)

Integer/FP Execution Pipes• Integer cluster

communication latency: 1 cycle

• Advantage of clustering:– Fewer read/write

ports to the register file

– Register file will not be a cycle time limiter

• FP issue queue:– 15 entries– 2 inst/cycle


Memory References• Load Queue

– Reorder buffer for loads– 32 entries, in-order– Maintains state of loads issued but not yet retired

• Store Queue– Reorder buffer for stores– 32 entries, in-order– Maintains state of stores issued but not yet written to the data cache– Holds data associated with store instructions– Forwards data to older matching stores

• Miss Address File– Holds physical addresses associated with pending L1 cache misses

(instruction or data)– Maximum 8 misses to off-chip memory system

Load/Store Ordering

• New memory references check their address and age against older references.

• For example, when a store issues:– LDQ compares store address to the addresses of younger loads

(CAM search)– If the older store issues to the same memory address as a

younger load, LDQ squashes the load and initiates recovery• When a load is ready to issue:

– STQ compares the load address to the addresses of younger stores

– If a match is found:• If store data is available, STQ forwards the data• Else load issue is delayed until store data becomes available

Load/Store Ordering• When a load is ready to issue:

– If a younger store exists in STQ with an unknown address:

• Predict that the ready load will not access the same memory location unless this load was incorrectly ordered before (check the load wait table)

– Exposes more ILP if prediction is correct– In case of misprediction:

• Minimum 14 cycle penalty• Initiate recovery: Load and all subsequent instructions are

squashed and re-executed• Mark the load in the load wait table so that it will wait for all

younger stores to compute addresses next time around

Load/Store Ordering Example


Features of Memory System• Data cache

– 64 KB, 2-way, virtually-indexed physically tagged (translation in parallel with access)

– Write-back, read/write allocate– 64-byte block size + ECC bits– Prevents synonyms by not allowing different physical addresses

corresponding to the same virtual address to co-exist in the cache– Load hit/miss prediction to minimize load-use latency (Data cache

access is 3 cycles after the issue queue + 1 cycle to get the hit/miss signal to issue queue)

• Victim Buffer (Victim address and data files)– Contains evicted L1(Data and Inst) and L2 cache lines– 8 entries, Serial access

• Off-chip L2 cache– Minimum data cache miss latency 13 cycles– Up to 16 MB– Dedicated access to L2 cache

Overall System Diagram


The Processor Itself

References• R.E. Kessler. The Alpha 21264 Microprocessor. IEEE Micro.

March/April 1999.• D. Leibholz and R. Razdan. The Alpha 21264: A 500 MHz Out-of-order

Execution Microprocessor. COMPCON97, 1997.• Compaq Computer Corporation. Alpha 21264/EV6 Hardware Reference

Manual.• R. Kessler, E. McLellan, and D. Webb. The Alpha 21264 microprocessor

architecture. International Conference on Computer Design, October 1998 • B.A. Gieseke et. al. A 600 MHz Superscalar RISC Microprocessor with

Out-of-order Execution. International Solid State Circuits Conference. 1997.

• L. Gwennap. Digital 21264 Sets New Standard. Microprocessor Report. October 28, 1996.

• Dezso Sima. The Design Space of Register Renaming Techniques. IEEE Micro. September/October 2000.

• P.E. Gronowski et. al. High Performance Microprocessor Design. IEEE Journal of Solid State Circuits. May 1998.

alpha 21264 microarchitecture onur/aditya 11/6/2001

Documents

fetch line

line predictor value

branch predictor information

line points

line predictordetermine

sequential fetch line

cyclelineway prediction

cyclestypical branch