runtime processor power monitoring
DESCRIPTION
Runtime Processor Power Monitoring. VICE VERSA 07/18/2003 Talk Canturk ISCI. Detailed. Touch. INTRODUCTION. Runtime processor power Measurement with HW Estimation with Performance counters CPU Unit Power Breakdowns Runtime verification Processor thermal modeling - PowerPoint PPT PresentationTRANSCRIPT
Runtime Processor Power Monitoring
VICE VERSA07/18/2003 Talk
Canturk ISCI
2
INTRODUCTIONINTRODUCTION
Runtime processor power Measurement with HW Estimation with
Performance countersCPU Unit Power BreakdownsRuntime verification
Processor thermal modelingPower Phase Behavior of programsMapping between power behavior
and program structure
3
THE BIG PICTURETHE BIG PICTURE
Performance Monitoring
Real Power Measurement
PowerModeling
ThermalModeling
Performance Monitoring
Real Power Measurement
PowerModeling
ThermalModeling
PowerPhases
ProgramProfiling
ProgramStructure
To Estimate component power & temperature breakdowns for P4 at runtime…
To analyze how power phase behavior relates to program structure
Bottom line…
4
Performance Monitoring
Real Power Measurement
PowerModeling
ThermalModeling
AgendaAgenda
Performance Monitoring
Real Power Measurement
PowerModeling
ThermalModeling
Performance Monitoring P4 Performance Counters Performance Reader LKM
Real Power Measurement P4 Power Measurement Setup Examples
Power Modeling P4 Power Model Model + Measurement Sync Setup,
Verification Thermal Modeling
Brief Thermal Model Intro Ppro Thermal Model Results
5
PowerModeling
PowerPhases
ProgramProfiling
ProgramStructure
PowerPhases
ProgramProfiling
ProgramStructure
Bonus MaterialBonus Material
Power Phase Behavior Similarity Based on Power Vectors Identifying similar program regions
Profiling Execution Flow Sampling process’ execution “PCsampler” LKM
Program Structure Execution vs. Code space Power Phases Exec. Phases
<OR VICE VERSA>
6
Performance Monitoring
Real Power Measurement
PowerModeling
ThermalModeling
Performance MonitoringPerformance Monitoring
Performance Monitoring
Related Work Performance Monitoring
P4 Performance Counters Performance Reader LKM
Real Power Measurement P4 Power Measurement Setup Examples
Power Modeling P4 Power Model Model + Measurement Sync Setup,
Verification Thermal Modeling
Refined Thermal Model Ex: Ppro Thermal Model
7
Live CPU Performance Monitoring Live CPU Performance Monitoring with Hardware Counterswith Hardware Counters
Most CPUs have hardware performance counters P4 Performance Monitoring HW:
18 Event Counters 18 Counter Configuration Control Registers
Configure how to count 45 Event Selection Control Registers
Configure what to count Additional Control Registers
8
Our Event-Counter: Performance ReaderOur Event-Counter: Performance Reader
Performance Reader implemented as Linux Loadable Kernel Module Implements 6 syscalls:
select_events()reset_event_counter()start_event_counter()stop_event_counter()get_event_counts()set_replay_MSRs()
User Level Interface: Defines the events
Starts counters Stops counters
Reads counters & TSC
Event Types: 59 event classes 100s of events to count
9
Performance Reader: Performance Reader: Example ValidationExample Validation
L1_Dcache benchmark
Controls cache hit behavior
Validated against measured cache events
Vary hit rate from 0-100%
10
Performance Monitoring
Real Power Measurement
PowerModeling
ThermalModeling
Processor Power MeasurementProcessor Power Measurement
Real Power Measurement
Related Work Performance Monitoring
P4 Performance Counters Performance Reader LKM
Real Power Measurement P4 Power Measurement Setup Examples
Power Modeling P4 Power Model Model + Measurement Sync Setup,
Verification Thermal Modeling
Refined Thermal Model Ex: Ppro Thermal Model
11
P4 Power Measurement SetupP4 Power Measurement Setup
1mV/Adc conversion
Clamp ammeter on 12V lines on measured CPU
Voltage readings via RS232 to
logging machine
Serial Reader(PowerMeter)(PowerPlotter)
Convert to Power vs. time window
DMM reading clamp voltages
12Pow
erP
lott
er:
Exa
mp
leP
ower
Plo
tter
: E
xam
ple “Branch exercise”
(Taken rate: 1)“High-Low”“L1Dcache”
Array Size1/100 of L1
“L1Dcache”Array Sizex25 of L1~L2
“L1Dcache”Array Sizex4 of L2
Initialization
BenchmarkExecution
“Fast”
13
SPEC Power ExamplesSPEC Power Examples
Different programs show very different power characteristics
Timescale of interest can be huge => inaccessible via simulation
Spec GCC (O3) with specrun -a run
0
10
20
30
40
50
60
70
80
0 50 100 150 200time (s)
[W]
Spec VPR (O3) with specrun -a run
0
10
20
30
40
50
60
0 100 200 300 400 500time(s)
[W]
14
Performance Monitoring
Real Power Measurement
PowerModeling
ThermalModeling
Processor Power ModelingProcessor Power Modeling
PowerModeling
Related Work Performance Monitoring
P4 Performance Counters Performance Reader LKM
Real Power Measurement P4 Power Measurement Setup Examples
Power Modeling P4 Power Model Model + Measurement Sync Setup,
Verification Thermal Modeling
Refined Thermal Model Ex: Ppro Thermal Model
15
DefineComponents
Performance Monitoring
Real Power Measurement
PowerModeling
DefineEvents
Real Power Measurement
Verify total power against measured processor power
PowerModeling
Convert counter info into component power breakdowns
Performance Monitoring
Gather counter info with minimal power overhead and program interruption
DefineEvents
Determine combination of P4 events that represent component accesses best
DefineComponents
Define components (I.e. L1 cache, BPU, Regs, etc.), whose powers we’ll model: from annotated layout
P4 POWER MODELP4 POWER MODEL
16
Defining ComponentsDefining Components
17
Defining Events Defining Events Access Rates Access Rates We determined 24 events to approximate access rates
for 22 components Used Several Heuristics to represent each access rate
Examples:
Need to rotate counters 4 times to collect all event data Used 15 counters & 4 rotations to collect all event data
18
Access Rates Access Rates Component Powers Component Powers
“Performance Counter based Access Rate estimations are used as proxy for max component power weighting together with microarchitectural details in order to estimate processor sub-unit powers”
EX: Trace cache delivers 3 uops/cycle in deliver mode and 1 uop/cycle in build mode:
Power(TC)=[Access-Rate(TC)/3 + Access-Rate(ID)] x MaxPower(TC) + Non-gated TC CLK power
Total power is computed as the sum of all 22 component powers + measured idle power (8W):
19
Experiment Setup – Recall:Experiment Setup – Recall:
1mV/Adc conversion
Clamp ammeter on 12V lines on measured CPU
Voltage readings via RS232 to
logging machine
Serial Reader(PowerMeter)(PowerPlotter)
Convert to Power vs. time window
DMM reading clamp voltages
20
Experiment SetupExperiment Setup
Voltage readings via RS232 to logging machine
1mV/Adc conversion
21
Experiment SetupExperiment Setup
POWERCLIENT
POWERSERVER
Voltage readings via RS232 to logging machine
Convert voltage to measured powerConvert access rates to modeled powersSync together in time window
1mV/Adc conversion
Component access rates
over ethernet
22
Tuning BenchmarksTuning Benchmarks
“Fast”
“Branch exercise”(Taken rate: 1) “High-Low”“L1Dcache”
(Hit Rate : 0.1)Measured
Modeled
23Com
pon
ent
Bre
akd
own
sC
omp
onen
t B
reak
dow
ns
Component Breakdowns for “branch_exercise”
Colors for 4 CPU subsystems
Issue - RetireExecution
24
Benchmark Power BreakdownsBenchmark Power Breakdowns
High issue, exec. & branch power
High L2 Cache PowerHigh L1 Cache PowerHigh Bus Power
25
SPEC2000 Results SPEC2000 Results
VPR Elaboration:Integer benchmark2 runs: 1st Placement, 2nd Route1st run much stable power, 2nd more variablePlacement has higher miss than route < L2 pwr>Significant FPE power due to x87_SIMD_movesTwolf Elaboration:(Integer benchmark)Several loop computations traversing memory<High Memory Power>Although ~const. Total power, component powers have slight gradients
Equake Elaboration: (FP benchmark) Initialization and computation phasesFP intensive mesh computation phaseInitialization with high complex IA32 instructions
26
Average SPEC Total PowersAverage SPEC Total Powers
1st set: Overall, 2nd set: Non-idle powerAverage difference between measurement
and estimation: 3WWorst case: Equake (5.8W)
27
Stdev of SPEC Total PowersStdev of SPEC Total Powers
1st set: Overall, 2nd set: Non-idle powerAverage difference: 2WWorst case: Vortex (3.5W)
28
Desktop ApplicationsDesktop ApplicationsWe aim to track low power utilizations as
well.Desktop applications are usually low
power with intermittent power bursts3 applications, with common operations
such as open/close application, web, streaming media, text editing, save to disk, statistical computations.
29
Performance Monitoring
Real Power Measurement
PowerModeling
ThermalModeling
Thermal ModelThermal Model
ThermalModeling
Related Work Performance Monitoring
P4 Performance Counters Performance Reader LKM
Real Power Measurement P4 Power Measurement Setup Examples
Power Modeling P4 Power Model Model + Measurement Sync Setup,
Verification Thermal Modeling
Brief Thermal Model Intro Ppro Thermal Model Results
Skip Thermal Skip Thermal
30
THERMAL MODELING: A Basic ModelTHERMAL MODELING: A Basic Model
Based on lumpedR-C model from packaging
Built uponpower modeling Sampled
Component Powers
Respective component areas
Physical processor Parameters
PackagingHeat Transfer
Tb,i
Rth,i
Cth,iPi
Tb,j
Rth,j
Cth,jPj
Tb,k
Rth,k
Cth,kPk
Tb,l
Rth,l
Cth,lPl
Th
Rth,h
Cth,hPi+Pj+Pk+Pl HEATSINK
Blki Blkj
BlkkBlkl
DIETb,i
Rth,i
Cth,iPi
Tb,j
Rth,j
Cth,jPj
Tb,k
Rth,k
Cth,kPk
Tb,l
Rth,l
Cth,lPl
Th
Rth,h
Cth,hPi+Pj+Pk+Pl
Tb,i
Rth,i
Cth,iPi
Tb,i
Rth,i
Cth,iPi
Tb,j
Rth,j
Cth,jPj
Tb,j
Rth,j
Cth,jPj
Tb,k
Rth,k
Cth,kPk
Tb,k
Rth,k
Cth,kPk
Tb,l
Rth,l
Cth,lPl
Tb,l
Rth,l
Cth,lPl
Th
Rth,h
Cth,hPi+Pj+Pk+Pl HEATSINK
Blki Blkj
BlkkBlkl
DIE
Blki Blkj
BlkkBlkl
DIE
31
Physical Structure vs. Thermal Model Physical Structure vs. Thermal Model
Ambient Temperature
Heatsink
Heat Spreader
Package
Die
Th
Rh
Ch
R_hXA
TA
Tspr
Rspr
Cspr
R_grXspr
Tp,i
Rp,i
Cp,i
Tdie,i
Rdie,i
Cdie,iPi
Ptotal
Thermal Grease
Ambient Airflow
32
Simulation OutputsSimulation Outputs Thermal nodes updated every t~20ms
Component Temperatures Build up to ~350K in ~5hrs Theatsink moves very slowly as expected
Pentium Pro Thermal Simulation
01020304050607080
Ambie
nt
Heatsi
nk
Heat S
prea
der
Decod
eIss
ue
Reord
er
DMem
IMem FUs
Other
Te
mp
era
ture
(C
)
At startupAfter 5 Hours
33
PowerModeling
PowerPhases
ProgramProfiling
ProgramStructure
PowerPhases
Power Phase BehaviorPower Phase Behavior
Power Phase Behavior Similarity Based on Power Vectors Identifying similar program regions
Profiling Execution Flow Sampling process’ execution “PCsampler” LKM
Program Structure Execution vs. Code space Power Phases Exec. Phases
<OR VICE VERSA>
34
Power Vectors for SimilarityPower Vectors for SimilaritySimilar to basic block vectorsUse component power vector samples
to represent program phasesConsider Manhattan distance between 2
vectors as the ‘measure of dissimilarity’ between the corresponding execution points
Construct a similarity matrix to represent similarity among all pairs of execution points Each entry in the similarity matrix:
35
Gcc Component Power Breakdowns
0
1
2
3
4
5
6
0 50 100 150 200 250Time (s)
Po
we
r [W
att
s] L1 cache
TraceCache
RETIRE
Gcc Total Power
0
10
20
30
40
50
60
0 50 100 150 200 250
Po
wer
[W
atts
]
MEASURED POWER COUNTER ESTIMATED POWER
Gzip Total Power
0
10
20
30
40
50
60
44 88 132 176 220 264 308 352 396 440Time (s)
Pow
er [W
atts
]
MEASURED POWER COUNTER ESTIMATED POWER
Gzip Component Power Breakdowns
0
1
2
3
4
5
6
44 144 244 344 444Time (s)
Pow
er [W
atts
] L1 cache
INT Exec
RETIRE
Gcc
& G
zip
Sim
ilari
ty M
atri
ces
Gcc
& G
zip
Sim
ilari
ty M
atri
ces
Gcc Elaboration: Very variant power
Almost identical power behavior at 30, 50, 180s.
Although 88s, 110s, 140s, 210s and 230s show similar total power; 88, 210 and 230 share higher similarity.
Gzip Elaboration: Much regular power behavior
Spurious similarities such as 100-150s and 200-280 are distinguished by the similarity analysis
36
Generating representative vectorsGenerating representative vectors
Gzip has ~1000 power vectors Cluster vectors based on similarity Could we represent power behavior with reasonable
accuracy, with a small number of ‘signature’ vectors? Ex: 26 representative vectors with “Thresholding
Algorithm”:
37
PowerModeling
PowerPhases
ProgramProfiling
ProgramStructure
ProgramProfiling
Program Execution ProfileProgram Execution Profile
ProgramStructure
Power Phase Behavior Similarity Based on Power Vectors Identifying similar program regions
Profiling Execution Flow Sampling process’ execution “PCsampler” LKM
Program Structure Execution vs. Code space Power Phases Exec. Phases
<OR VICE VERSA>
38
Program Execution ProfileProgram Execution Profile
Sample program flow simultaneously with power Our LKM implementation: “PCsampler”
Generate code space similarity in parallel with power space similarity
Relative comparisons for: Complexity Accuracy Applicability, etc.
39
EOP
40
Where all this is useful?Where all this is useful? Measurement/Modeling for microarchitectural
details Compiler level power
SW power profiling Power Aware OS
Dynamic power/thermal/March. ConfigurationDynamic memory allocate, Process cruise control, etc.
Demonstrates modern processor power Need for speed! Long Timescales, thermal
constants Identify program phases w/o knowledge of code,
no basic block info whatsoever Program signatures for detailed simulation, say:
“power points rather than simpoints”
41
42
Counter Access HeuristicsCounter Access Heuristics 1) BUS CONTROL:
No 3rd Level cache BSQ allocations ~ IOQ allocations Metric1: Bus accesses from all agents
Event: IOQ_allocationCounts various types of bus transactions
Should account for BSQ as wellaccess based rather than duration
MASK:Default req. type, all read (128B) and write (64B) types, include OWN,OTHER and PREFETCH
Metric2: Bus Utilization(The % of time Bus is utilized)Event: FSB_data_activity
Counts DataReaDY and DataBuSY events on BusMask:
Count when processor or other agents drive/read/reserve the busExpression: FSB_data_activity x BusRatio / Clocks Elapsed
To account for clock ratios
43
Counter Access HeuristicsCounter Access Heuristics 2) L2 Cache:
Metric: 2nd Level cache referencesEvent: BSQ_cache_reference
Counts cache ref-s as seen by bus unitMASK:
All MESI read misses (LD & RFO)2nd level WR misses
3) 2nd Level BPU: Metric 1: Instructions fetched from L2 (predict)
Event: ITLB_ReferenceCounts ITLB translations
Mask:All hits, misses & UC hits
Metric 2: Branches retired (history update)Event: branch_retired
Counts branches retiredMask:
Count all Taken/NT/Predicted/MissP
44
Counter Access HeuristicsCounter Access Heuristics 4) ITLB & I-Fetch:
etc……… 10) FP Execution:
Metric: FP instructions executedevent1: packed_SP_uop
counts packed single precision uopsevent2: packed_DP_uop
counts packed single precision uopsevent3: scalar_SP_uop
counts scalar double precision uopsevent4: scalar_DP_uop
counts scalar double precision uopsevent5: 64bit_MMX_uop
counts MMX uops with 64bit SIMD operandsevent6: 128bit_MMX_uop
counts integer SSE2 uops with 128bit SIMD operandsevent7: x87_FP_UOP
counts x87 FP uopsevent8: x87_SIMD_moves_uop
counts x87, FP, MMX, SSE, SSE2 ld/st/mov uops Back Back
45
46
P4 DetailsP4 DetailsKarelian.ee:
P4 – 1.4GHz0.18, C4-FC-PGA-423Heatsink Folded FinM6, Al interconnectDie Size: 217 mm2
Package Size: 5.34cm x 5.17cmPower: Idle/typ./max=??/51.8/71WD$1&T$1/L2: 8K&12KUops/256KVoltage: 1.7/1.75V
47
P4 DetailsP4 Details 1st LKM: <LKM_CPUinfo & UserLevel_CPUinfo>
Implements syscall: getCPUinfo()Gathers CPU info from:
/asm/processor.hIntel control registers (CR4)CPUID instruction
Reveals:Debug Store mechanism exists for PEBSTSC existsMSRs implemented
We can read/write performance counters
EX:karelian (P4,willamette): UserLevel_CPUinfoviale (P4, Northwood): UserLevel_CPUinfo
Back Back
48
P4 Detector - Counter ClustersP4 Detector - Counter ClustersEvent Detectors Event Counters
4 bit wide bus
P4
Com
pone
nts
EV
EN
TS
49
Counters, ESCRs & CCCRsCounters, ESCRs & CCCRs
Simplified Recipe:1. Select Event to count2. Select a counter
(also defines CCCR)3. Select an ESCR4. Set ESCR fields5. Set CCCR fields6. Enable CCCR
50
Counting MechanismsCounting MechanismsCounting Types
Non-retirement: Events occur any time during execution
At-Retirement: Events at the retirement of instruction
Can count BOGUS vs NBOGUS, Tag uops to count, etc.
TerminologyMechanisms:
Front end tagging (i.e. LD/ST retired)Execution tagging (i.e. packed_DP_retired)Replay Tagging (i.e. L1 misses)No Tags (i.e. uops retired)
Also:Event Counting | IEBS | PEBS
Back Back
51
At Retirement Counting TerminologyAt Retirement Counting Terminology
Back Back
BOGUS/NBOGUS (speculative)Tagging (count uops that encounter event)Replay (Data speculation)
52
Verifying Counter ReaderVerifying Counter Reader1) L1Dcache_exercise:
Uses pointer assignment L1=8K, L2=256K Array Size = (L1 Size/Hit Rate)
i.e. for 10% Hit rate: 80K 20K entriesArray Size < L2 size
Array elements PRBS of array indices Bench loop:
new index array[old index] However, gcc puts 5 LDs in the bench loop
4 static Hit rate ~ 100%1 our load our desired hit rate
53
……Verifying Counter ReaderVerifying Counter Reader
1) L1Dcache_exercise results:
L1Dcache Experiment
-20.00%
0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
120.00%
0.04 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 2 100 1000
Desired Hit Rate
Ac
qu
ire
d R
ate
s
Acquired L1 Hit Rate
Our L1 Hit Rate From L2Accesses
Ex:L1Dcache_exerciseHit Rate = 0.25
54
……Verifying Counter ReaderVerifying Counter Reader2) branch_exercise:
Uses random number comparison Assigns 400K PRBS array outside bench loop
To avoid rand() instructions in bench loop bench loop:
Compares array index to threshodThreshold = RAND_MAX*TakenRate
Repeats 1000 reseeding each time However gcc adds 2 more branches into
bench loop:Loop exit condition (Prediction ~ 100%)Unconditional JMP (Prediction ~ 100%)
Our Branch’s Expected Mispredict Rate:~ (0.5 - |TakenRate – 0.5| )
55
……Verifying Counter ReaderVerifying Counter Reader
2) branch_exercise results:
Ex:branch_exerciseTaken Rate=0.5
Branch Prediction Experiment
0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
120.00%
0 0.1 0.2 0.25 0.3 0.4 0.5 0.6 0.7 0.75 0.8 0.9 1
Desired Taken Rate
Acq
uir
ed R
ates
Approximated Mispredict RateOur Branch's Taken Rate
Back Back
56
MEASUREMENT MethodMEASUREMENT Method Select Power lines that reflect CPU power
P4 uses 12 V lines Clamp the current probe over the 12V lines
1mV/Adc conversion Connect the clamp into DMM Send Voltage reading over serial Log the voltage readings
Convert to instantaneous power as:12 x Vsample x 1000
Log Power values Plot Power values
57
MEASUREMENT ToolsMEASUREMENT ToolsPoll serial port ~20ms
quicker overkill, slower overlookCompute running average sample every t you select
Easier to sync with Power ModelPowerMeter:
Convert voltage reading to power and logP=12 x Vread x 1000
PowerPlotter: Plot Power samples over sliding time
window100 s history with 1000 samples (t = 100ms)
58
Current ProbeCurrent ProbeFluke i410Uses Hall Voltage to measure current
and convert to Voltage:1mV / Adc
Range: 0.5 – 400A Accuracy: 3.5%+0.5AGenerated voltage is fed to DMMCompared against the Ppro Amoeba
shunt setup for verification
59
DMMDMMAgilent 34401A Measurement Motive:
We should sample as quick as possible (grep case)
Measurement Setup:Fast 4 digit, Autozero OFF, Display OFF
From [8], 1000 readings/s (x150 faster than fast 6 digit)
Serial Interface:From [9] 55 ASCII readings /s
Polling serial port faster than 20ms is overkill
Back Back
60
P4 Power LinesP4 Power Lines Which power lines should we cut / clamp?
[5] shows the power lines:1-CPU power connector 13-System power connectorP1 13 & P2 1
[6],[7] say P4 uses 12V lines for CPU, rather than 5V lines
Both P1 & P2 have 12, 5 and 3.3 V lines
I run branch_exercise (takenRate=1) and gzip_static obtain the current variation on the lines
61
Current on Power LinesCurrent on Power LinesCurrent on Connector P1
line7 (12V)
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
0 20 40 60 80
time (s)
I [A
]
Series1
Current on Connector P1 lines1,3,,6,18,19,20,22 (5V)
0
0.5
1
1.5
2
2.5
0 20 40 60 80
time (s)
I [A
]Series1
Current on Connector lines 11,12,23 (3.3V)
0
0.5
1
1.5
2
2.5
0 10 20 30 40 50 60 70 80
time (s)
I [A
]
Series1
Current on connector P2 line1 (3.3V)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 10 20 30 40 50 60 70 80
time(s)
I(A
)
Series1
Current on connector P2 line14 (5V)
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0 10 20 30 40 50 60 70
time (s)
I [A
]
Series1
Current on Connector P2 line 3 (12V)
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
0 10 20 30 40 50 60 70
time (s)
I [A
]
Series1
Current on Connector P2 line7 (12V)
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
0 10 20 30 40 50 60 70
time (s)
I [A
]
Series1
Current on connector P2 line 9 (5V)
-0.1
-0.05
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 10 20 30 40 50 60 70
time (s)
I [A
]
Series1
Reveals ALL 3 12V lines’ currents follow CPU activity All add to CPU Power! Back Back
62
About the ripplesAbout the ripplesAdd ripple stuff
here…!!!!!!!!!!!!!!!!!!!!!!!!!!!
63
P4 Architecture vs LayoutP4 Architecture vs Layout
Components to Model:
1) Bus Control2) L2 Cache3) 2nd Level BPU4) ITLB & Ifetch5) L1 Cache
6) MOB7) Mem Control8) DTLB9) Int EXE10)FP EXE11) Int RF
12)FP RF13)Decode14)Trace $15)1st Level BPU16)Microcode ROM17)Allocation
18)Rename19) Inst-n Qs20)Schedule21) Inst-n Qs22)Retirement
Back Back
64
Counter RotationsCounter Rotations
Back Back
65
EOP