it’s all about latency
DESCRIPTION
It’s all about latency. Henk Neefs Dept. of Electronics and Information Systems (ELIS) University of Gent. Overview. Introduction of processor model Show importance of latency Techniques to handle latency Quantify memory latency effect Why consider optical interconnects? - PowerPoint PPT PresentationTRANSCRIPT
It’s all about latency
Henk NeefsDept. of Electronics and
Information Systems (ELIS)University of Gent
Overview• Introduction of processor model• Show importance of latency• Techniques to handle latency• Quantify memory latency effect• Why consider optical interconnects?• Latency of an optical interconnect• Conclusions
Out-of-order processor pipeline
I-cachefetch decode
instructionwindowrename
architecturalregister file
LDST
executionunits
‘future’register
file
INT
in-orderretirement
Branch latency
I-cachefetch decode
instructionwindowrename
LDST
executionunits
‘future’register
file
INT
BR
time
ADDORST XOR LD
ORBR ST XOR LD
... ... ...... ...... ......BR
latency
Eliminate branch latency
• By prediction:predict outcome of branch => eliminate dependency (with a high probability)
• By predication:convert control dependency to data dependency => eliminate control dependency
while (pointer!=0)pointer = pointer.next;
Load latency
Loop:LD R1, R1(32)BNE R1, Loop
cycles
LD
CPI = 2 cycles/2 instructions = 1 cycle/instruction
load latency = 2 cyclesbranch latency = 1 cycle
BNELD
BNELD
BNELD
execution units
When longer load latency
cycles
LD
CPI = 8 cycles/2 instructions = 4 cycles/instruction
load latency = 2+6 cyclesbranch latency = 1 cycle
BNE
BNE
BNE
execution units• When L1-cache missesand L2-cache hits:
LD
LD
LD
• When L2-cache missesand main memory hits:
load latency = 2+6+60 cyclesCPI = 34 cycles/instruction
Memory hierarchyregister file execution
unitsL1 cache
L2 cache
main memory
hard drive
storage capacityand latency
L1 cache latency
0
2
4
6
8
10
12
0 50 100 150 200 250 300instruction window size (#instructions)
IPC
latency = 2latency = 3latency = 4
loa d/store
IPC = Instructions Per clock Cycle, 1 Ghz processor, spec95 programs
Main memory latency
3
3.1
3.2
3.3
3.4
3.5
3.6
0 20 40 60 80 100
main memory latency (ns)
IPC
loa d/store
IPC = Instructions Per clock Cycle, 1 Ghz processor, spec95 programs
Performance and latencyInterconnect type Sensitivity of performance
to latency decrease(% per ns)
Processor core/register file 39
Processor/L1-cache 19
L1-cache/L2-cache 3,0
L2-cache/main memory 0,18
performance change = sensitivity * load latency change
Increase performance by• eliminating/reducing load latency:
– By prefetching:predict the next miss and fetch the datato e.g. L1-cache
– By address prediction:address known earlier=> load executed earlier=> data early in register file
• or reducing sensitivity to load latency:– by fine-grain multithreading
Some prefetch techniques• Stride prefetching:
search for pattern with constant stride
e.g. walking through a matrix (row- or column-order)
• Markov prefetching:recurring patterns of misses
20 31 42 53 64stride: 11
miss history prediction10 110 15 12 100 … ...
Stride prefetching
4.9
5
5.1
5.2
70 75 80 85 90latency main memory (ns)
IPC
prefetching no prefetching
IPC = Instructions Per clock Cycle, 1 Ghz processor, program: compress
loa d/store
Prefetching and sensitivity
Factors of “performance sensitivity to latency” increase with stride-prefetching:
L1-cache/L2-cache L2-cache/main memoryto L1-prefetching 1.6 4.1to L2-prefetching 2.5
Latency is important:generalization to other processor architecturesConsider schedule of program:
time
Present in everyprogram execution:• Latency of instruction
execution• Latency of
communication=> latency important
whatever processor architecture
Optical interconnects (OI)• Mature components:
– Vertical-Cavity Surface Emitting Lasers (VCSELs)
– Light Emitting Diodes (LEDs)• Very high bandwidths• Are replacing electronic interconnects in
telecom and networks• Useful for short inter-chip and even
intra-chip interconnects?
OI in processor context
• At levels close to processor core,latency is very important=> latency of OI determines how far OI penetrates in the memory hierarchy
• What is the latency of an optical interconnect?
An optical link
Total latency = buffer latency + VCSEL/LED latency + time of flight + receiver latency
LED/VCSEL
buffer/modulation/bias
fiber orlight conductor
receiver diode
transimpedance amplifier
VCSEL characteristics
0
0.5
1
1.5
2
0 1 2 3current (mA)
optic
al o
utpu
t (m
W)
optical power carrier density
loa d/store
• A small semiconductor laser• Carrier density should be high enough for lasing action
Total VCSEL link latencyconsists of
• Buffer latency• Parasitic capacitances and series
resistances of VCSEL and pads• Threshold carrier density build up• From low optical output to final optical
output (intrinsic latency)• Time of flight (TOF)• Receiver latency
Total optical link latency
loa d/store
0
1
2
3
4
5
6
7
LED LED VCSEL VCSEL
late
ncy
(ns)
TOF (10 cm)receiverintrinsicthresholdparasiticsbuffer
CMOS: 0.6 m 0.25 m 0.6 m 0.25 m
@ 1 mW
Latency as function of power
0
1
2
3
4
5
6
7
8
0 1 2 3 4 5 6optical output power (mW)
late
ncy
(ns)
LED (0.6 microm.)VCSEL (0.6 microm.)LED (0.25 microm.)VCSEL (0.25 microm.)
loa d/store
Conclusions• When combining performance sensitivity
and optical latency we conclude:– optical interconnects are feasible to main
memory and for multiprocessors– for interconnects close to processor core,
optical interconnects have too high latencywith present (telecom) devices, drivers and receivers
=> but now evolution to lower latency devices, drivers and receivers is taking place...
For more information on the presented results: Henk Neefs, Latentiebeheersing in processors, PhD Universiteit Gent, January 2000www.elis.rug.ac.be/~neefs