Download - Lec Jan15 2009
Anshul Kumar, CSE IITD
CSL718 : Pipelined ProcessorsCSL718CSL718 : Pipelined Processors: Pipelined Processors
PipelineTimings15th Jan, 2009
Anshul Kumar, CSE IITD slide 2
Pipelined ProcessorsPipelined ProcessorsPipelined Processors
Function-parallel
Instr level (ILP) Thread level Process level
Pipelined processors
VLIWs Superscalar processors
Parallel architectures
Data-parallel
Intel’s terminology:• intra ILP
• inter ILP
Anshul Kumar, CSE IITD slide 3
Ideal PipeliningIdeal PipeliningIdeal Pipelining
TinstS stages
Anshul Kumar, CSE IITD slide 4
Determining Clock PeriodDetermining Clock PeriodDetermining Clock Period
Clock
Δt
CombReg Reg
Δt ≥
PP = propagation delay
Δt = Pmax
Pmax = max propagation delay
P
Anshul Kumar, CSE IITD slide 5
Ideal PipeliningIdeal PipeliningIdeal Pipelining
Δt = Tinst / S Effective CPI = 1Effective time per inst Teff = CPI * Δt
= 1 * Tinst / S
TinstS stages
Pmax = Tinst / S
Anshul Kumar, CSE IITD slide 6
Pipelining with hazardsPipelining with hazardsPipelining with hazards
Δt = Tinst / SCPI = 1 + (S - 1) * bTeff = (1 + (S - 1) * b) * Tinst / S
TinstS stages
Frequency of interruptions - b
Teff vs. S (Tinst = 10)
0
2
4
6
8
10
12
1 2 3 4 5 6 7 8 9 10S
Teff b = .2
b = .1
b = .05
Anshul Kumar, CSE IITD slide 8
A more realistic viewA more realistic viewA more realistic view
Clock
CombReg Reg
P
Register output delay Register setup time
Clock skew
Anshul Kumar, CSE IITD slide 9
Clocking OverheadClocking OverheadClocking Overhead
• Fixed overhead c– Setup time – Output delay
• Variable overhead (stretching factor) k
– Clock skew
Δt = Pmax + k * Pmax + c= (1 + k) * Tinst / S + c
Teff = [1 + (S - 1) * b] * [(1 + k) * Tinst / S + c]
Teff vs. S (Tinst = 10, c = 1, k = .1)
0
2
4
68
10
12
14
1 3 5 7 9 11 13 15S
Teff b = .2
b = .1
b = .05
Anshul Kumar, CSE IITD slide 11
Pipelining with Clocking OverheadPipelining with Clocking OverheadPipelining with Clocking Overhead
Teff = [1 + (S - 1) * b] * [(1 + k) * Tinst / S + c]
Sopt = √
[(1 - b) * (1 + k) * Tinst / (b * c)]
Anshul Kumar, CSE IITD slide 12
Partitioning instruction into cycles with non-uniform stage times
Partitioning instruction into cycles Partitioning instruction into cycles with nonwith non--uniform stage timesuniform stage times
One action - one pipeline stage => large quantization overhead
Multiple actions per stage?Multiple stages per action?
Anshul Kumar, CSE IITD slide 13
ExampleExampleExample Put Away 2 ns
Data - ALU 3 ns
Addr - MAR 3 ns
Data - IR 3 ns
PC - MAR 4 ns
Cache Dir 6 ns
Cache Dir 6 ns
Cache Data 10 ns
Decode 6+6 ns
Gen Addr 9ns
Cache Data 10 ns
Execute 7+7+8 ns
Anshul Kumar, CSE IITD slide 14
Optimal PipeliningOptimal PipeliningOptimal Pipelining
Tinst = 4+6+10+3+12+9+3+6+10+3+22+2 = 90 ns
b = 0.2 c = 4 ns k = 5%
Sopt = √
[(1 - b) * (1 + k) * Tinst / (b * c)]= 9.7 ⇒ 9
Pmax = 10 ns
Anshul Kumar, CSE IITD slide 15
ExampleExampleExample Put Away 2 ns
Data - ALU 3 ns
Addr - MAR 3 ns
Data - IR 3 ns
PC - MAR 4 ns
Cache Dir 6 ns
Cache Dir 6 ns
Cache Data 10 ns
Decode 6+6 ns
Gen Addr 9ns
Cache Data 10 ns
Execute 7+7+8 ns
Pmax = 10 ns
S = 10Δt = 14.5 nsS * Δt = 145 ns
Anshul Kumar, CSE IITD slide 16
ExampleExampleExample Put Away 2 ns
Data - ALU 3 ns
Addr - MAR 3 ns
Data - IR 3 ns
PC - MAR 4 ns
Cache Dir 6 ns
Cache Dir 6 ns
Cache Data 10 ns
Decode 6+6 ns
Gen Addr 9ns
Cache Data 10 ns
Execute 7+7+8 ns
S = 9
Pmax = 13 nsΔt = 17.65 nsS * Δt = 159 ns
Anshul Kumar, CSE IITD slide 17
ExampleExampleExample Put Away 2 ns
Data - ALU 3 ns
Addr - MAR 3 ns
Data - IR 3 ns
PC - MAR 4 ns
Cache Dir 6 ns
Cache Dir 6 ns
Cache Data 10 ns
Decode 6+6 ns
Gen Addr 9ns
Cache Data 10 ns
Execute 7+7+8 ns
Pmax = 20 ns
S = 5Δt = 25 nsS * Δt = 125 ns
Anshul Kumar, CSE IITD slide 18
ComparisonComparisonComparison
S Pmax Δt S * Δt Teff
9 13 17.65 159 45.89
10 10 14.50 145 40.60
5 20 25.00 125 45.00
Anshul Kumar, CSE IITD slide 19
Cycle QuantizationCycle QuantizationCycle Quantization
Delays are not integral multiple of clock periodTotal overhead = clocking overhead
+ quantization overheadΔt ≥
Tinst / S + c (ignoring k)
∴ S * Δt ≥
Tinst + S * cQuantization overhead = S * (Δt - c) -Tinst
This reduces as clock period becomes small
Anshul Kumar, CSE IITD slide 20
Other Timing ApproachesOther Timing ApproachesOther Timing Approaches
• Self Timed Circuits– No centralized free running clock– An operation begins as soon as its inputs are
available, that is, all its predecessors have completed
– Higher speed, lower power consumption• Wave Pipelining
– Omit inter-stage registers– Reduced clocking overhead
Anshul Kumar, CSE IITD slide 21
Conventional vs Wave PipeliningConventional Conventional vsvs Wave PipeliningWave Pipelining
Conventional Pipeline• Registers separate
adjoining stages• Clock period > max prop
delay• Inter-stage data stored in
registers
Wave Pipeline• No registers between
adjoining stages• Clock period less than
max prop delay• Waves of data propagate
through combinational network (effectively, data is stored in the combinational circuit delay!)
Anshul Kumar, CSE IITD slide 22
No pipeliningNo pipeliningNo pipeliningX
Clock
Reg Reg
X
X’ Y
X’Y
Conventional pipeliningConventional pipeliningConventional pipeliningX
Clock
Reg Reg
X
X’ Y Y’ Z Z’ W
X’Y
Y’Z
Z’W
Anshul Kumar, CSE IITD slide 24
Wave pipeliningWave pipeliningWave pipeliningX
Clock
Reg Reg
X
Z’ W
Z’W
Anshul Kumar, CSE IITD slide 25
TimingTimingTiming
Comb cktX Y
Clock
Reg Reg
X
Y
ppropagation delay
sset-up time
T ≥
p + sTclock period
Anshul Kumar, CSE IITD slide 26
Timing with clock skewTiming with clock skewTiming with clock skew
Comb cktX Y
Clock
Reg Reg
X
Y
p s
T
T ≥
p + s + 2δδ δ
Clock skew = ±δ
Anshul Kumar, CSE IITD slide 27
Variation in propagation delayVariation in propagation delayVariation in propagation delay
• Different delays in different paths • Delay variation due to process /
temperature/ power variations• Data-dependent delay variations
Anshul Kumar, CSE IITD slide 28
Timing for wave pipeliningTiming for wave pipeliningTiming for wave pipelining
Comb cktX Y
Clock
Reg Reg
X
Y
T ≥ Δ p + s + 4δ
±δ
pmin
pmax
Δp
T
Anshul Kumar, CSE IITD slide 29
Timing for wave pipelining (expanded view)
Timing for wave pipeliningTiming for wave pipelining (expanded view)(expanded view)
pmin ≥
(n-1) T + 2δnT ≥
pmax + s + 2δ
⇒ T ≥ Δ p + s + 4δ
Δp
T
X
Y
(n-1) T nTpmin pmax
Anshul Kumar, CSE IITD slide 30
ComparisonComparisonComparison
Conventional PipelineT ≥
pmax/n + s + 2δ
(plus cycle quantizationoverhead)
nT ≥
pmax + ns + 2nδ
Wave PipelineT ≥ Δ p + s + 4δ
nT ≥
pmax + s + 2δ
Anshul Kumar, CSE IITD slide 31
Problems with wave pipeliningProblems with wave pipeliningProblems with wave pipelining
• Need to balance delays• Narrow range of clock frequencies• Control difficult• Not very suitable for non-linear pipelines
Anshul Kumar, CSE IITD slide 32
ReferencesReferencesReferences1. M.J. Flynn, "Computer Architecture : Pipelined and Parallel
Processor Design", Narosa Publishing House/ Jones and Bartlett, 1996.
2. Wayne P. Burleson, Maciej Ciesielski, Fabian Klass, and Wentai Liu, “Wave-Pipelining: A Tutorial and Research Survey”, IEEE Trans. on VLSI Systems, vol. 6, no. 3, September 1998, pp. 464 – 474.