detailed look at the tigersharc pipeline cycle counting for the ialu versionof the dc_removal...
Post on 18-Dec-2015
219 views
TRANSCRIPT
Detailed look at the TigerSHARC pipeline
Cycle counting for the IALU versionof the DC_Removal algorithm
DC_Removal algorithm performance 2 / 28
To be tackled today
Expected and actual cycle count for J-IALU version of DC_Removal algorithmUnderstanding why the stalls occur
and how to fix. Differences between first time into a
function (cache empty) and second time into the function
DC_Removal algorithm performance 3 / 28
Set up timeIn principle 1 cycle / instruction
2 + 4 instructions
DC_Removal algorithm performance 4 / 28
First key element – Sum Loop -- Order (N) Second key element – Shift Loop – Order (log2N)
4 instructions
N * 5 instructions
1 + 2 * log2N
DC_Removal algorithm performance 5 / 28
Third key element – FIFO circular buffer -- Order (N)
6
3
6 * N
2
DC_Removal algorithm performance 7 / 28
Using the “Pipeline Viewer”
Available with the TigerSHARC simulator ONLY VIEW | Debug Windows | Pipeline viewer
F1 to F4 – instruction fetch unit pipeline
PD, D, I -- Integer ALU pipeline
A, EX1, EX2 – Compute Block pipeline
DC_Removal algorithm performance 8 / 28
Pipeline symbols
Control - click
A – AbortB – BubbleH – BTB Hit (Jumps)S – StallW – WaitX – Illegal fetch(F1 – F4)X – Illegal instruction (PD – E2)
DC_Removal algorithm performance 9 / 28
Time in theorySet up pointers to buffersInsert values into buffersSUM LOOPSHIFT LOOPUpdate outgoing parametersUpdate FIFOFunction return
244 + N * 51 + 2 * log2N63 + 6 * N2---------------------------22 + 11 N + 2 log2N
N = 128 – instructions = 1444
1444 cycles + 1100 delay cycles
C++ debug mode – 9500 cycles???????
Note other tests executed before this test.Means “cache filled”
DC_Removal algorithm performance 10 / 28
Test environment
Examine the pipelinethe 2nd time around the loop“Cache’s filled”?
DC_Removal algorithm performance 11 / 28
Set up timeExpected2 + 4 instructions
Actual2 + 4 instructions+ 2 stalls
Why not 4 stalls?
DC_Removal algorithm performance 12 / 28
First time round sum loop
Expected 9 instructions
LC0 load – 3 stallsEach memory fetch – 4 stallsActual 9 + 11 stalls
DC_Removal algorithm performance 13 / 28
Other times around the loop
Expected 5 instructions
Each memory fetch – 4 stallsActual 5 + 8 stalls
DC_Removal algorithm performance 14 / 28
Shift Loop – 1st time around
Expected 3 instructions
No stalls on LC0 load?4 stall on ASHIFTRBTB hit followed by 5 aborts
DC_Removal algorithm performance 17 / 28
Exercise 1
Based on knowledge to this points – determine the expected stalls during the last piece of code – FIFO buffer operatio
DC_Removal algorithm performance 18 / 28
Third key element – FIFO circular buffer-- Order (N)
6
3
6 * N
2
DC_Removal algorithm performance 24 / 28
What happens if cache not full? – first time function called?
Was 2 + 2 stalls in loopNow 11 + 12 stalls in loop
DC_Removal algorithm performance 25 / 28
First time function called2nd time around the loopDitto 3, 4, 5, 6, 7, 8 times
DC_Removal algorithm performance 26 / 28
9th time around the loopditto 17th, 25th, 33rd, 41st , 49th
DC_Removal algorithm performance 27 / 28
What is happening?
With cache filled – memory read accesses require 4 cycles
Unfilled – first one requires “12 cycles” Then next 7 require 4 cycles
Total guess – is extra time associated with doing extra reads to fill the cache?
DC_Removal algorithm performance 28 / 28
Tackled today
Expected and actual cycle count for J-IALU version of DC_Removal algorithm Understanding why the stalls occur and how
to fix. Differences between first time into a
function (cache empty) and second time into the function
Further unknowns – how memory operations really work