lab 2 ideas
DESCRIPTION
Lab 2 Ideas. Various forms of non-optimized FIR code. Demonstrate progress on Lab.1. 1.5% of term mark is associated with demonstrating progress on developing C++ code (FIR) and associated tests. Demonstrate at the start of the Lab. Lab. notes. - PowerPoint PPT PresentationTRANSCRIPT
Lab 2 Ideas
Various forms of non-optimized FIR code
Demonstrate progress on Lab.1
1.5% of term mark is associated with demonstrating progress on developing C++ code (FIR) and associated tests.
Demonstrate at the start of the Lab.
Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada 2 / 2804/21/23
Lab. notes
Some information and suggestions about Lab. 2 in the laboratory notes.
More information here
Minor changes in code needs once first asm FIR is running
Keep old versions for reference and possible reanalysis as you learn more
Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada 3 / 2804/21/23
Lab. 2 – Preparation for Lab. 3 where we optimize code Step 1 – Generate ASM tests based on C++ tests
from Lab. 1 Step 2 – Convert 2 C++ routines into assembly code
and test FIR_ONLINE_ASM( Xin, Yout, FIRcoeffs, FIR_N)
For all data points – call FIR_ONLINE_ASM( ) Time C++ code calling FIR_ONLINE_ASM with one loop
FIR_OFFLINE_ASM(XinArray, YoutArray, M, FirCoeffs, FIR_N)
Time FIR_OFFLINE_ASM with double zero-overhead loop
Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada 4 / 2804/21/23
Lab. 2 – no optimization
However, as I mentioned before Write the code with “planned to optimize”
in mind Get the ASM code to work in the best way
you canThen “prepare for optimization” – called
“refactoring for speed” – see next slides
Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada 5 / 2804/21/23
Version 1 and 2 – no parallel codeFor (I = 0 to N-1, I++)
read data[i]; J-Bus read coeff[i]; J-Bus
multiply X-COMPUTE add X-COMPUTE
END_FOR
Time with software and then hardware loop – leave both code versions behind
Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada 6 / 2804/21/23
Expected part of report
Use post-modify addressing Test code for N = 32 Time code for N large to minimize timing errors
– getting into / out of timing code Calculate the theoretical time for loop
Number of instructions plus number of stallsShow stalls in code
Expect theory time = actual time within 1%
Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada 7 / 2804/21/23
Version 2 – no parallel codePut hardware loop jump with add – why not
For (I = 0 to N-1, I++) read data[i]; J-Bus Time = N * 6 read coeff[i]; J-Bus loop jump // addMEMORY NO_OP?
multiply X-COMPUTECOMPUTE NO_OP add X-COMPUTE
END_FOR
Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada 8 / 2804/21/23
Version 3 – no parallel codeUnroll loop
For (I = 0 to N-1, I += 2) -- N factor of 2 read data[i]; J-Bus Time = N / 2 * 12 read coeff[i]; J-Bus loop jump // addMEMORY NO_OP? No speed difference expected
multiply X-COMPUTECOMPUTE NO_OP add X-COMPUTE
read data[I + 1]; J-Bus read coeff[I + 1]; J-Bus MEMORY NO_OP?
multiply X-COMPUTECOMPUTE NO_OP add X-COMPUTE
END_FOR
Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada 9 / 2804/21/23
Version 3A – no parallel codeUnroll loop but with “extra temporary registers” to prepare for making parallel lateFor (I = 0 to N-1, I += 2) -- N factor of 2
read data[i]; J-Bus Time = N / 2 * 12 read coeff[i]; J-Bus loop jump // addMEMORY NO_OP? No speed difference expected
multiply X-COMPUTECOMPUTE NO_OP add X-COMPUTE
read data[I + 1]; J-Bus read coeff[I + 1]; J-Bus MEMORY NO_OP?
multiply X-COMPUTECOMPUTE NO_OP add X-COMPUTE
END_FOR
Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada 10 / 2804/21/23
Version 4 – no parallel codeShift to using K-bus for coeff[ ]
For (I = 0 to N-1, I += 2) -- N factor of 2 read data[i]; J-Bus Time = N / 2 * 12 read coeff[i]; K-Bus loop jump // addMEMORY NO_OP ? No speed difference expected
multiply X-COMPUTECOMPUTE NO_OP add X-COMPUTE
read data[I + 1]; J-Bus read coeff[I + 1]; K-Bus MEMORY NO_OP?
multiply X-COMPUTECOMPUTE NO_OP add X-COMPUTE
END_FOR
Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada 11 / 2804/21/23
Version 4A – no parallel execution – J and K-bus access same memory blockFor (I = 0 to N-1, I += 2) -- N factor of 2
read data[i]; J-Bus , read coeff[i]; K-Bus MEMORY NO_OP? MEMORY NO_OP?
multiply X-COMPUTECOMPUTE NO_OP Time = N / 2 * 12 add X-COMPUTE loop jump // add
read data[I + 1]; J-Bus read coeff[I + 1]; K-Bus MEMORY NO_OP? MEMORY NO_OP? No speed difference expected
multiply X-COMPUTECOMPUTE NO_OP add X-COMPUTE
END_FOR
Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada 12 / 2804/21/23
Expected cache issues
For (I = 0 to N-1, I++) read data[i]; J-Bus read coeff[i]; J-Bus
multiply X-COMPUTE add X-COMPUTE
END_FOR
Time with software and then hardware loop – leave both code versions behind
Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada 13 / 2804/21/23
First time into loop – cache on
For (I = 0 to N-1, I++) read data[i]; J-Bus read coeff[i]; J-Bus
multiply X-COMPUTE add X-COMPUTE
END_FOR
Time = 2 * N * Mtime-not cached + 2 * N + N * # of stalls
Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada 14 / 2804/21/23
Second time into loop – cache onFor (I = 0 to N-1, I++)
read data[i]; J-Bus read coeff[i]; J-Bus
multiply X-COMPUTE add X-COMPUTE
END_FOR
Time = N * time data fetch + N * coeff fetch
+ 2 * N + N * stalls
Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada 15 / 2804/21/23
Second time into loop – cache onFor (I = 0 to N-1, I++)
read data[i]; J-Bus read coeff[i]; J-Bus
multiply X-COMPUTE add X-COMPUTE
END_FOR
Time = (N – 1) * Mtime-cached + Mtime – (cache flush, cache reload) data fetch + N * Mtime –cached – coeffs)
+ 2 * N + N * stalls
Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada 16 / 2804/21/23
Different types of memory timing for read operationsRead from external memory
Read from external memory + cache store
Read from internal memory
Read from internal memory + cache store
Read from cache
Note – what happens if the cache is full
Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada 17 / 2804/21/23
First time into loop – cache onBut now processor doing quad fetches into cache
For (I = 0 to N-1, I++) read data[i]; J-Bus read coeff[i]; J-Bus
multiply X-COMPUTE add X-COMPUTE
END_FOR
Time = 2 * N / 4 * Mtime-not cached + 2 * 3 N / 4 * Mtime -cached + 2 * N + N * # of stalls
Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada 18 / 2804/21/23
Note
The hardware is doing quad fetches into cache
You ARE NOT doing quad fetches in your code
So why would that help
Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada 19 / 2804/21/23
Typical Cache behaviourTrue for TigerSHARC – don’t know! You issue Memory read request Processor sends 2 memory read requests
One to true memoryOne to cache
If cache replies “I have that value” then the value is fetched from cache and the Memory read request is aborted
Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada 20 / 2804/21/23
Typical Cache behaviourTrue for TigerSHARC – don’t know! You issue Memory read request Processor sends 2 memory read requests
One to true memory One to cache
If cache replies “No value” then the value is fetched Memory and stored in cache and sent to user.
No rule that says memory has to give only one values to the cache
Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada 21 / 2804/21/23
What if cache is full?Expected behaviour One existing cache line is thrown away
Least used – random Write operations can change cache
If the cache line being thrown away (has changed), then that value must be written to memory before the cache line is changed
Does that happen in parallel with user code – depends on algorithm characteristics
Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada 22 / 2804/21/23
See TigerSHARC hardware manual for cache details If the timing behaviour is not what you are
expecting – then work out why. In your report explainr you analysis
Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada 23 / 2804/21/23
Final part of Lab. 2 – Version 4B Run assembly code timing tests with data
placed in dm memory and FIR coefficients placed in pm memory by compiler
Will only need a name change of version 4 to meet prototype change
FIR_ASM(*data, *fir, N) FIR_ASM(dm *data, dm *fir, N)
FIR_ASM(dm *data, pm *fir, N) Version 4B C++ prototype
Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada 24 / 2804/21/23