lab 2 ideas

Lab 2 Ideas

Various forms of non-optimized FIR code

Demonstrate progress on Lab.1

1.5% of term mark is associated with demonstrating progress on developing C++ code (FIR) and associated tests.

Demonstrate at the start of the Lab.

Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada 2 / 2804/21/23

Lab. notes

Some information and suggestions about Lab. 2 in the laboratory notes.

More information here

Minor changes in code needs once first asm FIR is running

Keep old versions for reference and possible reanalysis as you learn more


Lab. 2 – Preparation for Lab. 3 where we optimize code Step 1 – Generate ASM tests based on C++ tests

from Lab. 1 Step 2 – Convert 2 C++ routines into assembly code

and test FIR_ONLINE_ASM( Xin, Yout, FIRcoeffs, FIR_N)

For all data points – call FIR_ONLINE_ASM( ) Time C++ code calling FIR_ONLINE_ASM with one loop

FIR_OFFLINE_ASM(XinArray, YoutArray, M, FirCoeffs, FIR_N)

Time FIR_OFFLINE_ASM with double zero-overhead loop


Lab. 2 – no optimization

However, as I mentioned before Write the code with “planned to optimize”

in mind Get the ASM code to work in the best way

you canThen “prepare for optimization” – called

“refactoring for speed” – see next slides


Version 1 and 2 – no parallel codeFor (I = 0 to N-1, I++)

read data[i]; J-Bus read coeff[i]; J-Bus

multiply X-COMPUTE add X-COMPUTE

END_FOR

Time with software and then hardware loop – leave both code versions behind


Expected part of report

Use post-modify addressing Test code for N = 32 Time code for N large to minimize timing errors

– getting into / out of timing code Calculate the theoretical time for loop

Number of instructions plus number of stallsShow stalls in code

Expect theory time = actual time within 1%


Version 2 – no parallel codePut hardware loop jump with add – why not

For (I = 0 to N-1, I++) read data[i]; J-Bus Time = N * 6 read coeff[i]; J-Bus loop jump // addMEMORY NO_OP?

multiply X-COMPUTECOMPUTE NO_OP add X-COMPUTE

END_FOR


Version 3 – no parallel codeUnroll loop

For (I = 0 to N-1, I += 2) -- N factor of 2 read data[i]; J-Bus Time = N / 2 * 12 read coeff[i]; J-Bus loop jump // addMEMORY NO_OP? No speed difference expected


read data[I + 1]; J-Bus read coeff[I + 1]; J-Bus MEMORY NO_OP?


END_FOR


Version 3A – no parallel codeUnroll loop but with “extra temporary registers” to prepare for making parallel lateFor (I = 0 to N-1, I += 2) -- N factor of 2

read data[i]; J-Bus Time = N / 2 * 12 read coeff[i]; J-Bus loop jump // addMEMORY NO_OP? No speed difference expected


read data[I + 1]; J-Bus read coeff[I + 1]; J-Bus MEMORY NO_OP?


END_FOR


Version 4 – no parallel codeShift to using K-bus for coeff[ ]

For (I = 0 to N-1, I += 2) -- N factor of 2 read data[i]; J-Bus Time = N / 2 * 12 read coeff[i]; K-Bus loop jump // addMEMORY NO_OP ? No speed difference expected


read data[I + 1]; J-Bus read coeff[I + 1]; K-Bus MEMORY NO_OP?


END_FOR


Version 4A – no parallel execution – J and K-bus access same memory blockFor (I = 0 to N-1, I += 2) -- N factor of 2

read data[i]; J-Bus , read coeff[i]; K-Bus MEMORY NO_OP? MEMORY NO_OP?

multiply X-COMPUTECOMPUTE NO_OP Time = N / 2 * 12 add X-COMPUTE loop jump // add

read data[I + 1]; J-Bus read coeff[I + 1]; K-Bus MEMORY NO_OP? MEMORY NO_OP? No speed difference expected


END_FOR


Expected cache issues

For (I = 0 to N-1, I++) read data[i]; J-Bus read coeff[i]; J-Bus


END_FOR

Time with software and then hardware loop – leave both code versions behind


First time into loop – cache on



END_FOR

Time = 2 * N * Mtime-not cached + 2 * N + N * # of stalls


Second time into loop – cache onFor (I = 0 to N-1, I++)



END_FOR

Time = N * time data fetch + N * coeff fetch

+ 2 * N + N * stalls


Second time into loop – cache onFor (I = 0 to N-1, I++)



END_FOR

Time = (N – 1) * Mtime-cached + Mtime – (cache flush, cache reload) data fetch + N * Mtime –cached – coeffs)

+ 2 * N + N * stalls


Different types of memory timing for read operationsRead from external memory

Read from external memory + cache store

Read from internal memory

Read from internal memory + cache store

Read from cache

Note – what happens if the cache is full


First time into loop – cache onBut now processor doing quad fetches into cache



END_FOR

Time = 2 * N / 4 * Mtime-not cached + 2 * 3 N / 4 * Mtime -cached + 2 * N + N * # of stalls


Note

The hardware is doing quad fetches into cache

You ARE NOT doing quad fetches in your code

So why would that help


Typical Cache behaviourTrue for TigerSHARC – don’t know! You issue Memory read request Processor sends 2 memory read requests

One to true memoryOne to cache

If cache replies “I have that value” then the value is fetched from cache and the Memory read request is aborted


Typical Cache behaviourTrue for TigerSHARC – don’t know! You issue Memory read request Processor sends 2 memory read requests

One to true memory One to cache

If cache replies “No value” then the value is fetched Memory and stored in cache and sent to user.

No rule that says memory has to give only one values to the cache


What if cache is full?Expected behaviour One existing cache line is thrown away

Least used – random Write operations can change cache

If the cache line being thrown away (has changed), then that value must be written to memory before the cache line is changed

Does that happen in parallel with user code – depends on algorithm characteristics


See TigerSHARC hardware manual for cache details If the timing behaviour is not what you are

expecting – then work out why. In your report explainr you analysis


Final part of Lab. 2 – Version 4B Run assembly code timing tests with data

placed in dm memory and FIR coefficients placed in pm memory by compiler

Will only need a name change of version 4 to meet prototype change

FIR_ASM(*data, *fir, N) FIR_ASM(dm *data, dm *fir, N)

FIR_ASM(dm *data, pm *fir, N) Version 4B C++ prototype


lab 2 ideas

Documents