lab 2 ideas

24
Lab 2 Ideas Various forms of non- optimized FIR code

Upload: jariah

Post on 09-Jan-2016

21 views

Category:

Documents


0 download

DESCRIPTION

Lab 2 Ideas. Various forms of non-optimized FIR code. Demonstrate progress on Lab.1. 1.5% of term mark is associated with demonstrating progress on developing C++ code (FIR) and associated tests. Demonstrate at the start of the Lab. Lab. notes. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Lab 2 Ideas

Lab 2 Ideas

Various forms of non-optimized FIR code

Page 2: Lab 2 Ideas

Demonstrate progress on Lab.1

1.5% of term mark is associated with demonstrating progress on developing C++ code (FIR) and associated tests.

Demonstrate at the start of the Lab.

Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada 2 / 2804/21/23

Page 3: Lab 2 Ideas

Lab. notes

Some information and suggestions about Lab. 2 in the laboratory notes.

More information here

Minor changes in code needs once first asm FIR is running

Keep old versions for reference and possible reanalysis as you learn more

Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada 3 / 2804/21/23

Page 4: Lab 2 Ideas

Lab. 2 – Preparation for Lab. 3 where we optimize code Step 1 – Generate ASM tests based on C++ tests

from Lab. 1 Step 2 – Convert 2 C++ routines into assembly code

and test FIR_ONLINE_ASM( Xin, Yout, FIRcoeffs, FIR_N)

For all data points – call FIR_ONLINE_ASM( ) Time C++ code calling FIR_ONLINE_ASM with one loop

FIR_OFFLINE_ASM(XinArray, YoutArray, M, FirCoeffs, FIR_N)

Time FIR_OFFLINE_ASM with double zero-overhead loop

Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada 4 / 2804/21/23

Page 5: Lab 2 Ideas

Lab. 2 – no optimization

However, as I mentioned before Write the code with “planned to optimize”

in mind Get the ASM code to work in the best way

you canThen “prepare for optimization” – called

“refactoring for speed” – see next slides

Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada 5 / 2804/21/23

Page 6: Lab 2 Ideas

Version 1 and 2 – no parallel codeFor (I = 0 to N-1, I++)

read data[i]; J-Bus read coeff[i]; J-Bus

multiply X-COMPUTE add X-COMPUTE

END_FOR

Time with software and then hardware loop – leave both code versions behind

Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada 6 / 2804/21/23

Page 7: Lab 2 Ideas

Expected part of report

Use post-modify addressing Test code for N = 32 Time code for N large to minimize timing errors

– getting into / out of timing code Calculate the theoretical time for loop

Number of instructions plus number of stallsShow stalls in code

Expect theory time = actual time within 1%

Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada 7 / 2804/21/23

Page 8: Lab 2 Ideas

Version 2 – no parallel codePut hardware loop jump with add – why not

For (I = 0 to N-1, I++) read data[i]; J-Bus Time = N * 6 read coeff[i]; J-Bus loop jump // addMEMORY NO_OP?

multiply X-COMPUTECOMPUTE NO_OP add X-COMPUTE

END_FOR

Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada 8 / 2804/21/23

Page 9: Lab 2 Ideas

Version 3 – no parallel codeUnroll loop

For (I = 0 to N-1, I += 2) -- N factor of 2 read data[i]; J-Bus Time = N / 2 * 12 read coeff[i]; J-Bus loop jump // addMEMORY NO_OP? No speed difference expected

multiply X-COMPUTECOMPUTE NO_OP add X-COMPUTE

read data[I + 1]; J-Bus read coeff[I + 1]; J-Bus MEMORY NO_OP?

multiply X-COMPUTECOMPUTE NO_OP add X-COMPUTE

END_FOR

Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada 9 / 2804/21/23

Page 10: Lab 2 Ideas

Version 3A – no parallel codeUnroll loop but with “extra temporary registers” to prepare for making parallel lateFor (I = 0 to N-1, I += 2) -- N factor of 2

read data[i]; J-Bus Time = N / 2 * 12 read coeff[i]; J-Bus loop jump // addMEMORY NO_OP? No speed difference expected

multiply X-COMPUTECOMPUTE NO_OP add X-COMPUTE

read data[I + 1]; J-Bus read coeff[I + 1]; J-Bus MEMORY NO_OP?

multiply X-COMPUTECOMPUTE NO_OP add X-COMPUTE

END_FOR

Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada 10 / 2804/21/23

Page 11: Lab 2 Ideas

Version 4 – no parallel codeShift to using K-bus for coeff[ ]

For (I = 0 to N-1, I += 2) -- N factor of 2 read data[i]; J-Bus Time = N / 2 * 12 read coeff[i]; K-Bus loop jump // addMEMORY NO_OP ? No speed difference expected

multiply X-COMPUTECOMPUTE NO_OP add X-COMPUTE

read data[I + 1]; J-Bus read coeff[I + 1]; K-Bus MEMORY NO_OP?

multiply X-COMPUTECOMPUTE NO_OP add X-COMPUTE

END_FOR

Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada 11 / 2804/21/23

Page 12: Lab 2 Ideas

Version 4A – no parallel execution – J and K-bus access same memory blockFor (I = 0 to N-1, I += 2) -- N factor of 2

read data[i]; J-Bus , read coeff[i]; K-Bus MEMORY NO_OP? MEMORY NO_OP?

multiply X-COMPUTECOMPUTE NO_OP Time = N / 2 * 12 add X-COMPUTE loop jump // add

read data[I + 1]; J-Bus read coeff[I + 1]; K-Bus MEMORY NO_OP? MEMORY NO_OP? No speed difference expected

multiply X-COMPUTECOMPUTE NO_OP add X-COMPUTE

END_FOR

Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada 12 / 2804/21/23

Page 13: Lab 2 Ideas

Expected cache issues

For (I = 0 to N-1, I++) read data[i]; J-Bus read coeff[i]; J-Bus

multiply X-COMPUTE add X-COMPUTE

END_FOR

Time with software and then hardware loop – leave both code versions behind

Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada 13 / 2804/21/23

Page 14: Lab 2 Ideas

First time into loop – cache on

For (I = 0 to N-1, I++) read data[i]; J-Bus read coeff[i]; J-Bus

multiply X-COMPUTE add X-COMPUTE

END_FOR

Time = 2 * N * Mtime-not cached + 2 * N + N * # of stalls

Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada 14 / 2804/21/23

Page 15: Lab 2 Ideas

Second time into loop – cache onFor (I = 0 to N-1, I++)

read data[i]; J-Bus read coeff[i]; J-Bus

multiply X-COMPUTE add X-COMPUTE

END_FOR

Time = N * time data fetch + N * coeff fetch

+ 2 * N + N * stalls

Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada 15 / 2804/21/23

Page 16: Lab 2 Ideas

Second time into loop – cache onFor (I = 0 to N-1, I++)

read data[i]; J-Bus read coeff[i]; J-Bus

multiply X-COMPUTE add X-COMPUTE

END_FOR

Time = (N – 1) * Mtime-cached + Mtime – (cache flush, cache reload) data fetch + N * Mtime –cached – coeffs)

+ 2 * N + N * stalls

Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada 16 / 2804/21/23

Page 17: Lab 2 Ideas

Different types of memory timing for read operationsRead from external memory

Read from external memory + cache store

Read from internal memory

Read from internal memory + cache store

Read from cache

Note – what happens if the cache is full

Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada 17 / 2804/21/23

Page 18: Lab 2 Ideas

First time into loop – cache onBut now processor doing quad fetches into cache

For (I = 0 to N-1, I++) read data[i]; J-Bus read coeff[i]; J-Bus

multiply X-COMPUTE add X-COMPUTE

END_FOR

Time = 2 * N / 4 * Mtime-not cached + 2 * 3 N / 4 * Mtime -cached + 2 * N + N * # of stalls

Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada 18 / 2804/21/23

Page 19: Lab 2 Ideas

Note

The hardware is doing quad fetches into cache

You ARE NOT doing quad fetches in your code

So why would that help

Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada 19 / 2804/21/23

Page 20: Lab 2 Ideas

Typical Cache behaviourTrue for TigerSHARC – don’t know! You issue Memory read request Processor sends 2 memory read requests

One to true memoryOne to cache

If cache replies “I have that value” then the value is fetched from cache and the Memory read request is aborted

Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada 20 / 2804/21/23

Page 21: Lab 2 Ideas

Typical Cache behaviourTrue for TigerSHARC – don’t know! You issue Memory read request Processor sends 2 memory read requests

One to true memory One to cache

If cache replies “No value” then the value is fetched Memory and stored in cache and sent to user.

No rule that says memory has to give only one values to the cache

Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada 21 / 2804/21/23

Page 22: Lab 2 Ideas

What if cache is full?Expected behaviour One existing cache line is thrown away

Least used – random Write operations can change cache

If the cache line being thrown away (has changed), then that value must be written to memory before the cache line is changed

Does that happen in parallel with user code – depends on algorithm characteristics

Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada 22 / 2804/21/23

Page 23: Lab 2 Ideas

See TigerSHARC hardware manual for cache details If the timing behaviour is not what you are

expecting – then work out why. In your report explainr you analysis

Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada 23 / 2804/21/23

Page 24: Lab 2 Ideas

Final part of Lab. 2 – Version 4B Run assembly code timing tests with data

placed in dm memory and FIR coefficients placed in pm memory by compiler

Will only need a name change of version 4 to meet prototype change

FIR_ASM(*data, *fir, N) FIR_ASM(dm *data, dm *fir, N)

FIR_ASM(dm *data, pm *fir, N) Version 4B C++ prototype

Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada 24 / 2804/21/23