averaging filter comparing performance of c++ and ‘our’ asm example of program development on...

1

Averaging FilterComparing performance of

C++ and ‘our’ ASM

Example of program developmenton SHARC using C++ and assembly

Planned for Tuesday 7rd October AfternoonPractical examples handled in Lab 1

2

Demo (uTTCOS) and Test (E-UNIT) configurations

True A/D

TrueleftChannel_In

Audio ISR with Filter

TrueleftChannel_Out

True D/A

DMA CHANNEL

DMA CHANNEL

YOUR SOFTWARE

YOUR SOFTWARE

Test InAudio array

MockleftChannel_In

Filter

MockleftChannel_Out

Test OutAudio array

MOCK ReceiveD2A

Mock TransmitA2D

YOUR SOFTWARE

YOUR SOFTWARE

TestSet up InAudio[ ]

Set up Expected[ ]

In Loop {Call Filter to

produceOutAudio[ ]

}

Compare Expected[ ] and

OutAudio

3

Mock Device Registers “satisfy linker”CCES says “inconsistent” definition

• Poor mock – we move values in Audio Device registers by hand

• Can we “MOCK” – Receive_ADC_Samples– Typical industrial testing approach needed when

hardware “NOT-YET-DEVELOPED

4

Better Simulation

• What is – the algorithm is “by mistake” still doing Left_Out = Left_In (Copy), then we would get the same answer

• Currently “LeftChannel_In1” is a fixed constant – making it difficult for us to check whether our algorithm would work for more complex signals

• So we could start testing the algorithm validity (not its speed) by changing LeftChannel_In1 by “mocking “ReciveA2D( )” and “TransmitD2A( ) audio devices

5

Using ‘MockDevice.c” loads (RECAP)What do we do about ‘Receive_ADC_Samples ( )?’

• These ‘mock’ routines satisfy a linker requirement for a function we don’t use. When they need to become more detailed, worry about then (WAIL).

6

Mocked device inside Assign1LibraryCan be used during Lab 1 -- 4

MADEPRIVATE (FIXED)

GOOD OR BAD IDEA?

VARIETY OFALGORITHMSTESTED

7

Use GUI to add new test group forAveraging code – 3 styles of tests (RECAP)

8

Testing• Test that it works• Test that it meets real time performance

– Measure ms / Sample for 1 channel = Time-1CH– Require 20 ms > 8 * Time-1CH

• Move code onto Resource chart.– Determine theoretical best time if all optimizations Could be found

• Test to determine real cycle count Cycle / Tap / Sample• Examine CPP .lst file (.i or .is) or your ASM file to determine expected cycle

count– Work out why the difference between theory and real– Looking at accuracy of better than 1 cycle in 1000– Assume 1 cycle per instruction except jumps and memory accesses and movement of

I registers to memory – or any other delay we find common• Be able to move the theoretical calculation for other processor architecture

(timings) for MidTerm 1 on Thursday 23rd Oct

9

Theoretical Analysis

• We expect our theoretical analysis to be fast or faster than what the C++ optimized code takes

• We are not using any C++ DSP extensions, so expected efficient rather than optimized code

• Is 816 cycles per sample processed by Average Filter the speed we would expect based on our understanding of the processor architecture?

10

Expectations

• First instruction after a jump takes 3 cycles to finish executing

• After that 1 instruction, all things being equal, takes 1 cycle

• 1 cycle for a read, write, add, multiple• D? cycles for a division

11

Averaging Filter with LoopTheoretical Analysis

• Fetch N values from memory -- N cycles• Perform N add operations -- N cycles• Go round the sum for-loop -- N * FLC cycles

– Where FLC is # instructions to handle For-Loop-Control – includes all-overheads of jumping dufing for-loop

• Exit for loop (done once) -- EFL cycles• Do division -- D cycles• Return a value from function -- RV cycles -- • Enter and exit Average routine -- EER cycles

AVERAGE_FILTER_TIME = N(1 + 1 + FLC) + EFL + D + RV + EER cycles

VERY BIG DEFECT IN ANALYSIS FOUND LATER ACTUAL THEORETICAL TIME IS TWICE AS LARGE AS THIS

12

Modify tests so can handle both CPP and ASM versions (Cut-and-paste)

• Not the timing that’s the problem at this moment

• It’s ‘does the ASM and CPP code work’ at all!

13

Check what function needs developing

• Fix compiler error with prototype in ‘Assign1.h”

• Linker error message says ‘wrong prototype’ (NM)

14

Check to see if can run the Tests that call ASM code without crashing

C++ prototypeextern “C” void Function(void)

15

Getting the same constants in an include file working in both CPP and ASM

• Use this type of syntax in ‘Assign1.h’– Conditional code generation

• And in assembly code files

16

Initial testing done with small NN = 4 (as can work out expected result)

• Write the test – C++ code expected to pass

– 3.3 is EXACTLY (N – 1) / N of 4.4 when N is 4

17

Look for ‘one out error’ in loopsCommon DSP mistake

• Remember to fix error in ASM ‘pseudo code’

18

Initial testing done with small NN = 8 (as can work out expected result)

• Write the test – C++ code expected to pass– Asm code MUST fail test – otherwise test is poor– Must fail as there is no ASM code to allow pass to

occur. This is the TEST of the TEST

Now have 4 tests passing rather than 3, including ASM test

INDICATES BAD TEST – WHY?

19

Improved test. Don’t allow ‘old correct value’ in output from C++ test

Defect might have been identified by reversing test order

20

What registers can we use in assembly?

• Don’t usewithoutperformingsave immediatelyand later recoveroperations.

• Otherwise C and C++will crash

• These okayto use in

assembly

21

Here’s the full software loop structureNote the formatting for easy code review (Required)

Each time aroundLoop – 9 cycles forControl

Not the 5 we thought

22

dm(2, I4) versus dm(I4, 2) dm(M4, I4) versus dm(I4, M4)

• Both instructions use the ‘eye’ 4 index register (volatile)• dm(2, I4) – is a pre-modify memory operation

– The 1 is before the I4 – hence pre something– I4 points to a memory location– Dm(2, I4) means access the memory location at (I4 + 2)

• ADD IS NOT preformed in parallel with other operations?

– LEAVE value in index register I4 unchanged– Used in array addressing

• Dm(I4, 2) – is a post-modify memory operation– The 2 is after the I4 – hence post something– I4 points to a memory location– Dm( I4, 2) means access the memory location at (I4)– MODIFY value in index register by 2

• DO I4 = I4 + 2 AFTER USING I4 (ADD in parallel with other operations?)

24

Other bits of code needed

25

Add assembly language ‘externs’ to ‘Assign1.h

• Still have not codedthe division – fake it by hard-coding * 1/4

• Must be an easier way to code memory – Yes – use post increment operation using pointers

and not using array indexing

26

Code fails -- Most likely place to look for defects are in loop operations

Forgot to set loopCounter =0And loopMax to N when weAdded code for the new loops

27

Try persuading the “assembler” to pre-calculate F3 = (1.0 / N) at ‘compile time’, not ‘run-time’

Code should now work forN = 64 – so can compare timing with C code

28

If we believe tests then calculation accuracy is lower (5E-06 for larger N)

Despite lousy ASM codewe already beating compilerin ‘debug’ mode(around 2N)

29

Before optimizing, we need to add a few more tests to check code valid

Uses sum of N integersN (N + 1) / 2

Accuracy now set to 1E-5

30

Use post-modify address modesum = sum + *pt++; ( N = 64)

• ASM was 2400 cycles (N = 64), is now 2208– Expect improvement of N = 64 cycles (2 instead of 3 instructions)– Get (2400 – 2208) = 192 which is very close to 3 * N = 196 faster

2 cycle stall till M4 ready to use?

31

dm(2, I4) versus dm(I4, 2) dm(M4, I4) versus dm(I4, M4)

• Both instructions use the ‘eye’ 4 index register • dm(2, I4) – is a pre-modify memory operation

– The 2 is before the I4 – hence pre something– I4 points to a memory location– Dm(2, I4) means access the memory location at (I4 + 2)– LEAVE value in index register I4 unchanged– Used in array addressing

• Dm(I4, 2) – is a post-modify memory operation– The 2 is after the I4 – hence post something– I4 points to a memory location– Dm( I4, 2) means access the memory location at (I4)– MODIFY value in index register by 1 (I4 = I4 + 2 AFTER USE)

• POST MODIFY OFFERS OPPORTUNITY FOR PROCESSOR ARCHITECTURE TO DO ADD IN PARALLEL WITH OTHER PIPELINE STAGES

32

Using pre-modify and post-modify addressing – replace 6 instructions by 2

Expect 4 * N faster (256)Was 2208, is 1704 = 500 cyclesClose to N * 6 faster!

33

Need to force “C++” to optimize

• Our asm code 1704 cycles• Optimized “C” 205 cycles

– 1500 cycles faster or roughly N * 23.5 cycles faster• FIFO Loop (63 reads / 63 write) + sum loop (64 reads + 64 adds) = 256• Loop control = 2 * 64 * 9 + Into / out of subroutine 20 + other 10 = 1182

– Our ASM = 1468 + 236 unaccounted for (N * 3.7 or nearly N * 4)

CONCLUSIONWe have a lot more to learn about using the processor architecture correctly in order to get HIGH SPEED DSP CODE

NOTE: COMPILER ASSUMES GENERAL DSP, CODE CHARACTERISTICS

WE KNOW MORE, so should be able to write faster code (if we need to)

averaging filter comparing performance of c++ and ‘our’ asm example of program development on...

Documents