averaging filter comparing performance of c++ and ‘our’ asm example of program development on...

33
Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October Afternoon Practical examples handled in Lab 1 1

Upload: belinda-jackson

Post on 03-Jan-2016

226 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October

1

Averaging FilterComparing performance of

C++ and ‘our’ ASM

Example of program developmenton SHARC using C++ and assembly

Planned for Tuesday 7rd October AfternoonPractical examples handled in Lab 1

Page 2: Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October

2

Demo (uTTCOS) and Test (E-UNIT) configurations

True A/D

TrueleftChannel_In

Audio ISR with Filter

TrueleftChannel_Out

True D/A

DMA CHANNEL

DMA CHANNEL

YOUR SOFTWARE

YOUR SOFTWARE

Test InAudio array

MockleftChannel_In

Filter

MockleftChannel_Out

Test OutAudio array

MOCK ReceiveD2A

Mock TransmitA2D

YOUR SOFTWARE

YOUR SOFTWARE

TestSet up InAudio[ ]

Set up Expected[ ]

In Loop {Call Filter to

produceOutAudio[ ]

}

Compare Expected[ ] and

OutAudio

Page 3: Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October

3

Mock Device Registers “satisfy linker”CCES says “inconsistent” definition

• Poor mock – we move values in Audio Device registers by hand

• Can we “MOCK” – Receive_ADC_Samples– Typical industrial testing approach needed when

hardware “NOT-YET-DEVELOPED

Page 4: Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October

4

Better Simulation

• What is – the algorithm is “by mistake” still doing Left_Out = Left_In (Copy), then we would get the same answer

• Currently “LeftChannel_In1” is a fixed constant – making it difficult for us to check whether our algorithm would work for more complex signals

• So we could start testing the algorithm validity (not its speed) by changing LeftChannel_In1 by “mocking “ReciveA2D( )” and “TransmitD2A( ) audio devices

Page 5: Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October

5

Using ‘MockDevice.c” loads (RECAP)What do we do about ‘Receive_ADC_Samples ( )?’

• These ‘mock’ routines satisfy a linker requirement for a function we don’t use. When they need to become more detailed, worry about then (WAIL).

Page 6: Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October

6

Mocked device inside Assign1LibraryCan be used during Lab 1 -- 4

MADEPRIVATE (FIXED)

GOOD OR BAD IDEA?

VARIETY OFALGORITHMSTESTED

Page 7: Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October

7

Use GUI to add new test group forAveraging code – 3 styles of tests (RECAP)

Page 8: Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October

8

Testing• Test that it works• Test that it meets real time performance

– Measure ms / Sample for 1 channel = Time-1CH– Require 20 ms > 8 * Time-1CH

• Move code onto Resource chart.– Determine theoretical best time if all optimizations Could be found

• Test to determine real cycle count Cycle / Tap / Sample• Examine CPP .lst file (.i or .is) or your ASM file to determine expected cycle

count– Work out why the difference between theory and real– Looking at accuracy of better than 1 cycle in 1000– Assume 1 cycle per instruction except jumps and memory accesses and movement of

I registers to memory – or any other delay we find common• Be able to move the theoretical calculation for other processor architecture

(timings) for MidTerm 1 on Thursday 23rd Oct

Page 9: Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October

9

Theoretical Analysis

• We expect our theoretical analysis to be fast or faster than what the C++ optimized code takes

• We are not using any C++ DSP extensions, so expected efficient rather than optimized code

• Is 816 cycles per sample processed by Average Filter the speed we would expect based on our understanding of the processor architecture?

Page 10: Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October

10

Expectations

• First instruction after a jump takes 3 cycles to finish executing

• After that 1 instruction, all things being equal, takes 1 cycle

• 1 cycle for a read, write, add, multiple• D? cycles for a division

Page 11: Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October

11

Averaging Filter with LoopTheoretical Analysis

• Fetch N values from memory -- N cycles• Perform N add operations -- N cycles• Go round the sum for-loop -- N * FLC cycles

– Where FLC is # instructions to handle For-Loop-Control – includes all-overheads of jumping dufing for-loop

• Exit for loop (done once) -- EFL cycles• Do division -- D cycles• Return a value from function -- RV cycles -- • Enter and exit Average routine -- EER cycles

AVERAGE_FILTER_TIME = N(1 + 1 + FLC) + EFL + D + RV + EER cycles

VERY BIG DEFECT IN ANALYSIS FOUND LATER ACTUAL THEORETICAL TIME IS TWICE AS LARGE AS THIS

Page 12: Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October

12

Modify tests so can handle both CPP and ASM versions (Cut-and-paste)

• Not the timing that’s the problem at this moment

• It’s ‘does the ASM and CPP code work’ at all!

Page 13: Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October

13

Check what function needs developing

• Fix compiler error with prototype in ‘Assign1.h”

• Linker error message says ‘wrong prototype’ (NM)

Page 14: Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October

14

Check to see if can run the Tests that call ASM code without crashing

C++ prototypeextern “C” void Function(void)

Page 15: Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October

15

Getting the same constants in an include file working in both CPP and ASM

• Use this type of syntax in ‘Assign1.h’– Conditional code generation

• And in assembly code files

Page 16: Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October

16

Initial testing done with small NN = 4 (as can work out expected result)

• Write the test – C++ code expected to pass

– 3.3 is EXACTLY (N – 1) / N of 4.4 when N is 4

Page 17: Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October

17

Look for ‘one out error’ in loopsCommon DSP mistake

• Remember to fix error in ASM ‘pseudo code’

Page 18: Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October

18

Initial testing done with small NN = 8 (as can work out expected result)

• Write the test – C++ code expected to pass– Asm code MUST fail test – otherwise test is poor– Must fail as there is no ASM code to allow pass to

occur. This is the TEST of the TEST

Now have 4 tests passing rather than 3, including ASM test

INDICATES BAD TEST – WHY?

Page 19: Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October

19

Improved test. Don’t allow ‘old correct value’ in output from C++ test

Defect might have been identified by reversing test order

Page 20: Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October

20

What registers can we use in assembly?

• Don’t usewithoutperformingsave immediatelyand later recoveroperations.

• Otherwise C and C++will crash

• These okayto use in

assembly

Page 21: Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October

21

Here’s the full software loop structureNote the formatting for easy code review (Required)

Each time aroundLoop – 9 cycles forControl

Not the 5 we thought

Page 22: Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October

22

dm(2, I4) versus dm(I4, 2) dm(M4, I4) versus dm(I4, M4)

• Both instructions use the ‘eye’ 4 index register (volatile)• dm(2, I4) – is a pre-modify memory operation

– The 1 is before the I4 – hence pre something– I4 points to a memory location– Dm(2, I4) means access the memory location at (I4 + 2)

• ADD IS NOT preformed in parallel with other operations?

– LEAVE value in index register I4 unchanged– Used in array addressing

• Dm(I4, 2) – is a post-modify memory operation– The 2 is after the I4 – hence post something– I4 points to a memory location– Dm( I4, 2) means access the memory location at (I4)– MODIFY value in index register by 2

• DO I4 = I4 + 2 AFTER USING I4 (ADD in parallel with other operations?)

Page 23: Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October

23

Page 24: Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October

24

Other bits of code needed

Page 25: Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October

25

Add assembly language ‘externs’ to ‘Assign1.h

• Still have not codedthe division – fake it by hard-coding * 1/4

• Must be an easier way to code memory – Yes – use post increment operation using pointers

and not using array indexing

Page 26: Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October

26

Code fails -- Most likely place to look for defects are in loop operations

Forgot to set loopCounter =0And loopMax to N when weAdded code for the new loops

Page 27: Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October

27

Try persuading the “assembler” to pre-calculate F3 = (1.0 / N) at ‘compile time’, not ‘run-time’

Code should now work forN = 64 – so can compare timing with C code

Page 28: Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October

28

If we believe tests then calculation accuracy is lower (5E-06 for larger N)

Despite lousy ASM codewe already beating compilerin ‘debug’ mode(around 2N)

Page 29: Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October

29

Before optimizing, we need to add a few more tests to check code valid

Uses sum of N integersN (N + 1) / 2

Accuracy now set to 1E-5

Page 30: Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October

30

Use post-modify address modesum = sum + *pt++; ( N = 64)

• ASM was 2400 cycles (N = 64), is now 2208– Expect improvement of N = 64 cycles (2 instead of 3 instructions)– Get (2400 – 2208) = 192 which is very close to 3 * N = 196 faster

2 cycle stall till M4 ready to use?

Page 31: Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October

31

dm(2, I4) versus dm(I4, 2) dm(M4, I4) versus dm(I4, M4)

• Both instructions use the ‘eye’ 4 index register • dm(2, I4) – is a pre-modify memory operation

– The 2 is before the I4 – hence pre something– I4 points to a memory location– Dm(2, I4) means access the memory location at (I4 + 2)– LEAVE value in index register I4 unchanged– Used in array addressing

• Dm(I4, 2) – is a post-modify memory operation– The 2 is after the I4 – hence post something– I4 points to a memory location– Dm( I4, 2) means access the memory location at (I4)– MODIFY value in index register by 1 (I4 = I4 + 2 AFTER USE)

• POST MODIFY OFFERS OPPORTUNITY FOR PROCESSOR ARCHITECTURE TO DO ADD IN PARALLEL WITH OTHER PIPELINE STAGES

Page 32: Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October

32

Using pre-modify and post-modify addressing – replace 6 instructions by 2

Expect 4 * N faster (256)Was 2208, is 1704 = 500 cyclesClose to N * 6 faster!

Page 33: Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October

33

Need to force “C++” to optimize

• Our asm code 1704 cycles• Optimized “C” 205 cycles

– 1500 cycles faster or roughly N * 23.5 cycles faster• FIFO Loop (63 reads / 63 write) + sum loop (64 reads + 64 adds) = 256• Loop control = 2 * 64 * 9 + Into / out of subroutine 20 + other 10 = 1182

– Our ASM = 1468 + 236 unaccounted for (N * 3.7 or nearly N * 4)

CONCLUSIONWe have a lot more to learn about using the processor architecture correctly in order to get HIGH SPEED DSP CODE

NOTE: COMPILER ASSUMES GENERAL DSP, CODE CHARACTERISTICS

WE KNOW MORE, so should be able to write faster code (if we need to)