automatic application profiling lecture 22. today what parts of the code are slow? amdahls law how...

25
Automatic Application Automatic Application Profiling Profiling Lecture 22

Upload: lee-moody

Post on 18-Jan-2018

218 views

Category:

Documents


0 download

DESCRIPTION

Amdahl’s Law Gene Amdahl - “Optimize the common case” Double the speed of ¼ of the program: Quadruple the speed of ¼ of the program:

TRANSCRIPT

Page 1: Automatic Application Profiling Lecture 22. Today  What parts of the code are slow? Amdahls law How to get the processor to tell us whats taking the

Automatic Application ProfilingAutomatic Application Profiling

Lecture 22

Page 2: Automatic Application Profiling Lecture 22. Today  What parts of the code are slow? Amdahls law How to get the processor to tell us whats taking the

Today – What parts of the code are slow?Today – What parts of the code are slow?• Amdahl’s law

• How to get the processor to tell us what’s taking the most time – Statistical program counter sampling

Page 3: Automatic Application Profiling Lecture 22. Today  What parts of the code are slow? Amdahls law How to get the processor to tell us whats taking the

Amdahl’s LawAmdahl’s Law

• Gene Amdahl - “Optimize the common case”• Double the speed of ¼ of the program:

• Quadruple the speed of ¼ of the program:

Enhanced

EnhancedNormal Speedup

FractionFraction

Speedup

1

14.1

225.075.0

1

Speedup

23.1

425.075.0

1

Speedup

Page 4: Automatic Application Profiling Lecture 22. Today  What parts of the code are slow? Amdahls law How to get the processor to tell us whats taking the

Work on the Slow PartWork on the Slow Part

• Double the speed of ¾ of the program:

• Quadruple the speed of ¾ of the program:

Enhanced

EnhancedNormal Speedup

FractionFraction

Speedup

1

6.125.0

275.0

1

Speedup

29.225.0

475.0

1

Speedup

Page 5: Automatic Application Profiling Lecture 22. Today  What parts of the code are slow? Amdahls law How to get the processor to tell us whats taking the

How Do We Find the Slow PartsHow Do We Find the Slow Parts• Option A: Measure the amount of time each region takes to execute

– Codeunsigned long times[NUM_FUNCTIONS];

void my_function(int i) { t_start = get_ticks(); /* function body */ times[THIS_FUNCTION_NUM] += get_ticks() - t_start;}

• Pros– Exact. Can get single cycle accuracy if needed.

• Cons– Tedious. Must add code to each function to be monitored.– Need access to source code, which may be a problem for library functions.

Page 6: Automatic Application Profiling Lecture 22. Today  What parts of the code are slow? Amdahls law How to get the processor to tell us whats taking the

Program Counter SamplingProgram Counter Sampling• Option B: Periodically examine the PC to see what’s running

– Result shows average fraction of time spent executing a region of code• Supporting data structure: table of region information

– Starting and ending addresses: defined before sampling

– Execution counts:updated during sampling

• Sampling– Use a timer to interrupt application periodically– Within ISR

• Read PC off of stack• Examine table of region addresses to determine currently executing region N• Increment entry N of execution count table• Also increment a total number of ticks variable (to reveal out-of-range PC

instances)– At end

• Provide execution count table to user (via file, serial port, debugger, etc.)

typedef struct { char Name[PROFILE_NAME_SIZE]; unsigned long Start, End; unsigned Count;} PROFILE_T;

Page 7: Automatic Application Profiling Lecture 22. Today  What parts of the code are slow? Amdahls law How to get the processor to tell us whats taking the

Configure Timer to Generate Periodic InterruptConfigure Timer to Generate Periodic Interrupt

• Call this function with desired sampling frequency in Hz

void Init_Profiling(unsigned samp_freq) { unsigned long divider; // set up timer A0 to interrupt at samp_freq ta0mr = 0x00; divider = ((unsigned long)MAIN_CLOCK)/samp_freq; if (divider > 0x0ffffl) ta0 = 0x0ffff; else ta0 = (unsigned) (divider & 0x0ffff); DISABLE_IRQ; ta0ic = 1; ENABLE_IRQ; ta0s = 1; }

Page 8: Automatic Application Profiling Lecture 22. Today  What parts of the code are slow? Amdahls law How to get the processor to tell us whats taking the

Profiler Interrupt Service RoutineProfiler Interrupt Service Routine

• Don’t forget to register ISR in vector table!

#pragma INTERRUPT/B profile_intrvoid profile_intr(void) { unsigned char PC_H; unsigned int PC_ML; unsigned long PC; unsigned char i; /* Get PC from stack */ _asm("mov.w 2[FB], $$[FB]", PC_ML); _asm("mov.b 5[FB], $$[FB]", PC_H); PC = PC_H; PC <<= 16; PC += PC_ML; profile_ticks++; /* look up function in table and inc. counter */ for (i=0; i<NUM_PROFILE_REGIONS; i++) { if ((PC >= profiles[i].Start) && (PC <= profiles[i].End)) { profiles[i].Count++; return; } }}

Page 9: Automatic Application Profiling Lecture 22. Today  What parts of the code are slow? Amdahls law How to get the processor to tell us whats taking the

Configure Profile Table with Region InformationConfigure Profile Table with Region Information

• Where do we find region addresses? Next page.

PROFILE_T profiles[NUM_PROFILE_REGIONS] = { {"main.c", UL 0x0f0100, UL 0x0f0299, 0}, {"profile.c", UL 0x0f029a, UL 0x0f037f, 0}, {"skp26.c", UL 0xf0380, UL 0xf0613, 0}, {"skp_lcd.c", UL 0x0f0614, UL 0x0f0917, 0}, {"library", UL 0x0f0918, UL 0x0f290d, 0}, {"", UL 0, UL 0, 0}, {"", UL 0, UL 0, 0}, {"", UL 0, UL 0, 0}, };

PROFILE_T profiles[NUM_PROFILE_REGIONS] = { {"main", UL main, UL (main+0x0199), 0}, {"LCD_Erase", UL LCD_Erase_FB, UL (LCD_Erase_FB+0x0240), 0}, {"sin", UL sin, UL (sin+0x0138), 0}, {"LCD_Plot_in_FB", UL LCD_Plot_in_FB, UL (LCD_Plot_in_FB+0x02e3), 0}, {"LCD_Display_FB", UL LCD_Display_FB, UL (LCD_Display_FB+0x0302), 0}, {"DisplayDelay", UL DisplayDelay, UL (DisplayDelay+0x01ad), 0}, {"LCD_write", UL LCD_write, UL (LCD_write+0x017f), 0}, {"", UL 0, UL 0, 0} };

To profile modules (source files)

To profile functions

Page 10: Automatic Application Profiling Lecture 22. Today  What parts of the code are slow? Amdahls law How to get the processor to tell us whats taking the

Finding the Region AddressesFinding the Region Addresses• Get module addresses from linker’s map file (in debug directory)

• Get function lengths (if needed) from .LST file– Second column is start address of each assembly instruction– Subtract function’s first address from its last address to find length

506 ;## # FUNCTION LCD_Erase_FB507 ;## # FRAME AUTO (y) size 2, offset -4508 ;## # FRAME AUTO (x) size 2, offset -2509 ;## # ARG Size(0) Auto Size(4) Context Size(5)510 511 .align512 ;## # C_SRC : void LCD_Erase_FB(void) {513 .glb _LCD_Erase_FB514 00212 _LCD_Erase_FB:515 00212 7CF204 enter #04H516 ;## # C_SRC : for (x=0; x<8; x++)517 00215 D90BFE mov.w #0000H,-2[FB] ; x P(etc.)539 ;## # C_SRC : }540 0023F 7DF2 exitd541 00241 E7:

program REL CODE 0F00FF 000000 NCRT0_26SKP REL CODE 0F0100 00019A 2 MAIN REL CODE 0F029A 0000E5 2 PROFILE REL CODE 0F0380 000293 2 SKP26 REL CODE 0F0614 000303 2 SKP_LCD REL CODE 0F0918 00020F 2 _F4DIV REL CODE 0F0B28 00008D 2 _F4TOF8 REL CODE 0F0BB6 000A65 2 _F8ADD

Page 11: Automatic Application Profiling Lecture 22. Today  What parts of the code are slow? Amdahls law How to get the processor to tell us whats taking the

Overview of Profiling ApproachOverview of Profiling Approach• Start with the big picture (rough details) and use that to

determine where to look next• Profiling sequence

– module-level (file-level)– function-level within the most common module– basic block-level within the most common function

Page 12: Automatic Application Profiling Lecture 22. Today  What parts of the code are slow? Amdahls law How to get the processor to tell us whats taking the

Detailed Steps to using profile.c/hDetailed Steps to using profile.c/h• Enable list file creation for each C source file

– HEW: Options -> Renesas M16C Standard Toolchain• C Tab -> Category: List. Check –dS and –dSL boxes

• Fill in array profiles with region addresses (e.g. names of functions), dummy lengths, and zero counts

• Compile– May need to add function prototypes if profiles table is declared before the functions

are• Update profiles array with correct starting and ending addresses• Recompile• Run• Examine profiles after running long enough

Page 13: Automatic Application Profiling Lecture 22. Today  What parts of the code are slow? Amdahls law How to get the processor to tell us whats taking the

How Long is Enough?How Long is Enough?• Complex statistical question

– The statistician I asked said “it depends” and changed the subject• So, run it until the digits you care about stop changing• Example: Module-level profiling

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

10 100 1000 10000 100000

Number of Samples

Frac

tion

of P

rogr

am T

ime

mainprofileskp26skp_lcdlibrary

Page 14: Automatic Application Profiling Lecture 22. Today  What parts of the code are slow? Amdahls law How to get the processor to tell us whats taking the

Where does the Lab 4 Skeleton Spend its Time?Where does the Lab 4 Skeleton Spend its Time?

We know there are delay loops executed every time the MCU writes to the LCD, but let’s verify how bad they are

• Start with modules• Then look at functions in module• Then look at basic blocks within function

DisplayString(LCD_LINE1," Lab #4 "); DisplayString(LCD_LINE2," Starter"); GRN_LED = LED_ON;

while (1) { for (f=6.0; f>0.0; f -= 0.4) { LCD_Erase_FB(); for (i=0; i<DISP_WIDTH_PIXELS; i++)

LCD_Plot_in_FB((unsigned char)i, (unsigned char) (3.5*(sin(i/f)+1.0)), 1);

LCD_Display_FB(LCD_LINE1); } }

Page 15: Automatic Application Profiling Lecture 22. Today  What parts of the code are slow? Amdahls law How to get the processor to tell us whats taking the

Step One: Profile ModulesStep One: Profile Modules

• Define a profile region per module, and one for all the library functions

# SECTION ATR TYPE START LENGTH ALIGN MODULENAMEprogram REL CODE 0F00FF 000000 NCRT0_26SKP REL CODE 0F0100 00019A 2 MAIN REL CODE 0F029A 0000E5 2 PROFILE REL CODE 0F0380 000293 2 SKP26 REL CODE 0F0614 000303 2 SKP_LCD REL CODE 0F0918 00020F 2 _F4DIV REL CODE 0F0B28 00008D 2 _F4TOF8 REL CODE 0F0BB6 000A65 2 _F8ADD REL CODE 0F161C 0000A7 2 _F8LE REL CODE 0F16C4 000069 2 _F8LTOR REL CODE 0F172E 0002DE 2 _F8MUL REL CODE 0F1A0C 0000BA 2 _F8TOF4 REL CODE 0F1AC6 000025 2 _F8TOI4U REL CODE 0F1AEC 000192 2 _FTOL REL CODE 0F1C7E 00004D 2 _I4DIVU REL CODE 0F1CCC 000022 2 _I4TOF4 REL CODE 0F1CEE 0000FD 2 _LTOF REL CODE 0F1DEC 000138 2 SIN REL CODE 0F1F24 00035D 2 TAN REL CODE 0F2282 000058 2 _F4LTOR REL CODE 0F22DA 000060 2 _F4RTOL REL CODE 0F233A 0002E4 2 _F8DIV REL CODE 0F261E 00007C 2 _F8EQ REL CODE 0F269A 0000A7 2 _F8LT REL CODE 0F2742 00007C 2 _F8NE REL CODE 0F27BE 000066 2 _F8RTOL REL CODE 0F2824 00002E 2 _F8SUB REL CODE 0F2852 000025 2 _F8TOI4 REL CODE 0F2878 000073 2 _I4MOD REL CODE 0F28EC 000022 2 _I4TOF8

Page 16: Automatic Application Profiling Lecture 22. Today  What parts of the code are slow? Amdahls law How to get the processor to tell us whats taking the

Step One: Profile ModulesStep One: Profile Modules

PROFILE_T profiles[NUM_PROFILE_REGIONS] = { {"main.c", UL 0x0f0100, UL 0x0f0299, 0}, {"profile.c", UL 0x0f029a, UL 0x0f037f, 0}, {"skp26.c", UL 0x0f0380, UL 0x0f0613, 0}, {"skp_lcd.c", UL 0x0f0614, UL 0x0f0917, 0}, {"library", UL 0x0f0918, UL 0x0f290d, 0}, {"", UL 0, UL 0, 0}, {"", UL 0, UL 0, 0}, {"", UL 0, UL 0, 0}, };

Page 17: Automatic Application Profiling Lecture 22. Today  What parts of the code are slow? Amdahls law How to get the processor to tell us whats taking the

Step 1 ResultsStep 1 Results

• Surprise! The LCD functions aren’t taking up most of the processor’s time! The library functions are instead

Execution Time per Module

main

profile

skp26

skp_lcd

library

other

Count Timemain 70 0.21%profile 0 0.00%skp26 0 0.00%skp_lcd 9642 29.11%library 23415 70.68%other 1 0.00%

Page 18: Automatic Application Profiling Lecture 22. Today  What parts of the code are slow? Amdahls law How to get the processor to tell us whats taking the

Step Two: Profile LibraryStep Two: Profile Library

• We only have eight entries in our table, so let’s split up the library into eight regions of about three functions each

# SECTION ATR TYPE START LENGTH ALIGN MODULENAMEprogram REL CODE 0F00FF 000000 NCRT0_26SKP REL CODE 0F0100 00019A 2 MAIN REL CODE 0F029A 0000E5 2 PROFILE REL CODE 0F0380 000293 2 SKP26 REL CODE 0F0614 000303 2 SKP_LCD REL CODE 0F0918 00020F 2 _F4DIV REL CODE 0F0B28 00008D 2 _F4TOF8 REL CODE 0F0BB6 000A65 2 _F8ADD REL CODE 0F161C 0000A7 2 _F8LE REL CODE 0F16C4 000069 2 _F8LTOR REL CODE 0F172E 0002DE 2 _F8MUL REL CODE 0F1A0C 0000BA 2 _F8TOF4 REL CODE 0F1AC6 000025 2 _F8TOI4U REL CODE 0F1AEC 000192 2 _FTOL REL CODE 0F1C7E 00004D 2 _I4DIVU REL CODE 0F1CCC 000022 2 _I4TOF4 REL CODE 0F1CEE 0000FD 2 _LTOF REL CODE 0F1DEC 000138 2 SIN REL CODE 0F1F24 00035D 2 TAN REL CODE 0F2282 000058 2 _F4LTOR REL CODE 0F22DA 000060 2 _F4RTOL REL CODE 0F233A 0002E4 2 _F8DIV REL CODE 0F261E 00007C 2 _F8EQ REL CODE 0F269A 0000A7 2 _F8LT REL CODE 0F2742 00007C 2 _F8NE REL CODE 0F27BE 000066 2 _F8RTOL REL CODE 0F2824 00002E 2 _F8SUB REL CODE 0F2852 000025 2 _F8TOI4 REL CODE 0F2878 000073 2 _I4MOD REL CODE 0F28EC 000022 2 _I4TOF8

lib 1

lib 2

lib 3lib 4

lib 5

lib 6

lib 7

lib 8

Page 19: Automatic Application Profiling Lecture 22. Today  What parts of the code are slow? Amdahls law How to get the processor to tell us whats taking the

Step Two: Profile LibraryStep Two: Profile LibraryPROFILE_T profiles[NUM_PROFILE_REGIONS] = { {"lib 1", UL 0x0f0918, UL 0x0f161b, 0}, {"lib 2", UL 0x0f161c, UL 0x0f1a0b, 0}, {"lib 3", UL 0x0f1a0c, UL 0x0f1c7d, 0}, {"lib 4", UL 0x0f1c7e, UL 0x0f1deb, 0}, {"lib 5", UL 0x0f1dec, UL 0x0f22d9, 0}, {"lib 6", UL 0x0f22da, UL 0x0f2699, 0}, {"lib 7", UL 0x0f269a, UL 0x0f2823, 0}, {"lib 8", UL 0x0f2824, UL 0x0f290d, 0}};

Page 20: Automatic Application Profiling Lecture 22. Today  What parts of the code are slow? Amdahls law How to get the processor to tell us whats taking the

Step Two ResultsStep Two Results

• Functions in group lib 6 are taking the most time, followed by lib 4 and lib 1

Execution Time per Library Function Group

lib 1

lib 2

lib 3

lib 4

lib 5

lib 6

lib 7

lib 8

other

Count Timelib 1 3079 10.22%lib 2 1796 5.96%lib 3 865 2.87%lib 4 3092 10.26%lib 5 687 2.28%lib 6 11044 36.65%lib 7 131 0.43%lib 8 444 1.47%other 8999 29.86%

Page 21: Automatic Application Profiling Lecture 22. Today  What parts of the code are slow? Amdahls law How to get the processor to tell us whats taking the

Step Three: Profile Top Library FunctionsStep Three: Profile Top Library Functions

• Examine the nine functions in these three groups, grouping two functions together

# SECTION ATR TYPE START LENGTH ALIGN MODULENAMEprogram REL CODE 0F00FF 000000 NCRT0_26SKP REL CODE 0F0100 00019A 2 MAIN REL CODE 0F029A 0000E5 2 PROFILE REL CODE 0F0380 000293 2 SKP26 REL CODE 0F0614 000303 2 SKP_LCD REL CODE 0F0918 00020F 2 _F4DIV REL CODE 0F0B28 00008D 2 _F4TOF8 REL CODE 0F0BB6 000A65 2 _F8ADD REL CODE 0F161C 0000A7 2 _F8LE REL CODE 0F16C4 000069 2 _F8LTOR REL CODE 0F172E 0002DE 2 _F8MUL REL CODE 0F1A0C 0000BA 2 _F8TOF4 REL CODE 0F1AC6 000025 2 _F8TOI4U REL CODE 0F1AEC 000192 2 _FTOL REL CODE 0F1C7E 00004D 2 _I4DIVU REL CODE 0F1CCC 000022 2 _I4TOF4 REL CODE 0F1CEE 0000FD 2 _LTOF REL CODE 0F1DEC 000138 2 SIN REL CODE 0F1F24 00035D 2 TAN REL CODE 0F2282 000058 2 _F4LTOR REL CODE 0F22DA 000060 2 _F4RTOL REL CODE 0F233A 0002E4 2 _F8DIV REL CODE 0F261E 00007C 2 _F8EQ REL CODE 0F269A 0000A7 2 _F8LT REL CODE 0F2742 00007C 2 _F8NE REL CODE 0F27BE 000066 2 _F8RTOL REL CODE 0F2824 00002E 2 _F8SUB REL CODE 0F2852 000025 2 _F8TOI4 REL CODE 0F2878 000073 2 _I4MOD REL CODE 0F28EC 000022 2 _I4TOF8

Page 22: Automatic Application Profiling Lecture 22. Today  What parts of the code are slow? Amdahls law How to get the processor to tell us whats taking the

Step Three: Profile Top Library FunctionsStep Three: Profile Top Library Functions

PROFILE_T profiles[NUM_PROFILE_REGIONS] = { {"_F4DIV", UL 0x0f0918, UL 0x0f0b27, 0}, {"_F4TOF8+_F8ADD", UL 0x0f0b28, UL 0x0f161b, 0}, {"_I4DIVU", UL 0x0f1c7e, UL 0x0f1ccb, 0}, {"_I4TOF4", UL 0x0f1ccc, UL 0x0f1ced, 0}, {"_LTOF", UL 0x0f1cee, UL 0x0f1deb, 0}, {"_F4RTOL", UL 0x0f22da, UL 0x0f2339, 0}, {"_F8DIV", UL 0x0f233a, UL 0x0f261d, 0}, {"_F8EQ", UL 0x0f261e, UL 0x0f2699, 0}};

Page 23: Automatic Application Profiling Lecture 22. Today  What parts of the code are slow? Amdahls law How to get the processor to tell us whats taking the

Step Three ResultsStep Three Results

• Most time spent in double precision floating point divide• 3.5*(sin(i/f)+1.0) is culprit. Avoid floating point when possible

Execution Time per Library Function

_F4DIV_F4TOF8+_F8ADD_I4DIVU_I4TOF4_LTOF_F4RTOL_F8DIV_F8EQother

Count Time_F4DIV 482 1.83%_F4TOF8+_F8ADD2413 9.14%_I4DIVU 0 0.00%_I4TOF4 18 0.07%_LTOF 2748 10.41%_F4RTOL 36 0.14%_F8DIV 9418 35.67%_F8EQ 0 0.00%other 11291 42.76%

26406 57.24%

Page 24: Automatic Application Profiling Lecture 22. Today  What parts of the code are slow? Amdahls law How to get the processor to tell us whats taking the

Disadvantages of SamplingDisadvantages of Sampling• Sampling is inexact - not guaranteed to get everything that runs

– Code which disables interrupts (e.g. ISRs, OS code) is not measured– Rarely executed code may be missed– Takes time for numbers to settle down– Profile changes based on mode of program

• If manually creating table, user needs to update address table with each code change

Page 25: Automatic Application Profiling Lecture 22. Today  What parts of the code are slow? Amdahls law How to get the processor to tell us whats taking the

Implementing InstrumentationImplementing Instrumentation• Tedious to do manually for large programs, so automate• Have compiler instrument code for you

– gcc and other compilers support profiling using a command line switch– They provide a tool to process the output file to determine how much time

each function takes

• Can also modify the binary (after compilation): Atom from DEC’s Western Research Lab, Etch, EEL– Tool processes binary files to run your instrumentation procedures for each

procedure, basic block, or instruction

• For the M16C, what would be best?– Create a program which reads the map file and creates a C file declaring our

profiles array with correct region names and addresses• Probably easiest to use a scripting language: sed, awk, perl, (f)lex• Probably not enough memory to instrument all functions, so must be selective