automatic application profiling lecture 22. today what parts of the code are slow? amdahls law how...
DESCRIPTION
Amdahl’s Law Gene Amdahl - “Optimize the common case” Double the speed of ¼ of the program: Quadruple the speed of ¼ of the program:TRANSCRIPT
Automatic Application ProfilingAutomatic Application Profiling
Lecture 22
Today – What parts of the code are slow?Today – What parts of the code are slow?• Amdahl’s law
• How to get the processor to tell us what’s taking the most time – Statistical program counter sampling
Amdahl’s LawAmdahl’s Law
• Gene Amdahl - “Optimize the common case”• Double the speed of ¼ of the program:
• Quadruple the speed of ¼ of the program:
Enhanced
EnhancedNormal Speedup
FractionFraction
Speedup
1
14.1
225.075.0
1
Speedup
23.1
425.075.0
1
Speedup
Work on the Slow PartWork on the Slow Part
• Double the speed of ¾ of the program:
• Quadruple the speed of ¾ of the program:
Enhanced
EnhancedNormal Speedup
FractionFraction
Speedup
1
6.125.0
275.0
1
Speedup
29.225.0
475.0
1
Speedup
How Do We Find the Slow PartsHow Do We Find the Slow Parts• Option A: Measure the amount of time each region takes to execute
– Codeunsigned long times[NUM_FUNCTIONS];
void my_function(int i) { t_start = get_ticks(); /* function body */ times[THIS_FUNCTION_NUM] += get_ticks() - t_start;}
• Pros– Exact. Can get single cycle accuracy if needed.
• Cons– Tedious. Must add code to each function to be monitored.– Need access to source code, which may be a problem for library functions.
Program Counter SamplingProgram Counter Sampling• Option B: Periodically examine the PC to see what’s running
– Result shows average fraction of time spent executing a region of code• Supporting data structure: table of region information
– Starting and ending addresses: defined before sampling
– Execution counts:updated during sampling
• Sampling– Use a timer to interrupt application periodically– Within ISR
• Read PC off of stack• Examine table of region addresses to determine currently executing region N• Increment entry N of execution count table• Also increment a total number of ticks variable (to reveal out-of-range PC
instances)– At end
• Provide execution count table to user (via file, serial port, debugger, etc.)
typedef struct { char Name[PROFILE_NAME_SIZE]; unsigned long Start, End; unsigned Count;} PROFILE_T;
Configure Timer to Generate Periodic InterruptConfigure Timer to Generate Periodic Interrupt
• Call this function with desired sampling frequency in Hz
void Init_Profiling(unsigned samp_freq) { unsigned long divider; // set up timer A0 to interrupt at samp_freq ta0mr = 0x00; divider = ((unsigned long)MAIN_CLOCK)/samp_freq; if (divider > 0x0ffffl) ta0 = 0x0ffff; else ta0 = (unsigned) (divider & 0x0ffff); DISABLE_IRQ; ta0ic = 1; ENABLE_IRQ; ta0s = 1; }
Profiler Interrupt Service RoutineProfiler Interrupt Service Routine
• Don’t forget to register ISR in vector table!
#pragma INTERRUPT/B profile_intrvoid profile_intr(void) { unsigned char PC_H; unsigned int PC_ML; unsigned long PC; unsigned char i; /* Get PC from stack */ _asm("mov.w 2[FB], $$[FB]", PC_ML); _asm("mov.b 5[FB], $$[FB]", PC_H); PC = PC_H; PC <<= 16; PC += PC_ML; profile_ticks++; /* look up function in table and inc. counter */ for (i=0; i<NUM_PROFILE_REGIONS; i++) { if ((PC >= profiles[i].Start) && (PC <= profiles[i].End)) { profiles[i].Count++; return; } }}
Configure Profile Table with Region InformationConfigure Profile Table with Region Information
• Where do we find region addresses? Next page.
PROFILE_T profiles[NUM_PROFILE_REGIONS] = { {"main.c", UL 0x0f0100, UL 0x0f0299, 0}, {"profile.c", UL 0x0f029a, UL 0x0f037f, 0}, {"skp26.c", UL 0xf0380, UL 0xf0613, 0}, {"skp_lcd.c", UL 0x0f0614, UL 0x0f0917, 0}, {"library", UL 0x0f0918, UL 0x0f290d, 0}, {"", UL 0, UL 0, 0}, {"", UL 0, UL 0, 0}, {"", UL 0, UL 0, 0}, };
PROFILE_T profiles[NUM_PROFILE_REGIONS] = { {"main", UL main, UL (main+0x0199), 0}, {"LCD_Erase", UL LCD_Erase_FB, UL (LCD_Erase_FB+0x0240), 0}, {"sin", UL sin, UL (sin+0x0138), 0}, {"LCD_Plot_in_FB", UL LCD_Plot_in_FB, UL (LCD_Plot_in_FB+0x02e3), 0}, {"LCD_Display_FB", UL LCD_Display_FB, UL (LCD_Display_FB+0x0302), 0}, {"DisplayDelay", UL DisplayDelay, UL (DisplayDelay+0x01ad), 0}, {"LCD_write", UL LCD_write, UL (LCD_write+0x017f), 0}, {"", UL 0, UL 0, 0} };
To profile modules (source files)
To profile functions
Finding the Region AddressesFinding the Region Addresses• Get module addresses from linker’s map file (in debug directory)
• Get function lengths (if needed) from .LST file– Second column is start address of each assembly instruction– Subtract function’s first address from its last address to find length
506 ;## # FUNCTION LCD_Erase_FB507 ;## # FRAME AUTO (y) size 2, offset -4508 ;## # FRAME AUTO (x) size 2, offset -2509 ;## # ARG Size(0) Auto Size(4) Context Size(5)510 511 .align512 ;## # C_SRC : void LCD_Erase_FB(void) {513 .glb _LCD_Erase_FB514 00212 _LCD_Erase_FB:515 00212 7CF204 enter #04H516 ;## # C_SRC : for (x=0; x<8; x++)517 00215 D90BFE mov.w #0000H,-2[FB] ; x P(etc.)539 ;## # C_SRC : }540 0023F 7DF2 exitd541 00241 E7:
program REL CODE 0F00FF 000000 NCRT0_26SKP REL CODE 0F0100 00019A 2 MAIN REL CODE 0F029A 0000E5 2 PROFILE REL CODE 0F0380 000293 2 SKP26 REL CODE 0F0614 000303 2 SKP_LCD REL CODE 0F0918 00020F 2 _F4DIV REL CODE 0F0B28 00008D 2 _F4TOF8 REL CODE 0F0BB6 000A65 2 _F8ADD
Overview of Profiling ApproachOverview of Profiling Approach• Start with the big picture (rough details) and use that to
determine where to look next• Profiling sequence
– module-level (file-level)– function-level within the most common module– basic block-level within the most common function
Detailed Steps to using profile.c/hDetailed Steps to using profile.c/h• Enable list file creation for each C source file
– HEW: Options -> Renesas M16C Standard Toolchain• C Tab -> Category: List. Check –dS and –dSL boxes
• Fill in array profiles with region addresses (e.g. names of functions), dummy lengths, and zero counts
• Compile– May need to add function prototypes if profiles table is declared before the functions
are• Update profiles array with correct starting and ending addresses• Recompile• Run• Examine profiles after running long enough
How Long is Enough?How Long is Enough?• Complex statistical question
– The statistician I asked said “it depends” and changed the subject• So, run it until the digits you care about stop changing• Example: Module-level profiling
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
10 100 1000 10000 100000
Number of Samples
Frac
tion
of P
rogr
am T
ime
mainprofileskp26skp_lcdlibrary
Where does the Lab 4 Skeleton Spend its Time?Where does the Lab 4 Skeleton Spend its Time?
We know there are delay loops executed every time the MCU writes to the LCD, but let’s verify how bad they are
• Start with modules• Then look at functions in module• Then look at basic blocks within function
DisplayString(LCD_LINE1," Lab #4 "); DisplayString(LCD_LINE2," Starter"); GRN_LED = LED_ON;
while (1) { for (f=6.0; f>0.0; f -= 0.4) { LCD_Erase_FB(); for (i=0; i<DISP_WIDTH_PIXELS; i++)
LCD_Plot_in_FB((unsigned char)i, (unsigned char) (3.5*(sin(i/f)+1.0)), 1);
LCD_Display_FB(LCD_LINE1); } }
Step One: Profile ModulesStep One: Profile Modules
• Define a profile region per module, and one for all the library functions
# SECTION ATR TYPE START LENGTH ALIGN MODULENAMEprogram REL CODE 0F00FF 000000 NCRT0_26SKP REL CODE 0F0100 00019A 2 MAIN REL CODE 0F029A 0000E5 2 PROFILE REL CODE 0F0380 000293 2 SKP26 REL CODE 0F0614 000303 2 SKP_LCD REL CODE 0F0918 00020F 2 _F4DIV REL CODE 0F0B28 00008D 2 _F4TOF8 REL CODE 0F0BB6 000A65 2 _F8ADD REL CODE 0F161C 0000A7 2 _F8LE REL CODE 0F16C4 000069 2 _F8LTOR REL CODE 0F172E 0002DE 2 _F8MUL REL CODE 0F1A0C 0000BA 2 _F8TOF4 REL CODE 0F1AC6 000025 2 _F8TOI4U REL CODE 0F1AEC 000192 2 _FTOL REL CODE 0F1C7E 00004D 2 _I4DIVU REL CODE 0F1CCC 000022 2 _I4TOF4 REL CODE 0F1CEE 0000FD 2 _LTOF REL CODE 0F1DEC 000138 2 SIN REL CODE 0F1F24 00035D 2 TAN REL CODE 0F2282 000058 2 _F4LTOR REL CODE 0F22DA 000060 2 _F4RTOL REL CODE 0F233A 0002E4 2 _F8DIV REL CODE 0F261E 00007C 2 _F8EQ REL CODE 0F269A 0000A7 2 _F8LT REL CODE 0F2742 00007C 2 _F8NE REL CODE 0F27BE 000066 2 _F8RTOL REL CODE 0F2824 00002E 2 _F8SUB REL CODE 0F2852 000025 2 _F8TOI4 REL CODE 0F2878 000073 2 _I4MOD REL CODE 0F28EC 000022 2 _I4TOF8
Step One: Profile ModulesStep One: Profile Modules
PROFILE_T profiles[NUM_PROFILE_REGIONS] = { {"main.c", UL 0x0f0100, UL 0x0f0299, 0}, {"profile.c", UL 0x0f029a, UL 0x0f037f, 0}, {"skp26.c", UL 0x0f0380, UL 0x0f0613, 0}, {"skp_lcd.c", UL 0x0f0614, UL 0x0f0917, 0}, {"library", UL 0x0f0918, UL 0x0f290d, 0}, {"", UL 0, UL 0, 0}, {"", UL 0, UL 0, 0}, {"", UL 0, UL 0, 0}, };
Step 1 ResultsStep 1 Results
• Surprise! The LCD functions aren’t taking up most of the processor’s time! The library functions are instead
Execution Time per Module
main
profile
skp26
skp_lcd
library
other
Count Timemain 70 0.21%profile 0 0.00%skp26 0 0.00%skp_lcd 9642 29.11%library 23415 70.68%other 1 0.00%
Step Two: Profile LibraryStep Two: Profile Library
• We only have eight entries in our table, so let’s split up the library into eight regions of about three functions each
# SECTION ATR TYPE START LENGTH ALIGN MODULENAMEprogram REL CODE 0F00FF 000000 NCRT0_26SKP REL CODE 0F0100 00019A 2 MAIN REL CODE 0F029A 0000E5 2 PROFILE REL CODE 0F0380 000293 2 SKP26 REL CODE 0F0614 000303 2 SKP_LCD REL CODE 0F0918 00020F 2 _F4DIV REL CODE 0F0B28 00008D 2 _F4TOF8 REL CODE 0F0BB6 000A65 2 _F8ADD REL CODE 0F161C 0000A7 2 _F8LE REL CODE 0F16C4 000069 2 _F8LTOR REL CODE 0F172E 0002DE 2 _F8MUL REL CODE 0F1A0C 0000BA 2 _F8TOF4 REL CODE 0F1AC6 000025 2 _F8TOI4U REL CODE 0F1AEC 000192 2 _FTOL REL CODE 0F1C7E 00004D 2 _I4DIVU REL CODE 0F1CCC 000022 2 _I4TOF4 REL CODE 0F1CEE 0000FD 2 _LTOF REL CODE 0F1DEC 000138 2 SIN REL CODE 0F1F24 00035D 2 TAN REL CODE 0F2282 000058 2 _F4LTOR REL CODE 0F22DA 000060 2 _F4RTOL REL CODE 0F233A 0002E4 2 _F8DIV REL CODE 0F261E 00007C 2 _F8EQ REL CODE 0F269A 0000A7 2 _F8LT REL CODE 0F2742 00007C 2 _F8NE REL CODE 0F27BE 000066 2 _F8RTOL REL CODE 0F2824 00002E 2 _F8SUB REL CODE 0F2852 000025 2 _F8TOI4 REL CODE 0F2878 000073 2 _I4MOD REL CODE 0F28EC 000022 2 _I4TOF8
lib 1
lib 2
lib 3lib 4
lib 5
lib 6
lib 7
lib 8
Step Two: Profile LibraryStep Two: Profile LibraryPROFILE_T profiles[NUM_PROFILE_REGIONS] = { {"lib 1", UL 0x0f0918, UL 0x0f161b, 0}, {"lib 2", UL 0x0f161c, UL 0x0f1a0b, 0}, {"lib 3", UL 0x0f1a0c, UL 0x0f1c7d, 0}, {"lib 4", UL 0x0f1c7e, UL 0x0f1deb, 0}, {"lib 5", UL 0x0f1dec, UL 0x0f22d9, 0}, {"lib 6", UL 0x0f22da, UL 0x0f2699, 0}, {"lib 7", UL 0x0f269a, UL 0x0f2823, 0}, {"lib 8", UL 0x0f2824, UL 0x0f290d, 0}};
Step Two ResultsStep Two Results
• Functions in group lib 6 are taking the most time, followed by lib 4 and lib 1
Execution Time per Library Function Group
lib 1
lib 2
lib 3
lib 4
lib 5
lib 6
lib 7
lib 8
other
Count Timelib 1 3079 10.22%lib 2 1796 5.96%lib 3 865 2.87%lib 4 3092 10.26%lib 5 687 2.28%lib 6 11044 36.65%lib 7 131 0.43%lib 8 444 1.47%other 8999 29.86%
Step Three: Profile Top Library FunctionsStep Three: Profile Top Library Functions
• Examine the nine functions in these three groups, grouping two functions together
# SECTION ATR TYPE START LENGTH ALIGN MODULENAMEprogram REL CODE 0F00FF 000000 NCRT0_26SKP REL CODE 0F0100 00019A 2 MAIN REL CODE 0F029A 0000E5 2 PROFILE REL CODE 0F0380 000293 2 SKP26 REL CODE 0F0614 000303 2 SKP_LCD REL CODE 0F0918 00020F 2 _F4DIV REL CODE 0F0B28 00008D 2 _F4TOF8 REL CODE 0F0BB6 000A65 2 _F8ADD REL CODE 0F161C 0000A7 2 _F8LE REL CODE 0F16C4 000069 2 _F8LTOR REL CODE 0F172E 0002DE 2 _F8MUL REL CODE 0F1A0C 0000BA 2 _F8TOF4 REL CODE 0F1AC6 000025 2 _F8TOI4U REL CODE 0F1AEC 000192 2 _FTOL REL CODE 0F1C7E 00004D 2 _I4DIVU REL CODE 0F1CCC 000022 2 _I4TOF4 REL CODE 0F1CEE 0000FD 2 _LTOF REL CODE 0F1DEC 000138 2 SIN REL CODE 0F1F24 00035D 2 TAN REL CODE 0F2282 000058 2 _F4LTOR REL CODE 0F22DA 000060 2 _F4RTOL REL CODE 0F233A 0002E4 2 _F8DIV REL CODE 0F261E 00007C 2 _F8EQ REL CODE 0F269A 0000A7 2 _F8LT REL CODE 0F2742 00007C 2 _F8NE REL CODE 0F27BE 000066 2 _F8RTOL REL CODE 0F2824 00002E 2 _F8SUB REL CODE 0F2852 000025 2 _F8TOI4 REL CODE 0F2878 000073 2 _I4MOD REL CODE 0F28EC 000022 2 _I4TOF8
Step Three: Profile Top Library FunctionsStep Three: Profile Top Library Functions
PROFILE_T profiles[NUM_PROFILE_REGIONS] = { {"_F4DIV", UL 0x0f0918, UL 0x0f0b27, 0}, {"_F4TOF8+_F8ADD", UL 0x0f0b28, UL 0x0f161b, 0}, {"_I4DIVU", UL 0x0f1c7e, UL 0x0f1ccb, 0}, {"_I4TOF4", UL 0x0f1ccc, UL 0x0f1ced, 0}, {"_LTOF", UL 0x0f1cee, UL 0x0f1deb, 0}, {"_F4RTOL", UL 0x0f22da, UL 0x0f2339, 0}, {"_F8DIV", UL 0x0f233a, UL 0x0f261d, 0}, {"_F8EQ", UL 0x0f261e, UL 0x0f2699, 0}};
Step Three ResultsStep Three Results
• Most time spent in double precision floating point divide• 3.5*(sin(i/f)+1.0) is culprit. Avoid floating point when possible
Execution Time per Library Function
_F4DIV_F4TOF8+_F8ADD_I4DIVU_I4TOF4_LTOF_F4RTOL_F8DIV_F8EQother
Count Time_F4DIV 482 1.83%_F4TOF8+_F8ADD2413 9.14%_I4DIVU 0 0.00%_I4TOF4 18 0.07%_LTOF 2748 10.41%_F4RTOL 36 0.14%_F8DIV 9418 35.67%_F8EQ 0 0.00%other 11291 42.76%
26406 57.24%
Disadvantages of SamplingDisadvantages of Sampling• Sampling is inexact - not guaranteed to get everything that runs
– Code which disables interrupts (e.g. ISRs, OS code) is not measured– Rarely executed code may be missed– Takes time for numbers to settle down– Profile changes based on mode of program
• If manually creating table, user needs to update address table with each code change
Implementing InstrumentationImplementing Instrumentation• Tedious to do manually for large programs, so automate• Have compiler instrument code for you
– gcc and other compilers support profiling using a command line switch– They provide a tool to process the output file to determine how much time
each function takes
• Can also modify the binary (after compilation): Atom from DEC’s Western Research Lab, Etch, EEL– Tool processes binary files to run your instrumentation procedures for each
procedure, basic block, or instruction
• For the M16C, what would be best?– Create a program which reads the map file and creates a C file declaring our
profiles array with correct region names and addresses• Probably easiest to use a scripting language: sed, awk, perl, (f)lex• Probably not enough memory to instrument all functions, so must be selective