from c to machine code/assembler · procedure rules a. collapse procedure hierarchies b. coroutines...
TRANSCRIPT
From C to Machine Code/Assembler
Outline
• Project 1
• Finish the last lecture
• C to Assembly
Project 1 – ResultsAnagramFlipCount
correct: 34/48
Nqueens
correct: 42/48
correct: 41/48
Project 1 – Meeting the Master
• Set for the meeting? – You should have already contacted the master– You should have already sent your code & write‐up – Do you have a meeting scheduled yet?
• Prepare for the meeting! – “Opening statement”
• Look for insights to help with the returnin – See if you can come close to the best performance – Make sure you understand the sharing policy
Bentley’s Rules
• Modifying Data • Modifying Code
– Loop Rules – Logic Rules – Procedure Rules – Expression Rules
• Compile‐time Initialization • Common Subexpression Elimination • Pairing Computation
– Parallelism Rules
© Saman Amarasinghe 2009
Compile‐Time Initialization • If a value is a constant, make it a compile‐time constant. –Save the effort of calculation
–Allow value inlining
–More optimization opportunities
#define PI 3.14159265358979
Example #define R 12
…….. vol = 2 * pi() * r * r; {
…….. vol = 2 * PI * R * R;
© Saman Amarasinghe 2009
Common Subexpression Elimination • If the same expression is evaluated twice, do it only once
• Viability? – Expression has no side effects
– The expression value does not change between the evaluations
– The cost of keeping a copy is amortized by the complexity of the expression
– Too complicated for the compiler to do it automatically
Example x = sin(a) * sin(a); double tmp;
…
tmp = sin(a);x = tmp * tmp;
© Saman Amarasinghe 2009
Pairing Computation • If two similar functions are called with the same arguments
close to each other in many occasions, combine them.– Reduce call overhead
– Possibility of sharing the computation cost
– More optimization possibilities
Example typedef struct twoduble {
double d1; x = r * cos(a); double d2; y = r * sin(a); }
….
© Saman Amarasinghe 2009
twodouble dd; dd = sincos(a); x = r * dd.d1; y = r * dd.d2;
Bentley’s Rules
• Modifying Data • Modifying Code
– Loop Rules – Logic Rules – Procedure Rules – Expression Rules – Parallelism Rules
• Exploit Implicit Parallelism • Exploit Inner Loop Parallelism • Exploit Coarse Grain Parallelism • Extra computation to create parallelism
© Saman Amarasinghe 2009
Implicit Parallelism
• Reduce the loop carried dependences so that “software pipelining” can execute a compact schedule without stalls.
• Example:
xmax1 = MININT; xmax2 = MININT;
xmax = MININT; for(i=0; i < N –1; I += 2) {for(i=0; i < N; i++) if(X[i] > xmax1) xmax1 = X[i];
if(X[i] > xmax) xmax = X[i]; if(X[i+1] > xmax2) xmax2 = X[i+1]; } if((i < N) &&(X[i] > xmax1)) xmax1 = X[i]; xmax = (xmax1 > xmax2)?xmax1:xmax2;
© Saman Amarasinghe 2009
Example 2 next next next next next
• curr = head;
• tot = 0;
• while(curr != NULL) {
• tot = tot + curr val;
• curr = curr next;
• }
• return tot;
© Saman Amarasinghe 2009
Example 2next next next next next
nextnext
curr = head;
tot = 0;
while(curr != NULL) {
tot = tot + curr val;
curr = curr next;
}
return tot;
Also see Rule A.1.a Data Structure Augmentation
nextnext nextnext nextnext
curr = head;
if(curr == NULL) return 0;
tot1 = 0;
tot2 = 0;
while(curr next) {
tot1 = tot1 + curr val;
tot2 = tot2 + curr next val;
curr = curr nextnext;
}
if(curr)
tot1 = tot1 + curr val;
return tot1 + tot2; © Saman Amarasinghe 2009
Exploit Inner Loop Parallelism
• Facilitate inner loop vectorization (for SSE type instructions)
• How? by gingerly guiding the compiler to do so– Iterative process by looking at why the loop is not vectorized and fixing those issues
– Most of the rules above can be used to simplify the loop so that the compiler can vectorize it
© Saman Amarasinghe 2009
Exploit Coarse Grain Parallelism
• Outer loop parallelism (doall and doacross loops)
• Task parallelism
• Ideal for multicores
• You need to do the parallelism yourself later lectures
© Saman Amarasinghe 2009
Extra Computation to Create Parallelism• In many cases doing a little more work (or a slower algorithm) can make a
sequential program a parallel one. Parallel execution may amortize the cost
• Example: double tot; double tottmp[N];
double tot; for(i = 0; i < N; i++)tot = 0;for(i = 0; i < N; i++)
tottmp[i] = 0;for(i = 0; i < N; i++) { //parallelizable
for(j = 0; j < N; j++) double tmp;tot = tot + A[i][j]; for(j = 0; j < N; j++)
tmp = tmp + A[i][j];tottmp[i]= tottmp[i]+ tmp;
} tot = 0; for(i = 0; i < N; i++)
© Saman Amarasinghe 2009 tot = tot + tottmp[i];
Bentley’s RulesA Modifying Data
1. Space for Time a. Data Structure Augmentation b. Storing Precomputed Results c. Caching d. Lazy Evaluation
2. Time for Space a. Packing/Compression b. Interpreters
3. Space and Time a. SIMD
• B Modifying Code 1. Loop Rules
a. Loop Invariant Code Motion b. Sentinel Loop Exit Test c. Loop Elimination by Unrolling d. Partial Loop Unrolling e. Loop fusion f. Eliminate wasted iterations
2. Logic Rules a. Exploit Algebraic Identities b. Short Circuit Monotone functions c. Reordering tests d. Precompute Logic Functions e. Boolean Variable Elimination
3. Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination
4. Expression Rules a. Compile‐time Initialization b. Common Subexpression Elimination c. Pairing Computation
5. Parallelism Rules a. Exploit Implicit Parallelism b. Exploit Inner Loop Parallelism c. Exploit Coarse Grain Parallelism d. Extra computation to create parallelism
© Saman Amarasinghe 2009
FROM C TO MACHINE CODE/ASSEMBLER
Generic Machine Model
• Processor – Registers – Functional units (arithmetic, logical operations) – Instruction execution and coordination – Generates memory accesses (instructions and data)
• Memory Hierarchy – Registers – 1st level cache – 2nd level cache – Main memory – Instruction versus data caches
• Executes a stream of instructions
Three Layers
• Source Code (C, C++, Java) • Machine Code • Hardware
• Today’s topic: Source Code (C) ‐> Machine Code • Responsibility of compiler • Goals:
– Understand how compiler implements C constructs using x86 constructs
– Can read machine code (assembler form) – Can hack machine code generated by compiler – Can write own machine code from scratch if necessary
Assembly example 11
.section .rodata.LC0:0000 6572726F7200 .string "error".text.globl factfact:0000 55 pushq %rbp0001 4889E5 movq %rsp, %rbp0004 4883EC10 subq $16, %rsp0008 897DFC movl %edi, -4(%rbp)000b 837DFC00 cmpl $0, -4(%rbp)000f 7911 jns .L20011 BF00000000 movl $.LC0, %edi0016 B800000000 movl $0, %eax001b E800000000 call printf0020 EB22 jmp .L3.L2:0022 837DFC00 cmpl $0, -4(%rbp)0026 7509 jne .L40028 C745F801000000 movl $1, -8(%rbp)002f EB13 jmp .L3.L4:0031 8B7DFC movl -4(%rbp), %edi0034 FFCF decl %edi0036 E800000000 call fact003b 0FAF45FC imull -4(%rbp), %eax003f 8945F8 movl %eax, -8(%rbp)0042 EB00 jmp .L1.L3:0044 8B45F8 movl -8(%rbp), %eaxSaman Ama0047 C9 0048 C3 rasinghe leave ret
19
6 035
X86‐64 Machine Model• Flat 64 Bit Address Space
– (bytes, words, doubleword, quadword, double quadword) • 64 bit registers
– RAX, RBX, RCX, RDX, RSI, RDI, RSP, RBP, R8‐R15 • 32 bit registers
– EAX, EBX, ECX, EDX, ESI, EDI, ESP, EBP, R8D‐R15D – Aliased with bottom 32 bits of corresponding 64 bit registers
• 16 bit registers – AX, BX, CX, DX, SI, DI, SP, BP, RW‐R15W – Aliased with bottom 16 bits of corresponding 32 bit registers
• 8 bit registers (look in the manual for these) • Status Flags (CF – carry, ZF ‐ zero, SF ‐ sign, OF ‐ overflow)• RIP (instruction pointer)
Arithmetic and Logic Unit • Performs most of the data
operations • Has the form:
OP <oprnd1>, <oprnd2> – <oprnd2> = <oprnd1> OP <oprnd2>
OrOP <oprnd1>
• Operands are: – Immediate Value $25 – Register %rax – Memory 4(%rbp)
• Operations are: – Arithmetic operations (add, sub, imul) – Logical operations (and, sal) – Unitary operations (inc, dec)
Control• Unconditional Branches
– Fetch the next instruction from a different location – Unconditional jump to an address
jmp .L32 – Unconditional jump to an address in a register
jmp %rax – To handle procedure calls
call fact call %r11
Control• All arithmetic operations update the condition codes (rFLAGS)
• Compare explicitly sets the rFLAGS – cmp $0, %rax
• Conditional jumps on the rFLAGS – Jxx .L32 Jxx 4(%rbp)
–Examples: • JO Jump Overflow
• JC Jump Carry
• JAE Jump if above or equal
• JZ Jump is Zero
• JNE Jump if not equal
General-PurposeRegister (GPRs)
63 0
RAX RBX RCX RDX RBP RSI RDI RSP R8 R9 R10 R11 R12 R13 R14 R15
Multimedia Extension and Floating-Point Registers
MM0/ST0 MM1/ST1 MM2/ST2 MM3/ST3 MM4/ST4 MM5/ST5 MM6/ST6 MM7/ST7
63 0
Flags Register
EFLAGS 31 0
Instruction Pointer RIP
63 0
Streaming SIMD Extension (SSE) Registers
XMM0 XMM1 XMM2 XMM3 XMM4 XMM5 XMM6 XMM7 XMM8 XMM9 XMM10 XMM11 XMM12 XMM13 XMM14 XMM15
127 0
Registers
Figure by MIT OpenCourseWare.
63 31 15 8 7 0
%rax %eax %ax %ah %al
%ebx %ax %bh %bl
%ecx %cx %ch %cl
%edx %dx %dh %dl
%esi %si %sil
%edi %di %dil
%ebp %bp %bpl
%esp %sp %spl
%r8d %r8w %r8b
%r9d %r9w %r9b
%r10d %r10w %r10b
%r11d %r11w %r11b
%r12d %r12w %r12b
%r13d %r13w %r13b
%r14d %r14w %r14b
%r15d %r15w %r15b
Return value
%rbx Callee saved
4th argument %rck
3rd argument %rdx
2nd argument %rsi
1st argument %rdi
%rbp Callee saved
%rsp Stack pointer
5th argument %r8
6th argument %r9
Callee saved%r10
%r11 Used for linking
%r12 Unused for C
%r13 Callee saved
%r14 Callee saved
Callee saved%r15
Registers – Calling Convention
Figure by MIT OpenCourseWare.
Memory
• Flat Address Space – composed of words
– byte addressable
• Need to store – Program
– Local variables – Global variables and data
– Stack
– Heap
Memory
Dynamic
Unmapped
Text
Stack
Data
Heap
0x800 0000 0000
Globals/ Read‐only data
Program
0x40 0000
0x0
Read‐Only Data
• All Read‐Only data in the text segment
• Integers – uses load immediate
• Strings – uses the .string macro
.section .text
.globl main
.string "Five: %d\n"
main: enter movqmov mov call leave
$0, $0$5, x(%rip)x(%rip), %rsi$.msg, %rdi printf
ret
.msg:
Global Variables• Allocation: Uses the .comm directive
• Uses PC relative addressing – %rip is the current instruction
address – X(%rip) will add the offset from the
current instruction location to the space for x in the data segment to %rip
– Creates easily recolatable binaries
.section .text
.globl mainmain:
enter movqmov mov call leave
$0, $0$5, x(%rip)x(%rip), %rsi $.msg, %rdi printf
ret
.comm x, 8
.comm name, size, alignment The .comm directive allocates storage in the data section. The storage is referenced by the identifier name. Size is measured in bytes and must be a positive integer. Name cannot be predefined. Alignment is optional. If alignment is specified, the address of name is aligned to a multiple of alignment
The Stack
• Grows from top to bottom
• Call frames keep procedure specific information
• Calling convention
8*n+16(%rbp) argument n …
16(%rbp) argument 7
8(%rbp) Return address
0(%rbp) Previous %rbp
-8(%rbp) local 0 …
-8*m-8(%rbp) local m 0(%rsp)
Variable size
Cur
rent
P
revi
ous
Procedure Linkages
Standard procedure linkage Procedure hasprocedure p
procedure qprolog
prolog
epilog
pre-call
post-return epilog
• standard prolog • standard epilog
Each call involves a • pre-call sequence • post-return sequence
X86‐64 Calling Convention
• RSP points to procedure call stack in memory – call instruction pushes RIP on stack, jumps to call target operand (address of procedure)
– ret instruction pops RIP from stack, returns to caller
– stack grows down
• Software conventions – Caller‐save registers (r10, r11)
– Callee‐save registers (rbx, rbp, r12‐r15)
rbp
Stack • Calling: Caller
– Assume %rcx is live andis caller save
– Call foo(A, B, C, D, E, F, G, H, I) rsp• A to I are at ‐8(%rbp) to ‐72(%rbp)
previous frame pointer return address
local variables
callee saved registers
stack temporaries
dynamic area
caller saved registers argument 9 argument 8 argument 7
return address
push
push
push push
mov mov mov
mov mov
mov call
%rcx
-72(%rbp)
-64(%rbp) -56(%rbp)
-48(%rbp), %r9 -40(%rbp), %r8 -32(%rbp), %rcx
-24(%rbp), %rdx -16(%rbp), %rsi
-8(%rbp), %rdi foo
return addressrbpprevious frame pointer
callee savedregisters
local variables
stack temporaries
dynamic area
caller saved registers argument 9argument 8argument 7
return address rsp
previous frame pointercallee saved
registers
local variables
stack temporaries
dynamic area
Stack• Calling: Callee
– Assume %rbx is used in the function and is callee save
– Assume 40 bytes are required for locals
foo:push %rbpmov %rsp, %rbpsub $48, %rspmov %rbx, -8(%rbp)
enter $48, $0
previous frame pointer return address
local variables
callee saved registers
stack temporaries
dynamic area
caller saved registers argument 9 argument 8 argument 7
rsp
rbp
Stack • Arguments • Call foo(A, B, C, D, E, F, G, H, I)
– Passed in by pushing before the call
push -72(%rbp)
push -64(%rbp)
push -56(%rbp) mov -48(%rbp), %r9
mov -40(%rbp), %r8
mov -32(%rbp), %rcx mov -24(%rbp), %rdx
mov -16(%rbp), %rsi return address mov -8(%rbp), %rdi
call foo previous frame pointer callee saved – Access A to F via registers registers
• or put them in local memory local variables – Access rest using 16+xx(%rbp)
stack temporaries mov 16(%rbp), %rax
dynamic area mov 24(%rbp), %r10
Stack • Locals and Temporaries
– Calculate the size and allocatespace on the stacksub $48, %rsp
or enter $48, 0
– Access using ‐8‐xx(%rbx) return addressmov -28(%rbx), %r10 previous frame pointer
mov %r11, -20(%rbx) callee savedregisters
local variables
stack temporaries
dynamic area
previous frame pointer return address
local variables
callee saved registers
stack temporaries
dynamic area
caller saved registers argument 9 argument 8 argument 7
rsp
rbp
return addressprevious frame pointer
callee savedregisters
local variables
stack temporaries
dynamic area
caller saved registers argument 9argument 8argument 7
return addressrbpprevious frame pointer
callee savedregisters
rsp
local variables
stack temporaries
dynamic area
Stack• Returning Callee
– Assume the return value is the first temporary
– Restore the caller saved register
– Put the return value in %rax
– Tear-down the call stack
mov -8(%rbp), %rbxmov -16(%rbp), %raxmov %rbp, %rsppop %rbpret
leave
return address
Stack previous frame pointer
callee saved
• Returning Caller registers
– Assume the return value goes to the local variables
first temporary stack temporaries
dynamic area
– Restore the stack to reclaim the caller saved registers
argument space argument 9
– Restore the caller save registers argument 8argument 7
rsp
rbp
call foo
add $24, %rsp
pop %rcx
mov %rax, 8(%rbp)
…
X86‐64 Addressing Modes
• Immediate – addl $‐1, %edi • Register – addl %ebx, %eax • Address – addq (%rdi), %rax • Base+(Index*Scale)+Displacement
– Base, Index are values in registers – Scale is 2, 4, or 8 – Displacement is 8, 16, or 32 bit value – addq (%rdi,%rdx,8), %rax
• RIP + 32 bit Displacement ‐movl x(%rip), %eax
Translating Expressionsint compute1() { int x, y, z; x = 34; y = 7; z = 45; return (x + y) | z; }
int compute2(int x, int y, int z) { return (x + y) | z; }
movl $45, %eax ret
# parameter 1: %edi # parameter 2: %esi # parameter 3: %edx addl %esi, %edi orl %edx, %edi movl %edi, %eax ret
int x, y, z; int compute3() { return (x + y) | z; }
movl x(%rip), %eax addl y(%rip), %eax orl z(%rip), %eax ret
Adding Up Integers
uint64_t add64(uint64_t n) { uint64_t count = 0; uint64_t i; for (i = 0; i < n; i++) {
count += i;}return count;
}
uint64_t add64(unit64_t n) { uint64_t count = 0;uint64_t i; for (i = 0; i < n; i++) count += i; return count; }
pushq %rbp (base pointer) movq %rsp, %rbp (new base ptr) subq $32, %rsp (stack frame) movl %edi, ‐16(%rbp) (store n) xorl %rax, %rax (count = 0) movq %rax, ‐32(%rbp) (store count) movq %rax, ‐24(%rbp) (i = 0) movl
‐16(%rbp), %rax (read n)
movq ‐
24(%rbp), %rdx (read i) cmpq %rax, %rdx (i < n) jae ..B2.4
icc –O0 (no optimization)
..B2.3: movq
‐24(%rbp), %rax ( read i)
addq ‐
32(%rbp), %rax (count += i) movq %rax, ‐32(%rbp) (store count) addq $1, ‐24(%rbp) (i++) movl
‐16(%rbp), %rax (read n)
movq ‐
24(%rbp), %rdx (read i) cmpq %rax, %rdx (i < n) jb ..B2.3 ..B2.4: movq
‐32(%rbp), %rax (return count)
leave (undo base pointer stuff) ret
running time n = 400, iters = 2000000 is 1419 ms
Naïve Compilation Strategy
• Local variables – Call frame (rsp points to bottom, rbp to top)– Each local variable stored in call frame
• while loop pattern • for loop pattern
– for (initcode; p; nextcode) { code } goes to – Initcode; while (p) { code; nextcode }
• If then else pattern • expression evaluation
while (p) { c; }
if (p) { ctrue; }else { cfalse; }
<instructions to evaluate p>j<p false> loopExitLabelloopLabel:<instructions for c><instructions to evaluate p>j<p true> loopLabelloopExitLabel:
<instructions to evaluate p>j<p false> elseLabel<instructions for ctrue>j endLabelelseLabel:<instructions for cfalse>endLabel:
uint64_t add64(uint64_t n) { uint64_t count = 0; uint64_t i; for (i = 0; i < n; i++) count += i; return count; }
icc –O1 (some optimization) movl %edi, %rcx (read n) xorl %eax, %eax (count = 0) xorl %edx, %edx (i = 0) testq %rcx, %rcx (n <= 0) jbe ..B17.5 ..B17.3: addq %rdx, %rax (count += i) addq $1, %rdx (i++) cmpq %rcx, %rdx (i < n) jb ..B17.3 ..B17.5: ret (return count)
running time n = 400, iters = 2000000 is 485 ms
Basic Optimization Concepts
• Allocate values in registers (eliminates excess traffic to/from memory)
• Optimize naïve procedure call linkage
• Why not always optimize? –Complicates compiler (sometimes have bugs)
– Complicates debugging(not sure where values are)(sometimes reorders computations)
Implementing Arrays
• Arrays are just blocks of memory – Static array – allocated by linker/loader
– Dynamic array – allocated by malloc
– Local array – allocated on stack
• Array/pointer equivalence
Building An Array
uint64_t * makeArrayList64(uint64_t x) {
int i;
uint64_t *a = (uint64_t *) malloc(sizeof(uint64_t) * x);
for (i = 0; i < x; i++) {
a[i] = i;
}
return a;
}
Adding Up Array Elements
uint64_t addArrayList64(uint64_t *a, uint64_t n) {
uint64_t c = 0;
for (uint64_t i = 0; i < n; i++) {
c += a[i];
}
return c;
}
uint64_t addArrayList64(uint64_t *a, int n) { uint64_t c = 0; for (int i = 0; i < n; i++) c += a[i]; return c; }
addArrayList64: # parameter 1: %rdi # parameter 2: %esi movslq %esi, %rcx (n) xorl %eax, %eax (c = 0) xorl %edx, %edx (i = 0) testq %rcx, %rcx (n <= 0) jle ..B15.5 ..B15.3: addq (%rdi,%rdx,8), %rax (c += a[i])addq $1, %rdx (i++)cmpq %rcx, %rdx (i < n)jl ..B15.3..B15.5:ret (return c)
running time n = 400, iters = 2000000 is 378 ms (unoptimized 2231 ms)
Different Kinds of Arrays
int a[200]; /* global */
int f(int *b /* whatever */) { int c[200]; /* local */ int count; int i = 0; for (i = 0; i < 200; i++) { c[i] = a[i]+b[i];}for (i = 0; i < 200; i++) {count += c[i];}return count;}
Different Kinds of Arrays
int a[200]; /* global */
int f(int *b /* whatever */) { int c[200]; /* local */ int count; int i = 0; for (i = 0; i < 200; i++) { c[i] = a[i]+b[i];}for (i = 0; i < 200; i++) {count += c[i];}return count;}
subq $808, %rsp (allocating c on stack) xorl %ecx, %ecx (a and c offset) xorl %edx, %edx (b offset) ..B1.2: movl a(%rcx), %eax (a[i]) addl (%rdx,%rdi), %eax (a[i]+b[i]) addq $4, %rdx movl %eax, (%rsp,%rcx) (c[i] = a[i]+b[i]) addq $4, %rcx cmpq $800, %rcx jl ..B1.2 xorl %edx, %edx ..B1.4: addl (%rsp,%rdx,4), %eax addq $1, %rdx cmpq $200, %rdx jl ..B1.4 addq $808, %rsp ret
More Arrays
int x[N]; int y[N]; int z[N]; void add() { int i = 0; for (i = 0; i < N; i++) { x[i] = y[i] + z[i]; } }
N = 400, Iters = 2000000 (8,000,000,000 + ops)
• icc –O0: 23783 ms • icc –O1: 6319 ms • icc –O2: 3681 ms • Factor of 6.5 speedup
icc –O0
pushq %rbp movq %rsp, %rbp subq $16, %rsp xorl %eax, %eax movl %eax, ‐16(%rbp) movl %eax, ‐16(%rbp) movl
‐16(%rbp), %eax
cmpl $4000, %eax jge ..B1.4
..B1.3: movl
‐16(%rbp), %eax
movslq %eax, %rax movl
‐16(%rbp), %edx
movslq %edx, %rdx movl z(,%rdx,4), %edx addl y(,%rax,4), %edx movl
‐16(%rbp), %eax
movslq %eax, %rax movl %edx, x(,%rax,4) addl $1, ‐16(%rbp) movl
‐16(%rbp), %eax
cmpl $4000, %eax jl ..B1.3 ..B1.4: leave ret
icc –O1
int x[N];
int y[N];
int z[N];
void add() {
int i = 0;
for (i = 0; i < N; i++) {
x[i] = y[i] + z[i];
}
}
xorl %edx, %edx
..B1.2:
movl y(%rdx), %eax
addl z(%rdx), %eax
movl %eax, x(%rdx)
addq $4, %rdx
cmpq $16000, %rdx
jl ..B1.2
ret
xorl %eax, %eax ..B1.2:
movdqa y(%rax), %xmm0 paddd z(%rax), %xmm0 movdqa 16+y(%rax), %xmm1 paddd 16+z(%rax), %xmm1 movdqa 32+y(%rax), %xmm2 paddd 32+z(%rax), %xmm2 movdqa 48+y(%rax), %xmm3 paddd 48+z(%rax), %xmm3 movdqa 64+y(%rax), %xmm4 paddd 64+z(%rax), %xmm4 movdqa 80+y(%rax), %xmm5 paddd 80+z(%rax), %xmm5 movdqa 96+y(%rax), %xmm6 paddd 96+z(%rax), %xmm6 movdqa 112+y(%rax), %xmm7 paddd 112+z(%rax), %xmm7
icc –O2movdqa %xmm0, x(%rax) movdqa %xmm1, 16+x(%rax) movdqa %xmm2, 32+x(%rax) movdqa %xmm3, 48+x(%rax) movdqa %xmm4, 64+x(%rax) movdqa %xmm5, 80+x(%rax) movdqa %xmm6, 96+x(%rax) movdqa %xmm7, 112+x(%rax) addq $128, %rax cmpq $16000, %rax jl ..B1.2
..B1.3: ret
XMM Stuff
• SIMD instructions – operate on (small) vectors• 16 128‐bit XMM registers
– 2 64‐bit values – 4 32‐bit values
• Instructions operate on multiple values – movdqa y(%rax), %xmm0
(moves 4 32‐bit ints to %xmm0)– paddd z(%rax), %xmm0 (adds 4 32‐bit ints in z to corresponding 4 32‐bit ints in %xmm0)
Even More Arrays
char x[N]; char y[N]; char z[N]; void add() { int i = 0; for (i = 0; i < N; i++) { x[i] = y[i] + z[i]; } }
N = 1600, Iters = 2000000 (32,000,000,000 + ops)
• icc –O0: 89550 ms (versus 23783 ms for 4000 ints)
• icc –O1: 21440 ms (versus 6319 ms for 4000 ints)
• icc –O2: 3635 ms (versus 3681 ms for 4000 ints)
• Factor of 24 speedup (versus 6.5 speedup)
Implementing Structs
• Structs are just blocks of memory –struct { char x; int i; double d; } s;
• Fields stored next to each other
• Alignment issues
• Like arrays, can have static, dynamic, local structs
Building A Linked List
typedef struct list64 { uint64_t value; struct list64 *next; } node64;
struct list64 * makeLinkedList64(int l) { int i; struct list64 *n; struct list64 *f; n = 0; f = 0; for (i = l-1; i >= 0; i--) {
n = (struct list64 *) malloc(sizeof(struct list64));n->next = f;n->value = i;f = n;
} return f;
}
Assembly for Building Linked Listpushq %r12 (callee save) ..B14.3: pushq %rbp (callee save) movl $16, %edi
(malloc(16))pushq %rsi (callee save) call malloc xorl %r12d, %r12d (f = 0) ..B14.4: movslq %edi, %rbp (l) movq %r12, 8(%rax) (n->next = f)addq $-1, %rbp (i = l-1) movq %rbp, (%rax) (n->value = i)testq %rbp, %rbp (i >= 0) movq %rax, %r12 (f = n)jl ..B14.6 addq $-1, %rbp (i = i – 1)
testq %rbp, %rbp (i >= 0) jge ..B14.3 ..B14.6: movq %r12, %rax (return f) popq %rcx popq %rbp popq %r12 ret
Counting Elements in Linked List
typedef struct list64 { uint64_t value; struct list64 *next;
} node64;
uint64_t addLinkedList64(struct list64 *n) { uint64_t count = 0; struct list64 *c; for (l = n; l != 0; l=l->next) { count = count + l->value;
}return count;
}
typedef struct list64 { uint64_t value; struct list64 *next } node64; uint64_t addLinkedList64(struct list64 *n) { uint64_t count = 0; struct list64 *l;
for (l = n; l != 0; l=l->next) count = count + l->value; return count;
}
xorl %eax, %eax (count = 0)
testq %rdi, %rdi (l == 0)
je ..B13.5
..B13.3:
addq (%rdi), %rax (count = count +l->value)
movq 8(%rdi), %rdi (l = l-> next)
testq %rdi, %rdi (l == 0)
jne ..B13.3
retrunning time n = 400, iters = 2000000 is 885 ms (unoptimized 3321 ms)
Procedure Calls, Recursion
int fib(int n) {
if (n == 0) return 1;
if (n == 1) return 1; return (fib(n-1) + fib(n-2));
}
int fib(int n) { if (n == 0) return 1; if (n == 1) return 1; return (fib(n‐1) + fib(n‐2)); }
pushq %rbp addl %eax, %ebx pushq %rbx movl %ebx, %eax pushq %rsi popq %rcx movl %edi, %ebp popq %rbx testl %ebp, %ebp popq %rbp je ..B2.4 ret cmpl $1, %ebp ..B2.4: je ..B2.4 movl $1, %eax
popq %rcx lea
‐
1(%rbp), %edi popq %rbx call fib popq %rbp movl %eax, %ebx ret addl $‐2, %ebp movl %ebp, %edi call fib
fib(43) running time 7231‐7401 ms
int fib(int n) { if (n == 0) return 1; if (n == 1) return 1; return (fib(n‐1) + fib(n‐2)); }
pushq %rbp
%rcx
addl %eax, %ebx pushq %rbx movl %ebx, %eax pushq %rsi popq
popq %rbx testl %edi, %edi popq %rbp je ..B2.4 ret cmpl $1, %edi ..B2.4: je ..B2.4 movl $1, %eax movl %edi, %ebp popq %rcx lea
‐
1(%rbp), %edi popq %rbx call fib popq %rbp movl %eax, %ebx ret addl $‐2, %ebp movl %ebp, %edi call fib
fib(43) running time 5502‐5533 ms
int fib(int n) { if (n == 0) return 1; if (n == 1) return 1; return (fib(n‐1) + fib(n‐2)); }
pushq %rbp addl %ebx, %eaxpushq %rbxpushq %rsi popq %rcx
popq %rbx testl %edi, %edi popq %rbp je ..B2.4 ret cmpl $1, %edi ..B2.4: je ..B2.4 movl $1, %eax movl %edi, %ebp popq %rcx lea
‐
1(%rbp), %edi popq %rbx call fib popq %rbp movl %eax, %ebx ret addl $‐2, %ebp movl %ebp, %edi call fib
fib(43) running time 5519‐5539 ms
int fib(int n) { if (n == 0) return 1; if (n == 1) return 1; return (fib(n‐1) + fib(n‐2)); }
pushq %rbp addl %ebx, %eax pushq %rbx
popq %rbx testl %edi, %edi popq %rbp je ..B2.4 ret cmpl $1, %edi ..B2.4:
movl $1, %eax
popq %rbx popq %rbp
movl %eax, %ebx retaddl $‐2, %ebpmovl %ebp, %edicall fib
fib(43) running time 5184‐5195 ms
je ..B2.4 movl %edi, %ebp lea
‐
1(%rbp), %edi call fib
int fib(int n) { if (n == 0) return 1; if (n == 1) return 1; return (fib(n‐1) + fib(n‐2)); }
pushq %rbp addl %ebx, %eaxpushq %rbx
popq %rbx testl %edi, %edi popq %rbp je ..B2.4 ret cmpl $1, %edi ..B2.4: je ..B2.4 movl $1, %eax addl $‐1, %edi movl %edi, %ebp popq %rbx call fib popq %rbp movl %eax, %ebx ret addl $‐1, %ebp movl %ebp, %edi call fib
fib(43) running time 5151‐5175 ms
int fib(int n) { if (n == 0) return 1; if (n == 1) return 1; return (fib(n‐1) + fib(n‐2)); }
popq %rbi
cmpl $1, %edi ..B2.4: je ..B2.4 movl $1, %eax addl $‐1, %edi pushq %rdi
testl %edi, %edi addl %ebi, %eax je ..B2.4 ret
call fib popq %rdi ret pushq %rax addl $‐1, %edi call fib
fib(43) running time 5004‐5020 ms
int fib(int n) { if (n == 0) return 1; if (n == 1) return 1; return (fib(n‐1) + fib(n‐2)); }
popq %rbi addl %ebi, %eax ret
cmpl $2, %edi ..B2.4: je ..B2.4 movl $1, %eax addl $‐1, %edi pushq %rdi call fib popq %rdi ret pushq %rax addl $‐1, %edi call fib
fib(43) running time 4539‐4696 ms
Summary Mapping (so far)
C • Parameters
• Procedure call/return
• Local variables
• Arrays and Structures
X86
• Passed in registers (then on call stack if too many)
• Call stack stored in memory, grows down – return address on stack
– return value in register (RAX) – Caller vs. callee save registers
• Registers or Activation Record
• Blocks of memory (static, dynamic, local)
C
More Summary Mapping (so far)
• Arithmetic Expressions
• Control flow – Loops
– If then else
• Procedures
X86
• Sequences of x86 instructions
• Evaluate condition, set status flags
• Jump appropriately
• Instructions in memory
More C/C++ Constructs
• Arrays of structs – how to lay out?
• Function pointers
• Bit fields in arrays
• Objects, virtual function tables
• Memory management techniques
What You Should Know• Understand mapping from C to x86 • C state to x86 state
– Variables (locals, globals, parameters) in registers or memory – Arrays and structs as blocks of memory
• Dynamic vs static allocation • Heap vs stack allocation (both dynamic allocation)
– Pointers as memory addresses • C computation to x86 computation
– Expression evaluation as sequence of instructions • Operand addressing modes (register, memory), computations • Vector instructions and associated registers
– Control flow (while, for, if then else) patterns with jump instructions • Procedure call linkage
– Call stack, stack pointer (rsp), frame pointer (rbp) – call, ret instructions – Parameters in registers or stack, return value in register – Caller and callee save registers
• Be able to read assembler that compiler generates
MIT OpenCourseWarehttp://ocw.mit.edu
6.172 Performance Engineering of Software SystemsFall 2009
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.