from c to machine code/assembler · procedure rules a. collapse procedure hierarchies b. coroutines...

From C to Machine Code/Assembler

Outline

• Project 1

• Finish the last lecture

• C to Assembly

Project 1 – ResultsAnagramFlipCount

correct: 34/48

Nqueens

correct: 42/48

correct: 41/48

Project 1 – Meeting the Master

• Set for the meeting? – You should have already contacted the master– You should have already sent your code & write‐up – Do you have a meeting scheduled yet?

• Prepare for the meeting! – “Opening statement”

• Look for insights to help with the returnin – See if you can come close to the best performance – Make sure you understand the sharing policy

Bentley’s Rules

• Modifying Data • Modifying Code

– Loop Rules – Logic Rules – Procedure Rules – Expression Rules

• Compile‐time Initialization • Common Subexpression Elimination • Pairing Computation

– Parallelism Rules

© Saman Amarasinghe 2009

Compile‐Time Initialization • If a value is a constant, make it a compile‐time constant. –Save the effort of calculation

–Allow value inlining

–More optimization opportunities

#define PI 3.14159265358979

Example #define R 12

…….. vol = 2 * pi() * r * r; {

…….. vol = 2 * PI * R * R;


Common Subexpression Elimination • If the same expression is evaluated twice, do it only once

• Viability? – Expression has no side effects

– The expression value does not change between the evaluations

– The cost of keeping a copy is amortized by the complexity of the expression

– Too complicated for the compiler to do it automatically

Example x = sin(a) * sin(a); double tmp;

…

tmp = sin(a);x = tmp * tmp;


Pairing Computation • If two similar functions are called with the same arguments

close to each other in many occasions, combine them.– Reduce call overhead

– Possibility of sharing the computation cost

– More optimization possibilities

Example typedef struct twoduble {

double d1; x = r * cos(a); double d2; y = r * sin(a); }

….


twodouble dd; dd = sincos(a); x = r * dd.d1; y = r * dd.d2;

Bentley’s Rules

• Modifying Data • Modifying Code

– Loop Rules – Logic Rules – Procedure Rules – Expression Rules – Parallelism Rules

• Exploit Implicit Parallelism • Exploit Inner Loop Parallelism • Exploit Coarse Grain Parallelism • Extra computation to create parallelism


Implicit Parallelism

• Reduce the loop carried dependences so that “software pipelining” can execute a compact schedule without stalls.

• Example:

xmax1 = MININT; xmax2 = MININT;

xmax = MININT; for(i=0; i < N –1; I += 2) {for(i=0; i < N; i++) if(X[i] > xmax1) xmax1 = X[i];

if(X[i] > xmax) xmax = X[i]; if(X[i+1] > xmax2) xmax2 = X[i+1]; } if((i < N) &&(X[i] > xmax1)) xmax1 = X[i]; xmax = (xmax1 > xmax2)?xmax1:xmax2;


Example 2 next next next next next

• curr = head;

• tot = 0;

• while(curr != NULL) {

• tot = tot + curr val;

• curr = curr next;

• }

• return tot;


Example 2next next next next next

nextnext

curr = head;

tot = 0;

while(curr != NULL) {

tot = tot + curr val;

curr = curr next;

}

return tot;

Also see Rule A.1.a Data Structure Augmentation

nextnext nextnext nextnext

curr = head;

if(curr == NULL) return 0;

tot1 = 0;

tot2 = 0;

while(curr next) {

tot1 = tot1 + curr val;

tot2 = tot2 + curr next val;

curr = curr nextnext;

}

if(curr)

tot1 = tot1 + curr val;

return tot1 + tot2; © Saman Amarasinghe 2009

Exploit Inner Loop Parallelism

• Facilitate inner loop vectorization (for SSE type instructions)

• How? by gingerly guiding the compiler to do so– Iterative process by looking at why the loop is not vectorized and fixing those issues

– Most of the rules above can be used to simplify the loop so that the compiler can vectorize it


Exploit Coarse Grain Parallelism

• Outer loop parallelism (doall and doacross loops)

• Task parallelism

• Ideal for multicores

• You need to do the parallelism yourself later lectures


Extra Computation to Create Parallelism• In many cases doing a little more work (or a slower algorithm) can make a

sequential program a parallel one. Parallel execution may amortize the cost

• Example: double tot; double tottmp[N];

double tot; for(i = 0; i < N; i++)tot = 0;for(i = 0; i < N; i++)

tottmp[i] = 0;for(i = 0; i < N; i++) { //parallelizable

for(j = 0; j < N; j++) double tmp;tot = tot + A[i][j]; for(j = 0; j < N; j++)

tmp = tmp + A[i][j];tottmp[i]= tottmp[i]+ tmp;

} tot = 0; for(i = 0; i < N; i++)

© Saman Amarasinghe 2009 tot = tot + tottmp[i];

Bentley’s RulesA Modifying Data

1. Space for Time a. Data Structure Augmentation b. Storing Precomputed Results c. Caching d. Lazy Evaluation

2. Time for Space a. Packing/Compression b. Interpreters

3. Space and Time a. SIMD

• B Modifying Code 1. Loop Rules

a. Loop Invariant Code Motion b. Sentinel Loop Exit Test c. Loop Elimination by Unrolling d. Partial Loop Unrolling e. Loop fusion f. Eliminate wasted iterations

2. Logic Rules a. Exploit Algebraic Identities b. Short Circuit Monotone functions c. Reordering tests d. Precompute Logic Functions e. Boolean Variable Elimination

3. Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination

4. Expression Rules a. Compile‐time Initialization b. Common Subexpression Elimination c. Pairing Computation

5. Parallelism Rules a. Exploit Implicit Parallelism b. Exploit Inner Loop Parallelism c. Exploit Coarse Grain Parallelism d. Extra computation to create parallelism


FROM C TO MACHINE CODE/ASSEMBLER

Generic Machine Model

• Processor – Registers – Functional units (arithmetic, logical operations) – Instruction execution and coordination – Generates memory accesses (instructions and data)

• Memory Hierarchy – Registers – 1st level cache – 2nd level cache – Main memory – Instruction versus data caches

• Executes a stream of instructions

Three Layers

• Source Code (C, C++, Java) • Machine Code • Hardware

• Today’s topic: Source Code (C) ‐> Machine Code • Responsibility of compiler • Goals:

– Understand how compiler implements C constructs using x86 constructs

– Can read machine code (assembler form) – Can hack machine code generated by compiler – Can write own machine code from scratch if necessary

Assembly example 11

.section .rodata.LC0:0000 6572726F7200 .string "error".text.globl factfact:0000 55 pushq %rbp0001 4889E5 movq %rsp, %rbp0004 4883EC10 subq $16, %rsp0008 897DFC movl %edi, -4(%rbp)000b 837DFC00 cmpl $0, -4(%rbp)000f 7911 jns .L20011 BF00000000 movl $.LC0, %edi0016 B800000000 movl $0, %eax001b E800000000 call printf0020 EB22 jmp .L3.L2:0022 837DFC00 cmpl $0, -4(%rbp)0026 7509 jne .L40028 C745F801000000 movl $1, -8(%rbp)002f EB13 jmp .L3.L4:0031 8B7DFC movl -4(%rbp), %edi0034 FFCF decl %edi0036 E800000000 call fact003b 0FAF45FC imull -4(%rbp), %eax003f 8945F8 movl %eax, -8(%rbp)0042 EB00 jmp .L1.L3:0044 8B45F8 movl -8(%rbp), %eaxSaman Ama0047 C9 0048 C3 rasinghe leave ret

19

6 035

X86‐64 Machine Model• Flat 64 Bit Address Space

– (bytes, words, doubleword, quadword, double quadword) • 64 bit registers

– RAX, RBX, RCX, RDX, RSI, RDI, RSP, RBP, R8‐R15 • 32 bit registers

– EAX, EBX, ECX, EDX, ESI, EDI, ESP, EBP, R8D‐R15D – Aliased with bottom 32 bits of corresponding 64 bit registers

• 16 bit registers – AX, BX, CX, DX, SI, DI, SP, BP, RW‐R15W – Aliased with bottom 16 bits of corresponding 32 bit registers

• 8 bit registers (look in the manual for these) • Status Flags (CF – carry, ZF ‐ zero, SF ‐ sign, OF ‐ overflow)• RIP (instruction pointer)

Arithmetic and Logic Unit • Performs most of the data

operations • Has the form:

OP <oprnd1>, <oprnd2> – <oprnd2> = <oprnd1> OP <oprnd2>

OrOP <oprnd1>

• Operands are: – Immediate Value $25 – Register %rax – Memory 4(%rbp)

• Operations are: – Arithmetic operations (add, sub, imul) – Logical operations (and, sal) – Unitary operations (inc, dec)

Control• Unconditional Branches

– Fetch the next instruction from a different location – Unconditional jump to an address

jmp .L32 – Unconditional jump to an address in a register

jmp %rax – To handle procedure calls

call fact call %r11

Control• All arithmetic operations update the condition codes (rFLAGS)

• Compare explicitly sets the rFLAGS – cmp $0, %rax

• Conditional jumps on the rFLAGS – Jxx .L32 Jxx 4(%rbp)

–Examples: • JO Jump Overflow

• JC Jump Carry

• JAE Jump if above or equal

• JZ Jump is Zero

• JNE Jump if not equal

General-PurposeRegister (GPRs)

63 0

RAX RBX RCX RDX RBP RSI RDI RSP R8 R9 R10 R11 R12 R13 R14 R15

Multimedia Extension and Floating-Point Registers

MM0/ST0 MM1/ST1 MM2/ST2 MM3/ST3 MM4/ST4 MM5/ST5 MM6/ST6 MM7/ST7

63 0

Flags Register

EFLAGS 31 0

Instruction Pointer RIP

63 0

Streaming SIMD Extension (SSE) Registers

XMM0 XMM1 XMM2 XMM3 XMM4 XMM5 XMM6 XMM7 XMM8 XMM9 XMM10 XMM11 XMM12 XMM13 XMM14 XMM15

127 0

Registers

Figure by MIT OpenCourseWare.

63 31 15 8 7 0

%rax %eax %ax %ah %al

%ebx %ax %bh %bl

%ecx %cx %ch %cl

%edx %dx %dh %dl

%esi %si %sil

%edi %di %dil

%ebp %bp %bpl

%esp %sp %spl

%r8d %r8w %r8b

%r9d %r9w %r9b

%r10d %r10w %r10b

%r11d %r11w %r11b

%r12d %r12w %r12b

%r13d %r13w %r13b

%r14d %r14w %r14b

%r15d %r15w %r15b

Return value

%rbx Callee saved

4th argument %rck

3rd argument %rdx

2nd argument %rsi

1st argument %rdi

%rbp Callee saved

%rsp Stack pointer

5th argument %r8

6th argument %r9

Callee saved%r10

%r11 Used for linking

%r12 Unused for C

%r13 Callee saved

%r14 Callee saved

Callee saved%r15

Registers – Calling Convention

Figure by MIT OpenCourseWare.

Memory

• Flat Address Space – composed of words

– byte addressable

• Need to store – Program

– Local variables – Global variables and data

– Stack

– Heap

Memory

Dynamic

Unmapped

Text

Stack

Data

Heap

0x800 0000 0000

Globals/ Read‐only data

Program

0x40 0000

0x0

Read‐Only Data

• All Read‐Only data in the text segment

• Integers – uses load immediate

• Strings – uses the .string macro

.section .text

.globl main

.string "Five: %d\n"

main: enter movqmov mov call leave

$0, $0$5, x(%rip)x(%rip), %rsi$.msg, %rdi printf

ret

.msg:

Global Variables• Allocation: Uses the .comm directive

• Uses PC relative addressing – %rip is the current instruction

address – X(%rip) will add the offset from the

current instruction location to the space for x in the data segment to %rip

– Creates easily recolatable binaries

.section .text

.globl mainmain:

enter movqmov mov call leave

$0, $0$5, x(%rip)x(%rip), %rsi $.msg, %rdi printf

ret

.comm x, 8

.comm name, size, alignment The .comm directive allocates storage in the data section. The storage is referenced by the identifier name. Size is measured in bytes and must be a positive integer. Name cannot be predefined. Alignment is optional. If alignment is specified, the address of name is aligned to a multiple of alignment

The Stack

• Grows from top to bottom

• Call frames keep procedure specific information

• Calling convention

8*n+16(%rbp) argument n …

16(%rbp) argument 7

8(%rbp) Return address

0(%rbp) Previous %rbp

-8(%rbp) local 0 …

-8*m-8(%rbp) local m 0(%rsp)

Variable size

Cur

rent

P

revi

ous

Procedure Linkages

Standard procedure linkage Procedure hasprocedure p

procedure qprolog

prolog

epilog

pre-call

post-return epilog

• standard prolog • standard epilog

Each call involves a • pre-call sequence • post-return sequence

X86‐64 Calling Convention

• RSP points to procedure call stack in memory – call instruction pushes RIP on stack, jumps to call target operand (address of procedure)

– ret instruction pops RIP from stack, returns to caller

– stack grows down

• Software conventions – Caller‐save registers (r10, r11)

– Callee‐save registers (rbx, rbp, r12‐r15)

rbp

Stack • Calling: Caller

– Assume %rcx is live andis caller save

– Call foo(A, B, C, D, E, F, G, H, I) rsp• A to I are at ‐8(%rbp) to ‐72(%rbp)

previous frame pointer return address

local variables

callee saved registers

stack temporaries

dynamic area

caller saved registers argument 9 argument 8 argument 7

return address

push

push

push push

mov mov mov

mov mov

mov call

%rcx

-72(%rbp)

-64(%rbp) -56(%rbp)

-48(%rbp), %r9 -40(%rbp), %r8 -32(%rbp), %rcx

-24(%rbp), %rdx -16(%rbp), %rsi

-8(%rbp), %rdi foo

return addressrbpprevious frame pointer

callee savedregisters

local variables

stack temporaries

dynamic area

caller saved registers argument 9argument 8argument 7

return address rsp

previous frame pointercallee saved

registers

local variables

stack temporaries

dynamic area

Stack• Calling: Callee

– Assume %rbx is used in the function and is callee save

– Assume 40 bytes are required for locals

foo:push %rbpmov %rsp, %rbpsub $48, %rspmov %rbx, -8(%rbp)

enter $48, $0


local variables


stack temporaries

dynamic area


rsp

rbp

Stack • Arguments • Call foo(A, B, C, D, E, F, G, H, I)

– Passed in by pushing before the call

push -72(%rbp)

push -64(%rbp)

push -56(%rbp) mov -48(%rbp), %r9

mov -40(%rbp), %r8

mov -32(%rbp), %rcx mov -24(%rbp), %rdx

mov -16(%rbp), %rsi return address mov -8(%rbp), %rdi

call foo previous frame pointer callee saved – Access A to F via registers registers

• or put them in local memory local variables – Access rest using 16+xx(%rbp)

stack temporaries mov 16(%rbp), %rax

dynamic area mov 24(%rbp), %r10

Stack • Locals and Temporaries

– Calculate the size and allocatespace on the stacksub $48, %rsp

or enter $48, 0

– Access using ‐8‐xx(%rbx) return addressmov -28(%rbx), %r10 previous frame pointer

mov %r11, -20(%rbx) callee savedregisters

local variables

stack temporaries

dynamic area


local variables


stack temporaries

dynamic area


rsp

rbp

return addressprevious frame pointer


local variables

stack temporaries

dynamic area

caller saved registers argument 9argument 8argument 7

return addressrbpprevious frame pointer


rsp

local variables

stack temporaries

dynamic area

Stack• Returning Callee

– Assume the return value is the first temporary

– Restore the caller saved register

– Put the return value in %rax

– Tear-down the call stack

mov -8(%rbp), %rbxmov -16(%rbp), %raxmov %rbp, %rsppop %rbpret

leave

return address

Stack previous frame pointer

callee saved

• Returning Caller registers

– Assume the return value goes to the local variables

first temporary stack temporaries

dynamic area

– Restore the stack to reclaim the caller saved registers

argument space argument 9

– Restore the caller save registers argument 8argument 7

rsp

rbp

call foo

add $24, %rsp

pop %rcx

mov %rax, 8(%rbp)

…

X86‐64 Addressing Modes

• Immediate – addl $‐1, %edi • Register – addl %ebx, %eax • Address – addq (%rdi), %rax • Base+(Index*Scale)+Displacement

– Base, Index are values in registers – Scale is 2, 4, or 8 – Displacement is 8, 16, or 32 bit value – addq (%rdi,%rdx,8), %rax

• RIP + 32 bit Displacement ‐movl x(%rip), %eax

Translating Expressionsint compute1() { int x, y, z; x = 34; y = 7; z = 45; return (x + y) | z; }

int compute2(int x, int y, int z) { return (x + y) | z; }

movl $45, %eax ret

# parameter 1: %edi # parameter 2: %esi # parameter 3: %edx addl %esi, %edi orl %edx, %edi movl %edi, %eax ret

int x, y, z; int compute3() { return (x + y) | z; }

movl x(%rip), %eax addl y(%rip), %eax orl z(%rip), %eax ret

Adding Up Integers

uint64_t add64(uint64_t n) { uint64_t count = 0; uint64_t i; for (i = 0; i < n; i++) {

count += i;}return count;

}

uint64_t add64(unit64_t n) { uint64_t count = 0;uint64_t i; for (i = 0; i < n; i++) count += i; return count; }

pushq %rbp (base pointer) movq %rsp, %rbp (new base ptr) subq $32, %rsp (stack frame) movl %edi, ‐16(%rbp) (store n) xorl %rax, %rax (count = 0) movq %rax, ‐32(%rbp) (store count) movq %rax, ‐24(%rbp) (i = 0) movl

‐16(%rbp), %rax (read n)

movq ‐

24(%rbp), %rdx (read i) cmpq %rax, %rdx (i < n) jae ..B2.4

icc –O0 (no optimization)

..B2.3: movq

‐24(%rbp), %rax ( read i)

addq ‐

32(%rbp), %rax (count += i) movq %rax, ‐32(%rbp) (store count) addq $1, ‐24(%rbp) (i++) movl

‐16(%rbp), %rax (read n)

movq ‐

24(%rbp), %rdx (read i) cmpq %rax, %rdx (i < n) jb ..B2.3 ..B2.4: movq

‐32(%rbp), %rax (return count)

leave (undo base pointer stuff) ret

running time n = 400, iters = 2000000 is 1419 ms

Naïve Compilation Strategy

• Local variables – Call frame (rsp points to bottom, rbp to top)– Each local variable stored in call frame

• while loop pattern • for loop pattern

– for (initcode; p; nextcode) { code } goes to – Initcode; while (p) { code; nextcode }

• If then else pattern • expression evaluation

while (p) { c; }

if (p) { ctrue; }else { cfalse; }

<instructions to evaluate p>j<p false> loopExitLabelloopLabel:<instructions for c><instructions to evaluate p>j<p true> loopLabelloopExitLabel:

<instructions to evaluate p>j<p false> elseLabel<instructions for ctrue>j endLabelelseLabel:<instructions for cfalse>endLabel:

uint64_t add64(uint64_t n) { uint64_t count = 0; uint64_t i; for (i = 0; i < n; i++) count += i; return count; }

icc –O1 (some optimization) movl %edi, %rcx (read n) xorl %eax, %eax (count = 0) xorl %edx, %edx (i = 0) testq %rcx, %rcx (n <= 0) jbe ..B17.5 ..B17.3: addq %rdx, %rax (count += i) addq $1, %rdx (i++) cmpq %rcx, %rdx (i < n) jb ..B17.3 ..B17.5: ret (return count)

running time n = 400, iters = 2000000 is 485 ms

Basic Optimization Concepts

• Allocate values in registers (eliminates excess traffic to/from memory)

• Optimize naïve procedure call linkage

• Why not always optimize? –Complicates compiler (sometimes have bugs)

– Complicates debugging(not sure where values are)(sometimes reorders computations)

Implementing Arrays

• Arrays are just blocks of memory – Static array – allocated by linker/loader

– Dynamic array – allocated by malloc

– Local array – allocated on stack

• Array/pointer equivalence

Building An Array

uint64_t * makeArrayList64(uint64_t x) {

int i;

uint64_t *a = (uint64_t *) malloc(sizeof(uint64_t) * x);

for (i = 0; i < x; i++) {

a[i] = i;

}

return a;

}

Adding Up Array Elements

uint64_t addArrayList64(uint64_t *a, uint64_t n) {

uint64_t c = 0;

for (uint64_t i = 0; i < n; i++) {

c += a[i];

}

return c;

}

uint64_t addArrayList64(uint64_t *a, int n) { uint64_t c = 0; for (int i = 0; i < n; i++) c += a[i]; return c; }

addArrayList64: # parameter 1: %rdi # parameter 2: %esi movslq %esi, %rcx (n) xorl %eax, %eax (c = 0) xorl %edx, %edx (i = 0) testq %rcx, %rcx (n <= 0) jle ..B15.5 ..B15.3: addq (%rdi,%rdx,8), %rax (c += a[i])addq $1, %rdx (i++)cmpq %rcx, %rdx (i < n)jl ..B15.3..B15.5:ret (return c)

running time n = 400, iters = 2000000 is 378 ms (unoptimized 2231 ms)

Different Kinds of Arrays

int a[200]; /* global */

int f(int *b /* whatever */) { int c[200]; /* local */ int count; int i = 0; for (i = 0; i < 200; i++) { c[i] = a[i]+b[i];}for (i = 0; i < 200; i++) {count += c[i];}return count;}

Different Kinds of Arrays

int a[200]; /* global */

int f(int *b /* whatever */) { int c[200]; /* local */ int count; int i = 0; for (i = 0; i < 200; i++) { c[i] = a[i]+b[i];}for (i = 0; i < 200; i++) {count += c[i];}return count;}

subq $808, %rsp (allocating c on stack) xorl %ecx, %ecx (a and c offset) xorl %edx, %edx (b offset) ..B1.2: movl a(%rcx), %eax (a[i]) addl (%rdx,%rdi), %eax (a[i]+b[i]) addq $4, %rdx movl %eax, (%rsp,%rcx) (c[i] = a[i]+b[i]) addq $4, %rcx cmpq $800, %rcx jl ..B1.2 xorl %edx, %edx ..B1.4: addl (%rsp,%rdx,4), %eax addq $1, %rdx cmpq $200, %rdx jl ..B1.4 addq $808, %rsp ret

More Arrays

int x[N]; int y[N]; int z[N]; void add() { int i = 0; for (i = 0; i < N; i++) { x[i] = y[i] + z[i]; } }

N = 400, Iters = 2000000 (8,000,000,000 + ops)

• icc –O0: 23783 ms • icc –O1: 6319 ms • icc –O2: 3681 ms • Factor of 6.5 speedup

icc –O0

pushq %rbp movq %rsp, %rbp subq $16, %rsp xorl %eax, %eax movl %eax, ‐16(%rbp) movl %eax, ‐16(%rbp) movl

‐16(%rbp), %eax

cmpl $4000, %eax jge ..B1.4

..B1.3: movl

‐16(%rbp), %eax

movslq %eax, %rax movl

‐16(%rbp), %edx

movslq %edx, %rdx movl z(,%rdx,4), %edx addl y(,%rax,4), %edx movl

‐16(%rbp), %eax

movslq %eax, %rax movl %edx, x(,%rax,4) addl $1, ‐16(%rbp) movl

‐16(%rbp), %eax

cmpl $4000, %eax jl ..B1.3 ..B1.4: leave ret

icc –O1

int x[N];

int y[N];

int z[N];

void add() {

int i = 0;

for (i = 0; i < N; i++) {

x[i] = y[i] + z[i];

}

}

xorl %edx, %edx

..B1.2:

movl y(%rdx), %eax

addl z(%rdx), %eax

movl %eax, x(%rdx)

addq $4, %rdx

cmpq $16000, %rdx

jl ..B1.2

ret

xorl %eax, %eax ..B1.2:

movdqa y(%rax), %xmm0 paddd z(%rax), %xmm0 movdqa 16+y(%rax), %xmm1 paddd 16+z(%rax), %xmm1 movdqa 32+y(%rax), %xmm2 paddd 32+z(%rax), %xmm2 movdqa 48+y(%rax), %xmm3 paddd 48+z(%rax), %xmm3 movdqa 64+y(%rax), %xmm4 paddd 64+z(%rax), %xmm4 movdqa 80+y(%rax), %xmm5 paddd 80+z(%rax), %xmm5 movdqa 96+y(%rax), %xmm6 paddd 96+z(%rax), %xmm6 movdqa 112+y(%rax), %xmm7 paddd 112+z(%rax), %xmm7

icc –O2movdqa %xmm0, x(%rax) movdqa %xmm1, 16+x(%rax) movdqa %xmm2, 32+x(%rax) movdqa %xmm3, 48+x(%rax) movdqa %xmm4, 64+x(%rax) movdqa %xmm5, 80+x(%rax) movdqa %xmm6, 96+x(%rax) movdqa %xmm7, 112+x(%rax) addq $128, %rax cmpq $16000, %rax jl ..B1.2

..B1.3: ret

XMM Stuff

• SIMD instructions – operate on (small) vectors• 16 128‐bit XMM registers

– 2 64‐bit values – 4 32‐bit values

• Instructions operate on multiple values – movdqa y(%rax), %xmm0

(moves 4 32‐bit ints to %xmm0)– paddd z(%rax), %xmm0 (adds 4 32‐bit ints in z to corresponding 4 32‐bit ints in %xmm0)

Even More Arrays

char x[N]; char y[N]; char z[N]; void add() { int i = 0; for (i = 0; i < N; i++) { x[i] = y[i] + z[i]; } }

N = 1600, Iters = 2000000 (32,000,000,000 + ops)

• icc –O0: 89550 ms (versus 23783 ms for 4000 ints)



• Factor of 24 speedup (versus 6.5 speedup)

Implementing Structs

• Structs are just blocks of memory –struct { char x; int i; double d; } s;

• Fields stored next to each other

• Alignment issues

• Like arrays, can have static, dynamic, local structs

Building A Linked List

typedef struct list64 { uint64_t value; struct list64 *next; } node64;

struct list64 * makeLinkedList64(int l) { int i; struct list64 *n; struct list64 *f; n = 0; f = 0; for (i = l-1; i >= 0; i--) {

n = (struct list64 *) malloc(sizeof(struct list64));n->next = f;n->value = i;f = n;

} return f;

}

Assembly for Building Linked Listpushq %r12 (callee save) ..B14.3: pushq %rbp (callee save) movl $16, %edi

(malloc(16))pushq %rsi (callee save) call malloc xorl %r12d, %r12d (f = 0) ..B14.4: movslq %edi, %rbp (l) movq %r12, 8(%rax) (n->next = f)addq $-1, %rbp (i = l-1) movq %rbp, (%rax) (n->value = i)testq %rbp, %rbp (i >= 0) movq %rax, %r12 (f = n)jl ..B14.6 addq $-1, %rbp (i = i – 1)

testq %rbp, %rbp (i >= 0) jge ..B14.3 ..B14.6: movq %r12, %rax (return f) popq %rcx popq %rbp popq %r12 ret

Counting Elements in Linked List

typedef struct list64 { uint64_t value; struct list64 *next;

} node64;

uint64_t addLinkedList64(struct list64 *n) { uint64_t count = 0; struct list64 *c; for (l = n; l != 0; l=l->next) { count = count + l->value;

}return count;

}

typedef struct list64 { uint64_t value; struct list64 *next } node64; uint64_t addLinkedList64(struct list64 *n) { uint64_t count = 0; struct list64 *l;

for (l = n; l != 0; l=l->next) count = count + l->value; return count;

}

xorl %eax, %eax (count = 0)

testq %rdi, %rdi (l == 0)

je ..B13.5

..B13.3:

addq (%rdi), %rax (count = count +l->value)

movq 8(%rdi), %rdi (l = l-> next)

testq %rdi, %rdi (l == 0)

jne ..B13.3

retrunning time n = 400, iters = 2000000 is 885 ms (unoptimized 3321 ms)

Procedure Calls, Recursion

int fib(int n) {

if (n == 0) return 1;

if (n == 1) return 1; return (fib(n-1) + fib(n-2));

}

int fib(int n) { if (n == 0) return 1; if (n == 1) return 1; return (fib(n‐1) + fib(n‐2)); }

pushq %rbp addl %eax, %ebx pushq %rbx movl %ebx, %eax pushq %rsi popq %rcx movl %edi, %ebp popq %rbx testl %ebp, %ebp popq %rbp je ..B2.4 ret cmpl $1, %ebp ..B2.4: je ..B2.4 movl $1, %eax

popq %rcx lea

‐

1(%rbp), %edi popq %rbx call fib popq %rbp movl %eax, %ebx ret addl $‐2, %ebp movl %ebp, %edi call fib

fib(43) running time 7231‐7401 ms


pushq %rbp

%rcx

addl %eax, %ebx pushq %rbx movl %ebx, %eax pushq %rsi popq

popq %rbx testl %edi, %edi popq %rbp je ..B2.4 ret cmpl $1, %edi ..B2.4: je ..B2.4 movl $1, %eax movl %edi, %ebp popq %rcx lea

‐




pushq %rbp addl %ebx, %eaxpushq %rbxpushq %rsi popq %rcx

popq %rbx testl %edi, %edi popq %rbp je ..B2.4 ret cmpl $1, %edi ..B2.4: je ..B2.4 movl $1, %eax movl %edi, %ebp popq %rcx lea

‐




pushq %rbp addl %ebx, %eax pushq %rbx

popq %rbx testl %edi, %edi popq %rbp je ..B2.4 ret cmpl $1, %edi ..B2.4:

movl $1, %eax

popq %rbx popq %rbp

movl %eax, %ebx retaddl $‐2, %ebpmovl %ebp, %edicall fib


je ..B2.4 movl %edi, %ebp lea

‐

1(%rbp), %edi call fib


pushq %rbp addl %ebx, %eaxpushq %rbx

popq %rbx testl %edi, %edi popq %rbp je ..B2.4 ret cmpl $1, %edi ..B2.4: je ..B2.4 movl $1, %eax addl $‐1, %edi movl %edi, %ebp popq %rbx call fib popq %rbp movl %eax, %ebx ret addl $‐1, %ebp movl %ebp, %edi call fib



popq %rbi

cmpl $1, %edi ..B2.4: je ..B2.4 movl $1, %eax addl $‐1, %edi pushq %rdi

testl %edi, %edi addl %ebi, %eax je ..B2.4 ret

call fib popq %rdi ret pushq %rax addl $‐1, %edi call fib



popq %rbi addl %ebi, %eax ret

cmpl $2, %edi ..B2.4: je ..B2.4 movl $1, %eax addl $‐1, %edi pushq %rdi call fib popq %rdi ret pushq %rax addl $‐1, %edi call fib


Summary Mapping (so far)

C • Parameters

• Procedure call/return

• Local variables

• Arrays and Structures

X86

• Passed in registers (then on call stack if too many)

• Call stack stored in memory, grows down – return address on stack

– return value in register (RAX) – Caller vs. callee save registers

• Registers or Activation Record

• Blocks of memory (static, dynamic, local)

C

More Summary Mapping (so far)

• Arithmetic Expressions

• Control flow – Loops

– If then else

• Procedures

X86

• Sequences of x86 instructions

• Evaluate condition, set status flags

• Jump appropriately

• Instructions in memory

More C/C++ Constructs

• Arrays of structs – how to lay out?

• Function pointers

• Bit fields in arrays

• Objects, virtual function tables

• Memory management techniques

What You Should Know• Understand mapping from C to x86 • C state to x86 state

– Variables (locals, globals, parameters) in registers or memory – Arrays and structs as blocks of memory

• Dynamic vs static allocation • Heap vs stack allocation (both dynamic allocation)

– Pointers as memory addresses • C computation to x86 computation

– Expression evaluation as sequence of instructions • Operand addressing modes (register, memory), computations • Vector instructions and associated registers

– Control flow (while, for, if then else) patterns with jump instructions • Procedure call linkage

– Call stack, stack pointer (rsp), frame pointer (rbp) – call, ret instructions – Parameters in registers or stack, return value in register – Caller and callee save registers

• Be able to read assembler that compiler generates

MIT OpenCourseWarehttp://ocw.mit.edu

6.172 Performance Engineering of Software SystemsFall 2009

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

http://ocw.mit.edu/terms

http://ocw.mit.edu

from c to machine code/assembler · procedure rules a. collapse procedure hierarchies b. coroutines...

Documents