from c to machine code/assembler · procedure rules a. collapse procedure hierarchies b. coroutines...

77
From C to Machine Code/Assembler

Upload: others

Post on 04-Jun-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

From C to Machine Code/Assembler

Page 2: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

Outline

• Project 1

• Finish the last lecture

• C to Assembly

Page 3: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

Project 1 – ResultsAnagramFlipCount

correct: 34/48

Nqueens

correct: 42/48

correct: 41/48

Page 4: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

Project 1 – Meeting the Master

• Set for the meeting? – You should have already contacted the master– You should have already sent your code & write‐up – Do you have a meeting scheduled yet?

• Prepare for the meeting! – “Opening statement”

• Look for insights to help with the returnin – See if you can come close to the best performance – Make sure you understand the sharing policy

Page 5: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

Bentley’s Rules

• Modifying Data • Modifying Code

– Loop Rules – Logic Rules – Procedure Rules – Expression Rules

• Compile‐time Initialization • Common Subexpression Elimination • Pairing Computation

– Parallelism Rules

© Saman Amarasinghe 2009

Page 6: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

Compile‐Time Initialization • If a value is a constant, make it a compile‐time constant. –Save the effort of calculation

–Allow value inlining

–More optimization opportunities

#define PI 3.14159265358979

Example #define R 12

…….. vol = 2 * pi() * r * r; {

…….. vol = 2 * PI * R * R;

© Saman Amarasinghe 2009

Page 7: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

Common Subexpression Elimination • If the same expression is evaluated twice, do it only once

• Viability? – Expression has no side effects

– The expression value does not change between the evaluations

– The cost of keeping a copy is amortized by the complexity of the expression

– Too complicated for the compiler to do it automatically

Example x = sin(a) * sin(a); double tmp;

tmp = sin(a);x = tmp * tmp;

© Saman Amarasinghe 2009

Page 8: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

Pairing Computation • If two similar functions are called with the same arguments

close to each other in many occasions, combine them.– Reduce call overhead

– Possibility of sharing the computation cost

– More optimization possibilities

Example typedef struct twoduble {

double d1; x = r * cos(a); double d2; y = r * sin(a); }

….

© Saman Amarasinghe 2009

twodouble dd; dd = sincos(a); x = r * dd.d1; y = r * dd.d2;

Page 9: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

Bentley’s Rules

• Modifying Data • Modifying Code

– Loop Rules – Logic Rules – Procedure Rules – Expression Rules – Parallelism Rules

• Exploit Implicit Parallelism • Exploit Inner Loop Parallelism • Exploit Coarse Grain Parallelism • Extra computation to create parallelism

© Saman Amarasinghe 2009

Page 10: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

Implicit Parallelism

• Reduce the loop carried dependences so that “software pipelining” can execute a compact schedule without stalls.

• Example:

xmax1 = MININT; xmax2 = MININT;

xmax = MININT; for(i=0; i < N –1; I += 2) {for(i=0; i < N; i++) if(X[i] > xmax1) xmax1 = X[i];

if(X[i] > xmax) xmax = X[i]; if(X[i+1] > xmax2) xmax2 = X[i+1]; } if((i < N) &&(X[i] > xmax1)) xmax1 = X[i]; xmax = (xmax1 > xmax2)?xmax1:xmax2;

© Saman Amarasinghe 2009

Page 11: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

Example 2 next next next next next

• curr = head;

• tot = 0;

• while(curr != NULL) {

• tot = tot + curr val;

• curr = curr next;

• }

• return tot;

© Saman Amarasinghe 2009

Page 12: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

Example 2next next next next next

nextnext

curr = head;

tot = 0;

while(curr != NULL) {

tot = tot + curr val;

curr = curr next;

}

return tot;

Also see Rule A.1.a Data Structure Augmentation

nextnext nextnext nextnext

curr = head;

if(curr == NULL) return 0;

tot1 = 0;

tot2 = 0;

while(curr next) {

tot1 = tot1 + curr val;

tot2 = tot2 + curr next val;

curr = curr nextnext;

}

if(curr)

tot1 = tot1 + curr val;

return tot1 + tot2; © Saman Amarasinghe 2009

Page 13: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

Exploit Inner Loop Parallelism

• Facilitate inner loop vectorization (for SSE type instructions)

• How? by gingerly guiding the compiler to do so– Iterative process by looking at why the loop is not vectorized and fixing those issues

– Most of the rules above can be used to simplify the loop so that the compiler can vectorize it

© Saman Amarasinghe 2009

Page 14: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

Exploit Coarse Grain Parallelism

• Outer loop parallelism (doall and doacross loops)

• Task parallelism

• Ideal for multicores

• You need to do the parallelism yourself later lectures

© Saman Amarasinghe 2009

Page 15: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

Extra Computation to Create Parallelism• In many cases doing a little more work (or a slower algorithm) can make a

sequential program a parallel one. Parallel execution may amortize the cost

• Example: double tot; double tottmp[N];

double tot; for(i = 0; i < N; i++)tot = 0;for(i = 0; i < N; i++)

tottmp[i] = 0;for(i = 0; i < N; i++) { //parallelizable

for(j = 0; j < N; j++) double tmp;tot = tot + A[i][j]; for(j = 0; j < N; j++)

tmp = tmp + A[i][j];tottmp[i]= tottmp[i]+ tmp;

} tot = 0; for(i = 0; i < N; i++)

© Saman Amarasinghe 2009 tot = tot + tottmp[i];

Page 16: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

Bentley’s RulesA Modifying Data

1. Space for Time a. Data Structure Augmentation b. Storing Precomputed Results c. Caching d. Lazy Evaluation

2. Time for Space a. Packing/Compression b. Interpreters

3. Space and Time a. SIMD

• B Modifying Code 1. Loop Rules

a. Loop Invariant Code Motion b. Sentinel Loop Exit Test c. Loop Elimination by Unrolling d. Partial Loop Unrolling e. Loop fusion f. Eliminate wasted iterations

2. Logic Rules a. Exploit Algebraic Identities b. Short Circuit Monotone functions c. Reordering tests d. Precompute Logic Functions e. Boolean Variable Elimination

3. Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination

4. Expression Rules a. Compile‐time Initialization b. Common Subexpression Elimination c. Pairing Computation

5. Parallelism Rules a. Exploit Implicit Parallelism b. Exploit Inner Loop Parallelism c. Exploit Coarse Grain Parallelism d. Extra computation to create parallelism

© Saman Amarasinghe 2009

Page 17: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

FROM C TO MACHINE CODE/ASSEMBLER

Page 18: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

Generic Machine Model

• Processor – Registers – Functional units (arithmetic, logical operations) – Instruction execution and coordination – Generates memory accesses (instructions and data)

• Memory Hierarchy – Registers – 1st level cache – 2nd level cache – Main memory – Instruction versus data caches

• Executes a stream of instructions

Page 19: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

Three Layers

• Source Code (C, C++, Java) • Machine Code • Hardware

• Today’s topic: Source Code (C) ‐> Machine Code • Responsibility of compiler • Goals:

– Understand how compiler implements C constructs using x86 constructs

– Can read machine code (assembler form) – Can hack machine code generated by compiler – Can write own machine code from scratch if necessary

Page 20: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

Assembly example 11

.section .rodata.LC0:0000 6572726F7200 .string "error".text.globl factfact:0000 55 pushq %rbp0001 4889E5 movq %rsp, %rbp0004 4883EC10 subq $16, %rsp0008 897DFC movl %edi, -4(%rbp)000b 837DFC00 cmpl $0, -4(%rbp)000f 7911 jns .L20011 BF00000000 movl $.LC0, %edi0016 B800000000 movl $0, %eax001b E800000000 call printf0020 EB22 jmp .L3.L2:0022 837DFC00 cmpl $0, -4(%rbp)0026 7509 jne .L40028 C745F801000000 movl $1, -8(%rbp)002f EB13 jmp .L3.L4:0031 8B7DFC movl -4(%rbp), %edi0034 FFCF decl %edi0036 E800000000 call fact003b 0FAF45FC imull -4(%rbp), %eax003f 8945F8 movl %eax, -8(%rbp)0042 EB00 jmp .L1.L3:0044 8B45F8 movl -8(%rbp), %eaxSaman Ama0047 C9 0048 C3 rasinghe leave ret

19

6 035

Page 21: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

X86‐64 Machine Model• Flat 64 Bit Address Space

– (bytes, words, doubleword, quadword, double quadword) • 64 bit registers

– RAX, RBX, RCX, RDX, RSI, RDI, RSP, RBP, R8‐R15 • 32 bit registers

– EAX, EBX, ECX, EDX, ESI, EDI, ESP, EBP, R8D‐R15D – Aliased with bottom 32 bits of corresponding 64 bit registers

• 16 bit registers – AX, BX, CX, DX, SI, DI, SP, BP, RW‐R15W – Aliased with bottom 16 bits of corresponding 32 bit registers

• 8 bit registers (look in the manual for these) • Status Flags (CF – carry, ZF ‐ zero, SF ‐ sign, OF ‐ overflow)• RIP (instruction pointer)

Page 22: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

Arithmetic and Logic Unit • Performs most of the data

operations • Has the form:

OP <oprnd1>, <oprnd2> – <oprnd2> = <oprnd1> OP <oprnd2>

OrOP <oprnd1>

• Operands are: – Immediate Value $25 – Register %rax – Memory 4(%rbp)

• Operations are: – Arithmetic operations (add, sub, imul) – Logical operations (and, sal) – Unitary operations (inc, dec)

Page 23: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

Control• Unconditional Branches

– Fetch the next instruction from a different location – Unconditional jump to an address

jmp .L32 – Unconditional jump to an address in a register

jmp %rax – To handle procedure calls

call fact call %r11

Page 24: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

Control• All arithmetic operations update the condition codes (rFLAGS)

• Compare explicitly sets the rFLAGS – cmp $0, %rax

• Conditional jumps on the rFLAGS – Jxx .L32 Jxx 4(%rbp)

–Examples: • JO Jump Overflow

• JC Jump Carry

• JAE Jump if above or equal

• JZ Jump is Zero

• JNE Jump if not equal

Page 25: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

General-PurposeRegister (GPRs)

63 0

RAX RBX RCX RDX RBP RSI RDI RSP R8 R9 R10 R11 R12 R13 R14 R15

Multimedia Extension and Floating-Point Registers

MM0/ST0 MM1/ST1 MM2/ST2 MM3/ST3 MM4/ST4 MM5/ST5 MM6/ST6 MM7/ST7

63 0

Flags Register

EFLAGS 31 0

Instruction Pointer RIP

63 0

Streaming SIMD Extension (SSE) Registers

XMM0 XMM1 XMM2 XMM3 XMM4 XMM5 XMM6 XMM7 XMM8 XMM9 XMM10 XMM11 XMM12 XMM13 XMM14 XMM15

127 0

Registers

Figure by MIT OpenCourseWare.

Page 26: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

63 31 15 8 7 0

%rax %eax %ax %ah %al

%ebx %ax %bh %bl

%ecx %cx %ch %cl

%edx %dx %dh %dl

%esi %si %sil

%edi %di %dil

%ebp %bp %bpl

%esp %sp %spl

%r8d %r8w %r8b

%r9d %r9w %r9b

%r10d %r10w %r10b

%r11d %r11w %r11b

%r12d %r12w %r12b

%r13d %r13w %r13b

%r14d %r14w %r14b

%r15d %r15w %r15b

Return value

%rbx Callee saved

4th argument %rck

3rd argument %rdx

2nd argument %rsi

1st argument %rdi

%rbp Callee saved

%rsp Stack pointer

5th argument %r8

6th argument %r9

Callee saved%r10

%r11 Used for linking

%r12 Unused for C

%r13 Callee saved

%r14 Callee saved

Callee saved%r15

Registers – Calling Convention

Figure by MIT OpenCourseWare.

Page 27: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

Memory

• Flat Address Space – composed of words

– byte addressable

• Need to store – Program

– Local variables – Global variables and data

– Stack

– Heap

Page 28: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

Memory

Dynamic

Unmapped

Text

Stack

Data

Heap

0x800 0000 0000

Globals/ Read‐only data

Program

0x40 0000

0x0

Page 29: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

Read‐Only Data

• All Read‐Only data in the text segment

• Integers – uses load immediate

• Strings – uses the .string macro

.section .text

.globl main

.string "Five: %d\n"

main: enter movqmov mov call leave

$0, $0$5, x(%rip)x(%rip), %rsi$.msg, %rdi printf

ret

.msg:

Page 30: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

Global Variables• Allocation: Uses the .comm directive

• Uses PC relative addressing – %rip is the current instruction

address – X(%rip) will add the offset from the

current instruction location to the space for x in the data segment to %rip

– Creates easily recolatable binaries

.section .text

.globl mainmain:

enter movqmov mov call leave

$0, $0$5, x(%rip)x(%rip), %rsi $.msg, %rdi printf

ret

.comm x, 8

.comm name, size, alignment The .comm directive allocates storage in the data section. The storage is referenced by the identifier name. Size is measured in bytes and must be a positive integer. Name cannot be predefined. Alignment is optional. If alignment is specified, the address of name is aligned to a multiple of alignment

Page 31: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

The Stack

• Grows from top to bottom

• Call frames keep procedure specific information

• Calling convention

8*n+16(%rbp) argument n …

16(%rbp) argument 7

8(%rbp) Return address

0(%rbp) Previous %rbp

-8(%rbp) local 0 …

-8*m-8(%rbp) local m 0(%rsp)

Variable size

Cur

rent

P

revi

ous

Page 32: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

Procedure Linkages

Standard procedure linkage Procedure hasprocedure p

procedure qprolog

prolog

epilog

pre-call

post-return epilog

• standard prolog • standard epilog

Each call involves a • pre-call sequence • post-return sequence

Page 33: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

X86‐64 Calling Convention

• RSP points to procedure call stack in memory – call instruction pushes RIP on stack, jumps to call target operand (address of procedure)

– ret instruction pops RIP from stack, returns to caller

– stack grows down

• Software conventions – Caller‐save registers (r10, r11)

– Callee‐save registers (rbx, rbp, r12‐r15)

Page 34: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

rbp

Stack • Calling: Caller

– Assume %rcx is live andis caller save

– Call foo(A, B, C, D, E, F, G, H, I) rsp• A to I are at ‐8(%rbp) to ‐72(%rbp)

previous frame pointer return address

local variables

callee saved registers

stack temporaries

dynamic area

caller saved registers argument 9 argument 8 argument 7

return address

push

push

push push

mov mov mov

mov mov

mov call

%rcx

-72(%rbp)

-64(%rbp) -56(%rbp)

-48(%rbp), %r9 -40(%rbp), %r8 -32(%rbp), %rcx

-24(%rbp), %rdx -16(%rbp), %rsi

-8(%rbp), %rdi foo

Page 35: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

return addressrbpprevious frame pointer

callee savedregisters

local variables

stack temporaries

dynamic area

caller saved registers argument 9argument 8argument 7

return address rsp

previous frame pointercallee saved

registers

local variables

stack temporaries

dynamic area

Stack• Calling: Callee

– Assume %rbx is used in the function and is callee save

– Assume 40 bytes are required for locals

foo:push %rbpmov %rsp, %rbpsub $48, %rspmov %rbx, -8(%rbp)

enter $48, $0

Page 36: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

previous frame pointer return address

local variables

callee saved registers

stack temporaries

dynamic area

caller saved registers argument 9 argument 8 argument 7

rsp

rbp

Stack • Arguments • Call foo(A, B, C, D, E, F, G, H, I)

– Passed in by pushing before the call

push -72(%rbp)

push -64(%rbp)

push -56(%rbp) mov -48(%rbp), %r9

mov -40(%rbp), %r8

mov -32(%rbp), %rcx mov -24(%rbp), %rdx

mov -16(%rbp), %rsi return address mov -8(%rbp), %rdi

call foo previous frame pointer callee saved – Access A to F via registers registers

• or put them in local memory local variables – Access rest using 16+xx(%rbp)

stack temporaries mov 16(%rbp), %rax

dynamic area mov 24(%rbp), %r10

Page 37: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

Stack • Locals and Temporaries

– Calculate the size and allocatespace on the stacksub $48, %rsp

or enter $48, 0

– Access using ‐8‐xx(%rbx) return addressmov -28(%rbx), %r10 previous frame pointer

mov %r11, -20(%rbx) callee savedregisters

local variables

stack temporaries

dynamic area

previous frame pointer return address

local variables

callee saved registers

stack temporaries

dynamic area

caller saved registers argument 9 argument 8 argument 7

rsp

rbp

Page 38: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

return addressprevious frame pointer

callee savedregisters

local variables

stack temporaries

dynamic area

caller saved registers argument 9argument 8argument 7

return addressrbpprevious frame pointer

callee savedregisters

rsp

local variables

stack temporaries

dynamic area

Stack• Returning Callee

– Assume the return value is the first temporary

– Restore the caller saved register

– Put the return value in %rax

– Tear-down the call stack

mov -8(%rbp), %rbxmov -16(%rbp), %raxmov %rbp, %rsppop %rbpret

leave

Page 39: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

return address

Stack previous frame pointer

callee saved

• Returning Caller registers

– Assume the return value goes to the local variables

first temporary stack temporaries

dynamic area

– Restore the stack to reclaim the caller saved registers

argument space argument 9

– Restore the caller save registers argument 8argument 7

rsp

rbp

call foo

add $24, %rsp

pop %rcx

mov %rax, 8(%rbp)

Page 40: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

X86‐64 Addressing Modes

• Immediate – addl $‐1, %edi • Register – addl %ebx, %eax • Address – addq (%rdi), %rax • Base+(Index*Scale)+Displacement

– Base, Index are values in registers – Scale is 2, 4, or 8 – Displacement is 8, 16, or 32 bit value – addq (%rdi,%rdx,8), %rax

• RIP + 32 bit Displacement ‐movl x(%rip), %eax

Page 41: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

Translating Expressionsint compute1() { int x, y, z; x = 34; y = 7; z = 45; return (x + y) | z; }

int compute2(int x, int y, int z) { return (x + y) | z; }

movl $45, %eax ret

# parameter 1: %edi # parameter 2: %esi # parameter 3: %edx addl %esi, %edi orl %edx, %edi movl %edi, %eax ret

int x, y, z; int compute3() { return (x + y) | z; }

movl x(%rip), %eax addl y(%rip), %eax orl z(%rip), %eax ret

Page 42: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

Adding Up Integers

uint64_t add64(uint64_t n) { uint64_t count = 0; uint64_t i; for (i = 0; i < n; i++) {

count += i;}return count;

}

Page 43: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

uint64_t add64(unit64_t n) { uint64_t count = 0;uint64_t i; for (i = 0; i < n; i++) count += i; return count; }

pushq %rbp (base pointer) movq %rsp, %rbp (new base ptr) subq $32, %rsp (stack frame) movl %edi, ‐16(%rbp) (store n) xorl %rax, %rax (count = 0) movq %rax, ‐32(%rbp) (store count) movq %rax, ‐24(%rbp) (i = 0) movl

 ‐16(%rbp), %rax (read n)

movq ‐

24(%rbp), %rdx (read i) cmpq %rax, %rdx (i < n) jae ..B2.4

icc –O0 (no optimization)

..B2.3: movq

 ‐24(%rbp), %rax ( read i)

addq ‐

32(%rbp), %rax (count += i) movq %rax, ‐32(%rbp) (store count) addq $1, ‐24(%rbp) (i++) movl

 ‐16(%rbp), %rax (read n)

movq ‐

24(%rbp), %rdx (read i) cmpq %rax, %rdx (i < n) jb ..B2.3 ..B2.4: movq

 ‐32(%rbp), %rax (return count)

leave (undo base pointer stuff) ret

running time n = 400, iters = 2000000 is 1419 ms

Page 44: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

Naïve Compilation Strategy

• Local variables – Call frame (rsp points to bottom, rbp to top)– Each local variable stored in call frame

• while loop pattern • for loop pattern

– for (initcode; p; nextcode) { code } goes to – Initcode; while (p) { code; nextcode }

• If then else pattern • expression evaluation

Page 45: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

while (p) { c; }

if (p) { ctrue; }else { cfalse; }

<instructions to evaluate p>j<p false> loopExitLabelloopLabel:<instructions for c><instructions to evaluate p>j<p true> loopLabelloopExitLabel:

<instructions to evaluate p>j<p false> elseLabel<instructions for ctrue>j endLabelelseLabel:<instructions for cfalse>endLabel:

Page 46: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

uint64_t add64(uint64_t n) { uint64_t count = 0; uint64_t i; for (i = 0; i < n; i++) count += i; return count; }

icc –O1 (some optimization) movl %edi, %rcx (read n) xorl %eax, %eax (count = 0) xorl %edx, %edx (i = 0) testq %rcx, %rcx (n <= 0) jbe ..B17.5 ..B17.3: addq %rdx, %rax (count += i) addq $1, %rdx (i++) cmpq %rcx, %rdx (i < n) jb ..B17.3 ..B17.5: ret (return count)

running time n = 400, iters = 2000000 is 485 ms

Page 47: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

Basic Optimization Concepts

• Allocate values in registers (eliminates excess traffic to/from memory)

• Optimize naïve procedure call linkage

• Why not always optimize? –Complicates compiler (sometimes have bugs)

– Complicates debugging(not sure where values are)(sometimes reorders computations)

Page 48: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

Implementing Arrays

• Arrays are just blocks of memory – Static array – allocated by linker/loader

– Dynamic array – allocated by malloc

– Local array – allocated on stack

• Array/pointer equivalence

Page 49: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

Building An Array

uint64_t * makeArrayList64(uint64_t x) {

int i;

uint64_t *a = (uint64_t *) malloc(sizeof(uint64_t) * x);

for (i = 0; i < x; i++) {

a[i] = i;

}

return a;

}

Page 50: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

Adding Up Array Elements

uint64_t addArrayList64(uint64_t *a, uint64_t n) {

uint64_t c = 0;

for (uint64_t i = 0; i < n; i++) {

c += a[i];

}

return c;

}

Page 51: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

uint64_t addArrayList64(uint64_t *a, int n) { uint64_t c = 0; for (int i = 0; i < n; i++) c += a[i]; return c; }

addArrayList64: # parameter 1: %rdi # parameter 2: %esi movslq %esi, %rcx (n) xorl %eax, %eax (c = 0) xorl %edx, %edx (i = 0) testq %rcx, %rcx (n <= 0) jle ..B15.5 ..B15.3: addq (%rdi,%rdx,8), %rax (c += a[i])addq $1, %rdx (i++)cmpq %rcx, %rdx (i < n)jl ..B15.3..B15.5:ret (return c)

running time n = 400, iters = 2000000 is 378 ms (unoptimized 2231 ms)

Page 52: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

Different Kinds of Arrays

int a[200]; /* global */

int f(int *b /* whatever */) { int c[200]; /* local */ int count; int i = 0; for (i = 0; i < 200; i++) { c[i] = a[i]+b[i];}for (i = 0; i < 200; i++) {count += c[i];}return count;}

Page 53: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

Different Kinds of Arrays

int a[200]; /* global */

int f(int *b /* whatever */) { int c[200]; /* local */ int count; int i = 0; for (i = 0; i < 200; i++) { c[i] = a[i]+b[i];}for (i = 0; i < 200; i++) {count += c[i];}return count;}

subq $808, %rsp (allocating c on stack) xorl %ecx, %ecx (a and c offset) xorl %edx, %edx (b offset) ..B1.2: movl a(%rcx), %eax (a[i]) addl (%rdx,%rdi), %eax (a[i]+b[i]) addq $4, %rdx movl %eax, (%rsp,%rcx) (c[i] = a[i]+b[i]) addq $4, %rcx cmpq $800, %rcx jl ..B1.2 xorl %edx, %edx ..B1.4: addl (%rsp,%rdx,4), %eax addq $1, %rdx cmpq $200, %rdx jl ..B1.4 addq $808, %rsp ret

Page 54: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

More Arrays

int x[N]; int y[N]; int z[N]; void add() { int i = 0; for (i = 0; i < N; i++) { x[i] = y[i] + z[i]; } }

N = 400, Iters = 2000000 (8,000,000,000 + ops)

• icc –O0: 23783 ms • icc –O1: 6319 ms • icc –O2: 3681 ms • Factor of 6.5 speedup

Page 55: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

icc –O0

pushq %rbp movq %rsp, %rbp subq $16, %rsp xorl %eax, %eax movl %eax, ‐16(%rbp) movl %eax, ‐16(%rbp) movl

 ‐16(%rbp), %eax

cmpl $4000, %eax jge ..B1.4

..B1.3: movl

 ‐16(%rbp), %eax

movslq %eax, %rax movl

 ‐16(%rbp), %edx

movslq %edx, %rdx movl z(,%rdx,4), %edx addl y(,%rax,4), %edx movl

 ‐16(%rbp), %eax

movslq %eax, %rax movl %edx, x(,%rax,4) addl $1, ‐16(%rbp) movl

 ‐16(%rbp), %eax

cmpl $4000, %eax jl ..B1.3 ..B1.4: leave ret

Page 56: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

icc –O1

int x[N];

int y[N];

int z[N];

void add() {

int i = 0;

for (i = 0; i < N; i++) {

x[i] = y[i] + z[i];

}

}

xorl %edx, %edx

..B1.2:

movl y(%rdx), %eax

addl z(%rdx), %eax

movl %eax, x(%rdx)

addq $4, %rdx

cmpq $16000, %rdx

jl ..B1.2

ret

Page 57: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

xorl %eax, %eax ..B1.2:

movdqa y(%rax), %xmm0 paddd z(%rax), %xmm0 movdqa 16+y(%rax), %xmm1 paddd 16+z(%rax), %xmm1 movdqa 32+y(%rax), %xmm2 paddd 32+z(%rax), %xmm2 movdqa 48+y(%rax), %xmm3 paddd 48+z(%rax), %xmm3 movdqa 64+y(%rax), %xmm4 paddd 64+z(%rax), %xmm4 movdqa 80+y(%rax), %xmm5 paddd 80+z(%rax), %xmm5 movdqa 96+y(%rax), %xmm6 paddd 96+z(%rax), %xmm6 movdqa 112+y(%rax), %xmm7 paddd 112+z(%rax), %xmm7

icc –O2movdqa %xmm0, x(%rax) movdqa %xmm1, 16+x(%rax) movdqa %xmm2, 32+x(%rax) movdqa %xmm3, 48+x(%rax) movdqa %xmm4, 64+x(%rax) movdqa %xmm5, 80+x(%rax) movdqa %xmm6, 96+x(%rax) movdqa %xmm7, 112+x(%rax) addq $128, %rax cmpq $16000, %rax jl ..B1.2

..B1.3: ret

Page 58: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

XMM Stuff

• SIMD instructions – operate on (small) vectors• 16 128‐bit XMM registers

– 2 64‐bit values – 4 32‐bit values

• Instructions operate on multiple values – movdqa y(%rax), %xmm0

(moves 4 32‐bit ints to %xmm0)– paddd z(%rax), %xmm0 (adds 4 32‐bit ints in z to corresponding 4 32‐bit ints in %xmm0)

Page 59: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

Even More Arrays

char x[N]; char y[N]; char z[N]; void add() { int i = 0; for (i = 0; i < N; i++) { x[i] = y[i] + z[i]; } }

N = 1600, Iters = 2000000 (32,000,000,000 + ops)

• icc –O0: 89550 ms (versus 23783 ms for 4000 ints)

• icc –O1: 21440 ms (versus 6319 ms for 4000 ints)

• icc –O2: 3635 ms (versus 3681 ms for 4000 ints)

• Factor of 24 speedup (versus 6.5 speedup)

Page 60: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

Implementing Structs

• Structs are just blocks of memory –struct { char x; int i; double d; } s;

• Fields stored next to each other

• Alignment issues

• Like arrays, can have static, dynamic, local structs

Page 61: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

Building A Linked List

typedef struct list64 { uint64_t value; struct list64 *next; } node64;

struct list64 * makeLinkedList64(int l) { int i; struct list64 *n; struct list64 *f; n = 0; f = 0; for (i = l-1; i >= 0; i--) {

n = (struct list64 *) malloc(sizeof(struct list64));n->next = f;n->value = i;f = n;

} return f;

}

Page 62: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

Assembly for Building Linked Listpushq %r12 (callee save) ..B14.3: pushq %rbp (callee save) movl $16, %edi

(malloc(16))pushq %rsi (callee save) call malloc xorl %r12d, %r12d (f = 0) ..B14.4: movslq %edi, %rbp (l) movq %r12, 8(%rax) (n->next = f)addq $-1, %rbp (i = l-1) movq %rbp, (%rax) (n->value = i)testq %rbp, %rbp (i >= 0) movq %rax, %r12 (f = n)jl ..B14.6 addq $-1, %rbp (i = i – 1)

testq %rbp, %rbp (i >= 0) jge ..B14.3 ..B14.6: movq %r12, %rax (return f) popq %rcx popq %rbp popq %r12 ret

Page 63: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

Counting Elements in Linked List

typedef struct list64 { uint64_t value; struct list64 *next;

} node64;

uint64_t addLinkedList64(struct list64 *n) { uint64_t count = 0; struct list64 *c; for (l = n; l != 0; l=l->next) { count = count + l->value;

}return count;

}

Page 64: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

typedef struct list64 { uint64_t value; struct list64 *next } node64; uint64_t addLinkedList64(struct list64 *n) { uint64_t count = 0; struct list64 *l;

for (l = n; l != 0; l=l->next) count = count + l->value; return count;

}

xorl %eax, %eax (count = 0)

testq %rdi, %rdi (l == 0)

je ..B13.5

..B13.3:

addq (%rdi), %rax (count = count +l->value)

movq 8(%rdi), %rdi (l = l-> next)

testq %rdi, %rdi (l == 0)

jne ..B13.3

retrunning time n = 400, iters = 2000000 is 885 ms (unoptimized 3321 ms)

Page 65: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

Procedure Calls, Recursion

int fib(int n) {

if (n == 0) return 1;

if (n == 1) return 1; return (fib(n-1) + fib(n-2));

}

Page 66: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

int fib(int n) { if (n == 0) return 1; if (n == 1) return 1; return (fib(n‐1) + fib(n‐2)); }

pushq %rbp addl %eax, %ebx pushq %rbx movl %ebx, %eax pushq %rsi popq %rcx movl %edi, %ebp popq %rbx testl %ebp, %ebp popq %rbp je ..B2.4 ret cmpl $1, %ebp ..B2.4: je ..B2.4 movl $1, %eax

popq %rcx lea

 ‐

1(%rbp), %edi popq %rbx call fib popq %rbp movl %eax, %ebx ret addl $‐2, %ebp movl %ebp, %edi call fib

fib(43) running time 7231‐7401 ms

Page 67: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

int fib(int n) { if (n == 0) return 1; if (n == 1) return 1; return (fib(n‐1) + fib(n‐2)); }

pushq %rbp

%rcx

addl %eax, %ebx pushq %rbx movl %ebx, %eax pushq %rsi popq

popq %rbx testl %edi, %edi popq %rbp je ..B2.4 ret cmpl $1, %edi ..B2.4: je ..B2.4 movl $1, %eax movl %edi, %ebp popq %rcx lea

 ‐

1(%rbp), %edi popq %rbx call fib popq %rbp movl %eax, %ebx ret addl $‐2, %ebp movl %ebp, %edi call fib

fib(43) running time 5502‐5533 ms

Page 68: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

int fib(int n) { if (n == 0) return 1; if (n == 1) return 1; return (fib(n‐1) + fib(n‐2)); }

pushq %rbp addl %ebx, %eaxpushq %rbxpushq %rsi popq %rcx

popq %rbx testl %edi, %edi popq %rbp je ..B2.4 ret cmpl $1, %edi ..B2.4: je ..B2.4 movl $1, %eax movl %edi, %ebp popq %rcx lea

 ‐

1(%rbp), %edi popq %rbx call fib popq %rbp movl %eax, %ebx ret addl $‐2, %ebp movl %ebp, %edi call fib

fib(43) running time 5519‐5539 ms

Page 69: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

int fib(int n) { if (n == 0) return 1; if (n == 1) return 1; return (fib(n‐1) + fib(n‐2)); }

pushq %rbp addl %ebx, %eax pushq %rbx

popq %rbx testl %edi, %edi popq %rbp je ..B2.4 ret cmpl $1, %edi ..B2.4:

movl $1, %eax

popq %rbx popq %rbp

movl %eax, %ebx retaddl $‐2, %ebpmovl %ebp, %edicall fib

fib(43) running time 5184‐5195 ms

je ..B2.4 movl %edi, %ebp lea

 ‐

1(%rbp), %edi call fib

Page 70: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

int fib(int n) { if (n == 0) return 1; if (n == 1) return 1; return (fib(n‐1) + fib(n‐2)); }

pushq %rbp addl %ebx, %eaxpushq %rbx

popq %rbx testl %edi, %edi popq %rbp je ..B2.4 ret cmpl $1, %edi ..B2.4: je ..B2.4 movl $1, %eax addl $‐1, %edi movl %edi, %ebp popq %rbx call fib popq %rbp movl %eax, %ebx ret addl $‐1, %ebp movl %ebp, %edi call fib

fib(43) running time 5151‐5175 ms

Page 71: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

int fib(int n) { if (n == 0) return 1; if (n == 1) return 1; return (fib(n‐1) + fib(n‐2)); }

popq %rbi

cmpl $1, %edi ..B2.4: je ..B2.4 movl $1, %eax addl $‐1, %edi pushq %rdi

testl %edi, %edi addl %ebi, %eax je ..B2.4 ret

call fib popq %rdi ret pushq %rax addl $‐1, %edi call fib

fib(43) running time 5004‐5020 ms

Page 72: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

int fib(int n) { if (n == 0) return 1; if (n == 1) return 1; return (fib(n‐1) + fib(n‐2)); }

popq %rbi addl %ebi, %eax ret

cmpl $2, %edi ..B2.4: je ..B2.4 movl $1, %eax addl $‐1, %edi pushq %rdi call fib popq %rdi ret pushq %rax addl $‐1, %edi call fib

fib(43) running time 4539‐4696 ms

Page 73: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

Summary Mapping (so far)

C • Parameters

• Procedure call/return

• Local variables

• Arrays and Structures

X86

• Passed in registers (then on call stack if too many)

• Call stack stored in memory, grows down – return address on stack

– return value in register (RAX) – Caller vs. callee save registers

• Registers or Activation Record

• Blocks of memory (static, dynamic, local)

Page 74: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

C

More Summary Mapping (so far)

• Arithmetic Expressions

• Control flow – Loops

– If then else

• Procedures

X86

• Sequences of x86 instructions

• Evaluate condition, set status flags

• Jump appropriately

• Instructions in memory

Page 75: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

More C/C++ Constructs

• Arrays of structs – how to lay out?

• Function pointers

• Bit fields in arrays

• Objects, virtual function tables

• Memory management techniques

Page 76: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

What You Should Know• Understand mapping from C to x86 • C state to x86 state

– Variables (locals, globals, parameters) in registers or memory – Arrays and structs as blocks of memory

• Dynamic vs static allocation • Heap vs stack allocation (both dynamic allocation)

– Pointers as memory addresses • C computation to x86 computation

– Expression evaluation as sequence of instructions • Operand addressing modes (register, memory), computations • Vector instructions and associated registers

– Control flow (while, for, if then else) patterns with jump instructions • Procedure call linkage

– Call stack, stack pointer (rsp), frame pointer (rbp) – call, ret instructions – Parameters in registers or stack, return value in register – Caller and callee save registers

• Be able to read assembler that compiler generates

Page 77: From C to Machine Code/Assembler · Procedure Rules a. Collapse Procedure Hierarchies b. Coroutines c. Tail Recursion Elimination ... • All arithmetic operations update the condition

MIT OpenCourseWarehttp://ocw.mit.edu

6.172 Performance Engineering of Software SystemsFall 2009

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.