x87-based support for floating-point x86-64 instruction ... 8-x87 fpu an… · x87-based support...

Carnegie Mellon

1

Lecture 8 198:231 Intro to Computer Organization

x87-Based Support for Floating-Point x86-64 Instruction Set Architecture 198:231 Introduction to Computer Organization Lecture 8

Instructor:

Nicole Hynes

[email protected]

mailto:[email protected]

Carnegie Mellon

2


x87-Based Support for Floating-Point

Intel floating-point architecture: a historical perspective x87 FPU refers to the floating-point unit of Intel’s line of x86 processors.

Early Intel processors (8086, 80286, i386) had separate off-chip floating-point units (called 8087, 80287, i387, respectively).

8087 – first FPU to implement the IEEE 754 floating-point standard.

Starting with the i486, the FPU was integrated with the IA32 CPU chip.

With the introduction of Intel’s SSE (Streaming SIMD Extensions) in the Pentium 4 (2000), it became possible to implement floating-point operations using SSE instructions.

However, many compilers, including gcc, still generate x87 floating-point code by default.

Will only give an overview of the x87 FPU. For more information consult the following in the course website: Chap. 8, IA-32 Intel Architecture Software Developer’s Manual

X87-Based Support for Floating-Point

Carnegie Mellon

3



x87 FPU registers Eight 80-bit data registers

Treated as a shallow stack.

Instructions implicitly push values onto, or pop values off, the FPU stack.

%st(0)(or simply %st) is the top of the stack.

%st(i) is i-th register “below” %st(0) in the stack.

Warning: When more than eight values are pushed onto the stack, the ones at the bottom simply disappear.

%st(7)

%st(6)

%st(5)

%st(4)

%st(3)

%st(2)

%st(1)

%st(0) (TOS)

80 bits

Carnegie Mellon

4


x87 FPU registers Floating-point values stored in IEEE extended-precision FP format (80 bits)

FP operations are performed in extended-precision.

Single-precision (32-bit) and double-precision (64-bit) values are converted to/from extended precision FP format.


±1.fraction ×2exponent

b_exp s frac 80 bits

64 15 1

significand

1

Field # Bits Value Remarks

s 1 0 if number is positive; 1 if negative

b_exp 15 exponent + bias, where bias = 215-1 -1 = 214-1 = 16,383

biased exponent

frac 64 entire significand 1.fraction no hidden bit

Carnegie Mellon

5


x87-Based Support for Floating-Point Floating-point load instruction

Pushes a value on top of the FPU stack – value loaded will be in %st(0).

Value to be loaded must be a memory operand specified by an IA32 addressing mode.

Can load a single-precision, double-precision, or an integer value – automatically converted to extended-precision format.

Instruction fld %st(i)duplicates a stack value – pushes a copy of %st(i)on top of stack.

Instruction Source format Source location

flds addr Single-precision M4[addr]

fldl addr Double-precision M8[addr]

fldt addr Extended-precision M10[addr]

fildl addr Integer M4[addr]

fld %st(i) Extended-precision %st(i)

Mk[addr] means the k-byte value starting at addr

Carnegie Mellon

6


x87-Based Support for Floating-Point Floating-point store instruction

Stores value on top of FPU stack into memory at specified address addr; automatically converted to destination format.

Each instruction has “popping” and “non-popping” versions; e.g.,

fstps addr: pops top of FPU stack and stores popped value into memory at address addr.

fsts addr: stores – but does not pop – value on top of FPU stack into memory at address addr.

Instruction fst{p} %st(i) copies value on top of stack to %st(i); also has popping and non-popping versions.

Instruction Destination format Destination location

fst{p}s addr Single-precision M4[addr]

fst{p}l addr Double-precision M8[addr]

fst{p}t addr Extended-precision M10[addr]

fist{p}l addr Integer M4[addr]

fst{p} %st(i) Extended-precision %st(i)

Carnegie Mellon

7



Basic floating-point arithmetic instructions

Instruction Effect

fldz Pushes 0.0 onto FPU stack

fld1 Pushes 1.0 onto FPU stack

fabs %st(0) ← |%st(0)|

fchs %st(0) ← -%st(0)

fsqrt %st(0) ← sqrt(%st(0))

fadd Operand 1 + Operand 2

fsub Operand 1 – Operand 2

fsubr Operand 2 – Operand 1

fmul Operand 1 × Operand 2

fdiv Operand 1 / Operand 2

fdivr Operand 2 / Operand 1

single operand %st(0)

two operands; many variants

Carnegie Mellon

8



Illustration: variants of fsub instruction

Computes Destination ← Operand1 – Operand2

Similar variants for fadd, fmul, and fdiv

Instruction Operand 1 Operand 2 Format Destination Pop %st?

fsubs addr %st(0) M4[addr] Single %st(0) No

fsubl addr %st(0) M8[addr] Double %st(0) No

fsubt addr %st(0) M10[addr] Extended %st(0) No

fisubl addr %st(0) M4[addr] Integer %st(0) No

fsub %st(i),%st %st(i) %st(0) Extended %st(0) No

fsub %st,%st(i) %st(0) %st(i) Extended %st(i) No

fsubp %st,%st(i) %st(0) %st(i) Extended %st(i) Yes

fsubp %st(0) %st(1) Extended %st(1) Yes

Carnegie Mellon

9



Example Compute x = (a-b)*(-b+c)using x87 FPU instructions.

Assume all variables are declared in the data segment as double.

Carnegie Mellon

10



Using floating-point data in procedures With IA32, floating-point arguments are passed to a calling

procedure on the stack, just as are integer arguments.

Each parameter of type float requires 4 bytes of stack space, while each parameter of type double requires 8.

For functions whose return values are of type float or double, the result is returned on the top of the floating-point register stack in extended-precision format.

As an example, consider the following function:

double funct(double a, float x, double b, int i)

{

return a*x - b/i;

}

Carnegie Mellon

11



Using floating-point data in procedures, cont. Parameters a, x, b, and i will be at byte offsets 8, 16, 20, and 28

relative to %ebp, respectively. Body of the generated code, and the resulting stack values are as

follows.

Carnegie Mellon

12



Other features of x87 FPU Has floating-point compare and test instructions that set the

condition codes in the EFLAGS register.

Can be used to perform conditional branches just like to integer case.

Has many other floating-point instructions that perform trigonometric, logarithmic, exponential, and scaling operations.

For details, see Chapter 8, IA-32 Intel Architecture Software Developer’s Manual on the C & Assembly section of the course website.

Carnegie Mellon

13


Intel’s 64-Bit Intel Attempted Radical Shift from IA32 to IA64

Totally different architecture (Itanium) - VLIW

Executes IA32 code only as legacy

Performance disappointing

AMD Stepped in with Evolutionary Solution x86-64 (now called “AMD64”)

Intel Felt Obligated to Focus on IA64 Hard to admit mistake or that AMD is better

2004: Intel Announces EM64T extension to IA32 Extended Memory 64-bit Technology

Almost identical to x86-64!

All but low-end x86 processors support x86-64 But, lots of code still runs in 32-bit mode

Carnegie Mellon

14


Overview of x86-64

Main Features Pointers and long integers are 64 bits long. Integer arithmetic

operations support 8, 16, 32, and 64-bit data types.

The set of general-purpose registers is expanded from 8 to 16.

Much of the program state is held in registers rather than on the stack.

Integer and pointer procedure arguments (up to 6) are passed via registers. Some procedures do not need to access the stack at all.

Conditional operations are implemented using conditional move instructions when possible, yielding better performance than traditional branching code.

Floating-point operations are implemented using a register-oriented instruction set, rather than the stack-based approach supported by IA32.

The program counter is named %rip (instead of %eip in IA32).

Carnegie Mellon

15


Twice the number of registers; each 64 bits. Accessible as 8, 16, 32, 64 bits Make %ebp/%rbp general purpose

%rsp

%eax

%ebx

%ecx

%edx

%esi

%edi

%esp

%ebp

%r8d

%r9d

%r10d

%r11d

%r12d

%r13d

%r14d

%r15d

%r8

%r9

%r10

%r11

%r12

%r13

%r14

%r15

%rax

%rbx

%rcx

%rdx

%rsi

%rdi

%rbp

x86-64 Integer Registers

Carnegie Mellon

16


x86-64 Integer Registers

Description The number of registers has been doubled to 16. The new registers are

numbered 8-15.

All registers are 64 bits long. The 64-bit extensions of the IA32 registers are named %rax, %rcx, %rdx, %rbx, %rsi, %rdi, %rsp, and %rbp. The new registers are named %r8–%r15.

The low-order 32 bits of each register can be accessed directly. This gives us the familiar registers from IA32: %eax, %ecx, %edx, %ebx, %esi, %edi, %esp, and %ebp, as well as eight new 32-bit registers: %r8d–%r15d.

The low-order 16 bits of each register can be accessed directly, as is the case for IA32. The word-size versions of the new registers are named %r8w–%r15w.

The low-order 8 bits of each register can be accessed directly. This is true in IA32 only for the first 4 registers (%al, %cl, %dl, %bl). The byte-size versions of the other IA32 registers are named %sil, %dil, %spl, and %bpl. The byte-size versions of the new registers are named %r8b–%r15b.

Carnegie Mellon

17


Instructions

Adds a new integer data type: quad word - 8 bytes (64 bits)

suffix q

C long int and pointer variables map to quad word

Arithmetic and logical instructions extended to quad word: movl ➙ movq

addl ➙ addq

sall ➙ salq

etc.

For 32-bit instructions that generate 32-bit results: Set higher order bits of destination register to 0

Example: addl

Carnegie Mellon

18


Data Representations: IA32 + x86-64

C declaration Intel data type Assembly

code suffix IA32 size (bytes)

x86-64 size (bytes)

char Byte b 1 1

short Word w 2 2

int Double word l 4 4

long int Quad word q 4 8

long long int Quad word q 8 8

char * (pointer) Quad word q 4 8

float Single precision s 4 4

double Double precision d 8 8

long double Extended precision t 10/12 10/16

Carnegie Mellon

19


64-bit Data Movement Instructions

Instruction Description Effect

movabsq imm, reg Move absolute quad word reg ← imm

movq src,dest Move quad word dest ← src

movsbq src,dest Move sign-extended byte dest ← SignExtend(src)

movswq src,dest Move sign-extended word dest ← SignExtend(src)

movslq src,dest Move sign-extended long dest ← SignExtend(src)

movzbq src,dest Move zero-extended byte dest ← ZeroExtend(src)

movzwq src,dest Move zero-extended word dest ← ZeroExtend(src)

pushq src Push quad word %rsp ← %rsp – 8; M[%rsp] ← src

popq dest Pop quad word dest ← M[%rsp]; %rsp ← %rsp + 8

Carnegie Mellon

20


64-bit Arithmetic and Logical Instructions


leaq src,dest Load effective address dest← &src

incq dest Increment dest ← dest + 1

decq dest Decrement dest ← dest – 1

negq dest Negate dest ← –dest

notq dest Complement dest ← ~dest

addq src,dest Add dest ← dest + src

subq src,dest Subtract dest ← dest – src

imulq src,dest Multiply dest ← dest * src

xorq src,dest Exclusive or dest ← dest ^ src

orq src,dest Or dest ← dest | src

andq src,dest And dest ← dest & src

salq k,dest Left shift dest ← dest << k

shlq k,dest Left shift (same as salq) dest ← dest << k

sarq k,dest Arithmetic right shift dest ← dest >> k

shrq k,dest Logical right shift dest ← dest >> k

Carnegie Mellon

21


Special 64-Bit Arithmetic Instructions


imulq src Signed full multiply R[%rdx]:R[%rax] ← src × R[%rax]

mulq src Unsigned full multiply R[%rdx]:R[%rax] ← src × R[%rax]

cltq Convert %eax to quad word R[%rax] ← SignExtend(R[%eax])

cqto Convet %rax to oct word R[%rdx]:R[%rax] ← SignExtend(R[%eax])

idivq src Signed divide R[%rdx] ← R[%rdx]:R[%rax] mod src (rem) R[%rax] ← R[%rdx]:R[%rax] ÷ S (quotient)

divq src Unsigned divide R[%rdx] ← R[%rdx]:R[%rax] mod src (rem) R[%rax] ← R[%rdx]:R[%rax] ÷ S (quotient)

Compare and Test Quad Word


cmpq src2, src1 Compare quad word Compute src1 – src2 and set condition codes based on result

testq src2, src1 Test quad word Compute src1 & src2 and set condition codes based on result

Carnegie Mellon

22


64-Bit Call and Return Instructions

Instruction pointer (aka program counter) is now a 64-bit register %rip (instead of %eip in IA32).

Callq acts like call: it pushes %rip (which contains return address) onto stack then jumps to address given by argument.

Retq acts like ret: it pops %rip (which should contain the return address of caller) from the stack, so next instruction to be executed is at the address stored in %rip.

When using gcc to compile to x86-64 architecture, call or ret (without the suffix q) will be interpreted as callq and retq.


callq label Procedure call pushq %rip

jmp label

callq *operand Procedure call pushq %rip

jmp *operand

retq Return from call popq %rip

Carnegie Mellon

23


Conditional Move Instructions

Copy source to destination when the move condition is true (based on condition codes)

src = memory or register; dest = register

Works for 16, 32, or 64-bit operands – no suffix required (length of operand inferred from size of destination register)

Instruction Synonym Description Move condition

cmove src, dest cmovz Move if equal/zero ZF

cmovne src, dest cmovnz Move is not equal/not zero ~ZF

cmovs src, dest Move if egative SF

cmovns src, dest Move if nonnegative ~SF

cmovg src, dest cmovnle Move if greater (signed >) ~(SFÔF) & ~ZF

cmovege src, dest cmovnl Move if greater or equal (signed >=) ~(SFÔF)

cmovl src, dest cmovnge Move if less (signed <) SFÔF

cmovle src, dest cmovng Move if less or equal (signed <=) (SFÔF) | ZF

cmova src, dest cmovnbe Move if above (unsigned >) ~CF & ~ZF

cmovae src, dest cmovnb Move if above or equal (unsigned >=) ~CF

cmovb src, dest cmovnae Move if below (unsigned <) CF

cmovbe src, dest cmovna Move if below or equal (unsigned <=) CF | ZF

Carnegie Mellon

24


Assembling and Running x86-64 Assembly Lang. Programs

Recall: to assemble an IA32 assembly language program:

-m32 flag tells gcc that program is in IA32 assembly language

To assemble an x86-64 assembly language program:

-m64 flag tells gcc that program is in x86-64 assembly language

On clam, may omit –m64 flag because by default gcc assumes program is in x86-64 assembly language

Run the executable program as usual:

% gcc –m32 –o prog prog_IA32.s

% ./prog

% gcc –m64 –o prog prog_IA32.s

Carnegie Mellon

25


Example: IA32 vs. x86-64 Assembly

Consider the following C program:

Translation to IA-32 assembly language:

/* sample4.c */

#include <stdio.h>

long int a = 15, b = 10, c = 20;

long int max;

int main () {

max = a > b ? a : b;

max = c > max ? c : max;

printf("max = %ld\n", max);

return 0;

}

/* sample4_IA32.s */

.data

.align 4

a: .long 15

b: .long 10

c: .long 20

.comm max,4,4

.section .rodata

.LC0: .string "max = %ld\n"

Carnegie Mellon

26



Translation to IA-32 assembly language, cont.

.text

.globl main

main:

pushl %ebp # prolog

movl %esp, %ebp

movl b, %eax # %eax = b

movl a, %edx # %edx = a

cmpl %edx, %eax # compare b:a

jge .L2 # if b >= a go to .L2

movl %edx, %eax # b < a: %eax = a

.L2:

movl %eax, max # max = maximum{a,b}

movl max, %eax # %eax = max

movl c, %edx # %edx = c

cmpl %edx, %eax # compare max:c

jge .L3 # if max >= c go to .L3

movl %edx, %eax # max < c: %eax = c

.L3:

movl %eax, max # max = maximum{max, c}

pushl %eax # push printf parms onto stack

pushl $.LC0

call printf # call printf

addl $8, %esp # deallocate parms from stack

movl $0, %eax # return 0

leave # epilog

ret

Carnegie Mellon

27



Translation to x86-64 assembly language:

/* sample4_x86-64.s */

.data

.align 8

a: .quad 15

b: .quad 10

c: .quad 20

.comm max,8,8

.section .rodata

.LC0: .string "max = %ld\n"

.text

.globl main

long int is quad word (8 bytes) in

x86-84

Carnegie Mellon

28



Translation to x86-64 assembly language, cont.

.text

.globl main

main:

pushq %rbp # prolog

movq %rsp, %rbp

movq b, %rdx # %rdx = b

movq a, %rax # %rax = a

cmpq %rax, %rdx # compare b:a

cmovge %rdx, %rax # if (b >= a) %rax = b

movq %rax, max # max = maximum{a, b}

movq max, %rdx # %rdx = max

movq c, %rax # %rax = c

cmpq %rax, %rdx # compare max:c

cmovge %rdx, %rax # if (max >= c) %rax = max

movq %rax, max # max = maximum{max, c}

movq max, %rax # %rax = max

movq %rax, %rsi # %rsi = second parm

movl $.LC0, %edi # %rdi = first parm

movl $0, %eax # %eax = 0

callq printf # call printf

movl $0, %eax # return 0

movq %rbp, %rsp # epilog

popq %rbp

retq

parameters to printf passed

via registers!

Note use of conditional

move instructions

Carnegie Mellon

29


x86-64 Register Usage Conventions

%rax

%rbx

%rcx

%rdx

%rsi

%rdi

%rsp

%rbp

%r8

%r9

%r10

%r11

%r12

%r13

%r14

%r15 Callee saved Callee saved

Callee saved

Callee saved

Callee saved

Caller saved

Callee saved

Stack pointer

Caller Saved

Return value

Parameter #4

Parameter #1

Parameter #3

Parameter #2

Parameter #6

Parameter #5

Carnegie Mellon

30


x86-64 Parameter Passing

First six integral parameters are passed via registers If more than 6 integral parameters, pass the rest via stack

These registers can also be used as caller-saved registers

Integral return value passed via %rax %eax if 32 bits; %ax if 16 bits; %al if 8 bits

Can also be used as a caller-saved register

Operand size

Parameter Number

(bits) 1 2 3 4 5 6

64 %rdi %rsi %rdx %rcx %r8 %r9

32 %edi %esi %edx %ecx %r8d %r9d

16 %di %si %dx %cx %r8w %r9w

8 %dil %sil %dl %cl %r8b %r9b

Carnegie Mellon

31


Example: x86-64 Parameter Passing

Consider the following function:

Translation to x86-64 assembly language:

long proc( long a, int b, short c, char d,

long *pa, int *pb, short *pc, char *pd )

{

*pa += a;

*pb += b;

*pc += c;

*pd += d;

return (a+b+c+d);

}

proc:

/* parameters passed as follows:

a in %rdi (64 bits)

b in %esi (32 bits)

c in %dx (16 bits)

d in %cl (8 bits)

pa in %r8 (64 bits)

pb in %r9 (64 bits)

pc via stack (64 bits)

pd via stack (64 bits)

*/

Carnegie Mellon

32


Example: x86-64 Parameter Passing

Translation to x86-64 assembly language, cont.

pushq %rbp # prolog

movq %rsp, %rbp

movq 16(%rbp), %r10 # %r10 = pc in 16(%rbp)

movq 24(%rsp), %rax # %rax = pd in 24(%rbp)

addq %rdi, (%r8) # *pa += a

addl %esi, (%r9) # *pb += b

addw %dx, (%r10) # *pc += c

addb %cl, (%rax) # *pd += d

movslq %esi, %rax # %rax = sign-extend(%esi) = b

addq %rdi, %rax # %rax = a + b

movswq %dx, %rdx # %rdx = sign-extend(%dx) = c

addq %rdx, %rax # %rax = a + b + c

movsbq %cl, %rcx # %rcx = sign-extend(%cl) = d

addq %rcx, %rax # %rax = a + b + c + d = return value

movq %rbp, %rsp # epilog

popq %rbp

retq

Carnegie Mellon

33


x86-64 Procedure Summary

Heavy use of registers Parameter passing

More temporaries since more registers

Minimal use of stack Sometimes none

Allocate/deallocate entire block

Many tricky optimizations What kind of stack frame to use

Various allocation techniques

x87-based support for floating-point x86-64 instruction ... 8-x87 fpu an… · x87-based support...

Documents