x87-based support for floating-point x86-64 instruction ... 8-x87 fpu an… · x87-based support...
TRANSCRIPT
Carnegie Mellon
1
Lecture 8 198:231 Intro to Computer Organization
x87-Based Support for Floating-Point x86-64 Instruction Set Architecture 198:231 Introduction to Computer Organization Lecture 8
Instructor:
Nicole Hynes
Carnegie Mellon
2
Lecture 8 198:231 Intro to Computer Organization
x87-Based Support for Floating-Point
Intel floating-point architecture: a historical perspective x87 FPU refers to the floating-point unit of Intel’s line of x86 processors.
Early Intel processors (8086, 80286, i386) had separate off-chip floating-point units (called 8087, 80287, i387, respectively).
8087 – first FPU to implement the IEEE 754 floating-point standard.
Starting with the i486, the FPU was integrated with the IA32 CPU chip.
With the introduction of Intel’s SSE (Streaming SIMD Extensions) in the Pentium 4 (2000), it became possible to implement floating-point operations using SSE instructions.
However, many compilers, including gcc, still generate x87 floating-point code by default.
Will only give an overview of the x87 FPU. For more information consult the following in the course website: Chap. 8, IA-32 Intel Architecture Software Developer’s Manual
X87-Based Support for Floating-Point
Carnegie Mellon
3
Lecture 8 198:231 Intro to Computer Organization
x87-Based Support for Floating-Point
x87 FPU registers Eight 80-bit data registers
Treated as a shallow stack.
Instructions implicitly push values onto, or pop values off, the FPU stack.
%st(0)(or simply %st) is the top of the stack.
%st(i) is i-th register “below” %st(0) in the stack.
Warning: When more than eight values are pushed onto the stack, the ones at the bottom simply disappear.
%st(7)
%st(6)
%st(5)
%st(4)
%st(3)
%st(2)
%st(1)
%st(0) (TOS)
80 bits
Carnegie Mellon
4
Lecture 8 198:231 Intro to Computer Organization
x87 FPU registers Floating-point values stored in IEEE extended-precision FP format (80 bits)
FP operations are performed in extended-precision.
Single-precision (32-bit) and double-precision (64-bit) values are converted to/from extended precision FP format.
x87-Based Support for Floating-Point
±1.fraction ×2exponent
b_exp s frac 80 bits
64 15 1
significand
1
Field # Bits Value Remarks
s 1 0 if number is positive; 1 if negative
b_exp 15 exponent + bias, where bias = 215-1 -1 = 214-1 = 16,383
biased exponent
frac 64 entire significand 1.fraction no hidden bit
Carnegie Mellon
5
Lecture 8 198:231 Intro to Computer Organization
x87-Based Support for Floating-Point Floating-point load instruction
Pushes a value on top of the FPU stack – value loaded will be in %st(0).
Value to be loaded must be a memory operand specified by an IA32 addressing mode.
Can load a single-precision, double-precision, or an integer value – automatically converted to extended-precision format.
Instruction fld %st(i)duplicates a stack value – pushes a copy of %st(i)on top of stack.
Instruction Source format Source location
flds addr Single-precision M4[addr]
fldl addr Double-precision M8[addr]
fldt addr Extended-precision M10[addr]
fildl addr Integer M4[addr]
fld %st(i) Extended-precision %st(i)
Mk[addr] means the k-byte value starting at addr
Carnegie Mellon
6
Lecture 8 198:231 Intro to Computer Organization
x87-Based Support for Floating-Point Floating-point store instruction
Stores value on top of FPU stack into memory at specified address addr; automatically converted to destination format.
Each instruction has “popping” and “non-popping” versions; e.g.,
fstps addr: pops top of FPU stack and stores popped value into memory at address addr.
fsts addr: stores – but does not pop – value on top of FPU stack into memory at address addr.
Instruction fst{p} %st(i) copies value on top of stack to %st(i); also has popping and non-popping versions.
Instruction Destination format Destination location
fst{p}s addr Single-precision M4[addr]
fst{p}l addr Double-precision M8[addr]
fst{p}t addr Extended-precision M10[addr]
fist{p}l addr Integer M4[addr]
fst{p} %st(i) Extended-precision %st(i)
Carnegie Mellon
7
Lecture 8 198:231 Intro to Computer Organization
x87-Based Support for Floating-Point
Basic floating-point arithmetic instructions
Instruction Effect
fldz Pushes 0.0 onto FPU stack
fld1 Pushes 1.0 onto FPU stack
fabs %st(0) ← |%st(0)|
fchs %st(0) ← -%st(0)
fsqrt %st(0) ← sqrt(%st(0))
fadd Operand 1 + Operand 2
fsub Operand 1 – Operand 2
fsubr Operand 2 – Operand 1
fmul Operand 1 × Operand 2
fdiv Operand 1 / Operand 2
fdivr Operand 2 / Operand 1
single operand %st(0)
two operands; many variants
Carnegie Mellon
8
Lecture 8 198:231 Intro to Computer Organization
x87-Based Support for Floating-Point
Illustration: variants of fsub instruction
Computes Destination ← Operand1 – Operand2
Similar variants for fadd, fmul, and fdiv
Instruction Operand 1 Operand 2 Format Destination Pop %st?
fsubs addr %st(0) M4[addr] Single %st(0) No
fsubl addr %st(0) M8[addr] Double %st(0) No
fsubt addr %st(0) M10[addr] Extended %st(0) No
fisubl addr %st(0) M4[addr] Integer %st(0) No
fsub %st(i),%st %st(i) %st(0) Extended %st(0) No
fsub %st,%st(i) %st(0) %st(i) Extended %st(i) No
fsubp %st,%st(i) %st(0) %st(i) Extended %st(i) Yes
fsubp %st(0) %st(1) Extended %st(1) Yes
Carnegie Mellon
9
Lecture 8 198:231 Intro to Computer Organization
x87-Based Support for Floating-Point
Example Compute x = (a-b)*(-b+c)using x87 FPU instructions.
Assume all variables are declared in the data segment as double.
Carnegie Mellon
10
Lecture 8 198:231 Intro to Computer Organization
x87-Based Support for Floating-Point
Using floating-point data in procedures With IA32, floating-point arguments are passed to a calling
procedure on the stack, just as are integer arguments.
Each parameter of type float requires 4 bytes of stack space, while each parameter of type double requires 8.
For functions whose return values are of type float or double, the result is returned on the top of the floating-point register stack in extended-precision format.
As an example, consider the following function:
double funct(double a, float x, double b, int i)
{
return a*x - b/i;
}
Carnegie Mellon
11
Lecture 8 198:231 Intro to Computer Organization
x87-Based Support for Floating-Point
Using floating-point data in procedures, cont. Parameters a, x, b, and i will be at byte offsets 8, 16, 20, and 28
relative to %ebp, respectively. Body of the generated code, and the resulting stack values are as
follows.
Carnegie Mellon
12
Lecture 8 198:231 Intro to Computer Organization
x87-Based Support for Floating-Point
Other features of x87 FPU Has floating-point compare and test instructions that set the
condition codes in the EFLAGS register.
Can be used to perform conditional branches just like to integer case.
Has many other floating-point instructions that perform trigonometric, logarithmic, exponential, and scaling operations.
For details, see Chapter 8, IA-32 Intel Architecture Software Developer’s Manual on the C & Assembly section of the course website.
Carnegie Mellon
13
Lecture 8 198:231 Intro to Computer Organization
Intel’s 64-Bit Intel Attempted Radical Shift from IA32 to IA64
Totally different architecture (Itanium) - VLIW
Executes IA32 code only as legacy
Performance disappointing
AMD Stepped in with Evolutionary Solution x86-64 (now called “AMD64”)
Intel Felt Obligated to Focus on IA64 Hard to admit mistake or that AMD is better
2004: Intel Announces EM64T extension to IA32 Extended Memory 64-bit Technology
Almost identical to x86-64!
All but low-end x86 processors support x86-64 But, lots of code still runs in 32-bit mode
Carnegie Mellon
14
Lecture 8 198:231 Intro to Computer Organization
Overview of x86-64
Main Features Pointers and long integers are 64 bits long. Integer arithmetic
operations support 8, 16, 32, and 64-bit data types.
The set of general-purpose registers is expanded from 8 to 16.
Much of the program state is held in registers rather than on the stack.
Integer and pointer procedure arguments (up to 6) are passed via registers. Some procedures do not need to access the stack at all.
Conditional operations are implemented using conditional move instructions when possible, yielding better performance than traditional branching code.
Floating-point operations are implemented using a register-oriented instruction set, rather than the stack-based approach supported by IA32.
The program counter is named %rip (instead of %eip in IA32).
Carnegie Mellon
15
Lecture 8 198:231 Intro to Computer Organization
Twice the number of registers; each 64 bits. Accessible as 8, 16, 32, 64 bits Make %ebp/%rbp general purpose
%rsp
%eax
%ebx
%ecx
%edx
%esi
%edi
%esp
%ebp
%r8d
%r9d
%r10d
%r11d
%r12d
%r13d
%r14d
%r15d
%r8
%r9
%r10
%r11
%r12
%r13
%r14
%r15
%rax
%rbx
%rcx
%rdx
%rsi
%rdi
%rbp
x86-64 Integer Registers
Carnegie Mellon
16
Lecture 8 198:231 Intro to Computer Organization
x86-64 Integer Registers
Description The number of registers has been doubled to 16. The new registers are
numbered 8-15.
All registers are 64 bits long. The 64-bit extensions of the IA32 registers are named %rax, %rcx, %rdx, %rbx, %rsi, %rdi, %rsp, and %rbp. The new registers are named %r8–%r15.
The low-order 32 bits of each register can be accessed directly. This gives us the familiar registers from IA32: %eax, %ecx, %edx, %ebx, %esi, %edi, %esp, and %ebp, as well as eight new 32-bit registers: %r8d–%r15d.
The low-order 16 bits of each register can be accessed directly, as is the case for IA32. The word-size versions of the new registers are named %r8w–%r15w.
The low-order 8 bits of each register can be accessed directly. This is true in IA32 only for the first 4 registers (%al, %cl, %dl, %bl). The byte-size versions of the other IA32 registers are named %sil, %dil, %spl, and %bpl. The byte-size versions of the new registers are named %r8b–%r15b.
Carnegie Mellon
17
Lecture 8 198:231 Intro to Computer Organization
Instructions
Adds a new integer data type: quad word - 8 bytes (64 bits)
suffix q
C long int and pointer variables map to quad word
Arithmetic and logical instructions extended to quad word: movl ➙ movq
addl ➙ addq
sall ➙ salq
etc.
For 32-bit instructions that generate 32-bit results: Set higher order bits of destination register to 0
Example: addl
Carnegie Mellon
18
Lecture 8 198:231 Intro to Computer Organization
Data Representations: IA32 + x86-64
C declaration Intel data type Assembly
code suffix IA32 size (bytes)
x86-64 size (bytes)
char Byte b 1 1
short Word w 2 2
int Double word l 4 4
long int Quad word q 4 8
long long int Quad word q 8 8
char * (pointer) Quad word q 4 8
float Single precision s 4 4
double Double precision d 8 8
long double Extended precision t 10/12 10/16
Carnegie Mellon
19
Lecture 8 198:231 Intro to Computer Organization
64-bit Data Movement Instructions
Instruction Description Effect
movabsq imm, reg Move absolute quad word reg ← imm
movq src,dest Move quad word dest ← src
movsbq src,dest Move sign-extended byte dest ← SignExtend(src)
movswq src,dest Move sign-extended word dest ← SignExtend(src)
movslq src,dest Move sign-extended long dest ← SignExtend(src)
movzbq src,dest Move zero-extended byte dest ← ZeroExtend(src)
movzwq src,dest Move zero-extended word dest ← ZeroExtend(src)
pushq src Push quad word %rsp ← %rsp – 8; M[%rsp] ← src
popq dest Pop quad word dest ← M[%rsp]; %rsp ← %rsp + 8
Carnegie Mellon
20
Lecture 8 198:231 Intro to Computer Organization
64-bit Arithmetic and Logical Instructions
Instruction Description Effect
leaq src,dest Load effective address dest← &src
incq dest Increment dest ← dest + 1
decq dest Decrement dest ← dest – 1
negq dest Negate dest ← –dest
notq dest Complement dest ← ~dest
addq src,dest Add dest ← dest + src
subq src,dest Subtract dest ← dest – src
imulq src,dest Multiply dest ← dest * src
xorq src,dest Exclusive or dest ← dest ^ src
orq src,dest Or dest ← dest | src
andq src,dest And dest ← dest & src
salq k,dest Left shift dest ← dest << k
shlq k,dest Left shift (same as salq) dest ← dest << k
sarq k,dest Arithmetic right shift dest ← dest >> k
shrq k,dest Logical right shift dest ← dest >> k
Carnegie Mellon
21
Lecture 8 198:231 Intro to Computer Organization
Special 64-Bit Arithmetic Instructions
Instruction Description Effect
imulq src Signed full multiply R[%rdx]:R[%rax] ← src × R[%rax]
mulq src Unsigned full multiply R[%rdx]:R[%rax] ← src × R[%rax]
cltq Convert %eax to quad word R[%rax] ← SignExtend(R[%eax])
cqto Convet %rax to oct word R[%rdx]:R[%rax] ← SignExtend(R[%eax])
idivq src Signed divide R[%rdx] ← R[%rdx]:R[%rax] mod src (rem) R[%rax] ← R[%rdx]:R[%rax] ÷ S (quotient)
divq src Unsigned divide R[%rdx] ← R[%rdx]:R[%rax] mod src (rem) R[%rax] ← R[%rdx]:R[%rax] ÷ S (quotient)
Compare and Test Quad Word
Instruction Description Effect
cmpq src2, src1 Compare quad word Compute src1 – src2 and set condition codes based on result
testq src2, src1 Test quad word Compute src1 & src2 and set condition codes based on result
Carnegie Mellon
22
Lecture 8 198:231 Intro to Computer Organization
64-Bit Call and Return Instructions
Instruction pointer (aka program counter) is now a 64-bit register %rip (instead of %eip in IA32).
Callq acts like call: it pushes %rip (which contains return address) onto stack then jumps to address given by argument.
Retq acts like ret: it pops %rip (which should contain the return address of caller) from the stack, so next instruction to be executed is at the address stored in %rip.
When using gcc to compile to x86-64 architecture, call or ret (without the suffix q) will be interpreted as callq and retq.
Instruction Description Effect
callq label Procedure call pushq %rip
jmp label
callq *operand Procedure call pushq %rip
jmp *operand
retq Return from call popq %rip
Carnegie Mellon
23
Lecture 8 198:231 Intro to Computer Organization
Conditional Move Instructions
Copy source to destination when the move condition is true (based on condition codes)
src = memory or register; dest = register
Works for 16, 32, or 64-bit operands – no suffix required (length of operand inferred from size of destination register)
Instruction Synonym Description Move condition
cmove src, dest cmovz Move if equal/zero ZF
cmovne src, dest cmovnz Move is not equal/not zero ~ZF
cmovs src, dest Move if egative SF
cmovns src, dest Move if nonnegative ~SF
cmovg src, dest cmovnle Move if greater (signed >) ~(SF^OF) & ~ZF
cmovege src, dest cmovnl Move if greater or equal (signed >=) ~(SF^OF)
cmovl src, dest cmovnge Move if less (signed <) SF^OF
cmovle src, dest cmovng Move if less or equal (signed <=) (SF^OF) | ZF
cmova src, dest cmovnbe Move if above (unsigned >) ~CF & ~ZF
cmovae src, dest cmovnb Move if above or equal (unsigned >=) ~CF
cmovb src, dest cmovnae Move if below (unsigned <) CF
cmovbe src, dest cmovna Move if below or equal (unsigned <=) CF | ZF
Carnegie Mellon
24
Lecture 8 198:231 Intro to Computer Organization
Assembling and Running x86-64 Assembly Lang. Programs
Recall: to assemble an IA32 assembly language program:
-m32 flag tells gcc that program is in IA32 assembly language
To assemble an x86-64 assembly language program:
-m64 flag tells gcc that program is in x86-64 assembly language
On clam, may omit –m64 flag because by default gcc assumes program is in x86-64 assembly language
Run the executable program as usual:
% gcc –m32 –o prog prog_IA32.s
% ./prog
% gcc –m64 –o prog prog_IA32.s
Carnegie Mellon
25
Lecture 8 198:231 Intro to Computer Organization
Example: IA32 vs. x86-64 Assembly
Consider the following C program:
Translation to IA-32 assembly language:
/* sample4.c */
#include <stdio.h>
long int a = 15, b = 10, c = 20;
long int max;
int main () {
max = a > b ? a : b;
max = c > max ? c : max;
printf("max = %ld\n", max);
return 0;
}
/* sample4_IA32.s */
.data
.align 4
a: .long 15
b: .long 10
c: .long 20
.comm max,4,4
.section .rodata
.LC0: .string "max = %ld\n"
Carnegie Mellon
26
Lecture 8 198:231 Intro to Computer Organization
Example: IA32 vs. x86-64 Assembly
Translation to IA-32 assembly language, cont.
.text
.globl main
main:
pushl %ebp # prolog
movl %esp, %ebp
movl b, %eax # %eax = b
movl a, %edx # %edx = a
cmpl %edx, %eax # compare b:a
jge .L2 # if b >= a go to .L2
movl %edx, %eax # b < a: %eax = a
.L2:
movl %eax, max # max = maximum{a,b}
movl max, %eax # %eax = max
movl c, %edx # %edx = c
cmpl %edx, %eax # compare max:c
jge .L3 # if max >= c go to .L3
movl %edx, %eax # max < c: %eax = c
.L3:
movl %eax, max # max = maximum{max, c}
pushl %eax # push printf parms onto stack
pushl $.LC0
call printf # call printf
addl $8, %esp # deallocate parms from stack
movl $0, %eax # return 0
leave # epilog
ret
Carnegie Mellon
27
Lecture 8 198:231 Intro to Computer Organization
Example: IA32 vs. x86-64 Assembly
Translation to x86-64 assembly language:
/* sample4_x86-64.s */
.data
.align 8
a: .quad 15
b: .quad 10
c: .quad 20
.comm max,8,8
.section .rodata
.LC0: .string "max = %ld\n"
.text
.globl main
long int is quad word (8 bytes) in
x86-84
Carnegie Mellon
28
Lecture 8 198:231 Intro to Computer Organization
Example: IA32 vs. x86-64 Assembly
Translation to x86-64 assembly language, cont.
.text
.globl main
main:
pushq %rbp # prolog
movq %rsp, %rbp
movq b, %rdx # %rdx = b
movq a, %rax # %rax = a
cmpq %rax, %rdx # compare b:a
cmovge %rdx, %rax # if (b >= a) %rax = b
movq %rax, max # max = maximum{a, b}
movq max, %rdx # %rdx = max
movq c, %rax # %rax = c
cmpq %rax, %rdx # compare max:c
cmovge %rdx, %rax # if (max >= c) %rax = max
movq %rax, max # max = maximum{max, c}
movq max, %rax # %rax = max
movq %rax, %rsi # %rsi = second parm
movl $.LC0, %edi # %rdi = first parm
movl $0, %eax # %eax = 0
callq printf # call printf
movl $0, %eax # return 0
movq %rbp, %rsp # epilog
popq %rbp
retq
parameters to printf passed
via registers!
Note use of conditional
move instructions
Carnegie Mellon
29
Lecture 8 198:231 Intro to Computer Organization
x86-64 Register Usage Conventions
%rax
%rbx
%rcx
%rdx
%rsi
%rdi
%rsp
%rbp
%r8
%r9
%r10
%r11
%r12
%r13
%r14
%r15 Callee saved Callee saved
Callee saved
Callee saved
Callee saved
Caller saved
Callee saved
Stack pointer
Caller Saved
Return value
Parameter #4
Parameter #1
Parameter #3
Parameter #2
Parameter #6
Parameter #5
Carnegie Mellon
30
Lecture 8 198:231 Intro to Computer Organization
x86-64 Parameter Passing
First six integral parameters are passed via registers If more than 6 integral parameters, pass the rest via stack
These registers can also be used as caller-saved registers
Integral return value passed via %rax %eax if 32 bits; %ax if 16 bits; %al if 8 bits
Can also be used as a caller-saved register
Operand size
Parameter Number
(bits) 1 2 3 4 5 6
64 %rdi %rsi %rdx %rcx %r8 %r9
32 %edi %esi %edx %ecx %r8d %r9d
16 %di %si %dx %cx %r8w %r9w
8 %dil %sil %dl %cl %r8b %r9b
Carnegie Mellon
31
Lecture 8 198:231 Intro to Computer Organization
Example: x86-64 Parameter Passing
Consider the following function:
Translation to x86-64 assembly language:
long proc( long a, int b, short c, char d,
long *pa, int *pb, short *pc, char *pd )
{
*pa += a;
*pb += b;
*pc += c;
*pd += d;
return (a+b+c+d);
}
proc:
/* parameters passed as follows:
a in %rdi (64 bits)
b in %esi (32 bits)
c in %dx (16 bits)
d in %cl (8 bits)
pa in %r8 (64 bits)
pb in %r9 (64 bits)
pc via stack (64 bits)
pd via stack (64 bits)
*/
Carnegie Mellon
32
Lecture 8 198:231 Intro to Computer Organization
Example: x86-64 Parameter Passing
Translation to x86-64 assembly language, cont.
pushq %rbp # prolog
movq %rsp, %rbp
movq 16(%rbp), %r10 # %r10 = pc in 16(%rbp)
movq 24(%rsp), %rax # %rax = pd in 24(%rbp)
addq %rdi, (%r8) # *pa += a
addl %esi, (%r9) # *pb += b
addw %dx, (%r10) # *pc += c
addb %cl, (%rax) # *pd += d
movslq %esi, %rax # %rax = sign-extend(%esi) = b
addq %rdi, %rax # %rax = a + b
movswq %dx, %rdx # %rdx = sign-extend(%dx) = c
addq %rdx, %rax # %rax = a + b + c
movsbq %cl, %rcx # %rcx = sign-extend(%cl) = d
addq %rcx, %rax # %rax = a + b + c + d = return value
movq %rbp, %rsp # epilog
popq %rbp
retq
Carnegie Mellon
33
Lecture 8 198:231 Intro to Computer Organization
x86-64 Procedure Summary
Heavy use of registers Parameter passing
More temporaries since more registers
Minimal use of stack Sometimes none
Allocate/deallocate entire block
Many tricky optimizations What kind of stack frame to use
Various allocation techniques