characteristics of amd barcelona floating point execution
TRANSCRIPT
Characteristics of AMD Barcelona Floating Point Execution
Cray User Group
Ben Bales and Richard Barrett
Computer Science and
Mathematics Division
Oak Ridge National Laboratory
Managed by UT-Battelle for theU. S. Department of Energy
Cray User Group
Atlanta, GA
May 7, 2009
Subjects
•Definitions
•Barcelona Overview
•Branch Predictor + Conditional Moves
•Software Pipelining
Managed by UT-Battelle for theU. S. Department of Energy
•Software Pipelining
•Vector Operations
•How to Take Advantage of Given Optimizations
SIMD and Vector
1 2 3 4
1 2 3 4
1 2 3 4
1 2 3 4
+ +
SIMD Vector
Managed by UT-Battelle for theU. S. Department of Energy
1 2 3 4
1 2 3 4
1 2 3 4
1 2 3 4
= =
Software Pipelining
In-Order SchedulerMultiply
Multiply
Multiply
Add
Multiply
Subtract
Lookahead
Out-of-Order SchedulerMultiply
Multiply
Multiply
Add
Multiply
Subtract
Lookahead
Managed by UT-Battelle for theU. S. Department of Energy
Subtract
Add
Add
Add
Subtract
Add
Add
Add
Software Pipelining references attempts to manually
guarantee the existence of superscalar code within
the Lookahead window.
Barcelona OverviewThree level cache architecture
•L1/L2 – High speed computational caches (256 and 128bits
per cycle respectively)
SIMD Instruction Set
Managed by UT-Battelle for theU. S. Department of Energy
SIMD Instruction Set
•Limited “Vector” operations
•Three SIMD pipelines
This presentation focuses on single precision
Branches
Four sub-topics
•SSE conditional moves
•Integer comparators
•Branch prediction mechanism
Managed by UT-Battelle for theU. S. Department of Energy
•Branch prediction mechanism
•Case study: CORDIC Arctangent
SSE Conditional Moves
Branches that can be expressed as moves
conditional on simple floating point comparisons
SSE friendly
if(a < b) {
Managed by UT-Battelle for theU. S. Department of Energy
if(a < b) {
sgn = 1.0;
} else {
sgn = -1.0;
}
ans = c + sgn * d;
if(a < b) {
ans = c + d;
} else {
ans = c – d;
}
Integer Comparison Unit
As recommended in the AMD 10h optimization guide,
floating point comparisons can be performed in
integer units (chart taken from guide):
Comparison Against Zero
Managed by UT-Battelle for theU. S. Department of Energy
Comparison Against Zero
Integer Comparison Replaces
if(*((unsigned int *)&f) > 0x8000000U) if(f < 0.0f)
if(*((int *)&f) <= 0) if(f <= 0.0f)
if(*((int *)&f) > 0) if(f > 0.0f)
if(*((unsigned int *)&f) <= 0x8000000U) if(f >= 0.0f)
Branch Predictor
Dynamic branches predicted based on 2-bit Global
History Bimodal Counters
Value: 0 1 2 3
Action: No
Branch
No
Branch
Branch Branch
Managed by UT-Battelle for theU. S. Department of Energy
Branch Branch
Counters updated as branch conditionals evaluated
Branch Predictor cont.
Heuristic improved by addressing GHBC with Global
Branch History
Global Branch History GHBC
0 1 0 3
1 0 1 0
Managed by UT-Battelle for theU. S. Department of Energy
1 0 1 0
Unfortunately, loops waste half of the available bits!
Branch Predictor cont.
Dots – Instruction address
Red – Loop branch bits
… 0 1 0 1 0 1 1 1 1 1 0 1
Managed by UT-Battelle for theU. S. Department of Energy
Green – Legitimate branch bits
Unrolling loops can help improve the Legitimate/Loop
ratio and make the GHBCs work better
Branch Benchmarks
10
12
14
16
Clo
ck C
ycle
s/B
ran
ch
Branch Speed Comparison
Branch
Block Branch
Managed by UT-Battelle for theU. S. Department of Energy
0
2
4
6
8
Random Non-Random
Clo
ck C
ycle
s/B
ran
ch
Block Branch
Int Branch
Int Block Branch
SSE Conditional Move (x1)
SSE Conditional Move (x4)
CORDIC Arctangent
•CORDIC algorithms popular for implementing trig
functions with limited hardware
•Basic algorithm is SSE friendly, but includes a tricky
conditional move
Managed by UT-Battelle for theU. S. Department of Energy
conditional move
CORDIC Arctangent (SIMD)for(i = 0; i < N; i++) {
if(y < 0.0f) {
s = 1.0f;
} else {
s = -1.0f;
}
n_x = x – s * y * powf(2.0f, -(float)i); //powf and atanf are
n_y = y + s * x * powf(2.0f, -(float)i); //implemented in
Managed by UT-Battelle for theU. S. Department of Energy
n_y = y + s * x * powf(2.0f, -(float)i); //implemented in
n_z = z – s * atanf(powf(2.0f, -(float)i); //lookup tables
x = n_x;
y = n_y;
z = n_z;
}
CORDIC Arctangent (Vector)for(i = 0; i < N; i++) {
if(y < 0.0f) {
s = 1.0f;
} else {
s = -1.0f;
}
n_x = x + (-s) * y * powf(2.0f, -(float)i) ;
n_y = y + s * x * powf(2.0f, -(float)i) ;
Managed by UT-Battelle for theU. S. Department of Energy
n_y = y + s * x * powf(2.0f, -(float)i) ;
n_z = z + (-s) * 1.0 * atanf(powf(2.0f, -(float)i) ;
x = n_x;
y = n_y;
z = n_z;
}
CORDIC Performance
15.00
20.00
25.00
Clo
ck C
ycle
s/Ite
rati
on
Cordic Contional Performance
Single Element, Parallel
Managed by UT-Battelle for theU. S. Department of Energy
0.00
5.00
10.00
Conditional Moves Floating Point Branches Integer Branches
Clo
ck C
ycle
s/Ite
rati
on
Single Element, Parallel
Single Element, Vector
Pipelined, Vector
CORDIC Performance Cont.
6.00
7.00
8.00
9.00
Clo
ck C
ycle
s/Ite
rati
on
Cordic Conditional Performance cont.
Managed by UT-Battelle for theU. S. Department of Energy
0.00
1.00
2.00
3.00
4.00
5.00
Clo
ck C
ycle
s/Ite
rati
on
Pipelined, Vector
Full SIMD, Parallel
Full SIMD, Pipelined, Parallel
Software Pipelining
•Out-of-Order scheduling is built around the idea of
automatically pipelining code
•Out-of-Order scheduler is effective while multiple
independent blocks of code appear in its look ahead
Managed by UT-Battelle for theU. S. Department of Energy
independent blocks of code appear in its look ahead
window
•Questions:
•How far ahead does the scheduler look?
•How effective is “effective”?
Out-of-Order Scheduler
13
14
15
16
17
18
19
GF
OP
S
Blocked vs. Shuffled Instructions
Blocked
Shuffled
Managed by UT-Battelle for theU. S. Department of Energy
Memory Dependent Tests
Blocked Shuffled Gain
15.10GFLOPS 16.00GFLOPS 5.62%
10
11
12
0 20 40 60 80 100 120
Instruction Block Size
Software Pipelining
•Scheduler is quite effective as long as instructions
are available in its window
•Instructions easily made “available” by two
techniques:
Managed by UT-Battelle for theU. S. Department of Energy
techniques:
•Unrolling loops
•Staging loops
Software Pipelining
Stage 1A
Stage 2A
Stage 1B
Stage 2B
…
Stage Stage Stage Stage …
Original
Unrolled
Managed by UT-Battelle for theU. S. Department of Energy
Stage 1A
Stage 1B
Stage 2A
Stage 2B
…
Stage 1A
Stage 1B
…Stage
2AStage
2B…
Staged
Software Pipelining
12.00
14.00
16.00
18.00
20.00
Clo
ck C
ycle
s/V
alu
e
Sine Performance
Staged, Blocked, Pipelined
Staged, Shuffled, Pipelined
Managed by UT-Battelle for theU. S. Department of Energy
0.00
2.00
4.00
6.00
8.00
10.00
Clo
ck C
ycle
s/V
alu
e
Non-Staged, Blocked, Pipelined
Non-Staged, Shuffled, Pipelined
Staged, Non-Pipelined
Non-Staged, Non-Pipelined
Vector Optimizations
•Some codes pre-shuffle values to make vector
operations fit better to SIMD instruction sets
•For multiplication on complex numbers we can trade
a dependency and a shuffle for a load
Managed by UT-Battelle for theU. S. Department of Energy
a dependency and a shuffle for a load
Vector Optimizations
1.4
1.45
1.5
1.55
1.6
Av
era
ge C
lock C
ycle
s/C
om
ple
x O
p
Effects of Buffering (Barcelona)
Managed by UT-Battelle for theU. S. Department of Energy
1.2
1.25
1.3
1.35
1.4
0 2000 4000 6000 8000 10000 12000
Av
era
ge C
lock C
ycle
s/C
om
ple
x O
p
Buffer Block Size (floats)
Regular
Buffered
Vector Optimizations
2.300
2.500
2.700
2.900
Av
era
ge C
lock C
ycle
s/C
om
ple
x O
p
Effects of Buffering (Athlon)
Managed by UT-Battelle for theU. S. Department of Energy
1.700
1.900
2.100
2.300
0 2000 4000 6000 8000 10000 12000
Av
era
ge C
lock C
ycle
s/C
om
ple
x O
p
Buffer Block Size (floats)
Regular
Buffered
Vector Optimizations
20
30
40
Perc
en
t G
ain
of
Bu
ffere
d I
mp
lem
en
tati
on
Comparison of Buffering Importance
Managed by UT-Battelle for theU. S. Department of Energy
-20
-10
0
10
0 2000 4000 6000 8000 10000 12000
Perc
en
t G
ain
of
Bu
ffere
d I
mp
lem
en
tati
on
Buffer Block Size (floats)
Barcelona
Athlon
How to Use This Stuff
•All these tests were written in C with the help of the
Intel Assembly Intrinsics
•Unfortunately, the intrinsics do not work well on
PGI or Pathscale compilers (they do work in GNU,
Managed by UT-Battelle for theU. S. Department of Energy
PGI or Pathscale compilers (they do work in GNU,
Intel, and Microsoft ones)
•Intrinsics, while better than assembly, are clunky
at best
How to Use This Stufffor(i = 0; i < ITERS; i++) {
n_two = _mm_set1_ps(-2.0f);
m = (__m128)_mm_shuffle_epi32((__m128i)v1, 0x0);
m = _mm_cmpgt_ps(m, zero);
n_two = _mm_and_ps(n_two, m);
n_two = _mm_add_ps(n_two, p_one);
Managed by UT-Battelle for theU. S. Department of Energy
n_v = (__m128)_mm_shuffle_epi32((__m128i)v1, 0xF1);
r1 = _mm_load_ps(&lin_lookup[i * 4]);
n_v = _mm_mul_ps(n_v, r1);
n_v = _mm_mul_ps(n_two, n_v);
v1 = _mm_add_ps(v1, n_v);
}
Conclusions
•Conditionals
•Unroll slow conditional loops
•Use integer comparisons
•Code for conditional moves
Managed by UT-Battelle for theU. S. Department of Energy
•Code for conditional moves
•Pipeline operations explicitly
•Separate staged computations
•Don’t worry about vector operations
Square Root/Reciprocal Time
15
20
25
Clo
ck C
ycl
es/
Op
era
tio
n
Square Root/Reciprocal Time
Managed by UT-Battelle for theU. S. Department of Energy
0
5
10
GNU RCP GNU SQRT SSE DIV RCP SSE RCP SSE SQRT
Clo
ck C
ycl
es/
Op
era
tio
n
Barcelona
Athlon