characteristics of amd barcelona floating point execution

Characteristics of AMD Barcelona Floating Point Execution

Cray User Group

Ben Bales and Richard Barrett

Computer Science and

Mathematics Division

Oak Ridge National Laboratory

Managed by UT-Battelle for theU. S. Department of Energy

Cray User Group

Atlanta, GA

May 7, 2009

Subjects

•Definitions

•Barcelona Overview

•Branch Predictor + Conditional Moves

•Software Pipelining


•Software Pipelining

•Vector Operations

•How to Take Advantage of Given Optimizations

SIMD and Vector

1 2 3 4

1 2 3 4

1 2 3 4

1 2 3 4

+ +

SIMD Vector


1 2 3 4

1 2 3 4

1 2 3 4

1 2 3 4

= =

Software Pipelining

In-Order SchedulerMultiply

Multiply

Multiply

Add

Multiply

Subtract

Lookahead

Out-of-Order SchedulerMultiply

Multiply

Multiply

Add

Multiply

Subtract

Lookahead


Subtract

Add

Add

Add

Subtract

Add

Add

Add

Software Pipelining references attempts to manually

guarantee the existence of superscalar code within

the Lookahead window.

Barcelona OverviewThree level cache architecture

•L1/L2 – High speed computational caches (256 and 128bits

per cycle respectively)

SIMD Instruction Set


SIMD Instruction Set

•Limited “Vector” operations

•Three SIMD pipelines

This presentation focuses on single precision

Branches

Four sub-topics

•SSE conditional moves

•Integer comparators

•Branch prediction mechanism


•Branch prediction mechanism

•Case study: CORDIC Arctangent

SSE Conditional Moves

Branches that can be expressed as moves

conditional on simple floating point comparisons

SSE friendly

if(a < b) {


if(a < b) {

sgn = 1.0;

} else {

sgn = -1.0;

}

ans = c + sgn * d;

if(a < b) {

ans = c + d;

} else {

ans = c – d;

}

Integer Comparison Unit

As recommended in the AMD 10h optimization guide,

floating point comparisons can be performed in

integer units (chart taken from guide):

Comparison Against Zero


Comparison Against Zero

Integer Comparison Replaces

if(*((unsigned int *)&f) > 0x8000000U) if(f < 0.0f)

if(*((int *)&f) <= 0) if(f <= 0.0f)

if(*((int *)&f) > 0) if(f > 0.0f)

if(*((unsigned int *)&f) <= 0x8000000U) if(f >= 0.0f)

Branch Predictor

Dynamic branches predicted based on 2-bit Global

History Bimodal Counters

Value: 0 1 2 3

Action: No

Branch

No

Branch

Branch Branch


Branch Branch

Counters updated as branch conditionals evaluated

Branch Predictor cont.

Heuristic improved by addressing GHBC with Global

Branch History

Global Branch History GHBC

0 1 0 3

1 0 1 0


1 0 1 0

Unfortunately, loops waste half of the available bits!

Branch Predictor cont.

Dots – Instruction address

Red – Loop branch bits

… 0 1 0 1 0 1 1 1 1 1 0 1


Green – Legitimate branch bits

Unrolling loops can help improve the Legitimate/Loop

ratio and make the GHBCs work better

Branch Benchmarks

10

12

14

16

Clo

ck C

ycle

s/B

ran

ch

Branch Speed Comparison

Branch

Block Branch


0

2

4

6

8

Random Non-Random

Clo

ck C

ycle

s/B

ran

ch

Block Branch

Int Branch

Int Block Branch

SSE Conditional Move (x1)

SSE Conditional Move (x4)

CORDIC Arctangent

•CORDIC algorithms popular for implementing trig

functions with limited hardware

•Basic algorithm is SSE friendly, but includes a tricky

conditional move


conditional move

CORDIC Arctangent (SIMD)for(i = 0; i < N; i++) {

if(y < 0.0f) {

s = 1.0f;

} else {

s = -1.0f;

}

n_x = x – s * y * powf(2.0f, -(float)i); //powf and atanf are

n_y = y + s * x * powf(2.0f, -(float)i); //implemented in


n_y = y + s * x * powf(2.0f, -(float)i); //implemented in

n_z = z – s * atanf(powf(2.0f, -(float)i); //lookup tables

x = n_x;

y = n_y;

z = n_z;

}

CORDIC Arctangent (Vector)for(i = 0; i < N; i++) {

if(y < 0.0f) {

s = 1.0f;

} else {

s = -1.0f;

}

n_x = x + (-s) * y * powf(2.0f, -(float)i) ;

n_y = y + s * x * powf(2.0f, -(float)i) ;


n_y = y + s * x * powf(2.0f, -(float)i) ;

n_z = z + (-s) * 1.0 * atanf(powf(2.0f, -(float)i) ;

x = n_x;

y = n_y;

z = n_z;

}

CORDIC Performance

15.00

20.00

25.00

Clo

ck C

ycle

s/Ite

rati

on

Cordic Contional Performance

Single Element, Parallel


0.00

5.00

10.00

Conditional Moves Floating Point Branches Integer Branches

Clo

ck C

ycle

s/Ite

rati

on

Single Element, Parallel

Single Element, Vector

Pipelined, Vector

CORDIC Performance Cont.

6.00

7.00

8.00

9.00

Clo

ck C

ycle

s/Ite

rati

on

Cordic Conditional Performance cont.


0.00

1.00

2.00

3.00

4.00

5.00

Clo

ck C

ycle

s/Ite

rati

on

Pipelined, Vector

Full SIMD, Parallel

Full SIMD, Pipelined, Parallel

Software Pipelining

•Out-of-Order scheduling is built around the idea of

automatically pipelining code

•Out-of-Order scheduler is effective while multiple

independent blocks of code appear in its look ahead


independent blocks of code appear in its look ahead

window

•Questions:

•How far ahead does the scheduler look?

•How effective is “effective”?

Out-of-Order Scheduler

13

14

15

16

17

18

19

GF

OP

S

Blocked vs. Shuffled Instructions

Blocked

Shuffled


Memory Dependent Tests

Blocked Shuffled Gain

15.10GFLOPS 16.00GFLOPS 5.62%

10

11

12

0 20 40 60 80 100 120

Instruction Block Size

Software Pipelining

•Scheduler is quite effective as long as instructions

are available in its window

•Instructions easily made “available” by two

techniques:


techniques:

•Unrolling loops

•Staging loops

Software Pipelining

Stage 1A

Stage 2A

Stage 1B

Stage 2B

…

Stage Stage Stage Stage …

Original

Unrolled


Stage 1A

Stage 1B

Stage 2A

Stage 2B

…

Stage 1A

Stage 1B

…Stage

2AStage

2B…

Staged

Software Pipelining

12.00

14.00

16.00

18.00

20.00

Clo

ck C

ycle

s/V

alu

e

Sine Performance

Staged, Blocked, Pipelined

Staged, Shuffled, Pipelined


0.00

2.00

4.00

6.00

8.00

10.00

Clo

ck C

ycle

s/V

alu

e

Non-Staged, Blocked, Pipelined

Non-Staged, Shuffled, Pipelined

Staged, Non-Pipelined

Non-Staged, Non-Pipelined

Vector Optimizations

•Some codes pre-shuffle values to make vector

operations fit better to SIMD instruction sets

•For multiplication on complex numbers we can trade

a dependency and a shuffle for a load


a dependency and a shuffle for a load


1.4

1.45

1.5

1.55

1.6

Av

era

ge C

lock C

ycle

s/C

om

ple

x O

p

Effects of Buffering (Barcelona)


1.2

1.25

1.3

1.35

1.4

0 2000 4000 6000 8000 10000 12000

Av

era

ge C

lock C

ycle

s/C

om

ple

x O

p

Buffer Block Size (floats)

Regular

Buffered


2.300

2.500

2.700

2.900

Av

era

ge C

lock C

ycle

s/C

om

ple

x O

p

Effects of Buffering (Athlon)


1.700

1.900

2.100

2.300

0 2000 4000 6000 8000 10000 12000

Av

era

ge C

lock C

ycle

s/C

om

ple

x O

p


Regular

Buffered


20

30

40

Perc

en

t G

ain

of

Bu

ffere

d I

mp

lem

en

tati

on

Comparison of Buffering Importance


-20

-10

0

10

0 2000 4000 6000 8000 10000 12000

Perc

en

t G

ain

of

Bu

ffere

d I

mp

lem

en

tati

on


Barcelona

Athlon

How to Use This Stuff

•All these tests were written in C with the help of the

Intel Assembly Intrinsics

•Unfortunately, the intrinsics do not work well on

PGI or Pathscale compilers (they do work in GNU,


PGI or Pathscale compilers (they do work in GNU,

Intel, and Microsoft ones)

•Intrinsics, while better than assembly, are clunky

at best

How to Use This Stufffor(i = 0; i < ITERS; i++) {

n_two = _mm_set1_ps(-2.0f);

m = (__m128)_mm_shuffle_epi32((__m128i)v1, 0x0);

m = _mm_cmpgt_ps(m, zero);

n_two = _mm_and_ps(n_two, m);

n_two = _mm_add_ps(n_two, p_one);


n_v = (__m128)_mm_shuffle_epi32((__m128i)v1, 0xF1);

r1 = _mm_load_ps(&lin_lookup[i * 4]);

n_v = _mm_mul_ps(n_v, r1);

n_v = _mm_mul_ps(n_two, n_v);

v1 = _mm_add_ps(v1, n_v);

}

Conclusions

•Conditionals

•Unroll slow conditional loops

•Use integer comparisons

•Code for conditional moves


•Code for conditional moves

•Pipeline operations explicitly

•Separate staged computations

•Don’t worry about vector operations

Square Root/Reciprocal Time

15

20

25

Clo

ck C

ycl

es/

Op

era

tio

n

Square Root/Reciprocal Time


0

5

10

GNU RCP GNU SQRT SSE DIV RCP SSE RCP SSE SQRT

Clo

ck C

ycl

es/

Op

era

tio

n

Barcelona

Athlon

characteristics of amd barcelona floating point execution

Documents