characteristics of amd barcelona floating point execution

30
Characteristics of AMD Barcelona Floating Point Execution Cray User Group Ben Bales and Richard Barrett Computer Science and Mathematics Division Oak Ridge National Laboratory Managed by UT-Battelle for the U. S. Department of Energy Cray User Group Atlanta, GA May 7, 2009

Upload: others

Post on 13-Jan-2022

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Characteristics of AMD Barcelona Floating Point Execution

Characteristics of AMD Barcelona Floating Point Execution

Cray User Group

Ben Bales and Richard Barrett

Computer Science and

Mathematics Division

Oak Ridge National Laboratory

Managed by UT-Battelle for theU. S. Department of Energy

Cray User Group

Atlanta, GA

May 7, 2009

Page 2: Characteristics of AMD Barcelona Floating Point Execution

Subjects

•Definitions

•Barcelona Overview

•Branch Predictor + Conditional Moves

•Software Pipelining

Managed by UT-Battelle for theU. S. Department of Energy

•Software Pipelining

•Vector Operations

•How to Take Advantage of Given Optimizations

Page 3: Characteristics of AMD Barcelona Floating Point Execution

SIMD and Vector

1 2 3 4

1 2 3 4

1 2 3 4

1 2 3 4

+ +

SIMD Vector

Managed by UT-Battelle for theU. S. Department of Energy

1 2 3 4

1 2 3 4

1 2 3 4

1 2 3 4

= =

Page 4: Characteristics of AMD Barcelona Floating Point Execution

Software Pipelining

In-Order SchedulerMultiply

Multiply

Multiply

Add

Multiply

Subtract

Lookahead

Out-of-Order SchedulerMultiply

Multiply

Multiply

Add

Multiply

Subtract

Lookahead

Managed by UT-Battelle for theU. S. Department of Energy

Subtract

Add

Add

Add

Subtract

Add

Add

Add

Software Pipelining references attempts to manually

guarantee the existence of superscalar code within

the Lookahead window.

Page 5: Characteristics of AMD Barcelona Floating Point Execution

Barcelona OverviewThree level cache architecture

•L1/L2 – High speed computational caches (256 and 128bits

per cycle respectively)

SIMD Instruction Set

Managed by UT-Battelle for theU. S. Department of Energy

SIMD Instruction Set

•Limited “Vector” operations

•Three SIMD pipelines

This presentation focuses on single precision

Page 6: Characteristics of AMD Barcelona Floating Point Execution

Branches

Four sub-topics

•SSE conditional moves

•Integer comparators

•Branch prediction mechanism

Managed by UT-Battelle for theU. S. Department of Energy

•Branch prediction mechanism

•Case study: CORDIC Arctangent

Page 7: Characteristics of AMD Barcelona Floating Point Execution

SSE Conditional Moves

Branches that can be expressed as moves

conditional on simple floating point comparisons

SSE friendly

if(a < b) {

Managed by UT-Battelle for theU. S. Department of Energy

if(a < b) {

sgn = 1.0;

} else {

sgn = -1.0;

}

ans = c + sgn * d;

if(a < b) {

ans = c + d;

} else {

ans = c – d;

}

Page 8: Characteristics of AMD Barcelona Floating Point Execution

Integer Comparison Unit

As recommended in the AMD 10h optimization guide,

floating point comparisons can be performed in

integer units (chart taken from guide):

Comparison Against Zero

Managed by UT-Battelle for theU. S. Department of Energy

Comparison Against Zero

Integer Comparison Replaces

if(*((unsigned int *)&f) > 0x8000000U) if(f < 0.0f)

if(*((int *)&f) <= 0) if(f <= 0.0f)

if(*((int *)&f) > 0) if(f > 0.0f)

if(*((unsigned int *)&f) <= 0x8000000U) if(f >= 0.0f)

Page 9: Characteristics of AMD Barcelona Floating Point Execution

Branch Predictor

Dynamic branches predicted based on 2-bit Global

History Bimodal Counters

Value: 0 1 2 3

Action: No

Branch

No

Branch

Branch Branch

Managed by UT-Battelle for theU. S. Department of Energy

Branch Branch

Counters updated as branch conditionals evaluated

Page 10: Characteristics of AMD Barcelona Floating Point Execution

Branch Predictor cont.

Heuristic improved by addressing GHBC with Global

Branch History

Global Branch History GHBC

0 1 0 3

1 0 1 0

Managed by UT-Battelle for theU. S. Department of Energy

1 0 1 0

Unfortunately, loops waste half of the available bits!

Page 11: Characteristics of AMD Barcelona Floating Point Execution

Branch Predictor cont.

Dots – Instruction address

Red – Loop branch bits

… 0 1 0 1 0 1 1 1 1 1 0 1

Managed by UT-Battelle for theU. S. Department of Energy

Green – Legitimate branch bits

Unrolling loops can help improve the Legitimate/Loop

ratio and make the GHBCs work better

Page 12: Characteristics of AMD Barcelona Floating Point Execution

Branch Benchmarks

10

12

14

16

Clo

ck C

ycle

s/B

ran

ch

Branch Speed Comparison

Branch

Block Branch

Managed by UT-Battelle for theU. S. Department of Energy

0

2

4

6

8

Random Non-Random

Clo

ck C

ycle

s/B

ran

ch

Block Branch

Int Branch

Int Block Branch

SSE Conditional Move (x1)

SSE Conditional Move (x4)

Page 13: Characteristics of AMD Barcelona Floating Point Execution

CORDIC Arctangent

•CORDIC algorithms popular for implementing trig

functions with limited hardware

•Basic algorithm is SSE friendly, but includes a tricky

conditional move

Managed by UT-Battelle for theU. S. Department of Energy

conditional move

Page 14: Characteristics of AMD Barcelona Floating Point Execution

CORDIC Arctangent (SIMD)for(i = 0; i < N; i++) {

if(y < 0.0f) {

s = 1.0f;

} else {

s = -1.0f;

}

n_x = x – s * y * powf(2.0f, -(float)i); //powf and atanf are

n_y = y + s * x * powf(2.0f, -(float)i); //implemented in

Managed by UT-Battelle for theU. S. Department of Energy

n_y = y + s * x * powf(2.0f, -(float)i); //implemented in

n_z = z – s * atanf(powf(2.0f, -(float)i); //lookup tables

x = n_x;

y = n_y;

z = n_z;

}

Page 15: Characteristics of AMD Barcelona Floating Point Execution

CORDIC Arctangent (Vector)for(i = 0; i < N; i++) {

if(y < 0.0f) {

s = 1.0f;

} else {

s = -1.0f;

}

n_x = x + (-s) * y * powf(2.0f, -(float)i) ;

n_y = y + s * x * powf(2.0f, -(float)i) ;

Managed by UT-Battelle for theU. S. Department of Energy

n_y = y + s * x * powf(2.0f, -(float)i) ;

n_z = z + (-s) * 1.0 * atanf(powf(2.0f, -(float)i) ;

x = n_x;

y = n_y;

z = n_z;

}

Page 16: Characteristics of AMD Barcelona Floating Point Execution

CORDIC Performance

15.00

20.00

25.00

Clo

ck C

ycle

s/Ite

rati

on

Cordic Contional Performance

Single Element, Parallel

Managed by UT-Battelle for theU. S. Department of Energy

0.00

5.00

10.00

Conditional Moves Floating Point Branches Integer Branches

Clo

ck C

ycle

s/Ite

rati

on

Single Element, Parallel

Single Element, Vector

Pipelined, Vector

Page 17: Characteristics of AMD Barcelona Floating Point Execution

CORDIC Performance Cont.

6.00

7.00

8.00

9.00

Clo

ck C

ycle

s/Ite

rati

on

Cordic Conditional Performance cont.

Managed by UT-Battelle for theU. S. Department of Energy

0.00

1.00

2.00

3.00

4.00

5.00

Clo

ck C

ycle

s/Ite

rati

on

Pipelined, Vector

Full SIMD, Parallel

Full SIMD, Pipelined, Parallel

Page 18: Characteristics of AMD Barcelona Floating Point Execution

Software Pipelining

•Out-of-Order scheduling is built around the idea of

automatically pipelining code

•Out-of-Order scheduler is effective while multiple

independent blocks of code appear in its look ahead

Managed by UT-Battelle for theU. S. Department of Energy

independent blocks of code appear in its look ahead

window

•Questions:

•How far ahead does the scheduler look?

•How effective is “effective”?

Page 19: Characteristics of AMD Barcelona Floating Point Execution

Out-of-Order Scheduler

13

14

15

16

17

18

19

GF

OP

S

Blocked vs. Shuffled Instructions

Blocked

Shuffled

Managed by UT-Battelle for theU. S. Department of Energy

Memory Dependent Tests

Blocked Shuffled Gain

15.10GFLOPS 16.00GFLOPS 5.62%

10

11

12

0 20 40 60 80 100 120

Instruction Block Size

Page 20: Characteristics of AMD Barcelona Floating Point Execution

Software Pipelining

•Scheduler is quite effective as long as instructions

are available in its window

•Instructions easily made “available” by two

techniques:

Managed by UT-Battelle for theU. S. Department of Energy

techniques:

•Unrolling loops

•Staging loops

Page 21: Characteristics of AMD Barcelona Floating Point Execution

Software Pipelining

Stage 1A

Stage 2A

Stage 1B

Stage 2B

Stage Stage Stage Stage …

Original

Unrolled

Managed by UT-Battelle for theU. S. Department of Energy

Stage 1A

Stage 1B

Stage 2A

Stage 2B

Stage 1A

Stage 1B

…Stage

2AStage

2B…

Staged

Page 22: Characteristics of AMD Barcelona Floating Point Execution

Software Pipelining

12.00

14.00

16.00

18.00

20.00

Clo

ck C

ycle

s/V

alu

e

Sine Performance

Staged, Blocked, Pipelined

Staged, Shuffled, Pipelined

Managed by UT-Battelle for theU. S. Department of Energy

0.00

2.00

4.00

6.00

8.00

10.00

Clo

ck C

ycle

s/V

alu

e

Non-Staged, Blocked, Pipelined

Non-Staged, Shuffled, Pipelined

Staged, Non-Pipelined

Non-Staged, Non-Pipelined

Page 23: Characteristics of AMD Barcelona Floating Point Execution

Vector Optimizations

•Some codes pre-shuffle values to make vector

operations fit better to SIMD instruction sets

•For multiplication on complex numbers we can trade

a dependency and a shuffle for a load

Managed by UT-Battelle for theU. S. Department of Energy

a dependency and a shuffle for a load

Page 24: Characteristics of AMD Barcelona Floating Point Execution

Vector Optimizations

1.4

1.45

1.5

1.55

1.6

Av

era

ge C

lock C

ycle

s/C

om

ple

x O

p

Effects of Buffering (Barcelona)

Managed by UT-Battelle for theU. S. Department of Energy

1.2

1.25

1.3

1.35

1.4

0 2000 4000 6000 8000 10000 12000

Av

era

ge C

lock C

ycle

s/C

om

ple

x O

p

Buffer Block Size (floats)

Regular

Buffered

Page 25: Characteristics of AMD Barcelona Floating Point Execution

Vector Optimizations

2.300

2.500

2.700

2.900

Av

era

ge C

lock C

ycle

s/C

om

ple

x O

p

Effects of Buffering (Athlon)

Managed by UT-Battelle for theU. S. Department of Energy

1.700

1.900

2.100

2.300

0 2000 4000 6000 8000 10000 12000

Av

era

ge C

lock C

ycle

s/C

om

ple

x O

p

Buffer Block Size (floats)

Regular

Buffered

Page 26: Characteristics of AMD Barcelona Floating Point Execution

Vector Optimizations

20

30

40

Perc

en

t G

ain

of

Bu

ffere

d I

mp

lem

en

tati

on

Comparison of Buffering Importance

Managed by UT-Battelle for theU. S. Department of Energy

-20

-10

0

10

0 2000 4000 6000 8000 10000 12000

Perc

en

t G

ain

of

Bu

ffere

d I

mp

lem

en

tati

on

Buffer Block Size (floats)

Barcelona

Athlon

Page 27: Characteristics of AMD Barcelona Floating Point Execution

How to Use This Stuff

•All these tests were written in C with the help of the

Intel Assembly Intrinsics

•Unfortunately, the intrinsics do not work well on

PGI or Pathscale compilers (they do work in GNU,

Managed by UT-Battelle for theU. S. Department of Energy

PGI or Pathscale compilers (they do work in GNU,

Intel, and Microsoft ones)

•Intrinsics, while better than assembly, are clunky

at best

Page 28: Characteristics of AMD Barcelona Floating Point Execution

How to Use This Stufffor(i = 0; i < ITERS; i++) {

n_two = _mm_set1_ps(-2.0f);

m = (__m128)_mm_shuffle_epi32((__m128i)v1, 0x0);

m = _mm_cmpgt_ps(m, zero);

n_two = _mm_and_ps(n_two, m);

n_two = _mm_add_ps(n_two, p_one);

Managed by UT-Battelle for theU. S. Department of Energy

n_v = (__m128)_mm_shuffle_epi32((__m128i)v1, 0xF1);

r1 = _mm_load_ps(&lin_lookup[i * 4]);

n_v = _mm_mul_ps(n_v, r1);

n_v = _mm_mul_ps(n_two, n_v);

v1 = _mm_add_ps(v1, n_v);

}

Page 29: Characteristics of AMD Barcelona Floating Point Execution

Conclusions

•Conditionals

•Unroll slow conditional loops

•Use integer comparisons

•Code for conditional moves

Managed by UT-Battelle for theU. S. Department of Energy

•Code for conditional moves

•Pipeline operations explicitly

•Separate staged computations

•Don’t worry about vector operations

Page 30: Characteristics of AMD Barcelona Floating Point Execution

Square Root/Reciprocal Time

15

20

25

Clo

ck C

ycl

es/

Op

era

tio

n

Square Root/Reciprocal Time

Managed by UT-Battelle for theU. S. Department of Energy

0

5

10

GNU RCP GNU SQRT SSE DIV RCP SSE RCP SSE SQRT

Clo

ck C

ycl

es/

Op

era

tio

n

Barcelona

Athlon