a case study on optimizing accurate half precision average · - k-means, meanshift, average pooling...

A Case Study on Optimizing Accurate Half PrecisionAverage

Kenny Peou1,2,3

[email protected]

Alan Kelly1, Joel Falcou1,2,3

NUMSCALE1, LRI2 and Universite Paris-Saclay3

24/09/2018

Introduction

Floating Point Format

Parallelization

Performance Considerations

Calculating the Average

Conclusion

2 / 32 Introduction A Case Study on Optimizing Accurate Half Precision Average

A Very Average Presentation

� Average is fundamental component of some ML algorithms- k-means, meanshift, average pooling

� FP16 hardware support incoming- Pascal GPU, ARM SVE

� FP16 precision imposes serious limitations

� Performance is important

3 / 32 Introduction A Case Study on Optimizing Accurate Half Precision Average

Floating Point Format - Specifications

Sign Exponent Mantissa

1. . 2-15

Precision Exponent Mantissa Max Min

Half 5 10 65504 6.1 · 10−5

Single 8 23 3.4 · 1038 1.2 · 10−38

Double 11 52 1.8 · 10308 2.2 · 10−308

4 / 32 Floating Point Format A Case Study on Optimizing Accurate Half Precision Average

Limitations

� Overflow - too big65504 + 32 =∞256 · 256 =∞

� Underflow/Subnormals - too small1÷ 66000 = 00.0625/64992 = 9.53674e − 07

� Exponent (mis)alignment - juuuuuuust wrong2048 + 1 = 20482048 + 3.5 = 2050

� Limited hardware support for FP16Emulated via half C++ library


Unit of Least Precision (ULP)

Sign Exponent Mantissa

1.ULP Bit

. 2-15

Base Value Half ULP Float ULP Double ULP

2−10 2−20 2−33 2−63

1 2−10 2−23 2−53

210 1 2−13 2−43

Table: Value of 1 ULP at different base values for half, single, and doubleprecisions.


Parallelization (SIMD)

Single Instruction Single Data

+

7 / 32 Parallelization A Case Study on Optimizing Accurate Half Precision Average



+ + + + + + + +




+ + + + + + +




+ + + + + +




+ + + + +




+ + + +




+ + +




+ +




+



Single Instruction Multiple Data

+




+ +




+




SIMD Technologies

� GPU - CUDA, OpenCL

� Multithread - OpenMP, MPI

� CPU Vectorization - AVX, NEON, PPC- No communication/transfer- Fixed register sizes- Not portable (without NSIMD)


Vectorization - NSIMD

// AVXm256 a = mm256 load ps ( &(A [ i ] ) ) ;m256 b = mm256 load ps ( &(B [ i ] ) ) ;m256 c = mm256 add ps ( a , b ) ;

mm256 store ps ( &(C [ i ] ) , c ) ;



// NEONf l o a t 3 2 x 4 t a = v l d 1 q s 3 2 ( &(A [ i ] ) ) ;f l o a t 3 2 x 4 t b = v l d 1 q s 3 2 ( &(B [ i ] ) ) ;f l o a t 3 2 x 4 t c = v a d d q f 3 2 ( a , b ) ;v s t 1 q f 3 2 ( &(C [ i ] ) , c ) ;



// VMXv e c t o r f l o a t a = v e c l d ( 0 , &(A [ i ] ) ) ;v e c t o r f l o a t b = v e c l d ( 0 , &(B [ i ] ) ) ;v e c t o r f l o a t c = v e c a d d ( a , b ) ;

v e c s t ( c , 0 , &(C [ i ] ) ) ;



// NSIMD ( A l l a r c h i t e c t u r e s )nsimd : : pack<f l oa t> b = nsimd : : l o a d ( &(B [ i ] ) ) ;nsimd : : pack<f l oa t> c = nsimd : : l o a d ( &(C [ i ] ) ) ;nsimd : : pack<f l oa t> a = b ∗ c ;nsimd : : s t o r e ( &(A [ i ] ) , a ) ;


Performance Considerations

� Number of Operations/Instructions

� Instruction Latency

� Data Types (SIMD consideration)

Operation Integer Float SIMD Float SIMD Double

Load/Store 1 1 1 1Addition 1 3 3 3

Subtraction 1 3 3 3Multiplication 3 5 5 5

Division 22 10 10 10

Table: Minimum cycles required to perform basic arithmetic operations onan Intel Haswell CPU. (Agner Fog)

9 / 32 Performance Considerations A Case Study on Optimizing Accurate Half Precision Average

Calculating the Average

Seems simple, right?

Avg(X ) =1

N

N∑i=1

xi

Not with half precision!

10 / 32 Calculating the Average A Case Study on Optimizing Accurate Half Precision Average

Precision BenchmarksInput N Naive Kahan Iterative Upcast Cascade

Sequential Input[i ] = s · is = 1 100

100010000

s = 0.001 1001000

10000

Fixed Ratio Input[i ] ={ even: i ÷ 2

odd: i ÷ (2 · r)r = N/2 100

100010000

Fixed Diff Input[i ] ={ even: i ÷ 2

odd: (i ÷ 2) + dd = N/2 100

100010000

Random Input[i ] = rand()

[0, 1] 1000[0, 10] 1000

Fixed Input[i ] = c

c = 10 100010000

1000000

Repeating Input[i ] = 10 + (i%3)

10 - 12 3003000

30000Image1 2073600Image2 2073600


Performance Benchmarks

Processor Used Clock Speed RAM

Intel i7-4790S 3.20GHz 16GBPower8 8348-21C 2.061GHz 64GB

ARM Cortex-A57r1 1.91GHz 4GB

Benchmark Compiler Used Compilation Flags

Scalar All of below -O3

SSE GCC 6.3.1 -O3 -msse4.2

AVX GCC 6.3.1 -O3 -mavx2

NEON GCC 5.4.1 -O3 -march=armv8-a+simd

Altivec GCC 6.3.0 -O3 -maltivec

Performance measured using float instead of half


Naive Average

1: function Naive Average(Array)2: sum = 03: for a in Array do4: sum += a5: end for6: avg = sum / length(Array)7: end function

� Rounding errors gets worse and worse

� Susceptible to overflow

� Computationally simple


Naive AverageInput N Naive Kahan Iterative Upcast Cascade

Sequential Input[i ] = s · is = 1 100 9

1000 fail10000 fail

s = 0.001 100 161000 198

10000 fail


odd: i ÷ (2 · r)r = N/2 100 1

1000 fail10000 fail


odd: (i ÷ 2) + dd = N/2 100 13

1000 fail10000 fail


[0, 1] 1000 271[0, 10] 1000 fail

Fixed Input[i ] = c

c = 10 1000 15210000 1070

1000000 1277


10 - 12 300 173000 709

30000 1998Image1 2073600 failImage2 2073600 1761


Naive Average

Naive Kahan Iterative Upcast Cascading

0

2

4

6

1

3.96

7.05

3.96

2.6

Speedup for All Architectures

Scalar SSE AVX Altivec NEON


Average Using Kahan Sum

1: function Kahan Average(Array)2: sum = 03: rem = 04: for a in Array do5: y = a - rem6: t = sum + y7: rem = ( t - sum ) - y8: sum = t9: end for

10: avg = sum / length(Array)11: end function

� Mostly compensates rounding errors

� Still susceptible to overflow and misalignments


Average Using Kahan SumInput N Naive Kahan Iterative Upcast Cascade

Sequential Input[i ] = s · is = 1 100 9 1

1000 fail fail10000 fail fail

s = 0.001 100 16 11000 198 1

10000 fail fail


odd: i ÷ (2 · r)r = N/2 100 1 1



odd: (i ÷ 2) + dd = N/2 100 13 1



[0, 1] 1000 271 0[0, 10] 1000 fail fail

Fixed Input[i ] = c

c = 10 1000 152 010000 1070 fail

1000000 1277 fail


10 - 12 300 17 03000 709 1

30000 1998 failImage1 2073600 fail failImage2 2073600 1761 391


Average Using Kahan Sum


0

2

4

6

1

3.96

7.05

3.96

2.6

0.24

1

1.97

1 0.99




Iterative Average

1: function Iterative Average(Array)2: avg = 03: for i in 1 → length(Array) do4: avg += (avg - Array[i]) / i5: end for6: end function

� Removes risk of overflow

� Misalignment, underflow, and subnormals get progressively worse

�

limi→∞

(xi − Avgi )

i= 0


Iterative AverageInput N Naive Kahan Iterative Upcast Cascade

Sequential Input[i ] = s · is = 1 100 9 1 0

1000 fail fail 010000 fail fail 993

s = 0.001 100 16 1 201000 198 1 249

10000 fail fail 1486


odd: i ÷ (2 · r)r = N/2 100 1 1 17



odd: (i ÷ 2) + dd = N/2 100 13 1 23



[0, 1] 1000 271 0 249[0, 10] 1000 fail fail 261

Fixed Input[i ] = c

c = 10 1000 152 0 010000 1070 fail 0

1000000 1277 fail 0


10 - 12 300 17 0 743000 709 1 128

30000 1998 fail 128Image1 2073600 fail fail 1037Image2 2073600 1761 391 1701


Iterative Average


0

2

4

6

1

3.96

7.05

3.96

2.6

0.24

1

1.97

1 0.99

0.16

0.63 0.

89

0.58

0.58




Increased Precision

1: function Upcast Average(Array)2: sum = (upcast)03: for a in Array do4: sum += (upcast)a5: end for6: avg = sum / length(Array)7: end function

� Same fundamental problems as naive average

� Limitations less problematic


Increased PrecisionInput N Naive Kahan Iterative Upcast Cascade

Sequential Input[i ] = s · is = 1 100 9 1 0 0

1000 fail fail 0 010000 fail fail 993 0

s = 0.001 100 16 1 20 01000 198 1 249 0

10000 fail fail 1486 0


odd: i ÷ (2 · r)r = N/2 100 1 1 17 0



odd: (i ÷ 2) + dd = N/2 100 13 1 23 0



[0, 1] 1000 271 0 249 0[0, 10] 1000 fail fail 261 0

Fixed Input[i ] = c

c = 10 1000 152 0 0 010000 1070 fail 0 0

1000000 1277 fail 0 0


10 - 12 300 17 0 74 03000 709 1 128 0

30000 1998 fail 128 0Image1 2073600 fail fail 1037 1Image2 2073600 1761 391 1701 12


Increased Precision


0

2

4

6

1

3.96

7.05

3.96

2.6

0.24

1

1.97

1 0.99

0.16

0.63 0.

89

0.58

0.58 1

22.

37

21.

58




Cascading Average

0 1 2 3 4 5 6 7


Cascading Average

0 1 2 3 4 5 6 7

0.5 2.5 4.5 6.5


Cascading Average

0 1 2 3 4 5 6 7

0.5 2.5 4.5 6.5

1.5 5.5


Cascading Average

0 1 2 3 4 5 6 7

0.5 2.5 4.5 6.5

1.5 5.5

3.5


Cascading Average

0 1 2 3 4 5 6 7

0.5 2.5 4.5 6.5

1.5 5.5

3.5

� Pairwise operations

� No accumulator = no overflow

� Rounding errors from inputvariation

� ≤ 2N operations


Cascading Average

1: function Cascading Average(Array)2: n = length(Array)3: if n == 1 then4: return Array[0]5: else if n == 2 then6: return ( Array[0] + Array[1] ) / 27: else8: return ( Cascading Average( Array[1:n/2] ) + Cascading

Average( Array[n/2:n] ) ) / 29: end if

10: end function


Cascading AverageInput N Naive Kahan Iterative Upcast Cascade

Sequential Input[i ] = s · is = 1 100 9 1 0 0 0

1000 fail fail 0 0 010000 fail fail 993 0 1

s = 0.001 100 16 1 20 0 21000 198 1 249 0 4

10000 fail fail 1486 0 4


odd: i ÷ (2 · r)r = N/2 100 1 1 17 0 0



odd: (i ÷ 2) + dd = N/2 100 13 1 23 0 0



[0, 1] 1000 271 0 249 0 3[0, 10] 1000 fail fail 261 0 4

Fixed Input[i ] = c

c = 10 1000 152 0 0 0 010000 1070 fail 0 0 0

1000000 1277 fail 0 0 0


10 - 12 300 17 0 74 0 13000 709 1 128 0 1

30000 1998 fail 128 0 1Image1 2073600 fail fail 1037 1 3Image2 2073600 1761 391 1701 12 7


Cascading Average


0

2

4

6

1

3.96

7.05

3.96

2.6

0.24

1

1.97

1 0.99

0.16

0.63 0.

89

0.58

0.58 1

22.

37

21.

58

0.28

1.04

2.21

1.21

0.9




Performance - Overall

210 219 228

2−2

2−1

20

Input Size

Sp

eed

up

Scalar Performance

210 219 228

20

21

22

Input Size

SSE Performance

210 219 228

20

21

22

23

Input Size

AVX Performance

210 2202−1

20

21

22

Input Size

Sp

eed

up

NEON Performance

210 219 2282−1

20

21

22

Input Size

Altivec Performance

Naive

Kahan

Iterative

Cascade

Upcast


Analysis

� Sum then divide methods overflow easily

� Iterative average method fails for large N

� Kahan sum (with divide when necessary) is the most precise whenit works

� Naive method is fast enough to increase precision and still befaster (with SIMD)

� Cascading Average is never a bad choice thanks to its robustness


Conclusion

� Numerical precision is complicated

� Performance requires effort

� Even simple things can need research

Future Works

� Similar study using 8 bit floats

� Similar study with exotic numerical formats

� Mix algorithms intelligently

� Look into other ”solved” algorithms under half precision

31 / 32 Conclusion A Case Study on Optimizing Accurate Half Precision Average

THE END

32 / 32 A Case Study on Optimizing Accurate Half Precision Average

Full Precision ResultsInput N Naive Kahan Iterative Upcast Cascade

Sequential Input[i ] = s · is = 1 100 9 1 0 0 0


s = 0.001 100 16 1 20 0 21000 198 1 249 0 4

10000 fail fail 1486 0 4s = -1 100 9 1 0 0 0



odd: i ÷ (2 · r)r = N/2 100 1 1 17 0 0



odd: (i ÷ 2) + dd = N/2 100 13 1 23 0 0



[0, 1] 1000 271 0 249 0 3[0, 10] 1000 fail fail 261 0 4

[0, 100] 1000 fail fail 246 0 2[0, 1000] 1000 fail fail 253 0 3

Fixed Input[i ] = c

1000 152 0 0 0 0c = 10 10000 1070 fail 0 0 0

100000 1259 fail 0 0 01000000 1277 fail 0 0 0


10 - 12 300 17 0 74 0 13000 709 1 128 0 1

30000 1998 fail 128 0 1300000 1401 fail 128 0 1

Image1 2073600 fail fail 1037 1 3Image2 2073600 1761 391 1701 12 7

1 / 1 A Case Study on Optimizing Accurate Half Precision Average

a case study on optimizing accurate half precision average · - k-means, meanshift, average pooling...

Documents