a fast implementation of matrix-matrix product in double-double precision on nvidia c2050 and...

59
. . A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming Nakata Maho *† ([email protected] * ), Yasuyoshi Takao †† , Noda Shigeho , Himeno Ryutaro RIKEN, Advanced Center for Computing and Communication , JFE Tech †† International Conference on Networking and Computing 2012/12/5 @ Okinawa 14:45-15:15 Nakata Maho A fast implementation of matrix-matrix product in double-double preci

Upload: maho-nakata

Post on 08-Jun-2015

1.591 views

Category:

Technology


0 download

DESCRIPTION

A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming

TRANSCRIPT

Page 1: A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application  to semidefinite programming

.

......

A fast implementation of matrix-matrix product indouble-double precision on NVIDIA C2050 and

application to semidefinite programming

Nakata Maho∗†

([email protected]∗),Yasuyoshi Takao††, Noda Shigeho†, Himeno Ryutaro†

RIKEN, Advanced Center for Computing and Communication†,JFE Tech††

International Conference on Networking and Computing 2012/12/5 @ Okinawa14:45-15:15

Nakata Maho A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming

Page 2: A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application  to semidefinite programming

Overview

Introduction of this research in a slide.

Importance of high precision arithmetic.

The double-double precision: a cheap and easy solution forquadruple precision and its details.

Matrix-matrix multiplication (Rgemm) in MPACK (highprecision version of BLAS and LAPACK).

Implementation of a fast Rgemm on C2050 GPU : 150 timesfaster than CPU.

Application: acceleration of semidefinite programming solver“SDPA-DD” : 10 times faster than CPU.

Summary.

Nakata Maho A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming

Page 3: A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application  to semidefinite programming

Introduction of this research in a slide.

Matrix-matrix multiplication double-double precision

NVIDIA C2050, GPU GPU=CPUx150, Peak performance: 26GFLOPS

0

5

10

15

20

25

0 1000 2000 3000 4000 5000 6000

GF

LO

PS

Dimension

QuadMul−Sloppy, QuadAdd−Cray KernelQuadMul−Sloppy, QuadAdd−Cray TotalQuadMul−FMA, QuadAdd−Cray Kernel

QuadMul−FMA, QuadAdd−Cray TotalQuadMul−Sloppy, QuadAdd−IEEE Kernel

QuadMul−Sloppy, QuadAdd−IEEE TotalQuadMul−FMA, QuadAdd−IEEE Kernel

QuadMul−FMA, QuadAdd−IEEE Total�� ��+ Application : Semidefinite Programming GPU=CPUx10

Nakata Maho A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming

Page 4: A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application  to semidefinite programming

Introduction of this research in a slide.

Matrix-matrix multiplication double-double precision

NVIDIA C2050, GPU GPU=CPUx150, Peak performance: 26GFLOPS

0

5

10

15

20

25

0 1000 2000 3000 4000 5000 6000

GF

LO

PS

Dimension

QuadMul−Sloppy, QuadAdd−Cray KernelQuadMul−Sloppy, QuadAdd−Cray TotalQuadMul−FMA, QuadAdd−Cray Kernel

QuadMul−FMA, QuadAdd−Cray TotalQuadMul−Sloppy, QuadAdd−IEEE Kernel

QuadMul−Sloppy, QuadAdd−IEEE TotalQuadMul−FMA, QuadAdd−IEEE Kernel

QuadMul−FMA, QuadAdd−IEEE Total�� ��+ Application : Semidefinite Programming GPU=CPUx10

Nakata Maho A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming

Page 5: A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application  to semidefinite programming

More accuracy is needed towards PETA and EXA scalecomputing

The EXA scale computing : 1023 FLOP!!! for just one weekcalculation.

Scientific computing may suffer from the accuracy.

Nakata Maho A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming

Page 6: A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application  to semidefinite programming

More accuracy is needed towards PETA and EXA scalecomputing

The EXA scale computing : 1023 FLOP!!! for just one weekcalculation.

Scientific computing may suffer from the accuracy.

Nakata Maho A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming

Page 7: A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application  to semidefinite programming

More accuracy is needed towards PETA and EXA scalecomputing

The EXA scale computing : 1023 FLOP!!! for just one weekcalculation.

Scientific computing may suffer from the accuracy.

Nakata Maho A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming

Page 8: A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application  to semidefinite programming

More accuracy is needed towards PETA and EXA scalecomputing

Iterative methods in double precision calculation sometimesdo not even converge. [Hasegawa 2007]

Nakata Maho A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming

Page 9: A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application  to semidefinite programming

More accuracy is needed towards PETA and EXA scalecomputing

Iterative methods in double precision calculation sometimesdo not even converge. [Hasegawa 2007]

Nakata Maho A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming

Page 10: A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application  to semidefinite programming

More accuracy is needed towards PETA and EXA scalecomputing

Semidefinite programming (SDP): condition number divergesat the optimum.Therefore, one may be very hard to obtain an accuratesolution[Nakata et al 2008], [Nakata 2009], [Waki-Nakata-Muramatsu]

1e-10

1e-05

1

100000

1e+10

1e+15

1e+20

0 10 20 30 40 50 60 70 80 90

# of iter.

The 1-norm and the estimated 1-norm condition number of shur complement matrix

1-cond1-norm

Nakata Maho A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming

Page 11: A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application  to semidefinite programming

More accuracy is needed towards PETA and EXA scalecomputing

Semidefinite programming (SDP): condition number divergesat the optimum.Therefore, one may be very hard to obtain an accuratesolution[Nakata et al 2008], [Nakata 2009], [Waki-Nakata-Muramatsu]

1e-10

1e-05

1

100000

1e+10

1e+15

1e+20

0 10 20 30 40 50 60 70 80 90

# of iter.

The 1-norm and the estimated 1-norm condition number of shur complement matrix

1-cond1-norm

Nakata Maho A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming

Page 12: A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application  to semidefinite programming

More accuracy is needed towards PETA and EXA scalecomputing

Semidefinite programming (SDP): condition number divergesat the optimum.Therefore, one may be very hard to obtain an accuratesolution[Nakata et al 2008], [Nakata 2009], [Waki-Nakata-Muramatsu]

1e-10

1e-05

1

100000

1e+10

1e+15

1e+20

0 10 20 30 40 50 60 70 80 90

# of iter.

The 1-norm and the estimated 1-norm condition number of shur complement matrix

1-cond1-norm

Nakata Maho A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming

Page 13: A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application  to semidefinite programming

More accuracy is needed towards PETA and EXA scalecomputing

Semidefinite programming (SDP): condition number divergesat the optimum.Therefore, one may be very hard to obtain an accuratesolution[Nakata et al 2008], [Nakata 2009], [Waki-Nakata-Muramatsu]

1e-10

1e-05

1

100000

1e+10

1e+15

1e+20

0 10 20 30 40 50 60 70 80 90

# of iter.

The 1-norm and the estimated 1-norm condition number of shur complement matrix

1-cond1-norm

Nakata Maho A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming

Page 14: A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application  to semidefinite programming

Acceleration high precision operation on GPU is a good idea

Double-double precision is a cheap and fast solution for highprecision

accurate enough for many purposes : almost as accurate asquadruple precision.fast: operations are done only by 8 ∼ 24 double precisionoperations.operation intensive: requires memory bandwidth than FLOPS.

Implementing on GPU is a good ideafast: 515GFLOPS by NVIDIA C2050, CPU 100 ∼200GFLOPS.cheap: NVIDIA C2050 $2000, Workstation : $5000 ∼ $10000.do not require complex operations: suitable for GPU.

Nakata Maho A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming

Page 15: A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application  to semidefinite programming

Acceleration high precision operation on GPU is a good idea

Double-double precision is a cheap and fast solution for highprecision

accurate enough for many purposes : almost as accurate asquadruple precision.fast: operations are done only by 8 ∼ 24 double precisionoperations.operation intensive: requires memory bandwidth than FLOPS.

Implementing on GPU is a good ideafast: 515GFLOPS by NVIDIA C2050, CPU 100 ∼200GFLOPS.cheap: NVIDIA C2050 $2000, Workstation : $5000 ∼ $10000.do not require complex operations: suitable for GPU.

Nakata Maho A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming

Page 16: A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application  to semidefinite programming

Acceleration high precision operation on GPU is a good idea

Double-double precision is a cheap and fast solution for highprecision

accurate enough for many purposes : almost as accurate asquadruple precision.fast: operations are done only by 8 ∼ 24 double precisionoperations.operation intensive: requires memory bandwidth than FLOPS.

Implementing on GPU is a good ideafast: 515GFLOPS by NVIDIA C2050, CPU 100 ∼200GFLOPS.cheap: NVIDIA C2050 $2000, Workstation : $5000 ∼ $10000.do not require complex operations: suitable for GPU.

Nakata Maho A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming

Page 17: A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application  to semidefinite programming

Acceleration high precision operation on GPU is a good idea

Double-double precision is a cheap and fast solution for highprecision

accurate enough for many purposes : almost as accurate asquadruple precision.fast: operations are done only by 8 ∼ 24 double precisionoperations.operation intensive: requires memory bandwidth than FLOPS.

Implementing on GPU is a good ideafast: 515GFLOPS by NVIDIA C2050, CPU 100 ∼200GFLOPS.cheap: NVIDIA C2050 $2000, Workstation : $5000 ∼ $10000.do not require complex operations: suitable for GPU.

Nakata Maho A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming

Page 18: A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application  to semidefinite programming

Acceleration high precision operation on GPU is a good idea

Double-double precision is a cheap and fast solution for highprecision

accurate enough for many purposes : almost as accurate asquadruple precision.fast: operations are done only by 8 ∼ 24 double precisionoperations.operation intensive: requires memory bandwidth than FLOPS.

Implementing on GPU is a good ideafast: 515GFLOPS by NVIDIA C2050, CPU 100 ∼200GFLOPS.cheap: NVIDIA C2050 $2000, Workstation : $5000 ∼ $10000.do not require complex operations: suitable for GPU.

Nakata Maho A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming

Page 19: A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application  to semidefinite programming

Acceleration high precision operation on GPU is a good idea

Double-double precision is a cheap and fast solution for highprecision

accurate enough for many purposes : almost as accurate asquadruple precision.fast: operations are done only by 8 ∼ 24 double precisionoperations.operation intensive: requires memory bandwidth than FLOPS.

Implementing on GPU is a good ideafast: 515GFLOPS by NVIDIA C2050, CPU 100 ∼200GFLOPS.cheap: NVIDIA C2050 $2000, Workstation : $5000 ∼ $10000.do not require complex operations: suitable for GPU.

Nakata Maho A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming

Page 20: A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application  to semidefinite programming

Acceleration high precision operation on GPU is a good idea

Double-double precision is a cheap and fast solution for highprecision

accurate enough for many purposes : almost as accurate asquadruple precision.fast: operations are done only by 8 ∼ 24 double precisionoperations.operation intensive: requires memory bandwidth than FLOPS.

Implementing on GPU is a good ideafast: 515GFLOPS by NVIDIA C2050, CPU 100 ∼200GFLOPS.cheap: NVIDIA C2050 $2000, Workstation : $5000 ∼ $10000.do not require complex operations: suitable for GPU.

Nakata Maho A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming

Page 21: A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application  to semidefinite programming

Acceleration high precision operation on GPU is a good idea

Double-double precision is a cheap and fast solution for highprecision

accurate enough for many purposes : almost as accurate asquadruple precision.fast: operations are done only by 8 ∼ 24 double precisionoperations.operation intensive: requires memory bandwidth than FLOPS.

Implementing on GPU is a good ideafast: 515GFLOPS by NVIDIA C2050, CPU 100 ∼200GFLOPS.cheap: NVIDIA C2050 $2000, Workstation : $5000 ∼ $10000.do not require complex operations: suitable for GPU.

Nakata Maho A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming

Page 22: A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application  to semidefinite programming

The double-double precision: handy and easy quadrupleprecision

“754-2008 IEEE Standard for Floating-Point Arithmetic”

The binary64 (aka double precision) format has 16 decimalsignificant digits

Widely used and very fast. Core i7 920: ∼40GFLOPS;RADEON HD7970 ∼1000GFLOPS, K computer: ∼ over10PFLOPS)�� ��Rounding error may occur for every arithmetic operation.

Nakata Maho A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming

Page 23: A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application  to semidefinite programming

The double-double precision: handy and easy quadrupleprecision

The double-double precision number a is expressed by two doubleprecision numbers ahi, alo.

a = (ahi, alo).

Nakata Maho A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming

Page 24: A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application  to semidefinite programming

The double-double precision: handy and easy quadrupleprecision

�� ��Knuth’s Theorem

Error-free transformation of two floating point numbers a, b,

a + b = (a ⊕ b) + ewhere ⊕ is addition including rounding errors, + is addition, e isfloating point number�� ��We can evaluate rounding error exactly for addition!

Nakata Maho A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming

Page 25: A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application  to semidefinite programming

The double-double precision: handy and easy quadrupleprecision

�� ��Dekker’s Theorem

Error-free transformation of two floating point numbers a, b,

a × b = (a ⊗ b) + e⊗ is multiplication operator with rounding errors, × is multiplicationoperator, e is floating point number.�� ��We can evaluate rounding error exactly for multiplication!

Nakata Maho A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming

Page 26: A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application  to semidefinite programming

The double-double precision: handy and easy quadrupleprecision

Based on Knuth’s Theorem, we can define “Quick-Two-Sum (a, b)”where a, b are floating point numbers, and ⊕, are operatorsincluding rounding errors. and when and when |a| ≥ |b|, we cancalculate exactly s = (a ⊕ b), e = a + b − (a ⊕ b) in threeoperations. �

Quick-Two-Sum (a, b):...1 s ← a ⊕ b...2 e ← b (s a)...3 return(s, e)�� ��(s, e) = Quick-Two-Sum (a, b)

Nakata Maho A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming

Page 27: A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application  to semidefinite programming

The double-double precision: handy and easy quadrupleprecision

Based on Knuth’s Theorem, we can define “Quick-Two-Sum (a, b)”where a, b are floating point numbers, and ⊕, are operatorsincluding rounding errors. and we can calculate exactlys = (a ⊕ b), e = a + b − (a ⊕ b) in six operations.'

&

$

%

Two-Sum (a, b):...1 s ← a ⊕ b...2 v ← s a...3 e ← (a (s v)) ⊕ (b v)...4 return(s, e)�� ��(s, e) = Two-Sum (a, b)

Nakata Maho A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming

Page 28: A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application  to semidefinite programming

The double-double precision: handy and easy quadrupleprecision

Basics:Dekker’s TheoremThere exists an algorithm which calculate s = (a ⊗ b) ande = a × b − (a ⊗ b), where ⊗ is multiplication operator withrounding errors, using following “Split(a)” in four operations and“Two-Prod(a,b)” in 17 operations.'

&

$

%

Split (a):...1 t ← (227 + 1) ⊗ a...2 ahi ← t (t a)...3 alo ← a ahi...4 return(ahi, alo)

'

&

$

%

Two-prod (a, b):...1 p← a ⊗ b...2 (ahi, alo) ← Split(a)...3 (bhi, blo) ← Split(b)...4 e ← ((ahi ⊗ bhi p) ⊕ ahi ⊗

blo ⊕ alo ⊗ bhi) ⊕ alo ⊗ blo...5 return(p, e)�� ��(s, e) =Two-Prod(a, b)

Nakata Maho A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming

Page 29: A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application  to semidefinite programming

The double-double precision: handy and easy quadrupleprecision

Addition in double-double operation can be done in 20FLOPS byfollowing “QuadAdd-IEEE”'

&

$

%

QuadAdd-IEEE (a, b):...1 (shi, ehi) = Two-Sum(ahi, bhi)...2 (slo, elo) = Two-Sum(alo, blo)...3 ehi = ehi ⊕ slo...4 (slo, elo) = Quick-Two-Sum(shi, ehi)...5 ehi = ehi ⊕ slo...6 (shi, elo) = Quick-Two-Sum(shi, ehi)...7 return(c)

Nakata Maho A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming

Page 30: A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application  to semidefinite programming

The double-double precision: handy and easy quadrupleprecision

Multiplication in double-double operation can be done in 24FLOPSby following “QuadMul”.'

&

$

%

QuadMul (a, b):...1 (phi, plo) = Two-Prod(ahi, bhi)...2 plo = plo ⊕ (ahi ⊗ blo ⊕ alo ⊗ bhi)...3 (chi, clo) = Quick-Two-Sum(phi, plo)...4 return(c)

Nakata Maho A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming

Page 31: A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application  to semidefinite programming

The double-double precision: handy and easy quadrupleprecision

The FMA (fused multiply-add) operation calculates

a × b + c

in one command. Doing a × b + c exactly, then round todouble-precision.

Nakata Maho A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming

Page 32: A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application  to semidefinite programming

The double-double precision: handy and easy quadrupleprecision

Faster: using FMA instruction Two-Prod becomes 3 operations (17op. w/o FMA), and QuadMul(-FMA) can be done in only 10operations (24 ops w/o FMA)�

Two-prod-FMA (a, b):...1 p← a ⊗ b...2 e ← FMA(a × b − p)...3 return(p, e)

Nakata Maho A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming

Page 33: A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application  to semidefinite programming

The double-double precision: handy and easy quadrupleprecision

Faster: lower accuracy operations'

&

$

%

QuadAdd-Cray (a, b):...1 (chi, clo) =

Two-Sum(ahi, bhi)...2 clo = clo ⊕ (alo ⊕ blo)...3 (chi, clo) =

Quick-Two-Sum(chi, clo)...4 return(c)

'

&

$

%

QuadMul-Sloppy (a, b):...1 p = (ahi ⊗ blo)...2 q = (alo ⊗ bhi)...3 t = p⊕ q...4 chi = FMA(ahi × bhi + t)...5 e = FMA(ahi × bhi − chi)...6 clo = e ⊕ t...7 return(c)

Nakata Maho A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming

Page 34: A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application  to semidefinite programming

The double-double precision: handy and easy quadrupleprecision

Summary: Operations count in each double-double arithmetic

Algorithm # of operationsQuick-Two-Sum 3

Two-Sum 6Split 4

Two-Prod 17Two-Prod-FMA 3∗

QuadAdd-IEEE 20QuadAdd-Cray 11

QuadMul 24QuadMul-FMA 10∗

QuadMul-FMA-Sloppy 8∗

∗2FLOPS count for FMA.We used QuadAdd-IEEE and QuadMul-FMA when not explicitlystated

Nakata Maho A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming

Page 35: A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application  to semidefinite programming

The double-double precision: handy and easy quadrupleprecision

QD libraryFeatures: Class of C++.The double-double precision: “dd real”.Free software. Author: Yozo Hida, Xiaoye S. Li, David H. BaileyDownload:

http://crd.lbl.gov/˜dhbailey/mpdist/

Paper:

http://crd.lbl.gov/˜dhbailey/dhbpapers/arith15.pdf

Yozo Hida, Xiaoye S. Li, David H. Bailey, “Quad-Double Arithmetic:Algorithms, Implementation, and Application”, Technical ReportLBNL-46996, Lawrence Berkeley National Laboratory, 2000.

Nakata Maho A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming

Page 36: A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application  to semidefinite programming

Implementation on GPU and performance evaluation

We accelerated matrix-matrix multiplication routine called“Rgemm”. Prototype definition of Rgemm�

�void Rgemm(const char *transa, const char *transb,

mpackint m, mpackint n, mpackint k, dd_real alpha,

dd_real * A, mpackint lda, dd_real * B, mpackint ldb,

dd_real beta, dd_real * C, mpackint ldc)

“MPACK”by M. Nakata, Multiple pre-cision version of BLAS, LAPACK(defacto standard linear algebra pack-age).

http://mplapack.sourceforge.net/

“Rgemm” corresponds to “dgemm”and “sgemm” of BLAS)

Nakata Maho A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming

Page 37: A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application  to semidefinite programming

Implementation on GPU and performance evaluation

Related study

D. Mukunoki and D. Takahashi : Implementation ofdouble-double matrix matrix multiplication on GPU, HPCS, p.148-156, (2011). → Matrix size should be multiple of 64 andslower than our implementation

Nakasato, N.:, “A Fast GEMM Implementation On a CypressGPU, Performance Modeling, Benchmark and Simulation ofHigh Performance Computing Systems”, Louisiana, USA,2010. → Matrix size should be multiple of 64 and faster thanour implementation�� ��Both implementations are not practical → we implemented for

general use.

Nakata Maho A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming

Page 38: A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application  to semidefinite programming

Implementation on GPU and evaluation

NVIDIA C2050 Architecture

Nakata Maho A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming

Page 39: A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application  to semidefinite programming

Implementation on GPU and evaluation

Block algorithm. We divide matrices to small blocks like b K, b M,b N. We used b M = b K = 16 and b N = 64.

Nakata Maho A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming

Page 40: A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application  to semidefinite programming

Implementation on GPU and evaluation

Basic algorithm:...1 Transfer A,B,C matrices on CPU memory to GPU Global

memory....2 Blocking: Ab: 16 × 16 and Bb : 16 × 64: most efficient....3 Apply 16 × 16 = 256 thread blocks to each elements Each

(i, j)-th thread in thread block calculated i-th row of Ab andj, j + 16, j + 32, j + 48-th column (four columns at the sametime) of Bb.

Nakata Maho A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming

Page 41: A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application  to semidefinite programming

Implementation on GPU and evaluation

Operation of each thread in detail:...1 Multiply beta to c0, c1, c2, c3 of C matrix which correspond to i-th column of

Ab and j, j + 16, j + 32, j + 48-th row of Bb....2 Read the first block Ab and Bb from global memory to shared memory.

Each thread of blocks read its elements....3 Calculate inner product of row vector ai of Ab and column bi of Bb bi , bi+16

, bi+32 , bi+48 as p0, p1, p2, p3...4 Update c0, c1, c2, c3 likec0 ← c0 + αp0....5 Read next blocks Ab, Bb and repeat 3, 4, until no further blocks are

available....6 Update C-matrix by c0, c1, c2, c3....7 Finally transfer C-matrix from GPU Global memory to CPU.

Nakata Maho A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming

Page 42: A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application  to semidefinite programming

Implementation on GPU and evaluation

The performance of matrix-matrix operation in double-doubleprecision. Square matrix (m = n = k), we varied m. Max. kernelperformance was 16.4GFLOPS. 16.1GFLOPS CPU-GPU transferincluded.

0

2

4

6

8

10

12

14

16

0 1000 2000 3000 4000 5000 6000

GFL

OPS

Dimension

NN−KernelNN−Total

Nakata Maho A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming

Page 43: A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application  to semidefinite programming

Implementation on GPU and evaluation

The performance of matrix-matrix operation in double-doubleprecision with matrix transposes. Square matrix (m = n = k), wevaried m. No performance loss with matrix transposes areobserved.

0

2

4

6

8

10

12

14

16

0 1000 2000 3000 4000 5000 6000

GFL

OPS

Dimension

NN−KernelNN−Total

NT−KernelNT−Total

TN−KernelTN−Total

TT−KernelTT−Total

Nakata Maho A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming

Page 44: A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application  to semidefinite programming

Implementation on GPU and evaluation

We observed no performance loss with matrix transposes, thereason was we make use of texture memory instead.

Global memory and Texture memory are essentially the same.

However, performance loss was small without coalescingmemory access using texture memory.

Also, relatively easy to hide the latency of memory transfer indouble-double precision since operation intensive (cf.QuadAdd-IEEE req’ 20FLOPS, QuadMul-FMA req 10FLOPS).

Nakata Maho A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming

Page 45: A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application  to semidefinite programming

Implementation on GPU and evaluation

“Pointer Redirecting” from “Accelerating GPU kernels for denselinear algebra”, Rajib Nath, Stanimire Tomov, and Jack Dongarra

Large performance loss (∼ 35%) are observed for matrix sizeout of multiple of 64.

Nakata Maho A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming

Page 46: A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application  to semidefinite programming

Implementation on GPU and evaluation

“Pointer redirecting” from “Accelerating GPU kernels for denselinear algebra”, Rajib Nath, Stanimire Tomov, and Jack Dongarra

Simple algorithm: if pointer is out of the block, then return thevalue of the nearest edge.

Very simple program.Small amount of performance loss.�� ��Breakthrough!!

Nakata Maho A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming

Page 47: A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application  to semidefinite programming

Implementation on GPU and evaluation

Performance loss was reduced from 35% to 6% !!

14.6 14.8

15 15.2 15.4 15.6 15.8

16 16.2 16.4

2050 2100 2150 2200 2250

GFL

OPS

Dimension

KernelTotal

Nakata Maho A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming

Page 48: A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application  to semidefinite programming

Implementation on GPU and evaluation

Performance blurred only 0.1% by repeated calculations.

15.5535

15.554

15.5545

15.555

15.5555

15.556

15.5565

15.557

15.5575

10 20 30 40 50 60 70 80 90 100

GF

LOP

S(T

otal

)

−th measure

Nakata Maho A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming

Page 49: A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application  to semidefinite programming

Implementation on GPU and evaluation

Using less accurate operations, we attained 26.4GFLOPS.

0

5

10

15

20

25

0 1000 2000 3000 4000 5000 6000

GFL

OPS

Dimension

QuadMul−Sloppy, QuadAdd−Cray KernelQuadMul−Sloppy, QuadAdd−Cray TotalQuadMul−FMA, QuadAdd−Cray Kernel

QuadMul−FMA, QuadAdd−Cray TotalQuadMul−Sloppy, QuadAdd−IEEE Kernel

QuadMul−Sloppy, QuadAdd−IEEE TotalQuadMul−FMA, QuadAdd−IEEE Kernel

QuadMul−FMA, QuadAdd−IEEE Total

Nakata Maho A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming

Page 50: A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application  to semidefinite programming

Implementation on GPU and evaluation

Using less accurate operations, we attained 26.4GFLOPS. “CPU”denotes measured on Xeon 3470 + DDR3-1066.

Algorithm PerformanceQuadAdd-Cray, QuadMul-Sloppy kernel 26.4GFLOPSQuadAdd-Cray, QuadMul-Sloppy total 25.7GFLOPS

QuadAdd-Cray, QuadMul kernel 23.0GFLOPSQuadAdd-Cray, QuadMul total 22.4GFLOPS

QuadAdd-IEEE, QuadMul-Sloppy kernel 18.1GFLOPSQuadAdd-IEEE, QuadMul-Sloppy total 17.8GFLOPS

QuadAdd-IEEE, QuadMul kernel 16.4GFLOPSQuadAdd-IEEE, QuadMul total 16.1GFLOPSQuadAdd-IEEE, QuadMul CPU 100MFLOPS

QuadAdd-IEEE, QuadMul OpenMP CPU 400MFLOPS

Nakata Maho A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming

Page 51: A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application  to semidefinite programming

Implementation on GPU and evaluation

16.1GFLOPS = ??2.4% (or 46.2%) of peak performance(QuadAdd-IEEE, QuadMul-FMA)

Average flop per sec:QuadAdd-IEEE 20op. QuadMul-FMA10op., in Rgemm, same # of mul and add op appear.

(20 + 10 − 1)/2 = 14.5

Approx theoretical peak should be...

515GFLOPS/14.5 = 35.5GFLOPS

However, on C2050, peak performance is calculated full useof FMA and our calculation is not this case, thus...

515GFLOPS/14.5/2 = 17.8GFLOPS

Nakata Maho A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming

Page 52: A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application  to semidefinite programming

Application:x10 acceleration for Semidefinite programmingsolver“SDPA-DD”.

Application

Nakata Maho A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming

Page 53: A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application  to semidefinite programming

Application:x10 acceleration for Semidefinite programmingsolver“SDPA-DD”.

Semidefinite programming:

Primal min: A0 • Xs.t.: Ai • X = bi (i = 1, 2, · · · , m)

X � 0

Dual max:m∑

i=1

bi zi

s.t.:m∑

i=1

Ai zi + Y = A0

Y � 0

Ai: n × n symm. mat.,X n × n symm. variable mat.,bi: m-dimvector,Y n × n symm. variable mat,X • Y :=

∑Xi jYi j. X � 0 : X

semidefinite: eigenvalues are lager than or equal to 0.

Nakata Maho A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming

Page 54: A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application  to semidefinite programming

Application:x10 acceleration for Semidefinite programmingsolver“SDPA-DD”.

Nature of optimally..Theorem (Complementary slackness theorem)..

......

When (X∗, Y∗, z∗) are feasible solution and interior point then theysatisfy the conditions of SDP of primal and dual, then necessaryand sufficient condition for optimally of (X∗, Y∗, z∗) is:

X∗ • Y∗ = 0.

Nakata Maho A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming

Page 55: A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application  to semidefinite programming

Application:x10 acceleration for Semidefinite programmingsolver“SDPA-DD”.

When X∗, Y∗ is optimal,

X∗ • Y∗ = 0.

Then,rankX∗ + rankY∗ ≤ n (1)

also follows.�� ��At least one of X∗, Y∗ is singularUsually both of X∗, Y∗ are singular: → unstable and/or less

accurate at the optimal.

Nakata Maho A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming

Page 56: A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application  to semidefinite programming

How to solve SDP:Interior point primal-dual path followingmethod

World’s best implementations SDPA and SDPARA are available bythe SDPA group led by Prof. Fujisawa.

Step 0: Setting the initial points:x0, X0, Y0, X0 � 0, Y0 � 0. letting h = 0,choose parameter γ ∈ (0, 1).

Step 1: Calculate Shur complementary matrix B ∈ Sn.

Bi j = ((Xh)−1 FiYh) • F j

Step 2: Solving linear equation Bdx = r, and calculate dX, dY bysolution dx, then we obtain next step (dx, dX, dY)

Step 3: Determine step size α keeping positive-semidefiniteness ofmatrices. α = max{α ∈ [0, 1] : Xh + αdX � 0, Yh + αdY � 0}.

Step 4: Update the current point.(xh+1, Xh+1, Yh+1) = (xh, Xh, Yh) + γα(dx, dX, dY).

Step 5: If (xh+1, Xh+1, Yh+1) satisfies some requirements, then iterationends. Otherwise, go back to the Step 1 and increment h = h + 1.

Nakata Maho A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming

Page 57: A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application  to semidefinite programming

Shur complement matrix becomes singular

B is called “Shur complementary matrix”We solve linear equation Bdx = r to determine the next step.This linear equation becomes singular!�� ��Multiple precision arithmetic is needed for accurate solutions!

1e-10

1e-05

1

100000

1e+10

1e+15

1e+20

0 10 20 30 40 50 60 70 80 90

# of iter.

The 1-norm and the estimated 1-norm condition number of shur complement matrix

1-cond1-norm

Nakata Maho A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming

Page 58: A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application  to semidefinite programming

Application:x10 acceleration for Semidefinite programmingsolver“SDPA-DD”.

Benchmark result: lager problem from SDPLIB (problem archive)CPU: Xeon 3470, DDR3 -1066

Problem CPU(sec) GPU(sec) accelerationequalG51 6531.9 573.2 11.4gpp500-1 902.0 72.2 12.5gpp500-4 638.0 74.8 8.5maxG32 36284.4 4373.1 8.3maxG55 521575.4 53413.1 9.8

mcp500-4 539.1 65.2 8.3qpG11 16114.7 1408.0 11.4qpG51 39678.9 3299.2 12.0ss30 310.7 138.6 2.2

theta5 3250.0 239.8 13.6theta6 9028.2 623.6 14.5

thetaG51 49161.5 4870.4 10.1

Nakata Maho A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming

Page 59: A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application  to semidefinite programming

Summary�� ��http://mplapack.sourceforge.net/

Matrix-matrix multiplication double-double precision

NVIDIA C2050, GPU CPU x150, Peak performance: 26GFLOPS

0

5

10

15

20

25

0 1000 2000 3000 4000 5000 6000

GF

LO

PS

Dimension

QuadMul−Sloppy, QuadAdd−Cray KernelQuadMul−Sloppy, QuadAdd−Cray TotalQuadMul−FMA, QuadAdd−Cray Kernel

QuadMul−FMA, QuadAdd−Cray TotalQuadMul−Sloppy, QuadAdd−IEEE Kernel

QuadMul−Sloppy, QuadAdd−IEEE TotalQuadMul−FMA, QuadAdd−IEEE Kernel

QuadMul−FMA, QuadAdd−IEEE Total

Nakata Maho A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming