mitch horton, stanimire tomov, jack dongarra innovative...

31
A Class of Hybrid LAPACK Algorithms for Multicore and GPU Architectures MAGMA QR, 1 GPU, All Available Cores Mitch Horton, Stanimire Tomov, Jack Dongarra Innovative Computing Laboratory University of Tennessee 20 July, 2011

Upload: nguyenlien

Post on 03-Apr-2018

214 views

Category:

Documents


1 download

TRANSCRIPT

A Class of Hybrid LAPACK Algorithms for Multicore and GPU Architectures

MAGMA QR, 1 GPU, All Available Cores

Mitch Horton, Stanimire Tomov, Jack Dongarra Innovative Computing Laboratory

University of Tennessee

20 July, 2011

Outline

1) Motivation2) Algorithm Description3) Algorithm Tuning4) Algorithm Optimization5) Results6) What's Next7) Power Efficiency

Motivation

Moore’s Law

The number of transistors that can be placed inexpensively on an integrated circuit doubles approximately every two years. This trend has continued for more than half a century and is expected to continue until 2015 or 2020 or later.

Wikipedia Kepler

6,000,000,0002011

Motivation

May’s Law

Software efficiency halves every 18 months, compensating Moore's Law.

Wikipedia

Motivation

Nothing you can't spell will ever work.

Will Rogers

Sourcebook of Parallel Computing

Dongarra, Foster, Fox

QUARK  LAPACK  Factoriza:on  (6  cores,  Q)

MAGMA  QR1  GPUAll  Available  Cores

2  x  6  cores

AlgorithmDescrip/on

3360  x  3360

NB  =  128IB  =  12OB  =  128

GPU  Update

Sequen:al  LAPACK  Update  (6  cores,  P)

MAGMA  QR1  GPUAll  Available  Cores

2  x  6  cores

AlgorithmDescrip/on

3360  x  3360

NB  =  128IB  =  12OB  =  128

QUARK  LAPACK  Factoriza:on  (6  cores,  Q)

GPU  Update

MAGMA  QR1  GPUAll  Available  Cores

2  x  6  cores

AlgorithmDescrip/on

3360  x  3360

NB  =  128IB  =  12OB  =  128

Sequen:al  LAPACK  Update  (6  cores,  P)

MAGMA  QR1  GPUAll  Available  Cores

2  x  6  cores

AlgorithmDescrip/on

3360  x  3360

NB  =  128IB  =  12OB  =  128

Sequen:al  LAPACK  Update  (6  cores,  P)

GPU  Update

QUARK  LAPACK  Factoriza:on  (6  cores,  Q)

GPU  Update

MAGMA  QR1  GPUAll  Available  Cores

2  x  6  cores

AlgorithmDescrip/on

3360  x  3360

NB  =  128IB  =  12OB  =  128

Sequen:al  LAPACK  Update  (6  cores,  P)

QUARK  LAPACK  Factoriza:on  (6  cores,  Q)

MAGMA  QR1  GPUAll  Available  Cores

2  x  6  cores

AlgorithmDescrip/on

3360  x  3360

NB  =  128IB  =  12OB  =  128

Sequen:al  LAPACK  Update  (6  cores,  P)

MAGMA  QR1  GPUAll  Available  Cores

2  x  6  cores

AlgorithmDescrip/on

3360  x  3360

NB  =  128IB  =  12OB  =  128

Sequen:al  LAPACK  Update  (6  cores,  P)

MAGMA  QR1  GPUAll  Available  Cores

2  x  6  cores

AlgorithmDescrip/on

3360  x  3360

NB  =  128IB  =  12OB  =  128

MAGMA  1.0+

Op:mized  Panel  

Factoriza:on6  cores

Algorithm Tuning Nightmare QPNBIBOB

Algorithm Tuning Nightmare QPNBIBOB

3360  x  3360

Algorithm Tuning Nightmare QPNBIBOB

5920  x  5920

Algorithm Tuning Nightmare QPNBIBOB 20000  x  20000

Algorithm Tuning Nightmare 8  x  6  cores,  double  precision  

0

125

250

375

500

800 20803360

46405920

72008480

976011040

1232013600

1522016800

18080

Q P NB OB IB

Matrix  Size

Algorithm Tuning Nightmare 2  x  6  cores,  double  precision  

0

125

250

375

500

8002080 3360

46405920

72008480

976011040

1232013600

1522016800

18080

Q P NB OB IB

Matrix  Size

5920  x  128,  4  cores,  IB=8

0

5

10

15

20

0 5000 10000 15000 20000

Gflo

p/s

Matrix Size

Performance of multicore QR factorization on Tall Skinny MatricesComparing Different Algorithms

12 Cores (2 x 6-cores) 2.8 GHz X5660, 23 GB, 270 Gflop/s Peak [keeneland] Tesla M2070, 1.1 GHz, 5.4 GB, 1.03 Tflop/s Peak

Single Node, Single GPU

RecursiveLeft Looking Execution

Left Looking InsertionParallel MKL

Algorithm Optimization

800  x  64,  6  cores,  IB=12

23840  x  192,  2  cores,  IB=16

Single  Precision

LeV  Looking  Inser:on LeV  Looking  Execu:on Recursive

Results

0

100

200

300

400

500

600

700

800

900

0 5000 10000 15000 20000 25000

Gflo

p/s

Matrix Size

Performance of MAGMA QR with 1 GPU and all Available CoresComparing Precisions

48 Cores (8 x 6-cores), 2.8 GHz-AMD Opteron 8439 SE, 129 GB, Peak 1080 Gflop/s [ig] 1 GeForce GTX 480 - 1.041 GHz Clock - Theoretical Peak: 1.401 * 2 * 480 = 1.34496 Tflop/s

numactl --interleave=all

SingleDouble

Complex SingleComplex Double

Results

0

100

200

300

400

500

600

700

800

0 5000 10000 15000 20000

Gflo

p/s

Matrix Size

Performance of MAGMA QR with 1 GPU and all Available CoresComparing Precisions

12 Cores (2 x 6-cores) 2.8 GHz X5660, 23 GB, 270 Gflop/s Peak [keeneland] Tesla M2070, 1.1 GHz, 5.4 GB, 1.03 Tflop/s Peak

Single Node, Single GPU

DoubleSingle

Complex SingleComplex Double

Results

0

50

100

150

200

250

300

0 5000 10000 15000 20000

Gflo

p/s

Matrix Size

Performance of MAGMA QR with 1 GPU and all Available Cores, Double PrecisionComparing Against MAGMA 1.0 and MKL

12 Cores (2 x 6-cores) 2.8 GHz X5660, 23 GB, 270 Gflop/s Peak [keeneland] Tesla M2070, 1.1 GHz, 5.4 GB, 1.03 Tflop/s Peak

Single Node, Single GPU

1 GPU, 2 sockets, New Approach1 GPU, 1 Socket, MAGMA 1.0

1 GPU, 1 Core, MAGMA 1.00 GPUs, 1 Socket, MKL

24

0

100

200

300

400

500

600

0 2000 4000 6000 8000 10000

Gflo

p/s

Matrix Size

Performance of MAGMA LU with 1 GPU and all Available Cores, Single PrecisionComparing Against MAGMA 1.0 and MKL

12 Cores (2 x 6-cores) 2.8 GHz X5660, 23 GB, 270 Gflop/s Peak [keeneland] Tesla M2070, 1.1 GHz, 5.4 GB, 1.03 Tflop/s Peak

Single Node, Single GPU

1 GPU, All Cores1 GPU, 1 Socket, MAGMA 1.0

0 GPUs, 2 Sockets, MKL

Results

What's Next

Power Efficiency

Math is free. Transistors are free. Power is expensive. Performance Per Watt = Performance.

Jen Hsun (gen shyuhn) [jensen] Huang, CEO of Nvidia

Power Efficiency

Peak  performance  of  any  system  is  essen:ally  limited  by  the  amount  of  power  it  can  draw  and  the  amount  of  heat  it  can  dissipate.  Consequently,  performance  per  wa]  of  a  GPU  design  translates  directly  into  peak  performance  of  a  system  that  uses  that  design.

While  performance  per  wa]  is  useful,  absolute  power  requirements  are  also  important.  Claims  of  improved  performance  per  wa]  may  be  used  to  mask  increasing  power  demands.  For  instance,  though  newer  genera:on  GPU  architectures  may  provide  be]er  performance  per  wa],  con:nued  performance  increases  can  negate  the  gains  in  efficiency,  and  the  GPUs  con:nue  to  consume  large  amounts  of  power.

Wikipedia

28

Power Efficiency

A Google engineer has warned that if the performance per watt of today's computers doesn't improve, the electrical costs of running them could end up far greater than the initial hardware price tag.

"If performance per watt is to remain constant over the next few years, power costs could easily overtake hardware costs, possibly by a large margin," Luiz Andre Barroso, who previously designed processors for Digital Equipment Corp., said in a September paper.

Google  recently  unveiled  a  major  new  datacenter  site  in  a  remote  part  of  Oregon,  where  power  costs  are  a  frac/on  of  those  at  Google's  home  base  in  Silicon  Valley.  

29

Power Efficiency

This has nothing to do with being "green." Every system and subsystem has to fit within some power budget.

Brough Turner 2007

30

Power Efficiency

"You want a battery on this device and that device that lasts three or four days? I do, too. Well, if we have much more high-performing systems at much lower watts that will trickle down into your cell phone and this recorder and everything else. So, if I can get a processor that runs on 5 watts, as opposed to 100 watts or whatever . . . Voila, batteries are going to last a lot longer ... “

ORNL scientific computing chief Jeff Nichols

31

Questions