mitch horton, stanimire tomov, jack dongarra innovative...

A Class of Hybrid LAPACK Algorithms for Multicore and GPU Architectures

MAGMA QR, 1 GPU, All Available Cores

Mitch Horton, Stanimire Tomov, Jack Dongarra Innovative Computing Laboratory

University of Tennessee

20 July, 2011

Outline

1) Motivation2) Algorithm Description3) Algorithm Tuning4) Algorithm Optimization5) Results6) What's Next7) Power Efficiency

Motivation

Moore’s Law

The number of transistors that can be placed inexpensively on an integrated circuit doubles approximately every two years. This trend has continued for more than half a century and is expected to continue until 2015 or 2020 or later.

Wikipedia Kepler

6,000,000,0002011

Motivation

May’s Law

Software efficiency halves every 18 months, compensating Moore's Law.

Wikipedia

Motivation

Nothing you can't spell will ever work.

Will Rogers

Sourcebook of Parallel Computing

Dongarra, Foster, Fox

QUARK LAPACK Factoriza:on (6 cores, Q)

MAGMA QR1 GPUAll Available Cores

2 x 6 cores

AlgorithmDescrip/on

3360 x 3360

NB = 128IB = 12OB = 128

GPU Update

Sequen:al LAPACK Update (6 cores, P)


2 x 6 cores

AlgorithmDescrip/on

3360 x 3360

NB = 128IB = 12OB = 128


GPU Update


2 x 6 cores

AlgorithmDescrip/on

3360 x 3360

NB = 128IB = 12OB = 128



2 x 6 cores

AlgorithmDescrip/on

3360 x 3360

NB = 128IB = 12OB = 128


GPU Update


GPU Update


2 x 6 cores

AlgorithmDescrip/on

3360 x 3360

NB = 128IB = 12OB = 128




2 x 6 cores

AlgorithmDescrip/on

3360 x 3360

NB = 128IB = 12OB = 128



2 x 6 cores

AlgorithmDescrip/on

3360 x 3360

NB = 128IB = 12OB = 128

MAGMA 1.0+

Op:mized Panel

Factoriza:on6 cores

Algorithm Tuning Nightmare QPNBIBOB


3360 x 3360


5920 x 5920

Algorithm Tuning Nightmare QPNBIBOB 20000 x 20000

Algorithm Tuning Nightmare 8 x 6 cores, double precision

0

125

250

375

500

800 20803360

46405920

72008480

976011040

1232013600

1522016800

18080

Q P NB OB IB

Matrix Size

Algorithm Tuning Nightmare 2 x 6 cores, double precision

0

125

250

375

500

8002080 3360

46405920

72008480

976011040

1232013600

1522016800

18080

Q P NB OB IB

Matrix Size

5920 x 128, 4 cores, IB=8

0

5

10

15

20

0 5000 10000 15000 20000

Gflo

p/s

Matrix Size

Performance of multicore QR factorization on Tall Skinny MatricesComparing Different Algorithms

12 Cores (2 x 6-cores) 2.8 GHz X5660, 23 GB, 270 Gflop/s Peak [keeneland] Tesla M2070, 1.1 GHz, 5.4 GB, 1.03 Tflop/s Peak

Single Node, Single GPU

RecursiveLeft Looking Execution

Left Looking InsertionParallel MKL

Algorithm Optimization

800 x 64, 6 cores, IB=12

23840 x 192, 2 cores, IB=16

Single Precision

LeV Looking Inser:on LeV Looking Execu:on Recursive

Results

0

100

200

300

400

500

600

700

800

900

0 5000 10000 15000 20000 25000

Gflo

p/s

Matrix Size

Performance of MAGMA QR with 1 GPU and all Available CoresComparing Precisions

48 Cores (8 x 6-cores), 2.8 GHz-AMD Opteron 8439 SE, 129 GB, Peak 1080 Gflop/s [ig] 1 GeForce GTX 480 - 1.041 GHz Clock - Theoretical Peak: 1.401 * 2 * 480 = 1.34496 Tflop/s

numactl --interleave=all

SingleDouble

Complex SingleComplex Double

Results

0

100

200

300

400

500

600

700

800

0 5000 10000 15000 20000

Gflo

p/s

Matrix Size

Performance of MAGMA QR with 1 GPU and all Available CoresComparing Precisions



DoubleSingle

Complex SingleComplex Double

Results

0

50

100

150

200

250

300

0 5000 10000 15000 20000

Gflo

p/s

Matrix Size

Performance of MAGMA QR with 1 GPU and all Available Cores, Double PrecisionComparing Against MAGMA 1.0 and MKL



1 GPU, 2 sockets, New Approach1 GPU, 1 Socket, MAGMA 1.0

1 GPU, 1 Core, MAGMA 1.00 GPUs, 1 Socket, MKL

24

0

100

200

300

400

500

600

0 2000 4000 6000 8000 10000

Gflo

p/s

Matrix Size

Performance of MAGMA LU with 1 GPU and all Available Cores, Single PrecisionComparing Against MAGMA 1.0 and MKL



1 GPU, All Cores1 GPU, 1 Socket, MAGMA 1.0

0 GPUs, 2 Sockets, MKL

Results

What's Next

Power Efficiency

Math is free. Transistors are free. Power is expensive. Performance Per Watt = Performance.

Jen Hsun (gen shyuhn) [jensen] Huang, CEO of Nvidia

Power Efficiency

Peak performance of any system is essen:ally limited by the amount of power it can draw and the amount of heat it can dissipate. Consequently, performance per wa] of a GPU design translates directly into peak performance of a system that uses that design.

While performance per wa] is useful, absolute power requirements are also important. Claims of improved performance per wa] may be used to mask increasing power demands. For instance, though newer genera:on GPU architectures may provide be]er performance per wa], con:nued performance increases can negate the gains in efficiency, and the GPUs con:nue to consume large amounts of power.

Wikipedia

28

Power Efficiency

A Google engineer has warned that if the performance per watt of today's computers doesn't improve, the electrical costs of running them could end up far greater than the initial hardware price tag.

"If performance per watt is to remain constant over the next few years, power costs could easily overtake hardware costs, possibly by a large margin," Luiz Andre Barroso, who previously designed processors for Digital Equipment Corp., said in a September paper.

Google recently unveiled a major new datacenter site in a remote part of Oregon, where power costs are a frac/on of those at Google's home base in Silicon Valley.

29

Power Efficiency

This has nothing to do with being "green." Every system and subsystem has to fit within some power budget.

Brough Turner 2007

30

Power Efficiency

"You want a battery on this device and that device that lasts three or four days? I do, too. Well, if we have much more high-performing systems at much lower watts that will trickle down into your cell phone and this recorder and everything else. So, if I can get a processor that runs on 5 watts, as opposed to 100 watts or whatever . . . Voila, batteries are going to last a lot longer ... “

ORNL scientific computing chief Jeff Nichols

31

Questions

mitch horton, stanimire tomov, jack dongarra innovative...

Documents