understanding the efficiency of gpu algorithms for matrix-matrix multiplication kayvon fatahalian,...

28
Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August 30, 2004 = *

Upload: jace-harkless

Post on 01-Apr-2015

222 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August

Understanding the Efficiency of GPU Algorithms for Matrix-Matrix

Multiplication

Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan

Stanford UniversityAugust 30, 2004

= *

Page 2: Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August

Motivation: Harness GPU Performance

0

1

2

3

4

5

6

P4 3.4Ghz 6800 Ultra X800 XT PE

Peak FLOPSMemory BW

Rela

tive P

erf

orm

an

ce

Page 3: Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August

Streaming Computation on GPUs

Kernel function(shader)

GPUs accelerate streaming numerical algorithms

Data parallelism High ratio of arithmetic to data access Little data reuse

Input Elements Output Elements

Page 4: Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August

Streaming Computation on GPUs

Level 1 BLAS operationsBuck et al. [2004]

Fluid solvers Kruger & Westermann

[2003]Boltz et al.

[2003]

Image processing Apple Corp. [2004]

McCormick et al. [2004]

Segmentation Sherbondy et al.

[2003]

Database operations Govindaraju et al.

[2004]

Data Clustering Hall et al. [2004]

Page 5: Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August

Dense Matrix Multiplication

Abundant data parallelism

*=

BAC

Regular data access (no branching)

High ratio of computation to data access

Page 6: Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August

Dense Matrix Multiplication

Widely used computational kernel

Building block for LAPACK library

Page 7: Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August

Matrix Multiplication on GPUs

Larsen & McAllister [2001]

Moravansky [2003]

Hall et al. [2003]

Limited analysis of performance

Page 8: Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August

Overview

GPU Implementations

Results

Analysis: Why GPUs are slow

Ways to Make GPUs Better

Page 9: Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August

CPU-Based Approaches

High performance matrix multiplication algorithms are cache aware

Partition computation into submatrix multiplications

Load input submatrices into cache Multiply submatrices Store output submatrix to memory

*=

BAC

Page 10: Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August

Method 1: Column Packed (CP)

=

Larsen & McAllister [SC2001]Moravansky [2003]

*

C A B4 elements stored per texel 4x4 matrix by 4-vector

multiplications

xyzw

Page 11: Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August

Method 2: Submatrix Packed (SP)

=

Hall et al. [2003]

*

C A B2x2 submatrix stored per texel

x y

z w

2x2 by 2x2 submatrix multiplications

Page 12: Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August

Alternative Approaches Ineffective

Varied mapping into texture memory

Altered rasterization order with geometry

Single quad most effective

Utilized multiple outputs

Varied amount of loop unrolling

Column packed: unroll maximally

Submatrix packed: unroll 128 times

Page 13: Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August

Performance Results

Pentium 4 3Ghz CPU, 512KB L2 cache 12 GFLOPS peak compute 44.1GB/sec cache BW Using sgemm routine from ATLAS package

NVIDIA GeForce 5900 Ultra GeForce 6800 Ultra

ATI Radeon 9800 XT Radeon X800 XT PE

(prerelease 500Mhz mem / 500Mhz core clock)

Page 14: Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August

Previous Generation GPUs

0

2

4

6

8

10

12

P4 3Ghz 5900 Ultra 9800 XT0

5

10

15

20

25

30

GFLOPSBandwidth

Multiplication of 1024x1024 Matrices

GFLO

PS

GB

/sec

Page 15: Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August

Current Generation GPUs

0

2

4

6

8

10

12

P4 3Ghz 6800 Ultra X800 XT PE0

5

10

15

20

25

30

GFLOPSBandwidth

Multiplication of 1024x1024 Matrices

GFLO

PS

GB

/sec

Page 16: Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August

Fragment Processor Data Paths

From L2

FragmentProcessor

L1 TextureCache

To FrameBuffer

TextureUnit

Page 17: Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August

GPU Microbenchmarks

0

10

20

30

40

50

60

70

5900 Ultra 6800 Ultra 9800 XT X800 XT PE

GFLO

PS

Peak Arithmetic Rate

Page 18: Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August

GPU Microbenchmarks

Observed Bandwidth

0

5

10

15

20

25

30

5900 Ultra 6800 Ultra 9800 XT X800 XT PE

GB

/sec

Cache BWSeq BW

Page 19: Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August

Fragment Processor Data Paths

From L2

FragmentProcessor

L1 TextureCache

To FrameBuffer

TextureUnit

High bandwidth(texture filtering)

Low bandwidth(1 float/clock)

Fragment processor consumes data at8X rate texture provides it!

1 4-wideMAD/clock

Page 20: Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August

Datapaths Designed for Shading

From L2

FragmentProcessor

L1 TextureCache

To FrameBuffer

TextureUnit

8 to 1 reduction inamount of data

4 componentsper clock

8 bit components 2 to 1 ratio of compute to bandwidth

Texture units filter (reduce) data Shaders use interpolated values & constants

Page 21: Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August

Compute and Bandwidth Efficiency

0

20

40

60

80

100

5900Ultra

6800Ultra

9800XT

X800XT PE

P4 3Ghz

Perc

en

tag

e o

f P

eak

ComputeBandwidth

GPU algorithms are severely bandwidth limited!

Page 22: Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August

Minimize Texture Fetches

Block in shader register file

Would need 8x8 submatrices to run at peak rates

Limited to 4x4 submatrices by available outputs

Page 23: Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August

Improvement 1: Widen Datapath

Fragment processor receives cached data more quickly

Expect performance to improve linearly with increase in bandwidth Need ~4X improvement to achieve peak perf

But L2 may no longer be able to fill L1

Page 24: Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August

Improvement 2: Larger Scratch Space

Requires large number of registers

Needs large number of output values

Reduces texture bandwidth requirements

Performance increases linearly with dimension of submatrices

Increases amount of per-pixel state Storage increases as square of dimension

of submatrices Requires 16X space of SP method for

peak perf

Page 25: Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August

Summary

GPU algorithms for matrix-matrix multiplication run inefficiently Best algorithms achieve below 20% of

peak performance Saturate data path between texture and

FP units

Cache-aware software blocking strategies do not improve performance Cannot exploit data reuse Hardware limits algorithm efficiency

Page 26: Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August

Summary

Hardware changes required to improve efficiency Widen path between texture and register file Output large number of values from shaders

Improved efficiency would make GPUs powerful platform for broader class of numerical algorithms

Page 27: Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August

Acknowledgements

Thanks to Ian Buck, Mike Houston, Sean Treichler, Nick Triantos, Steve Morein

Support from ATI, NVIDIA, DARPA, IBM, SONY

Rambus Stanford Graduate Fellowship

Stanford School of Engineering Fellowship

Page 28: Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August

Questions?