understanding the efficiency of gpu algorithms for matrix-matrix multiplication kayvon fatahalian,...

Understanding the Efficiency of GPU Algorithms for Matrix-Matrix

Multiplication

Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan

Stanford UniversityAugust 30, 2004

= *

Motivation: Harness GPU Performance

0

1

2

3

4

5

6

P4 3.4Ghz 6800 Ultra X800 XT PE

Peak FLOPSMemory BW

Rela

tive P

erf

orm

an

ce

Streaming Computation on GPUs

Kernel function(shader)

GPUs accelerate streaming numerical algorithms

Data parallelism High ratio of arithmetic to data access Little data reuse

Input Elements Output Elements

Streaming Computation on GPUs

Level 1 BLAS operationsBuck et al. [2004]

Fluid solvers Kruger & Westermann

[2003]Boltz et al.

[2003]

Image processing Apple Corp. [2004]

McCormick et al. [2004]

Segmentation Sherbondy et al.

[2003]

Database operations Govindaraju et al.

[2004]

Data Clustering Hall et al. [2004]

Dense Matrix Multiplication

Abundant data parallelism

*=

BAC

Regular data access (no branching)

High ratio of computation to data access

Dense Matrix Multiplication

Widely used computational kernel

Building block for LAPACK library

Matrix Multiplication on GPUs

Larsen & McAllister [2001]

Moravansky [2003]

Hall et al. [2003]

Limited analysis of performance

Overview

GPU Implementations

Results

Analysis: Why GPUs are slow

Ways to Make GPUs Better

CPU-Based Approaches

High performance matrix multiplication algorithms are cache aware

Partition computation into submatrix multiplications

Load input submatrices into cache Multiply submatrices Store output submatrix to memory

*=

BAC

Method 1: Column Packed (CP)

=

Larsen & McAllister [SC2001]Moravansky [2003]

*

C A B4 elements stored per texel 4x4 matrix by 4-vector

multiplications

xyzw

Method 2: Submatrix Packed (SP)

=

Hall et al. [2003]

*

C A B2x2 submatrix stored per texel

x y

z w

2x2 by 2x2 submatrix multiplications

Alternative Approaches Ineffective

Varied mapping into texture memory

Altered rasterization order with geometry

Single quad most effective

Utilized multiple outputs

Varied amount of loop unrolling

Column packed: unroll maximally

Submatrix packed: unroll 128 times

Performance Results

Pentium 4 3Ghz CPU, 512KB L2 cache 12 GFLOPS peak compute 44.1GB/sec cache BW Using sgemm routine from ATLAS package

NVIDIA GeForce 5900 Ultra GeForce 6800 Ultra

ATI Radeon 9800 XT Radeon X800 XT PE

(prerelease 500Mhz mem / 500Mhz core clock)

Previous Generation GPUs

0

2

4

6

8

10

12

P4 3Ghz 5900 Ultra 9800 XT0

5

10

15

20

25

30

GFLOPSBandwidth

Multiplication of 1024x1024 Matrices

GFLO

PS

GB

/sec

Current Generation GPUs

0

2

4

6

8

10

12

P4 3Ghz 6800 Ultra X800 XT PE0

5

10

15

20

25

30

GFLOPSBandwidth

Multiplication of 1024x1024 Matrices

GFLO

PS

GB

/sec

Fragment Processor Data Paths

From L2

FragmentProcessor

L1 TextureCache

To FrameBuffer

TextureUnit

GPU Microbenchmarks

0

10

20

30

40

50

60

70

5900 Ultra 6800 Ultra 9800 XT X800 XT PE

GFLO

PS

Peak Arithmetic Rate

GPU Microbenchmarks

Observed Bandwidth

0

5

10

15

20

25

30

5900 Ultra 6800 Ultra 9800 XT X800 XT PE

GB

/sec

Cache BWSeq BW

Fragment Processor Data Paths

From L2

FragmentProcessor

L1 TextureCache

To FrameBuffer

TextureUnit

High bandwidth(texture filtering)

Low bandwidth(1 float/clock)

Fragment processor consumes data at8X rate texture provides it!

1 4-wideMAD/clock

Datapaths Designed for Shading

From L2

FragmentProcessor

L1 TextureCache

To FrameBuffer

TextureUnit

8 to 1 reduction inamount of data

4 componentsper clock

8 bit components 2 to 1 ratio of compute to bandwidth

Texture units filter (reduce) data Shaders use interpolated values & constants

Compute and Bandwidth Efficiency

0

20

40

60

80

100

5900Ultra

6800Ultra

9800XT

X800XT PE

P4 3Ghz

Perc

en

tag

e o

f P

eak

ComputeBandwidth

GPU algorithms are severely bandwidth limited!

Minimize Texture Fetches

Block in shader register file

Would need 8x8 submatrices to run at peak rates

Limited to 4x4 submatrices by available outputs

Improvement 1: Widen Datapath

Fragment processor receives cached data more quickly

Expect performance to improve linearly with increase in bandwidth Need ~4X improvement to achieve peak perf

But L2 may no longer be able to fill L1

Improvement 2: Larger Scratch Space

Requires large number of registers

Needs large number of output values

Reduces texture bandwidth requirements

Performance increases linearly with dimension of submatrices

Increases amount of per-pixel state Storage increases as square of dimension

of submatrices Requires 16X space of SP method for

peak perf

Summary

GPU algorithms for matrix-matrix multiplication run inefficiently Best algorithms achieve below 20% of

peak performance Saturate data path between texture and

FP units

Cache-aware software blocking strategies do not improve performance Cannot exploit data reuse Hardware limits algorithm efficiency

Summary

Hardware changes required to improve efficiency Widen path between texture and register file Output large number of values from shaders

Improved efficiency would make GPUs powerful platform for broader class of numerical algorithms

Acknowledgements

Thanks to Ian Buck, Mike Houston, Sean Treichler, Nick Triantos, Steve Morein

Support from ATI, NVIDIA, DARPA, IBM, SONY

Rambus Stanford Graduate Fellowship

Stanford School of Engineering Fellowship

Questions?

understanding the efficiency of gpu algorithms for matrix-matrix multiplication kayvon fatahalian,...

Documents