understanding the efficiency of gpu algorithms for matrix-matrix multiplication kayvon fatahalian,...
TRANSCRIPT
Understanding the Efficiency of GPU Algorithms for Matrix-Matrix
Multiplication
Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan
Stanford UniversityAugust 30, 2004
= *
Motivation: Harness GPU Performance
0
1
2
3
4
5
6
P4 3.4Ghz 6800 Ultra X800 XT PE
Peak FLOPSMemory BW
Rela
tive P
erf
orm
an
ce
Streaming Computation on GPUs
Kernel function(shader)
GPUs accelerate streaming numerical algorithms
Data parallelism High ratio of arithmetic to data access Little data reuse
Input Elements Output Elements
Streaming Computation on GPUs
Level 1 BLAS operationsBuck et al. [2004]
Fluid solvers Kruger & Westermann
[2003]Boltz et al.
[2003]
Image processing Apple Corp. [2004]
McCormick et al. [2004]
Segmentation Sherbondy et al.
[2003]
Database operations Govindaraju et al.
[2004]
Data Clustering Hall et al. [2004]
Dense Matrix Multiplication
Abundant data parallelism
*=
BAC
Regular data access (no branching)
High ratio of computation to data access
Dense Matrix Multiplication
Widely used computational kernel
Building block for LAPACK library
Matrix Multiplication on GPUs
Larsen & McAllister [2001]
Moravansky [2003]
Hall et al. [2003]
Limited analysis of performance
Overview
GPU Implementations
Results
Analysis: Why GPUs are slow
Ways to Make GPUs Better
CPU-Based Approaches
High performance matrix multiplication algorithms are cache aware
Partition computation into submatrix multiplications
Load input submatrices into cache Multiply submatrices Store output submatrix to memory
*=
BAC
Method 1: Column Packed (CP)
=
Larsen & McAllister [SC2001]Moravansky [2003]
*
C A B4 elements stored per texel 4x4 matrix by 4-vector
multiplications
xyzw
Method 2: Submatrix Packed (SP)
=
Hall et al. [2003]
*
C A B2x2 submatrix stored per texel
x y
z w
2x2 by 2x2 submatrix multiplications
Alternative Approaches Ineffective
Varied mapping into texture memory
Altered rasterization order with geometry
Single quad most effective
Utilized multiple outputs
Varied amount of loop unrolling
Column packed: unroll maximally
Submatrix packed: unroll 128 times
Performance Results
Pentium 4 3Ghz CPU, 512KB L2 cache 12 GFLOPS peak compute 44.1GB/sec cache BW Using sgemm routine from ATLAS package
NVIDIA GeForce 5900 Ultra GeForce 6800 Ultra
ATI Radeon 9800 XT Radeon X800 XT PE
(prerelease 500Mhz mem / 500Mhz core clock)
Previous Generation GPUs
0
2
4
6
8
10
12
P4 3Ghz 5900 Ultra 9800 XT0
5
10
15
20
25
30
GFLOPSBandwidth
Multiplication of 1024x1024 Matrices
GFLO
PS
GB
/sec
Current Generation GPUs
0
2
4
6
8
10
12
P4 3Ghz 6800 Ultra X800 XT PE0
5
10
15
20
25
30
GFLOPSBandwidth
Multiplication of 1024x1024 Matrices
GFLO
PS
GB
/sec
Fragment Processor Data Paths
From L2
FragmentProcessor
L1 TextureCache
To FrameBuffer
TextureUnit
GPU Microbenchmarks
0
10
20
30
40
50
60
70
5900 Ultra 6800 Ultra 9800 XT X800 XT PE
GFLO
PS
Peak Arithmetic Rate
GPU Microbenchmarks
Observed Bandwidth
0
5
10
15
20
25
30
5900 Ultra 6800 Ultra 9800 XT X800 XT PE
GB
/sec
Cache BWSeq BW
Fragment Processor Data Paths
From L2
FragmentProcessor
L1 TextureCache
To FrameBuffer
TextureUnit
High bandwidth(texture filtering)
Low bandwidth(1 float/clock)
Fragment processor consumes data at8X rate texture provides it!
1 4-wideMAD/clock
Datapaths Designed for Shading
From L2
FragmentProcessor
L1 TextureCache
To FrameBuffer
TextureUnit
8 to 1 reduction inamount of data
4 componentsper clock
8 bit components 2 to 1 ratio of compute to bandwidth
Texture units filter (reduce) data Shaders use interpolated values & constants
Compute and Bandwidth Efficiency
0
20
40
60
80
100
5900Ultra
6800Ultra
9800XT
X800XT PE
P4 3Ghz
Perc
en
tag
e o
f P
eak
ComputeBandwidth
GPU algorithms are severely bandwidth limited!
Minimize Texture Fetches
Block in shader register file
Would need 8x8 submatrices to run at peak rates
Limited to 4x4 submatrices by available outputs
Improvement 1: Widen Datapath
Fragment processor receives cached data more quickly
Expect performance to improve linearly with increase in bandwidth Need ~4X improvement to achieve peak perf
But L2 may no longer be able to fill L1
Improvement 2: Larger Scratch Space
Requires large number of registers
Needs large number of output values
Reduces texture bandwidth requirements
Performance increases linearly with dimension of submatrices
Increases amount of per-pixel state Storage increases as square of dimension
of submatrices Requires 16X space of SP method for
peak perf
Summary
GPU algorithms for matrix-matrix multiplication run inefficiently Best algorithms achieve below 20% of
peak performance Saturate data path between texture and
FP units
Cache-aware software blocking strategies do not improve performance Cannot exploit data reuse Hardware limits algorithm efficiency
Summary
Hardware changes required to improve efficiency Widen path between texture and register file Output large number of values from shaders
Improved efficiency would make GPUs powerful platform for broader class of numerical algorithms
Acknowledgements
Thanks to Ian Buck, Mike Houston, Sean Treichler, Nick Triantos, Steve Morein
Support from ATI, NVIDIA, DARPA, IBM, SONY
Rambus Stanford Graduate Fellowship
Stanford School of Engineering Fellowship
Questions?