exploiting the graphics hardware to solve two compute intensive problems

Exploiting the Graphics Hardware to solve two compute intensive

problems

Sheetal Lahabar and P. J. Narayanan

Center for Visual Information Technology,

IIIT - Hyderabad

General-Purpose Computation on GPUs Why GPGPU? Computational Power

Pentium 4: 12 GFLOPS, GTX 280: 1 TFLOPS

High Performance Growth: Faster than Moore's law CPU: 1.4x, GPU: 1.7x ~ 2.3x for every year Disparity in performance: CPU(caches and branch

prediction), GPU(arithmetic intensity)

Flexible and precise Programmability High-level language support

Economics Gaming market

The Problem: Difficult to use GPUs are designed for and driven by

graphics Model is unusual & tied to graphics Environment is tightly constrained

Underlying architectures Inherently parallel Rapidly evolving Largely secret

Can’t simply “port” code written for the CPU!

Mapping Computations to GPU

Data-parallel processing GPU architecture is ALU-heavy Performance depends on

Arithmetic intensity = Computation / Bandwidth ratio

Hide memory latency with more computation

GPU architecture

Singular Value Decomposition on

the GPU using CUDA

Proceedings of IEEE International Parallel Distributed Processing Symposium(IPDPS 09), 25-29 May, 2009, Rome, Italy

Problem Statement

SVD of matrix A(mxn) for m>n

U and V are orthogonal and Σ is a diagonal matrix

Motivation

SVD has many applications in Image Processing, Pattern Recognition etc.

High computational complexity GPUs have high computing power

Teraflop performance Exploit the GPU for high performance

Related Work Ma et al. implemented two sided rotation

Jacobi on 2 million gate FPGA (2006) Yamamoto et al. proposed a method on

CSX600 (2007) Only for large rectangular matrices

Bobda et al. proposed a implemention on Distributed reconfigurable system (2001)

Zhang Shu et al. implemented One Sided Jacobi Works for small matrices

Bondhugula et al. proposed a hybrid implementation on GPU Using frame buffer objects

Methods

SVD algorithms Golub Reinsch (Bidiagonalization and Diagonalization) Hestenes algorithm(Jacobi)

Golub Reinsch method Simple and compact Maps well to the GPU Popular in numerical libraries

Golub Reinsch algorithm

Bidiagonalization: Series of householder transformations

Diagonalization: Implicitly Shifted QR iterations

SVD

Overall algorithm B ← QTAP Bidiagonalization of A to B Σ ← XTBY Diagonalization of B to Σ U ← QX , V T ← (PY ) T

Compute orthogonal matrices U andV T

Complexity: O(mn2) for m>n

Bidiagonalization

QT AsA QT P

Identity matrix

Simple Bidiagonalization

ith updateA(i+1:m, i+1:n) = A(i+1:m, i+1:n) – uif(ui,vi ) - f(vi)vi QT(i:m, 1:m) = QT(i:m, 1:m) – f(Q,ui)ui P(1:n, i:n) = P(1:n, i:n) – f(P,vi)vi

Contd… Many Reads and writes Use block updates Divide matrix into n/L blocks

Eliminate L rows and columns at once n/L block transformations

Contd… A Block transformation, L=3

QT A P

L

ith block transformation updates trailing

A(iL+1:m, iL+1:n), Q(1:m, iL+1:m) and PT(iL+1:n, 1:n)

Update using BLAS operations

Contd… Final bidiagonal matrix B = QTAP Store L ui’s and vi’s Additional space complexity O(mL) Partial Bidiagonalization only

computes B

Challenges

Iterative algorithm Repeated data transfer High precision requirements Irregular data access Matrix size affects performance

Bidiagonalization on GPU Block updates require Level 3 BLAS CUBLAS functions used, single

precision High performance for smaller

dimension Matrix dimension are multiple of 32 Operations on data local to the GPU Expensive GPU CPU transfers

avoided

Contd…

Inplace bidiagonalization Efficient GPU implementation Bidiagonal matrix copied to the CPU

Diagonalization

Implicitly shifted QR algorithm

Identity matrix

k1

X B Y

T

k2

k1

Iteration 1Iteration 2

k2

Diagonalization Apply implicitly shifted QR algorithm

In every iteration, until convergence Find matrix indexes k1 and k2

Apply Given’s rotations on B Store coefficient vectors (C1, S1) and (C2, S2) of

length k2-k1

Transform k2-k1+1 rows of YT using (C1, S1) Transform k2-k1+1 columns of X using (C2, S2)

Contd… Forward transformation on YT

C1 S1

YTfor(j=k1; j<k2; j++)

YT(j,1:n) = f (YT(j,1:n), YT(j+1,1:n), C1(j-k1+1), S1(j-k1+1))

YT(j+1,1:n) = g (YT(j,1:n), YT(j+1,1:n), C1(j-k1+1), S1(j-k1+1))

j=0

j=1

j=2

Diagonalization on GPU

Hybrid algorithm Given rotations modifies B on CPU Transfer coefficient vectors to GPU Row transformations Transform k2-k1+1 rows of YT and XT

on GPU

Contd… A row element depends on next or

previous row element A row is divided into blocks

m

n

txty=0

B1 Bk Bn

blockDim.x

k1

k2

Contd…

Kernel modifies k2-k1+1 rows

Kernel loops over k2-k1 rows Two rows in shared memory Requires k2-k1+1 coefficient vectors Coefficient vectors copied to shared

memory Efficient division of rows Each thread works independently

Orthogonal matrices

CUBLAS matrix multiplication for U and VT

Good performance even for small matrices

Results Intel 2.66 Ghz Dual Core CPU used Speedup on NVIDIA GTX 280:

3-8 over MKL LAPACK 3-60 over MATLAB

Contd… CPU outperforms for smaller matrices Speedup increases with matrix size

Contd… SVD timing for rectangular matrices

(m=8K) Speedup increases with varying

dimension

Contd… SVD of upto 14K x 14K on Tesla S1070

takes 76 mins on GPU 10K x 10K SVD takes 4.5 hours on CPU,

25.6 minutes on GPU

Contd…

Yamamoto achieved a speedup of 4 on CSX600 for very large matrices

Bobda report the time for 106 x 106 matrix which takes 17 hours

Bondhugula report only the partial bidiagonalization time

Timing for Partial Bidiagonalization Speedup:1.5-16.5 over Intel MKL CPU outperforms for small matrices Timing comparable to Bondhugula e.g 11 secs on GTX 280 compared to

19 secs on 7900 Time in secs

SIZEBidiag.

GTX 280

Partial Bidiag.

GTX 280

Partial Bidiag.

Intel MKL

512 x 512 0.57 0.37 0.14

1K x 1K 2.40 1.06 3.81

2K x 2K 14.40 4.60 47.9

4K x 4K 92.70 21.8 361.8

Timing for Diagonalization Speedup:1.5-18 over Intel MKL Maximum Occupancy: 83% Data coalescing achieved Performance increases with matrix

size Performs well even for small matricesTime in secs

SIZE

Diag.

GTX 280

Diag.

Intel MKL

512 x 512 0.38 0.54

2K x 2K 5.14 49.1

4K x 4K 20 354

8K x 2K 8.2 100

Limitations Limited double precision support High performance penalty Discrepancy due to reduced precision

m=3K, n=3K

Contd…

Max singular value discrepancy = 0.013%

Average discrepancy < 0.00005% Average discrepancy < 0.001% for U

and VT

Limited by device memory

SVD on GPU using CUDA Summary SVD algorithm on GPU Exploits the GPU parallelism High performance achieved Bidiagonalization using CUBLAS Hybrid algorithm for diagonalization Error due to low precision < 0.001% SVD of very large matrices

Ray Tracing Parametric Patches on GPU

Problem Statement

Direct ray trace parametric patches Exact point of intersection High visual quality images Less artifacts Fast preprocessing Less memory requirement Better rendering

Motivation

Describes 3D geometrical figures Foundation of most CAD systems

Computationally expensive process Graphics Processing Units (GPU)

High Computational Power, 1 TFLOPS

Exploit the Graphics hardware

Bezier patch

16 control points Better continuity properties, compact Difficult to render directly Tessellated to polygons Patch equation

Q(u, v) = [u3 u2 u 1] P [v3 v2 v 1]T

Methods

Uniformly refine on the fly Expensive tests to avoid recursion Approximates to triangles Rendering artifacts

Find exact hit point of a ray with a patch High computational complexity Prone to numerical errors

Related Work Toth’s algorithm (1985)

Applies multivariate Newton iteration Dependent on calculation of interval

extension; numerical errors Manocha’s and Krishnan’s method (1993)

Algebraic pruning based approaches Eigen value formation of the problem Does not map well to GPU

Kajiya’s method (1982) Finds roots of a 18-degree polynomial Maps well to parallel architectures

Kajiya’s algorithm

v - Intersect a and bu - gcd(a,b)

Rl0

l1a

b P

Advantages

Finds the exact point of intersection Uses robust root finding procedure No memory overhead required Requires double precision arithmetic Able to trace secondary rays On the downside; computationally

expensive Suitable for parallel implementation Can be implemented on GPU

Overview of ray tracing algorithm

Create BVH (CPU)

Compute Plane

Equations (GPU)

Traverse BVH for all

pixels/rays (GPU)

Compute 18 degree

polynomials (GPU)

Find the roots of the

polynomials (GPU)

Compute the GCD of bicubic

polynomials (GPU)

Accumulate shading data

recursively and render

Spawn Secondar

y Rays (GPU)

Compute point and

normal (GPU)

For all intersections

Every frame

Preprocessing

Compute Plane Equations

M+N planes represent MxN rays Thread computes a plane equation

Use frustum corner information Device occupancy: 100%

EyePixel

BVH traversal on the GPU Create BVH, traverse depth first Invoke traverse, scan, rearrange Store Num_Intersect intersection

data Device occupancy: 100%

4,5 4,6 5,65,5

0 0 1 1 2 2

0 0

3 3

2 2 3 3 4 4 5 5 4 4

(x,y)

traverse

1 1 2 2 2 2

3 3 4 4 6 6

Sum

Prefix_Sum

scan

4 4 4 4 5 5 5 5

5 5 5 6 5 5 6 6

0 1 2 2 3 4 4 5

rearrange

pixel_x

pixel_y

patch_ID

Computing the 18 degree polynomial Intersection of a and b

32 A and B coefficients Evaluate R = [a b c; b d e; c e f] for v bezout kernel

grid = Num_Intersect/16, threads = 21*16

6-6 degree, 6-12 degree, 3-18 degree

16

21 Threads active21*16Threads active13*16Threads active19*21

Contd…

Configuration uses resources well Avoids uncoalesced read and write

Row major layout Reduced divergence Device occupancy: 69% Performance limited by registers

Finding the polynomial roots 18 roots using Laguerre’s method

Guarantees convergence Iterative and cubically convergent

Thread evaluates an intersection grid = Num_Intersect/64, threads = 64

Kernel invoked from the CPUwhile(i < 18)

call <laguerre> kernel, finds ith root xi

call <deflate> kernel, deflates polynomial by xi

End

Iteration update: xi = xi – g(p(x), p’(x))

Each invocation finds a root in the block Store real v count in d_countv

Contd…

Splitting kernel reduces register usage Avoids uncoalesced read and write

Row major data layout Device occupancy

laguerre kernel : 25%, deflate kernel: 50% Performance limited by

Use of double registers Complex arithmetic Shared memory: Repeated transfer of

polynomial coefficients

Compute GCD of bicubic polynomials u = GCD(a,b)

Euclidean algorithm Real v count from d_countv

Thread evaluates an intersection grid = Num_Intersect/64, threads = 64

Num_Intersect

tx = 64, ty = 0

bx bx

Contd…

Update d_countu for real (u, v) pair Device occupancy: 25% Performance limited

Double registers Shared memory

A and B coefficients read repeatedly

Compute (x,y,z) and normal n

Use parametric patch equation Real (u,v) count from d_countu

Thread processes an intersection grid = Num_Intersect / 64, threads = 64

Device occupancy: 25% Performance limited by

Double registers Shared memory

Repeated patch data transfer

Challenges

High computational complexity Requires higher precision Repeated data transfer from device

to kernel Irregular data access Robust root finding algorithm Complex arithmetic High memory requirements

Optimizations

Keep computations independent (one thread per pixel) Disadvantage – no coherence

Avoid unnecessary computations Using SAH(surface area heuristics) in

building BVH Arrange data to reduce workload

Secondary rays

Secondary (shadow and reflection) rays spawned

Two orthogonal planes selected Find real point of intersection Shadow ray shadows the point of

origin Compute final color recursively Standard illumination equation

Memory requirements and Bandwidth Memory requirements

64 doubles: Patch coefficients Store plane equations (screen resolution) Per Intersection of the ray (double) – 480 bytes

32 x 8 bytes: Bicubic polynomials 19 x 8 bytes: Polynomial roots 3 x 4 bytes: Patch ID and Pixel location 60 bytes: Additional flags

Memory Bandwidth Patch coefficients read repeatedly in laguerre

kernel Incurs a performance penalty

Strengths

Facilitates direct ray tracing of dynamic patches

Divides into independent tasks Low branch divergence and high

memory access coherence Time taken linear in the number of

intersections No additional overhead incurred for

secondary rays

Contd…

Predict performance based on scene complexity

Speed up by multiple GPUs Reduction in the number of

intersections boosts performance

Limitations

Ray tracing performance Memory usage

Limits the number of intersections processed 480 bytes per ray patch intersection

Double Precision Performance Less GFLOPS

Limited Shared Memory Repeated data transfer Increases memory traffic and reduces the

performance

Contd…

Batch processing solves the memory usage problem

GPUs now have improved double precision, up to 4x

Modern GPU has increased shared memory available

Results: On GTX 280 Model No. of

Intersection

Patch/Ray

BVH TraversalTime (secs)

Polynomial formationTime (secs)

Solve polynomialTime (secs)

GCD, x,y,z and n computation Time (secs)

Time per frame(secs)

Average time per intersection(microseconds)

Teapot-P 54389 2.01 0.004 0.019 0.175 0.013 0.211 3.8

Teapot-S 29626 2.32 0.003 0.012 0.111 0.010 0.136 4.5

Teapot-R 41096 3.21 0.004 0.031 0.143 0.011 0.189 4.6

Bigguy-P 114048 3.23 0.007 0.043 0.352 0.015 0.417 3.6

Bigguy-S 114112 3.47 0.007 0.048 0.350 0.015 0.420 3.7

Bigguy-R 143040 4.34 0.008 0.104 0.480 0.022 0.614 4.3

Killeroo-P 127040 1.43 0.010 0.050 0.390 0.016 0.466 3.7

Killeroo-S 138240 1.72 0.011 0.061 0.420 0.016 0.508 3.7

Killeroo-R 146432 1.82 0.013 0.105 0.446 0.022 0.586 4

Kernel split timing Finding roots: On

average 82% BVH traversal takes

negligible time Constant

percentage for

primary and secondary rays

Device occupancy: 25-100% Y axis – Model, Ray Type tuple

Teapot(T), Bigguy(B), Killeroo(K)Primary(P), Shadow(S), Reflected(R)

Preliminary results on FermiModel No. of

inter-sections

Time per frame (secs)Fermi 480

Time per frame(secs)GTX 280

Avg. time per inter-section Fermi 480(microsecs)

Avg. time per inter- sectionGTX 280(microsecs)

Speedup

Teapot-P 54389 0.071 0.211 1.3 3.8 2.94

Teapot-S 29626 0.041 0.136 1.38 4.5 3.28

Teapot-R 41096 0.057 0.189 1.38 4.6 3.30

Bigguy-P 114048 0.147 0.417 1.28 3.6 2.83

Bigguy-S 114112 0.148 0.420 1.29 3.7 2.83

Bigguy-R 143040 0.190 0.614 1.32 4.3 3.23

Killeroo-P 127040 0.164 0.466 1.29 3.7 2.83

Killeroo-S 138240 0.179 0.508 1.29 3.7 2.83

Killeroo-R 146432 0.195 0.586 1.33 4 2.99

Average time per intersection

Per intersection 3.7 μs – GTX 280 1.4 μs – GTX 480

No overhead incurred for secondary rays Predict performance

X axis – Model, Ray Type tuple

Teapot(T), Bigguy(B), Killeroo(K)Primary(P), Shadow(S), Reflection(R)

Comparison to CPU First direct ray tracing

implementation Scales linearly with

number of inter-

sections Near interactive rates

Outperforms the CPU:

340x – GTX 280

990x – GTX 480

Promises interactivity

Shows the speedup using GTX 280 over MATLAB implementation on AMD dual core processor

Teapot (32 patches) with reflection rays

Teapot (32 patches) with shadow and reflection rays

Bigguy(3570 patches) with shadow rays

Killeroo(11532 patches) with shadow rays

Multiple objects with shadow and reflection rays

Ray tracing parametric patches on the GPU – Summary Finds exact points of intersection

Per pixel shading using true normal Renders highly accurate models

Quality not affected on zooming Able to trace secondary rays Suitable for parallel and pipelined

execution Near interactive performance; Speed

up over CPU

Contd…

Alternative to subdivision approaches Suitable for multi GPU implementation Easily extended for other parametric

models

Future Work

SVD and Ray tracing on multiple GPUs

Addressing large SVD Use double precision for SVD Adapt ray tracing to new generation

architectures (Fermi) Extend ray tracing for dynamic

models

Thank you

exploiting the graphics hardware to solve two compute intensive problems

Documents