exploiting the graphics hardware to solve two compute intensive problems
DESCRIPTION
Sheetal Lahabar and P. J. Narayanan Center for Visual Information Technology, IIIT - Hyderabad. Exploiting the Graphics Hardware to solve two compute intensive problems. General-Purpose Computation on GPUs. Why GPGPU? Computational Power Pentium 4: 12 GFLOPS, GTX 280: 1 TFLOPS - PowerPoint PPT PresentationTRANSCRIPT
Exploiting the Graphics Hardware to solve two compute intensive
problems
Sheetal Lahabar and P. J. Narayanan
Center for Visual Information Technology,
IIIT - Hyderabad
General-Purpose Computation on GPUs Why GPGPU? Computational Power
Pentium 4: 12 GFLOPS, GTX 280: 1 TFLOPS
High Performance Growth: Faster than Moore's law CPU: 1.4x, GPU: 1.7x ~ 2.3x for every year Disparity in performance: CPU(caches and branch
prediction), GPU(arithmetic intensity)
Flexible and precise Programmability High-level language support
Economics Gaming market
The Problem: Difficult to use GPUs are designed for and driven by
graphics Model is unusual & tied to graphics Environment is tightly constrained
Underlying architectures Inherently parallel Rapidly evolving Largely secret
Can’t simply “port” code written for the CPU!
Mapping Computations to GPU
Data-parallel processing GPU architecture is ALU-heavy Performance depends on
Arithmetic intensity = Computation / Bandwidth ratio
Hide memory latency with more computation
GPU architecture
Singular Value Decomposition on
the GPU using CUDA
Proceedings of IEEE International Parallel Distributed Processing Symposium(IPDPS 09), 25-29 May, 2009, Rome, Italy
Problem Statement
SVD of matrix A(mxn) for m>n
U and V are orthogonal and Σ is a diagonal matrix
Motivation
SVD has many applications in Image Processing, Pattern Recognition etc.
High computational complexity GPUs have high computing power
Teraflop performance Exploit the GPU for high performance
Related Work Ma et al. implemented two sided rotation
Jacobi on 2 million gate FPGA (2006) Yamamoto et al. proposed a method on
CSX600 (2007) Only for large rectangular matrices
Bobda et al. proposed a implemention on Distributed reconfigurable system (2001)
Zhang Shu et al. implemented One Sided Jacobi Works for small matrices
Bondhugula et al. proposed a hybrid implementation on GPU Using frame buffer objects
Methods
SVD algorithms Golub Reinsch (Bidiagonalization and Diagonalization) Hestenes algorithm(Jacobi)
Golub Reinsch method Simple and compact Maps well to the GPU Popular in numerical libraries
Golub Reinsch algorithm
Bidiagonalization: Series of householder transformations
Diagonalization: Implicitly Shifted QR iterations
SVD
Overall algorithm B ← QTAP Bidiagonalization of A to B Σ ← XTBY Diagonalization of B to Σ U ← QX , V T ← (PY ) T
Compute orthogonal matrices U andV T
Complexity: O(mn2) for m>n
Bidiagonalization
QT AsA QT P
Identity matrix
Simple Bidiagonalization
ith updateA(i+1:m, i+1:n) = A(i+1:m, i+1:n) – uif(ui,vi ) - f(vi)vi QT(i:m, 1:m) = QT(i:m, 1:m) – f(Q,ui)ui P(1:n, i:n) = P(1:n, i:n) – f(P,vi)vi
Contd… Many Reads and writes Use block updates Divide matrix into n/L blocks
Eliminate L rows and columns at once n/L block transformations
Contd… A Block transformation, L=3
QT A P
L
ith block transformation updates trailing
A(iL+1:m, iL+1:n), Q(1:m, iL+1:m) and PT(iL+1:n, 1:n)
Update using BLAS operations
Contd… Final bidiagonal matrix B = QTAP Store L ui’s and vi’s Additional space complexity O(mL) Partial Bidiagonalization only
computes B
Challenges
Iterative algorithm Repeated data transfer High precision requirements Irregular data access Matrix size affects performance
Bidiagonalization on GPU Block updates require Level 3 BLAS CUBLAS functions used, single
precision High performance for smaller
dimension Matrix dimension are multiple of 32 Operations on data local to the GPU Expensive GPU CPU transfers
avoided
Contd…
Inplace bidiagonalization Efficient GPU implementation Bidiagonal matrix copied to the CPU
Diagonalization
Implicitly shifted QR algorithm
Identity matrix
k1
X B Y
T
k2
k1
Iteration 1Iteration 2
k2
Diagonalization Apply implicitly shifted QR algorithm
In every iteration, until convergence Find matrix indexes k1 and k2
Apply Given’s rotations on B Store coefficient vectors (C1, S1) and (C2, S2) of
length k2-k1
Transform k2-k1+1 rows of YT using (C1, S1) Transform k2-k1+1 columns of X using (C2, S2)
Contd… Forward transformation on YT
C1 S1
YTfor(j=k1; j<k2; j++)
YT(j,1:n) = f (YT(j,1:n), YT(j+1,1:n), C1(j-k1+1), S1(j-k1+1))
YT(j+1,1:n) = g (YT(j,1:n), YT(j+1,1:n), C1(j-k1+1), S1(j-k1+1))
j=0
j=1
j=2
Diagonalization on GPU
Hybrid algorithm Given rotations modifies B on CPU Transfer coefficient vectors to GPU Row transformations Transform k2-k1+1 rows of YT and XT
on GPU
Contd… A row element depends on next or
previous row element A row is divided into blocks
m
n
txty=0
B1 Bk Bn
blockDim.x
k1
k2
Contd…
Kernel modifies k2-k1+1 rows
Kernel loops over k2-k1 rows Two rows in shared memory Requires k2-k1+1 coefficient vectors Coefficient vectors copied to shared
memory Efficient division of rows Each thread works independently
Orthogonal matrices
CUBLAS matrix multiplication for U and VT
Good performance even for small matrices
Results Intel 2.66 Ghz Dual Core CPU used Speedup on NVIDIA GTX 280:
3-8 over MKL LAPACK 3-60 over MATLAB
Contd… CPU outperforms for smaller matrices Speedup increases with matrix size
Contd… SVD timing for rectangular matrices
(m=8K) Speedup increases with varying
dimension
Contd… SVD of upto 14K x 14K on Tesla S1070
takes 76 mins on GPU 10K x 10K SVD takes 4.5 hours on CPU,
25.6 minutes on GPU
Contd…
Yamamoto achieved a speedup of 4 on CSX600 for very large matrices
Bobda report the time for 106 x 106 matrix which takes 17 hours
Bondhugula report only the partial bidiagonalization time
Timing for Partial Bidiagonalization Speedup:1.5-16.5 over Intel MKL CPU outperforms for small matrices Timing comparable to Bondhugula e.g 11 secs on GTX 280 compared to
19 secs on 7900 Time in secs
SIZEBidiag.
GTX 280
Partial Bidiag.
GTX 280
Partial Bidiag.
Intel MKL
512 x 512 0.57 0.37 0.14
1K x 1K 2.40 1.06 3.81
2K x 2K 14.40 4.60 47.9
4K x 4K 92.70 21.8 361.8
Timing for Diagonalization Speedup:1.5-18 over Intel MKL Maximum Occupancy: 83% Data coalescing achieved Performance increases with matrix
size Performs well even for small matricesTime in secs
SIZE
Diag.
GTX 280
Diag.
Intel MKL
512 x 512 0.38 0.54
2K x 2K 5.14 49.1
4K x 4K 20 354
8K x 2K 8.2 100
Limitations Limited double precision support High performance penalty Discrepancy due to reduced precision
m=3K, n=3K
Contd…
Max singular value discrepancy = 0.013%
Average discrepancy < 0.00005% Average discrepancy < 0.001% for U
and VT
Limited by device memory
SVD on GPU using CUDA Summary SVD algorithm on GPU Exploits the GPU parallelism High performance achieved Bidiagonalization using CUBLAS Hybrid algorithm for diagonalization Error due to low precision < 0.001% SVD of very large matrices
Ray Tracing Parametric Patches on GPU
Problem Statement
Direct ray trace parametric patches Exact point of intersection High visual quality images Less artifacts Fast preprocessing Less memory requirement Better rendering
Motivation
Describes 3D geometrical figures Foundation of most CAD systems
Computationally expensive process Graphics Processing Units (GPU)
High Computational Power, 1 TFLOPS
Exploit the Graphics hardware
Bezier patch
16 control points Better continuity properties, compact Difficult to render directly Tessellated to polygons Patch equation
Q(u, v) = [u3 u2 u 1] P [v3 v2 v 1]T
Methods
Uniformly refine on the fly Expensive tests to avoid recursion Approximates to triangles Rendering artifacts
Find exact hit point of a ray with a patch High computational complexity Prone to numerical errors
Related Work Toth’s algorithm (1985)
Applies multivariate Newton iteration Dependent on calculation of interval
extension; numerical errors Manocha’s and Krishnan’s method (1993)
Algebraic pruning based approaches Eigen value formation of the problem Does not map well to GPU
Kajiya’s method (1982) Finds roots of a 18-degree polynomial Maps well to parallel architectures
Kajiya’s algorithm
v - Intersect a and bu - gcd(a,b)
Rl0
l1a
b P
Advantages
Finds the exact point of intersection Uses robust root finding procedure No memory overhead required Requires double precision arithmetic Able to trace secondary rays On the downside; computationally
expensive Suitable for parallel implementation Can be implemented on GPU
Overview of ray tracing algorithm
Create BVH (CPU)
Compute Plane
Equations (GPU)
Traverse BVH for all
pixels/rays (GPU)
Compute 18 degree
polynomials (GPU)
Find the roots of the
polynomials (GPU)
Compute the GCD of bicubic
polynomials (GPU)
Accumulate shading data
recursively and render
Spawn Secondar
y Rays (GPU)
Compute point and
normal (GPU)
For all intersections
Every frame
Preprocessing
Compute Plane Equations
M+N planes represent MxN rays Thread computes a plane equation
Use frustum corner information Device occupancy: 100%
EyePixel
BVH traversal on the GPU Create BVH, traverse depth first Invoke traverse, scan, rearrange Store Num_Intersect intersection
data Device occupancy: 100%
4,5 4,6 5,65,5
0 0 1 1 2 2
0 0
3 3
2 2 3 3 4 4 5 5 4 4
(x,y)
traverse
1 1 2 2 2 2
3 3 4 4 6 6
Sum
Prefix_Sum
scan
4 4 4 4 5 5 5 5
5 5 5 6 5 5 6 6
0 1 2 2 3 4 4 5
rearrange
pixel_x
pixel_y
patch_ID
Computing the 18 degree polynomial Intersection of a and b
32 A and B coefficients Evaluate R = [a b c; b d e; c e f] for v bezout kernel
grid = Num_Intersect/16, threads = 21*16
6-6 degree, 6-12 degree, 3-18 degree
16
21 Threads active21*16Threads active13*16Threads active19*21
Contd…
Configuration uses resources well Avoids uncoalesced read and write
Row major layout Reduced divergence Device occupancy: 69% Performance limited by registers
Finding the polynomial roots 18 roots using Laguerre’s method
Guarantees convergence Iterative and cubically convergent
Thread evaluates an intersection grid = Num_Intersect/64, threads = 64
Kernel invoked from the CPUwhile(i < 18)
call <laguerre> kernel, finds ith root xi
call <deflate> kernel, deflates polynomial by xi
End
Iteration update: xi = xi – g(p(x), p’(x))
Each invocation finds a root in the block Store real v count in d_countv
Contd…
Splitting kernel reduces register usage Avoids uncoalesced read and write
Row major data layout Device occupancy
laguerre kernel : 25%, deflate kernel: 50% Performance limited by
Use of double registers Complex arithmetic Shared memory: Repeated transfer of
polynomial coefficients
Compute GCD of bicubic polynomials u = GCD(a,b)
Euclidean algorithm Real v count from d_countv
Thread evaluates an intersection grid = Num_Intersect/64, threads = 64
Num_Intersect
tx = 64, ty = 0
bx bx
Contd…
Update d_countu for real (u, v) pair Device occupancy: 25% Performance limited
Double registers Shared memory
A and B coefficients read repeatedly
Compute (x,y,z) and normal n
Use parametric patch equation Real (u,v) count from d_countu
Thread processes an intersection grid = Num_Intersect / 64, threads = 64
Device occupancy: 25% Performance limited by
Double registers Shared memory
Repeated patch data transfer
Challenges
High computational complexity Requires higher precision Repeated data transfer from device
to kernel Irregular data access Robust root finding algorithm Complex arithmetic High memory requirements
Optimizations
Keep computations independent (one thread per pixel) Disadvantage – no coherence
Avoid unnecessary computations Using SAH(surface area heuristics) in
building BVH Arrange data to reduce workload
Secondary rays
Secondary (shadow and reflection) rays spawned
Two orthogonal planes selected Find real point of intersection Shadow ray shadows the point of
origin Compute final color recursively Standard illumination equation
Memory requirements and Bandwidth Memory requirements
64 doubles: Patch coefficients Store plane equations (screen resolution) Per Intersection of the ray (double) – 480 bytes
32 x 8 bytes: Bicubic polynomials 19 x 8 bytes: Polynomial roots 3 x 4 bytes: Patch ID and Pixel location 60 bytes: Additional flags
Memory Bandwidth Patch coefficients read repeatedly in laguerre
kernel Incurs a performance penalty
Strengths
Facilitates direct ray tracing of dynamic patches
Divides into independent tasks Low branch divergence and high
memory access coherence Time taken linear in the number of
intersections No additional overhead incurred for
secondary rays
Contd…
Predict performance based on scene complexity
Speed up by multiple GPUs Reduction in the number of
intersections boosts performance
Limitations
Ray tracing performance Memory usage
Limits the number of intersections processed 480 bytes per ray patch intersection
Double Precision Performance Less GFLOPS
Limited Shared Memory Repeated data transfer Increases memory traffic and reduces the
performance
Contd…
Batch processing solves the memory usage problem
GPUs now have improved double precision, up to 4x
Modern GPU has increased shared memory available
Results: On GTX 280 Model No. of
Intersection
Patch/Ray
BVH TraversalTime (secs)
Polynomial formationTime (secs)
Solve polynomialTime (secs)
GCD, x,y,z and n computation Time (secs)
Time per frame(secs)
Average time per intersection(microseconds)
Teapot-P 54389 2.01 0.004 0.019 0.175 0.013 0.211 3.8
Teapot-S 29626 2.32 0.003 0.012 0.111 0.010 0.136 4.5
Teapot-R 41096 3.21 0.004 0.031 0.143 0.011 0.189 4.6
Bigguy-P 114048 3.23 0.007 0.043 0.352 0.015 0.417 3.6
Bigguy-S 114112 3.47 0.007 0.048 0.350 0.015 0.420 3.7
Bigguy-R 143040 4.34 0.008 0.104 0.480 0.022 0.614 4.3
Killeroo-P 127040 1.43 0.010 0.050 0.390 0.016 0.466 3.7
Killeroo-S 138240 1.72 0.011 0.061 0.420 0.016 0.508 3.7
Killeroo-R 146432 1.82 0.013 0.105 0.446 0.022 0.586 4
Kernel split timing Finding roots: On
average 82% BVH traversal takes
negligible time Constant
percentage for
primary and secondary rays
Device occupancy: 25-100% Y axis – Model, Ray Type tuple
Teapot(T), Bigguy(B), Killeroo(K)Primary(P), Shadow(S), Reflected(R)
Preliminary results on FermiModel No. of
inter-sections
Time per frame (secs)Fermi 480
Time per frame(secs)GTX 280
Avg. time per inter-section Fermi 480(microsecs)
Avg. time per inter- sectionGTX 280(microsecs)
Speedup
Teapot-P 54389 0.071 0.211 1.3 3.8 2.94
Teapot-S 29626 0.041 0.136 1.38 4.5 3.28
Teapot-R 41096 0.057 0.189 1.38 4.6 3.30
Bigguy-P 114048 0.147 0.417 1.28 3.6 2.83
Bigguy-S 114112 0.148 0.420 1.29 3.7 2.83
Bigguy-R 143040 0.190 0.614 1.32 4.3 3.23
Killeroo-P 127040 0.164 0.466 1.29 3.7 2.83
Killeroo-S 138240 0.179 0.508 1.29 3.7 2.83
Killeroo-R 146432 0.195 0.586 1.33 4 2.99
Average time per intersection
Per intersection 3.7 μs – GTX 280 1.4 μs – GTX 480
No overhead incurred for secondary rays Predict perfor- mance
X axis – Model, Ray Type tuple
Teapot(T), Bigguy(B), Killeroo(K)Primary(P), Shadow(S), Reflection(R)
Comparison to CPU First direct ray tracing
implementation Scales linearly with
number of inter-
sections Near interactive rates
Outperforms the CPU:
340x – GTX 280
990x – GTX 480
Promises interactivity
Shows the speedup using GTX 280 over MATLAB implementation on AMD dual core processor
Teapot (32 patches) with reflection rays
Teapot (32 patches) with shadow and reflection rays
Bigguy(3570 patches) with shadow rays
Killeroo(11532 patches) with shadow rays
Multiple objects with shadow and reflection rays
Ray tracing parametric patches on the GPU – Summary Finds exact points of intersection
Per pixel shading using true normal Renders highly accurate models
Quality not affected on zooming Able to trace secondary rays Suitable for parallel and pipelined
execution Near interactive performance; Speed
up over CPU
Contd…
Alternative to subdivision approaches Suitable for multi GPU implementation Easily extended for other parametric
models
Future Work
SVD and Ray tracing on multiple GPUs
Addressing large SVD Use double precision for SVD Adapt ray tracing to new generation
architectures (Fermi) Extend ray tracing for dynamic
models
Thank you