lu-gpu: efficient algorithms for solving dense linear
TRANSCRIPT
![Page 1: LU-GPU: Efficient Algorithms for Solving Dense Linear](https://reader033.vdocuments.site/reader033/viewer/2022052921/6290b13974c06c49097f192a/html5/thumbnails/1.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware
Nico Galoppo, Naga K. Govindaraju, Michael Henson, Dinesh Manocha
http://gamma.cs.unc.edu/LU-GPU
![Page 2: LU-GPU: Efficient Algorithms for Solving Dense Linear](https://reader033.vdocuments.site/reader033/viewer/2022052921/6290b13974c06c49097f192a/html5/thumbnails/2.jpg)
2The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Goals
Demonstrate advantages of mapping linear algebra routines to graphics hardware:
PerformanceGrowth rate
LAPACK compliant set of linear algebra algorithms on graphics hardware
![Page 3: LU-GPU: Efficient Algorithms for Solving Dense Linear](https://reader033.vdocuments.site/reader033/viewer/2022052921/6290b13974c06c49097f192a/html5/thumbnails/3.jpg)
3The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Outline
LU Decomposition & Related Work
The potential of GPUs
LU-GPU algorithm
Results
Conclusions & Ongoing Work
![Page 4: LU-GPU: Efficient Algorithms for Solving Dense Linear](https://reader033.vdocuments.site/reader033/viewer/2022052921/6290b13974c06c49097f192a/html5/thumbnails/4.jpg)
4The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
LU decomposition
Sequence of row eliminations: Scale and add: A(i,j) = A(i,j) – A(i,k) A(k,j)
Input data mapping: 2 distinct memory regions
No data dependencieswithin a row elimination
PivotingPointer-swap vs. data copy
k
k
k
![Page 5: LU-GPU: Efficient Algorithms for Solving Dense Linear](https://reader033.vdocuments.site/reader033/viewer/2022052921/6290b13974c06c49097f192a/html5/thumbnails/5.jpg)
5The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
LU decomposition
Theoretical complexity (partial pivoting): (2/3) n3 + O(n2)
Performance <> ArchitectureOrder of operationsMemory access (latency)Memory bandwidth
![Page 6: LU-GPU: Efficient Algorithms for Solving Dense Linear](https://reader033.vdocuments.site/reader033/viewer/2022052921/6290b13974c06c49097f192a/html5/thumbnails/6.jpg)
6The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Outline
LU Decomposition & Related Work
The potential of GPUs
LU-GPU algorithm
Results
Conclusions & Ongoing Work
![Page 7: LU-GPU: Efficient Algorithms for Solving Dense Linear](https://reader033.vdocuments.site/reader033/viewer/2022052921/6290b13974c06c49097f192a/html5/thumbnails/7.jpg)
7The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Commodity CPUs
LINPACK Benchmark:
Intel Pentium 4, 3.06 GHz: 2.88 GFLOPs/s
(Jack Dongarra, Oct 2005)
![Page 8: LU-GPU: Efficient Algorithms for Solving Dense Linear](https://reader033.vdocuments.site/reader033/viewer/2022052921/6290b13974c06c49097f192a/html5/thumbnails/8.jpg)
8The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Streaming architectures
Specialized hardwareHigh bandwidth/compute ratio
Merrimac [Erez04]Molecular modeling: 38 GFLOPs vs. 2.7 GFLOPs (P4)$1,000/node
Imagine [Ahn04]10.46 GFLOPs/s on QR-decomposition
Research
![Page 9: LU-GPU: Efficient Algorithms for Solving Dense Linear](https://reader033.vdocuments.site/reader033/viewer/2022052921/6290b13974c06c49097f192a/html5/thumbnails/9.jpg)
9The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Outline
LU Decomposition & Related Work
The potential of GPUs
LU-GPU algorithm
Results
Conclusions & Ongoing Work
![Page 10: LU-GPU: Efficient Algorithms for Solving Dense Linear](https://reader033.vdocuments.site/reader033/viewer/2022052921/6290b13974c06c49097f192a/html5/thumbnails/10.jpg)
10The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
CPU vs. GPU
Pentium EE 8403.2 GHz Dual Core230M Transistors90nm process206 mm2
2 x 1MB Cache25.6 GFLOPs
Price: $ 1,040
GeForce 7800 GTX430 MHz302M Transistors110 nm process326 mm2
512MB onboard memory313 GFLOPs (shader)1.3 TFLOPs (total)Price: $ 450
![Page 11: LU-GPU: Efficient Algorithms for Solving Dense Linear](https://reader033.vdocuments.site/reader033/viewer/2022052921/6290b13974c06c49097f192a/html5/thumbnails/11.jpg)
11The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
CPU vs. GPU(Henry Moreton: NVIDIA, Aug. 2005)
PEE 840 7800GTX GPU/CPU
Graphics GFLOPs 25.6 1300 50.8Shader GFLOPs 25.6 313 12.2Die area (mm2) 206 326 1.6Die area normalized 206 218 1.1Transistors (M) 230 302 1.3Power (W) 130 65 0.5GFLOPS/mm 0.1 6.0 47.9GFLOPS/tr 0.1 4.3 38.7GFLOPS/W 0.2 20.0 101.6
![Page 12: LU-GPU: Efficient Algorithms for Solving Dense Linear](https://reader033.vdocuments.site/reader033/viewer/2022052921/6290b13974c06c49097f192a/html5/thumbnails/12.jpg)
12The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
CPU vs. GPU: Bandwidth
CPU
(3 GHz)
System Memory
(2+ GB)
AGP Memory
(512 MB)
PCI-E Bus
(4 GB/s)
Video Memory
(512 MB)
GPU (500 MHz)
Video Memory
(512 MB)
GPU (500 MHz)
2 x 1 MB Cache
35.2 GB/s bandwidth6.4 GB/s bandwidth
![Page 13: LU-GPU: Efficient Algorithms for Solving Dense Linear](https://reader033.vdocuments.site/reader033/viewer/2022052921/6290b13974c06c49097f192a/html5/thumbnails/13.jpg)
13The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Bandwidth
Large high bandwidth memory 512 MB video memory vs. 2 MB L2 cache on CPUs
High memory to compute clock ratio – reduces memory stalls
![Page 14: LU-GPU: Efficient Algorithms for Solving Dense Linear](https://reader033.vdocuments.site/reader033/viewer/2022052921/6290b13974c06c49097f192a/html5/thumbnails/14.jpg)
14The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Graphics pipeline
vertex
pixel
texture
image
polygon
per-pixel texture, fp16 blending
programmable vertexprocessing (fp32)
programmable per-pixel math (fp32)
polygon setup,culling, rasterization
Z-buf, fp16 blending,anti-alias (MRT)
memory
![Page 15: LU-GPU: Efficient Algorithms for Solving Dense Linear](https://reader033.vdocuments.site/reader033/viewer/2022052921/6290b13974c06c49097f192a/html5/thumbnails/15.jpg)
15The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Stream processor (non-graphics)(David Kirk, NVIDIA, May’05)
data
setuprasterizer
data
data
data
data fetch, fp16 blending
programmable MIMDprocessing (fp32)
programmable SIMDprocessing (fp32)
listsSIMD“rasterization”
predicated write, fp16blend, multiple output
memory
![Page 16: LU-GPU: Efficient Algorithms for Solving Dense Linear](https://reader033.vdocuments.site/reader033/viewer/2022052921/6290b13974c06c49097f192a/html5/thumbnails/16.jpg)
16The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Potential of graphics processors
Commodity horsepowerParallel computationBandwidth
Programmable graphics pipelineStream processor
Exploit large growth rate
![Page 17: LU-GPU: Efficient Algorithms for Solving Dense Linear](https://reader033.vdocuments.site/reader033/viewer/2022052921/6290b13974c06c49097f192a/html5/thumbnails/17.jpg)
17The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Exploiting technology moving
faster than Moore’s law Source: Anselmo Lastra
CPU Growth Rate
GPU Growth Rate
![Page 18: LU-GPU: Efficient Algorithms for Solving Dense Linear](https://reader033.vdocuments.site/reader033/viewer/2022052921/6290b13974c06c49097f192a/html5/thumbnails/18.jpg)
18The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
General purpose computing on GPUs
Physical SimulationFluid Flow [Fan et al. 2004]
FEM [Rumpf and Strzodka 2001]
Cloud Dynamics [Harris et al. 2003]
Sparse Linear AlgebraOperators [Krüger and Westermann 2003]
Iterative Solvers [Bolz et al. 2003]
![Page 19: LU-GPU: Efficient Algorithms for Solving Dense Linear](https://reader033.vdocuments.site/reader033/viewer/2022052921/6290b13974c06c49097f192a/html5/thumbnails/19.jpg)
19The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
General purpose computing on GPUs
Matrix-Matrix MultiplicationFixed graphics pipeline, fixed-point arithmetic [Larsen and McAllister 2001]
Floating-point (SP) [Fatahalian et al. 2004]
High-level APIBrookGPU [Buck et al. 2004]
Sh [McCool et al. 2004]
![Page 20: LU-GPU: Efficient Algorithms for Solving Dense Linear](https://reader033.vdocuments.site/reader033/viewer/2022052921/6290b13974c06c49097f192a/html5/thumbnails/20.jpg)
20The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Outline
LU Decomposition & Related Work
The potential of GPUs
LU-GPU algorithm
Results
Conclusions & Ongoing Work
![Page 21: LU-GPU: Efficient Algorithms for Solving Dense Linear](https://reader033.vdocuments.site/reader033/viewer/2022052921/6290b13974c06c49097f192a/html5/thumbnails/21.jpg)
21The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Motivation for LU-GPU
LU decomposition maps well:Stream program
Few data dependencies
PivotingParallel pivot search
Exploit large memory bandwidth
![Page 22: LU-GPU: Efficient Algorithms for Solving Dense Linear](https://reader033.vdocuments.site/reader033/viewer/2022052921/6290b13974c06c49097f192a/html5/thumbnails/22.jpg)
22The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
GPU based algorithms
Data representation
Algorithm mapping
![Page 23: LU-GPU: Efficient Algorithms for Solving Dense Linear](https://reader033.vdocuments.site/reader033/viewer/2022052921/6290b13974c06c49097f192a/html5/thumbnails/23.jpg)
23The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Data representation
Texture mapping hardware: Input data mapping
![Page 24: LU-GPU: Efficient Algorithms for Solving Dense Linear](https://reader033.vdocuments.site/reader033/viewer/2022052921/6290b13974c06c49097f192a/html5/thumbnails/24.jpg)
24The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Data representation
Matrix elements 2D texture memory
One-to-one mapping
Texture memory = on-board memoryExploit bandwidth
Avoid CPU-GPU data transfer
![Page 25: LU-GPU: Efficient Algorithms for Solving Dense Linear](https://reader033.vdocuments.site/reader033/viewer/2022052921/6290b13974c06c49097f192a/html5/thumbnails/25.jpg)
25The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
GPU based algorithms
Data representation
Algorithm mappingStream computation
Input data mapping
Fast row swaps
![Page 26: LU-GPU: Efficient Algorithms for Solving Dense Linear](https://reader033.vdocuments.site/reader033/viewer/2022052921/6290b13974c06c49097f192a/html5/thumbnails/26.jpg)
26The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Algorithm mapping
Texture mapping hardware: Input data mapping
![Page 27: LU-GPU: Efficient Algorithms for Solving Dense Linear](https://reader033.vdocuments.site/reader033/viewer/2022052921/6290b13974c06c49097f192a/html5/thumbnails/27.jpg)
27The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Stream computation
Rasterize quadrilateralsGenerates computation streamInvokes SIMD unitsRasterization simulates blocking
Rasterization pass = row elimination
Alternating memory regions
![Page 28: LU-GPU: Efficient Algorithms for Solving Dense Linear](https://reader033.vdocuments.site/reader033/viewer/2022052921/6290b13974c06c49097f192a/html5/thumbnails/28.jpg)
28The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Input data mapping
Texture mapping hardware: Input data mapping
![Page 29: LU-GPU: Efficient Algorithms for Solving Dense Linear](https://reader033.vdocuments.site/reader033/viewer/2022052921/6290b13974c06c49097f192a/html5/thumbnails/29.jpg)
29The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Input data mapping
Dedicated texture mapping hardware
Traditionally for color interpolation
Map input matrix elements to output elements
Eliminates computation of memory locations
25% performance improvement
![Page 30: LU-GPU: Efficient Algorithms for Solving Dense Linear](https://reader033.vdocuments.site/reader033/viewer/2022052921/6290b13974c06c49097f192a/html5/thumbnails/30.jpg)
30The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Pivoting
Main issues:
Pivot search
Row/column swapping
![Page 31: LU-GPU: Efficient Algorithms for Solving Dense Linear](https://reader033.vdocuments.site/reader033/viewer/2022052921/6290b13974c06c49097f192a/html5/thumbnails/31.jpg)
31The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Pivoting
Texture mapping hardware: Input data mapping
![Page 32: LU-GPU: Efficient Algorithms for Solving Dense Linear](https://reader033.vdocuments.site/reader033/viewer/2022052921/6290b13974c06c49097f192a/html5/thumbnails/32.jpg)
32The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Partial pivoting
Fast row swapData copy: mapped rasterization
Texture mapping hardware
High memory bandwidth
Improvement over pointer swapping
TEXTURE MAPPINGHARDWARE
Input Output
![Page 33: LU-GPU: Efficient Algorithms for Solving Dense Linear](https://reader033.vdocuments.site/reader033/viewer/2022052921/6290b13974c06c49097f192a/html5/thumbnails/33.jpg)
33The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Full pivoting
Fast column/row swap
Parallel pivot searchDivide and conquer approach
Partial pivoting Full pivoting
![Page 34: LU-GPU: Efficient Algorithms for Solving Dense Linear](https://reader033.vdocuments.site/reader033/viewer/2022052921/6290b13974c06c49097f192a/html5/thumbnails/34.jpg)
34The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Outline
LU Decomposition & Related Work
The potential of GPUs
LU-GPU algorithm
Results
Conclusions & Ongoing Work
![Page 35: LU-GPU: Efficient Algorithms for Solving Dense Linear](https://reader033.vdocuments.site/reader033/viewer/2022052921/6290b13974c06c49097f192a/html5/thumbnails/35.jpg)
35The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Benchmarks
GPU SIMD units Core clock Memory Memory clock
6800 GT 12 350 MHz
16 425 MHz
430 MHz24
256 Mb 900 MHz
6800 Ultra 256 Mb 1100 MHz
7800 GTX 256 Mb 1200 MHz
Commodity CPU3.4 GHz Pentium IV with Hyper-Threading, 1 MB L2 cache
LAPACK sgetrf() (blocked algorithm, ATLAS library)
LAPACK sgetc2() (SSE-optimized IMKL library)
![Page 36: LU-GPU: Efficient Algorithms for Solving Dense Linear](https://reader033.vdocuments.site/reader033/viewer/2022052921/6290b13974c06c49097f192a/html5/thumbnails/36.jpg)
36The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Results: No pivoting
0
1
2
3
4
5
6
7
8
9
1000 1500 2000 2500 3000 3500Matrix size N
Tim
e (s
)
ATLAS GETRF (Partial Pivot)Ultra 6800 LU (no pivoting)GT 6800 LU (no pivoting)7800 LU (no pivoting)
0
1
2
3
4
5
6
7
8
9
1000 1500 2000 2500 3000 3500Matrix size N
Tim
e (s
)
ATLAS GETRF (Partial Pivot)Ultra 6800 LU (no pivoting)GT 6800 LU (no pivoting)7800 LU (no pivoting)
0
1
2
3
4
5
6
7
8
9
1000 1500 2000 2500 3000 3500Matrix size N
Tim
e (s
)
ATLAS GETRF (Partial Pivot)Ultra 6800 LU (no pivoting)GT 6800 LU (no pivoting)7800 LU (no pivoting)
0
1
2
3
4
5
6
7
8
9
1000 1500 2000 2500 3000 3500Matrix size N
Tim
e (s
)
ATLAS GETRF (Partial Pivot)Ultra 6800 LU (no pivoting)GT 6800 LU (no pivoting)7800 LU (no pivoting)
![Page 37: LU-GPU: Efficient Algorithms for Solving Dense Linear](https://reader033.vdocuments.site/reader033/viewer/2022052921/6290b13974c06c49097f192a/html5/thumbnails/37.jpg)
37The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Results: Partial pivoting
0
2
4
6
8
10
12
500 1000 1500 2000 2500 3000 3500Matrix size N
Tim
e (s
)
ATLAS GETRF (Partial Pivot)GT 6800 Partial PivotUltra 6800 Partial Pivot7800 Partial Pivot
0
2
4
6
8
10
12
500 1000 1500 2000 2500 3000 3500Matrix size N
Tim
e (s
)
ATLAS GETRF (Partial Pivot)GT 6800 Partial PivotUltra 6800 Partial Pivot7800 Partial Pivot
0
2
4
6
8
10
12
500 1000 1500 2000 2500 3000 3500Matrix size N
Tim
e (s
)
ATLAS GETRF (Partial Pivot)GT 6800 Partial PivotUltra 6800 Partial Pivot7800 Partial Pivot
0
2
4
6
8
10
12
500 1000 1500 2000 2500 3000 3500Matrix size N
Tim
e (s
)
ATLAS GETRF (Partial Pivot)GT 6800 Partial PivotUltra 6800 Partial Pivot7800 Partial Pivot
![Page 38: LU-GPU: Efficient Algorithms for Solving Dense Linear](https://reader033.vdocuments.site/reader033/viewer/2022052921/6290b13974c06c49097f192a/html5/thumbnails/38.jpg)
38The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Results: Full Pivoting
0
50
100
150
200
250
500 1000 1500 2000 2500 3000 3500Matrix size N
Tim
e (s
)
Ultra 6800 Full Pivot
LAPACK sgetc2 (IMKL)
7800 Full Pivot
0
50
100
150
200
250
500 1000 1500 2000 2500 3000 3500Matrix size N
Tim
e (s
)
Ultra 6800 Full Pivot
LAPACK sgetc2 (IMKL)
7800 Full Pivot
0
50
100
150
200
250
500 1000 1500 2000 2500 3000 3500Matrix size N
Tim
e (s
)
Ultra 6800 Full Pivot
LAPACK sgetc2 (IMKL)
7800 Full Pivot
![Page 39: LU-GPU: Efficient Algorithms for Solving Dense Linear](https://reader033.vdocuments.site/reader033/viewer/2022052921/6290b13974c06c49097f192a/html5/thumbnails/39.jpg)
39The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Results: Number of computational units
4
0
5
10
15
20
25
30
35
40
45
500 1000 1500 2000 2500 3000 3500 4000Matrix Size (N)
Tim
e (s
)
4
8
0
5
10
15
20
25
30
35
40
45
500 1000 1500 2000 2500 3000 3500 4000Matrix Size (N)
Tim
e (s
)
4
8
12
0
5
10
15
20
25
30
35
40
45
500 1000 1500 2000 2500 3000 3500 4000Matrix Size (N)
Tim
e (s
)
4
8
1216
0
5
10
15
20
25
30
35
40
45
500 1000 1500 2000 2500 3000 3500 4000Matrix Size (N)
Tim
e (s
)
6800 Ultra (no pivoting)(Jun 2003)
(Mar 2004)
![Page 40: LU-GPU: Efficient Algorithms for Solving Dense Linear](https://reader033.vdocuments.site/reader033/viewer/2022052921/6290b13974c06c49097f192a/html5/thumbnails/40.jpg)
40The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
GPU-CPU data transfer overhead
0.00
2.00
4.00
6.00
8.00
10.00
12.00
500 1000 1500 2000 2500 3000 3500Matrix size N
Tim
e (s
)
GT 6800 Partial PivotUltra 6800 Partial Pivot7800 Partial PivotCPU-GPU transfer
![Page 41: LU-GPU: Efficient Algorithms for Solving Dense Linear](https://reader033.vdocuments.site/reader033/viewer/2022052921/6290b13974c06c49097f192a/html5/thumbnails/41.jpg)
41The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Bandwidth efficiency
0
5
10
15
20
25
30
35
500 1000 1500 2000 2500 3000 3500 4000Matrix size (N)
Ban
dwid
th U
sage
(GB
/s)
6800 Ultra 6800 GT6800 Ultra Peak Bandwidth: 35.2 GB/s
6800 GT Peak Bandwidth: 28.8
![Page 42: LU-GPU: Efficient Algorithms for Solving Dense Linear](https://reader033.vdocuments.site/reader033/viewer/2022052921/6290b13974c06c49097f192a/html5/thumbnails/42.jpg)
42The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Faster than Moore’s law
0
2
4
6
8
10
12
500 1000 1500 2000 2500 3000 3500
Matrix size N
Tim
e (s
)
ATLAS GETRF (Partial Pivot)GT 6800 Partial PivotUltra 6800 Partial Pivot7800 Partial Pivot (Mar 2004)
(Jun 2005)
![Page 43: LU-GPU: Efficient Algorithms for Solving Dense Linear](https://reader033.vdocuments.site/reader033/viewer/2022052921/6290b13974c06c49097f192a/html5/thumbnails/43.jpg)
43The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Application: Fluid Simulation
Solve parallel sub-problemsN=2048
Diagonally-dominant
No pivoting required
15% faster than ATLASon Pentium IV 3.06 GHz
![Page 44: LU-GPU: Efficient Algorithms for Solving Dense Linear](https://reader033.vdocuments.site/reader033/viewer/2022052921/6290b13974c06c49097f192a/html5/thumbnails/44.jpg)
44The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Limitations
Maximum matrix size: 4096x4096Block-partitioned LU decomposition
PrecisionSingle precision floating point
Not 100% IEEE floating point compliant
CPU-GPU data transfer overheadSmall matrices
![Page 45: LU-GPU: Efficient Algorithms for Solving Dense Linear](https://reader033.vdocuments.site/reader033/viewer/2022052921/6290b13974c06c49097f192a/html5/thumbnails/45.jpg)
45The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Graphics hardware advancements
Improved floating point bandwidth4 component vs. single component
Floating point blendingUse of non-programmable TFLOPs
![Page 46: LU-GPU: Efficient Algorithms for Solving Dense Linear](https://reader033.vdocuments.site/reader033/viewer/2022052921/6290b13974c06c49097f192a/html5/thumbnails/46.jpg)
46The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Outline
LU Decomposition & Related Work
The potential of GPUs
LU-GPU algorithm
Results
Conclusions & Ongoing Work
![Page 47: LU-GPU: Efficient Algorithms for Solving Dense Linear](https://reader033.vdocuments.site/reader033/viewer/2022052921/6290b13974c06c49097f192a/html5/thumbnails/47.jpg)
47The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Conclusions
Algorithm mapped to graphics pipeline
Novel mapping of row operations to rasterization
Stream computation
Blocking
![Page 48: LU-GPU: Efficient Algorithms for Solving Dense Linear](https://reader033.vdocuments.site/reader033/viewer/2022052921/6290b13974c06c49097f192a/html5/thumbnails/48.jpg)
48The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Conclusions
Optimized with GPU architecture
Input data mapping
Fast pivoting
![Page 49: LU-GPU: Efficient Algorithms for Solving Dense Linear](https://reader033.vdocuments.site/reader033/viewer/2022052921/6290b13974c06c49097f192a/html5/thumbnails/49.jpg)
49The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Conclusions
Performance Comparable to industry-standard libraries
Relatively small development effort
GPU are useful co-processors Scientific computations
Many applications
![Page 50: LU-GPU: Efficient Algorithms for Solving Dense Linear](https://reader033.vdocuments.site/reader033/viewer/2022052921/6290b13974c06c49097f192a/html5/thumbnails/50.jpg)
50The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Conclusions
LU-GPU Open Source library available:
http://gamma.cs.unc.edu/LUGPULIB/
![Page 51: LU-GPU: Efficient Algorithms for Solving Dense Linear](https://reader033.vdocuments.site/reader033/viewer/2022052921/6290b13974c06c49097f192a/html5/thumbnails/51.jpg)
51The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Ongoing work
Sorting on GPUs
Linear algebra: GPU-LAPACK / QR decomposition
![Page 52: LU-GPU: Efficient Algorithms for Solving Dense Linear](https://reader033.vdocuments.site/reader033/viewer/2022052921/6290b13974c06c49097f192a/html5/thumbnails/52.jpg)
52The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Sorting on GPUs
Goal: Utilize the high parallelism, and memory bandwidth on GPUs for fast sorting
[Govindaraju et al, SIGMOD05]
GPUSort: Open Source library[http://gamma.cs.unc.edu/GPUSORT/]
![Page 53: LU-GPU: Efficient Algorithms for Solving Dense Linear](https://reader033.vdocuments.site/reader033/viewer/2022052921/6290b13974c06c49097f192a/html5/thumbnails/53.jpg)
53The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Sorting on GPUs
6 times faster than Quicksort on 3.4 GHz Pentium IV PC!
![Page 54: LU-GPU: Efficient Algorithms for Solving Dense Linear](https://reader033.vdocuments.site/reader033/viewer/2022052921/6290b13974c06c49097f192a/html5/thumbnails/54.jpg)
54The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Linear algebra
LAPACK-compliant library for GPUs
QR-decomposition in development (LAPACK SGEQRF)
![Page 55: LU-GPU: Efficient Algorithms for Solving Dense Linear](https://reader033.vdocuments.site/reader033/viewer/2022052921/6290b13974c06c49097f192a/html5/thumbnails/55.jpg)
55The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Acknowledgements
Army Research OfficeDARPANational Science FoundationOffice of Naval ResearchRDECOMIntel CorporationNVIDIA CorporationUNC GAMMA Group
![Page 56: LU-GPU: Efficient Algorithms for Solving Dense Linear](https://reader033.vdocuments.site/reader033/viewer/2022052921/6290b13974c06c49097f192a/html5/thumbnails/56.jpg)
56The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Thank you
For questions or comments:
[email protected]@[email protected]
http://gamma.cs.unc.edu/http://gamma.cs.unc.edu/LUGPULIB/