stream compaction for deferred shading · 2012. 10. 22. · our contribution •shading requests...
TRANSCRIPT
![Page 1: Stream Compaction for Deferred Shading · 2012. 10. 22. · Our Contribution •Shading requests from rasterization are spatially coherent •Less so when shading is deferred until](https://reader034.vdocuments.site/reader034/viewer/2022051604/5ff95bbf3e881f4f65284854/html5/thumbnails/1.jpg)
Stream
Compaction
for
Deferred
Shading
Jared Hoberock*
Victor Lu
Yuntao Jia
John C. Hart
UPCRC
University of Illinois
(*now at NVIDIA)
![Page 2: Stream Compaction for Deferred Shading · 2012. 10. 22. · Our Contribution •Shading requests from rasterization are spatially coherent •Less so when shading is deferred until](https://reader034.vdocuments.site/reader034/viewer/2022051604/5ff95bbf3e881f4f65284854/html5/thumbnails/2.jpg)
Our Contribution
• Shading requests from rasterization are spatially coherent
• Less so when shading is deferred until after rasterization
• Shading requests from ray tracers are spatially incoherent
• Neighboring processes need to run completely different shaders
• Shading requests can be deferred and batch processed
• SIMD processing of incoherent shading batches suffers
from control flow divergence
• Is it worth clustering shading requests into coherent
batches to avoid SIMD divergence?
1
![Page 3: Stream Compaction for Deferred Shading · 2012. 10. 22. · Our Contribution •Shading requests from rasterization are spatially coherent •Less so when shading is deferred until](https://reader034.vdocuments.site/reader034/viewer/2022051604/5ff95bbf3e881f4f65284854/html5/thumbnails/3.jpg)
Previous Work
• Memory Coherence for Out-of-Core Processing
[Pharr et al. 1997]
• Encouraged memory coherence within intersection jobs whereas
we encourage instruction coherence within shading jobs
• Ray-Hierarchy Traversal
• Mannson et al. [2007] measured divergence
• Wald et al. [2007] simulated compaction to avoid divergence
• Dynamic Warp Formation [Fung et al. 2007]
• Local re-ordering hardware v. global re-ordering software
• Load Balancing [Aila & Laine 2009]
• Ray tracing is a scheduling problem
2
![Page 4: Stream Compaction for Deferred Shading · 2012. 10. 22. · Our Contribution •Shading requests from rasterization are spatially coherent •Less so when shading is deferred until](https://reader034.vdocuments.site/reader034/viewer/2022051604/5ff95bbf3e881f4f65284854/html5/thumbnails/4.jpg)
Data Parallel Architectures
• MIMD “cores”
• Each core has its own
instruction counter
• Cell:8, GT200:30, LRB:32
• SIMD vector processors
• Lanes share same
instruction counter
• Cell:4, GT200:8, LRB:16
• Programmer may see even
wider degree of SIMD parallelism
• NVIDIA’s 32-wide “warps”
SIMD
Lane
Core
![Page 5: Stream Compaction for Deferred Shading · 2012. 10. 22. · Our Contribution •Shading requests from rasterization are spatially coherent •Less so when shading is deferred until](https://reader034.vdocuments.site/reader034/viewer/2022051604/5ff95bbf3e881f4f65284854/html5/thumbnails/5.jpg)
SIMD Divergence: Conceptual
4
X?
A B
TFFT TFTF
ABBA ABAB
Test X
Execute A or B
![Page 6: Stream Compaction for Deferred Shading · 2012. 10. 22. · Our Contribution •Shading requests from rasterization are spatially coherent •Less so when shading is deferred until](https://reader034.vdocuments.site/reader034/viewer/2022051604/5ff95bbf3e881f4f65284854/html5/thumbnails/6.jpg)
SIMD Divergence: Actual
5
X?
A B
TFFT TFTF
Mask on X
AAAA AAAA
Execute Both A and B
BBBB BBBB
ABBA ABAB
![Page 7: Stream Compaction for Deferred Shading · 2012. 10. 22. · Our Contribution •Shading requests from rasterization are spatially coherent •Less so when shading is deferred until](https://reader034.vdocuments.site/reader034/viewer/2022051604/5ff95bbf3e881f4f65284854/html5/thumbnails/7.jpg)
SIMD Divergence: Measurement
6
TFFT TFTF
Mask on X
ABBA ABAB
Efficiency =
#A |A| + #B |B|
(#A + #B)(|A|+|B|)
Useful Work
Total EffortAAAA AAAA
Execute Both A and B
BBBB BBBB=
1
1
min( ,1)
n
i i
i
n
i i
i
n A
N n A
![Page 8: Stream Compaction for Deferred Shading · 2012. 10. 22. · Our Contribution •Shading requests from rasterization are spatially coherent •Less so when shading is deferred until](https://reader034.vdocuments.site/reader034/viewer/2022051604/5ff95bbf3e881f4f65284854/html5/thumbnails/8.jpg)
Shading Efficiency in a Path Tracer
7
1st Hit: 98% Efficient 2nd Hit: 56% Efficient
3rd Hit: 52% Efficient 4th Hit: 54% Efficient
![Page 9: Stream Compaction for Deferred Shading · 2012. 10. 22. · Our Contribution •Shading requests from rasterization are spatially coherent •Less so when shading is deferred until](https://reader034.vdocuments.site/reader034/viewer/2022051604/5ff95bbf3e881f4f65284854/html5/thumbnails/9.jpg)
Recovering Coherence
TFFT TFTF
TTTT FFFF
AAAA BBBB
ABBA ABAB
Test X
Compact
Evaluate A or B
Scatter result
![Page 10: Stream Compaction for Deferred Shading · 2012. 10. 22. · Our Contribution •Shading requests from rasterization are spatially coherent •Less so when shading is deferred until](https://reader034.vdocuments.site/reader034/viewer/2022051604/5ff95bbf3e881f4f65284854/html5/thumbnails/10.jpg)
Recovering Coherence
TFFT TFTF
TTTT FFFF
AAAA BBBB
ABBA ABAB
Efficiency =
(A | B)
(A | B) + compact
![Page 11: Stream Compaction for Deferred Shading · 2012. 10. 22. · Our Contribution •Shading requests from rasterization are spatially coherent •Less so when shading is deferred until](https://reader034.vdocuments.site/reader034/viewer/2022051604/5ff95bbf3e881f4f65284854/html5/thumbnails/11.jpg)
Stream Compaction
• Compacting disorganized input
10
![Page 12: Stream Compaction for Deferred Shading · 2012. 10. 22. · Our Contribution •Shading requests from rasterization are spatially coherent •Less so when shading is deferred until](https://reader034.vdocuments.site/reader034/viewer/2022051604/5ff95bbf3e881f4f65284854/html5/thumbnails/12.jpg)
Stream Compaction
• Compacting disorganized input
1. Select orange token
11
0001 1000 1000 0010
![Page 13: Stream Compaction for Deferred Shading · 2012. 10. 22. · Our Contribution •Shading requests from rasterization are spatially coherent •Less so when shading is deferred until](https://reader034.vdocuments.site/reader034/viewer/2022051604/5ff95bbf3e881f4f65284854/html5/thumbnails/13.jpg)
Stream Compaction
• Compacting disorganized input
1. Select orange token
2. Prefix Sum
12
0001 1000 1000 0010
1110 1111 2222 4433
![Page 14: Stream Compaction for Deferred Shading · 2012. 10. 22. · Our Contribution •Shading requests from rasterization are spatially coherent •Less so when shading is deferred until](https://reader034.vdocuments.site/reader034/viewer/2022051604/5ff95bbf3e881f4f65284854/html5/thumbnails/14.jpg)
Stream Compaction
• Compacting disorganized input
1. Select orange token
2. Prefix Sum
3. Scatter
13
0001 1000 1000 0010
1110 1111 2222 4433
![Page 15: Stream Compaction for Deferred Shading · 2012. 10. 22. · Our Contribution •Shading requests from rasterization are spatially coherent •Less so when shading is deferred until](https://reader034.vdocuments.site/reader034/viewer/2022051604/5ff95bbf3e881f4f65284854/html5/thumbnails/15.jpg)
Stream Compaction
• Compacting disorganized input
1. Select orange token
2. Prefix Sum
3. Scatter
14
0001 1000 1000 0010
1110 1111 2222 4433
Scan: M*O(N)
[Sengupta et al. 2007]
Radix Sort: log(M)*O(N)
[Satish et al. 2009]
![Page 16: Stream Compaction for Deferred Shading · 2012. 10. 22. · Our Contribution •Shading requests from rasterization are spatially coherent •Less so when shading is deferred until](https://reader034.vdocuments.site/reader034/viewer/2022051604/5ff95bbf3e881f4f65284854/html5/thumbnails/16.jpg)
Shader Scheduling
• Implicit Serialization
• (Big Switch)
• Let hardware schedule
• Explicit Serialization
• Run only jobs w/same shader at a time
• Compact + Imp/Exp Serialization
• Radix Sort + Imp/Exp Serialization
• Local Bitonic Sort + Imp Ser.
• Local to a CUDA thread block
• Global loads coalesce
15
forall j in jobs in SIMD do
switch j do
case s1:
execute(s1)
…
case sM:
execute(sM)
forall s in shaders do
mask = select(s,jobs)
forall (m,j) in (mask,jobs)
in SIMD do
if m then execute(s)
Implemented in CUDA
on G80-class hardware
![Page 17: Stream Compaction for Deferred Shading · 2012. 10. 22. · Our Contribution •Shading requests from rasterization are spatially coherent •Less so when shading is deferred until](https://reader034.vdocuments.site/reader034/viewer/2022051604/5ff95bbf3e881f4f65284854/html5/thumbnails/17.jpg)
Results
16
3 simple shaders
19% slower on GX2
5% slower on GTX+
unscheduled scheduled
![Page 18: Stream Compaction for Deferred Shading · 2012. 10. 22. · Our Contribution •Shading requests from rasterization are spatially coherent •Less so when shading is deferred until](https://reader034.vdocuments.site/reader034/viewer/2022051604/5ff95bbf3e881f4f65284854/html5/thumbnails/18.jpg)
Results
17
6 simple shaders
14% faster on GX2
38% faster on GTX+
![Page 19: Stream Compaction for Deferred Shading · 2012. 10. 22. · Our Contribution •Shading requests from rasterization are spatially coherent •Less so when shading is deferred until](https://reader034.vdocuments.site/reader034/viewer/2022051604/5ff95bbf3e881f4f65284854/html5/thumbnails/19.jpg)
Results
18
2 simple & 2 proc. shaders
2.4x faster on GX2
2.7x faster on GTX+
![Page 20: Stream Compaction for Deferred Shading · 2012. 10. 22. · Our Contribution •Shading requests from rasterization are spatially coherent •Less so when shading is deferred until](https://reader034.vdocuments.site/reader034/viewer/2022051604/5ff95bbf3e881f4f65284854/html5/thumbnails/20.jpg)
Results
19
3 simple & 2 proc. shaders
3.2x faster on GX2
3.5x faster on GTX+
![Page 21: Stream Compaction for Deferred Shading · 2012. 10. 22. · Our Contribution •Shading requests from rasterization are spatially coherent •Less so when shading is deferred until](https://reader034.vdocuments.site/reader034/viewer/2022051604/5ff95bbf3e881f4f65284854/html5/thumbnails/21.jpg)
Results
20
11 moderate shaders
23% faster on GX2
51% faster on GTX+
![Page 22: Stream Compaction for Deferred Shading · 2012. 10. 22. · Our Contribution •Shading requests from rasterization are spatially coherent •Less so when shading is deferred until](https://reader034.vdocuments.site/reader034/viewer/2022051604/5ff95bbf3e881f4f65284854/html5/thumbnails/22.jpg)
Scaling
9800GX2 9800GTX+
“Cores” 2x16 (we used 16) 16
Processor Clock 1.50 GHz 1.84 GHz (23%)
Memory Clock 1 GHz 1.1 GHz (10%)
Bandwidth 64 GB/s 70.4 GB/s (10%)
Bus Width 2x256 bit (we used 256) 256 bit
21
• Difference between processor clock scaling and memory
bandwidth scaling enhances benefits of shader compaction
• Compaction further leverages increased processor speed
![Page 23: Stream Compaction for Deferred Shading · 2012. 10. 22. · Our Contribution •Shading requests from rasterization are spatially coherent •Less so when shading is deferred until](https://reader034.vdocuments.site/reader034/viewer/2022051604/5ff95bbf3e881f4f65284854/html5/thumbnails/23.jpg)
Analysis
• Coherent shading time is always smaller
• But cost of overhead not always worth it for simple cases
• Shader Complexity
• Simple – No improvement, but little penalty
• Procedural – Large improvements
• Implicit versus Explicit Serialization
• With compaction, explicit almost always wins
• Large penalties for explicit with unordered input
• Local compaction was never successful
• Too much local data movement
• Limited working set size
22
![Page 24: Stream Compaction for Deferred Shading · 2012. 10. 22. · Our Contribution •Shading requests from rasterization are spatially coherent •Less so when shading is deferred until](https://reader034.vdocuments.site/reader034/viewer/2022051604/5ff95bbf3e881f4f65284854/html5/thumbnails/24.jpg)
Conclusions
• Global stream compaction is almost always a win
• Surprising positive results for our toy scenes
• Production renderers will require stream compaction to
be tractable in large scenes of arbitrary shading
complexity
Future work
• Data sensitive scheduling to avoid memory divergence
• Hybrid shader batch approaches
• Scheduling in both space and time
23
![Page 25: Stream Compaction for Deferred Shading · 2012. 10. 22. · Our Contribution •Shading requests from rasterization are spatially coherent •Less so when shading is deferred until](https://reader034.vdocuments.site/reader034/viewer/2022051604/5ff95bbf3e881f4f65284854/html5/thumbnails/25.jpg)
Acknowledgments
• Thanks to Shubho Sengupta & Mark Harris for making
their fast CUDA compaction primitives available in
CUDPP, and Nathan Bell for GPU radix sort
• This work was funded by the Intel & Microsoft as part of
the Illinois Universal Parallel Computing Research
Center
24