jpeg-gpu: a gpgpu implementation of jpeg core coding systems ang li university of wisconsin-madison
TRANSCRIPT
Wisconsin Applied Computing Center
JPEG-GPU: A GPGPU IMPLEMENTATION OF JPEG CORE CODING SYSTEMS
Ang LiUniversity of Wisconsin-Madison
Wisconsin Applied Computing Center
NVIDIA GTC 2013 2
Outline
• Brief Introduction of Background
• Implementation
• Evaluation
• Conclusion
3/20/2013
Wisconsin Applied Computing Center
NVIDIA GTC 2013 3
Background
• JPEG Encoding
• Parallelism Seeking• Pre-processing:
Color Conversion• Block
Encoding/Decoding
3/20/2013
Wisconsin Applied Computing Center
NVIDIA GTC 2013 4
Implementation
• Step 1 – Find target functions• Encode: encode_mcu_huff, encode_one_block, emit_bits_s• Decode: decode_mcu_DC_first, decode_mcu_DC_refine• Profiling to find other functions
• Using GPROF• Encode: rgb_ycc_convert• Decode: ycc_rgb_convert• Both take small half of the total execution time of encoding/decoding
3/20/2013
Wisconsin Applied Computing Center
NVIDIA GTC 2013 5
Implementation – Cont’d
• Step 2 – Parallel with CUDA• First, implementing in OpenMP to
make sure the understandings are correct• E.g., in encode_one_block, emit_bits_s
changes the state of system => parallel with multiple threads will lead to incorrect results!
• Secondly, make a baseline GPGPU implementation to all critical functions
• Thirdly, optimize GPGPU implementations• Using constant memory
3/20/2013
for (k = 1; k <= Se; k++) { …
if (! emit_bits_s(…))return FALSE;
…if (! emit_bits_s(…))
return FALSE;…if (! emit_bits_s(…))
return FALSE;…
}
Wisconsin Applied Computing Center
NVIDIA GTC 2013 6
Evaluation
• Evaluation Environment• CPU: Intel Nehalem Xeon
E5520 2.26GHz processor• GPU: Tesla K20c
• Picture used• My favorite picture• Compressing: 1280 x 768
pixels• Decompressing: the
products after compressing
• Correctness checked by ``diff’’
3/20/2013
Wisconsin Applied Computing Center
NVIDIA GTC 2013 7
Evaluation – Cont’d
Sequential OpenMP GPGPU Base GPGPU Optimized
Compress 2.886 2.648 14.700 22.412
Decompress 2.420 2.200 14.616 21.507
3/20/2013
• Timings are in milliseconds, averagin 10 times of execution• Four threads are forked for OpenMP implementation• For both GPU implementations, configurations are tuned to be optimized
• Results discussion• OpenMP is fastest. GPGPU basically degrades the performance while `optimized’
version degrades more (due to serialized constant memory accesses).• Observations after hacking the code:
• Each kernel launch deals with at most 250 elements, too fine-grained.• Kernel launch is expensive (allocation & copying the data)
• Using OpenMP is always going to better off as long as there is enough parallelism & loop iterations are not extremely trivial.
Wisconsin Applied Computing Center
NVIDIA GTC 2013 8
Conclusion
• For JPEG encoding/decoding core system, GPGPU basically degrades the performance.
• Coarser-grained parallelism is required.
• OpenMP acceleration can be easily applied to gain some performance.
3/20/2013
Wisconsin Applied Computing Center
NVIDIA GTC 2013 9
Thank you.
Ang Li <[email protected]>
3/20/2013