jpeg-gpu: a gpgpu implementation of jpeg core coding systems ang li university of wisconsin-madison

9
Wisconsin Applied Computing Center JPEG-GPU: A GPGPU IMPLEMENTATION OF JPEG CORE CODING SYSTEMS Ang Li University of Wisconsin-Madison

Upload: melinda-sims

Post on 04-Jan-2016

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: JPEG-GPU: A GPGPU IMPLEMENTATION OF JPEG CORE CODING SYSTEMS Ang Li University of Wisconsin-Madison

Wisconsin Applied Computing Center

JPEG-GPU: A GPGPU IMPLEMENTATION OF JPEG CORE CODING SYSTEMS

Ang LiUniversity of Wisconsin-Madison

Page 2: JPEG-GPU: A GPGPU IMPLEMENTATION OF JPEG CORE CODING SYSTEMS Ang Li University of Wisconsin-Madison

Wisconsin Applied Computing Center

NVIDIA GTC 2013 2

Outline

• Brief Introduction of Background

• Implementation

• Evaluation

• Conclusion

3/20/2013

Page 3: JPEG-GPU: A GPGPU IMPLEMENTATION OF JPEG CORE CODING SYSTEMS Ang Li University of Wisconsin-Madison

Wisconsin Applied Computing Center

NVIDIA GTC 2013 3

Background

• JPEG Encoding

• Parallelism Seeking• Pre-processing:

Color Conversion• Block

Encoding/Decoding

3/20/2013

Page 4: JPEG-GPU: A GPGPU IMPLEMENTATION OF JPEG CORE CODING SYSTEMS Ang Li University of Wisconsin-Madison

Wisconsin Applied Computing Center

NVIDIA GTC 2013 4

Implementation

• Step 1 – Find target functions• Encode: encode_mcu_huff, encode_one_block, emit_bits_s• Decode: decode_mcu_DC_first, decode_mcu_DC_refine• Profiling to find other functions

• Using GPROF• Encode: rgb_ycc_convert• Decode: ycc_rgb_convert• Both take small half of the total execution time of encoding/decoding

3/20/2013

Page 5: JPEG-GPU: A GPGPU IMPLEMENTATION OF JPEG CORE CODING SYSTEMS Ang Li University of Wisconsin-Madison

Wisconsin Applied Computing Center

NVIDIA GTC 2013 5

Implementation – Cont’d

• Step 2 – Parallel with CUDA• First, implementing in OpenMP to

make sure the understandings are correct• E.g., in encode_one_block, emit_bits_s

changes the state of system => parallel with multiple threads will lead to incorrect results!

• Secondly, make a baseline GPGPU implementation to all critical functions

• Thirdly, optimize GPGPU implementations• Using constant memory

3/20/2013

for (k = 1; k <= Se; k++) { …

if (! emit_bits_s(…))return FALSE;

…if (! emit_bits_s(…))

return FALSE;…if (! emit_bits_s(…))

return FALSE;…

}

Page 6: JPEG-GPU: A GPGPU IMPLEMENTATION OF JPEG CORE CODING SYSTEMS Ang Li University of Wisconsin-Madison

Wisconsin Applied Computing Center

NVIDIA GTC 2013 6

Evaluation

• Evaluation Environment• CPU: Intel Nehalem Xeon

E5520 2.26GHz processor• GPU: Tesla K20c

• Picture used• My favorite picture• Compressing: 1280 x 768

pixels• Decompressing: the

products after compressing

• Correctness checked by ``diff’’

3/20/2013

Page 7: JPEG-GPU: A GPGPU IMPLEMENTATION OF JPEG CORE CODING SYSTEMS Ang Li University of Wisconsin-Madison

Wisconsin Applied Computing Center

NVIDIA GTC 2013 7

Evaluation – Cont’d

Sequential OpenMP GPGPU Base GPGPU Optimized

Compress 2.886 2.648 14.700 22.412

Decompress 2.420 2.200 14.616 21.507

3/20/2013

• Timings are in milliseconds, averagin 10 times of execution• Four threads are forked for OpenMP implementation• For both GPU implementations, configurations are tuned to be optimized

• Results discussion• OpenMP is fastest. GPGPU basically degrades the performance while `optimized’

version degrades more (due to serialized constant memory accesses).• Observations after hacking the code:

• Each kernel launch deals with at most 250 elements, too fine-grained.• Kernel launch is expensive (allocation & copying the data)

• Using OpenMP is always going to better off as long as there is enough parallelism & loop iterations are not extremely trivial.

Page 8: JPEG-GPU: A GPGPU IMPLEMENTATION OF JPEG CORE CODING SYSTEMS Ang Li University of Wisconsin-Madison

Wisconsin Applied Computing Center

NVIDIA GTC 2013 8

Conclusion

• For JPEG encoding/decoding core system, GPGPU basically degrades the performance.

• Coarser-grained parallelism is required.

• OpenMP acceleration can be easily applied to gain some performance.

3/20/2013

Page 9: JPEG-GPU: A GPGPU IMPLEMENTATION OF JPEG CORE CODING SYSTEMS Ang Li University of Wisconsin-Madison

Wisconsin Applied Computing Center

NVIDIA GTC 2013 9

Thank you.

Ang Li <[email protected]>

3/20/2013