parallel implementation of the 2-d discrete wavelet transform

Discrete Wavelet Transform (DWT)

[LH

]=

[G1

(o) G1(e)

G0(o) G0

(e)

] [LH

]

[LH

]=

{∏k

[1 U(k)

0 1

] [1 0

P(k) 1

]}[LH

]

2-D DWT

x =[

LL HL LH HH]T

y =[

LL HL LH HH]T

Existing Approach: Separable Convolution

y = NV∣∣NH

∣∣ x

Existing Approach: Separable Lifting

THP

SHU

TVP SV

U

y = SVU∣∣ SHU ∣∣TV

P

∣∣THP

∣∣ x

Our Approach: Non-Separable Lifting

THP0

TVP0

SHU0

SVU0

TP1SU1

y = SVU0SHU0

SU1

∣∣TVP0

THP0

TP1

∣∣ x

Architectures

pixel shader OpenCL/CUDA CPU

input/output off chip off chip off chipintermediate results off chip on chip on chip

— in registers no yes noon chip memory no 32–96 KiB 3–35 MiB

concurrent threads thousands thousands 2–1124-tuples / thread 1 1–4 thousands

view global block-based block-basedblock size global 642 5122

ResultsCDF9/7 Wavelet

0

10

20

30

40

50

60

70

80

90

100kpel 1Mpel 10Mpel 100Mpel

GB

/s

OpenCL (AMD 6970)

0

5

10

15

20

25

30

35

100kpel 1Mpel 10Mpel 100Mpel

GB

/sPixel Shader (NVIDIA Titan X)

separable liftingseparable polyconvolutionseparable convolution

non-separable liftingnon-separable polyconvolutionnon-separable convolution

Future WorkCPU

blade055 2 sockets × 28 cores = 56 CPUsIntel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHzcache: 32k/256k/35M

vidte 2 sockets × 12 cores = 24 CPUsIntel(R) Xeon(R) CPU X5680 @ 3.33GHzcache: 32k/256k/12M

UV2000 14 sockets × 8 cores = 112 coresIntel(R) Xeon(R) CPU E5-4627 v2 @ 3.30GHzcache: 32k/256k/16M

parallel implementation of the 2-d discrete wavelet transform

Engineering