parallel implementation of the 2-d discrete wavelet transform

9

Upload: david-barina

Post on 20-Feb-2017

15 views

Category:

Engineering


2 download

TRANSCRIPT

Page 1: Parallel Implementation of the 2-D Discrete Wavelet Transform
Page 2: Parallel Implementation of the 2-D Discrete Wavelet Transform

Discrete Wavelet Transform (DWT)

[LH

]=

[G1

(o) G1(e)

G0(o) G0

(e)

] [LH

]

[LH

]=

{∏k

[1 U(k)

0 1

] [1 0

P(k) 1

]}[LH

]

Page 3: Parallel Implementation of the 2-D Discrete Wavelet Transform

2-D DWT

x =[

LL HL LH HH]T

y =[

LL HL LH HH]T

Page 4: Parallel Implementation of the 2-D Discrete Wavelet Transform

Existing Approach: Separable Convolution

y = NV∣∣NH

∣∣ x

Page 5: Parallel Implementation of the 2-D Discrete Wavelet Transform

Existing Approach: Separable Lifting

THP

SHU

TVP SV

U

y = SVU∣∣ SHU ∣∣TV

P

∣∣THP

∣∣ x

Page 6: Parallel Implementation of the 2-D Discrete Wavelet Transform

Our Approach: Non-Separable Lifting

THP0

TVP0

SHU0

SVU0

TP1SU1

y = SVU0SHU0

SU1

∣∣TVP0

THP0

TP1

∣∣ x

Page 7: Parallel Implementation of the 2-D Discrete Wavelet Transform

Architectures

pixel shader OpenCL/CUDA CPU

input/output off chip off chip off chipintermediate results off chip on chip on chip

— in registers no yes noon chip memory no 32–96 KiB 3–35 MiB

concurrent threads thousands thousands 2–1124-tuples / thread 1 1–4 thousands

view global block-based block-basedblock size global 642 5122

Page 8: Parallel Implementation of the 2-D Discrete Wavelet Transform

ResultsCDF9/7 Wavelet

0

10

20

30

40

50

60

70

80

90

100kpel 1Mpel 10Mpel 100Mpel

GB

/s

OpenCL (AMD 6970)

0

5

10

15

20

25

30

35

100kpel 1Mpel 10Mpel 100Mpel

GB

/sPixel Shader (NVIDIA Titan X)

separable liftingseparable polyconvolutionseparable convolution

non-separable liftingnon-separable polyconvolutionnon-separable convolution

Page 9: Parallel Implementation of the 2-D Discrete Wavelet Transform

Future WorkCPU

blade055 2 sockets × 28 cores = 56 CPUsIntel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHzcache: 32k/256k/35M

vidte 2 sockets × 12 cores = 24 CPUsIntel(R) Xeon(R) CPU X5680 @ 3.33GHzcache: 32k/256k/12M

UV2000 14 sockets × 8 cores = 112 coresIntel(R) Xeon(R) CPU E5-4627 v2 @ 3.30GHzcache: 32k/256k/16M