parallel implementation of the 2-d discrete wavelet transform
TRANSCRIPT
Discrete Wavelet Transform (DWT)
[LH
]=
[G1
(o) G1(e)
G0(o) G0
(e)
] [LH
]
[LH
]=
{∏k
[1 U(k)
0 1
] [1 0
P(k) 1
]}[LH
]
2-D DWT
x =[
LL HL LH HH]T
y =[
LL HL LH HH]T
Existing Approach: Separable Convolution
y = NV∣∣NH
∣∣ x
Existing Approach: Separable Lifting
THP
SHU
TVP SV
U
y = SVU∣∣ SHU ∣∣TV
P
∣∣THP
∣∣ x
Our Approach: Non-Separable Lifting
THP0
TVP0
SHU0
SVU0
TP1SU1
y = SVU0SHU0
SU1
∣∣TVP0
THP0
TP1
∣∣ x
Architectures
pixel shader OpenCL/CUDA CPU
input/output off chip off chip off chipintermediate results off chip on chip on chip
— in registers no yes noon chip memory no 32–96 KiB 3–35 MiB
concurrent threads thousands thousands 2–1124-tuples / thread 1 1–4 thousands
view global block-based block-basedblock size global 642 5122
ResultsCDF9/7 Wavelet
0
10
20
30
40
50
60
70
80
90
100kpel 1Mpel 10Mpel 100Mpel
GB
/s
OpenCL (AMD 6970)
0
5
10
15
20
25
30
35
100kpel 1Mpel 10Mpel 100Mpel
GB
/sPixel Shader (NVIDIA Titan X)
separable liftingseparable polyconvolutionseparable convolution
non-separable liftingnon-separable polyconvolutionnon-separable convolution
Future WorkCPU
blade055 2 sockets × 28 cores = 56 CPUsIntel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHzcache: 32k/256k/35M
vidte 2 sockets × 12 cores = 24 CPUsIntel(R) Xeon(R) CPU X5680 @ 3.33GHzcache: 32k/256k/12M
UV2000 14 sockets × 8 cores = 112 coresIntel(R) Xeon(R) CPU E5-4627 v2 @ 3.30GHzcache: 32k/256k/16M