fast compressive sensing mri reconstruction using multi...
TRANSCRIPT
Fast Compressive Sensing MRI Reconstruction using multi-GPU system
TRAN MINH QUAN [email protected] JEONG [email protected]
High-performance Visual Computing Laboratory,School of Electrical and Computer Engineering,
Ulsan National Institute of Science and Technology (UNIST).
UNIST-gil 50 (100 Banyeon-ri), Eonyang-eup, Ulju-gun, Ulsan Metropolitan City. Republic of Korea, 689-798
1
Talk Overview
• Introduction
– 2D Dynamic Compressive Sensing MRI (CS MRI)
• Split Bregman (SB) Method
– Total Variation
• Fast 2D Discrete Wavelet Transform (DWT) with mixed-band
• Result of single GPU system
• Result of multi GPU system
• Q&A
2
2D Dynamic MRI (2.5D MRI)
3
Cardiac MRIhttp://www.youtube.com/watch?v=G4dFVeP9Vdo
Perfusion MRI
CS
CS MRIZero Filling Reconstruction
IFFT2
4
MRI Reconstruction
IFFT2
Traditional MRI
f = KuK = RF
VERY SLOW
VERY FAST
R: sampling maskF: 2D Fourier Transform
Motivation
• Why do we use sparse sampling?
– ~20-40 minutes down to ~1-3 minutes
– Greatly reduce the scanning time (~16x)
• Why do we use the GPUs?
– Speed up the reconstruction time
5
CSMRI Problem
6
• Lustig et. al.
– J u = Fz Wxyu 1
• Goldstein et. al.
– J u = 𝛻xyzu 1
• Our method
– J u = Wxyu 1+ 𝛻xyu 1
+ 𝛻zu 1
=> ℓ1 minimization problem
minu
J(u) s.t 𝑖 Kui − 𝑓𝑖2< μ
x
y
z (temporal)
Proposed SB CSMRI Algorithm• Initialize
– u0 = RF−1f and dx0 = dy
0= dz0 = w0 = 0
• While uk − uk−12> tol
– 𝑢𝑘 = 𝑚𝑖𝑛𝑢
𝜇
2𝐾𝑢 − 𝑓 2 +
𝜆
2𝑑𝑘 − 𝛻𝑢 − 𝑏𝑘
2+𝛾
2𝑤𝑘 −𝑊𝑢 − 𝑏𝑤
𝑘 2
– 𝑑𝑥𝑘+1 = 𝑚𝑎𝑥 𝑠𝑥𝑦
𝑘 −1
𝜆𝑥, 0 ∗
𝛻𝑥𝑢𝑘+𝑏𝑥
𝑘
𝑠𝑥𝑦𝑘
– 𝑑𝑦𝑘+1 = 𝑚𝑎𝑥 𝑠𝑥𝑦
𝑘 −1
𝜆𝑦, 0 ∗
𝛻𝑦𝑢𝑘+𝑏𝑦
𝑘
𝑠𝑥𝑦𝑘
– 𝑑𝑧𝑘+1 = 𝑚𝑎𝑥 𝑠𝑧
𝑘 −1
𝜆𝑧, 0 ∗
𝛻𝑧𝑢𝑘+𝑏𝑧
𝑘
𝑠𝑧𝑘
– 𝑤𝑘+1 = 𝑠ℎ𝑟𝑖𝑛𝑘 𝑊𝑢𝑘+1 + 𝑏𝑤𝑘 ,1
𝛾
– 𝑏𝑥𝑘+1 = 𝑏𝑥
𝑘 + 𝛻𝑥𝑢𝑘+1 − 𝑑𝑥
𝑘+1
– 𝑏𝑦𝑘+1 = 𝑏𝑦
𝑘 + 𝛻𝑦𝑢𝑘+1 − 𝑑𝑦
𝑘+1
– 𝑏𝑧𝑘+1 = 𝑏𝑧
𝑘 + 𝛻𝑧𝑢𝑘+1 − 𝑑𝑧
𝑘+1
– 𝑏𝑤𝑘+1 = 𝑏𝑤
𝑘 + 𝑊𝑢𝑘+1 − 𝑤𝑘+1
• End
7
J u = 𝛻xyu 1+ 𝛻zu 1 + Wxyu 1
Ref: Goldstein et. al, “The Split BregmanMethod for L1-Regularized Problems,” .SIAM 2009
Sub Optimization Problem
Update Bregman Distances(Smoothing/Thresholding)
Update Bregman Variables
Building Blocks
• Iterative solver
– Gradient, Laplacian operators
– Using Finite Difference Method
• Discrete Fourier transform
– CUFFT
• Discrete Wavelet transform
– Fast GPU mixed-band algorithm
8
2D Wavelet TransformTraditional Approach
9
* *
*
2D Wavelet Transform Traditional vs. Mixed-band
10
2D Wavelet Transform with Mixed-band (1)
11
Haar 2x2
Haar 4x4
Haar 8x8
𝑀 =𝑎 𝑏𝑐 𝑑
𝑊 =1
2
+1 +1+1 −1
𝐺 = 𝑊 ∗𝑀 ∗𝑊𝑇
𝐺 =1
2
+𝑎 + 𝑏 + 𝑐 + 𝑑 +𝑎 − 𝑏 + 𝑐 − 𝑑+𝑎 + 𝑏 − 𝑐 − 𝑑 +𝑎 − 𝑏 − 𝑐 + 𝑑
2D Wavelet Transform with Mixed-band (2)
12
encode_8Kernel
decode_8Kernel
Why do we choose block Size 8x8?
Optimize 2D Haar Wavelet (1)__global__
void __encode_8(float2* src, float2* dst,
int nRows, int nCols, int iRows, int iCols) {
//Read a 8x8 block from global memory to shared memory
...
__syncthreads();
float2 a, b, c, d; //Registers type, each thread will have its own values
//First time Haar 2x2
if(((tid.y&0)==0)&&((tid.x&0)==0)) {
a = sMem[(tid.y & (~1)) + 0][(tid.x & (~1)) + 0];
b = sMem[(tid.y & (~1)) + 0][(tid.x & (~1)) + 1];
c = sMem[(tid.y & (~1)) + 1][(tid.x & (~1)) + 0];
d = sMem[(tid.y & (~1)) + 1][(tid.x & (~1)) + 1];
}
__syncthreads();
if(((tid.y&0) == 0)&&((tid.x&0) == 0))
sMem[(tid.y][tid.x] = 0.5f * (a + b + c + d);
else if(((tid.y&0) == 0)&&((tid.x&0) == 1))
sMem[(tid.y][tid.x] = 0.5f * (a - b + c - d);
else if(((tid.y&0) == 1)&&((tid.x&0) == 0))
sMem[(tid.y][tid.x] = 0.5f * (a + b - c - d);
else if(((tid.y&0) == 1)&&((tid.x&0) == 1))
sMem[(tid.y][tid.x] = 0.5f * (a - b - c + d);
__syncthreads();
//Second time Haar 2x2
...
//Third time Haar 2x2
...
}
13
Haar 2x2
Broadcasting
𝐺 =1
2
+𝑎 + 𝑏 + 𝑐 + 𝑑 +𝑎 − 𝑏 + 𝑐 − 𝑑+𝑎 + 𝑏 − 𝑐 − 𝑑 +𝑎 − 𝑏 − 𝑐 + 𝑑
Divergence
Optimize 2D Haar Wavelet (2)__global__
void __encode_8(float2* src, float2* dst,
int nRows, int nCols, int iRows, int iCols) {
//Read a 8x8 block from global memory to shared memory
...
__syncthreads();
float2 a, b, c, d; //Registers type, each thread will have its own values
//First time Haar 2x2
if(((tid.y&0)==0)&&((tid.x&0)==0)) {
a = sMem[(tid.y & (~1)) + 0][(tid.x & (~1)) + 0];
b = sMem[(tid.y & (~1)) + 0][(tid.x & (~1)) + 1];
c = sMem[(tid.y & (~1)) + 1][(tid.x & (~1)) + 0];
d = sMem[(tid.y & (~1)) + 1][(tid.x & (~1)) + 1];
switchSign( (((tid.y>>0 & 1) & 0) ^ ((tid.x>>0 & 1) & 0)) , &a);
switchSign( (((tid.y>>0 & 1) & 0) ^ ((tid.x>>0 & 1) & 1)) , &b);
switchSign( (((tid.y>>0 & 1) & 1) ^ ((tid.x>>0 & 1) & 0)) , &c);
switchSign( (((tid.y>>0 & 1) & 1) ^ ((tid.x>>0 & 1) & 1)) , &d);
sMem[(tid.y][tid.x] = 0.5f * (a + b + c + d);
}
__syncthreads();
//Second time Haar 2x2
...
//Third time Haar 2x2
...
}
14
__device__ void switchSign(unsigned int intSign, float2* number){
*(number) *= intSign ? (-1.0f):(1.0f);}
𝐻2 =1
2
+1 +1 +1 +1+1 −1 +1 −1+1 +1 −1 −1+1 −1 −1 +1
𝐺 =1
2
+𝑎 + 𝑏 + 𝑐 + 𝑑 +𝑎 − 𝑏 + 𝑐 − 𝑑+𝑎 + 𝑏 − 𝑐 − 𝑑 +𝑎 − 𝑏 − 𝑐 + 𝑑
Recursive Representation
𝐻𝑑 =1
2
+𝐻𝑑−1 +𝐻𝑑−1
+𝐻𝑑−1 −𝐻𝑑−1
Synthetic Representation
𝐻𝑖,𝑗𝑑 =
1
2 𝑑 2−1 (𝑖⋆𝑗)
⋆ is bitwise dot product
−1 (3⋆2)= −1 1,1 ∗(1,0) = −11+0 = −1
Optimize 2D Haar Wavelet (3)
15
__device__ void switchSign(unsigned int intSign, float2* number){
*(number) *= intSign ? (-1.0f):(1.0f);}
__device__ void switchSign(unsigned int intSign, float2* number){
*((long long int*)number) ^= intSign ? 0x8000000080000000 : 0x0000000000000000;
}
Image size 512 x 512Shared Memory Multiply with -1 or +1 Casting signed bit
1 lvl 3 lvls 9 lvls 1 lvl 3 lvls 9 lvls 1 lvl 3 lvls 9 lvls
2D Forward Wavelet 0.0868 0.1305 0.1530 0.0595 0.0887 0.10112 N/a N/a 0.0786
2D Inverse Wavelet 0.0868 0.1304 0.1548 0.0591 0.0884 0.1003 N/a N/a 0.0791
Unit: Miliseconds
2D WAVELET TRANSFORM WITH MIXED-BAND
31 30 29 28 … 2 1 0
float2.x
31 30 29 28 … 2 1 0
float2.y
Complex number y = m + i*n
Comparison of Lenna (512x512): Full Decomposition – 9 levels
16
Filter Scheme (GPU): 1.602 milisecond
Lifting Scheme (GPU) : 2.140 milisecond
Mixed-band (GPU) :
0.079 millisecond (20x faster)
Put everything together with 1 GPU
17
minu
J(u) s.t 𝑖 Kui − 𝑓𝑖2< μ
J u = 𝛻xyu 1+ 𝛻zu 1 + Wxyu 1
Results of 2D Dynamic MRIFlank Tumor Dataset (256 slices)
18
1/1 1/4 1/8 1/10 1/12 1/16
Image size: 128x128
Performance of the CSMRI reconstruction (in milliseconds)
19
Operations
Inverse Differentiation
Compute Right Hand Side
Sub Optimization Prob(Modified Richardson)
Forward Differentiation
Shrinkage
Update BregmanParameter
Update Kspace
Total
32 slices 128 slices 256 slices
100.282 201.595 343.003
36.680 108.137 205.442
358.850 1155.242 2215.770
36.949 110.496 201.461
44.487 134.755 247.534
20.680 62.328 123.949
3.431 11.162 28.699
601.360 1783.715 3365.858
Image size: 128x128
MultiGPU system information
20
[quantm@node001 ~]$ lspci -tv-+-----------+- Advanced Micro Devices [AMD] nee ATI RD890 | | Northbridge only dual slot (2x8) PCI-e GFX Hydra part| || +-------------+-----00.0 NVIDIA Corporation Tesla M2090| | \-----00.0 NVIDIA Corporation Tesla M2090| || \-------------+-----00.0 NVIDIA Corporation Tesla M2090| \-----00.0 NVIDIA Corporation Tesla M2090|\-----------+- Advanced Micro Devices [AMD] nee ATI RD890
| Northbridge only dual slot (2x8) PCI-e GFX Hydra part|+-------------+-----00.0 NVIDIA Corporation Tesla M2090| \-----00.0 NVIDIA Corporation Tesla M2090|\-------------+-----00.0 NVIDIA Corporation Tesla M2090
\-----00.0 NVIDIA Corporation Tesla M2090
[quantm@node002 ~]$ lspci –tv-+-----------+- Advanced Micro Devices [AMD] nee ATI RD890 | | Northbridge only dual slot (2x8) PCI-e GFX Hydra part| || +-------------+-----00.0 NVIDIA Corporation Tesla M2090| | \-----00.0 NVIDIA Corporation Tesla M2090| || \-------------+-----00.0 NVIDIA Corporation Tesla M2090| \-----00.0 NVIDIA Corporation Tesla M2090|\-----------+- Advanced Micro Devices [AMD] nee ATI RD890
| Northbridge only dual slot (2x8) PCI-e GFX Hydra part|+-------------+-----00.0 NVIDIA Corporation Tesla M2090| \-----00.0 NVIDIA Corporation Tesla M2090|\-------------+-----00.0 NVIDIA Corporation Tesla M2090
\-----00.0 NVIDIA Corporation Tesla M2090MPI
MultiGPU Implementation (1)
21
J u = 𝛻xyu 1+ 𝛻zu 1 + Wxyu 1
OpenMP
cudaMemcpyPeer2Peer~6.1 GB/s.
Paulius M. “Implementing 3D Finite Difference code on GPUs” GTC2009
MultiGPU Implementation (2)
22
J u = 𝛻xyu 1+ 𝛻zu 1 + Wxyu 1
OpenMP
cudaMemcpyPeer2Peer~6.1 GB/s.
Paulius M. “Implementing 3D Finite Difference code on GPUs” GTC2009
MultiGPU Implementation (3)
23
J u = 𝛻xyu 1+ 𝛻zu 1 + Wxyu 1
OpenMP
cudaMemcpyPeer2Peer~6.1 GB/s.
Paulius M. “Implementing 3D Finite Difference code on GPUs” GTC2009
Performance on multiple GPUs
24
1 GPUs 2 GPUs Sync 4 GPUs Sync
343.003 177.845 177.776 90.904 99.422 99.754 90.758
205.442 106.035 105.928 54.907 58.670 58.697 54.824
2215.770 1138.010 1139.320 589.568 639.172 639.672 589.089
201.461 94.802 94.817 48.354 54.063 54.164 48.367
247.534 130.654 132.631 67.723 72.759 72.795 68.217
123.949 63.997 64.034 33.238 35.642 35.666 33.193
28.699 14.990 14.993 8.273 8.492 8.485 8.298
0.000 23.055 30.817 23.427 67.837 60.455 24.104
3365.858 1749.387 1760.316 1765.128 916.394 1036.058 1029.687 916.849 1052.241
Operations
Inverse Differentiation
Compute Right Hand Side
Sub Optimization Problem
Forward Differentiation
Shrinkage
Update Bregman Parameter
Update Kspace
Data Transfer
Total
8 GPUs Sync
47.291 55.626 55.994 55.743 55.942 55.820 55.830 47.227
29.441 33.274 33.294 33.336 33.331 33.291 33.285 29.416
315.112 364.920 365.107 364.852 364.849 364.749 364.855 315.064
24.620 30.027 30.151 30.086 30.009 30.074 30.087 24.590
36.980 41.683 41.679 41.700 41.646 41.667 41.619 36.864
18.016 20.377 20.349 20.422 20.312 20.316 20.320 18.006
4.566 4.944 4.946 5.182 5.188 4.943 4.945 4.593
41.246 115.281 131.557 117.197 123.288 141.518 119.851 46.849
517.272 666.130 683.077 668.517 674.566 692.378 670.794 522.610 696.834
Operations
Inverse Differentiation
Compute Right Hand Side
Sub Optimization Problem
Forward Differentiation
Shrinkage
Update Bregman Parameter
Update Kspace
Data Transfer
Total
Scalability
25
0
500
1000
1500
2000
2500
3000
1x 1.7x 2.9x 4.3x
Tim
e (
mili
seco
nd
s)
Number of GPUs
Conclusion
• Summary
– Split Bregman Formulation for dynamic CSMRI
– 2D DWT on the GPU using mixed-band algorithm
– Multi-GPU implementation using P2P communication
• Acknowledgement
– Thanks to HyungJoon Cho and SoHyun Han for data and
discussion.
– Funding from NRF Grant # 2012R1A1A1039929
26
Thank you
27