fast compressive sensing mri reconstruction using multi...

Fast Compressive Sensing MRI Reconstruction using multi-GPU system

TRAN MINH QUAN [email protected] JEONG [email protected]

High-performance Visual Computing Laboratory,School of Electrical and Computer Engineering,

Ulsan National Institute of Science and Technology (UNIST).

UNIST-gil 50 (100 Banyeon-ri), Eonyang-eup, Ulju-gun, Ulsan Metropolitan City. Republic of Korea, 689-798

1

mailto:[email protected]

mailto:[email protected]

Talk Overview

• Introduction

– 2D Dynamic Compressive Sensing MRI (CS MRI)

• Split Bregman (SB) Method

– Total Variation

• Fast 2D Discrete Wavelet Transform (DWT) with mixed-band

• Result of single GPU system

• Result of multi GPU system

• Q&A

2

2D Dynamic MRI (2.5D MRI)

3

Cardiac MRIhttp://www.youtube.com/watch?v=G4dFVeP9Vdo

Perfusion MRI

http://www.youtube.com/watch?v=G4dFVeP9Vdo

CS

CS MRIZero Filling Reconstruction

IFFT2

4

MRI Reconstruction

IFFT2

Traditional MRI

f = KuK = RF

VERY SLOW

VERY FAST

R: sampling maskF: 2D Fourier Transform

Motivation

• Why do we use sparse sampling?

– ~20-40 minutes down to ~1-3 minutes

– Greatly reduce the scanning time (~16x)

• Why do we use the GPUs?

– Speed up the reconstruction time

5

CSMRI Problem

6

• Lustig et. al.

– J u = Fz Wxyu 1

• Goldstein et. al.

– J u = 𝛻xyzu 1

• Our method

– J u = Wxyu 1+ 𝛻xyu 1

+ 𝛻zu 1

=> ℓ1 minimization problem

minu

J(u) s.t 𝑖 Kui − 𝑓𝑖2< μ

x

y

z (temporal)

Proposed SB CSMRI Algorithm• Initialize

– u0 = RF−1f and dx0 = dy

0= dz0 = w0 = 0

• While uk − uk−12> tol

– 𝑢𝑘 = 𝑚𝑖𝑛𝑢

𝜇

2𝐾𝑢 − 𝑓 2 +

𝜆

2𝑑𝑘 − 𝛻𝑢 − 𝑏𝑘

2+𝛾

2𝑤𝑘 −𝑊𝑢 − 𝑏𝑤

𝑘 2

– 𝑑𝑥𝑘+1 = 𝑚𝑎𝑥 𝑠𝑥𝑦

𝑘 −1

𝜆𝑥, 0 ∗

𝛻𝑥𝑢𝑘+𝑏𝑥

𝑘

𝑠𝑥𝑦𝑘

– 𝑑𝑦𝑘+1 = 𝑚𝑎𝑥 𝑠𝑥𝑦

𝑘 −1

𝜆𝑦, 0 ∗

𝛻𝑦𝑢𝑘+𝑏𝑦

𝑘

𝑠𝑥𝑦𝑘

– 𝑑𝑧𝑘+1 = 𝑚𝑎𝑥 𝑠𝑧

𝑘 −1

𝜆𝑧, 0 ∗

𝛻𝑧𝑢𝑘+𝑏𝑧

𝑘

𝑠𝑧𝑘

– 𝑤𝑘+1 = 𝑠ℎ𝑟𝑖𝑛𝑘 𝑊𝑢𝑘+1 + 𝑏𝑤𝑘 ,1

𝛾

– 𝑏𝑥𝑘+1 = 𝑏𝑥

𝑘 + 𝛻𝑥𝑢𝑘+1 − 𝑑𝑥

𝑘+1

– 𝑏𝑦𝑘+1 = 𝑏𝑦

𝑘 + 𝛻𝑦𝑢𝑘+1 − 𝑑𝑦

𝑘+1

– 𝑏𝑧𝑘+1 = 𝑏𝑧

𝑘 + 𝛻𝑧𝑢𝑘+1 − 𝑑𝑧

𝑘+1

– 𝑏𝑤𝑘+1 = 𝑏𝑤

𝑘 + 𝑊𝑢𝑘+1 − 𝑤𝑘+1

• End

7

J u = 𝛻xyu 1+ 𝛻zu 1 + Wxyu 1

Ref: Goldstein et. al, “The Split BregmanMethod for L1-Regularized Problems,” .SIAM 2009

Sub Optimization Problem

Update Bregman Distances(Smoothing/Thresholding)

Update Bregman Variables

Building Blocks

• Iterative solver

– Gradient, Laplacian operators

– Using Finite Difference Method

• Discrete Fourier transform

– CUFFT

• Discrete Wavelet transform

– Fast GPU mixed-band algorithm

8

2D Wavelet TransformTraditional Approach

9

* *

*

2D Wavelet Transform Traditional vs. Mixed-band

10

2D Wavelet Transform with Mixed-band (1)

11

Haar 2x2

Haar 4x4

Haar 8x8

𝑀 =𝑎 𝑏𝑐 𝑑

𝑊 =1

2

+1 +1+1 −1

𝐺 = 𝑊 ∗𝑀 ∗𝑊𝑇

𝐺 =1

2

+𝑎 + 𝑏 + 𝑐 + 𝑑 +𝑎 − 𝑏 + 𝑐 − 𝑑+𝑎 + 𝑏 − 𝑐 − 𝑑 +𝑎 − 𝑏 − 𝑐 + 𝑑

2D Wavelet Transform with Mixed-band (2)

12

encode_8Kernel

decode_8Kernel

Why do we choose block Size 8x8?

Optimize 2D Haar Wavelet (1)__global__

void __encode_8(float2* src, float2* dst,

int nRows, int nCols, int iRows, int iCols) {

//Read a 8x8 block from global memory to shared memory

...

__syncthreads();

float2 a, b, c, d; //Registers type, each thread will have its own values

//First time Haar 2x2

if(((tid.y&0)==0)&&((tid.x&0)==0)) {

a = sMem[(tid.y & (~1)) + 0][(tid.x & (~1)) + 0];

b = sMem[(tid.y & (~1)) + 0][(tid.x & (~1)) + 1];

c = sMem[(tid.y & (~1)) + 1][(tid.x & (~1)) + 0];

d = sMem[(tid.y & (~1)) + 1][(tid.x & (~1)) + 1];

}

__syncthreads();

if(((tid.y&0) == 0)&&((tid.x&0) == 0))

sMem[(tid.y][tid.x] = 0.5f * (a + b + c + d);

else if(((tid.y&0) == 0)&&((tid.x&0) == 1))

sMem[(tid.y][tid.x] = 0.5f * (a - b + c - d);

else if(((tid.y&0) == 1)&&((tid.x&0) == 0))

sMem[(tid.y][tid.x] = 0.5f * (a + b - c - d);

else if(((tid.y&0) == 1)&&((tid.x&0) == 1))

sMem[(tid.y][tid.x] = 0.5f * (a - b - c + d);

__syncthreads();

//Second time Haar 2x2

...

//Third time Haar 2x2

...

}

13

Haar 2x2

Broadcasting

𝐺 =1

2


Divergence

Optimize 2D Haar Wavelet (2)__global__

void __encode_8(float2* src, float2* dst,

int nRows, int nCols, int iRows, int iCols) {

//Read a 8x8 block from global memory to shared memory

...

__syncthreads();

float2 a, b, c, d; //Registers type, each thread will have its own values

//First time Haar 2x2

if(((tid.y&0)==0)&&((tid.x&0)==0)) {

a = sMem[(tid.y & (~1)) + 0][(tid.x & (~1)) + 0];

b = sMem[(tid.y & (~1)) + 0][(tid.x & (~1)) + 1];

c = sMem[(tid.y & (~1)) + 1][(tid.x & (~1)) + 0];

d = sMem[(tid.y & (~1)) + 1][(tid.x & (~1)) + 1];

switchSign( (((tid.y>>0 & 1) & 0) ^ ((tid.x>>0 & 1) & 0)) , &a);

switchSign( (((tid.y>>0 & 1) & 0) ^ ((tid.x>>0 & 1) & 1)) , &b);

switchSign( (((tid.y>>0 & 1) & 1) ^ ((tid.x>>0 & 1) & 0)) , &c);

switchSign( (((tid.y>>0 & 1) & 1) ^ ((tid.x>>0 & 1) & 1)) , &d);

sMem[(tid.y][tid.x] = 0.5f * (a + b + c + d);

}

__syncthreads();

//Second time Haar 2x2

...

//Third time Haar 2x2

...

}

14

__device__ void switchSign(unsigned int intSign, float2* number){

*(number) *= intSign ? (-1.0f):(1.0f);}

𝐻2 =1

2

+1 +1 +1 +1+1 −1 +1 −1+1 +1 −1 −1+1 −1 −1 +1

𝐺 =1

2


Recursive Representation

𝐻𝑑 =1

2

+𝐻𝑑−1 +𝐻𝑑−1

+𝐻𝑑−1 −𝐻𝑑−1

Synthetic Representation

𝐻𝑖,𝑗𝑑 =

1

2 𝑑 2−1 (𝑖⋆𝑗)

⋆ is bitwise dot product

−1 (3⋆2)= −1 1,1 ∗(1,0) = −11+0 = −1

Optimize 2D Haar Wavelet (3)

15


*(number) *= intSign ? (-1.0f):(1.0f);}


*((long long int*)number) ^= intSign ? 0x8000000080000000 : 0x0000000000000000;

}

Image size 512 x 512Shared Memory Multiply with -1 or +1 Casting signed bit

1 lvl 3 lvls 9 lvls 1 lvl 3 lvls 9 lvls 1 lvl 3 lvls 9 lvls

2D Forward Wavelet 0.0868 0.1305 0.1530 0.0595 0.0887 0.10112 N/a N/a 0.0786

2D Inverse Wavelet 0.0868 0.1304 0.1548 0.0591 0.0884 0.1003 N/a N/a 0.0791

Unit: Miliseconds

2D WAVELET TRANSFORM WITH MIXED-BAND

31 30 29 28 … 2 1 0

float2.x

31 30 29 28 … 2 1 0

float2.y

Complex number y = m + i*n

Comparison of Lenna (512x512): Full Decomposition – 9 levels

16

Filter Scheme (GPU): 1.602 milisecond

Lifting Scheme (GPU) : 2.140 milisecond

Mixed-band (GPU) :

0.079 millisecond (20x faster)

Put everything together with 1 GPU

17

minu

J(u) s.t 𝑖 Kui − 𝑓𝑖2< μ

J u = 𝛻xyu 1+ 𝛻zu 1 + Wxyu 1

Results of 2D Dynamic MRIFlank Tumor Dataset (256 slices)

18

1/1 1/4 1/8 1/10 1/12 1/16

Image size: 128x128

Performance of the CSMRI reconstruction (in milliseconds)

19

Operations

Inverse Differentiation

Compute Right Hand Side

Sub Optimization Prob(Modified Richardson)

Forward Differentiation

Shrinkage

Update BregmanParameter

Update Kspace

Total

32 slices 128 slices 256 slices

100.282 201.595 343.003

36.680 108.137 205.442

358.850 1155.242 2215.770

36.949 110.496 201.461

44.487 134.755 247.534

20.680 62.328 123.949

3.431 11.162 28.699

601.360 1783.715 3365.858

Image size: 128x128

MultiGPU system information

20

[quantm@node001 ~]$ lspci -tv-+-----------+- Advanced Micro Devices [AMD] nee ATI RD890 | | Northbridge only dual slot (2x8) PCI-e GFX Hydra part| || +-------------+-----00.0 NVIDIA Corporation Tesla M2090| | \-----00.0 NVIDIA Corporation Tesla M2090| || \-------------+-----00.0 NVIDIA Corporation Tesla M2090| \-----00.0 NVIDIA Corporation Tesla M2090|\-----------+- Advanced Micro Devices [AMD] nee ATI RD890

| Northbridge only dual slot (2x8) PCI-e GFX Hydra part|+-------------+-----00.0 NVIDIA Corporation Tesla M2090| \-----00.0 NVIDIA Corporation Tesla M2090|\-------------+-----00.0 NVIDIA Corporation Tesla M2090

\-----00.0 NVIDIA Corporation Tesla M2090

[quantm@node002 ~]$ lspci –tv-+-----------+- Advanced Micro Devices [AMD] nee ATI RD890 | | Northbridge only dual slot (2x8) PCI-e GFX Hydra part| || +-------------+-----00.0 NVIDIA Corporation Tesla M2090| | \-----00.0 NVIDIA Corporation Tesla M2090| || \-------------+-----00.0 NVIDIA Corporation Tesla M2090| \-----00.0 NVIDIA Corporation Tesla M2090|\-----------+- Advanced Micro Devices [AMD] nee ATI RD890

| Northbridge only dual slot (2x8) PCI-e GFX Hydra part|+-------------+-----00.0 NVIDIA Corporation Tesla M2090| \-----00.0 NVIDIA Corporation Tesla M2090|\-------------+-----00.0 NVIDIA Corporation Tesla M2090

\-----00.0 NVIDIA Corporation Tesla M2090MPI

MultiGPU Implementation (1)

21

J u = 𝛻xyu 1+ 𝛻zu 1 + Wxyu 1

OpenMP

cudaMemcpyPeer2Peer~6.1 GB/s.

Paulius M. “Implementing 3D Finite Difference code on GPUs” GTC2009


22

J u = 𝛻xyu 1+ 𝛻zu 1 + Wxyu 1

OpenMP




23

J u = 𝛻xyu 1+ 𝛻zu 1 + Wxyu 1

OpenMP



Performance on multiple GPUs

24

1 GPUs 2 GPUs Sync 4 GPUs Sync

343.003 177.845 177.776 90.904 99.422 99.754 90.758

205.442 106.035 105.928 54.907 58.670 58.697 54.824

2215.770 1138.010 1139.320 589.568 639.172 639.672 589.089

201.461 94.802 94.817 48.354 54.063 54.164 48.367

247.534 130.654 132.631 67.723 72.759 72.795 68.217

123.949 63.997 64.034 33.238 35.642 35.666 33.193

28.699 14.990 14.993 8.273 8.492 8.485 8.298

0.000 23.055 30.817 23.427 67.837 60.455 24.104

3365.858 1749.387 1760.316 1765.128 916.394 1036.058 1029.687 916.849 1052.241

Operations





Shrinkage

Update Bregman Parameter

Update Kspace

Data Transfer

Total

8 GPUs Sync

47.291 55.626 55.994 55.743 55.942 55.820 55.830 47.227

29.441 33.274 33.294 33.336 33.331 33.291 33.285 29.416

315.112 364.920 365.107 364.852 364.849 364.749 364.855 315.064

24.620 30.027 30.151 30.086 30.009 30.074 30.087 24.590

36.980 41.683 41.679 41.700 41.646 41.667 41.619 36.864

18.016 20.377 20.349 20.422 20.312 20.316 20.320 18.006

4.566 4.944 4.946 5.182 5.188 4.943 4.945 4.593

41.246 115.281 131.557 117.197 123.288 141.518 119.851 46.849

517.272 666.130 683.077 668.517 674.566 692.378 670.794 522.610 696.834

Operations





Shrinkage

Update Bregman Parameter

Update Kspace

Data Transfer

Total

Scalability

25

0

500

1000

1500

2000

2500

3000

1x 1.7x 2.9x 4.3x

Tim

e (

mili

seco

nd

s)

Number of GPUs

Conclusion

• Summary

– Split Bregman Formulation for dynamic CSMRI

– 2D DWT on the GPU using mixed-band algorithm

– Multi-GPU implementation using P2P communication

• Acknowledgement

– Thanks to HyungJoon Cho and SoHyun Han for data and

discussion.

– Funding from NRF Grant # 2012R1A1A1039929

26

Thank you

27

fast compressive sensing mri reconstruction using multi...

Documents