Transcript

Accelerating Proton Computed Tomography with GPUs

Thomas'D.'Uram,'Argonne'Leadership'Compu2ng'Facility'Michael'E.'Papka,'Argonne'Leadership'Compu2ng'Facility,'Northern'Illinois'University'Nicholas'T.'Karonis,'Northern'Illinois'University,'Argonne'Na2onal'Laboratory

Argonne'Leadership'Compu2ng'Facility'E'Thomas'D.'Uram'([email protected])

Overview‣ Proton'computed'tomography'(pCT)'is'an'alterna2ve'to'xEray'based'CAT'scans,'which'

promises'several'medical'benefits'at'the'cost'of'being'significantly'more'computa2onally'expensive'

‣ We'designed'a'60Enode'GPU'cluster'to'meet'the'computa2onal'challenge'!

!

‣ Computed'tomography'‣ Benefits'of'proton'computed'tomography'‣ Computa2onal'problem'descrip2on'‣ CPU/GPU'performance'comparison

2

Argonne'Leadership'Compu2ng'Facility'E'Thomas'D.'Uram'([email protected])

What is Computed Tomography?‣ CAT'(or'CT)'scans'are'wellEknown'‣ CAT'=='“computerized'axial'tomography”'‣ CAT'scans'are'used'to'reconstruct'the'density'distribu2on'within'a'volume,'typically'used'

in'medical'imaging'‣ CAT'scans'are'conducted'with'photons'(XErays)'

!

‣ What'is'Proton'Computed'Tomography?'• A'reconstruc2on'technique'similar'to'XEray'computed'tomography,'conducted'with'

protons'instead'of'photons

3

‣ 13'million'people'are'diagnosed'with'cancer'each'year'worldwide'‣ 2.6'million'of'them'are'candidates'for'proton'therapy'treatment'‣ Proton'therapy'involves'deposi2ng'protons'at'precise'loca2ons'within'a'tumor'

site'where'they'irradiate'the'target'2ssue'‣ The'protons'emit'lower'radia2on'as'they'travel'through'the'body'un2l'they'

reach'the'target,'where'they'emit'a'burst'of'radia2on'(the'Bragg'peak)'• Healthy'2ssue'beyond'the'tumor'site'receives'nominally'no'radia2on'

‣ It'is'crucially'important'to'precisely'iden2fy'the'tumor'site'• To'ensure'that'cancerous'2ssue'is'destroyed'• To'avoid'damaging'healthy'2ssue'surrounding'the'tumor,'especially'in'

sensi2ve'areas'‣ Proton'therapy'treatment'planning'is'currently'performed'using'XEray'imaging'

• Photons'and'protons'interact'with'intermediate'material'differently'• Conversion'between'photon/proton'modali2es'involves'a'systema0c'range'

error'of'365%

Argonne'Leadership'Compu2ng'Facility'E'Thomas'D.'Uram'([email protected])

Why Proton Computed Tomography?

4

Image source: Wikipedia

Argonne'Leadership'Compu2ng'Facility'E'Thomas'D.'Uram'([email protected])

‣ Our'goal'is'to'reconstruct'volume'of'adult'human'head'in'under'10'minutes''

‣ Protons'directed'through'two'frontal'planes,'the'target'volume,'two'backing'planes,'and'finally'a'calorimeter'

‣ Measures'posi2on'and'angle'of'incidence'of'protons'at'entry'and'exit,'and'the'energy'loss

5

Final System (in black): 4 tracking planes with XY Si detectors: calorimeter with 64 end=on CsI Crystals

Planned Scaled Prototype (in red): 4 planes of XY Si detectors (2 X-SSDs and 2 Y-SSDs per plane): 8 CsI Crystal bars

Calorimeter: Each bar corresponds to a 5cm x 5cm CsI Crystal, read out by a photodiode

Tracking Plane: Each large square corresponds to one double-sided or two single-sided 9cm x 9cm SSDs

Proton computed tomography

Argonne'Leadership'Compu2ng'Facility'E'Thomas'D.'Uram'([email protected])

Problem Description‣ Proton'source,'detector'planes,'and'calorimeter'

mounted'on'rota2ng'gantry,'as'in'familiar'XEray'CT'configura2ons'

‣ Data'collected'over'a'full'rota2on'of'the'gantry,'180'samples'(every'2'degrees)'

‣ Ini2al'detector'designed'to'image'a'human'head'(nominally'25cm'cube)'

‣ From'physics'domain,'and'so'that'each'voxel'is'sufficiently'represented'in'the'resul2ng'system'matrix,'we'approximate'requiring'a'volume'consis2ng'of'256x256x36'(2,359,296=~'2.4M)'voxels'and'2'billion'protons'total'

‣ For'each'proton,'we'track'11'values:'‣ [x,y,z]'at'entry'‣ [x,y,z]'at'exit'‣ angle'at'entry'and'exit'‣ input'and'output'energy'‣ gantry'rota2on'angle

6

Final System (in black): 4 tracking planes with XY Si detectors: calorimeter with 64 end=on CsI Crystals

Planned Scaled Prototype (in red): 4 planes of XY Si detectors (2 X-SSDs and 2 Y-SSDs per plane): 8 CsI Crystal bars

Calorimeter: Each bar corresponds to a 5cm x 5cm CsI Crystal, read out by a photodiode

Tracking Plane: Each large square corresponds to one double-sided or two single-sided 9cm x 9cm SSDs

Argonne'Leadership'Compu2ng'Facility'E'Thomas'D.'Uram'([email protected])

Baseline execution times

7

‣ Began'with'serial'code'that'took'more'than'7'hours'to'process'131M'protons'

‣ Parallelized'with'MPI'to'use'mul2ple'CPUs'

‣ Established'baseline'execu2on'2mes

{Phase Execution time (seconds)

Setup 128.2

Most Likely Path (MLP) 1278.5

Linear solver (CARP) 664.9

Overall execution time 2072.0

1 billion protons, 60 nodes, CPU only

Argonne'Leadership'Compu2ng'Facility'E'Thomas'D.'Uram'([email protected])

MLP (Most Likely Path)

8

‣ In'contrast'with'XEray'computed'tomography'in'which'the'par2cles'traverse'the'volume'in'straight'lines,'in'pCT'the'protons'are'scakered'by'the'material'as'they'travel'through'the'volume'

‣ MLP'computes'the'path'integral'of'the'protons'through'the'material'based'on'their'known'entry'and'exit'loca2ons'and'angles'and'the'energy'loss'

‣ The'proton'paths'are'discre2zed'as'the'voxels'touched'while'traversing'the'volume'

‣ Path'integral'calcula2ons'are'independent'and'parallelize'at'the'level'of'protons'(but'inherently'sequen2al'within'each'path)

Argonne'Leadership'Compu2ng'Facility'E'Thomas'D.'Uram'([email protected])

Linear solver (CARP)‣ The'result'of'MLP'is'a'system'of'equa2ons'rela2ng'each'proton’s'touched'

voxels'to'the'rela2ve'stopping'power'(roughly,'the'energy'loss)'‣ We'began'the'project'with'a'CPU'implementa2on'of'the'rowEac2on'based'

sparse'itera2ve'solver'CARP'(component'averaged'row'projec2ons)'‣ CARP'decomposes'the'matrix'into'row'blocks,'one'block'per'processor,'and'

iterates'to'sa2sfactory'convergence:'• Performs'a'JacobiElike'itera2on'sequen2ally'through'the'rows'to'produce'a'perE

block'solu2on'vector'• Averages'the'perEblock'solu2on'vectors'(in'componentEwise'fashion)'• Redistributes'the'solu2on'vector'x'to'all'processors

9

Argonne'Leadership'Compu2ng'Facility'E'Thomas'D.'Uram'([email protected])

Hardware: Gaea GPU cluster at Northern Illinois University‣ 60'compute'nodes'‣ Node'configura2on'

• 2x'Intel'X5650'12Ecore'CPUs'• 2x'NVIDIA'M2070'GPUs'• 72GB'RAM'• QDR'Infiniband

10

Argonne'Leadership'Compu2ng'Facility'E'Thomas'D.'Uram'([email protected])

Data decomposition‣ 2.1B'protons'/'60'nodes'=~'35M'protons'per'node'‣ 2'GPUs'E>'17M'protons'per'GPU'‣ The'maximum'voxels'per'proton'is'~364'‣ 17M'protons'x'364'voxels'x'4'bytes/voxel'='25GB'data'per'GPU'

• Larger'than'available'M2070'GPU'memory'of'6GB'‣ High'watermark'memory'requirement'on'cluster'is'3TB'(aggregate)

11

Argonne'Leadership'Compu2ng'Facility'E'Thomas'D.'Uram'([email protected])

MLP (Most Likely Path) CUDA implementation‣ MLP'involves'calcula2ng'path'integral'of'the'protons'‣ Ini2al'implementa2on'assigns'a'thread'per'proton'‣ PerEGPU'proton'data'is'larger'than'GPU'memory'on'M2070'‣ Stage'batches'of'protons'to'GPU'‣ MLP'was'ported'to'the'GPU,'with'mul2ple'variants'

• gpu'struct:'Direct'port'of'CPUEbased'code'using'structured'proton/voxel'data'• gpu'flat'memory:'Flat'memory'space'with'perEproton'padded'voxel'arrays'• gpu'flat'memory'+'overlap:'Streaming'computa2on'to'overlap'compute'and'

hostEdevice'transfers'

12

Argonne'Leadership'Compu2ng'Facility'E'Thomas'D.'Uram'([email protected])

MLP (Most Likely Path) CUDA implementation (26M protons, 2 GPUs)

13

Implementation Execution time (seconds) Speedup

cpu 598.7 -

gpu_struct 77.6 7.7x

gpu_flat_memory 55.5 10.8x

gpu_flat_memory + overlap 53.0 11.3x

Argonne'Leadership'Compu2ng'Facility'E'Thomas'D.'Uram'([email protected])

Linear solver (CARP) CUDA implementation (26M protons, 2 GPUs)‣ CARP'ported'directly'from'CPU'code'‣ PerEnode'rowEblock'data'larger'than'GPU'memory;'batch'process'‣ Further'subdivide'perEnode'rowEblock'into'rowEblocks'per'streaming'mul2processor'

!

!

!

!

!

!

!

‣ Limited'speedup'in'GPU'implementa2on,'because:'• rowEac2on'based'solver'constrains'parallel'granularity'• scakered'memory'accesses'constrain'performance,'as'is'typical'of'sparse'matrix'opera2ons

14

Implementation Execution time (seconds) Speedup

cpu 161.0 -

gpu 139.3 1.16x

Argonne'Leadership'Compu2ng'Facility'E'Thomas'D.'Uram'([email protected])

Performance at scale

15

Phase Execution time (seconds)

Setup 22.3

Most Likely Path (MLP) 151.0

Linear solver (CARP) 265.5

Overall execution time 438.8Initial goal was to complete in <600s (10mins)

2'billion'protons,'60'nodes,'12'CPU'cores/node,'2'GPUs/node

Argonne'Leadership'Compu2ng'Facility'E'Thomas'D.'Uram'([email protected])

Further work: CARP Hybrid CPU/GPU‣ Assign'row'blocks'to'CPU'and'GPU'simultaneously'‣ Weighted'work'distribu2on'based'on'ini2al'performance'measurements

16

Implementation Execution time (seconds) Speedup

cpu 161.0 -

gpu 139.3 1.16x

hybrid 102.3 1.57x

2'billion'protons,'60'nodes,'12'cores/node,'2'GPUs/node

Argonne'Leadership'Compu2ng'Facility'E'Thomas'D.'Uram'([email protected])

Future work‣ Integrate'alterna2ve'linear'solvers'to'improve'performance

(amgX,'cusparse,'PETSc)'‣ Consider'alternate'data'decomposi2ons'to'improve'cache'locality'

• volume'slab'per'streaming'mul2processor'• volume'wedge'per'streaming'mul2processor''

‣ Measure'performance'on'nextEgenera2on'GPUs'• K80'for'greater'performance'• Jetson/TK1'for'greater'performance/wak'

‣ Experiment'with'GPU'cloud'plauorms'(Amazon'cloud)

17

Argonne'Leadership'Compu2ng'Facility'E'Thomas'D.'Uram'([email protected])

AcknowledgementsNicholas'T.'Karonis,'Northern'Illinois'University'(NIU)'and'Argonne'Na2onal'Laboratory'(ANL)'Michael'E.'Papka,'NIU'and'ANL'Caesar'Ordoñez,'NIU'Eric'Olson,'ANL'Kirk'Duffin,'NIU'Venkat'Vishwanath,'ANL'!

US'Department'of'Defense'contract'number'W81XWHE10E1E0170'sponsored'this'work.'

18


Top Related