gpu cluster for scientific computing zhe fan, feng qiu, arie kaufman, suzanne yoakum-stover center...

1
GPU Cluster for Scientific Computing Zhe Fan, Feng Qiu, Arie Kaufman, Suzanne Yoakum-Stover Center for Visual Computing and Department of Computer Science, Stony Brook University http://www.cs.sunysb.edu/~vislab/projects/gpgpu/GPU_Cluster/GPU_Cluster.html Stony Brook Visual Computing Stony Brook Visual Computing Cluster Cluster GPU Cluster GPU Cluster 35 nodes with nVIDIA GeForce FX 5800 Ultra 35 nodes with nVIDIA GeForce FX 5800 Ultra Gigabit Ethernet Gigabit Ethernet 70 Pentium Xeon 2.4GHz CPUs 70 Pentium Xeon 2.4GHz CPUs 35 VolumePro 1000 35 VolumePro 1000 9 HP Sepia-2A with ServerNet II 9 HP Sepia-2A with ServerNet II LBM on the GPU LBM on the GPU Application: large-scale CFD simulations using Application: large-scale CFD simulations using Lattice Lattice Boltzmann Model (LBM) Boltzmann Model (LBM) LBM Computation: LBM Computation: Particles stream along lattice links Particles stream along lattice links Particles collide when they meet at a site Particles collide when they meet at a site Map to GPU: Map to GPU: Pack 3D lattice states into a series of 2D Pack 3D lattice states into a series of 2D textures textures Update the lattice with fragment programs Update the lattice with fragment programs Scale up LBM to the GPU Cluster Scale up LBM to the GPU Cluster Each GPU computes a sub-lattice Each GPU computes a sub-lattice Particles stream out of the sub-lattice Particles stream out of the sub-lattice 1. 1. Gather particle distributions in a Gather particle distributions in a texture texture 2. 2. Read out from GPU in a single operation Read out from GPU in a single operation 3. 3. Transfer through GigaE (MPI) Transfer through GigaE (MPI) 4. 4. Write into neighboring GPU nodes Write into neighboring GPU nodes Network performance optimization: Network performance optimization: 1. 1. Conduct network transfer while computing Conduct network transfer while computing 2. 2. Schedule to reduce the likelihood of Schedule to reduce the likelihood of interruption interruption 3. 3. Simplify the connection pattern Simplify the connection pattern Times Square Area of NYC Times Square Area of NYC Flow Streamlines Flow Streamlines 0.31 second / step on 30 GPUs 0.31 second / step on 30 GPUs 4.6 times faster than software version on 30 4.6 times faster than software version on 30 CPUs CPUs Acknowledgement Acknowledgement s s NSF CCR0306438 NSF CCR0306438 Department of Department of Homeland Homeland Security, Security, Environment Environment Measurement Lab Measurement Lab HP HP Terarecon Terarecon GPU Cluster / CPU Cluster Speedup GPU Cluster / CPU Cluster Speedup Each node computes an 80 x 80 x 80 sub-lattice Each node computes an 80 x 80 x 80 sub-lattice GeForce FX 5800 Ultra / Pentium Xeon 2.4GHz GeForce FX 5800 Ultra / Pentium Xeon 2.4GHz Dispersion Plume Dispersion Plume 1.66 km x 1.13 1.66 km x 1.13 km km 91 blocks 91 blocks 851 buildings 851 buildings 480 x 400 x 80 480 x 400 x 80 lattice lattice and Large-Scale Simulation

Upload: kristin-perry

Post on 28-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: GPU Cluster for Scientific Computing Zhe Fan, Feng Qiu, Arie Kaufman, Suzanne Yoakum-Stover Center for Visual Computing and Department of Computer Science,

GPU Cluster for Scientific Computing

Zhe Fan, Feng Qiu, Arie Kaufman, Suzanne Yoakum-StoverCenter for Visual Computing and Department of Computer Science, Stony Brook University

http://www.cs.sunysb.edu/~vislab/projects/gpgpu/GPU_Cluster/GPU_Cluster.html

Stony Brook Visual Computing ClusterStony Brook Visual Computing Cluster

• GPU ClusterGPU Cluster

• 35 nodes with nVIDIA GeForce FX 5800 Ultra35 nodes with nVIDIA GeForce FX 5800 Ultra

• Gigabit EthernetGigabit Ethernet

• 70 Pentium Xeon 2.4GHz CPUs70 Pentium Xeon 2.4GHz CPUs

• 35 VolumePro 100035 VolumePro 1000

• 9 HP Sepia-2A with ServerNet II9 HP Sepia-2A with ServerNet II

Stony Brook Visual Computing ClusterStony Brook Visual Computing Cluster

• GPU ClusterGPU Cluster

• 35 nodes with nVIDIA GeForce FX 5800 Ultra35 nodes with nVIDIA GeForce FX 5800 Ultra

• Gigabit EthernetGigabit Ethernet

• 70 Pentium Xeon 2.4GHz CPUs70 Pentium Xeon 2.4GHz CPUs

• 35 VolumePro 100035 VolumePro 1000

• 9 HP Sepia-2A with ServerNet II9 HP Sepia-2A with ServerNet II

LBM on the GPULBM on the GPUApplication: large-scale CFD simulations using LatticeApplication: large-scale CFD simulations using Lattice

Boltzmann Model (LBM)Boltzmann Model (LBM)

LBM Computation:LBM Computation:

• Particles stream along lattice linksParticles stream along lattice links

• Particles collide when they meet at a siteParticles collide when they meet at a site

Map to GPU:Map to GPU:

• Pack 3D lattice states into a series of 2D texturesPack 3D lattice states into a series of 2D textures

• Update the lattice with fragment programsUpdate the lattice with fragment programs

LBM on the GPULBM on the GPUApplication: large-scale CFD simulations using LatticeApplication: large-scale CFD simulations using Lattice

Boltzmann Model (LBM)Boltzmann Model (LBM)

LBM Computation:LBM Computation:

• Particles stream along lattice linksParticles stream along lattice links

• Particles collide when they meet at a siteParticles collide when they meet at a site

Map to GPU:Map to GPU:

• Pack 3D lattice states into a series of 2D texturesPack 3D lattice states into a series of 2D textures

• Update the lattice with fragment programsUpdate the lattice with fragment programs

Scale up LBM to the GPU ClusterScale up LBM to the GPU Cluster• Each GPU computes a sub-latticeEach GPU computes a sub-lattice

• Particles stream out of the sub-latticeParticles stream out of the sub-lattice

1.1. Gather particle distributions in a textureGather particle distributions in a texture

2.2. Read out from GPU in a single operationRead out from GPU in a single operation

3.3. Transfer through GigaE (MPI)Transfer through GigaE (MPI)

4.4. Write into neighboring GPU nodesWrite into neighboring GPU nodes

• Network performance optimization:Network performance optimization:

1.1. Conduct network transfer while computingConduct network transfer while computing

2.2. Schedule to reduce the likelihood of interruptionSchedule to reduce the likelihood of interruption

3.3. Simplify the connection patternSimplify the connection pattern

Scale up LBM to the GPU ClusterScale up LBM to the GPU Cluster• Each GPU computes a sub-latticeEach GPU computes a sub-lattice

• Particles stream out of the sub-latticeParticles stream out of the sub-lattice

1.1. Gather particle distributions in a textureGather particle distributions in a texture

2.2. Read out from GPU in a single operationRead out from GPU in a single operation

3.3. Transfer through GigaE (MPI)Transfer through GigaE (MPI)

4.4. Write into neighboring GPU nodesWrite into neighboring GPU nodes

• Network performance optimization:Network performance optimization:

1.1. Conduct network transfer while computingConduct network transfer while computing

2.2. Schedule to reduce the likelihood of interruptionSchedule to reduce the likelihood of interruption

3.3. Simplify the connection patternSimplify the connection pattern

Times Square Area of NYCTimes Square Area of NYC

Flow StreamlinesFlow Streamlines

• 0.31 second / step on 30 GPUs0.31 second / step on 30 GPUs

• 4.6 times faster than software version on 30 CPUs4.6 times faster than software version on 30 CPUs

Times Square Area of NYCTimes Square Area of NYC

Flow StreamlinesFlow Streamlines

• 0.31 second / step on 30 GPUs0.31 second / step on 30 GPUs

• 4.6 times faster than software version on 30 CPUs4.6 times faster than software version on 30 CPUs

AcknowledgementsAcknowledgements

• NSF CCR0306438NSF CCR0306438

• Department of Department of Homeland Security, Homeland Security, Environment Environment Measurement LabMeasurement Lab

• HP HP

• TerareconTerarecon

AcknowledgementsAcknowledgements

• NSF CCR0306438NSF CCR0306438

• Department of Department of Homeland Security, Homeland Security, Environment Environment Measurement LabMeasurement Lab

• HP HP

• TerareconTerarecon

GPU Cluster / CPU Cluster SpeedupGPU Cluster / CPU Cluster Speedup

• Each node computes an 80 x 80 x 80 sub-latticeEach node computes an 80 x 80 x 80 sub-lattice

• GeForce FX 5800 Ultra / Pentium Xeon 2.4GHzGeForce FX 5800 Ultra / Pentium Xeon 2.4GHz

GPU Cluster / CPU Cluster SpeedupGPU Cluster / CPU Cluster Speedup

• Each node computes an 80 x 80 x 80 sub-latticeEach node computes an 80 x 80 x 80 sub-lattice

• GeForce FX 5800 Ultra / Pentium Xeon 2.4GHzGeForce FX 5800 Ultra / Pentium Xeon 2.4GHzDispersion PlumeDispersion PlumeDispersion PlumeDispersion Plume

• 1.66 km x 1.13 km1.66 km x 1.13 km

• 91 blocks 91 blocks

• 851 buildings851 buildings

• 480 x 400 x 80 lattice480 x 400 x 80 lattice

and Large-Scale Simulation