gpu cluster for scientific computing zhe fan, feng qiu, arie kaufman, suzanne yoakum-stover center...

GPU Cluster for Scientific Computing

Zhe Fan, Feng Qiu, Arie Kaufman, Suzanne Yoakum-StoverCenter for Visual Computing and Department of Computer Science, Stony Brook University

http://www.cs.sunysb.edu/~vislab/projects/gpgpu/GPU_Cluster/GPU_Cluster.html

Stony Brook Visual Computing ClusterStony Brook Visual Computing Cluster

• GPU ClusterGPU Cluster

• 35 nodes with nVIDIA GeForce FX 5800 Ultra35 nodes with nVIDIA GeForce FX 5800 Ultra

• Gigabit EthernetGigabit Ethernet

• 70 Pentium Xeon 2.4GHz CPUs70 Pentium Xeon 2.4GHz CPUs

• 35 VolumePro 100035 VolumePro 1000

• 9 HP Sepia-2A with ServerNet II9 HP Sepia-2A with ServerNet II

Stony Brook Visual Computing ClusterStony Brook Visual Computing Cluster

• GPU ClusterGPU Cluster

• 35 nodes with nVIDIA GeForce FX 5800 Ultra35 nodes with nVIDIA GeForce FX 5800 Ultra

• Gigabit EthernetGigabit Ethernet

• 70 Pentium Xeon 2.4GHz CPUs70 Pentium Xeon 2.4GHz CPUs

• 35 VolumePro 100035 VolumePro 1000

• 9 HP Sepia-2A with ServerNet II9 HP Sepia-2A with ServerNet II

LBM on the GPULBM on the GPUApplication: large-scale CFD simulations using LatticeApplication: large-scale CFD simulations using Lattice

Boltzmann Model (LBM)Boltzmann Model (LBM)

LBM Computation:LBM Computation:

• Particles stream along lattice linksParticles stream along lattice links

• Particles collide when they meet at a siteParticles collide when they meet at a site

Map to GPU:Map to GPU:

• Pack 3D lattice states into a series of 2D texturesPack 3D lattice states into a series of 2D textures

• Update the lattice with fragment programsUpdate the lattice with fragment programs

LBM on the GPULBM on the GPUApplication: large-scale CFD simulations using LatticeApplication: large-scale CFD simulations using Lattice

Boltzmann Model (LBM)Boltzmann Model (LBM)

LBM Computation:LBM Computation:

• Particles stream along lattice linksParticles stream along lattice links

• Particles collide when they meet at a siteParticles collide when they meet at a site

Map to GPU:Map to GPU:

• Pack 3D lattice states into a series of 2D texturesPack 3D lattice states into a series of 2D textures

• Update the lattice with fragment programsUpdate the lattice with fragment programs

Scale up LBM to the GPU ClusterScale up LBM to the GPU Cluster• Each GPU computes a sub-latticeEach GPU computes a sub-lattice

• Particles stream out of the sub-latticeParticles stream out of the sub-lattice

1.1. Gather particle distributions in a textureGather particle distributions in a texture

2.2. Read out from GPU in a single operationRead out from GPU in a single operation

3.3. Transfer through GigaE (MPI)Transfer through GigaE (MPI)

4.4. Write into neighboring GPU nodesWrite into neighboring GPU nodes

• Network performance optimization:Network performance optimization:

1.1. Conduct network transfer while computingConduct network transfer while computing

2.2. Schedule to reduce the likelihood of interruptionSchedule to reduce the likelihood of interruption

3.3. Simplify the connection patternSimplify the connection pattern

Scale up LBM to the GPU ClusterScale up LBM to the GPU Cluster• Each GPU computes a sub-latticeEach GPU computes a sub-lattice

• Particles stream out of the sub-latticeParticles stream out of the sub-lattice

1.1. Gather particle distributions in a textureGather particle distributions in a texture

2.2. Read out from GPU in a single operationRead out from GPU in a single operation

3.3. Transfer through GigaE (MPI)Transfer through GigaE (MPI)

4.4. Write into neighboring GPU nodesWrite into neighboring GPU nodes

• Network performance optimization:Network performance optimization:

1.1. Conduct network transfer while computingConduct network transfer while computing

2.2. Schedule to reduce the likelihood of interruptionSchedule to reduce the likelihood of interruption

3.3. Simplify the connection patternSimplify the connection pattern

Times Square Area of NYCTimes Square Area of NYC

Flow StreamlinesFlow Streamlines

• 0.31 second / step on 30 GPUs0.31 second / step on 30 GPUs

• 4.6 times faster than software version on 30 CPUs4.6 times faster than software version on 30 CPUs

Times Square Area of NYCTimes Square Area of NYC

Flow StreamlinesFlow Streamlines

• 0.31 second / step on 30 GPUs0.31 second / step on 30 GPUs

• 4.6 times faster than software version on 30 CPUs4.6 times faster than software version on 30 CPUs

AcknowledgementsAcknowledgements

• NSF CCR0306438NSF CCR0306438

• Department of Department of Homeland Security, Homeland Security, Environment Environment Measurement LabMeasurement Lab

• HP HP

• TerareconTerarecon

AcknowledgementsAcknowledgements

• NSF CCR0306438NSF CCR0306438

• Department of Department of Homeland Security, Homeland Security, Environment Environment Measurement LabMeasurement Lab

• HP HP

• TerareconTerarecon

GPU Cluster / CPU Cluster SpeedupGPU Cluster / CPU Cluster Speedup

• Each node computes an 80 x 80 x 80 sub-latticeEach node computes an 80 x 80 x 80 sub-lattice

• GeForce FX 5800 Ultra / Pentium Xeon 2.4GHzGeForce FX 5800 Ultra / Pentium Xeon 2.4GHz

GPU Cluster / CPU Cluster SpeedupGPU Cluster / CPU Cluster Speedup

• Each node computes an 80 x 80 x 80 sub-latticeEach node computes an 80 x 80 x 80 sub-lattice

• GeForce FX 5800 Ultra / Pentium Xeon 2.4GHzGeForce FX 5800 Ultra / Pentium Xeon 2.4GHzDispersion PlumeDispersion PlumeDispersion PlumeDispersion Plume

• 1.66 km x 1.13 km1.66 km x 1.13 km

• 91 blocks 91 blocks

• 851 buildings851 buildings

• 480 x 400 x 80 lattice480 x 400 x 80 lattice

and Large-Scale Simulation

gpu cluster for scientific computing zhe fan, feng qiu, arie kaufman, suzanne yoakum-stover center...

Documents

gpu clustereach gpu

buildings480 x

lattice linksparticles

d lattice states

sublatticeparticles

sublatticegeforce fx

ultra pentium xeon

scientific computing