www.vacet.org brad whitlock october 14, 2009 brad whitlock october 14, 2009 porting visit to bg/p

www.vacet.org

Brad Whitlock

October 14, 2009

Brad Whitlock

October 14, 2009

Porting VisIt to BG/PPorting VisIt to BG/P

www.vacet.org

Overview

• Objectives• Building 3rd party libraries• Building VisIt• Running VisIt on BG/P• Improvements• Impact• Future work

• Objectives• Building 3rd party libraries• Building VisIt• Running VisIt on BG/P• Improvements• Impact• Future work

www.vacet.org

Objectives

• Port VisIt to IBM’s BlueGene/P platform so VisIt can run on LLNL’s Dawn and eventually Sequoia

– Dawn is a 500 Teraflop, 36,864 node, 147,456 cpu, IBM BG/P system

– 4 850MHz PowerPC cores/node, 4Gb Memory/node

– Compute nodes run CNK OS

– Cross-compile code for CNK

• Identify weaknesses in VisIt that prevent it from scaling to tens/hundreds of thousands of processors

• Port VisIt to IBM’s BlueGene/P platform so VisIt can run on LLNL’s Dawn and eventually Sequoia

– Dawn is a 500 Teraflop, 36,864 node, 147,456 cpu, IBM BG/P system

– 4 850MHz PowerPC cores/node, 4Gb Memory/node

– Compute nodes run CNK OS

– Cross-compile code for CNK

• Identify weaknesses in VisIt that prevent it from scaling to tens/hundreds of thousands of processors

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

www.vacet.org

Building 3rd party libraries

• Built all libraries on login nodes for regular Linux PowerPC version of VisIt– Ran into runtime problems using xlC compiler so reverted to g++

for the time being

• Cross-compiled all libraries for CNK• No support for this platform in VisIt’s 3rd party

libraries so special builds were required• Mesa built unmangled and no X11• VTK tricky to build

– No OpenGL so VTK built with Mesa as its OpenGL– No X11 so created custom render window– Used CMake toolchain file

• Built all libraries on login nodes for regular Linux PowerPC version of VisIt– Ran into runtime problems using xlC compiler so reverted to g++

for the time being

• Cross-compiled all libraries for CNK• No support for this platform in VisIt’s 3rd party

libraries so special builds were required• Mesa built unmangled and no X11• VTK tricky to build

– No OpenGL so VTK built with Mesa as its OpenGL– No X11 so created custom render window– Used CMake toolchain file

www.vacet.org

Building VisIt

• No X11 so graphical components can’t be built for CNK (don’t build gui)

• Added new --enable-engine-only build mode to VisIt’s build system that only builds the compute engine and its plugins

• VisIt always used to require mangled mesa– This support had to become conditional on VTK having

mangled mesa support

• No X11 so graphical components can’t be built for CNK (don’t build gui)

• Added new --enable-engine-only build mode to VisIt’s build system that only builds the compute engine and its plugins

• VisIt always used to require mangled mesa– This support had to become conditional on VTK having

mangled mesa support

www.vacet.org

Running VisIt on Dawn

• Dawn uses mpirun to start VisIt on compute nodes– Minor differences required environment variables to be exported via

mpirun command, which could be handled via host profile in VisIt

• VisIt ran at 1k,2k,4k,8k,16k nodes• VisIt ran with 1 and 4 trillion zone datasets (June09)• Encountered scaling problems early

– Launch time slow because each processor was reading plugin directory to obtain plugin information

– VisIt commands were sent from rank 0 to other ranks 1Kb at a time until a message was sent

– Non-spinning bcast substitute used for sending commands had point-to-point that performed poorly at scale

– Certain metadata consumed too much memory (each processor has ~700Mb only)

– Synchronization step for SR mode used slow point-to-point

• Dawn uses mpirun to start VisIt on compute nodes– Minor differences required environment variables to be exported via

mpirun command, which could be handled via host profile in VisIt

• VisIt ran at 1k,2k,4k,8k,16k nodes• VisIt ran with 1 and 4 trillion zone datasets (June09)• Encountered scaling problems early

– Launch time slow because each processor was reading plugin directory to obtain plugin information

– VisIt commands were sent from rank 0 to other ranks 1Kb at a time until a message was sent

– Non-spinning bcast substitute used for sending commands had point-to-point that performed poorly at scale

– Certain metadata consumed too much memory (each processor has ~700Mb only)

– Synchronization step for SR mode used slow point-to-point

www.vacet.org

Improvements

• Broadcast plugin information from rank 0 to other ranks to improve plugin loading time 9x

• Broadcast VisIt commands from rank 0 in a single chunk instead of 1Kb at a time

• Use standard bcast in engine main loop instead of poorly performing non-spin substitute geared towards shared nodes

• Switched to alternate metadata representation to free up most available memory for calculations

• Mark Miller was able to replace SR mode synchronization step with much faster version that reduced time to 2 seconds from 20 minutes

• Broadcast plugin information from rank 0 to other ranks to improve plugin loading time 9x

• Broadcast VisIt commands from rank 0 in a single chunk instead of 1Kb at a time

• Use standard bcast in engine main loop instead of poorly performing non-spin substitute geared towards shared nodes

• Switched to alternate metadata representation to free up most available memory for calculations

• Mark Miller was able to replace SR mode synchronization step with much faster version that reduced time to 2 seconds from 20 minutes

www.vacet.org

Impact

• So far this project’s impact has been small for customers– They do not yet run on Dawn– They might not notice small improvements at today’s

everyday processor counts (<2k)

• At higher processor counts (>4k) optimizations added by this work prevent bottlenecks in compute engine, improving scalability

• So far this project’s impact has been small for customers– They do not yet run on Dawn– They might not notice small improvements at today’s

everyday processor counts (<2k)

• At higher processor counts (>4k) optimizations added by this work prevent bottlenecks in compute engine, improving scalability

www.vacet.org

Future work

• Resolve load problems with xlC compiler so we can use the best optimizations, including using BG/P’s dual FPU’s

• Improve 3rd party library build process for BG/P by adding support in build_visit script

• Continue profiling plots and improving performance

• Reduce memory usage where possible• Investigate I/O patterns and attempt

optimizations

• Resolve load problems with xlC compiler so we can use the best optimizations, including using BG/P’s dual FPU’s

• Improve 3rd party library build process for BG/P by adding support in build_visit script

• Continue profiling plots and improving performance

• Reduce memory usage where possible• Investigate I/O patterns and attempt

optimizations

www.vacet.org brad whitlock october 14, 2009 brad whitlock october 14, 2009 porting visit to bg/p

Documents

visit visit

nodes visit

porting visit

objectives port visit

processors port visit

party libraries

mangled mesa support

dawn dawn