running petascale seismic applications on aws

82nd EAGE Conference & Exhibition 2020

8-11 June 2020, Amsterdam, The Netherlands

Running Petascale Seismic Applications on AWS Let’s talk about a widespread myth in the HPC community - that large scale production simulations are by far best run on finely-tuned, specialized, on-premise datacenter resources equipped with a key HPC attribute - i.e. a low latency interconnect. The joint work described in this article, between Petrobras, Atrio and AWS, shows that even Multi-Petaflop simulation workloads can perform on the cloud just as well as they do on special purpose on-premise datacenter resources. After having successfully demonstrated[1] that it is possible to run the same Reverse Time Migration (RTM) GPGPU+MPI binary executable on-premises and on the top three major cloud providers (AWS, Azure and GCP) using hybrid HPC solutions provided by Atrio, the natural next step is to scale the cloud execution to a production-level problem size. Thanks to AWS we were able to get access to the required scale of computing resources for such an experiment, and the AWS FSx Lustre service helped to make these impressive results achievable. The RTM workload can easily scale to hundreds of GPUs, and it’s commonly used by Oil & Gas companies to produce subsurface images to help to correctly position oil wells. The workload has to process thousands of independent experiments, called shots, where multiple signals reflected from subsurface layers are recorded. These shots can be processed independently in parallel. Typically a 3D production shot doesn’t fit into the memory available on a single GPU, so multiple GPUs must be used to process each shot. We used an RTM domain decomposition and process placement schema that fits the shot processing inside a multi-GPU node. This approach provides the advantage of enabling the use of a high speed NVLINK interconnect between intra-node GPUs to leverage the heavy communication demanded by the finite difference method used by RTM. For this RTM workload groups of four GPUs are used to process each shot. Figure 1 shows the communication pattern over the NVLINK high speed connection:

Figure 1. GPU communication pattern used by the RTM finite difference method inside a group of four GPUs in a single node. We run the RTM workload by launching an Amazon Web Services (AWS) on-demand compute cluster with 320 Tesla V100 GPUs (p3.16xlarge instance) and an ephemeral 100TB file system using Amazon FSx for Lustre (FSx Lustre). This cluster provides more than 5Pflops (theoretical peak, single precision). The application uses a CUDA-Aware MPI to leverage the communication between GPUs. Our MPI+GPU process placement schema with optimal GPU topology[4], in this case, allows all MPI communications between GPUs to use NVLINK. Figure 2 shows that we achieved a sustained bi-directional bandwidth close to 160Gb/s between GPUs using the MPI_Sendrecv function.



Figure 2. Average MPI_Sendrecv bi-directional bandwidth between GPUs. The 320 GPUs process a lot of data very quickly, so a high performance parallel file system is required to keep the GPU utilization high, and not leave them waiting for I/O operations. For this workload we created a 100TB FSx Lustre filesystem, which is designed to scale in such a way that for each TB of space you get 200MB/s of sustained I/O bandwidth[2]. So our 100TB filesystem has an expected aggregate performance of 20GB/s. Figure 3 shows the write performance (keep in mind that only one process out of every four performs I/O).

Figure 3. Average write performance per MPI process. As we can see in Figure 3 all processes that do I/O achieved a write performance close to 5Gb/s which is the connection limit[3] per process to the FSx Lustre filesystem. So our Lustre setup is more than adequate to support this workload. The performance data was collected using Atrio Hybrid HPC tools. The on-prem cluster used by Petroras has a more expensive version of the V100 GPU with 32GB and Infiniband interconnected Lustre. Since the on-prem cluster has more memory per GPU we used only two GPUs per group to process a shot there. Our large scale execution of a 3D RTM benchmark on public cloud resources achieved 99.33% of the on-prem performance per GPU, using a less expensive version of the V100 GPU (16GB) and no Infiniband. Our approach on cloud delivers enough bandwidth for the communication between GPUs. We were unable to do a fair comparison to on-prem Lustre performance since we used only 8 GPUs



on-prem versus 320 on cloud and due the fact that the on-prem Lustre file system was simultaneously serving other jobs. Future work includes remote mount of the cloud parallel file system onto on-prem nodes to transparently manage the data transfers to and from the cloud. Acknowledgements To Petrobras for this research opportunity. References

[1] Souza Filho, P. & Sardinha, A. & Ávila, C. & Azambuja, A. & Sierra, F. & De Paula, D. & Vecino, M. & Silva, L. & Ji, N., Seismic Processing with Hybrid HPC. Fourth EAGE Workshop on High Performance Computing for Upstream, 2019.

[2] Amazon FSx for Lustre Reference. https://docs.aws.amazon.com/fsx/latest/LustreGuide/performance.html

[3] The Floodgates Are Open – Increased Network Bandwidth for EC2 Instances. https://aws.amazon.com/pt/blogs/aws/the-floodgates-are-open-increased-network-bandwidth-for-ec2-instances/

[4] Widest path problem. https://en.wikipedia.org/wiki/Widest_path_problem

running petascale seismic applications on aws

Documents