gpu-accelerated power for supercomputing · kkrnano scaling exploration •target system with...
TRANSCRIPT
![Page 1: GPU-accelerated POWER for Supercomputing · KKRnano Scaling Exploration •Target system with performance ~6 PFlop/s •Requires ~2100 nodes with 2x POWER8 and 2x K40 GPUs •Minimal](https://reader033.vdocuments.site/reader033/viewer/2022051608/603d48a690c50509fd71a647/html5/thumbnails/1.jpg)
© 2016 OpenPOWER Foundation
GPU-accelerated POWER for Supercomputing
D. Pleiter (Jülich Supercomputing Centre)
![Page 2: GPU-accelerated POWER for Supercomputing · KKRnano Scaling Exploration •Target system with performance ~6 PFlop/s •Requires ~2100 nodes with 2x POWER8 and 2x K40 GPUs •Minimal](https://reader033.vdocuments.site/reader033/viewer/2022051608/603d48a690c50509fd71a647/html5/thumbnails/2.jpg)
Introduction
• Massively-parallel compute devices are becoming commonly used in supercomputers
• E.g., GPUs
• Significant changes for OpenPOWER server architectures like Minsky
• More GPUs per CPU socket • High-speed interconnect CPU-GPU and GPU-GPU • Better support for data migration between compute devices
• Challenge: Application enablement • Efficient exploitation, maintaining scalability
© 2015 OpenPOWER Foundation 2
![Page 3: GPU-accelerated POWER for Supercomputing · KKRnano Scaling Exploration •Target system with performance ~6 PFlop/s •Requires ~2100 nodes with 2x POWER8 and 2x K40 GPUs •Minimal](https://reader033.vdocuments.site/reader033/viewer/2022051608/603d48a690c50509fd71a647/html5/thumbnails/3.jpg)
Approach to Application Enablement
• Application vs. HPC expert • Create common understanding of the application
• Process • Perform anamnesis
• Define goals and define constraints • Create mini-applications
• Easy to modify, simplified version of the application • Implement and evaluate proof-of-concepts
• Proof benefits of specific porting strategies • Performance modelling
• Create understanding for architectural requirements
© 2015 OpenPOWER Foundation 3
![Page 4: GPU-accelerated POWER for Supercomputing · KKRnano Scaling Exploration •Target system with performance ~6 PFlop/s •Requires ~2100 nodes with 2x POWER8 and 2x K40 GPUs •Minimal](https://reader033.vdocuments.site/reader033/viewer/2022051608/603d48a690c50509fd71a647/html5/thumbnails/4.jpg)
Applications: KKRnano
• Materials science application based on Density Functional Theory (DFT) method
• High scalability due to truncation of long-range interactions → linear scaling in number of atoms
• Performance characteristics • Most time spent in iterative solver
• Dense matrix-matrix multiplications dominate performance (AI ≥ 4)
• Implementation properties: Fortran, MPI, OpenMP
• Exascale needed for simulating systems with O(106) atoms
© 2015 OpenPOWER Foundation 4
![Page 5: GPU-accelerated POWER for Supercomputing · KKRnano Scaling Exploration •Target system with performance ~6 PFlop/s •Requires ~2100 nodes with 2x POWER8 and 2x K40 GPUs •Minimal](https://reader033.vdocuments.site/reader033/viewer/2022051608/603d48a690c50509fd71a647/html5/thumbnails/5.jpg)
Applications: B-CALM
• Based on FDTD = Finite Difference Time Domain
• Method for electro-magnetic calculations
• Example usages: Analysis of dispersive media
• Information technology: Development of optical interconnects
• Energy technology: Research on photo-electric cells
© 2015 OpenPOWER Foundation 5
[P. Wahl et al., 2012]
![Page 6: GPU-accelerated POWER for Supercomputing · KKRnano Scaling Exploration •Target system with performance ~6 PFlop/s •Requires ~2100 nodes with 2x POWER8 and 2x K40 GPUs •Minimal](https://reader033.vdocuments.site/reader033/viewer/2022051608/603d48a690c50509fd71a647/html5/thumbnails/6.jpg)
Focus on Scalability: High-Q Club
• Eligible members: Applications that demonstrated scalability up to 28 Blue Gene/Q racks, i.e. 458,752 cores
• Example: KKRnano
© 2015 OpenPOWER Foundation 6
![Page 7: GPU-accelerated POWER for Supercomputing · KKRnano Scaling Exploration •Target system with performance ~6 PFlop/s •Requires ~2100 nodes with 2x POWER8 and 2x K40 GPUs •Minimal](https://reader033.vdocuments.site/reader033/viewer/2022051608/603d48a690c50509fd71a647/html5/thumbnails/7.jpg)
Research Questions
• Research questions • How well can application exploit architecture?
• How could architecture optimized for application?
• Methodology based on performance models to enable comparison with architectural parameters
• Support implementation decisions
• Enable understanding of optimal performance
• Allow for hypothetical tuning of hardware parameters
© 2015 OpenPOWER Foundation 7
![Page 8: GPU-accelerated POWER for Supercomputing · KKRnano Scaling Exploration •Target system with performance ~6 PFlop/s •Requires ~2100 nodes with 2x POWER8 and 2x K40 GPUs •Minimal](https://reader033.vdocuments.site/reader033/viewer/2022051608/603d48a690c50509fd71a647/html5/thumbnails/8.jpg)
B-CALM on OpenPOWER
• No specific porting efforts towards POWER + GPU required • Main kernels had been ported to GPU already
• Scalability challenge: efficient overlap of communication and computation
© 2015 OpenPOWER Foundation 8
![Page 9: GPU-accelerated POWER for Supercomputing · KKRnano Scaling Exploration •Target system with performance ~6 PFlop/s •Requires ~2100 nodes with 2x POWER8 and 2x K40 GPUs •Minimal](https://reader033.vdocuments.site/reader033/viewer/2022051608/603d48a690c50509fd71a647/html5/thumbnails/9.jpg)
B-CALM Results
• Ansatz: Model kernel execution time as function of information exchanged between GPU and its memory
• Measurements performed on POWER8 server with K40 GPUs
• Results for different choices of Lx = Ly
• Different number of MPI ranks
© 2015 OpenPOWER Foundation 9
[P. Baumeister et al., 2015]
![Page 10: GPU-accelerated POWER for Supercomputing · KKRnano Scaling Exploration •Target system with performance ~6 PFlop/s •Requires ~2100 nodes with 2x POWER8 and 2x K40 GPUs •Minimal](https://reader033.vdocuments.site/reader033/viewer/2022051608/603d48a690c50509fd71a647/html5/thumbnails/10.jpg)
Exploring B-CALM Scaling Limits
• Ansatz: Model time needed for communication as function of exchanged information
• Balance condition: Perfect overlap of computation and communication → Relation between number of MPI ranks and network bandwidth
© 2015 OpenPOWER Foundation 10
[P. Baumeister et al., 2015]
![Page 11: GPU-accelerated POWER for Supercomputing · KKRnano Scaling Exploration •Target system with performance ~6 PFlop/s •Requires ~2100 nodes with 2x POWER8 and 2x K40 GPUs •Minimal](https://reader033.vdocuments.site/reader033/viewer/2022051608/603d48a690c50509fd71a647/html5/thumbnails/11.jpg)
Porting KKRnano to OpenPOWER
• Porting strategy • Main application executed on
POWER processor
• Dedicated implementation of solver for POWER8 or GPU
• Performance limits • POWER8: compute-limited
• Reach 365 GFlop/s
• K40: memory-bandwidth limited • Reach 152 GByte/s
© 2015 OpenPOWER Foundation 11
[P. Baumeister et al., 2016]
![Page 12: GPU-accelerated POWER for Supercomputing · KKRnano Scaling Exploration •Target system with performance ~6 PFlop/s •Requires ~2100 nodes with 2x POWER8 and 2x K40 GPUs •Minimal](https://reader033.vdocuments.site/reader033/viewer/2022051608/603d48a690c50509fd71a647/html5/thumbnails/12.jpg)
KKRnano Scaling Exploration
• Target system with performance ~6 PFlop/s • Requires ~2100 nodes with 2x POWER8 and 2x K40 GPUs
• Minimal problem size determination • Need >20 atoms to saturate single node resources
• Need >42,000 atoms on target system
• Communication time determined from performance modelling approach
• Find time for communication << time for computation
• No communication during solver execution
© 2016 OpenPOWER Foundation 12
![Page 13: GPU-accelerated POWER for Supercomputing · KKRnano Scaling Exploration •Target system with performance ~6 PFlop/s •Requires ~2100 nodes with 2x POWER8 and 2x K40 GPUs •Minimal](https://reader033.vdocuments.site/reader033/viewer/2022051608/603d48a690c50509fd71a647/html5/thumbnails/13.jpg)
KKRnano Power Efficiency Analysis
© 2016 OpenPOWER Foundation 13
POWER8: 2.5 nJ/Flop
K40: 1.95 nJ/Flop
→ 22% gain
![Page 14: GPU-accelerated POWER for Supercomputing · KKRnano Scaling Exploration •Target system with performance ~6 PFlop/s •Requires ~2100 nodes with 2x POWER8 and 2x K40 GPUs •Minimal](https://reader033.vdocuments.site/reader033/viewer/2022051608/603d48a690c50509fd71a647/html5/thumbnails/14.jpg)
Summary and Conclusions
• Opportunity of new OpenPOWER architectures: broaden number of applications that can exploit GPU
• High-bandwidth interconnect • System support for data transport
• Current systems suitable for “classical” compute-intensive applications
• DFT-based application KKRnano • FDTD-based application B-CALM
• High scalability is achievable • Small number of fat nodes reduces node-scaling challenge
© 2016 OpenPOWER Foundation 14