gpu-based high-performance simulations for...
TRANSCRIPT
Jan Jacob Institute of Applied Physics and Advanced Microstructure Research Center
GPU-based High-Performance Simulations
for Spintronics Jan Jacob1, Darren Schmidt2, Qing Ruan2, Lothar Wenzel2, Vivek Amin3, and Jairo Sinova3
1University of Hamburg, Institute of Applied Physics, Hamburg, Germany
2National Instruments, Austin, TX, USA 3Texas A&M University, Department of Physics and Astronomy, College Station, TX, USA
NVIDIA GPU Technology Conference, San Jose, CA, USA, May 14-17, 2012
Jan Jacob Institute of Applied Physics and Advanced Microstructure Research Center
Overview
Introduction to the Underlying Physics
The Basic Algorithm
Optimizations
Benchmarks
Multicore-CPUs
NVIDIA Tesla GPUs
March 2012 GTC 2012, San Jose, May 14-17 2012
2
Jan Jacob Institute of Applied Physics and Advanced Microstructure Research Center
The Physics
March 2012 GTC 2012, San Jose, May 14-17 2012
3
Transport of Charge / Spin / Heat, etc. through a Scattering Region can be described by Schrödinger‘s Equation HΨ = EΨ
More complex structures: numerically solve in a tight-binding model (only nearest neighbor interaction)
Hamiltonian H becomes a matrix
Commonly used approach: Green‘s Function Method
Jan Jacob Institute of Applied Physics and Advanced Microstructure Research Center
More details
Conductivity can be obtained by:
The scattering matrix can be determined by:
Discretized version:
Differential Operator -> Matrix Operator Definition of derivatives:
March 2012 GTC 2012, San Jose, May 14-17 2012
4
Jan Jacob Institute of Applied Physics and Advanced Microstructure Research Center
More details 2
Green‘s function of the system (including self-energy term to describe the leads):
Transmission is then obtained by:
March 2012 GTC 2012, San Jose, May 14-17 2012
5
Jan Jacob Institute of Applied Physics and Advanced Microstructure Research Center
The Algorithm
1. The Hamiltonian Matrix defines the System
2. The Transverse Modes define the Occupied States
3. The Self-Energies describe the Contact Leads
4. The Green‘s Function of the System describes it‘s Scattering Properties
5. The Γ Matrices connect the Leads to the System
6. The Transmission and Reflection is obtained by Multipliying the G and Γ Matrices.
1. Define Hamiltonian Matrix (User Input)
2. Obtain Transverse Modes (Calculate Eigensystem)
3. Obtain Self-Energies (Scalar Operations)
4. Obtain Green‘s Function (Matrix Inversion)
5. Obtain Γ (Scalar Operations)
6. Obtain Transmission (Matrix Multiplication)
March 2012 GTC 2012, San Jose, May 14-17 2012
6
Jan Jacob Institute of Applied Physics and Advanced Microstructure Research Center
Optimizations
Step 1 requires to create large matrices memory issues!
Can be reduces by using extreme sparsity of the matrices
Creating only small blocks of the large matrices right, when they are needed is even more efficent
Step 4 requires to invert these large matrices computational issues
Main issue of the algorithm, will be addressed in detail on the next slides
1. Define Hamiltonian Matrix (User Input)
2. Obtain Transverse Modes (Calculate Eigensystem)
3. Obtain Self-Energies (Scalar Operations)
4. Obtain Green‘s Function (Matrix Inversion)
5. Obtain Γ (Scalar Operations)
6. Obtain Transmission (Matrix Multiplication)
March 2012 GTC 2012, San Jose, May 14-17 2012
7
Jan Jacob Institute of Applied Physics and Advanced Microstructure Research Center
Our View Of The Computational Map P
rob
lem
Siz
e
Cycle Time (Maximum Allowed)
10 ms
100 ms
1 ms
1 s
FPG
A
CPU
GPU
RT-GPU CPU
or GPU
----------
Power vs. $$$
March 2012 8 GTC 2012, San Jose, May 14-17 2012
Jan Jacob Institute of Applied Physics and Advanced Microstructure Research Center
Algorithm 0.
March 2012 GTC 2012, San Jose, May 14-17 2012
9
Jan Jacob Institute of Applied Physics and Advanced Microstructure Research Center
Algorithm 1.
March 2012 GTC 2012, San Jose, May 14-17 2012
10
1. Use optimized, multicore-ready inversion and multiplication algorithms
Intel MKL wrapped into LabVIEW via High-Performance Analysis Library (beta)
Jan Jacob Institute of Applied Physics and Advanced Microstructure Research Center
Algorithm 2.
March 2012 GTC 2012, San Jose, May 14-17 2012
11
2. Use sparsity of the matrices
PARDISO direct sparse linear solver
Jan Jacob Institute of Applied Physics and Advanced Microstructure Research Center
Optimizations for the matrix inversion
3. Use block-tridiagonal solver
Roll-your-own
March 2012 GTC 2012, San Jose, May 14-17 2012
12
Jan Jacob Institute of Applied Physics and Advanced Microstructure Research Center
Algorithm 3.
March 2012 GTC 2012, San Jose, May 14-17 2012
13
3. Use block-tridiagonal solver
Roll-your-own
Jan Jacob Institute of Applied Physics and Advanced Microstructure Research Center
Optimizations for the matrix inversion
4. Make use of the fact that not the full matrix is needed for the result
Improved block-tridiagonal solver, that only calculates the necessary blocks
March 2012 GTC 2012, San Jose, May 14-17 2012
14
Jan Jacob Institute of Applied Physics and Advanced Microstructure Research Center
Algorithm 4.
March 2012 GTC 2012, San Jose, May 14-17 2012
15
Jan Jacob Institute of Applied Physics and Advanced Microstructure Research Center
Optimizations for the matrix inversion 1. Use optimized, multicore-ready inversion and multiplication algorithms
Intel MKL wrapped into LabVIEW via High-Performance Analysis Library (beta)
2. Use sparsity of the matrices
PARDISO direct sparse linear solver
3. Use block-tridiagonal solver
Roll-your-own
4. Make use of the fact that not the full matrix is needed for the result
Improved block-tridiagonal solver, that only calculates the necessary blocks
5. Implement a highly parallel version of the improved block-tridiagonal solver
Pipelining
March 2012 GTC 2012, San Jose, May 14-17 2012
16
Jan Jacob Institute of Applied Physics and Advanced Microstructure Research Center
Algorithm 5.
March 2012 GTC 2012, San Jose, May 14-17 2012
17
Jan Jacob Institute of Applied Physics and Advanced Microstructure Research Center
Transfer of the Final Algorithm to GPUs
LabVIEW GPU Analysis Toolkit (alpha; public beta release soon) provides CUDA Functionality in LabVIEW
(Wrapper)
The Algorithm is that Memory-Efficient that the whole Problem can be uploaded to the GPU
(low time-losses due to data transfer between CPU and GPU)
Further improvement of performance expected
March 2012 GTC 2012, San Jose, May 14-17 2012
18
Jan Jacob Institute of Applied Physics and Advanced Microstructure Research Center
Benchmark - Environment
IBM idataplex M360 computing server
2x Intel 6-core Xeon X5650 @2.67 GHz
48 GB RAM
2 NVIDIA Tesla M2070 GPUs with 3 GB RAM
Windows 2008 Server Enterprise
LabVIEW 2011 64-Bit
High-Performance Analysis Library Toolkit (64-bit beta)
GPU Analysis Toolkit (alpha)
March 2012 GTC 2012, San Jose, May 14-17 2012
19
Jan Jacob Institute of Applied Physics and Advanced Microstructure Research Center
Benchmarks – CPU
March 2012 GTC 2012, San Jose, May 14-17 2012
20
Jan Jacob Institute of Applied Physics and Advanced Microstructure Research Center
Benchmarks – CPU & GPU
March 2012 GTC 2012, San Jose, May 14-17 2012
21
Jan Jacob Institute of Applied Physics and Advanced Microstructure Research Center
Summary - NEGF
The well known and commonly used Non-Equilibrium Green‘s Function approach for
Simulations of Transport in Nanostructures can be siginificantly optimized
It‘s implementation on Multicore-CPUs as well as GPU has been demonstrated with
significant speed-up compared to the basic algorithm
The presented Basic Algorithm for 2D Transport of Charges can analogously be
expanded to 3D and additional degrees of freedom
March 2012 GTC 2012, San Jose, May 14-17 2012
22
Jan Jacob Institute of Applied Physics and Advanced Microstructure Research Center
Current Project – multi-GPU stabilized transfer-matrix algorithm
Similar and very flexible algorithm to
compute transport properties of
nanostructures
Main part of the algorithm:
March 2012 GTC 2012, San Jose, May 14-17 2012
23
Jan Jacob Institute of Applied Physics and Advanced Microstructure Research Center
Calculation of C2 (per iteration)
Jan Jacob Institute of Applied Physics and Advanced Microstructure Research Center
Calculation of D2 (per Iteration)
Jan Jacob Institute of Applied Physics and Advanced Microstructure Research Center
Calculation of C1 (per iteration)
Jan Jacob Institute of Applied Physics and Advanced Microstructure Research Center
Calculation of D1 (per iteration)
Jan Jacob Institute of Applied Physics and Advanced Microstructure Research Center
GPU 1
Jan Jacob Institute of Applied Physics and Advanced Microstructure Research Center
GPU 2
Jan Jacob Institute of Applied Physics and Advanced Microstructure Research Center
Jan Jacob Institute of Applied Physics and Advanced Microstructure Research Center
Thank you for your attention!
Financial support by the German Science Foundation DFG via
Research Training Group 1286 “Functional Metal-Semiconductor Hybrid-Systems”
and DFG-Project Me916/11-1 “Spin-filter cascades in InAs heterostructures”,
by the Free and Hanseatic City of Hamburg via the Excellence Cluster “Nanospintronics”,
by the Office of Naval Research via ONR-N00014110780,
and by the National Science Foundation by NSF-MRSEC DMR-0820414, NSF-DMR-1105512, NHARP
31 March 2012 GTC 2012, San Jose, May 14-17 2012