gpu-based high-performance simulations for...

Jan Jacob Institute of Applied Physics and Advanced Microstructure Research Center

GPU-based High-Performance Simulations

for Spintronics Jan Jacob1, Darren Schmidt2, Qing Ruan2, Lothar Wenzel2, Vivek Amin3, and Jairo Sinova3

1University of Hamburg, Institute of Applied Physics, Hamburg, Germany

2National Instruments, Austin, TX, USA 3Texas A&M University, Department of Physics and Astronomy, College Station, TX, USA

NVIDIA GPU Technology Conference, San Jose, CA, USA, May 14-17, 2012


Overview

Introduction to the Underlying Physics

The Basic Algorithm

Optimizations

Benchmarks

Multicore-CPUs

NVIDIA Tesla GPUs

March 2012 GTC 2012, San Jose, May 14-17 2012

2


The Physics


3

Transport of Charge / Spin / Heat, etc. through a Scattering Region can be described by Schrödinger‘s Equation HΨ = EΨ

More complex structures: numerically solve in a tight-binding model (only nearest neighbor interaction)

Hamiltonian H becomes a matrix

Commonly used approach: Green‘s Function Method


More details

Conductivity can be obtained by:

The scattering matrix can be determined by:

Discretized version:

Differential Operator -> Matrix Operator Definition of derivatives:


4


More details 2

Green‘s function of the system (including self-energy term to describe the leads):

Transmission is then obtained by:


5


The Algorithm

1. The Hamiltonian Matrix defines the System

2. The Transverse Modes define the Occupied States

3. The Self-Energies describe the Contact Leads

4. The Green‘s Function of the System describes it‘s Scattering Properties

5. The Γ Matrices connect the Leads to the System

6. The Transmission and Reflection is obtained by Multipliying the G and Γ Matrices.

1. Define Hamiltonian Matrix (User Input)

2. Obtain Transverse Modes (Calculate Eigensystem)

3. Obtain Self-Energies (Scalar Operations)

4. Obtain Green‘s Function (Matrix Inversion)

5. Obtain Γ (Scalar Operations)

6. Obtain Transmission (Matrix Multiplication)


6


Optimizations

Step 1 requires to create large matrices memory issues!

Can be reduces by using extreme sparsity of the matrices

Creating only small blocks of the large matrices right, when they are needed is even more efficent

Step 4 requires to invert these large matrices computational issues

Main issue of the algorithm, will be addressed in detail on the next slides

1. Define Hamiltonian Matrix (User Input)

2. Obtain Transverse Modes (Calculate Eigensystem)

3. Obtain Self-Energies (Scalar Operations)

4. Obtain Green‘s Function (Matrix Inversion)

5. Obtain Γ (Scalar Operations)

6. Obtain Transmission (Matrix Multiplication)


7


Our View Of The Computational Map P

rob

lem

Siz

e

Cycle Time (Maximum Allowed)

10 ms

100 ms

1 ms

1 s

FPG

A

CPU

GPU

RT-GPU CPU

or GPU

----------

Power vs. $$$

March 2012 8 GTC 2012, San Jose, May 14-17 2012


Algorithm 0.


9


Algorithm 1.


10

1. Use optimized, multicore-ready inversion and multiplication algorithms

Intel MKL wrapped into LabVIEW via High-Performance Analysis Library (beta)


Algorithm 2.


11

2. Use sparsity of the matrices

PARDISO direct sparse linear solver


Optimizations for the matrix inversion

3. Use block-tridiagonal solver

Roll-your-own


12


Algorithm 3.


13


Roll-your-own


Optimizations for the matrix inversion

4. Make use of the fact that not the full matrix is needed for the result

Improved block-tridiagonal solver, that only calculates the necessary blocks


14


Algorithm 4.


15


Optimizations for the matrix inversion 1. Use optimized, multicore-ready inversion and multiplication algorithms

Intel MKL wrapped into LabVIEW via High-Performance Analysis Library (beta)

2. Use sparsity of the matrices

PARDISO direct sparse linear solver


Roll-your-own

4. Make use of the fact that not the full matrix is needed for the result

Improved block-tridiagonal solver, that only calculates the necessary blocks

5. Implement a highly parallel version of the improved block-tridiagonal solver

Pipelining


16


Algorithm 5.


17


Transfer of the Final Algorithm to GPUs

LabVIEW GPU Analysis Toolkit (alpha; public beta release soon) provides CUDA Functionality in LabVIEW

(Wrapper)

The Algorithm is that Memory-Efficient that the whole Problem can be uploaded to the GPU

(low time-losses due to data transfer between CPU and GPU)

Further improvement of performance expected


18


Benchmark - Environment

IBM idataplex M360 computing server

2x Intel 6-core Xeon X5650 @2.67 GHz

48 GB RAM

2 NVIDIA Tesla M2070 GPUs with 3 GB RAM

Windows 2008 Server Enterprise

LabVIEW 2011 64-Bit

High-Performance Analysis Library Toolkit (64-bit beta)

GPU Analysis Toolkit (alpha)


19


Benchmarks – CPU


20


Benchmarks – CPU & GPU


21


Summary - NEGF

The well known and commonly used Non-Equilibrium Green‘s Function approach for

Simulations of Transport in Nanostructures can be siginificantly optimized

It‘s implementation on Multicore-CPUs as well as GPU has been demonstrated with

significant speed-up compared to the basic algorithm

The presented Basic Algorithm for 2D Transport of Charges can analogously be

expanded to 3D and additional degrees of freedom


22


Current Project – multi-GPU stabilized transfer-matrix algorithm

Similar and very flexible algorithm to

compute transport properties of

nanostructures

Main part of the algorithm:


23


Calculation of C2 (per iteration)


Calculation of D2 (per Iteration)


Calculation of C1 (per iteration)


Calculation of D1 (per iteration)


GPU 1


GPU 2


Thank you for your attention!

Financial support by the German Science Foundation DFG via

Research Training Group 1286 “Functional Metal-Semiconductor Hybrid-Systems”

and DFG-Project Me916/11-1 “Spin-filter cascades in InAs heterostructures”,

by the Free and Hanseatic City of Hamburg via the Excellence Cluster “Nanospintronics”,

by the Office of Naval Research via ONR-N00014110780,

and by the National Science Foundation by NSF-MRSEC DMR-0820414, NSF-DMR-1105512, NHARP

31 March 2012 GTC 2012, San Jose, May 14-17 2012

gpu-based high-performance simulations for...

Documents