an overview of - crosslight...

An Overview of GPU Accelerated Crosslight FDTD

Crosslight Software Inc.

Performance Acceleration Methods Implemented in Crosslight FDTD (CLFDTD)

MPI parallelization Multi-cores and

multi-CPUs

PC cluster

GPU acceleration Highly optimized for Nvidia’s latest GPU

architectures (Fermi and Kepler)

Developed on CUDA environment

Nvidia Kepler GK110

MPI Parallel Efficiency of Multi-core CPU

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

0 1 2 3 4 5 6 7

Acce

lera

tio

n

# of processors

Comparison of Parallel Efficiency Between Two Different Generation of Core i7

Core i7 860

Core i7 3930K

Parallel efficiency of multi-core CPU is bounded by memory bandwidth (~51GB/s) when handling bandwidth limited application such as FDTD.

GPU Card Used in Our GPU Benchmark

Manufacture NvidiaModel GeForce GTX670Architecture GK104 (Kepler)Memory 4GB GDDR5Memory Bandwidth 192 (GB/sec.)FLOPS 2.5 TFLOPSTDP 170 WattsPrice 400 USD

GPU Implementation

Transfer all the data from CPU to GPU only at once at initialization step.

Most of the FDTD routines are processed at GPU side so that data transfer between CPU and GPU through slow PCI-express bus is kept to a minimum.

Tuning CUDA kernel for recent GPU architecture (Fermi and Kepler)

Transfer all the data at beginning

Retrieve results after FDTD finished

Processed by thousands of cores

GPU side

CPU side

Benchmark Tests of GPU Version of Crosslight FDTD

3D Lens structure

CIS3D CMOS image sensor

3D Lens Structure

t=3000 t=2000 t=4000

t=5000

FDTD grid size 250x365x50 = 4,562,500Boundary condition PBC on X and Z

CPML on Y (16 layers)Material Silicon

9-poles Lorentz dispersionSimulation steps 5000

Pictures show time evolution of Ez. Non-dispersive setting is used for the ease of seeing lens effect.

Dielectric structure

Benchmark Results on 3D Lens FDTD Part I. (Non-dispersive material case)

0

500

1000

1500

2000

2500

3000

CPU(Single

process)

CPU(5process)

GPU

Simulation time (sec.)

0

10

20

30

40

50

60

CPU (Singleprocess)

CPU(5process)

GPU

Acceleration

In non-dispersive material case, GTX670 is 59 times faster than single process of Core i7 3930K

59x

Benchmark Results on 3D Lens FDTD Part II. (Dispersive material case)

0

1000

2000

3000

4000

5000

6000

7000

CPU(Single

process)

CPU(5process)

GPU


0

5

10

15

20

25

30

35

40

CPU (Singleprocess)

CPU(5process)

GPU

Acceleration

In dispersive material case, GTX670 is 37 times faster than single process of Core i7 3930K

37x

Summary on 3D Lens FDTD Benchmark

Non-dispersive dielectric slab case marked 59x acceleration in compared to single process of Core i7 3930K.

In dispersive case, i.e. material is represented by 9-poles Lorentz dispersion function, 37x acceleration is marked.

CIS3D Structure

FDTD grid size 101x501x101 = 5,110,707Boundary condition PBC on X and Z

CPML on Y (16 layers)8 materials used Silicon (9-poles Lorentzian)

Aluminum (5-poles Lorentzian)Color filter (25-poles Lorentzian)SiO2, SiN, Photores, poly-Si, W

Simulation steps 5000

3D CMOS Image Sensor

t=100 t=300 t=500 t=700 t=900

t=1100 t=1300 t=1500

Micro-lens

Metal interconnects

Benchmark Results on CIS3D FDTD Part I. (Non-dispersive material case)

0

500

1000

1500

2000

2500

3000

3500

CPU(Single

process)

CPU(5process)

GPU


0

10

20

30

40

50

60

70

CPU (Singleprocess)

CPU(5process)

GPU

Acceleration

In non-dispersive material case, GTX670 is 66 times faster than single process of Core i7 3930K

66x

Benchmark Results on CIS3D FDTD Part II. (Dispersive material case)

0

1000

2000

3000

4000

5000

6000

7000

8000

CPU(Single

process)

CPU(5process)

GPU


0

5

10

15

20

25

30

35

40

45

CPU (Singleprocess)

CPU(5process)

GPU

Acceleration

In dispersive material case, GTX670 is 42 times faster than single process of Core i7 3930K

42x

Summary on CIS3D FDTD Benchmark

Non-dispersive case marked 66x acceleration in compared to single process of Core i7 3930K.

42x acceleration is marked for dispersive case.

GPU acceleration is even better than that of simple 3D lens structure.

Summary

GPU version of Crosslight FDTD achieved 66x GPU acceleration for 3D realistic device structure.

Our benchmark results show that cheap GPU card now offers computation speed comparable to a cluster of 10 Core i7 3930K PCs, which corresponds to 60 cores.

an overview of - crosslight...

Documents