an overview of - crosslight...

15
An Overview of GPU Accelerated Crosslight FDTD Crosslight Software Inc.

Upload: others

Post on 08-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: An Overview of - Crosslight Softwarecrosslight.com/wp-content/uploads/2013/11/Crosslight_GPU_FDTD.pdfGPU Implementation Transfer all the data from CPU to GPU only at once at initialization

An Overview of GPU Accelerated Crosslight FDTD

Crosslight Software Inc.

Page 2: An Overview of - Crosslight Softwarecrosslight.com/wp-content/uploads/2013/11/Crosslight_GPU_FDTD.pdfGPU Implementation Transfer all the data from CPU to GPU only at once at initialization

Performance Acceleration Methods Implemented in Crosslight FDTD (CLFDTD)

MPI parallelization Multi-cores and

multi-CPUs

PC cluster

GPU acceleration Highly optimized for Nvidia’s latest GPU

architectures (Fermi and Kepler)

Developed on CUDA environment

Nvidia Kepler GK110

Page 3: An Overview of - Crosslight Softwarecrosslight.com/wp-content/uploads/2013/11/Crosslight_GPU_FDTD.pdfGPU Implementation Transfer all the data from CPU to GPU only at once at initialization

MPI Parallel Efficiency of Multi-core CPU

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

0 1 2 3 4 5 6 7

Acce

lera

tio

n

# of processors

Comparison of Parallel Efficiency Between Two Different Generation of Core i7

Core i7 860

Core i7 3930K

Parallel efficiency of multi-core CPU is bounded by memory bandwidth (~51GB/s) when handling bandwidth limited application such as FDTD.

Page 4: An Overview of - Crosslight Softwarecrosslight.com/wp-content/uploads/2013/11/Crosslight_GPU_FDTD.pdfGPU Implementation Transfer all the data from CPU to GPU only at once at initialization

GPU Card Used in Our GPU Benchmark

Manufacture NvidiaModel GeForce GTX670Architecture GK104 (Kepler)Memory 4GB GDDR5Memory Bandwidth 192 (GB/sec.)FLOPS 2.5 TFLOPSTDP 170 WattsPrice 400 USD

Page 5: An Overview of - Crosslight Softwarecrosslight.com/wp-content/uploads/2013/11/Crosslight_GPU_FDTD.pdfGPU Implementation Transfer all the data from CPU to GPU only at once at initialization

GPU Implementation

Transfer all the data from CPU to GPU only at once at initialization step.

Most of the FDTD routines are processed at GPU side so that data transfer between CPU and GPU through slow PCI-express bus is kept to a minimum.

Tuning CUDA kernel for recent GPU architecture (Fermi and Kepler)

Transfer all the data at beginning

Retrieve results after FDTD finished

Processed by thousands of cores

GPU side

CPU side

Page 6: An Overview of - Crosslight Softwarecrosslight.com/wp-content/uploads/2013/11/Crosslight_GPU_FDTD.pdfGPU Implementation Transfer all the data from CPU to GPU only at once at initialization

Benchmark Tests of GPU Version of Crosslight FDTD

3D Lens structure

CIS3D CMOS image sensor

Page 7: An Overview of - Crosslight Softwarecrosslight.com/wp-content/uploads/2013/11/Crosslight_GPU_FDTD.pdfGPU Implementation Transfer all the data from CPU to GPU only at once at initialization

3D Lens Structure

t=3000 t=2000 t=4000

t=5000

FDTD grid size 250x365x50 = 4,562,500Boundary condition PBC on X and Z

CPML on Y (16 layers)Material Silicon

9-poles Lorentz dispersionSimulation steps 5000

Pictures show time evolution of Ez. Non-dispersive setting is used for the ease of seeing lens effect.

Dielectric structure

Page 8: An Overview of - Crosslight Softwarecrosslight.com/wp-content/uploads/2013/11/Crosslight_GPU_FDTD.pdfGPU Implementation Transfer all the data from CPU to GPU only at once at initialization

Benchmark Results on 3D Lens FDTD Part I. (Non-dispersive material case)

0

500

1000

1500

2000

2500

3000

CPU(Single

process)

CPU(5process)

GPU

Simulation time (sec.)

0

10

20

30

40

50

60

CPU (Singleprocess)

CPU(5process)

GPU

Acceleration

In non-dispersive material case, GTX670 is 59 times faster than single process of Core i7 3930K

59x

Page 9: An Overview of - Crosslight Softwarecrosslight.com/wp-content/uploads/2013/11/Crosslight_GPU_FDTD.pdfGPU Implementation Transfer all the data from CPU to GPU only at once at initialization

Benchmark Results on 3D Lens FDTD Part II. (Dispersive material case)

0

1000

2000

3000

4000

5000

6000

7000

CPU(Single

process)

CPU(5process)

GPU

Simulation time (sec.)

0

5

10

15

20

25

30

35

40

CPU (Singleprocess)

CPU(5process)

GPU

Acceleration

In dispersive material case, GTX670 is 37 times faster than single process of Core i7 3930K

37x

Page 10: An Overview of - Crosslight Softwarecrosslight.com/wp-content/uploads/2013/11/Crosslight_GPU_FDTD.pdfGPU Implementation Transfer all the data from CPU to GPU only at once at initialization

Summary on 3D Lens FDTD Benchmark

Non-dispersive dielectric slab case marked 59x acceleration in compared to single process of Core i7 3930K.

In dispersive case, i.e. material is represented by 9-poles Lorentz dispersion function, 37x acceleration is marked.

Page 11: An Overview of - Crosslight Softwarecrosslight.com/wp-content/uploads/2013/11/Crosslight_GPU_FDTD.pdfGPU Implementation Transfer all the data from CPU to GPU only at once at initialization

CIS3D Structure

FDTD grid size 101x501x101 = 5,110,707Boundary condition PBC on X and Z

CPML on Y (16 layers)8 materials used Silicon (9-poles Lorentzian)

Aluminum (5-poles Lorentzian)Color filter (25-poles Lorentzian)SiO2, SiN, Photores, poly-Si, W

Simulation steps 5000

3D CMOS Image Sensor

t=100 t=300 t=500 t=700 t=900

t=1100 t=1300 t=1500

Micro-lens

Metal interconnects

Page 12: An Overview of - Crosslight Softwarecrosslight.com/wp-content/uploads/2013/11/Crosslight_GPU_FDTD.pdfGPU Implementation Transfer all the data from CPU to GPU only at once at initialization

Benchmark Results on CIS3D FDTD Part I. (Non-dispersive material case)

0

500

1000

1500

2000

2500

3000

3500

CPU(Single

process)

CPU(5process)

GPU

Simulation time (sec.)

0

10

20

30

40

50

60

70

CPU (Singleprocess)

CPU(5process)

GPU

Acceleration

In non-dispersive material case, GTX670 is 66 times faster than single process of Core i7 3930K

66x

Page 13: An Overview of - Crosslight Softwarecrosslight.com/wp-content/uploads/2013/11/Crosslight_GPU_FDTD.pdfGPU Implementation Transfer all the data from CPU to GPU only at once at initialization

Benchmark Results on CIS3D FDTD Part II. (Dispersive material case)

0

1000

2000

3000

4000

5000

6000

7000

8000

CPU(Single

process)

CPU(5process)

GPU

Simulation time (sec.)

0

5

10

15

20

25

30

35

40

45

CPU (Singleprocess)

CPU(5process)

GPU

Acceleration

In dispersive material case, GTX670 is 42 times faster than single process of Core i7 3930K

42x

Page 14: An Overview of - Crosslight Softwarecrosslight.com/wp-content/uploads/2013/11/Crosslight_GPU_FDTD.pdfGPU Implementation Transfer all the data from CPU to GPU only at once at initialization

Summary on CIS3D FDTD Benchmark

Non-dispersive case marked 66x acceleration in compared to single process of Core i7 3930K.

42x acceleration is marked for dispersive case.

GPU acceleration is even better than that of simple 3D lens structure.

Page 15: An Overview of - Crosslight Softwarecrosslight.com/wp-content/uploads/2013/11/Crosslight_GPU_FDTD.pdfGPU Implementation Transfer all the data from CPU to GPU only at once at initialization

Summary

GPU version of Crosslight FDTD achieved 66x GPU acceleration for 3D realistic device structure.

Our benchmark results show that cheap GPU card now offers computation speed comparable to a cluster of 10 Core i7 3930K PCs, which corresponds to 60 cores.