implementation and optimization of fdtd kernels by using cache-aware time-skewing algorithms

35
IMPLEMENTATION AND OPTIMIZATION OF FDTD KERNELS BY USING CACHE-AWARE TIME-SKEWING ALGORITHMS THESIS PRESENTATION 1 SERHAN OZBEY WARSAW UNIVERSITY OF TECHNOLOGY INSTITUTE OF TELECOMMUNICATIONS 16/03/2017

Upload: serhan-oezbey

Post on 23-Jan-2018

67 views

Category:

Technology


0 download

TRANSCRIPT

IMPLEMENTATION AND OPTIMIZATION OF FDTD KERNELS BY USING CACHE-AWARE TIME-SKEWING ALGORITHMS

THESIS PRESENTATION

1

SERHAN OZBEY WARSAW UNIVERSITY OF TECHNOLOGYINSTITUTE OF TELECOMMUNICATIONS 16/03/2017

ABSTRACT

The main goal of this thesis was to implement and optimize cache-aware time-skewing algorithms for FDTD kernels to reduce cache misses and idle time of the processor.

Large scale discretization of space and computations needed for electromagnetic simulations

Importance of utilization and optimization of an efficient memory access pattern

Naive implementation of FDTD method into code is a kernel with cascaded loops that makes data reads and writes from memory to calculate EM fields.

Exploiting data dependencies and locality features of FDTD kernel with a better usage of memory hierarchy, reducing processors’ idle time is achievable

Execution time of FDTD can take long if cascaded loops are not incremented in a way to use data dependencies efficiently.

Reduction of this idle time can be done with skewing and blocking time and space domains to force loop iterations to follow data dependencies for a better access scheme with better usage of fast CPU cache memories

TOPICS

1. INTRODUCTION

2. LITERATURE REVIEW

3. METHODOLOGY

4. RESULTS AND DISCUSSION

5. CONCLUSIONS

3

INTRODUCTION

For sustainable and reliable telecommunication networks, modelling of efficient and durable network components are highly demanded. This is done by modelling and producing efficient devices that interacts well with electromagnetic disturbances that affects performance of such components.

Considerations of factors such as electromagnetic radiation, scattering should be done by electromagnetic modelling of devices to simulate interactions of devices with nature conditions and materials existing in environment.

This is done by modelling and producing efficient devices that interacts well with electromagnetic disturbances that affects performance of such components

4

INTRODUCTION

Computational electromagnetics (electromagnetic modeling): is the process of modeling the interaction of electromagnetic fields with physical objects and the environment. Maxwell’s equations should be solved, which will evaluate electric and magnetic fields according to given boundary and constitutional relation conditions.

By using computationally efficient approximations to Maxwell's equations, it is used to

calculate antenna performance

electromagnetic compatibility,

radar cross section

electromagnetic wave propagation when not in free space.

5

INTRODUCTION

Computational electromagnetics have been the answer for electromagnetic simulations using latest technology available. By now, there is many methods existing in domain such as integral form Maxwell’s equation solvers like MoM or differential form Maxwell’s equation solvers as FEM and FDTD.

To achieve high details and accuracy in these solvers, huge discretization of space and time elements needed to solve these problems.

This means memory should be used in an efficient way by exchanging spatial and temporal data in a fast way to calculate the field values with Maxwell’s equations till the end of the given time.

6

INTRODUCTION

• FDTD, the numerical analysis technique which is used widely in computational electromagnetics , belongs in the general class of grid-based differential numerical modeling methods. The time-dependent Maxwell's equations (in partial differential form) are discretized using central-difference approximations to the space and time partial derivatives.

7

FDTD METHOD

Solving Maxwell’s equations in time domain.

Saving each frame (one time iteration of our code) as a movie.

Electric field changing at a particular point will induce a curling (circulating) magnetic field.

Likewise, an induced magnetic field induces curling electric field.

This leaves us with a leapfrog way of calculations as shown at the figure on right hand side.

8

FDTD METHOD

for t in 0 to NT-1

for i in 1 to N-1

E[i] = k1*E[i] + k2 * ( H[i] - H[i-1] )

end for

for i in 1 to N-1

H[i]+=E[i]-E[i+1]

end for

end for

A naïve 1D FDTD algorithm.

It is calculating all field values N for every NT timesteps.

9

INTRODUCTION

• FDTD, remains to be a challenging task for the computers and devices running it due to it’s high demands of computational power and memory bandwidth .

• Programs can’t leverage fully efficiently from the evolving processor power upgrades matching Moore’s Law , as processors spend more than %80 of their time waiting for a data to process or to be received from the main memory.

10

INTRODUCTION

• Stencil codes such as FDTD kernels includes cascaded loops forcing processors to make a lot of memory read and writes. This is because of problem sizes in general are too big to fit inside the biggest cache component of the processor.

• Special feature of stencil codes are known as datas are somehow related to it’s neighbours.

• In case of FDTD kernels, this is happening between E-fields and H-fields. Space and time elements are dependent to elements close by in FDTD, as a result of Maxwell’s equations.

11

A data dependency graph, showing how the elements at different space and time are related to each others computations as shown at the FDTD formula.

12

Values that can be computed from tile after some values are loaded initially.

13

As programs can’t leverage fully efficiently from the evolving processor power upgrades matching Moore’s Law, one factor that is becoming more and more important is how well the algorithm takes advantage of the memory hierarchy, its memory performance .

Memory access speed is very important in modern microprocessors. And this is a reason that we will focus our work to cache memory hierarchies to make the most of effective cache replacement methods to

reduce cache miss rates

improving locality of data

making the fast data access possible between processor and memory via effective cache usage.

14

INTRODUCTION

Cache-aware time-skewing algorithms takes advantage of explicitly defined processor details which is being used with. As the algorithm stores data together in the same block , and as mentioned earlier, this is the reasons that processors memory page size and cache lines should be included inside algorithm.

This is a vital part as the algorithm is taking advantage if processors cache behavior as it’s main objective is minimizing the movement of memory pages in processors cache.

Objectives will be focused on loop tiling , time skewing , reducing CPU stalls with data locality optimizations. Significant rise on the performance will be expected as a result of these optimization steps.

15

INTRODUCTION

INTRODUCTION

FDTD solvers demands expensive hardware with parallelism features to run smoothly and accurately,

Our objective was to extend previous researches that provided ideas against these solutions.

The main objective of this thesis is achieving better results in means of reliability, cache usage and execution times for FDTD codes to make it available to run smoothly and accurately given problems with also taking the physics and engineering aspects of the problem into account which has been lacking in previous researches.

Extension of previously known works on code optimizations such as loop blocking, cache-aware algorithms and time-skewing techniques has been introduced as a contribution in details, instead only including implicit informations.

16

LITERATURE REVIEW

FDTD method

References for understanding the problem and implementation of theory to code

Changes and proposals for new FDTD techniques

Solving FDTD problems for extreme conditions and specific problems

Photonics , biomedicine

Solving Schrodinger equations with a generalized FDTD approach

Different implementations to software as V2D.

17

LITERATURE REVIEW

Memory hierarchy and the "memory wall"

Referring to important concepts of memory management and optimizations such as

Memory hierarchy

‘Memory wall’ term

Von Neumann bottleneck

Roofline model

Memory mountain

18

LITERATURE REVIEW

Stencil codes and data dependencies

Definition and types of stencils

Approximating problem into stencil code

Methodology of determination of data dependencies

Other terms such as: Paralellism, GPU

Locality optimizations

Understanding the ‘Principle of locality’

Important terms related to locality features of codes ( machine balance, computer balance, scalable locality)

Different code optimization algorithms studies

19

METHODOLOGY

Research design

Code generation and validation

Dependence and loop iteration analysis

Finding optimal tiling and skewing

Methodogical assumptions

20

METHODOLOGY

Instrumentations

Hardware

Software

Computer Benchmark

Data Processing and Analysis

21

22

DATA PROCESSING AND ANALYSIS

Example

23

Example

METHODOLOGY

24

RESULTS AND DISCUSSIONS

25

Generation and validation of codes

1D-FDTD

1D-FDTD

26

1D-FDTD

27

1D-FDTD

28

RESULTS AND DISCUSSIONS

29

2D-FDTD

RESULTS AND DISCUSSIONS

30

RESULTS AND DISCUSSIONS

31

32

Outputs and Discussion

Summarizing, for both 1D FDTD and 2D FDTD:

Cache profiling

Execution time

Data types and Programming Languages

Compiler optimizations

Future works

33

RESULTS AND DISCUSSIONS

CONCLUSIONS

Computational electromagnetics gained much more importance with improvements and demands of the related technologies, such as antenna design, bio-medicine, wireless communications

A good software implementation is a must for highly memory and computational intense code kernel such as FDTD

In this thesis, previous literature work was extended and demonstrated about the improvements with software optimizations such as loop blocking, cache-aware algorithms and time-skewing for 1D and 2D FDTD kernels.

34

CONCLUSIONS

Difference between naive FDTD codes and applied algorithms applied were shown in the results for 1D and 2D cases.

Results that were achieved indicates that applying time-skewing algorithms, with the way that has been done in this thesis, comes with increased total data references but with much better cache hit rate performance from other codes.

Performance of time-skewing is much visible in 2D code in terms of cache misses.

Run-time graphs and improved L1 and L3 cache miss rates for 1D and 2D cases have been achieved and demonstrated with results.

Explanation of line-by-line cache misses are explained throughout the thesis.

35