optimization of a parallel 3d-fft with non-blocking ... · optimization of a parallel 3d-fft with...

42
Introduction Three dimensional FFTs Implementation in ABINIT Optimization of a parallel 3d-FFT with non-blocking collective operations Torsten Hoefler, Gilles Zérah Chair of Computer Architecture Département de Physique Théorique et Appliquée Technical University of Chemnitz Commissariat à l’Énergie Atomique/DAM 3rd International ABINIT Developer Workshop Liège, Belgium 29th January 2007 Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs

Upload: others

Post on 30-Apr-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

IntroductionThree dimensional FFTs

Implementation in ABINIT

Optimization of a parallel 3d-FFT withnon-blocking collective operations

Torsten Hoefler, Gilles Zérah

Chair of Computer Architecture Département de Physique Théorique et AppliquéeTechnical University of Chemnitz Commissariat à l’Énergie Atomique/DAM

3rd International ABINIT Developer WorkshopLiège, Belgium

29th January 2007

Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs

IntroductionThree dimensional FFTs

Implementation in ABINIT

1 IntroductionShort introduction to non-blocking collectives

2 Three dimensional FFTsTraditional parallel 3d-FFTParallel 3d-FFT with maximum overlapParallel cache optimized 3d-FFT with partial overlap

3 Implementation in ABINITAvoidance of the transformation of zeroesAutotuning of parametersPreliminary Performance Results

Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs

IntroductionThree dimensional FFTs

Implementation in ABINITShort introduction to non-blocking collectives

1 IntroductionShort introduction to non-blocking collectives

2 Three dimensional FFTsTraditional parallel 3d-FFTParallel 3d-FFT with maximum overlapParallel cache optimized 3d-FFT with partial overlap

3 Implementation in ABINITAvoidance of the transformation of zeroesAutotuning of parametersPreliminary Performance Results

Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs

IntroductionThree dimensional FFTs

Implementation in ABINITShort introduction to non-blocking collectives

Non-blocking Collective Operations

Advantages - Overlap

leverage hardware parallelism (e.g. InfiniBandTM)overlap similar to non-blocking point-to-point

Usage?extension to MPI-2”mixture” between non-blocking ptp and collectivesuses MPI_Requests and MPI_Test/MPI_Wait

MPI_Ibcast(buf1, p, MPI_INT, 0, comm, &req);MPI_Wait(&req);

Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs

IntroductionThree dimensional FFTs

Implementation in ABINITShort introduction to non-blocking collectives

AvailabilityPrototype LibNBC: requires ANSI-C and MPI-2LibNBC dowload and documentation:http://www.unixer.de/NBC

DocumentationT. HOEFLER, J. SQUYRES, W. REHM, AND A. LUMSDAINE: ACase for Non-Blocking Collective Operations. In Frontiers ofHigh Performance Computing and Networking, pages 155-164,Springer Berlin, ISBN: 978-3-540-49860-5 Dec. 2006

T. HOEFLER, J. SQUYRES, G. BOSILCA, G. FAGG, A.LUMSDAINE, AND W. REHM: Non-Blocking CollectiveOperations for MPI-2. Open Systems Lab, Indiana University.presented in Bloomington, IN, USA, Aug. 2006

Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs

IntroductionThree dimensional FFTs

Implementation in ABINITShort introduction to non-blocking collectives

Performance Benefits?

0

2000

4000

6000

8000

10000

12000

0 10000 20000 30000 40000 50000 60000

Tim

e in

mic

rose

cond

s

Datasize

MPI LatencyNBC Latency (max. overlap)

Figure: MPI_Alltoall latency on the “tantale” cluster@CEATorsten Hoefler, Gilles Zérah Overlap in 3d-FFTs

IntroductionThree dimensional FFTs

Implementation in ABINIT

Traditional parallel 3d-FFTParallel 3d-FFT with maximum overlapParallel cache optimized 3d-FFT with partial overlap

1 IntroductionShort introduction to non-blocking collectives

2 Three dimensional FFTsTraditional parallel 3d-FFTParallel 3d-FFT with maximum overlapParallel cache optimized 3d-FFT with partial overlap

3 Implementation in ABINITAvoidance of the transformation of zeroesAutotuning of parametersPreliminary Performance Results

Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs

IntroductionThree dimensional FFTs

Implementation in ABINIT

Traditional parallel 3d-FFTParallel 3d-FFT with maximum overlapParallel cache optimized 3d-FFT with partial overlap

Domain Decomposition

Discretized 3D Domain (FFT-Box)

y x

z

Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs

IntroductionThree dimensional FFTs

Implementation in ABINIT

Traditional parallel 3d-FFTParallel 3d-FFT with maximum overlapParallel cache optimized 3d-FFT with partial overlap

Domain Decomposition

Distributed 3d Domain

y x

z 0 1 2

Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs

IntroductionThree dimensional FFTs

Implementation in ABINIT

Traditional parallel 3d-FFTParallel 3d-FFT with maximum overlapParallel cache optimized 3d-FFT with partial overlap

1D Transformation

1D Transformation in y Direction

y x

z

Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs

IntroductionThree dimensional FFTs

Implementation in ABINIT

Traditional parallel 3d-FFTParallel 3d-FFT with maximum overlapParallel cache optimized 3d-FFT with partial overlap

1D Transformation

1D Transformation in z Direction

y x

z

Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs

IntroductionThree dimensional FFTs

Implementation in ABINIT

Traditional parallel 3d-FFTParallel 3d-FFT with maximum overlapParallel cache optimized 3d-FFT with partial overlap

1D Transformation

1D Transformation in x Direction

y x

z

Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs

IntroductionThree dimensional FFTs

Implementation in ABINIT

Traditional parallel 3d-FFTParallel 3d-FFT with maximum overlapParallel cache optimized 3d-FFT with partial overlap

1 IntroductionShort introduction to non-blocking collectives

2 Three dimensional FFTsTraditional parallel 3d-FFTParallel 3d-FFT with maximum overlapParallel cache optimized 3d-FFT with partial overlap

3 Implementation in ABINITAvoidance of the transformation of zeroesAutotuning of parametersPreliminary Performance Results

Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs

IntroductionThree dimensional FFTs

Implementation in ABINIT

Traditional parallel 3d-FFTParallel 3d-FFT with maximum overlapParallel cache optimized 3d-FFT with partial overlap

Non-blocking 3D-FFT

Derivation from “normal” implementationdistribution identical to “normal” 3D-FFTfirst FFT in y direction and local data transpose

Design Goals to Minimize Communication Overheadstart communication as early as possibleachieve maximum overlap time

Solutionstart NBC_Ialltoall as soon as first xz-plane is readycalculate next xz-planestart next communication accordingly ...collect multiple xz-planes (A2A data size)

Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs

IntroductionThree dimensional FFTs

Implementation in ABINIT

Traditional parallel 3d-FFTParallel 3d-FFT with maximum overlapParallel cache optimized 3d-FFT with partial overlap

Non-blocking 3D-FFT

Derivation from “normal” implementationdistribution identical to “normal” 3D-FFTfirst FFT in y direction and local data transpose

Design Goals to Minimize Communication Overheadstart communication as early as possibleachieve maximum overlap time

Solutionstart NBC_Ialltoall as soon as first xz-plane is readycalculate next xz-planestart next communication accordingly ...collect multiple xz-planes (A2A data size)

Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs

IntroductionThree dimensional FFTs

Implementation in ABINIT

Traditional parallel 3d-FFTParallel 3d-FFT with maximum overlapParallel cache optimized 3d-FFT with partial overlap

Non-blocking 3D-FFT

Derivation from “normal” implementationdistribution identical to “normal” 3D-FFTfirst FFT in y direction and local data transpose

Design Goals to Minimize Communication Overheadstart communication as early as possibleachieve maximum overlap time

Solutionstart NBC_Ialltoall as soon as first xz-plane is readycalculate next xz-planestart next communication accordingly ...collect multiple xz-planes (A2A data size)

Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs

IntroductionThree dimensional FFTs

Implementation in ABINIT

Traditional parallel 3d-FFTParallel 3d-FFT with maximum overlapParallel cache optimized 3d-FFT with partial overlap

Transformation of in z Direction

Data already transformed in y direction

y x

z

1 block = 1 double value (3x3x3 grid)

Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs

IntroductionThree dimensional FFTs

Implementation in ABINIT

Traditional parallel 3d-FFTParallel 3d-FFT with maximum overlapParallel cache optimized 3d-FFT with partial overlap

Transformation of in z Direction

Transform first xz plane in z direction in parallel

y x

z

������������������������������������������

������������������������������

�������������������������

�������������������������������������������������������

������������������������������

����������������������������

�������

���������������������

�������

� � � � � � � � � �

�������������������������

������������������������������

������������������������������

������������������������������

����������������������������������������������������������

������������������������������

������������������������������

��������������������������������

���������������������

�������

�����������������������������������

�������������������������������������������������������������������

������������������������������

pattern means that data was transformed in y and z direction

Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs

IntroductionThree dimensional FFTs

Implementation in ABINIT

Traditional parallel 3d-FFTParallel 3d-FFT with maximum overlapParallel cache optimized 3d-FFT with partial overlap

Transformation of in z Direction

start NBC_Ialltoall of first xz plane and transform second plane

������������������������������

� � � � � � � � � � � � !�!�!�!!�!�!�!!�!�!�!"�"�"�""�"�"�""�"�"�"

#######

$$$$$$$

%�%�%%�%�%%�%�%%�%�%%�%�%%�%�%

&�&�&&�&�&&�&�&&�&�&&�&�&&�&�&

'�'�''�'�''�'�''�'�''�'�'

(�(�((�(�((�(�((�(�((�(�( )))))))

*******

+++++++

,,,,,,,

-�-�-�--�-�-�--�-�-�--�-�-�--�-�-�--�-�-�-

.�.�.�..�.�.�..�.�.�..�.�.�..�.�.�..�.�.�.

/�/�//�/�//�/�//�/�//�/�//�/�/

0�0�00�0�00�0�00�0�00�0�00�0�01�1�1�1�11�1�1�1�11�1�1�1�12�2�2�2�22�2�2�2�22�2�2�2�2

3�3�3�33�3�3�33�3�3�34�4�4�44�4�4�44�4�4�4

y x

z

5�5�55�5�55�5�55�5�55�5�5

6�6�66�6�66�6�66�6�66�6�67�7�77�7�77�7�77�7�77�7�77�7�7

8�8�88�8�88�8�88�8�88�8�88�8�8 9�99�99�99�99�99�99�9

:::::::

;�;;�;;�;;�;;�;;�;;�;

<<<<<<<=�=�==�=�==�=�==�=�==�=�==�=�=

>�>�>>�>�>>�>�>>�>�>>�>�>>�>�>

?�??�??�??�??�??�??�?

@@@@@@@

A�A�A�AA�A�A�AA�A�A�AA�A�A�AA�A�A�A

B�B�BB�B�BB�B�BB�B�BB�B�BC�C�C�CC�C�C�CC�C�C�CC�C�C�CC�C�C�CC�C�C�C

D�D�DD�D�DD�D�DD�D�DD�D�DD�D�DE�E�E�EE�E�E�EE�E�E�EE�E�E�EE�E�E�EE�E�E�E

F�F�FF�F�FF�F�FF�F�FF�F�FF�F�F

G�G�GG�G�GG�G�GG�G�GG�G�G

H�H�HH�H�HH�H�HH�H�HH�H�H

I�I�II�I�II�I�II�I�II�I�II�I�I

J�J�JJ�J�JJ�J�JJ�J�JJ�J�JJ�J�JK�K�K�K�KK�K�K�K�KL�L�L�LL�L�L�LM�M�M�MM�M�M�MN�N�N�NN�N�N�N

O�O�OO�O�OO�O�OO�O�OO�O�OO�O�O

P�P�PP�P�PP�P�PP�P�PP�P�PP�P�PQ�Q�Q�QQ�Q�Q�QR�R�R�RR�R�R�R

cyan color means that data is communicated in the background

Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs

IntroductionThree dimensional FFTs

Implementation in ABINIT

Traditional parallel 3d-FFTParallel 3d-FFT with maximum overlapParallel cache optimized 3d-FFT with partial overlap

Transformation of in z Direction

start NBC_Ialltoall of second xz plane and transform third planeS�S�S�SS�S�S�SS�S�S�ST�T�T�TT�T�T�TT�T�T�T

U�U�U�U�UU�U�U�U�UU�U�U�U�UV�V�V�VV�V�V�VV�V�V�V

W�WW�WW�WW�WW�WW�WW�W

X�XX�XX�XX�XX�XX�XX�X

Y�Y�Y�YY�Y�Y�YY�Y�Y�YY�Y�Y�YY�Y�Y�YY�Y�Y�Y

Z�Z�ZZ�Z�ZZ�Z�ZZ�Z�ZZ�Z�ZZ�Z�Z

[�[�[�[[�[�[�[[�[�[�[[�[�[�[[�[�[�[[�[�[�[

\�\�\\�\�\\�\�\\�\�\\�\�\\�\�\

]�]�]]�]�]]�]�]]�]�]]�]�]

^�^�^^�^�^^�^�^^�^�^^�^�^

_�_�_�__�_�_�__�_�_�__�_�_�__�_�_�_

`�`�``�`�``�`�``�`�``�`�`a�a�a�a�aa�a�a�a�aa�a�a�a�ab�b�b�bb�b�b�bb�b�b�b

c�cc�cc�cc�cc�cc�cc�c

d�dd�dd�dd�dd�dd�dd�de�ee�ee�ee�ee�ee�ee�e

f�ff�ff�ff�ff�ff�ff�fg�g�g�gg�g�g�gg�g�g�gg�g�g�gg�g�g�g

h�h�hh�h�hh�h�hh�h�hh�h�h

i�i�ii�i�ii�i�ii�i�ii�i�ii�i�i

j�j�jj�j�jj�j�jj�j�jj�j�jj�j�j

k�k�kk�k�kk�k�kk�k�kk�k�k

l�l�ll�l�ll�l�ll�l�ll�l�l mmmmmmm

nnnnnnn

ooooooo

ppppppp

q�q�q�qq�q�q�qq�q�q�qq�q�q�qq�q�q�qq�q�q�q

r�r�r�rr�r�r�rr�r�r�rr�r�r�rr�r�r�rr�r�r�r

sssssss

ttttttt

u�u�uu�u�uu�u�uu�u�uu�u�uu�u�u

v�v�vv�v�vv�v�vv�v�vv�v�vv�v�vw�w�ww�w�ww�w�ww�w�ww�w�ww�w�w

x�x�xx�x�xx�x�xx�x�xx�x�xx�x�x

y x

z

y�y�yy�y�yy�y�yy�y�yy�y�y

z�z�zz�z�zz�z�zz�z�zz�z�z{�{�{{�{�{{�{�{{�{�{{�{�{{�{�{

|�|�||�|�||�|�||�|�||�|�||�|�| }�}}�}}�}}�}}�}}�}}�}

~~~~~~~

���������������������

�������

�������������������������

�������������������������

������������������������������

����������������������������������������������������������

���������������������

�������

�����������������������������������

�������������������������

������������������������������������������

������������������������������

������������������������������

��������������������������������

������������������������������������������

������������������������������������������������������������������������

������������������������������

������������������������������

����������������������������������������������������������������������������������������������������������������

������������������������������������������

data of two planes is not accessible due to communication

Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs

IntroductionThree dimensional FFTs

Implementation in ABINIT

Traditional parallel 3d-FFTParallel 3d-FFT with maximum overlapParallel cache optimized 3d-FFT with partial overlap

Transformation of in z Direction

start communication of the third plane and ...������������������������������������������

��������������������������� � � �  � � �  � � � 

¡�¡¡�¡¡�¡¡�¡¡�¡¡�¡¡�¡

¢�¢¢�¢¢�¢¢�¢¢�¢¢�¢¢�¢

£�£�£�££�£�£�££�£�£�££�£�£�££�£�£�££�£�£�£

¤�¤�¤¤�¤�¤¤�¤�¤¤�¤�¤¤�¤�¤¤�¤�¤

¥�¥�¥�¥¥�¥�¥�¥¥�¥�¥�¥¥�¥�¥�¥¥�¥�¥�¥¥�¥�¥�¥

¦�¦�¦¦�¦�¦¦�¦�¦¦�¦�¦¦�¦�¦¦�¦�¦

§�§�§§�§�§§�§�§§�§�§§�§�§

¨�¨�¨¨�¨�¨¨�¨�¨¨�¨�¨¨�¨�¨

©�©�©�©©�©�©�©©�©�©�©©�©�©�©©�©�©�©

ª�ª�ªª�ª�ªª�ª�ªª�ª�ªª�ª�ª«�«�«�«�««�«�«�«�««�«�«�«�«¬�¬�¬�¬¬�¬�¬�¬¬�¬�¬�¬

­�­�­�­­�­�­�­­�­�­�­­�­�­�­­�­�­�­

®�®�®®�®�®®�®�®®�®�®®�®�®

¯�¯¯�¯¯�¯¯�¯¯�¯¯�¯¯�¯

°�°°�°°�°°�°°�°°�°°�°±�±±�±±�±±�±±�±±�±±�±

²�²²�²²�²²�²²�²²�²²�²

³�³�³³�³�³³�³�³³�³�³³�³�³³�³�³

´�´�´´�´�´´�´�´´�´�´´�´�´´�´�´µ�µ�µ�µµ�µ�µ�µµ�µ�µ�µ¶�¶�¶�¶¶�¶�¶�¶¶�¶�¶�¶

·�·�··�·�··�·�··�·�··�·�·

¸�¸�¸¸�¸�¸¸�¸�¸¸�¸�¸¸�¸�¸ ¹¹¹¹¹¹¹

ººººººº

»»»»»»»

¼¼¼¼¼¼¼

½�½�½�½½�½�½�½½�½�½�½½�½�½�½½�½�½�½½�½�½�½

¾�¾�¾�¾¾�¾�¾�¾¾�¾�¾�¾¾�¾�¾�¾¾�¾�¾�¾¾�¾�¾�¾¿�¿�¿�¿�¿¿�¿�¿�¿�¿¿�¿�¿�¿�¿À�À�À�À�ÀÀ�À�À�À�ÀÀ�À�À�À�À

Á�Á�Á�ÁÁ�Á�Á�ÁÁ�Á�Á�ÁÂ�Â�Â�ÂÂ�Â�Â�ÂÂ�Â�Â�Â

Ã�Ã�ÃÃ�Ã�ÃÃ�Ã�ÃÃ�Ã�ÃÃ�Ã�ÃÃ�Ã�Ã

Ä�Ä�ÄÄ�Ä�ÄÄ�Ä�ÄÄ�Ä�ÄÄ�Ä�ÄÄ�Ä�Ä ÅÅÅÅÅÅÅ

ÆÆÆÆÆÆÆ

Ç�Ç�ÇÇ�Ç�ÇÇ�Ç�ÇÇ�Ç�ÇÇ�Ç�ÇÇ�Ç�Ç

È�È�ÈÈ�È�ÈÈ�È�ÈÈ�È�ÈÈ�È�ÈÈ�È�È

y x

z É�É�ÉÉ�É�ÉÉ�É�ÉÉ�É�ÉÉ�É�ÉÉ�É�É

Ê�Ê�ÊÊ�Ê�ÊÊ�Ê�ÊÊ�Ê�ÊÊ�Ê�ÊÊ�Ê�Ê Ë�ËË�ËË�ËË�ËË�ËË�ËË�Ë

ÌÌÌÌÌÌÌ

Í�ÍÍ�ÍÍ�ÍÍ�ÍÍ�ÍÍ�ÍÍ�Í

ÎÎÎÎÎÎÎ

Ï�Ï�ÏÏ�Ï�ÏÏ�Ï�ÏÏ�Ï�ÏÏ�Ï�Ï

Ð�Ð�ÐÐ�Ð�ÐÐ�Ð�ÐÐ�Ð�ÐÐ�Ð�Ð

Ñ�Ñ�Ñ�ÑÑ�Ñ�Ñ�ÑÒ�Ò�Ò�ÒÒ�Ò�Ò�Ò

Ó�Ó�ÓÓ�Ó�ÓÓ�Ó�ÓÓ�Ó�ÓÓ�Ó�ÓÓ�Ó�Ó

Ô�Ô�ÔÔ�Ô�ÔÔ�Ô�ÔÔ�Ô�ÔÔ�Ô�ÔÔ�Ô�Ô

Õ�Õ�Õ�Õ�ÕÕ�Õ�Õ�Õ�ÕÖ�Ö�Ö�ÖÖ�Ö�Ö�Ö

×�××�××�××�××�××�××�×

ØØØØØØØ

Ù�Ù�Ù�ÙÙ�Ù�Ù�ÙÙ�Ù�Ù�ÙÙ�Ù�Ù�ÙÙ�Ù�Ù�Ù

Ú�Ú�ÚÚ�Ú�ÚÚ�Ú�ÚÚ�Ú�ÚÚ�Ú�ÚÛ�Û�Û�ÛÛ�Û�Û�ÛÛ�Û�Û�ÛÛ�Û�Û�ÛÛ�Û�Û�ÛÛ�Û�Û�Û

Ü�Ü�ÜÜ�Ü�ÜÜ�Ü�ÜÜ�Ü�ÜÜ�Ü�ÜÜ�Ü�Ü

Ý�Ý�Ý�ÝÝ�Ý�Ý�ÝÞ�Þ�Þ�ÞÞ�Þ�Þ�Þ

ß�ß�ß�ßß�ß�ß�ßß�ß�ß�ßß�ß�ß�ßß�ß�ß�ßß�ß�ß�ß

à�à�àà�à�àà�à�àà�à�àà�à�àà�à�à

á�á�áá�á�áá�á�áá�á�áá�á�á

â�â�ââ�â�ââ�â�ââ�â�ââ�â�â

ã�ã�ãã�ã�ãã�ã�ãã�ã�ãã�ã�ãã�ã�ã

ä�ä�ää�ä�ää�ä�ää�ä�ää�ä�ää�ä�ä

å�å�åå�å�åå�å�åå�å�åå�å�åå�å�å

æ�æ�ææ�æ�ææ�æ�ææ�æ�ææ�æ�ææ�æ�æ

we need the first xz plane to go on ...

Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs

IntroductionThree dimensional FFTs

Implementation in ABINIT

Traditional parallel 3d-FFTParallel 3d-FFT with maximum overlapParallel cache optimized 3d-FFT with partial overlap

Transformation of in x Direction

... so NBC_Wait for the first NBC_Ialltoall!ç�ç�ç�çç�ç�ç�çç�ç�ç�çè�è�è�èè�è�è�èè�è�è�è

é�é�é�é�éé�é�é�é�éé�é�é�é�éê�ê�ê�êê�ê�ê�êê�ê�ê�ê

ë�ëë�ëë�ëë�ëë�ëë�ëë�ë

ì�ìì�ìì�ìì�ìì�ìì�ìì�ì

í�í�í�íí�í�í�íí�í�í�íí�í�í�íí�í�í�íí�í�í�í

î�î�îî�î�îî�î�îî�î�îî�î�îî�î�î

ï�ï�ï�ïï�ï�ï�ïï�ï�ï�ïï�ï�ï�ïï�ï�ï�ïï�ï�ï�ï

ð�ð�ðð�ð�ðð�ð�ðð�ð�ðð�ð�ðð�ð�ð

ñ�ñ�ññ�ñ�ññ�ñ�ññ�ñ�ññ�ñ�ñ

ò�ò�òò�ò�òò�ò�òò�ò�òò�ò�ò

ó�ó�ó�óó�ó�ó�óó�ó�ó�óó�ó�ó�óó�ó�ó�ó

ô�ô�ôô�ô�ôô�ô�ôô�ô�ôô�ô�ôõ�õ�õ�õ�õõ�õ�õ�õ�õõ�õ�õ�õ�õö�ö�ö�öö�ö�ö�öö�ö�ö�ö

÷�÷÷�÷÷�÷÷�÷÷�÷÷�÷÷�÷

ø�øø�øø�øø�øø�øø�øø�øù�ùù�ùù�ùù�ùù�ùù�ùù�ù

ú�úú�úú�úú�úú�úú�úú�úû�û�û�ûû�û�û�ûû�û�û�ûû�û�û�ûû�û�û�û

ü�ü�üü�ü�üü�ü�üü�ü�üü�ü�ü

ý�ý�ýý�ý�ýý�ý�ýý�ý�ýý�ý�ýý�ý�ý

þ�þ�þþ�þ�þþ�þ�þþ�þ�þþ�þ�þþ�þ�þÿ�ÿ�ÿ�ÿÿ�ÿ�ÿ�ÿÿ�ÿ�ÿ�ÿ���������������������

�������������������������

������������������������� �������

�������

�������

�������

������������������������������������������

���������������������������������������������������������

��������������������� � � � � � � � � �

������������������������������

������������������������������ �������

�������

������������������������������

������������������������������

y x

z ������������������������������

������������������������������ ���������������������

�������

���������������������

�������

�������������������������

�������������������������

����������������������������

������������������������������

������������������������������

� � � � � � � � !�!�!�!!�!�!�!

"�""�""�""�""�""�""�"

#######

$�$�$�$$�$�$�$$�$�$�$$�$�$�$$�$�$�$

%�%�%%�%�%%�%�%%�%�%%�%�%&�&�&�&&�&�&�&&�&�&�&&�&�&�&&�&�&�&&�&�&�&

'�'�''�'�''�'�''�'�''�'�''�'�'

(�(�(�((�(�(�()�)�)�))�)�)�)

*�*�*�**�*�*�**�*�*�**�*�*�**�*�*�**�*�*�*

+�+�++�+�++�+�++�+�++�+�++�+�+

,�,�,,�,�,,�,�,,�,�,,�,�,

-�-�--�-�--�-�--�-�--�-�-

.�.�..�.�..�.�..�.�..�.�..�.�.

/�/�//�/�//�/�//�/�//�/�//�/�/

0�0�00�0�00�0�00�0�00�0�00�0�0

1�1�11�1�11�1�11�1�11�1�11�1�1

and transform first plane (pattern means xyz transformed)

Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs

IntroductionThree dimensional FFTs

Implementation in ABINIT

Traditional parallel 3d-FFTParallel 3d-FFT with maximum overlapParallel cache optimized 3d-FFT with partial overlap

Transformation of in x Direction

Wait and transform second xz plane2�2�2�22�2�2�22�2�2�23�3�3�33�3�3�33�3�3�3

4�4�4�4�44�4�4�4�44�4�4�4�45�5�5�55�5�5�55�5�5�5

6�66�66�66�66�66�66�6

7�77�77�77�77�77�77�7

8�8�8�88�8�8�88�8�8�88�8�8�88�8�8�88�8�8�8

9�9�99�9�99�9�99�9�99�9�99�9�9

:�:�:�::�:�:�::�:�:�::�:�:�::�:�:�::�:�:�:

;�;�;;�;�;;�;�;;�;�;;�;�;;�;�;

<�<�<<�<�<<�<�<<�<�<<�<�<

=�=�==�=�==�=�==�=�==�=�=

>�>�>�>>�>�>�>>�>�>�>>�>�>�>>�>�>�>

?�?�??�?�??�?�??�?�??�?�?@�@�@�@�@@�@�@�@�@@�@�@�@�@A�A�A�AA�A�A�AA�A�A�A

B�BB�BB�BB�BB�BB�BB�B

C�CC�CC�CC�CC�CC�CC�CD�DD�DD�DD�DD�DD�DD�D

E�EE�EE�EE�EE�EE�EE�EF�F�F�FF�F�F�FF�F�F�FF�F�F�FF�F�F�F

G�G�GG�G�GG�G�GG�G�GG�G�G

H�H�HH�H�HH�H�HH�H�HH�H�HH�H�H

I�I�II�I�II�I�II�I�II�I�II�I�IJ�J�J�JJ�J�J�JJ�J�J�JK�K�K�KK�K�K�KK�K�K�K

L�L�LL�L�LL�L�LL�L�LL�L�L

M�M�MM�M�MM�M�MM�M�MM�M�M NNNNNNN

OOOOOOO

PPPPPPP

QQQQQQQ

R�R�R�RR�R�R�RR�R�R�RR�R�R�RR�R�R�RR�R�R�R

S�S�S�SS�S�S�SS�S�S�SS�S�S�SS�S�S�SS�S�S�S T�T�T�TT�T�T�TT�T�T�TU�U�U�UU�U�U�UU�U�U�U

V�V�VV�V�VV�V�VV�V�VV�V�VV�V�V

W�W�WW�W�WW�W�WW�W�WW�W�WW�W�WX�X�XX�X�XX�X�XX�X�XX�X�XX�X�X

Y�Y�YY�Y�YY�Y�YY�Y�YY�Y�YY�Y�YZ�Z�Z�Z�ZZ�Z�Z�Z�ZZ�Z�Z�Z�Z[�[�[�[�[[�[�[�[�[[�[�[�[�[

\\\\\\\

]]]]]]]

y x

z ^�^�^^�^�^^�^�^^�^�^^�^�^^�^�^

_�_�__�_�__�_�__�_�__�_�__�_�_

`�`�``�`�``�`�``�`�``�`�`

a�a�aa�a�aa�a�aa�a�aa�a�a

b�b�b�bb�b�b�bc�c�c�cc�c�c�c

d�d�dd�d�dd�d�dd�d�dd�d�dd�d�d

e�e�ee�e�ee�e�ee�e�ee�e�ee�e�e

f�f�f�f�ff�f�f�f�fg�g�g�gg�g�g�g

h�h�h�hh�h�h�hh�h�h�hh�h�h�hh�h�h�h

i�i�ii�i�ii�i�ii�i�ii�i�ij�j�j�jj�j�j�jj�j�j�jj�j�j�jj�j�j�jj�j�j�j

k�k�kk�k�kk�k�kk�k�kk�k�kk�k�k

l�l�l�ll�l�l�lm�m�m�mm�m�m�m

n�n�n�nn�n�n�nn�n�n�nn�n�n�nn�n�n�nn�n�n�n

o�o�oo�o�oo�o�oo�o�oo�o�oo�o�o

p�p�pp�p�pp�p�pp�p�pp�p�p

q�q�qq�q�qq�q�qq�q�qq�q�q

r�r�rr�r�rr�r�rr�r�rr�r�rr�r�r

s�s�ss�s�ss�s�ss�s�ss�s�ss�s�s

t�t�tt�t�tt�t�tt�t�tt�t�tt�t�t

u�u�uu�u�uu�u�uu�u�uu�u�uu�u�u

v�vv�vv�vv�vv�vv�vv�v

wwwwwww

x�xx�xx�xx�xx�xx�xx�x

yyyyyyyz�zz�zz�zz�zz�zz�zz�z

{{{{{{{

first plane’s data could be accessed

Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs

IntroductionThree dimensional FFTs

Implementation in ABINIT

Traditional parallel 3d-FFTParallel 3d-FFT with maximum overlapParallel cache optimized 3d-FFT with partial overlap

Transformation of in x Direction

wait and transform last xz plane|�|�|�||�|�|�||�|�|�|}�}�}�}}�}�}�}}�}�}�}

~�~�~�~�~~�~�~�~�~~�~�~�~�~���������������������

���������������������

���������������������

������������������������������������������

������������������������������

������������������������������������������

������������������������������

�������������������������

�������������������������

�����������������������������������

�������������������������������������������������������������������������

���������������������

���������������������

�����������������������������������

�������������������������

���������������������

���������������������

������������������������������

������������������������������������������������������������������������

�������������������������

�������������������������

������������������������������������������

������������������������������������������ ������������������������������������������

������������������������������

������������������������������������������������������������

������������������������������ � � � �  � � � �  � � � � ¡�¡�¡�¡�¡¡�¡�¡�¡�¡¡�¡�¡�¡�¡

y x

z ¢�¢�¢¢�¢�¢¢�¢�¢¢�¢�¢¢�¢�¢¢�¢�¢

£�£�££�£�££�£�££�£�££�£�££�£�£

¤�¤¤�¤¤�¤¤�¤¤�¤¤�¤¤�¤

¥¥¥¥¥¥¥

¦�¦�¦¦�¦�¦¦�¦�¦¦�¦�¦¦�¦�¦

§�§�§§�§�§§�§�§§�§�§§�§�§

¨�¨�¨�¨¨�¨�¨�¨©�©�©�©©�©�©�© ª�ª�ª�ª�ªª�ª�ª�ª�ª«�«�«�««�«�«�«

¬�¬�¬�¬¬�¬�¬�¬¬�¬�¬�¬¬�¬�¬�¬¬�¬�¬�¬

­�­�­­�­�­­�­�­­�­�­­�­�­

®�®�®�®®�®�®�®¯�¯�¯�¯¯�¯�¯�¯

°�°�°°�°�°°�°�°°�°�°°�°�°

±�±�±±�±�±±�±�±±�±�±±�±�±

²�²�²²�²�²²�²�²²�²�²²�²�²²�²�²

³�³�³³�³�³³�³�³³�³�³³�³�³³�³�³

´�´´�´´�´´�´´�´´�´´�´

µµµµµµµ

¶¶¶¶¶¶¶

·······

¸�¸�¸¸�¸�¸¸�¸�¸¸�¸�¸¸�¸�¸¸�¸�¸

¹�¹�¹¹�¹�¹¹�¹�¹¹�¹�¹¹�¹�¹¹�¹�¹

º�ºº�ºº�ºº�ºº�ºº�ºº�º

»»»»»»»

¼�¼�¼�¼¼�¼�¼�¼¼�¼�¼�¼¼�¼�¼�¼¼�¼�¼�¼¼�¼�¼�¼

½�½�½½�½�½½�½�½½�½�½½�½�½½�½�½¾�¾�¾�¾¾�¾�¾�¾¾�¾�¾�¾¾�¾�¾�¾¾�¾�¾�¾¾�¾�¾�¾

¿�¿�¿¿�¿�¿¿�¿�¿¿�¿�¿¿�¿�¿¿�¿�¿

À�À�ÀÀ�À�ÀÀ�À�ÀÀ�À�ÀÀ�À�ÀÀ�À�À

Á�Á�ÁÁ�Á�ÁÁ�Á�ÁÁ�Á�ÁÁ�Á�ÁÁ�Á�Á

ÂÂÂÂÂÂÂ

ÃÃÃÃÃÃÃ

ÄÄÄÄÄÄÄ

ÅÅÅÅÅÅÅ

done! → 1 complete 1D-FFT overlaps a communication

Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs

IntroductionThree dimensional FFTs

Implementation in ABINIT

Traditional parallel 3d-FFTParallel 3d-FFT with maximum overlapParallel cache optimized 3d-FFT with partial overlap

Parameter and Problems

Tile factor# of z-planes to gather before NBC_Ialltoall is startedvery performance critical!not easily predictable

Window size and MPI_Test intervalWindow size = number of outstanding communicationsnot very performance critical → fine-tuningMPI_Test progresses internal state of MPIunneccessary in threaded NBC implementation (future)

Problems?NOT cache friendly :-(

Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs

IntroductionThree dimensional FFTs

Implementation in ABINIT

Traditional parallel 3d-FFTParallel 3d-FFT with maximum overlapParallel cache optimized 3d-FFT with partial overlap

Parameter and Problems

Tile factor# of z-planes to gather before NBC_Ialltoall is startedvery performance critical!not easily predictable

Window size and MPI_Test intervalWindow size = number of outstanding communicationsnot very performance critical → fine-tuningMPI_Test progresses internal state of MPIunneccessary in threaded NBC implementation (future)

Problems?NOT cache friendly :-(

Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs

IntroductionThree dimensional FFTs

Implementation in ABINIT

Traditional parallel 3d-FFTParallel 3d-FFT with maximum overlapParallel cache optimized 3d-FFT with partial overlap

Parameter and Problems

Tile factor# of z-planes to gather before NBC_Ialltoall is startedvery performance critical!not easily predictable

Window size and MPI_Test intervalWindow size = number of outstanding communicationsnot very performance critical → fine-tuningMPI_Test progresses internal state of MPIunneccessary in threaded NBC implementation (future)

Problems?NOT cache friendly :-(

Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs

IntroductionThree dimensional FFTs

Implementation in ABINIT

Traditional parallel 3d-FFTParallel 3d-FFT with maximum overlapParallel cache optimized 3d-FFT with partial overlap

3D-FFT Benchmark Results (small input)

0

5

10

15

20

25

30

35

0 5 10 15 20 25 30 35

Spe

edup

Nodes

idealNBCMPI

0

5

10

15

20

25

30

35

40

45

50

0 5 10 15 20 25 30 35C

omm

unic

atio

n O

verh

ead

(%)

Nodes

NBCMPI

“tantale”@CEA: 128 2 GHz Quad Opteron 844 nodesInterconnect: InfiniBandTM

System size 128x128x128 (1 CPU ≈ 0.75 s)

Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs

IntroductionThree dimensional FFTs

Implementation in ABINIT

Traditional parallel 3d-FFTParallel 3d-FFT with maximum overlapParallel cache optimized 3d-FFT with partial overlap

1 IntroductionShort introduction to non-blocking collectives

2 Three dimensional FFTsTraditional parallel 3d-FFTParallel 3d-FFT with maximum overlapParallel cache optimized 3d-FFT with partial overlap

3 Implementation in ABINITAvoidance of the transformation of zeroesAutotuning of parametersPreliminary Performance Results

Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs

IntroductionThree dimensional FFTs

Implementation in ABINIT

Traditional parallel 3d-FFTParallel 3d-FFT with maximum overlapParallel cache optimized 3d-FFT with partial overlap

Cache optimal implementation

cache optimality by yz transforming plane by plane (in cache)!

y x

z

→ we need all yz-planes before we can start x transform :-(Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs

IntroductionThree dimensional FFTs

Implementation in ABINIT

Traditional parallel 3d-FFTParallel 3d-FFT with maximum overlapParallel cache optimized 3d-FFT with partial overlap

Applying Non-blocking collectives

Pipelined communicationretain plane-by-plane transformsimple pipelined schemestart A2A of plane as soon as it is transformedwait for all before x transformA2A overlapped with computation of remaining planeslast A2A blocks (immediate wait :-( )

Issuesless overap potentialplane coalescing to adjust datasizenew parameter: “pipeline depth” (# of A2As)

Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs

IntroductionThree dimensional FFTs

Implementation in ABINIT

Traditional parallel 3d-FFTParallel 3d-FFT with maximum overlapParallel cache optimized 3d-FFT with partial overlap

3D-FFT Benchmark Results (small input)

System“tantale”@CEA2 GHz Quad OpteronInfiniBandTM

Parameters128x128x12816 CPUs, 4 nodes1 CPU ≈ 28 s8 planes/proc16kb/plane

1

2

3

4

5

6

7

8

9

10

0 1 2 4

FORW

0 1 2 4

Tim

e (s)

coalesced planes (0 = original impl.)

BACK

Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs

IntroductionThree dimensional FFTs

Implementation in ABINIT

Avoidance of the transformation of zeroesAutotuning of parametersPreliminary Performance Results

1 IntroductionShort introduction to non-blocking collectives

2 Three dimensional FFTsTraditional parallel 3d-FFTParallel 3d-FFT with maximum overlapParallel cache optimized 3d-FFT with partial overlap

3 Implementation in ABINITAvoidance of the transformation of zeroesAutotuning of parametersPreliminary Performance Results

Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs

IntroductionThree dimensional FFTs

Implementation in ABINIT

Avoidance of the transformation of zeroesAutotuning of parametersPreliminary Performance Results

Avoidance of the transformation of zeroes

ABINIT Implementationchanged routines back, forw, back_wf and forw_wf

some minor changes to others (input params ...)

The routines back_wf and forw_wf

avoid transformation of zeroesless computation and less communicationchanged communication (boxcut=2):

forw_wf: nz/p planes, each has nx/2 · ny/(2 · p) doublesback_wf: nz/(2 · p) planes, each has nx/2 · ny/p doubles

New Parametersall routines have different # planes → three parameters

Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs

IntroductionThree dimensional FFTs

Implementation in ABINIT

Avoidance of the transformation of zeroesAutotuning of parametersPreliminary Performance Results

1 IntroductionShort introduction to non-blocking collectives

2 Three dimensional FFTsTraditional parallel 3d-FFTParallel 3d-FFT with maximum overlapParallel cache optimized 3d-FFT with partial overlap

3 Implementation in ABINITAvoidance of the transformation of zeroesAutotuning of parametersPreliminary Performance Results

Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs

IntroductionThree dimensional FFTs

Implementation in ABINIT

Avoidance of the transformation of zeroesAutotuning of parametersPreliminary Performance Results

Autotuning of parameters

Three new input parametersfftplanes_fourdp, fftplanes_forw_wf, andfftplanes_back_wf

default = 0 → standard MPI implementationperformance criticalcomplicated to determine by hand

Autotuningautomatically determine them at runtimeeach planes parameter is benchmarked (after warmupround)fastest is chosen automaticallyrelatively accurate but problems with jitter

Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs

IntroductionThree dimensional FFTs

Implementation in ABINIT

Avoidance of the transformation of zeroesAutotuning of parametersPreliminary Performance Results

1 IntroductionShort introduction to non-blocking collectives

2 Three dimensional FFTsTraditional parallel 3d-FFTParallel 3d-FFT with maximum overlapParallel cache optimized 3d-FFT with partial overlap

3 Implementation in ABINITAvoidance of the transformation of zeroesAutotuning of parametersPreliminary Performance Results

Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs

IntroductionThree dimensional FFTs

Implementation in ABINIT

Avoidance of the transformation of zeroesAutotuning of parametersPreliminary Performance Results

Microbenchmarks

1

2

3

4

5

6

7

8

9

10

0 1 2 4 0 1 2 4

Tim

e (s)

0 1 2 4 0 1 2 4coalesced planes (0 = original impl.)

FORW BACK FORW_WF BACK_WF

Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs

IntroductionThree dimensional FFTs

Implementation in ABINIT

Avoidance of the transformation of zeroesAutotuning of parametersPreliminary Performance Results

ABINIT - Si, 60 bands, 1283 FFT

50

100

150

200

250

300

350

FOU

RD

P

0 −1 0 −1 0 −1

FOR

W_W

F

BA

CK

_WF

FOR

W_W

F

BA

CK

_WF

0 −1 0 −1 0 −1TOTAL 41.1s −> 37.1s39.5s −> 23.6s

npbandxnfft = 4x8npbandxnfft = 2x16

Tim

e (ms)

Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs

IntroductionThree dimensional FFTs

Implementation in ABINIT

Avoidance of the transformation of zeroesAutotuning of parametersPreliminary Performance Results

Conclusions & Future Work

Conclusionsapplying NBC requires some effortNBC can improve parallel efficiencecache usage vs. overlap potential

Future Worktune FFT further (reduce serial overhead)better automatic parameter assessment (?)parallel model for 3d-FFTuse NBC for parallel orthogonalizationapply NBC at higher level (LOBPCG?)

Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs

IntroductionThree dimensional FFTs

Implementation in ABINIT

Avoidance of the transformation of zeroesAutotuning of parametersPreliminary Performance Results

Conclusions & Future Work

Conclusionsapplying NBC requires some effortNBC can improve parallel efficiencecache usage vs. overlap potential

Future Worktune FFT further (reduce serial overhead)better automatic parameter assessment (?)parallel model for 3d-FFTuse NBC for parallel orthogonalizationapply NBC at higher level (LOBPCG?)

Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs

IntroductionThree dimensional FFTs

Implementation in ABINIT

Avoidance of the transformation of zeroesAutotuning of parametersPreliminary Performance Results

Discussion

ABINIT patch (soon):http://www.unixer.de/research/abinit/

Thanks to the CEA/DAM for support of this work and you foryour attention!

Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs