& c0 8g & ' aa a #a # c ) c % # )' k # )' kia) &a ) %cc c ... · 5th...
TRANSCRIPT
C 0 8 G � AA A� A C C�� K K� IA A C� C� �
C A�,C G�- C C
1 �- � C -
1 LIC , � L IC
2 C G A C J G L
0 � G � G �
�
����������
• G K �G � E8 G 8G E �C �G � 1/– , � � �GC� GE G � C �GC� GG � �D E CEA 8 �C � �G E G� E8 G 8G E
– �G �A � E8 G 8G E �- G � 5 � GE 8G C
• 2ECI �G � AD A G G C � �C �, � CE� E � E– G C � CE�8 CC �G � L �C � C8 �A GE 8
– C � G � 5 � GE 8G C
�
� � �
• 2 7�8 � F 7 � �7 E 7 �. 7 �37 �/ �/ 57 � �2 7� F 7 � �5 8 E 7 � � 7�4 0E4 4 � 7 �2 7�. 7 �/4 4 7 �1 E �3 �
�E 4 7� � � 4 7
• 2 7� 75 � F 7 � �7 E 7 �. 7 �37 �/ �/ 57 � �2 7� F 7 � �5 8 E 7 � � 7�4 7 7 7� 7 �2 7�. 7 �/4 4 7 �1 E �
3 � �E 4 7� � � 4 7
� �
�
5thlooparoundmicro-kernel
+=
nC nC
A Bj Cj
4thlooparoundmicro-kernel
+=kC
kC
Ap Bp
Cj
+=mC mC Bp Ci Ai
3rdlooparoundmicro-kernel
+=
nR nR Ai Bp Ci
registers
L3cachemainmemory
L1cacheL2cache
2ndlooparoundmicro-kernel
1stlooparoundmicro-kernel
+=mR mR
micro-kernel
+= 1
1
*Field G. Van Zee, and Tyler M. Smith. “Implementing high-performance complex matrix multiplication." In ACM Transactions on MathematicalSoftware (TOMS).
� (
� �2 Roktaek Lim et al.
1 Introduction
p-loop
m
n
C
+=
kb
· · ·
A
kb
...
B0
Bk�1
B
i-loop
+=...
mb
kb
...
kb
n
eBA0,p
jr-loop
· · · += mb
kb
eA · · ·
nr
kbeB0
ir-loop
+=
kb
mrbAbC bB kb
nr
2 Roktaek Lim et al.
ALGORITHM 1: Matrix-matrix multiplication algorithm.
for p = 0, . . . , k � 1 in steps of kb do
Pack Bp into eB;
for i = 0, . . . ,m� 1 in steps of mb do
Pack Ai,p into eA;
for jr = 0, . . . , n� 1 in steps of nr dofor ir = 0, . . . ,mb � 1 in steps of mr do
bA = eAir;
bB = eBjr;
bC += bA⇥ bB;
Update C using bC;
endend
endend
p-loop
m
n
C
+=
kb
· · ·
A
kb
...
B0
Bk�1
B
i-loop
+=...
mb
kb
...
kb
n
eBA0,p
jr-loop
· · · += mb
kb
eA · · ·
nr
kbeB0
ir-loop
+=
kb
mrbAbC bB kb
nr
2�2 � � �
• 3 5� � � � �1/1 3• � 18 5� � 3� �1/1 3��������
/ 5� 3�1 � � 2/ 5�!"• 5� 3� 3/2� 3 �1 3�����������
2 3� � 32� 3/2 5���������31 5 � � 3�)
�
� �
L1 prefetching
Manual loop unrolling
�
� �10 Roktaek Lim et al.
Performace of DGEMM for each (mr,n
r) when m=n=k=2400
(31,8) (15,16) (7,32) (16,8) (8,16) (30,8)(m
r,n
r)
0
5
10
15
20
25
30
35
GF
LO
PS
without unrolling#pragma unroll (2)#pragma unroll (3)#pragma unroll (4)#pragma unroll (5)#pragma unroll (6)#pragma unroll ( )
Fig. 3 DGEMM performances using Algorithm 1 for the various (mr, nr) when m = n =k = 2, 400.
best performance of our implementation is obtained, when (mr, nr) = (31, 8)and the i-loop in the inner kernel is unrolled by 3. Note that the perfor-mance of DGEMM with (mr, nr, kb,mb) = (31, 8, 480, 124) is better than thatof DGEMM with (mr, nr, kb,mb) = (30, 8, 480, 120). We choose (mr, nr) =(31, 8) for register block sizes.
On conventional architectures, kb is chosen to fit bB into the L1 cache. How-ever, it does not seem that L1 cache blocking is crucial for the implementationof DGEMM on the KNL. For example, the performance of DGEMM withmr = 15, nr = 16, kb = 480, and mb = 135 is better than that of DGEMMwith mr = 15, nr = 16, kb = 240, and mb = 240, even though the size ofbB with nr = 16 and kb = 480 is larger than the size of the L1 cache on theKNL. This indicates that if kb is chosen to fit bB into the L1 cache, kb maynot be large enough to amortize the cost of updating mr ⇥ nr block of Con the KNL. It is not necessary that elements of bC remain in the L1 cacheduring the computation of bA⇥ bB. The inner kernel requires 2mrnnkb FLOPSand mrnr to store the results from the registers to the main memory. Thus,we have to choose kb as large as possible to amortize the cost of updatingmr ⇥ nr block of C. We conduct performance experiments to choose kb andmb for (mr, nr) = (31, 8) by varying mb from 31 to 124 and kb from 480 to1,640 while m, n and k remain fixed at 2,400 and the cache and register blocksizes satisfy (5). The prefetch distances for bA and bB are also determined foreach (kb,mb). The best performance of our implementation is obtained, when(mb, kb) = (31, 1620). The performance experiments for choosing the prefetchdistance for bA and bB are conducted for each pair of (mb, kb). The experimentresults for (mb, kb) = (31, 1620) is given in Figure 4. The prefetch distancefor bA and bB are chosen to be 12mr and 32nr, respectively. Figure 5 showsthe performance comparison between DGEMM using the Intel MKL and ourimplementation for the di↵erent matrix sizes on the KNL. We vary the size
Title Suppressed Due to Excessive Length 9
0 1000 2000 3000 4000 5000 6000
m=n=k
0
5
10
15
20
25
30
35
40
GF
LO
PS
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1Performances of MKL's DGEMM
MKL(# of threads = 1)MKL(# of threads = 2)MKL(# of threads = 3)MKL(# of threads = 4)
Fig. 2 Performances of MKL’s DGEMM with # of threads = 1, 2, 3, and 4.
Table 1 kb, mb and prefetch distances for DGEMM.
(mr, nr) (31,8) (15,16) (7,32) (16,8) (8,16) (30,8)# of vector registers 32 32 32 17 18 31kb 480 400 400 480 480 480mb 124 135 126 112 112 120
L1 distance for bA 10mr 20mr 20mr 20mr 20mr 10mr
L1 distance for bB 28nr 16nr 12nr 36nr 24nr 28nr
Size of ( eA+ bB + bC) 497 KB 474 KB 496 KB 451 KB 481 KB 481 KB
and k remain fixed at 2,400 in the performance experiments. The prefetchdistances for bA and bB are also determined by the performance experiments.To examine DGEMM performances between the full use of vector registersand the use of half of vector registers, performance experiments are conductedfor the register block sizes (mr, nr) = (16, 8) and (mr, nr) = (8, 16). Smithet al. [10] used (mr, nr, kb,mb) = (30, 8, 240, 120) for the implementation ofDGEMM on the KNC. They chose kb = 240 so that at least two slivers ofbA and two bB must be in the L1 cache. The reason is that four threads on acore share the L1 cache. Since the KNL can reach the maximum performancewith one thread, kb should be 480. We test the performance of DGEMM with(mr, nr, kb,mb) = (30, 8, 480, 120) for comparison. The cache block sizes andthe prefetch distances for bA and bB are provided for each (mr, nr) in Table 1.Figure 3 shows that the performances of the matrix-matrix multiplication foreach (mr, nr) when m = n = k = 2, 400.
We can observe that good performance results are obtained with the fulluse of available vector registers when (mr, nr) = (31, 8) or (15, 16). Thus, it isrecommended to use 32 vector registers for register blocking on the KNL. Theperformance of the matrix-matrix multiplication as shown in Figure 3. The
≤ 512 KB (half of the size of the L2 cache)
�
� �
0 2000 4000 6000 8000 10000m=n=k
0
5
10
15
20
25
30
35
40
GFL
OPS
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1DGEMM performance comparison on the KNL
MKL(# of threads = 1)(mr,nr,mb,kb)=(31,8,31,1620)(mr,nr,mb,kb)=(31,8,124,480)
�
� � �2 Roktaek Lim et al.
ALGORITHM 1: Matrix-matrix multiplication algorithm.
for p = 0, . . . , k � 1 in steps of kb do
Pack Bp into eB;
for i = 0, . . . ,m� 1 in steps of mb do
Pack Ai,p into eA;
for jr = 0, . . . , n� 1 in steps of nr dofor ir = 0, . . . ,mb � 1 in steps of mr do
bA = eAir;
bB = eBjr;
bC += bA⇥ bB;
Update C using bC;
endend
endend
p-loop
m
n
C
+=
kb
· · ·
A
kb
...
B0
Bk�1
B
i-loop
+=...
mb
kb
...
kb
n
eBA0,p
jr-loop
· · · += mb
kb
eA · · ·
nr
kbeB0
ir-loop
+=
kb
mrbAbC bB kb
nr
!/!#
$/$%
*Tyler Smith et al. “Anatomy of High Performance Many-Threaded Matrix Multiplication” In ACM Transactions on Mathematical Software (TOMS).
�
� �
i-loop
jr-loop
�
� � � �
• � , � / 8� � �8/ 1 �
!"#" + #"%& + !&%& ×8*+,-. ≤ 0123 .
• � . � 11 1�/ 1 /
2× !"#" + #"%& + !&%& ×8*+,-. ≤ 51278• ,� .� 8� � . �8 � /1 � / � . �8 �
9: � � � . �/ � 1
!"#" + 2#"%& + 2!&%& ×8*+,-. ≤ 51278
�OLN MK �
�
• 2 D � A A� � //� A� �- .�A AE� �A E 8� E � A8� D
• ,1 / �DGA � AH D A A �H D �E G 8� �GE 8� � H 8� � H D 8� �D A � A8�8 E D A � D 8E
KMP_HOT_TEAMS_MODE = 1KMP_HOT_TEAMS_MAX_LEVEL = 2
� 8 �
�
�31 �. �. � 1�/ � .
�
�����������
OMP_PLACES = coresOMP_PROC_BIND = spread, spread
!" � � / � 4 � � 1 1 � 8�
#"!" + 2!"&' + 2#'&' ×8+,-./ ≤ 51234
Title Suppressed Due to Excessive Length 3
i-loop
core-0
A0,p
core-1 core-2 core-3 core-4 core-5
A1,p
core-6 core-7
jr-loop
· · ·
core-0 core-1
A0,p
core-2 core-3
A0,p
core-4 core-5
A1,p
core-6 core-7
A1,p
· · ·
i-loop
core-0
A0,p
core-1
A1,p
· · ·core-16
A16,p
core-17 core-18 core-19
· · ·core-66 core-67
jr-loop
core-0
A0,p
core-1
A1,p
· · ·core-16
A16,p
core-17
A0,p
core-18 core-19
A0,p
· · ·core-66 core-67
A16,p
�
, � , , �
KMP_AFFINITY = scatter
!" � � 5 / � �5 � 1 1 � 58
2× %"!" + !"'( + %('( ×8+,-./ ≤ 51234
Title Suppressed Due to Excessive Length 3
i-loop
core-0
A0,p
core-1 core-2 core-3 core-4 core-5
A1,p
core-6 core-7
jr-loop
· · ·
core-0 core-1
A0,p
core-2 core-3
A0,p
core-4 core-5
A1,p
core-6 core-7
A1,p
· · ·
i-loop
core-0
A0,p
core-1
A1,p
· · ·core-16
A16,p
core-17 core-18 core-19
· · ·core-66 core-67
jr-loop
core-0
A0,p
core-1
A1,p
· · ·core-16
A16,p
core-17
A0,p
core-18 core-19
A0,p
· · ·core-66 core-67
A16,p
�
� �
0 100 200 300 400 500 600kb
0
500
1000
1500
2000
2500
GFLOPS
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1DGEMM performance comparison for the Intel Xeon Phi Processor 7210
omp_places=coreskmp_affinity=scatter
0 100 200 300 400 500 600kb
0
500
1000
1500
2000
2500
3000
GFLOPS
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1DGEMM performance comparison for the Intel Xeon Phi Processor 7250
omp_places=coreskmp_affinity=scatter
�
�������
• E � AD A G G C �C � �I G �2D C �G � 1/– G C � CE�8 CC �G � K �C �7 C8 �A GE 8
– C �I G � 5 � GE 8G C
• E CEA 8 � G– 2D G
– - G �2D E G A � 7E E
� � � � !