& c0 8g & ' aa a #a # c ) c % # )' k # )' kia) &a ) %cc c ... · 5th...

C 0 8 G � AA A� A C C�� K K� IA A C� C� �

C A�,C G�- C C

1 �- � C -

1 LIC , � L IC

2 C G A C J G L

0 � G � G �

�

��

• G K �G � E8 G 8G E �C �G � 1/– , � � �GC� GE G � C �GC� GG � �D E CEA 8 �C � �G E G� E8 G 8G E

– �G �A � E8 G 8G E �- G � 5 � GE 8G C

• 2ECI �G � AD A G G C � �C �, � CE� E � E– G C � CE�8 CC �G � L �C � C8 �A GE 8

– C � G � 5 � GE 8G C

�

� � �

• 2 7�8 � F 7 � �7 E 7 �. 7 �37 �/ �/ 57 � �2 7� F 7 � �5 8 E 7 � � 7�4 0E4 4 � 7 �2 7�. 7 �/4 4 7 �1 E �3 �

�E 4 7� � � 4 7

• 2 7� 75 � F 7 � �7 E 7 �. 7 �37 �/ �/ 57 � �2 7� F 7 � �5 8 E 7 � � 7�4 7 7 7� 7 �2 7�. 7 �/4 4 7 �1 E �

3 � �E 4 7� � � 4 7

� �

�

5thlooparoundmicro-kernel

+=

nC nC

A Bj Cj

4thlooparoundmicro-kernel

+=kC

kC

Ap Bp

Cj

+=mC mC Bp Ci Ai

3rdlooparoundmicro-kernel

+=

nR nR Ai Bp Ci

registers

L3cachemainmemory

L1cacheL2cache

2ndlooparoundmicro-kernel

1stlooparoundmicro-kernel

+=mR mR

micro-kernel

+= 1

1

*Field G. Van Zee, and Tyler M. Smith. “Implementing high-performance complex matrix multiplication." In ACM Transactions on MathematicalSoftware (TOMS).

� (

� �2 Roktaek Lim et al.

1 Introduction

p-loop

m

n

C

+=

kb

· · ·

A

kb

...

B0

Bk�1

B

i-loop

+=...

mb

kb

...

kb

n

eBA0,p

jr-loop

· · · += mb

kb

eA · · ·

nr

kbeB0

ir-loop

+=

kb

mrbAbC bB kb

nr

2 Roktaek Lim et al.

ALGORITHM 1: Matrix-matrix multiplication algorithm.

for p = 0, . . . , k � 1 in steps of kb do

Pack Bp into eB;

for i = 0, . . . ,m� 1 in steps of mb do

Pack Ai,p into eA;

for jr = 0, . . . , n� 1 in steps of nr dofor ir = 0, . . . ,mb � 1 in steps of mr do

bA = eAir;

bB = eBjr;

bC += bA⇥ bB;

Update C using bC;

endend

endend

p-loop

m

n

C

+=

kb

· · ·

A

kb

...

B0

Bk�1

B

i-loop

+=...

mb

kb

...

kb

n

eBA0,p

jr-loop

· · · += mb

kb

eA · · ·

nr

kbeB0

ir-loop

+=

kb

mrbAbC bB kb

nr

2�2 � � �

• 3 5� � � � �1/1 3• � 18 5� � 3� �1/1 3��

/ 5� 3�1 � � 2/ 5�!"• 5� 3� 3/2� 3 �1 3��

2 3� � 32� 3/2 5��31 5 � � 3�)

�

� �

L1 prefetching

Manual loop unrolling

�

� �10 Roktaek Lim et al.

Performace of DGEMM for each (mr,n

r) when m=n=k=2400

(31,8) (15,16) (7,32) (16,8) (8,16) (30,8)(m

r,n

r)

0

5

10

15

20

25

30

35

GF

LO

PS

without unrolling#pragma unroll (2)#pragma unroll (3)#pragma unroll (4)#pragma unroll (5)#pragma unroll (6)#pragma unroll ( )

Fig. 3 DGEMM performances using Algorithm 1 for the various (mr, nr) when m = n =k = 2, 400.

best performance of our implementation is obtained, when (mr, nr) = (31, 8)and the i-loop in the inner kernel is unrolled by 3. Note that the perfor-mance of DGEMM with (mr, nr, kb,mb) = (31, 8, 480, 124) is better than thatof DGEMM with (mr, nr, kb,mb) = (30, 8, 480, 120). We choose (mr, nr) =(31, 8) for register block sizes.

On conventional architectures, kb is chosen to fit bB into the L1 cache. How-ever, it does not seem that L1 cache blocking is crucial for the implementationof DGEMM on the KNL. For example, the performance of DGEMM withmr = 15, nr = 16, kb = 480, and mb = 135 is better than that of DGEMMwith mr = 15, nr = 16, kb = 240, and mb = 240, even though the size ofbB with nr = 16 and kb = 480 is larger than the size of the L1 cache on theKNL. This indicates that if kb is chosen to fit bB into the L1 cache, kb maynot be large enough to amortize the cost of updating mr ⇥ nr block of Con the KNL. It is not necessary that elements of bC remain in the L1 cacheduring the computation of bA⇥ bB. The inner kernel requires 2mrnnkb FLOPSand mrnr to store the results from the registers to the main memory. Thus,we have to choose kb as large as possible to amortize the cost of updatingmr ⇥ nr block of C. We conduct performance experiments to choose kb andmb for (mr, nr) = (31, 8) by varying mb from 31 to 124 and kb from 480 to1,640 while m, n and k remain fixed at 2,400 and the cache and register blocksizes satisfy (5). The prefetch distances for bA and bB are also determined foreach (kb,mb). The best performance of our implementation is obtained, when(mb, kb) = (31, 1620). The performance experiments for choosing the prefetchdistance for bA and bB are conducted for each pair of (mb, kb). The experimentresults for (mb, kb) = (31, 1620) is given in Figure 4. The prefetch distancefor bA and bB are chosen to be 12mr and 32nr, respectively. Figure 5 showsthe performance comparison between DGEMM using the Intel MKL and ourimplementation for the di↵erent matrix sizes on the KNL. We vary the size

Title Suppressed Due to Excessive Length 9

0 1000 2000 3000 4000 5000 6000

m=n=k

0

5

10

15

20

25

30

35

40

GF

LO

PS

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Performances of MKL's DGEMM

MKL(# of threads = 1)MKL(# of threads = 2)MKL(# of threads = 3)MKL(# of threads = 4)

Fig. 2 Performances of MKL’s DGEMM with # of threads = 1, 2, 3, and 4.

Table 1 kb, mb and prefetch distances for DGEMM.

(mr, nr) (31,8) (15,16) (7,32) (16,8) (8,16) (30,8)# of vector registers 32 32 32 17 18 31kb 480 400 400 480 480 480mb 124 135 126 112 112 120

L1 distance for bA 10mr 20mr 20mr 20mr 20mr 10mr

L1 distance for bB 28nr 16nr 12nr 36nr 24nr 28nr

Size of ( eA+ bB + bC) 497 KB 474 KB 496 KB 451 KB 481 KB 481 KB

and k remain fixed at 2,400 in the performance experiments. The prefetchdistances for bA and bB are also determined by the performance experiments.To examine DGEMM performances between the full use of vector registersand the use of half of vector registers, performance experiments are conductedfor the register block sizes (mr, nr) = (16, 8) and (mr, nr) = (8, 16). Smithet al. [10] used (mr, nr, kb,mb) = (30, 8, 240, 120) for the implementation ofDGEMM on the KNC. They chose kb = 240 so that at least two slivers ofbA and two bB must be in the L1 cache. The reason is that four threads on acore share the L1 cache. Since the KNL can reach the maximum performancewith one thread, kb should be 480. We test the performance of DGEMM with(mr, nr, kb,mb) = (30, 8, 480, 120) for comparison. The cache block sizes andthe prefetch distances for bA and bB are provided for each (mr, nr) in Table 1.Figure 3 shows that the performances of the matrix-matrix multiplication foreach (mr, nr) when m = n = k = 2, 400.

We can observe that good performance results are obtained with the fulluse of available vector registers when (mr, nr) = (31, 8) or (15, 16). Thus, it isrecommended to use 32 vector registers for register blocking on the KNL. Theperformance of the matrix-matrix multiplication as shown in Figure 3. The

≤ 512 KB (half of the size of the L2 cache)

�

� �

0 2000 4000 6000 8000 10000m=n=k

0

5

10

15

20

25

30

35

40

GFL

OPS

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1DGEMM performance comparison on the KNL

MKL(# of threads = 1)(mr,nr,mb,kb)=(31,8,31,1620)(mr,nr,mb,kb)=(31,8,124,480)

�

� � �2 Roktaek Lim et al.

ALGORITHM 1: Matrix-matrix multiplication algorithm.

for p = 0, . . . , k � 1 in steps of kb do

Pack Bp into eB;

for i = 0, . . . ,m� 1 in steps of mb do

Pack Ai,p into eA;

for jr = 0, . . . , n� 1 in steps of nr dofor ir = 0, . . . ,mb � 1 in steps of mr do

bA = eAir;

bB = eBjr;

bC += bA⇥ bB;

Update C using bC;

endend

endend

p-loop

m

n

C

+=

kb

· · ·

A

kb

...

B0

Bk�1

B

i-loop

+=...

mb

kb

...

kb

n

eBA0,p

jr-loop

· · · += mb

kb

eA · · ·

nr

kbeB0

ir-loop

+=

kb

mrbAbC bB kb

nr

!/!#

$/$%

*Tyler Smith et al. “Anatomy of High Performance Many-Threaded Matrix Multiplication” In ACM Transactions on Mathematical Software (TOMS).

�

� �

i-loop

jr-loop

�

� � � �

• � , � / 8� � �8/ 1 �

!"#" + #"%& + !&%& ×8*+,-. ≤ 0123 .

• � . � 11 1�/ 1 /

2× !"#" + #"%& + !&%& ×8*+,-. ≤ 51278• ,� .� 8� � . �8 � /1 � / � . �8 �

9: � � � . �/ � 1

!"#" + 2#"%& + 2!&%& ×8*+,-. ≤ 51278

�OLN MK �

�

• 2 D � A A� � //� A� �- .�A AE� �A E 8� E � A8� D

• ,1 / �DGA � AH D A A �H D �E G 8� �GE 8� � H 8� � H D 8� �D A � A8�8 E D A � D 8E

KMP_HOT_TEAMS_MODE = 1KMP_HOT_TEAMS_MAX_LEVEL = 2

� 8 �

�

�31 �. �. � 1�/ � .

�

��

OMP_PLACES = coresOMP_PROC_BIND = spread, spread

!" � � / � 4 � � 1 1 � 8�

#"!" + 2!"&' + 2#'&' ×8+,-./ ≤ 51234


i-loop

core-0

A0,p

core-1 core-2 core-3 core-4 core-5

A1,p

core-6 core-7

jr-loop

· · ·

core-0 core-1

A0,p

core-2 core-3

A0,p

core-4 core-5

A1,p

core-6 core-7

A1,p

· · ·

i-loop

core-0

A0,p

core-1

A1,p

· · ·core-16

A16,p

core-17 core-18 core-19

· · ·core-66 core-67

jr-loop

core-0

A0,p

core-1

A1,p

· · ·core-16

A16,p

core-17

A0,p

core-18 core-19

A0,p


A16,p

�

, � , , �

KMP_AFFINITY = scatter

!" � � 5 / � �5 � 1 1 � 58

2× %"!" + !"'( + %('( ×8+,-./ ≤ 51234


i-loop

core-0

A0,p

core-1 core-2 core-3 core-4 core-5

A1,p

core-6 core-7

jr-loop

· · ·

core-0 core-1

A0,p

core-2 core-3

A0,p

core-4 core-5

A1,p

core-6 core-7

A1,p

· · ·

i-loop

core-0

A0,p

core-1

A1,p

· · ·core-16

A16,p

core-17 core-18 core-19


jr-loop

core-0

A0,p

core-1

A1,p

· · ·core-16

A16,p

core-17

A0,p

core-18 core-19

A0,p


A16,p

�

� �

0 100 200 300 400 500 600kb

0

500

1000

1500

2000

2500

GFLOPS

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1DGEMM performance comparison for the Intel Xeon Phi Processor 7210

omp_places=coreskmp_affinity=scatter

0 100 200 300 400 500 600kb

0

500

1000

1500

2000

2500

3000

GFLOPS

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1DGEMM performance comparison for the Intel Xeon Phi Processor 7250

omp_places=coreskmp_affinity=scatter

�

��

• E � AD A G G C �C � �I G �2D C �G � 1/– G C � CE�8 CC �G � K �C �7 C8 �A GE 8

– C �I G � 5 � GE 8G C

• E CEA 8 � G– 2D G

– - G �2D E G A � 7E E

� � � � !

& c0 8g & ' aa a #a # c ) c % # )' k # )' kia) &a ) %cc c ... · 5th...

Documents