high-performance tensor contraction without...

1
Devin A. Matthews Institute for Computational Engineering and Sciences (ICES), The University of Texas at Austin High-Performance Tensor Contraction without BLAS Tensor computations – in particular tensor contraction (TC) – are important kernels in many scientific computing applications (SCAs). Due to the fundamental similarity of TC to matrix multiplication (MM) and to the availability of optimized implementations such as the BLAS, tensor operations have traditionally been implemented in terms of BLAS operations, incurring both a performance and a storage overhead. Instead, we implement TC using the much more flexible BLIS framework, which allows for reshaping of the tensor to be fused with internal partitioning and packing operations, requiring no explicit reshaping operations or additional workspace. This implementation, TBLIS, achieves performance approaching that of MM, and in some cases considerably higher than that of traditional TC. Our implementation also supports multithreading using an approach identical to that used for MM in BLIS, with similar performance characteristics. The complexity of managing tensor- to-matrix transformations is also handled automatically in our approach, greatly simplifying its use in SCAs. References Goto approach: K. Goto and R.A. van de Geijn, Trans. Math. Softw. 34, 3, Article 12 (2008). BLIS: github.com/flame/blis, shpc.ices.utexas.edu/publications.html TBLIS: github.com/devinamatthews/tblis, D. Matthews, arXiv:1607.00291. GETT (independent ”native” TC algorithm): P. Springer and P. Bientinesi, arXiv:1607.00145. micro&kernel 2 nd loop around micro&kernel 1 st loop around micro&kernel 5 th loop around micro&kernel 4 th loop around micro&kernel 3 rd loop around micro&kernel m R m R 1 += += += += += += n C n C k C k C m C m C 1 n R k C n R Pack A i A i ~ Pack B p B p ~ n R A B j C j A p A i B p C j A i ~ B p ~ B p ~ C i C i k C L3 cache L2 cache L1 cache registers main memory m R 1 += += 1 n R k C L3 cache L2 cache L1 cache registers main memory Matrix7like logical layout Packed internal layout m R += += k C m C m C k C n R n R A i B p A i ~ C i C i n C n C C A B += Tensor physical layout 1 st loop around micro7kernel micro7kernel Pack “A i A i ~ Pack “B p B p ~ B p ~ Tensor to matrix packing kernel (Possibly) scattered update T abcde 6 3 2 3 4 18 24 Tensor Matrix “a” dimension is stride-1, other dimensions have increasing strides (6, 18, 36, 108). Column “cdb” dimension has stride of “c” (6x3=18) most of the time. Blocking lets us use this constant stride when possible, and a scattered algorithm otherwise. Row “ae” dimension similarly is stride-1 most of the time. (i.e. M is “row-major”) [rc]scat M stores offset for each position in rows or columns. [rc]bs M stores stride for each block or zero for irregular blocks. M (cdb)(ae) Large memory overhead. Transpose does not parallelize well. Difficult to optimize out transposes. t ae im V mk ec W ak ic A = B · C ck ck em em ai ai BLAS “TTDT” Tensor transpose “LoG” t ae im V mk ec W ak ic X m [t am ] ei [V mk ] ec [W ak ] ic 8 i, e, c for i for e for c GEMM = = Low data reuse. Poor cache behavior. Inner kernel is not always GEMM — may be level 1 or 2 BLAS. m R 1 += += 1 n R k C L3 cache L2 cache L1 cache registers main memory Matrix7like logical layout Packed internal layout m R += += k C m C m C k C n R n R A i B p A i ~ C i C i n C n C C A B += Tensor physical layout 1 st loop around micro7kernel micro7kernel Pack “A i A i ~ Pack “B p B p ~ B p ~ “Native” (TBLIS) t ae im V mk ec W ak ic = Performance nearly identical to GEMM. Tensor Contraction C ij = Σ k A ki B kj = T = C ijk = Σ mn A kmni B njm Tensor contraction: is essentially a generalization of… Matrix multiplication: for n-dimensional arrays (tensors). The most common current methods for computing tensor contractions are: Transpose-transpose-DGEMM-transpose (TTDT), which reshapes (transposes) the tensors into matrices. Loop-over-GEMM (LoG), which uses explicit loops over a small GEMM operation. A high-performance “native” algorithm is ideal. Leveraging the BLIS Framework Tensors as Matrices: The Block-Scatter Matrix Results and Conclusions Random “square” tensor contractions with TTDT and TBLIS, compared to matrix multiplication. Even for “square” cases where overhead is minimized, TTDT incurs as much as a 50% penalty, while TBLIS is consistent at 5-10%. TBLIS is also highly scalable. Since TBLIS uses the BLIS microkernels and blocking parameters, performance improvements in BLIS and porting to new architectures automatically carries over to TBLIS. Speedup of TBLIS over TTDT for the GETT benchmark (arXiv:1607.00145), which spans memory-bound contractions (left) to compute-bound contractions (right). TTDT is efficient for compute-bound problems because transposition is only O(n 2 ), but TBLIS speedup increases sharply for more memory-bound problems. The left-most contractions have non-unit-stride access during packing, which hinders TBLIS. This overhead can be overcome by using a 3D packing kernel (future work). BLIS partitions the input matrices into successively smaller and smaller blocks, eventually reaching the small, fixed-size microkernel, such as 4x4 (pictured above), 8x4, 6x8, etc. These same “register blocksizes” are used during packing. TBLIS uses this fact to encode tensor blocks as regular matrix blocks whenever possible, so that no overhead is incurred. Irregular blocks which do not have a constant row or column stride are handled specially by storing the offset for each row and column individually. The tensor dimensions can also be reordered to assure unit-stride access in at least two of the tensors (see next panel for the effects of non-unit-stride access). BLIS extends the high-performance Goto approach to matrix multiplication, such as used in GotoBLAS and OpenBLAS, by adding two partitioning loops in plain C, and only requiring a small microkernel to be highly optimized in assembly. TBLIS uses this five-loop framework to perform tensor contraction by logically treating the tensor as a matrix during partitioning, and only requiring extra tensor-aware steps (yellow and green above) during packing of the input tensors and updating the output tensor. The high-performance microkernel and cache blocking parameters are used unchanged, and overall overhead is small. Threading is simple to implement, and can be added hierarchically to four out of the five loops, leading to high scalability. TBLIS on 0 5 10 15 20 25 30 35 40 45 0 1000 2000 3000 4000 5000 Approx. M=N=K MKL BLIS TBLIS TTDT 0 5 10 15 20 25 30 35 40 45 0 500 1000 1500 2000 Approx. M=N=K MKL BLIS TBLIS TTDT Xeon E5-2690 v3 1 core 12 cores 0 1 2 3 4 5 abcde-efbad-cf abcde-efcad-bf abcd-dbea-ec abcde-ecbfa-fd abcd-deca-be abc-bda-dc abcd-ebad-ce abcdef-dega-gfbc abcdef-dfgb-geac abcdef-degb-gfac abcdef-degc-gfab abc-dca-bd abcd-ea-ebcd abcd-eb-aecd abcd-ec-abed abc-adec-ebd ab-cad-dcb ab-acd-dbc abc-acd-db abc-adc-bd ab-ac-cb abcd-aebf-fdec abcd-eafd-fbec abcd-aebf-dfce Speedup over TTDT 0 5 10 15 20 abcde-efbad-cf abcde-efcad-bf abcd-dbea-ec abcde-ecbfa-fd abcd-deca-be abc-bda-dc abcd-ebad-ce abcdef-dega-gfbc abcdef-dfgb-geac abcdef-degb-gfac abcdef-degc-gfab abc-dca-bd abcd-ea-ebcd abcd-eb-aecd abcd-ec-abed abc-adec-ebd ab-cad-dcb ab-acd-dbc abc-acd-db abc-adc-bd ab-ac-cb abcd-aebf-fdec abcd-eafd-fbec abcd-aebf-dfce Speedup over TTDT Xeon E5-2690 v3 1 core 12 cores GFLOPS/s per core GFLOPS/s per core 0 1 2 3 4 5 108 109 110 111 112 113 216 217 218 219 220 221 324 325 326 327 328 329 0 18 36 54 72 90 6 24 42 60 78 96 12 30 48 66 84 102 18 0 1 0 1 1 0 1 18 18 18 rscat M rbs M cscat M cbs M

Upload: others

Post on 02-Jun-2020

42 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: High-Performance Tensor Contraction without BLASsc16.supercomputing.org/sc-archive/tech_poster/... · High-Performance Tensor Contraction without BLAS Tensor computations – in particular

Devin A. MatthewsInstitute for Computational Engineering and Sciences (ICES),

The University of Texas at Austin

High-Performance Tensor Contraction without BLAS

Tensor computations – in particular tensor contraction (TC) – are important kernels inmany scientific computing applications (SCAs). Due to the fundamental similarity of TC tomatrix multiplication (MM) and to the availability of optimized implementations such as theBLAS, tensor operations have traditionally been implemented in terms of BLASoperations, incurring both a performance and a storage overhead. Instead, we implementTC using the much more flexible BLIS framework, which allows for reshaping of thetensor to be fused with internal partitioning and packing operations, requiring no explicitreshaping operations or additional workspace. This implementation, TBLIS, achievesperformance approaching that of MM, and in some cases considerably higher than that oftraditional TC. Our implementation also supports multithreading using an approachidentical to that used for MM in BLIS, with similar performance characteristics. Thecomplexity of managing tensor- to-matrix transformations is also handled automatically inour approach, greatly simplifying its use in SCAs.

ReferencesGoto approach: K. Goto and R.A. van de Geijn, Trans. Math. Softw. 34, 3, Article 12 (2008).BLIS: github.com/flame/blis, shpc.ices.utexas.edu/publications.htmlTBLIS: github.com/devinamatthews/tblis, D. Matthews, arXiv:1607.00291.GETT (independent ”native” TC algorithm): P. Springer and P. Bientinesi, arXiv:1607.00145.

micro&kernel+

2nd+loop+around+micro&kernel+

1st+loop+around+micro&kernel+

5th+loop+around+micro&kernel+

4th+loop+around+micro&kernel+

3rd+loop+around+micro&kernel+

mR

mR

1

+=+

+=+

+=+

+=+

+=+

+=+

nC nC

kC

kC

mC mC

1

nR

kC

nR

Pack Ai → Ai ~

Pack Bp → Bp ~

nR

A Bj Cj

Ap

Ai

Bp Cj

Ai ~ Bp

~

Bp ~ Ci

Ci

kC

L3+cache+L2+cache+L1+cache+registers+

main+memory+

mR

1

+=#

+=# 1

nR

kC

L3#cache#L2#cache#L1#cache#registers#

main#memory#

Matrix7like#logical#layout#

Packed#internal#layout#

mR

+=#

+=#

kC mC mC

kC

nR nR

Ai Bp

Ai ~

Ci

Ci

nC nC

C" A" B"+=#

Tensor#physical#layout#

#1st#loop#around#micro7kernel#

#micro7kernel#

Pack “Ai”→ Ai ~

Pack “Bp”→ Bp ~

Bp ~

Tensor to

matrix packing kernel

(Possibly)scatteredupdate

Tabcde 6⇥ 3⇥ 2⇥ 3⇥ 4

18⇥ 24

Tensor

Matrix

“a” dimension is stride-1, other dimensions haveincreasing strides (6, 18, 36, 108).

Column “cdb” dimension has stride of “c” (6x3=18) most of the time. Blocking lets us use this constant stride when possible, and a scattered algorithm otherwise.

Row “ae” dimension similarly is stride-1 most of the time. (i.e. M is “row-major”)

[rc]scatM stores offset for each position in rows or columns. [rc]bsM stores stride for each block or zero for irregular blocks.

M(cdb)(ae)

• Large memory overhead.• Transpose does not

parallelize well.• Difficult to optimize out

transposes.

taeim V mkec

W akic

A = B · C

ck

ckem

emai

ai

BLAS

“TTDT”

Tensortranspose

“LoG”

taeim V mkecW ak

ic �

�X

m

[tam]ei [Vmk]ec[Wak]ic8 i, e, c

for ifor efor cGEMM

=

=

• Low data reuse.• Poor cache behavior.• Inner kernel is not always

GEMM — may be level 1 or 2 BLAS.

mR

1

+=#

+=# 1

nR

kC

L3#cache#L2#cache#L1#cache#registers#

main#memory#

Matrix7like#logical#layout#

Packed#internal#layout#

mR

+=#

+=#

kC mC mC

kC

nR nR

Ai Bp

Ai ~

Ci

Ci

nC nC

C" A" B"+=#

Tensor#physical#layout#

#1st#loop#around#micro7kernel#

#micro7kernel#

Pack “Ai”→ Ai ~

Pack “Bp”→ Bp ~

Bp ~

“Native” (TBLIS)

taeim V mkecW ak

ic�=

• Performance nearly identical to GEMM.

Tensor Contraction

Cij = Σk Aki � Bkj

=T

=

Cijk = Σmn Akmni � Bnjm

Tensor contraction:

is essentially a generalization of…

Matrix multiplication:

for n-dimensional arrays (tensors). The most common current methods for computing tensor contractions are:

• Transpose-transpose-DGEMM-transpose (TTDT), which reshapes (transposes) the tensors into matrices.

• Loop-over-GEMM (LoG), which uses explicit loops over a small GEMM operation.

A high-performance “native” algorithm is ideal.

Leveraging the BLIS Framework Tensors as Matrices: The Block-Scatter Matrix Results and Conclusions

Random “square” tensor contractions with TTDT and TBLIS, compared to matrix multiplication.Even for “square” cases where overhead is minimized, TTDT incurs as much as a 50% penalty,while TBLIS is consistent at 5-10%. TBLIS is also highly scalable. Since TBLIS uses the BLISmicrokernels and blocking parameters, performance improvements in BLIS and porting to newarchitectures automatically carries over to TBLIS.

Speedup of TBLIS over TTDT for the GETT benchmark (arXiv:1607.00145), which spansmemory-bound contractions (left) to compute-bound contractions (right). TTDT is efficient forcompute-bound problems because transposition is only O(n2), but TBLIS speedup increasessharply for more memory-bound problems. The left-most contractions have non-unit-strideaccess during packing, which hinders TBLIS. This overhead can be overcome by using a 3Dpacking kernel (future work).

BLIS partitions the input matrices into successively smaller and smaller blocks,eventually reaching the small, fixed-size microkernel, such as 4x4 (pictured above),8x4, 6x8, etc. These same “register blocksizes” are used during packing.

TBLIS uses this fact to encode tensor blocks as regular matrix blocks wheneverpossible, so that no overhead is incurred. Irregular blocks which do not have aconstant row or column stride are handled specially by storing the offset for eachrow and column individually.

The tensor dimensions can also be reordered to assure unit-stride access in atleast two of the tensors (see next panel for the effects of non-unit-stride access).

BLIS extends the high-performance Goto approach to matrixmultiplication, such as used in GotoBLAS and OpenBLAS, byadding two partitioning loops in plain C, and only requiring a smallmicrokernel to be highly optimized in assembly.

TBLIS uses this five-loop framework to perform tensor contractionby logically treating the tensor as a matrix during partitioning, andonly requiring extra tensor-aware steps (yellow and green above)during packing of the input tensors and updating the output tensor.The high-performance microkernel and cache blockingparameters are used unchanged, and overall overhead is small.

Threading is simple to implement, and can be added hierarchicallyto four out of the five loops, leading to high scalability.

TBLIS on

0

5

10

15

20

25

30

35

40

45

0 1000 2000 3000 4000 5000

GFLO

PS/s

Approx. M=N=K

MKLBLIS

TBLISTTDT

0

5

10

15

20

25

30

35

40

45

0 500 1000 1500 2000

GFLO

PS/s

Approx. M=N=K

MKLBLIS

TBLISTTDT

Xeon E5-2690 v31 core 12 cores

0

1

2

3

4

5

abcde-efb

ad-cf

abcde-efc

ad-bf

abcd-dbea-ec

abcde-ecbfa

-fd

abcd-deca-be

abc-bda-dc

abcd-ebad-ce

abcdef-

dega-gfb

c

abcdef-

dfg

b-geac

abcdef-

degb-gfa

c

abcdef-

degc-gfa

b

abc-dca-bd

abcd-ea-ebcd

abcd-eb-aecd

abcd-ec-abed

abc-adec-ebd

ab-cad-dcb

ab-acd-dbc

abc-acd-db

abc-adc-bd

ab-ac-cb

abcd-aebf-

fdec

abcd-eafd

-fb

ec

abcd-aebf-

dfc

e

Speedup o

ver

TTD

T

0

5

10

15

20

abcde-efb

ad-cf

abcde-efc

ad-bf

abcd-dbea-ec

abcde-ecbfa

-fd

abcd-deca-be

abc-bda-dc

abcd-ebad-ce

abcdef-

dega-gfb

c

abcdef-

dfg

b-geac

abcdef-

degb-gfa

c

abcdef-

degc-gfa

b

abc-dca-bd

abcd-ea-ebcd

abcd-eb-aecd

abcd-ec-abed

abc-adec-ebd

ab-cad-dcb

ab-acd-dbc

abc-acd-db

abc-adc-bd

ab-ac-cb

abcd-aebf-

fdec

abcd-eafd

-fb

ec

abcd-aebf-

dfc

e

Speedup o

ver

TTD

T

Xeon E5-2690 v31 core 12 cores

GFL

OPS

/s p

er c

ore

GFL

OPS

/s p

er c

ore

0 1 2 3 4 5 108

109

110

111

112

113

216

217

218

219

220

221

324

325

326

327

328

329

018

3654

7290

624

4260

7896

1230

4866

84102

18

0

101 1 01

18

18

18

rscatM rbsM

cscatM

cbsM