high-performance tensor contraction without...

Devin A. MatthewsInstitute for Computational Engineering and Sciences (ICES),

The University of Texas at Austin

High-Performance Tensor Contraction without BLAS

Tensor computations – in particular tensor contraction (TC) – are important kernels inmany scientific computing applications (SCAs). Due to the fundamental similarity of TC tomatrix multiplication (MM) and to the availability of optimized implementations such as theBLAS, tensor operations have traditionally been implemented in terms of BLASoperations, incurring both a performance and a storage overhead. Instead, we implementTC using the much more flexible BLIS framework, which allows for reshaping of thetensor to be fused with internal partitioning and packing operations, requiring no explicitreshaping operations or additional workspace. This implementation, TBLIS, achievesperformance approaching that of MM, and in some cases considerably higher than that oftraditional TC. Our implementation also supports multithreading using an approachidentical to that used for MM in BLIS, with similar performance characteristics. Thecomplexity of managing tensor- to-matrix transformations is also handled automatically inour approach, greatly simplifying its use in SCAs.

ReferencesGoto approach: K. Goto and R.A. van de Geijn, Trans. Math. Softw. 34, 3, Article 12 (2008).BLIS: github.com/flame/blis, shpc.ices.utexas.edu/publications.htmlTBLIS: github.com/devinamatthews/tblis, D. Matthews, arXiv:1607.00291.GETT (independent ”native” TC algorithm): P. Springer and P. Bientinesi, arXiv:1607.00145.

micro&kernel+

2nd+loop+around+micro&kernel+

1st+loop+around+micro&kernel+

5th+loop+around+micro&kernel+

4th+loop+around+micro&kernel+

3rd+loop+around+micro&kernel+

mR

mR

1

+=+

+=+

+=+

+=+

+=+

+=+

nC nC

kC

kC

mC mC

1

nR

kC

nR

Pack Ai → Ai ~

Pack Bp → Bp ~

nR

A Bj Cj

Ap

Ai

Bp Cj

Ai ~ Bp

~

Bp ~ Ci

Ci

kC

L3+cache+L2+cache+L1+cache+registers+

main+memory+

mR

1

+=#

+=# 1

nR

kC

L3#cache#L2#cache#L1#cache#registers#

main#memory#

Matrix7like#logical#layout#

Packed#internal#layout#

mR

+=#

+=#

kC mC mC

kC

nR nR

Ai Bp

Ai ~

Ci

Ci

nC nC

C" A" B"+=#

Tensor#physical#layout#

#1st#loop#around#micro7kernel#

#micro7kernel#

Pack “Ai”→ Ai ~

Pack “Bp”→ Bp ~

Bp ~

Tensor to

matrix packing kernel

(Possibly)scatteredupdate

Tabcde 6⇥ 3⇥ 2⇥ 3⇥ 4

18⇥ 24

Tensor

Matrix

“a” dimension is stride-1, other dimensions haveincreasing strides (6, 18, 36, 108).

Column “cdb” dimension has stride of “c” (6x3=18) most of the time. Blocking lets us use this constant stride when possible, and a scattered algorithm otherwise.

Row “ae” dimension similarly is stride-1 most of the time. (i.e. M is “row-major”)

[rc]scatM stores offset for each position in rows or columns. [rc]bsM stores stride for each block or zero for irregular blocks.

M(cdb)(ae)

• Large memory overhead.• Transpose does not

parallelize well.• Difficult to optimize out

transposes.

taeim V mkec

W akic

A = B · C

ck

ckem

emai

ai

BLAS

“TTDT”

Tensortranspose

“LoG”

taeim V mkecW ak

ic �

�X

m

[tam]ei [Vmk]ec[Wak]ic8 i, e, c

for ifor efor cGEMM

=

=

• Low data reuse.• Poor cache behavior.• Inner kernel is not always

GEMM — may be level 1 or 2 BLAS.

mR

1

+=#

+=# 1

nR

kC

L3#cache#L2#cache#L1#cache#registers#

main#memory#

Matrix7like#logical#layout#

Packed#internal#layout#

mR

+=#

+=#

kC mC mC

kC

nR nR

Ai Bp

Ai ~

Ci

Ci

nC nC

C" A" B"+=#

Tensor#physical#layout#

#1st#loop#around#micro7kernel#

#micro7kernel#

Pack “Ai”→ Ai ~

Pack “Bp”→ Bp ~

Bp ~

“Native” (TBLIS)

taeim V mkecW ak

ic�=

• Performance nearly identical to GEMM.

Tensor Contraction

Cij = Σk Aki � Bkj

=T

=

Cijk = Σmn Akmni � Bnjm

Tensor contraction:

is essentially a generalization of…

Matrix multiplication:

for n-dimensional arrays (tensors). The most common current methods for computing tensor contractions are:

• Transpose-transpose-DGEMM-transpose (TTDT), which reshapes (transposes) the tensors into matrices.

• Loop-over-GEMM (LoG), which uses explicit loops over a small GEMM operation.

A high-performance “native” algorithm is ideal.

Leveraging the BLIS Framework Tensors as Matrices: The Block-Scatter Matrix Results and Conclusions

Random “square” tensor contractions with TTDT and TBLIS, compared to matrix multiplication.Even for “square” cases where overhead is minimized, TTDT incurs as much as a 50% penalty,while TBLIS is consistent at 5-10%. TBLIS is also highly scalable. Since TBLIS uses the BLISmicrokernels and blocking parameters, performance improvements in BLIS and porting to newarchitectures automatically carries over to TBLIS.

Speedup of TBLIS over TTDT for the GETT benchmark (arXiv:1607.00145), which spansmemory-bound contractions (left) to compute-bound contractions (right). TTDT is efficient forcompute-bound problems because transposition is only O(n2), but TBLIS speedup increasessharply for more memory-bound problems. The left-most contractions have non-unit-strideaccess during packing, which hinders TBLIS. This overhead can be overcome by using a 3Dpacking kernel (future work).

BLIS partitions the input matrices into successively smaller and smaller blocks,eventually reaching the small, fixed-size microkernel, such as 4x4 (pictured above),8x4, 6x8, etc. These same “register blocksizes” are used during packing.

TBLIS uses this fact to encode tensor blocks as regular matrix blocks wheneverpossible, so that no overhead is incurred. Irregular blocks which do not have aconstant row or column stride are handled specially by storing the offset for eachrow and column individually.

The tensor dimensions can also be reordered to assure unit-stride access in atleast two of the tensors (see next panel for the effects of non-unit-stride access).

BLIS extends the high-performance Goto approach to matrixmultiplication, such as used in GotoBLAS and OpenBLAS, byadding two partitioning loops in plain C, and only requiring a smallmicrokernel to be highly optimized in assembly.

TBLIS uses this five-loop framework to perform tensor contractionby logically treating the tensor as a matrix during partitioning, andonly requiring extra tensor-aware steps (yellow and green above)during packing of the input tensors and updating the output tensor.The high-performance microkernel and cache blockingparameters are used unchanged, and overall overhead is small.

Threading is simple to implement, and can be added hierarchicallyto four out of the five loops, leading to high scalability.

TBLIS on

0

5

10

15

20

25

30

35

40

45

0 1000 2000 3000 4000 5000

GFLO

PS/s

Approx. M=N=K

MKLBLIS

TBLISTTDT

0

5

10

15

20

25

30

35

40

45

0 500 1000 1500 2000

GFLO

PS/s

Approx. M=N=K

MKLBLIS

TBLISTTDT

Xeon E5-2690 v31 core 12 cores

0

1

2

3

4

5

abcde-efb

ad-cf

abcde-efc

ad-bf

abcd-dbea-ec

abcde-ecbfa

-fd

abcd-deca-be

abc-bda-dc

abcd-ebad-ce

abcdef-

dega-gfb

c

abcdef-

dfg

b-geac

abcdef-

degb-gfa

c

abcdef-

degc-gfa

b

abc-dca-bd

abcd-ea-ebcd

abcd-eb-aecd

abcd-ec-abed

abc-adec-ebd

ab-cad-dcb

ab-acd-dbc

abc-acd-db

abc-adc-bd

ab-ac-cb

abcd-aebf-

fdec

abcd-eafd

-fb

ec

abcd-aebf-

dfc

e

Speedup o

ver

TTD

T

0

5

10

15

20

abcde-efb

ad-cf

abcde-efc

ad-bf

abcd-dbea-ec

abcde-ecbfa

-fd

abcd-deca-be

abc-bda-dc

abcd-ebad-ce

abcdef-

dega-gfb

c

abcdef-

dfg

b-geac

abcdef-

degb-gfa

c

abcdef-

degc-gfa

b

abc-dca-bd

abcd-ea-ebcd

abcd-eb-aecd

abcd-ec-abed

abc-adec-ebd

ab-cad-dcb

ab-acd-dbc

abc-acd-db

abc-adc-bd

ab-ac-cb

abcd-aebf-

fdec

abcd-eafd

-fb

ec

abcd-aebf-

dfc

e

Speedup o

ver

TTD

T

Xeon E5-2690 v31 core 12 cores

GFL

OPS

/s p

er c

ore

GFL

OPS

/s p

er c

ore

0 1 2 3 4 5 108

109

110

111

112

113

216

217

218

219

220

221

324

325

326

327

328

329

018

3654

7290

624

4260

7896

1230

4866

84102

18

0

101 1 01

18

18

18

rscatM rbsM

cscatM

cbsM

high-performance tensor contraction without...

Documents