matrix multiplication implemented in data flow technology

Aleksandar MilinkovićBelgrade University, School of Electrical Engineering

Matrix multiplication implemented in data flow

technology


[email protected]


Problem with big dataNeed to change computing paradigmData flow instead of control flowAchieved by construction of graphGraph nodes (vertices) perform computationsEach node is one deep pipeline

Introduction


Dependencies are resolved at compile timeNo new dependencies are madeThe whole mechanism is in deep pipelinePipeline levels perform parallel computations Data flow produces one result per cycle

Dataflow computation


Data flow doesn’t suit all situationsHowever, it is applicable in lot of cases:

Partial differential equations3D finite differencesFinite elements methodProblems in bioinformatics, etc.

Most of them contain matrix multiplicationsGoal: realization on FPGA, using data flow

Matrix multiplication


Two solutions:Maximal utilization of on-chip matrix part• Matrices with small dimensions• Matrices with large dimensions

Multiplication using parallel pipelines

Project realizations


Good chip utilization A

...

X(n-2)0 X(n-2)1 X(n-2)2 X(n-2)(n-2) ...

X00 X01 X02 ... X0(n-2) X0(n-1)

...X10 X11 X12 X1(n-2) X1(n-1)

X(n-2)(n-1)

X(n-1)0 X(n-1)1 X(n-1)2 X(n-1)(n-2) ... X(n-1)(n-1)

...

y(n-2)0 y(n-2)1 y(n-2)2 y(n-2)(n-2) ...

y00 y01 y02 ... y0(n-2) y0(n-1)

...y10 y11 y12 y1(n-2) y1(n-1)

y(n-2)(n-1)

y(n-1)0 y(n-1)1 y(n-1)2 y(n-1)(n-2) ... y(n-1)(n-1)

FMem capacity

Pipe

0Pi

pe 1

Set of columns on the chip until they are fully usedEvery pipe calculates 48 sums at the timeEquivalent to 2 processors with 48 coresAdditional parallelization possible


Good chip utilization A


Good chip utilization AChip utilization and accelerationLUTs: 195345/297600 (65,64%)FFs: 290689/595200 (48.83%)BRAMs: 778/1064 (73.12%)DSPs: 996/2016 (49,40%)

Matrix: 2304 x 2304Intel: 42.5 sMAX3: 2.38 s

Acceleration at kernel clock 75 MHz: ≈18 x


Good chip utilization B

...

X(n-2)0 X(n-2)1 X(n-2)2 X(n-2)(n-2) ...

X00 X01 X02 ... X0(n-2) X0(n-1)

...X10 X11 X12 X1(n-2) X1(n-1)

X(n-2)(n-1)

X(n-1)0 X(n-1)1 X(n-1)2 X(n-1)(n-2) ... X(n-1)(n-1)

...

y(n-2)0 y(n-2)1 y(n-2)2 y(n-2)(n-2) ...

y00 y01 y02 ... y0(n-2) y0(n-1)

...y10 y11 y12 y1(n-2) y1(n-1)

y(n-2)(n-1)

y(n-1)0 y(n-1)1 y(n-1)2 y(n-1)(n-2) ... y(n-1)(n-1)

X00 X01 X00

FMem capacity

Pipe

0Pi

pe 1

Part of matrix Y is on chip during computationEach pipe calculates 48 sums at the timeEquivalent to 2 processors with 48 cores


Good chip utilization B


Good chip utilization BChip utilization and accelerationLUTs: 201237/297600 (67,62%)FFs: 302742/595200 (50.86%)BRAMs: 782/1064 (73.50%)DSPs: 1021/2016 (50,64%)

Matrix: 2304 x 2304Intel: 42.5 sMAX3: 2.38 s

Acceleration at kernel clock 75 MHz: ≈ 18x

Matrix: 4608 x 4608Intel: 1034 sMAX3: 58.41 s


Multiple parallel pipelines

...

X(n-2)0 X(n-2)1 X(n-2)2 X(n-2)(n-2) ...

X00 X01 X02 ... X0(n-2) X0(n-1)

...X10 X11 X12 X1(n-2) X1(n-1)

X(n-2)(n-1)

X(n-1)0 X(n-1)1 X(n-1)2 X(n-1)(n-2) ... X(n-1)(n-1)

...

y(n-2)0 y(n-2)1 y(n-2)2 y(n-2)(n-2) ...

y00 y01 y02 ... y0(n-2) y0(n-1)

...y10 y11 y12 y1(n-2) y1(n-1)

y(n-2)(n-1)

y(n-1)0 y(n-1)1 y(n-1)2 y(n-1)(n-2) ... y(n-1)(n-1)

X(n-2)(n-1)

Pipe 0 Pipe 1 Pipe 2 Pipe 46 Pipe 47

Matrices are exclusively in a big memoryEach pipe calculates one sum at the timeEquivalent to 48 processors with one core


Multiple parallel pipelines


Multiple parallel pipelinesChip utilization and accelerationLUTs: 166328/297600 (55,89%)FFs: 248047/595200 (41.67%)BRAMs: 430/1064 (40.41%)DSPs: 489/2016 (24,26%)

Matrix: 2304 x 2304Intel: 42.5 sMAX3: 4,08 s

Acceleration at kernel clock 150 MHz: > 10x

Matrix: 4608 x 4608Intel: 1034 sMAX3: 98,48 s


Comparison of solutionsFirst solution:

Good chip utilizationShorter execution time

Drawback: matrices up to 8GB

Second solution: matrices up to 12GBDrawback: longer execution time


ConclusionsMatrix multiplication is operation with complexity O(n3)Part of complexity moved from time to spaceThat produces acceleration (shorter execution time)Achieved by application of data flow technologyDeveloped using tool chain from Maxeler TechnologiesCalculations order of magnitude faster than Intel Xeon


Matrix multiplication implemented in data flow

technology


[email protected]

matrix multiplication implemented in data flow technology

Documents