Aleksandar MilinkovićBelgrade University, School of Electrical Engineering
Matrix multiplication implemented in data flow
technology
Aleksandar MilinkovićBelgrade University, School of Electrical Engineering
Aleksandar MilinkovićBelgrade University, School of Electrical Engineering
Problem with big dataNeed to change computing paradigmData flow instead of control flowAchieved by construction of graphGraph nodes (vertices) perform computationsEach node is one deep pipeline
Introduction
Aleksandar MilinkovićBelgrade University, School of Electrical Engineering
Dependencies are resolved at compile timeNo new dependencies are madeThe whole mechanism is in deep pipelinePipeline levels perform parallel computations Data flow produces one result per cycle
Dataflow computation
Aleksandar MilinkovićBelgrade University, School of Electrical Engineering
Data flow doesn’t suit all situationsHowever, it is applicable in lot of cases:
Partial differential equations3D finite differencesFinite elements methodProblems in bioinformatics, etc.
Most of them contain matrix multiplicationsGoal: realization on FPGA, using data flow
Matrix multiplication
Aleksandar MilinkovićBelgrade University, School of Electrical Engineering
Two solutions:Maximal utilization of on-chip matrix part• Matrices with small dimensions• Matrices with large dimensions
Multiplication using parallel pipelines
Project realizations
Aleksandar MilinkovićBelgrade University, School of Electrical Engineering
Good chip utilization A
...
X(n-2)0 X(n-2)1 X(n-2)2 X(n-2)(n-2) ...
X00 X01 X02 ... X0(n-2) X0(n-1)
...X10 X11 X12 X1(n-2) X1(n-1)
X(n-2)(n-1)
X(n-1)0 X(n-1)1 X(n-1)2 X(n-1)(n-2) ... X(n-1)(n-1)
...
y(n-2)0 y(n-2)1 y(n-2)2 y(n-2)(n-2) ...
y00 y01 y02 ... y0(n-2) y0(n-1)
...y10 y11 y12 y1(n-2) y1(n-1)
y(n-2)(n-1)
y(n-1)0 y(n-1)1 y(n-1)2 y(n-1)(n-2) ... y(n-1)(n-1)
FMem capacity
Pipe
0Pi
pe 1
Set of columns on the chip until they are fully usedEvery pipe calculates 48 sums at the timeEquivalent to 2 processors with 48 coresAdditional parallelization possible
Aleksandar MilinkovićBelgrade University, School of Electrical Engineering
Good chip utilization A
Aleksandar MilinkovićBelgrade University, School of Electrical Engineering
Good chip utilization AChip utilization and accelerationLUTs: 195345/297600 (65,64%)FFs: 290689/595200 (48.83%)BRAMs: 778/1064 (73.12%)DSPs: 996/2016 (49,40%)
Matrix: 2304 x 2304Intel: 42.5 sMAX3: 2.38 s
Acceleration at kernel clock 75 MHz: ≈18 x
Aleksandar MilinkovićBelgrade University, School of Electrical Engineering
Good chip utilization B
...
X(n-2)0 X(n-2)1 X(n-2)2 X(n-2)(n-2) ...
X00 X01 X02 ... X0(n-2) X0(n-1)
...X10 X11 X12 X1(n-2) X1(n-1)
X(n-2)(n-1)
X(n-1)0 X(n-1)1 X(n-1)2 X(n-1)(n-2) ... X(n-1)(n-1)
...
y(n-2)0 y(n-2)1 y(n-2)2 y(n-2)(n-2) ...
y00 y01 y02 ... y0(n-2) y0(n-1)
...y10 y11 y12 y1(n-2) y1(n-1)
y(n-2)(n-1)
y(n-1)0 y(n-1)1 y(n-1)2 y(n-1)(n-2) ... y(n-1)(n-1)
X00 X01 X00
FMem capacity
Pipe
0Pi
pe 1
Part of matrix Y is on chip during computationEach pipe calculates 48 sums at the timeEquivalent to 2 processors with 48 cores
Aleksandar MilinkovićBelgrade University, School of Electrical Engineering
Good chip utilization B
Aleksandar MilinkovićBelgrade University, School of Electrical Engineering
Good chip utilization BChip utilization and accelerationLUTs: 201237/297600 (67,62%)FFs: 302742/595200 (50.86%)BRAMs: 782/1064 (73.50%)DSPs: 1021/2016 (50,64%)
Matrix: 2304 x 2304Intel: 42.5 sMAX3: 2.38 s
Acceleration at kernel clock 75 MHz: ≈ 18x
Matrix: 4608 x 4608Intel: 1034 sMAX3: 58.41 s
Aleksandar MilinkovićBelgrade University, School of Electrical Engineering
Multiple parallel pipelines
...
X(n-2)0 X(n-2)1 X(n-2)2 X(n-2)(n-2) ...
X00 X01 X02 ... X0(n-2) X0(n-1)
...X10 X11 X12 X1(n-2) X1(n-1)
X(n-2)(n-1)
X(n-1)0 X(n-1)1 X(n-1)2 X(n-1)(n-2) ... X(n-1)(n-1)
...
y(n-2)0 y(n-2)1 y(n-2)2 y(n-2)(n-2) ...
y00 y01 y02 ... y0(n-2) y0(n-1)
...y10 y11 y12 y1(n-2) y1(n-1)
y(n-2)(n-1)
y(n-1)0 y(n-1)1 y(n-1)2 y(n-1)(n-2) ... y(n-1)(n-1)
X(n-2)(n-1)
Pipe 0 Pipe 1 Pipe 2 Pipe 46 Pipe 47
Matrices are exclusively in a big memoryEach pipe calculates one sum at the timeEquivalent to 48 processors with one core
Aleksandar MilinkovićBelgrade University, School of Electrical Engineering
Multiple parallel pipelines
Aleksandar MilinkovićBelgrade University, School of Electrical Engineering
Multiple parallel pipelinesChip utilization and accelerationLUTs: 166328/297600 (55,89%)FFs: 248047/595200 (41.67%)BRAMs: 430/1064 (40.41%)DSPs: 489/2016 (24,26%)
Matrix: 2304 x 2304Intel: 42.5 sMAX3: 4,08 s
Acceleration at kernel clock 150 MHz: > 10x
Matrix: 4608 x 4608Intel: 1034 sMAX3: 98,48 s
Aleksandar MilinkovićBelgrade University, School of Electrical Engineering
Comparison of solutionsFirst solution:
Good chip utilizationShorter execution time
Drawback: matrices up to 8GB
Second solution: matrices up to 12GBDrawback: longer execution time
Aleksandar MilinkovićBelgrade University, School of Electrical Engineering
ConclusionsMatrix multiplication is operation with complexity O(n3)Part of complexity moved from time to spaceThat produces acceleration (shorter execution time)Achieved by application of data flow technologyDeveloped using tool chain from Maxeler TechnologiesCalculations order of magnitude faster than Intel Xeon
Aleksandar MilinkovićBelgrade University, School of Electrical Engineering
Matrix multiplication implemented in data flow
technology
Aleksandar MilinkovićBelgrade University, School of Electrical Engineering