matrix multiplication implemented in data flow technology

17
Aleksandar Milinković Belgrade University, School of Electrical Engineering Matrix multiplication implemented in data flow technology Aleksandar Milinković Belgrade University, School of Electrical Engineering [email protected]

Upload: zarita

Post on 23-Feb-2016

34 views

Category:

Documents


0 download

DESCRIPTION

Matrix multiplication implemented in data flow technology. Aleksandar Milinkovi ć Belgrade University, School of Electrical Engineering [email protected]. Introduction. Problem with big data Need to change computing paradigm Data flow instead of control flow - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Matrix multiplication implemented in data flow technology

Aleksandar MilinkovićBelgrade University, School of Electrical Engineering

Matrix multiplication implemented in data flow

technology

Aleksandar MilinkovićBelgrade University, School of Electrical Engineering

[email protected]

Page 2: Matrix multiplication implemented in data flow technology

Aleksandar MilinkovićBelgrade University, School of Electrical Engineering

Problem with big dataNeed to change computing paradigmData flow instead of control flowAchieved by construction of graphGraph nodes (vertices) perform computationsEach node is one deep pipeline

Introduction

Page 3: Matrix multiplication implemented in data flow technology

Aleksandar MilinkovićBelgrade University, School of Electrical Engineering

Dependencies are resolved at compile timeNo new dependencies are madeThe whole mechanism is in deep pipelinePipeline levels perform parallel computations Data flow produces one result per cycle

Dataflow computation

Page 4: Matrix multiplication implemented in data flow technology

Aleksandar MilinkovićBelgrade University, School of Electrical Engineering

Data flow doesn’t suit all situationsHowever, it is applicable in lot of cases:

Partial differential equations3D finite differencesFinite elements methodProblems in bioinformatics, etc.

Most of them contain matrix multiplicationsGoal: realization on FPGA, using data flow

Matrix multiplication

Page 5: Matrix multiplication implemented in data flow technology

Aleksandar MilinkovićBelgrade University, School of Electrical Engineering

Two solutions:Maximal utilization of on-chip matrix part• Matrices with small dimensions• Matrices with large dimensions

Multiplication using parallel pipelines

Project realizations

Page 6: Matrix multiplication implemented in data flow technology

Aleksandar MilinkovićBelgrade University, School of Electrical Engineering

Good chip utilization A

...

X(n-2)0 X(n-2)1 X(n-2)2 X(n-2)(n-2) ...

X00 X01 X02 ... X0(n-2) X0(n-1)

...X10 X11 X12 X1(n-2) X1(n-1)

X(n-2)(n-1)

X(n-1)0 X(n-1)1 X(n-1)2 X(n-1)(n-2) ... X(n-1)(n-1)

...

y(n-2)0 y(n-2)1 y(n-2)2 y(n-2)(n-2) ...

y00 y01 y02 ... y0(n-2) y0(n-1)

...y10 y11 y12 y1(n-2) y1(n-1)

y(n-2)(n-1)

y(n-1)0 y(n-1)1 y(n-1)2 y(n-1)(n-2) ... y(n-1)(n-1)

FMem capacity

Pipe

0Pi

pe 1

Set of columns on the chip until they are fully usedEvery pipe calculates 48 sums at the timeEquivalent to 2 processors with 48 coresAdditional parallelization possible

Page 7: Matrix multiplication implemented in data flow technology

Aleksandar MilinkovićBelgrade University, School of Electrical Engineering

Good chip utilization A

Page 8: Matrix multiplication implemented in data flow technology

Aleksandar MilinkovićBelgrade University, School of Electrical Engineering

Good chip utilization AChip utilization and accelerationLUTs: 195345/297600 (65,64%)FFs: 290689/595200 (48.83%)BRAMs: 778/1064 (73.12%)DSPs: 996/2016 (49,40%)

Matrix: 2304 x 2304Intel: 42.5 sMAX3: 2.38 s

Acceleration at kernel clock 75 MHz: ≈18 x

Page 9: Matrix multiplication implemented in data flow technology

Aleksandar MilinkovićBelgrade University, School of Electrical Engineering

Good chip utilization B

...

X(n-2)0 X(n-2)1 X(n-2)2 X(n-2)(n-2) ...

X00 X01 X02 ... X0(n-2) X0(n-1)

...X10 X11 X12 X1(n-2) X1(n-1)

X(n-2)(n-1)

X(n-1)0 X(n-1)1 X(n-1)2 X(n-1)(n-2) ... X(n-1)(n-1)

...

y(n-2)0 y(n-2)1 y(n-2)2 y(n-2)(n-2) ...

y00 y01 y02 ... y0(n-2) y0(n-1)

...y10 y11 y12 y1(n-2) y1(n-1)

y(n-2)(n-1)

y(n-1)0 y(n-1)1 y(n-1)2 y(n-1)(n-2) ... y(n-1)(n-1)

X00 X01 X00

FMem capacity

Pipe

0Pi

pe 1

Part of matrix Y is on chip during computationEach pipe calculates 48 sums at the timeEquivalent to 2 processors with 48 cores

Page 10: Matrix multiplication implemented in data flow technology

Aleksandar MilinkovićBelgrade University, School of Electrical Engineering

Good chip utilization B

Page 11: Matrix multiplication implemented in data flow technology

Aleksandar MilinkovićBelgrade University, School of Electrical Engineering

Good chip utilization BChip utilization and accelerationLUTs: 201237/297600 (67,62%)FFs: 302742/595200 (50.86%)BRAMs: 782/1064 (73.50%)DSPs: 1021/2016 (50,64%)

Matrix: 2304 x 2304Intel: 42.5 sMAX3: 2.38 s

Acceleration at kernel clock 75 MHz: ≈ 18x

Matrix: 4608 x 4608Intel: 1034 sMAX3: 58.41 s

Page 12: Matrix multiplication implemented in data flow technology

Aleksandar MilinkovićBelgrade University, School of Electrical Engineering

Multiple parallel pipelines

...

X(n-2)0 X(n-2)1 X(n-2)2 X(n-2)(n-2) ...

X00 X01 X02 ... X0(n-2) X0(n-1)

...X10 X11 X12 X1(n-2) X1(n-1)

X(n-2)(n-1)

X(n-1)0 X(n-1)1 X(n-1)2 X(n-1)(n-2) ... X(n-1)(n-1)

...

y(n-2)0 y(n-2)1 y(n-2)2 y(n-2)(n-2) ...

y00 y01 y02 ... y0(n-2) y0(n-1)

...y10 y11 y12 y1(n-2) y1(n-1)

y(n-2)(n-1)

y(n-1)0 y(n-1)1 y(n-1)2 y(n-1)(n-2) ... y(n-1)(n-1)

X(n-2)(n-1)

Pipe 0 Pipe 1 Pipe 2 Pipe 46 Pipe 47

Matrices are exclusively in a big memoryEach pipe calculates one sum at the timeEquivalent to 48 processors with one core

Page 13: Matrix multiplication implemented in data flow technology

Aleksandar MilinkovićBelgrade University, School of Electrical Engineering

Multiple parallel pipelines

Page 14: Matrix multiplication implemented in data flow technology

Aleksandar MilinkovićBelgrade University, School of Electrical Engineering

Multiple parallel pipelinesChip utilization and accelerationLUTs: 166328/297600 (55,89%)FFs: 248047/595200 (41.67%)BRAMs: 430/1064 (40.41%)DSPs: 489/2016 (24,26%)

Matrix: 2304 x 2304Intel: 42.5 sMAX3: 4,08 s

Acceleration at kernel clock 150 MHz: > 10x

Matrix: 4608 x 4608Intel: 1034 sMAX3: 98,48 s

Page 15: Matrix multiplication implemented in data flow technology

Aleksandar MilinkovićBelgrade University, School of Electrical Engineering

Comparison of solutionsFirst solution:

Good chip utilizationShorter execution time

Drawback: matrices up to 8GB

Second solution: matrices up to 12GBDrawback: longer execution time

Page 16: Matrix multiplication implemented in data flow technology

Aleksandar MilinkovićBelgrade University, School of Electrical Engineering

ConclusionsMatrix multiplication is operation with complexity O(n3)Part of complexity moved from time to spaceThat produces acceleration (shorter execution time)Achieved by application of data flow technologyDeveloped using tool chain from Maxeler TechnologiesCalculations order of magnitude faster than Intel Xeon

Page 17: Matrix multiplication implemented in data flow technology

Aleksandar MilinkovićBelgrade University, School of Electrical Engineering

Matrix multiplication implemented in data flow

technology

Aleksandar MilinkovićBelgrade University, School of Electrical Engineering

[email protected]