the linpack benchmark on a multi-core multi-fpga system by emanuel ramalho supervisor: prof. paul...

The LINPACK Benchmark on a Multi-Core Multi-FPGA System

by

Emanuel Ramalho

Supervisor: Prof. Paul Chow

University of Toronto

Electrical and Computer Engineering Department

October 1st, 2008

Motivation

LINPACK Algorithm

Parallelizing LINPACK

Results

Conclusions

Future Work

Outline

The LINPACK Benchmark is used to rank the Top500 computers in the world

Can FPGAs compete?

Motivation

Objective

To see how well a multi-core multi-FPGA system performs when compared to processor

Disadvantage Much lower clock rate

Advantage Total implementation may

be done in hardware

FPGA

LINPACK Algorithm

Solves a system of linear equations by calling two routines: DGEFA and DGESL

Ax=b

DGEFA: LU factorization with partial pivoting:A=LUP

Ax=LUx=b

DGESL: Solves the system using LU factorization:

Ly=bUx=y

LINPACK1 vs. HPL

LINPACK1 Single processor Uses Level 1 BLAS Slower Low Complexity

HPLMultiple processorsUses Level 3 BLASFasterHigh Complexity

FPGA Implementation BLAS3 performs faster in processors (due to locality of reference) FPGAs do not take advantage of BLAS3, LINPACK1 is chosen

LINPACK Pseudo-Code

1. Random generation of matrix A and vector b

2. Execute DGEFA routine (A=LU)• IDAMAX, DSCAL and DAXPY are executed here

3. Execute DGESL routine (LUx=b)

4. Verify the result using residual calculation

Performance is measured from 2. to 3. (inclusive)

How is this going to be parallelized?

Parallelizing LINPACK

Find focus of parallelization: DGEFA

DGEFA

DGESL

5%

95%

DGEFA Analysis

Inside DGEFA: IDAMAX, DSCAL and DAXPY DAXPY is the main computation

DAXPY

IDAMAX

DSCAL

5%5%

90%

TMD-MPI

TMD-MPI is a lightweight implementation of the MPI protocol (message passing interface)

TMD-MPE is a hardware implementation of TMD-MPI's main functionality (SEND and RECV)

FSL NetIf PLB Bus

T M D - M P EH W MPINetwork

MPINetwork

DGEFA Parallelization

Generate matrix A and vector b (main rank) (MPI) Matrix distribution Perform DGEFA (main loop)

Perform IDAMAX and DSCAL(MPI) Broadcast scaled column and pivotPerform loop that contains DAXPY

(MPI) Matrix gather (main rank) Perform DGESL Calculate residual

LINPACK Engine

To NetworkOn-Chip

TMDMPE

CommandFSLs

LINPACK Engine

ControlSignals

MainFSM

BLAS1Engine

MPE HeaderFSM

DataFSLs

RAM

Data

BLAS1 Engine Performs IDAMAX, DSCAL and DAXPY

IDAMAX Finds Max(v1) and

returns its index

DSCAL Performs v2=α.v1

DAXPY Calculates

v3=α. v1+v2

Hardware - BEE2 Board

Device Utilization (XC2VP70)

About 34% is dedicated to the network

Cores 4-Input LUTs

Number of Occurrences

~ Total

4-Input LUTs

Total (%)

LINPACK Engine 4360 6 26160 40

TMD-MPE 896 6 5376 8

NetIf 579 11 6369 10

PLB-MPE 2685 1 2685 4

FSLs 44 154 6776 10

FSL2IC 349 4 1396 2NE

TW

OR

K C

OR

ES

Methods of Analysis

Method 1 – Simulation Modelsim waveform

Method 2 – PPC Timer By counting the time through the C code in

PPC

Method 3 – TMD-Profiler Using an external profiler to analyze the

engines

Processor vs FPGA

Most important portion is DGEFA

DGEFA Benchmark with n = 100

Processor's performance = 315MFLOPS

39.0Processor

Engine ePerformanc

ePerformancSpeedup

Performance – FPGA (6 Engines) 379MFLOPS

Performance – 1 Engine 123MFLOPS

20.1Processor

FPGA ePerformanc

ePerformancSpeedup

Engines Speedup

FPGA 1 FPGA 2

SpeedUp - Fixed size for LINPACK 100x100

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

1 2 3 4 5 6 7 8

# Engines

Spe

edU

p

Ideal

SpeedUp

Problem

Computation Sends Receives40.6% 8.7% 46.7%

Engines computation time is being surpassed by either communication or idle time

TMD-Profiler can be used to track the problem

For 8 Engines

IDAMAX & DSCAL Broadcast DAXPY

TMD-Profiler

SEND

RECV

COMP

Scaled Problem Size

0 1 2 3 4 5 6 7 8 90.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

SpeedUp - Scaled size for LINPACK 80x80xNp

Ideal

SpeedUp

# Engines (Np)

Spe

edU

p

FPGA 1 FPGA 2

Why “super” speedup? As matrix increases the size of column also

increases Since each engine has exactly the same

amount of data, number of columns decrease

C a se 1 C a se 2

+

++

++

+Latency

Latency

Latency

Latency

Latency

Latency

++

+

+

= 2 x Latency + 20 = 4 x Latency + 20

New Speedup

With matrix size of 195 x 195

Performance of 6 engines (one FPGA): 628MFLOPS

Performance of one processor: 324MFLOPS

Speedup of FPGA over processor is 1.94x

Newer Technology

Max theoretical peak performance of engine in V2Pro is 200MFLOPS

Newer FPGAs are larger and faster

Estimated peak performance for an engine network (20) for Virtex 5 LX330 – 4000MFLOPS

Theoretical speedup, compared to a processor, is 11.4x

Compared to HPL, estimated speedup is 4.4x

Scaling to Larger Systems

LINPACK is meant to run in large multi-processor systems

Computer networks suffer from high latency

The tighter coupling and lighter protocol used in this FPGA system have potential to scale

Conclusions

TMD-MPE was used to parallelize LINPACK Hardware Engine Disadvantage: expensive in terms of device utilization

Advantage: higher flexibility

Max speedup of engines over a processor, is 1.9x

Newer FPGAs have better chances of outperforming processors (est. 4000MFLOPS for Virtex 5 LX330)

Multi-FPGA systems have good scalability potential due to low latencies

Future Work

Include DDR memory

Improve broadcast method (e.g. to tree approach)

Optimize DAXPY flow

Replicate DAXPY flow inside each engine

Explore newer technologies and scalability

Thank You(Questions?)

Additional Slides

/* dgefa(*A[][], *ipvt[]) */for (k = 0 : n-2) (loop k) pivot = idamax(A[k][k]) + k; (loop idamax) ipvt[k] = pivot; if (A[pivot][k] != 0) t = -1/(A[pivot][k]); swap(&A[pivot][k], &A[k][k]); dscal(&A[k+1][k], t); (loop dscal) for (j = k+1 : n-1) (loop j) t = A[pivot][j]; swap(&A[pivot][j], &A[k][j]); daxpy(&A[k+1][j], A[k+1][k], t);(loop daxpy)

BLAS 1 Functions

Most of the time is spent doing this loop

DGEFA Code

MPE Protocol

Message Size (NDW )Opcode Src/Dest Rank3 1 3 0 2 2

1C t r l b it 2 9 2 1 0

Tag0

Data-word (0)0

Data-word (1)0

Data-word (NDW -1)0

LINPACK Report

Device utilization summary:---------------------------

Selected Device : 2vp70ff1704-7

Number of Slices: 2970 out of 33088 8% Number of Slice Flip Flops: 3282 out of 66176 4% Number of 4 input LUTs: 4360 out of 66176 6% Number used as logic: 3714 Number used as Shift registers: 134 Number used as RAMs: 512 Number of BRAMs: 18 out of 328 5% Number of MULT18X18s: 7 out of 328 2%

Timing Summary:---------------Speed Grade: -7

Minimum period: 9.259ns (Maximum Frequency: 108.003MHz) Minimum input arrival time before clock: 4.930ns Maximum output required time after clock: 6.853ns Maximum combinational path delay: 2.043ns

Opcode TAG

MPI Size Colum n SizeOpcode Nr of Colum ns3 1 2 8

2 7 2 0

1 9 1 2

1 1 0

assigned to Rank 0

assigned to Rank 1

assigned to Rank 2

0 1 n-3 n-2 n-12 3 4 5

Matrix Distribution

Considering an n x n matrix and 3 ranks

Processor vs. LINPACK Engine

Whole LINPACK Benchmark with n = 100

Performance (MFLOPS)Processor: 319MFLOPSLINPACK Engine: 164MFLOPS

51.0Processor

Engine ePerformanc

ePerformancSpeedup

IDAMAX

FLOPS

23 23

2nnNFLOPs

16 Engines

the linpack benchmark on a multi-core multi-fpga system by emanuel ramalho supervisor: prof. paul...

Documents

daxpy slide

mflops slide

residual slide

motivation slide

index slide

hardware fpga slide

b dgefa

dgefa dgefa benchmark