the linpack benchmark on a multi-core multi-fpga system by emanuel ramalho supervisor: prof. paul...

43
The LINPACK Benchmark on a Multi-Core Multi-FPGA System by Emanuel Ramalho Supervisor: Prof. Paul Chow University of Toronto Electrical and Computer Engineering Department October 1st, 2008

Upload: treyton-style

Post on 15-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The LINPACK Benchmark on a Multi-Core Multi-FPGA System by Emanuel Ramalho Supervisor: Prof. Paul Chow University of Toronto Electrical and Computer Engineering

The LINPACK Benchmark on a Multi-Core Multi-FPGA System

by

Emanuel Ramalho

Supervisor: Prof. Paul Chow

University of Toronto

Electrical and Computer Engineering Department

October 1st, 2008

Page 2: The LINPACK Benchmark on a Multi-Core Multi-FPGA System by Emanuel Ramalho Supervisor: Prof. Paul Chow University of Toronto Electrical and Computer Engineering

Motivation

LINPACK Algorithm

Parallelizing LINPACK

Results

Conclusions

Future Work

Outline

Page 3: The LINPACK Benchmark on a Multi-Core Multi-FPGA System by Emanuel Ramalho Supervisor: Prof. Paul Chow University of Toronto Electrical and Computer Engineering

The LINPACK Benchmark is used to rank the Top500 computers in the world

Can FPGAs compete?

Motivation

Page 4: The LINPACK Benchmark on a Multi-Core Multi-FPGA System by Emanuel Ramalho Supervisor: Prof. Paul Chow University of Toronto Electrical and Computer Engineering

Objective

To see how well a multi-core multi-FPGA system performs when compared to processor

Disadvantage Much lower clock rate

Advantage Total implementation may

be done in hardware

FPGA

Page 5: The LINPACK Benchmark on a Multi-Core Multi-FPGA System by Emanuel Ramalho Supervisor: Prof. Paul Chow University of Toronto Electrical and Computer Engineering

LINPACK Algorithm

Solves a system of linear equations by calling two routines: DGEFA and DGESL

Ax=b

DGEFA: LU factorization with partial pivoting:A=LUP

Ax=LUx=b

DGESL: Solves the system using LU factorization:

Ly=bUx=y

Page 6: The LINPACK Benchmark on a Multi-Core Multi-FPGA System by Emanuel Ramalho Supervisor: Prof. Paul Chow University of Toronto Electrical and Computer Engineering

LINPACK1 vs. HPL

LINPACK1 Single processor Uses Level 1 BLAS Slower Low Complexity

HPLMultiple processorsUses Level 3 BLASFasterHigh Complexity

FPGA Implementation BLAS3 performs faster in processors (due to locality of reference) FPGAs do not take advantage of BLAS3, LINPACK1 is chosen

Page 7: The LINPACK Benchmark on a Multi-Core Multi-FPGA System by Emanuel Ramalho Supervisor: Prof. Paul Chow University of Toronto Electrical and Computer Engineering

LINPACK Pseudo-Code

1. Random generation of matrix A and vector b

2. Execute DGEFA routine (A=LU)• IDAMAX, DSCAL and DAXPY are executed here

3. Execute DGESL routine (LUx=b)

4. Verify the result using residual calculation

Performance is measured from 2. to 3. (inclusive)

How is this going to be parallelized?

Page 8: The LINPACK Benchmark on a Multi-Core Multi-FPGA System by Emanuel Ramalho Supervisor: Prof. Paul Chow University of Toronto Electrical and Computer Engineering

Parallelizing LINPACK

Find focus of parallelization: DGEFA

DGEFA

DGESL

5%

95%

Page 9: The LINPACK Benchmark on a Multi-Core Multi-FPGA System by Emanuel Ramalho Supervisor: Prof. Paul Chow University of Toronto Electrical and Computer Engineering

DGEFA Analysis

Inside DGEFA: IDAMAX, DSCAL and DAXPY DAXPY is the main computation

DAXPY

IDAMAX

DSCAL

5%5%

90%

Page 10: The LINPACK Benchmark on a Multi-Core Multi-FPGA System by Emanuel Ramalho Supervisor: Prof. Paul Chow University of Toronto Electrical and Computer Engineering

TMD-MPI

TMD-MPI is a lightweight implementation of the MPI protocol (message passing interface)

TMD-MPE is a hardware implementation of TMD-MPI's main functionality (SEND and RECV)

FSL NetIf PLB Bus

T M D - M P EH W MPINetwork

MPINetwork

Page 11: The LINPACK Benchmark on a Multi-Core Multi-FPGA System by Emanuel Ramalho Supervisor: Prof. Paul Chow University of Toronto Electrical and Computer Engineering

DGEFA Parallelization

Generate matrix A and vector b (main rank) (MPI) Matrix distribution Perform DGEFA (main loop)

Perform IDAMAX and DSCAL(MPI) Broadcast scaled column and pivotPerform loop that contains DAXPY

(MPI) Matrix gather (main rank) Perform DGESL Calculate residual

Page 12: The LINPACK Benchmark on a Multi-Core Multi-FPGA System by Emanuel Ramalho Supervisor: Prof. Paul Chow University of Toronto Electrical and Computer Engineering

LINPACK Engine

To NetworkOn-Chip

TMDMPE

CommandFSLs

LINPACK Engine

ControlSignals

MainFSM

BLAS1Engine

MPE HeaderFSM

DataFSLs

RAM

Data

Page 13: The LINPACK Benchmark on a Multi-Core Multi-FPGA System by Emanuel Ramalho Supervisor: Prof. Paul Chow University of Toronto Electrical and Computer Engineering

BLAS1 Engine Performs IDAMAX, DSCAL and DAXPY

Page 14: The LINPACK Benchmark on a Multi-Core Multi-FPGA System by Emanuel Ramalho Supervisor: Prof. Paul Chow University of Toronto Electrical and Computer Engineering

IDAMAX Finds Max(v1) and

returns its index

Page 15: The LINPACK Benchmark on a Multi-Core Multi-FPGA System by Emanuel Ramalho Supervisor: Prof. Paul Chow University of Toronto Electrical and Computer Engineering

DSCAL Performs v2=α.v1

Page 16: The LINPACK Benchmark on a Multi-Core Multi-FPGA System by Emanuel Ramalho Supervisor: Prof. Paul Chow University of Toronto Electrical and Computer Engineering

DAXPY Calculates

v3=α. v1+v2

Page 17: The LINPACK Benchmark on a Multi-Core Multi-FPGA System by Emanuel Ramalho Supervisor: Prof. Paul Chow University of Toronto Electrical and Computer Engineering

Hardware - BEE2 Board

Page 18: The LINPACK Benchmark on a Multi-Core Multi-FPGA System by Emanuel Ramalho Supervisor: Prof. Paul Chow University of Toronto Electrical and Computer Engineering

Device Utilization (XC2VP70)

About 34% is dedicated to the network

Cores 4-Input LUTs

Number of Occurrences

~ Total

4-Input LUTs

Total (%)

LINPACK Engine 4360 6 26160 40

TMD-MPE 896 6 5376 8

NetIf 579 11 6369 10

PLB-MPE 2685 1 2685 4

FSLs 44 154 6776 10

FSL2IC 349 4 1396 2NE

TW

OR

K C

OR

ES

Page 19: The LINPACK Benchmark on a Multi-Core Multi-FPGA System by Emanuel Ramalho Supervisor: Prof. Paul Chow University of Toronto Electrical and Computer Engineering

Methods of Analysis

Method 1 – Simulation Modelsim waveform

Method 2 – PPC Timer By counting the time through the C code in

PPC

Method 3 – TMD-Profiler Using an external profiler to analyze the

engines

Page 20: The LINPACK Benchmark on a Multi-Core Multi-FPGA System by Emanuel Ramalho Supervisor: Prof. Paul Chow University of Toronto Electrical and Computer Engineering

Processor vs FPGA

Most important portion is DGEFA

DGEFA Benchmark with n = 100

Processor's performance = 315MFLOPS

39.0Processor

Engine ePerformanc

ePerformancSpeedup

Performance – FPGA (6 Engines) 379MFLOPS

Performance – 1 Engine 123MFLOPS

20.1Processor

FPGA ePerformanc

ePerformancSpeedup

Page 21: The LINPACK Benchmark on a Multi-Core Multi-FPGA System by Emanuel Ramalho Supervisor: Prof. Paul Chow University of Toronto Electrical and Computer Engineering

Engines Speedup

FPGA 1 FPGA 2

SpeedUp - Fixed size for LINPACK 100x100

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

1 2 3 4 5 6 7 8

# Engines

Spe

edU

p

Ideal

SpeedUp

Page 22: The LINPACK Benchmark on a Multi-Core Multi-FPGA System by Emanuel Ramalho Supervisor: Prof. Paul Chow University of Toronto Electrical and Computer Engineering

Problem

Computation Sends Receives40.6% 8.7% 46.7%

Engines computation time is being surpassed by either communication or idle time

TMD-Profiler can be used to track the problem

For 8 Engines

Page 23: The LINPACK Benchmark on a Multi-Core Multi-FPGA System by Emanuel Ramalho Supervisor: Prof. Paul Chow University of Toronto Electrical and Computer Engineering

IDAMAX & DSCAL Broadcast DAXPY

TMD-Profiler

SEND

RECV

COMP

Page 24: The LINPACK Benchmark on a Multi-Core Multi-FPGA System by Emanuel Ramalho Supervisor: Prof. Paul Chow University of Toronto Electrical and Computer Engineering

Scaled Problem Size

0 1 2 3 4 5 6 7 8 90.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

SpeedUp - Scaled size for LINPACK 80x80xNp

Ideal

SpeedUp

# Engines (Np)

Spe

edU

p

FPGA 1 FPGA 2

Page 25: The LINPACK Benchmark on a Multi-Core Multi-FPGA System by Emanuel Ramalho Supervisor: Prof. Paul Chow University of Toronto Electrical and Computer Engineering

Why “super” speedup? As matrix increases the size of column also

increases Since each engine has exactly the same

amount of data, number of columns decrease

C a se 1 C a se 2

+

++

++

+Latency

Latency

Latency

Latency

Latency

Latency

++

+

+

= 2 x Latency + 20 = 4 x Latency + 20

Page 26: The LINPACK Benchmark on a Multi-Core Multi-FPGA System by Emanuel Ramalho Supervisor: Prof. Paul Chow University of Toronto Electrical and Computer Engineering

New Speedup

With matrix size of 195 x 195

Performance of 6 engines (one FPGA): 628MFLOPS

Performance of one processor: 324MFLOPS

Speedup of FPGA over processor is 1.94x

Page 27: The LINPACK Benchmark on a Multi-Core Multi-FPGA System by Emanuel Ramalho Supervisor: Prof. Paul Chow University of Toronto Electrical and Computer Engineering

Newer Technology

Max theoretical peak performance of engine in V2Pro is 200MFLOPS

Newer FPGAs are larger and faster

Estimated peak performance for an engine network (20) for Virtex 5 LX330 – 4000MFLOPS

Theoretical speedup, compared to a processor, is 11.4x

Compared to HPL, estimated speedup is 4.4x

Page 28: The LINPACK Benchmark on a Multi-Core Multi-FPGA System by Emanuel Ramalho Supervisor: Prof. Paul Chow University of Toronto Electrical and Computer Engineering

Scaling to Larger Systems

LINPACK is meant to run in large multi-processor systems

Computer networks suffer from high latency

The tighter coupling and lighter protocol used in this FPGA system have potential to scale

Page 29: The LINPACK Benchmark on a Multi-Core Multi-FPGA System by Emanuel Ramalho Supervisor: Prof. Paul Chow University of Toronto Electrical and Computer Engineering

Conclusions

TMD-MPE was used to parallelize LINPACK Hardware Engine Disadvantage: expensive in terms of device utilization

Advantage: higher flexibility

Max speedup of engines over a processor, is 1.9x

Newer FPGAs have better chances of outperforming processors (est. 4000MFLOPS for Virtex 5 LX330)

Multi-FPGA systems have good scalability potential due to low latencies

Page 30: The LINPACK Benchmark on a Multi-Core Multi-FPGA System by Emanuel Ramalho Supervisor: Prof. Paul Chow University of Toronto Electrical and Computer Engineering

Future Work

Include DDR memory

Improve broadcast method (e.g. to tree approach)

Optimize DAXPY flow

Replicate DAXPY flow inside each engine

Explore newer technologies and scalability

Page 31: The LINPACK Benchmark on a Multi-Core Multi-FPGA System by Emanuel Ramalho Supervisor: Prof. Paul Chow University of Toronto Electrical and Computer Engineering

Thank You(Questions?)

Page 32: The LINPACK Benchmark on a Multi-Core Multi-FPGA System by Emanuel Ramalho Supervisor: Prof. Paul Chow University of Toronto Electrical and Computer Engineering

Additional Slides

Page 33: The LINPACK Benchmark on a Multi-Core Multi-FPGA System by Emanuel Ramalho Supervisor: Prof. Paul Chow University of Toronto Electrical and Computer Engineering

/* dgefa(*A[][], *ipvt[]) */for (k = 0 : n-2) (loop k) pivot = idamax(A[k][k]) + k; (loop idamax) ipvt[k] = pivot; if (A[pivot][k] != 0) t = -1/(A[pivot][k]); swap(&A[pivot][k], &A[k][k]); dscal(&A[k+1][k], t); (loop dscal) for (j = k+1 : n-1) (loop j) t = A[pivot][j]; swap(&A[pivot][j], &A[k][j]); daxpy(&A[k+1][j], A[k+1][k], t);(loop daxpy)

BLAS 1 Functions

Most of the time is spent doing this loop

DGEFA Code

Page 34: The LINPACK Benchmark on a Multi-Core Multi-FPGA System by Emanuel Ramalho Supervisor: Prof. Paul Chow University of Toronto Electrical and Computer Engineering

MPE Protocol

Message Size (NDW )Opcode Src/Dest Rank3 1 3 0 2 2

1C t r l b it 2 9 2 1 0

Tag0

Data-word (0)0

Data-word (1)0

Data-word (NDW -1)0

Page 35: The LINPACK Benchmark on a Multi-Core Multi-FPGA System by Emanuel Ramalho Supervisor: Prof. Paul Chow University of Toronto Electrical and Computer Engineering

LINPACK Report

Device utilization summary:---------------------------

Selected Device : 2vp70ff1704-7

Number of Slices: 2970 out of 33088 8% Number of Slice Flip Flops: 3282 out of 66176 4% Number of 4 input LUTs: 4360 out of 66176 6% Number used as logic: 3714 Number used as Shift registers: 134 Number used as RAMs: 512 Number of BRAMs: 18 out of 328 5% Number of MULT18X18s: 7 out of 328 2%

Timing Summary:---------------Speed Grade: -7

Minimum period: 9.259ns (Maximum Frequency: 108.003MHz) Minimum input arrival time before clock: 4.930ns Maximum output required time after clock: 6.853ns Maximum combinational path delay: 2.043ns

Page 36: The LINPACK Benchmark on a Multi-Core Multi-FPGA System by Emanuel Ramalho Supervisor: Prof. Paul Chow University of Toronto Electrical and Computer Engineering

Opcode TAG

MPI Size Colum n SizeOpcode Nr of Colum ns3 1 2 8

2 7 2 0

1 9 1 2

1 1 0

Page 37: The LINPACK Benchmark on a Multi-Core Multi-FPGA System by Emanuel Ramalho Supervisor: Prof. Paul Chow University of Toronto Electrical and Computer Engineering

assigned to Rank 0

assigned to Rank 1

assigned to Rank 2

0 1 n-3 n-2 n-12 3 4 5

Matrix Distribution

Considering an n x n matrix and 3 ranks

Page 38: The LINPACK Benchmark on a Multi-Core Multi-FPGA System by Emanuel Ramalho Supervisor: Prof. Paul Chow University of Toronto Electrical and Computer Engineering

Processor vs. LINPACK Engine

Whole LINPACK Benchmark with n = 100

Performance (MFLOPS)Processor: 319MFLOPSLINPACK Engine: 164MFLOPS

51.0Processor

Engine ePerformanc

ePerformancSpeedup

Page 39: The LINPACK Benchmark on a Multi-Core Multi-FPGA System by Emanuel Ramalho Supervisor: Prof. Paul Chow University of Toronto Electrical and Computer Engineering

IDAMAX

Page 40: The LINPACK Benchmark on a Multi-Core Multi-FPGA System by Emanuel Ramalho Supervisor: Prof. Paul Chow University of Toronto Electrical and Computer Engineering

DSCAL

Page 41: The LINPACK Benchmark on a Multi-Core Multi-FPGA System by Emanuel Ramalho Supervisor: Prof. Paul Chow University of Toronto Electrical and Computer Engineering

DAXPY

Page 42: The LINPACK Benchmark on a Multi-Core Multi-FPGA System by Emanuel Ramalho Supervisor: Prof. Paul Chow University of Toronto Electrical and Computer Engineering

FLOPS

23 23

2nnNFLOPs

Page 43: The LINPACK Benchmark on a Multi-Core Multi-FPGA System by Emanuel Ramalho Supervisor: Prof. Paul Chow University of Toronto Electrical and Computer Engineering

16 Engines