the linpack benchmark on a multi-core multi-fpga system by emanuel ramalho supervisor: prof. paul...
Post on 15-Dec-2015
218 Views
Preview:
TRANSCRIPT
The LINPACK Benchmark on a Multi-Core Multi-FPGA System
by
Emanuel Ramalho
Supervisor: Prof. Paul Chow
University of Toronto
Electrical and Computer Engineering Department
October 1st, 2008
Motivation
LINPACK Algorithm
Parallelizing LINPACK
Results
Conclusions
Future Work
Outline
The LINPACK Benchmark is used to rank the Top500 computers in the world
Can FPGAs compete?
Motivation
Objective
To see how well a multi-core multi-FPGA system performs when compared to processor
Disadvantage Much lower clock rate
Advantage Total implementation may
be done in hardware
FPGA
LINPACK Algorithm
Solves a system of linear equations by calling two routines: DGEFA and DGESL
Ax=b
DGEFA: LU factorization with partial pivoting:A=LUP
Ax=LUx=b
DGESL: Solves the system using LU factorization:
Ly=bUx=y
LINPACK1 vs. HPL
LINPACK1 Single processor Uses Level 1 BLAS Slower Low Complexity
HPLMultiple processorsUses Level 3 BLASFasterHigh Complexity
FPGA Implementation BLAS3 performs faster in processors (due to locality of reference) FPGAs do not take advantage of BLAS3, LINPACK1 is chosen
LINPACK Pseudo-Code
1. Random generation of matrix A and vector b
2. Execute DGEFA routine (A=LU)• IDAMAX, DSCAL and DAXPY are executed here
3. Execute DGESL routine (LUx=b)
4. Verify the result using residual calculation
Performance is measured from 2. to 3. (inclusive)
How is this going to be parallelized?
Parallelizing LINPACK
Find focus of parallelization: DGEFA
DGEFA
DGESL
5%
95%
DGEFA Analysis
Inside DGEFA: IDAMAX, DSCAL and DAXPY DAXPY is the main computation
DAXPY
IDAMAX
DSCAL
5%5%
90%
TMD-MPI
TMD-MPI is a lightweight implementation of the MPI protocol (message passing interface)
TMD-MPE is a hardware implementation of TMD-MPI's main functionality (SEND and RECV)
FSL NetIf PLB Bus
T M D - M P EH W MPINetwork
MPINetwork
DGEFA Parallelization
Generate matrix A and vector b (main rank) (MPI) Matrix distribution Perform DGEFA (main loop)
Perform IDAMAX and DSCAL(MPI) Broadcast scaled column and pivotPerform loop that contains DAXPY
(MPI) Matrix gather (main rank) Perform DGESL Calculate residual
LINPACK Engine
To NetworkOn-Chip
TMDMPE
CommandFSLs
LINPACK Engine
ControlSignals
MainFSM
BLAS1Engine
MPE HeaderFSM
DataFSLs
RAM
Data
BLAS1 Engine Performs IDAMAX, DSCAL and DAXPY
IDAMAX Finds Max(v1) and
returns its index
DSCAL Performs v2=α.v1
DAXPY Calculates
v3=α. v1+v2
Hardware - BEE2 Board
Device Utilization (XC2VP70)
About 34% is dedicated to the network
Cores 4-Input LUTs
Number of Occurrences
~ Total
4-Input LUTs
Total (%)
LINPACK Engine 4360 6 26160 40
TMD-MPE 896 6 5376 8
NetIf 579 11 6369 10
PLB-MPE 2685 1 2685 4
FSLs 44 154 6776 10
FSL2IC 349 4 1396 2NE
TW
OR
K C
OR
ES
Methods of Analysis
Method 1 – Simulation Modelsim waveform
Method 2 – PPC Timer By counting the time through the C code in
PPC
Method 3 – TMD-Profiler Using an external profiler to analyze the
engines
Processor vs FPGA
Most important portion is DGEFA
DGEFA Benchmark with n = 100
Processor's performance = 315MFLOPS
39.0Processor
Engine ePerformanc
ePerformancSpeedup
Performance – FPGA (6 Engines) 379MFLOPS
Performance – 1 Engine 123MFLOPS
20.1Processor
FPGA ePerformanc
ePerformancSpeedup
Engines Speedup
FPGA 1 FPGA 2
SpeedUp - Fixed size for LINPACK 100x100
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
1 2 3 4 5 6 7 8
# Engines
Spe
edU
p
Ideal
SpeedUp
Problem
Computation Sends Receives40.6% 8.7% 46.7%
Engines computation time is being surpassed by either communication or idle time
TMD-Profiler can be used to track the problem
For 8 Engines
IDAMAX & DSCAL Broadcast DAXPY
TMD-Profiler
SEND
RECV
COMP
Scaled Problem Size
0 1 2 3 4 5 6 7 8 90.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
SpeedUp - Scaled size for LINPACK 80x80xNp
Ideal
SpeedUp
# Engines (Np)
Spe
edU
p
FPGA 1 FPGA 2
Why “super” speedup? As matrix increases the size of column also
increases Since each engine has exactly the same
amount of data, number of columns decrease
C a se 1 C a se 2
+
++
++
+Latency
Latency
Latency
Latency
Latency
Latency
++
+
+
= 2 x Latency + 20 = 4 x Latency + 20
New Speedup
With matrix size of 195 x 195
Performance of 6 engines (one FPGA): 628MFLOPS
Performance of one processor: 324MFLOPS
Speedup of FPGA over processor is 1.94x
Newer Technology
Max theoretical peak performance of engine in V2Pro is 200MFLOPS
Newer FPGAs are larger and faster
Estimated peak performance for an engine network (20) for Virtex 5 LX330 – 4000MFLOPS
Theoretical speedup, compared to a processor, is 11.4x
Compared to HPL, estimated speedup is 4.4x
Scaling to Larger Systems
LINPACK is meant to run in large multi-processor systems
Computer networks suffer from high latency
The tighter coupling and lighter protocol used in this FPGA system have potential to scale
Conclusions
TMD-MPE was used to parallelize LINPACK Hardware Engine Disadvantage: expensive in terms of device utilization
Advantage: higher flexibility
Max speedup of engines over a processor, is 1.9x
Newer FPGAs have better chances of outperforming processors (est. 4000MFLOPS for Virtex 5 LX330)
Multi-FPGA systems have good scalability potential due to low latencies
Future Work
Include DDR memory
Improve broadcast method (e.g. to tree approach)
Optimize DAXPY flow
Replicate DAXPY flow inside each engine
Explore newer technologies and scalability
Thank You(Questions?)
Additional Slides
/* dgefa(*A[][], *ipvt[]) */for (k = 0 : n-2) (loop k) pivot = idamax(A[k][k]) + k; (loop idamax) ipvt[k] = pivot; if (A[pivot][k] != 0) t = -1/(A[pivot][k]); swap(&A[pivot][k], &A[k][k]); dscal(&A[k+1][k], t); (loop dscal) for (j = k+1 : n-1) (loop j) t = A[pivot][j]; swap(&A[pivot][j], &A[k][j]); daxpy(&A[k+1][j], A[k+1][k], t);(loop daxpy)
BLAS 1 Functions
Most of the time is spent doing this loop
DGEFA Code
MPE Protocol
Message Size (NDW )Opcode Src/Dest Rank3 1 3 0 2 2
1C t r l b it 2 9 2 1 0
Tag0
Data-word (0)0
Data-word (1)0
Data-word (NDW -1)0
LINPACK Report
Device utilization summary:---------------------------
Selected Device : 2vp70ff1704-7
Number of Slices: 2970 out of 33088 8% Number of Slice Flip Flops: 3282 out of 66176 4% Number of 4 input LUTs: 4360 out of 66176 6% Number used as logic: 3714 Number used as Shift registers: 134 Number used as RAMs: 512 Number of BRAMs: 18 out of 328 5% Number of MULT18X18s: 7 out of 328 2%
Timing Summary:---------------Speed Grade: -7
Minimum period: 9.259ns (Maximum Frequency: 108.003MHz) Minimum input arrival time before clock: 4.930ns Maximum output required time after clock: 6.853ns Maximum combinational path delay: 2.043ns
Opcode TAG
MPI Size Colum n SizeOpcode Nr of Colum ns3 1 2 8
2 7 2 0
1 9 1 2
1 1 0
assigned to Rank 0
assigned to Rank 1
assigned to Rank 2
0 1 n-3 n-2 n-12 3 4 5
Matrix Distribution
Considering an n x n matrix and 3 ranks
Processor vs. LINPACK Engine
Whole LINPACK Benchmark with n = 100
Performance (MFLOPS)Processor: 319MFLOPSLINPACK Engine: 164MFLOPS
51.0Processor
Engine ePerformanc
ePerformancSpeedup
IDAMAX
DSCAL
DAXPY
FLOPS
23 23
2nnNFLOPs
16 Engines
top related