a profiler for a multi-core multi-fpga system

44
A Profiler for a Multi-Core Multi-FPGA System by Daniel Nunes Supervisor: Professor Paul Chow September 30 th , 2008 University of Toronto Electrical and Computer Engineering Department

Upload: mahdis

Post on 01-Feb-2016

38 views

Category:

Documents


0 download

DESCRIPTION

University of Toronto Electrical and Computer Engineering Department. A Profiler for a Multi-Core Multi-FPGA System. by Daniel Nunes Supervisor: Professor Paul Chow. September 30 th , 2008. Overview. Background Profiling Model The Profiler Case Studies Conclusions Future Work. User - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: A Profiler for a Multi-Core Multi-FPGA System

A Profiler for a Multi-Core Multi-FPGA System

by

Daniel Nunes

Supervisor:

Professor Paul Chow

September 30th, 2008

University of Toronto

Electrical and Computer Engineering Department

Page 2: A Profiler for a Multi-Core Multi-FPGA System

Overview

Background Profiling Model The Profiler Case Studies Conclusions Future Work

Page 3: A Profiler for a Multi-Core Multi-FPGA System

How Do We Program This System? Lets look at what

traditional clusters use and try to port it to these type of machines

User

FPGA

User

FPGA

User

FPGA

User

FPGA

Ctrl

FPGA

Page 4: A Profiler for a Multi-Core Multi-FPGA System

Traditional Clusters

MPI is a de facto standard for parallel HPC

MPI can also be used to program a cluster of FPGAs

Page 5: A Profiler for a Multi-Core Multi-FPGA System

The TMD

Heterogeneous multi-core multi-FPGA system developed at UofT

Uses message passing (TMD-MPI)

Page 6: A Profiler for a Multi-Core Multi-FPGA System

TMD-MPI

Subset of the MPI standard Allows an independence between the

application and the hardware TMD-MPI functionality is also

implemented in hardware (TMD-MPE)

Page 7: A Profiler for a Multi-Core Multi-FPGA System

TMD-MPI – Rendezvous Protocol

This implementation uses the Rendezvous protocol, a synchronous communication mode

Req. to Send

Acknowledge

Data

Page 8: A Profiler for a Multi-Core Multi-FPGA System

The TMD Implementation on BEE2 Boards

PPC

MB

PPC MB

MBPPC

PPC

PPCMB

NoC

NoC

NoC

NoC

NoC

User FPGA

User FPGA

User FPGA

User FPGA

Ctrl FPGA

Page 9: A Profiler for a Multi-Core Multi-FPGA System

How Do We Profile This System? Lets look at how it is done

in traditional clusters and try to adapt it to hardware

Page 10: A Profiler for a Multi-Core Multi-FPGA System

MPICH - MPE

Collects information from MPI calls and defined user states through embedded calls

Includes a tool to view all log files (Jumpshot)

Page 11: A Profiler for a Multi-Core Multi-FPGA System

Goals Of This Work

Implement a hardware profiler capable of extracting the same data as the MPE

Make it less intrusive

Make it compatible with the API used by MPE

Make it compatible with Jumpshot

Page 12: A Profiler for a Multi-Core Multi-FPGA System

Tracers

PPCProcessor’s Computation

Tracer

Receive

Tracer

Send

Tracer

TMD

MPE

Receive

Tracer

Send

Tracer

TMD

MPE

Engine’s Computation

Tracer

The Profiler interacts with the computation elements through tracers that register important events

TMD-MPE requires two tracers due to its parallel nature

PPCProcessor’s Computation

Tracer

Page 13: A Profiler for a Multi-Core Multi-FPGA System

Tracers - Hardware Engine Computation

MUX

R0

Tracer for Hardware Engine

Cycle Counter

32 32 32

Page 14: A Profiler for a Multi-Core Multi-FPGA System

Tracers - TMD-MPE

R0 R1 R2 R3

R4

MPE Data Reg

MUX

MUX

MUX

Tracer for TMD-MPE

Cycle Counter

TMD

MPE

32

32 32 32

32

32

Page 15: A Profiler for a Multi-Core Multi-FPGA System

Tracers – Processors Computation

Register Bank

(9 x 32 bits)

MUX

Register Bank

(5 x 32 bits)

Stack

Stack

MPI Calls States User Define States

Tracer for PowerPC/MicroBlaze

Cycle Counter

PPC

3232 32 32

Page 16: A Profiler for a Multi-Core Multi-FPGA System

Profiler’s Network

Tracer

Tracer

Tracer

.

.

.

Gather Collector DDR

User FPGA Control FPGA

Page 17: A Profiler for a Multi-Core Multi-FPGA System

Synchronization

Synchronization within the same board Release reset of the cycle counters

simultaneously Synchronization between boards

Periodically exchange of messages between the root board and all other boards

Page 18: A Profiler for a Multi-Core Multi-FPGA System

Visualize with

Jumpshot

Profiler’s FlowCollect Data

Dump to Host

Convert

To CLOG2

Convert

To SLOG2

After Execution

Back

End

Front

End

Page 19: A Profiler for a Multi-Core Multi-FPGA System

Case Studies

Barrier Sequential vs Binary Tree

TMD-MPE - Unexpected Message Queue Unexpected Message Queue addressable by

rank The Heat Equation

Blocking Calls vs Non-Blocking Calls LINPACK Benchmark

16 Node System Calculating a LU Decomposition of a Matrix

Page 20: A Profiler for a Multi-Core Multi-FPGA System

Barrier

Synchronization call – No node will advance until all nodes have reached the barrier

0

1 2

3 4 5 6

7

0

1 2 3 4 5 6 7

Page 21: A Profiler for a Multi-Core Multi-FPGA System

Barrier Implemented Sequentially

Send Receive

Page 22: A Profiler for a Multi-Core Multi-FPGA System

Barrier Implemented as a Binary Tree

Send Receive

Page 23: A Profiler for a Multi-Core Multi-FPGA System

TMD-MPE – Unexpected Messages Queue

All request to send that arrive to a node before it issues a MPI_RECV are kept in this queue.

Page 24: A Profiler for a Multi-Core Multi-FPGA System

TMD-MPE – Unexpected Messages Queue

Send Receive Queue Search and Reorganization

Page 25: A Profiler for a Multi-Core Multi-FPGA System

TMD-MPE – Unexpected Messages Queue

Send Receive Queue Search and Reorganization

Page 26: A Profiler for a Multi-Core Multi-FPGA System

TMD-MPE – Unexpected Messages Queue

Send Receive

Page 27: A Profiler for a Multi-Core Multi-FPGA System

The Heat Equation Application

Partial differential equation that describes the temperature change over time

41,1,,1,1

,

jijijijiji

uuuuv

2,, )( jiji vu

Page 28: A Profiler for a Multi-Core Multi-FPGA System

The Heat Equation Application

Page 29: A Profiler for a Multi-Core Multi-FPGA System

The Heat Equation Application

Send Receive Computation

Page 30: A Profiler for a Multi-Core Multi-FPGA System

The Heat Equation Application

Send Receive Computation

Page 31: A Profiler for a Multi-Core Multi-FPGA System

The LINPACK Benchmark

Solves a system of linear equations

LU factorization with partial pivoting

LUPA

Page 32: A Profiler for a Multi-Core Multi-FPGA System

The LINPACK Benchmark

assigned to Rank 0

assigned to Rank 1

assigned to Rank 2

0 1 n-3 n-2 n-12 3 4 5

Page 33: A Profiler for a Multi-Core Multi-FPGA System

The LINPACK Benchmark

Send Receive Computation

Page 34: A Profiler for a Multi-Core Multi-FPGA System

The LINPACK Benchmark

Send Receive Computation

Page 35: A Profiler for a Multi-Core Multi-FPGA System

Profiler’s Overhead

Block LUTs Flip-Flops BRAMsCollector 3856 (5%) 1279 (1%) 0 (0%)

Gather 187 (0%) 53 (0%) 0 (0%)

Engine Computation Tracer

396 (0%) 701 (1%) 0 (0%)

TMD-MPE Tracer 526 (0%) 1000 (1%) 0 (0%)

Processors Computation Tracer

without MPE1196 (1%) 1521 (2%) 0 (0%)

Processors Computation Tracer

with MPE

855 (1%) 1200 (1%) 0 (0%)

Page 36: A Profiler for a Multi-Core Multi-FPGA System

Conclusions

All major features of the MPE were implemented

The profiler was successfully used to study the behavior of the applications

Less intrusive More events available to profile Can profile network components Compatible with existing profiling software

environments

Page 37: A Profiler for a Multi-Core Multi-FPGA System

Future Work

Reduce the footprint of the profiler’s hardware blocks.

Profile the Microblaze and PowerPC in a non-intrusive way.

Allow real-time profiling

Page 38: A Profiler for a Multi-Core Multi-FPGA System

Thank You(Questions?)

Page 39: A Profiler for a Multi-Core Multi-FPGA System

Off-Chip Communications Node

The TMD (2)

Off-Chip Communications Node

FSL

PPC

TMD-MPE

TMD-MPE

InterChip

FSL XAUI

Computation Node

Computation Node

Network InterfaceHardware Engine

Network

On-chip

Page 40: A Profiler for a Multi-Core Multi-FPGA System

Profiler (2)

TMD-MPE

Tracer RX Tracer TX Tracer Comp

To Gather

From Cycle Counter

From Cycle Counter

From Cycle Counter

PPC

PLB

TMD-MPE

Tracer RX Tracer TX

DCR2FSL

Bridge

Tracer Comp

To Gather

DC

R

From Cycle Counter

GPIO

Processor Profiler Architecture

Engine Profiler Architecture

Page 41: A Profiler for a Multi-Core Multi-FPGA System

Profiler (1)

XAUI

PPC

μB

Collector

IC IC

PPC

μB

Gather

ICIC

DDR

Control FPGA

User FPGA 1User FPGA 4

Board 0

Board N

Switch

Gather

Cycle Counter

Cycle Counter

Network

On-chip

Network

On-chip

Page 42: A Profiler for a Multi-Core Multi-FPGA System

Profiler (2)

TMD-MPE

Tracer RX Tracer TX Tracer Comp

To Gather

From Cycle Counter

From Cycle Counter

From Cycle Counter

PPC

PLB

TMD-MPE

Tracer RX Tracer TX

DCR2FSL

Bridge

Tracer Comp

To Gather

DC

R

From Cycle Counter

GPIO

Processor Profiler Architecture

Engine Profiler Architecture

Page 43: A Profiler for a Multi-Core Multi-FPGA System

Hardware Profiling Benefits

Less intrusive More events available to profile Can profile network components Compatible with existing profiling

software environments

Page 44: A Profiler for a Multi-Core Multi-FPGA System

MPE PROTOCOL

Message Size (NDW )Opcode Src/Dest Rank3 1 3 0 2 2

1C t r l b it 2 9 2 1 0

Tag0

Data-word (0)0

Data-word (1)0

Data-word (NDW -1)0