computing models for fpga-based...

RSSI – 7/9/2008HPRC Computing Models

Computing Models for

FPGA-Based Accelerators*

Martin Herbordt Tom VanCourt Yongfeng Gu

Josh Model Bharat Sukhwani Matt Chiu

Computer Architecture and Automated Design Laboratory

Department of Electrical and Computer Engineering

Boston University

http://www.bu.edu/caadlab

*This work is supported in part by the U.S. NIH/NCRR and NSF, and by MIT Lincoln Lab


The Promise of HPRC* …

*Trimberger/Xilinx, FPL07


Reality Check …

Lately … (just as everything seems to be going great for FPGAs)

Reported speed-ups have become more modest, with low single

digits frequently being reported …

Why? Some hypotheses

– More ambitious applications

• Large codes in established systems

• True HPC: large, complex, data types

– More realistic reporting

• end-to-end numbers

• production reference codes

– More “ambitious” development tools

– “Broader” developer base

– FPGA stagnation for two generations (4 years)

• Smaller chips (relative microprocessors)

• Relatively fewer “hard” components: Block RAMs, multipliers


One key to successful HPRC application

development …

Use an appropriate computing model*

when formulating a problem#

*neither programming language substitute nor HDL

Computing Model ≡

an abstraction of a target machine used to

ease application development

# “One of the key challenges addressed by the ACS program was Problem Formulation”

-- Dr. J. Muñoz, ACS Program Manager, RSSI 2008


A good computing model* …

• Ignores machine details

• Ignores programming language details

• Expresses enough of the underlying machine to enable

the user to develop/choose an optimal algorithm (for a

given application)

• Enables some amount of portability

*see, e.g., L. Snyder 1986 Annual Review of Computer Science


The more machine detail …

… that is expressed in the computing model:

• the greater the potential performance

• the less the portability

• the more experience required to use effectively


Historically (ca. 1990) …

“If we only had the right computing model, then

we could port programs among all of our

different parallel computers.”

-- theme of several parallel computing conferences


Nowadays …

“We need to develop the right computing model so

that programmers can use multicore efficiently”

[without having to learn anything new].


Serial Architecture Computing Model

v. Neumann machine

• Single thread

• Random access memory


Parallel Architecture Computing Models

Three (and combinations) are in common use:

1. multiple threads with shared address space

2. multiple threads with message passing

3. single thread with dataparallel constructs


Good (and bad) Computing Models

Just about any programming model can map to any non-trivial computer

(C pointers to FPGAs?? Parallel LISP to SIMD? It‟s been done!)

Good models (w.r.t. a target architecture):

1. Is it convenient to use (in comparison with programming)?

– Inconvenient to use microcode, VHDL, etc.

2. Do constructs map efficiently to the target architecture?

– Inefficient mapping functional parallelism to a SIMD architecture

3. Can critical target machine features be expressed?

– Features can‟t be expressed for a multi-core architecture, parallel

independent functions in the dataparallel model


A great question …

(that is beyond the scope of this talk):

Is it possible to support a “universal”

computing model?

In which we can create applications that port among target

architectures, where:

– the programmer effort is the same as for one version of the software

– the target architectures are unrestricted

– the performance is optimal for all target architectures

– there is no constraint on application domain


Outline

1. Computing Models

2. FPGA Basics – functional models

3. Things that FPGAs do really well –

FPGA Computing Models

4. Sample application mappings

5. What this says about how FPGA-based

computing can advance


What’s a good computing model?


What’s a good model?


A Basic FPGA Computing Model

Historically, FPGAs were a configurable “bag of gates”

Trimberger/Xilinx, FPL07


Nowadays, “bag of computer parts” is more accurate

Is no longer sufficient …

Trimberger/Xilinx, FPL07


Should also account for board-level …

… especially memory hierarchy!


and the system interface …

Bhatt/Intel, FPL07


• Millions of gate equivalents & connections

• ~500 ALU equivalents

• ~1,000 small on-chip caches

– Total on-chip memory ~16MB

• Several off-chip caches

– Capacity for 512b data transfers per cycle (8x64b)

• Several high-performance I/O streams

• Host w/ simple interface, e.g., FSB

FPGA Functional Model


Another Candidate Model


Outline

1. Computing Models








We know we’ve succeeded when …… we’ve restructured the problem into something we know

works really well on FPGAs

Effective computing models:

Streaming

Associative computing – broadcast, compare tags

HW structures – FIFOs, priority queues, systolic arrays

Cellular automata, SIMD PEs, Vector processing

Highly parallel (possibly complex) memory access

Overlapped parallel structures

Also assumed:

– Explicit memory control, e.g., to swap working sets

– High-bandwidth I/O


Some Observations

• Most architectures have a single preferred model, FPGAs have many

• FPGA models are surrogates for the component they replace


Model: StreamingEx: DSP replacement

Characteristics:

• Pass streams of data through a series of arithmetic units

• Iterative streaming computation with data beginning and

ending in Block RAMs


Typical Streaming Scenarios

1. On-line signal/video processing

– Stream originates from I/O

– Stream processed with computational

filters

2. Complex computation of large array

– Stream originates in memory

– Stream processed with pipelined

instantiation of computation


Model: Associative ComputingEx: SIMD array replacement

Characteristic operations:

• Broadcast query/data

• Tag check

• Collective response

• Reduction of responses

Krikelis: Associative String Processor


Typical Associative Scenario

Query/response

Example: Optimization with successive approximation

Class: Successive approximation

TBS

Scoringfunction

Fi(x)

TBS

Initial state

X0

TBS

Next state selection

Xi+1 =

NS[ F1(Xi), F2(Xi), … ]

F1 F2 F3 F4 F5

X0

Xi+1 = NextState[ F1, F2, F3, … ]

Xi


Model: MSI HW StructuresEX: ASIC replacement

... …

... … C[k]

A[L]

B[i]

0

A[L-1] A[0]A[L-2]

PE

A[k]

Init_A

Characteristics:

Standard HW versions of common data structures –

• FIFOs

• Priority queues

• Systolic arrays


Typical HW structure scenario

• Wherever HW instantiation does not have an

immediate SW analog

Example: Find palindromes in a character string

gap

+ + +

T3? T4?

Priority encoder

Maximal palindrome length

= = = =

Len=1Len=2Len=3Len=4

Charactercomparison

Length summation

Threshold detection

Length reporting

T1? T2?


Model: Highly Parallel Memory Access“The advantage of having an MPP is having lots of memory pipes”

Characteristics:

• Source and sink up to 2000 operands per cycle

• Possible complex access patterns

Divide n objects into subsets of size m …

… so that every size-3 subsetis in just one size-m subset

DPS1 DPS2 DPS3DPS0DPS83…

Y

X1-9

(m = 9)

DPS3

Vector DataMemory (VDM)

Example: access vectors in all possible 3-way combinations


Model: Functional Parallelism

Characteristics: (just what it says)

• potentially expensive computations can be hidden completely– random # generation

– coordinate transformations

• Access original molecule grid in rotated order

– Express (i,j,k) in (x,y,z) basis

i=(xi, yi, zi) j=(xj, yj, zj) k=(xk, yk, zk)

– Traverse (i,j,k) index space

– Find (x,y,z) from (i,j,k)

xi xj xk i x

yi yj yk j = y

zi zj zk k z

– Round and range check

• Pipelined, parallel computation

gives ~0 ns overhead for rotation

xy

i

j

Data reduction

filter

Molecule

voxel rotation

Systolic 3D

correlation array

Example: rigid molecule docking


Outline

1. Computing Models








BLAST*

• For a biological sequence (DNA, Protein) and a database

of such sequences, find the database sequences that are

most biologically relevant to the query sequence.

*FCCM06, ParCo07


Sequence Alignment – Basics

Example: GCGATCT versus an entire database

• Each character-character match (G-G, C-G, etc.) is scored

independently with a scoring matrix

• An alignment is a possible way for sequences to match (char-char)

• To score an alignment (evaluate a single ScoreSequence) …

• Simple algorithm to find maximal ungapped local alignment of all

possible ungapped alignment (i.e. N ScoreSequences) …

• Complexity of gapped alignments is potentially unbounded

# Find maximal local alignment of all ungapped alignments

# Find max cumulative score with cut-off = 0

# Complexity = O(MN)

Traverse Database – Foreach Alignment

Generate ScoreSequence

Do SimpleScoring


Gapped alignments (DP-based methods)

• Create query/database tableau:

• Traverse the tableau with a Dynamic Programming algorithm

Score each grid cell (i,j) Si,j is computed using the following recurrence:

• Complexity: O(MN)

G C G A T C T

G 1 0 1 0 0 0 0

C 0 1 0 0 0 1 0

A 0 0 0 1 0 0 0

T 0 0 0 0 1 0 1

T 0 0 0 0 1 0 1

T 0 0 0 0 1 0 1

A 0 0 0 1 0 0 0

Parallel to main diagonal – match/mismatch

Vertical or horizontal – indel

GCGATCT-GC-ATTTA

Qu

ery

(le

ng

th M

)

Database Sequence (length N)


An even better way (BLAST)

The BLAST heuristic …

1. Look for small clusters of matches on main diagonal

2. Try to extend those (and only those) clusters

3. Try to merge those extended clusters

– e.g. using DP methods on regions of interest

database

query

database

query

1. 2.

• Complexity: O(N) + O(M2) with M << N

database

query

3.


Systolic HW Implementation of DP/ASM

database

query

DP processing follows main diagonal …

0 1 2 3

1 2 3

2 3 N-2 N-1

3 N-2 N-1 N

N-2 N-1 N

N-2 N-1 N

N-1 N

N N N N N

N-1 N-1 N-1 N-1

N-2 N-2 N-2 N-2 N-2

A B

… leading to a wavefront dependency (A),

which is easily computed with a linear array (B).

• Complexity with M cells: O(N)

databasequery


What’s hard about HW BLAST?

• Random access into multi-GB database for extensions

• The serial version is already O(N)

• HW DP is already O(N) and handles gaps!

database

query

2.


An

Observation

DP and BLAST are duals of each other …

DP processes M alignments simultaneously;

processing is perpendicular to main diagonal

BLAST processes 1 extension at a time

processing is parallel to main diagonal

DP HW advances one db character per cycle

BLAST HW advances one db character per cycle??

database

qu

ery

…

database

qu

ery

…


TreeBLAST – Optimize the HW

Operation:

• Query string held in place,

database streams over it

• On each cycle (alignment), one

ScoreSequence generated

• ScoreSequences evaluated

systolically by the tree structure

database

query

…

TreeBLAST

# In a single cycle

Dimension 1: Foreach Alignment

generate ScoreSequence

# In log2(M) cycles for each ScoreSequence

# process log2(M) ScoreSequences

# simultaneously

Dimension 2: Foreach ScoreSequence

use tree structure to generate local alignment

# Time Complexity = O(N)

# Area Complexity = O(M)

8-2-3 -3 -3 -1 8-2

M

C

C

G

L

W

K

W

K

W

W

M

Y

Y

F

FC

Leaf Leaf Leaf Leaf

Intern. Intern.

Intern.

local alignment score

Query String

Database

ScoreSequence


FPGA BLAST Summary

Key Methods:

• Streaming

• 2D Systolic array

• Custom pipeline

• Thousands of comparisons per cycle

Performance – Time to stream a database through an FPGA

• Average of 4-5 parallel streams

• 200Mhz

• ~1GB/sec


Time-Step Driven Molecular Dynamics*

MD – An iterative application of Newtonian mechanics to

ensembles of atoms and molecules

Runs in phases:

Many forces typically computed,

but complexity lies in the non-bonded, spatially extended forces:

van der Waals (LJ) and Coulombic (C)

Force

update

MotionUpdate(Verlet)

bondednonHtorsionanglebondtotal FFFFFF

Initially O(n2), done

on coprocessor

ji

ji

ab

ji

ab

ij ab

abLJ

i rrr

F

814

2612

ji

ijji

ii

C

i rr

qqF

3

Generally O(n),

done on host

*FPL05,FPL06,ParCo08


Make Short-Range Forces O(N)

with Cell Lists

Observation:

• Typical volume to be simulated = 100Å3

• Typical LJ cut-off radius = 10Å

Therefore, for all-to-all O(N2) computation,

most work is wasted

Solution:

Partition space into “cells,” each roughly the size

of the cut-off

Compute forces on P only w.r.t. particles in

adjacent cells.– Issue shape of cell – spherical would be more efficient,

but cubic is easier to control

– Issue size of cell – smaller cells mean less useless force computations, but more difficult control. Limit is where the

cell is the atom itself.

P

ji

ji

ab

ji

ab

ij ab

abLJ

i rrr

F

814

2612


Short-Range Force Computation

Problem:

– Compute force equations such as

Difficulty:

– It requires expensive division operations for r -x.

Method: Use table look-up with interpolation, but on individual terms (r-4, r-7)

Also used for short-range component of Coulombic

three tables are needed, plus further computation

ji

ji

ab

ji

ab

ij ab

abLJ

i rrr

F

814

2612

)()(...)()()()( 3

3

2

210

MM

M xoaxCaxCaxCaxCCxf


Interpolation Pipeline with Semi-FP

• r-x interpolation Pipeline„a‟ is the starting point of an interval

00001111001100

Offset (x-a)Section (format)

Interval (a)

r2

Find most significant 1 to:

get format

extract a

extract (x-a)

C3*(x-a) Coefficient

Memory

format

(x-a)

a

x=r2

(C3*(x-a)+C2

(C3*(x-a)+C2)*(x-a)

(C3*(x-a)+C2)*(x-a)+C1

((C3*(x-a)+C2)*(x-a)+C1)*(x-a)

((C3*(x-a)+C2)*(x-a)+C1)*(x-a)+C0

r-14, r-8, or r-3

Coefficient

Memory

Coefficient

Memory

Coefficient

Memory

M

i

i

ition axCxf0

sec )()(

r2

r -x

a x

x-a


r-14,

r-8,

r-3

pos

r2

r2

Force Pipeline Array

Pos, Type

Memory

Acceleration

Memory

POS, Type

CacheAcceleration

Cache

BUS

Host Memory

Boundary

Condition

Check

Cutoff

Check

Distance

Squared

Extract format, a, (x-a)

((C3*(x-a)+C2)*(x-a)+C1)*(x-a)+C0

r-14,

r-8,

r-3,

r2

Lennard-Jones

force

Short-range

part of CL force

Pseudo

force


FPGA MD Summary

Key Method – cast as a streaming problem

• Very deep (70 stage) pipeline

• Multiple (2-8) pipelines

• Optimize interpolation w.r.t. FPGA architecture

• Explicit control of off-chip cell swaps to maintain

constant stream flow


Discrete Event Simulation of MD*

• Simulation with simplified models

• Approximate forces with barriers and square wells

• Classic discrete event simulation

*FPL07, FCCM08


An Alternative ...

Only update particle state when

“something happens”

• “Something happens” = a discrete event

• Advantage DMD runs 105 to 109 times faster

than tradition MD

• Disadvantage Laws of physics are continuous


DMD step-wise force approximation

Pote

ntial

Pote

ntial

Distance Distance

Covalent Bond Hard Sphere

Single-wellMulti-well


Discrete Event Simulation

• Simulation proceeds as a series of

discrete element-wise interactions

– NOT time-step driven

• Seen in simulations of …

– Circuits

– Networks

– Traffic

– Systems Biology

– Combat

Time-Ordered

Event Queuearbitrary insertions

and deletions

Event

Processor

Event

Predictor

(& Remover)

System

State

events

new state

infostate

infoevents &invalidations


Overview - Dataflow

Main idea: DMD in one big pipeline

• Events processed with a throughput of one event per cycle

• Therefore, in a single cycle:

• State is updated (event is committed)

• Invalidations are processed

• New events are inserted – up to four are possible

Com

mit

Event

Predictor

Units

Collid

er

On-Chip

Event

Priority Queue

Off-Chip

Event Heap

New Event InsertionsStall Inducing Insertions

Invalidations

Event flowUpdate

state


DMD Summary

Key Methods:

• Associative processing: broadcast, compare, etc.

• Standard HW components: priority queue, etc.

Performance –

• 200x – 400x for small to medium sized models


Outline

1. Computing Models








Improving HPRC application development

Language support for

– Streams (common)

– Associative operators (less common)

– Complex memory interleaving (less common)

Libraries with support for

– Common HW functions (some, but low-level)

User knowledge & experience

– FPGA models, applying models (not HW design!)


Questions?

computing models for fpga-based...

Documents