computing models for fpga-based...
TRANSCRIPT
RSSI – 7/9/2008HPRC Computing Models
Computing Models for
FPGA-Based Accelerators*
Martin Herbordt Tom VanCourt Yongfeng Gu
Josh Model Bharat Sukhwani Matt Chiu
Computer Architecture and Automated Design Laboratory
Department of Electrical and Computer Engineering
Boston University
http://www.bu.edu/caadlab
*This work is supported in part by the U.S. NIH/NCRR and NSF, and by MIT Lincoln Lab
RSSI – 7/9/2008HPRC Computing Models
The Promise of HPRC* …
*Trimberger/Xilinx, FPL07
RSSI – 7/9/2008HPRC Computing Models
Reality Check …
Lately … (just as everything seems to be going great for FPGAs)
Reported speed-ups have become more modest, with low single
digits frequently being reported …
Why? Some hypotheses
– More ambitious applications
• Large codes in established systems
• True HPC: large, complex, data types
– More realistic reporting
• end-to-end numbers
• production reference codes
– More “ambitious” development tools
– “Broader” developer base
– FPGA stagnation for two generations (4 years)
• Smaller chips (relative microprocessors)
• Relatively fewer “hard” components: Block RAMs, multipliers
RSSI – 7/9/2008HPRC Computing Models
One key to successful HPRC application
development …
Use an appropriate computing model*
when formulating a problem#
*neither programming language substitute nor HDL
Computing Model ≡
an abstraction of a target machine used to
ease application development
# “One of the key challenges addressed by the ACS program was Problem Formulation”
-- Dr. J. Muñoz, ACS Program Manager, RSSI 2008
RSSI – 7/9/2008HPRC Computing Models
A good computing model* …
• Ignores machine details
• Ignores programming language details
• Expresses enough of the underlying machine to enable
the user to develop/choose an optimal algorithm (for a
given application)
• Enables some amount of portability
*see, e.g., L. Snyder 1986 Annual Review of Computer Science
RSSI – 7/9/2008HPRC Computing Models
The more machine detail …
… that is expressed in the computing model:
• the greater the potential performance
• the less the portability
• the more experience required to use effectively
RSSI – 7/9/2008HPRC Computing Models
Historically (ca. 1990) …
“If we only had the right computing model, then
we could port programs among all of our
different parallel computers.”
-- theme of several parallel computing conferences
RSSI – 7/9/2008HPRC Computing Models
Nowadays …
“We need to develop the right computing model so
that programmers can use multicore efficiently”
[without having to learn anything new].
RSSI – 7/9/2008HPRC Computing Models
Serial Architecture Computing Model
v. Neumann machine
• Single thread
• Random access memory
RSSI – 7/9/2008HPRC Computing Models
Parallel Architecture Computing Models
Three (and combinations) are in common use:
1. multiple threads with shared address space
2. multiple threads with message passing
3. single thread with dataparallel constructs
RSSI – 7/9/2008HPRC Computing Models
Good (and bad) Computing Models
Just about any programming model can map to any non-trivial computer
(C pointers to FPGAs?? Parallel LISP to SIMD? It‟s been done!)
Good models (w.r.t. a target architecture):
1. Is it convenient to use (in comparison with programming)?
– Inconvenient to use microcode, VHDL, etc.
2. Do constructs map efficiently to the target architecture?
– Inefficient mapping functional parallelism to a SIMD architecture
3. Can critical target machine features be expressed?
– Features can‟t be expressed for a multi-core architecture, parallel
independent functions in the dataparallel model
RSSI – 7/9/2008HPRC Computing Models
A great question …
(that is beyond the scope of this talk):
Is it possible to support a “universal”
computing model?
In which we can create applications that port among target
architectures, where:
– the programmer effort is the same as for one version of the software
– the target architectures are unrestricted
– the performance is optimal for all target architectures
– there is no constraint on application domain
RSSI – 7/9/2008HPRC Computing Models
Outline
1. Computing Models
2. FPGA Basics – functional models
3. Things that FPGAs do really well –
FPGA Computing Models
4. Sample application mappings
5. What this says about how FPGA-based
computing can advance
RSSI – 7/9/2008HPRC Computing Models
What’s a good computing model?
RSSI – 7/9/2008HPRC Computing Models
What’s a good computing model?
RSSI – 7/9/2008HPRC Computing Models
What’s a good model?
RSSI – 7/9/2008HPRC Computing Models
A Basic FPGA Computing Model
Historically, FPGAs were a configurable “bag of gates”
Trimberger/Xilinx, FPL07
RSSI – 7/9/2008HPRC Computing Models
Nowadays, “bag of computer parts” is more accurate
Is no longer sufficient …
Trimberger/Xilinx, FPL07
RSSI – 7/9/2008HPRC Computing Models
Should also account for board-level …
… especially memory hierarchy!
RSSI – 7/9/2008HPRC Computing Models
and the system interface …
Bhatt/Intel, FPL07
RSSI – 7/9/2008HPRC Computing Models
• Millions of gate equivalents & connections
• ~500 ALU equivalents
• ~1,000 small on-chip caches
– Total on-chip memory ~16MB
• Several off-chip caches
– Capacity for 512b data transfers per cycle (8x64b)
• Several high-performance I/O streams
• Host w/ simple interface, e.g., FSB
FPGA Functional Model
RSSI – 7/9/2008HPRC Computing Models
Another Candidate Model
RSSI – 7/9/2008HPRC Computing Models
Outline
1. Computing Models
2. FPGA Basics – functional models
3. Things that FPGAs do really well –
FPGA Computing Models
4. Sample application mappings
5. What this says about how FPGA-based
computing can advance
RSSI – 7/9/2008HPRC Computing Models
We know we’ve succeeded when …… we’ve restructured the problem into something we know
works really well on FPGAs
Effective computing models:
Streaming
Associative computing – broadcast, compare tags
HW structures – FIFOs, priority queues, systolic arrays
Cellular automata, SIMD PEs, Vector processing
Highly parallel (possibly complex) memory access
Overlapped parallel structures
Also assumed:
– Explicit memory control, e.g., to swap working sets
– High-bandwidth I/O
RSSI – 7/9/2008HPRC Computing Models
Some Observations
• Most architectures have a single preferred model, FPGAs have many
• FPGA models are surrogates for the component they replace
RSSI – 7/9/2008HPRC Computing Models
Model: StreamingEx: DSP replacement
Characteristics:
• Pass streams of data through a series of arithmetic units
• Iterative streaming computation with data beginning and
ending in Block RAMs
RSSI – 7/9/2008HPRC Computing Models
Typical Streaming Scenarios
1. On-line signal/video processing
– Stream originates from I/O
– Stream processed with computational
filters
2. Complex computation of large array
– Stream originates in memory
– Stream processed with pipelined
instantiation of computation
RSSI – 7/9/2008HPRC Computing Models
Model: Associative ComputingEx: SIMD array replacement
Characteristic operations:
• Broadcast query/data
• Tag check
• Collective response
• Reduction of responses
Krikelis: Associative String Processor
RSSI – 7/9/2008HPRC Computing Models
Typical Associative Scenario
Query/response
Example: Optimization with successive approximation
Class: Successive approximation
TBS
Scoringfunction
Fi(x)
TBS
Initial state
X0
TBS
Next state selection
Xi+1 =
NS[ F1(Xi), F2(Xi), … ]
F1 F2 F3 F4 F5
X0
Xi+1 = NextState[ F1, F2, F3, … ]
Xi
RSSI – 7/9/2008HPRC Computing Models
Model: MSI HW StructuresEX: ASIC replacement
... …
... … C[k]
A[L]
B[i]
0
A[L-1] A[0]A[L-2]
PE
A[k]
Init_A
Characteristics:
Standard HW versions of common data structures –
• FIFOs
• Priority queues
• Systolic arrays
RSSI – 7/9/2008HPRC Computing Models
Typical HW structure scenario
• Wherever HW instantiation does not have an
immediate SW analog
Example: Find palindromes in a character string
gap
+ + +
T3? T4?
Priority encoder
Maximal palindrome length
= = = =
Len=1Len=2Len=3Len=4
Charactercomparison
Length summation
Threshold detection
Length reporting
T1? T2?
RSSI – 7/9/2008HPRC Computing Models
Model: Highly Parallel Memory Access“The advantage of having an MPP is having lots of memory pipes”
Characteristics:
• Source and sink up to 2000 operands per cycle
• Possible complex access patterns
Divide n objects into subsets of size m …
… so that every size-3 subsetis in just one size-m subset
DPS1 DPS2 DPS3DPS0DPS83…
Y
X1-9
(m = 9)
DPS3
Vector DataMemory (VDM)
Example: access vectors in all possible 3-way combinations
RSSI – 7/9/2008HPRC Computing Models
Model: Functional Parallelism
Characteristics: (just what it says)
• potentially expensive computations can be hidden completely– random # generation
– coordinate transformations
• Access original molecule grid in rotated order
– Express (i,j,k) in (x,y,z) basis
i=(xi, yi, zi) j=(xj, yj, zj) k=(xk, yk, zk)
– Traverse (i,j,k) index space
– Find (x,y,z) from (i,j,k)
xi xj xk i x
yi yj yk j = y
zi zj zk k z
– Round and range check
• Pipelined, parallel computation
gives ~0 ns overhead for rotation
xy
i
j
Data reduction
filter
Molecule
voxel rotation
Systolic 3D
correlation array
Example: rigid molecule docking
RSSI – 7/9/2008HPRC Computing Models
Outline
1. Computing Models
2. FPGA Basics – functional models
3. Things that FPGAs do really well –
FPGA Computing Models
4. Sample application mappings
5. What this says about how FPGA-based
computing can advance
RSSI – 7/9/2008HPRC Computing Models
BLAST*
• For a biological sequence (DNA, Protein) and a database
of such sequences, find the database sequences that are
most biologically relevant to the query sequence.
*FCCM06, ParCo07
RSSI – 7/9/2008HPRC Computing Models
Sequence Alignment – Basics
Example: GCGATCT versus an entire database
• Each character-character match (G-G, C-G, etc.) is scored
independently with a scoring matrix
• An alignment is a possible way for sequences to match (char-char)
• To score an alignment (evaluate a single ScoreSequence) …
• Simple algorithm to find maximal ungapped local alignment of all
possible ungapped alignment (i.e. N ScoreSequences) …
• Complexity of gapped alignments is potentially unbounded
# Find maximal local alignment of all ungapped alignments
# Find max cumulative score with cut-off = 0
# Complexity = O(MN)
Traverse Database – Foreach Alignment
Generate ScoreSequence
Do SimpleScoring
RSSI – 7/9/2008HPRC Computing Models
Gapped alignments (DP-based methods)
• Create query/database tableau:
• Traverse the tableau with a Dynamic Programming algorithm
Score each grid cell (i,j) Si,j is computed using the following recurrence:
• Complexity: O(MN)
G C G A T C T
G 1 0 1 0 0 0 0
C 0 1 0 0 0 1 0
A 0 0 0 1 0 0 0
T 0 0 0 0 1 0 1
T 0 0 0 0 1 0 1
T 0 0 0 0 1 0 1
A 0 0 0 1 0 0 0
Parallel to main diagonal – match/mismatch
Vertical or horizontal – indel
GCGATCT-GC-ATTTA
Qu
ery
(le
ng
th M
)
Database Sequence (length N)
RSSI – 7/9/2008HPRC Computing Models
An even better way (BLAST)
The BLAST heuristic …
1. Look for small clusters of matches on main diagonal
2. Try to extend those (and only those) clusters
3. Try to merge those extended clusters
– e.g. using DP methods on regions of interest
database
query
database
query
1. 2.
• Complexity: O(N) + O(M2) with M << N
database
query
3.
RSSI – 7/9/2008HPRC Computing Models
Systolic HW Implementation of DP/ASM
database
query
DP processing follows main diagonal …
0 1 2 3
1 2 3
2 3 N-2 N-1
3 N-2 N-1 N
N-2 N-1 N
N-2 N-1 N
N-1 N
N N N N N
N-1 N-1 N-1 N-1
N-2 N-2 N-2 N-2 N-2
A B
… leading to a wavefront dependency (A),
which is easily computed with a linear array (B).
• Complexity with M cells: O(N)
databasequery
RSSI – 7/9/2008HPRC Computing Models
What’s hard about HW BLAST?
• Random access into multi-GB database for extensions
• The serial version is already O(N)
• HW DP is already O(N) and handles gaps!
database
query
2.
RSSI – 7/9/2008HPRC Computing Models
An
Observation
DP and BLAST are duals of each other …
DP processes M alignments simultaneously;
processing is perpendicular to main diagonal
BLAST processes 1 extension at a time
processing is parallel to main diagonal
DP HW advances one db character per cycle
BLAST HW advances one db character per cycle??
database
qu
ery
…
database
qu
ery
…
RSSI – 7/9/2008HPRC Computing Models
TreeBLAST – Optimize the HW
Operation:
• Query string held in place,
database streams over it
• On each cycle (alignment), one
ScoreSequence generated
• ScoreSequences evaluated
systolically by the tree structure
database
query
…
TreeBLAST
# In a single cycle
Dimension 1: Foreach Alignment
generate ScoreSequence
# In log2(M) cycles for each ScoreSequence
# process log2(M) ScoreSequences
# simultaneously
Dimension 2: Foreach ScoreSequence
use tree structure to generate local alignment
# Time Complexity = O(N)
# Area Complexity = O(M)
8-2-3 -3 -3 -1 8-2
M
C
C
G
L
W
K
W
K
W
W
M
Y
Y
F
FC
Leaf Leaf Leaf Leaf
Intern. Intern.
Intern.
local alignment score
Query String
Database
ScoreSequence
RSSI – 7/9/2008HPRC Computing Models
FPGA BLAST Summary
Key Methods:
• Streaming
• 2D Systolic array
• Custom pipeline
• Thousands of comparisons per cycle
Performance – Time to stream a database through an FPGA
• Average of 4-5 parallel streams
• 200Mhz
• ~1GB/sec
RSSI – 7/9/2008HPRC Computing Models
Time-Step Driven Molecular Dynamics*
MD – An iterative application of Newtonian mechanics to
ensembles of atoms and molecules
Runs in phases:
Many forces typically computed,
but complexity lies in the non-bonded, spatially extended forces:
van der Waals (LJ) and Coulombic (C)
Force
update
MotionUpdate(Verlet)
bondednonHtorsionanglebondtotal FFFFFF
Initially O(n2), done
on coprocessor
ji
ji
ab
ji
ab
ij ab
abLJ
i rrr
F
814
2612
ji
ijji
ii
C
i rr
qqF
3
Generally O(n),
done on host
*FPL05,FPL06,ParCo08
RSSI – 7/9/2008HPRC Computing Models
Make Short-Range Forces O(N)
with Cell Lists
Observation:
• Typical volume to be simulated = 100Å3
• Typical LJ cut-off radius = 10Å
Therefore, for all-to-all O(N2) computation,
most work is wasted
Solution:
Partition space into “cells,” each roughly the size
of the cut-off
Compute forces on P only w.r.t. particles in
adjacent cells.– Issue shape of cell – spherical would be more efficient,
but cubic is easier to control
– Issue size of cell – smaller cells mean less useless force computations, but more difficult control. Limit is where the
cell is the atom itself.
P
ji
ji
ab
ji
ab
ij ab
abLJ
i rrr
F
814
2612
RSSI – 7/9/2008HPRC Computing Models
Short-Range Force Computation
Problem:
– Compute force equations such as
Difficulty:
– It requires expensive division operations for r -x.
Method: Use table look-up with interpolation, but on individual terms (r-4, r-7)
Also used for short-range component of Coulombic
three tables are needed, plus further computation
ji
ji
ab
ji
ab
ij ab
abLJ
i rrr
F
814
2612
)()(...)()()()( 3
3
2
210
MM
M xoaxCaxCaxCaxCCxf
RSSI – 7/9/2008HPRC Computing Models
Interpolation Pipeline with Semi-FP
• r-x interpolation Pipeline„a‟ is the starting point of an interval
00001111001100
Offset (x-a)Section (format)
Interval (a)
r2
Find most significant 1 to:
get format
extract a
extract (x-a)
C3*(x-a) Coefficient
Memory
format
(x-a)
a
x=r2
(C3*(x-a)+C2
(C3*(x-a)+C2)*(x-a)
(C3*(x-a)+C2)*(x-a)+C1
((C3*(x-a)+C2)*(x-a)+C1)*(x-a)
((C3*(x-a)+C2)*(x-a)+C1)*(x-a)+C0
r-14, r-8, or r-3
Coefficient
Memory
Coefficient
Memory
Coefficient
Memory
M
i
i
ition axCxf0
sec )()(
r2
r -x
a x
x-a
RSSI – 7/9/2008HPRC Computing Models
r-14,
r-8,
r-3
pos
r2
r2
Force Pipeline Array
Pos, Type
Memory
Acceleration
Memory
POS, Type
CacheAcceleration
Cache
BUS
Host Memory
Boundary
Condition
Check
Cutoff
Check
Distance
Squared
Extract format, a, (x-a)
((C3*(x-a)+C2)*(x-a)+C1)*(x-a)+C0
r-14,
r-8,
r-3,
r2
Lennard-Jones
force
Short-range
part of CL force
Pseudo
force
RSSI – 7/9/2008HPRC Computing Models
FPGA MD Summary
Key Method – cast as a streaming problem
• Very deep (70 stage) pipeline
• Multiple (2-8) pipelines
• Optimize interpolation w.r.t. FPGA architecture
• Explicit control of off-chip cell swaps to maintain
constant stream flow
RSSI – 7/9/2008HPRC Computing Models
Discrete Event Simulation of MD*
• Simulation with simplified models
• Approximate forces with barriers and square wells
• Classic discrete event simulation
*FPL07, FCCM08
RSSI – 7/9/2008HPRC Computing Models
An Alternative ...
Only update particle state when
“something happens”
• “Something happens” = a discrete event
• Advantage DMD runs 105 to 109 times faster
than tradition MD
• Disadvantage Laws of physics are continuous
RSSI – 7/9/2008HPRC Computing Models
DMD step-wise force approximation
Pote
ntial
Pote
ntial
Distance Distance
Covalent Bond Hard Sphere
Single-wellMulti-well
RSSI – 7/9/2008HPRC Computing Models
Discrete Event Simulation
• Simulation proceeds as a series of
discrete element-wise interactions
– NOT time-step driven
• Seen in simulations of …
– Circuits
– Networks
– Traffic
– Systems Biology
– Combat
Time-Ordered
Event Queuearbitrary insertions
and deletions
Event
Processor
Event
Predictor
(& Remover)
System
State
events
new state
infostate
infoevents &invalidations
RSSI – 7/9/2008HPRC Computing Models
Overview - Dataflow
Main idea: DMD in one big pipeline
• Events processed with a throughput of one event per cycle
• Therefore, in a single cycle:
• State is updated (event is committed)
• Invalidations are processed
• New events are inserted – up to four are possible
Com
mit
Event
Predictor
Units
Collid
er
On-Chip
Event
Priority Queue
Off-Chip
Event Heap
New Event InsertionsStall Inducing Insertions
Invalidations
Event flowUpdate
state
RSSI – 7/9/2008HPRC Computing Models
DMD Summary
Key Methods:
• Associative processing: broadcast, compare, etc.
• Standard HW components: priority queue, etc.
Performance –
• 200x – 400x for small to medium sized models
RSSI – 7/9/2008HPRC Computing Models
Outline
1. Computing Models
2. FPGA Basics – functional models
3. Things that FPGAs do really well –
FPGA Computing Models
4. Sample application mappings
5. What this says about how FPGA-based
computing can advance
RSSI – 7/9/2008HPRC Computing Models
Improving HPRC application development
Language support for
– Streams (common)
– Associative operators (less common)
– Complex memory interleaving (less common)
Libraries with support for
– Common HW functions (some, but low-level)
User knowledge & experience
– FPGA models, applying models (not HW design!)
RSSI – 7/9/2008HPRC Computing Models
Questions?