wormhole rtr fpgas with distributed configuration decompression cse-670 final project presentation a...
Post on 21-Dec-2015
218 views
TRANSCRIPT
Wormhole RTR FPGAs with Distributed Configuration
Decompression
CSE-670 Final Project Presentation
A Joint Project Presentation by:
Ali Mustafa Zaidi
Mustafa Imran Ali
Introduction
“An FPGA configuration architecture supporting distributed control and fast context switching.”
Aim of Joint Project: Explore the potential for dynamically reconfiguring FPGAs
by adapting the WRTR Approach Enable High Speed Reconfiguration using:
Optimized Logic-Blocks and Distributed Configuration Decompression techniques
Study focuses on Datapath-oriented FPGAs
Project Methodology In depth study of issues. Definition of basic Architecture models. Definition of Area models (for estimation of relative overhead w.r.t other
RTR schemes).≈ Design of Reconfigurable systems around designed Architecture
models. Identification of FPGA resource allocation methods
For distributing FPGA area between multiple applications/hosts (at runtime, compile-time, or system design-time).
Selection of Benchmarks for testing and simulation of various approaches.
Experimentation with WRTR systems and corresponding PRTR system without distributed configuration (i.e. single host/application) for comparison of baseline performance. Evaluate resource utilization, and normalized reconfiguration overhead etc.
Experimentation with all systems with distributed configurations (i.e. multiple hosts/applications). The PRTR systems’ configuration port will be time multiplexed between the various applications. Evaluate resource utilization, and normalized reconfiguration overhead etc.
Configuration Issues with Ultra-high Density FPGAs FPGA densities rising dramatically with process technology
improvements.
Configuration time for serial method becoming prohibitively large.
FPGAs are increasingly used as compute engines, implementing data-intensive portions of applications directly in Hardware.
Lack of efficient method for dynamic reconfiguration of large FPGAs will lead to inefficient utilization of the available resources.
Scalability Issues with Multi-context RTR Concept: While one plane is operating, configure the other
planes serially. Latency hidden by overlapping configuration with computation
As FPGA size, and thus configuration time grows, Multi context becomes less effective in hiding latency Configuring more bits in parallel for each context is only a stop-
gap solution. Only so many pins can be dedicated to configuration
Overheads in the Multi-context approach: Number of SRAM cells used for configuration grow linearly with
number of contexts Multiplexing Circuitry associated with each configurable unit Global, low-skew context select wires.
Scalability Issues with Partial RTR Concept: The Configuration memory is Addressable like standard
Random Access Memory.
Overheads in PRTR Approach: Long global cell-select and data busses required –
area overhead, issues with single cycle signal-transmission as wires grow relative to logic.
Vertical and Horizontal Decoding circuitry – represent centralized control resource. Can be accessed only sequentially by different user applications (one app at a
time) Potential for underutilization of hardware as FPGA density increases.
One solution could be to design the RAM as a multi-ported memory But area of RAM increases quadratically with increase in number of ports. Only so many dedicated configuration ports can be provided Not a long-term scalable solution.
What is Wormhole RTR
WRTR is a method for reconfiguring a configurable device in an entirely distributed fashion.
Routing and configuration handled at local instead of global level.
Advertised Benefits of WRTR
WRTR is a distributed paradigm Allows different parts of same resource to be independently
configured simultaneously.
Dramatically increases the configuration bandwidth of a device.
Lack of centralized controller means: Fewer single point failures that can lead to total system failure
(e.g. a broken configuration pin) Increased resilience: routing around faults – improving chip yields
? Distributed control provides scalability Eliminates configuration bottleneck.
Origins of Wormhole RTR Concept Developed in late 90s at Virginia Tech.
Intended as a method of rapidly creating and modifying ‘custom computational pathways’ using a distributed control scheme.
Essence of WRTR concept: Independent self-steering streams. Streams carried both programming information as well as
operand data Streams interact with architecture to perform computation.
(see DIAGRAM)
Origins of Wormhole RTR
Origins of Wormhole RTR
Programming information configures both the pathway of stream through the system, as well as the operations performed by computational elements along the path.
Heterogeneity of architectures is supported by these streams.
Runtime determination of path of stream is possible, allowing allocation of resources as they become available.
Adapting WRTR for conventional FPGAs Our aims
To achieve Fast, Parallel Reconfiguration With minimum area overhead And minimum constraints imposed on the underlying FPGA
Architecture. Configuration Architecture is completely decoupled from the
FPGA Architecture.
WRTR model is used as inspiration for developing a new paradigm for dynamic reconfiguration Not necessary that WRTR method is followed to the letter
Issues Associated with using WRTR for Conventional FPGAs Original WRTR was intended for Coarse-grained
dataflow architectures with localized communications Thus operand data was appended to the streams
immediately after the programming header.
In conventional FPGAs, dataflow patterns are unrelated to configuration flow, and there is no restriction of localizing communications. Therefore Wormhole routing is used only for configuration
(cannot be used for data).
Issues Associated with using WRTR for Conventional FPGAs The original model was intended to establish
linear pipelines through system. This makes run-time direction determination
feasible.
However, for conventional FPGAs, the functions implemented have arbitrary structures. Configuration stream can not change direction
arbitrarily (i.e. fixed at compile-time).
Issues Associated with using WRTR for Conventional FPGAs Due to the need for large number of configuration
ports, I/O ports must be shared/multiplexed thus active circuits may need to be stalled to load
configurations. Should not be a severe issue for high-performance
computing oriented tasks.
Should impose minimum constraints on the underlying FPGA architecture Constraints applicable in an FPGA with WRTR are same as
those for any PRTR architecture.
A Possible System Architecture for a WRTR FPGA Many configuration/IO ports, divided between
multiple host processors. (See Diagram)
Internally, FPGA divided into partitions, useable by each of the hosts. Partition boundaries may be determined at system
design time, or at runtime, based on requirements of each host at any given time
The various WRTR Models derived Our aim was to devise a high-speed, distributed
configuration model with all the benefits of the original WRTR concept, but with minimum overhead.
To this end, 3 models have been devised: Basic: with Full Internal Routing Second: with Perimeter-only Routing. Third: Packetized, or parallel configuration streams, with no
Internal Routing.
Basic WRTR Model: with Internal Routing Each configurable block or “tile” is accompanied by
a simple configuration Stream Router. See Diagram Overhead scales linearly with FPGA Size.
Expected Issues with this model Complicated router, arbitration overhead and prioritization,
potential for deadlock conditions etc. May be restricted to coarser grained designs. Without data routing, do we really need internal routing?
Second: WRTR with Perimeter-only Routing Primary requirement for achieving parallel configuration is multiple input
ports. Internal Routing not a mandatory requirement.
So why not restrict routing to chip boundary? (See Diagram) Overhead scaling improved (similar to PRTR Model)
Highlights: Finer granularity for configuration achievable Significantly lower overheads as FPGA sizes grow (ratio of perimeter to
area)
Issues Longer time required to reach parts of FPGA as FPGA Size grows. Reduced configuration parallelism because of potentially greater arbitration
delays at boundary Routers.
Third: Packet based distribution of Configuration
One solution to the increased boundary arbitration issues: use packets instead of streams. (See Diagram)
A single configuration from each application is generated as a stream (a worm) similar to previous models.
Before entering device, configuration packets from different streams are grouped according to their target rows.
Third: Packet based distribution of Configuration Benefit: No need at all for Routers in the fabric itself.
Drawbacks: Increases overhead on the host system Implies a centralized external controller
Or a limited crossbar interconnect within the FPGA Parallel Reconfiguration still possible, but with limited
multitasking.
This model may be considered for embedded systems, with low configuration parallelism, but high resource requirements.
Basic Area Model Baseline model required to identify overheads associated with
each PRTR model.
Basic Building block: (See Diagram) A basic Array of SRAM Cells Configured by arbitrary number of scan chains.
Assumptions for a fair comparison of overhead: Each RTR model studied has exactly the same amount of Logic
resources to configure (rows and columns). Each model can be configured at exactly the same granularity.
The given array of logic resources (see Diagram) has an area equivalent to a serially configurable FPGA. AREA of Basic Model = (A* B) * (x * y)
The PRTR FPGA Area Model
Please See Diagram
Configuration Granularity decided by A and B
AREA = Area of Basic model + Overheads
Overheads = Area of ‘log2(x)-to-x’ Row-select decoder + Area of ‘log2(y)-to-y’ n-bit Column De-multiplexer + Area of 1 n-bit bus * y
The basic WRTR FPGA Area Model Please See Diagram
AREA = Area of Basic Model + Overheads
Overheads = Area of 1 n-bit bus * 2x + Area of 1 n-bit bus * 2y + Area of 1 4-D Router block * [x * y].
The Perimeter Routing WRTR FPGA Area Model Please See Diagram
AREA = Area of Basic Model + Overheads
Overheads = Area of 1 n-bit bus * 2x + Area of 1 n-bit bus * 2y + Area of 1 3-D Router block * [2(x + y) – 1]
This model can also be made one dimensional to further reduce overheads. (other constraints will apply)
The Packet based WRTR FPGA Area Model Please See Diagram
AREA = Area of Basic Model + Overheads
Overheads = Area of 1 n-bit bus * 2x + Area of 1 n-bit bus * 2y
Additional overheads may appear in host system.
This model can also be made one dimensional to further reduce overheads. (other constraints will apply)
Parameters defined and their Impact The number of Busses (x and y)
Number of Busses varies with reconfiguration granularity. For fixed logic capacity, A and B increase with decreasing x and
y, i.e. coarser granularity.
Impact of coarser granularity: Reduced overhead ? Reduced reconfiguration flexibility Increased reconfiguration time per block Thus it is better to have finer granularity
The Width of the busses (n-bits) Smaller the width, smaller the overhead (for fixed number of
busses) Longer Reconfiguration times.
Parameters defined and their Impact It is possible to achieve finer granularity without
increasing overhead at the cost of bus width (and hence reconfiguration time per block)
Impact of Coarse grained vs. Fine grained configurability – Methods of Handling hazards in the underlying FPGA fabric. Coarse grained configuration places minimum constraints
on FPGA architecture Fine-grained reconfiguration is subject to all issues
associated with Partial RTR Systems.
Approaches to Router Design Active Routing Mechanism
Similar to conventional networks Routing of streams depends on stream-specified destination, as
well as network metrics (e.g. congestion, deadlock) Hazards and conflicts may be dealt with at Run-time. Significantly complicated routing logic required. Most likely will be restricted to very coarse grained systems.
Passive Routing Mechanism Routing of streams depends only on stream-specified direction Hazards and conflicts avoided by compile time optimization.
We have selected the Passive Routing Mechanism for our WRTR Models
Passive Router Details Must be able to handle streams from 4 different
directions. Streams from different directions only stalled if there is a
conflict in outgoing direction.
Includes mechanisms for stalling streams in case of conflict etc. Detecting and Applying back-pressure
Routing Circuitry for one port is defined (see Diagram) For a 4D router, this design is replicated 4 times For a 3D router, this design is replicated 3 times
Utilizing Variable Length Configuration Mechanisms
Support Hardware and Logic Block Issues
Configuration Overhead
Configuration data is huge for large FPGAs
Has to be reduced in order to have fast context switching
Configuration Overhead Minimization
Initial pointers in this direction
Variable Length Configurations
Default Configuration on Power-up
Variable Length Configurations Ideally
Change only minimum number of bits for a new configuration
Utilize the idea of short configurations for frequently used configurations
Start from a default configuration and change minimum bits to reconfigure to a new state
Hurdles
Logic blocks always require full configurations to be specified – configuration sizes cannot be varied
Knowing only what to change requires keeping track of what was configured before – difficult issue in multiple dynamic applications switching
A default power-up configuration can hardly be useful for all application cases
Configuration Overhead Minimization How to do it?
Remove redundancy in configuration data or “compact” the contents of configuration stream
Result? This will minimize the information required to be
conveyed during configuration or reconfiguration Configuration Compression
Applying “some” sort of compression to the configuration data stream
Configuration Decompression Approaches Centralized Approach
Decompress the configuration stream at the boundary of the FPGA
Distributed Approach – New Paradigm Decompress the stream at the boundary of the
Logic Blocks or Logic Cluster
Centralized Approach
Advantage Requires hardware only at the boundary of the device from
where the configuration data enters the device Significant reduction in configuration size can be achieved Runlength Coding and Lemple-Ziv based compression
used Examples
Atmel 6000 Series Zhiyuan Li, Scott Hauck, “Configuration Compression for Virtex
FPGAs”, IEEE Symposium on FPGAs for Custom Computing Machines, 2001
Centralized Approach
Limitations More efficient variable length coding not easy to
use because of the large number of symbol possibilities
It is difficult to quantify symbols in the configuration stream of heterogeneous devices which can have different types of blocks
Decentralized Approach
Advantages Decompressing at the logic block boundary
enables configurations to be easily symbolized and hence VLC to be used
In other words, we know what exactly we are coding so Huffman like codes can be used based on the frequency of configuration occurrences
Also has advantages specific to Wormhole RTR – discussed next
Decentralized Approach
Limitations The decompression hardware has to be replicated Optimality Issue: Decompression hardware
should be amortized over how much programmable logic area?
In other words, granularity of the logic area should be determined for optimal cost/benefit ratio
Suitability of Decentralized Approach to WRTR If worms are decompressed at the boundary,
large internal worms lengths will result This leads to greater internal worm lengths
and greater issues to arbitration and worm blockages
Decentralized approach thus favors shorter worm lengths and parallel worms to traverse with less blockages
Variable Length Configuration Overall idea
Frequently used configurations of a logic block should have small sized codes
Variable length coding such as Huffman coding can be adapted
Configuration Frequency Analysis How to decide upon the frequency?
Hardwired? By the designer through benchmarks analysis?
Generic? Done by software generating the configuration stream
Continued…
Hardwired determination will be inferior – no large benefit gained due to variations in applications
Software that generates the configuration can optimally identify a given number of frequently used configuration according to set of applications to be executed
Code determination should be done by software generating the configurations for optimal codes
Decoding Hardware Approaches Huffman coding the configurations
A Hardwired Huffman Decoder Adaptive Decoder (code table can be changed)
Using a Table of Frequently used configurations and address Decoder
Huffman Coding the Table Addresses Static Coding Adaptive Coding
Decoding Hardware Features
Static Huffman Decoders Lower compression
Coding Configurations Requires a very wide decoder
Using an Address Decoder only Reduced hardware but less compression (fixed
sized codes) Coding the Decoder Inputs
Requires a relatively smaller Huffman decoder
Some points to Note
Decompression approach is decoupled from any specific logic block architecture Though certain logic blocks will favor more
compression (discussed later) Not every possible configuration will be
coded. Especially random logic portions will require all the bits to be transmitted A special code will prefix the random logic
configuration to identify it to be handled separately
Logic Block Selection
High Level Issues Should be Datapath Oriented Efficient support for random logic implementation High functionality with minimum configuration bits
to support dense implementation with reconfiguration overhead reduction
Well defined datapath functionality (configuration) to aid in the quantification of frequently used configuration idea
Chosen Hi-Functionality Logic Block
A Low Functionality Logic Block
Logic Block Considerations
High functionality blocks good for datapath implementations
Low Functionality blocks less dense datapath implementations
Low functionality blocks have lesser configurations bits/block and vice versa
Frequently Used Configurations memory cost depends on the size of configurations stored
So decoder hardware overhead will be less for low functionality blocks
Logic Block Considerations
What about the configuration time overhead? Less dense functionality means more blocks to
configure This leads to longer configuration streams
Logic Block Considerations
Assumption: Random Logic does not benefit from one block or the other
Consequence: Datapath oriented designs will require fewer blocks to configure with high functionality blocks and even higher compression and larger overhead for random logic implementations than low-functionality blocks
Logic Block Considerations
Logic Block Issue Conclusions Proper logic block selection for a particular
application affects Decoder hardware size Configuration compression ratio
Since Random logic is not compressed, using high functionality blocks for less datapath oriented applications will result in A high decoder overhead unutilized Less compression and longer configuration
streams
Huffman decoder hardware
Basic Huffman hardware is sequential in nature and is variable input rate variable output rate
Huffman Hardware
Sequential decoding not suitable for WRTR Worm will have to be stalled Negates the benefit of fast reconfiguration
Hardware should be able to process N bits at a time where N=bus width
This requires a constant input rate architecture with variable number of codes processed per cycle
Constant Input Rate PLA based Architecture Input rate K bits/cycle PLA undertakes table lookup process
•Input bitsDetermine one unique path along Huffman tree
•Next statefeed back to the input of PLA to indicate the final residing state
•IndicatorThe number of symbols decoded in the cycle
The constant-input rate PLA-based architecture for the VLC decoder
PLA Based Architecture
Ref: “VLSI Designs for High Speed Huffman Decoders”, Shihfu Chang and David G. Messerschimit
Decoder model-FSM FSM Implemention
ROM PLA
Lower complexity high speed
Implementation Results The area is a function of number of inputs
and outputs along with input rate
Hardware Area Estimates
The number of inputs and outputs depend upon the maximum codelength and the symbol sizes
Typically, for a 16 entry table Code Size range from 1 to 8 bits Symbol size will equal the decoder input i.e. 4
Handling Multiple Symbols Outputs More than one code will be decoded with a
max of N codes per cycle To take full advantage of parallel decoding,
multiple configuration chains can be employed.
A counter can be used to cycle between the chains and output one configuration per chain with a maximum of N chains.
Decoder Hardware
For parallel decoding of N codes and M-bit decoded symbols Huffman Decoder (discussed before) N M-bit decoders N port ROM table N parallel configuration chains
Concluding Points
The hardware overhead of the Huffman decoding mechanism discussed has to be incorporated in the area model discussed.
Empirical determination of reconfiguration speed-up versus area overhead determination for a sampling of benchmarks