Comparison of Single-Event Effect Mitigation Methods using
Design Impact and Application Performance Metrics
Ian TroxelSEAKR Engineering, Inc.
Centennial, CO
Military and Aerospace Programmable Logic Devices (MAPLD) ConferenceNASA Goddard Space Flight Center in Greenbelt, MD
August 31 - September 3, 2009
2/20Troxel MAPLD 2009 Comparison of Single-Event Effect Mitigation Methods Using…
Motivation
o Flexible, multiuse payloads sought to limit NRE in space payload processors
o High-performance, SRAM-based FPGAs frequently required to meet mission requirements but require radiation mitigation to achieve fault tolerance goals
o Mitigation methods are application dependant• SWAP constraints• Processing performance• Reliability requirements• Design schedule• Type of data and peripherals• Latency constraints
o Optimum designs may use several methods
o SBIR Phase 1 topic compared mitigation methods for a particular application within an AFRL mission
Pro
ce
ss
ing
Pe
rfo
rma
nc
e
pe
r u
nit
of
SW
AP
Level of Effort
Pro
ce
ss
ing
Pe
rfo
rma
nc
e
pe
r u
nit
of
SW
AP
Reliability
3/20Troxel MAPLD 2009 Comparison of Single-Event Effect Mitigation Methods Using…
Background
o SEAKR’s Application Independent Processor (AIP) formed baseline system along with other component options for onboard processor
o AIP Processing Features• Mixture of processor and I/O capability– Reconfigurable Computer Board(s)
– Xilinx® Virtex®-4 FPGAs
– COTS PowerPC®-based SBC(s)– Gigabit Ethernet and Spacewire
– Mezzanine cards for custom features• Reconfigurable on-orbit• Flexible, scalable architecture• Adaptable fault tolerance
o Mitigation analysis focused on Xilinx Virtex 4 serieso Application analysis included Xilinx, Actel® and microprocessor options
Baseline AIP Flight Unit
4/20Troxel MAPLD 2009 Comparison of Single-Event Effect Mitigation Methods Using…
AIP System Architecture
o Reconfigurable Computer Board(s)• Xilinx Virtex-4, high-speed memory and SERDES backplane
Coprocessor
Xilinx V4
PCI-PCI
Bridge /
Config
PCI
Configuration
cPCI High Speed Serial Network
I/O
High Speed Mezzanine
High Speed Mezzanine
High Speed Mezzanine
Memory
RCC Board Architecture
High Speed
Memory
High Speed
Memory
High Speed
Memory
Coprocessor
Xilinx V4
Coprocessor
Xilinx V4I/O I/O
High-Speed
SERDES
5/20Troxel MAPLD 2009 Comparison of Single-Event Effect Mitigation Methods Using…
Radiation Effect Mitigation Techniques
Technique Description and Comments
Chip-level Redundancy
Replicate FPGA devices and all I/O. External voting or devices vote each other. Single device failures masked if triplicated. Use of internal features allowed.
Full Internal Logic Redundancy
Replicate all logic within a design and vote on intermediate signals. Does not require external voter but imposes limitations (no special features, area penalty). All SEFI modes cannot be addressed.
Continuous Read-back and Scrub
Read configuration memory and compare to “golden” standard and correct discrepancies. Unobtrusive and application-independent method; compliments other methods. Potentially allows corrupt data propagation.
Application-based Fault Tolerance
Augment critical data with checksums to determine if error occurred. Highly application-specific.
Data ReplayBuffer input data and provide a mechanism to temporally replicate computations to detect error.
Selective Logic Redundancy
Selectively replicate via circuit analysis to trade area for robustness. Benefits of internal logic redundancy with reduction in restrictions and area penalties.
6/20Troxel MAPLD 2009 Comparison of Single-Event Effect Mitigation Methods Using…
Characterization Metrics
Metric Description and Comments
Design Overhead Increase in needed resources as compared to the original design.
External Voter Does the technique require an external voter?
Non-standard Device Use
Does the technique allow the use of non-standard devices within the FPGA (e.g. BRAM, DSP blocks, etc.) Restrictions enumerated.
Development Ease Difficulty to incorporate into design. (5=easiest to 1=difficult)
Performance Impact
What performance impacts does the technique impose on the design? Quantitative values provided where possible.
Other System Impacts
Does the technique impact other aspects of the external system (besides requiring a voter) such as need for additional buffer space?
Error CoverageWhat error types does the technique correct (e.g. SEU, SET, SEFI, etc.). Error types enumerated.
Degree of Robustness
How many faults can the technique detect and correct? Does fault locality affect the technique’s robustness? (5=full coverage, 1=none)
Timeliness of Fault Correction
How fast can errors be detected by the technique? Latent errors? Timeframe within which the technique can detect and correct faults includes a collection of robustness considerations.
7/20Troxel MAPLD 2009 Comparison of Single-Event Effect Mitigation Methods Using…
Technique Evaluation
o All techniques require external resources and/or the use of additional internal resources. DR does not require additional internal resources.
o External voters are required in all external schemes and may be used in internal schemes depending on implementation.
o Internal redundancy schemes often do not allow for the use of special internal components such as DSP blocks.
CLR FILR RBAS ABFT DR SILR
OverheadN times devices and I/O
More than N times internal
resources
Support devices
and config. buffer
Additional device
resources
Support devices
Less than N times internal
resources
External Voter Yes Sometimes Yes Sometimes Yes Sometimes
Allow Special Components
Yes Sometimes Yes Yes Yes Sometimes
Legend: CLR=Chip-Level Redundancy, FILR=Full Internal Logic Redundancy, RBAS=Read-back & Scrub, ABFT=Application-based Fault Tolerance, DR=Data Replay, SILR=Selective Internal Logic Redundancy
8/20Troxel MAPLD 2009 Comparison of Single-Event Effect Mitigation Methods Using…
Technique Evaluation (2)
o ABFT and SILR require substantial application development and an intimate knowledge of the application and design
o Design performance typically limited by voter and DR reduces system throughput but has no impact on logic usage
o All designs beside DR require additional logic resources
CLR FILR RBAS ABFT DR SILR
Ease of Development
4 3 5 2 4 1.5
Performance Impact
Limited to speed of
voter
Limited to speed of internal voters
Virtually none
Application dependent
System throughput
reduced
Limited to speed of internal voters
Other System Impacts
N times the cost
and SWaP
Logic reduced by
>N times
Config. buffer
Detailed analysis
Data buffer
Logic reduced by <N times, analysis
Legend: CLR=Chip-Level Redundancy, FILR=Full Internal Logic Redundancy, RBAS=Read-back & Scrub, ABFT=Application-based Fault Tolerance, DR=Data Replay, SILR=Selective Internal Logic Redundancy
9/20Troxel MAPLD 2009 Comparison of Single-Event Effect Mitigation Methods Using…
Technique Evaluation (3)
o CLR and DR are the only two approaches that ensure full coverageo Other approaches may not catch all errors before they propagateo FILR and SILR provide immediate error detection (if covered)o CLR and DR require a processing interval to detect an error (external)o The timeliness of RBAS and ABFT vary with implementation
CLR FILR RBAS ABFT DR SILR
Error Coverage
Full SEU, SET and
SEFI
Full SEU, partial
SET and SEFI
Partial SEU and SEFI
and no SET
Full SEU, partial SET and SEFI
Full SEU, SET and
SEFI
Partial SEU, SET and SEFI
Robustness 5 4.5 2 4.5 5 3
Timeliness of Fault
Correction
One processing
intervalImmediate
Varies based on scrub rate
Varies -- application
specific
N times processing
interval
Immediate if covered
Legend: CLR=Chip-Level Redundancy, FILR=Full Internal Logic Redundancy, RBAS=Read-back & Scrub, ABFT=Application-based Fault Tolerance, DR=Data Replay, SILR=Selective Internal Logic Redundancy
10/20Troxel MAPLD 2009 Comparison of Single-Event Effect Mitigation Methods Using…
Comparison Summary
Legend: CLR=Chip-Level Redundancy, FILR=Full Internal Logic Redundancy, RBAS=Read-back & Scrub, ABFT=Application-based Fault Tolerance, DR=Data Replay, SILR=Selective Internal Logic Redundancy
Technique Pro Con
CLRoAll errors detectableoStraightforward to implement
oN time devices and external voteroError detect delay of one interval
FILR oImmediate fault detection o>N time logic and internal voters
RBASoNo performance impactoStraightforward to implement
oSupport devices and config. BufferoHigh potential for error propagation
ABFT oApplication dependent oApplication dependent
DR
oNo additional logic resourcesoSpecial structures allowedoAll errors detectable
oHalves throughputoExternal voter required but no impactoError detect delay of one interval
SILRoImmediate fault detection – for covered faultsoLess logic overhead than FILR
oSpecial structures not allowedoPartial fault coverageoDetailed design analysis required
11/20Troxel MAPLD 2009 Comparison of Single-Event Effect Mitigation Methods Using…
Recommendations
Legend: CLR=Chip-Level Redundancy, FILR=Full Internal Logic Redundancy, RBAS=Read-back & Scrub, ABFT=Application-based Fault Tolerance, DR=Data Replay, SILR=Selective Internal Logic Redundancy
Technique Recommendations
CLRoAppropriate when board area and device cost is less of a concern as compared to performance and robustness
FILRoUse instead of CLR when board space is a premium and a given design can fit within 1/N the number of logic cells on the FPGA
RBAS oAppropriate for all designs combined with other techniques
ABFT oUse when the application well understood and other options not feasible
DRoAppropriate when board area and part count is premium and full error coverage required however latency/throughput cannot be sacrificed
SILR
oUse this technique over full internal logic redundancy when board space is a premium and a given design cannot fit within 1/N the number of logic cells on the FPGA. Do not use if the design is not well understood or if hand placement is not an option.
o “Best technique” is application- and mission-dependent and must be further investigated for each application
12/20Troxel MAPLD 2009 Comparison of Single-Event Effect Mitigation Methods Using…
Application Analysis Setup
oTwo system types examined for the space application• Two types of architectures used for comparison• FPGAs and microprocessors included in the analysis
FPAs
Processor Processor Processor
Processor Processor Processor
Superframe
FPAs
Processor
Superframe
Processor
Internal Processor Resources External Processor Resources
Sensors Sensors
Data Product
Data Product
13/20Troxel MAPLD 2009 Comparison of Single-Event Effect Mitigation Methods Using…
Processor Options
Processor Type Distinguishing Features
Virtex-4 LX200 FPGA Fine-grained ProcessingHigh PerformanceSEU Tolerant (with mitigation)Virtex-5 FX130 FPGA
RTAX-2000 FPGA Fine-grained ProcessingLimited PerformanceSEU ImmuneRTAX-4000 FPGA
603e (350nm) 1-core
Corse-grained ProcessingCommercial PerformanceSEU Tolerant (mitigation)
750FX (130nm) 1-core
7448 (90nm) 1-core
8641 (90nm) 2-core
LEON 3FTTM (250nm) 1-coreRHBD or ProcessImproved SEU Tolerance
Rad750® (150nm) 1-core
MAESTRO (90nm) 49-core
14/20Troxel MAPLD 2009 Comparison of Single-Event Effect Mitigation Methods Using…
Analysis Assumptions
oPreliminary candidate systems constructed to meet algorithm and system requirements• Focused on meeting memory and I/O performance requirements• Two versions of application studied with three data rates• Fault tolerance, radiation susceptibility, and cost examined as well
oKey assumptions• Both algorithm versions focus on front-end sensor processing• Nominal processor and FPGA performance capabilities have been
de-rated based on typical performance achieved• Only the highest speed processor interface is considered• FPGAs can support at most 2 DDR interfaces along with a local
bus and any required sensor connections• Xilinx DRAM interfaces four times faster than Actel versions
15/20Troxel MAPLD 2009 Comparison of Single-Event Effect Mitigation Methods Using…
FPGA Processor Analysis (1)
o Internal option is severely memory limitedo Using only internal resources is impractical due to limited FPGA resources
Option Virtex-4 LX200
Virtex-5 FX130
RTAX2000 RTAX4000 DRAM
Rate 1 74 42 1518 810 0
Rate 2 148 84 3035 1619 0
Rate 3 592 334 12137 6473 0
Rate 1 1 1 3 2 0
Rate 2 2 2 5 4 0
Rate 3 5 5 17 13 0
Mem
ory
I/O
Rate 1 185 105 3793 2023 0
Rate 2 370 209 7586 4046 0
Rate 3 1480 835 30341 16182 0
Rate 1 1 1 3 2 0
Rate 2 2 2 5 4 0
Rate 3 5 5 17 13 0
Mem
ory
I/O
Bas
ic V
ersi
on
Inte
rna
l
Ad
van
ced
Ver
sio
n
16/20Troxel MAPLD 2009 Comparison of Single-Event Effect Mitigation Methods Using…
FPGA Processor Analysis (2)
o Design feasible using external memory resources but is I/O limitedo Designs must be implemented to achieve actual part counto Radiation mitigation for Virtex would triple the required number of deviceso Actel I/O performance limits the application when scaling data rate
Option Virtex-4 LX200
Virtex-5 FX130
RTAX2000 RTAX4000Xilinx DRAM
Actel DRAM
Rate 1 1 1 2 2 2 2
Rate 2 1 1 2 2 2 2
Rate 3 2 1 3 3 4 4
Rate 1 1 1 4 3 2 6
Rate 2 2 1 12 8 4 16
Rate 3 6 6 28 22 12 44
Mem
ory
I/O
Rate 1 1 1 2 2 2 2
Rate 2 1 1 2 2 2 2
Rate 3 2 2 5 4 8 8
Rate 1 1 1 7 6 2 12
Rate 2 2 2 23 17 4 34
Rate 3 7 7 34 26 14 52
Mem
ory
I/O
Bas
ic V
ersi
on
Ex
tern
al
Ad
van
ced
Ver
sio
n
17/20Troxel MAPLD 2009 Comparison of Single-Event Effect Mitigation Methods Using…
Microprocessor Analysis
o Microprocessors assumed to have DRAM attached (i.e. external architecture) and half of theoretical bandwidth assumed for I/O
o Processors are I/O limited even with RapidIO assumed for those capableo Advanced algorithms would require substantially more memory capacity
Option 603e 750FX 7448 8641D LEON 3FT RAD750 MAESTRO
Rate 1 1 1 1 1 1 1 1
Rate 2 2 1 1 1 1 3 1
Rate 3 7 3 2 2 3 9 1
Rate 1 3222 31 4 2 18 467 1
Rate 2 6443 62 7 4 35 758 2
Rate 3 25770 245 26 13 139 3890 7
Mem
ory
I/O
Rate 1 2 1 1 1 1 3 1
Rate 2 4 1 1 1 1 6 1
Rate 3 16 4 4 2 4 21 2
Rate 1 3234 31 4 2 18 471 1
Rate 2 6468 62 7 4 35 764 2
Rate 3 25871 245 26 13 140 3925 7
Mem
ory
I/OB
asic
Ver
sio
nA
dva
nce
d V
ersi
on
18/20Troxel MAPLD 2009 Comparison of Single-Event Effect Mitigation Methods Using…
Analysis Summary
o Data Rate 1:• Higher-end processors with high-speed serial I/O are viable options• Xilinx and Actel roughly tie in memory and I/O performance when
radiation mitigation added
o Data Rate 2:• Microprocessors become less attractive than FPGAs• Xilinx and Actel devices are still roughly equivalent
o Data Rate 3:• Xilinx devices appear more attractive than Actel due to improved
memory bandwidth (i.e. fewer DRAMs required)
o Note:• Algorithm implementation and sizing/timing analysis on logic designs
required to complete the analysis (need to include processing capability)• Processors would likely become a factor when incorporating additional
algorithms that are non-deterministic or require coarse-grained processing further down the application chain
19/20Troxel MAPLD 2009 Comparison of Single-Event Effect Mitigation Methods Using…
Conclusions
o Radiation effect mitigation techniques examined
• Characteristics of several techniques compared
• Choice is application-specific and combination of methods often included
o Application performance requirements examined
• FPGA and microprocessor memory bandwidth and I/O capabilities
examined due to the application being I/O intensive
• For this application, microprocessors become less attractive than FPGAs
as data rates are increased
• Xilinx devices more attractive due to improved memory bandwidth
o Future Work
• Implement algorithms to include processing capability in the analysis
• Include additional application processing steps
20/20Troxel MAPLD 2009 Comparison of Single-Event Effect Mitigation Methods Using…
Contact Information
Dr. Ian Troxel Future Systems Architect• 303-784-7673 [email protected]
SEAKR Engineering, Inc.6221 South Racine CircleCentennial, CO 80111-6427main: 303 790 8499fax: 303 790 8720web: http://www.SEAKR.com