cieslewski uf ftworkshop final2
TRANSCRIPT
-
8/3/2019 Cieslewski UF FTWorkshop Final2
1/19
Advanced Space Computing withAdvanced Space Computing withSystemSystem--Level Fault ToleranceLevel Fault Tolerance
Grzegorz Cieslewski, Adam Jacobs,
Chris Conger, Alan D. George
ECE Dept., University of Florida
NSF CHREC Center
-
8/3/2019 Cieslewski UF FTWorkshop Final2
2/19
2
OutlineOutline
Overview
NASA Dependable Multiprocessor
Reconfigurable Fault Tolerance (RFT)
Space Applications
Novel Computing Platforms
RapidIO
Conclusions
-
8/3/2019 Cieslewski UF FTWorkshop Final2
3/19
3
OverviewOverview What is advanced space computing?
New concepts, methods, and technologies to enable and deploy high-performancecomputing in space for an increasing variety of missions and applications
Why is advanced space computing vital? On-board data processing
Downlink bandwidth to Earth is extremely limited Sensor data rates, resolutions, and modes are dramatically increasing
Remote data processing from Earth is no longer viable
Must process sensor data where it is captured, then downlink results
On-board autonomous processing & control Remote control from Earth is often not viable
Propagation delays and bandwidth limits are insurmountable Space vehicles and space-delivered vehicles require autonomy
Autonomy requires high-speed computing for decision-making
Why is it difficult to achieve? Cannot simply strap a Cray to a rocket!
Hazardous radiation environment in space Platforms with limited power, weight, size, cooling, etc.
Traditional space processing technologies (RadHard) are severely limited
Potential for long mission times with diverse set of needs Need powerful yet adaptive technologies
Must ensure high levels of reliability and availability
-
8/3/2019 Cieslewski UF FTWorkshop Final2
4/19
4
Taxonomy of Fault ToleranceTaxonomy of Fault Tolerance
First, let us define various possible modes/methods of providing fault tolerance (FT) Many other options beyond simply throwing triple-modular redundancy (TMR) at the problem
Software FT vs. hardware FT concepts largely similar, differences only at implementation level
Radiation-hardening not listed, falls under prevention as opposed to detection or correction
DetectCorrect
orMask
Fault-TolerantHLL (e.g. MPI)
FT-HLL
Concurrent ErrorDetection
CED
Self-CheckingPairs
SCP
Algorithm-BasedFault-Tolerance
ABFT
Error CorrectionCodes
ECCN-Version
Programming
NVP
ByzantineResilience
BR
Checkpointing
& Roll-back
CR
Software-ImplementedFault Tolerance
SIFTN-Modular
Redundancy
NMR
Temporaland spatialvariants possiblefor many techniques
Most of these FTmodes are currentlybeing used at UF
-
8/3/2019 Cieslewski UF FTWorkshop Final2
5/19
5
NASA/Honeywell/UF ProjectNASA/Honeywell/UF Project1st Space Supercomputer
Funded by NASA NMP
In-situ sensor processing
Autonomous control Speedups of 100 to 1000
First fault-tolerant, parallel,reconfigurable computer for space
Infrastructure for fault-tolerant,high-speed computing in space
Robust system services
Fault-tolerant MPI services
Application services FPGA services
Standard design framework
Transparent API to resources for
earth & space scientists
NASA Dependable Multiprocessor (DM)
SystemController
B
SystemController
A(RHPPC) Data
Processor(PPC, FPGA)
#1
Spacecraft I/FMission -Specific
Devices
Instruments
. . .
High-Speed Network A
Mission -Specific
Spacecraft Interface
Spacecraft I /F
Spacecraft I /F
High-Speed Network B
DataProcessor
(PPC, FPGA)
#N
ReconfigurableReconfigurable
ClusterCluster
ComputerComputer
-
8/3/2019 Cieslewski UF FTWorkshop Final2
6/19
6
Dependable MultiprocessorDependable Multiprocessor
DM System Architecture
Dual system controllers
Redundant radiation-hardened PPC
boards Monitor data processors health and
communicate with spacecraft
Data processing engines
High-performance, low-power
COTS SBCs running Linux PowerPC with AltiVec capabilities
Optional FPGA co-processor for
additional performance
Scalable to 20 data processing
nodes
Redundant Interconnect
Dual GigE connections
Automatically switch networks when
error is detected
DM Middleware (DMM)
FT System Services Manages status and health of
multiple concurrent jobs
FT Embedded MPI (FEMPI) Lightweight subset of MPI
Allows fault recovery withoutrestarting an entire parallelapplication
Application & FPGA Services Commonly used libraries such as
ATLAS, FFTW, GSL
Simplified, generic API for FPGAusage through USURP*
High-Availability Middleware Framework used to enable health
monitoring of cluster
* USURP is a standardized interface
specification for RC platforms,
developed by researchers at UF
-
8/3/2019 Cieslewski UF FTWorkshop Final2
7/19
7
DMM ComponentsDMM Components
Mission Manager (MM)
Controls high-level job deployment
Facilitates replication of lower-leveljobs
Spatial or temporal replication
Automatically compares andvalidates outputs
Monitors real-time deadlines
Enables roll-forward / roll-backwhen faults occur
Job Manager (JM)
Controls low-level job deploymentand scheduling across system
FT Manager (FTM)
Manages low-level system faults(node crash, job crash)
JM Agent (JMA)
Deploys and monitorsprograms on given node
Provides application heartbeatto system controller
Mass Data Store (MDS) Provides reliable centralized data
services
Enables reliable checkpointing
Hardened Processor COTS Packet-Switched Network COTS Processor
COTS OS and Drivers COTS OS and Drivers
Reliable Messaging Middleware
JM FTM
Reliable Messaging Middleware
JMA ASL
JM Job Manager FEMPI Fault-Tolerant Embedded MPIJMA Job Manager Agent ASL Application Services Library
FTM Fault Tolerance Manager FCL FPGA Coprocessor Library
Hardened System
COTS Data Processors
FCL FEMPI
MPI Application Process
Mission-Specific Parameters
Mission Manager
-
8/3/2019 Cieslewski UF FTWorkshop Final2
8/19
8
AlgorithmAlgorithm--Based Fault ToleranceBased Fault Tolerance
Commonly refers to matrix coding method that ispreserved through certain linear algebra operations Matrix and vector multiply
Discrete Fourier Transform Discrete Wavelet Transform
Matrix decomposition: C = AB (LU, QR, Cholesky) Matrix inversion
Used to detect errors in these operations, and in certaincases allows for error correction
ABFT algorithms integrate with DM through ApplicationServices API
An improved method of using ABFT on the 2D-FFT andSAR has been researched at UF Uses Hamming encoding Low overhead due to ABFT
Important aspects of ABFT currently under investigation
at UF Round-off analysis Coverage analysis Code types Encoding and Decoding strategies Overhead
Fault-tolerant Partial Transform
Computation Flow of Fault-tolerant 2D-FFT
15%
25%
35%
45%
55%
65%
75%
85%
95%
128 256 512 1024 2048 4096
Image Size [N x N]
OverheadIncurred
Error Free
With Error
Experimental Overhead of Fault-tolerantRDP vs. a Fault-intolerant Version
-
8/3/2019 Cieslewski UF FTWorkshop Final2
9/19
9
Source Code TransformationsSource Code Transformations
Most science applications are inherently non-fault-tolerant Requires SIFT framework to improve reliability Possible to immunize programs against most errors by
transforming application source code Less overhead
More control over FT techniques
Compiler-independent Integrates with DM system through Application Services API
Custom source-to-source (S2S) transformation tool iscurrently under development at UF Accepts C source files as inputs Generates fault tolerant versions
Uses fine-grain NMR-type of approach to provide improvedreliability and dependability Provides means of control flow checking (CFC) through software Minimizes number of undetected errors
Transformation options to be supported by the tool Variable replication Function replication Memory duplication / memory checking Synchronization intervals
Condition evaluation Post-evaluation verification Evaluation using replicated variables
Block protection
-
8/3/2019 Cieslewski UF FTWorkshop Final2
10/19
10
Reconfigurable Fault ToleranceReconfigurable Fault Tolerance GOAL Research how to take advantage of reconfigurable nature of FPGAs, to provide
dynamically-adaptive fault tolerance in RC systems
Leverage partial reconfiguration(PR) where advantageous
Explore virtual architectures to enable PR and reconfigurable
fault tolerance(RFT)
MOTIVATION Why go with fixed/static FT, when
performance & reliability can be tuned as needed?
Environmentally-aware & adaptive computing is wave of future
Achieving power savings and/or performance improvement,
without sacrificing reliability
CHALLENGES limitations in concepts and tools,
open-ended problem requires innovative solutions
Conventional methods typically based upon radiation-
hardened components and/or fault masking via chip-level TMR
Highly-custom nature of FPGA architectures in different systems
and apps makes defining a common approach to FT difficult
Satellite orbits, passing throughthe Van Allen radiation belt
Fault Tolerance
-
8/3/2019 Cieslewski UF FTWorkshop Final2
11/19
11
Reconfigurable FTReconfigurable FT Virtual Architecture for RFT
Novel concept of adaptablecomponent-level protection (ACP)
Common components within VA: Adaptable protection frame largely module/design-independent (see figure above)
Error Status Register (ESR) for system-level error tracking/handling
Re-synchronization controller or interfaces, for state saving and restoration
Configuration controller, two options: Internal configuration through ICAP
External configuration controller
Benefits of internal protection: Early error detection and handling = faster recovery
Redundancy can be changed into parallelism
PR can be leveraged to provide uninterruptedoperation of non-failed components
Challenges of internal protection: Impossible to eliminate single points of failure, may still
need higher-level (external) detection and handling
Stronger possibility of fault/error going unnoticed
Single-event functional interrupts (SEFI) are majorconcern
A BB
2 parallel, SCP
A
no parallel, TMR
BA DC
4 parallel, single
BLA
NK
BLA
NK
no parallel, SCPsockets for modules
AdaptableComponent-
levelProtection
VA concept diagram
FPGA
-
8/3/2019 Cieslewski UF FTWorkshop Final2
12/19
12
Space ApplicationsSpace Applications Synthetic Aperture Radar (SAR)
Used to form high-resolution images of Earthssurface from moving platform in space
Patch-based processing with significant amountof overlap between patch boundaries
Parallelizable on multiple levels of granularity,possible without need for anyinter-processorcommunication (one patch per node)
2-dimensional data set, can range in size fromseveral hundred Megabytes to Gigabytes
Data set notsignificantly reduced through course
of application Highly amenable to ABFT; possible application for
the Dependable Multiprocessor project
-
8/3/2019 Cieslewski UF FTWorkshop Final2
13/19
13
Space ApplicationsSpace Applications Hyperspectral Imaging (HSI)
Uses traditional beamforming techniques toperform coarse-grained classification onhyperspectral images
Adjustable to enable real-time processing
Mostly embarrassingly parallel, exception beingweight computation (shown in red below)
3-dimensional data set, reduced through course ofapplication
Auto-correlation sample matrix (ACSM) calculationand beamforming (detection) amenable to ABFT
Suggest NMR for weight computation (weight) Parallel and multi-FPGA decompositions explored
-
8/3/2019 Cieslewski UF FTWorkshop Final2
14/19
14
Space ApplicationsSpace Applications Cosmic Ray Elimination
Uses image processing techniques to remove artifactscaused by cosmic rays
Image shows pre- and post-processed versions of a HubbleTelescope observation
Images are highly parallelizable, with minimalcommunication necessary
Main computation: median filtering
Fault-tolerant median filter developed
Other portions of algorithm replicated by hand or S2Stranslator
Other aerospace-related application kernels
Space-Time Adaptive Processing (STAP)
Ground Moving Target Indicator (GMTI)
Airborne LIDAR
Digital Down Conversion (DDC)
PDF Estimation
-
8/3/2019 Cieslewski UF FTWorkshop Final2
15/19
15
Novel Computing PlatformsNovel Computing Platforms
Fixed multi-core (FMC) devices Cell
Heterogeneous, vector compute engine, 3.2 GHzclock rate, ~70 W max. power consumption
GPU Homogeneous, many (e.g. 100+) stream processors,
~1.5 GHz clock rate, ~120 W max. powerconsumption
Reconfigurable multi-core (RMC) devices Field-Programmable Object Array (FPOA)
Heterogeneous, coarse-grained processingelements, 1 GHz clock rate, ~35 W max powerconsumption
Field-Programmable Gate Array (FPGA) Heterogeneous, fine-grained processing elements,
max. clock rate ~500 MHz, achievable clock ratevaries, ~30 W max. power consumption
Tilera Homogeneous, coarse-grained processing elements
(64 32-bit MIPS-like processors on-chip), ~750 MHzclock rate, ~30 W max. power consumption
Element CXi Heterogeneous, coarse-grained processing
elements, 200 MHz clock rate, ~1 W max. powerconsumption
Cell processor block diagram -http://www.research.ibm.com/journal/rd/494/kahle.html
FPOA architecture -http://www.mathstar.com/Architecture.php
-
8/3/2019 Cieslewski UF FTWorkshop Final2
16/19
1616
RC: Vital Technology for SpaceRC: Vital Technology for Space
HPEC devices featuredhere; similar results vs.65nm Xeon, 90nm GPU,etc. (see RSSI08).
Results excerpted frompending presentationfrom CHREC-UF site forHPEC08 Workshop.
Versatility in space missions (adapts as needs demand)
Fixed archs. burdened with fixed choices, limited tradeoffs
Performance in space missions (speed, power, size, etc.)
e.g. Computational density per Watt (CDW) device metric FPGAs far exceed FMC devices (CPU, Cell, GPU, etc.)
Parallel Operations scales upto max. # of adds and mults (#of adds = # of mults) possible
Achievable Frequency lowestfrequency after PAR of DSP &logic-only impls. of add & multcomp. cores [FPGA]
Power scales linearly with
resource util; max. powerreduced by ratio of achievablefreq. to max. freq. [FPGA]
Parallel Operations scales upto max. # of adds and mults (#of adds = # of mults) possible
Achievable Frequency lowestfrequency after PAR of DSP &logic-only impls. of add & multcomp. cores [FPGA]
Power scales linearly with
resource util; max. powerreduced by ratio of achievablefreq. to max. freq. [FPGA]
-
8/3/2019 Cieslewski UF FTWorkshop Final2
17/19
17
RapidIORapidIO High-speed embedded system
interconnect, replacement for bus-based backplanes
Parallel and serial variants, serial is wave of future
Multiple programming models
Research with RapidIO at UF Simulative research studying capability of RapidIO-based
computing platforms to support space-based radar (SBR)processing
Custom testbed designed and built, for verification ofsimulation models & experimentation with RapidIO & FPGAs
256 Pulses, 6 Beams, 1 Engine per Task per FPGA: 64k Ranges
0
10
20
30
40
50
6070
80
90
100
0 256 512 768 1024 1280 1536 1792 2048
Time (ms)
SDRAMUtilization(%)
Experimental logic analyzer measurements
Visualization of simulated GMTI application progress
Trace files
-
8/3/2019 Cieslewski UF FTWorkshop Final2
18/19
18
ConclusionsConclusions
Fault tolerance for space should be more thanRadHard components & spatial TMR designs
Fixed worst-case designs extremely limited in perf/Watt
Instead, many FT methods & modes can be exploited
Adaptive systems that react to environmental changes
COTS featured inside critical performance path
RadHard for FT management, outside critical perf. path
UF active on many space-related FT issues
NASA Dependable Multiprocessor, CHREC RFT F4-08 Modes: SIFT, ABFT, S2S, RFT, FEMPI, CR, CED, etc.
Devices: PPC/AV, FPGA, FPOA, Tilera, ElementCXi, etc.
Space apps: HSI, SAR, LIDAR, GMTI, CRE, et al.
-
8/3/2019 Cieslewski UF FTWorkshop Final2
19/19
19
2009 IEEE Aerospace Conference2009 IEEE Aerospace Conference
Track 7.12 Dependable Software for High PerformanceEmbedded Computing Platforms Transient error detection and recovery techniques
Compiler-based fault-tolerant techniques Algorithm-based fault-tolerant techniques
Tools and techniques for designing reliable software SIFT management frameworks Software dependability analysis Adaptive fault-tolerant techniques FT applications
Track Chairs Richard Linderman [email protected] Grzegorz Cieslewski [email protected]
Dates Abstract Submissions Due: July 1st, 2008 Paper Submissions Due: November 2nd, 2008