fault-tolerant design for long- life deep space missions yiğit kültür 2006702835
TRANSCRIPT
Fault-Tolerant Design for Long-Life Deep Space Missions
Yiğit Kültür
2006702835
Contents
IntroductionFault-Tolerant System Considerations and
TechniquesHistorical PerspectiveFuture ApproachConclusion
Introduction
Recently, planet Mars has been at the focal point of astronomical attention because Mars will play a key role in humanity’s expansion to the deep space
Future Mars transportation will require reliable operations over a lifespan of years unlike: Space Shuttle which requires
operations over months Space Station which is close
enough to the Earth for maintenance logistics
Introduction
Long operation period associated with deep space missions demands: Innovative fault-tolerant technology development Applications of advanced redundancy techniques
To enable Mars exploration safety, reliability and autonomy must be improved
A new technology plan to guide the development of the next generation fault tolerant computing technology
Fault Tolerant System Considerations
Traditionally, avionic systems achieved fault-tolerance through redundancy management
Redundancy management technique: Detects and isolates a failure Performs hardware roconfiguration
A combination of self-monitoring and cross-comparison strategies lead to comprehensive fault coverage at reduced risk and cost
Fault Tolerant System Considerations
Primary Flight Control System (PFCS) Baseline Requirements Mission reliability: 0.95 success probability at 10 years with no
repair Throughput: 100 million instructions per second (MIPS) Expandable I/O: 100 Mbits/sec Expandable Memory: 1 GByte Mass Storage Capacity: 1 Terabyte Cycle Rate: 100 Hz Hardware N-fail operation Low life-cycle cost Low power and mass Radiation tolerance Building block approach(Look for existing soultions to the parts
of the problem and combine the soluitons)
Fault Tolerant Techniques for Mars Applications
Ultra-reliable systems for long-life applications like human Mars exploration are required to sustain: Permanent faults Transient (temporary) faults Intermittent (not continuous) faults Timing faults Latent (hidden) faults Worst-case fault scenarios with a lower
probability of occurence
Fault Tolerant Techniques for Mars Applications
Distributed Architectures are more suitable to long-life space applications: Function integration Parallel computation Graceful performance growth Selective technology upgrade Appropriate levels of function reliability Graceful degradation of system capabilities in
the presence of faults Efficient use of hardware resources
Long-Life Unmanned Redundant Systems
Historical Perspective
Viking Voyager Galileo
Historical Perspective
Safety Critical High Reliability Systems
Columbia Challenger Discovery Atlantis Endeavour
Long-Life Unmanned Redundant SystemsViking Viking is an instance of the pre-1970
Thermoelectric Outer Planets Spacecraft (TOPS) conceptThis spacecraft firstly introduced the use of computer as a fault manager, to attempt to reconfigure and restore the spacecraft to an operational configurationFundamental strategy was to switch power on and off to various alternative subsystems until either the built-in fault monitoring indicated operation was restored, or until commands from the Earth are detected in the case of faults in the communication chainThere was no real-time masking of faults, so if a fault occured during a maneuver, an incorrect maneuver would have been performed
Viking Fault-Tolerant Architecture
CCS: Command Computer Subsystem
FDS: Flight Data Subsytem
Long-Life Unmanned Redundant SystemsVoyager
Like Viking, Voyager is an instance of the pre-1970 Thermoelectric Outer Planets Spacecraft (TOPS) concept.The improvement according to Viking is in only limited ways, such as the addition of a pair of seperate computers for the attitude and articulation controlIn both of them standby redundancy was used. The standby spares where cross-strapped so that either unit could be switched in to communicate with the other unitsCross-strapping and switching allowed reconfiguration around failed components, either automatically or by the ground command
Voyager Fault-Tolerant Architecture
CCS: Command Computer Subsystem
FDS: Flight Data Subsytem
AACS: Attitude and Articulation Control Subsystem
Long-Life Unmanned Redundant SystemsGalileo
Galileo mission is a follow on to the Voyager Jupiter fly-by missionGalileo design borrows heavily from the experiences of the VoyagerBlock redundancy (An error checking method that generates a longitudal parity byte from a specified string or block of bytes on a longitudinal track.) is used throughout the subsystemsAll except CDS operates as an active/standby pairCDS operates as active redundancy wherein each block can issue independent commands, or they can operate in parallel on the same critical activity
Galileo Fault-Tolerant Architecture
CDS: Command and Data Subsystem
AACS: Attitude and Articulation Control Subsystem
Long-Life Unmanned Redundant SystemsGalileo
The major departure from the Voyager arcihtecture is the extensive use of microprocessors and the consequent use of bus oriented architecture to facilitate communications among themGalileo on-board fault detection software is designed to alleviate the effects and symptoms of faults, rather than to pinpoint the exact faults.Fault identification and isolation are performed by the ground intervention
Galileo Fault-Tolerant Architecture
CDS: Command and Data Subsystem
AACS: Attitude and Articulation Control Subsystem
Safety Critical High Reliability SystemsShuttles
Operational differences from planetary probes: being absolutely certain no fault propagates to
the effectors during a relatively shorter operation cycle
rather than relying on fault monitors to interrupt processing and going through a reconfiguration, powering several redundant strings on and operating in parallel
Safety Critical High Reliability SystemsShuttles
Conceptual Shuttle Orbiter Fault-Tolerant Architecture
Voting occurs both in General Purpose Computers (GPC’s) and at the final effectorsVoting is much more brute force than fault moitoring, requiring more hardware but also providing greater fault coverageMuch more suited to real-time safety-critical maneuver control than a reconfiguration oriented strategy as in Viking, Voyager and Galileo
GPC: General Purpose Computer
Mars Advanced Fault Tolerant Computing ApproachFuture Manned Mars Missions
Parallel-Hybrid Redundancy will be the base for future long-life deep space missions: It combines the attractive features of parallel
processing and redundant computation Computational elements can be arranged to
provide high throughput or ultra reliability or a combination of them depending on the mission phase
Mars Advanced Fault Tolerant Computing ApproachFuture Manned Mars Missions
Parallel-Hybrid Redundancy was first used in 1979 when Fault Tolerant Multi-Processor (FTMP) was designed and built: FTMP used conventional shared memory
multiprocessor architecture Each virtual processor consisted of three real
processors working as a triad to provide real-time fault masking
Upon detection of a fault in a processor, faulty unit is replaced from a pool of spares
Mars Advanced Fault Tolerant Computing ApproachFuture Manned Mars Missions Parallel-Hybrid Redundancy had certain drawbacks:
It was not explicitly designed to meet rigorous requirements of Byzantine resilience (Correctly functioning components of a Byzantine fault tolerant system will be able to reach the same group decisions regardless of Byzantine faulty components ) which is necessary to provide
Coverage of random hardware faults Ultra-high reliability Ease of validation
It lacked ease of expandability due to redundant bus connections between processors and main memory
It did not support mixed redundancy because processors are aranged to work in triads regardless of the criticality of the application
Mars Advanced Fault Tolerant Computing ApproachFuture Manned Mars Missions
FTPP Arcihtecture
To solve the deficiencies of FTMP a new architecture called Fault Tolerant Parallel Processor (FTPP) was conceivedIt meets all requirements of random hardware faultsFTPP will be the base of fault tolerance for future manned Mars missions
Mars Advanced Fault Tolerant Computing ApproachFeatures of FTPP – Parallel Procesing
FTPP Arcihtecture
Parallel Processing is provided by:
40 Processing Elements (PEs) in 5 Fault Containment Regions (FCRs)2 Input/Output Controllers (IOCs) per FCR
Mars Advanced Fault Tolerant Computing ApproachFeatures of FTPP – Scalable Performance
FTPP Arcihtecture
Increasing the number of PEs in a single cluster create a communication bottleneck in the Network Elements (NEs)FTPP relies on hierarchical approach to scaling the performance by assebmling clusters via IOCs
Mars Advanced Fault Tolerant Computing ApproachFeatures of FTPP – Mixed Redundancy
FTPP Arcihtecture
Most fault tolerant computers are designed to operate in a redundant mode only, which is a waste of resources for the uncritical tasksFTPP allows the processing elements to be configured as
Simplex:non-critical tasksTriplex:tasks that require real-time fault maskingQuadruplex or higher: when two or more sequential faults must be tolerated in a small time window without the benefit of reconfiguration
In the figure:4 quads3 triplexes15 simplexes
Mars Advanced Fault Tolerant Computing ApproachFeatures of FTPP – Dynamic Reconfiguration
FTPP Arcihtecture
Mission consists of several phases such as launch, ascent, cruise from Earth orbit to Mars, Mars orbit injection, Mars landingFor each phase the throughput, latency, iteration rates and criticality changes over a wide range, therefore the arcihecture must be flexibleReconfiguration from high throughput to high reliability
3 PEs which are operating as independent simplex elements can be synchronized to run the same task (S2,S3,S13)
Replacing failed membersA simplex in the same FCR as the failed member is synchronized with the non-failed members of the virtual group(Channel A of Q1 failsS2,S7 or S12 can replace)
Mars Advanced Fault Tolerant Computing ApproachFeatures of FTPP – Low Fault Tolerance Overhead
FTPP Arcihtecture
Frequent fault tolerant related functions such as fault/error detection, error masking(voting) and synchronization are implemented in the Network ElementLess frequent functions such as identification of faulty modules, reconfiguration and reintegration are implemented in software which executes on PEs.Each NE services 8 PEs
Mars Advanced Fault Tolerant Computing ApproachFeatures of FTPP – Open Architecture
FTPP Arcihtecture
FTTP provides open architecture for both hardware and software including:
ProcessorsI/O modulesFiber optic linksOperating Systems
Mars Advanced Fault Tolerant Computing ApproachFeatures of FTPP – Small Physical Size
FTPP Arcihtecture
Key element of meeting the weight, volume and power requirements is the packaging technologyMulti-Chip Modules (MCMs) will be used:
A NE on a single MCM with less than 4 cm2
Conclusion
Future manned deep space missions will require reliable operation over years and real-time masking of critical faults
Current approaches are not enough and a new fault tolerant approach is needed
FTPP is a powerful candidate for the spacecraft which will bring the humans to Mars
References
Advanced fault tolerant computing for future manned space missionsBenjamin, A.L.; Lala, J.H.;Digital Avionics Systems Conference, 1997. 16th DASC., AIAA/IEEEVolume 2, 26-30 Oct. 1997 Page(s):8.5 - 26-8.5-32 vol.2
NASA WebsiteComputers in Spaceflight: The NASA Experience http://www.hq.nasa.gov/office/pao/History/computers/Ch6-2.html
NASA Jet Propulison Laboratory WebsiteVoyager: The Interstellar Missionhttp://voyager.jpl.nasa.gov/spacecraft/index.html