fault-tolerant design for long- life deep space missions yiğit kültür 2006702835

Fault-Tolerant Design for Long-Life Deep Space Missions

Yiğit Kültür

2006702835

Contents

IntroductionFault-Tolerant System Considerations and

TechniquesHistorical PerspectiveFuture ApproachConclusion

Introduction

Recently, planet Mars has been at the focal point of astronomical attention because Mars will play a key role in humanity’s expansion to the deep space

Future Mars transportation will require reliable operations over a lifespan of years unlike: Space Shuttle which requires

operations over months Space Station which is close

enough to the Earth for maintenance logistics

Introduction

Long operation period associated with deep space missions demands: Innovative fault-tolerant technology development Applications of advanced redundancy techniques

To enable Mars exploration safety, reliability and autonomy must be improved

A new technology plan to guide the development of the next generation fault tolerant computing technology

Fault Tolerant System Considerations

Traditionally, avionic systems achieved fault-tolerance through redundancy management

Redundancy management technique: Detects and isolates a failure Performs hardware roconfiguration

A combination of self-monitoring and cross-comparison strategies lead to comprehensive fault coverage at reduced risk and cost

Fault Tolerant System Considerations

Primary Flight Control System (PFCS) Baseline Requirements Mission reliability: 0.95 success probability at 10 years with no

repair Throughput: 100 million instructions per second (MIPS) Expandable I/O: 100 Mbits/sec Expandable Memory: 1 GByte Mass Storage Capacity: 1 Terabyte Cycle Rate: 100 Hz Hardware N-fail operation Low life-cycle cost Low power and mass Radiation tolerance Building block approach(Look for existing soultions to the parts

of the problem and combine the soluitons)

Fault Tolerant Techniques for Mars Applications

Ultra-reliable systems for long-life applications like human Mars exploration are required to sustain: Permanent faults Transient (temporary) faults Intermittent (not continuous) faults Timing faults Latent (hidden) faults Worst-case fault scenarios with a lower

probability of occurence

Fault Tolerant Techniques for Mars Applications

Distributed Architectures are more suitable to long-life space applications: Function integration Parallel computation Graceful performance growth Selective technology upgrade Appropriate levels of function reliability Graceful degradation of system capabilities in

the presence of faults Efficient use of hardware resources

Long-Life Unmanned Redundant Systems

Historical Perspective

Viking Voyager Galileo

Historical Perspective

Safety Critical High Reliability Systems

Columbia Challenger Discovery Atlantis Endeavour

Long-Life Unmanned Redundant SystemsViking Viking is an instance of the pre-1970

Thermoelectric Outer Planets Spacecraft (TOPS) conceptThis spacecraft firstly introduced the use of computer as a fault manager, to attempt to reconfigure and restore the spacecraft to an operational configurationFundamental strategy was to switch power on and off to various alternative subsystems until either the built-in fault monitoring indicated operation was restored, or until commands from the Earth are detected in the case of faults in the communication chainThere was no real-time masking of faults, so if a fault occured during a maneuver, an incorrect maneuver would have been performed

Viking Fault-Tolerant Architecture

CCS: Command Computer Subsystem

FDS: Flight Data Subsytem

Long-Life Unmanned Redundant SystemsVoyager

Like Viking, Voyager is an instance of the pre-1970 Thermoelectric Outer Planets Spacecraft (TOPS) concept.The improvement according to Viking is in only limited ways, such as the addition of a pair of seperate computers for the attitude and articulation controlIn both of them standby redundancy was used. The standby spares where cross-strapped so that either unit could be switched in to communicate with the other unitsCross-strapping and switching allowed reconfiguration around failed components, either automatically or by the ground command

Voyager Fault-Tolerant Architecture

CCS: Command Computer Subsystem

FDS: Flight Data Subsytem

AACS: Attitude and Articulation Control Subsystem

Long-Life Unmanned Redundant SystemsGalileo

Galileo mission is a follow on to the Voyager Jupiter fly-by missionGalileo design borrows heavily from the experiences of the VoyagerBlock redundancy (An error checking method that generates a longitudal parity byte from a specified string or block of bytes on a longitudinal track.) is used throughout the subsystemsAll except CDS operates as an active/standby pairCDS operates as active redundancy wherein each block can issue independent commands, or they can operate in parallel on the same critical activity

Galileo Fault-Tolerant Architecture

CDS: Command and Data Subsystem


Long-Life Unmanned Redundant SystemsGalileo

The major departure from the Voyager arcihtecture is the extensive use of microprocessors and the consequent use of bus oriented architecture to facilitate communications among themGalileo on-board fault detection software is designed to alleviate the effects and symptoms of faults, rather than to pinpoint the exact faults.Fault identification and isolation are performed by the ground intervention

Galileo Fault-Tolerant Architecture

CDS: Command and Data Subsystem


Safety Critical High Reliability SystemsShuttles

Operational differences from planetary probes: being absolutely certain no fault propagates to

the effectors during a relatively shorter operation cycle

rather than relying on fault monitors to interrupt processing and going through a reconfiguration, powering several redundant strings on and operating in parallel

Safety Critical High Reliability SystemsShuttles

Conceptual Shuttle Orbiter Fault-Tolerant Architecture

Voting occurs both in General Purpose Computers (GPC’s) and at the final effectorsVoting is much more brute force than fault moitoring, requiring more hardware but also providing greater fault coverageMuch more suited to real-time safety-critical maneuver control than a reconfiguration oriented strategy as in Viking, Voyager and Galileo

GPC: General Purpose Computer

Mars Advanced Fault Tolerant Computing ApproachFuture Manned Mars Missions

Parallel-Hybrid Redundancy will be the base for future long-life deep space missions: It combines the attractive features of parallel

processing and redundant computation Computational elements can be arranged to

provide high throughput or ultra reliability or a combination of them depending on the mission phase


Parallel-Hybrid Redundancy was first used in 1979 when Fault Tolerant Multi-Processor (FTMP) was designed and built: FTMP used conventional shared memory

multiprocessor architecture Each virtual processor consisted of three real

processors working as a triad to provide real-time fault masking

Upon detection of a fault in a processor, faulty unit is replaced from a pool of spares

Mars Advanced Fault Tolerant Computing ApproachFuture Manned Mars Missions Parallel-Hybrid Redundancy had certain drawbacks:

It was not explicitly designed to meet rigorous requirements of Byzantine resilience (Correctly functioning components of a Byzantine fault tolerant system will be able to reach the same group decisions regardless of Byzantine faulty components ) which is necessary to provide

Coverage of random hardware faults Ultra-high reliability Ease of validation

It lacked ease of expandability due to redundant bus connections between processors and main memory

It did not support mixed redundancy because processors are aranged to work in triads regardless of the criticality of the application


FTPP Arcihtecture

To solve the deficiencies of FTMP a new architecture called Fault Tolerant Parallel Processor (FTPP) was conceivedIt meets all requirements of random hardware faultsFTPP will be the base of fault tolerance for future manned Mars missions

Mars Advanced Fault Tolerant Computing ApproachFeatures of FTPP – Parallel Procesing

FTPP Arcihtecture

Parallel Processing is provided by:

40 Processing Elements (PEs) in 5 Fault Containment Regions (FCRs)2 Input/Output Controllers (IOCs) per FCR

Mars Advanced Fault Tolerant Computing ApproachFeatures of FTPP – Scalable Performance

FTPP Arcihtecture

Increasing the number of PEs in a single cluster create a communication bottleneck in the Network Elements (NEs)FTPP relies on hierarchical approach to scaling the performance by assebmling clusters via IOCs

Mars Advanced Fault Tolerant Computing ApproachFeatures of FTPP – Mixed Redundancy

FTPP Arcihtecture

Most fault tolerant computers are designed to operate in a redundant mode only, which is a waste of resources for the uncritical tasksFTPP allows the processing elements to be configured as

Simplex:non-critical tasksTriplex:tasks that require real-time fault maskingQuadruplex or higher: when two or more sequential faults must be tolerated in a small time window without the benefit of reconfiguration

In the figure:4 quads3 triplexes15 simplexes

Mars Advanced Fault Tolerant Computing ApproachFeatures of FTPP – Dynamic Reconfiguration

FTPP Arcihtecture

Mission consists of several phases such as launch, ascent, cruise from Earth orbit to Mars, Mars orbit injection, Mars landingFor each phase the throughput, latency, iteration rates and criticality changes over a wide range, therefore the arcihecture must be flexibleReconfiguration from high throughput to high reliability

3 PEs which are operating as independent simplex elements can be synchronized to run the same task (S2,S3,S13)

Replacing failed membersA simplex in the same FCR as the failed member is synchronized with the non-failed members of the virtual group(Channel A of Q1 failsS2,S7 or S12 can replace)

Mars Advanced Fault Tolerant Computing ApproachFeatures of FTPP – Low Fault Tolerance Overhead

FTPP Arcihtecture

Frequent fault tolerant related functions such as fault/error detection, error masking(voting) and synchronization are implemented in the Network ElementLess frequent functions such as identification of faulty modules, reconfiguration and reintegration are implemented in software which executes on PEs.Each NE services 8 PEs

Mars Advanced Fault Tolerant Computing ApproachFeatures of FTPP – Open Architecture

FTPP Arcihtecture

FTTP provides open architecture for both hardware and software including:

ProcessorsI/O modulesFiber optic linksOperating Systems

Mars Advanced Fault Tolerant Computing ApproachFeatures of FTPP – Small Physical Size

FTPP Arcihtecture

Key element of meeting the weight, volume and power requirements is the packaging technologyMulti-Chip Modules (MCMs) will be used:

A NE on a single MCM with less than 4 cm2

Conclusion

Future manned deep space missions will require reliable operation over years and real-time masking of critical faults

Current approaches are not enough and a new fault tolerant approach is needed

FTPP is a powerful candidate for the spacecraft which will bring the humans to Mars

References

Advanced fault tolerant computing for future manned space missionsBenjamin, A.L.; Lala, J.H.;Digital Avionics Systems Conference, 1997. 16th DASC., AIAA/IEEEVolume 2, 26-30 Oct. 1997 Page(s):8.5 - 26-8.5-32 vol.2

NASA WebsiteComputers in Spaceflight: The NASA Experience http://www.hq.nasa.gov/office/pao/History/computers/Ch6-2.html

NASA Jet Propulison Laboratory WebsiteVoyager: The Interstellar Missionhttp://voyager.jpl.nasa.gov/spacecraft/index.html

fault-tolerant design for long- life deep space missions yiğit kültür 2006702835

Documents

fault tolerant techniques

fault tolerance

faulttolerant design

cost slide

fault manager

soluitons slide

longlife space applications

comprehensive fault