credit: sts-112 shuttle crew, nasa
DESCRIPTION
Fault-Tolerance Verification of the Fluids and Combustion Facility of the International Space Station Raquel S. Whittlesey-Harris and Mikhail Nesterenko presented by Sylvie Dela ë t, Universit é Paris Sud. Credit: STS-112 Shuttle Crew, NASA. Outline. - PowerPoint PPT PresentationTRANSCRIPT
1Credit: STS-112 Shuttle Crew, NASA
Fault-Tolerance Verification of the Fluids and Combustion Facility of
the International Space Station
Raquel S. Whittlesey-Harris and Mikhail Nesterenko
presented by Sylvie Delaët, Université Paris Sud
2
Outline
• Introduction to FCF andIntroduction to FCF andProject MotivationProject Motivation
– space environment descriptionspace environment description
– applying stabilization to FCFapplying stabilization to FCF
– using model checking in using model checking in stabilization verificationstabilization verification
• Architecture & OperationArchitecture & Operation
• FCF SPIN Model
• Experiments
• Impact & Future Work
3
The Fluids and Combustion Facility
• Two racks– Combustion Integrated Rack
(CIR)• Facilities for combustion
science experiments– Multi-user Droplet
Combustion Apparatus
– Fluids Integrated Rack (FIR)• Facilities for fluid
physics experiments– Light Microscopy
Module SpacecraftFire Safety
TerrestrialFire Safety &
Fire Prevention
Pollution Control& Increased Fuel
Efficiency
Human Health
1-g
-g
1-g
CombustionIntegrated Rack Fluids
Integrated Rack
ManufacturingProcesses
Space SystemsFluids Management
The Fluids and Combustion Facility (FCF)is a Modular, Multi -User Microgravity
Research FacilityBeing Developed
for the ISSDestinyModule
World-ClassCombustion
Sc ience
World-ClassCombustion
Sc ience
Ground-BreakingFluid Physics
Research
Ground-BreakingFluid Physics
Research
-g
• Permanent installation onboard the International Space Station (ISS) US laboratory module
4
Why Fault-Tolerance for FCF
• Adverse environment
– harsh acceleration forces
• launch (3-g) and re-entry (1.5-g)
– microgravity (ug) vibrations
• e.g., orbital maneuvers, experimental vibrations
– radiation
• South Atlantic Anomaly
• Protection of life, equipment – care must be taken to prevent contamination of ISS and experiment environments
• Limited access
– crew time limited
• currently no more than 1.5 hours per month
– experiment access via Telescience
• available approximately 30% of the time
5
Why Self-Stabilization
• Faults are numerous and unpredictable in nature and effect, resources are limited, safety is critical
• FCF specification
– requires FCF to tolerant a single component failure regardless of cause
– stricter requirements in future
• A system is self-stabilizing if, starting from an arbitrary state, it is guaranteed to arrive at legitimate state and behave correctly afterwards
– a fault may take the system into an arbitrary state
– self-stabilization guarantees recovery regardless of fault cause
Self-stabilization is well-suited for FCF fault-tolerance design
6
Why Use Model Checking
• Traditionally self-stabilization is proven analytically:
– determine invariant guaranteeing correct behavior
– show that system starting from arbitrary states eventually satisfies this invariant
• Complex practical systems such as FCF have a large number of possible states and special cases
– analytical proofs for such systems are
• difficult to construct
• cumbersome and thus suspect
• Model checker
– automates state space checking and verifies desired properties such as stabilization
– especially effective if the state space is finite as in case of FCF
7
Outline
• Introduction to FCF andIntroduction to FCF andProject MotivationProject Motivation
• Architecture & OperationArchitecture & Operation
– Hardware design
– Operation
• FCF SPIN Model
• Experiments
• Impact & Future Work
8
FCF Architecture Overview
• FCF contains two racks (FIR and CIR)
• Each rack contains several independent components– The components may have processing, sensing
and storage capacity– the components communicate through multiple
networks (Copper Ethernet, Fiber Optic, CANBus, etc.)
• the main component of the rack (IOP) – runs real-time embedded OS: VxWorks– houses Rack Manager – main control program
of the rack– communicates with ISS and ground control– if necessary controls processing components of
the other rack
9
Combustion Integrated Rack (CIR)
Fuel/OxidizerManagement
Assembly (FOMA)•Gas Distribution
•Exhaust Vent
OpticsBench
CombustionChamber
Rack ClosureDoor
International Standard
Payload Rack (ISPR)
SAMSRTS
Active RackIsolation
Subsystem(ARIS)
Environmental Control (ECS)• Air Thermal Control
• Fire Detection & Suppression• Water Thermal Control
• Gas Interfaces (GN2, VES, VRS)
Input/OutputProcessor
(IOP)
Electrical PowerControl Unit
(EPCU)
FOMAControl
Unit(FCU)
PIAvionics
ImageProcessing and
Storage Unit(IPSU-A)
Experiment SpecificChamber Insert
Science Diagnostics• Color Camera
• Illumination Package• Low Light Level (2 Units)
• High Bit Depth Multi-Spectral• High Frame Rate/High Resolution
ORExperiment Specific Diagnostics
LaptopComputer
OpticsBench Slides
Common IPSU (2)
10
• Each component is in one of several states• e.g., initialization, safed, off-nominal
• State transitions– Must follow the rack rules: all components
must be in a legitimate state• e.g. op-idle, safed, off
• Out-of-tolerance conditions– nine selected which represent critical sampling
of all types• e.g., rack door is open while powered-on
• Rack manager actions– Seven actions in response to out-of-tolerances
• e.g., power off all hazardous components
FCF Operation
Initialization Safed (S)
Maintenance (M) Experiment (E)
Idle (I)
Mixed
Uplink/Downlink (UD)
Operational (OP)
Maintenance (M) Experiment (E)
Idle (I)
Mixed
Uplink/Downlink (UD)
Off-Nominal
OP to S
Off-Nom to S
power on/
s uc c es s /
e rro r/
e rro r/
e rro r/
power off/
s afed cm d/
s afed cm d/
operational cmd/
maintenance cmd to all packages/
idle cmd to all packages/ experiment cmd to all packages/
idle cmd to all packages/
e n t ry /
s uc c es s /
e rro r/
s uc c es s /
e rro r/
e rro r/ e rro r/
s afed cm d/
unsynchronize package states/
synchronize package states/unsynchronize package states/
synchronize package states/
unsynchronize package states/ synchronize package states/
e rro r/
power off/
/ Id le Cm d
/UD c m d
11
FCF Operation Example
• Power-on – rack manager initiates power on of the IPSU
• Component initialization
– component determines it is IPSU, initializes state
– IPSU performs power-on self test (health check of internal systems)
– upon successful completion, IPSU transitions to op-idle, starts monitoring its health & status, communicating with IOP, and sending telemetry
• Fault processing
– Rack manager finds one component off-nominal and requests all components to transition to operational-idle; components receive the command and transition to operation-idle
• Component power-down
– Rack manager determines that due to the fault it needs to power-down the system and requests all components into safed; after saving state information and IPSU powers down
12
Outline
• Introduction to FCF andIntroduction to FCF andProject MotivationProject Motivation
• Architecture & OperationArchitecture & Operation
• FCF SPIN ModelFCF SPIN Model
– Component modelComponent model
– Fault injectorFault injector
– Verification predicatesVerification predicates
• Experiments
• Impact & Future Work
13
Component Model
• Used SPIN model checker– Programmed a model of operation of FCF in SPIN’s
internal language PROMELA
• Each component is modeled as several PROMELA processes– implements main component functionality– run in parallel– functionality
• Command Handler• State Manager• Power On/Power Off
• Rack manager is modeled as a set of PROMELA processes providing additional functionality
• Health monitoring• Action handlers• Utilities
Rack Manager
CommandHandler
StateManager
Initialization
Power On
Power Off
Main
«include»
«include»
«include»
«extend»
«extend»
«extend»
Rack Manager
CommandHandler
StateManager
Initialization
Power On
Power Off
Main
RackManager
ActionHandlersHealth Monitor
UtilityProcesses
«include»
«include»
«include»
«include»
«include»
«include»
«include»
«extend»
«extend»
«extend»
14
Fault Injector
• Single PROMELA process
• Introduces two types of faults
– arbitrary state transitions
• e.g., op-idle from op-experiment
– Out-of-tolerance conditions
• e.g., rack door open
• The fault injections are not coordinated between components: injector may introduce faults in multiple components simultaneously
15
• components terminate operations and enter safe state upon discovery that communications has been lost with the rack manager (IOP)
• rack manager powers down all hazardous items upon detection that the rack door is open
Verification Predicates
• components are in a safe state upon the rack manager entering off-nominal
• verified nine critical predicates (three examples are below) • predicates expressed in Linear Temporal Logic Formulae (LTL)
)4_
4_4_
4_4_(
)3_
3_3_
3_3_(
)2_
2_2_
2_2_(
)1_
1_1_
1_1_(
(
IPSUp
IPSUtIPSUs
IPSUrIPSUq
IPSUp
IPSUtIPSUs
IPSUrIPSUq
IPSUp
IPSUtIPSUs
IPSUrIPSUq
IPSUp
IPSUtIPSUs
IPSUrIPSUq
p
))_
__
__(
)_
__
__(
)_
__
__(
)6_
6_6_
6_6_(
)5_
5_5_
5_5_(
PIPp
PIPtPIPs
PIPrPIPq
FSAPp
FSAPtFSAPs
FSAPrFSAPq
FCUp
FCUtFCUs
FCUrFCUq
IPSUp
IPSUtIPSUs
IPSUrIPSUq
IPSUp
IPSUtIPSUs
IPSUrIPSUq
)__(
)__(
)__(
)6_6_(
)5_5_(
)4_4_(
)3_3_(
)2_2_(
)1_1_(
PIPtPIPz
FSAPtFSAPz
FCUtFCUz
IPSUtIPSUz
IPSUtIPSUz
IPSUtIPSUz
IPSUtIPSUz
IPSUtIPSUz
IPSUtIPSUz
ml
• where: l – rack door open; m – hazardous items shutdown; p – IOP off-nominal; q – idle; r – safe; s s – good_off; t – bad_off; z – lost communications with IOP)
16
Outline
• Introduction to FCF andIntroduction to FCF andProject MotivationProject Motivation
• Architecture & OperationArchitecture & Operation
• FCF SPIN ModelFCF SPIN Model
• ExperimentsExperiments
– SimulationSimulation
– VerificationVerification
• Impact & Future Work
17
• Simulation
– Design and implement a model of the FCF in PROMELA
– Debug the model in the simulator
– Add fault injector
– Further debugging
• Verification
– Verify combined model in the SPIN verifier
Experiment Phases
18
Simulation
• Simulation
– Interactive, guided and randomized execution of the FCF model
– Used SPIN simulation tool
• Objective
– Debug model
• Possible to rerun exact iteration from previous execution
• Determine correct operation of the model
• Outcome
– 100 executions with different seeds
• Executed different paths and scenarios
• Provided some assurance of the stability of the model
19
Verification
• Verification - exhaustive trace model’s state space, verification of the predicates
– Note: state space includes every possible fault and fault combination
– guarantees correctness
• Outcome
– Verified no invalid end states or acceptance cycles in the model
• deadlock, never-ending loop, etc.
– Verified against all selected predicates
20
Outline
• Introduction to FCF and Introduction to FCF and Project MotivationProject Motivation
• Architecture & OperationArchitecture & Operation
• FCF SPIN ModelFCF SPIN Model
• ExperimentsExperiments
• Impact & Future WorkImpact & Future Work
21
Impact
• Fluids and Combustion Facility– found two errors corrected in the actual design– added assurance of the soundness of the design
– proposed and verified design modifications to lead to increased robustness in future versions
• Self-Stabilization
– first known application of model checking verification to a deterministic self-stabilizing system
– demonstrated the power of self-stabilization as an approach to fault-tolerance design of a practical system in harsh fault-averse environment
• Personal
– after publishing this research the first author secured a position at Boeing Research where she currently works on the fault-tolerance verification of real-time systems
22
More Info and Future Work
Extended version of the ADSN article is available as a KSU technical report TR-KSU-CS-2005-02
http://www.cs.kent.edu/techreps/TR-KSU-CS-2005-02.pdf
Future work
• Extend tolerance properties and design changes
– implement crash-failure tolerance (e.g., the IOP)
• IOP failover
• inter-rack control of power
• IOP-awareness for components
– more detailed implementation
• introduce real-time properties
– e.g., verify against timing constraints
• Devise ways to verify the conformance of the SPIN model to the actual system