organizational failure (lscits engd 2012)

32
Organisational Failure, York EngD Course in LSCITS, 2012 Slide 1 Organisational Failure Prof Ian Sommerville Video link

Upload: ian-sommerville

Post on 13-Jan-2015

996 views

Category:

Technology


2 download

DESCRIPTION

Discusses the organizational issues that affect systems failure

TRANSCRIPT

Page 1: Organizational Failure (LSCITS EngD 2012)

Organisational Failure, York EngD Course in LSCITS, 2012 Slide 1

Organisational Failure

Prof Ian Sommerville

Video link

Page 2: Organizational Failure (LSCITS EngD 2012)

Organisational Failure, York EngD Course in LSCITS, 2012 Slide 2

Organisational failure

• Why and how organisational factors can contribute to system failures

Page 3: Organizational Failure (LSCITS EngD 2012)

Organisational Failure, York EngD Course in LSCITS, 2012 Slide 3

Why organisations matter?

• Organisations have multiple, inter-related, potentially conflicting goals:

– Efficient resource utilisation

– Timely delivery of products/services

– Customer satisfaction

– Owner satisfaction

– Regulatory compliance

– Safety and dependability

– Maintenance of reputation/brand

– Future development

Page 4: Organizational Failure (LSCITS EngD 2012)

Organisational Failure, York EngD Course in LSCITS, 2012 Slide 4

Decision making

• Organisational decision making involves taking all of these into account

– Inevitably, this sometimes means making compromises that affect the safety and dependability of a system

• These compromises lead to vulnerabilities and hazards that may then compromise the safety or dependability of the system

• In complex organisations, there are competing priorities in different parts of the organisation

– Shifting power and authority in an organisation affects decision making

– May be deliberate lack of communications across the organisation

Page 5: Organizational Failure (LSCITS EngD 2012)

Organisational Failure, York EngD Course in LSCITS, 2012 Slide 5

NASA Challenger disaster

• Space shuttle exploded shortly after take-off

• The cause was the failure of rubber seals (O-rings) that allowed hot gas to escape and make contact with fuel tanks which then exploded

• Subsequent enquiry showed that O-ring failure was due to brittleness at low temperatures

• Arguably, decision makers were complacent because

– Redundant (primary and secondary) O-rings in the system

– Damage to primary O-rings had been tolerated in previous launches

Page 6: Organizational Failure (LSCITS EngD 2012)

Organisational Failure, York EngD Course in LSCITS, 2012 Slide 6

Organisational failure

• Engineers were concerned about launching in low temperatures and advised against launch

• But goals other than safety and dependability took precedence and engineers were overruled

– ‘Owner’ satisfaction

• already several delays to flight

– Future planning

• NASA wanted a success to support budget negotiations

– Resource utilisation

• Reluctance to address known problem with O-rings because of costs

Page 7: Organizational Failure (LSCITS EngD 2012)

Organisational Failure, York EngD Course in LSCITS, 2012 Slide 7

Normal accidents

• Developed by Charles Perrow who conducted a study of a nuclear accident in the USA (Three Mile Island)

• Official conclusion was that the problems were due to “human error”

• Perrow disagreed with this and argues that failures are ‘normal’ and inevitable in complex systems which have:

– Interactive complexity

• The presence of unfamiliar, unplanned and unexpected sequences of events in a system that are not visible or immediately comprehensible

– Tight coupling

• The presence of interdependent components.

• Tight coupling will make a system more prone to cascading errors.

Page 8: Organizational Failure (LSCITS EngD 2012)

Organisational Failure, York EngD Course in LSCITS, 2012 Slide 8

Page 9: Organizational Failure (LSCITS EngD 2012)

Organisational Failure, York EngD Course in LSCITS, 2012 Slide 9

Redundancy

• The use of redundancy is a fundamental technique in achieving system safety

– Primary and secondary O-rings on space shuttle

– Quintuple redundancy in Airbus FCS

• Failure of primary system can be tolerated

• Perrow argues that redundancy can decrease rather than increase safety:

– Increases complexity and coupling in the system

– Provides reassurance that system faults can be tolerated

Page 10: Organizational Failure (LSCITS EngD 2012)

Organisational Failure, York EngD Course in LSCITS, 2012 Slide 10

Failures or successes

• Normal accident theory is based on extensive studies of system failures

• It argues that failure is systemic and an inherent characteristic of the system itself

• Alternative perspective is based on studies of success

– Why are there some areas that are apparently complex (e.g. air traffic management) where failures are relatively uncommon?

• Led to the notion of high-reliability organisations

Page 11: Organizational Failure (LSCITS EngD 2012)

Organisational Failure, York EngD Course in LSCITS, 2012 Slide 11

Failure-free organisations?

• High-reliability organisation (HRO) researchers disagree that complex, highly interdependent systems will inevitably have accidents

– They believe organisations are able to compensate for technical shortcomings through their methods of operation, in essence they argue that organisations can be ‘failure free’.

• Based on studies of ‘reliable’ organisations– Aircraft carriers

– Air traffic control

– Nuclear power stations

– Intensive care units

Page 12: Organizational Failure (LSCITS EngD 2012)

Organisational Failure, York EngD Course in LSCITS, 2012 Slide 12

Aircraft carrier flight operations

Page 13: Organizational Failure (LSCITS EngD 2012)

Organisational Failure, York EngD Course in LSCITS, 2012 Slide 13

Nuclear powered carriers

• Complex systems– Carriers are 24 stories high and carry enough fuel for

15 years. 2000 telephones. 3,360 compartments and spaces

– Multiple software intensive systems (command systems, aircraft software)

– Dangerous objects (aircraft, fuel, and explosives) in close proximity.

– Aircraft taking off and landing in 48-60 second intervals.

– 6000 crew. Several different kinds of aircraft, multiple squadrons.

– All work interdependently and must be coordinated.

Page 14: Organizational Failure (LSCITS EngD 2012)

Organisational Failure, York EngD Course in LSCITS, 2012 Slide 14

Nuclear powered carriers

• High risk– Nuclear reactor accidents

– Fire, flooding, grounding, collision

– Fuel and weapons explosions

– Mistaken identification of friends and foes

– High risks both to crew and a much larger public

• High reliability– Low “crunch rates”

– comparatively few major accidents

• High reliability achieved through organisational design

Page 15: Organizational Failure (LSCITS EngD 2012)

Organisational Failure, York EngD Course in LSCITS, 2012 Slide 15

High Reliability Organisations

• High Reliability Organisations (HROs) have particular qualities

– Reliability takes precedence over efficiency

– Preoccupation with failure, not success

– Share the big picture

– Focus on details

– Migrate decisions

Page 16: Organizational Failure (LSCITS EngD 2012)

Organisational Failure, York EngD Course in LSCITS, 2012 Slide 16

Reliability over Efficiency

– Reliability comes before efficiency but cannot replace it

– Decisions are made on the grounds of reliability first and then efficiency

– Efficiency initiatives are treated with scepticism

– Managers regularly talk to and familiarise themselves with staff about how they do their work and why. This stops managers focusing just on figures.

– Organisations develop safety measures as well as financial measures, and include these in employee evaluations

– Organisations assign value to the avoidance of accidents

– High redundancy despite cost

– Cautious actions when necessary despite cost

Page 17: Organizational Failure (LSCITS EngD 2012)

Organisational Failure, York EngD Course in LSCITS, 2012 Slide 17

Preoccupation with Failure

• HROs recognise that: – Workers need to be heedful to the possibility of failure

– Failures are normal but accidents should be avoided

– Acknowledge there can be unexpected failure modes, even in common activities

• HROs address failure by:– Constant training of all people (simulations,

apprenticing, practice)

– Using incident reporting

– Designing in extensive redundancy

– Maintaining contingencies for critical operations

– Requiring proofs that something is safe, not that it is unsafe

Page 18: Organizational Failure (LSCITS EngD 2012)

Organisational Failure, York EngD Course in LSCITS, 2012 Slide 18

Carrier operations

– There is constant tracking of issues around malfunctioning, defective and substandard equipment.

• They act on these by training crew how to overcome problems and pressuring vendors to make improvements

– Extensive redundancy (overlapping jobs, multiple channels and centres of communications, spare parts, multiple sources for decision making).

• Example: if an aircrafts landing gear warning light comes on, the spotter, commander and pilot all work together to establish what the issues is.

– Multiple contingencies are maintained

• Example: There will always be multiple options for how to land the plane (or for the pilot to escape).

Page 19: Organizational Failure (LSCITS EngD 2012)

Organisational Failure, York EngD Course in LSCITS, 2012 Slide 19

Sharing the Big Picture

• HROs recognise that:– If people are narrowly focused they will act only in

their own interest

– People need to maintain awareness of other people and events around the organisation

• HROs– Train people broadly

– Educate people about overarching objectives, and set statements of purpose

– Give people access to information on what is happening elsewhere

– Clearly specify how people and teams fit into the whole

Page 20: Organizational Failure (LSCITS EngD 2012)

Organisational Failure, York EngD Course in LSCITS, 2012 Slide 20

Reluctance to Simplify

• HROS are reluctant to simplify

• All organisations have to simplify and abstract, to filter out unnecessary information (particularly for getting “big pictures”)

• Rather, HROs– Use labels and categories as little as possible as they

stop you from looking further into details and events.

– Continually rework labels and categories

– Listen to wisdom, but with skepticism

– Do not focus on information that supports expectations, but focus on that which doesn’t fit or disconfirms desires

Page 21: Organizational Failure (LSCITS EngD 2012)

Organisational Failure, York EngD Course in LSCITS, 2012 Slide 21

Migration of decision making

• HROs migrate decision making as far down the organisation as possible

– Decisions are not made by one central authority

• HROs recognise:– Decisions need to be made where there is expertise

– Decisions often need to be made quickly

– People must be trained in making decisions and are given the right resources to do so

– Skill levels and legitimacy through the organisation and people are trusted

Page 22: Organizational Failure (LSCITS EngD 2012)

Organisational Failure, York EngD Course in LSCITS, 2012 Slide 22

HROs and Normal Accidents

• HRO theory is sometimes presented as conflicting with Normal Accidents

– HRO proponents may argue that accidents are not ‘normal’

– Leveson critiques work on HROs and argues that they are not based on concerns of tightly coupled systems

• Arguably, an HRO is an organisation that has taken active steps to:

– reduce coupling and

– reduce interactions

– Once that has been achieved, the driver for HRO’s is perhaps a strong ‘safety culture’ to promote safety across the organisation

Page 23: Organizational Failure (LSCITS EngD 2012)

Organisational Failure, York EngD Course in LSCITS, 2012 Slide 23

Organisational vulnerabilities

• Organisational vulnerabilities are characteristics of an organisation that weaken defensive layers and so may lead to system failure.

• Examples of organisational vulnerabilities– Over-reliance on process to achieve

safety/dependability

– Responsibility failures

– Weak safety/dependability culture

– Under-resourcing of safety

Page 24: Organizational Failure (LSCITS EngD 2012)

Organisational Failure, York EngD Course in LSCITS, 2012 Slide 24

Over-reliance on process

• Quality standards such as ISO 9000 place great emphasis on process and process assurance

– Implication of these standards is that process is paramount

• This tends to promote a belief that focusing on process is the way to achieve safety and dependability

• However, processes are never isolated and have to be enacted in a dynamic context

• Sometimes necessary to deviate from the ‘normal’ process to achieve safety and dependability

Page 25: Organizational Failure (LSCITS EngD 2012)

Organisational Failure, York EngD Course in LSCITS, 2012 Slide 25

Responsibility failures

• System failures are often a consequence of responsibility failures

– Unassigned responsibility

– Misassigned responsibility

– Misunderstood responsibility

– Duplicated responsibilities

– Responsibility overload

– Responsibility fragility

• Responsibility failures may be a consequence of poor communications and/or under-resourcing

Page 26: Organizational Failure (LSCITS EngD 2012)

Organisational Failure, York EngD Course in LSCITS, 2012 Slide 26

Organisational culture

• “The way that we do things around here”

• Culture may conflict with public statements of priorities

– “The patient comes first”

– “Safety is our goal”

• Investment banking– High risk, high reward

– Lack of regulation or weak compliance with regulations

– Large-scale failures

Page 27: Organizational Failure (LSCITS EngD 2012)

Organisational Failure, York EngD Course in LSCITS, 2012 Slide 27

Safety culture

• Some organisations have developed a strong safety culture where safety is seen as a priority by all members of the organisation

• Safety culture (UK HSE)– “The product of individual and group values,

attitudes, perceptions, competencies, and patterns of behaviour that determine the commitment to, and the style and proficiency of, an organization’s health and safety management”

Page 28: Organizational Failure (LSCITS EngD 2012)

Organisational Failure, York EngD Course in LSCITS, 2012 Slide 28

Safety culture (Reason)

Page 29: Organizational Failure (LSCITS EngD 2012)

Organisational Failure, York EngD Course in LSCITS, 2012 Slide 29

Safety maturity

Page 30: Organizational Failure (LSCITS EngD 2012)

Organisational Failure, York EngD Course in LSCITS, 2012 Slide 30

Under-resourcing

• If operations are under-resourced then safety and dependability are often sacrificed

• Organisational priorities focus on optimising resource utilisation to continue service delivery

– Safety and dependability may be seen as an avoidable overhead

• Example– Cleaning services in hospital outsourced to save

money

– Competitive tender

– Under-resourced so quality of service reduced

• Consequent increase in hospital acquired infections

Page 31: Organizational Failure (LSCITS EngD 2012)

Organisational Failure, York EngD Course in LSCITS, 2012 Slide 31

Complex systems

• Complexity = Coupling + Interaction

• Lesson for LSCITS

– Increasing complexity will lead to unpredictable system failure

– Strive to build LSITS rather than LSCITS

• Improve safety by

– Reducing coupling

– Reducing interactions

– Redundancy may not improve safety as it increases complexity in the system

• Address problems at organisational as well as the system level

Page 32: Organizational Failure (LSCITS EngD 2012)

Organisational Failure, York EngD Course in LSCITS, 2012 Slide 32

Key points

• Organisational decisions, influenced by structure and culture, often have a major impact on safety and dependability

• Normal Accident Theory postulates that accidents are inevitable in complex, tightly coupled systems

• High-reliability organisations aim to achieve safety through a set of practices that aim to reduce failures

• Organisational vulnerabilities include over-reliance on process, responsibility failures, poor safety culture and under-resourcing