ems outages analysis and lessons learned · ems outages analysis and lessons learned sam chanoski,...
TRANSCRIPT
EMS Outages Analysis and
Lessons LearnedSam Chanoski, Director, Situation Awareness and Event Analysis
SPP RE Spring Workshop
March 11, 2015
RELIABILITY | ACCOUNTABILITY2
• Introduction - Why This is Important
• Cause Analysis Trends
• Outage Attribute Trends and Analysis
• Lessons Learned
• Q & A
Agenda
RELIABILITY | ACCOUNTABILITY3
• Introduction - Why This is Important
• Cause Analysis Trends
• Outage Attribute Trends and Analysis
• Lessons Learned
• Q & A
Agenda
RELIABILITY | ACCOUNTABILITY4
The Character of HarmsS
everity
Inverse
Cost-Benefit
Significance Threshold
Learn and Reduce
Avoid
Frequency
Harms
“Pick important problems and fix them”Dr. Malcolm Sparrow
John F. Kennedy School of Government
RELIABILITY | ACCOUNTABILITY5
Harms and Resilience
Severity What we learn, find and fix here…
...improves resilience
when we get to here.
Frequency
RELIABILITY | ACCOUNTABILITY7
Category 2b - Complete loss of SCADA, control or monitoring functionality for 30 minutes or more
Category 1h - Loss of monitoring or control, at a control center, such that it significantly affects the entity’s ability to make operating decisions for 30 continuous minutes or more. Examples include, but are not limited to the following:
Loss of operator ability to remotely monitor, control Bulk Electric System (BES) elements, or both
Loss of communications from SCADA RTUs
Unavailability of ICCP links reducing BES visibility
Loss of the ability to remotely monitor and control generating units via AGC
Unacceptable State Estimator or Contingency Analysis solutions
EMS Outage Harm Disaggregation
RELIABILITY | ACCOUNTABILITY8
Performance, Regulation, and Excellence
Normal Performance
Excellence
Slope ≈ resiliency metric?
Practical Minimum
Acceptable
EA, Info Sharing
CMEP
Regula
tory
Cra
ft
Forums, Trades
Regulatory Minimum
Acceptable
RELIABILITY | ACCOUNTABILITY9
• Introduction - Why This is Important
• Cause Analysis Trends
• Outage Attribute Trends and Analysis
• Lessons Learned
• Q & A
Agenda
RELIABILITY | ACCOUNTABILITY11
Top Root Causes
0
1
2
3
4
5
6
7
8
9
10
Top Root CausesInformation to determine cause LTA (AZ)Testing of Design/Installation LTA (A1B4C02)Software Failure (A2B6C07)Insufficient Job Scoping (A4B3C08)Inadequate Risk Assessment of Change (A4B5C04)
RELIABILITY | ACCOUNTABILITY13
Top Contributing Causes
0
5
10
15
20
25
30
35
40
A2
B6
C0
7
A1
B2
C0
1
A4
B5
C0
3
A2
B7
C0
4
A2
B6
C0
1
A1
B4
C0
2
A2
B7
C0
1
A4
B5
C0
5
A2
B3
C0
3
A4
B5
C0
4
A1
B2
C0
8
A2
B3
C0
2
A5
B2
C0
8
A2
B3
C0
1
AX
B1
AX
B2
A3
B1
C0
1
A3
B2
C0
1
A3
B2
C0
5
A3
B3
C0
1
A4
B1
C0
8
A4
B5
C1
3
A5
B3
C0
1
A5
B4
C0
1
AZB
3C
02
A2
B2
C0
1
A4
B3
C0
8
A5
B1
C0
3
A7
B1
C0
2
AX
A1
B2
C0
5
A1
B5
C0
2
A2
B7
C0
2
A3
B1
C0
2
A3
B1
C0
6
A3
B3
C0
3
A3
B3
C0
4
A4
B1
C0
9
A4
B2
A4
B2
C0
8
A4
B3
A4
B3
C0
9
A4
B5
C0
9
A6
B3
C0
2
A7
B1
A7
B3
Top Contributing CausesSoftware failure (A2B6C07)Design output scope LTA (A1B2C01)Inadequate vendor support of change (A4B5C03)Undesired operation of coordinated systems (A2B7C04)Defective or failed equipment (A2B6C01)Testing of Design/Installation LTA (A1B4C02)Communication path LTA (A2B7C01)System interactions not considered or identified (A4B5C05)Post maintenance/post-modification testing LTA (A2B3C03)Inadequate Risk Assessment of Change (A4B5C04)
RELIABILITY | ACCOUNTABILITY14
• Introduction - Why This is Important
• Cause Analysis Trends
• Outage Attribute Trends and Analysis
• Lessons Learned
• Q & A
Agenda
RELIABILITY | ACCOUNTABILITY15
Characteristics of Complete EMS Outages(Oct 1, 2013 – Dec 31, 2014)
0
5
10
15
20
25
30
Scheduled MaintenanceActivity occurring
CIP related Activity Weekday
Yes, 19
Yes, 3
Yes, 22
No, 8
No, 24
No, 5
Nu
mb
er
of
Even
ts
RELIABILITY | ACCOUNTABILITY16
Complete Outage Time of the Day(Oct 1, 2013 – Dec 31, 2014)
0
1
2
3
4
5
6
0:00 - 2:00 2:00 - 4:00 4:00 - 6:00 6:00 - 8:00 8:00 - 10:00 10:00 - 12:00 12:00 - 14:00 14:00 - 16:00 16:00 - 18:00 18:00 - 20:00 20:00 - 22:00 22:00 - 24:00
1
0
1 1
5
2
4
3
0
1 1
0
1 3 1 1
1
0
1
0
0
0 0
0
Nu
mb
er o
f Ev
en
ts
Scheduled Maintanance in progress Outage Unforeseen
RELIABILITY | ACCOUNTABILITY17
Analysis of Complete Restoration Times(Oct 1, 2013 – Dec 31, 2014)
0
50
100
150
200
250
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
Res
tora
tio
n in
Min
ute
s
Events
Complete Outage Restoration Time Mean Complete Outage Restoration Time
Mean Complete Outage Restoration Time = 64 Minutes
RELIABILITY | ACCOUNTABILITY18
2b Outage Restoration Time vs. Category vs. Event count
2
12
6 6
1
0
2
4
6
8
10
12
14
0
20
40
60
80
100
120
140
160
EMS Hardware EMS Software Facilities NetworkInfrastructure
Human Performance
Nu
mb
er
of
Eve
nts
Re
sto
rati
on
Tim
e i
n M
inu
tes
Outage Event Category
RELIABILITY | ACCOUNTABILITY19
Characteristics of Partial EMS Events(Oct 1, 2013 – Dec 31, 2014)
0
10
20
30
40
50
60
70
80
Scheduled MaintenanceActivity occurring
CIP related Activity Weekday
Yes, 24
Yes, 3
Yes, 50
No, 51
No, 72
No, 25
Nu
mb
er o
f Ev
en
ts
RELIABILITY | ACCOUNTABILITY20
Partial Outage Time of the Day(Oct 1, 2013 – Dec 31, 2014)
0
2
4
6
8
10
12
14
0:00 - 2:00 2:00 - 4:00 4:00 - 6:00 6:00 - 8:00 8:00 - 10:00 10:00 - 12:00 12:00 - 14:00 14:00 - 16:00 16:00 - 18:00 18:00 - 20:00 20:00 - 22:00 22:00 - 24:00
0 0
2 2
5
23 3
1
5
1 1
3
0
4
11
2
54
1
2
4
8
2
Nu
mb
er
of
Eve
nts
Scheduled Maintanance in progress Outage Unforeseen
RELIABILITY | ACCOUNTABILITY21
Analysis of Partial Restoration Times(Oct 1, 2013 – Dec 31, 2014)
0
50
100
150
200
250
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63
Total Partial Outage Restoration Time Mean Partial Outage Restoration Time
Mean Partial Outage Restoration Time = 71 Minutes
RELIABILITY | ACCOUNTABILITY22
1h Outage Restoration Time vs. Category vs. Event count
6
45
13
25
5
17
0
5
10
15
20
25
30
0
20
40
60
80
100
120
140
160
EMS Hardware EMS Software -Ex Vendor
EMS Software -HP
EMS Software -Network Model
EMS Software -Vendor Bug
Facilities NetworkInfrastructure
Nu
mb
er o
f Ev
ents
Res
tora
tio
n T
ime
in M
inu
tes
Mean Outage Restoration Time Count
RELIABILITY | ACCOUNTABILITY23
Trend of the outage restoration time
0
10
20
30
40
50
60
70
80
2010 2011 2012 2013 2014
Re
sto
rati
on
Tim
e in
Min
ute
s
Mean Complete Outage Restoration Time Mean Partial Outage Restoration Time
RELIABILITY | ACCOUNTABILITY24
• Introduction - Why This is Important
• Cause Analysis Trends
• Outage Attribute Trends and Analysis
• Lessons Learned
• Q & A
Agenda
RELIABILITY | ACCOUNTABILITY25
• Energy Management Systems (EMS) are extremely reliable
• EMS outages increase the risk to the real-time reliability of the grid
• EMS outages have NOT had an adverse impact to reliability
• 113 complete outages (Category 2b) reported through December 31, 2014; 24 event analyses in progress
• 84 partial outages (Category 1h) reported through December 31, 2014; 57 event analyses in progress
• 109 entities reported either a 1h or a 2b or both
49 experiencing multiple outages
• Several noticeable themes…
Introduction
RELIABILITY | ACCOUNTABILITY26
• Facilities maintenance affecting EMS Planned UPS work led to multiple events
Planned telecommunication facilities work led to multiple events
• Software failures Bugs in network infrastructure firmware
Latent bugs in network encryption software
Bug in vendor front end code
Incorrect scripts
• Configuration Firewall
Improper VLANs
Switch
Memory allocation
File system mounts
Common Themes for Complete Outages
RELIABILITY | ACCOUNTABILITY27
• Software bugs Contingency Analysis code
ICCP code
AGC code
Initialization routines
• Modeling of external network Inadequate modeling of external facilities before equivalence
Incorrect statuses
Incorrect tap positions
• Individual Human Performance and Organizational Challenges Set tuning parameters
Turn on automatic SE/CA runs
Alarm for failed SE/CA runs
Common Themes for Partial Outages
RELIABILITY | ACCOUNTABILITY28
• External Vendor or Contractor Leased lines from ATT, Verizon etc.
SONET ring maintenance
• Hardware failure NIC Cards
Inverters
Routers
• Failover Testing Incorrect procedures
Missing files
Corrupted executables
Common Themes for Partial Outages
RELIABILITY | ACCOUNTABILITY29
• Communicating with neighboring operators and Reliability Coordinator
•Decision making processes to man substations during an outage
• Increased vigilance during outages, partial monitoring from neighbors and RC
• Improving causal analysis of events
• Improving Event Analysis report quality
•Voluntary reporting, helping everyone get better
Strengths to Sustain
RELIABILITY | ACCOUNTABILITY30
•Change management, particularly in the context of job scoping
•Use of EMS test systems (aka sandbox, QA, dev)
•Network infrastructure testing
•Routine failover testing
•Vendor relationship
•Communication – internal and external
•Task scoping and job aids
Opportunities for Improvement
RELIABILITY | ACCOUNTABILITY31
• LL20141201 Control System Network Switch Failure
• LL20140902 Loss of EMS Dispatch Workstation Functionality Due to NTP Time Synchronization Device Misconfiguration
• LL20140901 Redundant Network Interface Cards on EMS
• LL20140802 Loss of EMS Monitoring and Control Functionality for More Than 30 Minutes
• LL20140604 Loss of SCADA Due to Memory Resources Being Fully Utilized
• LL20131003 Failure of EMS While Performing Database Update
• LL20131002 SCADA Failure Resulting in Reduced Monitoring Functionality
Related NERC Lessons Learned
RELIABILITY | ACCOUNTABILITY32
• LL20130802 Indistinguishable Screens During a Database Update Led to Loss of SCADA Monitoring and Control
• LL20130801 Inappropriate System Privileges Caused Loss of SCADA Monitoring
• LL20130204 Failure of EMS Due to Over-Utilization of Disk Storage
• LL20130203 SCADA Failure Resulting in Loss of Monitoring Functionality
• LL20130202 EMS Loss of Operators’ User Interface Application
• LL20130201 EMS System Outage and Effects on System Operations
Related NERC Lessons Learned
RELIABILITY | ACCOUNTABILITY33
• Introduction - Why This is Important
• Cause Analysis Trends
• Outage Attribute Trends and Analysis
• Lessons Learned
• Q & A
Agenda
RELIABILITY | ACCOUNTABILITY34
• ERO Event Analysis Program (http://www.nerc.com/pa/rrm/ea/Pages/EA-Program.aspx)
• NERC Lessons Learned (http://www.nerc.com/pa/rrm/ea/Pages/Lessons-Learned.aspx)
• NERC EMS conference presentations (http://www.nerc.com/pa/rrm/Resources/Pages/Conferences-and-Workshops.aspx)
Additional Information
RELIABILITY | ACCOUNTABILITY35
Sam Chanoski
Director, Situation Awareness &
Events Analysis
Office (404) 446-9706
Cell (404) 904-2480