ems outages and lessons learned qse

49
2014 ERCOT Operations Training Seminar Texas Reliability Entity Jagan Mandavilli, Bob Collins, Mark Henry EMS Outages and Lessons Learned QSE

Upload: gareth

Post on 21-Feb-2016

57 views

Category:

Documents


0 download

DESCRIPTION

EMS Outages and Lessons Learned QSE. 2014 ERCOT Operations Training Seminar Texas Reliability Entity Jagan Mandavilli, Bob Collins, Mark Henry. Objectives. Upon completing this course of instruction, you will : - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: EMS Outages and Lessons  Learned QSE

2014 ERCOT Operations Training SeminarTexas Reliability Entity

Jagan Mandavilli, Bob Collins, Mark Henry

EMS Outages and Lessons LearnedQSE

Page 2: EMS Outages and Lessons  Learned QSE

2

Objectives

Upon completing this course of instruction, you will:

Recognize the typical causes and failure modes for Energy Management Systems (EMS) systems and tools

Identify the importance of some of the tools QSE’s use Identify the EMS applications critical to your operation Recognize the QSE operator’s role in identifying problems

and reporting EMS failures Identify the components of the procedures for operation of

the system during EMS failures

Page 3: EMS Outages and Lessons  Learned QSE

3

Content

● EMS Failures Communication and Control (EMS) Failures

• Inter Control Center Communication Protocol (ICCP) failures• Remote Terminal Unit (RTU) issues

EMS Applications Failures• Automatic Generation Control (AGC) Failure• SCADA failures

Backup Control center operation Loss of Operator User Interface EMS failures due to database updates Training and Live EMS Screens on same display

● Analysis of Restorations● Contributing & Root causes with examples● Common themes with examples

Page 4: EMS Outages and Lessons  Learned QSE

4

Definitions

● SCADA – Supervisory Control And Data Acquisition● EMS – Energy Management System● AGC – Automatic Generation Control● LFC – Load Frequency Control● ICCP – Inter Control Center Communications Protocol● RTU – Remote Terminal Unit● EAS – Event Analysis Sub Committee● EMSTF – Energy Management Systems Task Force● SCED – Security Constrained Economic Dispatch

Page 5: EMS Outages and Lessons  Learned QSE

5

Tools and Their Importance

● SCADA● AGC/LFC● ICCP● SCED

Page 6: EMS Outages and Lessons  Learned QSE

6

ERCOT EMS Overview

Page 7: EMS Outages and Lessons  Learned QSE

7

EMS Reliability

● EMS are extremely reliable● Extremely high industry wide availability● Systems usually have redundancy● Multiple systems are common, with on-the-fly

failover● Backup centers, sometimes manned● Communications circuits on highly redundant

ring networks● Data handling has built in error detection and

correction● Support staff available 24 x 7

Page 8: EMS Outages and Lessons  Learned QSE

8

What do EMS Problems Look Like?

● Trends flatline ● Data no longer updates● Color changes● Alarms● Strange application results● Lockup of applications● Loss of Visibility

Page 9: EMS Outages and Lessons  Learned QSE
Page 10: EMS Outages and Lessons  Learned QSE

10

NERC EMS Failure Event Analysis

● NERC and personnel examined events 81 Category 2b events (Oct 26, 2010 – Sep 3,

2013) reported 64 events – thoroughly analyzed and reviewed 54 entities reporting - 20 entities experiencing

multiple outages Restoration time for partial outages: 18 to 411 min Restoration time for complete outages: 12 to 253

min Vendor diagnostic failures – Software & Hardware

Issues Several noticeable themes

Page 11: EMS Outages and Lessons  Learned QSE

11

NERC Lessons Learned from EMS Events #1

● Remote Terminal Units Not on DC Sources The power supply to an RTU for a High Voltage

Direct Current (HVDC) converter station was not designed to be fed from station batteries, resulting in a loss of the RTU when all AC feeds to the substation were lost due to an event.

● Lesson Learned While the availability of multiple AC sources

provides a deep degree of reliability for RTUs, entities should evaluate the practicality and feasibility of powering RTUs needed for control, situation awareness, system restoration and/or post analysis from the station batteries.

Operator Training Seminar2014

Page 12: EMS Outages and Lessons  Learned QSE

12

NERC Lessons Learned from EMS Events #2

● EMS System Outage and Effects on System Operations An entity’s EMS began to lose data necessary for visibility

of portions of its transmission network causing functionality and/or solution interruptions for some of its EMS operational tools. No loss of load occurred during this event and it was quickly determined to not be a cyber security event.

● Lessons Learned All entities should have a procedure such as “Conservative

Operations” which provides possible steps they may have to take to ensure reliability. Training should be conducted routinely on all procedures especially those related to low-probability, high-impact events regardless of how often the procedures are used.

Operator Training Seminar2014

Page 13: EMS Outages and Lessons  Learned QSE

13

NERC Lessons Learned from EMS Events #3

● EMS Loss of Operators User Interface Application A control center experienced a loss of control and

monitoring functionality of the EMS due to the loss of the operator’s user interface application between its primary EMS computer/host server and the system operator consoles.

● Lessons Learned Create a ‘save case’ of settings before and after any

change to the system is made. The ‘save case’ will aid in supplying the necessary documentation needed to perform comparisons.

Analyze EMS performance on a periodic basis and evaluate if the system is meeting the needs as designed and intended.

Operator Training Seminar2014

Page 14: EMS Outages and Lessons  Learned QSE

14

NERC Lessons Learned from EMS Events #4

● SCADA Failure Resulting in Loss of Monitoring Function A Transmission Owner (TO)’s control center

experienced a SCADA failure which resulted in a loss of monitoring functionality for more than thirty minutes.

● Lessons Learned It is beneficial that Transmission Operators (TOP) and

TOs install a “heartbeat monitor” alarm to detect stale or stagnant data.

A periodic evaluation of the mismatch thresholds should be conducted for state estimator alarming specific to each operating area, such that it will allow for the optimum sensitivity while minimizing false mismatch alarms.  

Operator Training Seminar2014

Page 15: EMS Outages and Lessons  Learned QSE

15

NERC Lessons Learned from EMS Events #5

● Failure of EMS Due to Over-Utilization of Disk Storage Loss of control functionality due to the hard disk on

the SCADA server being fully utilized.● Lessons Learned

SCADA equipment monitoring should include monitoring of hard disk storage utilization. Purging processes need to be set up to perform periodic clean up of disk space.

Operator Training Seminar2014

Page 16: EMS Outages and Lessons  Learned QSE

16

NERC Lessons Learned from EMS Events #6

● Indistinguishable Screens during a Database Update Led to Loss of SCADA Monitoring and Control During a planned database update and failover, an EMS

Operations Analyst inadvertently changed an online SCADA server database mode from “remote” (online) to “local” (local offline copy), which caused a loss of SCADA monitoring and control of Bulk Electric System (BES) facilities.

● Lessons Learned Changing the database mode on a server is not

recommended. A future release of EMS software should eliminate the ability to switch database modes on a server.

Operator Training Seminar2014

Page 17: EMS Outages and Lessons  Learned QSE

17

NERC Lessons Learned from EMS Events #7

● Inappropriate System Privileges Causes Loss of SCADA Monitoring An entity experienced a loss of SCADA telemetry –specifically a

loss of the channel status indicators – for 76% of its transmission system. This problem occurred during the implementation of a scheduled SCADA database update that caused one of the front-end processors to be in an abnormal state. An incorrect command was used to remedy the situation, which resulted in the channel status indicators being set to a failed state.

● Lessons Learned Entities should consider:

• Reviewing the training with respect to change management to ensure that it includes a checklist of steps required; and

• Educating SCADA support staff on global impact of commands on the entire SCADA system.  

Operator Training Seminar2014

Page 18: EMS Outages and Lessons  Learned QSE

18

NERC Lessons Learned from EMS Events #8

● Loss of EMS – IT Communications Disabled Transmission System Operators lost ability to

authenticate to the EMS system, resulting in a loss of monitoring and control functionality for more than 30 minutes.

● Lessons Learned EMS network design should include, where

possible, a redundant local authentication server on the same internal network as the primary local authentication server.

Operator Training Seminar2014

Page 19: EMS Outages and Lessons  Learned QSE

19

NERC Lessons Learned from EMS Events #9

● SCADA Failure Resulting in Reduced Monitoring Functionality An entity’s primary control center SCADA Management

Platform (SMP) servers became unresponsive, which resulted in a partial loss of monitoring and control functions for more than 30 minutes. Because this loss of functionality was a result of a conflict between security software configuration changes and core operating system functions, a cyber-security event was quickly ruled out, and no loss of load occurred during this event.

● Lessons Learned Registered entities should consider a “multi-site hosting”

configuration. This configuration provides flexibility and convenience for rapid recovery capability of EMS and SCADA functions.

Operator Training Seminar2014

Page 20: EMS Outages and Lessons  Learned QSE

20

NERC Lessons Learned from EMS Events #10

● Failure of Energy Management System While Performing Database Update There was a failure of EMS while performing a

database update.● Lessons Learned

When the EMS was purchased, the vulnerability of an integrated system architecture was unknown. To eliminate this now-exposed vulnerability, it is recommended that functional separation of the PCC from the ACC be implemented.

Operator Training Seminar2014

Page 21: EMS Outages and Lessons  Learned QSE

21

Number of Reports

October 26, 2010 – September 3, 2013

2010 Q42011 Q12011 Q22011 Q32011 Q42012 Q12012 Q22012 Q32012 Q42013 Q12013 Q22013 Q3

0

2

4

6

8

10

12

Page 22: EMS Outages and Lessons  Learned QSE

22

Characteristics of EMS Outages

0102030405060708090

69

1031

12

7150

Num

ber o

f Eve

nts

Page 23: EMS Outages and Lessons  Learned QSE

23

Root Causes by Category

A1 – DesignEngineering

16%

A2 – EquipmentMaterial

25%

A3 - Individual Human Per-

formance2%

A4 – Man-agement

Organization30%

A5 - Commu-nication

5%

A6 - Training2%

AZ - Informa-tion LTA

20%

Page 24: EMS Outages and Lessons  Learned QSE

24

Contributing Causes by Category

A4 – Man-agement

Organization28%

A2 – EquipmentMaterial

32%

A1 - Design/Engineering18%

A3 - Individual Human Performance9%

A5 - Communication6% AX - Overall

Configuration 5%

A7 - Other2%

Page 25: EMS Outages and Lessons  Learned QSE

25

Top Root/Contributing Causes (in order)

● Software Failure (A2B6C07)● Design output scope LTA (A1B2C01)● Inadequate vendor support of change (A4B5C03)● Testing of Design/Installation LTA (A1B4C02)● Defective or failed part (A2B6C01)● System Interactions not considered (A4B5C05)● Inadequate risk assessment of change (A4B5C04)● Insufficient Job scoping (A4B3C08)● Post Modification Testing LTA (A2B3C03)● Inspection/Testing LTA (A2B3C02)● Attention given to wrong issues (A3B3C01)● Untimely corrective actions to known issue (A4B1C08)

Page 26: EMS Outages and Lessons  Learned QSE

26

Common Themes

1. Software Failures2. Software

Configuration/Installation/Maintenance3. Hardware Failures4. Hardware

Configuration/Installation/Maintenance5. Failover Testing Weaknesses6. Testing Inadequacies

Page 27: EMS Outages and Lessons  Learned QSE

27

Software Failures – What is Affected?

● Application Software Bug/Defect Base System – Alarms/Health Check/Syncing etc. Front End Processing Supervisory Control Applications (SCADA) AGC ICCP User Interface (UI) Relational Database Management Systems (RDBMS) Build Process Scripts Miscellaneous Scripts

● Communication Equipment Firmware/Software Bug/Defect RTUs Switches Modems Routers Firewalls

● Operating System Software Bug/Defect Unix/Linux/Windows

Page 28: EMS Outages and Lessons  Learned QSE

28

Hardware Failures

● Application Servers/Nodes Network Interface cards Server hard drive control board Aux Power regulator control

● Communication Equipment RTU Switches Routers Firewalls Fiber Optic Cables Time source

● Power Sources Uninterruptible Power Supply (UPS) External Generators Power Cables

Page 29: EMS Outages and Lessons  Learned QSE

29

Failover Testing Weaknesses

● Improper settings preventing the failover● Improper procedure to failover● System setup issues preventing failover● Improper patch management between

primary/spare/backup servers● Primary server issues reflected on

spare/backup as well – No Isolation● Improper failover configurations settings● Improper network device configuration

settings for failover● Design requirements not considering failovers

Page 30: EMS Outages and Lessons  Learned QSE

30

Testing Inadequacies

● Inadequate testing● Improper procedures to test● Incomplete scope● Not engaging all the parties involved

Page 31: EMS Outages and Lessons  Learned QSE

31

Software and Hardware Categories and Restoration Times

Hardware C/I/M

Hardware Failure -

Com

Hardware Failure - Power

Hardware Failure - Server

Software Failure - App

Software Failure -

Com

Software C/I/M

0

20

40

60

80

100

120

140

160

0

5

10

15

20

25

86

131

91

66

100

152

94

13

78

2

20

4

19

Mean Outage Restoration Time (Mins) Event Count

Rest

orati

on T

ime

in M

inut

es

Even

t Cou

nt

Page 32: EMS Outages and Lessons  Learned QSE

32

Historical Failure Restoration Data

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 770

50

100

150

200

250

300

350

400

450

Complete Outage Restoration Time Partial Outage Restoration TimeMean Complete Outage Restoration Time Mean Partial Outage Restoration TimeMean Outage Restoration Time

Mean Complete Outage Restoration Time: 56 MinutesMean Partial Outage Restoration Time: 43 MinutesMean Total Outage Restoration Time: 99 Minutes

Page 33: EMS Outages and Lessons  Learned QSE

33

Lessons Learned

● Publish information about problems and solutions

● NERC continues review of events with a working group of stakeholders and Regional personnel

● Situational Awareness workshop held in June 2013 with future workshops planned

● Dialogue with vendors to inform and improve

Page 34: EMS Outages and Lessons  Learned QSE

34

Reporting Requirements – NERC Standard EOP-004-2

● Complete loss of voice communication capability affecting a Bulk Electric System (BES) control center for 30 continuous minutes or more (same as Category 2a of EAP)

● Complete loss of monitoring capability affecting a BES control center for 30 continuous minutes or more such that analysis capability (i.e., State Estimator or Contingency Analysis) is rendered inoperable (similar to Category 2b of EAP)

● Report to ERCOT, TRE, NERC and DOE per TRE web link: http://www.texasre.org/Reliability/EOP-004disturbancereports/Pages/Default.aspx

Page 35: EMS Outages and Lessons  Learned QSE

35

Reporting Requirements – NERC Events Analysis

● Category 1f - Unplanned evacuation from a control center facility with Bulk Power System (BPS) SCADA functionality for 30 minutes or more

● Category 1h - Loss of monitoring or control, at a control center, such that it significantly affects the entity’s ability to make operating decisions for 30 continuous minutes or more. Examples include, but are not limited to the following: Loss of operator ability to remotely monitor, control BES

elements, or both Loss of communications from SCADA RTUs Unavailability of ICCP links reducing BES visibility Loss of the ability to remotely monitor and control generating

units via AGC Unacceptable State Estimator or Contingency Analysis solutions

Page 36: EMS Outages and Lessons  Learned QSE

36

What Can Operators Do?

● Watch for failures and unexpected situations● Determine the criticality and impact to the

reliability of the grid● Promptly report the failures● Log the date/time of the failure, a description

of alarms/events, time of system/function restoration

● Expect the EMS failure and prepare to react● Have the necessary back up procedure in

place and be familiar with them

Page 37: EMS Outages and Lessons  Learned QSE

37

ERCOT Procedures

● Failover procedure● Loss of AGC – Operation Guides● Constant Frequency operation● Loss of ICCP

Page 38: EMS Outages and Lessons  Learned QSE

38

Real Time Operating Procedure- Section 3.3 System Failures

Monitor Frequency for the Loss of EMS or Site Failover The ability to view an adequate Frequency source may be limited during a site-failover, database load, or if AGC is temporarily unavailable. To view the System Frequency during these conditions you may monitor the following sources. o ERCOT Control Room digital wall frequency displays o PI ProcessBook → ERCOT → TrueTime Frequency (Taylor)

and/or o PI ProcessBook → ERCOT → TrueTime Frequency (Bastrop) It may be necessary to reload the PI ProcessBook “ERCOT Main Summary” display to show the historical data.

Page 39: EMS Outages and Lessons  Learned QSE

39

Real Time Operating Procedure- Section 3.3 System Failures

EMMS (LFC and RLC/SCED) Failure

Regulation, RRS, UDBP, BP, EBP, and manual offset not functioning.

LFC (AGC): AGC is SUSPENDED or PAUSED, “Last ACE crossing zero” time on the Generation Area Status page

is not updating, AGC operation adversely impacts the reliability of the

Interconnection, SCED and EMS are not functioning,

o Problem cannot be resolved quickly REFERENCE Display: EMP Applications>Generation Area Status>Nodal Operational Status>Resource Limits Data DETERMINE:

Which QSE has ample capacity to place on constant frequency Control;

THEN: Direct the selected QSE to go on constant frequency As time permits, issue an electronic VDI

o Choose “OPERATE AT CONSTANT FREQUENCY” as the Instruction Type from QSE Level

Place ERCOT AGC into “Monitor” mode

Page 40: EMS Outages and Lessons  Learned QSE

40

Generating Unit Operations During Complete Loss of CommunicationsExcerpt from 2014 DRAFT NERC guideline for units without voice or data links to their QSE but able to generate and monitor frequency

Page 41: EMS Outages and Lessons  Learned QSE

41

Draft NERC Reliability Guideline – Frequency Chart for the ERCOT Region

Page 42: EMS Outages and Lessons  Learned QSE

42

References

● ERCOT Nodal Protocols, Sect 3.10● ERCOT Nodal Operating Guides, Sect 7● ERCOT State Estimator Standards● ERCOT Telemetry Standards● ERCOT Operating Procedure Manual, Shift

Supervisor Desk, Sect 10● NERC Events Analysis Process● NERC Standard EOP-004-2● NERC EMS Task Force● DRAFT NERC Reliability Guideline: Generating

Unit Operations During Complete Loss of Communications

Page 43: EMS Outages and Lessons  Learned QSE

43

Credits

Much of the information contained in this presentation was previously published by North American Electric Reliability Corporation (NERC) in a variety of publications. It is the result of extensive review of actual power system events over a 2 year period by the EMS Event Task Force.

Questions?

Page 44: EMS Outages and Lessons  Learned QSE

44

EXAMPlease turn your iClicker on and answer each of the following questions.

Page 45: EMS Outages and Lessons  Learned QSE

45

1. Which of the following Operator tools can lead to EMS failures?

a) SCADAb) ICCPc) AGCd) All of the above

Page 46: EMS Outages and Lessons  Learned QSE

46

2. What is the top root/contributing cause of EMS failures?

a) Inadequate vendor supportb) Hardware failurec) Inadequate testingd) Software failure

Page 47: EMS Outages and Lessons  Learned QSE

47

3. What action should ERCOT take for the loss of their LFC?

a) Monitor frequency and hope for the bestb) Place a large QSE on constant frequencyc) OOME Up unitsd) RUC units off line

Page 48: EMS Outages and Lessons  Learned QSE

48

4. Which of the following steps should an Operator take during an EMS failure?

a) Promptly report the failuresb) Determine the criticality and impact of the failure

to the reliability of the gridc) Log the date/time of the failured) Implement backup procedurese) All of the above

Page 49: EMS Outages and Lessons  Learned QSE

49

5. What is the NERC Standard that requires reporting of EMS failures?

a) NERC Events Analysis Processb) NERC Standard TOP-001-1c) NERC Standard EOP-004-2d) NERC EMS Task Force