cit exchangerootcauseanalysis 20111207
TRANSCRIPT
-
7/28/2019 CIT ExchangeRootCauseAnalysis 20111207
1/36
Cornell Information Technology Root Cause Analysis
Exchange E-Mail Response
Root Cause Analysis
Service Disruption:
From: October 17, 2011
To: 12:30 AM November 8, 2011
Executive Summary
Cornells Exchange email system suffered three weeks of increasingly poor response. Microsoft flew a
field engineer to Ithaca to assist in diagnosing the problem. With Cornell staff, it was determined that
the root cause of the problem was a feature of the network interfaces on the Exchange servers that
disrupted communication within the cluster. This triggered a second bug in the Microsoft cluster
software that caused lengthy delays in resuming cluster operation following the disruption. Once
both of these were addressed, response time returned to normal levels. During the investigation, anumber of other potential causes were identified and eliminated. In the end, these other factors were
only minor contributors to pushing an unstable system over the edge.
Timeline
Beginning October 17, users of CIT's Exchange email system saw increasingly poor response. CIT
staff identified and eliminated several apparent contributions to the problem, but ultimately came to
an impasse. Paradoxically, while it initially appeared to be a resource load issue, adding additional
resources to the cluster made the problem worse. In reviewing the timeline, it is now apparent thatthe increasing size of the cluster as servers were moved from the Exchange 2007 cluster to the
Exchange 2010 cluster caused the network interface errors to reach a critical level.
In the first two weeks, a number of factors were identified that appeared to cause the problem. These
factors included a set of bad antivirus signatures coinciding with a malware storm, power
management settings that reduced CPU clock speeds on the servers, and an Exchange 2010 feature
that caused many more mailboxes to be opened than previously. Each seemed at the time to be an
isolated problem, and rectifying them provided temporary relief.
The problem was escalated to Microsoft, who flew in a Field Engineer on Wednesday, November 2,
evening to help us diagnose the problem. The following sections detail the troubleshooting stages.
Network Load Suspected
Each Exchange database server (MBX) has two network interfaces that it uses to connect via the Tier 1
and 2 networks to the Client Access Servers (CAS), and to the other MBX servers in the cluster. It has
a third that connects to the Tier 3 network for cluster heartbeats and a fourth that connects to the
Backup network.
-
7/28/2019 CIT ExchangeRootCauseAnalysis 20111207
2/36
Cornell Information Technology Root Cause Analysis
CIT Root Cause Analysis 2
It was hypothesized that Exchange database replication traffic was overwhelming the client traffic on
the Tier 1 and 2 interfaces, causing poor response time. The Microsoft Field Engineer said that this is
seen in some large installations of Exchange. He defined large as more than 5,000 users, while we
have 20,000.
Exchange settings were changed to route replication traffic over the Tier 3 network, which resulted insome improvement in performance issues. This was later determined to be only a tertiary effect on
the overall problem.
Clustering Issue
The servers continued to report communication errors to their peer servers in the cluster. The
physical network was examined for errors or capacity issues, but none were found. The Field
Engineer escalated the case to Microsoft internal resources, and eventually engaged a Network
Engineer in Austin. This Engineer identified three unreleased hot fixes for the cluster service that
appeared relevant to the issue. These fixes addressed issues where cluster timeouts were too short to
allow a cluster of our size to restabilize following a transient error, and problems in electing a newcluster manager. Those fixes were applied to the cluster on Thursday night, and made a large
improvement in the stability of the system.
The issue was related to the number of machines in the cluster. No problems had been observed
during the first phase of the 2007 to 2010 migration, when there were only two pairs of mailbox
servers in production. Some problems were observed when the third pair was added during the third
week of September, and more frequent problems were observed when the fourth pair was added
mid-October.
Our analysis now indicates that this was the secondary root cause of the problems. However, the
improvement was sufficient that it was believed that the problem was addressed, and the Microsoft
field engineer departed at noon on Friday.
Problem with New CAS servers
Reports of connection problems on Friday morning appeared to be geographically clustered. Around
noon, a connectivity problem between the load balancer and the new CAS servers was suspected.
The new servers were removed from rotation. All units that reported problems reported that this
resolved their connectivity issues. Analysis now indicates that it was again the combination of the
network interface errors and the size of the cluster causing the problem.
Network Interface IssuesCommunication errors remained present in the cluster, even though the problem of improper
responses to those errors by the cluster software had been addressed. While response time seemed
improved on Friday, by Monday it was apparent that it was still seriously degraded. CIT staff re-
engaged with the network engineer in Austin, who worked through logs. He first identified a
problem with the standby network adaptor invoking power saving mode. This appeared to take the
primary adaptor offline momentarily because the system software considered the pair of adaptors a
team. Turning off power management again reduced the magnitude of the problem.
-
7/28/2019 CIT ExchangeRootCauseAnalysis 20111207
3/36
Cornell Information Technology Root Cause Analysis
CIT Root Cause Analysis 3
Finally, using network traces, the engineer identified that heartbeat packets were being corrupted in
transit between the systems, and identified a TCP offloading feature of the Broadcom network
adaptors as a probable cause. This feature, called NetDMA, was turned off between midnight and
1am on Tuesday morning. Ever since that change, the cluster has been highly responsive, and the
CPU utilization on all servers has dropped to normal levels.
Postscript: Network Outage and Load Balancer Problems
After the cluster was healthy, a major network outage struck CCC. This affected both load balancers,
including the one in Rhodes Hall. People were unable to reach the Exchange system, and after the
network was restored, the load balancer did not re-establish contact with Exchange. Manual
intervention was required. This is mentioned in the timeline because of the proximity to the other
problems. It is entirely unrelated.
Specific Services and Configuration Items Impacted
Exchange
Actions Taken to Resolve Incident
Event Date/Time Action
Migrations, 2007
to 2010
9/5/11
12:00 AM
CCAB CHANGE REQUEST: artf35047: Exchange Mailbox Moves
CHANGE DESCRIPTION: Migrate Exchange Mailboxes from
Exchange 2007 to 2010.
TEST PROCEDURE: Process is already underway with select
groups of users.
NOTE: Migrations completed October 2.
MBXA Added to
Cluster
9/20/11 Work was fast-tracked because of issues with the dual 2007/2010
Exchange environment, as decided in a meeting with Ted and Dave
earlier in the month. The CCAB covering the event was
inadvertently not filed.
MBXB Added to
Cluster
10/14/11 Work was fast-tracked because of issues with the dual 2007/2010
Exchange environment, as decided in a meeting with Ted and Dave
earlier in the month. The CCAB covering the event was
inadvertently not filed.
Communication 10/17/11
10:01 AMCIT posts an alert for Exchange Performance issues
(http://www.cit.cornell.edu/services/alert.cfm?id=1500 ; see Appendix III).
Communication 10/17/11
11:46 AM
CIT posts an alert for the usps.com malware attack
(http://www.cit.cornell.edu/services/alert.cfm?id=1503, see
-
7/28/2019 CIT ExchangeRootCauseAnalysis 20111207
4/36
Cornell Information Technology Root Cause Analysis
CIT Root Cause Analysis 4
Actions Taken to Resolve Incident
Event Date/Time Action
Appendix III).
Communication 10/17/11
9:23 PM
CIT posts an update to the Exchange Performance issue
(http://www.cit.cornell.edu/services/alert.cfm?id=1500 ; see
Appendix III).
Communication/
Description of
Service
Disruption
10/18/11
1:56 PM
E-Mail sent to net-admin-l:
Please pass this message along to individuals you support who
may have been affected by the issues described below. (This
message has been sent to Net-Admin-L and ITMC-L.)
On Monday, October 17, from approximately 8:15 am to 4:15 pm,
the Exchange email and calendar system experienced
performance problems related to load, and some individuals
reported unstable connections, slow response, reducedfunctionality, and error messages.
It appears that Macintosh clients and BlackBerry devices were
most seriously impacted. A few Outlook Web App connections
may also have been affected, and response times for Windows
Outlook were slow at times.
The apparent cause was a significantly higher than normal load
triggered by the receipt of tens of thousands of virus-laden
messages. Cornell's perimeter anti-virus/anti-spam defenses keptmost of the virus-laden messages from reaching Exchange, but
the ones that got through triggered Exchange 2010's own anti-
virus defense, which affected overall performance. We are also
investigating whether a virus engine update played a role.
The virus-laden messages were from a forged usps.com address,
so one defensive step was to temporarily block all usps.com mail
until legitimate mail could be distinguished from forged mail.
This block resulted in approximately 6 legitimate messages
being returned to the sender with a "blacklisted" alert.
At this time, Exchange load is back to typical levels, so we
believe individuals should no longer be seeing performance
issues. We are investigating why this event had the effects that it
did, and what, if anything, could be adjusted in Exchange 2010.
Communication 10/20/11 CIT posts an alert for a 5 minute spike on Exchange servers. The
-
7/28/2019 CIT ExchangeRootCauseAnalysis 20111207
5/36
Cornell Information Technology Root Cause Analysis
CIT Root Cause Analysis 5
Actions Taken to Resolve Incident
Event Date/Time Action
10:30 AM issue is closed at 4:13 PM.
(http://www.cit.cornell.edu/services/alert.cfm?id=1511, see
Appendix III).
Communication 10/20/11
3:54 PM
CIT posts an alert for Exchange authentication errors which had
been reported during the previous half hour.
(http://www.cit.cornell.edu/services/alert.cfm?id=1512, see
Appendix III)
Power
Management
Settings Turned
Off
10/24/11
11:00 AM
CCAB CHANGE REQUEST: artf35973 : Change Power
Management Settings on Exchange
CHANGE DESCRIPTION: The exchange servers are currently
subject to power management which is causing the CPU clock
frequencies to be lowered. This is not a recommended setting for
exchange and is a contributing factor to the exchange performanceissues we have been seeing. This change will change the
processors to the maximum power settings.
TEST PROCEDURE: Standard Operating system setting.
BACKOUT STRATEGY: Revert to current settings
Communication 10/28/11
1:34 PM
CIT posts an alert for Exchange performance which was affected
from 10:00 AM to 12:30 PM due to an issue with the server farm
switch that morning.
(http://www.cit.cornell.edu/services/alert.cfm?id=1526; see AppendixIII).
Change Request 10/29/11
5:00 AM
CCAB CHANGE REQUEST: artf36042 : Modify Network settings
on Exchange 2010.
CHANGE DESCRIPTION: Turn off chimney and rss offloading on
the network adapters on the Exchange 2010 mailbox servers. This
change is recommended by Microsoft to resolve some database
replication problems we have been experiencing in the Exchange
2010 environment.
This change will be occurring between 5:00am and 7:00am on Sat
for the CCC data center servers followed by 5:00am and 7:00am on
Sunday for the Rhodes data center. The actual time to complete
this task should only be around 15 minutes.
User will not see any downtime as we will always have an
available copy of the exchange databases.
-
7/28/2019 CIT ExchangeRootCauseAnalysis 20111207
6/36
Cornell Information Technology Root Cause Analysis
CIT Root Cause Analysis 6
Actions Taken to Resolve Incident
Event Date/Time Action
TEST PROCEDURE: These settings have been recommended by
Microsoft.
BACKOUT STRATEGY: revert to the current settings.
Communication 10/31/11
8:46 AM
CIT posts an alert regarding reports that users cannot access
Outlook Web Access (OWA).
(http://www.cit.cornell.edu/services/alert.cfm?id=1529 ; see
Appendix III).
Communication 10/31/11
8:54 AM
CIT posts an alert regarding Exchange email and calendar.
(http://www.cit.cornell.edu/services/alert.cfm?id=1530 ; see
Appendix III). Updates are provided throughout the day and the
cause was believed to be primarily a feature of Exchange Server
2010 SP2 that became apparent when a script for Exchange GroupsAccounts was run on the weekend of October 29.
Communication/
Description of
Service
Disruption
10/31/11
7:25 PM
E-Mail sent to net-admin-l:
The Exchange performance problems the morning of October 31
have been traced to Outlook 2007/2010 on Windows attempting to
connect to all mailboxes to which each user had access. Fixing this
problem may have disconnected previously connected shared
mailboxes.
Affected individuals may need to re-add the Exchange GroupAccounts and other mailboxes they want to see in Outlook.
To make Exchange Group Accounts visible again:
http://www.cit.cornell.edu/services/ega/howto/config.cfm
To make other shared mailboxes visible again:
http://www.cit.cornell.edu/services/outlook/howto/email/email-
view-shared.cfm
CIT is adjusting the scripts for Exchange Group Accounts and filing
an issue report with Microsoft.
--------
DETAILS
The performance problems are believed to have been primarily
caused by a feature introduced by Microsoft in Exchange Server
-
7/28/2019 CIT ExchangeRootCauseAnalysis 20111207
7/36
Cornell Information Technology Root Cause Analysis
CIT Root Cause Analysis 7
Actions Taken to Resolve Incident
Event Date/Time Action
2010 SP 2. The feature's effects were not apparent until scripts for
Exchange Group Accounts (which had been in place for two years)
were run the weekend of October 29.
The feature causes Outlook to automount all mailboxes to which an
individual has full access. This behavior creates a huge load on
both CornellAD and Exchange. It was primarily responsible for the
Exchange performance problems the morning on October 31, as
thousands of additional mailbox connections, in aggregate, were
made. For individuals with full access to many Exchange Group
Accounts or other mailboxes, the start-up time for Outlook may
have taken several minutes.
These automounted mailboxes supplanted the accounts thatindividuals had previously added to Outlook, so when the
automounting was stopped, those mailboxes disappeared from
view in Outlook. Re-adding the affected mailboxes resolves the
issue for individuals.
We apologize for the inconvenience this issue has caused, and
appreciate your patience and assistance in helping individuals
restore their Outlook views.
Communication 11/1/11
12:02 PM
CIT posts an alert regarding performance issues to Exchange email
and calendar. (Seehttp://www.cit.cornell.edu/services/alert.cfm?id=1533 ; Appendix
III).
This alert remains open until 9:53 AM on November 10. During
this period, approximately 28 updates are provided to this alert.
Microsoft
Engineer on Site
11/2/11
9:00 PM
See the narrative description above for the detailed timeline of
events. See also Appendix V.
Change Request 11/2/11
2:00 PM
CCAB CHANGE REQUEST: artf36126 (Emergency: Urgent Service
Request) Add Exchange Client Access Server to Load Balancer
CHANGE DESCRIPTION: Add 4 additional client access servers
to the load balancer configuration so they may later be enabled.
This will provide additional client access capacity to our exchange
environment and should reduce the slow downs and dropped
connections users are currently experiencing.
TEST PROCEDURE: This is the same procedure used for the
-
7/28/2019 CIT ExchangeRootCauseAnalysis 20111207
8/36
Cornell Information Technology Root Cause Analysis
CIT Root Cause Analysis 8
Actions Taken to Resolve Incident
Event Date/Time Action
current client access servers that have been in production for
several months.
BACKOUT STRATEGY: Revert to previous configuration.
PATCH
APPLIED TO
SYSTEM
11/3/11 Thursday night, November 3, a patch was applied to the systems.
Communication: Wednesday evening, November 2, Microsoft flew
in a field engineer. With his help, we first identified a network
bottleneck, which reduced but did not eliminate the problem.
Digging deeper, a bug was identified in Microsoft's clustering
software that caused the cluster to believe that it was in failure
mode, and caused the active mailboxes to flip repeatedly between
the redundant Exchange systems in Rhodes and CCC. Since this
behavior was related to the number of machines in the cluster, weinadvertently worsened the problem by adding capacity.
CLIENT
ACCESS
SERVERS
REMOVED
FROM SYSTEM
11/4/11 Friday morning, November4, pockets of connectivity problems led
to discovering that a few of the ten Client Access Servers were not
responding to connections; they were removed from the pool. At
this time we believe that we have resolved the problems.
Change Request 11/8/11
12:00 AM
CCAB CHANGE REQUEST: artf36210 (Emergency: Urgent Service
Request) : Disable NetDMA on Exchange Mailbox Servers
CHANGE DESCRIPTION: Microsoft recommends that we disableNetDMA (a feature of the network adapters) on the Exchange
Mailbox Servers. NetDMA can cause timing problems with the
cluster communications and is a contributing factor to the issues
we have been encountering with exchange. The change requires a
reboot of the mailbox servers. This process will be done one server
at a time so users should not see any additional downtime as a
result of this update. These recommendations come out of a
Severity 1 case we have open with Microsoft regarding the
Exchange performance issues.
Metrics (See Appendix I)
Item Time (Hours) Comment
Detection Time (Detection Incident
Occurrence)Indefinite
Difficult to determine when to
designate as beginning time (i.e.,
Incident Occurrence)
Response Time (Diagnosis Detection) 21 days
Repair Time (Recovery Diagnosis) 30 minutes
-
7/28/2019 CIT ExchangeRootCauseAnalysis 20111207
9/36
Cornell Information Technology Root Cause Analysis
CIT Root Cause Analysis 9
Metrics (See Appendix I)
Item Time (Hours) Comment
Recovery Time (Restore Recovery) 0
Time to Repair (Recovery Incident
Occurrence)~21 days
Root Cause of Incident The Reason(s) for the Service Disruption
The primary cause of this incident was:
A network interface feature to offload network processing from the CPU caused corruption in
heartbeat packets. This caused the cluster to believe communication had been lost and
commence failover negotiations. This setting is the default setting for Windows Server 2008R2.
The problem was made worse by:
An unpublished bug in Microsoft clustering software reacted inappropriately to the missed
heartbeat packets, flipping the active cluster node back and forth between Rhodes and CCC.
Two additional factors contributed to triggering the problem, but were not by themselves a problem:
Power management at the network interface level turned off power to the backup network
path, causing interrupted communications. This is the default setting for Windows Server
2008R2.
Replication network traffic combined with client traffic to increase load on network interfaces
on the Exchange database servers.
Issues
There was no guidance from Microsoft on avoiding the NetDMA features, despite internal knowledge
that there had been problems in cluster environments with these features. There was also no
information available to customers or first level Microsoft engineers on the cluster problems resolved
by their patches. Both of these have been addressed by Microsoft since our incident. Their
knowledgebase article is attached as Appendix VI.
Recommendations
Action
Item(s)
Created toAddress
The following work is identified as important for hardening the Exchange system
against failures and improving the performance under load.
CIT is re-certifying the four client access servers that were briefly placed in
service, and will add their capacity to the service within the next few weeks.
The work to have the cornell.edu DNS name resolve to the domain controllers is
1,2,3,4,5
-
7/28/2019 CIT ExchangeRootCauseAnalysis 20111207
10/36
Cornell Information Technology Root Cause Analysis
CIT Root Cause Analysis 10
Recommendations
Action
Item(s)
Created to
Address
expected to remove some timeout issues with Exchange management
commands. CIT is planning on re-enabling the Forefront antivirus protection on the
Exchange servers. This was disabled as a trouble-shooting step, but was found
not to be a contributing factor.
CIT is planning re-enabling the host-based firewalls that were turned off as a
trouble-shooting step.
CIT is adding additional network interfaces to the Exchange servers, to provide
separate paths for replication and user network traffic.
Provide for faster resolution of critical problems by upgrading our Microsoft support
contract to provide immediate access to Level 3 engineers.
6
Communicate Root Cause findings to campus and give an opportunity for input;
investigate general options for open forums or reviews of major service disruptions
with strategic customers and users.
9, 10
Perform a yearly risk assessment and health check of our Active Directory and
Exchange systems by an outside vendor.
7
Provide checklist for Service Owners of tasks to complete with respect to
communication and notification during Service Disruptions.
8
Consider additional options for when the big red button should be pushed for similar
incidents/problems in the future.
8
Action Plan
# Item Responsible Party Completion Date
1 Re-certify the four client access servers
that were briefly placed in service, and
will add their capacity to the service
within the next few weeks.
Infrastructure Division
- Messaging
11/27/2011
2 Modify DNS so that the cornell.edu DNS
name resolve to the domain controllers is
expected to remove some timeout issues
with Exchange management commands.
Infrastructure Division
Identity Management
11/20/2011
3 Re-enable the Forefront antivirusprotection on the Exchange servers. This
was disabled as a trouble-shooting step,
but was found not to be a contributing
factor.
Infrastructure Division- Messaging
12/4/2011
4 Re-enable the host-based firewalls that
were turned off as a trouble-shooting step.
Infrastructure Division
- Messaging/Systems
11/27/2011
-
7/28/2019 CIT ExchangeRootCauseAnalysis 20111207
11/36
Cornell Information Technology Root Cause Analysis
CIT Root Cause Analysis 11
Action Plan
# Item Responsible Party Completion Date
Admins
5 Add additional network interfaces to the
Exchange servers, to provide separate
paths for replication and user network
traffic.
Infrastructure Division
- Messaging/Systems
1/22/2012
6 Upgrade our Microsoft support contract to
provide immediate access to Level 3
engineers.
Infrastructure Division
- Messaging
1/31/2012
7 Perform a yearly risk assessment and
health check of our Active Directory and
Exchange systems by an outside vendor.
Infrastructure Division
- Messaging
3/1/2012
8 Update CIT Process and Procedure 2007-
002 regarding Sev 1 incidents. Provide
awareness to Service Owners and others.
CIT Process
Improvement - Jim
Haustein
1/31/2012
9 Schedule an Exchange SIG to have a
review of the incident as one of the topics
Infrastructure Division
- Messaging
12/31/2011
10 Investigate general options for open
forums or reviews of major service
disruptions with strategic customers and
users.
CIT Process
Improvement Jim
Haustein
2/28/2012
-
7/28/2019 CIT ExchangeRootCauseAnalysis 20111207
12/36
Cornell Information Technology Root Cause Analysis
CIT Root Cause Analysis 12
Approvals:
___/s/ James R. Haustein________ __ (Submitter) 12/6/2011
Jim Haustein
___e-mail to Jim Haustein___ __ (Director, Infrastructure) 12/7/2011
Dave Vernon
-
7/28/2019 CIT ExchangeRootCauseAnalysis 20111207
13/36
Cornell Information Technology Root Cause Analysis
CIT Root Cause Analysis 13
APPENDIX I: DESCRIPTION OF METRICS
Incident Occurrence when an incident occurs.
Detection when IT is made aware of the issue
Diagnosis when the diagnosis to determine the underlying cause of the incident has been
completed
Repair when the incident has been repaired.
Recovery when component recovery has been completed.
Restoration when normal business operations resume.
Detection Diagnosis Repair Recover Restore
Detection
Time
Repair Time
Response
Time
Time to Repair (downtime)
Time Between
Failures
(uptime)
Time Between Incidents
Incident
Occurrence
Incident
Occurrence
Recovery
Time
-
7/28/2019 CIT ExchangeRootCauseAnalysis 20111207
14/36
Cornell Information Technology Root Cause Analysis
CIT Root Cause Analysis 14
APPENDIX II: TIMELINE OF EVENTS
-
7/28/2019 CIT ExchangeRootCauseAnalysis 20111207
15/36
Cornell Information Technology Root Cause Analysis
CIT Root Cause Analysis 15
APPENDIX III: COMMUNICATIONS
Performance: Exchange Performance Issues
Date: Oct 17, 2011, 10:01 AMDuration: Unknown
Status: Closed
Description:
We are currently investigating reported connection issues with Exchange. The symptoms include refused
connections when sending and or receiving mail. The connections are recovering after a minute but appear to be
re-occurring occasionally. This is affecting client access only we are not seeing delivery issues at this time. After
the connection recovers mail is being sent from and to the client.
Timeline:
10/19/2011 11:32 AM: The Exchange servers have been running normally since 4:15pm on Monday.
10/17/2011 09:23 PM: The CIT Exchange Admins report that system performance is much improved this
evening.
They are still investigating this problem and continue to monitor the issue.
Affected Services:
Exchange
Performance: Messaging Malware Attack Block: usps.com
Date: Oct 17, 2011, 11:46 AM
Duration: Unknown
Status: Closed
Description:
This morning CIT Messaging staff blocked a large-scale mail attack purporting to be from addresses at
usps.com, carrying malware that could infect client machines. Since it is impossible to distinguish these forgedaddresses from legitimate usps.com addresses, no mail from usps.com is currently getting through. This action
was necessary to protect the Cornell mail system and other IT systems from the attack.
Timeline:
10/17/2011 07:40 PM: The complete block of any email with a @usps.com
address has been lifted. We have isolated the
appropriate information and we are blocking solely on
-
7/28/2019 CIT ExchangeRootCauseAnalysis 20111207
16/36
Cornell Information Technology Root Cause Analysis
CIT Root Cause Analysis 16
that. Initially due to the volume and variants of the
infected email it seemed prudent to block all @usps.com
traffic, even though almost all of it was already being
blocked by our normal systems. We apologize for any
inconvenience this may of caused.
10/17/2011 04:58 PM: The CIT Exchange Admins are still investigating this
problem and continue to monitor the issue.
10/17/2011 11:49 AM: We will restore incoming mail from legitimate usps.com addresses as soon as we have a
way to do so.
Performance: Problems With Exchange This Morning
Date: Oct 20, 2011, 10:30 AM
Duration: Unknown
Status: Closed
Description:
There was a five minute load spike on some of the Exchange servers this morning, causing momentary slowness
and denied connections. It appears from reports that some email programs did not recover gracefully from that
incident. We recommend quitting and restarted your email program if you are experiencing problems.
Timeline:
10/20/2011 04:13 PM: This problem has been resolved.
10/20/2011 12:48 PM: Exchange experienced a momentary period slowness and denied connections this
morning.
Affected Services:
Exchange
Performance: Exchange Authentication Errors
Date: Oct 20, 2011, 03:25 PM
Duration: Unknown
Status: Closed
-
7/28/2019 CIT ExchangeRootCauseAnalysis 20111207
17/36
Cornell Information Technology Root Cause Analysis
CIT Root Cause Analysis 17
Description:
A issue occurred where some users were unable to authenticate to Exchange. This problem occurred between
3:25pm and 3:30pm. One of our client access servers was unable to authenticate users against Active Directory.
We removed the server from service while we investigate and correct the problem. All connections should now
have re-established on the remaining client access servers. In some cases users may have to re-start their clients.
Timeline:
10/21/2011 03:54 PM: This issue has now been resolved.
10/20/2011 04:16 PM: A issue occurred where some users were unable to authenticate to Exchange.
Affected Services:
Exchange
Performance: Exchange Service
Date: Oct 28, 2011, 01:30 PM
Duration: Until 10/28/2011 at 2:00 PM
Status: Closed
Description:
Due to the network issue this morning the Exchange system's performance was affected from 10am to
approximately 12:30pm today (10/28). To improve performance we had had split the databases up such that halfwere primary in Rhodes and half in CCC. The network issue caused databases to fail over and all the databases
were on one side instead of being split. Once usage rose up high enough performance suffered.
Timeline:
10/28/2011 01:34 PM: The databases have been split out again and all appears to be well. The Exchange 2007
servers are being rebuilt as Exchange 2010 servers which will increase our overall capacity to better handle these
sorts of situations.
Affected Services:
Exchange
Unplanned Outage: Outlook Web Access Service (OWA)
Date: Oct 31, 2011, 08:44 AM
Duration: Unknown
-
7/28/2019 CIT ExchangeRootCauseAnalysis 20111207
18/36
Cornell Information Technology Root Cause Analysis
CIT Root Cause Analysis 18
Status: Closed
Description:
CIT has received reports that users are unable to access the Outlook Web Access (OWA) service.
Timeline:
10/31/2011 08:46 AM: We are currently investigating this problem and will notify you with updates on this
situation.
Affected Services:
Outlook Web Access
Unplanned Outage: Exchange Email and Calendar
Date: Oct 31, 2011, 08:52 AM
Duration: Unknown
Status: Closed
Description:
The Exchange performance problems the morning of October 31 have been traced to Outlook 2007/2010 on
Windows attempting to connect to all mailboxes to which each user had access. Individuals may need to re-add
the Exchange Group Accounts they want to see in Outlook (see
http://www.cit.cornell.edu/services/ega/howto/config.cfm ).
Timeline:
10/31/2011 05:50 PM: The performance problems are believed to be primarily caused by a feature introduced by
Microsoft in Exchange Server 2010 SP 2. The feature's effects were not apparent until a script for Exchange
Group Accounts was run the weekend of October 29.
The feature causes Outlook to automount all mailboxes to which an individual has full access. The result was
that start-up time for Outlook may have taken several minutes for some individuals. When the automounting
was stopped, the accounts appeared to disappear from Outlook. Re-adding the affected accounts resolves the
issue for individuals.
CIT is adjusting the scripts for Exchange Group Accounts and filing an issue report with Microsoft.
10/31/2011 10:10 AM: The load spike has abated this morning. Exchange staff are continuing to work on
monitoring the system and addressing the root cause
-
7/28/2019 CIT ExchangeRootCauseAnalysis 20111207
19/36
Cornell Information Technology Root Cause Analysis
CIT Root Cause Analysis 19
10/31/2011 09:10 AM: Some Exchange users are experiencing slow Exchange response or difficulty connecting to
Exchange. One of the mailbox servers is experiencing a heavy load spike at this time. Exchange admins are
working on determining the source of the load and taking measures to address it.
10/31/2011 08:54 AM: We are currently investigating this problem and will notify you with updates on this
situation.
Affected Services:
Exchange
Performance: Exchange Email and Calendar
Date: Nov 1, 2011, 12:00 PM
Duration: Unknown
Status: Closed
Description:
For the past several days, Cornell's Exchange email and calendar services have had performance issues. Re-
establishing stable service levels is CIT's highest priority. Please bear with us as we continue working on the
problem.
Timeline:
11/10/2011 09:53 AM: The immediate issues with Exchange have been resolved. Over the next several weeks,
additional changes will be made to increase the Exchange's ability to handle normal growth in load over timeand load associated with traffic spikes.
A notice to all Exchange users will be sent later today.
Please report any issues with Exchange email or calendar to the CIT HelpDesk (255-8990), noting your email
client and OS, and the location from which you observe the problem.
11/08/2011 04:57 PM: Our assessment of today's experience with the campus Exchange service is that the fixes
applied yesterday and early this morning have addressed performance issues seen over the past several days.
We have been working on what appear to be pockets of client issues remaining for a limited number of users.
We will keep this alert open, however, until more time has elapsed and we can be certain there are no more
infrastructure issues remaining. If you have an open ticket with the CIT Help Desk, please update us with your
current status. If you see any renewed or continuing problems, please report those to the Help Desk with details
including your client and OS, and the location from which you observe the problem.
Unfortunately, there was a network outage in the CIT data center this afternoon that impacted Exchange access
from about 1:00 to 2:00 PM. During the outage connections were refused. Some clients required a restart before
they were able to connect once the network was restored, so some users may have seen problems after 2:00 PM.
-
7/28/2019 CIT ExchangeRootCauseAnalysis 20111207
20/36
Cornell Information Technology Root Cause Analysis
CIT Root Cause Analysis 20
11/08/2011 10:15 AM: After making the recommended changes to the Exchange network configuration, which
was complete by 1am today, the Exchange team has seen no recurrence of the server errors that indicate this
problem. Spot checks with the community have indicated, in general, much improved performance this
morning. If you have an open ticket with the CIT Help Desk, please update us with your current status. If you
see any renewed or continuing problems, please report those to the Help Desk with details including your client
and OS, and the location from which you observe the problem.
11/07/2011 10:25 PM: We have received some isolated reports of continued problems following the configuration
change this afternoon around 4:00 PM, although we've seen a reduction in server-side errors. Microsoft has
recommended an additional change to the server configuration which we are implementing from 12:00
midnight and 12:15 AM on Tuesday. The change requires rebooting the servers but we do not anticipate a
service disruption. If you experienced problems today described in an earlier update (see list below) and
continue to see them Tuesday morning, please report them to us.
Known symptoms are: sporadic slow or failed logins, failure to send messages, and slow operations (spinning
hourglass or beach ball, depending on the client system).
11/07/2011 06:33 PM: Microsoft has recommended that NetDMA be disabled in the Exchange cluster because it
is a contributing factor to Cornell's Exchange issues. From 12 midnight to 12:15 am on Tuesday, November 8,
CIT will restart the Exchange mailbox servers to disable NetDMA. This work will be done one server at a time.
No outage is expected.
11/07/2011 04:34 PM: CIT has made some changes to network settings on the Exchange cluster at Microsoft's
recommendation.
We are monitoring the performance to determine the effects of this change.
11/07/2011 03:49 PM: CIT and Microsoft experts are still diagnosing the cause of cluster communication failures.
They are currently analyzing network traces for further information on anomalies identified in the review of
Exchange data.
11/07/2011 01:39 PM: CIT staff continue to gather log data for Microsoft engineers to identify the source of the
problem, which appears to continue to be in the cluster communications layer.
Resolving the issues with Exchange remains the highest priority for both CIT and Microsoft to resolve.
The main symptoms are sporadic slow or failed logins, failure to send messages, and slow operations (spinning
hourglass or beach ball, depending on the client system). These have appeared a number of times throughout
the morning, with a larger interruption from noon to 1pm for users hosted on one of the four mailbox servers.
The server became non-responsive and required a reboot.
At this point, we have collected the data we need on client problems. If we need additional data to be reported,
a request will be posted here.
11/07/2011 01:12 PM: That database server is now online again. The start time was about 12:30, so it was a half
hour from that time.
-
7/28/2019 CIT ExchangeRootCauseAnalysis 20111207
21/36
Cornell Information Technology Root Cause Analysis
CIT Root Cause Analysis 21
11/07/2011 12:51 PM: One of the Exchange databases servers (out of four) went offline and unmounted the
mailbox databases. Exchange staff are working to get the databases back online. This problem does appear to be
related to the ongoing issue. Expected time to restore the service is 30 minutes.
11/07/2011 09:30 AM: While the patches that were applied to the Exchange cluster on Friday greatly reduced the
rate of errors, it's now apparent that some level of errors still persists. The Exchange team remains engaged withMicrosoft to locate the source of these problems. Symptoms include timeouts in connection, refused
connections, and errors in using OWA. If you receive these errors, please wait for a short time and retry the
operation. The patch applied on Friday makes recovery from such problems much more rapid that before.
11/04/2011 04:45 PM: If people are still seeing problems with their email or calendar, as a first step, they should
quit and restart their email client, and give it some time to catch up. In a few cases, it may be necessary to reboot
their system. If problems persist, they should contact the CIT HelpDesk with these details: problem description,
date and times the problem has occurred, and the operating system and email client being used. Having issues
reported is critical.
TIME LINE OF ACTIONS TAKEN
Early on, CIT staff identified and eliminated several apparent contributions to the problem, but ultimately came
to an impasse. Paradoxically, adding additional resources to the cluster made the problem worse.
Wednesday evening, November 2, Microsoft flew in a field engineer. With his help, we first identified a
network bottleneck, which reduced but did not eliminate the problem. Digging deeper, a bug was identified in
Microsoft's clustering software that caused the cluster to believe that it was in failure mode, and caused the
active mailboxes to flip repeatedly between the redundant Exchange systems in Rhodes and CCC. Since this
behavior was related to the number of machines in the cluster, we inadvertently worsened the problem by
adding capacity.
Thursday night, November 3, a patch was applied to the systems, and all the server side problems wereeliminated.
Friday morning, November 4, pockets of connectivity problems led to discovering that a few of the ten Client
Access Servers were not responding to connections; they were removed from the pool. At this time we believe
that we have resolved the problems.
11/04/2011 01:41 PM: The root cause of recent Exchange problems has been addressed with hot fixes and
reconfiguration of network traffic accomplished last night. Nonetheless, a subset of campus users experienced
problems with the service today related to:
A brief load spike at 9:00 AM this morning. This resulted in the temporary inability to connect to Exchange forsome users. We are still investigating this event.
A new problem was introduced with the addition of client access server capacity. These servers were not
handling connections properly so we have eliminated them from the rotation. We have been working directly
with the IT staff in the units impacted and believe that removing these servers has resolved those cases. We will
continue to monitor reports until we are certain that no access issues remain.
-
7/28/2019 CIT ExchangeRootCauseAnalysis 20111207
22/36
Cornell Information Technology Root Cause Analysis
CIT Root Cause Analysis 22
11/04/2011 12:24 PM: Overall, Exchange performance is much improved. However, we are still receiving reports
from a subset of users who are having trouble connecting to their accounts. We are working with the Microsoft
engineer to diagnose these cases and solve them.
11/04/2011 08:49 AM: CIT staff with the Microsoft engineer who has been assisting us this week have applied
patches to the cluster service supporting the Exchange system. These patches have eliminated the networkerrors and subsequent database restarts that have caused the extremely poor performance this week. At this
time the Exchange service appears much healthier. Some email programs may have become confused when the
Exchange system became unresponsive. If problems persist, we recommend that you quit and restart your email
programs, and contact the CIT Help Desk if problems continue after that.
11/03/2011 09:23 PM: Technical staff working on Exchange performance issues have applied a patch to the server
cluster to address a bug that was causing communication failures. This should improve stability and allow the
reconfiguration work to proceed.
11/03/2011 07:14 PM: Exchange mailboxes may be temporarily unavailable due to a cluster communications
problem we expect this condition to last for less than 30 minutes.
11/03/2011 04:09 PM: We are still working on reconfiguring the network path for Exchange communications to
better distribute the traffic. We have engaged additional Microsoft resources over the phone to expedite
resolution of issues we've encountered with this change.
11/03/2011 02:20 PM: We are still working with the Microsoft engineer to accomplish the reconfiguration
referenced in the last communication. Although we initially anticipated that work would be completed around
1 PM, we now expect it will take several more hours. We expect these changes will result in a stable service very
soon after they are completed but we will continue to take incremental steps to increase capacity to better
accommodate future unplanned events.
11/03/2011 10:56 AM: Between now and approximately 1 PM we will be making configuration changes to theExchange environment to improve performance. The changes themselves are not expected to impact the user
community. However, until these changes are complete we may see events similar to those we've experienced
over the past several days that result in access issues for users. Such an event did occur this morning at 10 AM.
It affected a significant number of users whose mailboxes live on the affected server. Those users would have
experienced performance issues or the momentary inability to connect to their Exchange accounts.
We anticipate that very soon after we complete the configuration changes users will see the improvement in
service performance.
11/03/2011 07:30 AM: Working in concert with the Microsoft engineer last evening we have made configuration
changes to alleviate Exchange performance issues. Measures included client access network reconfiguration,changes to the replication configuration, and deploying four additional client access servers. While we believe
we have determined the root cause of these issues we will continue to analyze performance data to confirm.
11/02/2011 03:29 PM: CIT continues to work on resolving the Exchange performance issues. Additional servers
will be added to Exchange tonight (November 2) to spread the load.
Problems with the replication service are being investigated, including determining whether a Microsoft patch
-
7/28/2019 CIT ExchangeRootCauseAnalysis 20111207
23/36
Cornell Information Technology Root Cause Analysis
CIT Root Cause Analysis 23
would resolve them.
A Microsoft engineer will be on site tonight (November 2), and CIT will be taking additional measures based on
those recommendations.
11/02/2011 08:55 AM:CIT is continuing to work on solutions to the Exchange performance issues. Our next stepis to address a communications problem between the two halves of the Exchange cluster. We are also working
to add another Exchange 2010 server as soon as tonight. In our test environment, we will be assessing a newly
released Microsoft patch that contains fixes for some of the problems we have been seeing.
11/01/2011 07:22 PM: Exchange performance has been stabilized for the moment. Some Microsoft-recommended
changes to the Active Directory Domain Controllers were implemented, as well as monitors that will capture
diagnostic information if the problems return tomorrow during periods of high load.
We also have a fourth Exchange database server ready to go into production, which will give us 33% more
capacity to deal with load issues. A fifth server will be added in another week. These will have a gradual affect
as user mailboxes migrate transparently onto them.
11/01/2011 05:10 PM: CIT understands the importance of email and calendar for your work, and we realize we
have fallen short of your expectations. We are working hard to regain those service levels. We have been
working with Microsoft and others to understand what is causing these problems.
So far the causes have been elusive, appearing at times to be a high CPU load causing poor response time, and
at other times seeming to be an intermittent network problem. Several apparent causes have been addressed,
including anti-virus updates, network adapter offload settings, power management settings, and the mailbox
automounting setting. Please bear with us as we continue working on the problem.
11/01/2011 04:06 PM: Exchange Admins are actively working with Microsoft
to resolve the problem swiftly. Additional informationwill be posted as it becomes available.
11/01/2011 02:30 PM: CIT is still receiving reports that some users are still unable to access their Exchange email.
CIT is still investigating and will provide further updates.
11/01/2011 12:02 PM: We are currently investigating this problem and will notify you with updates on this
situation.
Affected Services:
Exchange
-
7/28/2019 CIT ExchangeRootCauseAnalysis 20111207
24/36
Cornell Information Technology Root Cause Analysis
CIT Root Cause Analysis 24
APPENDIX IV: CCAB SERVICE DISRUPTION REPORTS
The below CCAB Service Disruption reports were completed in conjunction with the Exchange
service disruption described in this document.
artf35310 9/15/2011 9:30 PM
(Thursday)
9/15/2011 11:59
PM (Thursday)
Exchange [4236] 2 mailbox DBs on mbcx
outage :
mailboxdatabases19,22, and
the public folder database
did not mount after
patching last night.
It appears possible that this
was an early symptom of
the communications
problem
artf35362 9/19/2011 8:00 AM
(Monday)
9/19/2011 1:30 PM
(Monday)
Exchange [4236] Exchange slow response
times : Longer than
anticipated run times for a
large set of Exchange 2010
migrations coincided with a
failed backup run that
restarted at the same time.
The two activities, neither of
which could be halted,
combined to slow response
time down for client access
to Exchange.
artf35567 9/26/2011 7:00 AM
(Monday)
9/26/2011 7:00 PM
(Monday)
E-Mail Routing
[3979]
Exchange connections
hanging : Connections
began to hang on two new
Client Access Servers placed
into production on Sunday.
The problem was resolved
when the new servers were
removed from service.
Only a fraction of Exchange
users were affected, and
only certain clients had
problems.
No cause of the problem has
yet been determined.
-
7/28/2019 CIT ExchangeRootCauseAnalysis 20111207
25/36
Cornell Information Technology Root Cause Analysis
CIT Root Cause Analysis 25
artf35912 10/17/2011 8:15
AM (Monday)
10/17/2011 4:15
PM (Monday)
Exchange [4236] Exchange Performance --
malware attack: Exchange
experienced slow response
and dropped client
connections after receiving a
large attack of malware
messages. This did not
affect mail delivery, onlyclient access. There may
have been some interaction
with a set of virus
definitions in effect that day
on the Exchange anti-virus
engine. Anti-virus
signatures are automatically
delivered several times per
day by Microsoft.
artf36074
10/28/2011 10:00AM (Friday) 10/28/2011 12:30PM (Friday) Exchange [4236] Exchange performanceslowdown : Due to the
network issue this morning
the Exchange system's
performance was affected.
To improve performance we
had had split the databases
up such that half were
primary in Rhodes and half
in CCC. The network issue
caused databases to fail over
and all the databases were
on one side instead of being
split. Once usage rose up
high enough performance
suffered.
The databases have been
split out again and all
appears to be well. The
Exchange 2007 servers are
being rebuilt as Exchange
2010 servers which will
increase our overall capacity
to better handle these sorts
of situations.
artf36151 10/31/2011 7:00
AM (Monday)
10/31/2011 3:00
PM (Monday)
Exchange [4236] Outlook
automapping/Exchange
performance : A new'feature' with Exchange 2010
is that Outlook 2007/2010
will automatically open *all*
mailboxes to which the user
has full access permission.
All EGAs and resources
grant those permissions to
their owners. This only
took effect when the
-
7/28/2019 CIT ExchangeRootCauseAnalysis 20111207
26/36
Cornell Information Technology Root Cause Analysis
CIT Root Cause Analysis 26
permissions for a specific
EGA or resource was
updated, however a
maintenance script over the
weekend updated
permissions on all EGAs.
This resulted in many more
connections to mailboxes onMonday morning,
contributing to ongoing
performance problems.
The automatic mounts were
removed late in the
morning. An unexpected
side effect of this was that a
previously manually
mounted mailbox that was
overridden by the automatic
mount of the same mailbox
was subsequently forgotten.
People reported they had'lost access' to shared
mailboxes, when they had
in fact simply been
disconnected. The remedy
was for them to reopen the
shared mailbox.
artf36226 11/01/2011 12:00
AM (Tuesday)
11/07/2011 11:59
PM (Monday)
Exchange [4236] Exchange Performance
Problems : Severe
performance problems
affected Exchange during
the time. The underlying
symptom was that thecluster repeatedly lost and
re-established quorum. The
cause appeared to be
communications problems
between the cluster nodes.
A Microsoft engineer came
onsite to assist in diagnosis.
A number of steps were
taken to eliminate the
problems, listed from the
apparently most important
contributing cause through
lesser contributors:
- Turned off NetDMA on all
network adapters. This was
causing corrupted heartbeat
packets.
- Applied three hotfixes
from Microsoft that
improved the cluster
resiliency to network errors
- Turned off power
management on the
-
7/28/2019 CIT ExchangeRootCauseAnalysis 20111207
27/36
Cornell Information Technology Root Cause Analysis
CIT Root Cause Analysis 27
network adapters. (The
failover NICs were trying to
go to sleep.)
- Ensured that replication
traffic does not use the same
NIC as MAPI traffic to the
CAS servers.
- Turned off powermanagement on the CPUs.
artf36231 11/08/2011 12:37
PM (Tuesday)
11/08/2011 12:52
PM (Tuesday)
Campus Area
Network [2208]
Server Farm network
disruption : The network
switch sfcdist1-1-6600 failed
at 12:37 and was restored to
service at 12:52. A network
issue on tier3 prevented the
firewalls from failing over
properly and the extra tier
had no connectivity during
this same interval. A
second switch sfc1-1-5400
also had no connectivity
and some single attached
servers affected.
artf36227 11/08/2011 12:52
PM (Tuesday)
11/08/2011 2:00
PM (Tuesday)
Exchange [4236] Exchange affected by
network outage : Exchange
access was affected by the
network switch outage.
After the end of the outage,
the load balancer did notreestablish connections to
the CAS servers. Services
needed to be stopped and
started on the CAS servers
before the load balancer
would restart the
connections.
We had many reports that
client programs also
required a stop/start or
reboot before they would let
go of their previous
connection to Exchange viathe load balancer.
-
7/28/2019 CIT ExchangeRootCauseAnalysis 20111207
28/36
Cornell Information Technology Root Cause Analysis
CIT Root Cause Analysis 28
Artf36231 11/08/2011 12:37
PM
11/08/2011 12:52
PM
Campus Area
Network [2208]
The network switch
sfcdist1-1-6600 failed at
12:37 and was restored to
service at 12:52. A network
issue on tier3
prevented the firewalls
from failing over properly
and the extra tier had noconnectivity during this
same interval. A
second switch sfc1-1-5400
also had no connectivity
and some single attached
servers affected.
-
7/28/2019 CIT ExchangeRootCauseAnalysis 20111207
29/36
Cornell Information Technology Root Cause Analysis
CIT Root Cause Analysis 29
APPENDIX V: MICROSOFT FINAL REPORT
Mail from Microsoft Engineer to CIT team:
From: John Chappelle Sent: Tuesday, November 15, 2011 4:40 PM To:[email protected]: Gregg Koop;
MSSolve Case Email; Gregg Koop Subject: [REG:111100371705359] Exchange 2010 SP1|Experiencing twodatabases where the issue is happening frequently.
Bill,
I am writing to check on your DAG today, and I am also including a summary of our troubleshooting efforts on
this case.
When we first started, we observed an issue with the cluster losing quorum and the copy queue length
changing to a very large number. This was the result of a cluster disconnect. We installed three patches
(KB2549472, KB2549448, and 2552040) to allow nodes to join properly when they go offline, as well as to correct
an issue with the cluster not regrouping properly following a communication failure. This alleviated the issuefor a period of time, although it seems likely at this point that it was really the reboots that brought the cluster
back together. Those patches are still important to the proper operation of the cluster, and we recommend them
for any 2008R2 cluster that experiences any quorum issues at all.
We saw the issue crop up again the next week, and this time we brought in both a Cluster engineer and one of
our Networking engineers. From their analysis, we found in the cluster logs:
00001124.00001e84::2011/11/07-19:36:12.823 INFO [CONNECT] 169.254.7.84:~3343~ from local 169.254.2.231:~0~:
Established connection to remote endpoint 169.254.7.84:~3343~.
00001124.00001e84::2011/11/07-19:36:12.823 INFO [Reconnector-MBXB-01] Successfully established a new
connection.
00001124.00001e84::2011/11/07-19:36:12.823 INFO [SV] Route local (169.254.2.231:~43912~) to remote MBXB-01
(169.254.7.84:~3343~) exists. Forwarding to alternate path.
00001124.00001e84::2011/11/07-19:36:12.823 INFO [SV] Securing route from (169.254.2.231:~43912~) to remote
MBXB-01 (169.254.7.84:~3343~).
00001124.00001e84::2011/11/07-19:36:12.823 INFO [SV] Got a new outgoing stream to MBXB-01 at
169.254.7.84:~3343~
00001124.00001e84::2011/11/07-19:36:12.823 INFO [SV] Authentication and authorization were successful
00001124.00001e84::2011/11/07-19:36:12.838 INFO [SV] Security Handshake successful while obtaining
SecurityContext for NetFT driver
00001124.00001e84::2011/11/07-19:36:12.838 ERR [CORE] mscs::Reconnector::ConnectionEstablished:
HrError(0x8009030f)' because of 'Signature Verification Failed'
00001124.00001e84::2011/11/07-19:36:12.838 WARN [Reconnector-MBXB-01] Failed to handle new connectionwith error ERROR_SYSTEM_POWERSTATE_COMPLEX_TRANSITION(783), ignoring connection.
In addition, we saw simultaneous TCP Resets that were unexpected. We know this because the remote node in
the conversation continued to attempt communication after the resets:
2060 54 0 14:36:12.8425000 13:36:12 07-Nov-11 14.4811462 0.0000191 {TCP:41, IPv4:33} 169.254.2.231 169.254.7.84
TCP TCP:Flags=...A.R.., SrcPort=43912, DstPort=3343, PayloadLen=0, Seq=3063920255, Ack=2252985581, Win=0
-
7/28/2019 CIT ExchangeRootCauseAnalysis 20111207
30/36
Cornell Information Technology Root Cause Analysis
CIT Root Cause Analysis 30
(scale factor 0x8) = 0
2061 86 32 14:36:12.8425199 13:36:12 07-Nov-11 14.4811661 0.0000199 {TCP:42, IPv4:33} 169.254.7.84
169.254.2.231 TCP TCP:Flags=...AP..., SrcPort=3343, DstPort=43912, PayloadLen=32, Seq=2252985581 -
2252985613, Ack=3063920254, Win=514
2062 54 0 14:36:12.8425356 13:36:12 07-Nov-11 14.4811818 0.0000157 {TCP:42, IPv4:33} 169.254.2.231 169.254.7.84
TCP TCP:Flags=.....R.., SrcPort=43912, DstPort=3343, PayloadLen=0, Seq=3063920254, Ack=3063920254, Win=02063 54 0 14:36:12.8429705 13:36:12 07-Nov-11 14.4816167 0.0004349 {TCP:43, IPv4:33} 169.254.7.84 169.254.2.231
TCP TCP:Flags=...A...., SrcPort=3343, DstPort=43912, PayloadLen=0, Seq=2252985613, Ack=3063920255, Win=514
This POWERSTATE event and the resets led us to examine the NICs on the server, where we found the
power save functions were enabled. We disabled those, and both the POWERSTATE and TCP Reset issues
abated immediately.
Our Cluster engineer also researched the NetDMA settings and determined that they should be disabled, so we
turned off NetDMA along with the power save settings.
As a side note, I received the information on the Broadcom driver versions, and I am looking around to see if
there is a known issue with them.
Thank you,
John Chappelle
Senior Support Escalation Engineer
469-775-5153
M-F 0900-1800 Central
My manager:
Melissa Stroud
Followup email identifying NetDMA as a primary cause:
From: William Effinger [mailto:[email protected]] Sent: Friday, November 18, 2011 10:43 AM To: William T
Holmes Cc: Gregg Koop; John Chappelle Subject: [REG:111100371705359] Exchange 2010 SP1|Experiencing twodatabases where the issue is happening frequently
Bill,Johnaskedmetogiveyouashoutwithawriteupofmynotes
Lookinginyourclusterlog
NodeMBXD-02
14744000015d0.000025c0::2011/11/07-17:34:50.725INFO[GUM]Node2:ProcessingRequestLock7:1242
14745000015d0.00002ad8::2011/11/07-17:34:50.725INFO[GUM]Node2:ProcessingGrantLockto7(sentby1gumid:
80208)
14746000015d0.00001718::2011/11/07-17:35:01.349WARN[PULLERMBXA-02]ReadObjectfailedwith
HrError(0x8009030f)'becauseof'SignatureVerificationFailed'
14747000015d0.00001718::2011/11/07-17:35:01.349ERR[NODE]Node2:ConnectiontoNode6isbroken.Reason
-
7/28/2019 CIT ExchangeRootCauseAnalysis 20111207
31/36
Cornell Information Technology Root Cause Analysis
CIT Root Cause Analysis 31
HrError(0x8009030f)'becauseof'SignatureVerificationFailed'
14748000015d0.00001718::2011/11/07-17:35:01.349WARN[NODE]Node2:Initiatingreconnectwithn6.
14749000015d0.00001718::2011/11/07-17:35:01.349INFO[MQ-MBXA-02]Pausing
14750000015d0.000018b0::2011/11/07-17:35:01.349INFO[Reconnector-MBXA-02]Reconnectorfromepoch7toepoch
8waited00.000sofar.
14751000015d0.000018b0::2011/11/07-17:35:01.349INFO[CONNECT]169.254.6.224:~3343~fromlocal
169.254.2.172:~0~:Establishedconnectiontoremoteendpoint169.254.6.224:~3343~.14752000015d0.000018b0::2011/11/07-17:35:01.349INFO[Reconnector-MBXA-02]Successfullyestablishedanew
connection.
14753000015d0.000018b0::2011/11/07-17:35:01.349INFO[SV]Routelocal(169.254.2.172:~14524~)toremoteMBXA-
02(169.254.6.224:~3343~)exists.Forwardingtoalternatepath.
14754000015d0.000018b0::2011/11/07-17:35:01.349INFO[SV]Securingroutefrom(169.254.2.172:~14524~)toremote
MBXA-02(169.254.6.224:~3343~).
14755000015d0.000018b0::2011/11/07-17:35:01.349INFO[SV]GotanewoutgoingstreamtoMBXA-02at
169.254.6.224:~3343~
14756000015d0.000025c0::2011/11/07-17:35:01.349WARN[PULLERMBXB-01]ReadObjectfailedwith
HrError(0x8009030f)'becauseof'SignatureVerificationFailed'
14757000015d0.000025c0::2011/11/07-17:35:01.349ERR[NODE]Node2:ConnectiontoNode7isbroken.Reason
HrError(0x8009030f)'becauseof'SignatureVerificationFailed'
14758000015d0.000025c0::2011/11/07-17:35:01.349WARN[NODE]Node2:Initiatingreconnectwithn7.
14759000015d0.000025c0::2011/11/07-17:35:01.349INFO[MQ-MBXB-01]Pausing
15063000015d0.00001614::2011/11/07-17:35:47.681INFO[GUM]Node2:ProcessingGrantLockto1(sentby4gumid:
80222)
15064000015d0.00004628::2011/11/07-17:35:51.035INFO[GUM]Node2:ProcessingRequestLock7:1246
15065000015d0.00003964::2011/11/07-17:35:51.035INFO[GUM]Node2:ProcessingGrantLockto7(sentby1gumid:
80223)
15066000015d0.00003f7c::2011/11/07-17:36:02.704WARN[PULLERMBXA-02]ReadObjectfailedwith
HrError(0x8009030f)'becauseof'SignatureVerificationFailed'
15067000015d0.00003f7c::2011/11/07-17:36:02.704ERR[NODE]Node2:ConnectiontoNode6isbroken.Reason
HrError(0x8009030f)'becauseof'SignatureVerificationFailed'15068000015d0.00003f7c::2011/11/07-17:36:02.704WARN[NODE]Node2:Initiatingreconnectwithn6.
15069000015d0.00003f7c::2011/11/07-17:36:02.704INFO[MQ-MBXA-02]Pausing
15070000015d0.00003a78::2011/11/07-17:36:02.704INFO[Reconnector-MBXA-02]Reconnectorfromepoch10to
epoch11waited00.000sofar.
15071000015d0.00004628::2011/11/07-17:36:02.704WARN[PULLERMBXB-01]ReadObjectfailedwith
HrError(0x8009030f)'becauseof'SignatureVerificationFailed'
15072000015d0.00004628::2011/11/07-17:36:02.704ERR[NODE]Node2:ConnectiontoNode7isbroken.Reason
HrError(0x8009030f)'becauseof'SignatureVerificationFailed'
15073000015d0.00004628::2011/11/07-17:36:02.704WARN[NODE]Node2:Initiatingreconnectwithn7.
15074000015d0.00004628::2011/11/07-17:36:02.704INFO[MQ-MBXB-01]Pausing
SEC_E_MESSAGE_ALTERED
Themessageorsignaturesuppliedforverificationhasbeenaltered0x8009030f
DoingresearchwithourinternalknowledgebaseIcanseethat'SignatureVerificationFailed'casebecausedbyoneof
tworeasonsReceive Side Scaling, and Network Direct Memory Access features in Windows Server 2008 as you havealready turned off RSS we disabled NetDMA
Infoonthistech
http://technet.microsoft.com/sk-sk/magazine/2007.01.cableguy(en-us).aspx
-
7/28/2019 CIT ExchangeRootCauseAnalysis 20111207
32/36
Cornell Information Technology Root Cause Analysis
CIT Root Cause Analysis 32
HowtoturnoffRSS&NetDMA
http://support.microsoft.com/?id=951037
Best Regards,
William Effinger| MCP | MCSA | MCSE | MCTS | MCITP EA |Office Hours: Monday - Friday | 7a - 4p | EST(Phone:980.776.8887 *Email: [email protected] :Blog:http://blogs.technet.com/askcore/ Alternative Contact Information Local country phonenumber found here: http://support.microsoft.com/globalenglish Extension 1168887
-
7/28/2019 CIT ExchangeRootCauseAnalysis 20111207
33/36
Cornell Information Technology Root Cause Analysis
CIT Root Cause Analysis 33
APPENDIX VI: MICROSOFT KNOWLEGEBASE ARTICLE
In a post mortem discussion with Microsoft, CIT staff pointed out the lack of information available
that would have allowed us to prevent this problem or diagnose it once it occurred. In response,
Microsoft published the following article:
(http://blogs.technet.com/b/exchange/archive/2011/11/20/recommended-windows-hotfix-for-database-availability-groups-running-windows-server-2008-r2.aspx)
Recommended Windows Hotfix for Database Availability Groupsrunning Windows Server 2008 R2
Scott Schnoll [MSFT] 20 Nov 2011 7:41 AM 11In early August of this year, the Windows SE team released the following Knowledge Base (KB) article andaccompanying software hotfix regarding an issue in Windows Server 2008 R2 failover clusters:
KB2550886 - A transient communication failure causes a Windows Server 2008 R2 failover cluster to stop
working
This hotfix is strongly recommended for all databases availability groups that are stretched across multipledatacenters. For DAGs that are not stretched across multiple datacenters, this hotfix is good to have, as well.
The article describes a race condition and cluster database deadlock issue that can occur when a WindowsFailover cluster encounters a transient communication failure. There is a race condition within the reconnectionlogic of cluster nodes that manifests itself when the cluster has communication failures. When this occurs, it will
cause the cluster database to hang, resulting in quorum loss in the failover cluster.
As described on TechNet, a database availability group (DAG) relies on specific cluster functionality, including
the cluster database. In order for a DAG to be able to operate and provide high availability, the cluster and the
cluster database must also be operating properly.
Microsoft has encountered scenarios in which a transient network failure occurs (a failure of networkcommunications for about 60 seconds) and as a result, the entire cluster is deadlocked and all databases are
within the DAG are dismounted. Since it is not very easy to determine which cluster node is actually deadlocked,if a failover cluster deadlocks as a result of the reconnect logic race, the only available course of action is to
restart all members within the entire cluster to resolve the deadlock condition.
The problem typically manifests itself in the form of cluster quorum loss due to an asymmetric communication
failure (when two nodes cannot communicate with each other but can still communicate with other nodes). If
there are delays among other nodes in the receiving of cluster regroup messages from the clusters GlobalUpdate Manager (GUM), regroup messages can end up being received in unexpected order. When that
happens, the cluster loses quorum instead of invoking the expected behavior, which is to remove one of the
nodes that experienced the initial communication failure from the cluster.
Generally, this bug manifests when there is asymmetric latency (for example, where half of the DAG membershave latency of 1 ms, while the other half of the DAG members have 30 ms latency) for two cluster nodes that
discover a broken connection between the pair. If the first node detects a connection loss well before the
second node, a race condition can occur:
The first node will initiate a reconnect of the stream between the two nodes. This will cause the second nodeto add the new stream to its data.
-
7/28/2019 CIT ExchangeRootCauseAnalysis 20111207
34/36
Cornell Information Technology Root Cause Analysis
CIT Root Cause Analysis 34
Adding the new stream tears down the old stream and sets its failure handler to ignore. In the failure case,
the old stream is the failed stream that has not been detected yet.
When the connection break is detected on the second node, the second node will initiate a reconnectsequence of its own. If the connection break is detected in the proper race window, the failed stream's
failure handler will be set to ignore, and the reconnect process will not initiate a reconnect. It will,however, issue a pause for the send queue, which stops messages from being sent between the nodes.
When the messages are stopped, this prevents GUM from operating correctly and forces a clusterrestart.
If this issue does occur, the consequences are very bad for DAGs. As a result, we recommend that you deploy
this hotfix to all of your Mailbox servers that are members of a DAG, especially if the DAG is stretched acrossdatacenters. This hotfix can also benefit environments running Exchange 2007 Single Copy Clusters and ClusterContinuous Replication environments.
In addition to fixing the issue described above, KB2550886 also includes other important Windows Server 2008
R2 hotfixes that are also recommended for DAGs:
http://support.microsoft.com/kb/2549472 - Cluster node cannot rejoin the cluster after the node is restarted
or removed from the cluster in Windows Server 2008 R2
http://support.microsoft.com/kb/2549448 - Cluster service still uses the default time-out value after you
configure the regroup time-out setting in Windows Server 2008 R2 http://support.microsoft.com/kb/2552040 - A Windows Server 2008 R2 failover cluster loses quorum when an
asymmetric communication fail
CommentsWilliam Holmes 21 Nov 2011 9:59 AM # This helpful article comes about 3 weeks too late. We
experienced this issue and have in fact installed the hotfixes. In addition to these fixes you may want
to examine other aspects of your networking recomendations. For instance:
support.microsoft.com/.../951037 the features mentioned in this KB all contributed to triggering theproblems that the hotfixes address. Disabling the features mentioned improved the stability and
responsiveness of our entire Exchange Organization.
daliu21 Nov 2011 5:53 PM # I take it from the kb's these are "Windows" clustering hotfixes &therefore won't be rolled up into Exchange 2010 SP2 later this year, correct?
Marcus L 22 Nov 2011 2:14 AM # This is a question for William Holmes, when you say "Disablingthe features mentioned improved stability", which features exactly, all of them?
Martijn 22 Nov 2011 4:33 AM # Will this info be part of the Installation Guide Template - DAGMember? Then it would be clear which hotfixes to install along with the latest Windows 2008 R2 &
Exchange 2010 Service Packs and Update Rollups.
Rob A22 Nov 2011 7:17 AM # MSFT needs to update ExBPA so that we don't have to comb througharticles like this for obscure fixes and optimizations. ExBPA makes life easier for us and for PSS. I don'tthink I have seen an update for ExBPA in a very long time.
Brian Day [MSFT]22 Nov 2011 8:12 AM # @Rob A, ExBPA updates are released in Service Packsand Update Rollups. If you want to make sure you have the latest ExBPA ruleset in place then install
the latest SP and rollup on the machine you are running the ExBPA from.
Eugene 22 Nov 2011 9:33 AM # In our environment, using latest drivers available for IBM x3550M2 servers and firmware, we can only stabilize a high-throughput server by disabling NetDMA in each
and every case.
Eugene 22 Nov 2011 9:34 AM # In fact, IBM has documented recommendations for many of theirproducts to disable NetDMA. But since our drivers are the latest available you'd think we'd expect a
feature so heavily recommended by Microsoft perf. tuning guides to fundamentally work, which it
fundamentally doesn't. www-304.ibm.com/.../docview.wss
-
7/28/2019 CIT ExchangeRootCauseAnalysis 20111207
35/36
Cornell Information Technology Root Cause Analysis
CIT Root Cause Analysis 35
Serhad MAKBULO LU23 Nov 2011 1:46 AM # Thanks.andy 25 Nov 2011 1:03 PM # tried to request the hotfix but got below: The system is currently
unavailable. Please try back later, or contact support if you want immediate assistance When will the
hotfix be available from WSUS? We need some quality assurance from Microsoft in order to get itapproved on production environment.
William Holmes 25 Nov 2011 7:49 PM # For Marcus: Yes all of them. NetDMA in particular seemsto have caused cluster communications to be disrupted. This in turn caused a number of exchange
problems as might be expected.
-
7/28/2019 CIT ExchangeRootCauseAnalysis 20111207
36/36
Cornell Information Technology Root Cause Analysis
APPENDIX VII: MICROSOFT CLOSEOUT
From: Gregg Koop
Subject: Recent Exchange/Broadcom case
Date: November 22, 2011 3:13:27 PM EST
To: Chuck Boeheim , Andrea Beesing , William T Holmes
Hi everyone,
I am in the process of closing out your case and classifying this as a bug (Broadcom or otherwise) so that you
dont get charged the hours against your contract.
Is there anything else you need from the engineers assigned to this case?
Otherwise, is it OK to close this out?
Thank you.
Kind regards,
Gregg Koop
Sr. Technical Account Manager, MCTS, MBA, PMP, 6 Black Belt
Microsoft US Public Sector Services - State and Local Government & Education
[email protected] office: (732) 476-5581 cell: (908) 391-5656