analyze a svc, storwize metro/ global mirror performance problem-v58-20150818 1737

19
Analyze a SVC / Storwize metro mirror performance problem Page 1 / 19 1 Analyze a SVC,Storwize metro / global mirror performance problem Version: V1.0 Date: 08/18/2015 Author: [email protected] BVQ – Business Volume Qualicision ® Web: www.bvq-software.com WIKI: bvqwiki.sva.de Copyright by: SVA – System Vertrieb Alexander GmbH Borsigstraße 14 65205 Wiesbaden-Nordenstadt Web: www.sva.de

Upload: michael-pirker

Post on 23-Jan-2017

205 views

Category:

Devices & Hardware


0 download

TRANSCRIPT

Page 1: Analyze a SVC, STORWIZE metro/ global mirror performance problem-v58-20150818 1737

Analyze a SVC / Storwize metro mirror performance problem

Page 1 / 19 1

Analyze a SVC,Storwize metro / global mirror performance problem

Version: V1.0

Date: 08/18/2015

Author: [email protected]

BVQ – Business Volume Qualicision® Web: www.bvq-software.com WIKI: bvqwiki.sva.de

Copyright by:

SVA – System Vertrieb Alexander GmbH Borsigstraße 14 65205 Wiesbaden-Nordenstadt Web: www.sva.de

Page 2: Analyze a SVC, STORWIZE metro/ global mirror performance problem-v58-20150818 1737

Analyze a SVC / Storwize metro mirror performance problem

Page 2 / 19 2

Content    1.   Description of the performance problem solved in this document .................. 3  2.   Step 0: Proof that the problem is detectable in the storage system ................ 5  3.   Step 1: perform a quick health check .................................................................. 6  3.1.   Node CPU Check and Node Port Check .............................................................. 6  3.2.   Global cache check .............................................................................................. 7  4.   Step 2: A deeper look into the failing VDisk ....................................................... 9  5.   Step 3: Go deeper from VDisk to VDisk copy ................................................... 10  5.1.   Check the latency of the VDisk copies ............................................................... 11  5.2.   Check lower cache partition for overflow ............................................................ 12  6.   Step 4: Going upwards and looking into the upper cache .............................. 14  7.   Step 5: Remote copy analysis ............................................................................ 15  7.1.   Check the VDisk on the remote copy target side ................................................ 15  7.2.   Check the connection between the two clusters ................................................. 16  8.   BVQ Web Pages ................................................................................................... 18  

This document can be downloaded from this page: http://bvqwiki.sva.de/x/HQCuAQ

Page 3: Analyze a SVC, STORWIZE metro/ global mirror performance problem-v58-20150818 1737

Analyze a SVC / Storwize metro mirror performance problem

Page 3 / 19 3

1. Description of the performance problem solved in this document

Latency problem was reported for VDisk CA-CL1-Disk04-N at 02/05/15 8:09, The environment are two clusters connected with Metro Mirror. The first aim of this document is to show how we found the root cause of this problem in the link between the two clusters. The second aim of this document is to describe how the root cause for this problem was found by using the BVQ structured performance problem analysis method. It demonstrates that successful analysis work needs a structured method and also a tool which supports this method and delivers the needed technical insight. We have the concept that everybody should be able to conduct a performance analysis. This is important because the level of service is lowered day by day and especially small customers are more and more reliant on their own skills or on the skills of their partners. This is a common problem occurring at all vendors!

When the BVQ structured performance problem analysis method is used in combination with all the information made available in the BVQ Library it gets very easy for a customer to detect the root causes of at least 80% of all typical performance problems by himself! Solving a problem will become easier, if the root causes are uncovered! Also this can be a perfect opportunity for partners to use BVQ during their customers' service. Nothing is more impressive than presenting problem solving skills to customers. Just contact us: http://tinyurl.com/CALL-BVQ

Some information about the BVQ structured performance analysis method which makes it much easier to find performance problems inside storage systems.

What is the BVQ structured performance problem analysis method?

• A step by step approach to identify quickly the root causes of performance problems

• This method was developed by the BVQ team including the experience of hundreds of solved performance problems

• Prevents the poking around in the dark for finding the problem cause

• The BVQ User Interface is aligned to support this analysis method

• You can read more about this method in the BVQ Library

The idea of BVQ structured performance problem analysis is to track down the problem in the data path until we either come to the root cause of the problem or the problem is no longer visible. At each storage layer we have these decision points:

• Problem is visible in this layer --> go to next deeper layer and check whether the problem is visible there

Page 4: Analyze a SVC, STORWIZE metro/ global mirror performance problem-v58-20150818 1737

Analyze a SVC / Storwize metro mirror performance problem

Page 4 / 19 4

--> if not then check whether a root cause can be found --> if no root cause can be found here go to the next upper layer

• Problem is not visible in this layer --> the problem should be found in the level before

The analysis normally starts where the symptom of the problem becomes visible - normally at the VDisk layer. An experienced performance analyst knows shortcuts and probably will use other routes but in the end of his analysis he will also try to find the layer where the problem occurs.

It sounds funny that the first step of the structured performance analysis is always to proof that a problem exists in the storage system at all. We do this by looking up the problematic disk in the given time frame checking for high latencies. We will try to find out whether the problem is a peak problem, an overload problem, is it a read or write operation, is it caused by the SAN connections or by the host? Do we find probably overload problems in the nodes?

BVQ Library references (access for customers and partners)

The step by step analysis approach for a VDisk performance problem

Page 5: Analyze a SVC, STORWIZE metro/ global mirror performance problem-v58-20150818 1737

Analyze a SVC / Storwize metro mirror performance problem

Page 5 / 19 5

2. Step 0: Proof that the problem is detectable in the storage system

Latency problem was reported for VDisk CA-CL1-Disk04-N at 02/05/15 8:09

The first thing we always should do is to proof that the problem can be tracked down in the storage system at all. It is the easiest part of the analysis where we just use the given time frame and the volume name to lookup whether we can find latency issues there. The first valuable information here is:

• Is it a peak latency or do we have an overload situation?

• Is only this one volume affected or do we see the same peaks in more volumes?

Pict.1: A sudden latency peak with 400ms and without any obvious reason usually indicates a problem that you have to look for in the environment of the caches or the infrastructure. This latency peak is the starting point of our analysis example.

BVQ Library references (access for customers and partners)

• How to verify whether the problem exists in the storage

• Different kind of performance problems (peak, overload, long lasting latency)

Page 6: Analyze a SVC, STORWIZE metro/ global mirror performance problem-v58-20150818 1737

Analyze a SVC / Storwize metro mirror performance problem

Page 6 / 19 6

3. Step 1: perform a quick health check It is recommended to carry out a quick health check of the nodes prior to the analysis to exclude an internal problem in the node (CPU, caches, node ports). An issue here would lead the analysis into another direction. Since the health check is a documented standard procedure, the individual results are not further explained here. https://customercenter.sva.de/home/x/UAogAw

BVQ Library references (access for customers and partners)

• Quick storage system health check

Summary of the results:

3.1. Node CPU Check and Node Port Check

It is a common mistake to think that the average CPU load provides a significant information. The CPU can be divided into single cores and every overloaded core of a CPU can be a reason for a performance problem. The cores are serving different storage partitions within the system. For example, a heavily loaded core with about 80% and another with 20% load have an inconspicuous average of 50% load and with this the problem would be missed.

In the second picture of the Node Port Check overloaded ports, buffer credit wait times or SAN errors are reviewed.

No overload in CPU cores in the according period of time

Relatively high load in one core two hours later (orange box) buffer credit waits always below 1% and uncritical No SAN errors found

Higher port load on one port two hours later

Page 7: Analyze a SVC, STORWIZE metro/ global mirror performance problem-v58-20150818 1737

Analyze a SVC / Storwize metro mirror performance problem

Page 7 / 19 7

Pict2.: CPU cores are shown on the upper picture, node ports on lower picture. Both pictures show no issue when the performance problem happened (red box). However, it is remarkable that approximately two hours after the problem had occurred suddenly high values are visible (orange box). Later we will see that the recovery phase was responsible for this issue. In the orange time frame one CPU core shows high values. This is because a part of the storage had to be recovered.

3.2. Global cache check

Upper and lower cache are busy within normal limits. Two statements about this result:

a. The upper cache would react, if the compression occurred during a latency. But in this system no compression is used. So an upper cache latency would be an event that perhaps indicates a bug in the code.

b. The lower cache would indicate a performance issue in the storage backend or a volume which is overloading the cache with long lasting heavy write operations.

Both global caches show no abnormalities. The results are taken as helpful information which now can be used during the next steps.

Page 8: Analyze a SVC, STORWIZE metro/ global mirror performance problem-v58-20150818 1737

Analyze a SVC / Storwize metro mirror performance problem

Page 8 / 19 8

No overload situation in upper cache in the considered time period No overload situation in lower cache in the considered time period

Higher usage of upper cache two hours later (recovery phase)

The lower cache does not have any recovery phase - interesting it does not happen below VDisks!

Pict.3: Upper cache (picture one), lower cache (picture two). No issues when the problem happened but the upper cache shows some reaction in the orange box where we found indicators for a recovery phase. The lower cache is not affected at all and shows a medium write load in the system.

Page 9: Analyze a SVC, STORWIZE metro/ global mirror performance problem-v58-20150818 1737

Analyze a SVC / Storwize metro mirror performance problem

Page 9 / 19 9

4. Step 2: A deeper look into the failing VDisk The VDisk is not really failing but it shows very high latencies disturbing the dependent systems. A deeper look into the volume has to be taken to find whether we can exclude some to the typical error causes.

BVQ Library references (access for customers and partners)

• Further VDisk problem investigation - collect information

We scan the highest latency peak with the mouse by using the right mouse button menu to start the readout of this measurement point.

We find the following intermediate results:

• The problem is a write problem not a read problem

o The highest measured medium write latency was 441 ms (points to storage backend and/or lower caches)

o The peak latency seen at the servers was 7974 ms (this is enough to kill servers and processes)

• The transfer latencies to the host are normal to low

o This is not a problem influenced by the connected server. So we can exclude the server as being part of the problem.

o This is also not a communication problem between server and storage.

Pict.4: Some deeper look into the volume uncovers very high write latencies but no issue with the server the volume belongs to. We now know that we have a write problem and that we can exclude the server from being the cause of the problem because the transfer time is with 0.16 ms within normal limits.

Page 10: Analyze a SVC, STORWIZE metro/ global mirror performance problem-v58-20150818 1737

Analyze a SVC / Storwize metro mirror performance problem

Page 10 / 19 10

5. Step 3: Go deeper from VDisk to VDisk copy After looking into the VDisk layer we found the symptoms of the problem and also gained some additional knowledge. Seeing that the problem exists in the storage system we will go one step deeper to find out whether we can find traces of the problem in the VDisk copies. In V7.3 the VDisk copies are the more interesting part compared to the VDisk itself. The VDisk copy is the place for the the lower cache. Comparing IO from before and after lower cache can give us a very detailed information of how VDisks are hitting the backend storage.

BVQ Library references (access for customers and partners)

• VDisk copy and lower cache analysis

Shows how BVQ supports you to easily drill down to the right objects How to start the VDisk copy analysis from this point

BVQ supports you to easily drill down or drill up between storage layers. You just use the right mouse button on a measurement point and select the level you want to go to. The intelligent part of BVQ always keeps an eye on the topology to lead you to the right objects.

This is for example very helpful in an Easy Tier environment. Drilling down from a VDisk SSDs are only displayed, when the VDisk was using SSDs at the chosen point of time. Thus, analysis steps are no longer distorted and it is made easier to achieve the right outcome.

There is another analysis method called 'backtracking' which is based on this characteristic. This method is used to find VDisks causing peak behavior inside the storage system (node ports, MDisks, CPU cores...). Backtracking is performed from the suspicious peak to the VDisks layer to find out which VDisk is causing the peak.

Page 11: Analyze a SVC, STORWIZE metro/ global mirror performance problem-v58-20150818 1737

Analyze a SVC / Storwize metro mirror performance problem

Page 11 / 19 11

Pict.5: Drilling up and drilling down is easy with BVQ. By using the right mouse button it is possible to drill up or drill down between all available layers in the storage system. One of the most exceptional abilities of BVQ is that these activities are guided based on the topology. So when drilling down from a VDisk to the MDisk layer it will only show MDisks which are really used by the VDisk instead of presenting all MDisks of the managed disk group. There is also an excellent guidance when drilling up from a latency peak inside the system to the VDisk layer to find the peak causing volumes! This is named 'backtracking'.

Checking the VDisk copy layer includes two steps. First of all we look into the latency of the VDisk copy and then check the fullness of the lower cache being a part of the VDisk copy layer.

When opening the VDisk copy we find two objects in the options panel which have been automatically selected by BVQ based on the topology.

5.1. Check the latency of the VDisk copies

We check the latencies of the VDisk copies in read and write mode.

In this case we don`t find higher latencies in the VDisk copies. So we assume that we will not find the root cause of this problem in the VDisk copies or any deeper layer.

No R/W latencies found in VDisk copies

Page 12: Analyze a SVC, STORWIZE metro/ global mirror performance problem-v58-20150818 1737

Analyze a SVC / Storwize metro mirror performance problem

Page 12 / 19 12

The VDisk copy check is not finished yet. The lower cache also belongs to the VDisk copies and we have to check this too because the problem can also be brought in by another VDisk filling up the cache.

Pict.6: No higher latencies in the VDisk copy layer. This means that the latency problem is not caused by any layer deeper than VDisk. We also have to check the cache partition of the VDisk to find whether the problem had been caused by cache overflow.

5.2. Check lower cache partition for overflow

There is a second reason for problems which can be caused in the VDisk copy layer being an overflow into the lower cache. Mostly this is caused by another VDisk or even a VDisk group filling up the cache.

Using the right mouse button menu BVQ is automatically opening the MDisk group cache partitions of both VDisk copies.

In this situation we do not find a problem in the partition caches. So the second element of the VDisk copies is not showing any symptoms belonging to the problem.

No lower cache problem found during the period in question

High cache load situation found two hours later (recovery phase)

Page 13: Analyze a SVC, STORWIZE metro/ global mirror performance problem-v58-20150818 1737

Analyze a SVC / Storwize metro mirror performance problem

Page 13 / 19 13

Pict.7: No problem in the partition cache (red box). We have a much higher cache utilization when the storage system recovers from the problem (orange box).

Results of VDisk copies analysis make it necessary to look above the VDisk layer.

No R/W latencies found in VDisk copies No lower cache problem found during the period in question

High cache load situation found 2 hours later (recovery phase)

No symptom of the problem was visible in the VDisk copy layer. So the symptoms only reach to the VDisk layer and don´t go further into the MDisk or drive layer. Now we have to go to the opposite direction looking into the upper cache and then into the remote copy layer which is located above the upper cache.

Page 14: Analyze a SVC, STORWIZE metro/ global mirror performance problem-v58-20150818 1737

Analyze a SVC / Storwize metro mirror performance problem

Page 14 / 19 14

6. Step 4: Going upwards and looking into the upper cache

It is very unlikely that we will find upper cache problems because this customer does not use compression. A problem here could then only point to an overload from another volume (very seldom for upper cache) or to some code fault (appeared in V7.3 levels) The results are:

No upper cache problem found during the period in question

High cache load situation found 2 hours later (recovery phase)

Pict.8: In the time when the problem happened the upper cache load is low (red box). We see the recovery phase in the orange box. So far we found out that the symptoms of the problem are only visible in the VDisks. The VDisk copies, the lower cache and the upper cache don`t show any symptoms and with this the backend storage and overload situations brought in from other volumes can be excluded as root causes.

The only level left is the remote copy layer. Here are two possible causes for the problems:

• The target system has a performance problem. The VDisk latency associated with our disk is too high.

• There is a problem in the connection of the two storage systems

Page 15: Analyze a SVC, STORWIZE metro/ global mirror performance problem-v58-20150818 1737

Analyze a SVC / Storwize metro mirror performance problem

Page 15 / 19 15

7. Step 5: Remote copy analysis Due to the fact that we did not find the problem in or below the upper cache we have to look into the remote copy layer. To do this, the BVQ Copy Services Package needs to be installed. The possible source of problems can be an issue in the communication line between the clusters or a VDisk latency problem on the target side.

BVQ Library references (access for customers and partners)

• Analyze SVC / Storwize copy services

7.1. Check the VDisk on the remote copy target side

The BVQ Treemap allows us to visualize all three available copies of the VDisk

o Primary and secondary copy is the source VDisk with its two VDisk copies

o Remote copy is the target VDisk on the on the other cluster

Pict.9: This Treemap shows all three copies of the VDisk with performance information. The disks are connected via the remote copy relationship object which is the same for all copies. The performance of the remote copy is smaller (only write traffic) which is expressed by a smaller size of this object in the Treemap. Please don`t be confused about the equal performances of VDisk and VDisk copy. Performance differences here are only visible on the VDisk copy layer.

Page 16: Analyze a SVC, STORWIZE metro/ global mirror performance problem-v58-20150818 1737

Analyze a SVC / Storwize metro mirror performance problem

Page 16 / 19 16

We now start an analysis of the source and the target VDisk finding out that the remote VDisk is performing very well with latency of less than 1.02 ms. So the only cause left for the problem is the communication line between the clusters.

Target VDisk is performing well with a only 1.02ms latency.

Pict.10: This picture shows target and source VDisk in one analysis screen. We see the very high latency of the source disk and in the same time period the normal latency of the target disk. This screen shows us without any doubts that the target side of the VDisk is not the cause of the problem.

7.2. Check the connection between the two clusters

We can start an analysis of the cluster-to-cluster connections. Here we find the root cause of the problem in a disturbed connection between the two clusters. We have a time period of more than one hour where the lines show very high latencies.

Latency problem in the connection between cluster CA_SVC and storage cluster NY_SVC. This is finally the root cause of the problem.

Page 17: Analyze a SVC, STORWIZE metro/ global mirror performance problem-v58-20150818 1737

Analyze a SVC / Storwize metro mirror performance problem

Page 17 / 19 17

Pict. 11: Here we see the root cause of the latency problem. The FC connection between the two systems had a very high latency peek. Now that we know the definitive place where the problem started we can become active to solve the problem and avoid future situations like this. The following picture shows how a connection should be like when it is working correctly.

Page 18: Analyze a SVC, STORWIZE metro/ global mirror performance problem-v58-20150818 1737

Analyze a SVC / Storwize metro mirror performance problem

Page 18 / 19 18

8. BVQ Web Pages

BVQ in the www

• BVQ Website http://www.bvq-software.com/ (English) http://www.bvq-software.de/ (German) http://bvqwiki.sva.de (technical wiki with download)

English home page

German home page

BVQ library (customers and partners)

BVQ partner information

Technical white papers, videos and examples

Download BVQ software

Download BVQ offline scanner

Start a BVQ demo installation

Contact us!

If you are interested in BVQ a demo or a performance analysis, please contact [email protected].

If you are an IBM Business Partner with SVC or Storwize customer installations and you want to sell BVQ, please contact [email protected].

BVQ is a product from SVA System Vertrieb Alexander GmbH

Page 19: Analyze a SVC, STORWIZE metro/ global mirror performance problem-v58-20150818 1737

SVA – System Vertrieb Alexander GmbH

Borsigstraße 14 65205 Wiesbaden Germany

Web: http://www.sva.de