practical strategies for optimizing your ibm spectrum virtualize environment

43
1 Practical Strategies for Optimizing Data Storage in IBM Spectrum Virtualize Environments Brett Allison, Director of Technical Services, Americas

Upload: intellimagic

Post on 13-Apr-2017

20 views

Category:

Software


0 download

TRANSCRIPT

New Powerpoint Template

Practical Strategies for Optimizing Data Storage in IBM Spectrum Virtualize EnvironmentsBrett Allison, Director of Technical Services, Americas

#

1Have you ever been stumped by a performance issue in your IBM Spectrum Virtualize Environment? If so, this session is for you! It is designed to help users understand how to approach, manage and correct performance issues in an IBM Spectrum Virtualize environment by examining several real-life examples including: Bullies, Downstream congestion, compression, optimization for IBM flash storage and lastly HDS VSP backend configuration. We will examine the findings, discuss the recommendations and where applicable demonstrate, through quantitative before and after comparisons, how the changes that were implemented improved front-end response time.

SVC running Spectrum Virtualize

SwitchesSwitchesStorage 1Back-end StorageServersSVC

#

[ Show picture of IBM Spectrum Virtualize and record this audio]IBM Spectrum Virtualize software virtualizes your block storage environments allowing you to manage your storage environment in a single pane of glass. In this diagram you can see the SVC running Spectrum Virtualize software residing between the servers and the storage. Spectrum virtualize runs on a variety of IBM hardware including SVC, V7000, V3500, V3700 and V9000.

2

Dealing with BulliesDownstream CongestionIs Compression Worth it?Optimizing SVC for IBM FlashOptimizing HDS behind SVCWhat Case Studies Will be Covered?

#

Today we will be examining several case studies including, Dealing with Bullies, Downstream Congestion, Is Compression Worth it?, optimizing SVC for IBM Flash and Optimizing HDS behind SVC3

Dealing with Bullies

Do you remember these bullies?

#

Do you remember these characters? Anyone remember the name of the bully in Karate Kid? Johnny Lawrence a pen to whoever answers it correctly.Anyone remember the name of the bully in Back to the Future? Biff

A bully is someone who through force or intimidation does what they want to do regardless of those around them. In IT, a bully is used to describe how one system or application in a shared environment can impact the experience of another system or application. Im going to demonstrate how to identify bullies in your IBM SVC Spectrum Virtualize environment and quantify the improvements of the health of the environment when they are absent.

In this next slide I will show you the SVC Front-end Dashboard. The SVC Front-end Dashboard identifies the key availability risk indicators for the systems running Spectrum Virtualize software and is based on storage pool front-end metrics. Each row in the dashboard represents a different system running Spectrum Virtualize and the columns along the top show the different risk indicators. The rating is simply a way of measuring the intensity of the risk, with 0 being no risk, and 3 indicating significant risk. By hovering over each bubble you can see the ratings and tooltips for each metric. Now lets take a look at that dashboard.

4

SVC Front-End Dashboard [rating: 3.00]for all Storage Pools by SerialRating based on DSS Storage Pool data using DSS Thresholds

#

In this dashboard you can see that SVC001 is experiencing significant issues as indicated by the big red dots. We want to get more detail though so we are going to drill down to the individual storage pools associated with SVC001. The Storage Pool Dashboard that Im going to show highlights storage pools with the most risk. 5

SVC Storage Pool Front-End Dashboard [rating: 3.00]For Serial 'SVC001' by Storage PoolRating based on DSS Storage Pool data using DSS Thresholds

#

Several of the storage pools have risks but we want to look specifically at the pool EP-FLASH_3 which is our pool residing entirely on an IBM Flash system. By clicking on the storage pool EP-FLASH_3 and drilling down to the mini-line charts we can better determine where the performance problem is located. These charts leverage the rating system for the key risk indicators as did the previous dashboards.6

SVC Storage Pool mini-charts

#

The overall front-end response time, front-end write response time, and FW Bypass I/Os are red indicating exceptions. We can ignore the FW Bypass I/Os because the write cache for this pool was disabled during this period. The front-end read response time has a warning and a rating of .28 and the overall front-end response time has a rating of .56. These are all metrics that indicate saturation with the front-end of the SVC or its cache.

In order to verify if the back-end performance is good we need to check the performance between the SVC and Flash System. We do this by clicking on the Managed Disks report set and selecting the External Read Response Time Chart. Once we have this chart in the view we can click on SVC001 to drill down to the individual pools which I will show in the next slide.

7

External Read Response (ms) [rating: 0.00]For Serial 'SVC001', for Storage Pool 'EP-FLASH_3'Rating based on Managed Disks data using DSS Thresholds

#

In this chart we see that the EXTERNAL READ response time for EP-FLASH_3 is less than 1.0 ms, therefore, we can safely conclude that the performance issues are not with the Flash system, but rather, with the front-end of the SVC.

The next step is to determine who is causing the workload to the Pool EP-FLASH_3. We do this by selecting the Storage Pool Report Set and selecting the Throughput chart. From the throughput chart we will drill down on the storage pools for SVC001 which I will demonstrate in this next slide.

8

Read and Write Throughput (MB/s)For Serial 'SVC001' by Storage Pool

#

Now that we have the storage pool throughput in view we can drill down to the volumes on EP-FLASH_3 to see the top 100 volumes for the storage pool.9

Read and Write Throughput (MB/s) (top 100)For Serial 'SVC001', for Storage Pool 'EP-FLASH_3' by Volume Label

#

What stands out in this chart is the steady throughput for the volumes ending in 125, 202, 124 and 57. Now we want to know which systems these volumes are associated with so I am going to click on identify we can see which system these volumes relate to. 10

Identify the System

#

It turns out that the volumes relate to hosts with the prefix ABAAIX. Working with the AIX administrator we determined that these volumes support an Oracle Database. We then worked with the Oracle DBA to determine the purpose and activity of these volumes. It was found that these volumes contained Oracle tables and the query being executed was hitting tables without indexes. The DBAs used this information to configure indexes. The rest of this case study shows what happened next.

Now lets fast forward a week so that we can measure the system with the issues resolved. In the next chart we will take a look at the volume throughput for the EP-FLASH_3 storage pool for the day of 9/28 when the issue was resolved.

11

Read and Write Throughput (MB/s) (top 100)For Serial 'SVC001', for Storage Pool 'EP-FLASH_3' by Volume Label

#

What we see in this chart is that the workload profile changed dramatically as the workload from those top volumes 4 volumes simply disappears. So now the question is, Did the reduction in throughput improve the front-end response times?

In this next chart we will compare the response times for 9/28 when the indexes were added with the response times from 9/23. It will clearly show that the removal of the workload from the bully volumes resulted in a significant improvement in the front-end response times.12

Front-End Read Response (ms) [rating: 0.19 / 0.00]For Serial 'SVC001', for Storage Pool 'EP-FLASH_3 Rating based on DSS Storage Pool data using DSS ThresholdsChange: -18.45% Absolute change: -0.45 ms

#

The chart shows that the front-end response time improved by an average of .45ms or 18.45 %. Additionally, the risk improved from a warning to no risk. In this demo I showed you how to quickly identify the bullies in the environment and how, when dealt with, the overall response time and risk for the entire environment improved.13

And Thats How You Deal with a Bully

#

Just like Daniel and Marty dealt with their bullys we have got to properly manage bullys in a shared environment. If they cannot behave (ie optimize queries/add indexes) then you have to clip their wings (SVC IO governance) or give them a dedicated environment.14

Dealing with Downstream Congestion

#

Does anyone feel like this represents their morning commute?

What is causing the downstream congestion? With IBM SVC for some workloads the congestion downstream impacts the host response time.15

SVC Node Usage [rating: 1.22] by SerialRating based on Host Adapters data using DSS Thresholds

#

In this view you see the IntelliMagic Vision Node Performance Dashboard for IBM SVC. This dashboard provides an overview of the performance health of your IBM SVC environment based on ratings of the health of the key risk indicators. Along the Y-axis it shows the different SVC Clusters and along the X axis it displays the key risk indicators. Blue dots represent the relative I/O busyness, Green dots represent areas of low availability risk, yellow dots represent some availability risk and red indicates significant risk to availability due to performance constraints.

As you can see there are three different SVC Clusters: EAST_SVC001, SOUTH_SVC001, WEST_SVC001. For this exercise we will focus on the issues with the SOUTH_SVC001. For SOUTH_SVC001 we have significant risk as indicated by Front-end Read Response, Read Hit Percentage and Front-end Write response time. By selecting the Time charts in the drill down options above the chart, and clicking anywhere on the row of metrics for SOUTH_SVC001 we are taken to the key risk indicators over time. 16

SVC Node Performance Summary

#

This set of charts is great for identifying if there is any correlation between the various key risk indicators. Within each chart there is a line for each node. Most of the charts are pretty busy so in this case it may be hard to see correlation. The outline colors for the rated charts follow from the dashboard. Notice the red for the Node Read Response Time chart. Within that chart notice the grey line. By clicking on the Node Read Response Time chart we bring this chart into view and identify the Node with high response time.

17

Node Read Response Time (ms) [rating: 1.71]For Serial 'SOUTH_SVC001' by HA NameRating based on Host Adapters data using DSS Thresholds

#

Notice that in the legend there are several SVC nodes that are in bold. They also contain a rating value. The higher the rating value the higher the availability risk. In this case, SOUTH_SVC_75RWGNA_N7, which will be referred to only as Node7 throughout the case study, has the highest rating at 1.71. To see which volumes are affected we click on the line in the chart or in the legend representing Node7. 18

Front-End Read Response (ms) (top 30) [rating: 1.08]For Serial 'SOUTH_SVC001', for HA Name 'SOUTH_SVC_75RWGNA_N7' by Volume LabelRating based on Volume data using DSS Thresholds

#

This chart shows the Front-end Read Response time for the top 30 volumes on node7. There are only two volumes that show in bold in the legend. The volume labeled VOL-00000006, forthwith referred to as Volume 6, has a rating of 1.08 and is highlighted in bold. This has the highest rating and the highest risk of performance impact and is the biggest reason that node7 has a high rating overall for Front-end Read Response time. By selecting Isolate in the Drill Down and clicking on Volume6 in the chart you will see a new chart with a single line come into view which I will show in the next slide.19

Front-End Read Response (ms) [rating: 1.08]For Serial 'SOUTH_SVC001', for HA Name 'SOUTH_SVC_75RWGNA_N7', for Volume Label 'VOL-0000000000006'Rating based on Volume data using DSS Thresholds

#

It is easy to see why this volume has a high rating. The read response time for this volume is > 100 ms at times. In order to better understand the root cause, I have customized the chart and add the Read Miss I/Os per second.

There is a very tight correlation with the high response time and the read miss rate. This means that the majority of the response time is related to the back-end service time and queueing. In order to better understand what is happening on the back-end we should identify the storage pool by clicking on the identify drill down

20

Volume Properties, Hosts, and Pools

#

Looking at the volume properties we see that Volume6 belongs to South_DSS_TIER2_01. Lets take a look at the back-end performance for this storage pool.

21

SVC Managed Volumes minicharts

#

As you can see in this set of multi-charts for South_DSS_TIER2_01, there is lots of queuing and high response times for this storage pool. The queue time is > 100 ms at times. This explains the high front-end read response time.

Very similarly to when an accident on the freeway that is miles ahead can cause slow-downs where you are at, when you have a downstream bottleneck on the SVC back-end storage you can experience high response times on the front-end. This is acutely observed in the case where the workload is predominantly read miss.

22

How Do you Deal With Back-end Congestion?Add more lanes (mdisks, disk drives)Add faster lanes (i.e. Flash) Most helpful for small random readsSlow certain workloads down (I/O Limits)Add an HOV lane (I/O Priority Queueing)Only available on DS8000 currently but IBM has requested it standardized in SCSI T10

#

Is Compression Worth It?

#

Compression is meant to reduce the amount of volume required to contain something. Remember that expression, You cant get something for nothing? or nothing in life is freethat is similar to compression. Just like this man has chosen to wear a very tight fitting compression shirt to manage his volume, SVC is able to fit more bits into a smaller space with compression. The cost is that there is CPU overhead. On newer hardware models this is most often not an issue but I will show an example where the cost outweighed the benefit.

In this next slide Im going to show you an SVC Node Dashboard. The SVC Node Dashboard identifies the key availability risk indicators for the systems running Spectrum Virtualize software and is based on SVC Node metrics. Each row in the dashboard represents a different system running Spectrum Virtualize and the columns along the top show the different risk indicators. The rating is a way of measuring the intensity of the risk, with 0 being no risk, and 3 indicating significant risk. By hovering over each bubble you can see the ratings and tooltips for each metric.

24

SVC Node Usage [rating: 3.00] by SerialRating based on Host Adapters data using DSS Thresholds

#

In this SVC Node dashboard you can see that SVC001 is experiencing significant issues as indicated by the big red dots. We want to get more detail though so we are going to drill down to the individual nodes associated with SVC001 and I will show this in the next chart.

25

SVC Node Performance Summary

#

Several of the nodes have risks in Node Read Response Time, Node Utilization, Node Write Response Time and Node Fast Write Bypasses. Lets look at the Node utilization and see which Node has the highest utilization.

26

Node Utilization (%) [rating: 2.40]For Serial 'SVC001' by HA NameRating based on Host Adapters data using DSS Thresholds

#

With the node utilization chart in view you can see the line associated with Node_3 is higher than any of the others. It has a rating of 2.27. Node_4 its partner node also has high node utilization but for now we will focus on Node_3. We want to know what is driving the load on Node_3 so we will take a look at the Node throughput.27

Node Throughput (MB/s)For Serial 'SVC001' by HA Name

#

The throughput for each node within SVC001 is displayed. We now want to see what volumes are associated with Node_3 so we want to drill down on Node_3s preferred volumes. 28

Read and Write Throughput (MB/s) (top 30)For Serial 'SVC001', for HA Name 'Node3_174163' by Volume Label

#

The volumes with the most throughput are listed first in the legend. We need to understand if these volumes are compressed. The reason we want to know if these are compress volumes because very busy volumes are not good candidates for compression.

Further investigation detected these volumes were compressed volumes. The volumes were migrated to uncompressed volumes. Lets take a look at how removing the busy compressed volumes significantly improved performance of Node_3 in terms of both CPU Utilization and Overall front-end response time.

29

Node Utilization (%) [rating: 2.04 / 1.02]For Serial 'SVC001', for HA Name 'Node3_174163 Rating based on Host Adapters data using DSS ThresholdsChange: -37.45% Absolute change: -24.28 %

#

The chart on the left is the node utilization prior to the removal of the highly active compressed volumes. The chart on the right is for a period after the compressed volumes have been removed. The Node utilization for Node_3 is 24.28 % lower on average without the active compressed volumes that it was with them. Now lets take a look to see if this improved the front-end read response time. 30

Front-End Read Response (ms) [rating: 0.01 / 0.00]For Serial 'SVC001', for HA Name 'Node3_174163 Rating based on Host Adapters data using DSS ThresholdsChange: -19.14% Absolute change: -0.67 ms

#

This chart shows the front-end read response time for Node3. The chart on the left is for the period when the active compressed volumes resided on Node3 and the chart on the right is shows a period after the volumes have been removed. The response time improved by 19.14 % with an absolute change of .67ms.

.67 ms may not seem like a big number, but in performance tuning anytime you can get a 20% improvement in response time with minimal effort, it should be considered a victory.31

Was the Compression Worth the Space Saved?In this case it was notLook for High Node UtilizationExamine the active volumes on busy nodes. Are they compressed?If active volumes are compressed they may not be the best candidates for compression

#

32

Optimizing IBM Flash Behind SVC

Unix ServersFabricSVCBack-end Storage VMWareServers

Hosts

#

In this case study we will discuss the optimization of IBM Flash behind SVC.

The IBM Flash systems are fast. In order to keep them moving at flash speeds it is important to understand the best way to configure them. One of the hardest things to do when measuring configuration changes is quantifying the performance impact of an issue. With IntelliMagic Vision we will show how to easily compare the performance with and without SVC cache enabled and the impact to the front-end write SVC response time.33

SVC Normal Write Cache OperationsHBA 1HBA 2

HostVirtualization Node N

CacheCPUVirtualization Node N

CacheCPUWriteDe-stage

#

With SVC write cache enabled, during normal write cache operations a host write will enter an svc nodes cache and then be sent to the partner node. Upon completion of the write by the partner node, the host receives an acknowledgement that the write has completed. The write is then scheduled for de-stage to the back-end storage or in the case of data that needs to be replicated will be scheduled for replication as well.34

SVC Write Cache Disabled OperationsHBA 1HBA 2

HostVirtualization Node N

CacheCPUVirtualization Node N

CacheCPUWriteBypasses SVC Cache

#

When the write cache is disabled for a volume on SVC the write will not be placed in the SVC cache nor will it be mirrored to the partner node. It will pass through the SVC and go directly to the back-end storage controller. In reality there is coordination between SVC host and the back-end that I did not include in the diagram in order to keep it manageable. It has been suggested that in order to maximize throughput for large sequential operations on SVC volumes residing entirely on IBM Flash systems that you should consider disabling the write cache. During this case study we will look at whether this was true for the workload we examined.35

Front-End Write Response (ms) [rating: 0.67 / 2.41]For Serial 'SVC001', for Storage Pool 'EP-FLASH_3 Rating based on DSS Storage Pool data using DSS ThresholdsChange: 94.38% Absolute change: 1.87 ms

Cache DisabledCache Enabled

#

This chart represents the front-end write response time for the SVC SVC001 for the storage pool EP-FLASH_3. The chart on the left represents the time period when the write cache was enabled and the chart on the right represents the time period when the write cache was disabled. The average write response time increased 95.21% and by a real value of 1.88ms when the SVC write cache was disabled. Disabling the write cache was very detrimental to the write response time. In this brief case study we demonstrated how disabling the SVC write cache can be detrimental to the performance of your SVC environment36

HDS Behind SVC: Sometimes Modifications Are Required to Achieve the Desired Results

#

Does anyone know what the name of this vehicle and what movie it was from? Thats right it was a Deloreon from Back to the Future. There were signification modifications made to allow it to travel through time.

Just as sometimes the stock configuration does not always achieve the desired results, sometimes the back-end storage configuration has to be tweaked to get the best performance for your Spectrum Virtualize environment. In this case we will look at HDS VSP residing behind IBM SVC.37

SVC running Spectrum Virtualize

SwitchesSwitchesStorage 1HDSServersSVC

#

The configuration of the back-end is very important and can affect the front-end performance. For the HDS VSP/G-Family it is very important to configure the HDS host connections correctly to optimize the performance. In this example, we will take a look at an IBM SVC using HDS VSP as the storage back-end.

In the next screen we will look at the SVC Front-end Dashboard to identify the key availability risk indicators for the systems running Spectrum Virtualize software and is based on storage pool front-end metrics. Each row in the dashboard represents a different system running Spectrum Virtualize and the columns along the top show the different risk indicators. The rating is a way of measuring the intensity of the risk, with 0 being no risk, and 3 indicating significant risk. By hovering over each bubble you can see the ratings and tooltips for each metric.38

SVC Front-End Dashboard [rating: 0.04]for all Storage Pools by SerialRating based on DSS Storage Pool data using DSS Thresholds

#

This SVC Front-End Dashboard shows that all is well. The SVC Front-end dashboard shows the measured performance between the SVC and the attached hosts. Since this view provides the performance perspective of the host attached to the SVC it is the most important report. While the SVC Front-end dashboard may show that everything is healthy, there may be issues downstream on the attached storage that could under the right workloads cause performance issues on the front-end.

In this next slide we will take a look at the SVC Back-end Dashboard.39

SVC Back-End Dashboard [rating: 0.90] by SerialRating based on Managed Disks data using DSS Thresholds

#

The SVC Back-end dashboard reveals the health of the SVC back-end storage. Ideally, the external response times should be low and there should be no queueing. Notice that this dashboard is red and shows a red bubble for the External Read Response Time, External Queue time and External Read Queue time for SVC01. On the other hand, SVC02 is having no issues. There is definitely a significant amount of read queueing. Lets take a look the External Read Queue Time in ms. 40

External Read Queue Time (ms) [rating: 0.90]For Serial 'SVC01'Rating based on Managed Disks data using DSS Thresholds

#

This chart shows the External Read Queue Time for SVC01. There is significant queueing happening at the SVC level, lets see if the queueing is happening on all the managed disks. We will look at that in the next chart.

41

External Read Queue Time (ms) (top 50) [rating: 0.83]For Serial 'SVC01' by Managed Disk NameRating based on Managed Disks data using DSS Thresholds

#

This chart shows the top 50 managed disks associated with this SVC for the all the storage pools. In the legend you can see that all the managed disks that are in bold, those having an issue, also have the name SSD in them. This is really interesting. The fastest devices with the lowest service time have the highest queueing. All of the SSD devices are LUNs carved from some SSD RAID Groups on a VSP.

The next step in this investigation was to determine how the back-end VSP systems host entries were configured for the SVC host group on the HDS VSP.

It turned out that the VSP host entries for the SVC were set to Standard which is the default. They should have been set to Windows. Once this was changed the queueing cleared up.

I hope you enjoyed this case study, and remember. The default is not always the best choice!42

[email protected]

#

I hope you enjoyed learning about these SVC/Spectrum Virtualize Case Studies as much as I enjoyed discovering them for the first time. For a no obligation assessment of your environment please contact me at [email protected]

Intro slidePractical Strategies for Optimizing SVCLavf55.33.100SVC Architecture slide - slide 2Practical Strategies for Optimizing SVCLavf55.33.100Visual Roadmap -slide3Practical Strategies for Optimizing SVCLavf55.33.100Limbic bullies - slide 4Practical Strategies for Optimizing SVCLavf55.33.100Bullies - Front-end dashboard slide5Practical Strategies for Optimizing SVCLavf55.33.100Bullies - Storage Pool Front-end dashboard slide6Practical Strategies for Optimizing SVCLavf55.33.100Bullies storage pool multichartPractical Strategies for Optimizing SVCLavf55.33.100Bullies external read response timePractical Strategies for Optimizing SVCLavf55.33.100Bullies sp tputPractical Strategies for Optimizing SVCLavf55.33.100bullies vol tputPractical Strategies for Optimizing SVCLavf55.33.100bullies identifyPractical Strategies for Optimizing SVCLavf55.33.100bullies-low-vol-tputPractical Strategies for Optimizing SVCLavf55.33.100bullies comparePractical Strategies for Optimizing SVCLavf55.33.100bullies closePractical Strategies for Optimizing SVCLavf55.33.100limbic congestionPractical Strategies for Optimizing SVCLavf55.33.100congestion node dboardPractical Strategies for Optimizing SVCLavf55.33.100congestion mchartPractical Strategies for Optimizing SVCLavf55.33.100congestion node read rtPractical Strategies for Optimizing SVCLavf55.33.100congestion vol rrtPractical Strategies for Optimizing SVCLavf55.33.100congestion vol read miss ioratePractical Strategies for Optimizing SVCLavf55.33.100congestion identifyPractical Strategies for Optimizing SVCLavf55.33.100congestion mdisk multi-chartPractical Strategies for Optimizing SVCLavf55.33.100congestion summaryPractical Strategies for Optimizing SVCLavf55.33.100compression limbicPractical Strategies for Optimizing SVCLavf55.33.100congestion svc node-dbPractical Strategies for Optimizing SVCLavf55.33.100congestion node mchartPractical Strategies for Optimizing SVCLavf55.33.100congestion node utilizationPractical Strategies for Optimizing SVCLavf55.33.100congestion node throughputPractical Strategies for Optimizing SVCLavf55.33.100congestion vol tputPractical Strategies for Optimizing SVCLavf55.33.100congestion compare utilizationPractical Strategies for Optimizing SVCLavf55.33.100congestion compare response timePractical Strategies for Optimizing SVCLavf55.33.100slide 32 compression summaryLavf55.33.100SVC Write Cache Practical strategies for Optimizing SVCLavf55.33.100SVC Write Cache Enabled ArchitecturePractical strategies for Optimizing SVCLavf55.33.100SVC Write Cache Disabled ArchitecturePractical strategies for Optimizing SVCLavf55.33.100SVC Write Cache comparisonPractical strategies for Optimizing SVCLavf55.33.100SVC HDS LimbicPractical strategies for Optimizing SVCLavf55.33.100SVC HDS ArchitecturePractical strategies for Optimizing SVCLavf55.33.100SVC HDS Front-end DashboardPractical strategies for Optimizing SVCLavf55.33.100SVC HDS Back-end DashboardPractical strategies for Optimizing SVCLavf55.33.100SVC HDS External Read QTimePractical strategies for Optimizing SVCLavf55.33.100SVC HDS External Mdisk Read QTime by MdiskPractical strategies for Optimizing SVCLavf55.33.100Presentation closingPractical strategies for Optimizing SVCLavf55.33.100