navis analyzer case

Upload: axxer

Post on 06-Apr-2018

234 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/3/2019 Navis Analyzer Case

    1/14

    Navisphere Analyzer: A Case Study 0

    EMC Navisphere Analyzer: A Case Study

    May 2001

  • 8/3/2019 Navis Analyzer Case

    2/14

    Navisphere Analyzer: A Case Study 1

    No part of this publication may be reproduced or distributed in any form or by any means, or stored in a

    database or retrieval system, without the prior written consent of EMC Corporation. The information

    contained in this document is subject to change without notice. EMC Corporation assumes no responsibility

    for any errors that may appear. All computer software programs, including but not limited to microcode,

    described in this document are furnished under a license, and may be used or copied only in accordance

    with the terms of such license. EMC either owns or has the right to license the computer software programs

    described in this document. EMC Corporation retains all rights, title and interest in the computer software

    programs. EMC Corporation makes no warranties, expressed or implied, by operation of law or otherwise,

    relating to this document, the products, or the computer software programs described herein. EMC

    CORPORATION DISCLAIMS ALL IMPLIED WARRANTIES OF MERCHANTIBILITY AND

    FITNESS FOR A PARTICULAR PURPOSE. In no event shall EMC Corporation be liable for (a)

    incidental, indirect, special, or consequential damages or (b) any damages whatsoever resulting from theloss of use, data or profits, arising out of this document, even if advised of the possibility of such damages.

    EMC

    2, EMC, CLARiiON, and Navisphere are registered trademarks and where information lives is a

    trademark of EMC Corporation. All other brands or products may be trademarks or registered trademarks

    of their respective holders.

    Copyright 2001 EMC Corporation. All rights reserved.C844

  • 8/3/2019 Navis Analyzer Case

    3/14

    Navisphere Analyzer: A Case Study 2

    Table of ContentsEMC Navisphere Analyzer: A Case Study...................................................................0Executive Summary.......................................................................................................3Introduction ....................................................................................................................4Customers Problem......................................................................................................5Conclusions of the Analysis .......................................................................................10Summary ......................................................................................................................10Appendix 1 ...................................................................................................................11

  • 8/3/2019 Navis Analyzer Case

    4/14

    Navisphere Analyzer: A Case Study 3

    Executive SummaryThis white paper describes the functionality of EMC Navisphere Analyzer through the discussion of acustomer case study. The case study presents the functionality available with Navisphere Analyzer, the

    problem that the customer experienced, the types of data that were collected, and the methodology that was

    used to resolve the problem using Analyzer.

    As more and more companies rely on CLARiiON

    products, the need increases to be able to quicklydetermine the following:

    - The array is being used efficiently- The array is working properly- Sufficient resources exist on the array for normal day-to-day operations, as well as potential growthNavisphere Analyzer software is a host-based performance analysis tool that is intended to be used as a

    microscope to examine specific data in as much detail as necessary, to determine the cause behind a

    bottleneck and/or a performance issue. Once the cause has been isolated, Analyzer is of further assistance

    in helping to assess whether fine tuning parameters of the array will solve the problem or whether hardware

    components, such as cache memory or disks, need to be added.

    Analyzer can be used to continuously monitor and analyze performance. This is most helpful indetermining how to fine tune array performance for maximum utilization. Alternately, it can be used toanalyze data collected earlier. Data can be collected automatically from selected arrays, LUNs, or storage

    processors. The user can specify when to record data, from which hosts to gather data, and where the data

    should be stored. Collecting historical data of this type is helpful in determining the cause of lingering

    performance problems. Finally, the user can compare realtime data to data recorded previously to help

    analyze performance issues.

  • 8/3/2019 Navis Analyzer Case

    5/14

    Navisphere Analyzer: A Case Study 4

    IntroductionA typical problem for a system administrator to encounter is a complaint by one or more departments thatthe performance of an array drastically changes from time to time and that it seems unrelated to what that

    department is doing at the time. That is, the departments usage of the array has not changed significantly.

    The administrator will usually try to gather information about when the problem occurred to see if he or shecan determine what else was going on at the time. This is usually difficult to do because its rare that every

    department remembers precisely what they were doing, when.

    Navisphere Analyzer permits the administrator to collect data over different blocks of time and thenanalyze that data to see if there is any hint about the underlying causes of the problem.

    For many problems, looking at the utilization of different components of the array is usually sufficient toquickly narrow down the basis of the problem. That is, first looking in general at the utilization of each

    LUN. Then, if a particular LUNs utilization is high, looking in more and more detail at the performance

    characteristics of that particular LUN.

    When using Navisphere Analyzer for such an investigation, the Basic data types (see Appendix 1) aretypically quite sufficient for the analysis of most performance problems. That is, utilization of the LUN,

    storage processor, and/or disks can help give a specific direction to pursue in researching a performance

    issue.

    To illustrate this point, a case study is presented here in which Navisphere Analyzer was used to determinethe underlying cause of a performance problem.

  • 8/3/2019 Navis Analyzer Case

    6/14

    Navisphere Analyzer: A Case Study 5

    Customers ProblemThe customer is a major CLARiiON account who was experiencing a severe performance problem eachmonth when a large sales report was run. The configuration consisted of two FC5700 CLARiiON arrays

    with a combined storage of two terabytes. The arrays were configured as RAID 5.

    While the arrays performed well most of the time, when a particular large sales report was executed, theperformance of the arrays was severely affected.

    The system administrator used Navisphere Analyzer to first look at the utilization of all of the LUNs on thearray. Data was stored starting at 07:59 to 10:06 to overlap with the running of the sales report.

    Figure 1 is a printout of the utilization of the LUNs. It is clear that LUN 0x02 is close to 100 percentutilization. Its average utilization is at 90 percent and the latest utilization was close to 100 percent. This

    would obviously affect the performance of the array in general.

    Figure 1. LUN Utilization Report

  • 8/3/2019 Navis Analyzer Case

    7/14

    Navisphere Analyzer: A Case Study 6

    The next step was to look in detail at the utilization for that LUN. Figure 2 shows that shortly after the salesreport started to run at 08:00, the utilization for the LUN reached 100 percent and more or less stayed there

    for the duration of the report. This makes it very clear that this LUN is over-utilized.

    Figure 2. Report showing in detail, utilization of the LUN

  • 8/3/2019 Navis Analyzer Case

    8/14

    Navisphere Analyzer: A Case Study 7

    The next step that the administrator took was to look at the utilization of the storage processor. This data isshown in the lower portion of Figure 3. It is clear that the storage processor is not being over utilized, that

    is, the storage processor is not the limiting factor in this performance issue, because its utilization is only

    around 40 percent.

    Figure 3. Report showing LUN and storage processor utilization combined

  • 8/3/2019 Navis Analyzer Case

    9/14

    Navisphere Analyzer: A Case Study 8

    The next step was to look at the queue length of the storage processor. This is shown in Figure 4. It is clearthat again, the storage processor is not the issue, because its queue length is around 5, which is the number

    of disks that constitute the LUN.

    Figure 4. LUN utilization and storage processor queue length

  • 8/3/2019 Navis Analyzer Case

    10/14

    Navisphere Analyzer: A Case Study 9

    The queue length for the LUN itself was then looked at. It is shown in Figure 5. This queue lengthdemonstrates the location of the problem, because it is close to 20. This exceeds the number of disks that

    constitute the LUN. It is clear that this is the location of the bottleneck.

    Figure 5 LUN utilization and LUN queue length

  • 8/3/2019 Navis Analyzer Case

    11/14

    Navisphere Analyzer: A Case Study 10

    Conclusions of the AnalysisFirst, looking at the overall utilization report, it was clear that the LUN was being over-utilized. Whatwasnt clear, however, was whether the issue was the load on the storage processor or whether it was due to

    a lack of disks that constitute the LUN.

    The next step looked specifically at the LUN, while the sales report was being run. It was obvious lookingat that report (Figure 2) that while the report was running, the LUN was over-utilized.

    Next, the utilization of the storage processor was examined. When the data was examined, it was clear thatthe storage processor was not being bogged down because its utilization was less than 50percent.

    The queue lengths for both the storage processor and the LUN were examined. The queue length for thestorage processor was less than or equal to the number of disks that constitute the LUN, and therefore, the

    storage processor was not the performance bottleneck.

    The queue length for the LUN, however, was close to 20, which is four times the number of disks in theLUN. That is, the load on the LUN is four times higher than it can handle.

    The conclusion that was drawn from this data is that the number of disks on the system should beincreased. Once this was done, the problem was fixed.

    SummaryNavisphere Analyzer was used to determine the cause of a performance problem. Starting at the level of theLUN, reports were used to move closer and closer to the problem. It was concluded, using a few very clear

    reports, that the LUN in question neededmore storage in it. Additional disks were added and the problemwas resolved.

  • 8/3/2019 Navis Analyzer Case

    12/14

    Navisphere Analyzer: A Case Study 11

    Appendix 1Navisphere Analyzer collects and analyzes data on the following performance properties:- Basic (for disk, storage processor, and LUN)

    Utilization The fraction of a certain observation period that the system component is busyserving incoming requests. An SP or disk that shows 100 percent (or close to 100 percent)utilization is a system bottleneck, since an increase in the overall workload will not affect the

    component throughput; the component has reached its saturation point. Since a LUN is considered

    busy if any of its disks are busy, LUN utilization usually represents a pessimistic view. That is, a

    high LUN utilization value does not necessarily indicate that the LUN is approaching its

    maximum capacity.

    Queue length The average number of requests within a certain time interval waiting to be servedby the component, including the one in service. An (average) queue length of zero indicates an

    idle system. If three requests arrive at an empty service center at the same time, only one of them

    can be served immediately; the other two must wait in the queue, resulting in a queue length of

    three.

    Response time (ms) The average time, in milliseconds, required for one request to pass througha system component, including its waiting time. The higher the queue length for a component, the

    more requests are waiting in its queue, thus increasing the average response time of a singlerequest. For a given workload, queue length and response time are directly proportional.

    Total bandwidth (MB/s) The average amount of data in Mbytes that is passed through a systemcomponent per second. Larger requests usually result in a higher total bandwidth than smaller

    requests. Total bandwidth includes both read and write requests.

    Total throughput (IO/s) The average number of requests that pass through a system componentper second. Since smaller requests need a shorter time for this, they usually result in a higher total

    throughput than larger requests do. Total throughput includes both read and write requests.

  • 8/3/2019 Navis Analyzer Case

    13/14

    Navisphere Analyzer: A Case Study 12

    - Workload Maximum outstanding requests (storage processor) The largest number of commands on the

    storage processor at one time since statistics logging was enabled. This value measures the biggest

    burst of requests sent to this storage processor at a time.

    Maximum request count (LUN) The largest number of requests queued to this LUN at one timesince statistics logging was enabled. This value also indicates the worst instantaneous response

    time due to the maximum number of waiting requests. Maximum requests in queue (disk) The maximum number of requests waiting to be serviced by

    this specific disk since statistics logging was enabled.

    Read/write throughput (I/Os disk, storage processor, LUN) The average number of reads orwrites respectively passed through a component per second. Since smaller requests need less

    processing time, they usually result in a higher read or write throughput than larger requests.

    Read/write size (KB disk, storage processor, LUN) The average read or write size respectivelyin Kbytes. This number indicates whether the read workload is oriented more toward throughput

    (I/Os per second) or bandwidth (MB per second).

    Read/write bandwidth (MB/s disk, storage processor, LUN) The average number of Mbytesread or written respectively that was passed through a component per second. Large requests

    usually result in a higher bandwidth than smaller ones.

    - Read cache Used prefetches ( percent - LUN) This measure is an indication of prefetching efficiency. To

    improve real bandwidth, two consecutive requests trigger prefetching, thereby filling the read

    cache with data before it is requested. Thus sequential requests will receive the data from the read

    cache instead of from the disks, which results in a lower response time and higher throughput. As

    the percentage of sequential requests rises, so does the percentage of used prefetches.

    Hit ratio (LUN) The fraction of read requests served from both read and write caches vs. thenumber of read requests to this LUN. The higher the ratio, the better the read performance.

    Miss rate (LUN) The rate of read requests that could not be satisfied by the storage processorcache, and therefore, required a disk access.

    Hit rate (LUN) The number of read requests that was satisfied by either the write or read cache,within a second. A read cache hit occurs when recently accessed data is referenced while it is still

    in the cache.

    - Write cache Miss rate (LUN) The number of write requests per second that could not be satisfied by the write

    cache, since the data was not currently in the cache from a previous disk access.

    Hit ratio ( LUN) The fraction of write requests served from the write cache vs. the total numberof write requests to this LUN. The higher the ratio, the better the write performance.

    Hit rate (LUN) The number of write requests per second that was satisfied by the write cache,since they have been referenced before and not yet flushed to the disks. Write cache hits occur

    when recently accessed data is referenced again while it is still in the write cache.

    Flush ratio (storage processor) The fraction of the number of flush operations performed vs. thenumber of write requests. A flush operation is a write of a portion of the cache to make room for

    incoming write data. Since the ratio is a measure for the back-end activity vs. front-end activity, a

    lower number indicates better performance.

    Dirty page percentages (percent- storage processor) The percentage of cache pages owned bythis storage processor that was modified since it was last read from, or written to, disk. In an

    optimal environment, the dirty pages percentages will not exceed the high watermark for a long

    period.

    Block flush rate Forced flush rate (LUN) Number of times per second the cache had to flush pages to disk to free

    space for incoming write requests. Forced flushes indicate that the incoming workload is higher

    than the back-end workload. A relatively high number over a long period of time suggests that you

    spread the load over more disks.

  • 8/3/2019 Navis Analyzer Case

    14/14

    Navisphere Analyzer: A Case Study 13

    High watermark flush on rate (storage processor) Number of times, since the last sample, thatthe number of modified pages in the write cache reached the high watermark. The higher the

    number, the greater the write workload coming from the host.

    Idle flush on rate (storage processor) Number of times, since the last sample, that the write cachestarted flushing dirty pages to disk due to a given idle period. Idle flushes indicate a low workload.

    Low watermark flush off rate (storage processor) Number of times, since the last sample, thatthe number of modified pages in the write cache reached the low watermark, at which point thestorage processor stops flushing the cache. The higher the number, the greater the write workload

    coming from the host. This number should be close to the high watermark flush on number.

    Flush rate (storage processor) Number of times per second that the write cache performed a flushoperation. A flush operation is a write of a portion of a cache for any reason; it includes forced

    flushes, flushes resulting from high watermark, and flushes from an idle state. This value indicates

    back-end workload.

    - Miscellaneous Average seek distance (disk) Average seek distance in gigabytes. Longer seek distances result in

    longer seek times and therefore higher response times. Defragmentation might help to reduce seek

    distances.

    Disk crossing percentage (LUN) Percentage of requests that requires I/O to at least two disks vs.the total number of server requests. A disk crossing may involve more than two disks; that is,more than two stripe element crossings.

    Disk crossings relate to the LUN stripe element size. Generally, a low value is needed for goodperformance.

    Disk crossing rate (LUN) Indicates how many back-end requests per second used an average ofat least two disks. Disk crossings are counted for read and write requests. Generally, a low value is

    needed for good performance.

    Average busy queue length (disk, storage processor) Average number of requests waiting for abusy system component to be serviced, including the request that is currently in service. Since the

    queue length is counted only when the component is busy, the value indicates the frequency

    variation (burst frequency) of incoming requests. The higher the value, the bigger the burst, and

    the longer the average response time at this component.

    Service time (disk, storage processor, LUN) Time, in milliseconds, a request spent beingserviced by a component. It does not include time waiting in a queue. Service time is mainly aproperty of the system component. However, larger I/Os take longer and therefore usually result in

    lower throughput (I/Os) but better bandwidth (MB/s).

    - SnapView Reads from snapshot cache The number of reads during this session that have resulted in a read

    from the snapshot cache rather than reading from the sourceLUN.

    Reads from snapshot LUN The number of read requests on SnapView during this snapshotsession.

    Reads from snapshot source LUN The number of reads during this snapshot session from thesource LUN. It is calculated by the difference between the total reads in session and reads from

    cache.

    Writes to snapshot source LUN The number of writes during this snapshot session to the sourceLUN (on the pertinent storage processor).

    Writes to snapshot cache The number of writes to the source LUN this session that triggered acopy-on-write operation (the first write to each snapshot cache chunk region).

    Writes larger than cache chunk size The number of writes to the source LUN during this session,which were larger than the chunk size (they have resulted in multiple writes to the cache).

    Cache chunks used in snapshot session The number of chunks that this session has used.