ecs monitoring guide - dell...the monitoring guide supports the ecs administrator's use of the...

ECSVersion 3.4

Monitoring Guide302-999-903

01

September 2019

Copyright © 2018-2019 Dell Inc. or its subsidiaries. All rights reserved.

Dell believes the information in this publication is accurate as of its publication date. The information is subject to change without notice.

THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS-IS.” DELL MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND

WITH RESPECT TO THE INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED WARRANTIES OF

MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. USE, COPYING, AND DISTRIBUTION OF ANY DELL SOFTWARE DESCRIBED

IN THIS PUBLICATION REQUIRES AN APPLICABLE SOFTWARE LICENSE.

Dell Technologies, Dell, EMC, Dell EMC and other trademarks are trademarks of Dell Inc. or its subsidiaries. Other trademarks may be the property

of their respective owners. Published in the USA.

Dell EMCHopkinton, Massachusetts 01748-91031-508-435-1000 In North America 1-866-464-7381www.DellEMC.com

2 ECS Monitoring Guide

5

7

Welcome to ECS 9

Monitoring Basics 11View the ECS Portal Dashboard........................................................................ 12

Upper-right menu bar........................................................................... 12View requests....................................................................................... 12View capacity utilization....................................................................... 12View performance.................................................................................13View storage efficiency........................................................................ 13View geo monitoring............................................................................. 13View node and disk health.....................................................................13View alerts............................................................................................ 13

Using monitoring pages..................................................................................... 13Table navigation....................................................................................13Filter by date and time..........................................................................15History..................................................................................................15Export icon........................................................................................... 16

Monitoring ECS 17Monitor metering data.......................................................................................18

Metering data....................................................................................... 19Monitor capacity utilization...............................................................................20

Capacity forecast.................................................................................20Monitor capacity...................................................................................21Monitor used capacity..........................................................................23Monitor garbage collection data...........................................................24Monitor erasure encoding.................................................................... 25Monitor CAS processing...................................................................... 25

Monitor system health...................................................................................... 26Monitor hardware health...................................................................... 27Monitor process health........................................................................ 28Monitor node rebalancing status..........................................................30

Monitor transactions.........................................................................................30Monitor requests..................................................................................30Monitor performance............................................................................31

Monitor recovery status....................................................................................32Monitor disk bandwidth.....................................................................................33Introduction to geo-replication monitoring........................................................34

Monitor geo replication: Rate and Chunks............................................34Monitor geo replication: Recovery Point Objective (RPO)...................35Monitor geo replication: Failover Processing........................................35Monitor geo replication: Bootstrap Processing.................................... 36

Cloud hosted VDC monitoring........................................................................... 37

Figures

Tables

Chapter 1

Chapter 2

CONTENTS

ECS Monitoring Guide 3

Cloud topology..................................................................................... 37Cloud replication traffic....................................................................... 38

Monitoring Events: Audits and Alerts 39About event monitoring.................................................................................... 40Monitor audit data............................................................................................ 40Audit messages.................................................................................................40Monitor alerts................................................................................................... 45Alert policy........................................................................................................46

New alert policy................................................................................... 47Acknowledge all alerts.......................................................................................48Alert messages..................................................................................................48

Advanced Monitoring 61Advanced Monitoring........................................................................................62

View Advanced Monitoring Dashboards............................................... 62Share Advanced Monitoring Dashboards..............................................67

Flux API.............................................................................................................67List of metrics for performance-related data....................................... 70

Dashboard API's to be deprecated or changed in the next release.................... 73

Examining Service Logs 79ECS service logs............................................................................................... 80

Chapter 3

Chapter 4

Chapter 5

Contents


Upper-right menu bar........................................................................................................ 12Navigating with breadcrumbs............................................................................................ 14Refresh icon...................................................................................................................... 15Open Filter panel with date and time range selections....................................................... 15History chart with active cursor........................................................................................ 16Export icons...................................................................................................................... 16

123456

FIGURES


Figures


Bucket and namespace metering....................................................................................... 19Capacity utilization: VDC................................................................................................... 21Capacity utilization: storage pool...................................................................................... 22Capacity utilization: node..................................................................................................22Capacity utilization: disk................................................................................................... 23Used capacity................................................................................................................... 23Garbage collection: garbage detected...............................................................................24Garbage collection: capacity reclaimed.............................................................................25Erasure encoding metrics..................................................................................................25CAS processing metrics.................................................................................................... 26VDC, node, and process health metrics.............................................................................28ECS processes..................................................................................................................28Request metrics................................................................................................................30Network traffic metrics.....................................................................................................32Recovery metrics.............................................................................................................. 32Disk Bandwidth metrics.....................................................................................................33Rate and Chunks columns.................................................................................................35RPO columns.................................................................................................................... 35Failover columns............................................................................................................... 35Bootstrap Processing columns..........................................................................................36Replication traffic by VDC.................................................................................................38Replication traffic by replication group............................................................................. 38ECS audit messages..........................................................................................................40Alert types........................................................................................................................ 46ESRS dial home types.......................................................................................................46ECS Object alert messages...............................................................................................48ECS fabric alert messages................................................................................................ 55Secure Remote Services alert messages.......................................................................... 59Advanced monitoring dashboards..................................................................................... 62Advanced monitoring dashboard fields..............................................................................62Metrics for performance-related data...............................................................................70API - Remove.................................................................................................................... 73API - Change.....................................................................................................................74API - No change................................................................................................................ 74

12345678910111213141516171819202122232425262728293031323334

TABLES


Tables


Welcome to ECS

ECS provides a complete software-defined cloud storage platform that supports the storage,manipulation, and analysis of unstructured data on a massive scale on commodity hardware. ECScan be deployed as a turnkey storage appliance or as a software product that can be installed onqualified commodity servers and disks. ECS offers all the cost advantages of commodityinfrastructure with the enterprise reliability, availability, and serviceability of traditional arrays.

The ECS online documentation comprises the following guides:

l Administration Guide

l Monitoring Guide

l Data Access Guide

l Hardware Guide

Administration Guide

The Administration Guide supports the initial configuration of ECS and the provisioning ofstorage to meet requirements for availability and data replication. Also, it supports theongoing management of tenants and users, and the creation and configuration of buckets.

Monitoring Guide

The Monitoring Guide supports the ECS administrator's use of the ECS Portal to monitor thehealth and performance of ECS and to view its capacity utilization.

Data Access Guide

The Data Access Guide describes the protocols that are supported by ECS for user access toECS object storage. In addition to the S3, EMC Atmos, OpenStack Swift, and Centera (CAS)object APIs, it introduces the ECS Management API, which can be used to configure ECSbefore user access, and details the use of ECS as a Hadoop Filesystem (HDFS) and theintegration of ECS HDFS with a Hadoop cluster.

Hardware Guide

The Hardware Guide describes the supported hardware configurations and upgrade paths anddetails the rack cabling requirements.

PDF versions of these online guides and links to other PDFs, such as the ECS Security ConfigurationGuide and the ECS Release Notes, are available from support.emc.com.


https://support.emc.com/search/?text=ecs

Welcome to ECS


CHAPTER 1

Monitoring Basics

l View the ECS Portal Dashboard.............................................................................................12l Using monitoring pages..........................................................................................................13


View the ECS Portal DashboardThe ECS Portal Dashboard provides critical information about the ECS processes on the VDC youare currently logged in to.

The Dashboard is the first page you see after you log in. The title of each panel (box) links to theportal monitoring page that shows more detail for the monitoring area.

Upper-right menu bar

The upper-right menu bar appears on each ECS Portal page.

Figure 1 Upper-right menu bar

Menu items include the following icons and menus:

1. The Alert icon displays a number that indicates how many unacknowledged alerts are pendingfor the current VDC. The number displays 99+ if there are more than 99 alerts. You can clickthe Alert icon to see the Alert menu, which shows the five most recent alerts for the currentVDC.

2. The Help icon brings up the online documentation for the current portal page.

3. The Guide icon brings up the Getting Started Task Checklist.

4. The VDC menu displays the name of the current VDC. If your AD or LDAP credentials allow youto access more than one VDC, you can switch the portal view to the other VDCs withoutentering your credentials.

5. The User menu displays the current user and allows you to log out. The User menu displays thelast login time for the user.

View requestsThe Requests panel displays the total requests, successful requests, and failed requests.

Failed requests are organized by system error and user error. User failures are typically HTTP 400errors. System failures are typically HTTP 500 errors. Click Requests to see more requestmetrics.

Request statistics do not include replication traffic.

View capacity utilizationThe Capacity Utilization panel displays the total, used, available, reserved, and percent fullcapacity.

Capacity amounts are shown in gibibytes (GiB) and tibibytes (TiB). One GiB is approximately equalto 1.074 gigabytes (GB). One TiB is approximately equal to 1.1 terabytes (TB).

The Used capacity indicates the amount of capacity that is in use. Click Capacity Utilization tosee more capacity metrics.

The capacity metrics are available in the left menu.

Monitoring Basics


View performanceThe Performance panel displays how network read and write operations are currently performing,and the average read/write performance statistics over the last 24 hours for the VDC.

Click Performance to see more comprehensive performance metrics.

View storage efficiencyThe Storage Efficiency panel displays the efficiency of the erasure coding (EC) process.

The chart shows the progress of the current EC process, and the other values show the totalamount of data that is subject to EC, the amount of EC data waiting for the EC process, and thecurrent rate of the EC process. Click Storage Efficiency to see more storage efficiency metrics.

View geo monitoringThe Geo Monitoring panel displays how much data from the local VDC is waiting for geo-replication, and the rate of the replication.

Recovery Point Objective (RPO) refers to the point in time in the past to which you can recover.The value is the oldest data at risk of being lost if a local VDC fails before replication is complete.Failover Progress shows the progress of any active failover that is occurring in the federationinvolving the local VDC. Bootstrap Progress shows the progress of any active process to add anew VDC to the federation. Click Geo Monitoring to see more geo-replication metrics.

View node and disk healthThe Node & Disks panel displays the health status of disks and nodes.

A green check mark beside the node or disk number indicates the number of nodes or disks in goodhealth. A red x indicates bad health. Click Node & Disks to see more hardware health metrics. Ifthe number of bad disks or nodes is a number other than zero, clicking on the count takes you tothe corresponding Hardware Health tab (Offline Disks or Offline Nodes) on the System Healthpage.

View alertsThe Alerts panel displays a count of critical alerts and errors.

Click Alerts to see the full list of current alerts. Any Critical or Error alerts are linked to the Alertstab on the Events page where only the alerts with a severity of Critical or Error are filtered anddisplayed.

Using monitoring pagesIntroduces the basic techniques for using monitoring pages in the ECS Portal.

The ECS Portal monitoring pages share a set of common interactions as described in the followingsections:

Table navigationHighlighted text in a table row indicates a link to a detail display. Selecting the link drills down tothe next level of detail. On drill-down displays, a path string shows your current location in thesequence of drill-down displays. This path string is called a breadcrumb trail or breadcrumbs forshort. Selecting any highlighted breadcrumb jumps up to the associated display.

Monitoring Basics


Figure 2 Navigating with breadcrumbs

On some monitoring displays, you can force a table to refresh with the latest data by clicking theRefresh icon.

Monitoring Basics


Figure 3 Refresh icon

Filter by date and timeThe standard monitoring filter enables to narrow results by date and time. It is available on severalmonitoring pages. Some pages have more filter types, described on those pages.

You can select a Date Time Range predefined value (in hours, weeks, or months) or select Customto specify a From and To date and time. For the To value, you can select the current time. Afterselecting a Date Time Range, and click Apply. The Filter panel closes and the page contentupdates. When closed, the Filter panel shows a summary of the applied filter settings and providesa Clear Filter command and a Refresh symbol.

If you want the Filter panel to stay open, click the Pin icon before you click Apply.

Figure 4 Open Filter panel with date and time range selections

When the table has the Current filter applied, the latest values are displayed. When the table has adate-time range filter applied, it displays the average value over that period.

HistoryWhen you select a History button, all available charts for that row are displayed below the table.You can hover over a chart from left to right to see a vertical line that helps you find a specificdate-time point on the chart. A pop-up display shows the value and timestamp for that point.

The date-time scale is determined by the filter setting that has been configured. When theCurrent filter is selected, the charts show data from the last 24 hours. History data is kept for 60days.

Monitoring Basics


Figure 5 History chart with active cursor

In the history charts, when the Current filter is selected, if there is no available historical data, NoData displays.

Export iconExport icon enables you to export data from all the monitoring tables and graphs to pdf, doc, excel.and .csv formats for later consumption. To select the format, and export the data, use the exporticon in the upper right of the menu bar on each table and graph.

The exported data can be used to get a longer term view on capacity usage and consumptiontrends.

Figure 6 Export icons

Monitoring Basics


CHAPTER 2

Monitoring ECS

l Monitor metering data........................................................................................................... 18l Monitor capacity utilization................................................................................................... 20l Monitor system health...........................................................................................................26l Monitor transactions............................................................................................................. 30l Monitor recovery status........................................................................................................ 32l Monitor disk bandwidth......................................................................................................... 33l Introduction to geo-replication monitoring............................................................................ 34l Cloud hosted VDC monitoring................................................................................................37


Monitor metering dataYou can display metering data for namespaces, or buckets within namespaces, for a specified timeperiod.

About this task

The available metering data is detailed in Metering data on page 19.

Using the ECS Management REST API you can retrieve data programmatically with customclients. The ECS Management REST API Reference is provided here.

Procedure

1. In the ECS Portal, select Monitor > Metering.

2. From the Date Time Range menu, select the period for which you want to see the meteringdata. Select Current to view the current metering data. Select Custom to specify a customdate-time range.

Metering is not a real-time reporting activity but is performed as a background process andsome delay in reporting can occur. The longest delay is about 15 minutes. However, wherethe system is under heavy load, or is unstable, longer delays can be seen. If you areencountering longer delays, contact ECS Customer Support.

If you select Custom, use the From and To calendars to choose the time period for whichdata will be displayed.

Metering data is kept for 30 days.Note: The Current filter displays the latest available values. A date-time range filterdisplays average values over the specified range.

3. Select the namespace for which you want to display metering data. To narrow the list ofnamespaces, type the first few letters of the target namespace and click the magnifyingglass icon.

If you are a Namespace Administrator, you will only be able to select your namespace.

4. Click the + icon next to each namespace you want to see object data for.

5. To see the data for a particular bucket, click the + icon next to each bucket for which youwant to see data.

To narrow the list of buckets, type the first few letters of the target bucket and click themagnifying glass icon.

If you do not specify a bucket, the object metering data will be the totals for all buckets inthe namespace.

6. Click Apply to display the metering data for the selected namespace and bucket for thespecified time period.

Note: While all buckets in a geo-federation can be selected in metering, if a selectedbucket is not associated in a replication group to which the VDC that you are logged intobelongs, metering information cannot be retrieved for that bucket. In this case, after await, the bucket is listed as No data. To get the metering information for the bucket,log in to the VDC that owns the bucket or any VDC that is part of the replication groupto which the bucket belongs.Depending on the Date Time Range selected, the attributes that are displayed in theMetering Page may change. If Current option is selected, only Namespace, Buckets,Bucket Tags, Total MPU Parts, Total MPU Size, Total Size, Object Count, and Last

Monitoring ECS


http://doc.isilon.com/ECS/3.3/API/

Updated attributes are displayed in the table. If Custom or any other time range ischosen, the Namespace, Buckets, Bucket Tags, Total MPU Parts, Total MPU Size, TotalSize, Object Count, Objects Created, Objects Deleted, Write Traffic and Read Trafficattributes are displayed in the table and the Last Updated attribute is not displayed.

Metering dataObject metering data for a specified namespace, or a specified bucket within a namespace, can beobtained for a defined time period at the ECS Portal Monitor > Metering page.

The metering information that is provided is shown in the following table:

Table 1 Bucket and namespace metering

Attribute Description

Namespace Namespace selected.

Buckets Bucket selected for which the metering data applies. If blank, the data is forall buckets in the namespace.

Bucket Tags Lists any name=value bucket tags associated with the bucket.

Total MPU Parts The number of MPU parts that have been created and not used as part of acomplete MPU operation.

Total MPU Size The total disk size occupied by MPU parts that have been created and notused as part of a complete MPU operation.

Total Size Total size of the objects that are stored in the selected namespace or bucketat the end time that is specified in the filter. If the size is less than 1 GB, thenthe portal displays 0GB.

Object Count Number of objects that are associated with the selected namespace orbucket at the end time that is specified in the filter.

Last Updated If the Current filter is selected, Last Updated displays the time until whichmetering data can be considered consistent. This can help you determineany delay in reported metering stats. The metering stats may include somedata on the operations that are performed after the last updated time.

Objects Created Number of objects that are created in the selected namespace or bucket inthe time period.

Objects Deleted Number of objects that are deleted from the selected namespace or bucketin the time period.

Write Traffic Total of incoming object data (writes) for the selected namespace or bucketduring the specified period. Values are displayed in a size unit that is basedon the size of the data.

Read Traffic Total of outgoing object data (reads) for the selected namespace or bucketduring the specified period. Values are displayed in a size unit that is basedon the size of the data.

Note: When you perform an update operation on an object, the metering services showsObject Overwrite as Objects Created and Objects Deleted. The Objects Deletedis shown because of the expected OVERWRITE behavior of an object. However, no object isdeleted.

Monitoring ECS


Note: Metering is not a real-time reporting activity but is performed as a background processand some delay in reporting can occur. The longest delay is about 15 minutes. However, wherethe system is under heavy load, or is unstable, longer delays can be seen. If you areencountering longer delays, contact ECS Customer Support.

Note: When there are many concurrent requests, ECS metering can ignore some requests sothat they do not impact system performance. Hence, the Write Traffic value can show lessthat the actual Write bandwidth.

Monitor capacity utilizationYou can monitor capacity utilization from the ECS Portal Monitor > Capacity Utilization page.You can monitor the capacity utilization of storage pools, nodes and the entire VDC.

The Capacity Utilization page has the following tabs:

l Capacity: View summary data about the total, used, available, and reserved storage capacity ofstorage pools and nodes

l Used Capacity: View data about the used capacity for the VDC and storage pools

l Garbage Collection: View data about garbage detected, recovered capacity, capacity that ispending reclamation, and capacity that cannot be reclaimed

l Erasure Encoding: View erasure-encoded data in a local storage pool, data that is pendingerasure encoding, and the current erasure encoding rate and estimated completion time

l CAS Processing: View garbage data collection for CAS (Content Addressable Storage)buckets.

Tables showing capacity usage data display in each of the tabs. You can look down into the nodesand to individual disks by selecting the appropriate link in each table. Each row has an associatedHistory display that enables you to see how the data has changed over time. To graphically displayhow capacity has changed over time, select History for the storage pool, node, or disk that youare interested in. History data is kept for 30 days.

See Using monitoring pages for information about going to the tables.

Capacity forecastYou can use the Capacity tab to monitor when the capacity is expected to reach 50% and 80%.Capacity forecast is based on the current usage pattern that is shown on 1 day, 7 days, and 30-days usage trend. Capacity Forecast data is shown either for the entire VDC, for an individualstorage pool or for nodes.

Note: The capacity ETA shown as N/A could be due to the following reasons:

1. There is not enough historical data for forecast. At least two data points (1 hour apart) arerequired. It could happen when the ECS system is deployed. Click the History button atVDC, storage pool, or node levels to verify.

2. If capacity passed intended target, the ETA is set to 0.

3. The used capacity shows a down trend for the specified time (for example, 7 days). Clickthe History button or get the history through dashboard API to verify.

To see the capacity forecast data from the ECS Portal, select Monitor > Capacity Utilization >Capacity. Capacity tab is the default.

To see the data about total capacity, used capacity, and available capacity, click History.

Capacity Forecast is calculated based on the total capacity and used capacity.

Monitoring ECS


Monitor capacity

You can use the Capacity tab to view capacity utilization data for:

l VDC (VDC capacity utilization on page 21)

l Storage Pools (Storage pool capacity utilization on page 22)

l Nodes (Node capacity utilization on page 22)

l Disks (Disk capacity utilization on page 23)

l Used Capacity (Monitor used capacity on page 23)

You can view summary storage usage data about total, used, available, and reserved storagecapacity for storage pools and nodes.

Reserved capacity is the approximately 10 percent of the total capacity that is reserved for failurehandling and for performing erasure encoding or XOR operations. Reserved capacity is notavailable for writing new data.

The tab opens with the Storage Pools capacity table displayed. To view capacity data for individualnodes, click the appropriate link in the Nodes (Online) column to display the Nodes table. Clickthe appropriate link in the Disks (Online) column to view capacity data for individual disks.

You can display average values over a selected date-time range or over a custom time range usingthe Filter drop-down menu. The Current filter displays the latest available values and is thedefault filter value.

When the table has the Date Time Range filter set to Current (the default setting), the tabledisplays the latest values and the history graphs display values over the last 24-hour period. Whenthe table has a Date Time Range filter applied (other than Current), it displays the average valueover that period.

VDC capacity utilization

Table 2 Capacity utilization: VDC


VDC Name of the VDC.

Per 1 Day Trend 50% Forecasts VDC capacity when it is expected to reach 50% based on 1-dayusage trend.

Per 7 Day Trend 50% Forecasts VDC capacity when it is expected to reach 50% based on 7-daysusage trend.


Per 1 Day Trend 80% Forecasts VDC capacity when it is expected to reach 80% based on 1-dayusage trend.



Total Total capacity of the VDC that is online. This is the total of the capacity thatis already used and the capacity still free for allocation.

Used Used online capacity in the VDC.

Monitoring ECS


Table 2 Capacity utilization: VDC (continued)


Available (Reserved)If the Current filter is applied,Available (Reserved) displays. If afilter other than Current is applied,only Available displays.

Online capacity available for use, including the approximately 10% of thetotal capacity that is reserved for failure handling and for performing erasureencoding or XOR operations.

Actions History provides a graphic display of the data. If the Current filter (default)is selected, the History button displays total, used, and available capacity forthe last 24 hours. History data is kept for 60 days.

Storage pool capacity utilization

Table 3 Capacity utilization: storage pool


Storage Pool Name of the storage pool.

Nodes (Online) Number of nodes in the storage pool followed by the number of those nodesonline. Click this number to open: Node capacity utilization on page 22.

Online Nodes with Sufficient DiskSpace

Note: Does not appear if a filterother than Current is applied.

Number of online nodes that have sufficient disk space to accept new data.If too many disks are too full to accept new data, the performance of thesystem may be impacted.

Disks (Online) Number of disks in the storage pool followed by the number of those disksthat are online.

Total Total capacity of the storage pool that is online. This is the total of thecapacity that is already used and the capacity still free for allocation.

Used Used online capacity in the storage pool.


Online capacity available for use, including the approximately 10% of thetotal capacity that is reserved for failure handling and for performing erasureencoding or XOR operations.


Node capacity utilization

Table 4 Capacity utilization: node


Nodes Fully qualified domain name (FQDN) of the node.

Disks (Online) Number of disks that are associated with the node followed by the numberof those disks that are online. Click disk number to open: Disk capacityutilization on page 23

Monitoring ECS


Table 4 Capacity utilization: node (continued)


Total Total online capacity provided by the online disks within the node. This is thetotal of the capacity that is already used and the capacity still free forallocation.

Used Online capacity used within the node.


Remaining online capacity available in the node including reserved capacity.

Offline(Displays only if the Current filter isapplied)

Total capacity of the node that is offline.

Online Status Indicates whether the node is online or offline. A check mark indicates thatthe node status is Good.


Disk capacity utilization

Table 5 Capacity utilization: disk


Disks Disk identifier.

Total Total capacity provided by the disk.

Used Capacity used on the disk.

Available Remaining capacity available on the disk.

Online Status Indicates whether the disk is online or offline. The check mark indicates thatthe disk status is Good.


Monitor used capacity

You can use the Used Capacity tab to view the used storage capacity for the current VDC and foreach storage pool in the VDC.

Table 6 Used capacity

Storage use Description

User Data The capacity that is used for the repository chunks representing data uploadedby ECS users.

Monitoring ECS


Table 6 Used capacity (continued)

Storage use Description

System Metadata The capacity that is used by the ECS processes that track and describe the datain the system.

Protection Overhead The combined overhead of triple mirroring and erasure coding for all user data,system metadata, and geo data protection chunks protected locally.

Geo Cache The capacity used to cache chunks that are accessed locally but not storedlocally.

Geo Copy The capacity that is used for Geo-replication chunks stored on the current VDC.

Garbage The capacity used by data that is no longer in use.

Storage usage is shown as color-coded bars, one color for the current VDC, and a different colorfor its storage pools. Tool tips for each colored bar correspond to the status information in thenumeric status line.

Monitor garbage collection data

You can use the Garbage Collection tab to monitor garbage collection data for the entire VDC orfor individual storage pools. Use the Virtual Data Center drop-down menu to select the storagetype: Virtual Data Center or Storage Pool. Virtual Data Center is the default.

Garbage collection is enabled by default at installation. Contact your customer supportrepresentative to disable or reenable this feature.

The Garbage Collection page has the following subtabs:

l Garbage Detected: View summary garbage collection data.

l Capacity Reclaimed: View data about storage capacity reclaimed by the garbage collectionprocess.

Garbage Detected

Click the Virtual Data Center drop-down menu to view garbage detection data for the entire VDCor individual storage pools.

Table 7 Garbage collection: garbage detected


Storage Type The VDC or storage pool for which to view garbage collection data.

Total Garbage Detected The amount of reclaimable storage capacity detected on the system.

Capacity Reclaimed The amount of storage capacity reclaimed by the garbage collectionprocess.

Capacity Pending Reclamation The amount of storage capacity that is identified as reclaimable but notreclaimed yet.

UnReclaimable Garbage The amount of storage capacity that cannot be reclaimed currently.

Capacity Reclaimed

Click the Filter button to set a filter for the reclamation data by VDC or storage pool over a date/time range.

Monitoring ECS


Table 8 Garbage collection: capacity reclaimed


Storage Type The VDC or storage pool for which to view capacity reclaimed data.

Capacity Reclaimed The amount of storage capacity recovered following garbage collection.

User Data Reclaimed The amount of user data recovered.

System Metadata Reclaimed The amount of system metadata recovered.

Actions History provides a graphic display of the data. If the Current filter(default) is selected, the History button displays the total reclaimedcapacity for the last 24 hours. History data is kept for 60 days.

Monitor erasure encoding

You can use the Erasure Encoding tab to monitor the total user data and erasure encoded data ina local storage pool. It also shows the current encoding rate and the estimated completion time.

You can display average values over a selected date-time range or over a custom time range usingthe Filter drop-down menu. The Current filter displays the latest available values and is thedefault filter value.

Table 9 Erasure encoding metrics

Column Description

Storage Pool The storage pools from the current VDC.

Total Coding Data The total logical size of all data chunks in the storage pool which are subjectto erasure encoding.

Total Coded Data The total logical size of all erasure-encoded chunks in the storage pool.

Coded (%) The percent of data in the storage pool that is erasure encoded. Percentvalues display with three decimal places in the history chart for accurateplotting. Percent values display with two decimal points in the table,consistent with the format of the other values in the table.

Coding Rate The rate at which any current data waiting for erasure encoding is beingprocessed.

Est. Time to Complete The estimated completion time extrapolated from the current erasureencoding rate.

Actions History provides a graphic display of the total coding data, total coded data,percent of data coded, and coding rate per second. History data is kept for60 days.

If the Current filter is selected, History displays default history for the last24 hours.

Monitor CAS processing

You can use the CAS Processing tab to monitor unused CAS (Content Addressable Storage)objects in CAS buckets within a selected namespace over a specified time range. The unused CASobjects that are monitored by ECS include unreferenced blobs and expired reflections.

Monitoring ECS


In Centera terminology, there are three types of CAS objects: blob, clip, and reflection.

l Blob: CAS data objects are called blobs (binary large objects). Blobs store data. Blobs can bereferenced by data objects of a different type called clips. A blob is referenced by its ContentAddress (CA) that is stored in the Content Description File (CDF) that references the blob.The logical combination of a CDF and a Blob is called a Clip. The hash of a CDF is the Clip-ID.There can be multiple Clips for the same Blob with different CDFs (different metadata but withsame user data, single instance storage). When blobs are not referenced by live clips, theseunreferenced blobs become garbage data.

l C-Clip: Combination of a CDF and its related blobs

l Reflection: CDF of a deleted C-Clip. A reflection is created after the deletion of a C-Clip andprovides an audit trail for each deleted C-Clip. Reflections may have expiration times. (If thereis no configured expiration time for a reflection, the reflection is never deleted.)

Click the Filter drop-down menu to select a namespace containing CAS buckets and to set a date/time range to view the number and size of unreferenced blobs and expired reflections in CASbuckets.

Important: For ECS systems with existing CAS data that upgrade to 3.2.1, there is a CAS garbagedata bootstrap process that is automatically triggered post upgrade. The bootstrap process buildsnecessary references over the existing CAS data and can require a significant amount of timedepending on the amount of existing CAS data. During the bootstrap process, the unreferencedblob and reflection values will not change on the CAS Processing page. For example, you see zerofor the unreferenced blob data that are detected and unreferenced blobs detected values. Thevalues will not change until after the bootstrap process is complete. If you see that the values donot change over an extended period, call customer support.

When you search for a namespace (using the Search... option at the bottom of the list ofnamespaces in the Namespace drop-down field), the search functionality is based on prefixesonly. For example, a search for fin returns finance-namespace-dev, while a search for devwould return nothing.

Table 10 CAS processing metrics


Bucket The name of the bucket containing CAS data.

Unreferenced Blob Data Detected The amount of unreferenced blob data in the bucket (in bytes).

Unreferenced Blobs Detected The number of unreferenced blobs in the bucket.

Reflection Data Detected The amount of reflection data in the bucket (in bytes).

Reflections Expired The number of expired reflections in the bucket.

Actions History provides a graphic display of the unreferenced blob and reflectiondata. If the Current filter (default) is selected, the History button displaysthe data for the last 24 hours. History data is kept for 60 days.

Monitor system healthYou can monitor system health from the ECS Portal Monitor > System Health page.

The System Health page has the following tabs:

l Hardware Health: View data about the status of nodes and disks.

l Process Health: View data about the status of the NIC, CPU, and memory.

Monitoring ECS


l Node Rebalancing: View data about the status of node rebalancing operations.

Monitor hardware healthYou can use the Hardware Health tab to obtain the health of disks and nodes.

About this task

The Hardware Health tab is accessed from the ECS Portal at Monitor > System Health >Hardware Health. The following states describe hardware health:

l Good: The node is in normal operating condition.

l Suspect: Either the node is transitioning from good to bad because of decreasing hardwaremetrics, or there is a problem with a lower-level hardware component, or the hardware is notdetectable by the system because of connectivity problems.

l Bad: The node needs replacement.

Disks states have the following meanings:

l Good: The system is reading from and writing to the disk.

l Suspect: The system no longer writes to the disk but reads from it. Swarms of suspect disksare likely caused by connectivity problems at a node. These disks transition back to Good whenthe connectivity issues clear up.

l Bad: The systemneither reads from nor writes to the disk. Replace the disk. Once a disk hasbeen identified as bad by the ECS system, it cannot be reused anywhere in the ECS system.Because of ECS data protection, when a disk fails, copies of the data that was once on the diskare re-created on other disks in the system. A bad disk only represents a loss of capacity to thesystem--not a loss of data. When the disk is replaced, the new disk does not have datarestored to it. It becomes raw capacity for the system.

l Missing: The disk is a known disk that is unreachable. The disk may be transitioning betweenstates, disconnected, or pulled.

l Removed: The disk is one that the system has completed recovery on and removed from thestorage engine's list of valid disks. History of all the removed disks will be displayed on ECS UI.

l Not Accessible: If a node is not accessible, then all its disks have this status. It indicates thatthe actual status of the disk is not available to ECS.

Note: The Current filter displays the latest available values. A date-time range filter displaysaverage values over the specified range. Value data is kept for 60 days.

Procedure

1. Select Monitor > System Health and select the Hardware Health tab.

By default the Offline Nodes subtab displays. This table may be empty if all nodes areonline. Similarly, the Offline Disks subtab may be empty if all disks are online.

2. Select the Offline Nodes and Offline Disks subtabs to view a summary.

3. Select the All Nodes and Disks subtab to drill down to nodes and disks.

4. Click the node name to drill down to its disk health page.

Note: The Slot Info value always matches the physical slot ID in ECS U-Series, C-Series, and D-Series Appliances. This makes Slot Info useful for quickly locating a diskduring disk replacement service. Some Certified Hardware installations with ECSSoftware may not report useful or reliable data for Slot Info.

Monitoring ECS


Monitor process healthYou can use the Process Health tab to obtain metrics that can help assess the health of the VDC,node, or node process.

About this task

The Process Health tab is accessed from the ECS Portal at Monitor > System Health > ProcessHealth.


Table 11 VDC, node, and process health metrics

Metric label Level Description

Avg. NIC Bandwidth VDC and Node Average bandwidth of the network interfacecontroller hardware used by the selected VDC ornode.

Avg. CPU Usage (%) VDC and Node Average percentage of the CPU hardware used bythe selected VDC or node.

Avg. Memory Usage VDC and Node Average usage of the aggregate memory available tothe VDC or node.

Relative NIC (%) VDC and Node Percentage of the available bandwidth of the networkinterface controller hardware used by the selectedVDC or node.

Relative Memory (%) VDC and Node Percentage of the memory used relative to thememory available to the selected VDC or node.

CPU (%) Process Percentage of the node's CPU used by the process.The list of processes tracked is not the complete listof processes running on the node. Therefore, the sumof the CPU used by the processes is not equal to theCPU usage shown for the node.

Memory Usage Process The memory used by the process.

Relative Memory (%) Process Percentage of the memory used relative to thememory available to the process.

Avg. # Thread Process Average number of threads used by the process.

Last Restart Process The last time the process restarted on the node.

Actions All History provides a graphic display of the data.

If the Current filter is selected, History displaysdefault history for the last 24 hours.

Table 12 ECS processes

Process Description

Blob Service (blobsvc) Manages the following tables: Object (OB), Listing (LS), and RepoChunk Reference (RR).

Monitoring ECS


Table 12 ECS processes (continued)

Process Description

Chunk Manager (cm) Manages the following tables: Chunk (CT), Btree Reference (BR).Provides the logic to handle various events based on the chunk'scurrent state and decide which state to transition to next.

Directory Table Query (dtquery) Provides REST APIs to get Directory Table (DT) details.

GeoReceiver (georeceiver) Receives requests for chunks in the current VDC that are not ownedby the current VDC (secondary chunks). It then requests ChunkManager to start an operation to track the copy chunk creation andselect three replicas. The GeoReceiver process then writes thedatastream to the three instances. On successful completion, it directsChunk Manager to commit the copy chunk.

Head Service (headsvc) Manages object head protocols: S3, OpenStack Swift, EMC Atmos,CAS, and HDFS.

Metering (metering) Manages the following tables: Metering Aggregate (MA) and MeteringRaw (MR).

Object Control Service (objcontrolsvc) Provides REST APIs for configuring the ECS cluster, managing ECSresources, and monitoring the system.

Provision Service (provisionsvc) Manages the provisioning of storage resources and user access. Ithandles user management, authorization, and authentication for allprovisioning requests, resource management, and multi-tenancysupport.

Resource Service (resourcesvc) Manages the following tables: Resource Table (RT) which handlesreplication groups, buckets, users, namespace information and so on.

Record Manager (rm) Manages PR (Partition Record) table (journal region).

Storage Service Manager (ssm) Manages the following tables: Storage Space (SS) which contain diskblock usage and disk to chunk mapping. Interacts with one or moreStorage Servers and manages the active/free chunks on thecorresponding servers. Directs I/O operations to the disks.

Statistics Service (statsvc) Tracks various information on storage processes. These statistics canbe used to monitor the system.

VNest (vnest) Provides distributed synchronization and group services. A subset ofdata nodes will be group members responsible for serving the key/value requests. VNest services running on other nodes will listen forconfiguration updates and be ready to be added to the group.

Procedure

1. Locate the table row for the target VDC.

2. To drill down to a table with rows for each node in the VDC, select the VDC name.

3. To drill down to a table with rows for each process running on the node, select a nodeendpoint.

4. To display data graphically, select the History button for the target VDC, node, or process.

Monitoring ECS


Monitor node rebalancing statusUse the Node Rebalancing tab to monitor the status of data rebalancing operations when nodesare added to, or removed from, a cluster. Node rebalancing is enabled by default at installation.Contact your customer support representative to disable or re-enable this feature.

Before you begin

Access the Node Rebalancing tab from the ECS Portal at Monitor > System Health > NodeRebalancing. Amounts are shown in bytes (B).


A series of interactive graphs shows that the amount of data rebalanced, pending rebalancing, andthe rate of rebalancing data in bytes over time.

Node rebalancing works only for new nodes that are added to the cluster.

Rebalance data Description

Data Rebalanced Amount of data that has been rebalanced.

Pending Rebalancing Amount of data that is in the rebalance queue but has not been rebalancedyet.

Rate of Rebalance (per day) The incremental amount of data that was rebalanced during a specific timeperiod. The default time period is one day.

Monitor transactionsYou can monitor requests and network performance for VDCs and nodes from the Monitor >Transactions page.

The Transactions page has two tabs:

l Requests: monitor data requests and failure rates for VDCs and nodes.

l Performance: monitor network performance for VDCs and nodes.

Monitor requestsYou can use the Requests tab to monitor network traffic.

About this task

The Requests tab is accessed from the ECS Portal at Monitor > Transactions > Requests andprovides information about request rates and errors from the ECS object heads (S3, OpenStackSwift, EMC Atmos, CAS, and so on). Information is available at the VDC and node level.

Table 13 Request metrics

Metric label Description

VDC The name of the VDC. Click to drill down to request metrics by node.

Successful Requests The number of data requests from all object heads that weresuccessfully completed.

Monitoring ECS


Table 13 Request metrics (continued)


System Failures The number of data requests from all object heads classified as systemfailures. System failures are failed requests associated with hardwareor service errors (typically an HTTP error code of 5xx).

User Failures The number of data requests from all object heads classified as userfailures. User failures are known error types originating from the objectheads (typically an HTTP error code of 4xx).

Failures % Rate The percentage of failures in the VDC or node.


Total Requests The number of requests for the VDC.

Total Failures The number of failed requests for the VDC.


Code An HTTP error code.

Type The type of failure: System or User.

Head An ECS object head.

Failures with this code The number of failures associated with the specified HTTP code.

Failures % Rate The percent of failures associated with the HTTP error code.

Procedure

1. Select Monitor > Transactions and select the Requests tab.

2. Locate the target VDC name.

3. To show data for a node, select the VDC name to drill down to the nodes table and thenselect a node to show complete data for that node.

4. To apply a filter or sort the table, click on a column name.

Note: The Current filter displays the values for the last 24 hours. A date-time rangefilter displays the total request values over the specified range.

Monitor performanceYou can use the Performance tab to obtain network traffic metrics at the VDC or the individualnode level.

About this task

The Performance tab is accessed from the ECS Portal at Monitor > Transactions >Performance.

Note: The Current filter is selected by default and displays the latest available values. A DateTime Range filter displays average values over a selected range or over a custom time range.

Monitoring ECS


Table 14 Network traffic metrics


VDC The name of the VDC. Click to see the performance for each node.

Read Latency (ms) Average latency for reads in milliseconds. Read latency value iscalculated as time to first byte. Latency value indicates the requestprocessing time on the node, it does not include data transfer.

Write Latency (ms) Average latency for writes in milliseconds. Write latency value iscalculated as time from last byte to transaction complete. Latencyvalue indicates the request processing time on the node, it does notinclude data transfer.

Read Bandwidth Bandwidth for reads.

Write Bandwidth Bandwidth for writes.

Read Transactions (per second) Read transactions per second.

Write Transactions (per second) Write transactions per second.

Actions History provides a graphic display of the data.

If the Current filter is selected, the History button displays defaulthistory for the last 24 hours.

Procedure

1. Select Monitor > Transactions > Performance.

2. Locate the target VDC name.

3. To drill down to the nodes display, select the VDC name.

4. To display the performance data graphically, select the History button for the target VDCor node.

Monitor recovery statusYou can use the Recovery Status page to monitor the data recovered by the system.

About this task

The Recovery Status page is accessed from the ECS Portal at Monitor > Recovery Status.Recovery is the process of rebuilding data after any local condition that results in bad data(chunks). This table includes one row for each storage pool in the local VDC.

Note: The Current filter displays the latest available values. A date-time range filter displaysaverage values over the specified range.

Table 15 Recovery metrics

Column Description

Storage Pool The storage pools for the current VDC.

Amount of Data to be Recovered With the Current filter selected, this is the logical size of the data yetto be recovered.

Monitoring ECS


Table 15 Recovery metrics (continued)

Column Description

When a historical period is selected as the filter, the meaning of TotalAmount Data to be Recovered is the average amount of data pendingrecovery during the selected period of time.

For example, if the first hourly snapshot of the data showed 400 GB ofdata to be recovered in a historical time period and every othersnapshot showed 0 GB waiting to be recovered, the value of this fieldwould be 400 GB divided by the total number of hourly snapshots inthe period.

Recovery Rate Rate at which data is being recovered in the specified storage pool.

Time to Completion Estimated time to complete the recovery, extrapolated from thecurrent recovery rate.

Actions History provides a graphical display of the data.


Procedure

1. Select Monitor > Recovery Status.

2. Locate the table row for the target storage pool.

3. To show the recovery status history graph, select the History button.

Monitor disk bandwidthYou can use the Disk Bandwidth page to monitor the disk usage metrics at the VDC or individualnode level.

About this task

The Disk Bandwidth page is accessed from the ECS Portal at Monitor > Disk Bandwidth. There isone row for read and another for write for each VDC or node. By default, the History charts showdata for the last 24 hours.


Table 16 Disk Bandwidth metrics


VDC The VDC that the bandwidth data relates to.

Read or Write Indicates whether the row describes read data or write data.

Nodes The number of nodes in the VDC. You can click on the nodes numberto see the disk bandwidth metrics for each node. There is no Nodescolumn when you have drilled down into the Nodes display for a VDC.

Total Total disk bandwidth used for either read or write operations.

Hardware Recovery Rate at which disk bandwidth is used to recover data after a hardwarefailure.

Monitoring ECS


Table 16 Disk Bandwidth metrics (continued)


Erasure Encoding Rate at which disk bandwidth is used in system erasure codingoperations.

XOR Rate at which disk bandwidth is used in the system's XOR dataprotection operations. Note that XOR operations occur for systemswith three or more sites (VDCs).

Consistency Checker Rate at which disk bandwidth is used to check for inconsistenciesbetween protected data and its replicas.

Geo Rate at which disk bandwidth is used to support geo replicationoperations.

User Traffic Rate at which disk bandwidth is used by object users.

Actions History provides a graphic display of the data.


Procedure

1. Select Monitor > Disk Bandwidth.

2. Locate the target VDC name and either the Read or Write table row for that VDC.

3. To show data for nodes, select Nodes Count to drill down to a table with rows for the nodesin the VDC.

4. To display the disk bandwidth history charts, select the History button for the VDC or node.

Introduction to geo-replication monitoringYou can use the Geo Replication page to monitor the replication of data across the VDCs thatmake up a replication group.

The Geo Replication page is accessed from the ECS Portal at Monitor > Geo Replication andprovides four tabs:

l Rate and Chunks

l Recovery Point Objective (RPO)

l Failover Processing

l Bootstrap Processing

Monitor geo replication: Rate and ChunksYou can use the Rate and Chunks tab to obtain metrics about the network traffic for geo-replication and the chunks waiting for replication by a replication group or remote VDC.

The Rate and Chunks tab is accessed from the ECS Portal at Monitor > Geo Replication > Rateand Chunks.

Monitoring ECS


Table 17 Rate and Chunks columns

Column Description

Replication Group Lists the replication groups of which this VDC is a member. Click areplication group to see a table of remote VDCs in the replicationgroup and their statistics. Click the Replication Groups link above thetable to return to the default view.

Write Traffic The current rate of writes to all remote VDCs or individual remote VDCin the replication group.

Read Traffic The current rate of reads to all remote VDCs or individual remote VDCin the replication group.

User Data Pending Replication The total logical size of user data waiting for replication for thereplication group or remote VDC.

Metadata Pending Replication The total logical size of metadata waiting for replication for thereplication group or remote VDC.

Data Pending XOR The total logical size of all data waiting to be processed by the XORcompression algorithm in the local VDC for the replication group orremote VDC.

Monitor geo replication: Recovery Point Objective (RPO)You can use the RPO tab to view the recovery point objective for a replication group and itsremote VDCs. The RPO refers to the point in time in the past to which you can recover. The valuepresented is the oldest data at risk of being lost if a local VDC fails before replication is complete.

The RPO tab is accessed from the ECS Portal at Monitor > Geo Replication > RPO.

Table 18 RPO columns

Column Description

Remote Replication Group\Remote VDC At the VDC level, lists all remote replication groups of which the localVDC is a member. At the replication group level, this column lists theremote VDCs in the replication group.

Overall RPO The recent time period for which data might be lost in the event of alocal VDC failure.

Monitor geo replication: Failover ProcessingYou can use the Failover Processing tab to view the metrics on the process to rereplicate datafollowing permanent failure of a remote VDC.

The Failover Processing tab is accessed from the ECS Portal at Monitor > Geo Replication >Failover Processing.

Table 19 Failover columns

Field Description

Replication Group Lists the replication groups that the local VDC is a member of.

Failed VDC Identifies a failed VDC that is part of the replication group.

Monitoring ECS


Table 19 Failover columns (continued)

Field Description

User Data Pending Re-replication When a VDC fails, user data chunks replicated to the failed VDC haveto be re-replicated to a different VDC. This field reports the logicalsize of all user data (repository) chunks waiting re-replication to adifferent VDC.

Metadata Pending Re-replication When a VDC fails, metadata chunks replicated to the failed VDC haveto be re-replicated to a different VDC. This field reports the logicalsize of all metadata chunks waiting re-replication to a different VDC.

Data Pending XOR Decoding Shows the count and total logical size of chunks waiting to beretrieved by the XOR compression scheme.

Failover State l BLIND_REPLAY_DONE

l REPLICATION_CHECK_DONE: The process that makes sure thatall replication chunks are in an acceptable state and replication hascompleted successfully.

l CONSISTENCY_CHECK_DONE: The process that makes surethat all system metadata is fully consistent with other replicateddata and has completed successfully.

l ZONE_SYNC_DONE: The synchronization of the failed VDC hascompleted successfully.

l ZONE_BOOTSTRAP_DONE: The bootstrap process on the failedVDC has completed successfully.

l ZONE_FAILOVER_DONE: The failover process has completedsuccessfully.

Failover Progress A percentage indicator for the overall status of the failover process.

Monitor geo replication: Bootstrap ProcessingYou can use the Bootstrap Processing tab to monitor the copying of user data and metadata to aVDC that has been added to a replication group.

The Bootstrap Processing tab is accessed from the ECS Portal at Monitor > Geo Replication >Bootstrap Processing.

Table 20 Bootstrap Processing columns

Column Description

Replication Group This column provides the list of replication groups of which the localVDC is a member and that are adding new VDCs. Each row providesmetrics for the specified replication group.

Added VDC The VDC being added to the specified replication group.

User Data Pending Replication The logical size of all user data (repository) chunks waiting forreplication to the new VDC.

Metadata Pending Replication The logical size of all system metadata waiting for replication to thenew VDC.

Monitoring ECS


Table 20 Bootstrap Processing columns (continued)

Column Description

Bootstrap State The bootstrap state. Can be:

l BTreeScan

l ReplicateBTree

l ReplicateBTreeMarker

l ReplicateJournal

l Done

Bootstrap Progress (%) The completion percent of the entire bootstrap process.

Cloud hosted VDC monitoring

ECS provides support for identifying when a site is hosted or on-premise and the ECSManagement REST API provides support for retrieving information about the utilization andperformance of hosted sites.

Where an ECS system includes a hosted site, the ECS Portal displays a top-level Cloud menu thatenables administrators to see how the hosted sites are used as part of replication groups and toview the traffic to and from the hosted site in terms of bandwidth utilization and latency. Theportal displays also show the traffic to and from on-premise sites to allow comparison with hostedsites traffic.

The Cloud menu is not shown if the ECS system uses only on-premise sites.

Cloud topologyYou can use the Cloud topology summary information to see how the ECS system is making use ofhosted VDCs.

The Cloud > Topology page shows the hosted VDCs that are part of an ECS federated system,and shows the relationship between the hosted VDC and any on-premise VDCs.

Cloud Hosted VDCs

The Cloud Hosted VDCs table shows the hosted VDCs that are present in the ECS system.Currently ECS supports a single hosted site.

Related On-Premise VDCs

The Related On-Premise VDCs table shows the on-premise VDCs that are part of the ECSfederation.

Related Replication Groups

The Related Replication Groups table shows the replication groups that contain a storage poolcontributed by a selected hosted VDC. The Hosted VDC is selected in the Cloud Hosted VDC table.

A primary use case for using a hosted VDC is the Passive configuration in which the hosted VDCprovides a site for replication data but cannot be used as an active site by users. However, wherethe active operation of the hosted VDC is allowed, the hosted VDC can be included in replicationgroups where the type is Passive.

The table shows the replication group type and the VDC storage pools that are contributing to thereplication group, at least one of which will be a hosted VDC.

Monitoring ECS


Cloud replication trafficYou can use the cloud replication traffic information is to see the performance of hosted VDCs andcompare with on-premise VDCs.

The Cloud > Replication page shows replication traffic by VDC and by replication group.


Virtual Data Centers

The Virtual Data Centers tab shows each VDC, both hosted or on-premise, and providesaggregated traffic figures for all replication groups associated with a VDC.

Table 21 Replication traffic by VDC


Read Latency The average latency in milliseconds for reads from all replication groupsassociated with the selected VDC.

Write Latency The average latency in milliseconds for writes to all replication groupsassociated with the selected VDC.

Read Bandwidth The bandwidth utilized by reads from all replication groups associated withthe selected VDC.

Write Bandwidth The bandwidth utilized by writes from all replication groups associated withthe selected VDC.

Replication Groups

The Replication Groups tab shows each replication group and provides traffic data for a VDC foreach replication group that it contributes to. A VDC might have a storage pool that is in more thanone replication group, and this display allows you to see the traffic associated with each replicationgroup.

Table 22 Replication traffic by replication group


Read Latency The average latency in milliseconds for reads from the selected VDC thatrelate to the specified replication group.

Write Latency The average latency in milliseconds for writes to the selected VDC thatrelate to the specified replication group.

Read Bandwidth The bandwidth utilized by reads from the from the selected VDC that relateto the specified replication group.

Write Bandwidth The bandwidth utilized by writes to the selected VDC that relate to thespecified replication group.

Monitoring ECS


CHAPTER 3

Monitoring Events: Audits and Alerts

l About event monitoring.........................................................................................................40l Monitor audit data.................................................................................................................40l Audit messages..................................................................................................................... 40l Monitor alerts........................................................................................................................45l Alert policy............................................................................................................................ 46l Acknowledge all alerts........................................................................................................... 48l Alert messages...................................................................................................................... 48


About event monitoringYou can view the available event monitoring messages (audit and alert) from the ECS Portal.

The Monitor > Events page has two tabs:

l Audit: All activity by users working with the portal, the ECS REST APIs, and the ECS CLI.Other audit types include upgrade activities.

l Alerts: Alerts raised by the ECS system.

Event data through the ECS Portal is limited to 30 days. If you need to keep event data for longerperiods, consider using the ViPR SRM product.

Monitor audit dataUse the Monitor > Events > Audit tab to view and manage audit data.

About this task

See the List of audit messages.

Procedure

1. Select the Audit tab.

2. Optionally, select Filter.

3. Specify a Date Time Range and adjust the From and To fields and time fields. Whencreating a custom date-time range, select Current Time to use the current date and time asthe end of your range.

4. Select a Namespace.

5. Click Apply.

Note: The newest audit messages appear at the top of the table.

Audit messagesList of the audit messages generated by ECS.

Table 23 ECS audit messages

Service Audit item Audit message

Alert sent_alert Alert \"${alertMessage}\" with symptom code ${symptomCode} triggered

Auth Provider new_authentication_provider_added New authentication provider ${resourceId} added

Auth Provider authentication_provider_deleted Authentication provider ${resourceId} deleted

Auth Provider authentication_provider_updated Existing Authentication provider ${resourceId} updated

Bucket bucket_created Bucket ${resourceId} has been created

Bucket bucket_deleted Bucket ${resourceId} has been deleted

Bucket bucket_updated Bucket ${resourceId} has been updated

Bucket bucket_ACL_set Bucket ${resourceId} ACLs have changed



Table 23 ECS audit messages (continued)


Bucket bucket_owner_changed Owner of ${resourceId} bucket has changed

Bucket bucket_versioning_set Versioning has been enabled on ${resourceId} bucket

Bucket bucket_versioning_unset Versioning has been suspended on ${resourceId} bucket

Bucket bucket_versioning_source_set Bucket ${resourceId} versioning source set

Bucket bucket_metadata_set Metadata on ${resourceId} bucket has been changed

Bucket bucket_head_metadata_set Bucket ${resourceId} head metadata set

Bucket bucket_expiration_policy_set Bucket ${resourceId} expiration policy has updated

Bucket bucket_expiration_policy_deleted Bucket ${resourceId} expiration policy has been deleted

Bucket bucket_cors_config_set Bucket ${resourceId} CORS rules have been changed

Bucket bucket_cors_config_deleted Bucket ${resourceId} CORS rules have been deleted

Bucket notification_size_exceeded_on_bucket Notification size has been exceeded on ${resourceId}bucket

Bucket block_size_exceeded_on_bucket Block size has been exceeded on ${resourceId} bucket

Bucket bucket_set_quota Bucket ${resourceId} quota has been updated withnotification size as ${notificationSize} and block size as${blockSize}

Bucket bucket_policy_created Bucket ${resourceId} policy has been created

Bucket bucket_policy_updated Bucket ${resourceId} policy has been updated

Bucket bucket_policy_deleted Bucket ${resourceId} policy has been deleted

Cluster cluster_set Cluster id ${resourceId} has been set

Fabric InstallerServiceOperation[kind=INSTALLER_SERVICE_OPERATION,host=${hostName},timestamp=${timestamp},operationType=${operation},args=${arguments of operation},status=SUCCEEDED,fqdn=${fqdn of host},version=${installer version}]

Fabric NodeMaintenanceMode[kind=NodeMaintenanceMode,timestamp=${timestamp},agentId=${agendId},fqdn=${fqdn},status=${MaintenanceStatus}]

License user_added_license License ${resourceId} has been added

License managed_capacity_exceeded Managed capacity has exceeded licensed ${resourceId}capacity





License license_expired License ${resourceId} has expired

Local user domain_group_mapping_created Domain group ${resourceId} to ${roles} role(s) mappingis added

Local user domain_group_mapping_created_no_roles

Domain group ${resourceId} without role mappings isadded

Local user domain_group_mapping_updated Domain group ${resourceId} roles mapping is changedto ${roles} role(s)

Local user domain_group_mapping_updated_no_roles

All roles of domain group ${resourceId} mapping havebeen removed

Local user domain_user_mapping_created Domain user ${resourceId} to ${roles} role(s) mappingis added

Local user domain_user_mapping_created_no_roles Domain user ${resourceId} without role mappings isadded

Local user domain_user_mapping_deleted Domain user ${resourceId} mapping is removed

Local user domain_user_mapping_updated Domain user ${resourceId} role mapping is changed to ${roles} role(s)

Local user domain_user_mapping_updated_no_roles

All roles of domain user ${resourceId} mapping havebeen removed

Local user local_user_created Management user ${resourceId} with ${roles}role(s)has been created

Local user local_user_created_no_roles Management user ${resourceId} without roles has beencreated

Local user local_user_deleted Management user ${resourceId} has been deleted

Local user local_user_password_changed Credential of management user ${resourceId} haschanged

Local user local_user_updated Roles of management user ${resourceId} have beenchanged to ${roles}

Local user local_user_roles_updated_no_roles All roles of management user ${resourceId} have beenremoved

Locked vdc_lock_successful VDC lock was successful

Locked vdc_lock_failed VDC lock failed

Locked node_lock_successful Lock successful for node ${resourceId}

Locked node_lock_failed Lock failed for node ${resourceId}

Locked node_unlock_successful Unlock successful for node ${resourceId}

Locked node_unlock_failed Unlock failed for node ${resourceId}

Login login_successful User ${resourceId} logged in successfully

Login login_failed User ${resourceId} failed to login

Login user_token_logout User logged out token ${resourceId}





Login user_logout All user tokens have logged out

Namespace block_size_exceeded_on_namespace Block size has been exceeded on ${resourceId}namespace

Namespace namespace_admin_group_mappings_updated

Namespace ${resourceId} admin group mappingsupdated to following groups: ${groups}

Namespace namespace_admin_group_mappings_updated_no_groups

Namespace ${resourceId} admin groups mappingsupdated to an empty list

Namespace namespace_admin_user_mappings_updated

Namespace ${resourceId} admin mappings updated tofollowing users: ${admins}

Namespace namespace_admin_user_mappings_updated_no_admins

Namespace ${resourceId} admin mappings updated toan empty list

Namespace namespace_created Namespace ${resourceId} has been created

Namespace namespace_deleted Namespace ${resourceId} has been deleted

Namespace namespace_updated Namespace ${resourceId} has been updated

Namespace notification_size_exceeded_on_namespace

Notification size has been exceeded on ${resourceId}namespace

NFS ugmapping_created ${type} mapping ${ugMappingName} --> ${resourceId}has been created

NFS ugmapping_deleted ${type} mapping ${ugMappingName} --> ${resourceId}has been deleted

NFS export_created Export with export path ${exportPath} has beencreated

NFS export_deleted Export with export path ${exportPath} has been deleted

NFS export_updated Export with export path ${exportPath} has beenupdated

ReplicationGroup

replication_group_created Replication Group ${resourceId} has been created

ReplicationGroup

replication_group_updated Replication Group ${resourceId} has been updated

Security command_exec_insufficient_permission Attempt to execute a command ${command} from ${host} without right permissions

SNMP snmp_v2_target_created SNMP target ${snmpTarget} with Community '${community}' is added

SNMP snmp_v3_target_created SNMP target ${snmpTarget} with Username '${username}', Authentication(${authProtocol}) andPrivacy(${privProtocol})

SNMP snmp_target_deleted SNMP target ${snmpTarget} is deleted

SNMP snmp_engineid_updated SNMP agent EngineID is set to ${engineId}





SNMP snmp_v2_target_updated SNMP target ${oldSnmpTarget} is updated as ${newSnmpTarget} with Community string ${community}

SNMP snmp_v3_target_updated SNMP target ${oldSnmpTarget} is updated as ${newSnmpTarget} with Username ${username},Authentication(${authProtocol}) and Privacy(${privProtocol})

Storage Pool storage_pool_created Storage Pool ${resourceId} has been created

Storage Pool storage_pool_deleted Storage Pool ${resourceId} has been deleted

Storage Pool storage_pool_updated Storage Pool ${resourceId} has been updated

Syslog syslog_server_added Syslog server ${protocol}://${host}:${port} withseverity ${severity} is added into the configuration

Syslog syslog_server_updated Syslog server ${old_protocol}://${old_host}:${old_port} is updated to ${protocol}://${host}:${port}with severity ${severity} in the configuration

Syslog syslog_server_deleted Syslog server ${protocol}://${host}:${port} is removedfrom the configuration

Transformation

transformation_created_message Transformation created

Transformation

transformation_updated_message Transformation updated

Transformation

transformation_pre_check_started_message

Transformation precheck started

Transformation

transformation_enumeration_started_message

Transformation enumeration started

Transformation

transformation_indexing_started_message

Transformation indexing started

Transformation

transformation_migration_started_message

Transformation migration started

Transformation

transformation_recovery_migration_started_message

Transformation recovery migration started

Transformation

transformation_reconciliation_started_message

Transformation reconciliation started

Transformation

transformation_sources_updated_message

Transformation sources updated

Transformation

transformation_deleted_message Transformation deleted

Transformation

transformation_retried_message Transformation %s retried

Transformation

transformation_canceled_message Transformation %s canceled





Transformation

transformation_profile_mappings_updated_message

Transformation profile mappings updated

User change_password_failed User ${resourceId} failed to change password, reason: ${reason}

User user_created Object user ${resourceId} has been created

User user_deleted Object user ${resourceId} has been deleted

User user_set_password New password has been set for object user ${resourceId}

User user_delete_password Password has been deleted for object user ${resourceId}

User user_set_metadata New metadata has been set for object user ${resourceId}

User user_locked Object user ${resourceId} has been locked

User user_unlocked Object user ${resourceId} has been unlocked

User user_set_user_tag User Tag has been set for object user ${resourceId}

User user_delete_user_tag User Tag has been deleted for object user ${resourceId}

Monitor alertsYou can use the Monitor > Events > Alerts tab to view and manage system alerts.

About this task

See the list of alert messages.

Alert message Severity labels have the following meanings:

l Critical: Messages about conditions that require immediate attention

l Error: Messages about error conditions that report either a physical failure or a software failure

l Warning: Messages about less than optimal conditions

l Info: Routine status messages

Procedure

1. Select Alerts.

2. Optionally, click Filter.

3. Select your filters. The alerts filter adds filtering by Severity and Type, and an option toShow Acknowledged Alerts, which retains the display of an alert even after it isacknowledged by the user. When creating a custom date-time range, select Current Timeto use the current date and time as the end of your range.

Alert types must be entered exactly as described in the following table:



Table 24 Alert types

Alert Type (type exactly asshown)

Description

Fabric Raised when system issues detected.

Geo Raised for geo-replication alerts.

License Raised for license, capacity, or capacity entitlement exceeded alerts.

Notify Raised for miscellaneous alerts.

Quota Raised when soft or hard quota limits are exceeded (SoftQuotaLimitExceeded orHardQuotaLimitExceeded) for a bucket or for a namespace.

RPO Raised when the recovery point objective (RPO) is greater than the RPO threshold.

Capacity Alerting Raised when the remaining capacity of the storage pool reaches a set threshold.

Capacity License Threshold Raised if the system capacity is greater than the licensed capacity.

CHUNK_NOT_FOUND Raised when chunk data is not found.

DTSTATUS_RECENT_FAILURE Raised when the status of a data table is bad.

Table 25 ESRS dial home types

Alert Type (type exactly asshown)

Description

TestDialHome Raised to test that ESRS connections can be established and that the call homefunctionality works.

4. Select a Namespace.

5. Click Apply.

6. Next to each event, click the Acknowledge Alert button to acknowledge and dismiss themessage. Messages that have previously been acknowledged will display when the ShowAcknowledged Alerts filter option is selected, but the Acknowledge Alert button will notbe displayed for these rows.

7. You can click the Description of an alert, when it is formatted as a link, to be taken to arelevant page in the portal.

Alert policyAlert policies are created to alert about metrics, and are triggered when the specified conditionsare met. Alert policies are created per VDC.

You can use the Settings > Alerts Policy page to view alert policies.

There are two types of alert policy:

System alert policies

l System alert policies are precreated and exist in ECS during deployment.

l All the metrics have an associated system alert policy.

l System alert policies cannot be updated or deleted.

l System alert policies can be enabled/disabled.



l Alert is sent to the UI and all channels (SNMP, SYSLOG, and Secure Remote Services).

User-defined alert policies

l You can create User-defined alert policies for the required metrics.

l Alert is sent to the UI and customer channels (SNMP and SYSLOG).

New alert policyYou can use the Settings > Alerts Policy > New Alert Policy tab to create user-defined alertpolicies.

Procedure

1. Select New Alert Policy.

2. Give a unique policy name.

3. Use the metric type drop-down menu to select a metric type.

Metric Type is a grouping of statistics. It consists of:

l Btree Statistics

l CAS GC Statistics

l Geo Replication Statistics

l Metering Statistics

l Garbage Collection Statistics

l EKM

4. Use the metric name drop-down menu to select a metric name.

5. Select level.

a. To inspect metrics at the node level, select Node.

b. To inspect metrics at the VDC level, select VDC.

6. Select polling interval.

Polling Interval determines how frequently data should be checked. Each polling intervalgives one data point which is compared against the specified condition and when thecondition is met, alert is triggered.

7. Select instances.

Instances describe how many data points to check and how many should match thespecified conditions to trigger an alert.

For metrics where historical data is not available only the latest data is used.

8. Select conditions.

You can set the threshold values and alert type with Conditions.

The alerts can be either a Warning Alert, Error Alert, or Critical Alert.

9. To add more conditions with multiple thresholds and with different alert levels, select AddCondition.

10. Click Save.



Acknowledge all alertsAlerts can be acknowledged individually or by bulk using the Acknowledge All Alerts button. Youcan choose to acknowledge all the alerts or acknowledge a subset of the alerts using filters.

About this task

You can use the Monitor > Events > Alerts tab to acknowledge alerts.

Procedure

1. To acknowledge all alerts, click the Acknowledge All Alerts button.

a. To acknowledge a subset of all alerts, use the table filter to filter by a combination ofdate and time, severity, type, or namespace, and then click Acknowledge All Alerts.

The bulk alert acknowledgment process runs in the background and may take a few minutesto complete. Only one bulk alert acknowledgment can be processed at a time.

2. On the confirmation pop-up screen, to initiate acknowledgment, click OK or to exit withoutacknowledgment click Cancel.

Clicking the Acknowledge All Alerts initiates a background task to acknowledge all thematching alerts. The response either shows successfully initiated or fails.

To keep a record of the acknowledge all alerts request, a new informational alert of typeBulk Alert Ack will be generated after the acknowledgment completes. Clear the filter andmanually refresh the table.

Alert messagesList of the alert messages that ECS uses.

Alert message Severity labels have the following meanings:

l Critical: Messages about conditions that require immediate attention

l Error: Messages about error conditions that report either a physical failure or a software failure

l Warning: Messages about less than optimal conditions

l Info: Routine status messages

Table 26 ECS Object alert messages

Alert Severity Symptomcode

Sent to... Message Description Action

Btree chunklevel GC

Warning 1321 Portal, API,SecureRemoteServices,SNMP Trap,Syslog

System metadatagarbagereclamationthroughput is tooslow to catch upwith garbagedetection.

Event trigger sourceExample: ReclaimedBtree Garbage is lessthan 10% of theremaining BTreegarbage as BTree GCis slow at Chunkreclamation.

This condition haspersisted for last 7

Contact ECSRemote Support



Table 26 ECS Object alert messages (continued)



days, leading tocreation of this alert.

Derived it fromformula: Full_Garbage> 1TB, andGarbage_Detected_Rate -Garbage_Chunk_Reclaim_Rate > 100GB

Btree disklevel GC


Capacity free-upthroughput is tooslow to catch upwith systemmetadata garbagereclamation.

Event trigger sourceExample: ReclaimedBtree Garbage is lessthan 10% of the Fullgarbage, as BTree GCis slow at disk levelreclamation.

This condition haspersisted for last 7days, leading tocreation of this alert.

Derived from formula:ifGarbage_Pending_Delete > 1TB, andGarbage_Chunk_Reclaim_Rate -Garbage_Capacity_Reclaim_Rate > 100GB

Contact ECSRemote Support.

Btreepartial GC


Partial GC forsystem metadatais too slow.

Event trigger sourceExample: Rate ofBtree Partial GCconversion to fullGarbage is less than10% of the Partial GCeligible for Conversion.

Btree partial GC workstoo slow to convertpartial garbage into fullgarbage.


Derived from formula :IfPartial_Eligible_Garba







ge > 1TB, andPartial_To_Full_Convert_Rate < 100GB

Bucket hardquota

Error 1006 Portal, API,SNMP Trap,Syslog

HardQuotaLimitExceeded: bucket{bucket_name}

Bucket softquota

Warning 1008 Portal, API,SNMP Trap,Syslog

SoftQuotaLimitExceeded: bucket{bucket_name}

Capacityalerting

WarningErrorCritical

111111121113

Portal, API,SNMP Trap,Syslog

Storage pool{Storage pool}has {id}%remainingcapacity meetingthreshold of {id}%.

The severity of thealert depends on howclose the remainingstorage pool capacityis to reaching theconfigured threshold.Capacity alerting is notset by default: setcapacity alerts toreceive them. You canset them by editing anexisting storage poolor when you create astorage pool.

Capacityexceededthreshold


Used Capacity ofthe VDCexceededconfiguredthreshold, currentusage is {usage}%.

The configuredthreshold is set at80% of the UsedCapacity of the VDCby default.

CAUTION If theused capacityreaches 90%, youcannot write ormodify objectdata.

Contact ECSRemote Supportrepresentative todetermine theappropriatesolution.

Capacitylicensethreshold

Error 997 Portal, API,SecureRemoteServices,Trap, Syslog

Licensed CapacityEntitlementExceeded Event

The capacity of thesystem is greater thanwas licensed.

Chunk notfound

Error 1004 Portal, API,SecureRemoteServices,SNMP Trap,Syslog

chunkId {chunkId}not found






CPU UsagePercent

Warning

Error

Critical

4001

4002

4003


CPU usage is ${inspectorValue}% crossesthreshold ${thresholdValue}%

If CPU usage percentcrosses the thresholdspecified then thealert is triggered.

DisabledCAS GC

InfoWarningErrorCritical

1316131713181319

Portal, API,SecureRemoteServices,SNMP, Trap,Syslog

CAS Processing ispaused.

CAS GC is ContentAddressable StorageGarbage Collection.

CAS GC is disabled.

Contact ECSRemote Supportto ensure that itshould stayenabled.

DT initfailure


There are morethan {numbers}DTs failed or DTstats check failedin last {number}rounds of DTstatus check.

DT is a directory table.The default value is setat 8 DTs for this alertto trigger.

EKM ServerCertificateExpiry

Warning

Error

1361

1362

Portal, API,SecureRemoteServices,SNMP Trap,Syslog

The servercertificate forEKM serverexpires in 30days. Renew thecertificate.

The servercertificate forEKM serverexpires in 7 days.Renew thecertificate.

EKM ServerConnectionStatus

Warning

Error

1369

1370


The EKM server isnot responding.Ensure that theserver isconnected.

First ByteLatency ForRead

Warning

Error

Critical

4009

4010

4011


First ByteLatency for Readis ${inspectorValue}ms crossesthreshold ${thresholdValue}ms

If TTFB for readlatency crosses thethreshold specifiedthen the alert istriggered.






Last ByteLatency ForWrite

Warning

Error

Critical

4003

4014

4015


Last Byte Latencyfor Write is ${inspectorValue}ms crossesthreshold ${thresholdValue}ms

If TTLB for writelatency crosses thethreshold specifiedthen the alert istriggered.

Licenseexpiration

Info 998 Portal, API,SecureRemoteServices,SNMP Trap,Syslog

Expiration event

Licenseregistration

Info 100 Portal, API,SecureRemoteServices,SNMP Trap,Syslog

RegistrationEvent

MemoryoutsideBtreewritescache


For cm processmemory of Xbytes is allocatedoutside Btreewrite cache onnode <Node IP>.

Meteringreadlatency

Warning

Error

Critical

1205

1206

1207


Read latency is300 millisecond,crosses threshold250 millisecond.




Meteringwritelatency

Warning

Error

Critical

1205

1206

1207


Write latency is300 millisecond,crosses threshold250 millisecond.









MonitoringHealth

Critical 4016

4017

4018


Data recorded inTSDB is laggingby{thresholdValue}mins on nodex.x.x.x

Namespacehard quota


HardQuotaLimitExceeded:Namespace{namespace}

Namespacesoft quota


SoftQuotaLimitExceeded:Namespace{namespace}

Notification Any Any User-definedmessage.

Custom message thatis defined andprovided by the user.

Processmemorytable freespacepercent


Memory table sizefor blob process isX % less than thespecifiedthreshold of Y %on <node IP>.


Repo chunklevel GC


User garbagecollectionthroughput is tooslow to catch upwith garbagedetection.

Event trigger sourceExample: Repo Chunkreclamation rate is lessthan 10% of theremaining garbage.


Derived from formula:Full_Garbage > 10TB,andGarbage_Detected_Rate -Garbage_Chunk_Reclaim_Rate > 100GB


Repo disklevel GC

Warning 1337 Portal, API,Secure

Capacity free-upthroughput is too

Event trigger source Contact ECSRemote Support.






RemoteServices,SNMP Trap,Syslog

slow to catch upwith user garbagecollection.

Example: Repo disklevel GC reclamationrate is less than 10 %of Garbage pendingdelete at disk level.


Derived from formula:IfGarbage_Pending_Delete > 10TB, andGarbage_Chunk_Reclaim_Rate -Garbage_Capacity_Reclaim_Rate > 100GB

Repo partialGC


Partial GC foruser garbage istoo slow.

Event trigger sourceExample: Repo Partialrepo GC works tooslow to convert partialgarbage into fullgarbage.


Derived from formula:IfPartial_Eligible_Garbage > 10TB, andPartial_To_Full_Convert_Rate < 100GB


RPO Warning 1012 Portal, API,SecureRemoteServices,Trap, Syslog

RPO forreplication group{RG} is {HH} hour{SS} secondsgreater than {HH}hour thresholdset.

The recovery pointobjective (RPO) isgreater than the RPOthreshold. The defaultvalue is one hour.

Slow CASGC ObjectCleanup

Info

Warning

Error

Critical

1312

1313

1314

1315


CAS Processingobject cleanupspeed is slow.

CAS GC cleanup tasksare lagging.






Slow CASGCReferenceCollection

Info

Warning

Error

Critical

1308

1309

1310

1311


CAS Processingreferencecollection speed isslow.

CAS GC referencecollection tasks arelagging.

SlowJournalParsing

Info

Warning

Error

Critical

1304

1305

1306

1307


Journal parsingspeed is slow.

Journal parsing speedis slow.

SpaceUsagePercent

Warning

Error

Critical

4005

4006

4007

Portal, API,SNMP, Trap,Syslog

Disk space usageis ${inspectorValue}% crossesthreshold ${thresholdValue}%

If Disk usage percentcrosses the thresholdspecified then thealert is triggered.

GC Status Warning 1345 Portal, API,SecureRemoteServices,SNMP Trap,Syslog

Spacereclamation foruser data/systemmetadata isdisabled.

Make sure it isdisabled fortemporarypurpose, and re-enable it whenready.


VDC in TSO Critical 1007 Portal, API ,SNMP Trap,Syslog

Site {vdc} ismarked astemporarilyunavailable.

TSO is a temporarysite outage.

Table 27 ECS fabric alert messages



Disk added Info 2019 Portal, API,SNMP Trap,Syslog

Disk{diskSerialNumber} on node {fqdn}was added.

Disk was added.

Disk failure Critical 2002 Portal, API,SNMP Trap,Syslog,

Disk{diskSerialNumbe

Health of disk that ischanged to BAD.



Table 27 ECS fabric alert messages (continued)



SecureRemoteServices

r} on node {fqdn}has failed.

Disk good Info 2025 Portal, API,SNMP Trap,Syslog

Disk{diskSerialNumber} on node {fqdn}was revived.

Disk was revived.

Diskmounted

Info 2035 Portal, API,SNMP Trap,Syslog

Disk{diskSerialNumber} on node {fqdn}has mounted.

Disk was mounted.

Diskremoved


Disk{diskSerialNumber} on node {fqdn}was removed.

Disk was removed.

Disksuspect

Error 2003 Portal, API,SNMP Trap,Syslog,SecureRemoteServices

Disk{diskSerialNumber} on node {fqdn}has suspected.

Health of disk that ischanged to SUSPECT.

Diskunmounted


Disk{diskSerialNumber} on node {fqdn}has unmounted.

Disk was unmounted.

Dockercontainerconfiguration failure

Critical 2022 Portal, API,SNMP Trap,Syslog,SecureRemoteServices

Container{containerName}configuration hasfailed on node{fqdn} with exitcode {exitCode}{happenedOn}.

Configure scriptreturned nonzero exitcode.The configure script isprovided by object andcalled by fabric onobject container start-up. It is only applicablefor the objectcontainer.

Dockercontainerpaused


Container{containerName}has paused onnode {fqdn}.

Container paused

Dockercontainerrunning


Container{containerName}is up on node{fqdn}.

Container moved torunning state.






Dockercontainerstopped


Container{containerName}has stopped onnode {fqdn}.

Container stopped

Eventscannot bedelivered.


Events cannot bedelivered through{SMTP|ESRS}and lost.

Verify configuration ofthe channel for whichthe alert is.

Firewallhealth isBAD orSUSPECT

BAD

SUSPECT

2051

2052


Firewall health isBAD! {reason}

Firewall health isSUSPECT!{reason}

Rules or ip sets do notexist, system firewallis off, ip tables or ipset utils do not exist

Rules or ip sets do notexist, trying to recover

Fabricagentfailure

Critical 2013 Portal, API,SNMP Trap,Syslog

FabricAgent hasfailed on node{fqdn}.

Fabric agent health isbad.

Fabricagentsuspect


FabricAgent hassuspected onnode {fqdn}.

Fabric agent health issuspect.

Netinterfacehealthdown


Net interface{$netInterfaceName}[ on node$FQDN] isdown[ with IPaddress $IP]".

Fabric's net interfaceis down.

Netinterfacehealth up

Info 2024 Portal, API,SNMP Trap,Syslog,SecureRemoteServices

Net interface{$netInterfaceName}[ on node$FQDN] isup[ with IPaddress $IP]".

Fabric's net interfaceis up.

Netinterfacepermanentdown

Critical 2026 Portal, API,SecureRemoteServices

Net interface{$netInterfaceName}[ on node$FQDN] ispermanentlydown[ with IPaddress $IP].

Net interface is downfor at least 10 minutes.

Netinterface IPaddressupdated


Net interface's{netInterfaceName} IP address onnode {fqdn} was

Fabric's net interfaceIP address changed






changed to{newIpAddress}.

Node failure Critical 2006 Portal, API,SNMP Trap,Syslog,SecureRemoteServices

Node {fqdn} hasfailed.

Node is not reachablefor 30 minutes.

Nodesuspect

Error 2007 Portal, API,SNMP Trap,Syslog,SecureRemoteServices

Node {fqdn} hassuspected.

Node is not reachablefor 15 minutes.

Node up Info 2018 Portal, API,SNMP Trap,Syslog

Node {fqdn} is up. Node moved to 'up'state after it wasdown for at least 15minutes.

Root filesystemfilling onnode

WARNING

CRITICAL

2039

2042

Portal, API,SNMP Trap,Syslog,SecureRemoteServices

Thresholdsexceeded, usablespace on root fs<BYTES> are lessthan threshold for<LEVEL> level onnode <NODE>

Threshold between15G and 10G triggerswarning.

Threshold Less than10G of free spaceresults in Critical alert.

Slotpermanentdown


Container{containerName}is permanentlydown on node{fqdn}.

Container stopped/paused or not startedat all for at least 10minutes

Servicefailure

Critical 2011 Portal, API,Syslog,SecureRemoteServices

Service HealthFailure Event

Service failed

Servicesuspect

Error 2012 Portal, API,Syslog,SecureRemoteServices

Service HealthSuspect event

Service health issuspect.



Table 28 Secure Remote Services alert messages


Sent to... Description

TestDialHome N/A TestDialHome SecureRemoteServices

Tests that Secure Remote Services connectionscan be established and that the call homefunctionality works.



CHAPTER 4

Advanced Monitoring

l Advanced Monitoring............................................................................................................ 62l Flux API................................................................................................................................. 67l Dashboard API's to be deprecated or changed in the next release.........................................73


Advanced MonitoringAdvanced Monitoring dashboards provide critical information about the ECS processes on the VDCyou are logged in to. The advanced monitoring dashboards are based on time series database, andare provided by Grafana, which is well known open-source time series analytics platform.

Refer Grafana for basic details of navigation in Grafana dashboards.

l View Advanced Monitoring Dashboards

l Share Advanced Monitoring Dashboards

View Advanced Monitoring DashboardsTo view the advanced monitoring dashboards in the ECS Portal, select Advanced Monitoring.Data Access Performance - Overview dashboard is the default.

Table 29 Advanced monitoring dashboards

Dashboard Description

Data Access Performance -Overview

You can use the Data Access Performance - Overviewdashboard to monitor VDC data.

Data Access Performance - byNamespaces

You can use the Data Access Performance - byNamespaces dashboard to monitor performance datafor individual namespace or group of Namespaces.

Data Access Performance - byNodes

You can use the Data Access Performance - by Nodesdashboard to see performance data for individual nodeor group of nodes in a VDC.

Data Access Performance - byProtocols

You can use the Data Access Performance - byProtocols dashboard to see performance data for eachsupported protocol (S3, ATMOS, SWIFT, etc.) or set ofprotocols.

Table 30 Advanced monitoring dashboard fields

Dashboard Field Description

All Relateddashboards

Allows you to switch to other dashboards in accessperformance group, with the selected time.

All TransactionSummary

Lists the total Successful requests, System Failures,User Failures, and Failure % Rate for the selectedVDCs, namespaces, nodes, or protocols.

All Successfulrequests

The number of data requests that were successfullycompleted.

All SystemFailures

The number of data requests that failed due tohardware or service errors. System failures arefailed requests that are associated with hardware orservice errors (typically an HTTP error code of 5xx).

All UserFailures

The number of data requests from all object headsare classified as user failures. User failures are

Advanced Monitoring


https://grafana.com/docs/guides/getting_started/

Table 30 Advanced monitoring dashboard fields (continued)


known error types originating from the object heads(typically an HTTP error code of 4xx).

All Failure %Rate

The percentage of failures for the VDC, namespace,nodes, or protocols.

All TPS(success/failure)

Rate of successful requests and failures per second.

Data Access Performance- OverviewData Access Performance- by Nodes

Data Access Performance- by Protocols

Bandwidth(read/write)

Data access bandwidth of successful requests persecond.

All FailedRequests/sby errortype (user/system)

Rate of failed requests per second, split by errortype (user/system).



Latency Latency of read/write requests.


Successfulrequest drilldown

Displays the rate of successful requests per second,by method, node, and protocol.

Data Access Performance- by NodesData Access Performance- Overview

SuccessfulRequests/sby Method

Rate of successful requests per second, by method.

All SuccessfulRequests/sby Node

Rate of successful requests per second, by node.


SuccessfulRequests/sby Protocol

Rate of successful requests per second, by protocol.


Failures drilldown

Displays the rate of failed requests per second, bymethod, node, and protocol.

Advanced Monitoring





FailedRequests/sby Method

Rate of failed requests per second, by method.

All FailedRequests/sby Node

Rate of failed requests per second, by node.


FailedRequests/sby Protocol

Rate of failed requests per second, by protocol.


FailedRequests/sby errorcode

Rate of failed requests per second, by error code.

Data Access Performance- by NodesData Access Performance- by Namespaces


CompareTPS ofsuccessfulrequests

Select multiple nodes and compare rates ofsuccessful requests per second.

Data Access Performance- by Namespaces

CompareTPS offailedrequests

Select multiple nodes and compare rates of failedrequests per second, by error type (user/system).

Data Access Performance- by NodesData Access Performance- by Protocols

Comparereadbandwidth

Select multiple nodes and compare data accessbandwidth (read) of successful requests per second.


Comparewritebandwidth

Select multiple nodes and compare data accessbandwidth (write) of successful requests persecond.


Comparereadlatency

Select multiple nodes and compare latency of readrequests.


Comparewritelatency

Select multiple nodes and compare latency of writerequests.

Data Access Performance- by Nodes

Comparerate of

Select multiple nodes and compare rates of failedrequests per second, split by error type (user/system).

Advanced Monitoring





failedrequests/s

Data Access Performance- by Namespaces

Requestdrill downby nodes

Rate of requests per second, split by node.

View modeProcedure

1. To view a dashboard in the view mode, click the title of a dashboard, for example (TPS(success/failure) > View.

The dashboard opens in the view mode or in the full-screen mode.

2. Click Back to dashboard icon to return back to the dashboards view.

Export CSVProcedure

1. To export the dashboard data to .csv format click the title of a dashboard, for example (TPS(success/failure) > More > Export CSV.

The Export CSV window pops-up.

You can customize the csv output by modifying the Mode, Date Time Format, and check/uncheck the Excel CSV Dialect attributes.

2. Click Export > Save to export the dashboard data to .csv format to your local storage.

View Advanced Monitoring Dashboards- OverviewData Access Performance - Overview dashboard is the default.

In the Data Access Performance - Overview dashboard, you can monitor for all nodes in theVDC:

l TPS (success/failure)

l Bandwidth (read/write)

l Failed Requests/s by error type (user/system)

l Latency

l Successful Requests/s by Method

l Successful Requests/s by Node

l Successful Requests/s by Protocol

l Failed Requests/s by Method

l Failed Requests/s by Node

l Failed Requests/s by Protocol

l Failed Requests/s by error code

To view the Data Access Performance - Overview dashboard in the ECS Portal, select AdvancedMonitoring.

Advanced Monitoring


Click Successful requests drill down to see the successful requests by all the methods, nodes,and protocols.

Click Failures drill down to see the failed requests by all the methods, nodes, protocols, and errorcode.

Click Related dashboards to view the other dashboards, with the selected time.

View Advanced Monitoring Dashboards- by Namespaces

In the Data Access Performance - by Namespaces dashboard, you can monitor for namespaces:





l Compare TPS of successful requests

l Compare TPS of failed requests

To view the Data Access Performance - by Namespaces dashboard in the ECS Portal, selectAdvanced Monitoring > Related dashboards > Data Access Performance - by Namespaces.

All the namespace data are visible in the default view. To select a namespace, click the legendparameter for the namespace below the graph.

Requests drill down by nodes shows the successful and failed requests by node.

Compare: select multiple namespaces compares TPS of successful and failed requests.

View Advanced Monitoring Dashboards- by Nodes

In the Data Access Performance - by Nodes dashboard, you can monitor for nodes in a VDC:




l Latency

l Successful Requests/s by Method


l Successful Requests/s by Protocol

l Failed Requests/s by Method


l Failed Requests/s by Protocol

l Failed Requests/s by error code



l Compare read bandwidth

l Compare write bandwidth

l Compare read latency

l Compare write latency

To view the Data Access Performance - by Nodes dashboard in the ECS Portal, select AdvancedMonitoring > Related dashboards > Data Access Performance - by Nodes.

Advanced Monitoring


Data for all the nodes are visible in the default view. To select data for a node, click the legendparameter for the node below the graph.

Successful requests drill down shows the successful requests by method, node, and protocol.

Failures drill down shows the failed requests by method, node, protocol, and error code.

Compare: select multiple namespaces compares TPS of successful and failed requests, compareread/write bandwidth, compare read/write latency.

View Advanced Monitoring Dashboards- by Protocols

In the Data Access Performance - by Protocols dashboard, based on the protocol, you canmonitor:




l Latency





l Compare read bandwidth

l Compare write bandwidth

l Compare read latency

l Compare write latency

To view the Data Access Performance - by Nodes dashboard in the ECS Portal, select AdvancedMonitoring > Related dashboards > Data Access Performance - by Protocols.

Data for all the protocols are visible in the default view. To select data for a protocol, click thelegend parameter for the protocol below the graph.

Requests drill down by nodes shows the successful and failed requests by node.

Compare: select multiple namespaces compares TPS of successful and failed requests, compareread/write bandwidth, compare read/write latency.

Share Advanced Monitoring DashboardsShare dashboard icon enables you to create a direct link to the dashboard or panel, share asnapshot of an interactive dashboard publicly, and export the dashboard to a JSON file.

For procedures on sharing the dashboard link, dashboard snapshot, and dashboard as a JSON file,refer to Grafana documentation.

Flux APIFlux API enables you to retrieve time series database data by sending REST queries using curl. Youcan get raw data from fluxd service in a way similar to using the Dashboard API. You have to geta token, and provide the token in the requests.

Procedure

1. Use curl https://<ip>:4443/login -k -u "user:passwd" -v to get thesecurity token.

Advanced Monitoring


https://grafana.com/docs/v5.3/reference/sharing/#share-dashboard

2. Enter the token in the request header.

l Query is displayed in the request body.

l Nginx validates token in authsvc, and sends proxy request to fluxd on local host.

l Fluxd handles query, and return results in JSON or .CSV format.

Example of Flux API output:

curl -k -H "X-SDS-AUTH-TOKEN: xxxx" -XPOST --data-urlencode 'query=from(bucket: "monitoring_main") |> filter(fn: (r) => r._measurement == "statDataHead_performance_internal_transactions") |> range(start: -30m)' 'https://10.249.230.55:4443/dashboard/v2/query'

Example of Dashboard API output:

{ "_links": { "self": { "href": "/dashboard/zones/localzone"}, "storagepools": { "href": "/dashboard/zones/localzone/storagepools" }, "nodes": { "href": "/dashboard/zones/localzone/nodes" }, "replicationgroups": { "href": "/dashboard/zones/localzone/replicationgroups" }, "rglinksFailed": { "href": "/dashboard/zones/localzone/rglinksFailed" }, "rglinksBootstrap": { "href": "/dashboard/zones/localzone/rglinksBootstrap" } }, "apiChange" : "1", "name": "vdc1", "numNodes":16,... "nodeCpuUtilizationAvg": [ {"t":"12345678" , "Percent":10}, {"t":"23435455" , "Percent":43}, {"t":"55433455" , "Percent":39}],... "diskSpaceAllocatedCurrent": [ {"t":"12345678", "Bytes ":10000}, {"t":"23456789", "Bytes ":12000} ]}

Example of Flux API JSON output:

{ [ {"table":"0" , "_start":"2019-03-06T10:30:00Z", "_stop":"2019-03-07T11:15:00Z", "_time":"2019-03-06T10:30:00Z", "_value":95.30918181366027, "_field":"usage_idle", "_measurement":"cpu", "cpu":"cpu-total", "host":"layton-ivory.ecs.lab.emc.com", "node_id":"1f2815b3-b340-45ce-b863-de8f46e8b691", "tag":"system"}, {"table":"0" , "_start":"2019-03-06T10:30:00Z", "_stop":"2019-03-07T11:15:00Z", "_time":"2019-03-06T10:35:00Z", "_value":95.52097124715358, "_field":"usage_idle", "_measurement":"cpu", "cpu":"cpu-total", "host":"layton-ivory.ecs.lab.emc.com", "node_id":"1f2815b3-b340-45ce-b863-de8f46e8b691", "tag":"system"},

Advanced Monitoring


{"table":"1" , "_start":"2019-03-06T10:30:00Z", "_stop":"2019-03-07T11:15:00Z", "_time":"2019-03-06T10:30:00Z", "_value":85.41386518615308, "_field":"usage_idle", "_measurement":"cpu", "cpu":"cpu-total", "host":"lehi-ivory.ecs.lab.emc.com", "node_id":"48e607ef-2e81-4f8b-b9e2-b61b45ef2240", "tag":"system"}, {"table":"1" , "_start":"2019-03-06T10:30:00Z", "_stop":"2019-03-07T11:15:00Z", "_time":"2019-03-06T10:35:00Z", "_value":67.13489651306735, "_field":"usage_idle", "_measurement":"cpu", "cpu":"cpu-total", "host":"lehi-ivory.ecs.lab.emc.com", "node_id":"48e607ef-2e81-4f8b-b9e2-b61b45ef2240", "tag":"system"}, ]}

Example of Flux API CSV output:

#datatype,string,long,dateTime:RFC3339,dateTime:RFC3339,dateTime:RFC3339,double,string,string,string,string,string,string#group,false,false,true,true,false,false,true,true,true,true,true,true#default,_result,,,,,,,,,,,,result,table,_start,_stop,_time,_value,_field,_measurement,cpu,host,node_id,tag,,0,2019-03-06T10:30:00Z,2019-03-07T11:15:00Z,2019-03-06T10:30:00Z,95.30918181366027,usage_idle,cpu,cpu-total,dallas-straw,1f2815b3-b340-45ce-b863-de8f46e8b691,system,,0,2019-03-06T10:30:00Z,2019-03-07T11:15:00Z,2019-03-06T10:35:00Z,95.52097124715358,usage_idle,cpu,cpu-total,dallas-straw,1f2815b3-b340-45ce-b863-de8f46e8b691,system,,0,2019-03-06T10:30:00Z,2019-03-07T11:15:00Z,2019-03-06T10:40:00Z,85.41386518615308,usage_idle,cpu,cpu-total,dallas-straw,48e607ef-2e81-4f8b-b9e2-b61b45ef2240,system,,0,2019-03-06T10:30:00Z,2019-03-07T11:15:00Z,2019-03-06T10:45:00Z,67.13489651306735,usage_idle,cpu,cpu-total,dallas-straw,48e607ef-2e81-4f8b-b9e2-b61b45ef2240,system

Note: Flux query language that is supported for flux API is a subset of operationsthat are supported by Influxdb Flux language v0.12. Refer Get started with Flux formore details.

Query enabled to run by Flux API:

l from https://docs.influxdata.com/flux/v0.12/functions/inputs/from/

l filter https://docs.influxdata.com/flux/v0.12/functions/transformations/filter/

l range https://docs.influxdata.com/flux/v0.12/functions/transformations/range/

l last https://docs.influxdata.com/flux/v0.12/functions/transformations/selectors/last/

l first https://docs.influxdata.com/flux/v0.12/functions/transformations/selectors/first/

l limit https://docs.influxdata.com/flux/v0.12/functions/transformations/limit/

l drop https://docs.influxdata.com/flux/v0.12/functions/transformations/drop/

l keep https://docs.influxdata.com/flux/v0.12/functions/transformations/keep/

Advanced Monitoring


HTTPS://DOCS.INFLUXDATA.COM/FLUX/V0.12/INTRODUCTION/GETTING-STARTED

https://docs.influxdata.com/flux/v0.12/functions/inputs/from/

https://docs.influxdata.com/flux/v0.12/functions/transformations/filter/

https://docs.influxdata.com/flux/v0.12/functions/transformations/filter/

https://docs.influxdata.com/flux/v0.12/functions/transformations/range/

https://docs.influxdata.com/flux/v0.12/functions/transformations/selectors/last/

https://docs.influxdata.com/flux/v0.12/functions/transformations/selectors/last/

https://docs.influxdata.com/flux/v0.12/functions/transformations/selectors/first/

https://docs.influxdata.com/flux/v0.12/functions/transformations/selectors/first/

https://docs.influxdata.com/flux/v0.12/functions/transformations/limit/

https://docs.influxdata.com/flux/v0.12/functions/transformations/drop/

https://docs.influxdata.com/flux/v0.12/functions/transformations/keep/

Example of Flux API query:

from(bucket: "monitoring_main")|> filter(fn: (r) => r._measurement == "statDataHead_performance_internal_transactions")|> range(start: -30m)|> keep(columns: ["_time", "_value", "host"])

List of metrics for performance-related data

Table 31 Metrics for performance-related data

Tag Field reference

host Name of data node

node_id ID of data node

tag Internal, set to dashboard

process Internal, set to statDataHead

head Type of protocol

namespace Name of namespace

method Protocol-specific request method (GET,POST, READ, WRITE)

Note: When measurement has tags, it is possible to query TSDB for a subset of data. Forexample, measurements with tag head provide information for each protocol independently.

Database monitoring_mainPerformance metrics in this database are raw, each split by data node, all have node andnode_id tags.

Most of integer fields are increasing counters, values that increase over time. Increasing countersrestart from zero after data head service restart.

Measurement: statDataHead_performance_internal_errorTags: host, node_id, process, tagFields: system_errors (integer) user_errors (integer)

Measurement: statDataHead_performance_internal_error_codeTags: code, host, node_id, process, tagFields: error_counter (integer)

Measurement: statDataHead_performance_internal_error_headTags: head, host, node_id, process, tagFields: system_errors (integer) user_errors (integer)

Measurement: statDataHead_performance_internal_error_head_namespaceTags: head, host, namespace, node_id, process, tagFields: system_errors (integer) user_errors (integer)

Advanced Monitoring


Measurement: statDataHead_performance_internal_latencyTags: host, id, node_id, process, tagFields: +Inf (integer) 0.0 (integer) 1.0 (integer) 111.6295328521717 (integer) 12461.15260479408 (integer) 23.183877401213103 (integer) 2588.0054039994393 (integer) 4.814963904455889 (integer) 537.4921713544796 (integer) 59999.999999999985 (integer)

Measurement: statDataHead_performance_internal_latency_headTags: head, host, id, node_id, process, tagFields: +Inf (integer) 0.0 (integer) 1.0 (integer) 111.6295328521717 (integer) 12461.15260479408 (integer) 23.183877401213103 (integer) 2588.0054039994393 (integer) 4.814963904455889 (integer) 537.4921713544796 (integer) 59999.999999999985 (integer)

Measurement: statDataHead_performance_internal_throughputTags: host, node_id, process, tagFields: total_read_requests_size (integer) total_write_requests_size (integer)

Measurement: statDataHead_performance_internal_throughput_headTags: head, host, node_id, process, tagFields: total_read_requests_size (integer) total_write_requests_size (integer)

Measurement: statDataHead_performance_internal_transactionsTags: host, node_id, process, tagFields: failed_request_counter (integer) succeed_request_counter (integer)

Measurement: statDataHead_performance_internal_transactions_headTags: head, host, node_id, process, tagFields: failed_request_counter (integer) succeed_request_counter (integer)

Measurement: statDataHead_performance_internal_transactions_head_namespaceTags: head, host, namespace, node_id, process, tagFields: failed_request_counter (integer) succeed_request_counter (integer)

Measurement: statDataHead_performance_internal_transactions_methodTags: host, method, node_id, process, tagFields: failed_request_counter (integer) succeed_request_counter (integer)

Database monitoring_vdcPerformance metrics in this database are calculated values over whole VDC without reference toparticular data node.

Most of values are:

l Rates (number of requests per second)- for all measurements not ending by _delta

Advanced Monitoring


l Delta values (increase of a counter from previous timestamp)- for all measurements ending by_delta

Measurement: cq_performance_errorTags: noneFields: system_errors (float) user_errors (float)

Measurement: cq_performance_error_codeTags: codeFields: error_counter (float)

Measurement: cq_performance_error_deltaTags: noneFields: system_errors_i (integer) user_errors_i (integer)

Measurement: cq_performance_error_headTags: headFields: system_errors (float) user_errors (float)

Measurement: cq_performance_error_head_deltaTags: headFields: system_errors_i (integer) user_errors_i (integer)

Measurement: cq_performance_error_nsTags: namespaceFields: system_errors (float) user_errors (float)

Measurement: cq_performance_error_ns_deltaTags: namespaceFields: system_errors_i (integer) user_errors_i (integer)

Measurement: cq_performance_latencyTags: idFields: p50 (float) p99 (float)

Measurement: cq_performance_latency_headTags: head, idFields: p50 (float) p99 (float)

Measurement: cq_performance_throughputTags: noneFields: total_read_requests_size (float) total_write_requests_size (float)

Measurement: cq_performance_throughput_headTags: headFields: total_read_requests_size (float) total_write_requests_size (float)

Measurement: cq_performance_transactionTags: noneFields: failed_request_counter (float) succeed_request_counter (float)

Measurement: cq_performance_transaction_deltaTags: noneFields: failed_request_counter_i (integer) succeed_request_counter_i (integer)

Measurement: cq_performance_transaction_head

Advanced Monitoring


Tags: headFields: failed_request_counter (float) succeed_request_counter (float)

Measurement: cq_performance_transaction_head_deltaTags: headFields: failed_request_counter_i (integer) succeed_request_counter_i (integer)

Measurement: cq_performance_transaction_methodTags: methodFields: failed_request_counter (float) succeed_request_counter (float)

Measurement: cq_performance_transaction_nsTags: namespaceFields: failed_request_counter (float) succeed_request_counter (float)

Measurement: cq_performance_transaction_ns_deltaTags: namespaceFields: failed_request_counter_i (integer) succeed_request_counter_i (integer)

Dashboard API's to be deprecated or changed in the nextrelease

The dashboard APIs listed below will be deprecated or changed in the next major release of ECS.New or replacement API's are listed for reference. You are advised to make any neededadjustments to the use of these API's in anticipation of the next release.

API to be removed

The following table lists the APIs which will be removed in the future release:

Table 32 API - Remove

API Name Syntax Description

Get Process GET /dashboard/processes/{id} Gets the process instance details.

Get NodeProcesses

GET /dashboard/nodes/{id}/processes

Gets the details of processes in thenode.

Get Hosted Zone GET /dashboard/zones/hostedzone

Gets the hosted VDC details.

Get Zone GET /dashboard/zones/{id} Gets the hosted VDC details.

Get Hosted ZoneReplicationGroups

GET /dashboard/zones/hostedzone/replicationgroups

Gets the hosted VDC replicationgroups details.

API to be changed

The following table lists the APIs which will be changed in the future release:

Advanced Monitoring


Table 33 API - Change


Get Local Zone GET /dashboard/zones/localzone

Gets the local VDC details.

Get Local ZoneNodes

GET /dashboard/zones/localzone/nodes

Gets the local VDC node details.

Get Node GET /dashboard/nodes/{id} Gets the node instance details.

Get Storage PoolNodes

GET /dashboard/storagepools/{id}/nodes

Gets the details of nodes in thestorage pool.

The following data will be removed the APIs:

l n nodeCpuUtilization*, nodeMemoryUtilizationBytes*, nodeMemoryUtilization*,

n nodeNicBandwidth*, nodeNicReceivedBandwidth*, nodeNicTransmittedBandwidth*

n nodeNicUtilization*, nodeNicReceivedUtilization*, nodeNicTransmittedUtilization*

n capacityRebalanceEnabled, capacityRebalanced, capacityPendingRebalancing

n capacityRebalancedAvg, capacityRebalanceRate, capacityPendingRebalancingAvg

n transactionReadLatency, transactionWriteLatency, transactionReadBandwidth,transactionWriteBandwidth

n transactionReadTransactionsPerSec, transactionWriteTransactionsPerSec,transactionErrors.*

n diskReadBandwidthTotal, diskWriteBandwidthTotal, diskReadBandwidthEc,diskWriteBandwidthEc

n diskReadBandwidthCc, diskWriteBandwidthCc, diskReadBandwidthRecovery,diskWriteBandwidthRecovery

n diskReadBandwidthGeo, diskWriteBandwidthGeo, diskReadBandwidthUser

n diskWriteBandwidthUser, diskReadBandwidthXor, diskWriteBandwidthXor

API to stay without change

The following table lists the APIs, which will not be changed:

Table 34 API - No change


Get Local ZoneStorage Pools

GET /dashboard/zones/localzone/storagepools

Gets the local VDC storage pooldetails.

Get Local ZoneReplicationGroups

GET /dashboard/zones/localzone/replicationgroups

Gets the local VDC replication groupsdetails.

Get Local ZoneReplication GroupFailed Links

GET /dashboard/zones/localzone/rglinksFailed

Gets the local VDC replication groupfailed links details.

Get Local ZoneDisks

GET /dashboard/zones/localzone/disks

Gets the local VDC disks details.

Advanced Monitoring


Table 34 API - No change (continued)


Get Storage Pool GET /dashboard/storagepools/{id}

Gets the storage pool details.

Get Disk GET /dashboard/disks/{id} Gets the disk instance details.

Get Node Disks GET /dashboard/nodes/{id}/disks

Gets the details of disks in the node.

Get Local ZoneReplication GroupBootstrap Links

GET /dashboard/zones/localzone/rglinksBootstrap

Gets the local VDC replication groupbootstrap links details.

Get ReplicationGroup

GET /dashboard/replicationgroups/{id}

Gets the replication group instancedetails.

Get RG Link GET /dashboard/rglinks/{id} Gets the replication group linkinstance details.

Get ReplicationGroup Links

GET /dashboard/replicationgroups/{id}/rglinks

Gets the replication group instanceassociated link details.

Get ReplicationGroup Data Table

GET /dashboard/datatables/{id} Gets the datatable details.

Get ReplicationGroup Data Tables

GET /dashboard/replicationgroups/{id}/datatables

Gets the details of the datatables inthe replication group.

Get Cas GcBuckets

GET /dashboard/zones/localzone/cas

Gets the local VDC CAS GC bucketsdetails

Get Cas GcBucket

GET /dashboard/cas/{id} Gets the CAS GC bucket instancedetails.

Flux API for deprecated dashboard API

You can retrieve analogues of metrics currently available through dashboard API, which will bedeprecated in the future release. Flux API will provide API for metrics used to build new AdvancedMonitoring dashboards. Node or process level metrics will be available as raw metrics. In ECS 3.4not all the metrics which are planned to be deprecated in a future version are available throughFlux API. The metrics that are available in ECS 3.4 is listed here:

Processes statistics

l Dashboard APIGET /dashboard/nodes/{id}/processes

l Flux APIDatabase: monitoring_op

Measurement: procstat

Fields: memory_rss, cpu_usage, and num_threads

Tags: host - hostname (fqdn), node_id - host id and process_name. The valid process namesare:

n blobsvc

n cm

n coordinatorsvc

Advanced Monitoring


n dataheadsvc

n dtquery

n ecsportalsvc

n eventsvc

n georeceiver

n metering

n metering

n objcontrolsvc

n resourcesvc

n transformsvc

n vnest

n fluxd

n influxd

n throttler

n grafana-server

n dockerd

n fabric-agent

n fabric-lifecycle

n fabric-registry

n fabric-zookeeper

Nodes statistics

l Dashboard APIGET /dashboard/nodes/{id}

l Flux APIDatabase: monitoring_op

Measurement : cpu

Fields: usage_idle

Tags:

n host - hostname (fqdn)

n node_id - host id

Measurement : mem

Fields: free - free memory on host (bytes)

Tags:

n host - hostname (fqdn)

n node_id - host id

Performance statistics

l Dashboard API

n GET /dashboard/nodes/{id}

n GET /dashboard/zones/localzone

Advanced Monitoring


n GET /dashboard/zones/localzone/nodes

l Flux API: See List of metrics for performance-related data section for details.

Advanced Monitoring


Advanced Monitoring


CHAPTER 5

Examining Service Logs

l ECS service logs................................................................................................................... 80


ECS service logsDescribes the location and content of ECS service logs.

You can access ECS service logs directly by an SSH session on a node. Change to the followingdirectory: /opt/emc/caspian/fabric/agent/services/object/main/log. You can alsoaccess the logs from the Service Console. The following logs are provided:

Note:The emcservice user cannot access service logs. When the node is locked using the platformlockdown feature, a user cannot access service logs. Only an administrator who has permissionto access the node can access the logs.

l authsvc.log: Records information from the authentication service

l blobsvc*.log: Records aspects of the binary large object service (BLOB) service

l cassvc*.log: Records aspects of the CAS service

l coordinatorsvc.log: Records information from the coordinator service

l ecsportalsvc.log: Records information from the ECS Portal service

l eventsvc*.log: Records aspects of the event service. This information is available in theECS Portal at Monitor > Events

l hdfssvc*.log: Records aspects of the HDFS service

l objcontrolsvc.log: Records information from the object service

l objheadsvc*.log: Records aspect of the various object heads supported by the objectservice.

l provisionsvc*.log: Records aspects of the ECS provisioning service

l resourcesvc*.log: Records information that is related to global resources like namespaces,buckets, object users

l dataheadsvc-access.log: Records the aspects of the object heads supported by theobject service, the file service supported by HDFS, and the CAS service.

Examining Service Logs


ecs monitoring guide - dell...the monitoring guide supports the ecs administrator's use of the...

Documents