ecs monitoring guide - dell...the monitoring guide supports the ecs administrator's use of the...
TRANSCRIPT
ECSVersion 3.4
Monitoring Guide302-999-903
01
September 2019
Copyright © 2018-2019 Dell Inc. or its subsidiaries. All rights reserved.
Dell believes the information in this publication is accurate as of its publication date. The information is subject to change without notice.
THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS-IS.” DELL MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND
WITH RESPECT TO THE INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. USE, COPYING, AND DISTRIBUTION OF ANY DELL SOFTWARE DESCRIBED
IN THIS PUBLICATION REQUIRES AN APPLICABLE SOFTWARE LICENSE.
Dell Technologies, Dell, EMC, Dell EMC and other trademarks are trademarks of Dell Inc. or its subsidiaries. Other trademarks may be the property
of their respective owners. Published in the USA.
Dell EMCHopkinton, Massachusetts 01748-91031-508-435-1000 In North America 1-866-464-7381www.DellEMC.com
2 ECS Monitoring Guide
5
7
Welcome to ECS 9
Monitoring Basics 11View the ECS Portal Dashboard........................................................................ 12
Upper-right menu bar........................................................................... 12View requests....................................................................................... 12View capacity utilization....................................................................... 12View performance.................................................................................13View storage efficiency........................................................................ 13View geo monitoring............................................................................. 13View node and disk health.....................................................................13View alerts............................................................................................ 13
Using monitoring pages..................................................................................... 13Table navigation....................................................................................13Filter by date and time..........................................................................15History..................................................................................................15Export icon........................................................................................... 16
Monitoring ECS 17Monitor metering data.......................................................................................18
Metering data....................................................................................... 19Monitor capacity utilization...............................................................................20
Capacity forecast.................................................................................20Monitor capacity...................................................................................21Monitor used capacity..........................................................................23Monitor garbage collection data...........................................................24Monitor erasure encoding.................................................................... 25Monitor CAS processing...................................................................... 25
Monitor system health...................................................................................... 26Monitor hardware health...................................................................... 27Monitor process health........................................................................ 28Monitor node rebalancing status..........................................................30
Monitor transactions.........................................................................................30Monitor requests..................................................................................30Monitor performance............................................................................31
Monitor recovery status....................................................................................32Monitor disk bandwidth.....................................................................................33Introduction to geo-replication monitoring........................................................34
Monitor geo replication: Rate and Chunks............................................34Monitor geo replication: Recovery Point Objective (RPO)...................35Monitor geo replication: Failover Processing........................................35Monitor geo replication: Bootstrap Processing.................................... 36
Cloud hosted VDC monitoring........................................................................... 37
Figures
Tables
Chapter 1
Chapter 2
CONTENTS
ECS Monitoring Guide 3
Cloud topology..................................................................................... 37Cloud replication traffic....................................................................... 38
Monitoring Events: Audits and Alerts 39About event monitoring.................................................................................... 40Monitor audit data............................................................................................ 40Audit messages.................................................................................................40Monitor alerts................................................................................................... 45Alert policy........................................................................................................46
New alert policy................................................................................... 47Acknowledge all alerts.......................................................................................48Alert messages..................................................................................................48
Advanced Monitoring 61Advanced Monitoring........................................................................................62
View Advanced Monitoring Dashboards............................................... 62Share Advanced Monitoring Dashboards..............................................67
Flux API.............................................................................................................67List of metrics for performance-related data....................................... 70
Dashboard API's to be deprecated or changed in the next release.................... 73
Examining Service Logs 79ECS service logs............................................................................................... 80
Chapter 3
Chapter 4
Chapter 5
Contents
4 ECS Monitoring Guide
Upper-right menu bar........................................................................................................ 12Navigating with breadcrumbs............................................................................................ 14Refresh icon...................................................................................................................... 15Open Filter panel with date and time range selections....................................................... 15History chart with active cursor........................................................................................ 16Export icons...................................................................................................................... 16
123456
FIGURES
ECS Monitoring Guide 5
Figures
6 ECS Monitoring Guide
Bucket and namespace metering....................................................................................... 19Capacity utilization: VDC................................................................................................... 21Capacity utilization: storage pool...................................................................................... 22Capacity utilization: node..................................................................................................22Capacity utilization: disk................................................................................................... 23Used capacity................................................................................................................... 23Garbage collection: garbage detected...............................................................................24Garbage collection: capacity reclaimed.............................................................................25Erasure encoding metrics..................................................................................................25CAS processing metrics.................................................................................................... 26VDC, node, and process health metrics.............................................................................28ECS processes..................................................................................................................28Request metrics................................................................................................................30Network traffic metrics.....................................................................................................32Recovery metrics.............................................................................................................. 32Disk Bandwidth metrics.....................................................................................................33Rate and Chunks columns.................................................................................................35RPO columns.................................................................................................................... 35Failover columns............................................................................................................... 35Bootstrap Processing columns..........................................................................................36Replication traffic by VDC.................................................................................................38Replication traffic by replication group............................................................................. 38ECS audit messages..........................................................................................................40Alert types........................................................................................................................ 46ESRS dial home types.......................................................................................................46ECS Object alert messages...............................................................................................48ECS fabric alert messages................................................................................................ 55Secure Remote Services alert messages.......................................................................... 59Advanced monitoring dashboards..................................................................................... 62Advanced monitoring dashboard fields..............................................................................62Metrics for performance-related data...............................................................................70API - Remove.................................................................................................................... 73API - Change.....................................................................................................................74API - No change................................................................................................................ 74
12345678910111213141516171819202122232425262728293031323334
TABLES
ECS Monitoring Guide 7
Tables
8 ECS Monitoring Guide
Welcome to ECS
ECS provides a complete software-defined cloud storage platform that supports the storage,manipulation, and analysis of unstructured data on a massive scale on commodity hardware. ECScan be deployed as a turnkey storage appliance or as a software product that can be installed onqualified commodity servers and disks. ECS offers all the cost advantages of commodityinfrastructure with the enterprise reliability, availability, and serviceability of traditional arrays.
The ECS online documentation comprises the following guides:
l Administration Guide
l Monitoring Guide
l Data Access Guide
l Hardware Guide
Administration Guide
The Administration Guide supports the initial configuration of ECS and the provisioning ofstorage to meet requirements for availability and data replication. Also, it supports theongoing management of tenants and users, and the creation and configuration of buckets.
Monitoring Guide
The Monitoring Guide supports the ECS administrator's use of the ECS Portal to monitor thehealth and performance of ECS and to view its capacity utilization.
Data Access Guide
The Data Access Guide describes the protocols that are supported by ECS for user access toECS object storage. In addition to the S3, EMC Atmos, OpenStack Swift, and Centera (CAS)object APIs, it introduces the ECS Management API, which can be used to configure ECSbefore user access, and details the use of ECS as a Hadoop Filesystem (HDFS) and theintegration of ECS HDFS with a Hadoop cluster.
Hardware Guide
The Hardware Guide describes the supported hardware configurations and upgrade paths anddetails the rack cabling requirements.
PDF versions of these online guides and links to other PDFs, such as the ECS Security ConfigurationGuide and the ECS Release Notes, are available from support.emc.com.
ECS Monitoring Guide 9
Welcome to ECS
10 ECS Monitoring Guide
CHAPTER 1
Monitoring Basics
l View the ECS Portal Dashboard.............................................................................................12l Using monitoring pages..........................................................................................................13
ECS Monitoring Guide 11
View the ECS Portal DashboardThe ECS Portal Dashboard provides critical information about the ECS processes on the VDC youare currently logged in to.
The Dashboard is the first page you see after you log in. The title of each panel (box) links to theportal monitoring page that shows more detail for the monitoring area.
Upper-right menu bar
The upper-right menu bar appears on each ECS Portal page.
Figure 1 Upper-right menu bar
Menu items include the following icons and menus:
1. The Alert icon displays a number that indicates how many unacknowledged alerts are pendingfor the current VDC. The number displays 99+ if there are more than 99 alerts. You can clickthe Alert icon to see the Alert menu, which shows the five most recent alerts for the currentVDC.
2. The Help icon brings up the online documentation for the current portal page.
3. The Guide icon brings up the Getting Started Task Checklist.
4. The VDC menu displays the name of the current VDC. If your AD or LDAP credentials allow youto access more than one VDC, you can switch the portal view to the other VDCs withoutentering your credentials.
5. The User menu displays the current user and allows you to log out. The User menu displays thelast login time for the user.
View requestsThe Requests panel displays the total requests, successful requests, and failed requests.
Failed requests are organized by system error and user error. User failures are typically HTTP 400errors. System failures are typically HTTP 500 errors. Click Requests to see more requestmetrics.
Request statistics do not include replication traffic.
View capacity utilizationThe Capacity Utilization panel displays the total, used, available, reserved, and percent fullcapacity.
Capacity amounts are shown in gibibytes (GiB) and tibibytes (TiB). One GiB is approximately equalto 1.074 gigabytes (GB). One TiB is approximately equal to 1.1 terabytes (TB).
The Used capacity indicates the amount of capacity that is in use. Click Capacity Utilization tosee more capacity metrics.
The capacity metrics are available in the left menu.
Monitoring Basics
12 ECS Monitoring Guide
View performanceThe Performance panel displays how network read and write operations are currently performing,and the average read/write performance statistics over the last 24 hours for the VDC.
Click Performance to see more comprehensive performance metrics.
View storage efficiencyThe Storage Efficiency panel displays the efficiency of the erasure coding (EC) process.
The chart shows the progress of the current EC process, and the other values show the totalamount of data that is subject to EC, the amount of EC data waiting for the EC process, and thecurrent rate of the EC process. Click Storage Efficiency to see more storage efficiency metrics.
View geo monitoringThe Geo Monitoring panel displays how much data from the local VDC is waiting for geo-replication, and the rate of the replication.
Recovery Point Objective (RPO) refers to the point in time in the past to which you can recover.The value is the oldest data at risk of being lost if a local VDC fails before replication is complete.Failover Progress shows the progress of any active failover that is occurring in the federationinvolving the local VDC. Bootstrap Progress shows the progress of any active process to add anew VDC to the federation. Click Geo Monitoring to see more geo-replication metrics.
View node and disk healthThe Node & Disks panel displays the health status of disks and nodes.
A green check mark beside the node or disk number indicates the number of nodes or disks in goodhealth. A red x indicates bad health. Click Node & Disks to see more hardware health metrics. Ifthe number of bad disks or nodes is a number other than zero, clicking on the count takes you tothe corresponding Hardware Health tab (Offline Disks or Offline Nodes) on the System Healthpage.
View alertsThe Alerts panel displays a count of critical alerts and errors.
Click Alerts to see the full list of current alerts. Any Critical or Error alerts are linked to the Alertstab on the Events page where only the alerts with a severity of Critical or Error are filtered anddisplayed.
Using monitoring pagesIntroduces the basic techniques for using monitoring pages in the ECS Portal.
The ECS Portal monitoring pages share a set of common interactions as described in the followingsections:
Table navigationHighlighted text in a table row indicates a link to a detail display. Selecting the link drills down tothe next level of detail. On drill-down displays, a path string shows your current location in thesequence of drill-down displays. This path string is called a breadcrumb trail or breadcrumbs forshort. Selecting any highlighted breadcrumb jumps up to the associated display.
Monitoring Basics
ECS Monitoring Guide 13
Figure 2 Navigating with breadcrumbs
On some monitoring displays, you can force a table to refresh with the latest data by clicking theRefresh icon.
Monitoring Basics
14 ECS Monitoring Guide
Figure 3 Refresh icon
Filter by date and timeThe standard monitoring filter enables to narrow results by date and time. It is available on severalmonitoring pages. Some pages have more filter types, described on those pages.
You can select a Date Time Range predefined value (in hours, weeks, or months) or select Customto specify a From and To date and time. For the To value, you can select the current time. Afterselecting a Date Time Range, and click Apply. The Filter panel closes and the page contentupdates. When closed, the Filter panel shows a summary of the applied filter settings and providesa Clear Filter command and a Refresh symbol.
If you want the Filter panel to stay open, click the Pin icon before you click Apply.
Figure 4 Open Filter panel with date and time range selections
When the table has the Current filter applied, the latest values are displayed. When the table has adate-time range filter applied, it displays the average value over that period.
HistoryWhen you select a History button, all available charts for that row are displayed below the table.You can hover over a chart from left to right to see a vertical line that helps you find a specificdate-time point on the chart. A pop-up display shows the value and timestamp for that point.
The date-time scale is determined by the filter setting that has been configured. When theCurrent filter is selected, the charts show data from the last 24 hours. History data is kept for 60days.
Monitoring Basics
ECS Monitoring Guide 15
Figure 5 History chart with active cursor
In the history charts, when the Current filter is selected, if there is no available historical data, NoData displays.
Export iconExport icon enables you to export data from all the monitoring tables and graphs to pdf, doc, excel.and .csv formats for later consumption. To select the format, and export the data, use the exporticon in the upper right of the menu bar on each table and graph.
The exported data can be used to get a longer term view on capacity usage and consumptiontrends.
Figure 6 Export icons
Monitoring Basics
16 ECS Monitoring Guide
CHAPTER 2
Monitoring ECS
l Monitor metering data........................................................................................................... 18l Monitor capacity utilization................................................................................................... 20l Monitor system health...........................................................................................................26l Monitor transactions............................................................................................................. 30l Monitor recovery status........................................................................................................ 32l Monitor disk bandwidth......................................................................................................... 33l Introduction to geo-replication monitoring............................................................................ 34l Cloud hosted VDC monitoring................................................................................................37
ECS Monitoring Guide 17
Monitor metering dataYou can display metering data for namespaces, or buckets within namespaces, for a specified timeperiod.
About this task
The available metering data is detailed in Metering data on page 19.
Using the ECS Management REST API you can retrieve data programmatically with customclients. The ECS Management REST API Reference is provided here.
Procedure
1. In the ECS Portal, select Monitor > Metering.
2. From the Date Time Range menu, select the period for which you want to see the meteringdata. Select Current to view the current metering data. Select Custom to specify a customdate-time range.
Metering is not a real-time reporting activity but is performed as a background process andsome delay in reporting can occur. The longest delay is about 15 minutes. However, wherethe system is under heavy load, or is unstable, longer delays can be seen. If you areencountering longer delays, contact ECS Customer Support.
If you select Custom, use the From and To calendars to choose the time period for whichdata will be displayed.
Metering data is kept for 30 days.Note: The Current filter displays the latest available values. A date-time range filterdisplays average values over the specified range.
3. Select the namespace for which you want to display metering data. To narrow the list ofnamespaces, type the first few letters of the target namespace and click the magnifyingglass icon.
If you are a Namespace Administrator, you will only be able to select your namespace.
4. Click the + icon next to each namespace you want to see object data for.
5. To see the data for a particular bucket, click the + icon next to each bucket for which youwant to see data.
To narrow the list of buckets, type the first few letters of the target bucket and click themagnifying glass icon.
If you do not specify a bucket, the object metering data will be the totals for all buckets inthe namespace.
6. Click Apply to display the metering data for the selected namespace and bucket for thespecified time period.
Note: While all buckets in a geo-federation can be selected in metering, if a selectedbucket is not associated in a replication group to which the VDC that you are logged intobelongs, metering information cannot be retrieved for that bucket. In this case, after await, the bucket is listed as No data. To get the metering information for the bucket,log in to the VDC that owns the bucket or any VDC that is part of the replication groupto which the bucket belongs.Depending on the Date Time Range selected, the attributes that are displayed in theMetering Page may change. If Current option is selected, only Namespace, Buckets,Bucket Tags, Total MPU Parts, Total MPU Size, Total Size, Object Count, and Last
Monitoring ECS
18 ECS Monitoring Guide
Updated attributes are displayed in the table. If Custom or any other time range ischosen, the Namespace, Buckets, Bucket Tags, Total MPU Parts, Total MPU Size, TotalSize, Object Count, Objects Created, Objects Deleted, Write Traffic and Read Trafficattributes are displayed in the table and the Last Updated attribute is not displayed.
Metering dataObject metering data for a specified namespace, or a specified bucket within a namespace, can beobtained for a defined time period at the ECS Portal Monitor > Metering page.
The metering information that is provided is shown in the following table:
Table 1 Bucket and namespace metering
Attribute Description
Namespace Namespace selected.
Buckets Bucket selected for which the metering data applies. If blank, the data is forall buckets in the namespace.
Bucket Tags Lists any name=value bucket tags associated with the bucket.
Total MPU Parts The number of MPU parts that have been created and not used as part of acomplete MPU operation.
Total MPU Size The total disk size occupied by MPU parts that have been created and notused as part of a complete MPU operation.
Total Size Total size of the objects that are stored in the selected namespace or bucketat the end time that is specified in the filter. If the size is less than 1 GB, thenthe portal displays 0GB.
Object Count Number of objects that are associated with the selected namespace orbucket at the end time that is specified in the filter.
Last Updated If the Current filter is selected, Last Updated displays the time until whichmetering data can be considered consistent. This can help you determineany delay in reported metering stats. The metering stats may include somedata on the operations that are performed after the last updated time.
Objects Created Number of objects that are created in the selected namespace or bucket inthe time period.
Objects Deleted Number of objects that are deleted from the selected namespace or bucketin the time period.
Write Traffic Total of incoming object data (writes) for the selected namespace or bucketduring the specified period. Values are displayed in a size unit that is basedon the size of the data.
Read Traffic Total of outgoing object data (reads) for the selected namespace or bucketduring the specified period. Values are displayed in a size unit that is basedon the size of the data.
Note: When you perform an update operation on an object, the metering services showsObject Overwrite as Objects Created and Objects Deleted. The Objects Deletedis shown because of the expected OVERWRITE behavior of an object. However, no object isdeleted.
Monitoring ECS
ECS Monitoring Guide 19
Note: Metering is not a real-time reporting activity but is performed as a background processand some delay in reporting can occur. The longest delay is about 15 minutes. However, wherethe system is under heavy load, or is unstable, longer delays can be seen. If you areencountering longer delays, contact ECS Customer Support.
Note: When there are many concurrent requests, ECS metering can ignore some requests sothat they do not impact system performance. Hence, the Write Traffic value can show lessthat the actual Write bandwidth.
Monitor capacity utilizationYou can monitor capacity utilization from the ECS Portal Monitor > Capacity Utilization page.You can monitor the capacity utilization of storage pools, nodes and the entire VDC.
The Capacity Utilization page has the following tabs:
l Capacity: View summary data about the total, used, available, and reserved storage capacity ofstorage pools and nodes
l Used Capacity: View data about the used capacity for the VDC and storage pools
l Garbage Collection: View data about garbage detected, recovered capacity, capacity that ispending reclamation, and capacity that cannot be reclaimed
l Erasure Encoding: View erasure-encoded data in a local storage pool, data that is pendingerasure encoding, and the current erasure encoding rate and estimated completion time
l CAS Processing: View garbage data collection for CAS (Content Addressable Storage)buckets.
Tables showing capacity usage data display in each of the tabs. You can look down into the nodesand to individual disks by selecting the appropriate link in each table. Each row has an associatedHistory display that enables you to see how the data has changed over time. To graphically displayhow capacity has changed over time, select History for the storage pool, node, or disk that youare interested in. History data is kept for 30 days.
See Using monitoring pages for information about going to the tables.
Capacity forecastYou can use the Capacity tab to monitor when the capacity is expected to reach 50% and 80%.Capacity forecast is based on the current usage pattern that is shown on 1 day, 7 days, and 30-days usage trend. Capacity Forecast data is shown either for the entire VDC, for an individualstorage pool or for nodes.
Note: The capacity ETA shown as N/A could be due to the following reasons:
1. There is not enough historical data for forecast. At least two data points (1 hour apart) arerequired. It could happen when the ECS system is deployed. Click the History button atVDC, storage pool, or node levels to verify.
2. If capacity passed intended target, the ETA is set to 0.
3. The used capacity shows a down trend for the specified time (for example, 7 days). Clickthe History button or get the history through dashboard API to verify.
To see the capacity forecast data from the ECS Portal, select Monitor > Capacity Utilization >Capacity. Capacity tab is the default.
To see the data about total capacity, used capacity, and available capacity, click History.
Capacity Forecast is calculated based on the total capacity and used capacity.
Monitoring ECS
20 ECS Monitoring Guide
Monitor capacity
You can use the Capacity tab to view capacity utilization data for:
l VDC (VDC capacity utilization on page 21)
l Storage Pools (Storage pool capacity utilization on page 22)
l Nodes (Node capacity utilization on page 22)
l Disks (Disk capacity utilization on page 23)
l Used Capacity (Monitor used capacity on page 23)
You can view summary storage usage data about total, used, available, and reserved storagecapacity for storage pools and nodes.
Reserved capacity is the approximately 10 percent of the total capacity that is reserved for failurehandling and for performing erasure encoding or XOR operations. Reserved capacity is notavailable for writing new data.
The tab opens with the Storage Pools capacity table displayed. To view capacity data for individualnodes, click the appropriate link in the Nodes (Online) column to display the Nodes table. Clickthe appropriate link in the Disks (Online) column to view capacity data for individual disks.
You can display average values over a selected date-time range or over a custom time range usingthe Filter drop-down menu. The Current filter displays the latest available values and is thedefault filter value.
When the table has the Date Time Range filter set to Current (the default setting), the tabledisplays the latest values and the history graphs display values over the last 24-hour period. Whenthe table has a Date Time Range filter applied (other than Current), it displays the average valueover that period.
VDC capacity utilization
Table 2 Capacity utilization: VDC
Attribute Description
VDC Name of the VDC.
Per 1 Day Trend 50% Forecasts VDC capacity when it is expected to reach 50% based on 1-dayusage trend.
Per 7 Day Trend 50% Forecasts VDC capacity when it is expected to reach 50% based on 7-daysusage trend.
Per 30 Day Trend 50% Forecasts VDC capacity when it is expected to reach 50% based on 30-daysusage trend.
Per 1 Day Trend 80% Forecasts VDC capacity when it is expected to reach 80% based on 1-dayusage trend.
Per 7 Day Trend 80% Forecasts VDC capacity when it is expected to reach 80% based on 7-daysusage trend.
Per 30 Day Trend 80% Forecasts VDC capacity when it is expected to reach 80% based on 30-daysusage trend.
Total Total capacity of the VDC that is online. This is the total of the capacity thatis already used and the capacity still free for allocation.
Used Used online capacity in the VDC.
Monitoring ECS
ECS Monitoring Guide 21
Table 2 Capacity utilization: VDC (continued)
Attribute Description
Available (Reserved)If the Current filter is applied,Available (Reserved) displays. If afilter other than Current is applied,only Available displays.
Online capacity available for use, including the approximately 10% of thetotal capacity that is reserved for failure handling and for performing erasureencoding or XOR operations.
Actions History provides a graphic display of the data. If the Current filter (default)is selected, the History button displays total, used, and available capacity forthe last 24 hours. History data is kept for 60 days.
Storage pool capacity utilization
Table 3 Capacity utilization: storage pool
Attribute Description
Storage Pool Name of the storage pool.
Nodes (Online) Number of nodes in the storage pool followed by the number of those nodesonline. Click this number to open: Node capacity utilization on page 22.
Online Nodes with Sufficient DiskSpace
Note: Does not appear if a filterother than Current is applied.
Number of online nodes that have sufficient disk space to accept new data.If too many disks are too full to accept new data, the performance of thesystem may be impacted.
Disks (Online) Number of disks in the storage pool followed by the number of those disksthat are online.
Total Total capacity of the storage pool that is online. This is the total of thecapacity that is already used and the capacity still free for allocation.
Used Used online capacity in the storage pool.
Available (Reserved)If the Current filter is applied,Available (Reserved) displays. If afilter other than Current is applied,only Available displays.
Online capacity available for use, including the approximately 10% of thetotal capacity that is reserved for failure handling and for performing erasureencoding or XOR operations.
Actions History provides a graphic display of the data. If the Current filter (default)is selected, the History button displays total, used, and available capacity forthe last 24 hours. History data is kept for 60 days.
Node capacity utilization
Table 4 Capacity utilization: node
Attribute Description
Nodes Fully qualified domain name (FQDN) of the node.
Disks (Online) Number of disks that are associated with the node followed by the numberof those disks that are online. Click disk number to open: Disk capacityutilization on page 23
Monitoring ECS
22 ECS Monitoring Guide
Table 4 Capacity utilization: node (continued)
Attribute Description
Total Total online capacity provided by the online disks within the node. This is thetotal of the capacity that is already used and the capacity still free forallocation.
Used Online capacity used within the node.
Available (Reserved)If the Current filter is applied,Available (Reserved) displays. If afilter other than Current is applied,only Available displays.
Remaining online capacity available in the node including reserved capacity.
Offline(Displays only if the Current filter isapplied)
Total capacity of the node that is offline.
Online Status Indicates whether the node is online or offline. A check mark indicates thatthe node status is Good.
Actions History provides a graphic display of the data. If the Current filter (default)is selected, the History button displays total, used, and available capacity forthe last 24 hours. History data is kept for 60 days.
Disk capacity utilization
Table 5 Capacity utilization: disk
Attribute Description
Disks Disk identifier.
Total Total capacity provided by the disk.
Used Capacity used on the disk.
Available Remaining capacity available on the disk.
Online Status Indicates whether the disk is online or offline. The check mark indicates thatthe disk status is Good.
Actions History provides a graphic display of the data. If the Current filter (default)is selected, the History button displays total, used, and available capacity forthe last 24 hours. History data is kept for 60 days.
Monitor used capacity
You can use the Used Capacity tab to view the used storage capacity for the current VDC and foreach storage pool in the VDC.
Table 6 Used capacity
Storage use Description
User Data The capacity that is used for the repository chunks representing data uploadedby ECS users.
Monitoring ECS
ECS Monitoring Guide 23
Table 6 Used capacity (continued)
Storage use Description
System Metadata The capacity that is used by the ECS processes that track and describe the datain the system.
Protection Overhead The combined overhead of triple mirroring and erasure coding for all user data,system metadata, and geo data protection chunks protected locally.
Geo Cache The capacity used to cache chunks that are accessed locally but not storedlocally.
Geo Copy The capacity that is used for Geo-replication chunks stored on the current VDC.
Garbage The capacity used by data that is no longer in use.
Storage usage is shown as color-coded bars, one color for the current VDC, and a different colorfor its storage pools. Tool tips for each colored bar correspond to the status information in thenumeric status line.
Monitor garbage collection data
You can use the Garbage Collection tab to monitor garbage collection data for the entire VDC orfor individual storage pools. Use the Virtual Data Center drop-down menu to select the storagetype: Virtual Data Center or Storage Pool. Virtual Data Center is the default.
Garbage collection is enabled by default at installation. Contact your customer supportrepresentative to disable or reenable this feature.
The Garbage Collection page has the following subtabs:
l Garbage Detected: View summary garbage collection data.
l Capacity Reclaimed: View data about storage capacity reclaimed by the garbage collectionprocess.
Garbage Detected
Click the Virtual Data Center drop-down menu to view garbage detection data for the entire VDCor individual storage pools.
Table 7 Garbage collection: garbage detected
Attribute Description
Storage Type The VDC or storage pool for which to view garbage collection data.
Total Garbage Detected The amount of reclaimable storage capacity detected on the system.
Capacity Reclaimed The amount of storage capacity reclaimed by the garbage collectionprocess.
Capacity Pending Reclamation The amount of storage capacity that is identified as reclaimable but notreclaimed yet.
UnReclaimable Garbage The amount of storage capacity that cannot be reclaimed currently.
Capacity Reclaimed
Click the Filter button to set a filter for the reclamation data by VDC or storage pool over a date/time range.
Monitoring ECS
24 ECS Monitoring Guide
Table 8 Garbage collection: capacity reclaimed
Attribute Description
Storage Type The VDC or storage pool for which to view capacity reclaimed data.
Capacity Reclaimed The amount of storage capacity recovered following garbage collection.
User Data Reclaimed The amount of user data recovered.
System Metadata Reclaimed The amount of system metadata recovered.
Actions History provides a graphic display of the data. If the Current filter(default) is selected, the History button displays the total reclaimedcapacity for the last 24 hours. History data is kept for 60 days.
Monitor erasure encoding
You can use the Erasure Encoding tab to monitor the total user data and erasure encoded data ina local storage pool. It also shows the current encoding rate and the estimated completion time.
You can display average values over a selected date-time range or over a custom time range usingthe Filter drop-down menu. The Current filter displays the latest available values and is thedefault filter value.
Table 9 Erasure encoding metrics
Column Description
Storage Pool The storage pools from the current VDC.
Total Coding Data The total logical size of all data chunks in the storage pool which are subjectto erasure encoding.
Total Coded Data The total logical size of all erasure-encoded chunks in the storage pool.
Coded (%) The percent of data in the storage pool that is erasure encoded. Percentvalues display with three decimal places in the history chart for accurateplotting. Percent values display with two decimal points in the table,consistent with the format of the other values in the table.
Coding Rate The rate at which any current data waiting for erasure encoding is beingprocessed.
Est. Time to Complete The estimated completion time extrapolated from the current erasureencoding rate.
Actions History provides a graphic display of the total coding data, total coded data,percent of data coded, and coding rate per second. History data is kept for60 days.
If the Current filter is selected, History displays default history for the last24 hours.
Monitor CAS processing
You can use the CAS Processing tab to monitor unused CAS (Content Addressable Storage)objects in CAS buckets within a selected namespace over a specified time range. The unused CASobjects that are monitored by ECS include unreferenced blobs and expired reflections.
Monitoring ECS
ECS Monitoring Guide 25
In Centera terminology, there are three types of CAS objects: blob, clip, and reflection.
l Blob: CAS data objects are called blobs (binary large objects). Blobs store data. Blobs can bereferenced by data objects of a different type called clips. A blob is referenced by its ContentAddress (CA) that is stored in the Content Description File (CDF) that references the blob.The logical combination of a CDF and a Blob is called a Clip. The hash of a CDF is the Clip-ID.There can be multiple Clips for the same Blob with different CDFs (different metadata but withsame user data, single instance storage). When blobs are not referenced by live clips, theseunreferenced blobs become garbage data.
l C-Clip: Combination of a CDF and its related blobs
l Reflection: CDF of a deleted C-Clip. A reflection is created after the deletion of a C-Clip andprovides an audit trail for each deleted C-Clip. Reflections may have expiration times. (If thereis no configured expiration time for a reflection, the reflection is never deleted.)
Click the Filter drop-down menu to select a namespace containing CAS buckets and to set a date/time range to view the number and size of unreferenced blobs and expired reflections in CASbuckets.
Important: For ECS systems with existing CAS data that upgrade to 3.2.1, there is a CAS garbagedata bootstrap process that is automatically triggered post upgrade. The bootstrap process buildsnecessary references over the existing CAS data and can require a significant amount of timedepending on the amount of existing CAS data. During the bootstrap process, the unreferencedblob and reflection values will not change on the CAS Processing page. For example, you see zerofor the unreferenced blob data that are detected and unreferenced blobs detected values. Thevalues will not change until after the bootstrap process is complete. If you see that the values donot change over an extended period, call customer support.
When you search for a namespace (using the Search... option at the bottom of the list ofnamespaces in the Namespace drop-down field), the search functionality is based on prefixesonly. For example, a search for fin returns finance-namespace-dev, while a search for devwould return nothing.
Table 10 CAS processing metrics
Attribute Description
Bucket The name of the bucket containing CAS data.
Unreferenced Blob Data Detected The amount of unreferenced blob data in the bucket (in bytes).
Unreferenced Blobs Detected The number of unreferenced blobs in the bucket.
Reflection Data Detected The amount of reflection data in the bucket (in bytes).
Reflections Expired The number of expired reflections in the bucket.
Actions History provides a graphic display of the unreferenced blob and reflectiondata. If the Current filter (default) is selected, the History button displaysthe data for the last 24 hours. History data is kept for 60 days.
Monitor system healthYou can monitor system health from the ECS Portal Monitor > System Health page.
The System Health page has the following tabs:
l Hardware Health: View data about the status of nodes and disks.
l Process Health: View data about the status of the NIC, CPU, and memory.
Monitoring ECS
26 ECS Monitoring Guide
l Node Rebalancing: View data about the status of node rebalancing operations.
Monitor hardware healthYou can use the Hardware Health tab to obtain the health of disks and nodes.
About this task
The Hardware Health tab is accessed from the ECS Portal at Monitor > System Health >Hardware Health. The following states describe hardware health:
l Good: The node is in normal operating condition.
l Suspect: Either the node is transitioning from good to bad because of decreasing hardwaremetrics, or there is a problem with a lower-level hardware component, or the hardware is notdetectable by the system because of connectivity problems.
l Bad: The node needs replacement.
Disks states have the following meanings:
l Good: The system is reading from and writing to the disk.
l Suspect: The system no longer writes to the disk but reads from it. Swarms of suspect disksare likely caused by connectivity problems at a node. These disks transition back to Good whenthe connectivity issues clear up.
l Bad: The systemneither reads from nor writes to the disk. Replace the disk. Once a disk hasbeen identified as bad by the ECS system, it cannot be reused anywhere in the ECS system.Because of ECS data protection, when a disk fails, copies of the data that was once on the diskare re-created on other disks in the system. A bad disk only represents a loss of capacity to thesystem--not a loss of data. When the disk is replaced, the new disk does not have datarestored to it. It becomes raw capacity for the system.
l Missing: The disk is a known disk that is unreachable. The disk may be transitioning betweenstates, disconnected, or pulled.
l Removed: The disk is one that the system has completed recovery on and removed from thestorage engine's list of valid disks. History of all the removed disks will be displayed on ECS UI.
l Not Accessible: If a node is not accessible, then all its disks have this status. It indicates thatthe actual status of the disk is not available to ECS.
Note: The Current filter displays the latest available values. A date-time range filter displaysaverage values over the specified range. Value data is kept for 60 days.
Procedure
1. Select Monitor > System Health and select the Hardware Health tab.
By default the Offline Nodes subtab displays. This table may be empty if all nodes areonline. Similarly, the Offline Disks subtab may be empty if all disks are online.
2. Select the Offline Nodes and Offline Disks subtabs to view a summary.
3. Select the All Nodes and Disks subtab to drill down to nodes and disks.
4. Click the node name to drill down to its disk health page.
Note: The Slot Info value always matches the physical slot ID in ECS U-Series, C-Series, and D-Series Appliances. This makes Slot Info useful for quickly locating a diskduring disk replacement service. Some Certified Hardware installations with ECSSoftware may not report useful or reliable data for Slot Info.
Monitoring ECS
ECS Monitoring Guide 27
Monitor process healthYou can use the Process Health tab to obtain metrics that can help assess the health of the VDC,node, or node process.
About this task
The Process Health tab is accessed from the ECS Portal at Monitor > System Health > ProcessHealth.
Note: The Current filter displays the latest available values. A date-time range filter displaysaverage values over the specified range. Value data is kept for 60 days.
Table 11 VDC, node, and process health metrics
Metric label Level Description
Avg. NIC Bandwidth VDC and Node Average bandwidth of the network interfacecontroller hardware used by the selected VDC ornode.
Avg. CPU Usage (%) VDC and Node Average percentage of the CPU hardware used bythe selected VDC or node.
Avg. Memory Usage VDC and Node Average usage of the aggregate memory available tothe VDC or node.
Relative NIC (%) VDC and Node Percentage of the available bandwidth of the networkinterface controller hardware used by the selectedVDC or node.
Relative Memory (%) VDC and Node Percentage of the memory used relative to thememory available to the selected VDC or node.
CPU (%) Process Percentage of the node's CPU used by the process.The list of processes tracked is not the complete listof processes running on the node. Therefore, the sumof the CPU used by the processes is not equal to theCPU usage shown for the node.
Memory Usage Process The memory used by the process.
Relative Memory (%) Process Percentage of the memory used relative to thememory available to the process.
Avg. # Thread Process Average number of threads used by the process.
Last Restart Process The last time the process restarted on the node.
Actions All History provides a graphic display of the data.
If the Current filter is selected, History displaysdefault history for the last 24 hours.
Table 12 ECS processes
Process Description
Blob Service (blobsvc) Manages the following tables: Object (OB), Listing (LS), and RepoChunk Reference (RR).
Monitoring ECS
28 ECS Monitoring Guide
Table 12 ECS processes (continued)
Process Description
Chunk Manager (cm) Manages the following tables: Chunk (CT), Btree Reference (BR).Provides the logic to handle various events based on the chunk'scurrent state and decide which state to transition to next.
Directory Table Query (dtquery) Provides REST APIs to get Directory Table (DT) details.
GeoReceiver (georeceiver) Receives requests for chunks in the current VDC that are not ownedby the current VDC (secondary chunks). It then requests ChunkManager to start an operation to track the copy chunk creation andselect three replicas. The GeoReceiver process then writes thedatastream to the three instances. On successful completion, it directsChunk Manager to commit the copy chunk.
Head Service (headsvc) Manages object head protocols: S3, OpenStack Swift, EMC Atmos,CAS, and HDFS.
Metering (metering) Manages the following tables: Metering Aggregate (MA) and MeteringRaw (MR).
Object Control Service (objcontrolsvc) Provides REST APIs for configuring the ECS cluster, managing ECSresources, and monitoring the system.
Provision Service (provisionsvc) Manages the provisioning of storage resources and user access. Ithandles user management, authorization, and authentication for allprovisioning requests, resource management, and multi-tenancysupport.
Resource Service (resourcesvc) Manages the following tables: Resource Table (RT) which handlesreplication groups, buckets, users, namespace information and so on.
Record Manager (rm) Manages PR (Partition Record) table (journal region).
Storage Service Manager (ssm) Manages the following tables: Storage Space (SS) which contain diskblock usage and disk to chunk mapping. Interacts with one or moreStorage Servers and manages the active/free chunks on thecorresponding servers. Directs I/O operations to the disks.
Statistics Service (statsvc) Tracks various information on storage processes. These statistics canbe used to monitor the system.
VNest (vnest) Provides distributed synchronization and group services. A subset ofdata nodes will be group members responsible for serving the key/value requests. VNest services running on other nodes will listen forconfiguration updates and be ready to be added to the group.
Procedure
1. Locate the table row for the target VDC.
2. To drill down to a table with rows for each node in the VDC, select the VDC name.
3. To drill down to a table with rows for each process running on the node, select a nodeendpoint.
4. To display data graphically, select the History button for the target VDC, node, or process.
Monitoring ECS
ECS Monitoring Guide 29
Monitor node rebalancing statusUse the Node Rebalancing tab to monitor the status of data rebalancing operations when nodesare added to, or removed from, a cluster. Node rebalancing is enabled by default at installation.Contact your customer support representative to disable or re-enable this feature.
Before you begin
Access the Node Rebalancing tab from the ECS Portal at Monitor > System Health > NodeRebalancing. Amounts are shown in bytes (B).
Note: The Current filter displays the latest available values. A date-time range filter displaysaverage values over the specified range. Value data is kept for 60 days.
A series of interactive graphs shows that the amount of data rebalanced, pending rebalancing, andthe rate of rebalancing data in bytes over time.
Node rebalancing works only for new nodes that are added to the cluster.
Rebalance data Description
Data Rebalanced Amount of data that has been rebalanced.
Pending Rebalancing Amount of data that is in the rebalance queue but has not been rebalancedyet.
Rate of Rebalance (per day) The incremental amount of data that was rebalanced during a specific timeperiod. The default time period is one day.
Monitor transactionsYou can monitor requests and network performance for VDCs and nodes from the Monitor >Transactions page.
The Transactions page has two tabs:
l Requests: monitor data requests and failure rates for VDCs and nodes.
l Performance: monitor network performance for VDCs and nodes.
Monitor requestsYou can use the Requests tab to monitor network traffic.
About this task
The Requests tab is accessed from the ECS Portal at Monitor > Transactions > Requests andprovides information about request rates and errors from the ECS object heads (S3, OpenStackSwift, EMC Atmos, CAS, and so on). Information is available at the VDC and node level.
Table 13 Request metrics
Metric label Description
VDC The name of the VDC. Click to drill down to request metrics by node.
Successful Requests The number of data requests from all object heads that weresuccessfully completed.
Monitoring ECS
30 ECS Monitoring Guide
Table 13 Request metrics (continued)
Metric label Description
System Failures The number of data requests from all object heads classified as systemfailures. System failures are failed requests associated with hardwareor service errors (typically an HTTP error code of 5xx).
User Failures The number of data requests from all object heads classified as userfailures. User failures are known error types originating from the objectheads (typically an HTTP error code of 4xx).
Failures % Rate The percentage of failures in the VDC or node.
Metric label Description
Total Requests The number of requests for the VDC.
Total Failures The number of failed requests for the VDC.
Metric label Description
Code An HTTP error code.
Type The type of failure: System or User.
Head An ECS object head.
Failures with this code The number of failures associated with the specified HTTP code.
Failures % Rate The percent of failures associated with the HTTP error code.
Procedure
1. Select Monitor > Transactions and select the Requests tab.
2. Locate the target VDC name.
3. To show data for a node, select the VDC name to drill down to the nodes table and thenselect a node to show complete data for that node.
4. To apply a filter or sort the table, click on a column name.
Note: The Current filter displays the values for the last 24 hours. A date-time rangefilter displays the total request values over the specified range.
Monitor performanceYou can use the Performance tab to obtain network traffic metrics at the VDC or the individualnode level.
About this task
The Performance tab is accessed from the ECS Portal at Monitor > Transactions >Performance.
Note: The Current filter is selected by default and displays the latest available values. A DateTime Range filter displays average values over a selected range or over a custom time range.
Monitoring ECS
ECS Monitoring Guide 31
Table 14 Network traffic metrics
Metric label Description
VDC The name of the VDC. Click to see the performance for each node.
Read Latency (ms) Average latency for reads in milliseconds. Read latency value iscalculated as time to first byte. Latency value indicates the requestprocessing time on the node, it does not include data transfer.
Write Latency (ms) Average latency for writes in milliseconds. Write latency value iscalculated as time from last byte to transaction complete. Latencyvalue indicates the request processing time on the node, it does notinclude data transfer.
Read Bandwidth Bandwidth for reads.
Write Bandwidth Bandwidth for writes.
Read Transactions (per second) Read transactions per second.
Write Transactions (per second) Write transactions per second.
Actions History provides a graphic display of the data.
If the Current filter is selected, the History button displays defaulthistory for the last 24 hours.
Procedure
1. Select Monitor > Transactions > Performance.
2. Locate the target VDC name.
3. To drill down to the nodes display, select the VDC name.
4. To display the performance data graphically, select the History button for the target VDCor node.
Monitor recovery statusYou can use the Recovery Status page to monitor the data recovered by the system.
About this task
The Recovery Status page is accessed from the ECS Portal at Monitor > Recovery Status.Recovery is the process of rebuilding data after any local condition that results in bad data(chunks). This table includes one row for each storage pool in the local VDC.
Note: The Current filter displays the latest available values. A date-time range filter displaysaverage values over the specified range.
Table 15 Recovery metrics
Column Description
Storage Pool The storage pools for the current VDC.
Amount of Data to be Recovered With the Current filter selected, this is the logical size of the data yetto be recovered.
Monitoring ECS
32 ECS Monitoring Guide
Table 15 Recovery metrics (continued)
Column Description
When a historical period is selected as the filter, the meaning of TotalAmount Data to be Recovered is the average amount of data pendingrecovery during the selected period of time.
For example, if the first hourly snapshot of the data showed 400 GB ofdata to be recovered in a historical time period and every othersnapshot showed 0 GB waiting to be recovered, the value of this fieldwould be 400 GB divided by the total number of hourly snapshots inthe period.
Recovery Rate Rate at which data is being recovered in the specified storage pool.
Time to Completion Estimated time to complete the recovery, extrapolated from thecurrent recovery rate.
Actions History provides a graphical display of the data.
If the Current filter is selected, the History button displays defaulthistory for the last 24 hours.
Procedure
1. Select Monitor > Recovery Status.
2. Locate the table row for the target storage pool.
3. To show the recovery status history graph, select the History button.
Monitor disk bandwidthYou can use the Disk Bandwidth page to monitor the disk usage metrics at the VDC or individualnode level.
About this task
The Disk Bandwidth page is accessed from the ECS Portal at Monitor > Disk Bandwidth. There isone row for read and another for write for each VDC or node. By default, the History charts showdata for the last 24 hours.
Note: The Current filter displays the latest available values. A date-time range filter displaysaverage values over the specified range.
Table 16 Disk Bandwidth metrics
Metric label Description
VDC The VDC that the bandwidth data relates to.
Read or Write Indicates whether the row describes read data or write data.
Nodes The number of nodes in the VDC. You can click on the nodes numberto see the disk bandwidth metrics for each node. There is no Nodescolumn when you have drilled down into the Nodes display for a VDC.
Total Total disk bandwidth used for either read or write operations.
Hardware Recovery Rate at which disk bandwidth is used to recover data after a hardwarefailure.
Monitoring ECS
ECS Monitoring Guide 33
Table 16 Disk Bandwidth metrics (continued)
Metric label Description
Erasure Encoding Rate at which disk bandwidth is used in system erasure codingoperations.
XOR Rate at which disk bandwidth is used in the system's XOR dataprotection operations. Note that XOR operations occur for systemswith three or more sites (VDCs).
Consistency Checker Rate at which disk bandwidth is used to check for inconsistenciesbetween protected data and its replicas.
Geo Rate at which disk bandwidth is used to support geo replicationoperations.
User Traffic Rate at which disk bandwidth is used by object users.
Actions History provides a graphic display of the data.
If the Current filter is selected, the History button displays defaulthistory for the last 24 hours.
Procedure
1. Select Monitor > Disk Bandwidth.
2. Locate the target VDC name and either the Read or Write table row for that VDC.
3. To show data for nodes, select Nodes Count to drill down to a table with rows for the nodesin the VDC.
4. To display the disk bandwidth history charts, select the History button for the VDC or node.
Introduction to geo-replication monitoringYou can use the Geo Replication page to monitor the replication of data across the VDCs thatmake up a replication group.
The Geo Replication page is accessed from the ECS Portal at Monitor > Geo Replication andprovides four tabs:
l Rate and Chunks
l Recovery Point Objective (RPO)
l Failover Processing
l Bootstrap Processing
Monitor geo replication: Rate and ChunksYou can use the Rate and Chunks tab to obtain metrics about the network traffic for geo-replication and the chunks waiting for replication by a replication group or remote VDC.
The Rate and Chunks tab is accessed from the ECS Portal at Monitor > Geo Replication > Rateand Chunks.
Monitoring ECS
34 ECS Monitoring Guide
Table 17 Rate and Chunks columns
Column Description
Replication Group Lists the replication groups of which this VDC is a member. Click areplication group to see a table of remote VDCs in the replicationgroup and their statistics. Click the Replication Groups link above thetable to return to the default view.
Write Traffic The current rate of writes to all remote VDCs or individual remote VDCin the replication group.
Read Traffic The current rate of reads to all remote VDCs or individual remote VDCin the replication group.
User Data Pending Replication The total logical size of user data waiting for replication for thereplication group or remote VDC.
Metadata Pending Replication The total logical size of metadata waiting for replication for thereplication group or remote VDC.
Data Pending XOR The total logical size of all data waiting to be processed by the XORcompression algorithm in the local VDC for the replication group orremote VDC.
Monitor geo replication: Recovery Point Objective (RPO)You can use the RPO tab to view the recovery point objective for a replication group and itsremote VDCs. The RPO refers to the point in time in the past to which you can recover. The valuepresented is the oldest data at risk of being lost if a local VDC fails before replication is complete.
The RPO tab is accessed from the ECS Portal at Monitor > Geo Replication > RPO.
Table 18 RPO columns
Column Description
Remote Replication Group\Remote VDC At the VDC level, lists all remote replication groups of which the localVDC is a member. At the replication group level, this column lists theremote VDCs in the replication group.
Overall RPO The recent time period for which data might be lost in the event of alocal VDC failure.
Monitor geo replication: Failover ProcessingYou can use the Failover Processing tab to view the metrics on the process to rereplicate datafollowing permanent failure of a remote VDC.
The Failover Processing tab is accessed from the ECS Portal at Monitor > Geo Replication >Failover Processing.
Table 19 Failover columns
Field Description
Replication Group Lists the replication groups that the local VDC is a member of.
Failed VDC Identifies a failed VDC that is part of the replication group.
Monitoring ECS
ECS Monitoring Guide 35
Table 19 Failover columns (continued)
Field Description
User Data Pending Re-replication When a VDC fails, user data chunks replicated to the failed VDC haveto be re-replicated to a different VDC. This field reports the logicalsize of all user data (repository) chunks waiting re-replication to adifferent VDC.
Metadata Pending Re-replication When a VDC fails, metadata chunks replicated to the failed VDC haveto be re-replicated to a different VDC. This field reports the logicalsize of all metadata chunks waiting re-replication to a different VDC.
Data Pending XOR Decoding Shows the count and total logical size of chunks waiting to beretrieved by the XOR compression scheme.
Failover State l BLIND_REPLAY_DONE
l REPLICATION_CHECK_DONE: The process that makes sure thatall replication chunks are in an acceptable state and replication hascompleted successfully.
l CONSISTENCY_CHECK_DONE: The process that makes surethat all system metadata is fully consistent with other replicateddata and has completed successfully.
l ZONE_SYNC_DONE: The synchronization of the failed VDC hascompleted successfully.
l ZONE_BOOTSTRAP_DONE: The bootstrap process on the failedVDC has completed successfully.
l ZONE_FAILOVER_DONE: The failover process has completedsuccessfully.
Failover Progress A percentage indicator for the overall status of the failover process.
Monitor geo replication: Bootstrap ProcessingYou can use the Bootstrap Processing tab to monitor the copying of user data and metadata to aVDC that has been added to a replication group.
The Bootstrap Processing tab is accessed from the ECS Portal at Monitor > Geo Replication >Bootstrap Processing.
Table 20 Bootstrap Processing columns
Column Description
Replication Group This column provides the list of replication groups of which the localVDC is a member and that are adding new VDCs. Each row providesmetrics for the specified replication group.
Added VDC The VDC being added to the specified replication group.
User Data Pending Replication The logical size of all user data (repository) chunks waiting forreplication to the new VDC.
Metadata Pending Replication The logical size of all system metadata waiting for replication to thenew VDC.
Monitoring ECS
36 ECS Monitoring Guide
Table 20 Bootstrap Processing columns (continued)
Column Description
Bootstrap State The bootstrap state. Can be:
l BTreeScan
l ReplicateBTree
l ReplicateBTreeMarker
l ReplicateJournal
l Done
Bootstrap Progress (%) The completion percent of the entire bootstrap process.
Cloud hosted VDC monitoring
ECS provides support for identifying when a site is hosted or on-premise and the ECSManagement REST API provides support for retrieving information about the utilization andperformance of hosted sites.
Where an ECS system includes a hosted site, the ECS Portal displays a top-level Cloud menu thatenables administrators to see how the hosted sites are used as part of replication groups and toview the traffic to and from the hosted site in terms of bandwidth utilization and latency. Theportal displays also show the traffic to and from on-premise sites to allow comparison with hostedsites traffic.
The Cloud menu is not shown if the ECS system uses only on-premise sites.
Cloud topologyYou can use the Cloud topology summary information to see how the ECS system is making use ofhosted VDCs.
The Cloud > Topology page shows the hosted VDCs that are part of an ECS federated system,and shows the relationship between the hosted VDC and any on-premise VDCs.
Cloud Hosted VDCs
The Cloud Hosted VDCs table shows the hosted VDCs that are present in the ECS system.Currently ECS supports a single hosted site.
Related On-Premise VDCs
The Related On-Premise VDCs table shows the on-premise VDCs that are part of the ECSfederation.
Related Replication Groups
The Related Replication Groups table shows the replication groups that contain a storage poolcontributed by a selected hosted VDC. The Hosted VDC is selected in the Cloud Hosted VDC table.
A primary use case for using a hosted VDC is the Passive configuration in which the hosted VDCprovides a site for replication data but cannot be used as an active site by users. However, wherethe active operation of the hosted VDC is allowed, the hosted VDC can be included in replicationgroups where the type is Passive.
The table shows the replication group type and the VDC storage pools that are contributing to thereplication group, at least one of which will be a hosted VDC.
Monitoring ECS
ECS Monitoring Guide 37
Cloud replication trafficYou can use the cloud replication traffic information is to see the performance of hosted VDCs andcompare with on-premise VDCs.
The Cloud > Replication page shows replication traffic by VDC and by replication group.
Note: The Current filter displays the latest available values. A date-time range filter displaysaverage values over the specified range.
Virtual Data Centers
The Virtual Data Centers tab shows each VDC, both hosted or on-premise, and providesaggregated traffic figures for all replication groups associated with a VDC.
Table 21 Replication traffic by VDC
Attribute Description
Read Latency The average latency in milliseconds for reads from all replication groupsassociated with the selected VDC.
Write Latency The average latency in milliseconds for writes to all replication groupsassociated with the selected VDC.
Read Bandwidth The bandwidth utilized by reads from all replication groups associated withthe selected VDC.
Write Bandwidth The bandwidth utilized by writes from all replication groups associated withthe selected VDC.
Replication Groups
The Replication Groups tab shows each replication group and provides traffic data for a VDC foreach replication group that it contributes to. A VDC might have a storage pool that is in more thanone replication group, and this display allows you to see the traffic associated with each replicationgroup.
Table 22 Replication traffic by replication group
Attribute Description
Read Latency The average latency in milliseconds for reads from the selected VDC thatrelate to the specified replication group.
Write Latency The average latency in milliseconds for writes to the selected VDC thatrelate to the specified replication group.
Read Bandwidth The bandwidth utilized by reads from the from the selected VDC that relateto the specified replication group.
Write Bandwidth The bandwidth utilized by writes to the selected VDC that relate to thespecified replication group.
Monitoring ECS
38 ECS Monitoring Guide
CHAPTER 3
Monitoring Events: Audits and Alerts
l About event monitoring.........................................................................................................40l Monitor audit data.................................................................................................................40l Audit messages..................................................................................................................... 40l Monitor alerts........................................................................................................................45l Alert policy............................................................................................................................ 46l Acknowledge all alerts........................................................................................................... 48l Alert messages...................................................................................................................... 48
ECS Monitoring Guide 39
About event monitoringYou can view the available event monitoring messages (audit and alert) from the ECS Portal.
The Monitor > Events page has two tabs:
l Audit: All activity by users working with the portal, the ECS REST APIs, and the ECS CLI.Other audit types include upgrade activities.
l Alerts: Alerts raised by the ECS system.
Event data through the ECS Portal is limited to 30 days. If you need to keep event data for longerperiods, consider using the ViPR SRM product.
Monitor audit dataUse the Monitor > Events > Audit tab to view and manage audit data.
About this task
See the List of audit messages.
Procedure
1. Select the Audit tab.
2. Optionally, select Filter.
3. Specify a Date Time Range and adjust the From and To fields and time fields. Whencreating a custom date-time range, select Current Time to use the current date and time asthe end of your range.
4. Select a Namespace.
5. Click Apply.
Note: The newest audit messages appear at the top of the table.
Audit messagesList of the audit messages generated by ECS.
Table 23 ECS audit messages
Service Audit item Audit message
Alert sent_alert Alert \"${alertMessage}\" with symptom code ${symptomCode} triggered
Auth Provider new_authentication_provider_added New authentication provider ${resourceId} added
Auth Provider authentication_provider_deleted Authentication provider ${resourceId} deleted
Auth Provider authentication_provider_updated Existing Authentication provider ${resourceId} updated
Bucket bucket_created Bucket ${resourceId} has been created
Bucket bucket_deleted Bucket ${resourceId} has been deleted
Bucket bucket_updated Bucket ${resourceId} has been updated
Bucket bucket_ACL_set Bucket ${resourceId} ACLs have changed
Monitoring Events: Audits and Alerts
40 ECS Monitoring Guide
Table 23 ECS audit messages (continued)
Service Audit item Audit message
Bucket bucket_owner_changed Owner of ${resourceId} bucket has changed
Bucket bucket_versioning_set Versioning has been enabled on ${resourceId} bucket
Bucket bucket_versioning_unset Versioning has been suspended on ${resourceId} bucket
Bucket bucket_versioning_source_set Bucket ${resourceId} versioning source set
Bucket bucket_metadata_set Metadata on ${resourceId} bucket has been changed
Bucket bucket_head_metadata_set Bucket ${resourceId} head metadata set
Bucket bucket_expiration_policy_set Bucket ${resourceId} expiration policy has updated
Bucket bucket_expiration_policy_deleted Bucket ${resourceId} expiration policy has been deleted
Bucket bucket_cors_config_set Bucket ${resourceId} CORS rules have been changed
Bucket bucket_cors_config_deleted Bucket ${resourceId} CORS rules have been deleted
Bucket notification_size_exceeded_on_bucket Notification size has been exceeded on ${resourceId}bucket
Bucket block_size_exceeded_on_bucket Block size has been exceeded on ${resourceId} bucket
Bucket bucket_set_quota Bucket ${resourceId} quota has been updated withnotification size as ${notificationSize} and block size as${blockSize}
Bucket bucket_policy_created Bucket ${resourceId} policy has been created
Bucket bucket_policy_updated Bucket ${resourceId} policy has been updated
Bucket bucket_policy_deleted Bucket ${resourceId} policy has been deleted
Cluster cluster_set Cluster id ${resourceId} has been set
Fabric InstallerServiceOperation[kind=INSTALLER_SERVICE_OPERATION,host=${hostName},timestamp=${timestamp},operationType=${operation},args=${arguments of operation},status=SUCCEEDED,fqdn=${fqdn of host},version=${installer version}]
Fabric NodeMaintenanceMode[kind=NodeMaintenanceMode,timestamp=${timestamp},agentId=${agendId},fqdn=${fqdn},status=${MaintenanceStatus}]
License user_added_license License ${resourceId} has been added
License managed_capacity_exceeded Managed capacity has exceeded licensed ${resourceId}capacity
Monitoring Events: Audits and Alerts
ECS Monitoring Guide 41
Table 23 ECS audit messages (continued)
Service Audit item Audit message
License license_expired License ${resourceId} has expired
Local user domain_group_mapping_created Domain group ${resourceId} to ${roles} role(s) mappingis added
Local user domain_group_mapping_created_no_roles
Domain group ${resourceId} without role mappings isadded
Local user domain_group_mapping_updated Domain group ${resourceId} roles mapping is changedto ${roles} role(s)
Local user domain_group_mapping_updated_no_roles
All roles of domain group ${resourceId} mapping havebeen removed
Local user domain_user_mapping_created Domain user ${resourceId} to ${roles} role(s) mappingis added
Local user domain_user_mapping_created_no_roles Domain user ${resourceId} without role mappings isadded
Local user domain_user_mapping_deleted Domain user ${resourceId} mapping is removed
Local user domain_user_mapping_updated Domain user ${resourceId} role mapping is changed to ${roles} role(s)
Local user domain_user_mapping_updated_no_roles
All roles of domain user ${resourceId} mapping havebeen removed
Local user local_user_created Management user ${resourceId} with ${roles}role(s)has been created
Local user local_user_created_no_roles Management user ${resourceId} without roles has beencreated
Local user local_user_deleted Management user ${resourceId} has been deleted
Local user local_user_password_changed Credential of management user ${resourceId} haschanged
Local user local_user_updated Roles of management user ${resourceId} have beenchanged to ${roles}
Local user local_user_roles_updated_no_roles All roles of management user ${resourceId} have beenremoved
Locked vdc_lock_successful VDC lock was successful
Locked vdc_lock_failed VDC lock failed
Locked node_lock_successful Lock successful for node ${resourceId}
Locked node_lock_failed Lock failed for node ${resourceId}
Locked node_unlock_successful Unlock successful for node ${resourceId}
Locked node_unlock_failed Unlock failed for node ${resourceId}
Login login_successful User ${resourceId} logged in successfully
Login login_failed User ${resourceId} failed to login
Login user_token_logout User logged out token ${resourceId}
Monitoring Events: Audits and Alerts
42 ECS Monitoring Guide
Table 23 ECS audit messages (continued)
Service Audit item Audit message
Login user_logout All user tokens have logged out
Namespace block_size_exceeded_on_namespace Block size has been exceeded on ${resourceId}namespace
Namespace namespace_admin_group_mappings_updated
Namespace ${resourceId} admin group mappingsupdated to following groups: ${groups}
Namespace namespace_admin_group_mappings_updated_no_groups
Namespace ${resourceId} admin groups mappingsupdated to an empty list
Namespace namespace_admin_user_mappings_updated
Namespace ${resourceId} admin mappings updated tofollowing users: ${admins}
Namespace namespace_admin_user_mappings_updated_no_admins
Namespace ${resourceId} admin mappings updated toan empty list
Namespace namespace_created Namespace ${resourceId} has been created
Namespace namespace_deleted Namespace ${resourceId} has been deleted
Namespace namespace_updated Namespace ${resourceId} has been updated
Namespace notification_size_exceeded_on_namespace
Notification size has been exceeded on ${resourceId}namespace
NFS ugmapping_created ${type} mapping ${ugMappingName} --> ${resourceId}has been created
NFS ugmapping_deleted ${type} mapping ${ugMappingName} --> ${resourceId}has been deleted
NFS export_created Export with export path ${exportPath} has beencreated
NFS export_deleted Export with export path ${exportPath} has been deleted
NFS export_updated Export with export path ${exportPath} has beenupdated
ReplicationGroup
replication_group_created Replication Group ${resourceId} has been created
ReplicationGroup
replication_group_updated Replication Group ${resourceId} has been updated
Security command_exec_insufficient_permission Attempt to execute a command ${command} from ${host} without right permissions
SNMP snmp_v2_target_created SNMP target ${snmpTarget} with Community '${community}' is added
SNMP snmp_v3_target_created SNMP target ${snmpTarget} with Username '${username}', Authentication(${authProtocol}) andPrivacy(${privProtocol})
SNMP snmp_target_deleted SNMP target ${snmpTarget} is deleted
SNMP snmp_engineid_updated SNMP agent EngineID is set to ${engineId}
Monitoring Events: Audits and Alerts
ECS Monitoring Guide 43
Table 23 ECS audit messages (continued)
Service Audit item Audit message
SNMP snmp_v2_target_updated SNMP target ${oldSnmpTarget} is updated as ${newSnmpTarget} with Community string ${community}
SNMP snmp_v3_target_updated SNMP target ${oldSnmpTarget} is updated as ${newSnmpTarget} with Username ${username},Authentication(${authProtocol}) and Privacy(${privProtocol})
Storage Pool storage_pool_created Storage Pool ${resourceId} has been created
Storage Pool storage_pool_deleted Storage Pool ${resourceId} has been deleted
Storage Pool storage_pool_updated Storage Pool ${resourceId} has been updated
Syslog syslog_server_added Syslog server ${protocol}://${host}:${port} withseverity ${severity} is added into the configuration
Syslog syslog_server_updated Syslog server ${old_protocol}://${old_host}:${old_port} is updated to ${protocol}://${host}:${port}with severity ${severity} in the configuration
Syslog syslog_server_deleted Syslog server ${protocol}://${host}:${port} is removedfrom the configuration
Transformation
transformation_created_message Transformation created
Transformation
transformation_updated_message Transformation updated
Transformation
transformation_pre_check_started_message
Transformation precheck started
Transformation
transformation_enumeration_started_message
Transformation enumeration started
Transformation
transformation_indexing_started_message
Transformation indexing started
Transformation
transformation_migration_started_message
Transformation migration started
Transformation
transformation_recovery_migration_started_message
Transformation recovery migration started
Transformation
transformation_reconciliation_started_message
Transformation reconciliation started
Transformation
transformation_sources_updated_message
Transformation sources updated
Transformation
transformation_deleted_message Transformation deleted
Transformation
transformation_retried_message Transformation %s retried
Transformation
transformation_canceled_message Transformation %s canceled
Monitoring Events: Audits and Alerts
44 ECS Monitoring Guide
Table 23 ECS audit messages (continued)
Service Audit item Audit message
Transformation
transformation_profile_mappings_updated_message
Transformation profile mappings updated
User change_password_failed User ${resourceId} failed to change password, reason: ${reason}
User user_created Object user ${resourceId} has been created
User user_deleted Object user ${resourceId} has been deleted
User user_set_password New password has been set for object user ${resourceId}
User user_delete_password Password has been deleted for object user ${resourceId}
User user_set_metadata New metadata has been set for object user ${resourceId}
User user_locked Object user ${resourceId} has been locked
User user_unlocked Object user ${resourceId} has been unlocked
User user_set_user_tag User Tag has been set for object user ${resourceId}
User user_delete_user_tag User Tag has been deleted for object user ${resourceId}
Monitor alertsYou can use the Monitor > Events > Alerts tab to view and manage system alerts.
About this task
See the list of alert messages.
Alert message Severity labels have the following meanings:
l Critical: Messages about conditions that require immediate attention
l Error: Messages about error conditions that report either a physical failure or a software failure
l Warning: Messages about less than optimal conditions
l Info: Routine status messages
Procedure
1. Select Alerts.
2. Optionally, click Filter.
3. Select your filters. The alerts filter adds filtering by Severity and Type, and an option toShow Acknowledged Alerts, which retains the display of an alert even after it isacknowledged by the user. When creating a custom date-time range, select Current Timeto use the current date and time as the end of your range.
Alert types must be entered exactly as described in the following table:
Monitoring Events: Audits and Alerts
ECS Monitoring Guide 45
Table 24 Alert types
Alert Type (type exactly asshown)
Description
Fabric Raised when system issues detected.
Geo Raised for geo-replication alerts.
License Raised for license, capacity, or capacity entitlement exceeded alerts.
Notify Raised for miscellaneous alerts.
Quota Raised when soft or hard quota limits are exceeded (SoftQuotaLimitExceeded orHardQuotaLimitExceeded) for a bucket or for a namespace.
RPO Raised when the recovery point objective (RPO) is greater than the RPO threshold.
Capacity Alerting Raised when the remaining capacity of the storage pool reaches a set threshold.
Capacity License Threshold Raised if the system capacity is greater than the licensed capacity.
CHUNK_NOT_FOUND Raised when chunk data is not found.
DTSTATUS_RECENT_FAILURE Raised when the status of a data table is bad.
Table 25 ESRS dial home types
Alert Type (type exactly asshown)
Description
TestDialHome Raised to test that ESRS connections can be established and that the call homefunctionality works.
4. Select a Namespace.
5. Click Apply.
6. Next to each event, click the Acknowledge Alert button to acknowledge and dismiss themessage. Messages that have previously been acknowledged will display when the ShowAcknowledged Alerts filter option is selected, but the Acknowledge Alert button will notbe displayed for these rows.
7. You can click the Description of an alert, when it is formatted as a link, to be taken to arelevant page in the portal.
Alert policyAlert policies are created to alert about metrics, and are triggered when the specified conditionsare met. Alert policies are created per VDC.
You can use the Settings > Alerts Policy page to view alert policies.
There are two types of alert policy:
System alert policies
l System alert policies are precreated and exist in ECS during deployment.
l All the metrics have an associated system alert policy.
l System alert policies cannot be updated or deleted.
l System alert policies can be enabled/disabled.
Monitoring Events: Audits and Alerts
46 ECS Monitoring Guide
l Alert is sent to the UI and all channels (SNMP, SYSLOG, and Secure Remote Services).
User-defined alert policies
l You can create User-defined alert policies for the required metrics.
l Alert is sent to the UI and customer channels (SNMP and SYSLOG).
New alert policyYou can use the Settings > Alerts Policy > New Alert Policy tab to create user-defined alertpolicies.
Procedure
1. Select New Alert Policy.
2. Give a unique policy name.
3. Use the metric type drop-down menu to select a metric type.
Metric Type is a grouping of statistics. It consists of:
l Btree Statistics
l CAS GC Statistics
l Geo Replication Statistics
l Metering Statistics
l Garbage Collection Statistics
l EKM
4. Use the metric name drop-down menu to select a metric name.
5. Select level.
a. To inspect metrics at the node level, select Node.
b. To inspect metrics at the VDC level, select VDC.
6. Select polling interval.
Polling Interval determines how frequently data should be checked. Each polling intervalgives one data point which is compared against the specified condition and when thecondition is met, alert is triggered.
7. Select instances.
Instances describe how many data points to check and how many should match thespecified conditions to trigger an alert.
For metrics where historical data is not available only the latest data is used.
8. Select conditions.
You can set the threshold values and alert type with Conditions.
The alerts can be either a Warning Alert, Error Alert, or Critical Alert.
9. To add more conditions with multiple thresholds and with different alert levels, select AddCondition.
10. Click Save.
Monitoring Events: Audits and Alerts
ECS Monitoring Guide 47
Acknowledge all alertsAlerts can be acknowledged individually or by bulk using the Acknowledge All Alerts button. Youcan choose to acknowledge all the alerts or acknowledge a subset of the alerts using filters.
About this task
You can use the Monitor > Events > Alerts tab to acknowledge alerts.
Procedure
1. To acknowledge all alerts, click the Acknowledge All Alerts button.
a. To acknowledge a subset of all alerts, use the table filter to filter by a combination ofdate and time, severity, type, or namespace, and then click Acknowledge All Alerts.
The bulk alert acknowledgment process runs in the background and may take a few minutesto complete. Only one bulk alert acknowledgment can be processed at a time.
2. On the confirmation pop-up screen, to initiate acknowledgment, click OK or to exit withoutacknowledgment click Cancel.
Clicking the Acknowledge All Alerts initiates a background task to acknowledge all thematching alerts. The response either shows successfully initiated or fails.
To keep a record of the acknowledge all alerts request, a new informational alert of typeBulk Alert Ack will be generated after the acknowledgment completes. Clear the filter andmanually refresh the table.
Alert messagesList of the alert messages that ECS uses.
Alert message Severity labels have the following meanings:
l Critical: Messages about conditions that require immediate attention
l Error: Messages about error conditions that report either a physical failure or a software failure
l Warning: Messages about less than optimal conditions
l Info: Routine status messages
Table 26 ECS Object alert messages
Alert Severity Symptomcode
Sent to... Message Description Action
Btree chunklevel GC
Warning 1321 Portal, API,SecureRemoteServices,SNMP Trap,Syslog
System metadatagarbagereclamationthroughput is tooslow to catch upwith garbagedetection.
Event trigger sourceExample: ReclaimedBtree Garbage is lessthan 10% of theremaining BTreegarbage as BTree GCis slow at Chunkreclamation.
This condition haspersisted for last 7
Contact ECSRemote Support
Monitoring Events: Audits and Alerts
48 ECS Monitoring Guide
Table 26 ECS Object alert messages (continued)
Alert Severity Symptomcode
Sent to... Message Description Action
days, leading tocreation of this alert.
Derived it fromformula: Full_Garbage> 1TB, andGarbage_Detected_Rate -Garbage_Chunk_Reclaim_Rate > 100GB
Btree disklevel GC
Warning 1325 Portal, API,SecureRemoteServices,SNMP Trap,Syslog
Capacity free-upthroughput is tooslow to catch upwith systemmetadata garbagereclamation.
Event trigger sourceExample: ReclaimedBtree Garbage is lessthan 10% of the Fullgarbage, as BTree GCis slow at disk levelreclamation.
This condition haspersisted for last 7days, leading tocreation of this alert.
Derived from formula:ifGarbage_Pending_Delete > 1TB, andGarbage_Chunk_Reclaim_Rate -Garbage_Capacity_Reclaim_Rate > 100GB
Contact ECSRemote Support.
Btreepartial GC
Warning 1329 Portal, API,SecureRemoteServices,SNMP Trap,Syslog
Partial GC forsystem metadatais too slow.
Event trigger sourceExample: Rate ofBtree Partial GCconversion to fullGarbage is less than10% of the Partial GCeligible for Conversion.
Btree partial GC workstoo slow to convertpartial garbage into fullgarbage.
This condition haspersisted for last 7days, leading tocreation of this alert.
Derived from formula :IfPartial_Eligible_Garba
Contact ECSRemote Support.
Monitoring Events: Audits and Alerts
ECS Monitoring Guide 49
Table 26 ECS Object alert messages (continued)
Alert Severity Symptomcode
Sent to... Message Description Action
ge > 1TB, andPartial_To_Full_Convert_Rate < 100GB
Bucket hardquota
Error 1006 Portal, API,SNMP Trap,Syslog
HardQuotaLimitExceeded: bucket{bucket_name}
Bucket softquota
Warning 1008 Portal, API,SNMP Trap,Syslog
SoftQuotaLimitExceeded: bucket{bucket_name}
Capacityalerting
WarningErrorCritical
111111121113
Portal, API,SNMP Trap,Syslog
Storage pool{Storage pool}has {id}%remainingcapacity meetingthreshold of {id}%.
The severity of thealert depends on howclose the remainingstorage pool capacityis to reaching theconfigured threshold.Capacity alerting is notset by default: setcapacity alerts toreceive them. You canset them by editing anexisting storage poolor when you create astorage pool.
Capacityexceededthreshold
Warning 1100 Portal, API,SecureRemoteServices,SNMP Trap,Syslog
Used Capacity ofthe VDCexceededconfiguredthreshold, currentusage is {usage}%.
The configuredthreshold is set at80% of the UsedCapacity of the VDCby default.
CAUTION If theused capacityreaches 90%, youcannot write ormodify objectdata.
Contact ECSRemote Supportrepresentative todetermine theappropriatesolution.
Capacitylicensethreshold
Error 997 Portal, API,SecureRemoteServices,Trap, Syslog
Licensed CapacityEntitlementExceeded Event
The capacity of thesystem is greater thanwas licensed.
Chunk notfound
Error 1004 Portal, API,SecureRemoteServices,SNMP Trap,Syslog
chunkId {chunkId}not found
Monitoring Events: Audits and Alerts
50 ECS Monitoring Guide
Table 26 ECS Object alert messages (continued)
Alert Severity Symptomcode
Sent to... Message Description Action
CPU UsagePercent
Warning
Error
Critical
4001
4002
4003
Portal, API,SNMP Trap,Syslog
CPU usage is ${inspectorValue}% crossesthreshold ${thresholdValue}%
If CPU usage percentcrosses the thresholdspecified then thealert is triggered.
DisabledCAS GC
InfoWarningErrorCritical
1316131713181319
Portal, API,SecureRemoteServices,SNMP, Trap,Syslog
CAS Processing ispaused.
CAS GC is ContentAddressable StorageGarbage Collection.
CAS GC is disabled.
Contact ECSRemote Supportto ensure that itshould stayenabled.
DT initfailure
Error 3001 Portal, API,SecureRemoteServices,SNMP Trap,Syslog
There are morethan {numbers}DTs failed or DTstats check failedin last {number}rounds of DTstatus check.
DT is a directory table.The default value is setat 8 DTs for this alertto trigger.
EKM ServerCertificateExpiry
Warning
Error
1361
1362
Portal, API,SecureRemoteServices,SNMP Trap,Syslog
The servercertificate forEKM serverexpires in 30days. Renew thecertificate.
The servercertificate forEKM serverexpires in 7 days.Renew thecertificate.
EKM ServerConnectionStatus
Warning
Error
1369
1370
Portal, API,SecureRemoteServices,SNMP Trap,Syslog
The EKM server isnot responding.Ensure that theserver isconnected.
First ByteLatency ForRead
Warning
Error
Critical
4009
4010
4011
Portal, API,SNMP Trap,Syslog
First ByteLatency for Readis ${inspectorValue}ms crossesthreshold ${thresholdValue}ms
If TTFB for readlatency crosses thethreshold specifiedthen the alert istriggered.
Monitoring Events: Audits and Alerts
ECS Monitoring Guide 51
Table 26 ECS Object alert messages (continued)
Alert Severity Symptomcode
Sent to... Message Description Action
Last ByteLatency ForWrite
Warning
Error
Critical
4003
4014
4015
Portal, API,SNMP Trap,Syslog
Last Byte Latencyfor Write is ${inspectorValue}ms crossesthreshold ${thresholdValue}ms
If TTLB for writelatency crosses thethreshold specifiedthen the alert istriggered.
Licenseexpiration
Info 998 Portal, API,SecureRemoteServices,SNMP Trap,Syslog
Expiration event
Licenseregistration
Info 100 Portal, API,SecureRemoteServices,SNMP Trap,Syslog
RegistrationEvent
MemoryoutsideBtreewritescache
Warning 1349 Portal, API,SecureRemoteServices,SNMP Trap,Syslog
For cm processmemory of Xbytes is allocatedoutside Btreewrite cache onnode <Node IP>.
Meteringreadlatency
Warning
Error
Critical
1205
1206
1207
Portal, API,SecureRemoteServices,SNMP Trap,Syslog
Read latency is300 millisecond,crosses threshold250 millisecond.
Read latency is505 millisecond,crosses threshold500 millisecond.
Read latency is1050 millisecond,crosses threshold1000 millisecond.
Contact ECSRemote Support.
Meteringwritelatency
Warning
Error
Critical
1205
1206
1207
Portal, API,SecureRemoteServices,SNMP Trap,Syslog
Write latency is300 millisecond,crosses threshold250 millisecond.
Write latency is555 millisecond,crosses threshold500 millisecond.
Contact ECSRemote Support.
Monitoring Events: Audits and Alerts
52 ECS Monitoring Guide
Table 26 ECS Object alert messages (continued)
Alert Severity Symptomcode
Sent to... Message Description Action
Write latency is1500 millisecond,crosses threshold1000 millisecond.
MonitoringHealth
Critical 4016
4017
4018
Portal, API,SecureRemoteServices,SNMP Trap,Syslog
Data recorded inTSDB is laggingby{thresholdValue}mins on nodex.x.x.x
Namespacehard quota
Error 1005 Portal, API,SNMP Trap,Syslog
HardQuotaLimitExceeded:Namespace{namespace}
Namespacesoft quota
Warning 1009 Portal, API,SNMP Trap,Syslog
SoftQuotaLimitExceeded:Namespace{namespace}
Notification Any Any User-definedmessage.
Custom message thatis defined andprovided by the user.
Processmemorytable freespacepercent
Error 1354 Portal, API,SecureRemoteServices,SNMP Trap,Syslog
Memory table sizefor blob process isX % less than thespecifiedthreshold of Y %on <node IP>.
Contact ECSRemote Support.
Repo chunklevel GC
Warning 1333 Portal, API,SecureRemoteServices,SNMP Trap,Syslog
User garbagecollectionthroughput is tooslow to catch upwith garbagedetection.
Event trigger sourceExample: Repo Chunkreclamation rate is lessthan 10% of theremaining garbage.
This condition haspersisted for last 7days, leading tocreation of this alert.
Derived from formula:Full_Garbage > 10TB,andGarbage_Detected_Rate -Garbage_Chunk_Reclaim_Rate > 100GB
Contact ECSRemote Support.
Repo disklevel GC
Warning 1337 Portal, API,Secure
Capacity free-upthroughput is too
Event trigger source Contact ECSRemote Support.
Monitoring Events: Audits and Alerts
ECS Monitoring Guide 53
Table 26 ECS Object alert messages (continued)
Alert Severity Symptomcode
Sent to... Message Description Action
RemoteServices,SNMP Trap,Syslog
slow to catch upwith user garbagecollection.
Example: Repo disklevel GC reclamationrate is less than 10 %of Garbage pendingdelete at disk level.
This condition haspersisted for last 7days, leading tocreation of this alert.
Derived from formula:IfGarbage_Pending_Delete > 10TB, andGarbage_Chunk_Reclaim_Rate -Garbage_Capacity_Reclaim_Rate > 100GB
Repo partialGC
Warning 1341 Portal, API,SecureRemoteServices,SNMP Trap,Syslog
Partial GC foruser garbage istoo slow.
Event trigger sourceExample: Repo Partialrepo GC works tooslow to convert partialgarbage into fullgarbage.
This condition haspersisted for last 7days, leading tocreation of this alert.
Derived from formula:IfPartial_Eligible_Garbage > 10TB, andPartial_To_Full_Convert_Rate < 100GB
Contact ECSRemote Support.
RPO Warning 1012 Portal, API,SecureRemoteServices,Trap, Syslog
RPO forreplication group{RG} is {HH} hour{SS} secondsgreater than {HH}hour thresholdset.
The recovery pointobjective (RPO) isgreater than the RPOthreshold. The defaultvalue is one hour.
Slow CASGC ObjectCleanup
Info
Warning
Error
Critical
1312
1313
1314
1315
Portal, API,SecureRemoteServices,SNMP, Trap,Syslog
CAS Processingobject cleanupspeed is slow.
CAS GC cleanup tasksare lagging.
Monitoring Events: Audits and Alerts
54 ECS Monitoring Guide
Table 26 ECS Object alert messages (continued)
Alert Severity Symptomcode
Sent to... Message Description Action
Slow CASGCReferenceCollection
Info
Warning
Error
Critical
1308
1309
1310
1311
Portal, API,SecureRemoteServices,SNMP, Trap,Syslog
CAS Processingreferencecollection speed isslow.
CAS GC referencecollection tasks arelagging.
SlowJournalParsing
Info
Warning
Error
Critical
1304
1305
1306
1307
Portal, API,SecureRemoteServices,SNMP, Trap,Syslog
Journal parsingspeed is slow.
Journal parsing speedis slow.
SpaceUsagePercent
Warning
Error
Critical
4005
4006
4007
Portal, API,SNMP, Trap,Syslog
Disk space usageis ${inspectorValue}% crossesthreshold ${thresholdValue}%
If Disk usage percentcrosses the thresholdspecified then thealert is triggered.
GC Status Warning 1345 Portal, API,SecureRemoteServices,SNMP Trap,Syslog
Spacereclamation foruser data/systemmetadata isdisabled.
Make sure it isdisabled fortemporarypurpose, and re-enable it whenready.
Contact ECSRemote Support.
VDC in TSO Critical 1007 Portal, API ,SNMP Trap,Syslog
Site {vdc} ismarked astemporarilyunavailable.
TSO is a temporarysite outage.
Table 27 ECS fabric alert messages
Alert Severity Symptomcode
Sent to... Message Description Action
Disk added Info 2019 Portal, API,SNMP Trap,Syslog
Disk{diskSerialNumber} on node {fqdn}was added.
Disk was added.
Disk failure Critical 2002 Portal, API,SNMP Trap,Syslog,
Disk{diskSerialNumbe
Health of disk that ischanged to BAD.
Monitoring Events: Audits and Alerts
ECS Monitoring Guide 55
Table 27 ECS fabric alert messages (continued)
Alert Severity Symptomcode
Sent to... Message Description Action
SecureRemoteServices
r} on node {fqdn}has failed.
Disk good Info 2025 Portal, API,SNMP Trap,Syslog
Disk{diskSerialNumber} on node {fqdn}was revived.
Disk was revived.
Diskmounted
Info 2035 Portal, API,SNMP Trap,Syslog
Disk{diskSerialNumber} on node {fqdn}has mounted.
Disk was mounted.
Diskremoved
Info 2020 Portal, API,SNMP Trap,Syslog
Disk{diskSerialNumber} on node {fqdn}was removed.
Disk was removed.
Disksuspect
Error 2003 Portal, API,SNMP Trap,Syslog,SecureRemoteServices
Disk{diskSerialNumber} on node {fqdn}has suspected.
Health of disk that ischanged to SUSPECT.
Diskunmounted
Warning 2036 Portal, API,SNMP Trap,Syslog
Disk{diskSerialNumber} on node {fqdn}has unmounted.
Disk was unmounted.
Dockercontainerconfiguration failure
Critical 2022 Portal, API,SNMP Trap,Syslog,SecureRemoteServices
Container{containerName}configuration hasfailed on node{fqdn} with exitcode {exitCode}{happenedOn}.
Configure scriptreturned nonzero exitcode.The configure script isprovided by object andcalled by fabric onobject container start-up. It is only applicablefor the objectcontainer.
Dockercontainerpaused
Warning 2017 Portal, API,SNMP Trap,Syslog
Container{containerName}has paused onnode {fqdn}.
Container paused
Dockercontainerrunning
Info 2016 Portal, API,SNMP Trap,Syslog
Container{containerName}is up on node{fqdn}.
Container moved torunning state.
Monitoring Events: Audits and Alerts
56 ECS Monitoring Guide
Table 27 ECS fabric alert messages (continued)
Alert Severity Symptomcode
Sent to... Message Description Action
Dockercontainerstopped
Error 2015 Portal, API,SNMP Trap,Syslog
Container{containerName}has stopped onnode {fqdn}.
Container stopped
Eventscannot bedelivered.
Error 2038 Portal, API,SecureRemoteServices,SNMP Trap,Syslog
Events cannot bedelivered through{SMTP|ESRS}and lost.
Verify configuration ofthe channel for whichthe alert is.
Firewallhealth isBAD orSUSPECT
BAD
SUSPECT
2051
2052
Portal, API,SecureRemoteServices,SNMP Trap,Syslog
Firewall health isBAD! {reason}
Firewall health isSUSPECT!{reason}
Rules or ip sets do notexist, system firewallis off, ip tables or ipset utils do not exist
Rules or ip sets do notexist, trying to recover
Fabricagentfailure
Critical 2013 Portal, API,SNMP Trap,Syslog
FabricAgent hasfailed on node{fqdn}.
Fabric agent health isbad.
Fabricagentsuspect
Error 2014 Portal, API,SNMP Trap,Syslog
FabricAgent hassuspected onnode {fqdn}.
Fabric agent health issuspect.
Netinterfacehealthdown
Critical 2023 Portal, API,SNMP Trap,Syslog,SecureRemoteServices
Net interface{$netInterfaceName}[ on node$FQDN] isdown[ with IPaddress $IP]".
Fabric's net interfaceis down.
Netinterfacehealth up
Info 2024 Portal, API,SNMP Trap,Syslog,SecureRemoteServices
Net interface{$netInterfaceName}[ on node$FQDN] isup[ with IPaddress $IP]".
Fabric's net interfaceis up.
Netinterfacepermanentdown
Critical 2026 Portal, API,SecureRemoteServices
Net interface{$netInterfaceName}[ on node$FQDN] ispermanentlydown[ with IPaddress $IP].
Net interface is downfor at least 10 minutes.
Netinterface IPaddressupdated
Info 2027 Portal, API,SNMP Trap,Syslog
Net interface's{netInterfaceName} IP address onnode {fqdn} was
Fabric's net interfaceIP address changed
Monitoring Events: Audits and Alerts
ECS Monitoring Guide 57
Table 27 ECS fabric alert messages (continued)
Alert Severity Symptomcode
Sent to... Message Description Action
changed to{newIpAddress}.
Node failure Critical 2006 Portal, API,SNMP Trap,Syslog,SecureRemoteServices
Node {fqdn} hasfailed.
Node is not reachablefor 30 minutes.
Nodesuspect
Error 2007 Portal, API,SNMP Trap,Syslog,SecureRemoteServices
Node {fqdn} hassuspected.
Node is not reachablefor 15 minutes.
Node up Info 2018 Portal, API,SNMP Trap,Syslog
Node {fqdn} is up. Node moved to 'up'state after it wasdown for at least 15minutes.
Root filesystemfilling onnode
WARNING
CRITICAL
2039
2042
Portal, API,SNMP Trap,Syslog,SecureRemoteServices
Thresholdsexceeded, usablespace on root fs<BYTES> are lessthan threshold for<LEVEL> level onnode <NODE>
Threshold between15G and 10G triggerswarning.
Threshold Less than10G of free spaceresults in Critical alert.
Slotpermanentdown
Critical 2021 Portal, API,SNMP Trap,Syslog,SecureRemoteServices
Container{containerName}is permanentlydown on node{fqdn}.
Container stopped/paused or not startedat all for at least 10minutes
Servicefailure
Critical 2011 Portal, API,Syslog,SecureRemoteServices
Service HealthFailure Event
Service failed
Servicesuspect
Error 2012 Portal, API,Syslog,SecureRemoteServices
Service HealthSuspect event
Service health issuspect.
Monitoring Events: Audits and Alerts
58 ECS Monitoring Guide
Table 28 Secure Remote Services alert messages
Alert Severity Symptomcode
Sent to... Description
TestDialHome N/A TestDialHome SecureRemoteServices
Tests that Secure Remote Services connectionscan be established and that the call homefunctionality works.
Monitoring Events: Audits and Alerts
ECS Monitoring Guide 59
Monitoring Events: Audits and Alerts
60 ECS Monitoring Guide
CHAPTER 4
Advanced Monitoring
l Advanced Monitoring............................................................................................................ 62l Flux API................................................................................................................................. 67l Dashboard API's to be deprecated or changed in the next release.........................................73
ECS Monitoring Guide 61
Advanced MonitoringAdvanced Monitoring dashboards provide critical information about the ECS processes on the VDCyou are logged in to. The advanced monitoring dashboards are based on time series database, andare provided by Grafana, which is well known open-source time series analytics platform.
Refer Grafana for basic details of navigation in Grafana dashboards.
l View Advanced Monitoring Dashboards
l Share Advanced Monitoring Dashboards
View Advanced Monitoring DashboardsTo view the advanced monitoring dashboards in the ECS Portal, select Advanced Monitoring.Data Access Performance - Overview dashboard is the default.
Table 29 Advanced monitoring dashboards
Dashboard Description
Data Access Performance -Overview
You can use the Data Access Performance - Overviewdashboard to monitor VDC data.
Data Access Performance - byNamespaces
You can use the Data Access Performance - byNamespaces dashboard to monitor performance datafor individual namespace or group of Namespaces.
Data Access Performance - byNodes
You can use the Data Access Performance - by Nodesdashboard to see performance data for individual nodeor group of nodes in a VDC.
Data Access Performance - byProtocols
You can use the Data Access Performance - byProtocols dashboard to see performance data for eachsupported protocol (S3, ATMOS, SWIFT, etc.) or set ofprotocols.
Table 30 Advanced monitoring dashboard fields
Dashboard Field Description
All Relateddashboards
Allows you to switch to other dashboards in accessperformance group, with the selected time.
All TransactionSummary
Lists the total Successful requests, System Failures,User Failures, and Failure % Rate for the selectedVDCs, namespaces, nodes, or protocols.
All Successfulrequests
The number of data requests that were successfullycompleted.
All SystemFailures
The number of data requests that failed due tohardware or service errors. System failures arefailed requests that are associated with hardware orservice errors (typically an HTTP error code of 5xx).
All UserFailures
The number of data requests from all object headsare classified as user failures. User failures are
Advanced Monitoring
62 ECS Monitoring Guide
Table 30 Advanced monitoring dashboard fields (continued)
Dashboard Field Description
known error types originating from the object heads(typically an HTTP error code of 4xx).
All Failure %Rate
The percentage of failures for the VDC, namespace,nodes, or protocols.
All TPS(success/failure)
Rate of successful requests and failures per second.
Data Access Performance- OverviewData Access Performance- by Nodes
Data Access Performance- by Protocols
Bandwidth(read/write)
Data access bandwidth of successful requests persecond.
All FailedRequests/sby errortype (user/system)
Rate of failed requests per second, split by errortype (user/system).
Data Access Performance- OverviewData Access Performance- by Nodes
Data Access Performance- by Protocols
Latency Latency of read/write requests.
Data Access Performance- OverviewData Access Performance- by Nodes
Successfulrequest drilldown
Displays the rate of successful requests per second,by method, node, and protocol.
Data Access Performance- by NodesData Access Performance- Overview
SuccessfulRequests/sby Method
Rate of successful requests per second, by method.
All SuccessfulRequests/sby Node
Rate of successful requests per second, by node.
Data Access Performance- by NodesData Access Performance- Overview
SuccessfulRequests/sby Protocol
Rate of successful requests per second, by protocol.
Data Access Performance- OverviewData Access Performance- by Nodes
Failures drilldown
Displays the rate of failed requests per second, bymethod, node, and protocol.
Advanced Monitoring
ECS Monitoring Guide 63
Table 30 Advanced monitoring dashboard fields (continued)
Dashboard Field Description
Data Access Performance- by NodesData Access Performance- Overview
FailedRequests/sby Method
Rate of failed requests per second, by method.
All FailedRequests/sby Node
Rate of failed requests per second, by node.
Data Access Performance- by NodesData Access Performance- Overview
FailedRequests/sby Protocol
Rate of failed requests per second, by protocol.
Data Access Performance- by NodesData Access Performance- Overview
FailedRequests/sby errorcode
Rate of failed requests per second, by error code.
Data Access Performance- by NodesData Access Performance- by Namespaces
Data Access Performance- by Protocols
CompareTPS ofsuccessfulrequests
Select multiple nodes and compare rates ofsuccessful requests per second.
Data Access Performance- by Namespaces
CompareTPS offailedrequests
Select multiple nodes and compare rates of failedrequests per second, by error type (user/system).
Data Access Performance- by NodesData Access Performance- by Protocols
Comparereadbandwidth
Select multiple nodes and compare data accessbandwidth (read) of successful requests per second.
Data Access Performance- by NodesData Access Performance- by Protocols
Comparewritebandwidth
Select multiple nodes and compare data accessbandwidth (write) of successful requests persecond.
Data Access Performance- by NodesData Access Performance- by Protocols
Comparereadlatency
Select multiple nodes and compare latency of readrequests.
Data Access Performance- by NodesData Access Performance- by Protocols
Comparewritelatency
Select multiple nodes and compare latency of writerequests.
Data Access Performance- by Nodes
Comparerate of
Select multiple nodes and compare rates of failedrequests per second, split by error type (user/system).
Advanced Monitoring
64 ECS Monitoring Guide
Table 30 Advanced monitoring dashboard fields (continued)
Dashboard Field Description
Data Access Performance- by Protocols
failedrequests/s
Data Access Performance- by Namespaces
Requestdrill downby nodes
Rate of requests per second, split by node.
View modeProcedure
1. To view a dashboard in the view mode, click the title of a dashboard, for example (TPS(success/failure) > View.
The dashboard opens in the view mode or in the full-screen mode.
2. Click Back to dashboard icon to return back to the dashboards view.
Export CSVProcedure
1. To export the dashboard data to .csv format click the title of a dashboard, for example (TPS(success/failure) > More > Export CSV.
The Export CSV window pops-up.
You can customize the csv output by modifying the Mode, Date Time Format, and check/uncheck the Excel CSV Dialect attributes.
2. Click Export > Save to export the dashboard data to .csv format to your local storage.
View Advanced Monitoring Dashboards- OverviewData Access Performance - Overview dashboard is the default.
In the Data Access Performance - Overview dashboard, you can monitor for all nodes in theVDC:
l TPS (success/failure)
l Bandwidth (read/write)
l Failed Requests/s by error type (user/system)
l Latency
l Successful Requests/s by Method
l Successful Requests/s by Node
l Successful Requests/s by Protocol
l Failed Requests/s by Method
l Failed Requests/s by Node
l Failed Requests/s by Protocol
l Failed Requests/s by error code
To view the Data Access Performance - Overview dashboard in the ECS Portal, select AdvancedMonitoring.
Advanced Monitoring
ECS Monitoring Guide 65
Click Successful requests drill down to see the successful requests by all the methods, nodes,and protocols.
Click Failures drill down to see the failed requests by all the methods, nodes, protocols, and errorcode.
Click Related dashboards to view the other dashboards, with the selected time.
View Advanced Monitoring Dashboards- by Namespaces
In the Data Access Performance - by Namespaces dashboard, you can monitor for namespaces:
l TPS (success/failure)
l Failed Requests/s by error type (user/system)
l Successful Requests/s by Node
l Failed Requests/s by Node
l Compare TPS of successful requests
l Compare TPS of failed requests
To view the Data Access Performance - by Namespaces dashboard in the ECS Portal, selectAdvanced Monitoring > Related dashboards > Data Access Performance - by Namespaces.
All the namespace data are visible in the default view. To select a namespace, click the legendparameter for the namespace below the graph.
Requests drill down by nodes shows the successful and failed requests by node.
Compare: select multiple namespaces compares TPS of successful and failed requests.
View Advanced Monitoring Dashboards- by Nodes
In the Data Access Performance - by Nodes dashboard, you can monitor for nodes in a VDC:
l TPS (success/failure)
l Bandwidth (read/write)
l Failed Requests/s by error type (user/system)
l Latency
l Successful Requests/s by Method
l Successful Requests/s by Node
l Successful Requests/s by Protocol
l Failed Requests/s by Method
l Failed Requests/s by Node
l Failed Requests/s by Protocol
l Failed Requests/s by error code
l Compare TPS of successful requests
l Compare TPS of failed requests
l Compare read bandwidth
l Compare write bandwidth
l Compare read latency
l Compare write latency
To view the Data Access Performance - by Nodes dashboard in the ECS Portal, select AdvancedMonitoring > Related dashboards > Data Access Performance - by Nodes.
Advanced Monitoring
66 ECS Monitoring Guide
Data for all the nodes are visible in the default view. To select data for a node, click the legendparameter for the node below the graph.
Successful requests drill down shows the successful requests by method, node, and protocol.
Failures drill down shows the failed requests by method, node, protocol, and error code.
Compare: select multiple namespaces compares TPS of successful and failed requests, compareread/write bandwidth, compare read/write latency.
View Advanced Monitoring Dashboards- by Protocols
In the Data Access Performance - by Protocols dashboard, based on the protocol, you canmonitor:
l TPS (success/failure)
l Bandwidth (read/write)
l Failed Requests/s by error type (user/system)
l Latency
l Successful Requests/s by Node
l Failed Requests/s by Node
l Compare TPS of successful requests
l Compare TPS of failed requests
l Compare read bandwidth
l Compare write bandwidth
l Compare read latency
l Compare write latency
To view the Data Access Performance - by Nodes dashboard in the ECS Portal, select AdvancedMonitoring > Related dashboards > Data Access Performance - by Protocols.
Data for all the protocols are visible in the default view. To select data for a protocol, click thelegend parameter for the protocol below the graph.
Requests drill down by nodes shows the successful and failed requests by node.
Compare: select multiple namespaces compares TPS of successful and failed requests, compareread/write bandwidth, compare read/write latency.
Share Advanced Monitoring DashboardsShare dashboard icon enables you to create a direct link to the dashboard or panel, share asnapshot of an interactive dashboard publicly, and export the dashboard to a JSON file.
For procedures on sharing the dashboard link, dashboard snapshot, and dashboard as a JSON file,refer to Grafana documentation.
Flux APIFlux API enables you to retrieve time series database data by sending REST queries using curl. Youcan get raw data from fluxd service in a way similar to using the Dashboard API. You have to geta token, and provide the token in the requests.
Procedure
1. Use curl https://<ip>:4443/login -k -u "user:passwd" -v to get thesecurity token.
Advanced Monitoring
ECS Monitoring Guide 67
2. Enter the token in the request header.
l Query is displayed in the request body.
l Nginx validates token in authsvc, and sends proxy request to fluxd on local host.
l Fluxd handles query, and return results in JSON or .CSV format.
Example of Flux API output:
curl -k -H "X-SDS-AUTH-TOKEN: xxxx" -XPOST --data-urlencode 'query=from(bucket: "monitoring_main") |> filter(fn: (r) => r._measurement == "statDataHead_performance_internal_transactions") |> range(start: -30m)' 'https://10.249.230.55:4443/dashboard/v2/query'
Example of Dashboard API output:
{ "_links": { "self": { "href": "/dashboard/zones/localzone"}, "storagepools": { "href": "/dashboard/zones/localzone/storagepools" }, "nodes": { "href": "/dashboard/zones/localzone/nodes" }, "replicationgroups": { "href": "/dashboard/zones/localzone/replicationgroups" }, "rglinksFailed": { "href": "/dashboard/zones/localzone/rglinksFailed" }, "rglinksBootstrap": { "href": "/dashboard/zones/localzone/rglinksBootstrap" } }, "apiChange" : "1", "name": "vdc1", "numNodes":16,... "nodeCpuUtilizationAvg": [ {"t":"12345678" , "Percent":10}, {"t":"23435455" , "Percent":43}, {"t":"55433455" , "Percent":39}],... "diskSpaceAllocatedCurrent": [ {"t":"12345678", "Bytes ":10000}, {"t":"23456789", "Bytes ":12000} ]}
Example of Flux API JSON output:
{ [ {"table":"0" , "_start":"2019-03-06T10:30:00Z", "_stop":"2019-03-07T11:15:00Z", "_time":"2019-03-06T10:30:00Z", "_value":95.30918181366027, "_field":"usage_idle", "_measurement":"cpu", "cpu":"cpu-total", "host":"layton-ivory.ecs.lab.emc.com", "node_id":"1f2815b3-b340-45ce-b863-de8f46e8b691", "tag":"system"}, {"table":"0" , "_start":"2019-03-06T10:30:00Z", "_stop":"2019-03-07T11:15:00Z", "_time":"2019-03-06T10:35:00Z", "_value":95.52097124715358, "_field":"usage_idle", "_measurement":"cpu", "cpu":"cpu-total", "host":"layton-ivory.ecs.lab.emc.com", "node_id":"1f2815b3-b340-45ce-b863-de8f46e8b691", "tag":"system"},
Advanced Monitoring
68 ECS Monitoring Guide
{"table":"1" , "_start":"2019-03-06T10:30:00Z", "_stop":"2019-03-07T11:15:00Z", "_time":"2019-03-06T10:30:00Z", "_value":85.41386518615308, "_field":"usage_idle", "_measurement":"cpu", "cpu":"cpu-total", "host":"lehi-ivory.ecs.lab.emc.com", "node_id":"48e607ef-2e81-4f8b-b9e2-b61b45ef2240", "tag":"system"}, {"table":"1" , "_start":"2019-03-06T10:30:00Z", "_stop":"2019-03-07T11:15:00Z", "_time":"2019-03-06T10:35:00Z", "_value":67.13489651306735, "_field":"usage_idle", "_measurement":"cpu", "cpu":"cpu-total", "host":"lehi-ivory.ecs.lab.emc.com", "node_id":"48e607ef-2e81-4f8b-b9e2-b61b45ef2240", "tag":"system"}, ]}
Example of Flux API CSV output:
#datatype,string,long,dateTime:RFC3339,dateTime:RFC3339,dateTime:RFC3339,double,string,string,string,string,string,string#group,false,false,true,true,false,false,true,true,true,true,true,true#default,_result,,,,,,,,,,,,result,table,_start,_stop,_time,_value,_field,_measurement,cpu,host,node_id,tag,,0,2019-03-06T10:30:00Z,2019-03-07T11:15:00Z,2019-03-06T10:30:00Z,95.30918181366027,usage_idle,cpu,cpu-total,dallas-straw,1f2815b3-b340-45ce-b863-de8f46e8b691,system,,0,2019-03-06T10:30:00Z,2019-03-07T11:15:00Z,2019-03-06T10:35:00Z,95.52097124715358,usage_idle,cpu,cpu-total,dallas-straw,1f2815b3-b340-45ce-b863-de8f46e8b691,system,,0,2019-03-06T10:30:00Z,2019-03-07T11:15:00Z,2019-03-06T10:40:00Z,85.41386518615308,usage_idle,cpu,cpu-total,dallas-straw,48e607ef-2e81-4f8b-b9e2-b61b45ef2240,system,,0,2019-03-06T10:30:00Z,2019-03-07T11:15:00Z,2019-03-06T10:45:00Z,67.13489651306735,usage_idle,cpu,cpu-total,dallas-straw,48e607ef-2e81-4f8b-b9e2-b61b45ef2240,system
Note: Flux query language that is supported for flux API is a subset of operationsthat are supported by Influxdb Flux language v0.12. Refer Get started with Flux formore details.
Query enabled to run by Flux API:
l from https://docs.influxdata.com/flux/v0.12/functions/inputs/from/
l filter https://docs.influxdata.com/flux/v0.12/functions/transformations/filter/
l range https://docs.influxdata.com/flux/v0.12/functions/transformations/range/
l last https://docs.influxdata.com/flux/v0.12/functions/transformations/selectors/last/
l first https://docs.influxdata.com/flux/v0.12/functions/transformations/selectors/first/
l limit https://docs.influxdata.com/flux/v0.12/functions/transformations/limit/
l drop https://docs.influxdata.com/flux/v0.12/functions/transformations/drop/
l keep https://docs.influxdata.com/flux/v0.12/functions/transformations/keep/
Advanced Monitoring
ECS Monitoring Guide 69
Example of Flux API query:
from(bucket: "monitoring_main")|> filter(fn: (r) => r._measurement == "statDataHead_performance_internal_transactions")|> range(start: -30m)|> keep(columns: ["_time", "_value", "host"])
List of metrics for performance-related data
Table 31 Metrics for performance-related data
Tag Field reference
host Name of data node
node_id ID of data node
tag Internal, set to dashboard
process Internal, set to statDataHead
head Type of protocol
namespace Name of namespace
method Protocol-specific request method (GET,POST, READ, WRITE)
Note: When measurement has tags, it is possible to query TSDB for a subset of data. Forexample, measurements with tag head provide information for each protocol independently.
Database monitoring_mainPerformance metrics in this database are raw, each split by data node, all have node andnode_id tags.
Most of integer fields are increasing counters, values that increase over time. Increasing countersrestart from zero after data head service restart.
Measurement: statDataHead_performance_internal_errorTags: host, node_id, process, tagFields: system_errors (integer) user_errors (integer)
Measurement: statDataHead_performance_internal_error_codeTags: code, host, node_id, process, tagFields: error_counter (integer)
Measurement: statDataHead_performance_internal_error_headTags: head, host, node_id, process, tagFields: system_errors (integer) user_errors (integer)
Measurement: statDataHead_performance_internal_error_head_namespaceTags: head, host, namespace, node_id, process, tagFields: system_errors (integer) user_errors (integer)
Advanced Monitoring
70 ECS Monitoring Guide
Measurement: statDataHead_performance_internal_latencyTags: host, id, node_id, process, tagFields: +Inf (integer) 0.0 (integer) 1.0 (integer) 111.6295328521717 (integer) 12461.15260479408 (integer) 23.183877401213103 (integer) 2588.0054039994393 (integer) 4.814963904455889 (integer) 537.4921713544796 (integer) 59999.999999999985 (integer)
Measurement: statDataHead_performance_internal_latency_headTags: head, host, id, node_id, process, tagFields: +Inf (integer) 0.0 (integer) 1.0 (integer) 111.6295328521717 (integer) 12461.15260479408 (integer) 23.183877401213103 (integer) 2588.0054039994393 (integer) 4.814963904455889 (integer) 537.4921713544796 (integer) 59999.999999999985 (integer)
Measurement: statDataHead_performance_internal_throughputTags: host, node_id, process, tagFields: total_read_requests_size (integer) total_write_requests_size (integer)
Measurement: statDataHead_performance_internal_throughput_headTags: head, host, node_id, process, tagFields: total_read_requests_size (integer) total_write_requests_size (integer)
Measurement: statDataHead_performance_internal_transactionsTags: host, node_id, process, tagFields: failed_request_counter (integer) succeed_request_counter (integer)
Measurement: statDataHead_performance_internal_transactions_headTags: head, host, node_id, process, tagFields: failed_request_counter (integer) succeed_request_counter (integer)
Measurement: statDataHead_performance_internal_transactions_head_namespaceTags: head, host, namespace, node_id, process, tagFields: failed_request_counter (integer) succeed_request_counter (integer)
Measurement: statDataHead_performance_internal_transactions_methodTags: host, method, node_id, process, tagFields: failed_request_counter (integer) succeed_request_counter (integer)
Database monitoring_vdcPerformance metrics in this database are calculated values over whole VDC without reference toparticular data node.
Most of values are:
l Rates (number of requests per second)- for all measurements not ending by _delta
Advanced Monitoring
ECS Monitoring Guide 71
l Delta values (increase of a counter from previous timestamp)- for all measurements ending by_delta
Measurement: cq_performance_errorTags: noneFields: system_errors (float) user_errors (float)
Measurement: cq_performance_error_codeTags: codeFields: error_counter (float)
Measurement: cq_performance_error_deltaTags: noneFields: system_errors_i (integer) user_errors_i (integer)
Measurement: cq_performance_error_headTags: headFields: system_errors (float) user_errors (float)
Measurement: cq_performance_error_head_deltaTags: headFields: system_errors_i (integer) user_errors_i (integer)
Measurement: cq_performance_error_nsTags: namespaceFields: system_errors (float) user_errors (float)
Measurement: cq_performance_error_ns_deltaTags: namespaceFields: system_errors_i (integer) user_errors_i (integer)
Measurement: cq_performance_latencyTags: idFields: p50 (float) p99 (float)
Measurement: cq_performance_latency_headTags: head, idFields: p50 (float) p99 (float)
Measurement: cq_performance_throughputTags: noneFields: total_read_requests_size (float) total_write_requests_size (float)
Measurement: cq_performance_throughput_headTags: headFields: total_read_requests_size (float) total_write_requests_size (float)
Measurement: cq_performance_transactionTags: noneFields: failed_request_counter (float) succeed_request_counter (float)
Measurement: cq_performance_transaction_deltaTags: noneFields: failed_request_counter_i (integer) succeed_request_counter_i (integer)
Measurement: cq_performance_transaction_head
Advanced Monitoring
72 ECS Monitoring Guide
Tags: headFields: failed_request_counter (float) succeed_request_counter (float)
Measurement: cq_performance_transaction_head_deltaTags: headFields: failed_request_counter_i (integer) succeed_request_counter_i (integer)
Measurement: cq_performance_transaction_methodTags: methodFields: failed_request_counter (float) succeed_request_counter (float)
Measurement: cq_performance_transaction_nsTags: namespaceFields: failed_request_counter (float) succeed_request_counter (float)
Measurement: cq_performance_transaction_ns_deltaTags: namespaceFields: failed_request_counter_i (integer) succeed_request_counter_i (integer)
Dashboard API's to be deprecated or changed in the nextrelease
The dashboard APIs listed below will be deprecated or changed in the next major release of ECS.New or replacement API's are listed for reference. You are advised to make any neededadjustments to the use of these API's in anticipation of the next release.
API to be removed
The following table lists the APIs which will be removed in the future release:
Table 32 API - Remove
API Name Syntax Description
Get Process GET /dashboard/processes/{id} Gets the process instance details.
Get NodeProcesses
GET /dashboard/nodes/{id}/processes
Gets the details of processes in thenode.
Get Hosted Zone GET /dashboard/zones/hostedzone
Gets the hosted VDC details.
Get Zone GET /dashboard/zones/{id} Gets the hosted VDC details.
Get Hosted ZoneReplicationGroups
GET /dashboard/zones/hostedzone/replicationgroups
Gets the hosted VDC replicationgroups details.
API to be changed
The following table lists the APIs which will be changed in the future release:
Advanced Monitoring
ECS Monitoring Guide 73
Table 33 API - Change
API Name Syntax Description
Get Local Zone GET /dashboard/zones/localzone
Gets the local VDC details.
Get Local ZoneNodes
GET /dashboard/zones/localzone/nodes
Gets the local VDC node details.
Get Node GET /dashboard/nodes/{id} Gets the node instance details.
Get Storage PoolNodes
GET /dashboard/storagepools/{id}/nodes
Gets the details of nodes in thestorage pool.
The following data will be removed the APIs:
l n nodeCpuUtilization*, nodeMemoryUtilizationBytes*, nodeMemoryUtilization*,
n nodeNicBandwidth*, nodeNicReceivedBandwidth*, nodeNicTransmittedBandwidth*
n nodeNicUtilization*, nodeNicReceivedUtilization*, nodeNicTransmittedUtilization*
n capacityRebalanceEnabled, capacityRebalanced, capacityPendingRebalancing
n capacityRebalancedAvg, capacityRebalanceRate, capacityPendingRebalancingAvg
n transactionReadLatency, transactionWriteLatency, transactionReadBandwidth,transactionWriteBandwidth
n transactionReadTransactionsPerSec, transactionWriteTransactionsPerSec,transactionErrors.*
n diskReadBandwidthTotal, diskWriteBandwidthTotal, diskReadBandwidthEc,diskWriteBandwidthEc
n diskReadBandwidthCc, diskWriteBandwidthCc, diskReadBandwidthRecovery,diskWriteBandwidthRecovery
n diskReadBandwidthGeo, diskWriteBandwidthGeo, diskReadBandwidthUser
n diskWriteBandwidthUser, diskReadBandwidthXor, diskWriteBandwidthXor
API to stay without change
The following table lists the APIs, which will not be changed:
Table 34 API - No change
API Name Syntax Description
Get Local ZoneStorage Pools
GET /dashboard/zones/localzone/storagepools
Gets the local VDC storage pooldetails.
Get Local ZoneReplicationGroups
GET /dashboard/zones/localzone/replicationgroups
Gets the local VDC replication groupsdetails.
Get Local ZoneReplication GroupFailed Links
GET /dashboard/zones/localzone/rglinksFailed
Gets the local VDC replication groupfailed links details.
Get Local ZoneDisks
GET /dashboard/zones/localzone/disks
Gets the local VDC disks details.
Advanced Monitoring
74 ECS Monitoring Guide
Table 34 API - No change (continued)
API Name Syntax Description
Get Storage Pool GET /dashboard/storagepools/{id}
Gets the storage pool details.
Get Disk GET /dashboard/disks/{id} Gets the disk instance details.
Get Node Disks GET /dashboard/nodes/{id}/disks
Gets the details of disks in the node.
Get Local ZoneReplication GroupBootstrap Links
GET /dashboard/zones/localzone/rglinksBootstrap
Gets the local VDC replication groupbootstrap links details.
Get ReplicationGroup
GET /dashboard/replicationgroups/{id}
Gets the replication group instancedetails.
Get RG Link GET /dashboard/rglinks/{id} Gets the replication group linkinstance details.
Get ReplicationGroup Links
GET /dashboard/replicationgroups/{id}/rglinks
Gets the replication group instanceassociated link details.
Get ReplicationGroup Data Table
GET /dashboard/datatables/{id} Gets the datatable details.
Get ReplicationGroup Data Tables
GET /dashboard/replicationgroups/{id}/datatables
Gets the details of the datatables inthe replication group.
Get Cas GcBuckets
GET /dashboard/zones/localzone/cas
Gets the local VDC CAS GC bucketsdetails
Get Cas GcBucket
GET /dashboard/cas/{id} Gets the CAS GC bucket instancedetails.
Flux API for deprecated dashboard API
You can retrieve analogues of metrics currently available through dashboard API, which will bedeprecated in the future release. Flux API will provide API for metrics used to build new AdvancedMonitoring dashboards. Node or process level metrics will be available as raw metrics. In ECS 3.4not all the metrics which are planned to be deprecated in a future version are available throughFlux API. The metrics that are available in ECS 3.4 is listed here:
Processes statistics
l Dashboard APIGET /dashboard/nodes/{id}/processes
l Flux APIDatabase: monitoring_op
Measurement: procstat
Fields: memory_rss, cpu_usage, and num_threads
Tags: host - hostname (fqdn), node_id - host id and process_name. The valid process namesare:
n blobsvc
n cm
n coordinatorsvc
Advanced Monitoring
ECS Monitoring Guide 75
n dataheadsvc
n dtquery
n ecsportalsvc
n eventsvc
n georeceiver
n metering
n metering
n objcontrolsvc
n resourcesvc
n transformsvc
n vnest
n fluxd
n influxd
n throttler
n grafana-server
n dockerd
n fabric-agent
n fabric-lifecycle
n fabric-registry
n fabric-zookeeper
Nodes statistics
l Dashboard APIGET /dashboard/nodes/{id}
l Flux APIDatabase: monitoring_op
Measurement : cpu
Fields: usage_idle
Tags:
n host - hostname (fqdn)
n node_id - host id
Measurement : mem
Fields: free - free memory on host (bytes)
Tags:
n host - hostname (fqdn)
n node_id - host id
Performance statistics
l Dashboard API
n GET /dashboard/nodes/{id}
n GET /dashboard/zones/localzone
Advanced Monitoring
76 ECS Monitoring Guide
n GET /dashboard/zones/localzone/nodes
l Flux API: See List of metrics for performance-related data section for details.
Advanced Monitoring
ECS Monitoring Guide 77
Advanced Monitoring
78 ECS Monitoring Guide
CHAPTER 5
Examining Service Logs
l ECS service logs................................................................................................................... 80
ECS Monitoring Guide 79
ECS service logsDescribes the location and content of ECS service logs.
You can access ECS service logs directly by an SSH session on a node. Change to the followingdirectory: /opt/emc/caspian/fabric/agent/services/object/main/log. You can alsoaccess the logs from the Service Console. The following logs are provided:
Note:The emcservice user cannot access service logs. When the node is locked using the platformlockdown feature, a user cannot access service logs. Only an administrator who has permissionto access the node can access the logs.
l authsvc.log: Records information from the authentication service
l blobsvc*.log: Records aspects of the binary large object service (BLOB) service
l cassvc*.log: Records aspects of the CAS service
l coordinatorsvc.log: Records information from the coordinator service
l ecsportalsvc.log: Records information from the ECS Portal service
l eventsvc*.log: Records aspects of the event service. This information is available in theECS Portal at Monitor > Events
l hdfssvc*.log: Records aspects of the HDFS service
l objcontrolsvc.log: Records information from the object service
l objheadsvc*.log: Records aspect of the various object heads supported by the objectservice.
l provisionsvc*.log: Records aspects of the ECS provisioning service
l resourcesvc*.log: Records information that is related to global resources like namespaces,buckets, object users
l dataheadsvc-access.log: Records the aspects of the object heads supported by theobject service, the file service supported by HDFS, and the CAS service.
Examining Service Logs
80 ECS Monitoring Guide