best practices: managing a healthy couchbase server deployment: couchbase connect 2014

Managing a Couchbase DeploymentJustin Michaels | Couchbase

justin@couchbase.com

What does a Healthy Cluster look like

What does a Unhealthy Cluster look like

Key System Metrics

Health-Check tools Couchbase provides

Agenda

A Healthy Cluster

Healthy Couchbase Cluster

‘Active vBuckets’ count across all the servers should be

equal to “1024”

‘Replica vBuckets’ count across all the servers should be

equal to “1024 * <num of replica’s configured>”

Healthy Couchbase Cluster

‘Cache Miss Ratio’ and ‘Disk Reads per Sec’ across all the

servers should be equal to “0” (depending on resident ratio!)

Items in ‘TAP Queue’ for 2.x

Items in ‘DCP Queues’ for 3.x

An Unhealthy Cluster

Unhealthy Couchbase Cluster

‘Memory used’ is equal to ‘High Water Mark’ that means

Active items are evicted from RAM

‘Drain rate’ is much lower than ‘Fill rate’ than Disk Queue

will hit TAP/DCP back-off

Unhealthy Couchbase Cluster

TMP Out Of Memory that means memory usage is

at or above 90% of bucket memory quota

‘Disk Reads per Sec’ and ‘Cache Miss Ratio’

growing at hundreds or thousands that means no

more memory capacity

Key Metrics

Perry Krug http://blog.couchbase.com/how-many-nodes-part-4-monitoring-sizing

Justin Michaels http://blog.couchbase.com/monitoring-couchbase-cluster

"cache miss ratio"

Goal: Low as possible

Many customers run with <1%, some are upwards of 10-20% with the understanding that the performance is not as good but that it is okay for their application.

Relationship between "memory used" and the "high watermark".

Add another node: "red flag" that you need more RAM

Memory used reaching the high watermark ok but never for any sustained period of time.

Key Metrics

"disk write queue" (TAP or DCP)

Goal: Peaks approaching 1 million items per node

Peaks/valleys getting higher over time, that's an indication of needing more IO and likely needing more nodes for that.

Will grow and shrink over time but is a good indication of available IO capacity.

"temp OOM"

should always be 0 unless they are explicitly expecting it (like in a bulk load scenario)

Key Metrics

"fragmentation” (docs and views)

Goal: Under 2x%

Will grow and shrink

Problematic if constantly higher than the compaction threshold that's an indication that compaction is not keeping up.

Not running at all for some reason and may lead to out of disk space issues.

Note: Monitoring disk usage outside of Couchbase and make sure that it doesn't reach a critical level (90%, it shouldn't ever get to that)

Key Metrics

"outbound XDCR mutations"

as an indication of how many outstanding writes are waiting to be sent to the destination cluster.

This will likely always have some value in it under load and so it's hard to say what a "good" threshold is, but it's something you should understand and monitor so that it doesn't go out of whatever your expected range is.

"items” in the "TAP queues" section

If this is above 0, it's an indication that some items are not replicated between nodes and therefore are at risk of data loss if that node fails.

It's extremely unlikely for this to happen during steady state but if there is a network slowdown or disruption this queue will grow and should be an immediate sign that something is wrong.

Key Metrics

Sample Couchbase ClusterJustin Michaels | Couchbase

Couchbase Monitoring

Real-time traffic graphs

Per bucket, per node and aggregate statistics

Monitor inter-node traffic

REST API accessible (Extend existing monitoring system)

Couchbase Monitoring WebUI

External systems can access all statistics from Couchbase's REST API

External systems are in a good position to take into account components that are outside the scope of Couchbase Server.

A network switch is failing and that there is a dependency on that switch by the Couchbase cluster.

Shared storage supporting the cluster are functioning.

Routes to nodes in the cluster are healthy.

External Monitoring

Some options (I’m sure there’s more)

Health Check Tool

What is CBHealthChecker Tool

… Insert sample syntax used …

Web based report

ALERTING user on issues where immediate action is required.

Input to cluster health.

Important buckets statistics

Important stats on each Node

WARNING indicators to point out issues that needs to be addressed before they become an issue.

Sample CBHealthCheckerJustin Michaels | Couchbase

Couchbase Admin UI – Cluster Overview

Cluster Overview Page Cluster RAM

Cluster DISK Usage

Buckets Deployed in

Cluster

Servers Deployed in

Cluster

Cluster Rebalance Progress Indicator

Couchbase Admin UI – Server Nodes

Server Page Node

List Active Servers

Expand Individual Servers

Servers Ready for Rebalance

Couchbase Admin UI – Server Nodes

Additional Server Details

Keys transferred, Keys yet to be transferred

Rebalance Progress Indicator in-detail

Memory Utilization on this server

Disk Utilization on this server

Couchbase Admin UI – Monitoring System

Monitoring Stats per Bucket on entire Cluster

120+ Stats collected from entire Cluster

View stats by aggregate

Click eclipse to view this stat on per Server basis

Tooltip provides description and stats used for calculating

Couchbase Admin UI – Logging System

Log Event Page

Logs all events occurring on the

clusterWith timestamp

Server where event occurred

vBucket Resources

Active State Replica State Pending State Total

Active vBuckets

Replica vBuckets

Disk Queues

Active State Replica State Pending State Total

Tap Queues

Replication Queue Rebalance Queues Client Queues Total

XDCR Stats

Outgoing XDCR stats section displays information about the XDCR operations that are supporting cross datacenter replication from the current cluster to a destination cluster.

Incoming XDCR stats section displays information about the XDCR operations that are coming into to the current cluster from a remote cluster.

Sample Healthy Report

Categories to jump to

Time periods

Expanding this provides detailed

Sizing info

Cluster-wide stats analyzed

Sample Un-Healthy Report

User needs to take action

Details about what action

best practices: managing a healthy couchbase server deployment: couchbase connect 2014

off2014 couchbase

x2014 couchbase

unhealthy couchbase

unhealthy cluster

ratio goal

resident ratio

disk reads

bucket memory quotadisk

Data & Analytics

performance tuning couchbase: couchbase connect 2014

couchbase in the digital economy – couchbase connect 2016

deploying couchbase server using docker: couchbase connect...

couchbase at linkedin: couchbase connect 2014

cvent: building a microservice deployment pipeline –...

couchbase server 101: couchbase connect 2015

what's new in couchbase server 3.0: couchbase connect 2014

testing and deploying couchbase mobile – couchbase connect...

couchbase server with coreos and kubernetes: couchbase...

the future of couchbase mobile: couchbase connect 2014

cisco: application clustering with couchbase – couchbase...

best practices - couchbase indexing in production: couchbase...

couchbase connect 2016

best practices: managing a healthy couchbase server...

performance & scalability of couchbase server – couchbase...

what's new in couchbase 4.0: couchbase connect 2015

couchbase at scale at ebay: couchbase connect 2014

nurse couchbase connect 2015

interactive data analytics with couchbase n1ql: couchbase...

couchbase @ paypal: couchbase connect 2014