best practices: managing a healthy couchbase server deployment: couchbase connect 2014

Post on 25-May-2015

861 Views

Category:

Data & Analytics

5 Downloads

Preview:

Click to see full reader

DESCRIPTION

“...An ounce of prevention is worth a pound of cure…” Using customer use cases we’ll review how to identify a healthy cluster and spot issues before they become problems for your application. The talk will cover Couchbase monitoring and explain what key statistics mean and how to interpret them to identify common problems.

TRANSCRIPT

Managing a Couchbase DeploymentJustin Michaels | Couchbase

justin@couchbase.com

©2014 Couchbase, Inc. 2

What does a Healthy Cluster look like

What does a Unhealthy Cluster look like

Key System Metrics

Health-Check tools Couchbase provides

Agenda

A Healthy Cluster

©2014 Couchbase, Inc. 4

Healthy Couchbase Cluster

‘Active vBuckets’ count across all the servers should be

equal to “1024”

‘Replica vBuckets’ count across all the servers should be

equal to “1024 * <num of replica’s configured>”

©2014 Couchbase, Inc. 5

Healthy Couchbase Cluster

‘Cache Miss Ratio’ and ‘Disk Reads per Sec’ across all the

servers should be equal to “0” (depending on resident ratio!)

Items in ‘TAP Queue’ for 2.x

Items in ‘DCP Queues’ for 3.x

An Unhealthy Cluster

©2014 Couchbase, Inc. 7

Unhealthy Couchbase Cluster

‘Memory used’ is equal to ‘High Water Mark’ that means

Active items are evicted from RAM

‘Drain rate’ is much lower than ‘Fill rate’ than Disk Queue

will hit TAP/DCP back-off

©2014 Couchbase, Inc. 8

Unhealthy Couchbase Cluster

TMP Out Of Memory that means memory usage is

at or above 90% of bucket memory quota

‘Disk Reads per Sec’ and ‘Cache Miss Ratio’

growing at hundreds or thousands that means no

more memory capacity

Key Metrics

Perry Krug http://blog.couchbase.com/how-many-nodes-part-4-monitoring-sizing

Justin Michaels http://blog.couchbase.com/monitoring-couchbase-cluster

©2014 Couchbase, Inc. 10

"cache miss ratio"

Goal: Low as possible

Many customers run with <1%, some are upwards of 10-20% with the understanding that the performance is not as good but that it is okay for their application.  

Relationship between "memory used" and the "high watermark".  

Add another node: "red flag" that you need more RAM

Memory used reaching the high watermark ok but never for any sustained period of time.  

Key Metrics

©2014 Couchbase, Inc. 11

"disk write queue" (TAP or DCP)

Goal: Peaks approaching 1 million items per node

Peaks/valleys getting higher over time, that's an indication of needing more IO and likely needing more nodes for that.

Will grow and shrink over time but is a good indication of available IO capacity.  

"temp OOM"

should always be 0 unless they are explicitly expecting it (like in a bulk load scenario)

Key Metrics

©2014 Couchbase, Inc. 12

"fragmentation” (docs and views) 

Goal: Under 2x%

Will grow and shrink

Problematic if constantly higher than the compaction threshold that's an indication that compaction is not keeping up.

Not running at all for some reason and may lead to out of disk space issues.

Note: Monitoring disk usage outside of Couchbase and make sure that it doesn't reach a critical level (90%, it shouldn't ever get to that)

Key Metrics

©2014 Couchbase, Inc. 13

"outbound XDCR mutations"

as an indication of how many outstanding writes are waiting to be sent to the destination cluster.  

This will likely always have some value in it under load and so it's hard to say what a "good" threshold is, but it's something you should understand and monitor so that it doesn't go out of whatever your expected range is.

"items” in the "TAP queues" section 

If this is above 0, it's an indication that some items are not replicated between nodes and therefore are at risk of data loss if that node fails.  

It's extremely unlikely for this to happen during steady state but if there is a network slowdown or disruption this queue will grow and should be an immediate sign that something is wrong.

Key Metrics

Sample Couchbase ClusterJustin Michaels | Couchbase

Couchbase Monitoring

©2014 Couchbase, Inc. 16

Real-time traffic graphs

Per bucket, per node and aggregate statistics

Monitor inter-node traffic

REST API accessible (Extend existing monitoring system)

Couchbase Monitoring WebUI

©2014 Couchbase, Inc. 17

External systems can access all statistics from Couchbase's REST API

External systems are in a good position to take into account components that are outside the scope of Couchbase Server.

A network switch is failing and that there is a dependency on that switch by the Couchbase cluster.

Shared storage supporting the cluster are functioning.

Routes to nodes in the cluster are healthy.

External Monitoring

©2014 Couchbase, Inc. 18

Some options (I’m sure there’s more)

Health Check Tool

©2014 Couchbase, Inc. 20

What is CBHealthChecker Tool

… Insert sample syntax used …

Web based report

ALERTING user on issues where immediate action is required.

Input to cluster health.

Important buckets statistics

Important stats on each Node

WARNING indicators to point out issues that needs to be addressed before they become an issue.

Sample CBHealthCheckerJustin Michaels | Couchbase

©2014 Couchbase, Inc. 22

Couchbase Admin UI – Cluster Overview

Cluster Overview Page Cluster RAM

Usage

Cluster DISK Usage

Buckets Deployed in

Cluster

Servers Deployed in

Cluster

Cluster Rebalance Progress Indicator

©2014 Couchbase, Inc. 23

Couchbase Admin UI – Server Nodes

Server Page Node

List Active Servers

Expand Individual Servers

Servers Ready for Rebalance

©2014 Couchbase, Inc. 24

Couchbase Admin UI – Server Nodes

Additional Server Details

Keys transferred, Keys yet to be transferred

Rebalance Progress Indicator in-detail

Memory Utilization on this server

Disk Utilization on this server

©2014 Couchbase, Inc. 25

Couchbase Admin UI – Monitoring System

Monitoring Stats per Bucket on entire Cluster

120+ Stats collected from entire Cluster

View stats by aggregate

Click eclipse to view this stat on per Server basis

Tooltip provides description and stats used for calculating

©2014 Couchbase, Inc. 26

Couchbase Admin UI – Logging System

Log Event Page

Logs all events occurring on the

clusterWith timestamp

Server where event occurred

©2014 Couchbase, Inc. 27

vBucket Resources

Active State Replica State Pending State Total

Active vBuckets

Replica vBuckets

©2014 Couchbase, Inc. 28

Disk Queues

Active State Replica State Pending State Total

©2014 Couchbase, Inc. 29

Tap Queues

Replication Queue Rebalance Queues Client Queues Total

©2014 Couchbase, Inc. 30

XDCR Stats

Outgoing XDCR stats section displays information about the XDCR operations that are supporting cross datacenter replication from the current cluster to a destination cluster.

Incoming XDCR stats section displays information about the XDCR operations that are coming into to the current cluster from a remote cluster.

©2014 Couchbase, Inc. 32

Sample Healthy Report

Categories to jump to

Time periods

Expanding this provides detailed

Sizing info

Cluster-wide stats analyzed

©2014 Couchbase, Inc. 33

Sample Un-Healthy Report

User needs to take action

Details about what action

top related