ultra-high availability & disaster recovery with couchbase server: couchbase connect 2014

Ultra-High Availability & Disaster Recovery with Couchbase Server

Anil Kumar Product Management, Couchbase

About Me

Anil KumarProduct Manager, Couchbase

anil@couchbase.com

@anilkumar1129

Part I - High Availability Single node architecture Local data redundancy Rebalance and failover Node recovery

Part II - Disaster Recovery Business continuity for “mission-critical” applications Geo redundancy Backup-Restore for worst case scenario

High-Availability & Disaster Recovery

Part I - High Availability

Couchbase Server – Single Node Architecture

on each node

one per clusa

HTTP8091

Erlang port mapper4369

Distributed Erlang21100 - 21199

Erlang / OTP

storage interface

Couchbase EP Engine

11210Memcapable 2.0

11211Memcapable 1.0

Memcached

Persistence Layer

8092Query API

Data Manager Cluster Manager

Single Node type is the foundation for high availability architecture

No Single Point of Failure (SPOF)

Easy scalability

Intra-Cluster Replication – Data Redundancy

RAM to RAM replication

Max of 4 copies of data in a Cluster

Bandwidth optimized through de-duplicate, or ‘de-dup’ the item

Intra-cluster replication is the process of replicating data on multiple servers within a cluster in order to provide data redundancy.

Write Operation – Data Redundancy

33 2Managed Cache

Replication Queue

App Server

Memory-to-Memory Replication to other node

Doc Doc

(New in 3.0) Database Change Protocol (DCP) – Data Redundancy

DCP is new streaming replication protocol in Couchbase Server 3.0

High-Performance, Stream-based Protocol

Better Resume-ability after blips and failures

Powers Intra Cluster Replication

Powers Cross Datacenter Replication

Powers Incremental Backup & Restore

Up to 150x Improvement on ReplicateTo latency from 2.5 to 3.0

(New in 3.0) Auto Tuning Shared Thread Pool - Durability

Efficient Auto-Tuning Engine Detect and allocate threads based

on HW resources

Pool threads for best resource utilization

Improved latency across the board Faster Rebalance

Faster Node Reactivation

Faster Durability with Writes & PersistTo

Up to 3x better PersistTo latency from 2.5 to 3.0

Rebalance Operation in Couchbase Server – Data Availability

Rebalance redistributes data-partitions (data) around cluster When adding nodes When removing nodes When nodes have failed over

Aim is to bring cluster back to optimal health

Data-partitions are moved between nodes automatically

Rebalance happens on an active cluster Allows you to expand/shrink without pausing your application Client libraries automatically handle the rebalance and redistribute their requests

accordingly

Up to 2x Faster Rebalance under Load between 3.0 and 2.5.1

Failover in Couchbase Server - Fault-tolerance

Failover automatically switches-over to the replicas for a given database Gracefully under node maintenance Immediately under auto-failover

Manual failover for node maintenance Can be triggered manually through the Admin-UI/REST/CLI

Automatic failover in case unplanned outage – system failures Can be configured through Admin-UI/REST/CLI Constraints in place to avoid “split-brain” and false positives

30 second delay, multiple heartbeat “pings” Clusters >=3 nodes Only one node down at a time

Automatic Failover – “In action”

SERVER 4 SERVER 5

Replica

Active

Replica

Active

App Server 1

COUCHBASE Client Library

CLUSTER MAP

COUCHBASE Client Library

CLUSTER MAP

App Server 2

Active

SERVER 1

Shard 5

Shard 2

Shard 9Shard

Replica

Shard 4

Shard 1

Shard 8Shard

Active

SERVER 2

Shard 4

Shard 7 Shard 8

Shard Shard

Replica

Shard 6

Shard 3 Shard 2

Shard Shard

Active

SERVER 3

Shard 1

Shard 3

Shard 6Shard

Replica

Shard 7

Shard 9

Shard 5Shard

App servers accessing Shards

Requests to Server 3 fail

Cluster detects server failed

Promotes replicas of Shards to active

Updates cluster map

Requests for docs now go to appropriate server

Typically rebalance would follow

Shard 1 Shard 3

Node Recovery – Bring Cluster back to Capacity

Failed-Over node can re-added back to cluster

Full recovery – Add back as a fresh node

(New in 3.0) Delta Node recovery – Add back failed node incrementally into the cluster without having to rebuild the full node.

New in 3.0

100sx Reduction in Time to Re-Add Node from 2.5 to 3.0

Rack-Zone Awareness – Rack-Zone Availability

Rack 1

Rack 2

Rack 3

Grouping of servers into server groups so that each group is on a physically separate rack

Ensures that replica data partitions are not on the same rack as the primary partitions

Servers 1, 2, 3 on Rack 1

Cluster has 2 replicas (3 copies of data)

This is a balanced configuration

If a entire Server Rack fails, data is still available

If a entire Cloud Zone or a Region fails, data is still available

Rack-Zone Awareness

Rack 1

Rack 2

Rack 3

Couchbase Server provides statistics at multiple levels throughout the cluster. Used for regular monitoring, capacity planning and to identify the performance

characteristics. Enable email alerts to be raised when a significant error occurs on your Couchbase

Server cluster.

Monitoring & Alerting

Part II – Disaster Recovery

Cross Datacenter Replication (XDCR)

Unidirectional Replication

Hot spare / Disaster Recovery

Development/Testing copies

Bidirectional Replication

Datacenter Locality

Multiple Active Masters

Cross Datacenter Replication (XDCR) using DCP

Replicates continuously data FROM source cluster to remote clusters may be spread across geo’s

Supports unidirectional and bidirectional operation

Application can read and write from both clusters (active – active replication)

Automatically handles node addition and removal

Simplified Administration via Admin UI, REST, and CLI

(New in 3.0) Pause and resume XDCR replication

Cross Datacenter Replication (XDCR) – Memory based using DCP

33 2Managed Cache

Replication Queue

App Server

Memory-to-Memory Replication to other node

Doc Doc

XDCR Queue

(New in 3.0) Memory-to-Memory Replication to remote cluster

Up to 4x better on XDCR latency between clusters between 3.0 & 2.5

Backup & Restore – Oops Case

cbbackup tools provides backup for a running cluster Entire Cluster – across all bucket Single Node – across all buckets Single Node – single bucket Supports remote or local access

(New in 3.0) Incremental Backups Differential Or Cumulative Back up data that only changed since the last backup. Minimize resource and time consumption during backups. Enables more frequent backups

Restore cluster to point in time differential or cumulative backup

Backup & Restore

Demo !!!

Visit – Alex Ma (Deep-dive into XDCR & Rack-zone Awareness)Visit – Cihan (Couchbase on Azure)Visit – Kirk (Tuning Couchbase Server)

Related Talks

DOWNLOAD COUCHBASE SERVER 3.0

www.couchbase.com/download

& give us feedback…

Download Couchbase Server 3.0

Q & AAnil Kumar

Product Management, Couchbase

anil@couchbase.com

@anilkumar1129

ultra-high availability & disaster recovery with couchbase server: couchbase connect 2014

shard 3shard2014 couchbase

couchbase server fault

datapartitions data

item2014 couchbase

anilkumar11292014 couchbase

othernodedocdoc2014

time2014 couchbase

shardshard shard

Data & Analytics

couchbase 101: couchbase connect 2014

couchbase sydney meetup #1 couchbase architecture and...

couchbase n1ql (couchbase meetup #7)

performance & scalability of couchbase server – couchbase...

visual analytics with tableau & couchbase: couchbase connect...

sizing your couchbase cluster: couchbase connect 2015

sizing your couchbase cluster: couchbase connect 2014

couchbase @ paypal: couchbase connect 2014

couchbase at scale at ebay: couchbase connect 2014

high availability for ultra-scale scientific high-end

couchbase in the digital economy – couchbase connect 2016

deep dive into high availability and disaster recovery...

couchbase 3.0 beta: ultra-high availability an always-on...

enabling high availability with multi-site, rack-aware...

couchbase chennai meetup #3 what's new in couchbase server...

performance tuning couchbase: couchbase connect 2014

cisco: application clustering with couchbase – couchbase...

deploying couchbase server using docker: couchbase connect...

what's new in couchbase 4.0: couchbase connect 2015

couchbase 102 - developing with couchbase