ultra-high availability & disaster recovery with couchbase server: couchbase connect 2014
Post on 29-Jun-2015
826 Views
Preview:
DESCRIPTION
TRANSCRIPT
Ultra-High Availability & Disaster Recovery with Couchbase Server
Anil Kumar Product Management, Couchbase
©2014 Couchbase, Inc. 2
About Me
Anil KumarProduct Manager, Couchbase
anil@couchbase.com
@anilkumar1129
©2014 Couchbase, Inc. 3
Part I - High Availability Single node architecture Local data redundancy Rebalance and failover Node recovery
Part II - Disaster Recovery Business continuity for “mission-critical” applications Geo redundancy Backup-Restore for worst case scenario
Demo
Q & A
High-Availability & Disaster Recovery
Part I - High Availability
©2014 Couchbase, Inc. 5
Couchbase Server – Single Node Architecture
Hea
rtbe
at
Pro
cess
mon
itor
Glo
bal s
ingl
eton
sup
ervi
sor
Con
figur
atio
n m
anag
er
on each node
Reb
alan
ce o
rche
stra
tor
Nod
e he
alth
mon
itor
one per clusa
vBuc
ket
stat
e an
d re
plic
atio
n m
anag
er
http
RE
ST
man
ag
em
ent
AP
I/Web
UI
HTTP8091
Erlang port mapper4369
Distributed Erlang21100 - 21199
Erlang / OTP
storage interface
Couchbase EP Engine
11210Memcapable 2.0
Moxi
11211Memcapable 1.0
Memcached
Persistence Layer
8092Query API
Qu
ery
En
gin
e
Data Manager Cluster Manager
Single Node type is the foundation for high availability architecture
No Single Point of Failure (SPOF)
Easy scalability
6
Intra-Cluster Replication – Data Redundancy
RAM to RAM replication
Max of 4 copies of data in a Cluster
Bandwidth optimized through de-duplicate, or ‘de-dup’ the item
©2014 Couchbase, Inc.
Intra-cluster replication is the process of replicating data on multiple servers within a cluster in order to provide data redundancy.
7
Write Operation – Data Redundancy
33 2Managed Cache
Dis
k Q
ueue
Disk
Replication Queue
App Server
Memory-to-Memory Replication to other node
Doc
Doc Doc
©2014 Couchbase, Inc.
8
(New in 3.0) Database Change Protocol (DCP) – Data Redundancy
©2014 Couchbase, Inc.
DCP is new streaming replication protocol in Couchbase Server 3.0
High-Performance, Stream-based Protocol
Better Resume-ability after blips and failures
Powers Intra Cluster Replication
Powers Cross Datacenter Replication
Powers Incremental Backup & Restore
Up to 150x Improvement on ReplicateTo latency from 2.5 to 3.0
(New in 3.0) Auto Tuning Shared Thread Pool - Durability
©2014 Couchbase, Inc.
Efficient Auto-Tuning Engine Detect and allocate threads based
on HW resources
Pool threads for best resource utilization
Improved latency across the board Faster Rebalance
Faster Node Reactivation
Faster Durability with Writes & PersistTo
Up to 3x better PersistTo latency from 2.5 to 3.0
©2014 Couchbase, Inc. 10
Rebalance Operation in Couchbase Server – Data Availability
Rebalance redistributes data-partitions (data) around cluster When adding nodes When removing nodes When nodes have failed over
Aim is to bring cluster back to optimal health
Data-partitions are moved between nodes automatically
Rebalance happens on an active cluster Allows you to expand/shrink without pausing your application Client libraries automatically handle the rebalance and redistribute their requests
accordingly
Up to 2x Faster Rebalance under Load between 3.0 and 2.5.1
©2014 Couchbase, Inc. 11
Failover in Couchbase Server - Fault-tolerance
Failover automatically switches-over to the replicas for a given database Gracefully under node maintenance Immediately under auto-failover
Manual failover for node maintenance Can be triggered manually through the Admin-UI/REST/CLI
Automatic failover in case unplanned outage – system failures Can be configured through Admin-UI/REST/CLI Constraints in place to avoid “split-brain” and false positives
30 second delay, multiple heartbeat “pings” Clusters >=3 nodes Only one node down at a time
©2014 Couchbase, Inc. 12
Automatic Failover – “In action”
SERVER 4 SERVER 5
Replica
Active
Replica
Active
App Server 1
COUCHBASE Client Library
CLUSTER MAP
COUCHBASE Client Library
CLUSTER MAP
App Server 2
Active
SERVER 1
Shard 5
Shard 2
Shard 9Shard
Shard
Shard
Replica
Shard 4
Shard 1
Shard 8Shard
Shard
Shard
Active
SERVER 2
Shard 4
Shard 7 Shard 8
Shard
Shard Shard
Replica
Shard 6
Shard 3 Shard 2
Shard
Shard Shard
Active
SERVER 3
Shard 1
Shard 3
Shard 6Shard
Shard
Shard
Replica
Shard 7
Shard 9
Shard 5Shard
Shard
Shard
App servers accessing Shards
Requests to Server 3 fail
Cluster detects server failed
Promotes replicas of Shards to active
Updates cluster map
Requests for docs now go to appropriate server
Typically rebalance would follow
Shard 1 Shard 3
Shard
Node Recovery – Bring Cluster back to Capacity
©2014 Couchbase, Inc.
Failed-Over node can re-added back to cluster
Full recovery – Add back as a fresh node
(New in 3.0) Delta Node recovery – Add back failed node incrementally into the cluster without having to rebuild the full node.
New in 3.0
100sx Reduction in Time to Re-Add Node from 2.5 to 3.0
©2014 Couchbase, Inc. 14
Rack-Zone Awareness – Rack-Zone Availability
Rack 1
1
2
3
Rack 2
4
5
6
Rack 3
7
8
9
Grouping of servers into server groups so that each group is on a physically separate rack
Ensures that replica data partitions are not on the same rack as the primary partitions
Servers 1, 2, 3 on Rack 1
Servers 4, 5, 6 on Rack 2
Servers 7, 8, 9 on Rack 3
Cluster has 2 replicas (3 copies of data)
This is a balanced configuration
©2014 Couchbase, Inc. 15
If a entire Server Rack fails, data is still available
If a entire Cloud Zone or a Region fails, data is still available
Rack-Zone Awareness
Rack 1
1
2
3
Rack 2
4
5
6
Rack 3
7
8
9
©2014 Couchbase, Inc. 16
Couchbase Server provides statistics at multiple levels throughout the cluster. Used for regular monitoring, capacity planning and to identify the performance
characteristics. Enable email alerts to be raised when a significant error occurs on your Couchbase
Server cluster.
Monitoring & Alerting
Part II – Disaster Recovery
19
Cross Datacenter Replication (XDCR)
©2014 Couchbase, Inc.
Unidirectional Replication
Hot spare / Disaster Recovery
Development/Testing copies
Bidirectional Replication
Datacenter Locality
Multiple Active Masters
20
Cross Datacenter Replication (XDCR) using DCP
Replicates continuously data FROM source cluster to remote clusters may be spread across geo’s
Supports unidirectional and bidirectional operation
Application can read and write from both clusters (active – active replication)
Automatically handles node addition and removal
Simplified Administration via Admin UI, REST, and CLI
(New in 3.0) Pause and resume XDCR replication
©2014 Couchbase, Inc.
21
Cross Datacenter Replication (XDCR) – Memory based using DCP
33 2Managed Cache
Dis
k Q
ueue
Disk
Replication Queue
App Server
Memory-to-Memory Replication to other node
Doc
Doc Doc
XDCR Queue
(New in 3.0) Memory-to-Memory Replication to remote cluster
Doc
©2014 Couchbase, Inc.
Up to 4x better on XDCR latency between clusters between 3.0 & 2.5
©2014 Couchbase, Inc. 22
Backup & Restore – Oops Case
cbbackup tools provides backup for a running cluster Entire Cluster – across all bucket Single Node – across all buckets Single Node – single bucket Supports remote or local access
(New in 3.0) Incremental Backups Differential Or Cumulative Back up data that only changed since the last backup. Minimize resource and time consumption during backups. Enables more frequent backups
Restore cluster to point in time differential or cumulative backup
Backup & Restore
Demo !!!
Visit – Alex Ma (Deep-dive into XDCR & Rack-zone Awareness)Visit – Cihan (Couchbase on Azure)Visit – Kirk (Tuning Couchbase Server)
Related Talks
DOWNLOAD COUCHBASE SERVER 3.0
www.couchbase.com/download
& give us feedback…
Download Couchbase Server 3.0
Q & AAnil Kumar
Product Management, Couchbase
anil@couchbase.com
@anilkumar1129
top related