couchconf israel 2013_couchbase server in production

50
1 Couchbase Server 2.0 in Production Perry Krug Sr. Solutions Architect

Upload: couchbase

Post on 13-Jul-2015

370 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CouchConf Israel 2013_Couchbase Server in Production

1

Couchbase Server 2.0 in

ProductionPerry Krug

Sr. Solutions Architect

Page 2: CouchConf Israel 2013_Couchbase Server in Production

2

Typical Couchbase production environment

Application users

Load Balancer

Application Servers

Servers

Page 3: CouchConf Israel 2013_Couchbase Server in Production

3

We’ll focus on App-Couchbase interaction …

Application users

Load Balancer

Application Servers

Servers

Page 4: CouchConf Israel 2013_Couchbase Server in Production

4

… at each step of the application lifecycle

Dev/Test Size Deploy Monitor Manage

Page 5: CouchConf Israel 2013_Couchbase Server in Production

5

KEY CONCEPTS

Page 6: CouchConf Israel 2013_Couchbase Server in Production

6

Couchbase Single Node Architecture

Replication, Rebalance, Shard State Manager

REST management API/Web UI

8091Admin Console

Erla

ng

/OTP

11210 / 11211Data access ports

Object-managedCache

Storage Engine

8092Query API

Qu

ery

En

gin

e

http

Data Manager Cluster Manager

Page 7: CouchConf Israel 2013_Couchbase Server in Production

11

33 2

Couchbase Single Node

2

Managed Cache

Dis

k Q

ueu

e

Disk

Replication Queue

App Server

Couchbase Server Node

Doc 1Doc 1

Doc 1

To other node

XDCR Queue

Doc 1

To other cluster

View engine

Doc 1

Page 8: CouchConf Israel 2013_Couchbase Server in Production

12

Web Application

Couchbase deployment

Data Flow

Cluster Management

Web Application

CouchbaseClient Library

Web Application … …

Couchbase Server Couchbase Server Couchbase Server Couchbase Server

Replication Flow

Page 9: CouchConf Israel 2013_Couchbase Server in Production

13

COUCHBASE SERVER CLUSTER

Couchbase in a Cluster

User Configured Replica Count = 1

ACTIVE

Doc 5

Doc 2

Doc

Doc

Doc

SERVER 1

REPLICA

Doc 3

Doc 1

Doc 7

Doc

Doc

Doc

APP SERVER 1

COUCHBASE Client Library

CLUSTER MAP

COUCHBASE Client Library

CLUSTER MAP

APP SERVER 2

Doc 9

ACTIVE

Doc 3

Doc 1

Doc

Doc

Doc

SERVER 2

REPLICA

Doc 6

Doc 4

Doc 9

Doc

Doc

Doc

Doc 8

ACTIVE

Doc 4

Doc 6

Doc

Doc

Doc

SERVER 3

REPLICA

Doc 2

Doc 5

Doc 8

Doc

Doc

Doc

Doc 7

Query

READ/WRITE/UPDATE

Page 10: CouchConf Israel 2013_Couchbase Server in Production

15

NODE AND CLUSTER SIZING

Dev-Test Size Deploy Monitor Manage

Page 11: CouchConf Israel 2013_Couchbase Server in Production

16

Size Couchbase Server

Sizing == performance• Serve reads out of RAM• Enough IO for writes and disk operations• Mitigate inevitable failures

Reading Data Writing Data

Server

Give medocument A

Here is document A

Application Server

A

Server

Please storedocument A

OK, I storeddocument A

Application Server

A

Page 12: CouchConf Israel 2013_Couchbase Server in Production

17

ServerServer Server

Scaling out permits matching of aggregate flow rates so queues do not grow

Application ServerApplication Server Application Server

network networknetwork

Page 13: CouchConf Israel 2013_Couchbase Server in Production

18

How many nodes?

5 Key Factors determine number of nodes needed:

1) RAM2) Disk3) CPU4) Network5) Data Distribution/Safety

Couchbase Servers

Web application server

Application user

Page 14: CouchConf Israel 2013_Couchbase Server in Production

19

RAM sizing

1) Total RAM:• Managed document cache:

• Working set• Metadata• Active+Replicas

• Index caching (I/O buffer)

Keep working set in RAM for best read performance

Server

Give medocument A

Here is document A

Application Server

A

A

A

Reading Data

Page 15: CouchConf Israel 2013_Couchbase Server in Production

20

Working set ratio depends on your application

Server Server Server

Late stage social gameMany users no longer

active; few logged in at any given time.

Ad NetworkAny cookie can show up

at any time.

Business applicationUsers logged in during

the day. Day moves around the globe.

working/total set = 1working/total set = .01 working/total set = .33

Reading Data

Page 16: CouchConf Israel 2013_Couchbase Server in Production

21

RAM sizing – Working set managed cache

As memory grows, some cached data will be removed from RAM to make space:

• Active and replica data share RAM• Threshold based (NRU, favoring active data)• Only cleanly persisted data can be “ejected”• Only data values can be “ejected” which means

RAM can fill up with metadata

Page 17: CouchConf Israel 2013_Couchbase Server in Production

22

RAM Sizing - View/Index cache (disk I/O)

• File system cache availability for the index has a big impact performance:

• Test runs based on 10 million items with 16GB bucket quota and 4GB, 8GB system RAM availability for indexes

• Performance results show that by doubling system cache availability– query latency reduces by half

– throughput increases by 50%

• Leave RAM free with quotas

Page 18: CouchConf Israel 2013_Couchbase Server in Production

23

Disk sizing: Space and I/O

2) Disk• Sustained write rate• Rebalance capacity• Backups• XDCR • Compaction• Total dataset:

(active + replicas + indexes)

• Append-only

I/O

Space

Please storedocument A

OK, I storeddocument A

Application Server

A

Server

A

A

Writing Data

Page 19: CouchConf Israel 2013_Couchbase Server in Production

24

Disk sizing: I/O

Impacting disk I/O needed:• Peak write load• Sustained write load• Compaction• XDCR• Views/indexing

Configurable paths/partitions for data and indexes allows for separation of space and I/O

Page 20: CouchConf Israel 2013_Couchbase Server in Production

25

Disk sizing: Space

Impacting amount of disk space needed:• Total data set • Indexes• Overhead for compaction (~3x): Both data

and indexes are “append-only”

Configurable paths/partitions for data and indexes allows for separation of space and I/O

Page 21: CouchConf Israel 2013_Couchbase Server in Production

26

Disk sizing: Impact of Views on IO and Space

• Number of Design Documents• Extra space for each DD• Extra IO to process for each DD• Segregate views by DD

• Complexity of Views (IO)

• Amount of view output (space)• Emit as little as possible• Doc ID automatically included

• Use Development views and extrapolate

Page 22: CouchConf Israel 2013_Couchbase Server in Production

27

• Append-only file format puts all new/updated/deleted items at the end of the on-disk file.

– Better performance and reliability

– No more fragmentation!

• This can lead to invalidated data in the “back” of the file.

• Need to compact data

Disk sizing: Append only

Page 23: CouchConf Israel 2013_Couchbase Server in Production

28

Initial file layout:

Update some data:

After compaction:

Disk compaction

Doc A Doc B Doc C

Doc C Doc B’ Doc A’’

Doc A Doc B Doc A’ Doc B’ Doc A’’Doc A Doc B Doc C Doc A’ Doc D

Doc D

Page 24: CouchConf Israel 2013_Couchbase Server in Production

29

• Compaction happens automatically:

– Settings for “threshold” of stale data

– Settings for time of day

– Split by data and index files

– Per-bucket or global

• Reduces size of on-disk files – data files AND index files

• Temporarily increased disk I/O

and CPU, but no downtime!

Disk compaction

Page 25: CouchConf Israel 2013_Couchbase Server in Production

30

CPU sizing

3) CPU• Disk writing• Views/compaction/XDCR• RAM r/w performance not impacted

1.8 used VERY little CPU. Under the same workloads, 2.0 should not be much different.

New 2.0 features will require more CPU

Page 26: CouchConf Israel 2013_Couchbase Server in Production

31

Network sizing

4) Network• Client traffic• Replication (writes)• Rebalancing• XDCR

Reads+Writes

Replication (multiply writes) and Rebalancing

Page 27: CouchConf Israel 2013_Couchbase Server in Production

32

Consistent low latency with varying doc sizes

Consistently low latencies in microseconds for

varying documents sizes with a mixed workload

Page 28: CouchConf Israel 2013_Couchbase Server in Production

33

Data Distribution

5) Data Distribution / Safety (assuming one replica):• 1 node = BAD• 2 nodes = …better…• 3+ nodes = BEST!

Note: Many applications will need more than 3 nodes

Servers fail, be prepared. The more nodes, the less impact a failure will have.

Page 29: CouchConf Israel 2013_Couchbase Server in Production

34

How many nodes? (recap)

New to 2.0 feature will affect sizing requirements:• Views/Indexing/Querying• XDCR• Append-only file format

5 Key Factors still determine number of nodes needed:1) RAM2) Disk3) CPU4) Network5) Data Distribution

Couchbase Servers

Web application server

Application user

Page 30: CouchConf Israel 2013_Couchbase Server in Production

35

MONITORING

Dev-Test Size Deploy Monitor Manage

Page 31: CouchConf Israel 2013_Couchbase Server in Production

36Server

Key resources: RAM, Disk, Network, CPU

RAM

DISK

NETWORK

Server

RAM

DISK

Server

RAM

DISK

Application Server Application Server Application Server

Page 32: CouchConf Israel 2013_Couchbase Server in Production

37

Monitoring

Once in production, heart of operations is monitoring

• RAM Usage• Disk space and I/O:

• write queues / read activity / indexing• Network bandwidth, replication queues• CPU Usage• Data distribution (balance, replicas)

Page 33: CouchConf Israel 2013_Couchbase Server in Production

38

Monitoring

IMMENSE amount of information available

• Real-time traffic graphs

• REST API accessible

• Per bucket, per node and aggregate statistics

• Application and inter-node traffic

• RAM <-> Disk

• Inter-system timing

Page 34: CouchConf Israel 2013_Couchbase Server in Production

39

Key Stats to Monitor

• Working set doesn’t fit in RAM

–Cache miss rate / disk fetches

• Disk I/O not keeping up

–Disk Write queue size

• Internal replication lag

– TAP queues

• Indexing not keeping up

• XDCR lag

Page 35: CouchConf Israel 2013_Couchbase Server in Production

40

Page 36: CouchConf Israel 2013_Couchbase Server in Production

41

MANAGEMENT AND MAINTENANCE

Dev-Test Size Deploy Monitor Manage

Page 37: CouchConf Israel 2013_Couchbase Server in Production

42

Management/Maintenance

• Scaling

• Upgrading/Scheduled maintenance

• Backup/Restore

• Dealing with Failures

Page 38: CouchConf Israel 2013_Couchbase Server in Production

43

Scaling

Couchbase Scales out Linearly:

Need more RAM? Add nodes…

Need more Disk IO or space? Add nodes…

Couchbase also makes it easy to scale up by swapping larger nodes for smaller ones without any disruption

Page 39: CouchConf Israel 2013_Couchbase Server in Production

44

Couchbase + Cisco + Solarflare

Number of servers in cluster

Op

era

tio

ns p

er

seco

nd

High throughput with 1.4 GB/sec data transfer rate using 4 servers

Linear throughput scalability

Page 40: CouchConf Israel 2013_Couchbase Server in Production

45

Additional benchmark details

• Cluster of 8 nodes running Couchbase Server 1.8.0 • One server used as the client to run the workload• Workload used for the test was Couchbase’s streaming load generator • GET and SET operations were performed in the 70:30 ratio

Test System and Parameters • Couchbase Server 1.8.0 • Cisco Nexus 5548UP Switch • Solarflare SFN5122F 10 Gigabit Ethernet Enhanced Small Form-Factor

Pluggable (SFP+) server adapters • Solarflare OpenOnload• Servers: Nine Cisco UCS C200 M2 High-Density Rack Servers with Intel

Xeon processor X5670 six-core 2.93-GHz CPU, running Red Hat Enterprise Linux (RHEL) 5.5 x86 64-bit, with 100-GB RAM and four 2-TB hard drives

Page 41: CouchConf Israel 2013_Couchbase Server in Production

46

1. Add nodes of new version, rebalance…

2. Remove nodes of old version, rebalance…

3. Done!

No disruption

General use for software upgrade, hardware refresh, planned maintenance

Upgrade existing Couchbase Server 1.8 to

Couchbase Server 2.0!

Upgrade

Page 42: CouchConf Israel 2013_Couchbase Server in Production

47

Easy to Maintain Couchbase

• Use remove+rebalance on “malfunctioning” node:

– Protects data distribution and “safety”

– Replicas recreated

– Best to “swap” with new node to maintain capacity and move minimal amount of data

Page 43: CouchConf Israel 2013_Couchbase Server in Production

48

Backup

Data Files

cbbackup

ServerServer Server

network networknetwork

Page 44: CouchConf Israel 2013_Couchbase Server in Production

49

Restore

2) “cbrestore” used to restore data into live/different cluster

Data Files

cbrestore

Page 45: CouchConf Israel 2013_Couchbase Server in Production

50

Failures Happen!

Hardware

NetworkBugs

Page 46: CouchConf Israel 2013_Couchbase Server in Production

51

Easy to Manage failures with Couchbase

• Failover (automatic or manual):

– Replica data and indexes promoted for immediate access

– Replicas not recreated

– Do NOT failover healthy node

– Perform rebalance after returning cluster to full or greater capacity

Page 47: CouchConf Israel 2013_Couchbase Server in Production

52

Fail Over

Doc 7

Doc 9

Doc 3

Active Docs

Replica Docs

Doc 6

COUCHBASE CLIENT LIBRARY

CLUSTER MAP

APP SERVER 1

COUCHBASE CLIENT LIBRARY

CLUSTER MAP

APP SERVER 2

Doc 4

Doc 2

Doc 5

SERVER 1

Doc 6

Doc 4

SERVER 2

Doc 7

Doc 1

SERVER 3

Doc 3

Doc 9

Doc 7 Doc 8

Doc 6

Doc 3

DOC

DOC

DOCDOC

DOC

DOC

DOC DOC

DOC

DOC

DOC DOC

DOC

DOC

DOC

Doc 9

Doc 5DOC

DOC

DOC

Doc 1

Doc 8

Doc 2

Replica Docs Replica Docs Replica Docs

Active Docs Active Docs Active Docs

SERVER 4 SERVER 5

Active Docs Active Docs

Replica Docs Replica Docs

COUCHBASE SERVER CLUSTER

Page 48: CouchConf Israel 2013_Couchbase Server in Production

53Dev/Test Size Deploy Monitor Manage

Conclusion

Page 49: CouchConf Israel 2013_Couchbase Server in Production

54

Want more?

Lots of details and best practices in our documentation:

http://www.couchbase.com/docs/

Page 50: CouchConf Israel 2013_Couchbase Server in Production

55

QUESTIONS?

[email protected]@PERRYKRUG