bcndevcon 2013: usign cassandra and zookeeper to build a distributed, high performance system

Using Zookeeper and Cassandrato build a distributed, high performance system

Galo Navarro@srvaroa - [email protected]

BcnDevCon13

mailto:[email protected]

About me

Background: backend & architecture in high traffic systems

Current: software engineer @ Midokura

A talk about databases

Takeaways

New tech signals different emphasis on solving each problem

Solutions are not exclusive: you can combine them.

Go beyond artificial SQL-NoSQL antagonisms:

We share some fundamental problems:● Latency, availability, durability..● True today, 20y ago, 3000y ago

Midonet

Distributed network virtualization system

Context, dataset, requirements

https://www.midokura.com/

Virtualization

Computational resources on demand

VM VM VM

VM VM VM

VM VM VM

VM VM VM

VM VM VM

VM VM VM

VM VM VM

VM VM VM

the cloud

VMvRouter

vSwitch vSwitch

Virtual networkVM

internet

VM

VM

VM

Midokura's use case

Each client that rents VMs on the datacentre wants to network them as if they were their own physical resources (e.g.: same L2 domain, private addresses, isolation..)

MidoNet allows the owner of the datacentre do provide that service

Virtual network topology

Metrics, audit logs, monitoring

Virtual network state

destination IP gateway

192.168.0.0/16 192.168.0.12

66.82.1.0/16 66.82.1.1

0.0.0.0/32 10.0.2.1

Routing tables

IP MAC

192.168.1.23 aa:bb:cc:dd:ee:ff

192.168.1.11 11:22:33:44:55:66

vRouter

vSwitchvSwitch

internet

ARP table

Dataset

Usage

VM VM

A daemon captures Packets sent from VMs contained on each physical host.

On new packets, it loads a view of the virtual topology from a (distributed) data store

VM VMVM VMVMVMVM VMVM

Load virtual topology

Usage

VM VM

The daemon simulates the trip through the virtual network until reaching a a destination VM, and identifies the host

Instructs the kernel to route similar packets via a tunnel

VM VMVM VMVMVMVM VMVM

Midonet architecture

IP bus

API

Hosts Storage cluster

Constraints

Consistency

Availability Partition Tolerance

negotiable

critical

What happens if our service doesn't handle network partitions, faulty master, GC pauses, latency, lags, locks..?

- Not just N users unable to see their profiles- But infrastructure failure in the entire datacentre

Midokura's use case

Coming to “NoSQL” not from “Big Data”

But looking for specific mixes of

● Availability● Fault tolerance● Performance● Durability● Low operational cost

How are Cassandra and Zookeeper useful?

Virtual Network State

Assorted data

Metrics

https://cassandra.apache.org/

Cassandra elevator pitch

A massively scalable open source NoSQL database

Supports large amounts of structured, semi-structured, and unstructured data (key-value)

Across multiple data centers

Performance, availability, linear scalability, with no SPF

Cassandra architecture

DC1

DC2

DC3

clients

P2PNo privileged nodesUnified view

Fault tolerance

write (x)

Replication Factor = 3

Consistencylevel = QUORUM

ok

ok

ok

FAIL

faulty node

Fault tolerance

ok

w(x)

Hinted handoff:

coordinator holds datauntil faulty replicarecovers

read(x)x

RF = 3

CL = QUORUM

x

x

Fault tolerance

read(x)x

x

x

x'

Consistency

The coordinator will waituntil CL possible across replicas (or fail) - CL can be also 1, 2, ALL..

RF = 3

CL = QUORUM

read_repair

Consistency

Order issued to thedisagreeing node to reconcile its local copy

?

?

DC 1 DC 2

RF = 6CL = LOCAL_QUORUM (quorum inside the local DC)CL = EACH_QUORUM (quorum on each DC)

Multi DC

Minimizesexpensivenetwork trips

client req

Latency + Throughput: W

write (key, value) commit log

...

...

write (...)

diskmemory

ok

memtable

X

sstable

indexX

flush

clean

Minimize disk accessImmutablility

- Data in disk doesn't change- Saves IO sync locks- Requires async compaction

commit log

...

...

write (...)

Latency + Throughput: R

read (key)

diskmemory

sstables

indexX

?

memtable

ok: X

?

Caches

?

Bloom filters

Flexible data model

user[“1”] [email protected]

name email

Simpler schema changes

Marcususer[“2”] [email protected]

Juliususer[“1”] [email protected] stabbed

state

Flexible (good on growth mode)

NAT[“192.168.1.2:80:10.1.1.1:923”] = { ip = “192.12.3.11”, port = “455”

ttl = .... }

Time series

Column names are stored physically sorted● Wide rows enable ordering + efficient filtering● Pack together data that will be queried together

Events (bad) <- applies SQL approach

event[id] = {device=1, time=t1, val=1}

event[id] = {device=1, time=t2, val=2}

Events (better)

event[device1] = { {time=t2, val=2}, {time=t1, val=3} .. }

event[device2] = { {time=t3, val=3}, {time=t4, val=4} .. }

Things to watch

● Data model highly conditioned by queriesvs. SQL's model for many possible queries

● Relearn performance tuningGC, caches, IO patterns, repairs.. understandinginternals is as important as in SQL

● Counter intuitive internalsE.g.: expired data doesn't get deleted immediately (not even “soon”)

● ...

Things to watch

Know well how your clients handle failovers, and tune for your use case:

E.g.: if we process a packet we want low latency, and no failures so:

● How long is a Timeout? ● Retry to a different node or fail fast?● How to distinguish node failure from

transient latency spike?● How many nodes must be up to satisfy CL?

Watch data changes

Service discovery

Coordination

https://zookeeper.apache.org/

Zookeeper

“Because coordinating distributed systems is

a Zoo”

Zookeeper

● High availability

● Performance (in memory, r > w)In memory: limits dataset size (backed by disk)

● Reliable delivery

If a node sees an update, all will eventually

● Total & causal order- Data is delivered in the same order it is sent- A message m is delivered only after all messages sent before m have been delivered

Zookeeper architecture

L1. update

2. proposal 4. commit

3. ack! 3. ack! 3. ack!

● Paxos variant● Ordered messages● Atomic broadcasts

● Leader is not SPF: new one elected upon failure

ZK Watchers

/midonet

/bridges/A

/ports

/1 = [.., peer = bridgeC/ports/79, .. ]/2 = [.., peer = routerX/ports/53, .. ]

/B/ports

/79 = [.., peer = bridgeA/ports/1, .. ]/routers

...

Change here

Notifiessubscribers ofthese

ZK Watchers

VM VM

A

B

VM VMVM VMVMVMVM

Binding changes Binding changes

update A! update B!

VMVM

C

update C!

change: cut and add new device

Important: we want tonotify each node ofrelevant changes only!

Remember the scale!

c

cc

Service discovery (WIP)

n1

n2

n3

ccc

cc

c

/nodes/n1

/n2/n3

register discover

Ephemeral nodes: if the session that created it dies, the node disappears

discover

notify down

Distributed service nodes Clients

Must know ZK cluster (static) but not service nodes (dynamic)

Q ? A : Thank you!

bcndevcon 2013: usign cassandra and zookeeper to build a distributed, high performance system

Technology

big data

flexible data model

unstructured data keyvalue

new packets

cassandra architecture

fault tolerance rf

multi dc dc

quorum xreadx x x