bcndevcon 2013: usign cassandra and zookeeper to build a distributed, high performance system
DESCRIPTION
Slides from my presentation at BCN Dev Con 2013.TRANSCRIPT
Using Zookeeper and Cassandrato build a distributed, high performance system
Galo Navarro@srvaroa - [email protected]
BcnDevCon13
About me
Background: backend & architecture in high traffic systems
Current: software engineer @ Midokura
A talk about databases
Takeaways
New tech signals different emphasis on solving each problem
Solutions are not exclusive: you can combine them.
Go beyond artificial SQL-NoSQL antagonisms:
We share some fundamental problems:● Latency, availability, durability..● True today, 20y ago, 3000y ago
Midonet
Distributed network virtualization system
Context, dataset, requirements
https://www.midokura.com/
Virtualization
Computational resources on demand
VM VM VM
VM VM VM
VM VM VM
VM VM VM
VM VM VM
VM VM VM
VM VM VM
VM VM VM
the cloud
VMvRouter
vSwitch vSwitch
Virtual networkVM
internet
VM
VM
VM
Midokura's use case
Each client that rents VMs on the datacentre wants to network them as if they were their own physical resources (e.g.: same L2 domain, private addresses, isolation..)
MidoNet allows the owner of the datacentre do provide that service
Virtual network topology
Metrics, audit logs, monitoring
Virtual network state
destination IP gateway
192.168.0.0/16 192.168.0.12
66.82.1.0/16 66.82.1.1
0.0.0.0/32 10.0.2.1
Routing tables
IP MAC
192.168.1.23 aa:bb:cc:dd:ee:ff
192.168.1.11 11:22:33:44:55:66
vRouter
vSwitchvSwitch
internet
ARP table
Dataset
Usage
VM VM
A daemon captures Packets sent from VMs contained on each physical host.
On new packets, it loads a view of the virtual topology from a (distributed) data store
VM VMVM VMVMVMVM VMVM
Load virtual topology
Usage
VM VM
The daemon simulates the trip through the virtual network until reaching a a destination VM, and identifies the host
Instructs the kernel to route similar packets via a tunnel
VM VMVM VMVMVMVM VMVM
Midonet architecture
IP bus
API
Hosts Storage cluster
Constraints
Consistency
Availability Partition Tolerance
negotiable
critical
What happens if our service doesn't handle network partitions, faulty master, GC pauses, latency, lags, locks..?
- Not just N users unable to see their profiles- But infrastructure failure in the entire datacentre
Midokura's use case
Coming to “NoSQL” not from “Big Data”
But looking for specific mixes of
● Availability● Fault tolerance● Performance● Durability● Low operational cost
How are Cassandra and Zookeeper useful?
Virtual Network State
Assorted data
Metrics
https://cassandra.apache.org/
Cassandra elevator pitch
A massively scalable open source NoSQL database
Supports large amounts of structured, semi-structured, and unstructured data (key-value)
Across multiple data centers
Performance, availability, linear scalability, with no SPF
Cassandra architecture
DC1
DC2
DC3
clients
P2PNo privileged nodesUnified view
Fault tolerance
write (x)
Replication Factor = 3
Consistencylevel = QUORUM
ok
ok
ok
FAIL
faulty node
Fault tolerance
ok
w(x)
Hinted handoff:
coordinator holds datauntil faulty replicarecovers
read(x)x
RF = 3
CL = QUORUM
x
x
Fault tolerance
read(x)x
x
x
x'
Consistency
The coordinator will waituntil CL possible across replicas (or fail) - CL can be also 1, 2, ALL..
RF = 3
CL = QUORUM
read_repair
Consistency
Order issued to thedisagreeing node to reconcile its local copy
?
?
DC 1 DC 2
RF = 6CL = LOCAL_QUORUM (quorum inside the local DC)CL = EACH_QUORUM (quorum on each DC)
Multi DC
Minimizesexpensivenetwork trips
client req
Latency + Throughput: W
write (key, value) commit log
...
...
write (...)
diskmemory
ok
memtable
X
sstable
indexX
flush
clean
Minimize disk accessImmutablility
- Data in disk doesn't change- Saves IO sync locks- Requires async compaction
commit log
...
...
write (...)
Latency + Throughput: R
read (key)
diskmemory
sstables
indexX
?
memtable
ok: X
?
Caches
?
Bloom filters
Flexible data model
user[“1”] [email protected]
name email
Simpler schema changes
Marcususer[“2”] [email protected]
Juliususer[“1”] [email protected] stabbed
state
Flexible (good on growth mode)
NAT[“192.168.1.2:80:10.1.1.1:923”] = { ip = “192.12.3.11”, port = “455”
ttl = .... }
Time series
Column names are stored physically sorted● Wide rows enable ordering + efficient filtering● Pack together data that will be queried together
Events (bad) <- applies SQL approach
event[id] = {device=1, time=t1, val=1}
event[id] = {device=1, time=t2, val=2}
Events (better)
event[device1] = { {time=t2, val=2}, {time=t1, val=3} .. }
event[device2] = { {time=t3, val=3}, {time=t4, val=4} .. }
Things to watch
● Data model highly conditioned by queriesvs. SQL's model for many possible queries
● Relearn performance tuningGC, caches, IO patterns, repairs.. understandinginternals is as important as in SQL
● Counter intuitive internalsE.g.: expired data doesn't get deleted immediately (not even “soon”)
● ...
Things to watch
Know well how your clients handle failovers, and tune for your use case:
E.g.: if we process a packet we want low latency, and no failures so:
● How long is a Timeout? ● Retry to a different node or fail fast?● How to distinguish node failure from
transient latency spike?● How many nodes must be up to satisfy CL?
Watch data changes
Service discovery
Coordination
https://zookeeper.apache.org/
Zookeeper
“Because coordinating distributed systems is
a Zoo”
Zookeeper
● High availability
● Performance (in memory, r > w)In memory: limits dataset size (backed by disk)
● Reliable delivery
If a node sees an update, all will eventually
● Total & causal order- Data is delivered in the same order it is sent- A message m is delivered only after all messages sent before m have been delivered
Zookeeper architecture
L1. update
2. proposal 4. commit
3. ack! 3. ack! 3. ack!
● Paxos variant● Ordered messages● Atomic broadcasts
● Leader is not SPF: new one elected upon failure
ZK Watchers
/midonet
/bridges/A
/ports
/1 = [.., peer = bridgeC/ports/79, .. ]/2 = [.., peer = routerX/ports/53, .. ]
/B/ports
/79 = [.., peer = bridgeA/ports/1, .. ]/routers
...
Change here
Notifiessubscribers ofthese
ZK Watchers
VM VM
A
B
VM VMVM VMVMVMVM
Binding changes Binding changes
update A! update B!
VMVM
C
update C!
change: cut and add new device
Important: we want tonotify each node ofrelevant changes only!
Remember the scale!
c
cc
Service discovery (WIP)
n1
n2
n3
ccc
cc
c
/nodes/n1
/n2/n3
register discover
Ephemeral nodes: if the session that created it dies, the node disappears
discover
notify down
Distributed service nodes Clients
Must know ZK cluster (static) but not service nodes (dynamic)
Q ? A : Thank you!