nyc* jonathan ellis keynote: "cassandra 1.2 + 2.0"
DESCRIPTION
Jonathan Ellis, Apache Cassandra Project Chair & DataStax Co-Founder, presents Apache Cassandra 1.2 + 2.0.TRANSCRIPT
Cassandra 1.2 (and 2.0)Jonathan Ellis Project Chair, Apache Cassandra CTO, DataStax@spyced
©2012 DataStax
©2012 DataStax
• Massively scalable
• High performance
• Reliable/Available
©2012 DataStax
VLDB benchmark (RWS)
©2012 DataStax
Endpoint benchmark (RW)
©2012 DataStax
©2012 DataStax
©2012 DataStax
1.2• Concurrent schema
changes
• Virtual nodes
• “Fat node” support
• JBOD improvements• Off-heap bloom filters,
compression metadata• Parallel leveled
compaction
• Atomic batches
• CQL3
• Collections• Data dictionary
• Tracing
©2012 DataStax
Concurrent Schema Changes
CassandraCluster
Client
CREATE TABLE X;...
DROP TABLE X;
Client
CREATE TABLE Y;...
DROP TABLE Y;
©2012 DataStax
Virtual nodes
F
C
B
E
A
D
Ring without vnodes
A
N
K
H
E
JM
Ring with vnodes
C
F
P
B
L
I
O
D
G
©2012 DataStax
Virtual nodes
F
C
B
E
A
D
Ring without vnodes
A
N
K
H
E
JM
Ring with vnodes
C
F
P
B
L
I
O
D
G
©2012 DataStax
Virtual nodes
F
C
B
E
A
D
Ring without vnodes
A
N
K
H
E
JM
Ring with vnodes
C
F
P
B
L
I
O
D
G
©2012 DataStax
Node Rebuild without vnodes
F
C
B
E
A
D
Ring without vnodes
A
F E
Node 1 Node 2 Node 3
Node 4 Node 6Node 5
B
A F
C
B A
D
B
E
D C
F
DC E
©2012 DataStax
Node Rebuild with vnodes
A
N
K
H
E
JM
Ring with VNodes
C
F
P
B
L
I
O
D
G
B
G
E
K
D J
L
A
O
D H
K F
K G
J F
P
M
I
O
H
B L
F D
E
I
P
A
M C
G N
H
B
C
O
N
J L
Node 1 Node 2 Node 3
Node 4 Node 6Node 5
E
M
I
C N
P
A
©2012 DataStax
JBOD support
HDD2HDD1
Cassandra Instance
HDD3 HDD4
©2012 DataStax
JBOD support
HDD2HDD1
Cassandra Instance
HDD3 HDD4X
©2012 DataStax
On-Heap/Off-Heap
Java Heap
Off-HeapNot managed by GC
JVM
Java Process
Native Memory
On-HeapManaged by GC
©2012 DataStax
Moving O(n) structures off-heap• Row (partition) bloom filter• 1-2GB per billion rows
• Compression metadata• ~20GB per TB compressed data
• 1.2 targets 5-10TB of data per machine
©2012 DataStax
Batches
CoordinatorNode
PartitionReplica
PartitionReplica
PartitionReplica
Client
©2012 DataStax
Batches
CoordinatorNode
PartitionReplica
PartitionReplica
PartitionReplica
Client
©2012 DataStax
Batches
CoordinatorNode
PartitionReplica
PartitionReplica
PartitionReplica
Client
©2012 DataStax
Batches
CoordinatorNode
PartitionReplica
PartitionReplica
PartitionReplica
Client
©2012 DataStax
Batches
CoordinatorNode
PartitionReplica
PartitionReplica
PartitionReplica
Client X
©2012 DataStax
Atomic batches
CoordinatorNode
PartitionReplica
PartitionReplica
PartitionReplica
Client
BatchlogNode
©2012 DataStax
Atomic batches
CoordinatorNode
PartitionReplica
PartitionReplica
PartitionReplica
Client
BatchlogNode
©2012 DataStax
Atomic batches
CoordinatorNode
PartitionReplica
PartitionReplica
PartitionReplica
Client
BatchlogNode
©2012 DataStax
Atomic batches
CoordinatorNode
PartitionReplica
PartitionReplica
PartitionReplica
Client
BatchlogNode
©2012 DataStax
Atomic batches
CoordinatorNode
PartitionReplica
PartitionReplica
PartitionReplica
Client
BatchlogNode
X
©2012 DataStax
Atomic batches
CoordinatorNode
PartitionReplica
PartitionReplica
PartitionReplica
Client
BatchlogNode
X
©2012 DataStax
CREATE TABLE users ( id uuid PRIMARY KEY, name text, state text, birth_date int);
CREATE INDEX ON users(state);
SELECT * FROM users WHERE state=‘Texas’ AND birth_date > 1950;
CQL: You got SQL in my NoSQL!
©2012 DataStax
Strictly “realtime” focused• No joins
• No subqueries
• No aggregation functions* or GROUP BY
• Strictly limited ORDER BY
©2012 DataStax
a3e64f8f... title: La Grange artist: ZZ Top album: Tres Hombres
8a172618... title: Moving in Stereo artist: Fu Manchu album: We Must Obey
2b09185b... title: Outside Woman Blues artist: Back Door Slam album: Roll Away
songscreate column family songswith key_validation_class = UUIDTypeand comparator = UTF8Type -- cell names are stringsand column_metdata = [{column_name: title, validation_class: UTF8Type}, {column_name: album, validation_class: UTF8Type}, {column_name: artist, validation_class: UTF8Type}, {column_name: data, validation_class: BytesType}];
©2012 DataStax
id title artist album
a3e64f8f... La Grange ZZ Top Tres Hombres8a172618... Moving in Stereo Fu Manchu We Must Obey2b09185b... Outside Woman Blues Back Door Slam Roll Away
CREATE TABLE songs ( id uuid PRIMARY KEY, title text, artist text, album text, data blob);
©2012 DataStax
a3e64f8f... blues: 1973:
8a172618... covers: 2003:
song_tagscreate column family song_tagswith key_validation_class = UUIDTypeand comparator = UTF8Type;
©2012 DataStax
a3e64f8f... blues: 1973:
8a172618... covers: 2003:
id tag_name
a3e64f8f... bluesa3e64f8f... 1973
8a172618... covers8a172618... 2003
CREATE TABLE song_tags ( id uuid, tag_name text, PRIMARY KEY (id, tag_name));
©2012 DataStax
62c36092... La Grange,ZZ Top,Tres Hombres
: a3e64f8f...Moving in S...,Fu Manchu,We Must O...
: 8a172618...Outside Wo...,Back Door ...,Roll Away
: 2b09185b...
playlistscreate column family playlistswith key_validation_class = UUIDTypeand comparator = 'CompositeType(UTF8Type, UTF8Type, UTF8Type)'and default_validation_class = UUIDType;
©2012 DataStax
62c36092... La Grange,ZZ Top,Tres Hombres
: a3e64f8f...Moving in S...,Fu Manchu,We Must O...
: 8a172618...Outside Wo...,Back Door ...,Roll Away
: 2b09185b...
playlistscreate column family playlistswith key_validation_class = UUIDTypeand comparator = 'CompositeType(UTF8Type, UTF8Type, UTF8Type)'and default_validation_class = UUIDType;
©2012 DataStax
62c36092... La Grange,ZZ Top,Tres Hombres
: a3e64f8f...Moving in S...,Fu Manchu,We Must O...
: 8a172618...Outside Wo...,Back Door ...,Roll Away
: 2b09185b...
id title artist album song_id
62c36092... La Grange ZZ Top Tres Hombres a3e64f8f...
62c36092... Moving in Stereo Fu Manchu We Must Obey 8a172618...
62c36092... Outside Wo... Back Door Slam Roll Away 2b09185b...
CREATE TABLE playlists ( id uuid, title text, album text, artist text, song_id uuid, PRIMARY KEY (id, title, album, artist));
©2012 DataStax
Collections
id title artist album tags
a3e64f8f... La Grange ZZ Top Tres Hombres {blues, 1973}8a172618... Moving in Stereo Fu Manchu We Must Obey {covers, 2003}2b09185b... Outside Woman Blues Back Door Slam Roll Away
CREATE TABLE songs ( id uuid PRIMARY KEY, title text, artist text, album text, tags set<text>, data blob);
©2012 DataStax
cqlsh:system> SELECT * FROM schema_keyspaces;
keyspace_name | durable_writes | strategy_class | strategy_options---------------+----------------+----------------+---------------------------- keyspace1 | True | SimpleStrategy | {"replication_factor":"1"} system | True | LocalStrategy | {} system_traces | True | SimpleStrategy | {"replication_factor":"1"}
Data dictionary
©2012 DataStax
cqlsh:system> SELECT * FROM schema_keyspaces;
keyspace_name | durable_writes | strategy_class | strategy_options---------------+----------------+----------------+---------------------------- keyspace1 | True | SimpleStrategy | {"replication_factor":"1"} system | True | LocalStrategy | {} system_traces | True | SimpleStrategy | {"replication_factor":"1"}
Data dictionary
©2012 DataStax
cqlsh:system> SELECT * FROM schema_keyspaces;
keyspace_name | durable_writes | strategy_class | strategy_options---------------+----------------+----------------+---------------------------- keyspace1 | True | SimpleStrategy | {"replication_factor":"1"} system | True | LocalStrategy | {} system_traces | True | SimpleStrategy | {"replication_factor":"1"}
Data dictionary
cqlsh:system> SELECT * FROM schema_columnfamilies WHERE keyspace_name='keyspace1' AND columnfamily_name='test';
©2012 DataStax
cqlsh:system> SELECT * FROM schema_keyspaces;
keyspace_name | durable_writes | strategy_class | strategy_options---------------+----------------+----------------+---------------------------- keyspace1 | True | SimpleStrategy | {"replication_factor":"1"} system | True | LocalStrategy | {} system_traces | True | SimpleStrategy | {"replication_factor":"1"}
Data dictionary
cqlsh:system> SELECT * FROM schema_columnfamilies WHERE keyspace_name='keyspace1' AND columnfamily_name='test';
cqlsh:system> SELECT * FROM schema_columns WHERE keyspace_name='keyspace1' AND columnfamily_name='test';
©2012 DataStax
cqlsh:system> SELECT * FROM local;
key | bootstrapped | cluster_name | cql_version | data_center | gossip_generation | partitioner | rack | release_version | ring_id | thrift_version | tokens | truncated_at-------+--------------+--------------+-------------+-------------+-------------------+---------------------------------------------+-------+----------------------+--------------------------------------+----------------+--------+-------------- local | COMPLETED | test | 3.0.0 | datacenter1 | 1352846064 | org.apache.cassandra.dht.Murmur3Partitioner | rack1 | 1.2.0-beta2-SNAPSHOT | 224c55d5-21b4-42b0-8969-afc0cc04e812 | 19.35.0 | {0} | null
Data dictionary
©2012 DataStax
cqlsh:system> SELECT * FROM peers LIMIT 1;
peer | data_center | rack | release_version | ring_id | rpc_address | schema_version | tokens-----------+-------------+-------+----------------------+--------------------------------------+-------------+--------------------------------------+----------------------- 127.0.0.3 | datacenter1 | rack1 | 1.2.0-beta2-SNAPSHOT | f6782327-ef8e-41cf-87b9-2edc287b1ffe | 127.0.0.3 | 915ed888-ddd0-3448-860c-582f4eea1bc6 | {6148914691236517204}
Data dictionary
©2012 DataStax
Request tracingcqlsh:foo> INSERT INTO bar (i, j) VALUES (6, 2);Tracing session: 4ad36250-1eb4-11e2-0000-fe8ebeead9f9
activity | timestamp | source | source_elapsed-------------------------------------+--------------+-----------+---------------- Determining replicas for mutation | 00:02:37,015 | 127.0.0.1 | 540 Sending message to /127.0.0.2 | 00:02:37,015 | 127.0.0.1 | 779 Message received from /127.0.0.1 | 00:02:37,016 | 127.0.0.2 | 63 Applying mutation | 00:02:37,016 | 127.0.0.2 | 220 Acquiring switchLock | 00:02:37,016 | 127.0.0.2 | 250 Appending to commitlog | 00:02:37,016 | 127.0.0.2 | 277 Adding to memtable | 00:02:37,016 | 127.0.0.2 | 378 Enqueuing response to /127.0.0.1 | 00:02:37,016 | 127.0.0.2 | 710 Sending message to /127.0.0.1 | 00:02:37,016 | 127.0.0.2 | 888 Message received from /127.0.0.2 | 00:02:37,017 | 127.0.0.1 | 2334 Processing response from /127.0.0.2 | 00:02:37,017 | 127.0.0.1 | 2550
©2012 DataStax
CREATE TABLE queues ( id text, created_at timeuuid, value blob, PRIMARY KEY (id, created_at));
id created_at value
myqueue 3092e86f 9b0450d30de9
myqueue 0867f47c fc7aee5f6a66
myqueue 5fc74be0 668fdb3a2196
Tracing an antipattern
©2012 DataStax
CREATE TABLE queues ( id text, created_at timeuuid, value blob, PRIMARY KEY (id, created_at));
id created_at value
myqueue 3092e86f 9b0450d30de9
myqueue 0867f47c fc7aee5f6a66
myqueue 5fc74be0 668fdb3a2196
Tracing an antipattern
©2012 DataStax
CREATE TABLE queues ( id text, created_at timeuuid, value blob, PRIMARY KEY (id, created_at));
id created_at value
myqueue 3092e86f 9b0450d30de9
myqueue 0867f47c fc7aee5f6a66
myqueue 5fc74be0 668fdb3a2196
©2012 DataStax
cqlsh:foo> SELECT FROM queues WHERE id = 'myqueue' ORDER BY created_at LIMIT 1;Tracing session: 4ad36250-1eb4-11e2-0000-fe8ebeead9f9
activity | timestamp | source | source_elapsed------------------------------------------+--------------+-----------+--------------- execute_cql3_query | 19:31:05,650 | 127.0.0.1 | 0 Sending message to /127.0.0.3 | 19:31:05,651 | 127.0.0.1 | 541 Message received from /127.0.0.1 | 19:31:05,651 | 127.0.0.3 | 39 Executing single-partition query | 19:31:05,652 | 127.0.0.3 | 943 Acquiring sstable references | 19:31:05,652 | 127.0.0.3 | 973 Merging memtable contents | 19:31:05,652 | 127.0.0.3 | 1020 Merging data from memtables and sstables | 19:31:05,652 | 127.0.0.3 | 1081 Read 1 live cells and 100000 tombstoned | 19:31:05,686 | 127.0.0.3 | 35072 Enqueuing response to /127.0.0.1 | 19:31:05,687 | 127.0.0.3 | 35220 Sending message to /127.0.0.1 | 19:31:05,687 | 127.0.0.3 | 35314 Message received from /127.0.0.3 | 19:31:05,687 | 127.0.0.1 | 36908 Processing response from /127.0.0.3 | 19:31:05,688 | 127.0.0.1 | 37650 Request complete | 19:31:05,688 | 127.0.0.1 | 38047
©2012 DataStax
2.0• Eager retries• Improved compaction• Triggers• CAS (Compare-and-set)• More-efficient repair
©2012 DataStax
Eager retries
Client Coordinator
40% busy
90% busy
30% busy
©2012 DataStax
Eager retries
Client Coordinator
40% busy
90% busy
30% busy
©2012 DataStax
Eager retries
Client Coordinator
40% busy
90% busy
30% busy
©2012 DataStax
Improved compaction• Specialized strategy for append-only with TTL
• Can we do any better for a general-purpose workload?
©2012 DataStax
©2012 DataStax
CREATE TRIGGER fooBEFORE UPDATEON usersEXECUTE ’/var/lib/cassandra/triggers/send_registration_email.jar’
Triggers
©2012 DataStax
class MyTrigger implements ITrigger{ public Collection<RowMutation> revise(ByteBuffer key, ColumnFamily update) { ... }}
Triggers
©2012 DataStax
SELECT * FROM usersWHERE username = ’jbellis’
[empty resultset]
INSERT INTO users (...)VALUES (’jbellis’, ...)
CAS
Session 1SELECT * FROM usersWHERE username = ’jbellis’
[empty resultset]
INSERT INTO users (...)VALUES (’jbellis’, ...)
Session 2
©2012 DataStax
CAS• Locking does not solve this problem
• 2PC does not solve this problem
• Locking + 2PC does not solve this problem
©2012 DataStax
Paxos!
©2012 DataStax
Open questions
UPDATE USERS SET email = ‘[email protected]’, ...WHERE username = ’jbellis’IF email = ‘[email protected]’
• What do we call it?• Conditional write guarantee?• Atomic conditional updates?• Lightweight transactions?
• What syntax do we use for CQL?
©2012 DataStax
More-efficient repair
©2012 DataStax
More-efficient repair
©2012 DataStax
More-efficient repair
©2012 DataStax
More-efficient repair
©2012 DataStax
More-efficient repair
©2012 DataStax
More-efficient repair
©2012 DataStax
More-efficient repair
©2012 DataStax
More-efficient repair
©2012 DataStax
More-efficient repair
©2012 DataStax
More-efficient repair
©2012 DataStax
More-efficient repair
©2012 DataStax
Consequences• Repair won’t replace missing data due to
hardware failure by default
• Add --include-previously-repaired to force old-style full validation