big data grows up - a (re)introduction to cassandra

70
Big Data Grows Up A (re)introduction to Cassandra Robbie Strickland

Upload: rastrick

Post on 15-Jan-2015

1.783 views

Category:

Technology


2 download

DESCRIPTION

For the last several years Cassandra has been the heavyweight in the NoSQL space. But its massive scalability was accompanied by a bare bones feature set, a substantial learning curve, and a Thrift-based RPC mechanism that left newbies bewildered by a sea of potential client libraries–all with their own fragmented semantics. Over the last year that’s all changed, culminating in the recently unveiled Cassandra 2.0. In this talk I’ll bring you up to speed on Cassandra Query Language, cursors, the new native libraries, lightweight transactions, virtual nodes, and loads of other new goodies. Whether you’re completely new to Cassandra or a seasoned veteran who wants the latest scoop, this talk has something for you.

TRANSCRIPT

Page 1: Big Data Grows Up - A (re)introduction to Cassandra

Big Data Grows UpA (re)introduction to Cassandra

Robbie Strickland

Page 2: Big Data Grows Up - A (re)introduction to Cassandra

Who am I?

Robbie StricklandSoftware Development ManagerThe Weather Channel

[email protected]@dont_use_twitter

Page 3: Big Data Grows Up - A (re)introduction to Cassandra

Who am I?

● Cassandra user/contributor since 2010● … it was at release 0.5 back then● 4 years? Oracle DBA’s aren’t impressed● Done lots of dumb stuff with Cassandra● … and some really awesome stuff too

Page 4: Big Data Grows Up - A (re)introduction to Cassandra

Cassandra in 2010

Page 5: Big Data Grows Up - A (re)introduction to Cassandra

Cassandra in 2010

Page 6: Big Data Grows Up - A (re)introduction to Cassandra

Cassandra in 2014

Page 7: Big Data Grows Up - A (re)introduction to Cassandra

Why Cassandra?

It’s fast:

● No locks● Tunable consistency● Sequential R/W● Decentralized

Page 8: Big Data Grows Up - A (re)introduction to Cassandra

Why Cassandra?

It scales (linearly):

● Multi data center● No SPOF● DHT● Hadoop integration

Page 9: Big Data Grows Up - A (re)introduction to Cassandra

Why Cassandra?

It’s fault tolerant:

● Automatic replication● Masterless● Failed nodes

replaced with ease

Page 10: Big Data Grows Up - A (re)introduction to Cassandra

… a lot in the last year (ish)

What’s different?

Page 11: Big Data Grows Up - A (re)introduction to Cassandra

What’s new?

● Virtual nodes● O(n) data moved off-heap● CQL3 (and defining schemas)● Native protocol/driver● Collections● Lightweight transactions● Compaction throttling that actually works

Page 12: Big Data Grows Up - A (re)introduction to Cassandra

What’s gone?

● Manual token management● Supercolumns● Thrift (if you use the native driver)● Directly managing storage rows

Page 13: Big Data Grows Up - A (re)introduction to Cassandra

What’s still the same?

● Still not an RDBMS● Still no joins (see above)● Still no ad-hoc queries (see above again)● Still requires a denormalized data model (^^)● Still need to know what the heck you’re

doing

Page 14: Big Data Grows Up - A (re)introduction to Cassandra

Linear scalability without the migraine

Token Management

Page 15: Big Data Grows Up - A (re)introduction to Cassandra

The old way● 1 token per node● Assigned manually● Adding nodes ==

reassignment of all tokens

● Node rebuild heavily taxes a few nodes

A

BF

C

D

E

cluster with no vnodes

Page 16: Big Data Grows Up - A (re)introduction to Cassandra

… enter Vnodes● n tokens per node● Assigned magically● Adding nodes ==

painless● Node rebuild

distributed across many nodes

A B

C

Dcluster with vnodes

N

M

L

H G

F

E

I

J

K

Page 17: Big Data Grows Up - A (re)introduction to Cassandra

Node rebuild without Vnodes

Page 18: Big Data Grows Up - A (re)introduction to Cassandra

Node rebuild with Vnodes

Page 19: Big Data Grows Up - A (re)introduction to Cassandra

because the JVM sometimes sucks

Going Off-heap

Page 20: Big Data Grows Up - A (re)introduction to Cassandra

Why go off-heap

● GC overhead● JVM no good with big heap sizes● GC overhead● GC overhead● GC overhead

Page 21: Big Data Grows Up - A (re)introduction to Cassandra

O(n) data structures

● Row cache● Bloom filters● Compression offsets● Partition summary

… all these are moved off-heap

Page 22: Big Data Grows Up - A (re)introduction to Cassandra

New memory allocation

native

JVM

heap

Row cacheBloom filtersCompression offsetsPartition summary

Partition key cache

Page 23: Big Data Grows Up - A (re)introduction to Cassandra

Or, how to build a killer data store without a crappy interface

Death of a (Thrift) Salesman

Page 24: Big Data Grows Up - A (re)introduction to Cassandra

Reasons not to ditch Thrift

● Lots of client libraries still use it● You finally got it installed● You didn’t know there was another choice● It sucks less than many alternatives

Page 25: Big Data Grows Up - A (re)introduction to Cassandra

… in spite of all those benefits, you really should ditch Thrift because:

● It requires your entire result set to fit into RAM on both client and server

● The native protocol is better, faster, and supports all the new features

● Thrift-based client libraries are always a step behind

● It’s going away eventually

Page 26: Big Data Grows Up - A (re)introduction to Cassandra

… and did I mention ...

It requires your entire result set to fit into RAM

on both client and server!!!

Page 27: Big Data Grows Up - A (re)introduction to Cassandra

Requesting too much data

Page 28: Big Data Grows Up - A (re)introduction to Cassandra

really catchy tag line here

Going Native

Page 29: Big Data Grows Up - A (re)introduction to Cassandra

Native protocol

● It’s binary, making it lighter weight● It supports cursors (FTW!)● It supports prepared statements● Cluster awareness built-in● Either synchronous or asynchronous ops● Only supports CQL-based operations● Can be used side-by-side with Thrift

Page 30: Big Data Grows Up - A (re)introduction to Cassandra

Native drivers

from DataStax:JavaC#Python

… other community supported drivers available

Page 31: Big Data Grows Up - A (re)introduction to Cassandra

Native query exampleval insert = session.prepare("INSERT INTO myKsp.myTable (myKey, col1, col2) VALUES (?,?,?)")val select = session.prepare("SELECT * FROM myKsp.myTable WHERE myKey = ?")val cluster = Cluster.builder().addContactPoints(host1, host2, host3)val session = cluster.connect()session.execute(insert.bind(myKey, col1, col2))val result = session.execute(select.bind(myKey))

Page 32: Big Data Grows Up - A (re)introduction to Cassandra

Or, how to make Cassandra more awesome while simultaneously irritating early adopters

Wait, was that SQL?!!

Page 33: Big Data Grows Up - A (re)introduction to Cassandra

Introducing CQL3

● Because the first two attempts sucked● Stands for “Cassandra Query Language”● Looks a heck of a lot like SQL● … but isn’t● Substantially lowers the learning curve● … but also makes it easier to screw up● An abstraction over the storage rows

Page 34: Big Data Grows Up - A (re)introduction to Cassandra

Storage rows[default@unknown] create keyspace Library;[default@unknown] use Library;[default@Library] create column family Books... with comparator=UTF8Type... and key_validation_class=UTF8Type… and default_validation_class=UTF8Type;[default@Library] set Books['Patriot Games']['author'] = 'Tom Clancy';[default@Library] set Books['Patriot Games']['year'] = '1987';[default@Library] list Books;

RowKey: Patriot Games=> (name=author, value=Tom Clancy, timestamp=1393102991499000)=> (name=year, value=1987, timestamp=1393103015955000)

Page 35: Big Data Grows Up - A (re)introduction to Cassandra

Storage rows - composites[default@Library] create column family Authors... with key_validation_class=UTF8Type... and comparator='CompositeType(LongType,UTF8Type,UTF8Type)'... and default_validation_class=UTF8Type;[default@Library] set Authors['Tom Clancy']['1987:Patriot Games:publisher'] = 'Putnam';[default@Library] set Authors['Tom Clancy']['1987:Patriot Games:ISBN'] = '0-399-13241-4';[default@Library] set Authors['Tom Clancy']['1993:Without Remorse:publisher'] = 'Putnam';[default@Library] set Authors['Tom Clancy']['1993:Without Remorse:ISBN'] = '0-399-13825-0';[default@Library] list Authors;

RowKey: Tom Clancy=> (name=1987:Patriot Games:ISBN, value=0-399-13241-4, timestamp=1393104011458000)=> (name=1987:Patriot Games:publisher, value=Putnam, timestamp=1393103948577000)=> (name=1993:Without Remorse:ISBN, value=0-399-13825-0, timestamp=1393104109214000)=> (name=1993:Without Remorse:publisher, value=Putnam, timestamp=1393104083773000)

Page 36: Big Data Grows Up - A (re)introduction to Cassandra

CQL - simple introcqlsh> CREATE KEYSPACE Library WITH REPLICATION = {'class':'SimpleStrategy', 'replication_factor':1};cqlsh> use Library;cqlsh:library> CREATE TABLE Books ( ... title varchar, ... author varchar, ... year int, ... PRIMARY KEY (title) ... );cqlsh:library> INSERT INTO Books (title, author, year) VALUES ('Patriot Games', 'Tom Clancy', 1987);cqlsh:library> INSERT INTO Books (title, author, year) VALUES ('Without Remorse', 'Tom Clancy', 1993);

Page 37: Big Data Grows Up - A (re)introduction to Cassandra

CQL - simple intro

Storage rows:

Page 38: Big Data Grows Up - A (re)introduction to Cassandra

CQL - composite keyCREATE TABLE Authors (

name varchar,year int,title varchar,publisher varchar,ISBN varchar,PRIMARY KEY (name, year, title)

)

Page 39: Big Data Grows Up - A (re)introduction to Cassandra

CQL - composite key

Storage rows:

Page 40: Big Data Grows Up - A (re)introduction to Cassandra

Keys and Filters

● Ad hoc queries are NOT supported● Query by key● Key must include all potential filter columns● Must include partition key in filter● Subsequent filters must be in order● Only last filter can be a range

Page 41: Big Data Grows Up - A (re)introduction to Cassandra

Example - Books tableCREATE TABLE Books ( title varchar, author varchar, year int, PRIMARY KEY (title))

Page 42: Big Data Grows Up - A (re)introduction to Cassandra

Example - Books tableCREATE TABLE Books ( title varchar, author varchar, year int, PRIMARY KEY (author, title))

Page 43: Big Data Grows Up - A (re)introduction to Cassandra

Example - Books tableCREATE TABLE Books ( title varchar, author varchar, year int, PRIMARY KEY (author, year))

Page 44: Big Data Grows Up - A (re)introduction to Cassandra

Example - Books tableCREATE TABLE Books ( title varchar, author varchar, year int, PRIMARY KEY (year, author))

Page 45: Big Data Grows Up - A (re)introduction to Cassandra

Secondary Indexes

● Allows query-by-value● CREATE INDEX myIdx ON myTable (myCol)● Works well on low cardinality fields● Won’t scale for high cardinality fields● Don’t overuse it -- not a quick fix for a bad

data model

Page 46: Big Data Grows Up - A (re)introduction to Cassandra

Example - Books tableCREATE TABLE Books ( title varchar, author varchar, year int, PRIMARY KEY (author))CREATE INDEX Books_year ON Books(year)

Page 47: Big Data Grows Up - A (re)introduction to Cassandra

Composite Partition Keys

● PRIMARY KEY((year, author), title)● Creates a more granular shard key● Can be useful to make certain queries more

efficient, or to better distribute data● Updates sharing a partition key are atomic

and isolated

Page 48: Big Data Grows Up - A (re)introduction to Cassandra

Example - Books tableCREATE TABLE Books ( title varchar, author varchar, year int, PRIMARY KEY ((year, author), title))

Page 49: Big Data Grows Up - A (re)introduction to Cassandra

Example - Books tableCREATE TABLE Books ( title varchar, author varchar, year int, PRIMARY KEY (year, author, title))

Page 50: Big Data Grows Up - A (re)introduction to Cassandra

denormalization done well

Collections

Page 51: Big Data Grows Up - A (re)introduction to Cassandra

Supported types

● Sets - ordered naturally● Lists - ordered by index● Maps - key/value pairs

Page 52: Big Data Grows Up - A (re)introduction to Cassandra

Caveats

● Max 64k items in a collection● Max 64k size per item● Collections are read in their entirety, so keep

them small

Page 53: Big Data Grows Up - A (re)introduction to Cassandra

Sets

Page 54: Big Data Grows Up - A (re)introduction to Cassandra

Sets

Set name

Itemvalue

Page 55: Big Data Grows Up - A (re)introduction to Cassandra

Lists

Page 56: Big Data Grows Up - A (re)introduction to Cassandra

Lists

List name Ordering meta data

List item value

Page 57: Big Data Grows Up - A (re)introduction to Cassandra

Maps

Page 58: Big Data Grows Up - A (re)introduction to Cassandra

Maps

Map name

Key Value

Page 59: Big Data Grows Up - A (re)introduction to Cassandra

(tracing on)

TRON

Page 60: Big Data Grows Up - A (re)introduction to Cassandra

Using tracing

● In cqlsh, “tracing on”● … enjoy!

Page 61: Big Data Grows Up - A (re)introduction to Cassandra

Example1393126200000

Page 62: Big Data Grows Up - A (re)introduction to Cassandra

AntipatternCREATE TABLE WorkQueue ( name varchar, time bigint, workItem varchar, PRIMARY KEY (name, time))

… do a bunch of inserts ...SELECT * FROM WorkQueue WHERE name='ToDo' ORDER BY time ASC;DELETE FROM WorkQueue WHERE name=’ToDo’ AND time=[some_time]

Page 63: Big Data Grows Up - A (re)introduction to Cassandra

Antipattern - enqueue

Page 64: Big Data Grows Up - A (re)introduction to Cassandra

Antipattern - dequeue

Page 65: Big Data Grows Up - A (re)introduction to Cassandra

Antipattern

20k tombstones!! 13ms of 17ms spent reading tombstones

Page 66: Big Data Grows Up - A (re)introduction to Cassandra

(no it’s not ACID)

Lightweight Transactions

Page 67: Big Data Grows Up - A (re)introduction to Cassandra

Primer

● Supports basic Compare-and-Set ops● Provides linearizable consistency● … aka serial isolation● Uses “Paxos light” under the hood● Still expensive -- four round trips!● For most cases quorum reads/writes will be

sufficient

Page 68: Big Data Grows Up - A (re)introduction to Cassandra

UsageINSERT INTO Users (login, name)VALUES (‘rs_atl’, ‘Robbie Strickland’)IF NOT EXISTS;

UPDATE UsersSET password=’super_secure_password’WHERE login=’rs_atl’IF reset_token=’some_reset_token’;

Page 69: Big Data Grows Up - A (re)introduction to Cassandra

Other cool stuff

● Triggers (experimental)● Batching multiple requests● Leveled compaction● Configuration via CQL● Gossip-based rack/DC configuration

Page 70: Big Data Grows Up - A (re)introduction to Cassandra

Thank you!

Robbie StricklandSoftware Development ManagerThe Weather Channel

[email protected]@dont_use_twitter