cassandra explained

50
Cassandra Explained Disruptive Code September 22, 2010 Eric Evans [email protected] @jericevans http://blog.sym-link.com

Upload: eric-evans

Post on 15-Jan-2015

119.246 views

Category:

Technology


3 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Cassandra Explained

Cassandra Explained

Disruptive CodeSeptember 22, 2010

Eric [email protected]

@jericevanshttp://blog.sym-link.com

Page 2: Cassandra Explained

Outline

● Background● Description● API(s)● Code Samples

Page 3: Cassandra Explained

Background

Page 4: Cassandra Explained

The Digital Universe

Page 5: Cassandra Explained

Consolidation

Page 6: Cassandra Explained

Old Guard

Page 7: Cassandra Explained

Vertical Scaling Sucks

Page 8: Cassandra Explained

Influential Papers

● BigTable● Strong consistency● Sparse map data model● GFS, Chubby, et al

● Dynamo● O(1) distributed hash table (DHT)● BASE (aka eventual consistency)● Client tunable consistency/availability

Page 9: Cassandra Explained

NoSQL

● HBase● MongoDB● Riak● Voldemort● Neo4J● Cassandra

● Hypertable● HyperGraphDB● Memcached● Tokyo Cabinet● Redis● CouchDB

Page 10: Cassandra Explained

NoSQL Big data

● HBase● MongoDB● Riak● Voldemort● Neo4J● Cassandra

● Hypertable● HyperGraphDB● Memcached● Tokyo Cabinet● Redis● CouchDB

Page 11: Cassandra Explained

Bigtable / Dynamo

Bigtable● HBase● Hypertable

Dynamo● Riak● Voldemort

Cassandra ??

Page 12: Cassandra Explained

Dynamo-Bigtable Lovechild

Page 13: Cassandra Explained

CAP Theorem “Pick Two”

● CP● Bigtable● Hypertable● HBase

● AP● Dynamo● Voldemort● Cassandra

Page 14: Cassandra Explained

CAP Theorem “Pick Two”

● Consistency● Availability● Partition Tolerance

Page 15: Cassandra Explained

Description

Page 16: Cassandra Explained

Properties

● Symmetric● No single point of failure● Linearly scalable● Ease of administration

● Flexible partitioning, replica placement● Automated provisioning● High availability (eventual consistency)

Page 17: Cassandra Explained

P2P Routing

Page 18: Cassandra Explained

P2P Routing

Page 19: Cassandra Explained

Partitioning

● Random● 128bit namespace, (MD5)● Good distribution

● Order Preserving● Tokens determine namespace● Natural order (lexicographical)● Range / cover queries

● Yours ??

Page 20: Cassandra Explained

Replica Placement

● SimpleSnitch● Default● N-1 successive nodes

● RackInferringSnitch● Infers DC/rack from IP

● PropertyFileSnitch● Configured w/ a properties file

Page 21: Cassandra Explained

Bootstrap

Page 22: Cassandra Explained

Bootstrap

Page 23: Cassandra Explained

Bootstrap

Page 24: Cassandra Explained

Choosing Consistency

Read

Level Description

ZERO N/A

ANY N/A

ONE 1 replica

QUORUM (N / 2) +1

ALL All replicas

Write

Level Description

ZERO Hail Mary

ANY 1 replica (HH)

ONE 1 replica

QUORUM (N / 2) +1

ALL All replicas

R + W > N

Page 25: Cassandra Explained

Quorum ((N/2) + 1)

Page 26: Cassandra Explained

Quorum ((N/2) + 1)

Page 27: Cassandra Explained

Data Model

Page 28: Cassandra Explained

Overview

● Keyspace● Uppermost namespace● Typically one per application

● ColumnFamily● Associates records of a similar kind● Record-level Atomicity● Indexed

● Column● Basic unit of storage

Page 29: Cassandra Explained

Sparse Table

Page 30: Cassandra Explained

Column

● name● byte[]

● Queried against (predicates)● Determines sort order

● value● byte[]

● Opaque to Cassandra

● timestamp● long

● Conflict resolution (Last Write Wins)

Page 31: Cassandra Explained

Column Comparators

● Bytes● UTF8● TimeUUID● Long● LexicalUUID● Composite (third-party)

http://github.com/edanuff/CassandraCompositeType

Page 32: Cassandra Explained

API

Page 33: Cassandra Explained

RPC

THRIFT

Page 34: Cassandra Explained

RPC

THRIFT

Page 35: Cassandra Explained

RPC

AVRO

Page 36: Cassandra Explained

RPC

AVRO

Page 37: Cassandra Explained

Idiomatic Client Libraries

● Pelops, Hector (Java)● Pycassa (Python)● Cassandra (Ruby)● Others …

http://wiki.apache.org/cassandra/ClientOptions

Page 38: Cassandra Explained

Code Samples

Page 39: Cassandra Explained

Pycassa – Python Client API

# creating a connectionfrom pycassa import connecthosts = ['host1:9160', 'host2:9160']client = connect('Keyspace1', hosts)

# creating a column family instancefrom pycassa import ColumnFamilycf = ColumnFamily(client, “Standard1”)

# reading/writing a columncf.insert('key1', {'name': 'value'})print cf.get('key1')['name']

1. http://github.com/vomjom/pycassa

Page 40: Cassandra Explained

Address Book – Setup

# conf/cassandra.yamlkeyspaces: - name: AddressBook column_families: - name: Addresses compare_with: BytesType rows_cached: 10000 keys_cached: 50 comment: 'No comment'

Page 41: Cassandra Explained

Adding an entry

key = uuid()

columns = { 'first': 'Eric', 'last': 'Evans', 'email': '[email protected]', 'city': 'San Antonio', 'zip': 78250}

addresses.insert(key, columns)

Page 42: Cassandra Explained

Fetching a record

# fetching the record by keyrecord = addresses.get(key)

# accessing columns by namezipcode = record['zip']city = record['city']

Page 43: Cassandra Explained

Indexing (manual)

# conf/cassandra.yamlkeyspaces: - name: AddressBook column_families: - name: Addresses compare_with: BytesType rows_cached: 10000 keys_cached: 50 comment: 'No comment' - name: ByCity compare_with: UTF8Type

Page 44: Cassandra Explained

Updating the index

key = uuid()

columns = { 'first': 'Eric', 'last': 'Evans', 'email': '[email protected]', 'city': 'San Antonio', 'zip': 78250}

addresses.insert(key, columns)byCity.insert('San Antonio', {key: ''})

Page 45: Cassandra Explained

Indexing (auto)

# conf/cassandra.yamlkeyspaces: - name: AddressBook column_families: - name: Addresses compare_with: BytesType rows_cached: 10000 keys_cached: 50 comment: 'No comment' column_metadata: - name: city index_type: KEYS

Page 46: Cassandra Explained

Querying the Index

from pycassa.index import create_index_expressionfrom pycassa.index import create_index_clause

e = create_index_expression('city', 'San Antonio')clause = create_index_clause([e])

results = address.get_indexed_slices(clause)

for (key, columns) in results.items(): print “%(first)s %(last)s” % columns

Page 47: Cassandra Explained

Timeseries

# conf/cassandra.yamlkeyspaces: - name: Sites column_families: - name: Stats compare_with: LongType

Page 48: Cassandra Explained

Logging values

# time as long (milliseconds since epoch)tstmp = long(time() * 1e6)

stats.insert('org.apache', {tstmp: value})

Page 49: Cassandra Explained

Slicing

begin = long(start * 1e6)

stats.get_range('org.apache', column_start=begin)

end = long((start + 86400) * 1e6)

stats.get_range(start='org.apache', finish='org.debian', column_start=begin, column_finish=end)

Page 50: Cassandra Explained

Questions?