learning cassandra
Post on 15-Jan-2015
10.101 Views
Preview:
DESCRIPTION
TRANSCRIPT
Learning Cassandra
Dave Gardner@davegardnerisme
What I’m going to cover
• How to NoSQL• Cassandra basics (dynamo and
big table)• How to use the data model in
real life
How to NoSQL
1. Find data store that doesn’t use SQL2. Anything3. Cram all the things into it4. Triumphantly blog this success5. Complain a month later when it
bursts into flames
http://www.slideshare.net/rbranson/how-do-i-cassandra/4
Choosing NoSQL
“NoSQL DBs trade off traditional features to better support new and emerging use cases”
http://www.slideshare.net/argv0/riak-use-cases-dissecting-the-sol
utions-to-hard-
problems
Choosing Cassandra: Tradeoffs
More widely used, tested and documented softwareMySQL first OS release 1998
For a relatively immature productCassandra first open-sourced in 2008
Choosing Cassandra: Tradeoffs
Ad-hoc queryingSQL join, group by, having, order
For a rich data model with limited ad-hoc querying abilityCassandra makes you denormalise
Choosing NoSQL
“they say … I can’t decide between this project and this project even though they look nothing like each other. And the fact that you can’t decide indicates that you don’t actually have a problem that requires them.”
Benjamin Black – NoSQL Tapes (at 30:15)
http://nosqltapes.com/video/benjamin-black-on-nosql-cloud-computing-and-fast_ip
What do we get in return?
Proven horizontal scalability
Cassandra scales reads and writes linearly as new nodes are added
Netflix benchmark: linear scaling
http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html
What do we get in return?
High availability
Cassandra is fault-resistant with tunable consistency levels
What do we get in return?
Low latency, solid performance
Cassandra has very good write performance
http://blog.cubrid.org/dev-platform/nosql-benchmarking/
* Add pinch of salt
Performance benchmark *
What do we get in return?
Operational simplicity
Homogenous cluster, no “master” node, no SPOF
What do we get in return?
Rich data model
Cassandra is more than simple key-value – columns, composites, counters, secondary indexes
How to NoSQL version 2
Learn about each solution
• What tradeoffs are you making?• How is it designed?• What algorithms does it use?
http://www.alberton.info/nosql_databases_what_when_why_phpuk2011.html
Amazon Dynamo + Google Big Table
Consistent hashingVector clocks *Gossip protocolHinted handoffRead repair
http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf
ColumnarSSTable storage
Append-onlyMemtable
Compaction
http://labs.google.com/papers/bigtable-osdi06.pdf
* not in Cassandra
The dynamo paper
#1
#4
#6
#2
#3
Client
#5
tokens are integers from0 to 2127
The dynamo paper
#1
#4
#6
#2
#3
Client
#5
Coordinator
consistent hashing
Consistency levels
How many replicas must respond to declare success?
Consistency levels: read operations
Level Description
ONE 1st Response
QUORUM N/2 + 1 replicas
LOCAL_QUORUM N/2 + 1 replicas in local data centre
EACH_QUORUM N/2 + 1 replicas in each data centre
ALL All replicas
http://wiki.apache.org/cassandra/API#Read
Consistency levels: write operations
Level Description
ANY One node, including hinted handoff
ONE One node
QUORUM N/2 + 1 replicas
LOCAL_QUORUM N/2 + 1 replicas in local data centre
EACH_QUORUM N/2 + 1 replicas in each data centre
ALL All replicas
http://wiki.apache.org/cassandra/API#Write
The dynamo paper
RF = 3CL = One
#1
#4
#6
#2
#3
Client
#5
Coordinator
The dynamo paper
RF = 3CL = Quorum
#1
#4
#6
#2
#3
Client
#5
Coordinator
The dynamo paper
RF = 3CL = One
#1
#4
#6
#2
#3
Client
#5
Coordinator
+ hint
The dynamo paper
RF = 3CL = One
#1
#4
#6
#2
#3
Client
#5
Coordinator
Read repair
The big table paper
• Sparse "columnar" data model• SSTable disk storage• Append-only commit log• Memtable (buffer and sort)• Immutable SSTable files• Compactionhttp://labs.google.com/papers/bigtable-osdi06.pdfhttp://www.slideshare.net/geminimobile/bigtable-4820829
The big table paper
Name
Value
Column
+ timestamp
The big table paper
Name
Value
Column
Name
Value
Column
Name
Value
Column
we can have millions of columns
*
* theoretically up to 2 billion
The big table paper
Name
Value
Column
Name
Value
Column
Name
Value
Column
Row Key
Row
The big table paper
Column Family
ColumnRow Key Column Column
ColumnRow Key Column Column
ColumnRow Key Column Column
we can have billions of rows
The big table paper
Write Memtable
SSTable
SSTable
SSTable
SSTable
Commit Log
Memory
Disk
Flushed on time/size trigger
Immutable
Data model basics: conflict resolution
Per-column timestamp-based conflict resolution
http://cassandra.apache.org/
{ column: foo, value: bar, timestamp: 1000}
{ column: foo, value: zing, timestamp: 1001}
Data model basics: conflict resolution
Per-column timestamp-based conflict resolution
http://cassandra.apache.org/
{ column: foo, value: bar, timestamp: 1000}
{ column: foo, value: zing, timestamp: 1001}
bigger timestamp
Data model basics: column ordering
Columns ordered at time of writing, according to Column Family schema
http://cassandra.apache.org/
{ column: zebra, value: foo, timestamp: 1000}
{ column: badger, value: foo, timestamp: 1001}
Data model basics: column ordering
Columns ordered at time of writing, according to Column Family schema
http://cassandra.apache.org/
{ badger: foo, zebra: foo}
with AsciiType column schema
Key point
Each “query” can be answered from a single slice of disk
(once compaction has finished)
Data modeling – 1000ft introduction
• Start from your queries and work backwards
• Denormalise in the application(store data more than once)
http://www.slideshare.net/mattdennis/cassandra-data-modelinghttp://blip.tv/datastax/data-modeling-workshop-5496906
Pattern 1: not using the value
Storing that user X is in bucket Y
Row key: f97be9cc-5255-457…Column name: fooValue: 1
https://github.com/davegardnerisme/we-have-your-kidneys/blob/master/www/add.php#L53-58
we don’t really care about this
Pattern 1: not using the value
Q: is user X in bucket foo?f97be9cc-5255-4578-8813-76701c0945bd
bar: 1foo: 1
06a6f1b0-fcf2-41d9-8949-fe2d416bde8ebaz: 1zoo: 1
503778bc-246f-4041-ac5a-fd944176b26daaa: 1
A: single column fetch
Pattern 1: not using the value
Q: which buckets is user X in?f97be9cc-5255-4578-8813-76701c0945bd
bar: 1foo: 1
06a6f1b0-fcf2-41d9-8949-fe2d416bde8ebaz: 1zoo: 1
503778bc-246f-4041-ac5a-fd944176b26daaa: 1
A: column slice fetch
Pattern 1: not using the value
We could also use expiring columns to automatically delete columns N seconds after insertion
UPDATE users USING TTL = 3600SET 'foo' = 1WHERE KEY = 'f97be9cc-5255-4578-8813-76701c0945bd'
Pattern 2: counters
Real-time analytics to count clicks/impressions of ads in hourly buckets
Row key: 1Column name: 2011103015-clickValue: 34
https://github.com/davegardnerisme/we-have-your-kidneys/blob/master/www/adClick.php
Pattern 2: counters
Increment by 1 using CQL
UPDATE adsSET '2011103015-impression' = '2011103015-impression' + 1WHERE KEY = '1’
Pattern 2: counters
Q: how many clicks/impressions for ad 1 over time range?1
2011103015-click: 12011103015-impression: 34342011103016-click: 122011103016-impression: 54112011103017-click: 22011103017-impression: 345
A: column slice fetch, between column X and Y
Pattern 3: time series
Store canonical reference of impressions and clicks
Row key: 20111030Column name: <time UUID>Value: {json}
http://rubyscale.com/2011/basic-time-series-with-cassandra/
Cassandra can order columns by time
Pattern 4: object properties as columns
Store user properties such as name, email, etc.
Row key: f97be9cc-5255-457…Column name: nameValue: Bob Foo-Bar
http://www.wehaveyourkidneys.com/adPerformance.php?ad=1
Anti-pattern 1: read-before-write
Instead store as independent columns and mutate individually
(see pattern 4)
Anti-pattern 2: super columns
Friends don’t let friends use super columns.
http://rubyscale.com/2010/beware-the-supercolumn-its-a-trap-for-the-unwary/
Anti-pattern 3: OPP
The Order Preserving Partitioner unbalances your load and makes your life harder
http://ria101.wordpress.com/2010/02/22/cassandra-randompartitioner-vs-orderpreservingpartitioner/
Recap: Data modeling
• Think about the queries, work backwards
• Don’t overuse single rows; try to spread the load
• Don’t use super columns
• Ask on IRC! #cassandra
There’s more: Brisk
Integrated Hadoop distribution (without HDFS installed). Run Hive and Pig queries directly against Cassandra
DataStax offer this functionality in their “Enterprise” product
http://www.datastax.com/products/enterprise
Hive: SQL-like interface to Hadoop
CREATE EXTERNAL TABLE tempUsers (userUuid string, segmentId string, value string)STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler'WITH SERDEPROPERTIES ( "cassandra.columns.mapping" = ":key,:column,:value", "cassandra.cf.name" = "users" );
SELECT segmentId, count(1) AS totalFROM tempUsersGROUP BY segmentIdORDER BY total DESC;
In conclusion
Cassandra is founded on sound design principles
In conclusion
The data model is incredibly powerful
In conclusion
CQL and a new breed of clients are making it easier to use
In conclusion
Hadoop integration means we can analyse data directly from a Cassandra cluster
In conclusion
There is a strong community and multiple companies offering professional support
Thanks
Learn more about Cassandrameetup.com/Cassandra-London
Sample ad-targeting project on Github https://github.com/davegardnerisme/we-have-your-kidneys
Watch videos from Cassandra SF 2011http://www.datastax.com/events/cassandrasf2011/presentations
looking for a job?
top related