introduction to nosql and cassandra
DESCRIPTION
Intro to NoSQL, Cassandra and Hector I gave at Globant Laminar in Buenos Aires Argentina Dec 13th 2012.TRANSCRIPT
About me
Present: Relateiq (Data Processing and Scalability)
Hector committer
Past: DataStax (The Cassandra Company)
Cassandra/Hadoop distribution (former Brisk)
Cassandra FS
CQL connection pool
Cassandra contributions
Trends: “NoSQL”
2011
2012
What is “NoSQL” ?
systems able to store and retrieve great quantities of data with none or little information about the relationships between them.
Generally they don't have a SQL like language for data manipulation and their schema is more relaxed than traditional RDBM systems.
Full ACID is not often guaranteed.
Brewer's CAP theorem
Consistency: all replicas agree on the same value
Availability: always get an answer from a replica
Partition Tolerance: the system works even if replicas can't talk
You can have 2 of these
Brewer's CAP theorem
CAP Classification
Consistency
PartitioningAvailability
Types
- Relationals- Key-Value stores- Columnar (column-oriented)- Graph databases- Document
What's eventual consistency?
It is a promise that eventually, in the absence of new writes, all replicas that are responsible for a data item will agree on the same version
How eventual is eventual?Write to 1 replica and Read from 1 replica of a total
of 3
How eventual is eventual?Write to 2 replicas and Read from 2 replicas of a total
of 3
Why is it good?
because, by contacting fewer replicas, read and write operations complete
more quickly, lowering latency.
Cassandra is a distributed , fault tolerant, scalable, column oriented and tunable consistency data store.
Cassandra hasC A PBut C is tunable
What is Apache Cassandra?
Key Concepts
Multi-Master, Multi-DC
Linearly scalable
Integrated Caching
Performs well with Larger-than-memory Datasets
Tunable consistency
Idempotent (client clock)
Schema Optional
No ACID transactions, No Locking
Generally complements another system(s)(Not intended to be one-size-fits-all)
You should always use the right tool for the right job
Speaking Cassandra
Data Model
“4-Dimensional Hash Table”
A Keyspace contains a collection of Column Families(Controls replication)
A Column Family contains Rows
A Row have a key, and each row has columns(No need to define the columns before hand)
Each column has a name and a value and a timestamp
(TTL is optional)
Data Model – (RDBMS)
Keyspace (Schema)
Column Family(CF) (table)
Row (row)
Column (column*) → may not be present in all rows
Data Model – Column Family
Static Column Family- Model my object data
Dynamic Column Family- Precalculated / Prematerialized query results
Nothing stopping you from mixing them!
Data Model – Static Column Family
Data Model – Dynamic CF
stats for a specific date
Data Model – Dynamic CF
Timeline of tweets by a userTimeline of tweets by all of the people a user is followingList of comments sorted by scoreList of friends grouped by stateMetrics for a time bucket
...
Let's store “foo”
...
Let's store “foo”
Foo
…
But if that node is down?
Foo
...
Let's store “foo” in 3 nodes.This is the Replication Factor(N)
Foo
Foo
Foo
...
Now we need to know what nodes the key was written to so we can read it later
...
The Initial Token specifies the upper value of the key range each node is responsible for
#1<= 'd'
#2<= 'k'
#3<= 'p'
#5<= 'z'
#4<= 'u'
a b c d e f g h I j k l m n …. z
'e f g h I j k '
...
Gossip is the protocol Cassandra uses to interchange information with nodes in the cluster (a.k.a. Ring)
…
Gossip is the protocol Cassandra uses to interchange information with nodes in the cluster (a.k.a. Ring)
For example, what nodes owns the key “foo”
...
Gossip is the protocol Cassandra uses to interchange information with nodes in the cluster (a.k.a. Ring)
For example, what nodes owns the key “foo”
#1<= 'd'
#2<= 'k'
#3<= 'p'
#5<= 'z'
#4<= 'u'
Client
'foo'
Read 'foo'
'e f g h I j k '
...
A Partitioner is used to transform the key. “foo1” and “foo2” may end up in different nodes
...
A Partitioner is used to transform the key. “foo1” and “foo2” may end up in different nodes
The most commonly used is Random Partitioner
“foo1” md5(“foo1”) “A99A0B....”
...
A Partitioner is used to transform the key. “foo1” and “foo2” may end up in different nodes
The most commonly used is Random Partitioner
#1
#2
#3
#5
#4
'foo1'
'foo2'
...
A Replica Placement Strategy determines which nodes contain replicas
...
A Replica Placement Strategy determines which nodes contain replicas
Simple Strategy place them clockwise
#1
#2
#3
#5
#4
'foo1'
'foo1'
'foo1'
...
A Replica Placement Strategy determines which nodes contain replicas
Network Topology Strategy place them in different DCs
#1
#2#4
#3
#5'foo1'
'foo1'
'foo1'
#1
#2#4
#3
#5'foo1'
DC1:3 DC2:1
...
Consistency Level determines how many replicas to contact to
...
Consistency Level determines how many replicas to contact to
CL = 1
#1
#2
#3
#5
#4
'foo1'
'foo1'
'foo1'
Client
...
Consistency Level determines how many replicas to contact to
CL = QUORUM
#1
#2
#3
#5
#4
'foo1'
'foo1'
'foo1'
Client
Consistency For Writes
ANY
ONE
TWO
THREE
QUORUM
LOCAL_QUORUM
EACH_QUORUM
ALL
Consistency For Reads
ONE
TWO
THREE
QUORUM
LOCAL_QUORUM
EACH_QUORUM
ALL
Consistency In Math Term
(nodes_written + nodes_read) > replication_factor
Cassandra guarantees strong consistency if
R + W > N
Back to the example..
Consistency Level determines how many replicas to contact to
CL = QUORUM
#1
#2
#3
#5
#4
'foo1'
'foo1'
'foo1'
Client
...
But what if node #3 is down?
...
But what if node #3 is down?
#1
#2
#3
#5
#4
'foo1'hint
'foo1'
Client
...
But what if node #3 is down?
The coordinator nodes will store a hint and will replay that mutation when the down node comes back up.
This is known as Hinted Handoff
...
Node #5 will replay the hint to node #3 when it comes back online
#1
#2
#3
#5
#4
'foo1'hint
'foo1'
Client
'foo1'
...
And if node #5 dies before sending the hints to node #3?
#1
#2
#3
#5
#4
'foo1'hint
'foo1'
Client
...
If using Quorum, node #4 will request for 'foo' to all the replicas
#1
#2
#3
#5
#4
'foo1'hint
'foo1'
Client
''
...
If the result received do not match, a Read Repair process is performed in the background
#1
#2
#3
#5
#4
'foo1'hint
'foo1'
Client
''
...
And the missing or not up-to-date value is pushed to the out of date node. #3 in this case
#1
#2
#3
#5
#4
'foo1'hint
'foo1'
Client
'foo''foo' != ''
...
The last feature to achieve consistency is the Anti Entropy Service (AES)
Should run periodically as part of the cluster maintenance or when a node was down
Recap Consistency Features
Read Repair
Anti Entropy Service (AES)
Hinted Handoff
scaling
“z”
“t”
“e”
“o”
“j”
scaling
“z”
“t”
“e”
“o”
“j”
“?”
scaling
“z”
“t”
“e”
“o”
“j”
“g”
Nodetool move ?
Want 2x performance ?!
Add 2x nodes
'No downtime' included!
Want 2x performance ?!
“z”
“t”
“e”
“o”
“j”
Want 2x performance ?!
“z”
“t”
“e”
“o”
“j”
“g”
“l”
“q”
“v”
“b”
With RF= 3 we could lose
“z”
“t”
“e”
“o”
“j”
“g”
“l”
“q”
“v”
“b”
XX
X
With RF= 3 we could lose
“z”
“t”
“e”
“o”
“j”
“g”
“l”
“q”
“v”
“b”
XX
X
X ?
Vs others
z
t
e
o
j
g
lq
v
b
Recap
Replication FactorTokensGossipPartitionerReplica PlacementConsistencyHinted HandoffRead RepairAESClustering
Performance
Reads on par with writes
Scalability
Internals
Read and Write path
Storage - SSTable
- SSTables are sorted
- Immutable (“Merge on read”)
- Newest timestamp wins
Storage – Compaction
Storage – Compaction
Merges SSTables together into a larger SSTables
Removes Tombstones
Rebuild primary and secondary indexes
Storage – Compaction
Two types:
- Size-tiered compaction
- Leveled compaction
Storage – Compaction
Size-tiered compaction
Performance no guaranteedRow may be across many SSTablesWaste of spaceGood for write heavy opsRows are written once100% more space than SSTables
Storage – Compaction
Leveled compaction
Grouped into levelsNo overlapping within a levelEach level is ten times as large90% of reads satisfied with 1 SSTableTwice as much I/O
Recap
SSTableMemtableRow CacheCompaction
Before - 48 Cassandra on m2.4xlarge. 36 EVcache on m2.xlarge
After - 12 Cassandra on hi1.4xlarge
SSDs and caching
API Operations
Five general categories
Retrieving
Write/Update/Remove (all the same op!)Increment counters
Meta Information
Schema Manipulation
CQL Execution
Insertion/Deletion => Mutation
Again: Every mutation is an insert!- Merge on read- Sstables are immutable- Highest timestamp wins
CQL
INSERT INTO Hollywood.NerdMovies (user_uuid, fan) VALUES ('cfd66ccc-d857-4e90-b1e5-df98a3d40cd6', 'johndoe') USING CONSISTENCY LOCAL_QUORUM AND TTL 86400;
Hadoop
Using a Client
- Hector
http://hector-client.org
- Astyanax
https://github.com/Netflix/astyanax
- Pelops
https://github.com/s7/scale7-pelops
Using a Client → Hector
- Most popular Java client
- In use at very large installations
- A number of tools and utilities built on top
- Very active community
- MIT Licensed
Features
- High Level API
- Failover behavior
- High performant connection pool
- JMX counters for management
- Discoverability of new nodes
- Automatic retry of downed hosts
- Suspension of nodes after several timeouts
- Load Balancing: Configurable and extensible
- Locking (Beta)
Hector's Architecture
vs JDBC
Hector is operation-oriented
Whereas
JDBC is connection-oriented
API Abstractions
Thrift
Mutator
Templates
ColumnFamilyTemplate
Familiar, type-safe approach
- based on template-method design pattern
- generic: ColumnFamilyTemplate<K,N>
(K is the key type, N the column name type)
ColumnFamilyTemplate template = new ThriftColumnFamilyTemplate(keyspaceName, columnFamilyName, StringSerializer.get(), StringSerializer.get());
*** (no generics for clarity)
ColumnFamilyTemplate
new ThriftColumnFamilyTemplate(keyspaceName,
columnFamilyName,
StringSerializer.get(),
StringSerializer.get());Key Format
Column Name Format- Cassandra calls this a “comparator”- Remember: defines column order in on-disk format
ColumnFamilyTemplate
ColumnFamilyResult<String, String> res = cft.queryColumns("patricioe");
String value = res.getString("email");
Date startDate = res.getDate(“DateOfBirth”);
Key Format
Column Name Format
ColumnFamilyTemplate
ColumnFamilyUpdater updater = template.createUpdater(”pato");
updater.setString("companyName",”Relateiq");updater.addKey(”sabina");updater.setString("companyName",”Globant");
template.update(updater);
Inserting data with ColumnFamilyUpdater
ColumnFamilyTemplate
template.deleteColumn("zznate", "notNeededStuff");template.deleteColumn("zznate", "somethingElse");template.deleteColumn("patricioe", "aDifferentColumnName");...template.deleteRow(“someuser”);
template.executeBatch();
Deleting Data with ColumnFamilyTemplate
Integrating with existing patterns
Hector Object Mapper -> Apache Gorahttps://github.com/hector-client/hector/tree/master/object-mapper
Hector JPA*:https://github.com/riptano/hector-jpa
Spring IOC
CQL: JDBC Driver and Pool in 1.0!
JdbcTemplate FTW!
Development Resources
Hector Documentation (http://hector-client.org)
Cassandra Unithttps://github.com/jsevellec/cassandra-unit
Cassandra Maven Pluginhttp://mojo.codehaus.org/cassandra-maven-plugin/
CCM localhost cassandra clusterhttps://github.com/pcmanus/ccm
OpsCenterhttp://www.datastax.com/products/opscenter
Cassandra AMIshttps://github.com/riptano/CassandraClusterAMI
Want to contribute?
git clone [email protected]:hector-client/hector.git
Summary
- Take advantage of strengths- idempotence and asynchronicity are your friends- If it's not in the API, you are probably doing it wrong- Seek death is still possible if you model incorrectly- Try Denormalizing (append-only model ?)
Patricio Echagü[email protected]
@patricioe
Credits
Nate McCall
Aaron Morton (http://thelastpickle.com)
Datastax (http://www.datastax.com)
http://www.slideshare.net/mikiobraun/cassandra-an-introduction
Additional Resources
DataStax Documentation: http://www.datastax.com/docs
Apache Cassandra project wiki: http://wiki.apache.org/cassandra/
“The Dynamo Paper”http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf
P. Helland. Building on Quicksandhttp://arxiv.org/pdf/0909.1788
P. Helland. Life Beyond Distributed Transactionshttp://www.ics.uci.edu/~cs223/papers/cidr07p15.pdf
S. Anand. “Netflix's Transition to High-Availability Storage Systems”http://media.amazonwebservices.com/Netflix_Transition_to_a_Key_v3.pdf
“The Megastore Paper”http://research.google.com/pubs/archive/36971.pdf