nosql and cassandra

Post on 17-May-2015

5.942 Views

Category:

Technology

5 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Introduction to NOSQLAnd Cassandra

@rantav @outbrain

SQL is good

• Rich language• Easy to use and integrate• Rich toolset • Many vendors

• The promise: ACIDo Atomicityo Consistencyo Isolationo Durability

SQL Rules

BUT

The Challenge: Modern web apps

• Internet-scale data size• High read-write rates• Frequent schema changes

• "social" apps - not bankso They don't need the same

 level of ACID 

SCALING

Scaling Solutions - Replication

Scales Reads

Scaling Solutions - Sharding

Scales also Writes

Brewer's CAP Theorem: 

You can only choose two

CAP

Availability + Partition Tolerance (no Consistency)

Existing NOSQL Solutions

Taxonomy of NOSQL data stores

• Document Orientedo CouchDB, MongoDB, Lotus Notes, SimpleDB, Orient

• Key-Valueo Voldemort, Dynamo, Riak (sort of), Redis, Tokyo 

• Columno Cassandra, HBase, BigTable

• Graph Databaseso  Neo4J, FlockDB, DEX, AlegroGraph

http://en.wikipedia.org/wiki/NoSQL

 

• Developed at facebook

• Follows the BigTable Data Model - column oriented

• Follows the Dynamo Eventual Consistency model

• Opensourced at Apache

• Implemented in Java

N/R/W

• N - Number of replicas (nodes) for any data item

• W - Number or nodes a write operation blocks on

• R - Number of nodes a read operation blocks on

CONSISTENCY DOWN TO EARTH

N/R/W - Typical Values

• W=1 => Block until first node written successfully• W=N => Block until all nodes written successfully• W=0 => Async writes

• R=1 => Block until the first node returns an answer• R=N => Block until all nodes return an answer• R=0 => Doesn't make sense

• QUORUM:o R = N/2+1o W = N/2+1o => Fully consistent

Data Model - Forget SQL

Do you know SQL?

Data Model - Vocabulary

• Keyspace – like namespace for unique keys.

• Column Family – very much like a table… but not quite.

• Key – a key that represent row (of columns)

• Column – representation of value with:o Column nameo Valueo Timestamp

• Super Column – Column that holds list of columns inside

Data Model - Columns

struct Column {   1: required binary name,   2: optional binary value,   3: optional i64 timestamp,   4: optional i32 ttl,}

JSON-ish notation:{  "name":      "emailAddress",  "value":     "foo@bar.com",  "timestamp": 123456789 }

Data Model - Column Family

• Similar to SQL tables• Has many columns• Has many rows

Data Model - Rows

• Primary key for objects• All keys are arbitrary length binaries

Users:                                 CF    ran:                               ROW        emailAddress: foo@bar.com,     COLUMN        webSite: http://bar.com        COLUMN    f.rat:                             ROW        emailAddress: f.rat@mouse.com  COLUMNStats:                                 CF    ran:                               ROW        visits: 243                    COLUMN

Data Model - Songs example

Songs:     Meir Ariel:         Shir Keev: 6:13,         Tikva: 4:11,        Erol: 6:17        Suetz: 5:30        Dr Hitchakmut: 3:30    Mashina:        Rakevet Layla: 3:02        Optikai: 5:40

Data Model - Super Columns

• Columns whose values are lists of columns

Data Model - Super Columns

Songs:     Meir Ariel:        Shirey Hag:            Shir Keev: 6:13,             Tikva: 4:11,            Erol: 6:17        Vegluy Eynaim:             Suetz: 5:30            Dr Hitchakmut: 3:30    Mashina:        ...

The API - Read

getget_sliceget_countmultigetmultiget_sliceget_ranage_slicesget_indexed_slices

The True API

get(keyspace, key, column_path, consistency)get_slice(ks, key, column_parent, predicate, consistency)multiget(ks, keys, column_path, consistency)multiget_slice(ks, keys, column_parent, predicate, consistency)...

The API - Write

insertaddremoveremove_counterbatch_mutate

The API - Meta

describe_schema_versionsdescribe_keyspacesdescribe_cluster_namedescribe_versiondescribe_ringdescribe_partitionerdescribe_snitch

The API - DDL

system_add_column_familysystem_drop_column_familysystem_add_keyspacesystem_drop_keyspacesystem_update_keyspacesystem_update_column_family

The API - CQL

execute_cql_query

cqlsh> SELECT key, state FROM users;

cqlsh> INSERT INTO users (key, full_name, birth_date, state) VALUES ('bsanderson', 'Brandon Sanderson', 1975, 'UT');

Consistency Model

• N - per keyspace• R - per each read requests• W - per each write request

Consistency Model

Cassandra defines:

enum ConsistencyLevel {    ONE,    QUORUM,    LOCAL_QUORUM,    EACH_QUORUM,    ALL,    ANY,    TWO,    THREE,}

Java Code

TTransport tr = new TSocket("localhost", 9160); TProtocol proto = new TBinaryProtocol(tr); Cassandra.Client client = new Cassandra.Client(proto); tr.open(); 

String key_user_id = "1"; 

long timestamp = System.currentTimeMillis(); client.insert("Keyspace1",               key_user_id,               new ColumnPath("Standard1",                              null,                             "name".getBytes("UTF-8")),               "Chris Goffinet".getBytes("UTF-8"),              timestamp,               ConsistencyLevel.ONE); 

Java Client - Hectorhttp://github.com/rantav/hector

• The de-facto java client for cassandra

• Encapsulates thrift• Adds JMX (Monitoring)• Connection pooling• Failover• Open-sourced at github and has a growing

community of developers and users.

Java Client - Hector - cont

 /**   * Insert a new value keyed by key   *   * @param key   Key for the value   * @param value the String value to insert   */  public void insert(final String key, final String value) {    Mutator m = createMutator(keyspaceOperator);    m.insert(key,              CF_NAME,              createColumn(COLUMN_NAME, value));  }

Java Client - Hector - cont

  /**   * Get a string value.   *   * @return The string value; null if no value exists for the given key.   */  public String get(final String key) throws HectorException {    ColumnQuery<String, String> q = createColumnQuery(keyspaceOperator, serializer, serializer);    Result<HColumn<String, String>> r = q.setKey(key).        setName(COLUMN_NAME).        setColumnFamily(CF_NAME).        execute();    HColumn<String, String> c = r.get();    return c == null ? null : c.getValue();  }

Extra

If you're not snoring yet...

Sorting

Columns are sorted by their type • BytesType • UTF8Type• AsciiType• LongType• LexicalUUIDType• TimeUUIDType

Rows are sorted by their Partitioner• RandomPartitioner• OrderPreservingPartitioner• CollatingOrderPreservingPartitioner

Thrift

Cross-language protocolCompiles to: C++, Java, PHP, Ruby, Erlang, Perl, ...

struct UserProfile {     1: i32    uid,     2: string name,     3: string blurb } 

service UserStorage {     void         store(1: UserProfile user),    UserProfile  retrieve(1: i32 uid) }

Thrift

Generating sources:

thrift --gen java cassandra.thrift

thrift -- gen py cassandra.thrift

Internals

Agenda

• Background and history• Architectural Layers• Transport: Thrift• Write Path (and sstables, memtables)• Read Path• Compactions• Bloom Filters• Gossip• Deletions• More...

Required Reading ;-)

BigTable http://labs.google.com/papers/bigtable.html

Dynamo http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html

From Dynamo:

• Symmetric p2p architecture• Gossip based discovery and error detection• Distributed key-value store

o Pluggable partitioning o Pluggable topology discovery

• Eventual consistent and Tunable per operation 

From BigTable

• Sparse Column oriented sparse array• SSTable disk storage

o Append-only commit logo Memtable (buffering and sorting)o Immutable sstable fileso Compactionso High write performance 

Architecture Layers

Cluster ManagementMessaging service Gossip Failure detection Cluster state Partitioner Replication 

Single HostCommit log Memtable SSTable Indexes Compaction 

Write Path

Writing

Writing

Writing

Writing

Memtables

• In-memory representation of recently written data• When the table is full, it's sorted and then flushed to disk ->

sstable

SSTables

Sorted Strings Tables• Immutable• On-disk• Sorted by a string key• In-memory index of elements• Binary search (in memory) to find element location• Bloom filter to reduce number of unneeded binary searches.

Write Properties

• No Locks in the critical path• Always available to writes, even if there are failures.

• No reads• No seeks • Fast • Atomic within a Row

Read Path

Reads

Reading

Reading

Reading

Reading

Bloom Filters

• Space efficient probabilistic data structure• Test whether an element is a member of a set• Allow false positive, but not false negative • k hash functions• Union and intersection are implemented as bitwise OR, AND

Read Properteis

• Read multiple SSTables • Slower than writes (but still fast) • Seeks can be mitigated with more RAM• Uses probabilistic bloom filters to reduce lookups.• Extensive optional caching

o Key Cacheo Row Cacheo Excellent monitoring

Compactions

 

Compactions

• Merge keys • Combine columns • Discard tombstones• Use bloom filters bitwise OR operation

Gossip

• p2p• Enables seamless nodes addition.• Rebalancing of keys• Fast detection of nodes that goes down.• Every node knows about all others - no

master.

Deletions

• Deletion marker (tombstone) necessary to suppress data in older SSTables, until compaction 

• Read repair complicates things a little • Eventually consistent complicates things more • Solution: configurable delay before tombstone GC, after

which tombstones are not repaired

Extra Long list of subjects

• SEDA (Staged Events Driven Architecture)• Anti Entropy and Merkle Trees• Hinted Handoff• repair on read

SEDA

• Mutate• Stream• Gossip• Response• Anti Entropy• Load Balance• Migration 

Anti Entropy and Merkle Trees

Hinted Handoff

References

• http://horicky.blogspot.com/2009/11/nosql-patterns.html• http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon

-dynamo-sosp2007.pdf• http://labs.google.com/papers/bigtable.html• http://bret.appspot.com/entry/how-friendfeed-uses-mysql• http://www.julianbrowne.com/article/viewer/brewers-cap-the

orem• http://www.allthingsdistributed.com/2008/12/eventually_cons

istent.html• http://wiki.apache.org/cassandra/DataModel• http://incubator.apache.org/thrift/• http://www.eecs.harvard.edu/~mdw/papers/quals-seda.pdf

top related