no slide title...• cassandra documentation describes the architecture in terms of: – data...

VICTORIA UNIVERSITY OF WELLINGTONTe Whare Wananga o te Upoko o te Ika a Maui

SWEN 432

Advanced Database Design and

Implementation

Cassandra

ArchitectureLecturer : Dr. Pavle Mogin

Advanced Database Design and Implementation 2020 Architecture 1

Cassandra The Prophet


Plan for Cassandra Architecture

• Prologue

• Internode Communication

• Data Distribution and Replication

• Partitioners

• Snitches

• CREATE KEYSPACE command


Prologue

• Cassandra documentation describes the architecture

in terms of:– Data centres,

– Clusters,

– Internode communication,

– Data distribution and replication,

– Snitches, and

– Keyspaces

• Important structural information is held in two

configuration files:– cassandra.yaml, and

– cassandra-topology.properties


Data Centre and Cluster

• Cluster

– A group of networked nodes used to implement at least one database

– A cluster contains nodes belonging to one or more data centres

– It can span physical locations

• Data Centre is a homonym:

– General meaning: A collection of related nodes placed in racks

– Data Centre of a cluster: A grouping of a number of nodes belonging to a

data centre configured together for replication purposes

– If a cluster spans more then one data centre

• Replication is set by data centres

• The same data is written in all data centres

– Using separate data centres allows:

• Dedicating each data centre for different processing tasks

• Satisfying requests from a datacentre close to client, and

• Improving availability for the same level of consistency

• A rack is a container for storing a number of nodes that share:

– Power supply, cooling system, and network connection


…rack1

rackr2

…

Data Centres and a Cluster

rack1

rackr1

…

Data Centre dc1 Data Centre dc2

…

…

rack2…

rack2

…

…

Cluster


Internode Communication

• Cassandra uses Gossip communication protocol in

which nodes periodically exchange state information

about themselves and other nodes they know about

• The gossip process runs every second and exchanges

messages with up to three other nodes in the cluster

• Gossip messages are versioned – During a gossip exchange, older information is overwritten with the

most current state for a particular node


Configuring Gossip Settings and Seeds

• When a node first starts up, it looks at its cassandra.yaml configuration file to determine:

– The name of the Cassandra cluster it belongs to,

– Which nodes (called seeds) to contact to obtain information about

the other nodes in the cluster, and

– Other parameters for determining port and range information

• All these parameters are set by an administrator and

have to be the same for all nodes of a cluster

• In the cassandra.yaml, the property

seed_provider is a list of comma-delimited hosts

(IP addresses)

• In multiple data-centre clusters, the seeds list should

include at least one node from each data centre


Node Failure – Dynamic Snitch

• During gossiping, every node maintains a sliding

window of inter-arrival times of gossip messages from

other nodes in the cluster – That mechanism is called dynamic snitch

• If a node does not respond during an adjustable time

period, it is either down or experiencing transient

problems

• Coordinator nodes use this information to avoid (if

possible) routing client requests to unreachable

nodes, or nodes that are performing poorly

• The live nodes periodically try to re-establish contact

with failed nodes to see if they are back up


Data Distribution and Partitioning

• Data distribution is done by Consistent Hashing

• Partitioning– Nodes of a cluster form a consistent partitioning ring,

– A partitioner determines how data is distributed across the nodes in

the cluster,

– Basically, a partitioner is a function for deriving a token representing

a row from its partition key,

– Cassandra default partitioner is a hash function named Murmur3Partitioner,

– This hashing function creates a 64-bit value of the partition key,

– The possible range of hash values is from -263 to +263 – 1

• Ring nodes are also called endpoints because each

node occupies the last token of a range


Hashing Values in a Cluster

A

BD

C

Cluster Ring

Only physical

nodes assumed

- 4611686018427387903

to

- 9223372036854775808

- 4611686018427387902

to

- 1

9223372036854775807

to

4611686018427387904

0

to

4611686018427387903

initial_token

of the node C

263 – 1 = 9223372036854775807

- 2 63 = - 9223372036854775808

initial_token

of the node Dinitial_token

of the node B


Virtual Nodes

• Virtual nodes are used to:– Balance the work load.

– Add or remove nodes in an easy way, and

– Rebuild a dead node faster

• For each physical node of a cluster, the administrator

specifies the number of virtual nodes in its cassandra.yaml configuration file

• The property num_tokens defines the number of

tokens randomly assigned to this node on the ring– One token for each virtual node


Data Replication

• Cassandra supports a number of replication

strategies:– Simple Strategy,

– Network Topology Strategy,

– Gossiping Network Topology Strategy,

– Amazon Worldwide Services (EC2 and EC2 Multi Region)

• We consider the Simple Strategy and Network

Topology Strategy, only

• Simple Strategy is used for single data centres only – Places the first replica on a node determined by the partitioner

– Additional replicas are placed on the next nodes clockwise in the

ring without considering topology (rack or data centre location)


Network Topology Strategy

• Network Topology Strategy (NTS) should be used for

clusters deployed across multiple data centres

• NTS is:– Data Centre aware and

– Rack aware

• This strategy specifies the number of replicas in each

data centre

• NTS places replicas in the same data centre by

walking the ring clockwise until reaching the first node

in another rack– NTS attempts to place replicas on distinct racks because nodes in

the same rack (or similar physical grouping) often fail at the same

time due to power, cooling, or network issues


Snitches

• A snitch determines which data centres and racks

cluster nodes belong to

• They inform Cassandra about the network topology

so that requests are routed efficiently and allows

Cassandra to distribute replicas by grouping

machines into data centres and racks

• Cassandra supports:– SimpleSnitch,

– PropertyFileSnitch,

– Others

• Simple Snitch:– The SimpleSnitch is the default snitch

– It is used for single-data centre deployments, only

– It does not recognize data centre or rack information


Property File Snitch

• This snitch determines proximity by rack and data

centre

• It uses the network details located in the cassandra-topology.properties file

• An administrator can use cassandra-

topology.properties file to define data centre

names and the distribution of nodes among racks

• Data centre names have to correlate to names of

data centres in the keyspace definition

• Every node in the cluster should be assigned to a

rack, and this allocation has to be exactly the same in the cassandra-topology.properties file of

each node in the cluster


Configuring Data Centre Replication

• Primary considerations when configuring replicas in a

multi data centre cluster are: – Being able to satisfy reads locally, without incurring cross data-centre

latency, and

– Failure scenarios

• Locality of reads is achieved by:– Choosing a data centre closest to users, and

– Storing the same data in each data centre (so users can connect to

closest data centre)

• Failure scenarios depend on the replication factor

• In a multi data centre cluster: – The global replication factor ng is the sum of local replication factors

– The global quorum qg is floor (sum_of_local_replication_factors / 2) + 1


Failure Scenario Examples

• Consider the worst case design

• The replication factor n = 2 in each of dc1 and dc2:

– In the case of a single node failure the guaranty is:

• The eventual consistency locally, and the strong consistency

globally (since qg = 3)

– In the case of a two node failure, the guaranty is:

• The eventual consistency globally

• The replication factor n = 3 in each dc1 and dc2:

– In the case of a single node failure, the guaranty is:

• The strong consistency locally, and globally (since qg = 4)

– In the case of a two node failure, the guaranty is:

• The eventual consistency locally, and the strong consistency

globally (since qg = 4)

– In the case of a three node failure, the guaranty is:

• The eventual consistency globally


About Cluster Nodes and Racks (1)

• Nodes of a cluster are allocated to r racks within a

datacenter having m nodes by a database administrator

• A rack is a storage for a number of nodes that share:

– Power supply, cooling system, and network connection

– If any of these fails, all the nodes in the rack fail

• Distribution of nodes within racks has a strong influence on

database consistency, availability, and load balancing

• Placing all cluster nodes of a data centre in the same rack:– Considerably degrades database availability because a rack failure brings all

nodes down, but

– Offers greatest node proximity that slightly improves availability by providing

shortest latency of internode communication within a data centre

• The other extreme is placing each cluster node into another

rack (all nodes in different racks) that leads to– The greatest availability and

– The longest latency of internode communication



• The node placement in racks trade-off prompts:– Use of more than one rack for storing nodes of a cluster in a data

centre, and

– Avoiding to place nodes storing replicas of the same data in the same

rack

• The last requirement asks for the following relationship

between the number of racks r used to hold nodes of a

cluster and the replication number n

r > n

• So, each node of a replica set can be placed in a

different rack



• A database administrator allocates nodes of a cluster to

racks having in mind the inequality r > n, and the following

algorithm that Cassandra uses to assign nodes to replica

sets in a multi data centre cluster:

– The first node of a replica set is the primary node (the node that owns

the token range that is going to be stored on the replica set),

– Each next node of the replica set is the closest neighbour of the

current node in the clock wise direction on the ring that resides in the

next rack with the regard to the current rack,

• If the current rack is the last rack, the next rack is the first rack

– If there is no next node that resides in the next rack, then the next

replica node is the first, physically next node in the clock wise

direction on the ring,

• The algorithm above, does not depend on the actual state of

nodes (up or down),

• If all noes are placed in a single rack, then there is no cluster

nodes in a different rack


About Cluster Nodes and Racks

• Each node that Cassandra assigns to a replica set

stems from a different racks, and thus the availability

requirement has been satisfied

• But that assignment may vary depending on the initial

allocation of nodes to racks

• The following is an allocation algorithm that results in

an even work load distribution, if replication factor n

divides the number of nodes m (n | m):

• Let node ni, iϵ{1, 2, …, m} be allocated to rack rj,

jϵ{1, 2, …, r}, then the node n(i + 1) is allocated to the

rack rp where p = j + 1 for j = 1, 2, …, r -1, and p = 1

for j = r

Examples


Example

• Assume:– A twelve node cluster has been deployed using two data centres,

containing six nodes each

– Each physical node has only one virtual node

– Both datacentres have the same replication factor

– The availability requirements are:

• Strong consistency globally for 100% of data when two nodes

are down

• Strong consistency globally for 100% of data when two racks

are down

• What should be the minimal local replication factor?

• How many racks the cluster should be deployed to, to

achieve availability requirements?


Answer

41 Rack 1

52 Rack 2

63 Rack 3

Data Centre 1

107 Rack 1

118 Rack 2

129 Rack 3

Data Centre 2

Local replication factors n1 = 3 and n2 = 3

Global replication factor n = 6

Number of racks r1 = 3, r2 = 3

Local quorums q1 = 2 and q2 = 2

Global quorum q = 4

Assume a given data object has been stored on nodes n4, n5, and n6 in

dc1 and n10, n11, and n12 in dc2


CREATE KEYSPACE

• The replication strategy and the replication configuration are defined within the CREATE

KEYSPACE CQL command

CREATE KEYSPACE

IF NOT EXISTS keyspace_name

WITH REPLICATION = map AND

DURABLE_WRITES = ( true | false )

• The map is used to declare:

– Replica placement strategy class (either Simple or Network Topology)

, and

– Replication configuration (factor)


The map of CREATE KEYSPACE

• The two different types of keyspaces:

{ 'class' : 'SimpleStrategy',

'replication_factor' : <integer>, … };

{'class': 'NetworkTopologyStrategy‘ [,

'<data center>' : <integer>,

'<data center>' : <integer>, … ] . . . };

• The SimpleStrategy is used for evaluating and testing

Cassandra– It is correlated with the SimpleSnitch

• The NetworkTopologyStrategy should be used for

production or for use with mixed workloads– It is correlated with PropertyFileSnitch that uses information in

cassandra-topology.properties file


Configuring NTS

• Before creating a keyspace for use with multiple

data centres, the cluster has to be configured to use a

network-aware snitch – The cassanra.yaml configuration file of each node has to be

configured before starting the cluster

– One of settings to be checked and (possibly) done is

endpoint_snitch: PropertyFileSnitch

• Data centre names and rack information have to be checked and possibly changed in cassandra-

topology.properties files of each node– Cassandra uses dci, i = 1,…, as the default data centre name

– Centre names in .properties file and in the map of CREATE

KEYSPACE have to match exactly, otherwise Cassandra will fail to

find a node and to complete a write request

– Allocation of nodes to racks have to be made before starting the

cluster


Setting Durable_Writes

• DURABLE_WRITES is an option whose default is yes

• When set to false, data written to the keyspace

bypasses the commit log – A risk of losing data

– Not to use on a key space using the SimpleStrategy


Summary (1)

• A data centre is a collection of nodes configured for

replication purposes

• A cluster contains nodes from one or more data

centres

• Internode communication is accomplished through

gossiping

• The dynamic snitch is a mechanism for detecting

poorly performing or failing nodes

• Data distribution is done by Consistent Hashing– Partitioning is performed by deriving a token from the partition key of

a row and storing the row on a node assigned to the first greater

token on the consistent hashing ring


Summary (2)

• Data replication is performed according to one of

replication strategies:– Simple Strategy,

– Network Topology Strategy

• A snitch contains information about the network

topology– Each snitch corresponds to one replication strategy and vice versa

• The replication strategy and the replication

configuration (factor) are declared within a keyspace

declaration

no slide title...• cassandra documentation describes the architecture in terms of: – data...

Documents