no slide title...• cassandra documentation describes the architecture in terms of: – data...
TRANSCRIPT
VICTORIA UNIVERSITY OF WELLINGTONTe Whare Wananga o te Upoko o te Ika a Maui
SWEN 432
Advanced Database Design and
Implementation
Cassandra
ArchitectureLecturer : Dr. Pavle Mogin
Advanced Database Design and Implementation 2020 Architecture 1
Cassandra The Prophet
Advanced Database Design and Implementation 2020 Architecture 2
Plan for Cassandra Architecture
• Prologue
• Internode Communication
• Data Distribution and Replication
• Partitioners
• Snitches
• CREATE KEYSPACE command
Advanced Database Design and Implementation 2020 Architecture 3
Prologue
• Cassandra documentation describes the architecture
in terms of:– Data centres,
– Clusters,
– Internode communication,
– Data distribution and replication,
– Snitches, and
– Keyspaces
• Important structural information is held in two
configuration files:– cassandra.yaml, and
– cassandra-topology.properties
Advanced Database Design and Implementation 2020 Architecture 4
Data Centre and Cluster
• Cluster
– A group of networked nodes used to implement at least one database
– A cluster contains nodes belonging to one or more data centres
– It can span physical locations
• Data Centre is a homonym:
– General meaning: A collection of related nodes placed in racks
– Data Centre of a cluster: A grouping of a number of nodes belonging to a
data centre configured together for replication purposes
– If a cluster spans more then one data centre
• Replication is set by data centres
• The same data is written in all data centres
– Using separate data centres allows:
• Dedicating each data centre for different processing tasks
• Satisfying requests from a datacentre close to client, and
• Improving availability for the same level of consistency
• A rack is a container for storing a number of nodes that share:
– Power supply, cooling system, and network connection
Advanced Database Design and Implementation 2020 Architecture 5
…rack1
rackr2
…
Data Centres and a Cluster
rack1
rackr1
…
Data Centre dc1 Data Centre dc2
…
…
rack2…
rack2
…
…
Cluster
Advanced Database Design and Implementation 2020 Architecture 6
Internode Communication
• Cassandra uses Gossip communication protocol in
which nodes periodically exchange state information
about themselves and other nodes they know about
• The gossip process runs every second and exchanges
messages with up to three other nodes in the cluster
• Gossip messages are versioned – During a gossip exchange, older information is overwritten with the
most current state for a particular node
Advanced Database Design and Implementation 2020 Architecture 7
Configuring Gossip Settings and Seeds
• When a node first starts up, it looks at its cassandra.yaml configuration file to determine:
– The name of the Cassandra cluster it belongs to,
– Which nodes (called seeds) to contact to obtain information about
the other nodes in the cluster, and
– Other parameters for determining port and range information
• All these parameters are set by an administrator and
have to be the same for all nodes of a cluster
• In the cassandra.yaml, the property
seed_provider is a list of comma-delimited hosts
(IP addresses)
• In multiple data-centre clusters, the seeds list should
include at least one node from each data centre
Advanced Database Design and Implementation 2020 Architecture 8
Node Failure – Dynamic Snitch
• During gossiping, every node maintains a sliding
window of inter-arrival times of gossip messages from
other nodes in the cluster – That mechanism is called dynamic snitch
• If a node does not respond during an adjustable time
period, it is either down or experiencing transient
problems
• Coordinator nodes use this information to avoid (if
possible) routing client requests to unreachable
nodes, or nodes that are performing poorly
• The live nodes periodically try to re-establish contact
with failed nodes to see if they are back up
Advanced Database Design and Implementation 2020 Architecture 9
Data Distribution and Partitioning
• Data distribution is done by Consistent Hashing
• Partitioning– Nodes of a cluster form a consistent partitioning ring,
– A partitioner determines how data is distributed across the nodes in
the cluster,
– Basically, a partitioner is a function for deriving a token representing
a row from its partition key,
– Cassandra default partitioner is a hash function named Murmur3Partitioner,
– This hashing function creates a 64-bit value of the partition key,
– The possible range of hash values is from -263 to +263 – 1
• Ring nodes are also called endpoints because each
node occupies the last token of a range
Advanced Database Design and Implementation 2020 Architecture 10
Hashing Values in a Cluster
A
BD
C
Cluster Ring
Only physical
nodes assumed
- 4611686018427387903
to
- 9223372036854775808
- 4611686018427387902
to
- 1
9223372036854775807
to
4611686018427387904
0
to
4611686018427387903
initial_token
of the node C
263 – 1 = 9223372036854775807
- 2 63 = - 9223372036854775808
initial_token
of the node Dinitial_token
of the node B
Advanced Database Design and Implementation 2020 Architecture 11
Virtual Nodes
• Virtual nodes are used to:– Balance the work load.
– Add or remove nodes in an easy way, and
– Rebuild a dead node faster
• For each physical node of a cluster, the administrator
specifies the number of virtual nodes in its cassandra.yaml configuration file
• The property num_tokens defines the number of
tokens randomly assigned to this node on the ring– One token for each virtual node
Advanced Database Design and Implementation 2020 Architecture 12
Data Replication
• Cassandra supports a number of replication
strategies:– Simple Strategy,
– Network Topology Strategy,
– Gossiping Network Topology Strategy,
– Amazon Worldwide Services (EC2 and EC2 Multi Region)
• We consider the Simple Strategy and Network
Topology Strategy, only
• Simple Strategy is used for single data centres only – Places the first replica on a node determined by the partitioner
– Additional replicas are placed on the next nodes clockwise in the
ring without considering topology (rack or data centre location)
Advanced Database Design and Implementation 2020 Architecture 13
Network Topology Strategy
• Network Topology Strategy (NTS) should be used for
clusters deployed across multiple data centres
• NTS is:– Data Centre aware and
– Rack aware
• This strategy specifies the number of replicas in each
data centre
• NTS places replicas in the same data centre by
walking the ring clockwise until reaching the first node
in another rack– NTS attempts to place replicas on distinct racks because nodes in
the same rack (or similar physical grouping) often fail at the same
time due to power, cooling, or network issues
Advanced Database Design and Implementation 2020 Architecture 14
Snitches
• A snitch determines which data centres and racks
cluster nodes belong to
• They inform Cassandra about the network topology
so that requests are routed efficiently and allows
Cassandra to distribute replicas by grouping
machines into data centres and racks
• Cassandra supports:– SimpleSnitch,
– PropertyFileSnitch,
– Others
• Simple Snitch:– The SimpleSnitch is the default snitch
– It is used for single-data centre deployments, only
– It does not recognize data centre or rack information
Advanced Database Design and Implementation 2020 Architecture 15
Property File Snitch
• This snitch determines proximity by rack and data
centre
• It uses the network details located in the cassandra-topology.properties file
• An administrator can use cassandra-
topology.properties file to define data centre
names and the distribution of nodes among racks
• Data centre names have to correlate to names of
data centres in the keyspace definition
• Every node in the cluster should be assigned to a
rack, and this allocation has to be exactly the same in the cassandra-topology.properties file of
each node in the cluster
Advanced Database Design and Implementation 2020 Architecture 16
Configuring Data Centre Replication
• Primary considerations when configuring replicas in a
multi data centre cluster are: – Being able to satisfy reads locally, without incurring cross data-centre
latency, and
– Failure scenarios
• Locality of reads is achieved by:– Choosing a data centre closest to users, and
– Storing the same data in each data centre (so users can connect to
closest data centre)
• Failure scenarios depend on the replication factor
• In a multi data centre cluster: – The global replication factor ng is the sum of local replication factors
– The global quorum qg is floor (sum_of_local_replication_factors / 2) + 1
Advanced Database Design and Implementation 2020 Architecture 17
Failure Scenario Examples
• Consider the worst case design
• The replication factor n = 2 in each of dc1 and dc2:
– In the case of a single node failure the guaranty is:
• The eventual consistency locally, and the strong consistency
globally (since qg = 3)
– In the case of a two node failure, the guaranty is:
• The eventual consistency globally
• The replication factor n = 3 in each dc1 and dc2:
– In the case of a single node failure, the guaranty is:
• The strong consistency locally, and globally (since qg = 4)
– In the case of a two node failure, the guaranty is:
• The eventual consistency locally, and the strong consistency
globally (since qg = 4)
– In the case of a three node failure, the guaranty is:
• The eventual consistency globally
Advanced Database Design and Implementation 2020 Architecture 18
About Cluster Nodes and Racks (1)
• Nodes of a cluster are allocated to r racks within a
datacenter having m nodes by a database administrator
• A rack is a storage for a number of nodes that share:
– Power supply, cooling system, and network connection
– If any of these fails, all the nodes in the rack fail
• Distribution of nodes within racks has a strong influence on
database consistency, availability, and load balancing
• Placing all cluster nodes of a data centre in the same rack:– Considerably degrades database availability because a rack failure brings all
nodes down, but
– Offers greatest node proximity that slightly improves availability by providing
shortest latency of internode communication within a data centre
• The other extreme is placing each cluster node into another
rack (all nodes in different racks) that leads to– The greatest availability and
– The longest latency of internode communication
Advanced Database Design and Implementation 2020 Architecture 19
About Cluster Nodes and Racks (2)
• The node placement in racks trade-off prompts:– Use of more than one rack for storing nodes of a cluster in a data
centre, and
– Avoiding to place nodes storing replicas of the same data in the same
rack
• The last requirement asks for the following relationship
between the number of racks r used to hold nodes of a
cluster and the replication number n
r > n
• So, each node of a replica set can be placed in a
different rack
Advanced Database Design and Implementation 2020 Architecture 20
About Cluster Nodes and Racks (3)
• A database administrator allocates nodes of a cluster to
racks having in mind the inequality r > n, and the following
algorithm that Cassandra uses to assign nodes to replica
sets in a multi data centre cluster:
– The first node of a replica set is the primary node (the node that owns
the token range that is going to be stored on the replica set),
– Each next node of the replica set is the closest neighbour of the
current node in the clock wise direction on the ring that resides in the
next rack with the regard to the current rack,
• If the current rack is the last rack, the next rack is the first rack
– If there is no next node that resides in the next rack, then the next
replica node is the first, physically next node in the clock wise
direction on the ring,
• The algorithm above, does not depend on the actual state of
nodes (up or down),
• If all noes are placed in a single rack, then there is no cluster
nodes in a different rack
Advanced Database Design and Implementation 2020 Architecture 21
About Cluster Nodes and Racks
• Each node that Cassandra assigns to a replica set
stems from a different racks, and thus the availability
requirement has been satisfied
• But that assignment may vary depending on the initial
allocation of nodes to racks
• The following is an allocation algorithm that results in
an even work load distribution, if replication factor n
divides the number of nodes m (n | m):
• Let node ni, iϵ{1, 2, …, m} be allocated to rack rj,
jϵ{1, 2, …, r}, then the node n(i + 1) is allocated to the
rack rp where p = j + 1 for j = 1, 2, …, r -1, and p = 1
for j = r
Examples
Advanced Database Design and Implementation 2020 Architecture 22
Example
• Assume:– A twelve node cluster has been deployed using two data centres,
containing six nodes each
– Each physical node has only one virtual node
– Both datacentres have the same replication factor
– The availability requirements are:
• Strong consistency globally for 100% of data when two nodes
are down
• Strong consistency globally for 100% of data when two racks
are down
• What should be the minimal local replication factor?
• How many racks the cluster should be deployed to, to
achieve availability requirements?
Advanced Database Design and Implementation 2020 Architecture 23
Answer
41 Rack 1
52 Rack 2
63 Rack 3
Data Centre 1
107 Rack 1
118 Rack 2
129 Rack 3
Data Centre 2
Local replication factors n1 = 3 and n2 = 3
Global replication factor n = 6
Number of racks r1 = 3, r2 = 3
Local quorums q1 = 2 and q2 = 2
Global quorum q = 4
Assume a given data object has been stored on nodes n4, n5, and n6 in
dc1 and n10, n11, and n12 in dc2
Advanced Database Design and Implementation 2020 Architecture 24
CREATE KEYSPACE
• The replication strategy and the replication configuration are defined within the CREATE
KEYSPACE CQL command
CREATE KEYSPACE
IF NOT EXISTS keyspace_name
WITH REPLICATION = map AND
DURABLE_WRITES = ( true | false )
• The map is used to declare:
– Replica placement strategy class (either Simple or Network Topology)
, and
– Replication configuration (factor)
Advanced Database Design and Implementation 2020 Architecture 25
The map of CREATE KEYSPACE
• The two different types of keyspaces:
{ 'class' : 'SimpleStrategy',
'replication_factor' : <integer>, … };
{'class': 'NetworkTopologyStrategy‘ [,
'<data center>' : <integer>,
'<data center>' : <integer>, … ] . . . };
• The SimpleStrategy is used for evaluating and testing
Cassandra– It is correlated with the SimpleSnitch
• The NetworkTopologyStrategy should be used for
production or for use with mixed workloads– It is correlated with PropertyFileSnitch that uses information in
cassandra-topology.properties file
Advanced Database Design and Implementation 2020 Architecture 26
Configuring NTS
• Before creating a keyspace for use with multiple
data centres, the cluster has to be configured to use a
network-aware snitch – The cassanra.yaml configuration file of each node has to be
configured before starting the cluster
– One of settings to be checked and (possibly) done is
endpoint_snitch: PropertyFileSnitch
• Data centre names and rack information have to be checked and possibly changed in cassandra-
topology.properties files of each node– Cassandra uses dci, i = 1,…, as the default data centre name
– Centre names in .properties file and in the map of CREATE
KEYSPACE have to match exactly, otherwise Cassandra will fail to
find a node and to complete a write request
– Allocation of nodes to racks have to be made before starting the
cluster
Advanced Database Design and Implementation 2020 Architecture 27
Setting Durable_Writes
• DURABLE_WRITES is an option whose default is yes
• When set to false, data written to the keyspace
bypasses the commit log – A risk of losing data
– Not to use on a key space using the SimpleStrategy
Advanced Database Design and Implementation 2020 Architecture 28
Summary (1)
• A data centre is a collection of nodes configured for
replication purposes
• A cluster contains nodes from one or more data
centres
• Internode communication is accomplished through
gossiping
• The dynamic snitch is a mechanism for detecting
poorly performing or failing nodes
• Data distribution is done by Consistent Hashing– Partitioning is performed by deriving a token from the partition key of
a row and storing the row on a node assigned to the first greater
token on the consistent hashing ring
Advanced Database Design and Implementation 2020 Architecture 29
Summary (2)
• Data replication is performed according to one of
replication strategies:– Simple Strategy,
– Network Topology Strategy
• A snitch contains information about the network
topology– Each snitch corresponds to one replication strategy and vice versa
• The replication strategy and the replication
configuration (factor) are declared within a keyspace
declaration