www.opendaylight.org 2-node clustering active-standby deployment
TRANSCRIPT
www.opendaylight.org
2-Node ClusteringActive-Standby Deployment
www.opendaylight.org2
Requirements • Configuration of Primary controller in cluster (Must)• Primary Controller services the Northbound IP address, a Secondary takes over NB IP upon
failover (Must)• Configuration of whether on failover & recovery, configured Primary controller reasserts
leadership (Must)• Configuration of merge strategy on failover & recovery (Want)• Primary controller is master of all devices and is leader of all shards (Must)
• Initial Config (design to allow for alternatives – multi-shard / multiple device masters)
• Single node operation allowed (access to datastore on non-quorum) (Want)
2-Node Deployment TopologyActive-Standby Requirements
www.opendaylight.org3
Failover Sequence 1. Secondary controller becomes master of all devices and leader of all shards
Failure of PrimaryScenario 1: Master Stays Offline
www.opendaylight.org4
Recovery Sequence 1. Controller A comes back online and its data is replaced by all of Controller B’s data
2. For Re-assert leadership configuration:
1. (ON) Controller A becomes master of all devices and leader of all shards
2. (OFF) Controller B stays master of all devices and maintains leadership of all shards
Failure of PrimaryScenario 2: Primary Comes Back Online
www.opendaylight.org5
Failover Sequence 1. Controller A becomes master of devices in its network segment and leader of all shards
2. Controller B becomes master of devices in its network segment and leader of all shards
Network PartitionScenario 1: During Network Partition
www.opendaylight.org6
Recovery Sequence 1. Merge data according to pluggable merge strategy
(Default: Secondary’s data replaced with Primary’s data.)
2. For Re-assert leadership configuration:
1. (ON) Controller A becomes master of all devices and leader of all shards again.
2. (OFF) Controller B becomes master of all devices and leader of all shards again
Network PartitionScenario 2: Network Partition Recovers
www.opendaylight.org7
Scenarios1. Secondary controller failure.
2. Any single link failure.
3. Secondary controller loses network connectivity (but device connections to Primary maintained)
No-Op FailuresFailures That Do Not Result in Any Role Changes
www.opendaylight.org8
Global1. Cluster Leader (aka “Primary”)
1. Allow this to be changed on live system, e.g. maintenance.
2. Assigned (2-Node Case), Elected (Larger Cluster Case)
2. Cluster Leader Northbound IP
3. Reassert Leadership on Failover and Recovery
4. Network Partition Detection Alg. (pluggable)
5. Global Overrides of Per Device/Group and Per Shard items (below)
Per Device / Group6. Master / Slave
Per Shard7. Shard Leader (Shard Placement Strategy – pluggable)
8. Shard Data Merge (Shard Merge Strategy – pluggable)
Cluster Configuration OptionsGlobal & Granular Configuration
www.opendaylight.org9
Can we Abstract Configurations to Admin-Defined Deployment Scenarios? e.g. Admin Configures 2-Node (Active-Standby):
This means Primary controller is master of all devices and leader of all shards. Conflicting configurations are overridden by deployment scenario.
HA Deployment ScenariosSimplified Global HA Settings
www.opendaylight.org10
Clustering:1. Refactoring of Raft Actor vs. 2-Node Raft Actor code.
2. Define Cluster Leader
3. Define Northbound Cluster Leader IP Alias
OpenFlow Plugin:4. OpenFlow Master/Slave Roles
5. Grouping of Master/Slave Roles (aka “Regions”)
System: 6. Be Able to SUSPEND the Secondary controller to support Standby mode.
Implementation DependenciesPotential Changes to Other ODL Projects
www.opendaylight.org11
TBD:1. Is Master/Slave definition too tied to OpenFlow? (Generalize?)
Should device ownership/mastership be implemented by OF Plugin?
2. How to define Northbound Cluster Leader IP in a platform independent way?(Linux/Mac OSx: IP Alias, Windows: Possible)
Gratuitous ARP on Leader Change.
3. When both Controllers are active in Network Partition scenario which controller “owns” the Northbound Cluster Leader IP?
4. Define Controller-Wide SUSPEND behavior (how?)
5. On failure Primary controller should be elected (2-node case Secondary is only option to be elected)
6. How/Need to detect management plane failure? (Heartbeat timeout >> w.c. GC?)
Open IssuesFollow-up Design Discussion Topics
www.opendaylight.org12
Implementation(DRAFT)
www.opendaylight.org13
Cluster Primary: (OF Master & Shard Leader) Northbound IP Address
(Config) Define Northbound IP Alias Address (Logic) <Pluggable> Northbound IP Alias Implementation (Platform Dependent)
Behavior (Config / Logic) <Pluggable> Define Default Primary Controller
1. Assigned (Configuration) – Default for 2-Node
2. Calculated (Election Algorithm) Redefine Default Primary Controller on Running Clustering
(Logic) Control OF Master Role (Logic) Control Datastore Shards
Global Config (Overridden) Shard Placement (On Primary) <Pluggable> Leadership Determination
• Match OF Master – Default for 2-Node• Election Based (With Influence)
Change Summary
www.opendaylight.org14
Cluster Primary: (OF Master & Shard Leader) Behavior (Continued)
Network Partition & Failure Detection (Config / Logic) <Pluggable> Detection Algorithm – Default: Akka Clustering Alg. Failover
(Config / Logic) <Pluggable> Secondary Controller Behavior• (Logic) Suspend
(Dependent APP, Datastore, etc.)• (Logic) Resume (Become Primary)
(OF Mastership, Shards Leader, Non-Quorum Datastore Access) Failback
(Logic) <Pluggable> Data Merge Strategy – Default: Current Primary Overrides Secondary (Config) Primary Re-Asserts Leadership on Failback
(OF Master & Shard Leader Roles – After Merge)
Change Summary(Continued)
www.opendaylight.org15
1. Southbound Device Ownership & Roles
2. System Suspend Behavior How to Enforce System-Wide Suspend When Desired? (Config Subsystem? OSGI?)
3. Config Subsystem
4. Resolving APP Data Notifications?
Measure Failover Times No Data Exchange Various Data Exchange Cases (Sizes)
Dependencies
www.opendaylight.org16
RAFT/Sharding Changes(DRAFT)
www.opendaylight.org17
(Current) Shard Design ShardManager is an actor who does the following
Creates all local shard replicas on a given cluster node and maintains the shard information Monitor the cluster members, their status, and stores their addresses Finds local shards
Shard is an actor (instance of RaftActor) which represents a sub-tree within data store Uses in-memory data store Handles requests from Three phase commit Cohorts Handles the data change listener requests and notifies the listeners upon state change Responsible for data replication among the shard (data sub-tree) replicas.
Shard uses RaftActorBehavior for two tasks Leader Election for a given shard Data Replication
RaftActorBehavior can be in any of the following roles at any given point of time Leader Follower Candidate
www.opendaylight.org18
(Current) Shard Class Diagram
AbstractRaftActorBehavior
#context : RaftActorContext#leaderId: String#requestVote(sender:ActorRef, requestVote:RequestVote)#handleRequestVoteReply(sender:ActorRef, requestVotReply:RequestVotReply)#handleAppendEntries(sender:ActorRef, appendEntries:AppendEntries)#handleAppendEntriesReply(sender:ActorRef, appendEntriesReply:AppendEntriesReply)+handleMessage(sender:ActorRef, message:Object)#stopElection ()#scheduleElection(interval:FiniteDuration)
Follower
-memberName#handleAppendEntries (sender:ActorRef, appendEntries:AppendEntries)#handleMessage(sender:ActorRef, originalMessage:Object)-handleInstallSnapshot()#scheduleEletion(interval:FiniteDuration)
Candidate
-voteCount:int-votesRequired:int+handleMessage(sender:ActorRef, originalMessage:Object)-startNewTerm()#handleRequestVoteReply(sender:ActorRef, requestVoteReply:RequestVoteReply
Leader
followers:set<String>+handleMessage(sender:ActorRef, originalMessage:Object)-replicate(replicate:Replicate)-sendHeartBeat()-installSnapShotIfNeeded()-handleInstallSnapshotReply(reply:InstallSnapshotReply)-sendAppendEntries()
Raft Actor
#context : RaftActorContext-currentBehavior : RaftActorBehavior+onReceivedRecover(message : Object)+onReceiveCommand(message : Object)#onLeaderChanged()-switchBehavior(state : RaftState)
<<interface>>RaftActorBehavior
handleMessage(sender: ActorRef, message:Object)state ()getLeaderId ()
1 1
Shard
-configParams:ConfigParams-store:InMemoryDOMDataStore-name:ShardIdentifier-dataStoreContext:DataStoreContext-schemaContext:SchemaContext+onReceiveRecover(message:Object)+onReceiveCommand(message:Object)+commit(sender:ActorRef, serialized:Object)
www.opendaylight.org19
(Proposed) Shard Design Intent
Support two-node cluster by separating shard data replication from Leader election Elect one of the ODL node “master” and mark that as “Leader” for all the shards Make Leader Election Pluggable Current Raft Leader Election logic should work for 3-node deployment
Design Idea Minimize the impact on “ShardManager” and “Shard” Separate ‘leader election’ and ‘data replication’ logic with ‘RaftActorBehavior’ classes. Create two separate abstract classes and interfaces for ‘leader election’ and ‘data
replication’ Shard actor will contain reference to ‘RaftReplicatonActorBehavior’ instances
(currentBehavior). ‘RaftReplicationActorBehavior’ will contain reference to ‘ElectionActorBehavior’ instance. Both ‘RaftReplicationActorBehavior’ and ‘ElectionActorBehavior’ instances will be in one
of the roles at any given point of time Leader Follower Candidate
“RaftReplicationActorBehavior” will update it’s “ElectionActorBehavior” instance based on message received. The message could be sent either by one of the “ElectionActorBehavior” instance or a module that implement “2-node cluster” logic.
www.opendaylight.org20
(Proposed) Shard Class Diagram
Shard
-configParams:ConfigParams-store:InMemoryDOMDataStore-name:ShardIdentifier-dataStoreContext:DataStoreContext-schemaContext:SchemaContext+onReceiveRecover(message:Object)+onReceiveCommand(message:Object)+commit(sender:ActorRef, serialized:Object)
Raft Actor
#context : RaftActorContext-currentBehavior : RaftActorBehavior+onReceivedRecover(message : Object)+onReceiveCommand(message : Object)#onLeaderChanged()-switchBehavior(state : RaftState)
AbstractRaftReplicationActorBehavior
#context : RaftActorContext#handleAppendEntries(sender:ActorRef, appendEntries:AppendEntries)#handleAppendEntriesReply(sender:ActorRef, appendEntriesReply:AppendEntriesReply)+handleMessage(sender:ActorRef, message:Object)#applyLogToStateMachine(index:long)
AbstractElectionActorBehavior
#context : RaftActorContext#leaderId: String#requestVote(sender:ActorRef, requestVote:RequestVote)#handleRequestVoteReply(sender:ActorRef, requestVotReply:RequestVotReply)+handleMessage(sender:ActorRef, message:Object)#stopElection ()#scheduleElection(interval:FiniteDuration)#currentTerm ()#voteFor()
1
1
<<interface>>ElctionActorBehavior
handleMessage(sender: ActorRef, message:Object)state ()getLeaderId ()
<<interface>>RaftReplicationActorBehavior
handleMessage(sender: ActorRef, message:Object)+switchElectionBehavior(:RaftState)getLeaderId ()
Leader
followers:set<String>heartbeatSchedule:Cancellable#handleRequestVoteReply(:ActorRef, :RequestVoteReply-sendHeartBeat()
Candidate
-voteCount:int+handleMessage(sender:ActorRef, originalMessage:Object)-startNewTerm()#handleRequestVoteReply(:ActorRef, :RequestVoteReply
Follower
#handleRequestVoteReply(:ActorRef, :RequestVoteReply)#scheduleEletion(interval:FiniteDuration)
Leader
-minReplicationLog:int#handleAppendEntries (sender:ActorRef, appendEntries:AppendEntries)-handleInstallSnapshotReply-sendAppendEntries()-installSnaphotIfNeeded()+sendSnapshotChunk(:ActorSelection, :String)
Candidate
-startNewTerm()#handleAppendEntriesReply(ActorRef, AppendEntriesReply)
Follower
snapshotChunksCollected:ByteString+handleInstallSnapShot(:ActorRef, :InstallSnapshot)#handleAppendEntriesReply(:ActorRef, :AppendEntriesReply)#handleAppendEntries (:ActorRef, :AppendEntries)
11
www.opendaylight.org21
Method-1: Run 2-node cluster protocol outside of ODL External cluster protocol decides which node is ‘master’ and which node
is ‘standby’. Once the master election is complete, master sends node roles and node membership information to all the ODL instances.
‘Cluster module’ within ODL defines ‘cluster node’ model and provides REST APIs to configure the cluster information by modifying the *.conf files.
‘Cluster module’ will send RAFT messages to all other the cluster members about cluster information – membership & shard RAFT state.
‘ShardActors’ in both the cluster nodes will handle these messages, and instantiate corresponding “replication Behavior” & “election Behavior” role instances and switch to new roles.
Northbound virtual IP is OS dependent and out of scope here.
2-node cluster work flow
www.opendaylight.org22
Reference diagram for Method-2
Cluster protocol - Primary path
1b. Cluster protocol – Secondary path
1a. Switch to controller connectivity state polling
www.opendaylight.org23
Method-2: Run cluster protocol within ODL ‘Cluster Module’ within each ODL instance, talks to other ODL
instance and elects the ‘master’ and ‘standby’ nodes. If cluster times out, a node will check other factors (probably
cross-check with connected ‘open flow’ switches for ‘primary’ controller information or use alternative path) for new master election.
‘Cluster module’ will send RAFT messages to all other the cluster members about ‘cluster information’ – membership & shard RAFT state.
‘ShardActors’ in both the cluster nodes will handle these messages, and instantiates corresponding “replication Behavior” & “election Behavior” role instances and switch to new roles.
Northbound virtual IP is OS dependent and out of scope here.
2-node cluster work flow
www.opendaylight.org24
Shard Manager will create the local shards based on the shard configuration.
Each shard will start of as ‘candidate’ for role election and as well as for ‘data replication’ messages, by instantiating the ‘ElectionBehavior’ and ‘ReplicationBehavior’ classes in ‘Candidate’ roles.
Candidate node will start sending ‘requestForVote’ messages to other members.
Leader is elected based on ‘Raft leader election’ algorithm and each shard will set it’s state to ‘Leader’ by switching the ‘ElectionBehavior’ & ‘ReplicationBehavior’ instances to Leader.
Remaining candidates, receive the leader assertion messages, they will move to ‘Follower’ state by switching to ‘ElectionBehavior’ & ‘ReplicationBehavior’ instances to ‘Follower’
3-node Cluster work flow
www.opendaylight.org25
Provide Hooks to Influence Key RAFT Decisions (Shard Leader Election / Data Replication)
https://git.opendaylight.org/gerrit/#/c/12588/
(Working Proposal) ConsensusStrategy
www.opendaylight.org26
Config Changes(DRAFT)
www.opendaylight.org27
(Current) Config Config Files (Karaf: /config/initial)
Read Once on Startup (Default Settings For New Modules) (sal-clustering-commons) Hosts Akka & Config Subsystem Reader/Resolver/Validator
Currently No Config Subsystem Config Properties Defined?
Akka/Cluster Config: (akka.conf) Akka-Specific Settings (actorspaces data/rpc, mailbox, logging, serializers, etc.) Cluster Config (IPs, names, network parameters)
Shard Config: (modules.conf, modules-shards.config) Shard Name / Namespace Sharding Strategies Replication (# and Location) Default Config
www.opendaylight.org28
(Proposal) Config Intent
Continue to Keep Config Outside of Shard/RAFT/DistributedDatastore Code Provide Sensible Defaults and Validate Settings When Possible
Error/Warn on Any Changes That Are Not Allowed On a Running System Provide REST Config Access (where appropriate)
Design Idea Host Configuration Settings in Config Subsystem Investigate Using Karaf Cellar To Distribute Common Cluster-Wide Config
Move Current Config Processing (org.opendaylight.controller.cluster.common.actor) toexisting sal-clustering-config?
Akka-Specific Config: Make Most of Existing akka.conf File as Default Settings Separate Cluster Member Config (see Cluster Config) Options:
Provide Specific Named APIs, e.g. setTCPPort() Allow Akka <type,value> Config To Be Set Directly
www.opendaylight.org29
(Proposal) Config Design Idea (Continued)
Cluster Config: Provide a Single Point For Configuring A Cluster
Feeds Back to Akka-Specific Settings, etc. Define Northbound Cluster IP Config (alias)
Shard Config: Define Shard Config (Name / Namespace / Sharding Strategy) Will NOT Support Changing Running Shard For Now
‘Other’ Config: 2-Node:
Designate Cluster’s Primary Node or Election Algorithm (dynamic) Failback to Primary Node (dynamic) Strategies (Influence These in RAFT) – Separate Bundles?
Election Consensus
www.opendaylight.org30
Northbound IP Alias(DRAFT)