www.opendaylight.org 2-node clustering active-standby deployment

www.opendaylight.org

2-Node ClusteringActive-Standby Deployment

www.opendaylight.org2

Requirements • Configuration of Primary controller in cluster (Must)• Primary Controller services the Northbound IP address, a Secondary takes over NB IP upon

failover (Must)• Configuration of whether on failover & recovery, configured Primary controller reasserts

leadership (Must)• Configuration of merge strategy on failover & recovery (Want)• Primary controller is master of all devices and is leader of all shards (Must)

• Initial Config (design to allow for alternatives – multi-shard / multiple device masters)

• Single node operation allowed (access to datastore on non-quorum) (Want)

2-Node Deployment TopologyActive-Standby Requirements


Failover Sequence 1. Secondary controller becomes master of all devices and leader of all shards

Failure of PrimaryScenario 1: Master Stays Offline


Recovery Sequence 1. Controller A comes back online and its data is replaced by all of Controller B’s data

2. For Re-assert leadership configuration:

1. (ON) Controller A becomes master of all devices and leader of all shards

2. (OFF) Controller B stays master of all devices and maintains leadership of all shards

Failure of PrimaryScenario 2: Primary Comes Back Online


Failover Sequence 1. Controller A becomes master of devices in its network segment and leader of all shards

2. Controller B becomes master of devices in its network segment and leader of all shards

Network PartitionScenario 1: During Network Partition


Recovery Sequence 1. Merge data according to pluggable merge strategy

(Default: Secondary’s data replaced with Primary’s data.)

2. For Re-assert leadership configuration:

1. (ON) Controller A becomes master of all devices and leader of all shards again.

2. (OFF) Controller B becomes master of all devices and leader of all shards again

Network PartitionScenario 2: Network Partition Recovers


Scenarios1. Secondary controller failure.

2. Any single link failure.

3. Secondary controller loses network connectivity (but device connections to Primary maintained)

No-Op FailuresFailures That Do Not Result in Any Role Changes


Global1. Cluster Leader (aka “Primary”)

1. Allow this to be changed on live system, e.g. maintenance.

2. Assigned (2-Node Case), Elected (Larger Cluster Case)

2. Cluster Leader Northbound IP

3. Reassert Leadership on Failover and Recovery

4. Network Partition Detection Alg. (pluggable)

5. Global Overrides of Per Device/Group and Per Shard items (below)

Per Device / Group6. Master / Slave

Per Shard7. Shard Leader (Shard Placement Strategy – pluggable)

8. Shard Data Merge (Shard Merge Strategy – pluggable)

Cluster Configuration OptionsGlobal & Granular Configuration


Can we Abstract Configurations to Admin-Defined Deployment Scenarios? e.g. Admin Configures 2-Node (Active-Standby):

This means Primary controller is master of all devices and leader of all shards. Conflicting configurations are overridden by deployment scenario.

HA Deployment ScenariosSimplified Global HA Settings


Clustering:1. Refactoring of Raft Actor vs. 2-Node Raft Actor code.

2. Define Cluster Leader

3. Define Northbound Cluster Leader IP Alias

OpenFlow Plugin:4. OpenFlow Master/Slave Roles

5. Grouping of Master/Slave Roles (aka “Regions”)

System: 6. Be Able to SUSPEND the Secondary controller to support Standby mode.

Implementation DependenciesPotential Changes to Other ODL Projects


TBD:1. Is Master/Slave definition too tied to OpenFlow? (Generalize?)

Should device ownership/mastership be implemented by OF Plugin?

2. How to define Northbound Cluster Leader IP in a platform independent way?(Linux/Mac OSx: IP Alias, Windows: Possible)

Gratuitous ARP on Leader Change.

3. When both Controllers are active in Network Partition scenario which controller “owns” the Northbound Cluster Leader IP?

4. Define Controller-Wide SUSPEND behavior (how?)

5. On failure Primary controller should be elected (2-node case Secondary is only option to be elected)

6. How/Need to detect management plane failure? (Heartbeat timeout >> w.c. GC?)

Open IssuesFollow-up Design Discussion Topics


Implementation(DRAFT)


Cluster Primary: (OF Master & Shard Leader) Northbound IP Address

(Config) Define Northbound IP Alias Address (Logic) <Pluggable> Northbound IP Alias Implementation (Platform Dependent)

Behavior (Config / Logic) <Pluggable> Define Default Primary Controller

1. Assigned (Configuration) – Default for 2-Node

2. Calculated (Election Algorithm) Redefine Default Primary Controller on Running Clustering

(Logic) Control OF Master Role (Logic) Control Datastore Shards

Global Config (Overridden) Shard Placement (On Primary) <Pluggable> Leadership Determination

• Match OF Master – Default for 2-Node• Election Based (With Influence)

Change Summary


Cluster Primary: (OF Master & Shard Leader) Behavior (Continued)

Network Partition & Failure Detection (Config / Logic) <Pluggable> Detection Algorithm – Default: Akka Clustering Alg. Failover

(Config / Logic) <Pluggable> Secondary Controller Behavior• (Logic) Suspend

(Dependent APP, Datastore, etc.)• (Logic) Resume (Become Primary)

(OF Mastership, Shards Leader, Non-Quorum Datastore Access) Failback

(Logic) <Pluggable> Data Merge Strategy – Default: Current Primary Overrides Secondary (Config) Primary Re-Asserts Leadership on Failback

(OF Master & Shard Leader Roles – After Merge)

Change Summary(Continued)


1. Southbound Device Ownership & Roles

2. System Suspend Behavior How to Enforce System-Wide Suspend When Desired? (Config Subsystem? OSGI?)

3. Config Subsystem

4. Resolving APP Data Notifications?

Measure Failover Times No Data Exchange Various Data Exchange Cases (Sizes)

Dependencies


RAFT/Sharding Changes(DRAFT)


(Current) Shard Design ShardManager is an actor who does the following

Creates all local shard replicas on a given cluster node and maintains the shard information Monitor the cluster members, their status, and stores their addresses Finds local shards

Shard is an actor (instance of RaftActor) which represents a sub-tree within data store Uses in-memory data store Handles requests from Three phase commit Cohorts Handles the data change listener requests and notifies the listeners upon state change Responsible for data replication among the shard (data sub-tree) replicas.

Shard uses RaftActorBehavior for two tasks Leader Election for a given shard Data Replication

RaftActorBehavior can be in any of the following roles at any given point of time Leader Follower Candidate


(Current) Shard Class Diagram

AbstractRaftActorBehavior

#context : RaftActorContext#leaderId: String#requestVote(sender:ActorRef, requestVote:RequestVote)#handleRequestVoteReply(sender:ActorRef, requestVotReply:RequestVotReply)#handleAppendEntries(sender:ActorRef, appendEntries:AppendEntries)#handleAppendEntriesReply(sender:ActorRef, appendEntriesReply:AppendEntriesReply)+handleMessage(sender:ActorRef, message:Object)#stopElection ()#scheduleElection(interval:FiniteDuration)

Follower

-memberName#handleAppendEntries (sender:ActorRef, appendEntries:AppendEntries)#handleMessage(sender:ActorRef, originalMessage:Object)-handleInstallSnapshot()#scheduleEletion(interval:FiniteDuration)

Candidate

-voteCount:int-votesRequired:int+handleMessage(sender:ActorRef, originalMessage:Object)-startNewTerm()#handleRequestVoteReply(sender:ActorRef, requestVoteReply:RequestVoteReply

Leader

followers:set<String>+handleMessage(sender:ActorRef, originalMessage:Object)-replicate(replicate:Replicate)-sendHeartBeat()-installSnapShotIfNeeded()-handleInstallSnapshotReply(reply:InstallSnapshotReply)-sendAppendEntries()

Raft Actor

#context : RaftActorContext-currentBehavior : RaftActorBehavior+onReceivedRecover(message : Object)+onReceiveCommand(message : Object)#onLeaderChanged()-switchBehavior(state : RaftState)

<<interface>>RaftActorBehavior

handleMessage(sender: ActorRef, message:Object)state ()getLeaderId ()

1 1

Shard

-configParams:ConfigParams-store:InMemoryDOMDataStore-name:ShardIdentifier-dataStoreContext:DataStoreContext-schemaContext:SchemaContext+onReceiveRecover(message:Object)+onReceiveCommand(message:Object)+commit(sender:ActorRef, serialized:Object)


(Proposed) Shard Design Intent

Support two-node cluster by separating shard data replication from Leader election Elect one of the ODL node “master” and mark that as “Leader” for all the shards Make Leader Election Pluggable Current Raft Leader Election logic should work for 3-node deployment

Design Idea Minimize the impact on “ShardManager” and “Shard” Separate ‘leader election’ and ‘data replication’ logic with ‘RaftActorBehavior’ classes. Create two separate abstract classes and interfaces for ‘leader election’ and ‘data

replication’ Shard actor will contain reference to ‘RaftReplicatonActorBehavior’ instances

(currentBehavior). ‘RaftReplicationActorBehavior’ will contain reference to ‘ElectionActorBehavior’ instance. Both ‘RaftReplicationActorBehavior’ and ‘ElectionActorBehavior’ instances will be in one

of the roles at any given point of time Leader Follower Candidate

“RaftReplicationActorBehavior” will update it’s “ElectionActorBehavior” instance based on message received. The message could be sent either by one of the “ElectionActorBehavior” instance or a module that implement “2-node cluster” logic.


(Proposed) Shard Class Diagram

Shard

-configParams:ConfigParams-store:InMemoryDOMDataStore-name:ShardIdentifier-dataStoreContext:DataStoreContext-schemaContext:SchemaContext+onReceiveRecover(message:Object)+onReceiveCommand(message:Object)+commit(sender:ActorRef, serialized:Object)

Raft Actor

#context : RaftActorContext-currentBehavior : RaftActorBehavior+onReceivedRecover(message : Object)+onReceiveCommand(message : Object)#onLeaderChanged()-switchBehavior(state : RaftState)

AbstractRaftReplicationActorBehavior

#context : RaftActorContext#handleAppendEntries(sender:ActorRef, appendEntries:AppendEntries)#handleAppendEntriesReply(sender:ActorRef, appendEntriesReply:AppendEntriesReply)+handleMessage(sender:ActorRef, message:Object)#applyLogToStateMachine(index:long)

AbstractElectionActorBehavior

#context : RaftActorContext#leaderId: String#requestVote(sender:ActorRef, requestVote:RequestVote)#handleRequestVoteReply(sender:ActorRef, requestVotReply:RequestVotReply)+handleMessage(sender:ActorRef, message:Object)#stopElection ()#scheduleElection(interval:FiniteDuration)#currentTerm ()#voteFor()

1

1

<<interface>>ElctionActorBehavior

handleMessage(sender: ActorRef, message:Object)state ()getLeaderId ()

<<interface>>RaftReplicationActorBehavior

handleMessage(sender: ActorRef, message:Object)+switchElectionBehavior(:RaftState)getLeaderId ()

Leader

followers:set<String>heartbeatSchedule:Cancellable#handleRequestVoteReply(:ActorRef, :RequestVoteReply-sendHeartBeat()

Candidate

-voteCount:int+handleMessage(sender:ActorRef, originalMessage:Object)-startNewTerm()#handleRequestVoteReply(:ActorRef, :RequestVoteReply

Follower

#handleRequestVoteReply(:ActorRef, :RequestVoteReply)#scheduleEletion(interval:FiniteDuration)

Leader

-minReplicationLog:int#handleAppendEntries (sender:ActorRef, appendEntries:AppendEntries)-handleInstallSnapshotReply-sendAppendEntries()-installSnaphotIfNeeded()+sendSnapshotChunk(:ActorSelection, :String)

Candidate

-startNewTerm()#handleAppendEntriesReply(ActorRef, AppendEntriesReply)

Follower

snapshotChunksCollected:ByteString+handleInstallSnapShot(:ActorRef, :InstallSnapshot)#handleAppendEntriesReply(:ActorRef, :AppendEntriesReply)#handleAppendEntries (:ActorRef, :AppendEntries)

11


Method-1: Run 2-node cluster protocol outside of ODL External cluster protocol decides which node is ‘master’ and which node

is ‘standby’. Once the master election is complete, master sends node roles and node membership information to all the ODL instances.

‘Cluster module’ within ODL defines ‘cluster node’ model and provides REST APIs to configure the cluster information by modifying the *.conf files.

‘Cluster module’ will send RAFT messages to all other the cluster members about cluster information – membership & shard RAFT state.

‘ShardActors’ in both the cluster nodes will handle these messages, and instantiate corresponding “replication Behavior” & “election Behavior” role instances and switch to new roles.

Northbound virtual IP is OS dependent and out of scope here.

2-node cluster work flow


Reference diagram for Method-2

Cluster protocol - Primary path

1b. Cluster protocol – Secondary path

1a. Switch to controller connectivity state polling


Method-2: Run cluster protocol within ODL ‘Cluster Module’ within each ODL instance, talks to other ODL

instance and elects the ‘master’ and ‘standby’ nodes. If cluster times out, a node will check other factors (probably

cross-check with connected ‘open flow’ switches for ‘primary’ controller information or use alternative path) for new master election.

‘Cluster module’ will send RAFT messages to all other the cluster members about ‘cluster information’ – membership & shard RAFT state.

‘ShardActors’ in both the cluster nodes will handle these messages, and instantiates corresponding “replication Behavior” & “election Behavior” role instances and switch to new roles.

Northbound virtual IP is OS dependent and out of scope here.

2-node cluster work flow


Shard Manager will create the local shards based on the shard configuration.

Each shard will start of as ‘candidate’ for role election and as well as for ‘data replication’ messages, by instantiating the ‘ElectionBehavior’ and ‘ReplicationBehavior’ classes in ‘Candidate’ roles.

Candidate node will start sending ‘requestForVote’ messages to other members.

Leader is elected based on ‘Raft leader election’ algorithm and each shard will set it’s state to ‘Leader’ by switching the ‘ElectionBehavior’ & ‘ReplicationBehavior’ instances to Leader.

Remaining candidates, receive the leader assertion messages, they will move to ‘Follower’ state by switching to ‘ElectionBehavior’ & ‘ReplicationBehavior’ instances to ‘Follower’

3-node Cluster work flow


Provide Hooks to Influence Key RAFT Decisions (Shard Leader Election / Data Replication)

https://git.opendaylight.org/gerrit/#/c/12588/

(Working Proposal) ConsensusStrategy

https://git.opendaylight.org/gerrit/%23/c/12588/

https://git.opendaylight.org/gerrit/%23/c/12588/


Config Changes(DRAFT)


(Current) Config Config Files (Karaf: /config/initial)

Read Once on Startup (Default Settings For New Modules) (sal-clustering-commons) Hosts Akka & Config Subsystem Reader/Resolver/Validator

Currently No Config Subsystem Config Properties Defined?

Akka/Cluster Config: (akka.conf) Akka-Specific Settings (actorspaces data/rpc, mailbox, logging, serializers, etc.) Cluster Config (IPs, names, network parameters)

Shard Config: (modules.conf, modules-shards.config) Shard Name / Namespace Sharding Strategies Replication (# and Location) Default Config


(Proposal) Config Intent

Continue to Keep Config Outside of Shard/RAFT/DistributedDatastore Code Provide Sensible Defaults and Validate Settings When Possible

Error/Warn on Any Changes That Are Not Allowed On a Running System Provide REST Config Access (where appropriate)

Design Idea Host Configuration Settings in Config Subsystem Investigate Using Karaf Cellar To Distribute Common Cluster-Wide Config

Move Current Config Processing (org.opendaylight.controller.cluster.common.actor) toexisting sal-clustering-config?

Akka-Specific Config: Make Most of Existing akka.conf File as Default Settings Separate Cluster Member Config (see Cluster Config) Options:

Provide Specific Named APIs, e.g. setTCPPort() Allow Akka <type,value> Config To Be Set Directly


(Proposal) Config Design Idea (Continued)

Cluster Config: Provide a Single Point For Configuring A Cluster

Feeds Back to Akka-Specific Settings, etc. Define Northbound Cluster IP Config (alias)

Shard Config: Define Shard Config (Name / Namespace / Sharding Strategy) Will NOT Support Changing Running Shard For Now

‘Other’ Config: 2-Node:

Designate Cluster’s Primary Node or Election Algorithm (dynamic) Failback to Primary Node (dynamic) Strategies (Influence These in RAFT) – Separate Bundles?

Election Consensus


Northbound IP Alias(DRAFT)

www.opendaylight.org 2-node clustering active-standby deployment

Documents

master of devices

primary controller services

secondary controller

controller b stays master

shards failure of primary

network partition slide

controller bs data

shards network partition