ha, srx cluster & redundancy groups
TRANSCRIPT
![Page 1: HA, SRX Cluster & Redundancy Groups](https://reader033.vdocuments.site/reader033/viewer/2022042607/5561ece6d8b42a9d068b53c2/html5/thumbnails/1.jpg)
HA, SRX Cluster & Redundancy Groups
Prepared By:Kashif LatifMuhammad Bilal
![Page 2: HA, SRX Cluster & Redundancy Groups](https://reader033.vdocuments.site/reader033/viewer/2022042607/5561ece6d8b42a9d068b53c2/html5/thumbnails/2.jpg)
High Availability ClusterHigh-availability clusters (also known as HA clusters or failover clusters) are groups of computers that support server applications that can be reliably utilized with a minimum of down-time.
They operate by harnessing redundant computers in groups or clusters that provide continued service when system components fail.
![Page 3: HA, SRX Cluster & Redundancy Groups](https://reader033.vdocuments.site/reader033/viewer/2022042607/5561ece6d8b42a9d068b53c2/html5/thumbnails/3.jpg)
Uses of HA Cluster
HA clusters are often used for:
1. Critical Databases2. File Sharing on a Network3. Business Applications4. Customer Services such
as electronic commerce websites
![Page 4: HA, SRX Cluster & Redundancy Groups](https://reader033.vdocuments.site/reader033/viewer/2022042607/5561ece6d8b42a9d068b53c2/html5/thumbnails/4.jpg)
Cluster Monitoring
HA clusters usually use a heartbeat private network connection which is used to monitor the health and status of each node in the cluster.
![Page 5: HA, SRX Cluster & Redundancy Groups](https://reader033.vdocuments.site/reader033/viewer/2022042607/5561ece6d8b42a9d068b53c2/html5/thumbnails/5.jpg)
Application Design RequirementsIn order to run in a high-availability cluster environment, an application must satisfy at least the following technical requirements:
There must be a relatively easy way to start, stop, force-stop, and check the status of the application.
The application must be able to use shared storage.
![Page 6: HA, SRX Cluster & Redundancy Groups](https://reader033.vdocuments.site/reader033/viewer/2022042607/5561ece6d8b42a9d068b53c2/html5/thumbnails/6.jpg)
Count…
Ability to restart on another node at the last state before failure using the saved state from the shared storage.
The application must not corrupt data if it crashes, or restarts from the saved state.
![Page 7: HA, SRX Cluster & Redundancy Groups](https://reader033.vdocuments.site/reader033/viewer/2022042607/5561ece6d8b42a9d068b53c2/html5/thumbnails/7.jpg)
Node Configurations
In two-node cluster configurations can sometimes be categorized into one of the following models:
1. Active/active2. Active/passive3. N+14. N+M5. N-to-16. N-to-N
![Page 8: HA, SRX Cluster & Redundancy Groups](https://reader033.vdocuments.site/reader033/viewer/2022042607/5561ece6d8b42a9d068b53c2/html5/thumbnails/8.jpg)
Active/active
Traffic intended for the failed node is either passed onto an existing node or load balanced across the remaining nodes.This is usually only possible when the nodes utilize a homogeneous software configuration.
![Page 9: HA, SRX Cluster & Redundancy Groups](https://reader033.vdocuments.site/reader033/viewer/2022042607/5561ece6d8b42a9d068b53c2/html5/thumbnails/9.jpg)
Active/passive
Provides a fully redundant instance of each node, which is only brought online when its associated primary node fails.This configuration typically requires the most extra hardware.
![Page 10: HA, SRX Cluster & Redundancy Groups](https://reader033.vdocuments.site/reader033/viewer/2022042607/5561ece6d8b42a9d068b53c2/html5/thumbnails/10.jpg)
Node Reliability
HA clusters usually utilize all available techniques to make the individual systems and shared infrastructure as reliable as possible. These include:1. Disk Mirroring2. Redundant Network3. Redundant Storage Area Network4. Redundant Electrical Power5. Redundant Power Supply Units
![Page 11: HA, SRX Cluster & Redundancy Groups](https://reader033.vdocuments.site/reader033/viewer/2022042607/5561ece6d8b42a9d068b53c2/html5/thumbnails/11.jpg)
Failover Strategies
Systems that handle failures in distributed computing have different strategies to cure a failure. For instance, API defines three ways to configure a failover:
1. FAIL_FAST: The try fails if the first node cannot be reached.
2. ON_FAIL_TRY_ONE_NEXT_AVAILABLE: Tries one more host before giving up
3. ON_FAIL_TRY_ALL_AVAILABLE: Tries all existing nodes before giving up
![Page 12: HA, SRX Cluster & Redundancy Groups](https://reader033.vdocuments.site/reader033/viewer/2022042607/5561ece6d8b42a9d068b53c2/html5/thumbnails/12.jpg)
What is SRX Cluster…?
SRX Cluster provides network node redundancy by grouping a pair of the same kind of supported SRX Series devices or J Series devices into a cluster.
The devices must be running the same version of Junos OS.
![Page 13: HA, SRX Cluster & Redundancy Groups](https://reader033.vdocuments.site/reader033/viewer/2022042607/5561ece6d8b42a9d068b53c2/html5/thumbnails/13.jpg)
SRX Cluster Example
![Page 14: HA, SRX Cluster & Redundancy Groups](https://reader033.vdocuments.site/reader033/viewer/2022042607/5561ece6d8b42a9d068b53c2/html5/thumbnails/14.jpg)
SRX PlaneThe SRX has a separated planes. Depending on the SRX platform architecture, the separation varies from being separate processes running on separate cores to completely physically differentiated subsystems.
1. Control Plane2. Data Plane
![Page 15: HA, SRX Cluster & Redundancy Groups](https://reader033.vdocuments.site/reader033/viewer/2022042607/5561ece6d8b42a9d068b53c2/html5/thumbnails/15.jpg)
Control Plane
The control plane is used in HA to synchronize the kernel state between the two REs.
It also provides a path between the two devices to send hello messages between them.
The two devices’ control planes talk to each other over a control link. This link is reserved for control plane communication.
![Page 16: HA, SRX Cluster & Redundancy Groups](https://reader033.vdocuments.site/reader033/viewer/2022042607/5561ece6d8b42a9d068b53c2/html5/thumbnails/16.jpg)
Count… The control plane is always in an
active/backup state. This means only one RE can be the master over the cluster’s configuration and state.
This ensures that there is only one ultimate truth over the state of the cluster. If the primary RE fails, the secondary takes over for it.
Creating an active/active control plane makes synchronization more difficult because many checks would need to be put in place to validate which RE is right.
![Page 17: HA, SRX Cluster & Redundancy Groups](https://reader033.vdocuments.site/reader033/viewer/2022042607/5561ece6d8b42a9d068b53c2/html5/thumbnails/17.jpg)
Control Plane States
![Page 18: HA, SRX Cluster & Redundancy Groups](https://reader033.vdocuments.site/reader033/viewer/2022042607/5561ece6d8b42a9d068b53c2/html5/thumbnails/18.jpg)
Data Plane The data plane’s responsibility in
the SRX is to pass data and processes based on the administrator’s configuration.
All session and service states are maintained on the data plane.
The REs and/or control plane are not responsible for maintaining state.
![Page 19: HA, SRX Cluster & Redundancy Groups](https://reader033.vdocuments.site/reader033/viewer/2022042607/5561ece6d8b42a9d068b53c2/html5/thumbnails/19.jpg)
Responsibilities of Data Plane
The data plane has a few responsibilities when it comes to HA implementation.
First and foremost is state synchronization. The state of sessions and services is shared between the two devices.
Sessions are the state of the current set of traffic that is going through the SRX, and services are other items such as:
1. VPN2. IPS3. ALGs
![Page 20: HA, SRX Cluster & Redundancy Groups](https://reader033.vdocuments.site/reader033/viewer/2022042607/5561ece6d8b42a9d068b53c2/html5/thumbnails/20.jpg)
Chassis Cluster
An SRX cluster implements a concept called chassis cluster. A chassis cluster takes the two SRX devices and represents them as a single device.
The interfaces are numbered in such a way that they are counted starting at the first chassis and then end on the second chassis.
![Page 21: HA, SRX Cluster & Redundancy Groups](https://reader033.vdocuments.site/reader033/viewer/2022042607/5561ece6d8b42a9d068b53c2/html5/thumbnails/21.jpg)
Chassis Cluster Numbering
![Page 22: HA, SRX Cluster & Redundancy Groups](https://reader033.vdocuments.site/reader033/viewer/2022042607/5561ece6d8b42a9d068b53c2/html5/thumbnails/22.jpg)
Chassis Cluster Functionality
1. Resilient system architecture, with a single active control plane for the entire cluster and multiple Packet Forwarding Engines. This architecture presents a single device view of the cluster.
2. Synchronization of configuration and dynamic runtime states between nodes within a cluster.
3. Monitoring of physical interfaces, and failover if the failure parameters cross a configured threshold.
![Page 23: HA, SRX Cluster & Redundancy Groups](https://reader033.vdocuments.site/reader033/viewer/2022042607/5561ece6d8b42a9d068b53c2/html5/thumbnails/23.jpg)
States of ClusterThe different states that a cluster can be in at any given instant are as follows:1. Hold2. Primary3. Secondary-Hold4. Secondary5. Ineligible6. Disabled
A state transition can be triggered because of any event, such as interface monitoring, SPU monitoring, failures, and manual failovers.
![Page 24: HA, SRX Cluster & Redundancy Groups](https://reader033.vdocuments.site/reader033/viewer/2022042607/5561ece6d8b42a9d068b53c2/html5/thumbnails/24.jpg)
Chassis Cluster Formation
To form a chassis cluster, a pair of the same kind of supported SRX Series devices or J Series devices are combined to act as a single system that enforces the same overall security.
You can deploy up to 15 chassis clusters in a Layer 2 domain.
![Page 25: HA, SRX Cluster & Redundancy Groups](https://reader033.vdocuments.site/reader033/viewer/2022042607/5561ece6d8b42a9d068b53c2/html5/thumbnails/25.jpg)
Identification of Clusters
Clusters and nodes are identified in the following way: A cluster is identified by a
cluster ID (cluster-id) specified as a number from 1 through15.
A cluster node is identified by a node ID (node) specified as a number from 0 to 1.
![Page 26: HA, SRX Cluster & Redundancy Groups](https://reader033.vdocuments.site/reader033/viewer/2022042607/5561ece6d8b42a9d068b53c2/html5/thumbnails/26.jpg)
Redundancy Groups
A redundancy group is an abstract construct that includes and manages a collection of objects. A redundancy group contains objects on both nodes.
A redundancy group is primary on one node and backup on the other at any time.
We can create up to 128 redundancy groups.
![Page 27: HA, SRX Cluster & Redundancy Groups](https://reader033.vdocuments.site/reader033/viewer/2022042607/5561ece6d8b42a9d068b53c2/html5/thumbnails/27.jpg)
Example of Redundancy Groups
![Page 28: HA, SRX Cluster & Redundancy Groups](https://reader033.vdocuments.site/reader033/viewer/2022042607/5561ece6d8b42a9d068b53c2/html5/thumbnails/28.jpg)
Primacy of Redundancy GroupThree things determine the primacy of a redundancy group:1. The priority configured for the node2. The node ID (in case of tied priorities)3. The order in which the node comes
up.
If a lower priority node comes up first, then it will assume the primacy for a redundancy group (and will stay as primary if preempt is not enabled).
![Page 29: HA, SRX Cluster & Redundancy Groups](https://reader033.vdocuments.site/reader033/viewer/2022042607/5561ece6d8b42a9d068b53c2/html5/thumbnails/29.jpg)
Redundancy Group Monitoring
A redundancy group is automatically fail over to another node, for this it has to monitor some following components of the Chassis Cluster:1. Interface Monitoring2. IP Address Monitoring3. Monitoring of Global-Level Objects
i. SPU Monitoringii. Flowd Monitoringiii. Cold-Sync Monitoring
![Page 30: HA, SRX Cluster & Redundancy Groups](https://reader033.vdocuments.site/reader033/viewer/2022042607/5561ece6d8b42a9d068b53c2/html5/thumbnails/30.jpg)
Chassis Cluster Redundancy Group Failover
A redundancy group is a collection of objects that fail over as a group. Each redundancy group monitors a set of objects (physical interfaces), and each monitored object is assigned a weight.
Each redundancy group has an initial threshold of 255.
![Page 31: HA, SRX Cluster & Redundancy Groups](https://reader033.vdocuments.site/reader033/viewer/2022042607/5561ece6d8b42a9d068b53c2/html5/thumbnails/31.jpg)
Count… When a monitored object fails, the
weight of the object is subtracted from the threshold value of the redundancy group.
When the threshold value reaches zero, the redundancy group fails over to the other node. As a result, all the objects associated with the redundancy group fail over as well.
![Page 32: HA, SRX Cluster & Redundancy Groups](https://reader033.vdocuments.site/reader033/viewer/2022042607/5561ece6d8b42a9d068b53c2/html5/thumbnails/32.jpg)
Count…
Because back-to-back redundancy group failovers that occur too quickly can cause a cluster to exhibit unpredictable behavior, a dampening time between failovers is needed.
The default dampening time is 300 seconds (5 minutes) for redundancy group 0 and is configurable to up to 1800 seconds with the hold-down-interval statement.
![Page 33: HA, SRX Cluster & Redundancy Groups](https://reader033.vdocuments.site/reader033/viewer/2022042607/5561ece6d8b42a9d068b53c2/html5/thumbnails/33.jpg)
Count… Redundancy groups x
(redundancy groups numbered 1 through 128) have a default dampening time of 1 second, with a range of 0 through 1800 seconds.
The hold-down interval affects manual failovers, as well as automatic failovers associated with monitoring failures.
![Page 34: HA, SRX Cluster & Redundancy Groups](https://reader033.vdocuments.site/reader033/viewer/2022042607/5561ece6d8b42a9d068b53c2/html5/thumbnails/34.jpg)
Chassis Cluster Redundancy Group Manual Failover
We can initiate a redundancy group x failover manually. A manual failover applies until a failback event occurs.
You can also initiate a redundancy group 0 failover manually if you want to change the primary node for redundancy group 0.
![Page 35: HA, SRX Cluster & Redundancy Groups](https://reader033.vdocuments.site/reader033/viewer/2022042607/5561ece6d8b42a9d068b53c2/html5/thumbnails/35.jpg)
State Transitions Cases
There are three transition cases:
1. Reboot case—The node in the secondary-hold state transitions to the primary state; the other node goes dead (inactive).
![Page 36: HA, SRX Cluster & Redundancy Groups](https://reader033.vdocuments.site/reader033/viewer/2022042607/5561ece6d8b42a9d068b53c2/html5/thumbnails/36.jpg)
Count…
2. Control link failure case—The node in the secondary-hold state transitions to the ineligible state and then to a disabled state; the other node transitions to the primary state.
3. Fabric link failure case—The node in the secondary-hold state transitions directly to the disabled state.
![Page 37: HA, SRX Cluster & Redundancy Groups](https://reader033.vdocuments.site/reader033/viewer/2022042607/5561ece6d8b42a9d068b53c2/html5/thumbnails/37.jpg)
SNMP Failover Traps
Chassis clustering supports SNMP traps, which are triggered whenever there is a redundancy group failover.The trap message can help you troubleshoot failovers. It contains the following information:1. The cluster ID and node ID2. The reason for the failover3. The redundancy group that is involved
in the failover4. The redundancy group’s previous state
and current state
![Page 38: HA, SRX Cluster & Redundancy Groups](https://reader033.vdocuments.site/reader033/viewer/2022042607/5561ece6d8b42a9d068b53c2/html5/thumbnails/38.jpg)
Chassis Cluster Interfaces
A network device doesn’t help a network without participating in traffic processing.An SRX has two different interface types that it can use to process traffic that are:1. Reth Interface2. Local Interface
![Page 39: HA, SRX Cluster & Redundancy Groups](https://reader033.vdocuments.site/reader033/viewer/2022042607/5561ece6d8b42a9d068b53c2/html5/thumbnails/39.jpg)
Reth Interface A Reth is a Junos
aggregate Ethernet interface and it has special properties compared to a traditional aggregate Ethernet interface.
The Reth allows the administrator to add one or more child links per chassis.
![Page 40: HA, SRX Cluster & Redundancy Groups](https://reader033.vdocuments.site/reader033/viewer/2022042607/5561ece6d8b42a9d068b53c2/html5/thumbnails/40.jpg)
Reth MAC Address
The MAC address for the Reth is based on a combination of the cluster ID and the Reth number.
![Page 41: HA, SRX Cluster & Redundancy Groups](https://reader033.vdocuments.site/reader033/viewer/2022042607/5561ece6d8b42a9d068b53c2/html5/thumbnails/41.jpg)
Count…
In the figure the first four of the six bytes are fixed. They do not change between cluster deployments.
The last two bytes vary based on the cluster ID and the Reth index.
![Page 42: HA, SRX Cluster & Redundancy Groups](https://reader033.vdocuments.site/reader033/viewer/2022042607/5561ece6d8b42a9d068b53c2/html5/thumbnails/42.jpg)
Local Interface A local interface is an interface that is
configured local to a specific node. This method of configuration on an
interface is the same method of configuration on a standalone device.
![Page 43: HA, SRX Cluster & Redundancy Groups](https://reader033.vdocuments.site/reader033/viewer/2022042607/5561ece6d8b42a9d068b53c2/html5/thumbnails/43.jpg)
Count… The significance of a local interface
in an SRX cluster is that it does not have a backup interface on the other chassis, meaning that it is part of neither a Reth nor a redundancy group.
If this interface were to fail, its IP address would not fail over to the other node.
![Page 44: HA, SRX Cluster & Redundancy Groups](https://reader033.vdocuments.site/reader033/viewer/2022042607/5561ece6d8b42a9d068b53c2/html5/thumbnails/44.jpg)
Troubleshooting the Cluster
There are various methods that show the administrator how to troubleshoot a chassis cluster:1. Identify the Cluster Status2. Checking Interfaces3. Verifying the Data Plane4. Core Dumps5. The Dreaded Priority Zero
![Page 45: HA, SRX Cluster & Redundancy Groups](https://reader033.vdocuments.site/reader033/viewer/2022042607/5561ece6d8b42a9d068b53c2/html5/thumbnails/45.jpg)
Thank YouFrom:Kashif LatifMuhammad Bilal