Case Study: Deterministic Failover in a Dual Supervisor System

In a dual-supervisor (active/standby) architecture, measuring health isn’t always as straightforward as monitoring a heartbeat. A supervisor may also be considered alive but is isolated from the rest of the cluster, unable to communicate with the nodes that it manages. This case study explores the implementation of interface-state-aware election logic in such a system designed to ensure that a leader was only elected when it was in a position to manage the nodes, not just because it was reachable.

The election protocol used here is simple, expanding on the bully algorithm and taking multiple criteria into account. Including an interface-aware component to this protocol allows us to build a more resilient system.

The article about how this is implemented is separated into the following sections

Silent Failure and “Deaf” Leadership

The cluster fabric manages various nodes via specific aggregate interfaces (most notably the data-plane aggregate). An existing election protocol might have multiple election components intended to choose the most suitable leader from among the two supervisors. However, when it has no visibility into whether the supervisor is connected to the data nodes, it enters a “silent failure” mode wherein the active supervisor’s heartbeat remained healthy but it is unable to connect to the cluster that it manages.

Such a supervisor that has lost connection to the nodes it is intended to manage would remain in power, allowing routing protocols to fail silently, unless a network operator stepped in to issue a manual failover. It is, for all intents and purposes, “deaf” to the state of the cluster.

We identified such a problem and an opportunity to optimise the cluster availability by tracking network events and electing a leader accordingly. Without having a leadership election that included these criteria, we ran the risk of having an unresponsive leader. In the time that such a leader remains in power new configuration wasn’t being pushed to the cluster, or worse, the cluster was completely unavailable.

A diagram showing a dual supervisor system. Both supervisors are connected to each other via a connection with an active heartbeat, and to the cluster through another connection. The connection between supervisor 2 (active) and the clustered nodes is broken

Analysis of the Existing System: The Election Protocol

The existing election protocol was a simple one, based on the BULLY algorithm. Rather than comparing process IDs as in the traditional BULLY algorithm, a series of metrics would be compared in sequence. These metrics ranged from administrative override, to the freshness of DB updates, to comparing node IDs (as in the traditional BULLY).

Usually this kind of election expected failover to happen only when an operator requested it or when a supervisor restarted, but could be triggered by any node sending a message to request that it be the leader. It may also be triggered if the standby did not get any updates from the leader for a given period of time, or from a new supervisor joining the cluster. Usually when a new supervisor joined the cluster the cluster would remain stable, avoiding a case where a continuously restarting node could trigger a constant change in leadership.

I added a new component that tracked the status of these required interfaces to the election algorithm: REQUIRED_AVAILABLE. This component counted the number of interfaces required for healthy operation that were currently available.

The main point of contention when adding this component was where to place it in the hierarchy:

  • Option A (before administrative overrides): automated health takes precedence. However, if an operator forced a failover to a specific supervisor for debugging, then the system might “ignore” the command if the interfaces were down.
  • Option B (after administrative overrides): Human intent takes precedence. If an operator rigs an election, the system obeys, even if it means that the Active supervisor is less healthy.

I chose option B. Prioritising human intent ensures that the system remains predictable during manual debugging. By ensuring that this new component is given less weight than an administrative override, I maintained backwards compatibility and while putting trust in the human operator to understand the state of their network when issuing commands.

Election Component Supervisor A Supervisor B
Is allowed to be Leader
Admin rigged in favour of this node
How many required interfaces are in UP state
Node ID (tiebreaker) 1 2
Determining election state...

Building a Thread safe Network Monitoring Service

In this monitoring system, we make heavy use of the observer pattern, registering listeners and notifying different levels of interface changes throughout the system. The monitoring system is a core component of this project, used to react to network changes, notify of real-time changes to interface states, and notify that the “damped” state of the interface had been changed.

A diagram showing how the interface monitoring system works. The ip-monitor process sends information to the Network Change Listener, which then fetches the interface state from an Interface State Service. It notifies the State Aggregator of this state. The State Aggregator determines whether this state change is actionable before sending it to the Cluster Controller. Once there the Cluster Controller notifies the ElectionListener to call the election.

Using pre-existing infrastructure as a baseline, I developed a monitoring layer that translated low-level kernel network events, received from the linux ip-monitor and Linux interface states, into actionable signals that could be sent throughout the rest of the system.

Listeners would register to the reporter module based on the interface grouping that they cared about, and the network monitor would trigger each one of them when a single interface in that grouping changed or if the entire aggregate status was changed.

interface Listener {
     void singleInterfaceChanged(intfname, oldState, newState)
     void aggregateInterfaceChanged(oldState, newState)
}

I added a cluster controller to listen to changes and await notification from this reporter module. Aside from policing whether the feature would be enabled or not, the cluster controller would receive notifications of interface changes and inform any downstream listeners of the relevant changes.

Normalising Interface Flapping

Interface states may change frequently for a variety of reasons, including unstable physical connections, hardware issues, and high network load. This is commonly referred to as flapping. If we were to call an election every time an interface changed state then we’d end up with an overwhelming amount of unnecessary elections and our cluster could end up in a constant state of switching supervisors.

To prevent these kinds of election storms, any listeners that needed to act on the state changes (such as those that called an election) would only be notified when the “derived state” was updated. The system used asymmetric damping:

  • UP to DOWN: notify listeners immediately to prioritise keeping a healthy supervisor as the leader
  • DOWN to UP: the listener would not be notified until the cluster controller had determined that the interface had been down for a minimum safe buffer of 30 seconds.
A diagram showing how the interface monitoring system works. The IPMonitoringService notifies the Reporter Module of raw interface state changes. The Reporter Module decides whether to wait 30 seconds since the last transition to down (if a DOWN to UP transition) or notify the Cluster Controller immediately. Once the Cluster Controller is notified an election is called

Respecting the Human Operator with a Veto System

One of the highest priorities is operational stability, especially during times when we needed the cluster to remain static, such as during a maintenance window where it is required that the cluster state remain unchanged for the duration of the window. No elections could happen when in this state, but interfaces would still need to be monitored.

I designed and implemented a veto-map for the cluster controller where a system component (such as a maintenance mode trigger) could temporarily pause the automated failovers. When the feature was no longer paused then an election could be triggered to ensure we got to the appropriate state.

Also, to respect the human operator, if the human operator wanted to rig the election in favour of a certain supervisor then that was given priority, as described above

AND
AND
intf state change
intf state change
feature flag
feature flag
no maintenance
no maintenance
trigger election
Leadership...

The trade-off in both these cases is that it prioritises predictability over availability. In the former, it is better to allow this veto than risk changing supervisor when an end user expects their cluster to be stable. In the latter, if we didn’t respect the operator’s wishes then it may lead to confusion and appear that there was a bug in the system.

Conclusion

By the end of this project the cloud fabric has an election system that accounts for the ground truth of the network. We reduce silent failures, ensuring that the active controller would be in power if it actually has a line of communication with its cluster.

This makes the supervisor more aware of its place within the overall system, and allows us to maintain a more available system overall. It is worth noting, however, that it doesn’t take into account a situation where the interfaces on both supervisors were down. Working around that would still require manual intervention.

Signature