The Cost of Silence

What Cold War-era nuclear bombing protocols can teach us about designing modern systems, and the time-bombs ticking in our code

A Catastrophic Atomic Tale

In the depths of the Cold War, following a credible threat of an incoming Soviet nuclear attack, a group of bombers were dispatched to target military and political targets (such as Moscow) within the Soviet Union. While en-route, the Strategic Air Command learned that the “credible threat” had in fact been a false alarm and they scrambled to radio their bombers and have them turn around. Many of the aircrews complied, following the order to the letter. Others did not receive the broadcast and continued to their target.

For the aircrews who did not receive the updated order, they were simply following the status quo. They had not received a signal to stop, and under the SAC’s terrifyingly simple protocol they should continue to the target unless they receive a clear and authenticated “abort”. They would continue the mission and drop the payload as ordered.

The terrifying simplicity of Fail-Open. Without a signal to stop, the status quo is catastrophic.

This scenario fortunately never occurred, but it was a real risk under the SAC protocols of the 1950s. The fail-open protocol meant that silence was indistinguishable from “proceed as planned”. It was a simple, terrifying, but highly available system. Upon receiving a warning of an incoming bomber strike, nuclear bombers would be scrambled towards targets in the Soviet Union with a single directive. They would proceed to the target unless given a clear “abort” order. If a radio failed or the frequency jammed, the mission and the cataclysmic consequences would be inevitable. Aircrews who failed to receive such an order, whether it be because the radio was broken or the signal jammed, would not be going rogue. Instead they were just following their orders as previously laid out.

By the late 1950s, the SAC realised that a non-deterministic state could have drastic consequences. The protocol was refined to one of positive control, or “fail-closed”. Rather than continuing to the target unless otherwise instructed, the bombers would enter a holding pattern close to the target and wait for authenticated “GO” order. If no order was received then they would return to base. They shifted from a default of catastrophe to one of safety.

Positive Control. Shifting the default state from one that defaults to proceed to one that defaults to waiting for confirmation (a Time-leased holding pattern).

While the majority of us are not working on software where the risk of failure could result in nuclear armageddon, the fundamental architectural challenge remains the same. In the world of networks and distributed software engineering “radio failure” is a daily reality. A node may lose connection to its cluster or to its control plane. A DNS race condition may break the connections in our global network, and render our services unreachable. When (not if) this happens, the resilience of our system depends on whether that node defaults to silence meaning the last order remains valid, or whether it aborts safely.

Two Generals, One Non-Deterministic state

The core of the dilemma faced by the SAC is a well-worn concept in computer science called “The Two Generals problem”. In this thought experiment two generals can only defeat an enemy if they attack together. However, to communicate this they must send a message over an unreliable connection, in this case the valley controlled by the enemy. Neither can truly know if a message was lost or if their counterpart has received the message.

For the SAC, they could send an “abort” signal but unless they received an “ack” they wouldn’t know if the message was lost and the bombers were still on the way to Moscow, or if the bombers had already turned around. They were betting the mission, and the world, on a perfect connection that even with the technology of today doesn’t exist.

Positive Control Turns Silence into a Signal

Positive Control eliminated this problem completely. It stopped treating silence as “no news” or “no change” and acknowledged that silence itself was a signal.

To put this in software terms, the SAC moved from an event based system, where an abort signal could be pushed to stop the last order, to one where permission to continue was time-locked. In distributed systems we call this leasing.

Time-locked authority. Node Bravo becomes a 'Zombie' the moment its lease expires without a heartbeat.

Under a leasing or time-to-live (TTL) system, a node may only act if it is the leader. It remains the leader only while it holds the lease, which expires after a time period. In order to remain the leader it must continuously renew this lease before the timer reaches zero. If the network fails, the lease expires and the node ceases all writes and actions.

An example of this in action would be a task-based scheduler. We have a queue of tasks, which are assigned one by one to various worker nodes. If the control plane loses contact with the node responsible for a given task, or some timeout is exceeded without the node acknowledging completion of the task, then it may assume that the node has failed without completing its task, and assign it to a different node.

Such a node that continues to act when other parts of the system believe it to be dead is commonly referred to as a “zombie”. In a fail-open (event-based) system, this zombie node may continue to work on the task indefinitely, despite the network partition. When the control plane assigns the task to another node, you now have two nodes working simultaneously on the same task. In many distributed data systems, this can lead to catastrophic data corruption or an inconsistent state. It’s classic split-brain scenario.

In the lease-based system, the node and the control plane agree on a safety buffer. The control plane will only reassign when the node’s lease has definitively expired, and the node is programmed to “turn the plane around” when its lease hits 0. We have a mathematical guarantee that no two nodes will be working on the same task simultaneously.

The Zombie of AWS US-EAST-1. When Leases Fail

Lease-related race conditions, and the resulting divergence of state, have been responsible for outages such as the AWS outage of October 2025. In this case an automated DNS management system tasked two different DNS enactors with updating the DNS records. One had difficulty applying an updated plan, necessitating multiple retries.

When this job took too long, the DNS planner service (the “General”) sent a new DNS plan to a different node. The last step in each enactor’s job was to clean up old state, i.e any state dated before its own. As it began to enact this cleanup, the slower enactor finally applied its now outdated plan, leaving the faster enactor to clean up all the state and leaving an empty record.

A three-lane timeline showing Enactor 1, a DNS Record, and Enactor 2. It shows Enactor 2 writing 'Plan B.' Just after it shows the slower Enactor 1 writing 'Plan A' and overwriting the record. After that it shows Enactor 1 performing a 'Cleanup' operation, which deletes the record entirely, leaving it empty. — A slow enactor arriving late overwrites fresh state and then the 'cleans up' removes the very data the system needs.

The system fell victim to a gap between deciding to act and completing an action. Both enactors believed that they held the lease and so neither was at fault. An expiring lease could be considered an “Abort”. What was missing here was an authenticated “Go” order. In systems we call this a fencing token. A fencing or version token, atomically updated when the faster enactor wrote its DNS config, would mean that the slower enactor would back off as the latest configuration was newer than the one it intended to write. Like a bomber arriving hours late and dropping a payload after a ceasefire had been signed, the node acted on authority that it didn’t realise had already expired.

The same three-lane timeline as the previous figure, but the DNS Record now includes a 'Token' or version column. When Enactor 2 writes Plan B, the stored token becomes 102. When the slow Enactor 1 arrives with Token 101, its write hits a 'Version Check' shield and is rejected because 101 is less than 102. The subsequent cleanup does not erase a valid record. — Fencing tokens as the 'Authenticated GO.' By checking the version at the point of write, we ensure late bombers can't change the course of history.

Because DynamoDB (the service handling lease-refreshment) was unreachable, the other AWS services couldn’t renew their leases and followed their own fail-closed protocol. They ceased operations. This was the correct behaviour but proves that while a fail-closed protects your data, it is at the expense of availability during recovery.

Trade-offs and a Shrinking Decision Window

No system is perfect, and the cost of this safety is the sacrifice of availability. For the SAC, the bombers turning back due to a faulty radio meant that the mission was a failure. The bombers would not strike their target, all due to a small technical glitch. In software, a networking blip means that the system stalls while the task must be reassigned.

In distributed systems, this is a classic CAP theorem trade-off (Consistency, Availability, Partition Tolerance) in action. You are prioritizing consistency, wherein no two nodes will perform the same action, over availability where the task must keep moving forward no matter what.

Designing a safety buffer is, therefore, crucial. In the 1950s, SAC had to decide how far the bombers would go (or how close to the target) before they treated silence as a signal to abort. Their timeout was measured in miles. In the modern backend these timeouts are measured in milliseconds. The length of this safety buffer is one of the most critical decisions we make in a TTL lease-based system. Too short a buffer and we interrupt the system unnecessarily, introducing more churn. Too long a buffer and the entire system must wait for the lease to expire before it can continue.

Similarly, external factors can put pressure on changing the safety buffer. AWS’ customers want highly available but consistent services, and therefore recovery time must be minimised as much as possible while ensuring data integrity. The decision window in the world of nuclear deterrents also shrinks. With advances in hypersonic missiles, the window shrinks from hours to minutes. Politicians and military generals face pressure from a shrinking decision window, and may also consider changing their system to one that optimises for availability over consistency.

Just as a shrinking window pressures a general to favour speed, a business’s Recovery Time Objective (RTO) is limited by lease lengths and so pressures an engineer to shorten lease times. Both are gambles against the reliability of the system.

Conclusion

As engineers we strive to build the perfect networks, and software that won’t fail. I’m positive the engineers of the 1950s thought the same. The takeaway here, however, is that networks will always find a way to fail. We can rigorously test them and cover edge cases, but we can never guarantee zero-failure. True resilience is not found in a perfect radio. It’s found in a process designed to handle the situation when that radio fails. As the world pushes for faster responses we need to also consider what happens when the radio fails.

Published 2026.04.10