The Design of Everyday Software
System Reliability starts with the interface
I recently spent 20 minutes attempting to configure my new “simple” router. There was no progress bar, no error messages, just a blinking multi-coloured LED which the helpful manual informed me could mean one of several states depending on the hue and frequency of the pulse. Simultaneously the companion app would occasionally fail with useful error messages like “could not find the network”.
As I pondered the mystery of how to configure this router, it occurred to me that too often we think of systems in terms of latency and throughput. We think of failure states as a node going down or an interface flapping. What we often omit may be the vital and also volatile part of the system, the human operating it.
In his book, “The Design of Everyday Things”, Don Norman talks about signifiers and affordances, the subtle clues that tell you to pull a handle or push a plate. In the world of software engineering, particularly in high-stakes networking and complex backend systems, these are the abstractions we build; our APIs, our CLIs, our function names, our error codes, and our logging messages.
If a user has to read a 400 page manual to understand why a command failed, or scratch their heads trying to figure out just what “error code 15” means, then the system hasn’t just failed the user. It has failed as a piece of engineering. It has become fundamentally more prone to bugs through accidental misuse.
The Myth of the “Expert” user
It can be very easy to shrug our shoulders and excuse such failures because our users are “experts”. We assume that a senior SRE doesn’t need a clear UI nor an intuitive CLI. They’ll figure it out because they’re an expert. However, if that same SRE is woken at 3am to debug a network outage they are losing a significant portion of their cognitive capacity to stress and exhaustion. Their expertise comes secondary to their natural human physiology.
If your CLI flags are inconsistent or your API returns a generic “500 internal server error” without context, then you are not helping to solve the crisis. Every generic error message increases the Mean Time to Recovery (MTTR). A good design, whether it be the systems we follow or the tool that we’re using, would allow this stressed SRE to use what Daniel Kahneman describes as system 1 thinking, their fast, intuitive, and unconscious mode of thinking. A bad design leaves them to rely on the slower and analytical system 2 thinking. While not the main point of this article, every second spent deciphering a CLI flag is a second less spent on diagnosing the root cause. This has a real monetary cost to a business!
Even outside of a disaster response, by building an intuitive system you make it easier for new developers to your organisation to onboard and become productive, rather than wading through an error message that results from them unknowingly providing an invalid input.
Feedback loops in our CLI
A well-designed CLI is a vital tool for a systems engineer, but can often be an afterthought. To be particularly useful, our CLI must provide feedback loops. As the name implies, these provide valuable feedback to the user. These are simple, and you’ll probably think are common sense, but it’s remarkable how many systems lack them.
- Visibility of system status: If a command takes more than 500ms, provide a progress bar. Silence is itself a signal, and in this case is the enemy of confidence. Silence can leave the operator wondering whether their system is hanging in an infinite loop or waiting for a resource that will never be freed. What would you do to figure out which one it is and escape this Gulf of Evaluation? strace? Tcpdump? Coffee and hope that it resolves itself?
- Guardrails for destruction: Let’s be honest with ourselves. No matter how much experience we have, we’ve had a moment where we felt a chill go through us after we deleted something we shouldn’t have. That’s why any destructive action (for example, a delete in a database or a shutdown on a port) should provide “affordances” for safety, such as an atomic rollback mechanism or a dry-run flag.
- Concise errors: Something went wrong. Ok, what was it? Did the database fail? Was my input incorrect? We must treat error output as a primary interface, not an afterthought. A manual must not be necessary.
The API as a Developer Experience
Our internal APIs are no less important than a UI for an end consumer to configure their router. Here, the “user” is often your colleagues, both present and future. When we write a library or abstraction layer for other programmers to use, we are designing an experience. A well-crafted class or clean function is not just about “clean code”, it is about providing clear affordances to our fellow developers.
Norman emphasises that once a pattern is established then it must continue to be adhered to. Every programming language has its own idioms; returning error messages in Go, raising exceptions in Java. Similarly, many large codebases will have their own style guides. On top of that we all have expectations for how our software should behave.
When you write code that ignores these patterns, you are creating a “Norman door”. A developer sees a function and expects it to behave a certain way, but encounters a different behaviour, or worse they encounter an unexpected side-effect.
We once built a REST client that we assumed was dormant and that nobody was using it, until our support team notified us that a very large client (a company whose name you certainly know) had discovered that no matter what they sent to the REST client it always returned 200 OK. This false signifier (a success rather than an error) wasn’t just a bug. It was a contractual violation.
If one part of your API uses camelCase while another uses snake_case or if a GET /status returns a JSON object in one service and a string in another, you are creating cognitive debt. These small taxes, accumulated over a large platform, lead to user-error outages, which are actually design-error outages.
This is particularly important in the era of AI-assisted programming and “vibe-coding”. We are entering an era where the barrier to producing functional code has never been lower. However, while an LLM can generate code that satisfies a compiler and passes every unit test, it has no understanding of a developer’s conceptual model. It will never feel the pain of a missing dry-run flag. It will never experience being awoken at 3am to debug an issue in production. In other words it has no need for affordances, signifiers, or predictable mappings. With guidance it can accommodate these constraints, but it needs a software designer to do so.
Constraining Invalid States
We must design our APIs in a way where the “happy path” is the easiest to follow, the minimal amount of boilerplate is required, and invalid states are difficult (or impossible) to reach. We do this by coding defensively and enforcing inputs by moving runtime logic to the type system itself where possible.
Consider a function that configures a network interface. A fragile design allows a user to pass nonsensical combinations of arguments
// Fragile: What do 0, 1, or 2 mean?
// An LLM might "hallucinate" that 3 is 'Disabled', but the code will fail silently.
func SetPortState(portID string, state int) {
// switch state { ... default: panic("unknown state") }
}
// Or the String version (Typo-prone):
func SetPortState(portID string, state string) {
// If I pass "Maintence" (typo), the system fails at runtime.
}
A more resilient design uses semantic constraints and type checking to ensure that the user can only provide data that makes sense
type PortState int
const (
StateEnabled PortState = iota
StateDisabled
StateTesting
StateMaintenance
)
// Resilient: The compiler won't allow a raw int or a string.
// The "Possibility Space" is restricted to exactly 4 valid states.
func SetPortState(portID string, state PortState) {
// Business logic...
}
// Usage:
SetPortState("et1", StateMaintenance) // Signifiers in action.
If additional data checking is required it should be done as soon as possible, stopping any potential for the process to partially complete.
By designing our software in this way we’re building a model that eases the amount of system 2 thinking required by those who come after, and building in constraints that even a machine can’t ignore.
The Principle of Least astonishment
When calling a function, the developer must have a reasonable conceptual model of what it will do. They can do this by:
- Creating predictable mappings: If a developer calls a function “GetStatus” it should not trigger a network call that mutates database state
- Avoiding hidden Side-effects: If a GET request purges a shared cache or updates a record as a “bonus”, without communicating that it may do so, then you have broken the developer’s mental model. This mental model should be treated with the same respect as error codes and schema constraints. When you violate it, you’ve hidden the handle and pushing the door has somehow opened all the windows.
Just as a physical object should explain itself through its shape, a function should explain itself through its name and its signature.
Education through Error messages
The most important feedback for a developer is a well-crafted error or log message. The common “invalid input at line 32” is a design failure. It tells us nothing about the underlying failure.
An error message is an opportunity to teach the user how the system works and how to recover. An appropriate error message should explain 3 things
- What happened? (The link-state is down)
- Why did it happen? (The upstream peer failed its heartbeat)
- How do I fix it? (Check the cabling on interface 1/1 or verify the network protocol configurations)
When we provide information like that we give the operator agency. We reduce support calls, avoid regressions, or we’ve allowed a colleague to write tests that lead to a more resilient system.
This might seem obvious but unhelpful error messages are all around us. It’s not just a blinking light on a router. I recently encountered the login page of a system. It was a multi-tenant system and I wanted to log in to a new tenant (after my lease on another expired) to test some code I wrote. I followed the login flow and was met with an error message “Error! An active session already exists. Please log out of your existing session before logging in here”. This failure in discoverability left me stumped. The feedback was clear (there is a session on another tenant) but provided no signifier for how to act. There was no logout button (the page showed me a login screen) and no hint of which tenant my “active session” was associated with. When I, a supposed “expert” technical user, asked around I was told to use incognito mode or delete my browser cookie.
This isn’t dissimilar to the blinking LED on my router. It clearly identifies state but with no affordance for fixing it.
By simply adding a “Logout of all sessions” link we transformed the error into one that had a clear action that could be followed to solve it.
| Design Principle | Physical World Example | Software Engineering Equivalent |
|---|---|---|
| Affordance | A chair "affords" sitting; a handle "affords" pulling. | A public method Start() or a --dry-run flag. |
| Signifier | A "Push" plate on a door. | Descriptive function names like SetAdminState. |
| Constraint | A key that only fits one way into a lock. | Strong Typing and private interface methods. |
| Feedback | The "click" of a light switch being toggled. | Progress bars and meaningful 201 Created responses. |
| Mapping | Stovetop knobs arranged to match the burners. | Consistent naming (camelCase) and idiomatic error handling. |
| Conceptual Model | A thermostat dial for "Warmer" vs "Colder." | An API that hides complex DB logic behind a simple GetStatus() call. |
Optimising the human cycle
In conclusion, we spend our careers optimising for CPU cycles and memory utilisation. What we often overlook is the overhead incurred by the human cycle. The time it takes for a human to understand, debug, and fix a problem has a cost to our organisation.
By applying the principles of physical design to our virtual infrastructure, we’re not talking about making things “prettier”. We are building more resilient systems. When our systems are intuitive then the gap between what the user believes will happen and the machine’s reality narrows. It is in that gap that bugs often hide, and it’s our job to close it.