Case Study: Scaling CLI Testing with Record-and-Replay Proxy

Python REST API Docker Mocking Test Engineering DX

One of the greatest challenges we face as engineers is designing deterministic testing infrastructure. It was no different when I was tasked with designing and implementing a testing framework for a REST-backed CLI. Because this CLI communicates with the backing system via a REST-based API, we could build a unit testing system by decoupling the CLI itself from the physical server by building a proxy layer between the CLI process and a backing virtual machine. Using this method we can generate tests based on a real-world system that are deterministic and can be run in seconds, by mocking REST responses, rather than several minutes per test in a typical integration test infrastructure that requires the virtual machine to come online before running.

The end solution has two different python programs with some shared code that can run in a container (such as docker). This would allow it to run in any development environment without requiring a lot of dependencies to be installed locally. The end product can therefore be broken down into the following components:

Recorder script
Proxy REST server
Runner script (to run the generated tests)
A mock REST server (to mock REST responses)
Individual test files

The framework is designed with a plugin architecture, allowing product-specific test suites to provide configuration (rules) and tests that run in a centralised testing library, thus leveraging shared infrastructure while maintaining their own domain-specific logic.

The details of how various challenges were solved can be found in the following sections:

Building a recording layer: defining a test structure and “schema” for individual tests
Building the runner layer: faking responses to a CLI
Reducing hardware dependency: regenerating tests using REST mocks
Managing non-determinism and volatility using a rules framework

Overview of the CLI and architecture

The CLI we were testing was a python process that parsed CLI input and matched it to a corresponding REST request. It sent this REST request to the main process to either set data in the database, execute an RPC, or get some value to show to the operator.

A diagram of the CLI process where user input is fed into a CLI parser. Once parsed it is matched with an associated REST request/set of REST requests before being sent to the server via a REST response. On the server end, it is processed into a database request and finally the result is returned as a REST response

Without appropriate unit tests you could accidentally delete a line in a python string somewhere, and you end up with a “show controller” command that omits the reason why the controller couldn’t be contacted.

While this can be covered with adequate integration testing, these kinds of tests are often slower. If each of these requires a minimum 5 minute startup time, uses valuable server resources, and each test requires a “clean” environment to run in, then the system rapidly becomes unwieldy and costly for our backing infrastructure. Running multiple of such tests on each pull request means that this cost would scale linearly (O(n + m), where n represents the number of tests run per PR and m represents the number of PRs) and changes could take a long time to be approved by our CI/CD systems.

Building a Recording Layer: defining a test structure and “schema” for individual tests

Recording a test can be seen as having three main components.

The REST proxy that would write to the test file
The CLI wrapper that would form a thin wrapper around a running CLI process, taking input commands and writing them to the test file
The backing REST server

Given what we are testing we decided a simple JSON structure would suffice. A single test would have a series of JSON objects, each with CLI commands that would be entered to trigger the REST request, CLI output, and the REST interaction(s) involved. A recorded test is therefore made up of the following components:

The test file, unique to this test
The initial configuration for the device being tested (used when automatically updating a test)
A textual representation of the database for the product being tested. This was shared among tests and was updated whenever a test was recorded/updated

A developer wouldn’t always be recording a new test. They may be re-recording/updating a test. In these cases the input should be automated, with the CLI commands read from the test file, fed into the CLI process, and the results recorded just as if a new test is being recorded. When re-recording this we need to make sure the backing VM is in the same starting state as the VM used to record the original test. This configuration is kept in the stored configuration file.

Recording a CLI test. A REST proxy sits between the CLI process and the backing REST service to capture all requests and responses — A wrapper around the CLI handles starting up and stopping the running CLI, and the REST server used for recording this test. It also processes input, either from the user via the command line or by reading from an existing test file

Building the Runner Layer: Faking responses to a CLI

Running a test involved a few simple steps

Read all commands and outputs from the test file
Read all REST requests/responses from the test file
Load the REST requests into a mock REST server with the associated responses
Start a wrapped CLI where the REST address is the mock REST server.
Run each command against the CLI and check that the output is as expected.

Running a CLI test. A CLI wrapper sends commands to a CLI process. A mock REST server responds to the CLI using responses stored in the test file — The CLI test process gets commands from the test file and sends to the CLI. The mock REST server matches requests to the stored responses and returns them to the CLI. The testing framework compares the output of the CLI to that which was stored in the test file

Because we can’t guarantee the strict order in which a REST request may be sent by the CLI, we need to implement a system whereby if the same “show” command is issued multiple times, once before a mutation event and once after, the correct outputs will be returned.

To address this problem I implemented a versioning system where the version number within the mock server would increment every time a mutation event was received. When a REST request came in, the mock REST server would check for a matching response at the current version number. If no response was found at that number then it would get the highest version available that was less than the current version number.

In addition, multiple of the same GET requests/responses could be deduplicated in our mock REST server, but I had to take care not to deduplicate the mutation requests. A POST could look similar, but depending on the current state of the machine may have a different result.

The versioning flowchart when getting a response from the REST server. If an entry for a given REST request doesn't exist at that version then previous versions are tried until there are no earlier versions, at which point the server returns an error

Reducing Hardware Dependency: Regenerating Tests using REST mocks

When recording a brand new test we would always need a live version of the product under test. There would be no avoiding using VM resources for that.

However, if someone is just updating some CLI commands without changing the REST API then there is no need for them to wait for a VM. In these cases the REST server would be mocked, sending responses from the test file and writing to another test file that would later replace the original. This reduces the time to regenerate a test from minutes to mere seconds.

We can also repurpose the mock REST server used when running the tests for this exact purpose, serving canned responses to the REST requests made. If nothing had changed at all then this regeneration should result in an empty diff.

Managing Non-Determinism and Volatility using a Rules Framework

Many REST requests and responses would include highly volatile fields, such as a date/time or a build number. In many cases the real values of these fields are not necessary to provide adequate test coverage. With multiple developers submitting changes, particularly of the initial schema files for a given product, these fields are a source of several merge conflicts.

I devised a “rules framework” to combat this. The goal is to strip out or change certain fields to default values when certain conditions were met. You would define a rule such as:

Rule(POST, “foo/bar”).when_data_is(“baz”).set_field(“content/name”, “John”)

This rule means that if a POST request for foo/bar comes in to the mock REST server during the test, and the data attached to the POST request is “baz” then the response returned from the mock or proxy REST server would have the name field in the content portion of the response equal to “John” regardless of what was actually returned from the API.

Rules could apply to any portion of the REST request and change what was written for any part of the REST response.

The application of these rules means that changeable fields such as build-id could be changed to default values (or simply deleted) and therefore would not continuously show up in diffs.

Each product under test may need to define their own rules. I addressed this by setting up a loader or plugin system where there would be three rules files:

The global_rules file, imported into the recorder/runner file
The rules.py file provided by the product under test. Because these might be defined outside of the paths known by the core CLI testing library, they needed to be loaded in dynamically using importlib from the paths provided by the product under test.
The rules.py file inside the test directory itself

I applied these rules following a pattern where the specific overrode the general. If a rule conflicts with another rule then test rules are the highest priority, followed by product rules, and then global rules. If there are no conflicts then, unless overridden, all rules would be applied from global, product, and test rules.

Conclusion

With this project our CLI becomes a lot more reliable. All new CLI commands can now have an automated test, without requiring lengthy integration test runtimes and without our automated testing systems putting a strain on our VM resources.

Published 2025.11.04