Pretty much all developers would agree that functional tests are essential for building high-quality code, but benchmarks are often overlooked as a means of verifying non-functional requirements like performance. While monitoring metrics for your app or service in production is second nature, not everyone reaches for a benchmark when they want to understand the performance of a system. And that’s a missed opportunity.
This article outlines what benchmarks are, why they’re helpful, and the different kinds available.
An introduction to benchmarks
Benchmarks are a special kind of test used to measure the performance of a piece of software. Unlike unit tests, benchmarks return numerical values and not a binary pass/fail which makes them slightly more challenging to work with than functional tests. We don’t really think about it, but there are two parts to all testing:
- The test code provokes some behaviour of the software you’re testing
- Measurement of that behaviour
For functional tests, the second part is usually pretty trivial: the test provides some input and you check you get some expected output, i.e. you’re testing for equality or inequality or some other relation that can be answered with yes or no. Tests that validate the consistency of a database, or that a sequence of events doesn’t violate some property of your system, are like playing on hard mode, but even there the measurement is usually binary – did this sequence of events cause a violation, yes or no?
With benchmarks, measuring behaviour is absolutely non-trivial. And that’s because the measurement results are taken from a spectrum of values instead of a binary set. Instead of “PASS” a benchmark might return “200ms” or “40MB”.
This is why benchmark results are usually compared against something else (afterall, that’s what “benchmark” means). Maybe that’s a previous version of the same software or a competing/alternative product. Comparing two numbers against each other gets us back to that binary answer and makes things easier to reason about.
Note: I said easier not easy.
Why do you need a benchmark anyhow?
This might sound like an obvious question but hear me out. Knowing exactly why you need a benchmark will guide which type of benchmark you ultimately end up writing and also how many benchmarks you need to write. It will also influence what you compare the results against.
You need a benchmark for a reason. And you should keep that reason clear in your mind. Ultimately all performance work can be linked to a business goal; that could be gaining a competitive advantage by optimising a common user flow, limiting your app to no more than 2GiB of memory (resource budget), or reducing cloud costs by minimising disk usage.
For example, if you’ve noticed that a lot of time is spent in some Java string handling function then you’ll probably want to write a benchmark for that code. Even if you ultimately optimise the string handling code, or eliminate it altogether, the benchmark will make sure that you actually are speeding things up. Here you’re comparing the benchmark against itself because you’re optimising the code under test.
On the other hand, if you’re creating a brand new service and have latency Service Level Objectives (SLOs) from the Product team or your manager, you’ll need a benchmark that provokes behaviour in a much larger section of the code because you want to measure the performance an end-user will experience. You’re comparing the benchmark results against some predetermined values, your SLO.
To help understand when you need benchmarks, performance expert Francesco Nigro likes to use the analogy of designing a car. You can begin by creating lab experiments or driving on a test track (benchmarks) but at some point you need to move beyond that and test on real roads (production). While driving on real roads, unexpected things you observe in the car’s behaviour might suggest you go back to the lab where you have more control over the environment. The point is that “writing benchmarks” isn’t something you do once and forget about it. Benchmarks are a tool and you should turn to them whenever you need to investigate performance.
Your reason for needing a benchmark informs a bunch of things, not least of which is the type of benchmark you want to write.
Types of benchmarks
Functional tests are usually separated into roughly 3 buckets: unit tests, integration tests, and end-to-end (or system) tests. The scope of the thing under test increases as you move from one to the other. Admittedly, the boundaries are a little fuzzy and there’s no agreed upon distinction between them, but they give you a rough guideline of how to think about the different types of tests you can write.
Analogously, we can lump benchmarks into 3 buckets: microbenchmark, benchmark, and end-to-end benchmark.
Like unit tests, microbenchmarks cover the least surface area in your code. They’re small tests which target a specific and limited part of your software. Typical microbenchmarks might measure the time it takes to execute a few instructions for zeroing a buffer, or the time a single function takes to find a character in a string. Almost certainly these benchmarks are not doing I/O (accessing disks or the network).
Next we have regular benchmarks. This type of test usually covers many functions and may also include some I/O such as accessing a file on the local disk. Examples include a benchmark to measure the time it takes for a database query engine to parse a query, create a query plan and execute it, and a benchmark that measures the number of requests per second an individual HTTP microservice can handle.
And lastly we have end to end tests. These benchmarks are as close as you can get to the real world user experience and typically include some kind of client and server components. The client is analogous to whatever code your users execute to interact with your service, so that might be an SDK or a command-line tool. The server is all the pieces required for you to provide a service to your users, for example loader balancers, HTTP servers, databases, event streaming platforms, etc.
There’s an interesting dynamic between the scope of your test and how realistic or representative of a user experience the results are.
Microbenchmarks test critical pieces of code for your software but the way that code is tested is totally unlike anything your users would do. Microbenchmarks routinely execute a short sequence of instructions in a loop thousands of times and it’s extremely unlikely that your code would do this during normal operation. Because of this, we say that microbenchmarks are synthetic or unrealistic workloads. That doesn’t mean they’re not useful, only that you should only write microbenchmarks for strategic pieces of your code that are executed on the hot path – the code that gets executed most frequently.
If your app or service spends a lot of time parsing strings, microbenchmarks for the string parsing code are essential. Instead, if your app spends most of its time placing tasks on queues you’ll want to use a microbenchmark to measure that.
What you’re probably starting to realise from these examples is that you should not write a microbenchmark until you know which small parts of your code are critical. And the only way to know that is by profiling it. Additionally, when you run your benchmarks you should profile and monitor your application again while the benchmark is running to ensure that the benchmark stresses your code the way you intend. This is known as active benchmarking and we’ll have more to say about this in a later post.
Benchmarks fill the gap between completely synthetic microbenchmarks and real-world end-to-end tests. Think of tests for measuring the time a packet takes to travel through all the packet handling code of a microservice which could include the code to pull a packet out of a buffer, figure out what type of packet it is and place it on a packet-type-specific queue.
End-to-end tests are intended to be accurate representations of the way your users interact with your software. If your app requires multiple microservices running on multiple kubernetes nodes, then that’s what you’ll need to spin up for your end-to-end test (or use your production environment). Also included in this category are scale tests and stress tests, which aim to measure peak performance.
These tests typically use the same client or SDK that your end users do. Because they’re larger and harder to setup, most developers run end-to-end tests far less frequently than other benchmark types. Still, they’re a handy backstop to understand what kind of performance you can expect your users to experience.
While the first two types of benchmarks usually live alongside the code for your app, end-to-end tests might use specialised benchmarks that allow you to write a workload specification. Examples of this are YCSB, Gatling, and locust.
Conclusion
Benchmarks are an invaluable way of validating performance and should be part of every developer’s toolbox. While every developer understands the value of unit testing, benchmarks are often overlooked but provide the same kind of assurance about the quality of your code. Benchmarks come in all sizes and you should pick the right one for your situation.


