Software testing

Last modified on August 27, 2019 by Alex

This is my collection of notes on various software testing methods. It is not intended to be either complete or super well-sourced.

Let us first define testing in its most general form. We have a system under test that we probe with some stimuli and observe its outputs. Observations can include anything that can be discerned and ascribed a meaning significant to the purpose of testing — including output values, state, temporal properties of the execution, heat dissipated during execution, power consumed and any other measurable attributes of the execution.

Testing attempts to falsify the behavior of the system with respect to its specification. A specification is a set of properties that specify whether a sequence of stimuli and the resulting observations is correct or not.

In order to automate testing, we need an oracle - a procedure that determines what the correct behaviour of a system should be for all inputs with which we wish to subject the system under test. You can think of an oracle as automated specification, specification as a decision procedure.

One of the difficult problems in testing is the lack of good testing oracles that can be used in automated testing. For example, consider testing a speech-to-text translation systme. We can test it on individual pairs of ⟨auidoclip, text⟩, but each such pair has a non-trivial cost to obtain. A few advanced testing techniques (property testing, metamorphic testing, derived oracles, pseudo-oracles, implicit oracles) attempt to partially solve this problem.

The ground truth oracle, an oracle that always gives the “right answer”. We rarely if ever have access to a ground truth oracle for a problem. Given a ground truth oracle G, we can define a sound oracle (also known as partially correct oracle) T as an oracle such that ∀ σ. (T σ)↓ ⟹ T σ = G σ: for every input σ such that T σ terminates and gives an answer, T is correct. A complete oracle is defined as ∀ σ. (T σ)↓.

Specified oracles judge all behavioural aspects of a system with respect to a given formal specification. An implicit oracle refers to the detection of ‘obvious’ faults such as a program crash. Derived oracles involve artefacts from which an oracle may be derived – for example a previous version of the system. A pseudo-oracle is an alternative version of the program produced independently, e.g. by a different programming team or written in an entirely different programming language.

Unit tests

Unit testing is perhaps the least interesting method of testing. A unit test is just a piece of code that exercises a particular feature or behavior of the code and validates the results against explicitly specified results. While tedious to write, they are indispensable for the initial development of any piece of software.

Integration tests exercise larger pieces of code or multiple independent modules (modules, applications, services) together.

There is little to be said about either unit testing or integration testing in general.

Property testing

Functions often have easy to describe properties, such as

  • Non-negativity ∀ x. f x ≥ 0.
  • Commutativity ∀ x y. f x y = f y x.
  • Associativity ∀ x y z. f (f x y) z = f x (f y z).
  • Injectivity, ∀ x y. f x = f y ⟹ x = y.
  • Various symmetries w.r.t. other operations:
    • ∀ x. f x = f (1 - x).
    • ∀ x. f (reversed x) = f x.
    • ∀ g x. f (map g x) = map g (f x).
    • ∀ x y. f (x + y) = (f x) + (g y).
  • Being constant along a specific axis ∀ x y. f(0, x) = f(0, y).
  • Even unit tests can be considered “function properties”, f(10) = 1.
  • Etc.

Such properties may be falsified by enumeration of a subset of possible inputs. For example, in order to check that ∀ x. f x ≥ 0 is not satisfied by some f, we can test this property on a large number of inputs {x₁, x₂, ... xᵢ} ⊆ X. Since (∀ x. f x ≥ 0) ⟹ (f x₁ ≥ 0) ⋀ (f x₂ ≥ 0) ⋀ ... ⋀ (f xᵢ ≥ 0), if we find any xᵢ such that ¬(f xᵢ ≥ 0), we will prove that ¬(∀ x. f x ≥ 0) or in other words, falsify ∀ x. f x ≥ 0.

Failing to falsify a property is not the same as proving a property! The input space of a function is often infinite, so any attempt to falsify a property will explore only a small subset of all possible inputs. There could easily be some inputs that would falsify a property, but we just failed to find them.

Finding good inputs to test properties on is not easy, especially in the black-box case of testing a system without having any insight into its behavior. There are some techniques, such as coverage-guided fuzzing and concolic testing that attempt to partially alleviate this problem in gray-box and white-box cases, where we have partial or complete access to the system’s internal implementation. Note that usually they are used as their own methods of testing rather than being applied to property testing.

Since the number of inputs we can test our functions on is limited, it is crucial to fine-tune the input generators to cover most “unusual” inputs. Additionally, once we have found a counter-example to a property, it is useful to be able to minimize while preserving the failure condition. We’ll come back to both random input generation and test case minimization later.

Both regression testing and metamorphic testing can both be thought a form of property testing. You can think of the former as a “symmetry” in time and the latter as a symmetry w.r.t. the input.

Regression testing

The idea behind regression testing is that an old version of a system can be used as a derived testing oracle for the new version. If we have an old version f₁ and a new version f₂, we can check that:

  • The new system produces the same outputs as the old system, ∀ x. f₁ x = f₂ x.
  • The new system produces the same outputs as the old system on all inputs but some, and for every input x such that f₁ x ≠ f₂ x, f₁ x produces “better” results than f₂ x, ∀ x. (f₁ x = f₂ x) ⋁ (better (f₂ x) (f₁ x)). Usually the better predicate is evaluated manually.
  • The new system runs equally fast or faster than the old system.
  • The new system consumes the same amount of memory as the old system or less.

N-version testing

If we have multiple implementations of the same function, we can test one implementation against the others, which is called n-version testing. A particularly useful example of this is implementing the same functionality with and without algorithmic or low-level implementations and then testing them one against the other:

We can also test our code against someone else’s implementation of the same or similar functionality:

Round-trip properties

Sometimes the function under test has an inverse or some sort of “pseudo-inverse”, such that composition on the left or right gives us an identity function. For example, consider the problem of testing a base64 encoder encode : Array Byte -> String. If we write a function that decodes base64, decode: String -> Option (Array Byte), we can state that:

Metamorphic testing

The prototypical example for metamorphic testing is that of a program that computes the sine function: What is the exact value of sin(12)? Is an observed output of −0.5365 correct?

Manually evaluating sin(12) is non-trivial, but we can use its symmetry, ∀ x. sin(x) = sin(π − x), and test sin(12) = sin(π − 12) without knowing the concrete values of either sine calculation. This is an example of a metamorphic relation: an input transformation t that can be used to generate new test cases from existing test data, and an output relation p, that compares the outputs produced by a pair of test cases:

If we have multiple metamorphic transformations {t₁, t₂, t₃, ... tₐ}, we can also check that

This might suggest thinking of metamorphic relations as a group of symmetries T of f w.r.t. some property p.

We can arrive at a different generalization of metamorphic testing if we consider n-ary relations p. A metamorphic relation for a function f is expressed as a relation among a series of function inputs {x₁, x₂, ..., xᵢ} (with i > 1), and their corresponding output values {f(x₁), f(x₂), ..., f(xᵢ)}.

Implicit oracles

Sometimes, when testing a complex system, you might struggle to come up with simple properties that its outputs must satisfy. There are, however, certain properties that most systems should satisfy such as:

  • Successfully terminate on all valid inputs.
  • Use a finite amount of memory for all reasonably sized inputs.
  • Not have any race conditions.
  • Be deterministic.
  • Not leak memory.

The first property alone, which may be expressed in Scala as:

is often enough to unearth a pleothora of bugs. Take any sufficiently complex program that reads files and generate a lot of random files and you are bound to discover bugs, buffer overflows, null pointer exceptions, memory leaks, etc.

C/C++ has a number of tools, called sanitizers that allow you to check for some common undesirable behaviors of programs, that rely entirely on C/C++-specific implicit oracles: undefined behavior, memory leaks, buffer overflows, data races, and uninitialized memory.

Generating test inputs

Gen monad and Arbitrary typeclass

QuickCheck paper defines three important “types”:

Generating functions



Test case reduction


Extrapolating from failing test cases


Testing the property set

Consider any function f: X ⟶ Y and a set of properties 𝓟 it should satisfy, such as ∀ x. f(-x) = f(x) or f(10) = 2. Property testing is all about falsifiying one of p ∈ 𝓟 by enumerating a large number of possible inputs, trying to find a solution to ∃ x. ¬p(f, x) for some property p ∈ 𝓟.

There are, however, other questions we could ask ourselves, such as:

  1. Does 𝓟 completely determine f? Is there no other function that satisfies 𝓟?
  2. Is 𝓟 minimal? Is there a smaller set of properties 𝓡 such any function that satisfies 𝓟 satisfies 𝓡? Is there a property that is redundant and implied by other properties?
  3. If 𝓟 does not completely determine f, can we find some larger set of properties that does? Or, can we add one more property p such that p does not follow from 𝓟, but f satisfies p?
  4. Given an f, can we find a set of non-trivial properties that f satisfies?

Here the focus is no longer on f, but on the set of properties 𝓟.

Property set reduction

FitSpec explores the first two questions.

The basic idea behind FitSpec’s approach to (1) is quite simple. Take a function f: X ⟶ Y that satisfies 𝓟 and make a new function

which is called black-box mutant testing - creating new instances of a system that exhibit aberrant behavior.

If g also satisfies 𝓟 for some x₁ and y₁ and f x₁ ≠ y₁, then 𝓟 does not completely describe f! This is an effective procedure to falsify a statement “𝓟 completely determines f”, but as was the case with property testing, failing to falsify is not the same as proving.

Testing (2) is a little bit more difficult. Conceptually, property p implies q, p ⟹ q iff

or in other words, if for every function f and input x, whenever the first property p f x is satisfied, the second property q f x is also satisfied.

If p, q ∈ 𝓟 and both are satisfied by f, ∀ x. p f x = ⊤ and ∀ x. q f x = ⊤, then ∀ x. p f x ⟹ q f x is always true, since both sides of the implication are true! This means that varying x alone is insufficient to determine whether one property implies the other.

That’s where f[x₁ → y₁] comes in. We can attempt to falsify ∀ x x₁ y₁. p f[x₁ → y₁] x ⟹ q f[x₁ → y₁] x, and failing to falsify it on a large enough number of inputs would give us some evidence that p ⟹ q. Once we gather enough evidence, we can suggest a human to manually verify these implications and reduce the property set by removing redundant properties.

Whitebox mutant testing and test suite minimization

Above we saw how FitSpec uses black-box mutant testing to find redundant properties and discover property set incompleteness. There are other approaches to mutant testing that modify the source code of a function f directly, modifying one or more expression subtrees. This white-box mutant testing is often used to minimize and evaluate test suites. Conceptually, it is the same procedure as used in FitSpec, but this time applied to white-box mutant testing and unit tests rather than property tests.

Given a test suite 𝓟, we take our function f and white-box mutate it, e.g. change f(x) = if x == 1 then 2 else 3 to f₁(x) = if x <= 1 then 2 else 3. Doing this repeatedly and checking such f₁ against the test suite, we can calculate the number of mutants the test suite “killed” - the number of functions that didn’t pass the test suite. This provides a measure of test suite effectiveness and completeness.

Predicate mining

QuickSpec and Speculate attempt to answer this question.


You have read about all of the wonderful testing techniques above, and you might have noticed that tests are fundamentally about attempting to falsify some property P. One key insight that I would like to stress (once again) is that failure to falsify P is not a proof of P. Tests alone can not normally prove that your system is correct (unless they are exhaustive, which is almost never the case).

In 1972, Dijkstra remarked that “program testing can be used to show the presence of bugs, but never their absence”.

Types on the other hand allow you to prove that some property P holds. But one of the problems with types is that complex properties are very difficult to express in types and have formally verified. Most mainstream programming languages have insufficiently powerful type systems to express even very mundane properties such as ∀ x. f x ≥ 0! They are, however, great for proving basic correctness properties, such as “this program returns an integer, and not a string”.

There are other benefits to types that are not relevant to testing, such as typeclasses, type-driven typeclass derivation, and advanced abstractions.

Design by contract

If tests are insufficiently exhaustive to prove correctness, and type systems are insufficiently powerful to express valid constraints, what else can we do? Well, if we want certain properties to hold and we can not statically verify them, we can still check them at runtime.

Quoting from Wikipedia: Design by contract prescribes that software designers should define formal, precise and verifiable interface specifications for software components, which extend the ordinary definition of abstract data types with preconditions, postconditions and invariants. DbC considers these contracts to be so crucial to software correctness that they should be part of the design process. In effect, DbC advocates writing the assertions first.

While some languages like Eiffel have explicit support pre-, post-, and invariant conditions built into the language, almost every language has assertions of one form or another - boolean expression that are placed at certain points in a program to check its behaviour at runtime.

What is great about DbC is that assertions can serve both as documentation and as implicit oracles for the code’s behavior. This means that there is a certain synergy between other modes of testing and DbC - adding contracts may make your tests more effective at uncovering bugs.

In a way, DbC is dynamic typing for the dependent part of our type system. For example, let’s say you take a list l and an index i into that list and you want to split the list on that index:

The type signature above is not precise enough to completely determine the function, but we can use dependent types for a more precise definition:

Or, if we don’t have dependent types, we can add assertions:

You could argue that this is messy, non-compositional, and would cause runtime failures. And to an extent, I agree. Unfortunately, not every language supports dependent types, and writing code using dependent types can be a serious undertaking.


Test adequacy criterion

Some simple criteria:

  1. Every reachable statement should be executed in at least one test.
  2. Every feasible control-flow path should be followed in at least one test.

The first two apply mostly to imperative languages.


Fuzzing is one effective way of finding implicit anomalies. The main idea is to generate random (or “fuzz”) inputs and attack the system to find those anomalies.

Concolic testing