Beyond Code Coverage - Mutation Testing

Code coverage is a metric most of us track and manage. The basic idea is that when you run the tests, you indicate which lines and branches are executed. Then, at the end, you come up with a percentage of lines that were executed during a test run. If a line of code was never executed, that means it was never tested. A typical report might look something like this:

Code Coverage Results

You can see the line coverage and branch coverage in that graphic.

Typically, we consider 80% or above to be acceptable coverage, and 95% or above to be "great" coverage.

Lies, Damn Lies, and Statistics

However, coverage numbers are not always enough. Here's an example class and test case, let's see where the problem is:

So here we have a calculator, and we're testing the add method. We've tested that if we add two zeroes together, we get zero as a result. A standard code coverage would call this 100% covered. So where's the problem?

Codifying Behavior

What happens if somebody accidentally changed the addition to subtraction in the original method? Something like this:

The test case will still pass, and we will still have 100% coverage, but we can tell that the result is just wrong.

But, it can get more insidious than this. How about this example:

The lines in the constructor count towards the code coverage, despite us never validating them. What happens if somebody goes and changes the radix to 2? The tests will still pass, and the app will be completely wrong!

While additional tests can fix these concerns, how do we know if we have comprehensive coverage?

Mutations

Imagine that we took the code above and did a couple things:

  • Changed all the constants to MAX_INT -- including radix
  • Changed addition to subtraction

Now, these should be breaking changes. But what happens if you make those changes, and your tests still pass? That means you're not testing it well enough! But if your tests fail or throw an error, then you are testing it.

This is the basic idea behind mutation testing. By inspecting the bytecode, we can apply transformations to the code, and re-run the test suite. We can then count the mutations that produced breaking changes, and the mutations that did not produce breaking changes -- we will call those "killed" and "survived".

Your goal, then will be to maximize the "killed" stat, and minimize the "survived" stat. We have a Maven Site Docs Example, and I have wired in the pitest library, which provides the mutation framework. The output looks a little something like this:

pitest output 1

Those red lines are places where the mutations survived -- and the green ones are where mutations were killed. Highlighting or clicking the number on the left shows this portion:

pitest mutations applied

Mutations Available

Each language and framework is unique, but here are some examples of before and after:

Of course, this isn't limited to just equality mutations. We can also do stuff like this:

In this case, we are removing a method call that does not return a value. If your code still passes after this, why does that method call exist?

You should check out full list of mutations available in pitest.

Conclusion

This method of testing is obviously more comprehensive, but it comes at a cost: time. Making all of these mutations and running the test suite can take significantly longer. Each mutation requires a run of the whole test suite, to make sure it was killed. Obviously the larger your code base, the more mutations are needed, and the more test runs are needed. You can tweak things like the number of threads used to run the suite, the mutations available, and the classes targeted. Except for the thread count, the others will reduce the overall comprehensiveness of testing.

Now you know another tool for quality. You can see how we wired it up in our Maven Site Docs example, and how we integrated it into the maven lifecycle.