Sunday, June 06, 2021

The 2021 Python Language Summit: Fuzzing and Testing Python With Properties

At the 2021 Python Language Summit, Zac Hatfield-Dodds gave a presentation about fuzzing and testing with Python properties. This presentation tied in with the one he gave at the 2020 Python Language Summit.

Zac Hatfield-Dodds
 

What Is Testing?

For the purposes of this talk, he defined testing as the art and science of running code and then checking if it did what it was supposed to do. He added that, although assertions, type checkers, linters, and code review are good, they are not testing.

There are two general reasons why we might have tests:

  1. For correctness:
    1. The goal is to validate software and determine that they are no bugs.
    2. Nondeterminism is acceptable.
    3. Finding any fault is a success.
  2. For software engineering (programming, over time, in teams):
    1. The goal is to validate changes or detect regressions.
    2. Nondeterminism is bad.
    3. Bugs should be in only the diff.

When these two reasons for testing aren't distinguished, there can be miscommunications.

What Is Property-Based Testing?

There are many types of tests:

  • Unit tests
  • Integration tests
  • Snapshot tests
  • Parameterized tests
  • Fuzz tests
  • Property-based tests
  • Stateful model tests

The speaker then walked the summit attendees through an example to explain going from traditional unit tests through to parameterized tests and then seeing how that plays into property-based tests.

Imagine that you needed to test the sorted() builtin. With a traditional set of unit tests, you can write a bunch of cases with the expected inputs and outputs:


If you want to avoid repeating yourself, you can write a list of inputs and outputs:


If you don't have a known good result, then you can still write tests using only the input argument. One option would be to compare to another reference implementation:

However, comparing with another reference implementation might not be an option, so you could just test if the output seems to be right:

In order to improve on this test, you might want to add another property that you can test. You could check that the length of the output is the same as the length of the input and that you have the same set of elements:

This would pass on the incorrect sorted([1, 2, 1]) -> [1, 2, 2].  A brute-force approach using itertools.permutations() would detect that too:


But the best solution is collections.Counter():

This last test uses property-based testing:

 
Instead of having a specific list of inputs, you could use Hypothesis:
 

That test will fail because NaN compares unequal to itself, so any list containing NaN will appear to not be in sorted order. So it could be good to have specified behavior for the ordering on NaN elements in the sorting algorithm:

 
He said that one of the big advantages of using something like Hypothesis rather than a list of handwritten examples is that is will raise conceptual issues that you may not have already thought through yourself.

In summary, property-based testing lets you:

  • Generate input data that you might not have thought of yourself
  • Check that the result isn't wrong, even without the right answer
  • Discover bugs in your understanding rather than just in your code
Often, you don't even need assertions in the test. Generating unusual input data is surprisingly effective. It can give you the sort of feedback you could get from real users, but you don't need to ship before getting the feedback.

A common concern is that, if you have randomized testing, then are things flaky? How do you deal with determinism? Hypothesis has been working on that for years, so they have solid answers to these kinds of questions:

If that's not enough, then you also have other options:

The Hypothesis database is a collection of files on disk that represent the various examples. Since it's a key-value store, it's easy to implement your own custom one:

In this example, you have a local database on disk. You can also have a shared network database on something like Redis, for example.

Coverage-guided fuzzing takes this to the next level:

What's New?

At the 2020 Python Language Summit, when he said that we would find more bugs if we used property-based testing for CPython and the standard library, the response was positive, but then not much happened. Since then, Paul Ganssle has opened a PR on CPython to add some Hypothesis tests for the zoneinfo library. Zac Hatfield-Dodds said that CPython is doing very well on unit testing and has a strong focus on regressions but that it would be quite valuable to add some of the tools that have been developed for testing for correctness.

These tools don't only find existing bugs. They're good at finding regressions where someone checked in new code with what turned out to be inadequate test coverage:

There is a pace at which we find and fix bugs that were preexisting in addition to the ongoing rate of introducing new bugs that then get detected by fuzzing instead of lasting for too long:


What's Next?

There is a three-step plan:

  1. Merge Paul Ganssle's PR or come up with an alternative proposal to get Hypothesis into CPython's CI in order to unblock ongoing incremental work
  2. Merge some tests
  3. Run them in CI and on OSS-Fuzz
For interested parties, you can see and engage in the follow-ups to this work on the Python Steering Council's issue tracker.