In the last year Ive attended talks by Marshall Clow and Chandler Carruth on C++ tooling and caught the fuzzing bug from them. This post is an attempt to show how to use this fun and productive technique to find problems in your own code.
The basic idea behind fuzzing is to try massive numbers of random inputs to code in order to trigger a vulnerability. You create a testbench for the code of interest, pair it with a fuzzing engine that generates random data, and launch it on some server somewhere. Hours, days, or weeks later - if your testbench is solid - it comes back with a set of inputs that cause the code to crash. This process may be accelerated by:
This post will demonstrate the use of coverage-driven fuzzers with sanitizing, applied to an open source JSON parsing library.
For my test case I selected an open-source library Im familiar with: json_spirit. It provides a very simple interface to parsing and generating JSON:
auto v = json::construct(s); // parse and construct recursive variant std::cout << v << "\n"; // print it back out as JSON
The advantages of using this library as a test case are:
Fuzzers treat your code as a black box they are trying to exercise. They supply random input strings, and observe what code paths are executed. It is up to the user to construct a meaningful set of code inputs from random data with a test driver. In my chosen example, this is as simple as initializing a string using the supplied data. For a templating engine, you may choose to regard a portion of the input as the template and another portion as the model data. More complex examples can require even more interpretation.
In addition to constructing input data from the random string, you may also need to filter out strings that represent inputs you want to exclude from consideration. In the case of json_spirit, non-ASCII characters outside of double quotes are not handled, and the library does not (yet) handle invalid UTF-8 within double quotes. I therefore filter out such cases with some code of my own and a little help from Boost.Locale.
libFuzzer can be checked out from LLVMs Subversion repository and built using their directions. You supply a test driver as a function called
LLVMFuzzerTestOneInput with C linkage. The result is a standalone program that exercises the code inside that function. It uses some Clang compiler-supplied instrumentation, via the
-fsanitize-coverage option, to monitor which paths are exercised, so gcc is not an option.
AFL is a standalone tool that uses binary rewriting to instrument the code being tested. It supplies wrapper compilers that call either Clang or gcc as necessary. The test driver is written as a main program that takes the random string from standard input, which means each run is a separate process. However, if you use Clang, there is a special fast mode that instruments your code as a compiler pass, rather than a final object code rewrite. This means the instrumentation itself can be optimized, producing faster binaries.
A infrequently used, but potentially powerful, type of fuzzing engine is based on symbolic execution. In the last few years there has been significant advances in SMT solvers, upon which this technology relies. Many symbolic execution engines are proprietary tools but Ive heard positive things about Klee and hope to try it out someday.
I created the build flow in CMake in my own fork of json_spirit. From the CMake command line users can specify which sanitizer to use (address or memory) with the
SANITIZER option. We choose the test driver based on whether the selected compiler (from the standard
CMAKE_CXX_COMPILER option) appears to be one of the AFL wrappers. If not,
LLVM_ROOT points us to the location of the libFuzzer code and we use the function-based driver.
Building with the memory sanitizer presents some unique challenges. This sanitizer tries to find uses of uninitialized memory, and accordingly must track the state of values throughout their lifetime. It intercepts calls to the C library for this purpose. Every other library used must be compiled with
-fsanitize=memory to ensure no initialization is missed. This includes the C++ standard library. Even libFuzzer (if used) must be compiled this way. In the case of json_spirit, the libraries Boost.Locale and its dependency ICU need to be built separately with memory sanitizing enabled. Users supply paths to these libraries with another pair of command-line options.
Users of gcc have very limited options. libFuzzer requires a Clang-only compile switch, and gcc doesnt have a memory sanitizer at this time, so the only supported choice is AFL with address sanitizing.
Fuzzing json_spirit has so far found only a single bug, in Boost.Spirit, where an inappropriate check for ASCII characters produces an assertion failure. It may be that more running time is required to access more paths in the code. I also suspect that using C++ strings and other higher-level abstractions (streams, variants etc.) tends to reduce the sort of bugs found in C-style code where pointer arithmetic, fixed-size buffers, memcpy etc. are common.
Going forward my default fuzzing approach will probably be AFL in fast Clang mode with address sanitizing. AFL is more mature and has more sophisticated mutation algorithms, and though its one-process-per-test approach is slower, the special Clang support compensates. Address sanitizing seems much faster than memory sanitizing, and you can always re-run all the interesting (unique path) test cases afterwards with msan turned on instead.
If youd like to run fuzzing on your own code using this infrastructure, I suggest:
I hope this proves helpful to someone.