Fuzzing has its limitations. Formal methods can help you overcome them.
In the first post of this series , we introduced the software verification technique called fuzzing. We described what it is, who uses it and why, the tools involved, and the benefits of fuzzing.
In this post, we’ll examine the limitations of fuzzing. We’ll then show how most of these can be overcome by pairing a good fuzzer with formal-methods-based analysis tools.
This post is Part 2 of a 3-part series derived from TrustInSoft’s new guide to fuzzing for cybersecurity entitled “Fuzzing and Beyond.” To obtain a FREE copy, CLICK HERE.
Of course, one’s definition of “meaningful” will depend upon one’s use case and objective. Many fuzzers do not solve the difficult problem of generating meaningful inputs, but simply generate a lot of them hoping that some will be interesting. The idea is that—since modern computers are so fast that it takes only a very small fraction of a second to run the code on the inputs—you might as well generate plenty of inputs and see which ones cause something interesting to happen.
As mentioned in our previous post, more sophisticated grey-box fuzzers will refine and improve their input seeds to explore promising areas and try to find more vulnerabilities. Nonetheless, fuzzing is an iterative process. Any instance of the PUT (Program Under Test) can only be tested for one fuzz input at a time, and complex input profiles can present billions upon billions of possible test cases.
For example, a 256-bit input file offers 2256 (>1.15x 1077) possible permutations. Even with fast computers and parallel processing, fuzzing an entire input space of that size would be prohibitive in terms of both time and cost.
Fuzzing engines generate fuzz inputs in a semi-random fashion. That means even the best fuzzing algorithms will produce a significant number of redundant inputs and run many redundant tests.
Embedded software is typically designed to run on specific embedded hardware, like an ARM RISC processor, for instance.
Good fuzz testing requires that you run as many fuzz inputs as possible. To make that cost-effective, those inputs need to be applied as quickly as possible. The power of fuzzing is measured in thousands of executions per second; 8000 executions per second is better than 6000 ex/s. Due to this need for speed, fuzz testing is typically performed during unit testing in a host environment.
While finding and eliminating vulnerabilities in a host environment is good, if you only test in that environment you’ll have no chance of detecting vulnerabilities that occur only in the target architecture.
In other words, there are some bugs that will only occur when your code is running on your big-endian target architecture. If you fuzz test only on your little-endian desktop environment, you’ll have no hope of finding them.
Since fuzzers generate fuzz inputs semi-randomly with lots of redundancy, they can’t possibly generate all the possible values that make up your code’s input space. As mentioned, for a non-trivial program, there are too many possible input combinations. Billions upon billions of them.
Also, fuzzers are not intended to detect all bugs resulting from their inputs. The main purpose of a fuzzer is to generate sets of semi-random inputs. Secondary purposes of grey-box and white-box fuzzers include the measurement of code coverage, and the optimization of input sets to maximize code coverage while minimizing the number of inputs/executions required to achieve that coverage. Fuzzers will indicate interesting behavior like program crashes caused by specific inputs, but they are not designed to detect every bug that might be lurking in your code.
In short, while it can help prove a program is incorrect, fuzzing cannot prove a program is correct.
Fuzzing for cybersecurity relies on the use of what is called a bug oracle. Researchers have defined a bug oracle asa program, perhaps as part of a fuzzer, that determines whether a given execution of the PUT violates a specific security policy.[1]
Since fuzzing generates arbitrary inputs in an automated manner, we do not know a priori how the target software is supposed to behave for those inputs. Is it expected to reject them as invalid? Is it expected to accept them as valid and produce a particular response to them? Either way, the fuzzer doesn’t know.
The fuzzer only knows if a given input allows the program to execute or causes an execution failure (a crash). If the program executes, the fuzzer provides no indication of whether the program’s response was correct or not. If the user is to recognize whether a given input results in behavior worthy of further human investigation, another component must be present. That component is the oracle.
The oracle tells the user if the target software appears to be behaving incorrectly for a given input.
In the case of C/C++ and other memory-unsafe languages, the frequently-used default oracles just described are only partial. They do not check for all of the possible types of undefined behavior that hackers might exploit; there is a vast array of these, and some are more subtle than others. Furthermore, the results computed for a given input may vary depending on the memory layout defined at compilation.
Therefore, not finding anything interesting during the execution of the (instrumented) binary during fuzzing does not mean that nothing interesting will happen during the execution of the (uninstrumented) binary after deployment.
Classic fuzzing will greatly expand your exploration of the total input space of your target program. You will likely find a lot of bugs that you would not normally find during a normal testing campaign.
Due to the limitations just described, however, classic fuzzing will always be a partial solution, for three reasons. First, you cannot explore the entire input space due to its size. Second, you will only find bugs that occur in the specific memory allocation dictated by the compilation performed by the fuzzer. Finally, you will only find bugs your instrumentation and analysis tools are designed to detect.
To overcome these limitations, you need a complete oracle rather than a partial one.
To be clear, an undefined behavior is not the same as either an implementation-defined behavior or an unspecified behavior.[2]
Undefined behavior is a C/C++ language concept, defined in Annex J2 of the language: They consist in code constructs for which there are no requirements on how the compiler will implement them, i.e. there is zero guarantee that the code will behave the same in different contexts (in particular the toolchain (and toolchain settings) used to generate the executable and the environment in which the executable will run). They therefore frequently cause random crashes or random program behavior. These types of problems are often very difficult to detect under standard laboratory testing conditions.
A great first step toward eliminating all undefined behavior in your code is to fuzz your code with the help of TrustInSoft Analyzer. Running fuzz inputs through TrustInSoft Analyzer allows you to formally verify the elimination of all bugs the Analyzer finds.
TrustInSoft Analyzer is a hybrid code Analyzer combining advanced static and dynamic analysis techniques together with Formal Methods to mathematically guarantee C/C++ code quality, reliability, security and safety. It has been designed to detect ALL undefined behavior in any execution path, and in combination with a fuzzer or any test drivers, with no false alarms.
TrustInSoft Analyzer can guarantee your fuzz testing results are valid for any compiler, any chosen set of compiler options, and any memory layout. In short, TrustInSoft Analyzer is a complete and perfect bug oracle for C/C++ code. It optimizes the fuzzing process.
Fuzzing with TrustInSoft Analyzer is fast and efficient because the tool integrates easily with AFL, one of the most popular coverage-based grey-box fuzzers. We like fuzzing with AFL for a number of reasons. Of these, three stand above the rest.
First, AFL is chainable to other tools. As the creator’s documentation states, “The fuzzer generates superior, compact test corpora that can serve as a seed for more specialized, slower, or labor-intensive processes and testing frameworks. It is also capable of on-the-fly corpus synchronization with any other software.” [3]
Second, AFL employs an efficient source code instrumentation to record the edge coverage of each execution of the program being tested and the coarse hit counts for each edge. It uses this information not only to generate seed files for new fuzz inputs but also in the implementation of a unique deduplication scheme that optimizes code coverage using a minimum set of inputs.[4]
Third, AFL uses a heuristic Evolutionary Algorithm (EA) to refine its seed pool based on branch coverage. This helps improve the odds that newly generated fuzz inputs help to increase coverage.[5]
All of these features help shorten the fuzzing cycle and save you time.
After running AFL to generate a set of test cases (input files), you just load those test cases along with your code into TrustInSoft Analyzer and run an analysis in the Analyzer’s interpreter mode. This is a simple operation—as easy as ordering a compilation of your code.
Find more vulnerabilities. TrustInSoft Analyzer can detect undefined behaviors that typically remain unnoticed throughout standard unit and integration tests. You will be 100% sure you have no undefined behaviors for the specific entry points defined by your test (fuzz) inputs.
No false alarms. When running analysis on discrete inputs, TrustInSoft Analyzer generates no false alarms. All alarms raised correspond to real vulnerabilities, thanks to our use of formal methods. You’ll waste no time investigating false positives.
Better validation of embedded code. TrustInSoft Analyzer provides target emulation for embedded hardware platforms. Target emulation allows you to test your embedded code in an environment that closely resembles your target architecture. It helps you find bugs in embedded code that unit testing in a host environment cannot possibly reveal.
Out of the box, TrustInSoft Analyzer supports a number of common target platforms, including 32-bit ARM, 64-bit ARM, Power PC, RISC-V, and X86. If your hardware is more exotic, the emulator can easily be configured by adjusting a series of parameters. Everything that can change from one target platform to another is configurable in the Analyzer.
This post is Part 2 of a 3-part series derived from TrustInSoft’s new guide to fuzzing for cybersecurity entitled “Fuzzing and Beyond.” To obtain a FREE copy, CLICK HERE.
In the final installment of this series, we’ll look at how to go beyond fuzzing to formally prove that you have eliminated every last vulnerability from your most cybersecurity-critical applications.
To download your FREE copy, CLICK HERE.
[1] Manès, V., et al, The Art, Science, and Engineering of Fuzzing: A Survey, IEEE, October 2019.
[2] Gruevski, P., Falsehoods programmers believe about undefined behavior, predr.ag/blog, November 2022.
[3] Zalewski, M., American fuzzy lop.
[4] Manès, V., et al, The Art, Science, and Engineering of Fuzzing: A Survey, IEEE, October 2019.
[5] Ibid.