Optimizing Fuzzing to Eliminate More Vulnerabilities
May 23, 2023
Fuzzing has its limitations. Formal methods can help you overcome them.
In the first post of this series , we introduced the software verification technique called fuzzing. We described what it is, who uses it and why, the tools involved, and the benefits of fuzzing.
In this post, we’ll examine the limitations of fuzzing. We’ll then show how most of these can be overcome by pairing a good fuzzer with formal-methods-based analysis tools.
This post is Part 2 of a 3-part series derived from TrustInSoft’s new guide to fuzzing for cybersecurity entitled “Fuzzing and Beyond.” To obtain a FREE copy, CLICK HERE.
Limitations of Fuzzing
While fuzzing offers several advantages, it is not without its limitations. It would be a mistake to view fuzzing as the be-all and end-all of software security, and it’s important to know what those limitations are to evaluate the role fuzzing should play in the verification of a given application. We will examine several of those limitations here.
Finding meaningful inputs can take some time
Fuzzing generally requires a preliminary phase of running the fuzzer to “find” meaningful inputs. This can take quite a long time. In addition, you need to re-run fuzzers regularly as your code changes.
Of course, one’s definition of “meaningful” will depend upon one’s use case and objective. Many fuzzers do not solve the difficult problem of generating meaningful inputs, but simply generate a lot of them hoping that some will be interesting. The idea is that—since modern computers are so fast that it takes only a very small fraction of a second to run the code on the inputs—you might as well generate plenty of inputs and see which ones cause something interesting to happen.
As mentioned in our previous post, more sophisticated grey-box fuzzers will refine and improve their input seeds to explore promising areas and try to find more vulnerabilities. Nonetheless, fuzzing is an iterative process. Any instance of the PUT (Program Under Test) can only be tested for one fuzz input at a time, and complex input profiles can present billions upon billions of possible test cases.
For example, a 256-bit input file offers 2256 (>1.15x 1077) possible permutations. Even with fast computers and parallel processing, fuzzing an entire input space of that size would be prohibitive in terms of both time and cost.
You can’t really control the inputs selected by the fuzzer
Fuzzing engines generate fuzz inputs in a semi-random fashion. That means even the best fuzzing algorithms will produce a significant number of redundant inputs and run many redundant tests.
What you can do—thanks to the instrumentation inserted by the fuzzer’s compiler—is measure code coverage and the relevance of individual inputs based on the coverage they produce.
Coverage-based grey-box fuzzers can adjust their seed pools to reduce redundant tests. In general, however, it is very difficult to generate inputs that reach all the parts of the source code. Coverage tends to remain well below 100%.
Fuzz testing may be insufficient for testing embedded code
Embedded software is typically designed to run on specific embedded hardware, like an ARM RISC processor, for instance.
Good fuzz testing requires that you run as many fuzz inputs as possible. To make that cost-effective, those inputs need to be applied as quickly as possible. The power of fuzzing is measured in thousands of executions per second; 8000 executions per second is better than 6000 ex/s. Due to this need for speed, fuzz testing is typically performed during unit testing in a host environment.
While finding and eliminating vulnerabilities in a host environment is good, if you only test in that environment you’ll have no chance of detecting vulnerabilities that occur only in the target architecture.
In other words, there are some bugs that will only occur when your code is running on your big-endian target architecture. If you fuzz test only on your little-endian desktop environment, you’ll have no hope of finding them.
No amount of fuzzing will catch all undefined behavior
Since fuzzers generate fuzz inputs semi-randomly with lots of redundancy, they can’t possibly generate all the possible values that make up your code’s input space. As mentioned, for a non-trivial program, there are too many possible input combinations. Billions upon billions of them.
Also, fuzzers are not intended to detect all bugs resulting from their inputs. The main purpose of a fuzzer is to generate sets of semi-random inputs. Secondary purposes of grey-box and white-box fuzzers include the measurement of code coverage, and the optimization of input sets to maximize code coverage while minimizing the number of inputs/executions required to achieve that coverage. Fuzzers will indicate interesting behavior like program crashes caused by specific inputs, but they are not designed to detect every bug that might be lurking in your code.
In short, while it can help prove a program is incorrect, fuzzing cannot prove a program is correct.
Assumes the employment of an all-knowing bug oracle
Fuzzing for cybersecurity relies on the use of what is called a bug oracle. Researchers have defined a bug oracle asa program, perhaps as part of a fuzzer, that determines whether a given execution of the PUT violates a specific security policy.[1]
Why the bug oracle is important
Since fuzzing generates arbitrary inputs in an automated manner, we do not know a priori how the target software is supposed to behave for those inputs. Is it expected to reject them as invalid? Is it expected to accept them as valid and produce a particular response to them? Either way, the fuzzer doesn’t know.
The fuzzer only knows if a given input allows the program to execute or causes an execution failure (a crash). If the program executes, the fuzzer provides no indication of whether the program’s response was correct or not. If the user is to recognize whether a given input results in behavior worthy of further human investigation, another component must be present. That component is the oracle.
The oracle tells the user if the target software appears to be behaving incorrectly for a given input.
Typical oracle implementations
One way to realize an oracle would be to write internal self-checks, called assertions, in the target software. Any input that causes one of these internal checks to fail makes the input interesting and worthy of investigation. Unfortunately, writing assertions for all possible security policies would be extremely time-consuming.
A second option is to rely on another implementation of the exact same functionality as a reference. One could then compare the results computed by the target software with those computed by the reference implementation. This practice is known as differential testing.
For differential testing to be effective, however, either the reference implementation must be of very high quality, or one must be prepared to find and fix bugs in the reference implementation as well as in the target software. An often-rediscovered fact when using differential testing is that bugs are found in the reference implementation as well as in the target software. This, too, is very time-consuming; in most cases, it doubles the amount of effort, as both the target and the reference must be developed and debugged.
Alternatively, one could hope that any incorrect behavior of the target software in response to a particular input will result in a recognizable failure like a crash or—in a memory-safe language like Ada—an uncaught exception. While choosing the latter option is an improvement over encountering undefined behavior, it still falls short of the ideal solution. This is particularly true in memory-unsafe languages like C/C++, where the prevalence of undefined behaviors poses significant challenges.
Limitations of typical oracle implementations
In the case of C/C++ and other memory-unsafe languages, the frequently-used default oracles just described are only partial. They do not check for all of the possible types of undefined behavior that hackers might exploit; there is a vast array of these, and some are more subtle than others. Furthermore, the results computed for a given input may vary depending on the memory layout defined at compilation.
In memory-unsafe languages, a given memory layout may cause a defect to result in an undefined behavior or not, or cause an undefined behavior to result in a crash or not. In consequence, one can’t be sure that the result obtained during testing will be the same as the result for the same input obtained after deployment.
One way to palliate the problem would be to make sure the tested binary is exactly that which is intended for deployment and that all memory is allocated statically.
However, to detect more undefined behaviors, what many users of fuzzing also like to do is allow a sanitizer to instrument the code generated during compilation with automatically-inserted additional checks. In this case, the memory layout of the binary executed during fuzzing is different from the memory layout of the uninstrumented binary intended for deployment.
Therefore, not finding anything interesting during the execution of the (instrumented) binary during fuzzing does not mean that nothing interesting will happen during the execution of the (uninstrumented) binary after deployment.
Classic fuzzing is always a partial solution
Classic fuzzing will greatly expand your exploration of the total input space of your target program. You will likely find a lot of bugs that you would not normally find during a normal testing campaign.
Due to the limitations just described, however, classic fuzzing will always be a partial solution, for three reasons. First, you cannot explore the entire input space due to its size. Second, you will only find bugs that occur in the specific memory allocation dictated by the compilation performed by the fuzzer. Finally, you will only find bugs your instrumentation and analysis tools are designed to detect.
Because the compilation and memory allocation of your binary is (1) likely to differ from what you explored in your fuzzing environment and (2) may be affected by the order in which applications were loaded on the target hardware, your code will still be vulnerable. Hackers may be using black-box fuzzing to penetrate it on specific hardware platforms.
To overcome these limitations, you need a complete oracle rather than a partial one.
The perfect bug oracle
To ensure your code is safe from the exploits of malicious hackers, it is essential that you detect and eliminate all undefined behavior (e.g. buffer overflows, non-initialized variables, invalid pointer usage, signed overflows, division by zero, etc.) in every execution path of your program.
To be clear, an undefined behavior is not the same as either an implementation-defined behavior or an unspecified behavior.[2]
Undefined behavior is a C/C++ language concept, defined in Annex J2 of the language: They consist in code constructs for which there are no requirements on how the compiler will implement them, i.e. there is zero guarantee that the code will behave the same in different contexts (in particular the toolchain (and toolchain settings) used to generate the executable and the environment in which the executable will run). They therefore frequently cause random crashes or random program behavior. These types of problems are often very difficult to detect under standard laboratory testing conditions.
Undefined behavior is also very dangerous because it is a major angle of attacks on C/C++ code. Hackers exploit undefined behavior to remotely gain control of the software and achieve arbitrary code execution.
A great first step toward eliminating all undefined behavior in your code is to fuzz your code with the help of TrustInSoft Analyzer. Running fuzz inputs through TrustInSoft Analyzer allows you to formally verify the elimination of all bugs the Analyzer finds.
TrustInSoft Analyzer
TrustInSoft Analyzer is a hybrid code Analyzer combining advanced static and dynamic analysis techniques together with Formal Methods to mathematically guarantee C/C++ code quality, reliability, security and safety. It has been designed to detect ALL undefined behavior in any execution path, and in combination with a fuzzer or any test drivers, with no false alarms.
TrustInSoft Analyzer can guarantee your fuzz testing results are valid for any compiler, any chosen set of compiler options, and any memory layout. In short, TrustInSoft Analyzer is a complete and perfect bug oracle for C/C++ code. It optimizes the fuzzing process.
Why formal verification is superior to classic fuzzing
While they will generate many more tests than normal testing, fuzzers typically do not detect vulnerabilities that don’t cause the PUT to crash. What’s more, a fuzzer will only explore the execution paths of one specific code compilation and memory layout.
In contrast, TrustInSoft Analyzer’s mathematical analysis of the PUT using formal methods guarantees that every undefined behavior on every execution path explored by the provided fuzz inputs will be detected for every possible compilation and every possible memory layout.
Fuzzing with TrustInSoft Analyzer and AFL
Fuzzing with TrustInSoft Analyzer is fast and efficient because the tool integrates easily with AFL, one of the most popular coverage-based grey-box fuzzers. We like fuzzing with AFL for a number of reasons. Of these, three stand above the rest.
First, AFL is chainable to other tools. As the creator’s documentation states, “The fuzzer generates superior, compact test corpora that can serve as a seed for more specialized, slower, or labor-intensive processes and testing frameworks. It is also capable of on-the-fly corpus synchronization with any other software.” [3]
Second, AFL employs an efficient source code instrumentation to record the edge coverage of each execution of the program being tested and the coarse hit counts for each edge. It uses this information not only to generate seed files for new fuzz inputs but also in the implementation of a unique deduplication scheme that optimizes code coverage using a minimum set of inputs.[4]
Third, AFL uses a heuristic Evolutionary Algorithm (EA) to refine its seed pool based on branch coverage. This helps improve the odds that newly generated fuzz inputs help to increase coverage.[5]
All of these features help shorten the fuzzing cycle and save you time.
After running AFL to generate a set of test cases (input files), you just load those test cases along with your code into TrustInSoft Analyzer and run an analysis in the Analyzer’s interpreter mode. This is a simple operation—as easy as ordering a compilation of your code.
For each test case, TrustInSoft Analyzer will detect whether or not the execution depends upon the memory layout. If for some memory layout that input will cause an undefined behavior, the tool will detect that as well and generate a warning to that effect.
Benefits of fuzzing with TrustInSoft Analyzer
Fuzzing with TrustInSoft Analyzer produces a number of important benefits.
Find more vulnerabilities. TrustInSoft Analyzer can detect undefined behaviors that typically remain unnoticed throughout standard unit and integration tests. You will be 100% sure you have no undefined behaviors for the specific entry points defined by your test (fuzz) inputs.
No false alarms. When running analysis on discrete inputs, TrustInSoft Analyzer generates no false alarms. All alarms raised correspond to real vulnerabilities, thanks to our use of formal methods. You’ll waste no time investigating false positives.
Better validation of embedded code. TrustInSoft Analyzer provides target emulation for embedded hardware platforms. Target emulation allows you to test your embedded code in an environment that closely resembles your target architecture. It helps you find bugs in embedded code that unit testing in a host environment cannot possibly reveal.
Out of the box, TrustInSoft Analyzer supports a number of common target platforms, including 32-bit ARM, 64-bit ARM, Power PC, RISC-V, and X86. If your hardware is more exotic, the emulator can easily be configured by adjusting a series of parameters. Everything that can change from one target platform to another is configurable in the Analyzer.
This post is Part 2 of a 3-part series derived from TrustInSoft’s new guide to fuzzing for cybersecurity entitled “Fuzzing and Beyond.” To obtain a FREE copy, CLICK HERE.
In our next post…
In the final installment of this series, we’ll look at how to go beyond fuzzing to formally prove that you have eliminated every last vulnerability from your most cybersecurity-critical applications.
For additional information, see our white paper
If you find this post useful, our new white paper, Fuzzing and Beyond, contains still more information on fuzzing, including a case study on how fuzzing with TrustInSoft Analyzer uncovered 13 previously undiscovered bugs in a popular network protocol analyzer and IP packet sniffer.
References
[1] Manès, V., et al, The Art, Science, and Engineering of Fuzzing: A Survey, IEEE, October 2019.
[2] Gruevski, P., Falsehoods programmers believe about undefined behavior, predr.ag/blog, November 2022.
[3] Zalewski, M., American fuzzy lop.
[4] Manès, V., et al, The Art, Science, and Engineering of Fuzzing: A Survey, IEEE, October 2019.
[5] Ibid.