Proactive Measures in Software Development to Avoid Critical Errors
July 23, 2024
The recent CrowdStrike outage highlights the importance of using abstract interpretation to mathematically guarantee bug-free software and ensure reliability.
Key Points:
- Understanding Memory Access Violations: The recent software update for CrowdStrike Falcon triggered a memory access violation bug, resulting in the Blue Screen of Death (BSOD) during system startup, showcasing the dangers of such software errors.
- Challenges of Undefined Behaviors in C/C++: Undefined behaviors, like buffer overflows and NULL pointer dereferencing, can cause inconsistent software performance, making bugs hard to detect through conventional testing methods alone.
- Ensuring Error-Free Software: Advanced techniques like formal methods and tools such as TrustInSoft Analyzer can provide mathematical guarantees against undefined behaviors, ensuring software reliability and security from the start.
What happened?
Recently, a routine software update for CrowdStrike triggered a memory access violation bug in Falcon’s code. This update, which was rolled out to numerous Windows machines, caused a significant error during system startup, resulting in a Blue Screen of Death (BSOD). The code of Falcon is not publicly accessible, so it cannot be easily inspected to determine all the details concerning this bug. However, the available data (e.g., stack trace dumps produced when a system crashes) indicate that the bug was a memory access violation – one of the most common and dangerous kinds of software errors.
Normally, Windows can handle such errors by terminating the problematic program. With Falcon’s kernel-level privileges and early startup initiation, the operating system was not able manage the error safely which resulted in a system crash.
How did this happen?
A memory access violation occurs when a program tries to write to an address in memory that it should not access. There are several ways a bug in the program’s source code can cause such a violation.
C/C++ language standards define how a program should behave when compiled and executed on any computer. However, these standards also have gaps in the descriptions resulting in the behavior of some code written in C/C++ left undefined. One of the simplest examples of such situations is division by zero. When a C/C++ program divides any number by zero, the behavior is undefined. An undefined behavior means that anything can happen when such a program is compiled and executed. In the case of division by zero, this usually causes a critical error, and the program aborts execution.
Different ways exist for a bug to cause invalid memory access, but for C/C++ programs, most of them result from undefined behavior. The most infamous of these is the dreaded buffer overflow (CWE-121: Stack-based Buffer Overflow). The C/C++ standards define exactly what should be done to correctly read or write in a certain location of a memory buffer. For example, only writing inside the buffer is a well-defined operation, and what happens when writing something outside the buffer is undefined behavior.
This is an example of a very common programming mistake, the so-called “off-by-one”:
The buffer t has a size of 42. It begins at t[0] and ends at t[41]. In this for loop, the index i takes values from 0 to 42. What happens when the program writes to t[0], t[1], …, and t[41] is well-defined. But when the program writes to t[42], the behavior is undefined.
You may be asking yourself; does executing this code cause a memory access violation? Not necessarily! In our example, it may depend on the memory layout. If the variable i happens to be positioned in the memory just after the buffer t, then accessing t[42] may result in the same behavior as just accessing i. And, as the program is allowed to access i, this is not a memory access violation in any way.
This is one of the important reasons why undefined behaviors are so dangerous. They are not always synonymous with errors. The same program containing an undefined behavior may execute without any issues on one computer and then perform invalid memory access and cause a BSOD on another one.
Without reading the source code, it is impossible to be completely sure what was the root cause behind the memory access violation triggered by CrowdStrike Falcon. However, educated guesses seem to point to an undefined behavior: dereferencing a NULL pointer.
How could this be avoided?
There are many ways to detect memory access violations at different stages of software development.
The most widely used and simplest method is extensive testing of the program. This entails compiling the code and then running it in different test scenarios, checking if it gives the expected output for each input. There are two main problems with this approach.
- The coverage of testing is incomplete. As Edsger W. Dijkstra famously said, “Program testing can be used to show the presence of bugs, but never to show their absence!”. In other words, it is impossible to test every scenario.
- Testing is prone to the “works on my machine” problem. As we have seen, undefined behaviors do not always cause errors. They may go undetected during testing and rear their ugly head only in specific conditions once the software is deployed.
Other methods include more advanced techniques, such as the use of sanitizers, memory debugging tools, fuzzers, and static analyzers.
- Sanitizers (e.g., UBSan and ASan) and memory debugging tools (e.g., Valgrind) make testing more powerful by instrumenting the compiled programs with additional checks and/or testing them in a special harness. The aim is to detect all the potentially dangerous operations, not only those that explicitly cause an error.
- Fuzzers (e.g., American Fuzzy Lop) are used to automatically generate test suites in a smart way that greatly augments the test coverage.
- Static analyzers (e.g., Cppcheck) operate directly on the source code level, usually using various pattern-matching techniques to detect dangerous constructions.
To ensure that faulty software does not get deployed and that every single version of code is checked, all these methods are usually coupled with Continuous Integration solutions (e.g., Jenkins).
But how could this be avoided for sure?
None of the aforementioned techniques can guarantee the absence of errors. In the most extreme cases, where such guarantees are required, we need to go to the next level and use formal methods, applying mathematical reasoning to prove without doubt that the code contains no undefined behavior that could trigger memory access violations. One such method is called Abstract Interpretation. It allows for analyzing all the possible executions of a given program to ensure that none of them can lead to undefined behavior.
In the case of critical code, where the consequences of failure could be disastrous and catastrophic (like for software embedded in an aircraft), the most powerful methods are deployed to verify the absence of bugs. Industry standards require more than just looking for errors; they demand strong guarantees that such errors are absent from the code. Only solutions based on formal methods make it possible to provide such a guarantee and prove that software is free of errors.
This is where tools like TrustInSoft Analyzer, based on Abstract Interpretation and several other formal methods, show their value. TrustInSoft Analyzer can be used to ensure that a given program can never perform an invalid memory access, whatever its context of execution. Only by relying on formal methods and operating on the source code level can such guarantees be achieved.
So, what now?
The CrowdStrike case highlights how costly low-level memory access bugs can be. The consequences here were not as radical as if, for example, an airplane flight computer crashed in the middle of a flight. Luckily, no airplane fell out of the sky because of the Windows servers of several airlines crashing. Still, the world has seen how enormous an upheaval a single bug can cause and the significant economic impact it can have.