GCC always assumes aligned pointer accesses

TL;DR

This post shows that modern optimizing C compilers assume that the source code they are compiling respects memory alignment constraints, even if the target architecture imposes no such restriction. This can lead to compiled programs not behaving as the programmer intended.

C: The Language Formerly Known as Portable Assembly

Some people will tell you that C is a portable assembly language, in which each line of source code is individually translated into one or a few assembly instructions. In this worldview, the C compiler does not do anything very smart, rather it only handles for you the mapping of subexpressions to the registers and instructions of various incompatible target ISAs. This worldview is no longer accurate, for reasons discussed in this blog post and in previous posts in the same vein, but it used to be true several decades ago. See for instance this answer by StackOverflow’s Jerry Coffin, which ascribes the notion of “C as a portable assembly language” to the first edition of the book “The C Programming Language” by Brian Kernighan and Dennis Ritchie, the creator of C himself. This early history of C as portable assembly language is also visible in this post to comp.std.c in 1990 retracing the problem that the volatile qualifier solved and its introduction by the ANSI standardization committee in the 1989 C standard.

The shift of the C language from “portable assembly” to “high-level programming language without the safety of high-level programming languages” should be kept in mind when writing C code from scratch in 2020, or when maintaining legacy C code.

Misaligned Memory Accesses

The x86-64 and AArch64 architectures are popular at the moment, and their respective 32-bit predecessors are still widely used; the POWER architecture is getting old, but the Power architecture continues to receive updates; RISC-V is popular in some circles, and so on. The C language is old enough to remember a time when the popular architectures were different, and unlike the popular architectures of today, did not all allow misaligned accesses. Alpha and SPARC are two examples of once-popular instruction sets in which misaligned memory accesses were forbidden (a processor exception would happen if such an access were attempted).

The C standards, having to accommodate both target architectures where misaligned accesses worked and target architectures where these violently interrupted the program, applied their universal solution: they classified misaligned access as an undefined behavior. This allowed both executions to continue normally with the results that you would expect if you had been writing x86-64 assembly, or for the execution to stop abruptly, which you would expect if you had been writing SPARC assembly. This in turn led to the appearance of C programs that, consciously or not, were only written for architectures that allow misaligned memory accesses. In fact, if you ask early owners of Alpha or SPARC computers, there were many of these. For users of these architectures, the problem was that C programs, written to access memory in misaligned ways, worked as intended on the numerous processors that allowed this, but crashed when compiled and run on their computers. This is now a forgotten problem, but a well-documented one nevertheless. It is covered for instance in the blog post On misaligned memory accesses(2006).

Something funny happened to Pavel Zemtsov in 2016. He was writing C++ code that involved misaligned memory accesses, as one tends to do without even noticing when one is targeting x86-64, when he suddenly realized that he was using an architecture that forbade misaligned accesses after all! GCC automatically vectorized the loop he was writing, and in doing so, assumed that all uint32_t accesses were to addresses that were aligned to 32-bit boundaries. The summary of that blog post, A bug story: data alignment on x86, is that GCC used vector instructions recently added to the x86-64 ISA that had alignment requirements. The humor came from the fact that the source code looked very ordinary: nothing suggested the use of modern vector instructions; the source did not call intrinsics or anything like that. But making sophisticated transformations is what modern, optimizing C compilers do.

A Clever GCC Optimization

The present blog post brings bad, and as far as I know, previously undocumented news. Even if you really are targeting an instruction set without any memory access instruction that requires alignment, GCC still applies some sophisticated optimizations that assume aligned pointers. Consider the following function:

int h(int *p, int *q){
  *p = 1;
  *q = 1;
  return *p;
}

GCC optimizes this function to make it always return 1:

h:
        movl    $1, (%rdi)
        movl    $1, (%rsi)
        movl    $1, %eax
        ret

GCC generates code for a function h that always return 1: that’s the movl $1, %eax part of the assembly listing, in which the value 1 is hard-coded. In this function, GCC reasons that if p and q are the same, then the second assignment writes 1 to *p, and if they are distinct, when what is read in *p is exactly what was written at the first assignment, that is, 1. It seems that either way, the function can only return 1. The reasoning is sound if p and q both are aligned to an address multiple of sizeof(int), but it is wrong if either pointer is misaligned, as in the calling context below:

void f(void) {
  char *t = malloc(1 + sizeof(int));
  if (!t) abort();
  int *fp = (int*)t;
  int *fq = (int*)(t+1);
  int r = h(fp, fq);
  assert(r == *fp);
  ...
}

Strictly speaking, the function f invokes Undefined Behavior when it computes fq. The result t of the call to malloc is guaranteed to be aligned for int, but t+1 is not aligned for int. A very old computer that uses non-uniform pointer representations could crash on the conversion to int* of t+1. The words “if the resulting pointer is not correctly aligned for the referenced type, the behavior is undefined” in the C11 standard make the conversion Undefined Behavior.

The programmer, knowing that the program is to be executed on the little-endian x86-64 architecture, that has uniform pointer representation and handles misaligned accesses, might expect r to be set to 257 and the assertion to be true. Instead, r is set to 1, but *fp still evaluates to 257, so that the assertion is evaluated as false at runtime.

Note that this discussion is not about strict aliasing. The description of strict-aliasing rules in the C standard is rather vague, and it could be interpreted as meaning that you are not allowed to use the second byte of an int to read the first byte of an int. However, if using the compilation option -fno-strict-aliasing, you tell GCC to use a forgiving memory model without strict aliasing restrictions, it still optimizes functionh

How to express misaligned memory accesses

If you are writing C code in 2020 that would benefit from reading four bytes at once even if these may not be aligned to a multiple-of-4 address, and if this C code is intended to be compiled with a modern C compiler, you should use memcpy. On target architectures that allow misaligned memory accesses, a modern C compiler can easily translate the memcpy from a temporary int variable into the single assembly instruction that you would have written if you were directly writing in assembly. As a bonus, the same code will function on architectures that do not allow misaligned memory accesses; on these, the compiler will automatically generate either a longer sequence of instructions or an actual call to memcpy.

For instance, if the function h is intended to accept a misaligned pointer q, it can be changed as follows:

int h(int *p, int *q){
  *p = 1;
  int one = 1;
  memcpy(q, &one, sizeof *q);
  return *p;
}

Note: if you take this route, you might want to go the extra mile and avoid building misaligned pointers altogether, which as we said earlier, is Undefined Behavior in itself. The above example leaves the argument qas a pointer to int in order to focus on the important change with respect to the original, but if the second argument can be misaligned, it would be better to declare it as a pointer to char, so as not to force the construction of a misaligned pointer to int when the function is called.

When targeting x86-64, GCC produces the code we intended but did not get with the original function h:

h:
        movl    $1, (%rdi)
        movl    $1, (%rsi)
        movl    (%rdi), %eax
        ret

This assembly code reads back the contents pointed to by the first argument (%rdi) in order to return it (in %eax) because the compiler is aware that accessing memory through the second argument (%rsi) may have changed it. The call tomemcpy in the source code is free: it is not translated to a function call, and not even to a single unnecessary instruction.

By contrast, when generating assembly code for an architecture for which the direct memory access would not produce the intended result, such as armv5, GCC, having no shorter and more expedient version, keeps the call tomemcpy. The armv5 ISA is a bit quirky, as RISC ISAs tend to be, so that on a processor implementing it, the misaligned memory access may not crash, but in any case, it would not produce the same result as the call to memcpy, hence GCC not replacing the latter by the former. This shows that GCC, while translating the statement *q = 1;, is assuming the int access to be aligned when it chooses to translate it to the single “store” instruction str r3, [r1].

Does a lot of existing C code do this?

It is difficult to tell how much existing code does this. The optimization described in this post is only implemented in GCC, and it is focused enough that you wouldn’t expect it to be applied very often(just wait until C code is increasingly compiled with Link-Time Optimizations, though). At a time when the popular ISAs mostly provide the expected behavior on misaligned accesses, the crime of misaligned memory accesses is seldom reported, but this is only by lack of witnesses.

One data point is the fast compression library LZO, which contains the following lines, intended to do the right thing on each of several possible target platforms:

#if (LZO_ARCH_ALPHA)
#  define LZO_OPT_AVOID_UINT_INDEX          1
#elif (LZO_ARCH_AMD64)
#  define LZO_OPT_AVOID_INT_INDEX           1
#  define LZO_OPT_AVOID_UINT_INDEX          1
#  ifndef LZO_OPT_UNALIGNED16
#  define LZO_OPT_UNALIGNED16               1
#  endif
#  ifndef LZO_OPT_UNALIGNED32
#  define LZO_OPT_UNALIGNED32               1
#  endif
#  ifndef LZO_OPT_UNALIGNED64
#  define LZO_OPT_UNALIGNED64               1
#  endif
#elif (LZO_ARCH_ARM)
#  if defined(__ARM_FEATURE_UNALIGNED)
#   if ((__ARM_FEATURE_UNALIGNED)+0)
#    ifndef LZO_OPT_UNALIGNED16
#    define LZO_OPT_UNALIGNED16             1
#    endif
#    ifndef LZO_OPT_UNALIGNED32
#    define LZO_OPT_UNALIGNED32             1
#    endif
#   endif
#  elif 1 && (LZO_ARCH_ARM_THUMB2)
 
...
 
#if (LZO_OPT_UNALIGNED32)
LZO_COMPILE_TIME_ASSERT_HEADER(sizeof(*(lzo_memops_TU4p)0)==4)
#define LZO_MEMOPS_COPY4(dd,ss) \
    * (lzo_memops_TU4p) (lzo_memops_TU0p) (dd) = * (const lzo_memops_TU4p) (const lzo_memops_TU0p) (ss)
#elif defined(lzo_memops_tcheck__)
#define LZO_MEMOPS_COPY4(dd,ss) \
    LZO_BLOCK_BEGIN if (lzo_memops_tcheck__(lzo_memops_TU4,4,1)) { \
        * (lzo_memops_TU4p) (lzo_memops_TU0p) (dd) = * (const lzo_memops_TU4p) (const lzo_memops_TU0p) (ss); \
    } else { LZO_MEMOPS_MOVE4(dd,ss); } LZO_BLOCK_END
#else
#define LZO_MEMOPS_COPY4(dd,ss) LZO_MEMOPS_MOVE4(dd,ss)
#endif

You will have noticed that the code wants to perform misaligned accesses, but is written to detect at compile-time platforms where misaligned accesses would fail, and in these cases, to avoid them.

This issue pointed out in Pavel Zemtsov’s 2016 post and in the present one is that the above #ifdef nest, when it picks the simple uint32_t lvalue access for the x86-64 target architecture (since x86-64 processors allow these memory accesses), exposes the code to be mistranslated by the compiler, for loop vectorization purposes or for optimization purposes.

Conclusion

This blog post is an extended version of a ticket I opened on GCC’s Bugzilla, where I suggested as a new feature a commandline option to make GCC not assume that memory accesses are aligned. The reaction of the GCC developers confirms that this behavior is here to stay. Clang developers, if you asked them, would reserve the right to make Clang behave in the same way, especially now that GCC has breached the topic. The conclusion is that as a modern C developer, you should always avoid misaligned memory accesses, and use memcpy to or from a temporary variable instead. Thememcpy call will be free as long as the code is compiled with an optimizing compiler for a platform that allows misaligned memory accesses.

Acknowledgments

The following people have contributed context information in this post, participated in a discussion that lead to the discovery being written up, improved the wording of some sentences, or had their joke stolen in this post: hikari, Alexander Monakov, David Monniaux, Markus F.X.J. Oberhumer, John Regehr, Miod Vallat, Ashley Zupkus.

Printing a null pointer with %s is undefined behavior

The C standard makes it undefined to pass anything other than a pointer to a null-terminated string as second argument to printf("%s",. However, most libcs kindly print the string (null) if a null pointer is passed as argument, and some developers have made it a habit, when generating debug messages, to pass pointers that can be either null pointers or point to a valid string directly to printf("%s",. These developers are relying on the kindness of the underlying libc implementation. “It’s undefined because it could behave differently on another platform”, these developers think, “but I know how printf works on my platform.”

Digging into stdio implementations

Most calls to printf worldwide end up interpreted by one of four independent implementations:

  • Glibc provides its own printf implementation. This is the printf that ends up being called, say, if you compile a C program on a mainstream GNU/Linux distribution such as Ubuntu.
  • On many Unices, the implementation of printf comes from “UCB stdio” and was originally written by Steve Summit, but later heavily modified by Keith Bostic who gave it its current form.
  • On other Unices such as Solaris, the implementation is descended from the AT&T codebase.
  • Musl is a libc implementation from scratch, initiated and maintained by Rich Felker.

In their most recent incarnations, all four of these implementations are kind with respect to null pointers for %s. Consider the program below:

#include <stdio.h>

int main(void) {
  printf("%s|%.3s|\n", (char*)0, (char*)0);
}

This program, when linked with Glibc, kindly prints (null)||, whereas the other three aforementioned printf implementations kindly print (null)|(nu|. The reason for the disparity is that Glibc’s printf tries to be sophisticated about the substitution, whereas musl, UCB stdio, and Solaris are straightforward about it. You may have noticed in the previously linked commit from 1988 by Keith Bostic that printf was already handling null pointers before the commit, but that it used to print them as (NULL) or (NU.

What does this C program do?

#include <stdio.h>

int main(int argc, char*argv[]) {
  switch (argc) {
  case 0: return 0;
  case 1:
    printf("argc is 1, expect to see '(null)'\n");
  default:
    printf("%s\n", argv[1]);
  }
}

Given the reassuring results of all the digging up in the first half of this post, one may expect, when compiling and executing the above program without commandline arguments, that it prints:

argc is 1, expect to see '(null)'
(null)

This is not what happens for me when compiling and linking with Glibc on my faithful Ubuntu GNU/Linux distribution:

$ cat > printarg.c
#include <stdio.h>

int main(int argc, char*argv[]) {
  switch (argc) {
  case 0: return 0;
  case 1:
    printf("argc is 1, expect to see '(null)'\n");
  default:
    printf("%s\n", argv[1]);
  }
}
$ gcc printarg.c && ./a.out 
argc is 1, expect to see '(null)'
Segmentation fault

You should be able to reproduce this at home, and this Compiler Explorer link and snippet contains some vital clues as to how this happens:

main:                                   # @main
        testl   %edi, %edi
        je      .LBB0_4
        pushq   %rbx
        movq    %rsi, %rbx
        cmpl    $1, %edi
        jne     .LBB0_3
        movl    $.Lstr, %edi
        callq   puts
.LBB0_3:
        movq    8(%rbx), %rdi
        callq   puts
        popq    %rbx
.LBB0_4:
        xorl    %eax, %eax
        retq
.Lstr:
        .asciz  "argc is 1, expect to see '(null)'"

Because, in the small example in this section, the printf calls match a pattern that can be realized with the leaner, faster puts function, GCC and Clang substituted the latter for the former. The first substitution is harmless. By applying the second one, though, the compilers assumed we did not intend to print a null pointer, because puts does not support this at all. But as the title of this post says, we shouldn’t be surprised: passing a null pointer to printf for %s was undefined behavior all along.

T∙Snippet, which, in a sense, contains a fifth printf implementation largely written from scratch, warns that this program invokes undefined behavior.

Conceivable consequences and conclusion

If a developer does not expect a pointer to be null, but thinks that it might still happen and decides to make robust code, they might write:

printf("[parsing] got string:\n");
printf("%s\n", s);
if (s == NULL) {
  // handle the situation gracefully
  ...
}

(The developer wants to log that the null pointer happened, and expects printf to handle it.)
In this case, the null pointer, which the developer does not know how to cause and thus cannot test, will lead to a segmentation fault despite the developer’s attempt to handle it.

The moral of this note is once again that undefined behavior should be avoided as much as possible, because even if a libc has been kind to you once, the C compiler may not be.

Annex

Acknowledgments: Miod Vallat largely contributed to this post.
Post-scriptum: if you pass a pointer to printf("%s\n",, after transforming the call to a call to puts, GCC will assume that the pointer is not null, since “you” passed it to puts. Generally this should not make the problem worse, since the program will crash at the puts call, exceptions notwithstanding.

Outlining the language C programs should be written in

Rich Felker, maintainer of the musl libc, recently tweeted to ask:

Anyone want to guess what gcc generates for a==b?0:a-b ? Any ideas why?

The answer to the first question is that when a and b are two pointers to char, GCC generates a sequence of several instructions, including a “conditional move” instruction, for the source-level construct a==b?0:a-b:

movq    %rdi, %rax
movl    $0, %edx
subq    %rsi, %rax
cmpq    %rsi, %rdi
cmove   %rdx, %rax

Of these five instructions, just the first and the third would be sufficient to produce the same result. On the architecture targeted by GCC here, applying the SUB instruction to two pointers that satisfy a==b can only produce a result of zero. In the case where a is a pointer one-past to some object and b a pointer to the beginning of another object, the standard allows a==b to be true or false, but in this case the source-level construct a-b is undefined, and any value computed by SUB is a valid result too.

One way to make GCC generate only the first and the third instructions would be to write a-b. Unfortunately, the C standard leaves this expression undefined when a and b are both null pointers, whereas a==b is defined (and is true). If GCC documented that it allows a null pointer to be subtracted from another (and that the result is 0), this would be a safe alternative when targeting GCC. As far as I know, however, GCC does not document this behavior. Since GCC does not document it and the C standard leaves it undefined, GCC could implement a new optimization tomorrow that improves its SPECint score by 0.5% while breaking programs relying on this expression.

The situation is similar to that of implementing rotate in C in 2013. Shifting an uint32_t by 32-n is undefined if n can be 0, but, at the time, Clang would only optimize to the efficient ROL instruction the pattern (x << n) | (x >> (32-n)) , to the exclusion of n ? (x << n) | (x >> (32-n)) : x or (x << n) | (x >> (-n%31)). Sometimes, in the transition from C-as-it-was-written-in-1990 to modern C, the compiler forces developers to write code circumlocutions to protect themselves from UB-based optimizations, and then punishes them for the circumlocutions.

Writing a sound static analyzer for C implies, first and foremost, formalizing the language that the analyzer will allow programs to use. “Just using the standard as reference” is illusory: in several places, the C standard is ambiguous, or even self-contradictory, enough that what matters is the consensual interpretation that compiler authors have arrived to. Sometimes the situation may be too obscure to even have been discussed. A question I asked myself in 2011 is somewhat related to Rich’s: when a is a null pointer and b is a valid pointer, a < b is left undefined by the standard but compilers currently produce code that makes this expression evaluate to 1, and existing code relies on this. Then, we chose to warn for pointer comparisons and to allow the subtraction of null pointers. Maybe the latter allowance is a decision that should be revisited now, since there exists at least one person in the world, Rich, who cares about this case.

Mandatory “no sane compiler would optimize this” post-scriptum

LLVM assuming that since x << n appears in the source code, n cannot be greater than 32 has already caused unintended behavior.

In order to break programs that rely on a-b when both a and b are null pointers, a compiler would only need to propagate the information that a and b aren't null pointers from any pointer subtraction involving them. This is as easy as assuming that a pointer is not null if it was passed to memcpy, even if that was with a size that was not known to be nonzero, and GCC already does this.

Differences between the B method and Frama-C in Formal Methods

My interest was piqued by a question on the questions-and-answers site Quora (the irritating one that tries to get you to register by limiting the number of answers you can view in a month otherwise; no, now you’re thinking of Medium; I mean the one that was doing this from the beginning). The question was about the differences between the B method and Frama-C. The existing answer at the time was making some good points, but I thought the phrase “First you are supposed to write or find some C source codes” in the existing answer was backwards. So I wrote my own answer there.

An old quirky libksba bug

The libksba library, used by GnuPG, provides functions for parsing X.509 cryptographic certificates. I was testing libksba with TIS Interpreter a little over a year ago. One of the bugs I found then illustrates a point I would like to make now.

The bug

Consider this function, as it was present in libksba two years ago:

const char *
ksba_cert_get_digest_algo (ksba_cert_t cert)
{
  gpg_error_t err;
  AsnNode n;
  char *algo;
  size_t nread;

  if (!cert)
    return NULL;  /* Ooops (can't set cert->last_error :-().  */

  if (!cert->initialized)
    {
       cert->last_error = gpg_error (GPG_ERR_NO_DATA);
       return NULL;
    }

  if (cert->cache.digest_algo)
    return cert->cache.digest_algo;

  n = _ksba_asn_find_node (cert->root, "Certificate.signatureAlgorithm");
  if (!n || n->off == -1)
    err = gpg_error (GPG_ERR_UNKNOWN_ALGORITHM);
  else
    err = _ksba_parse_algorithm_identifier (cert->image + n->off,
                                            n->nhdr + n->len, &nread, &algo);
  if (err)
    cert->last_error = err;
  else
    cert->cache.digest_algo = algo;

  return algo;
}

The source code above contains the bug. The bug can almost be found by inspection of the function, inferring contents of types and behavior of callees from the way they are used. I have only removed commented-out code from the original. If you think that you have identified the bug I found one year ago, but that it may depend how the function ksba_cert_get_digest_algo is used, you may be on the right track. Here is how ksba_cert_get_digest_algo is invoked in tests/cert-basic.c:

  oid = ksba_cert_get_digest_algo (cert);
  s = get_oid_desc (oid);
  printf ("  hash algo.: %s%s%s%s\n",
          oid?oid:"(null)", s?" (":"",s?s:"",s?")":"");

If on the other hand, even with this clue, you are still finding the entire function ksba_cert_get_digest_algo too tedious to review, here is a synthetic view that makes that bug stand out:

const char *
ksba_cert_get_digest_algo (ksba_cert_t cert)
{
  …
  char *algo;
  …
  n = _ksba_asn_find_node (cert->root, "Certificate.signatureAlgorithm");
  if (!n || n->off == -1)
    err = gpg_error (GPG_ERR_UNKNOWN_ALGORITHM);
  else
    err = _ksba_parse_algorithm_identifier (cert->image + n->off,
                                            n->nhdr + n->len, &nread, &algo);
  if (err)
    cert->last_error = err;
  else
    cert->cache.digest_algo = algo;

  return algo;
}

The bug is that the automatic variable algo remains uninitialized until _ksba_asn_find_node is called. When that function succeeds, &algo is passed as argument of the function _ksba_parse_algorithm_identifier (without looking at it, we can assume that that function does initialize algo in all cases). The fact remains that when _ksba_asn_find_node fails because of the condition !n || n->off == -1 being true, the variable algo remains uninitialized until its “value” is returned as the result of the function.

The anecdote

The funny thing about this bug is that it is very easy to produce inputs for. I was using afl, an efficient and easy-to-use fuzzer, to generate testcases. This bug was the second one I found, right after I set up afl and TIS Interpreter to work together on libksba. It turned up just after afl generated an input that demonstrated a crash caused by an unrelated first bug. Whenever the target software crashes, afl generates a README.txt file to invite the user to report their success to afl’s author. This is smart timing: when using a fuzzer to find bugs, producing an input that causes a direct crash is one of the best possible outcomes. afl’s README file is thus a subtle declaration of victory: you only see it when afl has done a good job.

The README, a pure ASCII file intended for human consumption, of all inputs pretending to be X.509 certificates, also happens to trigger the uninitialized-variable-use bug described in the first part of this post. Let they who would not have used the shell command for i in findings_dir/crashes/* ; do run_in_tis_interpreter.sh $i ; done throw the first stone.

No great coincidence happened here, neither was the README.txt file generated designed to look fiendishly similar to a X.509 certificate. Any brute-force purely random fuzzer would have instantly produced a file that triggered the bug shown. In good time, afl would have found one too—and, importantly, it would have recognized that such an input caused a different execution path than less obviously incorrect certificates to be followed inside libksba’s cert-basic test program, and would have set it apart for later inspection.

Epilog

The crash initially found by afl was fixed in a timely manner by Werner Koch in commit a7eed17, and the uninitialized-variable-use bug described in this post was fixed in commit 3f74c2c.

I never formally reported any of afl’s numerous findings to afl’s author, despite the instructions in the input file. Well, this post can be my report: all the bugs reported by me in libksba in 2016 were found using afl and TIS Interpreter together. This post is but an example to illustrate how well these two tools work together.

afl had been used on libksba before (for instance this 2014 vulnerability is listed as having been found using afl). But although the uninitialized-variable-use bug is very shallow and can be evidenced by throwing any README file at cert-basic, the bug was probably missed because it did not cause a crash. Barring weird optimizations, when executing a binary not instrumented to detect the use of uninitialized variables, the variable’s value is what happened to be on the stack. The stack is full of valid addresses: addresses of variables and buffers that have previously served, but also addressed of call sites to return to. Reading from any of these addresses, it is likely that a zero will be found before the end of the page (code contains plenty of null bytes because of how constants in instructions are typically encoded, and data contains plenty of null bytes too, of course). This means that running cert-basic directly does not reveal, via a crash, that something wrong happened. All that happens is that instructions or (possibly secret) data is printed as if it were a string. Since there are no a priori expectations for what cert-basic should print when passed an invalid file, this is difficult to notice.

One solution, of course, is to have used MemorySanitizer (MSan for short) to compile libksba’s cert-basic. One might speculate that the bug was not found earlier because MSan’s instrumentation is incompatible with ASan’s, and that users of afl, if they use a sanitizer at all, use ASan in order to find the most dangerous memory bugs.

TIS Interpreter takes more time to run a test program than it takes to compile and execute this test program with MSan or ASan instrumentation, but TIS Interpreter finds all the problems that these two sanitizers find, and more.

afl can produce minimal test suites that exert all the different execution paths it was able to identify in the target programs. This feature enables TIS Interpreter to be used harmoniously together with it.

If the only way to use the two together was to run TIS Interpreter on every input that afl generated and discarded, then the collaboration between the two tools would be fruitless. The interesting inputs would be lost in a sea of uninteresting ones and TIS Interpreter’s comparatively slow speed would make it seem like it is not finding anything. But since afl can produce minimal test suites, covering a lot of different behaviors of a typical parser with a few hundred testcases, these few hundred testcases should definitely be run in the best detector of undefined behavior that exists, even if that detector uses up a few seconds per testcase. And, for C programs, the best detector of undefined behavior in terms of detected undefined behaviors is arguably TIS Interpreter.

Auditing zlib

zlib is a venerable multi-purpose compression library, first released in May 1995. The first time I installed GNU/Linux was in late 1995, so to me zlib has somehow always existed. Although now that I think about it, the zlib source code was distributed in the Unix compress format: had I been just a little bit more perceptive, I might have inferred that zlib had not existed forever, as a consequence of something else having come before it.

Together with New York based company Trail of Bits and under the auspices of Mozilla’s Secure Open Source program, TrustInSoft has completed an automated audit of zlib. Subtle issues were found, and fixed by zlib co-author Mark Adler.

Dan Guido makes a great argument for taking advantage of all that software tools can offer for auditing software. One of the tools used in this audit is tis-interpreter, to identify C undefined behaviors along the execution traces generated by Trail of Bits’s CRS.

How do you report bugs that you alone can see?

Do you remember the TV show The Invaders? It was enormously popular in France, much more than in the US where it originated. It tells the story of one David Vincent, who alone sees that currently working C programs have a serious defect and are at risk of getting translated to flawed binaries by evil C compilers.

David Vincent

I misremember. In the series, David Vincent is alone aware that Earth is being infiltrated by extra-terrestrial beings of human appearance.

Strict aliasing as some know it

Many will know “strict aliasing” as a set of rules that make it forbidden to pass the same address twice to the following C function:

int f(int *p, float *q) {
  *p = 1;
  *q = 2.0;
  return *p;
}

Indeed, the compiler confidently generates assembly code that assumes that the arguments are not some int variable’s address &x and (float*)&x:

f:                                      # @f
        movl    $1, (%rdi)
        movl    $1073741824, (%rsi)     # imm = 0x40000000
        movl    $1, %eax
        retq

This is in accordance with the rules described in the C standard. The function f in itself is fine and can be used validly with arguments that do not alias, but passing arguments that make *q write to an int variable causes undefined behavior.

Strict aliasing as implemented in modern optimizing compilers

Not so many may know that compilers also assume that p and q do not alias in function g:

struct object { int id; };

struct thing { int id; int thing_stuff; };

int g(struct object *p, struct thing *q) {
  p -> id = 1;
  q -> id = 2;
  return p -> id;
}

The function g always returns 1, according to Clang and GCC:

g:                                      # @g
        movl    $1, (%rdi)
        movl    $2, (%rsi)
        movl    $1, %eax
        retq

It is not clear that the standard allows them to do that(*). Not long ago, GCC optimized the following version, showing that any pointer computed as the address of a struct was assumed not to point to any other struct even when the lvalue used locally did not show any trace of this.

int g(struct object *p, struct thing *q) {
  int *int_addr1 = & p -> id;
  int *int_addr2 = & q -> id;
  *int_addr1 = 1;
  *int_addr2 = 2;
  return *int_addr1;
}

This function is also compiled as always returning 1:

g:
        movl    $1, (%rdi)
        movl    $1, %eax
        movl    $2, (%rsi)
        ret

(*) Here is an argument demonstrating that passing the same address for both argument of g does not break strict aliasing rules, and therefore implying that the compiler should produce code that works in this case. I understand the author of the comment to be close to the matter at hand, being the implementer of the first type-based alias analysis in GCC.

How to report bugs

So we have been working on a pretty neat thing, and it is now just working well enough to give its first results. The thing is a very early version of a static analyzer that detect violations of strict aliasing as described in the standard, such as the first example above, and as actually implemented by compilers, such as the subsequent examples.

The results are diagnostics of strict aliasing violations in widely used open-source software. How would you go about reporting these to the maintainers of the software?

It seems important to report them: type-based alias analyses have broken programs that their authors expected to work in the past, and in fact they break them again every time a Gentoo user recompiles the Linux kernel without the option -fno-strict-aliasing. It is possible that these optimizations will not become more programmer-hostile than they are now (fingers crossed), and one may think that if they did not break a particular program (that violates the rules), they never will, but compiler implementers are hard at work on inter-procedural and link-time optimizations, all of which will make information available that wasn’t before, and allow strict aliasing optimizations to fire where they didn’t.

In the particular case of the bug being reported, we are quite sure of the analyzer’s findings, but the analyzer is a bit too experimental to release yet. Not that this would necessarily help: these alien-detecting sunglasses may, to a proud developer, seem like sunglasses with aliens painted on. The analyzer is the first of its kind, too, so there is little hope of confirming the findings with another comparable alien detector.

Pointer conversion misuses of the kind of the first example are easy to recognize: a float is being converted to an int containing its representation, or vice-versa, and one only needs to convince the maintainer of the software that there are better, if more verbose, ways to do this. On the other hand, the assumption that structs can be used in the way shown in the example above is rooted even deeper in the C programming ways. Not only will it be harder to convince developers that what the code is doing is dangerous, but since any genericity at all in C is only obtained through pointer conversions, it is not easy to show which among them are invalid, without seeming to appeal to authority. Note how in the program below, the pointer conversion takes place where the function g is invoked, which isn’t the place where the strict aliasing violation takes place (it takes place inside g). Just looking at the function main, it is not obvious that the pointer conversion is one that leads to a strict aliasing violation, as opposed to one that is the only way not to implement twenty different qsort functions.

#include <stdio.h>

struct object { int id; };

struct thing { int id; int thing_stuff; };

int g(struct object *p, struct thing *q) {
  p -> id = 1;
  q -> id = 2;
  return p -> id;
}

int main(void) {
  struct thing saucer;
  g((struct object*)&saucer, &saucer);
}

I have been reporting subtle errors, like use of uninitialized memory, by offering a patch that should not change the program’s behavior if the program wasn’t using uninitialized memory, and that makes it evident that it does. Here is one recent example. Perhaps the same approach can be used here. That is, for reporting a problem in the program above, one might offer the patch below, and hope that the maintainer is not already having a bad day before receiving the bug report.

$ diff -u orig.c instrumented.c
--- orig.c	2016-06-25 20:40:38.819893396 +0200
+++ instrumented.c	2016-06-25 20:35:25.253912642 +0200
@@ -6,7 +6,9 @@
 
 int g(struct object *p, struct thing *q) {
   p -> id = 1;
+  printf("writing 2  to  %p as thing\n", (void*) & q -> id);
   q -> id = 2;
+  printf("reading %d from %p as object\n", p -> id, (void*) & p -> id);
   return p -> id;
 }
 
$ ./a.out 
writing 2  to  0x7ffe3e7ebbd0 as thing
reading 2 from 0x7ffe3e7ebbd0 as object

Bear in mind that in real code, the two sites may be in different functions, or even different files. The patch above is pointing out the obvious only because the strict aliasing violation is obvious in this 6-line example.

Will this be convincing enough? I will let you know…

This blog post owes to the Cerberus project for pointing out the actual situation with modern C compilers and structs, John Regehr for writing a summary of all the ways strict aliasing-based optimizations break programs and proofreading this post, Loïc Runarvot for implementing the analysis prototype. Alexander Cherepanov suggested the use of the verb “to fire” to describe C compiler optimizations being triggered.

Trap representations and padding bits

The C programming language does not hide from you how the values you manipulate are represented. One consequence is that when padding happens, its presence may have observable effects in carelessly crafted programs. Padding is well-known to appear between members of a struct, and also possibly after the last member of a struct. The remaining space in a union when the active member is not the widest one is also considered padding. A C programmer that only cares for usual x86 platforms might be excused for thinking that, for them, this is it. As for trap representations, these may be believed to be reserved for weird hardware that use one’s complement or explicit parity bits.

Padding in structs and unions

Naïve attempts at making an x86-64 compiler take advantage of the unspecified nature of struct padding fail, as in example functions f, g and h, but the function i, provided by Alexander Cherepanov, shows that the padding of a struct x does not have to be copied along the rest of the struct after the two consecutive struct assignments y = x; z = y;:

int i(void)
{
  struct { char c; int i; } x, y, z;
  memset(&x, 1, sizeof x);
  memset(&z, 0, sizeof z);

  y = x;
  z = y;

  return *((char *)&z + 1);
}

The function i is optimized to return 0;, meaning that the entirety of the memory dedicated to x was not copied from x to z.

These occurrences of padding can be prevented when programming for a specific architecture, or a couple of architectures(*), with interstitial character array members. If, in a new project, you are implementing treatments so sophisticated that they require you to define a struct, C may be the wrong language to use in 2016. However, if you are going to do it anyway, you might want to insert these explicit interstitial members yourself: nothing the compiler can do with these unused bytes is worse than what the compiler can do with padding. Clang has a warning to help with this.

Consequences

Since padding does not have to be copied in a struct assignment, and since any struct member assignment can, according to the standard, set padding to unspecified values, memcmp is in general the wrong way to compare structs that have padding. Sending an entire struct wholesale to an interlocutor across the network (by passing its address to write) can leak information that was not intended to get out.

The rest of this post discusses padding and trap representations in scalar types, which we show that our exemplar C programmer for usual x86 platforms might encounter after all. Also, padding in scalar types cannot be eliminated with just additional members, so the problem, if rarely noticed, is in some ways more annoying than struct padding.

Padding in scalar types

It would be easy to assume that in such an ordinary architecture as x86, padding only happens because of structs and unions, as opposed to scalar types. The x86 architecture is not that ordinary: its first historical floating-point type occupied 80 bits; many compilation platforms still make that floating-point type available as long double, for some reason as a type of 12 bytes or 16 bytes of total (respectively in 32-bit and 64-bit mode), including respectively 2 and 6 bytes of padding. Padding may or may not be modified when the value is assigned, as shown in another example from Alexander Cherepanov:

int f(void)
{
  long double x, y;
  memset(&x, 0, sizeof x);
  memset(&y, -1, sizeof y);

  y = x;
  return ((unsigned char *)&y)[10];
}

The function f above is compiled as is if was return 255;, although the entire memory assigned to x was set to 0 before copying x to y with y = x;.

Trap representations

Trap representations are a particular case of padding bits. A symptom of padding bits is that a type t has fewer than 2CHAR_BIT ✕ sizeof(t). In the case of trap representations, some of the bit patterns are considered erroneous for type t. Accessing such erroneous representations with an lvalue of type t is undefined behavior.

The C11 standard latest draft contains this footnote(**):

53) Some combinations of padding bits might generate trap representations, for example, if one padding bit is a parity bit. Regardless, no arithmetic operation on valid values can generate a trap representation other than as part of an exceptional condition such as an overflow, and this cannot occur with unsigned types. All other combinations of padding bits are alternative object representations of the value specified by the value bits.

This footnote is in the context of the representation of integer types, which are made of value bits, padding bits, and in the case of signed integers, one sign bit (6.2.6.2).

“I have no access to no parity bits,” the ordinary programmer might think. “Even if redundant internal representations are part of the implementation of my high-performance computer, the interface presented to me is that of ordinary memory, where all bits are value bits (with sometimes a sign bit).”

In fact, as implemented by GCC and Clang, the _Bool type has two values and 254 trap representations. This is visible on the following example:

int f(_Bool *b) {
  if (*b)
    return 1;
  else
    return 0;
}

The function f in this example, as compiled, returns 123 when passed the address of a byte that contains 123. This value does not correspond to any of the two possible execution paths of the function. Undefined behavior is the only excuse the compiler has for generating code with this behavior, and this means that GCC and Clang, for the sake of this optimization, choose to interpret bytes containing a value other than 0 or 1 as trap representations for the type _Bool.


This post owes to Alexander Cherepanov’s examples, John Regehr’s encouragements and Miod Vallat’s remarks.

Footnotes:

(*) I would have recommended to use fixed-width integer types from stdint.h, such as uint32_t, to make the layout fixed, but that does not solve the problem with floating-point types or pointer types.

(**) The latest C11 draft contains two very similar footnotes 53 and 54 for which it is not clear that both were intended to be present in the final revision.

A non-exhaustive list of ways C compilers break for objects larger than PTRDIFF_MAX bytes

One well-documented way in which a program may break in presence of objects of more than PTRDIFF_MAX bytes is by doing a pointer subtraction that results in an overflow:

#include <stdio.h>
#include <stdlib.h>

int main(void) {
  char *p = malloc(0x81000000);
  if (!p) abort();
  char *q = p + 0x81000000;
  printf("q - p: %td\n", q - p);
}

On the Ubuntu Linux version that I have conveniently at hand, this program, compiled to 32-bit code, produces:

$ gcc -m32 t.c && ./a.out 
q - p: -2130706432

Strange results are acceptable because the overflow in pointer subtractions is undefined behavior:

When two pointers are subtracted, both shall point to elements of the same array object, or one past the last element of the array object; the result is the difference of the subscripts of the two array elements. The size of the result is implementation-defined, and its type (a signed integer type) is ptrdiff_t defined in the header. If the result is not representable in an object of that type, the behavior is undefined.

Do not stop reading if you already know all this! Things are going to get interesting in a minute!

Incidentally, when I tried this program on OS X 10.9 with clang -m32 t.c && ./a.out, I obtained a warning that malloc had failed, and my program aborted without doing the dangerous q - p pointer subtraction. In effect, a malloc function that rejects allocations of more than half the address space goes a long way towards preventing surprising behaviors. This is interesting because at least as late as OS X 10.5, I routinely allocated 2.5GiB from 32-bit processes. On the OS X platform, the decision was courageously made to break some 32-bit applications that used to work in order to prevent some vaguely worrying problems. But then again, preserving application compatibility was never the foremost preoccupation of the Mac platform.

Even if you already knew all this, keep reading.

Subtraction of pointers to wider types

On a 32-bit platform

GCC and Clang’s behavior on the following example is not excused by clause 6.5.6p9:

#include <stdlib.h>

int main(void) {  
  short *p, *q;

  p = malloc(0x70000000 * sizeof *p); // note: product does not overflow
  q = p + 0x70000000; // note: stays in-bounds
  return q - p; // note: does not overflow ptrdiff_t
}

This program prints -268435456. Wait, what?

The example can be made more striking by making the expected result of the pointer subtraction smaller:

#include <stdlib.h>

typedef char (*arrayptr)[0x70000000];

int main(void) {  
  arrayptr p, q;

  p = malloc(2 * sizeof *p);
  q = p + 2;
  return q - p;
}

Both Clang and GCC make this program return 0 when it’s very obvious that the correct result is 2. As a psychological experiment, showing these related examples to C programmers, the average subject might reluctantly let themselves be convinced that -268435456 is 0x70000000, but they are quite certain that 2 should not be 0, and express more surprise for the second example.

The answer to the riddle—but not really an excuse in itself for Clang and GCC’s behavior—is contained in the following decomposition of pointer subtraction into elementary steps. The “sarl” instruction implements a signed operation that does not produce the correct result when the value in %eax should be interpreted as a number above 231. When the element size is not a power of two, the compiler uses a division instruction but the problem remains the same: the sequence generated by the compiler implement a signed division that doesn’t work above 231.

Partially excusing GCC’s behavior, it is somewhat documented (in replies to bug reports) that 32-bit GCC does not work with malloc functions that succeed for requests of more than 2GiB of memory. As long as the malloc calls fail in the examples above, GCC is not doing anything wrong, because the program invokes undefined behavior as soon as p + … is computed. (It is illegal to do pointer arithmetic from a null pointer.)

And indeed, if we re-introduce the malloc success check that I surreptitiously removed from the initial example, GCC does the right thing. “The right thing” being making a call to malloc, which, as semi-documented, should fail for the requested size in order to be compatible with GCC.
It’s a shame that 32-bit Glibc’s malloc does not systematically fail for such requests, but that is not GCC’s problem. It is the problem of the programmer who links together code generated from GCC and code from Glibc (note that Glibc is not intended to be compiled with a compiler other than GCC, which is unfortunate considering).

Clang, the most honey badger-like of compilers, don’t give a shit.

64-bit platforms

64-bit solves all these problems, because a malloc call requesting half the address space always fails, right?
It’s a shame that there was a difficult trade-off on 32-bit platforms, where the problem was fully realized only after some applications had gotten used to requesting 2.5GiB and getting away with it, but as even the smallest computers are moving to 64-bit, we can consider this problem solved by Moore’s law? 263-1 bytes ought to be enough for anybody?

Not really. The malloc call in the program below succeeds (the program would return 0 if the malloc invocation failed), despite Clang having the same half-the-address-space limitation for 64-bit as for 32-bit.

#include <stdlib.h>

typedef char (*arrayptr)[0x700000000000000];

int main(void) {  
  arrayptr p, q;

  p = malloc(32 * sizeof *p);
  if (!p) return 0;
  q = p + 32;
  return q - p;
}

The program not returning 0 indicates that malloc succeeded. The program returning a nonzero value other than 32 indicates that Clang is confused.

Comparisons of pointers

One might consider it an acceptable trade-off to let malloc allocate 2.5GiB on 32-bit platforms, if the developer knows that they never subsequently subtract pointers to the allocated zone. Not as char* and not as pointers to another type, since compilers are broken for these subtractions too, even when C11’s 6.5.6:9 clause perfectly defines the result.

Unfortunately, an optimizing compiler’s code appears to be littered with optimizations that assume object sizes are less than PTRDIFF_MAX. As an example, we all know that X86 Clang and GCC implement pointer comparison as an assembly unsigned comparison (“cmpl” is an agnostic comparison instruction, but here it is followed by “cmoval”, where the “a”, for “above”, indicates conditional move based on an unsigned comparison). This means that it is allowed to compare pointers inside a putative 2.5GiB array on a 32-bit platform, right?

Not all pointer comparisons.

Because of this optimization.

Note the “g” in “cmovgl”. This “g” stands for “greater”: a mnemotechnic(?) indicator that a signed comparison decides which string is returned. Determining what happens with offs1=0 and offs2=0x81000000, two values representable in a 32-bit size_t that, with a lenient malloc, can also be valid offsets into a same array, is left as an exercise for the reader.

Conclusion

It is impossible to tell when the compiler will apply this or any other optimization that assumes that a pointer difference is always representable as a ptrdiff_t. Therefore, the only safe path on 32-bit platforms is to refrain from allocating more than 2GiB in a single malloc call; it might be a good idea if Glibc’s malloc was changed to implement this limitation. Compilers could do much, much better than they currently do. Nothing would prevent malloc(0x81000000) to cause a warning (I suggest “malloc call will always return NULL” or something equally direct). Beyond the friendliness of warnings, Clang’s elimination of malloc calls that, because of Clang limits, cannot succeed, as calls that cannot fail indicate a disturbing misunderstanding of classical logic.

Acknowledgements

Alexander Cherepanov and John Regehr provided most of the examples in this post, with additional remarks coming from Daniel Micay and Rich Felker. Matt Godbolt made the absolute best platform on which to investigate these examples and show them to the world.