15.03.2015

When in doubt, express intent, and leave the rest to the compiler

Expressing intent in your C language code 

Graphic: Technical article

Expressing intent, illustrated on an example

I sometimes get asked what stylistic choices a C developer could make that would help their code work well with static analysis or formal verification. I usually do not have a good answer—the reader of this blog or of the previous one may have categorized me as a detail-oriented person; when thus solicited, my first instinct is to look for concrete examples, but stylistic choices in programming are full of trade-offs. I might provide an answer such as “Some verification tools may have difficulties with goto, so you could try to avoid goto as much as possible. Of course, the analyzer I work on has no difficulties with goto, it is only that other techniques exist with which goto is an inconvenience. But of course if you are implementing error-handling in a C function that allocates resources, then you should use goto, because there is not really a better way, unless…” This is about the time I notice that whoever asked has gone get themselves another cocktail and a better discussion group to mingle with.

 

The general, non-detailed answer that I should give, and I am writing it down here for future reference, is that when in doubt, the developer should choose to express clearly the intent of their program.

 

Imagine that, while reviewing foreign code, you arrive to the snippet below, except that instead of being isolated inside a function, it is three lines of code within another hundred that you are trying to make sense of.

void * f2(int c, void * p1, void * p2) {
  void * r = p1;
  uintptr_t mask = -c;
  r = (void*)(((uintptr_t)r & mask) | ((uintptr_t)p2 & ~mask));
  return r;
}

What does the above code do?

 

The answer is that it is nonsensical. It mixes some bits from a pointer p1 with bits from a pointer p2. Perhaps we can assume that p1 and p2 point to blocks that are aligned to the same large power of two in memory, and this code is simply copying the offset of p2 inside its own block to a similar offset within the block inside which p1 points? That would depend on the value of c. You look to the code above the snippet and painfully infer that c can contain only 0 or 1. At least that explains it: the code, for c containing 0 or 1, is intended to be equivalent to the one below:

void * f1(int c, void * p1, void * p2) {
  void * r = p1;
  if (!c) r = p2;
  return r;
}

The developer’s idea may be that the f2 version is faster, because it doesn’t contain any expensive conditional branch. Let us take a look:

$ clang -O3 -S -fomit-frame-pointer -fno-asynchronous-unwind-tables cmov.c
$ cat cmov.s
…
_f2:                                    ## @f2
## BB#0:
	negl	%edi
	movslq	%edi, %rax
	andq	%rax, %rsi
	notq	%rax
	andq	%rdx, %rax
	orq	%rsi, %rax
	retq

The compiled code for f2 is beautiful: a straight sequence of arithmetic instructions that modern processors can fetch and decode well in advance to sustain a rhythm of one instruction per cycle or more. But not much more in this case, as most instructions depend immediately on the result of the preceding instruction. At least the code doesn’t contain any conditional branches that might get mis-predicted and inflict a penalty of several tens of cycles.

 

What about f1?

_f1:                                    ## @f1
## BB#0:
	testl	%edi, %edi
	cmoveq	%rdx, %rsi
	movq	%rsi, %rax
	retq

Looking at the compiled code for f1, it does not contain any expensive conditional branch either! It uses the “conditional move” instruction. Just after a testl %edi, %edi instruction, the cmoveq %rdx, %rsi instruction has the same meaning as if (%edi == 0) %rsi = %rdx; would have in C, but the execution flow does not depend on %edi, allowing efficient fetching, decoding and execution at one instruction per cycle.

 

When producing IA-32 code, you might need a commandline option to tell the compiler to emit this instruction, as it does not exist in every IA-32 processor in history. The CMOV instruction was only introduced in the Pentium Pro processor, in 1995, and you might encounter a processor that does not support it if you are in some kind of computing museum. Say “hi!” to the Missile Command arcade game for me.

 

Here, my compiler is producing 64-bit code by default (x86-64), and every x86-64 processor has the CMOV instruction, so I do not even need to think about whether I want 20-year-old processors to execute f2 branchlessly. The dilemma is solved for me by such an x86-64 processor not exiting at all.

 

In the compilation of f2, the compiler needs to produce code that works for all values of c, because it has not been provided with the information that c contained only 0 or 1. If we do provide the compiler with that information, it can do a better job at compiling it to fast code:

void * f3(int c, void * p1, void * p2) {
  void * r = p1;
  c = !!c; // force c to be 0 or 1 only
  uintptr_t mask = -c;
  r = (void*)(((uintptr_t)r & mask) | ((uintptr_t)p2 & ~mask));
  return r;
}
$ clang -O3 -S -fomit-frame-pointer -fno-asynchronous-unwind-tables cmov.c
$ cat cmov.s
…
_f3:                                    ## @f3
## BB#0:
	testl	%edi, %edi
	cmoveq	%rdx, %rsi
	movq	%rsi, %rax
	retq

This is the same code as for f1! The pattern under discussion was actually common enough that the authors of Clang have had to invent an optimization to detect that the code implements a crude branchless conditional move and compile it to an efficient one. The optimization only works if the compiler can also infer that the variable c can only contain 0 or 1, though.

Are you trying to write constant-time code?

One, perhaps the only, justification for the pattern above is that the developer is trying to write code that executes in constant time. Some architectures do not have a “conditional move” instruction and it may be important that the snippet executes in constant time even on these architectures.

Parenthesis: the world is not only x86-64 and post-1995 IA-32, after all. There is also ARM. Though I hear that ARM’s 32-bit instruction set beats everyone else at who has more conditional instructions.

It is a jump of faith to write if (!c) r = p2; and hope that the compiler will generate a constant-time CMOV instruction for it. In fact, a naïve compiler or one aiming for an unusual architecture might generate an assembly conditional branch instruction just for the !c part!

Parenthesis: hoping that the compiler generates a CMOV instruction assumes that CMOV can be relied on to execute in constant time. This has been true historically (CMOV is really the hardware version of (((uintptr_t)r & mask) | ((uintptr_t)p2 & ~mask))), and can expected to remain true for register operands (there is no point in making an instruction with low latency have variable execution time), but it could cease to be true in future generations of processors when one of the operands is a memory location.

It is a slightly narrower jump of faith to write r = (void*)(((uintptr_t)r & mask) | ((uintptr_t)p2 & ~mask)); and hope that the compiler will not generate code the execution time of which depends on c. But as the example of f3 above shows, there is no limit to the sophistication of the transformations a compiler may apply. Pray that the compiler doesn’t find a non-constant-time sequence of instructions that is faster than what you have written (or that it thinks is faster than what you have written). Or, if you are an atheist, revisit Pascal’s wager: you have nothing to lose, and you stand to win a chance of having a chance that the compiled code is constant-time!

 

The only strictly safe solution is not to write the code that has to be constant-time in C, because C does not allow to choose the execution time. A softer variant of the safe solution is to proof-read the assembly code generated by the compiler, perhaps in an automated fashion, grepping for any instruction the execution time of which varies. In the most recent generations of Intel processors, the execution time of integer division depends on its inputs, so this instruction should be avoided by principle, too.

Express intent!

Coming back to the initial exhortation, what do I expect from the developer?

 

  • if their intention is to assign either p1 or p2 to r as fast as possible, they should write if (!c) r = p2;, which any compiler worth its salt can transform to the most elegant sequence of instructions for the target architecture.
  • if their intention is to assign either p1 or p2 to r in constant time, they should write r = constant_time_select(c, p1, p2);, which expresses exactly what they are trying to do. The name constant_time_select can refer to a macro, or an inline function, or even a normal function (we said we are aiming for constant time, not speed. If the function is not inlined it will be easier to check that it was compiled to constant-time assembly code).

 

This matters to anyone who has to process the code. It matters to the human reader. It matter to C static analyzers loosely based on the principles of abstract interpretation. Our first reaction to the code early in this post was that it is nonsensical. This is the correct reaction for arbitrary values of c. An analyzer working its way on the code in which the snippet is included may not have inferred that c is always 0 or 1 here. Even if the analyzer has inferred this information, unless it contains special pattern-detection for the idiom at hand, it will attempt to extract the meaning of (void*)(((uintptr_t)r & mask) | ((uintptr_t)p2 & ~mask)) for general values of c, p1 and p2, because it gives meaning to programs by application of a small set of general rules that are designed to describe all cases. The analyzer will then, quite correctly, just like we did, arrive to the conclusion that the code is nonsense. If a function (or even a macro) with an explanatory name is used, then the operator can easily enough, for the purpose of functional verification, change the function’s implementation to its functional if-based equivalent.

 

It matters to other types of analyzers, too. Consider the popular testcase generator american fuzzy-lop. afl measures, and aims at maximizing, coverage. With the branchless version based on bit manipulations, afl cannot know that there are two cases to cover. If the same if-based equivalent is used, afl will know that there are two separate behaviors to test here, and can aim at producing testcases that exert both.

Conclusion

I will leave the safety of my detail-oriented habits, go out on a limb, and state that expressing intent in code is good for everyone.

 

Acknowledgments: Ben Laurie, Stephen Canon and John Regehr provided some of the parenthetical notes in this post.

 

Newsletter

Related articles

April 17, 2024