The last time Hackerfall tried to access this page, it returned a not found error. A cached version of the page is below, or click here to continue anyway

ImperialViolet - Signed overflow in C

(These are notes that I wrote up from a discussion with one of Google's LLVM experts, Chandler Carruth. He was generous enough to let me pick his brains for a while and anything intelligent in this blog post is due to him. If anything's messed up, then that would be my doing.)

The fact that C and C++ specify that overflow of signed numbers is undefined behaviour is a common source of complaint, at least from security-minded people like me. Specifically, we worry that sensible code like the following will be transformed, in light of this rule, into unsafe code:

// x, y and length are ints.
if (x + y 

This code worries about overflow and tries to guard against it. Integer overflow is one of the OWASP standard vulnerability categories and so that code would appear to be commendable for guarding against it.

But, if signed overflow is undefined behaviour then the expression x + y can be assumed by the compiler not to overflow! Therefore a C or C++ compiler can delete this check and, possibly, introduce a security bug. At first glance, this seems insane.

The benefits of not defining signed overflow

Lots of code indexes arrays using ints. A for loop, iterating over an array using int i is second only to Hello world as a standard example of C/C++ code. Imagine such code that accesses fields in an array of structs:

struct Point {
  float x, y, z;

// points is an array of Points.
int sum = 0;
for (int i = 0; i 

Given that the size of a Point is 12 bytes, the reference to the z members of points is turned into points + (12*i + 8). On Intel chips (at least) it's possible to do all that computation in the addressing instruction: the instruction can take a scale, an offset and a displacement.

However, that computation is generally done in a 64-bit environment these days. If signed overflow were defined, then 12*i + 8 would have to have correct overflow behaviour for 32-bit values and so this trick couldn't be done. In that case, the additional lea instructions cause a measurable increase in code size, thus a decrease in instruction-cache and TLB hits, and thus a decrease in performance. Measurements in Google's code-base show that this is a significant problem.

Now, perhaps you don't believe that's a sufficiently good reason to disallow signed overflow, but I'm happier knowing that at least there is a real reason. Another anecdote from Google is that, in the bulk of C/C++ code, cases where signed overflow is a bug vastly outnumber cases where it was being handled correctly.

Left shifting

An unexpectedly related desiderata is left shifting of signed numbers. Traditionally this has been undefined because the C/C++ standards were not willing to require two's-complement behaviour. However, now that two's-complement is ubiquitous, there's another problem:

Programmers often interchangeably write x << 2 or x * 4 (or likewise for other powers of two). However, if left-shift were to be fully defined then they wouldn't be the same; the shift would have defined overflow semantics but overflowing multiplication would not. This is sufficiently surprising and subtle to be quite undesirable.

There are no clear answers here. Having undefined signed overflow is an important optimisation. Having it be undefined only when pointers are involved in an expression is unsatisfying and makes trivial looking code changes have unexpectedly large effects. That all seems to imply that, in order to be consistent, left shifting should also have undefined overflow—not a happy conclusion.

It's possible that we could rescue left-shifting of signed numbers by saying that overflow from signed multiplication is defined, but overflow from signed addition is not. That certainly isn't satisfying, but nor are any of the other options. We have not yet done the work to investigate all the consequences of this, however.

Continue reading on