Martin Keegan: Experimenting with CompCert

A few weeks ago I experimented with CompCert, a C compiler from INRIA, written largely in Coq, with chunks in OCaml; this allows the Coq parts of CompCert to be formally verified (see below for more on this).

Now I have no need of a guarantee that my compiler is bug-free, but to the extent that translation my code into the subset of C supported by CompCert reduces the bugcount rather than increases it, it's a win. I'm basically using compcert as a lint tool, but it's fun and instructive anyway. The real-world scenario which makes any of this interesting is therefore if you have a C codebase and suspect a bug in your compiler and want to know how hard it would be to maintain that codebase such that it compiled with a compiler believed to be bug free.

For many years I have maintained a codebase of 40K lines of fairly odd C, that implements a computer game I used to run in the 1990s, and which predates modern conveniences that might have been used, such as sqlite, pcre, libevent, reliable IP stacks on NeXTSTEP, ANSI C, free C++ compilers, free Erlang, etc, etc. The code is also unusual in shunning the use of struct, malloc and pointer arithmetic. For almost the last twenty years I've kept it up-to-date with the C toolchains on a number of OSes, as a way of keeping an eye on what the cool kids are breaking.

So:

Firstly, the codebase needs to be able to cope with multiple compilers; gcc and LLVM's clang are close to drop-in replacements for each other from the perspective of the Makefile. Not so, CompCert: -Wall -Werror are not accepted as options by CompCert, as they're effectively on by default. CompCert isn't going to want to know about any code that doesn't pass gcc -Wall -Werror, but there are a few things LLVM thinks it's Ok to warn you about that CompCert is cool with, which feels like LLVM is wasting my time. Getting the build system and revision control happy about parameterisable compiler options has to happen first.

I was forced to do change all the remaining instances of conflation of integer widths. Anyone who's done arithmetic in OCaml will recognised this as one of the house microfascisms of INRIA, but it's a deep issue: a lot of corner cases depend on your installation of the header files and libraries and so on. In my case, function prototypes are culled into a .h file automatically with cproto, which by default changes the width of integers in K&R-style C functions:

void my_function(i)
short i
{
...
}

is output as

void my_function(int i);

which gcc and LLVM tolerate, but CompCert doesn't. There were a couple of other legitimate "Well Don't Do That Then" moments that I won't tax you with. Effectively one's forced to get all the prototypes and headers and includes exactly right. This showed up a bug: a variable which was supposed to be declared extern wasn't, and was separately allocated from the global it was supposed to represent.

The more formal treatment of integer widths also meant fixing a lot of sprintf format strings.

The next thing I had to fix was the idiom

char *messages[] = { "...", "...", "...", NULL };
int x = sizeof(messages) / ...;

CompCert insists on the length of messsages[] being explicitly specified, which means this technique isn't allowed.

The harder stuff was signal() and stdarg; basically, CompCert supports an anaemic subset of C, and doesn't allow stdargs, though it provides the sprintf() clique of functions. Since wrapping sprintf() is about the only thing varargs is used for in C, this turns out not to be a problem, but I originally bet that parts of the codebase were outside the CompCert C dialect and would need to be shunted into libraries.

My own adventure in CompCert land basically amounted to learning new stylistic restrictions in C. Reading around what people have been doing with CompCert I came across a few interesting articles and from this chap I learnt about concolic testing which is another technique I have no use for but am glad to have spent time learning about.