The tale of a strange PHP bug

April 2, 2008

This article about PHP’s zvals inspired me to tell the story of a bug related to this data structure.

A PHP portability problem

A few days ago, we experienced a strange bug that seemed to happen only in our production environment: A PHP5 script is run periodically by cron to compute a value and insert it into a data cache. When the web servers were trying to read this value from the cache, it wasn’t found. The strange thing was that another cron script was using the same cache package, with its result successfully inserted and extracted.

I got assigned this bug, and proceeded to try and reproduce it on my local development machine, which runs a cache. Sure enough, both scripts were working fine.

Thinking this might be a cache configuration problem, I connected to the test server that we use before going into the production environment, and tried to reproduce the bug there. Working again on that cache.

Unable to reproduce this locally, we examined the cache on the production servers, and found out that the web servers were not looking for the data where the cron scripts had put it…

A “split” cache storage

Our cache is using a key → value model, but we use several storage units. In order to find out in which unit the data is, the key is sent to this function:

function getStorageUnit($key) {
    return abs(crc32($key) % NUMBER_OF_CACHE_UNITS);
}

With the key “userPageData” and a total of 10 cache storage units, crc32 modulo 10 returns 1495715863 % 10 = 3. This means the data for this key is actually stored in the “cache_3” unit.

While “userPageData” was working fine, the key “homePageData” wasn’t:

On my 64-bit development machine, its data was stored in cache unit 1, and the web code was successfully fetching the data from the same unit. On the 32-bit test machine, the cache number was 5. (The raw CRC32 is 0xB40B0545).

In production, the cron scripts were running on 64-bit machines, inserting the data into cache unit #1, while the 32-bit web front-ends were computing the key differently and trying to fetch the value from cache unit #5.
This behavior is documented.

Finding an explanation in struct zval

When PHP computes values, it uses an internal structure called zval. A zval can hold different types of values, of different sizes. It also has portability problems.
note – All C code © Zend.

This is how PHP stores values (Zend/zend.h):

typedef union _zvalue_value {
        long lval;                       /* long value */
        double dval;                     /* double value */
        struct {
                char *val;
                int len;
        } str;
        HashTable *ht;                   /* hash table value */
        zend_object_value obj;
} zvalue_value;

struct _zval_struct {
        /* Variable information */
        zvalue_value value;      /* value */
        zend_uint refcount;
        zend_uchar type;         /* active type */
        zend_uchar is_ref;
};

And this is how a CRC32 value is written to a new zval in the PHP function (ext/standard/crc32.c):

register php_uint32 crc;
        /* ... actually computing the value here ... */      
RETVAL_LONG(crc^0xFFFFFFFF);

RETVAL_LONG is a macro, expanding to setting the value of return_value->value.lval, thus writing a uint32 into a signed long. On our build running on i686, the long is 32 bits, its leftmost bit representing the sign of the value it’s holding.On our build running on x86_64, the long is 64 bits, with one sign bit, 31 zeros, and then 32 bits with the CRC value. A fix exists for this function:

$crc = abs(crc32($string));
if( $crc & 0x80000000){        $crc ^= 0xffffffff;
        $crc += 1;
}

This works on Intel 32 and 64, but I’m not sure it works as expected on other platforms. Also I wouldn’t be surprised if this problem was present in other functions as well.