Hard bugs come in a number of forms. Every experienced programmer has a few war stories in their pocket. Twenty years on, my bloodiest war stories have been replaced a few times as a new, even bloodier stories were written. This is only the most recent to make the cut.
It’s not my fault, I swear!
The C++ code in question came about over a number of years by the hands of many programmers. It wasn’t my fault, but I’m not completely clean of it either. It was in a content-loading system that everyone knew was terrible, but no one had taken the initiative to fix. I’m saying this so when you read this and say, “But if only you had written your code better, you never would have been in this mess in the first place,” I can reply that your’re right, but we don’t all live in a clean, perfect world, and you should get down in the muck with the rest of us.
Symptoms
This code ran fine on all iOS devices (the iPhone 5 was the latest at the time) and almost all Android devices. The only two devices that had problems were the recently-released Galaxy S4 and the HTC-One. Something about those devices was leading us down a dark path where the code would occasionally hang sometime during the initial content load sequence. Oh, and it only happened in optimized release builds – you didn’t think this would be THAT easy, did you?
Initial Investigations
Diving right in, there’s no debugger available, so it’s logging output only. The first observation was that this hang wasn’t actually a hang at all. The main thread was still happily rocketing around it’s little track servicing the data-loading jobs. Strangely, once the bug was happening, there weren’t any more jobs to service, but the main thread was convinced that there was more work to do.
As it happens, the last job to be serviced was always seemed to be data decompression. This particular code had a single worker thread dedicated to decompression. Fixed-sized buffers were read from storage and passed to the decompressor. The resulting decompressed data is then passed back to the main thread in another buffer along with the number of bytes it contains.
The most annoying part of this bug quickly became apparent. As I added logging statements, the bug would sometimes vanish entirely. Adding output inside a certain window of the thread-handling code would cause the bug to vanish.
The inter-thread communication looked something like this:
Main thread
ReadBuffer( mInputBuffer, bufferSize );
mOutputBuffer = GetAvailableBuffer();
mTotalDecompressed = 0;
mDecompressionDone = false;
signalCompressionThread();
// Go do other stuff
Decompression thread
IterateForever
{
waitForSignal();
mDecompressedBytes = decompressData( mInputBuffer, mOutputBuffer );
mDecompressionDone = true;
}
Later, back on the Main thread
// Do lots of job processing
if ( mDecompressionDone )
{
// The decompression job is done, so process the results
mTotalDecompressed += mDecompressedBytes;
if ( mTotalDecompressedBytes == mExpectedDecompressedBytes )
FetchNextDecompressionJob();
}
The primary gate is the mDecompressionDone flag, and we care about the mDecompressedBytes to move things along. Super-janky, right? But it should work, right? The bug seemed to imply that the sum of all mDecompressedBytes didn’t equal the expected value. Unfortunately, logging the values shows they’re always spot on – even when the hang happens. Weird. A little too weird.
I remembered seeing Bruce Dawson give a talk about the perils of lock-free programming. The primary take-away was that the actual order of reads and writes can be changed by the CPU. So for example, even if you think the logic looks good, your data may not be completely consistent between cores. If that was happening here, the window to see the problem would be exceptionally small, and my usual trick of using a sleep to widen the window would be ineffective. I had no shortage of ways to make the problem go away, but I still needed a way to prove what the problem actually was.
Enter the Network Side-Car
Logging statements are actually pretty expensive relative to the timescales we’re talking about here, so that was right out. I needed a way to log information without invoking the normal logging machinery. Making matters worse, this was also only happening on a mobile phone. My solution was to create a special low-overhead logging object that sat on a network port and waited for a connection.
Bingo!
The formerly-correct mDecompressedBytes was returning zero just before the hang. To prove my case, I added a little more code to detect the zero and then wait until the value became non-zero. And success! This was suddenly a clear-cut case where the memory writes to mDecompressedBytes and then mDecompressionDone were inflight at the same time from one thread on one thread, and then being observed from the main thread in the reverse order.
The fix is really quite simple once you know what’s going on. Memory write barriers are used to ensure that all pending memory writes before the barrier complete before any of the writes after it. Inserting a memory barrier does the trick here. Slightly less exotically, you can use simple atomic-writes on mDecompressedBytes which generally implies a memory barrier.
So why didn’t this problem show up earlier? After all, the problematic code had been around for years and actually shipped in multiple products. It’s hard to say with complete certainty, but I can explain some of it. X86 CPUs generally don’t have the write-pass-write problem, so PC and Mac ports were likely immune. The iOS and Android devices we were using were all ARM CPUs, but until the Galaxy S4 and the HTC One, all the devices were only dual-core. Further, I’ve seen some anecdotal evidence that while Android phones were often multicore, some of them restrict all threads from a single application to a single core which would prevent this issue. Perhaps the quad-core processors lift that restriction.
So why didn’t this show up for other Android devices or the plethora of iOS devices? It turns out the nature of the data becomes important. The compressed data we’ve been dealing with was image data, and the two problematic phones were the first to have super-high 400+ dpi pixel densities. We were using higher-resolution content on those phones, and the fixed-sized decompression input buffer was no longer enough to hold an entire compressed image. Because the decompression required multiple jobs to complete, the book keeping logic was suddenly more susceptible to the memory-write problem that was always there.