How I Debugged a Bad Race Condition

The last week and a half I’ve spent searching for race conditions in my database code. A race condition is a bug where two (or more) processes interact in an unforseen way causing bad things to happen.

The symptoms of this bug where:

  1. Could occur at any time, usually after 2+ hours of testing.
  2. Appeared to change data: My tests read and write a test string of 1 character repeated many times, sometimes the string would have garbage in the middle like:


  3. If I reread the data after the error the problem was gone.
  4. Sometimes the system hardlocked. All potential debugging information would be lost.
  5. Enabling debugging made the problem so rare I couldn’t produce it in a reasonable time.

I assumed the read error was from forgetting to lock something, I also assumed the hardlocks were an unrelated problem. I began by isolating the simplest test that still exhibited either problem. This consumed days of time, but I learnt hardlocks happened in any tests, read errors only in the two integration tests.

The integration tests ran 32 threads and performed thousands of random selected operations: 0) Make and store data, 1) Replace data with new data, 2) Delete data (which by design moved other data around). After testing I found that both replace and delete had problems, delete producing one earlier, but I couldn’t tell if the two operations were experiencing the same problem.

The First Bug was a Deadlock

By day four I was googling race conditions and deadlocks. On a whim, I googled “SBCL deadlocks” and found: Programming with signal handling in mind:

6.4.2 More deadlocks

If functions A and B directly or indirectly lock mutexes M and N, they should do so in the same order to avoid deadlocks.

A less trivial scenario is where there is only one lock involved but it is acquired in a WITHOUT-GCING in thread A, and outside of WITHOUT-GCING in thread B. If thread A has entered WITHOUT-GCING but thread B has the lock when the gc hits, then A cannot leave WITHOUT-GCING because it is waiting for the lock the already suspended thread B has. From this scenario one can easily derive the rule: in a WITHOUT-GCING form (or pseudo atomic for that matter) never wait for another thread that’s not in WITHOUT-GCING.

I had the less trivial scenario in many tests. I often use sb-ext:string-to-octets, sb-sys:with-pinned-objects, and sb-sys:vector-sap to convert a string to a utf8 byte array pointer that I can copy into the database. However, the definition of sb-sys:with-pinned-objects is just:

(def!macro with-pinned-objects ((&rest objects) &body body)
  (declare (ignore objects)) 
  `(without-gcing ,@body))

I modified the code to pin objects at the last possible moment after all other locks are acquired, and the deadlocks have gone away. The phantom read issue still occurred.

I Finally Stumble Across the Race Condition

I reviewed my collected error messages and noticed each phantom read occurred at exactly byte 564. I knew this was significant but my code didn’t contain 564, not as a magic-number nor a constant. It was not until next morning that I found a variable length on-disk structure that serialized as 1 or more blocks of 142 unsigned 32 bit integers; that would be 568 byte blocks, which was close enough to merit inspection.

My serialization is pretty simple: the first four bytes store the length (number of bytes) of the data, this data is written across a number of *data-size* binary blocks, and *origin-data* points to the start of the current block. The current block will always be in the page buffer. The routine ‘next’ moves *origin-data* to the next block. Next will return a new buffer, and release the old one, if the next block is on another page.

To read, I see the length in the first 4 bytes and make a big byte array. Then I use reposition 4 to set *origin-data* past the saved length. I read (min length (- *data-length* 4)) bytes and that handles the first block. Subsequent blocks are more data, but there is a trick: After calling next I may be at the first block of the next page or *data-size* bytes ahead. In the latter case I’m 4 bytes ahead and must reposition -4 to align. I decide which case I’m in by comparing nbuffer to buffer, if they are equal I’m still on the same page.

(labels ((read-len ()
           (let ((length (sap-ref-32 *origin-data* 0)))
             (make-array length :element-type '(unsigned-byte 8))))
         (do-read (buffer dest offset len)
           (if (> len 0)
               (progn (with-pinned-objects (dest)
                        (system-area-ub8-copy *origin-data* 0 (vector-sap dest) offset (min len *data-size*)))
                      (do-read (next buffer) dest (+ offset *data-size*) (- len *data-size*)))
         (reposition (n) (setf *origin-data* (sap+ *origin-data* n))))
  (defun readin (buffer)
    (let ((bytes (read-len)))
      (values bytes
              (if (zerop (length bytes))
                  (next buffer)
                    (reposition 4)
                    (let ((nbuffer (do-read buffer 0 (min (length bytes) (- *data-size* 4)))))
                      (when (eq nbuffer buffer) (reposition -4))
                      (let ((already (- *data-size* n)))
                        (do-read nbuffer already (- (length bytes) already))))))))))

Where is the problem? Not here, it’s actually in the definition of ‘start-of-next’ used by ‘next’. Can you spot it?

(defun start-of-next (buffer)
  (let ((next-sector (link buffer)))
    (if (zerop next-sector)
        (progn (release buffer) nil)
        (progn (release buffer)
               (let ((nbuffer (acquire next-sector)))
                 (setf *origin-data* (first-block nbuffer))

(defun next (buffer)
  (if (past-end buffer (origin-data))
      (start-of-next buffer)
      (incf *origin-data* *data-size*)))

The problem is releasing the old buffer before acquiring the next one. There is a slim chance the newly acquired next buffer will reuse the space of the just released old buffer. Thus in this case buffer will equal nbuffer, even though we are at the start of the next block. This caused an erroneously reposition -4 before the second block read. That took me over a week to find.

About these ads

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Get every new post delivered to your Inbox.

Join 34 other followers