Be careful with spin locks for synchronization

There is a temptation when writing high performance multithreaded code to use simple spin locks as they can be very fast when there is little contention.

Example spin lock

struct spin_lock {
  
  void lock(void) {
    for (;;) {
      // Assume we will get the lock
      if (!lock_.exchange(true, std::memory_order_acquire)) {
        break;
      }

      // If we failed to get the lock spin with a read so we don't
      // create lots of coherency traffic on the CPU
      while (lock_.load(std::memory_order_relaxed));

      // Try lock again.
    }
  }

  void unlock(void) { 
    lock_.store(false, std::memory_order_release);
  }

private:
   std::atomic<bool> lock_ = false;
};

The problem

The problem with using a spin lock like this for synchronization is that the OS scheduler is totally unaware.

Here is some example code to help demonstrate the problem.


spin_lock lock;

void foo(void)
{
    for (int32_t i = 0; i < 1000; i++) {
      lock.lock();

      DoWorkUsingValuesProtectedByLock();

      lock.unlock();
    }
}

int main() 
{
    std::thread t0(foo);
    std::thread t1(foo);

    t0.join();
    t1.join();

    auto sum = data[0].counter + data[1].counter;
    return 0;
}

Lets assume t0 gets the lock first and starts doing some work, lets also assume this work did not complete before the thread used up it’s time slice from the scheduler.

So now t0 is paused but it still holds the lock, but from the point of view of the scheduler t1 is not blocked so it gives a time slice to t1.
t1 wakes up and spends the entire time slice spinning away doing nothing useful (typically between 5-16ms for a single time slice).

There is also nothing stopping the scheduler from running t1 multiple times before it schedules t0 again, potentially adding up to 100’s of milliseconds.

Add additional threads in to the mix and you can see how this can very quickly turn into mass busy spinning for multiple seconds.

This can be very problematic in real time software that has a upper bound on how long each task can take.

Not obvious to the casual observer

The nasty thing about this problem is while you are busy spinning waiting for the lock your program looks like it’s busy doing work as CPU usage will be high. But the program is not actually doing anything useful.

For example you could have a multithread program that process some data taking about 10 minutes to complete with high CPU usage thought the run. But potentially 5 minutes of that was spent spinning on the lock.

Simply swapping out the “fast” spin lock with a OS mutex could make the program complete much quicker.

Not on my machine

The code might run without issue on the machine it was developed and tested but when deployed in production or to end users with differing hardware and schedulers chaos may ensue.

Exceptions

Spin locks can be useful in setups where you can pin threads to specific CPU cores and know that the blocking thread is not blocked by another threads execution.

The solution

The solution is to use OS synchronization primitives so the OS scheduler can observe the blocking dependency.

Going back to the above example code where t0 holds the lock at the end of the time slice, the scheduler is able to see that t1 is blocked by t0 so won’t event attempt to schedule t1 while t0 holds the lock.

Have cake and eat it too

Windows provides CRITICAL_SECTION as the locking primitive that can be initialized using InitializeCriticalSectionAndSpinCount giving you a spin lock that falls back to a wait the scheduler can observe.

Giving you the best of both worlds, a fast lock when contention is low without the potential pathological busy spinning of a simple spin lock.