spingate

1 Building a simple spin gate in C++

This is a very simplified latch which allows several threads to block and busywait until they are released to begin work in parallel. I'm using this to do some multi-thread stress testing, where I want to maximize the overlapping work in order to check for suprising non-deterministic behavior. There's no count associated with the gate, nor is it reuseable. All we need is the one bit of information, weather the gate is open or closed. The simplest atomic boolean is the std::atomic_flag. It's guaranteed by standard to be lock free. Unfortunately, to provide that guarantee, it provides no way to do an atomic read or write to the flag, just a clear, and a set and return prior value. This isn't rich enough for multiple threads to wait, although it is enough to implement a spinlock. std::atomic<bool> isn't guaranteed to be lock free, but in practice, any architecture I'm interested has at least 32bit aligned atomics that are lock free. Older hardware, such as ARMv5, SparcV8, and 80386 are missing cmpxchg, so loads are generally implemented with a lock in order to maintain the guarantees if there were a simultaneous load and exchange. See, for example, LLVM Atomic Instructions and Concurrency Guide. Modern ARM, x86, Sparc, and Power chips are fine. When the spin gate is constructed, we'll mark the gate as closed. Threads will then wait on the flag, spinning until the gate is opened. For this we use Release-Acquire ordering between the open and wait. This will ensure any stores done before the gate is opened will be visible to the thread waiting.

// spingate.h                                                       -*-C++-*-
#ifndef INCLUDED_SPINGATE
#define INCLUDED_SPINGATE

#include <atomic>
#include <thread>

class SpinGate
{
    std::atomic_bool flag_;

  public:
    SpinGate();
    void wait();
    void open();
};

inline
SpinGate::SpinGate() {
    // Close the gate
    flag_.store(true, std::memory_order_release);
}

inline
void SpinGate::wait() {
    while (flag_.load(std::memory_order_acquire))
        ; // spin
}

inline
void SpinGate::open() {
    flag_.store(false, std::memory_order_release);
    std::this_thread::yield();
}


#endif

#include "spingate.h"
// Test that header is complete by including

Using a SpinGate is fairly straightfoward. Create an instance of SpinGate and wait() on it in each of the worker threads. Once all of the threads are created, open the gate to let them run. In this example, I sleep for one second in order to check that none of the worker threads get past the gate before it is opened. The synchronization is on the SpingGate's std::atomic_bool, flag_. The flag_ is set to true in the constructor, with release memory ordering. The function wait() spins on loading the flag_ with acquire memory ordering, until open() is called, which sets the flag_ to false with release semantics. The other threads that were spinning may now proceed. The release-acquires ordering ensures that happens-before writes by the thread setting up the work and calling open will be read by the threads that were spin waiting.

1.1 Update:

I've added a yield() call after the open, hinting that the thread opening the gate may be rescheduled and allow the other threads to run. On the system I'm testing on, it doesn't seem to make a difference, but it does seem like the right thing to do, and it does not seem to hurt.

#include "spingate.h"

#include <vector>
#include <chrono>
#include <thread>
#include <iostream>


int main()
{
    std::vector<std::thread> workers;
    SpinGate gate;
    using time_point = std::chrono::time_point<std::chrono::high_resolution_clock>;
    time_point t1;
    auto threadCount = std::thread::hardware_concurrency();
    std::vector<time_point> times;
    times.resize(threadCount);

    for (size_t n = 0; n < threadCount; ++n) {
        workers.emplace_back([&gate, t1, &times, n]{
                gate.wait();
                time_point t2 = std::chrono::high_resolution_clock::now();
                times[n] = t2;
            });
    }

    std::cout << "Open the gate in 1 second: " << std::endl;
    using namespace std::chrono_literals;
    std::this_thread::sleep_for(1s);
    t1 = std::chrono::high_resolution_clock::now();
    gate.open();

    for (auto& thr : workers) {
        thr.join();
    }

    int threadNum = 0;
    for (auto& time: times) {
        auto diff = std::chrono::duration_cast<std::chrono::nanoseconds>(time - t1);
        std::cout << "Thread " << threadNum++ << " waited " << diff.count() << "ns\n";
    }
}

I'd originally had the body of the threads just spitting out that they were running on std::cout, and the lack of execution before the gate, plus the overlapping output, being evidence of the gate working. That looked like:

for (std::size_t n = 0; n < std::thread::hardware_concurrency(); ++n) {
    workers.emplace_back([&gate, n]{
            gate.wait();
            std::cout << "Output from gated thread " << n << std::endl;
        });
}

The gate is captured in the thread lambda by reference, the thread number by value, and when run, overlapping gibberish is printed to the console as soon as open() is called. But then I became curious about how long the spin actually lasted. Particularly since the guarantees for atomics with release-acquire semantics, or really even sequentially consistent, are about once a change is visible, that changes before are also visible. It's really a function of the underlying hardware how fast the change is visible, and what are the costs of making the happened-before writes available. I'd already observed better overlapping execution using the gate, as opposed to just letting the threads run, so for my initial purposes of making contention more likely, I was satisfied. Visibility, on my lightly loaded system, seems to be in the range of a few hundred to a couple thousand nanoseconds, which is fairly good. Checking how long it took to start let me do two thing. First, play with the new-ish chrono library. Second, check that the release-acquire sync is working the way I expect. The lambdas that the threads are running capture the start time value by reference. The start time is set just before the gate is opened, and well after the threads have started running. The spin gate's synchronization ensures that if the state change caused by open is visible, the setting of the start time is also visible. Here are one set of results from running a spingate:

Open the gate in 1 second: 
Thread 0 waited 821ns
Thread 1 waited 14490ns
Thread 2 waited 521ns
Thread 3 waited 817ns

2 Comparison with Condition Variable gate

The normal way of implementing a gate like this is with a condition variable, associated mutex, and a plain bool. The mutex guarantees synchronization between the wait and open, rather than the atomic variable in the SpinGate. The unlock/lock pair in mutex have release-acquire semantics. The actual lock and unlock are done by the unique_lock guard.

// cvgate.h                                                           -*-C++-*-
#ifndef INCLUDED_CVGATE
#define INCLUDED_CVGATE

#include <mutex>
#include <condition_variable>
#include <atomic>
#include <thread>

class CVGate
{
    std::mutex lock_;
    std::condition_variable cv_;
    bool flag_;

  public:
    CVGate();

    void wait();
    void open();
};

inline
CVGate::CVGate()
: lock_(),
  cv_(),
  flag_(true)
{}

inline
void CVGate::wait() {
    std::unique_lock<std::mutex> lk(lock_);
    cv_.wait(lk, [&](){return !flag_;});
}

inline
void CVGate::open() {
    std::unique_lock<std::mutex> lk(lock_);
    flag_ = false;
    cv_.notify_all();
    std::this_thread::yield();
}
#endif

#include "cvgate.h"
// Test that header is complete by including

This has the same interface as SpinGate, and is used exactly the same way. Running it shows:

Open the gate in 1 second: 
Thread 0 waited 54845ns
Thread 1 waited 76125ns
Thread 2 waited 91977ns
Thread 3 waited 128816ns

That the overhead of the mutex and condition variable is significant. On the other hand, the system load while it's waiting is much lower. Spingate will use all available CPU, while CVGate yields, so useful work can be done byu the rest of the system. However, for the use I was originally looking at, releasing threads for maximal overlap, spinning is clearly better. There is much less overlap as the cv blocked threads are woken up.

3 Building and Running

This is a minimal CMake file for building with the system compiler.

cmake_minimum_required(VERSION 3.5)
set(CMAKE_LEGACY_CYGWIN_WIN32 0)

project(SpinGate C CXX)

set(THREADS_PREFER_PTHREAD_FLAG ON)
find_package(Threads REQUIRED)

set(CMAKE_EXPORT_COMPILE_COMMANDS ON)

set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -std=c++14 -ftemplate-backtrace-limit=0")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wall -Wextra -march=native")
set(CMAKE_CXX_FLAGS_DEBUG "-O0 -fno-inline -g3")
set(CMAKE_CXX_FLAGS_RELEASE "-Ofast -g0 -DNDEBUG")

include_directories(${CMAKE_CURRENT_SOURCE_DIR})

add_executable(spingate main.cpp spingate.cpp)
add_executable(cvgate cvmain.cpp cvgate.cpp)
target_link_libraries(spingate Threads::Threads)
target_link_libraries(cvgate Threads::Threads)

And here we build a release version of the test executable:

mkdir -p build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ../
make

-- Configuring done
-- Generating done
-- Build files have been written to: /home/sdowney/src/spingate/build
[ 50%] Built target cvgate
[100%] Built target spingate

4 Org-mode source and git repo

Exported from an org-mode doc. All of the source is available on github at SpinGate