I've added a yield() call after the open, hinting that the thread opening the gate may be rescheduled and allow the other threads to run. On the system I'm testing on, it doesn't seem to make a difference, but it does seem like the right thing to do, and it does not seem to hurt.
#include "spingate.h"
#include <vector>
#include <chrono>
#include <thread>
#include <iostream>
int main()
{
std::vector<std::thread> workers;
SpinGate gate;
using time_point = std::chrono::time_point<std::chrono::high_resolution_clock>;
time_point t1;
auto threadCount = std::thread::hardware_concurrency();
std::vector<time_point> times;
times.resize(threadCount);
for (size_t n = 0; n < threadCount; ++n) {
workers.emplace_back([&gate, t1, ×, n]{
gate.wait();
time_point t2 = std::chrono::high_resolution_clock::now();
times[n] = t2;
});
}
std::cout << "Open the gate in 1 second: " << std::endl;
using namespace std::chrono_literals;
std::this_thread::sleep_for(1s);
t1 = std::chrono::high_resolution_clock::now();
gate.open();
for (auto& thr : workers) {
thr.join();
}
int threadNum = 0;
for (auto& time: times) {
auto diff = std::chrono::duration_cast<std::chrono::nanoseconds>(time - t1);
std::cout << "Thread " << threadNum++ << " waited " << diff.count() << "ns\n";
}
}
I'd originally had the body of the threads just spitting out that they were running on std::cout, and the lack of execution before the gate, plus the overlapping output, being evidence of the gate working. That looked like:
for (std::size_t n = 0; n < std::thread::hardware_concurrency(); ++n) {
workers.emplace_back([&gate, n]{
gate.wait();
std::cout << "Output from gated thread " << n << std::endl;
});
}
The gate is captured in the thread lambda by reference, the thread number by value, and when run, overlapping gibberish is printed to the console as soon as open() is called.
But then I became curious about how long the spin actually lasted. Particularly since the guarantees for atomics with release-acquire semantics, or really even sequentially consistent, are about once a change is visible, that changes before are also visible. It's really a function of the underlying hardware how fast the change is visible, and what are the costs of making the happened-before writes available. I'd already observed better overlapping execution using the gate, as opposed to just letting the threads run, so for my initial purposes of making contention more likely, I was satisfied. Visibility, on my lightly loaded system, seems to be in the range of a few hundred to a couple thousand nanoseconds, which is fairly good.
Checking how long it took to start let me do two thing. First, play with the new-ish chrono library. Second, check that the release-acquire sync is working the way I expect. The lambdas that the threads are running capture the start time value by reference. The start time is set just before the gate is opened, and well after the threads have started running. The spin gate's synchronization ensures that if the state change caused by open is visible, the setting of the start time is also visible.
Here are one set of results from running a spingate:
Open the gate in 1 second:
Thread 0 waited 821ns
Thread 1 waited 14490ns
Thread 2 waited 521ns
Thread 3 waited 817ns