Lock Free Work Stealing Queue
Lock free data structures are very handy for concurrent programs especially that the number of cores increases. The benefits of such data structures is that they never acquire a lock, instead they do a form of a spin lock, where a local copy of the data must be stored and exchanged or updated by an processors atomic operation (like "compare and swap" or "fetch and add") only when the global value and the local copy match, otherwise we repeat the whole procedure. Almost every lock free data structure follows this pattern, so you may ask where's the benefit in that? Well the chances are that the local data race will never issue a spin and everything will end in just one loop cycle, thus making the process very fast, instead of always acquiring and holding the lock, where other threads would just queue up.
These types of data structures are very hard to implement correctly as we need to think about other threads, where they are and what they are doing, and to be even more hard there types of structures can suffer from the ABA problem where a memory location will be read twice but between reads some other thread might introduced sideffects by changing the values in that location (resusing location) and thus corrupting the entire state, but the switched out thread will just hold the memory address thus thinking that the data is still the same. Luckily for us in .NET the ABA problem does not exist as the memory location will only get reused when there are no references to it.
Work StealingSo where does work stealing come into play here? Well a work stealing queue is simply a queue where the main thread can dequeue from the head but the stealing threads can dequeue elements from the tail, in a standard locking queue this helps alot as the stealing thread does not participate in the lock when dequeueing if the head and tail are far away from each other, if they are close then a lock can be put for extra safety. So almost by design this queue is supposed to be almost lock free. But what it gives us when our queue will not use any locks at all and will just do a form of a spin lock ? Well there will be almost the same benefits that we will not participate head dequeing making it a lot less likely that our operation will last more then one loop cycle. There is however one difference as we may slow down the enqueing operation as we will be racing with the enqueing threads, but the enqueue thread will not slow our own stealing operation.
So lets go into code, again most of explanations of how it's working is done through comments in the code, as I just feel that's the better way :)
PerformanceLets do some perfomance tests for the solution, please note that those tests were conducted on a four physical core I7 system. The tests will only do a basic enqueue/dequeue test as it's impossible to campare performances of work stealing so that should be tested individually for performance depending on the use case of such structure. The test code simply inserts and removes from the queue from multiple threads at the same time, the initial count will give us the corretness indicator as it needs to be the same as when we started our program.
ConclusionNow if those results are somewhat strange to you, let me explain. Performance benefits will only be visible on a high count core/processor system as only then processors will need to be in sync when waiting on a lock acquired by a thread on another processor, so for a low core systems you will be better off with standard locking (work stealing) queue.
Update On PerformanceAs correctly pointed out by Sławomir in the comments section, the performance test is actually wrong for a couple of reasons. The first being that ThreadPool uses work stealing algorithms itself so it will run on NumOfCores * 4 threads instead of the provided number, this is especially true if the tests were performed on .NET 4.0, besides the work stealing algorithm will obscure the results as threads will steal queued tasks. The correct solution to the problem is to use plain simple threads, so the code test code is changed to reflect that:
For 1k and 10k items:
For 10k items and varying number of dequeue threads:
|Threads||10d 10e||20d 10e||40d 10e||80d 10e||100d 10e||200d 10e||400d 10e||800d 10e||1000d 10e|
For 10k items and varying number of enqueue threads:
|Threads||10d 10e||10d 20e||10d 40e||10d 80e||10d 100e||10d 200e||10d 400e||10d 800e||10d 1000e|
Now we do see that the lock free queue generally outperforms the locking version of the queue, but again on low core machine the gain is very small.