Pages

Sunday, February 2, 2014

On Deleting Comments

On Deleting Comments

This time we are going into a social territory with a rant mixed in for good measure.

Commenting on someone else's blog for me happens usually when the article/post contains missing details that could be pointed out, or the facts or part of them are dead wrong. This is usually in friendly tone so it's not to be taken as a attack or anything and while doing so I also try to highlight the points that are correct or that I liked. So my comment's are quite big in their size, thus I always copy them to notepad or other tool should the session expire or something bad happen, but if the comment gets posted I remove the file (as it's not needed).

And them after some time I see that the comment is gone …

People that run tech blogs that try to sound expert like delete comments that are hard for them, or expose some of the problems that would undermine their expert status. This sort of things has happened to me a few times and the statistics suggest that the problem is growing. Last year I could count 3-4 instances like that, last month I got 2.  One of my friends had actually every comment deleted from a blog on every post and that just shocking.

I'm very sorry but deleting comments that are not considered spam is just fucking disgusting.

If by deleting no easy to tackle comments you want to retain your expert status and hide some stupid posts or rewrite reality you sir are a fool! and sooner or later you will be exposed.

There is another problem here that if people put some work in making the comment and try to reach out to someone and those comment's get deleted then it's a total discouragement on their side, making them not to want to comment ever again. This would be a total waste as those comments are very meaningful and often correct or expand the article itself making it more useful.

Rewriting the reality is another interesting problem. Someone starts to blog as a beginner, later his considered expert so he goes and edits the old posts and rewrites them without any notice to make them look much better then they actually were, or deletes them entirely. I have written some really stupid shit here and you can still look it up, and I will never ever remove or alter it. First of all my past posts carry some useful information and show from where I started vs where I'm now. Second I encourage everyone to comment and correct wrong information and fill the gaps. In order to be good and write meaningful stuff one needs to start somewhere and possibly write crap (this is not always the case though).

From now on I'm going to not only save my comments but after posting them take screen-shots should they disappear, and then posting about it here.

If you had some instances of deleting comments or rewriting reality happening share them here.

Saturday, January 25, 2014

String Hazards - Custom Intern Pool

String Hazards - Custom Intern Pool

This article is a continuation of this post, you are missing much of the context if you haven't read it so I advise you to do it (I'll wait go and read it :P)

As I promised we are going to fix the InternPool and try to roll our own. There are multiple angles that we can approach this problem like OnHeap allocation and OffHeap we can use unsafe code or our own more advanced strings. For this article we will stick with the most simple solution (and move to the more advanced ones as we go) as most people don't want to use custom advanced strings, this is kinda the last resort or we need to start an application/system with them from the get go.

So the first thing if we want to use existing strings (and don't do unsafe code of pointer stuff) we need to find some way to make them mutable again since string are immutable every single operation will return a different reference thus preventing us from doing any form of a string pool. This is actually quite simple to do since string arguments are still copied by reference (they are classes no?) so in order to pool a string one needs to take it's reference and store it.

So the simplest intern pool could look like this:

    public static class InternPool
    {
        static InternPool()
        {}

        private static readonly Hashtable pool = new Hashtable();

        public static  string Intern(string str)
        {
            return (string)InnerIntern(str);
        }

        private static object InnerIntern(string str)
        {
            if (pool.ContainsKey(str))
                return pool[str];
            else
            {
                object tmp = str;
                pool.Add(str, tmp);
                return tmp;
            }
        }
    }


Now this will work as intended, meaning that if we pass a string that was already Interned then we will receive only the previously Interned reference and the current one will be dumped (it will be collected in gen0 this is nor bad or good we just have to be aware of that).

Now to see if we this InternPool fragments the LOH we need need to run the following code and fire up WinDbg to see what's happening in the system.

            for (int i = 0; i < 100000; i++)
            {
                var x = InternPool.Intern("THIS STRING IS POOLED!" + i);
                Console.WriteLine(x);
            }
            Console.ReadKey();
Like in the previous blog post we will stop in the middle of a loop (i = 50K).

First thing we need to do is to get the location on LOH:

            0:006> !eeheap -gc

            Number of GC Heaps: 1
            generation 0 starts at 0x00000000035adb38
            generation 1 starts at 0x00000000034a4490
            generation 2 starts at 0x00000000032b1000
            ephemeral segment allocation context: none
                    segment            begin        allocated            size
            00000000032b0000 00000000032b1000  0000000003885fe8 0x00000000005d4fe8(6115304)
            Large object heap starts at 0x00000000132b1000
                    segment            begin        allocated            size
            00000000132b0000 00000000132b1000  000000001369bf78 0x00000000003eaf78(4108152)

Now we need to dump it's contents:

            0:006> !dumpheap 0x00000000132b1000
                    Address               MT    Size
            00000000132b1000 0000000002ed8200       24 Free
            00000000132b1018 000007fd4bd05a10    8192   
            00000000132b3018 0000000002ed8200       24 Free
            00000000132b3030 000007fd4bd05a10    1056   
            00000000132b3450 0000000002ed8200    7136 Free
            00000000132b5030 000007fd4bd05a10    8192   
            00000000132b7030 0000000002ed8200  1910952 Free
            00000000134898d8 000007fd4bd1f740  2172576   
            total 8 objects
            Statistics:
                          MT    Count    TotalSize Class Name
            000007fd4bd05a10        3        17440 System.Object[]
            0000000002ed8200        4      1918136      Free
            000007fd4bd1f740        1      2172576 System.Collections.Hashtable+bucket[]
            Total 8 objects

We can see that there aren't any small objects flying around this time and there is one huge reference (2172576 bytes) that's the hashtable bucket array. This is understandable since we are putting the references to strings there and the build in hashtable can expand quite heavily not to mention rehashing etc. A badly utilized hashtable can do almost as much damage as fragmenting string InternPool but we should be still ok if the memory is not pinned thus once the hashtable decides to expand the only reference will be collected over time (this is still a big no-no in high frequency environment since we can allocate faster then we are freeing).

Let's inspect the roots of an object:

            0:006> !gcroot 00000000134898d8
            Note: Roots found on stacks may be false positives. Run "!help gcroot" for
            more info.
            Scan Thread 0 OSTHread 3488
            RSP:103e938:Root:00000000032d6720(System.Collections.Hashtable)->
            00000000134898d8(System.Collections.Hashtable+bucket[])
            RSP:103ea38:Root:00000000032d6720(System.Collections.Hashtable)->
            00000000134898d8(System.Collections.Hashtable+bucket[])
            RSP:103ea58:Root:00000000032d6720(System.Collections.Hashtable)->
            00000000134898d8(System.Collections.Hashtable+bucket[])
            Scan Thread 2 OSTHread 6f6c
            DOMAIN(00000000010BE960):HANDLE(Pinned):217f8:Root:00000000132b1018(System.Object[])->
            00000000032d6720(System.Collections.Hashtable)


We see that there are three roots to our hash table bucket arrays and they are not pinned so this is good the only thing that's pinned is the hashtable reference itself, now why is that ? Well since the hashtable is a static reference therefor GC will pin the reference in memory no matter what.

Finally let's check how many buckets are in the hashtable (just for fun):

            0:006> !dumpobj 00000000134898d8
            Name: System.Collections.Hashtable+bucket[]
            MethodTable: 000007fd4bd1f740
            EEClass: 000007fd4b9c14c8
            Size: 2172576(0x2126a0) bytes
            Array: Rank 1, Number of elements 90523, Type VALUETYPE
            Element Type: System.Collections.Hashtable+bucket
            Fields:
            None

For 50K strings we occupy 90K slot's thats very poor distribution, and most certainly we could improve this value.

The above code could use a lot of improvements like replacing hashtable with custom double hashing data structure to reduce the number of buckets and prevent expansion and rehashing. The fundamental question here is will this form of string pooling help to reduce allocations of strings ? Well No, it might help a bit, just like the original intern pool simply helped a bit if you were smart in using it (excluding the nasty fragmentation issues). To gain something from String Pool one would need to wrap this it around every single string operation and give the ability to pick if we want to pool or not. Further more some string operations would need to be completely redone to be pointer/index based and the pool itself would need to store references as pointer addresses in a buffer so we could benefit from aligning them and the next article will explore just that.

But the takeaway is this if you've been using InternPool for your string ops with success but suffered form fragmentation issues then this is a good replacement (consider changing the hashing structure though).

Tuesday, January 21, 2014

String Hazards

String Hazards



This post was the effect of this blog and the great confusion about strings in the comment section.

Strings are a wonderful thing in .NET, very easy to use, expressive and have tons of operation methods that one can use to handle them. But with such great flexibility comes great responsibility and a few hazards. String allocation is expensive and can lead to great overhead and memory pressure increase thus forcing many Garbage Collections (usually in gen0). Now this isn't a huge deal if we are creating standard applications that don't have big data or allocation frequency, but starts to be a problem one the application grows or the application is high frequency from the start.

To do something about it usually this means throwing down the drain most of the benefits that strings give and redo all of the standard string operations to reduce allocations. That has limited uses however so another good idea is to create a string pool that will reuse memory by having a large continuous buffer (this can be either on Heap or off Heap if you like hardcore stuff). But actually there is a string pool in .NET that does string pooling using a copy on write technique (meaning that different strings that the one from the intern pool get a new reference), to make use of pool allocation we need to use string.Intern or just use literal strings (literal string are strings that are hard coded in the program like var s = "MYSTRING").

The intern pool therefor can cause potential memory leaks which are not leaks in reality as this the Intern Pool once allocated is never released and can expand, this behavior is rather expected. The pool is allocated on Large Object Heap and all of it's records are pinned in memory which is understandable, but the pool has one serious design issue and I'm not sure if this is a bug or just bad design choice. Since the pool can expand when needed and it's always pinned it can cause Heap fragmentation as other objects can be allocated and then freed but the next block of Intern Pool cannot so it will fragment the Large Object Heap since this heap cannot be compacted (in 4.5.1 it can). The very strange thing here is that that free blocks between Intern blocks are to small to be considered LOH allocated objects thus it might be a bug rather then a design choice but still expanding the pool in LOH as pinned memory would still fragment the heap.

Intern Pool String

 

To prove what I'm saying let's take an example:
 
            for (int i = 0; i < 100000; i++)
            {
                var x = string.Intern("THIS STRING IS POOLED!" + i);
                Console.WriteLine(x);
            }
            Console.ReadKey();

This simple code exposes the problem with the Intern pool, let's fire up WinDbg then attach to the sample process and break when 'i' is around 50K. Let's now see where LOH starts.


Now that we have the address of the LOH we can dump it's contents.


(This heap is huge so the pic is just showing a small fragment of it)

As we can see this smallish program generated lot's of LOH allocations, most of the allocated and free blocks are constant in size, this is likely due the constant size of the string.

Now let's inspect one of the allocated blocks.


We can see here that this is an object array and has no fields defined, so it looks like a reference or a handle to an actual string or set of strings, in order to inspect it even more we would need to dump asm instructions, but analyzing those is beyond the scope of this article.

In order to prove that this memory is indeed pinned we need to check it's GCHandle


Indeed it is pinned so it's not movable or freeable in any way (unless we would acquire the handle and freed it, but that looks more like a hacking attempt then a solution)

Non Intern Pool String

 

Now just to be sure let's modify the code a bit so it will not use the Intern table and then let's check how LOH looks like (I'm going to skip some instructions and go straight to LOH).

 
            for (int i = 0; i < 100000; i++)
            {
                var x = "THIS STRING IS POOLED!" + i;
                //var x = string.Intern("THIS STRING IS POOLED!" + i);
                Console.WriteLine(x);
            }
            Console.ReadKey();

Earlier I said that only string literals are automatically Interned and since we are using a literal here it should go to the Intern table right? Well wrong since we are concatenating a literal with non literal string this this gives us a third non literal string so there will be no Interning.



Now we can see that there are almost no objects in LOH and this was expected, there are still some small allocated blocks but since Interning isn't disabled so this just represents the PreAllocated first block of Intern table and some other internal CLR structures.

Conclusion

 

So that confirms that by using the Intern table we can very quickly fragment LOH and get in trouble so what can be done then? Well the best possible solution would be to actually roll you own Intern Pool so do something similar but the memory block would need to be continuous and should we expand to another block we would need to ensure that this memory is not pinned can be freed by GC. Another solution would be to just skip the managed heap and allocate memory on OS heap or just request a memory block of virtual memory (using virtualAllocEx) but that would be unsafe code all the way and it's not easy so for starters I would stick to a managed heap.

In the future articles I will propose solution to the string Intern Pool problem and create a custom pool that's preallocated on LOH and does not fragment memory and will explore the possibilities and consequences of such pool (flexibility suffers a bit).

There are much more string hazards regarding costly allocations and non optimal design choices for high frequency uses, and hell even the Intern operation has it's problems (for example object.ReferenceEquals("str" + 1 , "str" + 1) will return false), but that again is out the scope of this article, and we will leave them for future.

Sunday, December 29, 2013

Hello SCPM

First of all I would like to say that this blog is not dead, and I'm planning to write here more often, and publish some of the things that I wrote some time ago but never found the time to finish it and publish.

Now onwards with the current topic:

SCPM

All of my previous posts and concurrent interests has led to creation of a concurrent based framework that can manage work and create fibers if needed, where a single unit of work is represented by the notion of immutable computation (that's the end goal). The framework has very little features so far as it's core needs to be perfect in therms of performance and scalability, only then we can start adding some features.

How this is different from TPL ?



Well it's very similar as the core idea is the same, but SCPM (yea that's the name of the framework) will be more flexible in therms of what it can do and how. For example right now we can schedule computations on fibers (currently I don't like how its done) but the general idea is that the system using various metrics that we have right now will decide how to schedule computations.

Performance vise SCPM works a bit faster then TPL when scheduling against a thread pool but performance measurements will be left for a different post.

The project can be found here.

The roadmap is something along those lines:

1. Make sure that everything in the core is bug free, and blazing fast.
2. Implement continuations and add support for Long Running Computations.
3. Computation chains, and auto decision making (Thread, Fiber, Long Running).

Wednesday, August 22, 2012

Lock Free And SpinWait MSDN Example

Lock Free And SpinWait MSDN Example

If you have read about the new features in 4.0 then probably you stumbled on a SpinWait structure and a MSDN article and the example code it provides. Just for the sake of clarity and history should the article change in the future I present you the code: 

public class LockFreeStack<T>
{
    private volatile Node m_head;

    private class Node { public Node Next; public T Value; }

    public void Push(T item)
    {
        var spin = new SpinWait();
        Node node = new Node { Value = item }, head;
        while (true)
        {
            head = m_head;
            node.Next = head;
            if (Interlocked.CompareExchange(ref m_head, node, head) == head) break;
            spin.SpinOnce();
        }
    }

    public bool TryPop(out T result)
    {
        result = default(T);
        var spin = new SpinWait();

        Node head;
        while (true)
        {
            head = m_head;
            if (head == null) return false;
            if (Interlocked.CompareExchange(ref m_head, head.Next, head) == head)
            {
                result = head.Value;
                return true;
            }
            spin.SpinOnce();
        }
    }
}

The example presents a lock free stack that uses the new SpinWait to issue a SpinOnce() operation after a failed CAS (Interlocked.CompareExchange). This is done to supposedly improve performance of code by doing a spin wait, but in my honest opinion there are a couple of problems with this solution. In order to see them first we need to see how SpinWait actually works.

Here is the code that does all of the work:

public bool NextSpinWillYield
{
    get
    {
        return this.m_count > 10 || PlatformHelper.IsSingleProcessor;
    }
}
/// <summary>Performs a single spin.</summary>
public void SpinOnce()
{
    if (this.NextSpinWillYield)
    {
        CdsSyncEtwBCLProvider.Log.SpinWait_NextSpinWillYield();
        int num = (this.m_count >= 10) ? (this.m_count - 10) : this.m_count;
        if (num % 20 == 19)
        {
            Thread.Sleep(1);
        }
        else
        {
            if (num % 5 == 4)
            {
                Thread.Sleep(0);
            }
            else
            {
                Thread.Yield();
            }
        }
    }
    else
    {
        Thread.SpinWait(4 << this.m_count);
    }
    this.m_count = ((this.m_count == 2147483647) ? 10 : (this.m_count + 1));
}

SpinOnce uses a counter to determine if it needs to issue a Yield, on each method call this counter gets incremented. If the counter is under ten (10) then Thread.SpinWait will be issued that multiplies the number 4 by 2 to the power of the counter number, so the maximum spin iterations will be 4096 which is allot. If the counter will be grater then ten (10) then a number of context switching methods will be invoked depending on the counter value. Yield will be invoked every time except fifth (Thread.Sleep(0)) and 20th Thread.Sleep(1). This means that after ten (10) iterations SpinWait will issue a light context switch that will issue a switch to another thread on the same processor (!). Then it will try to issue a switch to a thread with the same or higher priority, and finally a switch to thread will be issued with any given priority. 

Lock free Data Structures to be performant need to be wait free or limit the number of CAS failures, this requires building the structure in such a way that should we miss the CAS instruction and leave the underlying data structure in a inconsistent or wrong state, some other thread will set it right again. In the context of our example this is not the case as we only have a single and very simple CAS instruction present, that may have a potentially high CAS failures. So to deal with that we might just let the thread spin as it is in our best intentions to let the thread finish the loop as fast as possible, or we could issue a Yield to let another thread waiting on its time slice to continue. Yielding should have a positive effect in high CAS failures as this means that threads have a high overlap ratio and therefor step on each others boots. Issuing a Yield should decrease the overlapping and in result decrease the CAS failure rate.

SpinWait

So what's wrong with SpinWait SpinOnce in this context? The problem is that up to ten (10) failed CAS we issue an Thread.SpinWait that will get expensive fast. Another thing is that we are spinning in the loop right now so there is very little value in spinning again for a increasing number of iterations. The only possible benefit from that is that we might set the processor in the active mode (provided that we aren't in that mode already due to the high load) and thus we will increase the performance but still this will not give us to much benefits (it might if the number of threads isn't to high) as we are in a high contention loop (lots of spinning threads). Another possible benefit is that each thread that failed CAS spins longer than it's previous pair thus  racing threads have a chance to align properly and finish the operation, but in reality the penalty of waiting is to high and still even without spin wait the threads should align. Doing performance tests the code without spin waits on average runned faster compared to the original code, the SpinWait solution was on par with commented out solution in a couple of tests but was generally worse.

Proposed Solution

A better solution then using a SpingWait structure would be to have a failed CAS counter that would count failed CAS operations and upon reaching a certain threshold it would issue a Thread.Yield every time to give other threads a chance in finishing their loops for reference let's change the Push() method.

private volatile Node m_head;
private const int failedCASthreshold = 2;

private class Node { public Node Next; public T Value; }

public void Push(T item)
{
    int failedCAS = 0;

    //var spin = new SpinWait();
    Node node = new Node { Value = item }, head;
    while (true)
    {
        head = m_head;
        node.Next = head;
        if (Interlocked.CompareExchange(ref m_head, node, head) == head) break;

        if (++failedCAS > failedCASthreshold)
            Thread.Yield();
    }
}

Conclusion

So does this mean that MSDN example code is now ready for production use? Actually no it just means that SpinWait was presented in wrong context and better solution existed and this article was an attempt to explain what's wrong with it to prevent developers from using it in such context.

Monday, August 13, 2012

Lock Free Work Stealing Queue

Lock Free Work Stealing Queue


Lock free data structures are very handy for concurrent programs especially that the number of cores increases. The benefits of such data structures is that they never acquire a lock, instead they do a form of a spin lock, where a local copy of the data must be stored and exchanged or updated by an processors atomic operation (like "compare and swap" or "fetch and add") only when the global value and the local copy match, otherwise we repeat the whole procedure. Almost every lock free data structure follows this pattern, so you may ask where's the benefit in that? Well the chances are that the local data race will never issue a spin and everything will end in just one loop cycle, thus making the process very fast, instead of always acquiring and holding the lock, where other threads would just queue up. 

These types of data structures are very hard to implement correctly as we need to think about other threads, where they are and what they are doing, and to be even more hard there types of structures can suffer from the ABA problem where a memory location will be read twice but between reads some other thread might introduced sideffects by changing the values in that location (resusing location) and thus corrupting the entire state, but the switched out thread will just hold the memory address thus thinking that the data is still the same. Luckily for us in .NET the ABA problem does not exist as the memory location will only get reused when there are no references to it.

Work Stealing

So where does work stealing come into play here? Well a work stealing queue is simply a queue where the main thread can dequeue from the head but the stealing threads can dequeue elements from the tail, in a standard locking queue this helps alot as the stealing thread does not participate in the lock when dequeueing if the head and tail are far away from each other, if they are close then a lock can be put for extra safety. So almost by design this queue is supposed to be almost lock free. But what it gives us when our queue will not use any locks at all and will just do a form of a spin lock ? Well there will be almost the same benefits that we will not participate head dequeing making it a lot less likely that our operation will last more then one loop cycle. There is however one difference as we may slow down the enqueing operation as we will be racing with the enqueing threads, but the enqueue thread will not slow our own stealing operation.

So lets go into code, again most of explanations of how it's working is done through comments in the code, as I just feel that's the better way :)

/// <summary>
/// Represents a hybrid queue, that uses non blocking (spining) techniques to achive thread safety.
/// </summary>
/// <remarks>
/// The queue is hybrid due it's functionality that it can dequeue last node, which makes it
/// perfect for certain set of alghoritms, like work stealing.
/// </remarks>
/// <typeparam name="T">generic Typeparam.</typeparam>
public class LockFreeWorkStealingQueue<T> : IEnumerable<T>
{
    /// <summary>
    /// Internal node class for the use of internal double linked list structure.
    /// </summary>
    private class Node
    {
        public T val;
        public volatile Node next;
        public volatile Node prev;
        public int id;
    }

    /*
    * You may ask yourself why volatile is here, well the
    * main reason for this when we don't do explicit locking
    * we don't get the memory barier safety so instructions
    * might get reordered.
    *
    * NOTE: having volatile code in here is just a workaround
    * as we get a performance hit, instead we need to put mem bariers
    * when they are actually needed!
    */
    private volatile Node head;
    private volatile Node tail;

    /// <summary>
    /// Initializes a new instance of a hybrid queue.
    /// </summary>
    public LockFreeWorkStealingQueue()
    {
        head = new Node();
        tail = new Node();
        head = tail;
    }

    /// <summary>
    /// Gets the Unsafe Count (A count that will not nesserly provide the correct actual value).
    /// </summary>
    /// <remarks>
    /// This property is very handy when trying to issue a steal depending on a certain window.
    /// </remarks>
    public int UnsafeCount
    {
        get
        {
            return tail.id - head.id;
        }
    }

    /// <summary>
    /// Gets the count.
    /// </summary>
    public int Count
    {
        get
        {
            int count = 0;
            EvaluateCount((x) => false, out count);
            return count;
        }
    }

    /// <summary>
    /// Stars counting nodes utils a certain condition has been met.
    /// </summary>
    /// <param name="value">the confiiton.</param>
    /// <returns>the value indication that the condition was met or not.</returns>
    public bool EvaluateCount(Predicate<int> value)
    {
        int count = 0;
        return EvaluateCount(value, out count);
    }

    /// <summary>
    /// Stars counting nodes utils a certain condition has been met.
    /// </summary>
    /// <param name="value">the confiiton.</param>
    /// <param name="actualCount">the actual counted number of elements.</param>
    /// <returns>the value indication that the condition was met or not.</returns>
    private bool EvaluateCount(Predicate<int> value, out int actualCount)
    {
        int count = 0;
        for (Node current = head.next;
            current != null; current = current.next)
        {
            count++;

            if (value(count))
            {
                actualCount = count;
                return true;
            }
        }
        actualCount = count;
        return false;
    }

    /// <summary>
    /// Get's the value indicating if the Queue is empty.
    /// </summary>
    public bool IsEmpty
    {
        get { return head.next == null; }
    }

    /// <summary>
    /// Get's the tail.
    /// </summary>
    /// <remarks>
    /// In order to achieve correctness we need to keep track of the tail,
    /// accessing tail.next will not do as some other thread might just moved it
    /// so in order to catch the tail we need to do a subtle form of a spin lock
    /// that will use CompareAndSet atomic instruction ( Interlocked.CompareExchange )
    /// and set ourselvs to the tail if it had been moved.
    /// </remarks>
    /// <returns>Tail.</returns>
    private Node GetTail()
    {
        Node localTail = tail;
        Node localNext = localTail.next;

        //if some other thread moved the tail we need to set to the right possition.
        while (localNext != null)
        {
            //set the tail.
            Interlocked.CompareExchange(ref tail, localNext, localTail);
            localTail = tail;
            localNext = localTail.next;
        }

        return tail;
    }

    /// <summary>
    /// Attempts to reset the Couner id.
    /// </summary>
    private void TryResetCounter()
    {
        if (tail.id >= Int16.MaxValue)
        {
            int res = (tail.id - head.id);
            head.id = 0;
            tail.id = res;
        }
    }

    /// <summary>
    /// Puts a new item on the Queue.
    /// </summary>
    /// <param name="obj">The value to be queued.</param>
    public void Enqueue(T obj)
    {
        Node localTail = null;
        Node newNode = new Node();
        newNode.val = obj;

        TryResetCounter();

        do
        {
            //get the tail.
            localTail = GetTail();

            //TODO: This should be atomic.
            newNode.next = localTail.next;
            newNode.id = localTail.id + 1;
            newNode.prev = localTail;
        }
        // if we arent null, then this means that some other
        // thread interffered with our plans (sic!) and we need to
        // start over.
        while (Interlocked.CompareExchange(
            ref localTail.next, newNode, null) != null);
        // if we finally are at the tail and we are the same,
        // then we switch the values to the new node, phew! :)
        Interlocked.CompareExchange(ref tail, newNode, localTail);
    }

    /// <summary>
    /// Gets the first element in the queue.
    /// </summary>
    /// <returns>Head element.</returns>
    public T Dequeue()
    {
        // keep spining until we catch the propper head.
        while (true)
        {
            Node localHead = head;
            Node localNext = localHead.next;
            Node localTail = tail;

            // if the queue is empty then return the default for that
            // typeparam.
            if (localNext == null)
            {
                return default(T);
            }
            else if (localHead == localTail)
            {
                // our tail is lagging behind so we need to swing it.
                Interlocked.CompareExchange(ref tail, localHead, localTail);
            }
            else
            {
                localNext.prev = localHead.prev;

                // if no other thread changed the head then we are good to
                // go and we can return the local value;
                if (Interlocked.CompareExchange(
                    ref head, localNext, localHead) == localHead)
                {
                    return localNext.val;
                }
            }
        }
    }

    /*
    * Instread of sexy name like 'Steal' I like to name it by it's function
    * meaning we are dequeing the LAST item! :P
    */

    /// <summary>
    /// Gets the last element in the queue.
    /// </summary>
    /// <returns>old tail element.</returns>
    public T DequeueLast()
    {
        Node localTail;
        Node localPrev;
        Node swapNode = new Node();

        do
        {
            //get the tail.
            localTail = GetTail();
            localPrev = localTail.prev;

            if (localPrev == null)
                return default(T);
            else if (localPrev.prev == null)
                return default(T);
            else if (localPrev.prev == head)
                return default(T);
            else if (localTail == null)
                return default(T);

            // Set the swap node values that will exchange the element
            // in a sense that it will skip right through it.
            swapNode.next = localTail.next;
            swapNode.prev = localPrev.prev;
            swapNode.val = localPrev.val;
            swapNode.id = localPrev.id;
        }
        // In order for this to be actualy *thread safe* we need to subscribe ourselfs
        // to the same logic as the enque and create a blockade by setting the next value
        // of the tail!
        while (Interlocked.CompareExchange(ref localTail.next, localTail, null) != null);


        // do a double exchange, if we get interrupted between we should be still fine as,
        // all we need to do after the first echange is to swing the prev element to point at the
        // correct tail.
        Interlocked.CompareExchange(ref tail, swapNode, tail);
        Interlocked.CompareExchange(ref tail.prev.next, swapNode, tail.prev.next);

        return localTail.val;
    }

    /// <summary>
    /// Tries to peek the next value in the queue without
    /// getting it out.
    /// </summary>
    /// <param name="value">the output value.</param>
    /// <returns>the value indicating that there are still values to be peeked.</returns>
    public bool TryPeek(out T value)
    {
        Node currentNode = head.next;

        if (currentNode == null)
        {
            value = default(T);
            return false;
        }
        else
        {
            value = currentNode.val;
            return true;
        }
    }

    /// <summary>
    /// Gets the enumerator.
    /// </summary>
    /// <returns>enumerator.</returns>
    public IEnumerator<T> GetEnumerator()
    {
        Node currenNode = head.next;
        Node localTail = GetTail();

        while (currenNode != null)
        {
            yield return currenNode.val;

            if (currenNode == localTail)
                break;

            currenNode = currenNode.next;
        }
    }

    /// <summary>
    /// Gets the enumerator.
    /// </summary>
    /// <returns>enumerator.</returns>
    IEnumerator IEnumerable.GetEnumerator()
    {
        return ((IEnumerable<T>)this).GetEnumerator();
    }
}

Performance

Lets do some perfomance tests for the solution, please note that those tests were conducted on a four physical core I7 system. The tests will only do a basic enqueue/dequeue test as it's impossible to campare performances of work stealing so that should be tested individually for performance depending on the use case of such structure. The test code simply inserts and removes from the queue from multiple threads at the same time, the initial count will give us the corretness indicator as it needs to be the same as when we started our program.

    static ManualResetEvent ev = new ManualResetEvent(false);
    static int enqTh = 100; //enqueue threads
    static int deqTh = 100; //dequeue threads
    static int items = 100000; //items do deq/enq
    static int count = 0; //countdown latch


    static void LockFreeTest()
    {
        Console.WriteLine("Lock Free Test");

        LockFreeWorkStealingQueue<int> q = new LockFreeWorkStealingQueue<int>();

        //insert some initial items.
        for (int i = 0; i < items * enqTh; i++ ) q.Enqueue(i);

        Stopwatch w = new Stopwatch();
        w.Start();

        for (int i = 0; i < enqTh; i++)
        {
            ThreadPool.QueueUserWorkItem((x) =>
                { for (int n = 0; n < items; n++) q.Enqueue(i); count++; if (count == enqTh + deqTh) ev.Set(); });
        }

        for (int i = 0; i < deqTh; i++)
        {
            ThreadPool.QueueUserWorkItem((x) =>
            { for (int n = 0; n < items; n++) q.Dequeue(); count++; if (count == enqTh + deqTh) ev.Set(); });
        }

        ev.WaitOne();
        w.Stop();

        Console.WriteLine("Count: {0}" ,q.Count);
        Console.WriteLine("Time it took: {0}", w.ElapsedMilliseconds);
    }

    static void LockingTest()
    {
        Console.WriteLine("Locking Test");

        Queue<int> q = new Queue<int>();

        //insert some initial items.
        for (int i = 0; i < items * enqTh; i++) q.Enqueue(i);

        Stopwatch w = new Stopwatch();
        w.Start();

        for (int i = 0; i < enqTh; i++)
        {
            ThreadPool.QueueUserWorkItem((x) =>
            { for (int n = 0; n < items; n++) lock(q) q.Enqueue(i); count++; if (count == enqTh + deqTh) ev.Set(); });
        }

        for (int i = 0; i < deqTh; i++)
        {
            ThreadPool.QueueUserWorkItem((x) =>
            { for (int n = 0; n < items; n++) lock(q) q.Dequeue(); count++; if (count == enqTh + deqTh) ev.Set(); });
        }

        ev.WaitOne();
        w.Stop();

        Console.WriteLine("Count: {0}", q.Count);
        Console.WriteLine("Time it took: {0}", w.ElapsedMilliseconds);
    }


Lock free queue took:



Standard queue took:


Conclusion

Now if those results are somewhat strange to you, let me explain. Performance benefits will only be visible on a high count core/processor system as only then processors will need to be in sync when waiting on a lock acquired by a thread on another processor, so for a low core systems you will be better off with standard locking (work stealing) queue.


Update On Performance

As correctly pointed out by SÅ‚awomir in the comments section, the performance test is actually wrong for a couple of reasons. The first being that ThreadPool uses work stealing algorithms itself so it will run on NumOfCores * 4 threads instead of the provided number, this is especially true if the tests were performed on .NET 4.0, besides the work stealing algorithm will obscure the results as threads will steal queued tasks. The correct solution to the problem is to use plain simple threads, so the code test code is changed to reflect that:

    static ManualResetEvent ev = new ManualResetEvent(false);
    static int enqTh = 1000; //enqueue threads
    static int deqTh = 1000; //dequeue threads
    static int enqItems = 10000;
    static int deqItems = 10000;
    static int count = 0; //countdown latch
    static int startItems = 10000000;
    static object locker = new object();

    static void LockFreeTest()
    {
        Console.WriteLine("Lock Free Test");

        LockFreeWorkStealingQueue<int> q = new LockFreeWorkStealingQueue<int>();

        //insert some initial items.
        for (int i = 0; i < startItems; i++) q.Enqueue(i);

        Thread[] enqThreads = new Thread[enqTh];
        Thread[] deqThreads = new Thread[deqTh];

        for (int i = 0; i < enqTh; i++)
        {
            enqThreads[i] = new Thread(() =>
                {
                    for (int n = 0; n < enqItems; n++) q.Enqueue(i); count++; if (count == enqTh + deqTh) ev.Set();
                });
        }

        for (int i = 0; i < deqTh; i++)
        {
            deqThreads[i] = new Thread(() =>
                {
                    for (int n = 0; n < deqItems; n++) q.Dequeue(); count++; if (count == enqTh + deqTh) ev.Set();
                });
        }

        Stopwatch w = new Stopwatch();
        w.Start();

        Array.ForEach(enqThreads, (x) => x.Start());
        Array.ForEach(deqThreads, (x) => x.Start());
        Array.ForEach(enqThreads, (x) => x.Join());
        Array.ForEach(deqThreads, (x) => x.Join());

        ev.WaitOne();
        w.Stop();

        Console.WriteLine("Count: {0}" ,q.Count);
        Console.WriteLine("Time it took: {0}", w.ElapsedMilliseconds);
    }

    static void LockingTest()
    {
        Console.WriteLine("Locking Test");

        Queue<int> q = new Queue<int>();

        //insert some initial items.
        for (int i = 0; i < startItems; i++) q.Enqueue(i);

        Thread[] enqThreads = new Thread[enqTh];
        Thread[] deqThreads = new Thread[deqTh];

        for (int i = 0; i < enqTh; i++)
        {
            enqThreads[i] = new Thread(() =>
            {
                for (int n = 0; n < enqItems; n++) lock (locker) q.Enqueue(i); count++; if (count == enqTh + deqTh) ev.Set();
            });
        }

        for (int i = 0; i < deqTh; i++)
        {
            deqThreads[i] = new Thread(() =>
            {
                for (int n = 0; n < deqItems; n++) lock(locker) q.Dequeue(); count++; if (count == enqTh + deqTh) ev.Set();
            });
        }

        Stopwatch w = new Stopwatch();
        w.Start();

        Array.ForEach(enqThreads, (x) => x.Start());
        Array.ForEach(deqThreads, (x) => x.Start());
        Array.ForEach(enqThreads, (x) => x.Join());
        Array.ForEach(deqThreads, (x) => x.Join());

        ev.WaitOne();
        w.Stop();

        Console.WriteLine("Count: {0}", q.Count);
        Console.WriteLine("Time it took: {0}", w.ElapsedMilliseconds);
    }

I extended the test procedure and also tested scenarios with high enq low deq ratio, and the other way around as it is rare for the most systems to have the same number of enq and deq at any given point in time, usually it will vary. I provided a table with results for given number of items and threads for given test scenario along with a graph showing the performance. Let's look at the performance of both solutions when we have the same amount of dequeue and enqueue threads:

For 1k and 10k items:

Threads 10 20 40 80 100 200 400 800 1000
LockFree 1000 142 210 337 743 999 1885 3304 3500 3594
Locking 1000 96 135 483 1037 1043 1628 2600 2638 2760
LockFree 10000 254 499 828 1907 2327 4922 9940 18637 23549
Locking 10000 257 481 890 2386 2888 5005 9505 17900 22900




As we can see the performance of the lock free solution on my machine (4 physical cores) is almost on par with the locking queue, and there is a pattern here where the lock free solution outperforms the locking code (by a hair) where the number of threads is between 40 and 100. For better results of the lock free code we would need to redo this test on multi core machine (8 to 10 physical cores), but unfortunately I have no access to such machine (sic!) :(. Now let's test the performance where we have varying number of enqueue and dequeue threads.

For 10k items and varying number of dequeue threads:


Threads 10d 10e 20d 10e 40d 10e 80d 10e 100d 10e 200d 10e 400d 10e 800d 10e 1000d 10e
Locking 10000 234 419 532 1210 1414 2364 4876 10257 13067
LockFree 10000 190 393 524 958 1181 2391 4528 9141 10878




For 10k items and varying number of enqueue threads:

Threads 10d 10e 10d 20e 10d 40e 10d 80e 10d 100e 10d 200e 10d 400e 10d 800e 10d 1000e
Locking 10000 201 429 797 1369 1408 2703 5063 9916 12917
LockFree 10000 168 416 587 1184 1300 2426 4867 10645 13267




Now we do see that the lock free queue generally outperforms the locking version of the queue, but again on low core machine the gain is very small.

Update On Conclusion

So should we bother with lock free solution ? If we do have a very loaded system that runs on multiple cores then yes! Besides this implementation is a work stealing queue that has a primary role to boost performance on loaded system by letting the low loaded threads steal work from other queues, however the primary question here would be should we go for the implementation of lock based work stealing queue where or should we go lock free all the way? Well in lock based work stealing we would participate in a enqueue lock upon stealing and in the lock free version we would still participate in the same pseudo lock, but when the enqueue ratio is small at the moment there is a big chance that we will not spin and in result the whole steal operation will be faster then acquiring the lock and wait on on it (especially when the number of cores goes up). Another thing is that very short locks in .NET are implemented internally as spin locks thus they do not switch to the Operating System so the performance of such short locks are very good (I will try to shed more info on that topic in the future posts). So the answer to the question is go lock free if the production system uses a high core machine and go lock based if you cannot spare cores.