Saturday, January 25, 2014

String Hazards - Custom Intern Pool

String Hazards - Custom Intern Pool

This article is a continuation of this post, you are missing much of the context if you haven't read it so I advise you to do it (I'll wait go and read it :P)

As I promised we are going to fix the InternPool and try to roll our own. There are multiple angles that we can approach this problem like OnHeap allocation and OffHeap we can use unsafe code or our own more advanced strings. For this article we will stick with the most simple solution (and move to the more advanced ones as we go) as most people don't want to use custom advanced strings, this is kinda the last resort or we need to start an application/system with them from the get go.

So the first thing if we want to use existing strings (and don't do unsafe code of pointer stuff) we need to find some way to make them mutable again since string are immutable every single operation will return a different reference thus preventing us from doing any form of a string pool. This is actually quite simple to do since string arguments are still copied by reference (they are classes no?) so in order to pool a string one needs to take it's reference and store it.

So the simplest intern pool could look like this:

    public static class InternPool
        static InternPool()

        private static readonly Hashtable pool = new Hashtable();

        public static  string Intern(string str)
            return (string)InnerIntern(str);

        private static object InnerIntern(string str)
            if (pool.ContainsKey(str))
                return pool[str];
                object tmp = str;
                pool.Add(str, tmp);
                return tmp;

Now this will work as intended, meaning that if we pass a string that was already Interned then we will receive only the previously Interned reference and the current one will be dumped (it will be collected in gen0 this is nor bad or good we just have to be aware of that).

Now to see if we this InternPool fragments the LOH we need need to run the following code and fire up WinDbg to see what's happening in the system.

            for (int i = 0; i < 100000; i++)
                var x = InternPool.Intern("THIS STRING IS POOLED!" + i);
Like in the previous blog post we will stop in the middle of a loop (i = 50K).

First thing we need to do is to get the location on LOH:

            0:006> !eeheap -gc

            Number of GC Heaps: 1
            generation 0 starts at 0x00000000035adb38
            generation 1 starts at 0x00000000034a4490
            generation 2 starts at 0x00000000032b1000
            ephemeral segment allocation context: none
                    segment            begin        allocated            size
            00000000032b0000 00000000032b1000  0000000003885fe8 0x00000000005d4fe8(6115304)
            Large object heap starts at 0x00000000132b1000
                    segment            begin        allocated            size
            00000000132b0000 00000000132b1000  000000001369bf78 0x00000000003eaf78(4108152)

Now we need to dump it's contents:

            0:006> !dumpheap 0x00000000132b1000
                    Address               MT    Size
            00000000132b1000 0000000002ed8200       24 Free
            00000000132b1018 000007fd4bd05a10    8192   
            00000000132b3018 0000000002ed8200       24 Free
            00000000132b3030 000007fd4bd05a10    1056   
            00000000132b3450 0000000002ed8200    7136 Free
            00000000132b5030 000007fd4bd05a10    8192   
            00000000132b7030 0000000002ed8200  1910952 Free
            00000000134898d8 000007fd4bd1f740  2172576   
            total 8 objects
                          MT    Count    TotalSize Class Name
            000007fd4bd05a10        3        17440 System.Object[]
            0000000002ed8200        4      1918136      Free
            000007fd4bd1f740        1      2172576 System.Collections.Hashtable+bucket[]
            Total 8 objects

We can see that there aren't any small objects flying around this time and there is one huge reference (2172576 bytes) that's the hashtable bucket array. This is understandable since we are putting the references to strings there and the build in hashtable can expand quite heavily not to mention rehashing etc. A badly utilized hashtable can do almost as much damage as fragmenting string InternPool but we should be still ok if the memory is not pinned thus once the hashtable decides to expand the only reference will be collected over time (this is still a big no-no in high frequency environment since we can allocate faster then we are freeing).

Let's inspect the roots of an object:

            0:006> !gcroot 00000000134898d8
            Note: Roots found on stacks may be false positives. Run "!help gcroot" for
            more info.
            Scan Thread 0 OSTHread 3488
            Scan Thread 2 OSTHread 6f6c

We see that there are three roots to our hash table bucket arrays and they are not pinned so this is good the only thing that's pinned is the hashtable reference itself, now why is that ? Well since the hashtable is a static reference therefor GC will pin the reference in memory no matter what.

Finally let's check how many buckets are in the hashtable (just for fun):

            0:006> !dumpobj 00000000134898d8
            Name: System.Collections.Hashtable+bucket[]
            MethodTable: 000007fd4bd1f740
            EEClass: 000007fd4b9c14c8
            Size: 2172576(0x2126a0) bytes
            Array: Rank 1, Number of elements 90523, Type VALUETYPE
            Element Type: System.Collections.Hashtable+bucket

For 50K strings we occupy 90K slot's thats very poor distribution, and most certainly we could improve this value.

The above code could use a lot of improvements like replacing hashtable with custom double hashing data structure to reduce the number of buckets and prevent expansion and rehashing. The fundamental question here is will this form of string pooling help to reduce allocations of strings ? Well No, it might help a bit, just like the original intern pool simply helped a bit if you were smart in using it (excluding the nasty fragmentation issues). To gain something from String Pool one would need to wrap this it around every single string operation and give the ability to pick if we want to pool or not. Further more some string operations would need to be completely redone to be pointer/index based and the pool itself would need to store references as pointer addresses in a buffer so we could benefit from aligning them and the next article will explore just that.

But the takeaway is this if you've been using InternPool for your string ops with success but suffered form fragmentation issues then this is a good replacement (consider changing the hashing structure though).


Wojtek (szogun1987) said...

I think that ordered array would be much better solution than Hashtable. I haven't make any test but I think that reading from structure like that should be much faster writing to it would be much slower in pessimistic case.

Bartosz Adamczewski said...

@Wojtek, that all depends on the hashtable used, on average the hash table has O(1) performance, using a good hash and smart bucket distribution would give a very good pessimistic time as well.

On the other hand you might use an array with great success to do Intern Pool and just do a binary search to find the reference O(logN) but in reality the best possible thing one might do is to use a Radix/Prefix Tree it's both space and time optimized and in this case it has an added bonus as either we will be using strings or pointers for that matter the data will be aligned.

The implementation is a bit trickier though.

The point of this article was actually not to provide the best ever DS to store references but to show how it could be done and what actually happens :) (I might do a post with more optimized structure).

Harley Watts said...

custom software development