String Hazards - Custom Intern Pool
This article is a continuation of this post, you are missing much of the context if you haven't read it so I advise you to do it (I'll wait go and read it :P)As I promised we are going to fix the InternPool and try to roll our own. There are multiple angles that we can approach this problem like OnHeap allocation and OffHeap we can use unsafe code or our own more advanced strings. For this article we will stick with the most simple solution (and move to the more advanced ones as we go) as most people don't want to use custom advanced strings, this is kinda the last resort or we need to start an application/system with them from the get go.
So the first thing if we want to use existing strings (and don't do unsafe code of pointer stuff) we need to find some way to make them mutable again since string are immutable every single operation will return a different reference thus preventing us from doing any form of a string pool. This is actually quite simple to do since string arguments are still copied by reference (they are classes no?) so in order to pool a string one needs to take it's reference and store it.
So the simplest intern pool could look like this:
public static class InternPool
{
static InternPool()
{}
private static readonly Hashtable pool = new Hashtable();
public static string Intern(string str)
{
return (string)InnerIntern(str);
}
private static object InnerIntern(string str)
{
if (pool.ContainsKey(str))
return pool[str];
else
{
object tmp = str;
pool.Add(str, tmp);
return tmp;
}
}
}
Now to see if we this InternPool fragments the LOH we need need to run the following code and fire up WinDbg to see what's happening in the system.
for (int i = 0; i < 100000; i++)
{
var x = InternPool.Intern("THIS STRING IS POOLED!" + i);
Console.WriteLine(x);
}
Console.ReadKey();
First thing we need to do is to get the location on LOH:
0:006> !eeheap -gc
Number of GC Heaps: 1
generation 0 starts at 0x00000000035adb38
generation 1 starts at 0x00000000034a4490
generation 2 starts at 0x00000000032b1000
ephemeral segment allocation context: none
segment begin allocated size
00000000032b0000 00000000032b1000 0000000003885fe8 0x00000000005d4fe8(6115304)
Large object heap starts at 0x00000000132b1000
segment begin allocated size
00000000132b0000 00000000132b1000 000000001369bf78 0x00000000003eaf78(4108152)
Now we need to dump it's contents:
0:006> !dumpheap 0x00000000132b1000
Address MT Size
00000000132b1000 0000000002ed8200 24 Free
00000000132b1018 000007fd4bd05a10 8192
00000000132b3018 0000000002ed8200 24 Free
00000000132b3030 000007fd4bd05a10 1056
00000000132b3450 0000000002ed8200 7136 Free
00000000132b5030 000007fd4bd05a10 8192
00000000132b7030 0000000002ed8200 1910952 Free
00000000134898d8 000007fd4bd1f740 2172576
total 8 objects
Statistics:
MT Count TotalSize Class Name
000007fd4bd05a10 3 17440 System.Object[]
0000000002ed8200 4 1918136 Free
000007fd4bd1f740 1 2172576 System.Collections.Hashtable+bucket[]
Total 8 objects
We can see that there aren't any small objects flying around this time and there is one huge reference (2172576 bytes) that's the hashtable bucket array. This is understandable since we are putting the references to strings there and the build in hashtable can expand quite heavily not to mention rehashing etc. A badly utilized hashtable can do almost as much damage as fragmenting string InternPool but we should be still ok if the memory is not pinned thus once the hashtable decides to expand the only reference will be collected over time (this is still a big no-no in high frequency environment since we can allocate faster then we are freeing).
Let's inspect the roots of an object:
0:006> !gcroot 00000000134898d8
Note: Roots found on stacks may be false positives. Run "!help gcroot" for
more info.
Scan Thread 0 OSTHread 3488
RSP:103e938:Root:00000000032d6720(System.Collections.Hashtable)->
00000000134898d8(System.Collections.Hashtable+bucket[])
RSP:103ea38:Root:00000000032d6720(System.Collections.Hashtable)->
00000000134898d8(System.Collections.Hashtable+bucket[])
RSP:103ea58:Root:00000000032d6720(System.Collections.Hashtable)->
00000000134898d8(System.Collections.Hashtable+bucket[])
Scan Thread 2 OSTHread 6f6c
DOMAIN(00000000010BE960):HANDLE(Pinned):217f8:Root:00000000132b1018(System.Object[])->
00000000032d6720(System.Collections.Hashtable)
Finally let's check how many buckets are in the hashtable (just for fun):
0:006> !dumpobj 00000000134898d8
Name: System.Collections.Hashtable+bucket[]
MethodTable: 000007fd4bd1f740
EEClass: 000007fd4b9c14c8
Size: 2172576(0x2126a0) bytes
Array: Rank 1, Number of elements 90523, Type VALUETYPE
Element Type: System.Collections.Hashtable+bucket
Fields:
None
For 50K strings we occupy 90K slot's thats very poor distribution, and most certainly we could improve this value.
The above code could use a lot of improvements like replacing hashtable with custom double hashing data structure to reduce the number of buckets and prevent expansion and rehashing. The fundamental question here is will this form of string pooling help to reduce allocations of strings ? Well No, it might help a bit, just like the original intern pool simply helped a bit if you were smart in using it (excluding the nasty fragmentation issues). To gain something from String Pool one would need to wrap this it around every single string operation and give the ability to pick if we want to pool or not. Further more some string operations would need to be completely redone to be pointer/index based and the pool itself would need to store references as pointer addresses in a buffer so we could benefit from aligning them and the next article will explore just that.
But the takeaway is this if you've been using InternPool for your string ops with success but suffered form fragmentation issues then this is a good replacement (consider changing the hashing structure though).
2 comments:
I think that ordered array would be much better solution than Hashtable. I haven't make any test but I think that reading from structure like that should be much faster writing to it would be much slower in pessimistic case.
@Wojtek, that all depends on the hashtable used, on average the hash table has O(1) performance, using a good hash and smart bucket distribution would give a very good pessimistic time as well.
On the other hand you might use an array with great success to do Intern Pool and just do a binary search to find the reference O(logN) but in reality the best possible thing one might do is to use a Radix/Prefix Tree it's both space and time optimized and in this case it has an added bonus as either we will be using strings or pointers for that matter the data will be aligned.
The implementation is a bit trickier though.
The point of this article was actually not to provide the best ever DS to store references but to show how it could be done and what actually happens :) (I might do a post with more optimized structure).
Post a Comment