Pages

Tuesday, January 21, 2014

String Hazards

String Hazards



This post was the effect of this blog and the great confusion about strings in the comment section.

Strings are a wonderful thing in .NET, very easy to use, expressive and have tons of operation methods that one can use to handle them. But with such great flexibility comes great responsibility and a few hazards. String allocation is expensive and can lead to great overhead and memory pressure increase thus forcing many Garbage Collections (usually in gen0). Now this isn't a huge deal if we are creating standard applications that don't have big data or allocation frequency, but starts to be a problem one the application grows or the application is high frequency from the start.

To do something about it usually this means throwing down the drain most of the benefits that strings give and redo all of the standard string operations to reduce allocations. That has limited uses however so another good idea is to create a string pool that will reuse memory by having a large continuous buffer (this can be either on Heap or off Heap if you like hardcore stuff). But actually there is a string pool in .NET that does string pooling using a copy on write technique (meaning that different strings that the one from the intern pool get a new reference), to make use of pool allocation we need to use string.Intern or just use literal strings (literal string are strings that are hard coded in the program like var s = "MYSTRING").

The intern pool therefor can cause potential memory leaks which are not leaks in reality as this the Intern Pool once allocated is never released and can expand, this behavior is rather expected. The pool is allocated on Large Object Heap and all of it's records are pinned in memory which is understandable, but the pool has one serious design issue and I'm not sure if this is a bug or just bad design choice. Since the pool can expand when needed and it's always pinned it can cause Heap fragmentation as other objects can be allocated and then freed but the next block of Intern Pool cannot so it will fragment the Large Object Heap since this heap cannot be compacted (in 4.5.1 it can). The very strange thing here is that that free blocks between Intern blocks are to small to be considered LOH allocated objects thus it might be a bug rather then a design choice but still expanding the pool in LOH as pinned memory would still fragment the heap.

Intern Pool String

 

To prove what I'm saying let's take an example:
 
            for (int i = 0; i < 100000; i++)
            {
                var x = string.Intern("THIS STRING IS POOLED!" + i);
                Console.WriteLine(x);
            }
            Console.ReadKey();

This simple code exposes the problem with the Intern pool, let's fire up WinDbg then attach to the sample process and break when 'i' is around 50K. Let's now see where LOH starts.


Now that we have the address of the LOH we can dump it's contents.


(This heap is huge so the pic is just showing a small fragment of it)

As we can see this smallish program generated lot's of LOH allocations, most of the allocated and free blocks are constant in size, this is likely due the constant size of the string.

Now let's inspect one of the allocated blocks.


We can see here that this is an object array and has no fields defined, so it looks like a reference or a handle to an actual string or set of strings, in order to inspect it even more we would need to dump asm instructions, but analyzing those is beyond the scope of this article.

In order to prove that this memory is indeed pinned we need to check it's GCHandle


Indeed it is pinned so it's not movable or freeable in any way (unless we would acquire the handle and freed it, but that looks more like a hacking attempt then a solution)

Non Intern Pool String

 

Now just to be sure let's modify the code a bit so it will not use the Intern table and then let's check how LOH looks like (I'm going to skip some instructions and go straight to LOH).

 
            for (int i = 0; i < 100000; i++)
            {
                var x = "THIS STRING IS POOLED!" + i;
                //var x = string.Intern("THIS STRING IS POOLED!" + i);
                Console.WriteLine(x);
            }
            Console.ReadKey();

Earlier I said that only string literals are automatically Interned and since we are using a literal here it should go to the Intern table right? Well wrong since we are concatenating a literal with non literal string this this gives us a third non literal string so there will be no Interning.



Now we can see that there are almost no objects in LOH and this was expected, there are still some small allocated blocks but since Interning isn't disabled so this just represents the PreAllocated first block of Intern table and some other internal CLR structures.

Conclusion

 

So that confirms that by using the Intern table we can very quickly fragment LOH and get in trouble so what can be done then? Well the best possible solution would be to actually roll you own Intern Pool so do something similar but the memory block would need to be continuous and should we expand to another block we would need to ensure that this memory is not pinned can be freed by GC. Another solution would be to just skip the managed heap and allocate memory on OS heap or just request a memory block of virtual memory (using virtualAllocEx) but that would be unsafe code all the way and it's not easy so for starters I would stick to a managed heap.

In the future articles I will propose solution to the string Intern Pool problem and create a custom pool that's preallocated on LOH and does not fragment memory and will explore the possibilities and consequences of such pool (flexibility suffers a bit).

There are much more string hazards regarding costly allocations and non optimal design choices for high frequency uses, and hell even the Intern operation has it's problems (for example object.ReferenceEquals("str" + 1 , "str" + 1) will return false), but that again is out the scope of this article, and we will leave them for future.

4 comments:

Anonymous said...

Why is your first example causing LOH fragmentation? It seems that the only objects you're allocating on the LOH are intern strings so no blocks are supposed to be freed and I would expect there to be contiguous blocks belonging to the string pool and no fragmentation.

Bartosz Adamczewski said...

@Anonymous Like you see this is not the case here, and you are right that this sort of fragmentation should not happen in this example but as the Intern blocks expand some of the EE (Execution Engine) internal structures get deallocated thus leaving an empty space, most notably this would be EEStringData that needs to be recreated when adding a new Intern string.

The other thing is that Intern strings use m_pStringLiteralMap which is a hash map where the buckets are contained in a linked list way thus the entire memory will not be allocated at the start but it seams that it actually reserves some free space for new buckets.

Another thing is that some internal structures don't create GC handles or even notify and type of allocator that they do so it might be that GC simply does not know that the free space is not actually free.

And last but least the hashtable can be rehashed and the rehashing allocation code in CLR is a bit ... well funky thus the pointer to the new bucket table is offset-ed, this might be a bug or simply the hash bucket array reserves some space for not yet created buckets in the array.

Bartosz Adamczewski said...

@Anonymous,

To be 100% sure what happens one would need to debug the CLR (which I might do some day) still I don't actually think that this is some kind of a bug as this would been fixed some time ago but it's not.

Anonymous said...

@Bartosz Thanks for the reply, makes sense.