Allocation is cheap… until it is not

We can often hear that allocation of objects is “cheap” in .NET. I fully support this sentence because the most important part is its continuation – allocation is cheap but allocating a lot of objects will hit you back as sooner or later garbage collector will kick in and start messing around. Thus, the fewer allocations, the better.

However, I would like to add a few words about “allocation is cheap” itself. This is true to some extent because the typical path of objects allocation is indeed really fast. So-called bump a pointer technique is most often used. It consists of the following simple steps:

  • it uses so-called allocation pointer as an address of a newly created object
  • it increases allocation pointer by the requested size (so next object will be created there

For example, in case of execution in Workstation GC mode, the following assembly code will be executed (in case of Server GC Mode, it is almost the same but additionally INLINE_GETTHREAD will be used to get current thread’s Thread Local Storage):

Besides managing allocation pointer, .NET manages also allocation limit – a boundary of allocation context with memory being already zeroed. This allows using such memory instantly, without the overhead of zeroing it when an object is being created. Zeroing it in advance has an important advantage of warming CPU cache before accessing from our application code.

Bump a pointer technique checks whether requested size fits within allocation limit. If not, it falls back to the slower allocation path, which we can see as jumping to JIT_NEW in the code above. It is the necessity of abandoning this fast path that makes the “allocation is cheap” phrase not always true. Slow path is realized as a quite complex state machine which tries to find a place with the required size. We can see it in gc_heap::allocate_small and gc_heap::allocate_large method used for Small Object Heap and Large Object Heap respectively.

How complex is the slow path? Below I’ve illustrated state machine for the allocate_small method. It starts with a_state_start state when fast allocation described above fails. This state unconditionally changes into a_state_try_fit which calls gc_heap::soh_try_fit() method. And so the whole story begins…

allocate_small

Names of the states, conditions and method called are quite self-descriptive. soh_try_fit method is itself also non-trivial as it first tries to use a free-item list to find an appropriate free space for required size. It then tries to consume already committed segment memory. If it fails, it commits more from the reserved memory:

soh_try_fit

Describing all the logic here is not my point. What I would like to point out is – memory allocation is ALMOST ALWAYS cheap but… it may be also SOMETIMES SO complicated. Executing illustrated state machine has its own cost. Obviously, in the worst case, it will trigger garbage collection which is the reason why it is the best to not allocate at all in the first place. But even without GC, we can spend a while in the allocator. Thus, after digging into those methods making code less allocatey makes even more sense to me than before.

PS. There are several “allocation helpers” like JIT_TrialAllocSFastSP listed above. JIT (based on allocated object properties and runtime environment) will choose one of them. If you are interested, here’s a summary:

gcgethelper

6 comments

  1. Does .NET do scope testing at compile time (or even run time) to determine which objects should be allocated on the stack vs which should be allocated in the heap? One facet of Golang that makes it fast is that it only allocates on the heap if it needs to hence it’s GC runs are inherently smaller. Does .NET have a similar strategy? RETN/RETF (even at 35 cycles when stalled) is much much cheaper than GC.

    Also, is .NET GC multi-threaded as well as concurrent (or is stop th world)?

    1. .net has “in place” allocated types (e.g. structs) which are allocated with their owner, on the stack, etc, and only heap alloc on their own when boxed, which accomplishes the same feature explicitly rather than implicitly.

  2. “Zeroing it in advance has an important advantage of warming CPU cache before accessing from our application code”

    Cache warming would be beneficial if the zeroes were read, but most of the time they aren’t, the memory is just overwritten in the ctor. Is it really an advantage in that case?

  3. I just learned about fetch-on-write-miss cache policy, I guess you are right and the cache warm-up is an advantage even if the data is overwritten.

Leave a Reply

Your email address will not be published. Required fields are marked *