Hi all! I am thrilled to announce that after more than two years of intensive book writing, it is finally available for preorder! Its about 800 pages are solely dedicated to the topic of .NET memory management and its Garbage Collector. With many, many internal workings of all this. I believe, personally, that there is currently no single book or even finite set of articles online that give so comprehensive insight into this topic.


announce

As a person who sincerely loves .NET and related performance topics – and spent quite a lot of time diagnosing various .NET memory-related issues – I’ve just needed to write such book. And as it covers all recent changes in .NET Core 2.1 (including Span, Memory or pipelines), I believe there is no better time to publish such book!

Let me give you an excerpt from the introduction of the book, which should explain my intentions when writing it:

Continue reading

This blog post was written as part of the preparations while writing the book about .NET, which will be announced in a few weeks. If you want to be informed about its publication and receive auxiliary materials, feel free to subscribe to my newsletter. Many thanks to Stephen Toub that helped in reviewing this code.

Async programming becomes more and more popular. While being very convenient in use, from performance perspective there are scenarios where regular Task-returning async methods have one serious drawback: they need to allocate a new Task to represent the operation (and its result). Such heap-allocated Task is unavoidable in the truly asynchronous path of execution because async continuations are not guaranteed to be executed on the same thread – thus the operation must persist on the heap, not on the stack.

However, there are cases where async operations may complete synchronously (because of really fast meeting some conditions). It would be nice to avoid heap-allocating Task in such case, created just to pass the result of an operation. Exactly for such purpose ValueTask type was introduced in .NET Core 2.0 (and the corresponding AsyncValueTaskMethodBuilder handling the underlying state machine). Initially, it was a struct made as a discriminated union, which could take one of two possible values:

  • ready to use the result (if the operation completed successfully synchronously)
  • a normal Task which may be awaited on (if the operation become truly asynchronous)

In other words, ValueTask helps in handling synchronous path of async method execution. Thanks to being a struct (which will be allocated on the stack or enregistered into CPU register) synchronous result of the operation can be returned without heap allocations. And only in case of an asynchronous path, a new Task will be heap-allocated eventually by underlying machinery:

But what if we were able to not heap-allocate Task even in case of an asynchronous path? This may be useful in very, very high-performance code, avoiding async-related allocations at all. As already said, something on the heap must represent our async operation because there is no thread affinity. But why not use heap-allocated, pooled objects for that, reused between successive async operations?

Indeed, since .NET Core 2.1, ValueTask can also wrap an object implementing IValueTaskSource interface. Such an object can be pooled and reused to minimize allocations. It represents our operation and underlying AsyncValueTaskMethodBuilder is aware of it, calling appropriate methods of IValueTaskSource interface. Here is how the previous example looks with the help of custom IValueTaskSource implementation described in this post:

Thanks to pooling, even on the asynchronous path there will be no allocations (at least most often, if our pool is used efficiently). We can await such method in a regular way and underlying machinery will take care of it.

Note. If you would like to hear more about Task, ValueTask and IValueTaskSource again, in similar but other words, please look at great Task, Async Await, ValueTask, IValueTaskSource and how to keep your sanity in modern .NET world post by Szymon Kulec.

Implementation details

Although the rationale behind IValueTaskSource seems to be clear, as well as its usage presented above, implementing it is not trivial. When implementing IValueTaskSource interface, we must implement three following methods:
* GetResult – called only once, when the async state machine needs to obtain the result of the operation
* GetStatus – called by the async state machine to check the status of the operation
* OnCompleted – called by the async state machine when wrapping ValueTask has been awaited. We should remember here the continuation to be called when the operation completes (but if it already has been completed, we should call the continuation immediately)

Seems easy, right? Read on to see if it really is! All source code described here is available in my PooledValueTaskSource repository on GitHub. There are quite many comments in the code but this post explains most relevant parts as well. Do not be also surprised with many diagnostic Console.Write in this code – it serves to illustrate the internal working of this class in prepared example program (also available in the repository).

In my custom implementation, I use object pooling based on ObjectPool class based on the internal class taken from Roslyn source code and a little refactored (with renaming mostly) – I’ve omitted it here for brevity as not so relevant. From our perspective here there are obvious Rent and Return methods, period.

In my implementation, I am also mostly based on code from AwaitableSocketAsyncEventArgs in System.Net.Sockets.Socket
and AsyncIOOperation in ASP.NET IIS Integration code. What I’ve tried to do is to provide Minimal Valuable Product that is correct and working (stripping as much as possible from the mentioned code).

My custom IValueTaskSource represents an operation that returns a string that is being read from the provided file. Obviously, one would probably like to introduce a more generic class with generic result type and action being provided as a lambda expression. However, to not clutter such example too much, I’ve decided to prepare it in such “hardcoded”, specific scenario. Feel free to contribute more generic versions!

Let’s start from fields that FileReadingPooledValueTaskSource contains:

The most important fields of FileReadingPooledValueTaskSource include:

  • Action< object > continuation – it represents a continuation to be executed when our operation ends
  • string result – it keeps the result of our operation (in case of success)
  • Exception exception – it keeps an Exception instance that happened during executing our operation (in case of failure)
  • short token – current token value given to a ValueTask and then verified against the value it passes back to us. This is not meant to be a completely reliable mechanism, doesn’t require additional synchronization, etc. It’s purely a best effort attempt to catch misuse, including awaiting for a value task twice and after it’s already being reused by someone else
  • object state – state internally used by asynchronous machinery
  • static readonly Action<object> CallbackCompleted – sentinel object used to indicate that the operation has completed prior to OnCompleted being called

Let’s now look at each of IValueTaskSource method implementation. GetResult is quite easy – it will be called only once by underlying state machine when we inform that our operation has completed (by GetStatus method explained soon). Thus, we need to reset the object state (to be reusable), return it to the pool and return the result (or throw an exception in case of failure):

GetStatus is called by the state machine to check the current status of our operation. In my case, I assume it is completed if result is no more null. Depends on the exception field, it is then succeeded or faulted:

The most complex is OnCompleted method implementation. It is being called by the underlying state machine if wrapped ValueTask is being awaited. Two scenarios may happen here that must be handled:

  • if an operation has not yet completed – we store the provided continuation to be executed once the operation is completed
  • if an operation has already completed – in such case our internal continuation should be already set to CallbackCompleted value. If it so, we simply invoke the continuation here

Please note how much code is dedicated to properly get the context of the continuation (with respect to provided ValueTaskSourceOnCompletedFlags):

This concludes implementing IValueTaskSource methods but we need to add two more crucial pieces into this puzzle: a method that starts our operation and method that is called when an operation completes asynchronously.

The first one, named by my as simple as RunAsync (called in our example at the beginning of the article) is responsible for executing the main work:

I’ve implemented here simulation of some operation that may both complete immediately (synchronously) and asynchronously. In case of asynchronous path, returned ValueTask pass the result so we avoid allocations again. In case of asynchronous case the key is to return ValueTask that wraps… ourselves. It may be then awaited on, while we also started asynchronous processing (simulated by thread pool work in our case).

When asynchronous operation finishes, NotifyAsyncWorkCompletion method will be called (remember – in the real-world scenario this would be some callback registered in asynchronous IO or other low-level API). The responsibility of this method is simple:

  • it stores result and/or exception
  • if the operation is not yet awaited (in such case this.continuation will be null) – it only sets continuation to CallbackCompleted. Continuation will be executed in OnCompleted method when ValueTask will be awaited
  • if the operation is already awaited (in such case this.continuation contains awaited continuation) – it executes continuation in the appropriate context (which again is quite a complex process)

 

Above-mentioned clearing and returning to the pool is implemented in ResetAndReleaseOperation (yes I know, SRP is dying here, refactor!). The only field we cannot clear is token, which is solely dedicated to detecting incorrect re-usage of those objects:

And… only such little code is necessary to avoid heap-allocating in case of async operations!

Remarks

  • I do not claim that my code in current form is ideal. Quite opposite, I still expect it to be by far ideal! It serves as an illustration and base for further development. Please, feel invited to comment and to contribute to making it better!
  • Current repository is oversimplified – due to the work on the book, I do not have time to reorganize it properly (especially include comprehensive unit tests). Again, feel free to contribute!

 

zerogclead

A few months ago I wrote an article about Zero GC in .NET Core 2.0. This proof of concept was based on a preview version of .NET Core 2.0 in which a possibility to plug in custom garbage collector has been added. Such “standalone GC”, as it was named, required custom CoreCLR compilation because it was not enabled by default. Quite a lot of other tweaks were necessary to make this working – especially including required headers from CoreCLR code was very cumbersome.

However upcoming .NET Core 2.1 contains many improvements in that field so I’ve decided to write follow up post. I’ve also answered one of the questions bothering me for a long time (well, at least started answering…) – how would real usage of Zero GC like in the context of ASP.NET Core application?

.NET Core 2.1 changes

Here is a short summary of most important changes. I’ve updated CoreCLR.Zero repository to reflect them.

  • first of all, as previously mentioned, now standalone GC is pluggable by default so no custom CoreCLR is required. We will be able to plug our custom GC just by setting a single environment variable:
  • as standalone GC matured, documentation in CoreCLR appeared
  • a great improvement is that code between library implementing standalone GC and CoreCLR has been greatly decoupled. Now it is possible to include only a few files directly from CoreCLR code to have things compiled:

    Previously I had to create my own headers with some of the declarations from CoreCLR copy-pasted which was obviously not maintanable and cumbersome.
  • loading path has been refactored slightly. InitializeGarbageCollector inside CoreCLR calls GCHeapUtilities::LoadAndInitialize() with the following code inside:

    Inside LoadAndInitializeGC there is a brand new functionality – verification of GC/EE interface version match. It checks whether version used by standalone GC library (returned by GC_VersionInfo function) matches the runtime version – major version must match and minor version must be equal or higher. Additionaly, GC initialization function has been renamed to GC_Initialize.
  • core logic of my the poor man’s allocator remained the same so please refer to the original article for details

ASP.NET Core 2.1 integration

As this CoreCLR feature has matured, I’ve decided do use standard .NET CLI instead of CoreRun.exe. This allowed me to easily test the question bothering me for a long time – how even the simplest ASP.NET Core application will consume memory without garbage collection? .NET Core 2.1 is still in preview so I’ve just used Latest Daily Build of .NET CLI to create WebApi project:

I’ve modified Controller a little to do something more dynamic that just returning two string literals:

Additionally, I’ve disabled Server GC which is enabled by default. Obviously setting GC mode does not make sense as there is no GC at all, right? However, Server GC crashes runtime because GC JIT_WriteBarrier_SVR64 is being used which requires valid card table address – and there are no card tables either 🙂

Then we simply compile and run, remembering about the environment variable:

Everything should be running fine so… congratulations! We’ve just run ASP.NET Core application on .NET Core with standalone GC plugged in which is doing nothing but allocating.

Benchmarks

I’ve created the same WebApi via regular .NET Core 2.0 CLI for reference. Then via SuperBenchmarker I’ve started simple load test: 10 concurrent users making 100 000 requests in total with 10 ms delay between each request.

.NET Core 2.1 with Zero GC:

zerogcaspnetcore02

.NET Core 2.0:

zerogcaspnetcore03

As we can see classic GC from .NET Core was able to process slightly more requests (357.8 requests/second) comparing to version with Zero GC plugged in. It does not surprise me at all because my version uses the most primitive allocation based on calloc. I’m quite surprised that Zero GC is doing so well after all. However, this is not so interesting because I assume that replacing calloc with a simple bump a pointer allocation would improve performance noticeably.

What is interesting is the memory usage over time. As you can see in the chart below, after a minute of such test, the process using Zero GC takes around 1 GB of memory. This is… quite a lot. Not sure yet how to interpret this. Version with regular GC ended with a stable 120 MB size. Both started from fresh run.

zerogcaspnetcore

This would mean that each REST WebApi requests triggers around 55 kB of allocations. Any comments will be appreciated here…

Update 30.01.2018: After debugging allocations during single ASP.NET requests, most of them comes from RouterMiddleware. This is no surprise as currently this application does almost nothing but routing… I’ve uploaded sample log of such single request which seems to be minimal (others are allocating some buffers from time to time). It consumes around 7 kB of memory.

We can often hear that allocation of objects is “cheap” in .NET. I fully support this sentence because the most important part is its continuation – allocation is cheap but allocating a lot of objects will hit you back as sooner or later garbage collector will kick in and start messing around. Thus, the fewer allocations, the better.

However, I would like to add a few words about “allocation is cheap” itself. This is true to some extent because the typical path of objects allocation is indeed really fast. So-called bump a pointer technique is most often used. It consists of the following simple steps:

  • it uses so-called allocation pointer as an address of a newly created object
  • it increases allocation pointer by the requested size (so next object will be created there

Continue reading

ThreeDotNetos

I would like to announce with pleasure the initiative of Three Dot Netos. I am very excited because the preparations have been going on for several months. And here it is finally. I can officially and publicly announce it!

But… what?

We get in the car and start on the road through Poland. 5 cities, day by day. Every evening a different city and other people but the same topics – .NET performance, .NET internals and other advanced .NET themes. Hell of a ride for your brain! There will be no mercy. If you’re bored with sessions on .NET at other conferences, now you should be happy! Of course, this is not meant to be an empty talk. All topics discussed will be practical. But we will not repeat again the same boring “reference types are on the heap and value types are on the stack“. Oh no no! Detailed agenda will be announced in a few weeks. But be sure it will be interesting.

In the first edition, we will speak Polish. But who knows what the future will bring. However, if you know some Polish guys – tell them about us! Spread the word, this will always be helpful for planning further initiatives. Please 🙂

Continue reading

Tune screenshot

I would like to present you a new tool I’ve started to work on recently. I’ve called it The Ultimate .NET Experiment (Tune) as its purpose is to learn .NET internals and performance tuning by experiments with C# code. As it is currently in very early 0.2 version, it can be treated as Proof Of Concept with many, many features still missing. But it is usable enough to have some fun with it already.

The main way of working with this tool is as follows:

  • write a sample, valid C# script which contains at least one class with public method taking a single string parameter. It will be executed by hitting Run button. This script can contain as many additional methods and classes as you wish. Just remember that first public method from the first public class will be executed (with single parameter taken from the input box below the script). You may also choose whether you want to build in Debug or Release mode (note: currently it is only x64 bit compilation).
  • after clicking Run button, the script will be compiled and executed. Additionally, it will be decompiled both to IL (Intermediate Language) and assembly code in the corresponding tab.
  • all the time Tune is running (including time during script execution) a graph with GC data is being drawn. It shows information about generation sizes and GC occurrences (as vertical lines with the number below showing which generation has been triggered).

Continue reading

Zero Garbage Collector on .NET Core

Starting from .NET Core 2.0 coupling between Garbage Collector and the Execution Engine itself have been loosened. Prior to this version, the Garbage Collector code was pretty much tangled with the rest of the CoreCLR code. However, Local GC initiative in version 2.0 is already mature enough to start using it. The purpose of the exercise we are going to do is to prepare Zero Garbage Collector that replaces the default one.

Zero Garbage Collector is the simplest possible implementation that in fact does almost nothing. It only allows you to allocate objects, because this is obviously required by the Execution Engine. Created objects are never automatically deleted and theoretically, no longer needed memory is never reclaimed. Why one would be interested in such a simple GC implementation? There are at least two reasons:Continue reading

Trace Compass

.NET Core on Linux is still very fresh in 2017. First production deployments are just beginning to emerge. Consequently, development on this platform is only beginning to show up. There is a lack of knowledge and good practices related to virtually every aspect of the existence of this environment. One of them is monitoring and diagnostic aspect. How can we monitor and analyze the health of our application?

The easiest way of getting tracing data is by using official perfcollect bash script and then using Perfview on Windows to analyze this recorded data. This approach has some drawbacks. The main one is are fairly limited analysis results available in PerfView. The second, less burdensome, is the need for Windows to… analyze Linux data. Recently Sasha Goldstein has created a lot of valuable material on this subject and I invite you to review the list posted at the end of this post.

I would like to present another diagnostic option here. This is using the free Eclipse Trace Compass tool. Continue reading

Programmer

…or what type are you?

I have been in the business for some years and have worked with a couple of companies, small and large ones. I have met quite a large number of developers, have been in a couple of teams. Maybe it’s not a so big experience, yet it somehow gives me a chance to share with my retrospectives and analysis with you. Let me share with you some of my… characterological thoughts. We are gifted with various temperaments, interests, opinions and other things that come to our mind. We are different privately, yet we are different at work. I have noticed some groups, let’s call them types of people, that our development society can be divided into. This is my attempt of such classification based on my own experience. If you feel I have missed some species unknown to me, please leave it in in the comments!Continue reading

Note: This is a first entry of a new series about Microsoft .NET CLR internals. I encourage you to ask – the most interesting questions will become a similar posts in the future! This one was inspired by Angelika Piątkowska.

How does Object.GetType() really work?

Extending this question: How an object knows what type it is? Is it the compiler or the runtime that knows that? How this is related to the CLR? As C# is statically and strongly typed, maybe the method call GetType() does not really exist, and the compiler can replace it with the appropriate result already at compile time?Continue reading