War Story – The Mystery Of The Very Long GC Pauses in .NET Windows Service

As a part of my consultancy job, I have a pleasure to help various customers with problems that could be described collectively as GC-related (or memory-related in general). One day Tamir Dresher from Clarizen company (BTW, an author of Rx.NET in Action) contacted me with such an extremely interesting message (emphasis mine):

We are experiencing a phenomenon of GC duration of 15 minutes in our backend servers. (…) Do you think we can have a session with you and perhaps you’ll have ideas on how to find the root cause?

15 minutes! That’s an infinity! If we see something like this, one thought comes to mind – something really serious must be happening there! As nowadays most of such problems may be diagnosed remotely, after signing NDAs we could go straight into attacking the problem. Clarizen has provided a very well-prepared and concise summary of their architecture and current findings.

From internal monitoring tool, based on Kibana, it indeed looked that GC introduces so endless pauses into .NET-based Windows Service. They were completely freezing its “job” processing – represented by incoming messages from RabbitMQ. Below is an example of a single day monitoring showing the number of various jobs (different bar colours) processed per time. Clearly, every few hours there is a mysterious gap – representing even 15-minutes long “no processing” periods!

Job processing pauses during the day.

Job processing pauses during the day.

Engineers from Clarizen connected those pauses with % Time in GC measurement hitting the value of 100% for minutes:

% Time in GC measurement during a day.

% Time in GC measurement during a day.

The process size hit 25-30 GB at maximum so while not so small, it for sure was not the reason for so gigantic pauses introduced. Observed pauses could not be correlated with any other event or metric, including generation sizes or overall CPU usage. Only those mystic % Time in GC hitting 100% for minutes!

Unfortunately, the problem could not be reproduced ad hoc and, what’s more interesting, it was happening only on one of the server clusters and no others. Smells like an environment/infrastructure problem!

Investigation

Even so, the first thing I wanted to make sure is in what GC mode are those processes running. Windows Service is treated as a desktop application and as such, it runs Workstation GC by default. I’ve seen high inefficiencies in data processing just because of that, because much more often Windows Service works like a request-processing “server”. But it turned out that Clarizen already tried that path and everything was set up properly and “the best” results were observed with Background Server GC (while with Workstation flavours pause hit even half an hour!).

It was clearly high time we saw a deeper insight into GC’s working. So… PerfView to the rescue! We started from a typical overall view of the GC by running ETW GCOnly session for hours and waiting for the problem occurrence:

Recorded session confirmed mysterious pauses but it made them even more unexplainable that before:

Very long GC pauses visible (pause time in milliseconds)

Very long GC pauses visible (pause time in milliseconds) in GCStats report from PerfView

As we can see, there is a series of very slow GCs of even 80 seconds! Ephemeral GC (of generation 0 or 1) that takes so much time is abnormal, for sure. The question was, what is causing it?!

Other events were only confirming this nonsense. Like, 9-second long unexplainable wait at the very beginning of the GC – between managed thread suspension and the start of GC itself:

Some ETW GC-related events in Events view from PerfView

Some ETW GC-related events in Events view from PerfView

Moreover, from MarkWithType ETW events, we could notice that most of those ridiculous gaps time is spent in the Mark phase of the GC:

MarkWithType ETW events showing very long marking time during the Mark phase

MarkWithType ETW events showing very long marking time during the Mark phase

Looking at those events for various HeapNum and Type is revealing delays at the level of up to several seconds! And they eventually sum up to 40-80 seconds long overall GC time. But as we can also note, promoted size is not huge so those long times could not be explained by traversing abnormally huge heaps (and, the whole process heap was not so big after all, as mentioned earlier).

I decided to go even deeper with the GC analysis to get its CPU sampling profiling (following the path described by Maoni in her recent You Should Never See This Callstack in Production blog post):

Unfortunately, with no clear reasons, we were unable to get such measurements, even after a few tries. And to be honest, I was only expecting to see there only a confirmation of the huge time spend in mark-related functions of the GC.

The next step would be to take a memory dump of such a frozen service during mysterious GC pause. And it indeed happened, but… not on my side 🙂

Solution

Fortunately, during the same time, Dmitry Rudakov, one of the Clarizen engineers, was able to identify the issue by digging into various memory dumps taken during the mysterious pauses. Moreover, the problem seemed to strengthen as there were even process crashes happening at the time. From his great analysis, the following findings were a path to the root cause identification:

1) he was able to identify something that looked like a deadlock inside GC’s mark phase code – all GCs threads were waiting on SVR::t_join::join method from within SVR::gc_heap::mark_phase (one of which was in SVR::gc_heap::scan_dependent_handles called by it). As Dmitry suspected, it may be caused by a kind of heap corruption. It was confirmed that such corruption indeed happens by !VerifyHeap SOS command executed for severals memory dumps taken.

2) in one of the crash dumps, Access violation exception was caught. It happened directly inside the GC thread within SVR::gc_heap::background_mark_simple1 method.

Thus, both abnormal behaviours lead somehow to the marking phase of the GC (which confirms our previous findings). At this moment Dmitry has identified two possible reasons for such heap corruption:

1) Oracle.DataAccess.DLL with some ORM seems to cause such problems (and indeed it was used in a diagnosed service). Similar cases can be found:

They all more or less clearly connect the problem with this library and some bug with the Background GC.

2) antivirus software, that is injecting its modules into the job server process. A suspicion arose from the once observed Access violation caused by the module injected by it (here obfuscated as SomeAntivirusDLL):

After those suspicions were defined, the follow-up action plan was quite clear – disable Background GC on some machine, disable antivirus software on some other, and observe the results.

Disabling Background GC

Disabling concurrent flavour of the GC (by setting gcConcurrent flag to false in the config) helped, but not entirely. The problem seemed to disappear but non-concurrent GC was just too inefficient for the typical workload of the service. Observed % Time in GC and the processing time was not acceptable. However, at least it was a kind of confirmation that Background GC has something to do with this bug.

Service behavior with Background GC disabled (marked region).

Service behaviour with Background GC disabled (marked region).

Disabling antivirus software

Guess from the below graph where this change was applied! Clearly, since one day the issue has gone. In fact, only completely uninstalling the antivirus was a proper fix – excluding only .NET assemblies was not enough. Please note this fix worked even Background Server GC was still running.

Service behavior with antivirus software disabled.

Service behaviour with antivirus software disabled.

At this moment the issue has been considered fixed. After uninstalling this particular antivirus software from all machines, the mysterious long GC pauses have disappeared.

Conclusion

GC is a very interesting beast. It operates on the vast majority of the process memory. This makes it very fragile to any memory corruption that may happen within a process. In other words, it is not the GC that causes AccessViolation or any other mysteries side-effects – but the fact that it so intensively operating on the process memory. More complex concurrent GC flavour only intensifies this fragility as it is much easier for the concurrently running threads to reveal an invalid memory state.

This is exactly the case of the considered bug. It was not the .NET GC misconfiguration, it was not the .NET GC bug or a bug in the application code. Injected modules of the antivirus software were creating some subtle memory corruption and the GC was an innocent victim of it. Here is where a lot of various hypotheses may come why it revealed in such a way. Marking GC phase is traversing all the managed heap so maybe such subtle heap state corruption pushed it in very long processing loops. A much deeper analysis would probably provide a direct answer to this mystery. Obviously, after a solution has been found, time investment into a deeper investigation was unjustified.

But this is where a very interesting aspect of the GC-related issues comes – solving them is very often a mix of expertise, intuition and… a bit of luck. Very rarely, the analysis runs straight, from one planned point to another. Kudos here goes to Dmitry Rudakov that eventually solved the issue by exactly such mix of ingredients!

I hope that this article will provide you with some insight into how you can attack your long GC pauses issues, if it happens you will have some in the future. See you!

Disclaimer: By purpose, I’ve obfuscated all the details that could point to a specific antivirus software used here. I do not believe it contains such a major bug revealing everywhere, so it would be unfair to blame it in general. I rather believe that it is a subtle combination of environment, software version used, maybe even .NET Framework used, that reveals as such bug in this particular scenario. Rather than blaming particular antivirus software, this article purpose is to show how you can attack such problem and… what an interesting solution may be eventually found. In particular – completely not related to our software at all.

6 comments

  1. AV software is a pox on the world. Who in their right mind runs it on a production server ? When I troubleshoot I tell the client “remove all AV software and we’ll talk if you still see the problem”

Leave a Reply

Your email address will not be published. Required fields are marked *