Analyzing runtime CoreCLR events from Linux – Trace Compass

Trace Compass

.NET Core on Linux is still very fresh in 2017. First production deployments are just beginning to emerge. Consequently, development on this platform is only beginning to show up. There is a lack of knowledge and good practices related to virtually every aspect of the existence of this environment. One of them is monitoring and diagnostic aspect. How can we monitor and analyze the health of our application?

The easiest way of getting tracing data is by using official perfcollect bash script and then using Perfview on Windows to analyze this recorded data. This approach has some drawbacks. The main one is are fairly limited analysis results available in PerfView. The second, less burdensome, is the need for Windows to… analyze Linux data. Recently Sasha Goldstein has created a lot of valuable material on this subject and I invite you to review the list posted at the end of this post.

I would like to present another diagnostic option here. This is using the free Eclipse Trace Compass tool. As it was written on the main page of this tool:

Eclipse Trace Compass is an open source application for viewing and analyzing any type of logs or traces. Its goal is to provide views, graphs, metrics, and more to help extract useful information from traces, in a way that is more user-friendly and informative than huge text dumps.

Among the various supported formats, the most important for us is CTF format (Common Trace Format), in which events are generated by LTTng mechanism used by CoreCLR. It looks like a mix of PerfView and Windows Performance Analyzer tools – if you had contact with them you may guess what I mean. It is powerful and allows us to make great things. But unfortunately, like the two mentioned programs, it has very steep learning curve. Extensive configuration options make it hard to know where to start when you run it for the first time. It took me some time adjusting this tool to make a sensible analysis of the data generated by CoreCLR and I wanted to share with you the results.

Note: If you do not know how to start monitoring your .NET Core application, follow official CoreCLR instructions. It is not complicated. You should get perfcollect script from CoreCLR Github repository. Then you only to execute

which will install perf_event i LLTng tools on your Linux machine. Then to start tracing session:

After stoping session you will result in ZIP file containing registered data. You can of course manually control LTTng session and if you want to do that, I refer to the above mentioned materials.

Getting Trace Compass

You can download Trace Compass binaries for Linux, Windows and OS X from the mentioned page. You can obviously compile it from source code if you wish. The tool has quite extensive documentation, and if you decide to use it more often than just because of this article, I would strongly recommend reading it even only briefly.

Opening trace file

Assuming you get a perfcollect recording (or you may use one sample prepared by me (25MB)) please unzip it to some folder. The LTTng data we’re interested in is in the lttngTrace subfolder, more specifically in the path that follows the schema lttngTrace\auto-20170801-103533\ust\uid\1000\64-bit. To open it in Trace Compass, select File -> Open Trace… and select metadata file. The default view we will see (see Figure 1) includes two main views: a list of all events (a 64-bit bookmark in case of our sample file), and a histogram of events instances over time.

Trace Compass

Figure 1. The default view of LTTng trace

We can take a moment to look at the events tab, where as you can see, along with each event there are also accompanying fields (including the generic context._vpid and context._vtid, respectively the process ID and the thread ID from which the event was generated). You can search and filter that view by manipulating the first row. On the other hand, the histogram can only help us figure out the number of events in time and in that sense is not very helpful. We can close it, like other tabs: Control, Control Flow, Resources, Properties and Bookmarks. After that,  we should end up only with Project Explorer, Statistics and tracing tabs. Such a view, however, is not particularly useful, and this is where the complex customization process begins.

For this moment we will just open a coreclr_analyses.xml file containing the ready-made analyzes prepared by me and then one by one I will explain how they were created and what they show. To do that it is best to close current trace by selecting Clear from context menu under Tracing -> Traces in Project Explorer tab. Download coreclr_analyses.xml file from my repository and store it somewhere. Then select Manage XML analyses… from the same context menu. In the window that appears, select Import and point to the file you just downloaded. Then open the same trace once again. Three new views should be visible under Tracing -> Traces -> 64-bit -> Views item (see Figure 2).

tracecompass02

Figure 2. Three new custom views of LTTng trace

By expanding any of new custom views, you will see additional possible views. You can double click any of them to add it to the main view.

All this views are based on Trace Compass feature called Data driven analysis. It allows to specify interpretation of events sequences in a various way by providing special XML files.

CoreCLR.GC.collections

Let’s start from the simplest custom view. It is based on a simple pattern, matching Garbage Collection start and end events. Each such a pair generates so called “segment” which is understood simply as a time interval with a name and possible attributes. Such analysis in Trace Compass is carried out with Finite State Machine (FSM) describing transitions of our interest (reactions to subsequent events) and related actions. Below is a brief structure of such analysis (for simplicity I’ve removed the part matching the start and end of the same GC):

The name of each segment corresponds to the generation on which the GC was made (section segName in the above description). So views generated by this analysis include a list of all Garbage Collections per generation and their statistics (see Figure 3 and 4) – segment duration is being called latency.

Trace Compass

Figure 3. Statistics of all GCs during recorded trace – name indicates generation.

Trace Compass

Figure 4. List of all GCs during recorded trace – including additional parameters like Type and Reason.

It means that in our sample trace, there were, for example, six GCs on 2 generation and they took almost 10 ms in average. Type and Reason are additional recorded fields which come from GCStart_V2 event (V2 is not yet documented but used fields are the same as in GCStart_V1 event).

CoreCLR.threads.state

This is by far most complex custom view made by me so far. It utilizes yet another powerful Trace Compass feature to create Gantt-like diagrams of XML-based data driven analyses. In case of my example you can open it by double clicking CoreCLR.threads.state.view under CoreCLR.threads.state view. Just to show a overview of the underlying FSM, here is a beginning of its definition:

Such fairly complex state machine responds to individual CoreCLR (mostly GC-related) events changing the state of one of the so-called “scenario”. In this case, scenario corresponds to a single thread thanks to the thread_condition condition. In other words, the event most often changes the state of only one selected thread, assigned to a given scenario. This is not the case for some events like GCSuspendEEBegin_V1 which are impacting all current managed threads. The actions associated with each of these events (reactions) primarily change the Status field of a given scenario, which is simply a numerical value. If you are interested in how exactly the entire state machine I designed looks like, I invite you to look at the coreclr_analyses.xml file.

Status field for each scenario is then interpreted by the timeGraphView component, listed below:

This component visualizes each of the scenario in a separate line, which gives us a separate line for each thread, colored according to the current state of the thread and name according to its ThreadID. This allows for a nice view of the application state (see Figure 5). And in particular, after zooming in it shows us nice details of a single GC run (see Figure 6).

Trace Compass

Figure 5. Threads overall view.

Trace Compass

Figure 6. Single background GC, creating concurrent GC thread.

In the above examples, we see details of a single generation 2 garbage collection run which triggered creation of GC thread for non-concurrent run of background GC.

CoreCLR.GC.generations.ranges

The last option is to create XY graphs based on data provided by events. Of course it is especially tempting to visualize all kinds of measurable metrics such as size of generations and the like. There is one event especially useful here – DotNETRuntime:GCGenerationRange, generated for each generation at the end of each GC run (see Figure 7).

Trace Compass

Figure 7. DotNETRuntime:GCGenerationRange events emitted at the end of GC run.

We can consume its Generation, RangeUsedLength and RangeReservedLength fields to visualize generation sizes. Such analysis is based on simpler mechanism and does not require creating separate FSM. It is just event handler reacting on a particular event:

We obtain graphical visualization of the size of generations over time, which can be very useful in an analysis (see Figure 8).

Trace Compass

Figure 8. XY visualization of generation sizes in time.

Note: There is yet another very interesting event DotNETRuntime:GCHeapStats_V1 but unfortunately currently its payload is interpreted as byte array so it is not possible to consume it. I am working on a development branch of Trace Compass trying to use incoming scripting feature to overcome this issue.

The final results

All this allows us to customize the Trace Compass for a fairly convenient analysis of the collected traces. Of course, there is still a lot to do, but such an analysis will make some preliminary conclusions: how frequent and why GC runs occur and how memory consumption changes over time. Reviewing the list of events may allow you to get an idea of the details.

Trace Compass

Figure 9. CoreCLR analysis different views altogether.

Note: To speed up the display, configure the event list filter to disable the very common GCSampledObjectAllocationHigh event.

Note #2: Sample data analysis file coreclr_analyses.xml probably still requires some enhancement and may display wrong results for server GC. 

Currently, I am working on extending the statistical analysis. Especially it would be very helpful to consume allocation statistics. I encourage you to experiment and to help me with that!

CoreCLR on Linux materials by Sasha Goldstein:

Leave a Reply

Your email address will not be published. Required fields are marked *