Testing shows the presence of errors in a product, but “cannot prove that there are no defects” – you probably know that quote. I remember so many hours spent on debugging those little, mean bugs hidding deeply in the code edge cases. But what’s worse, I remember even more hours trying to understand and reproduce an error that happens only in production environment. Here’s the first top 5 most popular issues I’ve met during last years:
- app hangs due to deadlocks (in the app or external library)
- memory issues like memory leaks, long GC pauses or high CPU usage due to the GC
- swallowed exception preventing some logic, with no logs available
- threading issues like thread-pool starvation
- intermittent errors due to the resources shortage, like running out of sockets or file handles
BTW. And what’s yours top 5?
As a part of my consultancy job, I have a pleasure to help various customers with problems that could be described collectively as GC-related (or memory-related in general). One day Tamir Dresher from Clarizen company (BTW, an author of Rx.NET in Action) contacted me with such an extremely interesting message (emphasis mine):
We are experiencing a phenomenon of GC duration of 15 minutes in our backend servers. (…) Do you think we can have a session with you and perhaps you’ll have ideas on how to find the root cause?
15 minutes! That’s an infinity! If we see something like this, one thought comes to mind – something really serious must be happening there! As nowadays most of such problems may be diagnosed remotely, after signing NDAs we could go straight into attacking the problem. Clarizen has provided a very well-prepared and concise summary of their architecture and current findings.Continue reading