Keeping track of random stack traces

Question

Every so often in a piece of software we're testing, you get a random crash. If you're unlucky it's a nice race condition, hasn't happened 50 times in a row, then boom, it happens again.In terms of tracking it for the future, what suggestions do people have? I can backlog it as a bug, but it's not going to be easily searchable. A dev could pick it up, but without a method to reproduce, it's not easily fixed in a sprint either.And it's also hard to know if it gets fixed in the future either!But it's also a crash, and I personally hate not documenting them, no matter how rare. But I'd like a better way to manage it.

wallstprog · Accepted Answer

Pretty low-tech, but what we do is to create an md5 hash of the whole stack trace as a single string. Before hashing, we munge some bits to help make similar stack traces hash to the same value:
- remove file/line#
- omit the bottom (top) frame, which can be different between environments
- convert certain constructs to a common format (e.g., "unknown module" (clang) to "???" (valgrind)
- translate "func@@GLIBC_version" => "func"
This works well enough in practice for our purposes (identifying regressions and suppressing specific reports from valgrind/Address Santizer).
We also maintain a xref between md5 hash and full stack trace.

mceachen · Answer

You didn't specify where your software is running. I'm building software that my users install and run on their own hardware.
I send errors to Sentry.io when the error contains a novel stacktrace for the user and the user hasn't disabled error reporting. I also send recent log messages and some other info, like the OS and hardware architecture (so I can reproduce it on my end). [1]
PhotoStructure uses a SHA of the stacktrace to discriminate between different errors. This certainly can group different problems together, but in practice those problems are related.
Only sending novel stacktraces prevents a user from clogging up my Sentry dashboard, and from wasting my users' bandwidth. PhotoStructure imports huge libraries, and before I added this squelching, I could have a single user send tens of thousands of reports (when the "error" turned out to be an ignorable warning due to the camera they were using writing metadata that was malformed but still parseable).
If you're building a SaaS, and you own the hardware you're software is running on, just send all errors to Sentry.
Sentry does a good job in helping me triage new errors, marking when errors crop back up, and highlighting which build seems to have introduced a novel error.
Keep in mind that the stacktrace may not be relevant if that section of code or the upstream code is modified. I use automatic upgrading on all platforms to keep things consistent.
[1] https://photostructure.com/faq/error-reports/

babygoat · Answer

Sentry is great for this.

nitwit005 · Answer

At previous company there was a home built service which was a database of unhandled Java exceptions. It attempted to generate a hash value so that you could see how often exceptions were happening, and graph them over time.Highly imperfect of course, and it created separate entries for some exceptions that included random numbers in their message. But it did put pressure on people to clean them up.

fierarul · Answer

A little automation can help you here. Errors could be auto-transformed into bugs somewhere and duplicates just add +1s or votes, etc. How you detect duplicated depends on your configuration but it should be doable (eg. use the stacktrace sha as a tag).

dev_north_east · Answer

I feel you.> I can backlog it as a bug, but it's not going to be easily searchable.In my experience, I've marked it as a bug, comment with the stack trace and mark as U. Then when it arises again, hopefully someone searches for a part of the stack and comes lucky or more often than not, I (or others) will hear of the crash and relay the bug info. Bug is updated with any new info and live continues until it crashes again... Not perfect by any means. I'd love to hear how others deal with this

RabbitmqGuy · Answer

I have an upcoming product in this space. It basically lets you send errors and their stack traces to datadog. You can then search, aggregate, filter etc your stack tracesYou can email me if this interests you (email is in my bio)

drewg123 · Answer

I'd suggest looking at backtrace.io It may be overkill for what you want to do, but one thing it does really well is to log stack traces.