For example, I might have a notification system where users are reporting missed messages. They give concrete examples and I can verify that, given the state of the system, the notification should have been sent, but evidently it hasn't. This involves multiple systems, changing state and many potential failure points. This is a legacy system written by others, so I have an incomplete mental model of the system, and obviously can't reproduce the error.
I'm an experienced software developer, but whenever I encounter an error like the above, my steps include: * Backtrack from where something is supposed to happen looking for potential logical bugs. * Add more logging * repeat
It's not very efficient! I'm tempted to throw the system out and replace it, but we all know what kind of a trap that is.
So how do engineers here attack a problem like this? Are there activities you do, e.g. building flow diagrams, that help guide you through the process?