Anyway The business was seasonal and entering spring (our big sales time), and I see a flood of errors coming. In on DEV environment.
I silenced them because I figured someone was working on it. Nope. Our customers were working that checkout process and the payment gateway was failing hard.
Yeah, so 12 hours later I get a fire drill call from VP finance asking why his morning reports report no sales.
I was too green to realize my mistake right away and took another 2 hours of bumping my head on things before seeing the glaring mistake.
Impact: 25 to 30k but maybe less as customer service made lots of follow up calls to close sales.
Love your customer service people. They are the ones who humanize your mistakes.
Software can create a lot of value when hooked up to directly optimise a real world decision process for a large project, but it can also destroy a fair bit of that value too.
[1] regression tests were expressed as file-diff tests, i.e. diff output against expected "golden" output file. This means that the tests were very non-specific and failed any time anyone changed anything that caused output to be altered slightly. Maintaining regression test suite was a lot of busywork for engineers, the test suite did not encode properties or invariants that were meant to be maintained by software, so engineers had to to make a judgement call if the new output was "expected" or not after their change to decision engine or one of the pre- or post- processing passes and then, if they decided the new output was "good", copy the "actual" output across to become the new "expected" golden output.
I hit a wall where, in order to go further, I had to do a major redesign. This redesign could have come in the form of incremental refactoring or a total rewrite. I sought discussion on rewrites vs refactors to help me decide which path to take, and initially went with the veterans' advice to do an incremental refactor.
I started out with the refactor, but since the project was relatively small, not in production, and developed solely by me, I figured screw it - maybe a rewrite will be faster.
But this rewrite caused me so many problems. The veterans were right - there's so much knowledge and already-accounted-for edge-cases & bug-fixes in that old code. Bugs and cases that I'd apparently forgotten about. Maybe better documentation would have helped the rewrite, but I still think an incremental refactor would have been more productive.
The rewrite took way longer than anticipated and the redesign, although better, wasn't worth the time, and it was incredibly inefficient re-discovering all the little problems that the old code accounted for.
One day I was assigned to look at it. I discovered that the import feature worked by spawning multiple subprocesses to process the import in parallel (PHP app). There was a calculation used to determine how many CPU cores were available on the host and then divide that number in half to determine how many processes to spawn with some math that was added later to reduce the number of processes spawned to avoid overburdening the machine. In another there was the same code to determine the number of processes to spawn, but no math to divide as in the other location causing a mismatch. This value was used to determine how to split the file for processing. Unfortunately, the new math for determining how many processes to spawn was not applied in both places.
I found the issue and fixed it by adding `/ 2`. Three characters that before being added had probably costs the company thousands of hours of having relatively high payed employees manually correct the data from the broken import process. In addition this was supposed to be a new product replacing a supposedly unmaintainable legacy product and the broken import process really hamstrung the migration process.
The whole product was a mess, but that really illustrated how much of a mess it was to me. A lot of it was caused by a couple of individuals who incredibly productive at writing sloppy and repetitive code that frequently made problems worse rather than better.
They way this worked out on the boards meant that the "I have have messages for you" long distance calls would happen twice instead of once for each board, and where there was message traffic traveling both ways the system was building new batches on top of the remnant messages; so the next call would also have a remnant message require another call to clear the queue.
We nailed that after a couple weeks, but it cost dozens of board owners larger outbound call fees than they had planned on that month.
To be expected tho when you hire a junior and give them no senior oversight
Users been using the app for free for one week before I realized what is going on lol, probably lost just $50 in revenue but it did taught me a lesson