What is your silliest and costliest (time or money) programming mistake?

Question

simplecto · Accepted Answer

I once mishandled the api keys and endpoints for dev and production environments, effectively send dev alerts as prod and prod as dev.
Anyway The business was seasonal and entering spring (our big sales time), and I see a flood of errors coming. In on DEV environment.
I silenced them because I figured someone was working on it. Nope. Our customers were working that checkout process and the payment gateway was failing hard.
Yeah, so 12 hours later I get a fire drill call from VP finance asking why his morning reports report no sales.
I was too green to realize my mistake right away and took another 2 hours of bumping my head on things before seeing the glaring mistake.
Impact: 25 to 30k but maybe less as customer service made lots of follow up calls to close sales.
Love your customer service people. They are the ones who humanize your mistakes.

shoo · Answer

Software product used by client to optimise decisions for large scale construction project, to minimise construction cost (think: https://en.wikipedia.org/wiki/Facility_location ). Regression introduced into some of the algorithms used to make subtle but important optimisation decisions. Regression wasn't caught by QA process, review process or regression test suite[1]. From memory the defect ran in production > 1 year before anyone noticed and was used by client to make a bunch of sub-optimal construction decisions. Cost to client: estimated in seven figure range.
Software can create a lot of value when hooked up to directly optimise a real world decision process for a large project, but it can also destroy a fair bit of that value too.
[1] regression tests were expressed as file-diff tests, i.e. diff output against expected "golden" output file. This means that the tests were very non-specific and failed any time anyone changed anything that caused output to be altered slightly. Maintaining regression test suite was a lot of busywork for engineers, the test suite did not encode properties or invariants that were meant to be maintained by software, so engineers had to to make a judgement call if the new output was "expected" or not after their change to decision engine or one of the pre- or post- processing passes and then, if they decided the new output was "good", copy the "actual" output across to become the new "expected" golden output.

VoodooJuJu · Answer

Doing a rewrite.I hit a wall where, in order to go further, I had to do a major redesign. This redesign could have come in the form of incremental refactoring or a total rewrite. I sought discussion on rewrites vs refactors to help me decide which path to take, and initially went with the veterans' advice to do an incremental refactor.I started out with the refactor, but since the project was relatively small, not in production, and developed solely by me, I figured screw it - maybe a rewrite will be faster.But this rewrite caused me so many problems. The veterans were right - there's so much knowledge and already-accounted-for edge-cases & bug-fixes in that old code. Bugs and cases that I'd apparently forgotten about. Maybe better documentation would have helped the rewrite, but I still think an incremental refactor would have been more productive.The rewrite took way longer than anticipated and the redesign, although better, wasn't worth the time, and it was incredibly inefficient re-discovering all the little problems that the old code accounted for.

frompdx · Answer

This was not a mistake I made, but it was one that I fixed. A product I worked on had a CSV import feature with an option to run the import in parallel to speed things up. For a while this worked fine. Then one day imports using the parallelization feature started only importing a portion of the CSV and generally leaving a mess that had to be manually cleaned up. No one seemed to know what was wrong.
One day I was assigned to look at it. I discovered that the import feature worked by spawning multiple subprocesses to process the import in parallel (PHP app). There was a calculation used to determine how many CPU cores were available on the host and then divide that number in half to determine how many processes to spawn with some math that was added later to reduce the number of processes spawned to avoid overburdening the machine. In another there was the same code to determine the number of processes to spawn, but no math to divide as in the other location causing a mismatch. This value was used to determine how to split the file for processing. Unfortunately, the new math for determining how many processes to spawn was not applied in both places.
I found the issue and fixed it by adding `/ 2`. Three characters that before being added had probably costs the company thousands of hours of having relatively high payed employees manually correct the data from the broken import process. In addition this was supposed to be a new product replacing a supposedly unmaintainable legacy product and the broken import process really hamstrung the migration process.
The whole product was a mess, but that really illustrated how much of a mess it was to me. A lot of it was caused by a couple of individuals who incredibly productive at writing sloppy and repetitive code that frequently made problems worse rather than better.

h2odragon · Answer

long ago, I had written some software that passed messages among BBS systems, 'mail tosser' type thing where some messages would be for this board, and others might be passing on to another. Because of a silly mistake, when there was more than one message in a bundle, the last wasn't processed.
They way this worked out on the boards meant that the "I have have messages for you" long distance calls would happen twice instead of once for each board, and where there was message traffic traveling both ways the system was building new batches on top of the remnant messages; so the next call would also have a remnant message require another call to clear the queue.
We nailed that after a couple weeks, but it cost dozens of board owners larger outbound call fees than they had planned on that month.

NZ_Matt · Answer

Used an event queue for sending emails and stupidly set to continuously retry when the event failed. The events were failing but only after successfully sending the email, so hundreds of customers received 20 of the same email before I realised what was up.To be expected tho when you hire a junior and give them no senior oversight

soulchild37 · Answer

Was doing some test to test the premium feature of my iOS app, then after the test, I forgot to put a paywall and buy button on the app UI, then submitted to the App Store.Users been using the app for free for one week before I realized what is going on lol, probably lost just $50 in revenue but it did taught me a lesson