At another company I took a minor marketing landing page offline when a massive burst of traffic came in (it made some expensive API calls, I thought it was a DDOS attack). Turns out it was legitimate traffic. The marketing team had done a big ad spend to generate more leads and didn't tell me. All those expensive leads got 404s.
Maybe some database indexing changes that performed a lot worse for lots of users that had to be reverted. Certainly deploying some protocol incompatibilities either inadvertently or out of sequence.
One surprising one was using composite primary keys for a misc table then realizing that some downstream Go service was getting { "id": [1, 2], ... } from the upstream Ruby one. We need to validate schema on write rather than waiting for them to fail to parse.
Disaster recovery stories are much more interesting like Hollywood blockbusters. One of my faves is un-f*ing an OS/2 HPFS partition on the west coast over the phone using DOS Norton Utilities 'nu'. Luckily the client was IBM and they had lots of identically configured machines, so just blast the central drive shape definitions (in specific sectors at the start and middle of the drive) from a neighbouring machine and run checkdisk with the recover anything that looks like a valid HPFS structure option.
I wasn’t on that team when that issue came in. But low and behold a senior dev told me “just so you know, you can’t assume a rudder only rotates 90 degrees”. s he told me the story, I put 2+2 together that it was the same cruise ship I watched on TV.
Luckily, you always have manual failovers, and it was an easy fix. But it did at least put some egg on my face :).