I'm looking for good examples of how easy it is to forget/miss when people leave, teams re-org, systems change, etc.
Here's mine for example: I created a cert with a 5-year expiry for my startup's primary DB. My thinking: in 5-years we'll have millions of users, and a whole team of smart people working on the DB. We did have millions of users, and the smart team, but no one thought to check the cert. We had to re-learn the signing process during the outage.
Any good horror stories to share? Monitoring best practices?
I've also seen some certs expired for obscure government endpoints used to validate information about people. The most challenging part of that was not technical but rather finding someone that could fix it and cared enough to treat it with any level of urgency. The only option to monitor those was internal monitoring tools as they are in most cases firewall restricted meaning any tool someone makes would have to be self hosted. We would notify them ahead of time. Then notify them again. Then try to call someone. The bigger challenge with these was not expiration but rather chain validation when they would not install intermediate certs and then expect us to install of their certs on our tens of thousands of servers as a work around. Nobody would believe me if I said which agency had this problem the most often but this will be our little secret.