So, my question to the group is - Who do you hold(or should hold) accountable when cloud costs spike unexpectedly: the engineers who write the code, the platform team who manages the infrastructure, or the product managers who set the requirements? (My current solution is a mix of platform team and engineers but we're still trying to formalize the accountability model.)
Each application team should be able to view the total cost of running their service - and thus be held accountable to reduce costs when necessary.
Without data you are running blind. Cost optimization cannot be solved by a standalone team - it has to be owned by everyone.
Source: Personal experience reducing cloud costs in a slightly smaller team.
You can do click-ops in the UI (which then generates yaml in a repo), or you can write special yaml files in your repo yourself. These yaml files define the owner (team entity) or the individual that the cost originates from. Its a mostly automated process.
Since each resource an application uses is known, anomalies can be tracked down and attributed. So, for example, if someone starts serving big files from anywhere other than the CDN and blowing up egress costs, the source and root cause are easy to identify.
Backstage has a "lifecycle" tag for the resources you spin up (experimental is the default). If you spin stuff up that isn't tagged as being in production they get auto deleted after a period of time (you get a bunch of emails about it beforehand). That cleans up experiements or other test infra that people have forgotten about.
Our bill is so big that no one engineer can significantly move the needle, but we have people going around and looking at costs (both Arch and FinOps) to identify what appears to be inefficient spend. We're also quite happy to tell AWS we want something zero rated or discounted if we don't like the cost of it. At a certain size the account team from the cloud provider are somewhat on your side when it comes to negotiations.
Generally architecture reviews and engineering peer reviews should avoid designs which cost a lot, but the most common cause of inefficient engineering practices is when time is more important than money. Then 6 months later someone looks at the cost and says "Why the hell are we doing search that way?", "Because you said you didn't have any time to change the API, so we just made it work this stupid way."
Any new engineer can join and find 5 ways of saving more than their annual salary within a week. But corralling all the teams to actually change the code? That takes leverage.
I've worked for a few orgs where quality and testing were "everyone's" responsibility and it ultimately led to everyone pushing it off their plates and lots of it simply not getting done. Why? We could collectively borrow against the future and "everyone" being responsible meant that nobody could be held accountable, as then the debate would be in deciding fractions of responsibility.
It also encouraged those with other incentives, like product, to lean heavily on that to ship more features over doing reliable tech work as they figured the debt would be someone else's problem down the road.
People have this naive idea that people who are given responsibility will step up. There are those that do, but the rest often see the far easier path of externalizing problems and frankly most jobs reward that as they don't see externalities well.
I would have it so that platform team is responsible for identifying and engineering is responsible for fixing it. I am not sure that either team would have the skills needed to prevent such things from happening, so perhaps canary deployments would be the way to go if it is a substantial risk in your domain.
Teams are given a fixed budget per micro-service and if that spend exceeds that budget, you need to find the money from another service in our org.
Depending on how you add infra there are tools you can use to estimate the cost of a change at the pr level before it goes live.
EVERYONE. Like with everything on Earth it's everyone's responsibility. It all adds up.
The way where everything is in isolation and silos really doesn't work and with everyone not having the full picture nothing is ever optimal.
What talonx said is what we do.
And it's essential that you check in regularly, rather than just give them a dashboard and say "ok, you go look here and tell me if anything is amiss", because they will never look.