Who is accountable for cloud costs in your org?

Question

Context: I lead the DevOps team in a mid size engineering org (60 engineers approximately). The product is a B2B SaaS product. The organization started taking cloud costs seriously early 2024 and my team worked throughout the year to make infrastructure changes to reduce our cloud costs. This included projects like removing manually created stale infrastructure, automating infra management, rightsizing, purchasing the right Savings Plans(and RIs), and many others. Now we're at a place where infra-only projects to optimize cloud costs are pretty much exhausted. On the other hand, most of the anomalies (and surprises) regarding cost spikes come from application level changes. This causes serious problems because not only these anomalies are identified late into the deployment lifecycle, but these anomalies are inherently harder to resolve quickly. An example of this is when a service triggered a downstream workflow which started spawning additional background jobs (10x in production) which blew up the cost projections.So, my question to the group is - Who do you hold(or should hold) accountable when cloud costs spike unexpectedly: the engineers who write the code, the platform team who manages the infrastructure, or the product managers who set the requirements? (My current solution is a mix of platform team and engineers but we're still trying to formalize the accountability model.)

talonx · Accepted Answer

Rather than focusing on accountability as the starting point, I would suggest building tooling and visibility so that cloud costs are visible across all layers including application and infra. Once you have this, accountability becomes easier.
Each application team should be able to view the total cost of running their service - and thus be held accountable to reduce costs when necessary.
Without data you are running blind. Cost optimization cannot be solved by a standalone team - it has to be owned by everyone.
Source: Personal experience reducing cloud costs in a slightly smaller team.

lbhdc · Answer

We use backstage (backstage.io) to manage our infra. It has plugins that track costs and attribute them to individuals and teams. That gets aggregated and is used to forcast costs for projects/teams/whatever.
You can do click-ops in the UI (which then generates yaml in a repo), or you can write special yaml files in your repo yourself. These yaml files define the owner (team entity) or the individual that the cost originates from. Its a mostly automated process.
Since each resource an application uses is known, anomalies can be tracked down and attributed. So, for example, if someone starts serving big files from anywhere other than the CDN and blowing up egress costs, the source and root cause are easy to identify.
Backstage has a "lifecycle" tag for the resources you spin up (experimental is the default). If you spin stuff up that isn't tagged as being in production they get auto deleted after a period of time (you get a bunch of emails about it beforehand). That cleans up experiements or other test infra that people have forgotten about.

bobdvb · Answer

Ultimately the CTO pays the bill but my boss, the head of Tech Ops, effectively pays the bill for cloud, even though almost all the users are not in his reporting line. Our division is a service to the business as Platform Ops.
Our bill is so big that no one engineer can significantly move the needle, but we have people going around and looking at costs (both Arch and FinOps) to identify what appears to be inefficient spend. We're also quite happy to tell AWS we want something zero rated or discounted if we don't like the cost of it. At a certain size the account team from the cloud provider are somewhat on your side when it comes to negotiations.
Generally architecture reviews and engineering peer reviews should avoid designs which cost a lot, but the most common cause of inefficient engineering practices is when time is more important than money. Then 6 months later someone looks at the cost and says "Why the hell are we doing search that way?", "Because you said you didn't have any time to change the API, so we just made it work this stupid way."
Any new engineer can join and find 5 ways of saving more than their annual salary within a week. But corralling all the teams to actually change the code? That takes leverage.

MattGaiser · Answer

The problem with the "everyone" model being pitched here is that it may as well be a synonym for "nobody."
I've worked for a few orgs where quality and testing were "everyone's" responsibility and it ultimately led to everyone pushing it off their plates and lots of it simply not getting done. Why? We could collectively borrow against the future and "everyone" being responsible meant that nobody could be held accountable, as then the debate would be in deciding fractions of responsibility.
It also encouraged those with other incentives, like product, to lean heavily on that to ship more features over doing reliable tech work as they figured the debt would be someone else's problem down the road.
People have this naive idea that people who are given responsibility will step up. There are those that do, but the rest often see the far easier path of externalizing problems and frankly most jobs reward that as they don't see externalities well.
I would have it so that platform team is responsible for identifying and engineering is responsible for fixing it. I am not sure that either team would have the skills needed to prevent such things from happening, so perhaps canary deployments would be the way to go if it is a substantial risk in your domain.

itake · Answer

At my job (eng count ~1k), the EM's are responsible, with TPMs helping monitor the metrics.Teams are given a fixed budget per micro-service and if that spend exceeds that budget, you need to find the money from another service in our org.

blinded · Answer

Everyone, but it is up to one team to enforce standards (tagging) such that you can do proper cost attribution to teams and products.Depending on how you add infra there are tools you can use to estimate the cost of a change at the pr level before it goes live.

re-thc · Answer

> or should holdEVERYONE. Like with everything on Earth it's everyone's responsibility. It all adds up.The way where everything is in isolation and silos really doesn't work and with everyone not having the full picture nothing is ever optimal.

nejsjsjsbsb · Answer

Each team for their microservice costs. There are finops teams that help collate the data though.What talonx said is what we do.

JSTrading · Answer

The new model is basically: get staff into donating. I mean, they&rsquo;re using it too, right? Saving the planet one bullshit app at a time, aren&rsquo;t they? So why shouldn&rsquo;t they pony up and help foot the bill?

YuriNiyazov · Answer

Visibility is essential, and visibility is owned by the devops team. ALl the resources have to be tagged with the team that owns those resources, and you should have a meeting once a month or so with that team where you bring to them a report of "here are all the resources we have assigned to you, do you agree?" and if they agree, then great, they are responsible for it, and if they don't agree, then now you have to have a conversation where you bring in the PMs and the EMs, and the head of finance if you need to.And it's essential that you check in regularly, rather than just give them a dashboard and say "ok, you go look here and tell me if anything is amiss", because they will never look.

pants2 · Answer

Offer your team a $100 gift card for every $1,000 shaved off in monthly cloud costs

Who is accountable for cloud costs in your org?

At my job (eng count ~1k), the EM's are responsible, with TPMs helping monitor the metrics.
Teams are given a fixed budget per micro-service and if that spend exceeds that budget, you need to find the money from another service in our org.

Everyone, but it is up to one team to enforce standards (tagging) such that you can do proper cost attribution to teams and products.
Depending on how you add infra there are tools you can use to estimate the cost of a change at the pr level before it goes live.

> or should hold
EVERYONE. Like with everything on Earth it's everyone's responsibility. It all adds up.
The way where everything is in isolation and silos really doesn't work and with everyone not having the full picture nothing is ever optimal.

Each team for their microservice costs. There are finops teams that help collate the data though.
What talonx said is what we do.

The new model is basically: get staff into donating. I mean, they’re using it too, right? Saving the planet one bullshit app at a time, aren’t they? So why shouldn’t they pony up and help foot the bill?

Offer your team a $100 gift card for every $1,000 shaved off in monthly cloud costs