HACKER Q&A
📣 scapecast

What does your company do to reduce cloud cost?


And is cloud cost actually really a problem in the first place? I know that everyone keeps saying "we need to reduce our cloud spend". But from the few anecdotes that I have, reducing cloud spend is only like a #3 or #4 priority at best.

Prio #1 and #2 always seem to be shipping more features and having a more robust delivery pipeline, and that's where all engineering efforts go.

If your company has actively reduced cloud cost, how did you do it, and how did you address the practices / processes that led to overspending in the first place?


  👤 chris_armstrong Accepted Answer ✓
Managing cloud cost is much easier if you target it from the beginning.

Having worked in this space, the things that work are mostly centred on making it easier to attribute the source of costs, proactively exploring your bill on multiple dimensions, thinking about the problem early in the lifecycle, and looking for opportunities to delete or turn off unused resources.

The more successful practices that will deliver quickly:

* tagging everything to able to understand who owns what and it is used for

* separating workloads into different accounts (might be easier than tagging and crudely achieves the same objective: attribution)

* turning off unused instances at night in dev environments

* automating the removal of incorrectly tagged resources in non prod environments

* automating image and snapshot cleanup

* avoiding right sizing (it’s trickier than you think and the effort is better spent rearchitecting for more cloud native usage based resource types like serverless)

* exploring your bill and slicing it by service type, account, department and other dimensions you’ve tagged, paying attention to size and trends

* paying attention to network paths (AWS is notorious for expensive outbound and it’s worth understanding how it works and how to architect around it)

* use templates to deploy resources (Terraform/CloudFormation/ARM) instead of console so they’re easier to tear down

* set retention limits on other expensive storage resources (hello CloudWatch Logs)

* being careful with high level services that leverage others under the hood, especially when they’re immature (eg Control Tower + constantly tearing down resources will blow out AWS Config charges)

* using cost alarms

* explore using spot instances, reserved instances or savings plans for EC2 instances


👤 ggeorgovassilis
> is cloud cost actually really a problem in the first place

Yes. Prices are adapted to mature IT markets, in emerging markets it's often cheaper to operate one's own data centre and throw people at problems, eg. dev salaries are cheap enough to spend money on optimising software rather than spending money on infrastructure. With inflation and supply chain issues even more so.

> Prio #1 and #2 always seem to be shipping more features and having a more robust delivery pipeline

I think that's either very big companies which buy agility by throwing money at the cloud, or VC backed startups which try to prove a point at all costs.


👤 netsectoday
The processes that led to the overspending in the first place are most likely the pricing models you agreed to. There are a million ways to manage this spend, but at the end of they day you are now held hostage by your hosting company. They have dark pattern pricing models and have tricked you into thinking it's impossible to manage your infrastructure at a lower level.

Go buy cloud resources from companies that sell what you need: cpu, memory, storage, and network capacity. You just want to pay for the hardware, shelf space, and network/energy usage with a reasonable profit margin to the data-center owner. Don't buy resources from companies selling you a bag full of shit marketing terms they made up so they can meter your usage to death with some product.

Building your business on premium cloud services in partnership with the richest men on earth is probably why your spend is getting out of control.


👤 bdcravens
(assuming AWS; YMMV on other vendors)

Use spot pricing everywhere possible. Anything you can't put in spot, put in reserved instances.

If your spend supports it, going with an AWS partner like Mission will allow you to save about 5%. They place your account under their org, and you pay them. You lose some visibility in the billing interface however, but gain some additional advantages in terms of reporting etc.

Think in terms of batches. Most of the time a single Lambda is overpowered for the single unit of work it's performing. Batching that work up and executing in parallel in an "old school" manner is many many times cheaper. This isn't just a Lambda thing - it's very "cloudy" to want to split up units of execution across individual resources, but that means you're under-utilizing.


👤 rozenmd
It's fairly common with serverless functions - folks will start a project on serverless, expecting very little traffic.

Eventually, they get lucky, and their Lambda functions are running non-stop, incurring a fairly large premium over a normal constantly running VM.

I managed to do this with my side project (AWS bill in the several hundred dollars per month range), I wrote about it here: https://onlineornot.com/on-moving-million-uptime-checks-onto...


👤 feydaykyn
We switched to kubernetes with the basic autoscaller, that saved us money in particular when there's less traffic.

We've also moved away from AWS Elasticache offering and now directly manage our own Redis. We've discovered that our backend is very sensitive to missing cache and that Elasticache is expensive but very reliable.

Currently we're testing Karpenter to automate scaling with 'spot' instances.

We're working with dev teams to remind them performance is a requirement, that will help us reduce ram/cpu allocations. Since RAM is capped, they have no choice but to take it into account if they want to avoid emergencies, it's a very motivating carrot! (just trolling !)

In the pipe: - move some workloads to arm (databases, cache, Argo, grafana, etc) and generally speaking better match instance types with the workload - our testing environment is too close to production, that's great for testing but quite expensive, for instance we're going to move to only one Postgres cluster with multiples db inside, instead of one cluster per test account. - more monitoring to know where we can reduce cpu /ram requests - market our results to the business side of the company, so we can require more resources to reduce costs - simplify our docker images which have a lot of legacy code, to be able to scale up faster. That will allow us to scale down more - invest more time to comb through AWS cost reduction programs to enable some more

That's it for now!


👤 dekleinewolf
We don't do any cloud engineering ;)

👤 dublin
Best advice: Be very aware of what you are paying for, how much you're using it, and WHY. I used to sell for the largest AWS services provider, and most companies with out-of-control cloud spending slipped into the slough of cloud despond by using services that seemed cheap in development, but ate their face at scale.

Storage is a big one: S3 is not nearly as cheap as you think - but there are alternatives, both inside and outside the AWS ecosystem. The big gotchas (for any cloud, not just AWS) are relying on the cloud services - these make development MUCH easier, but are a lock-in every bit as serious as the old mainframe lock-ins of the 60s and 70s. If you rely on very many cloud services (especially the nice easy, high leverage ones, unfortunately), your prices can not only skyrocket, but your switching costs may be so high as to be impossible.


👤 josh_fyi
If this is a reasonable interpretation of "What does your company do to reduce cloud cost?" then the answer is:

The company I work at offers (in GCP and AWS) reserved-instance style discounts but without commitment; spot-instance management; cost analytics tools; and anomaly alerts. https://www.doit-intl.com/flexsave/


👤 mbaris
Cost savings were actually pretty high priority for us this year, especially since we had to focus on profitability

  - Migrate from Splunk to Chaossearch
  - Tag every aws resource and create a cost dashboard for each team
  - Utilize spot instances in more places
  - Reduced retention limits for cloudwatch

👤 asfarley
Running a small business, I made the mistake of not paying close enough attention to this. It is only a few extra thousand, but this is a lot for a small business owner.

I have realized I can run most things on a small instance if I’m careful about Rails’ ram usage.