The more I learn, the more I believe cloud is the only competitive solution today, even for sensitive industries like banking or medical.
I honestly fail to see any good reason not to use the cloud anymore, at least for business. Cost-wise, security-wise, whatever-wise.
What's a good reason to stick to on-prem today for new projects? To be clear, this is not some troll question. I'm curious: am I missing something?
I'm the CTO of a moderately sized gaming community, Hypixel Minecraft, who operates about 700 rented dedicated machines to service 70k-100k concurrent players. We push about 4PB/mo in egress bandwidth, something along the lines of 32gbps 95th-percentile. The big cloud providers have repeatedly quoted us an order of magnitude more than our entire fleet's cost....JUST in bandwidth costs. Even if we bring our own ISPs and cross-connect to just use cloud's compute capacity, they still charge stupid high costs to egress to our carriers.
Even if bandwidth were completely free, at any timescale above 1-2 years purchasing your own hardware, LTO-ing, or even just renting will be cheaper.
Cloud is great if your workload is variable and erratic and you're unable to reasonably commit to year+ terms, or if your team is so small that you don't have the resources to manage infrastructure yourself, but at a team size of >10 your sysadmins running on bare metal will pay their own salaries in cloud savings.
We manage upwards of 30 different compute clusters (many listed here: https://hpc.llnl.gov/hardware/platforms). You can read about the machine slated to hit the floor in 2022/2023 here: https://www.llnl.gov/news/llnl-and-hpe-partner-amd-el-capita....
All the machines are highly utilized, and they have fast Infiniband/OmniPath networks that you simply cannot get in the cloud. For our workloads on "commodity" x86_64/no-GPU clusters, we pay 1/3 or less the cost of what you'd pay for equivalent cloud nodes, and for the really high end systems like Sierra, with NVIDIA GPUs and Power9's, we pay far less than that over the life of the machine.
The way machines are procured here is different from what smaller shops might be used to. For example, the El Capitan machine mentioned above was procured via the CORAL-2 collaboration with 2 other national labs (ANL and ORNL). We write a 100+ page statement of work describing what the machine must do, and we release a set of benchmarks characterizing our workload. Vendors submit proposals for how they could meet our requirements, along with performance numbers and test results for the benchmarks. Then we pick the best proposal. We do something similar with LANL and SNL for the so-called commodity clusters (see https://hpc.llnl.gov/cts-2-rfi for the latest one). As part of these processes, we learn a lot about what vendors are planning to offer 5 years out, so we're not picking off the shelf stuff -- we're getting large volumes of the latest hardware.
In addition to the cost savings from running on-prem, it's our job to stay on the bleeding edge, and I'm not sure how we would do that without working with vendors through these procurements and running our own systems.
I've worked almost entirely for companies that run services in various cloud infrastructures - Azure/Heroku/Aws/GCP/Other.
I recently started a tiny 1 man dev shop in my spare time. Given my experience with cloud services it seemed like a no brainer to throw something up in the cloud and run with it.
Except after a few months I realized I'm in an industry that's not going to see drastic and unplanned demand (I'm not selling ads, and I don't need to drive eyeballs to my site to generate revenue).
So while in theory the scaling aspect of the cloud sounds nice, the reality was simple - I was overpaying for EVERYTHING.
I reduced costs by nearly 90% by throwing several of my old personal machines at the problem and hosting things myself.
So long story short - Cost. I'm happy to exchange some scaling and some uptime in favor of cutting costs. Backups are still offsite, so if my place burns I'm just out on uptime. The product supports offline, so while no one is thrilled if I lose power, my customers can still use the product.
Basically - cost, Cost, COST. I have sunk costs in old hardware, it's dumb to rent an asset I already own.
There might well be a point when I scale into a point where the cloud makes sense. That day is not today.
Simply, _cost_
Our compute servers crunch numbers and data at > 80% util.
Our servers are optimized for the work we have.
They run 24/7 picking jobs from queue. Cloud burst is often irrelevant here.
They deal with Terabytes or even Petabytes of moving data. I’d cry paying for bandwidth costs if charged €/GB.
Sysadmin(yours truly) would be needed even if it were to be run in the cloud.
We run our machines beyond 4 years if they are still good at purpose.
We control the infra and data. So, a little more peace and self-reliance.
No surprise bills because some bot pounded on a S3 dataset.
Our heavy users are connected to the machines at a single hop :-) No need to go across WAN for work.
cost: https://news.ycombinator.com/item?id=23098576
cost: https://news.ycombinator.com/item?id=23097812
cost: https://news.ycombinator.com/item?id=23098658
abilities / guarantees: https://news.ycombinator.com/item?id=23097213
cost: https://news.ycombinator.com/item?id=23090325
cost: https://news.ycombinator.com/item?id=23097737
threat model: https://news.ycombinator.com/item?id=23098612
cost: https://news.ycombinator.com/item?id=23097896
cost: https://news.ycombinator.com/item?id=23098297
cost: https://news.ycombinator.com/item?id=23097215
That's just the in-order top comments I'm seeing right now. (please do read and upvote them / others too, they're widely varying in their details and are interesting)
The answer's the same as it has always been. Cloud is more expensive, unless you're small enough to not pay for a sysadmin, or need to swing between truly extreme scale differences. And a few exceptions for other reasons.
At this number of servers we can still host websites that have millions of users (but not tens of millions). They are not exotic servers either. In fact by now they are, on average, around 11 years old. And costed anywhere from 2k to 8k at the time of purchase. Some are as old as 19 years. Hell, when we bought some of them - with 32GB of memory each - AWS had no concept of "high memory" instances and you had to completely pay out your ass for a 32GB server, despite ram being fairly cheap at the time.
We have no dedicated hardware person. Between myself and the CTO, we average maybe a day per month thinking about or managing the hardware. If we need something special setup that we have no experience in, we have a person we know that we contract, and he walks us through how and why he set it up as he did. We've used him twice in the last 13 years.
The last time one of us had to visit the colocation center was months ago. The last time one of us had to go there in an emergency was years ago. It's a 5 minute drive from each of our homes.
So, why exactly should we use the cloud? We have servers we already paid for. We rent 3 cabinets - I don't recall the exact cost, but I think its around $1k per month. We spend practically no time managing them. In our time being hosted in a colo center - the past 19 years - we've had a total of 3 outages that were the fault of our colo center. They all lasted on the order of minutes.
One great example. We were paying $45k/yr for a hosted MS Dynamics GP solution. For $26k we brought it in house with only a $4k/yr maintenance fee. We bought a rackmount Dell, put on VMWare, have an app VM and a DB VM. My team can handle basic maintenance. In the past 11 months we haven't had to touch that server once. We have an automated backup, pulls VMs out daily and sends them off to Backblaze. Even if we need to call our GP partner for some specialized problem, it's not $45k/yr in consulting costs.
We had a bunch of Azure servers for Active Directory and a few other things. When I came in 2 years ago I set up new on-prem DC VMs, and killed out absurd Azure monthly bill, we were saving money by month three. A meteor could take out Atlanta and the DCs are our satellite offices would handle the load just fine until we restored from backups and we'd STILL save money. We've had MORE uptime and reliability since then too.
If I have a server go down, we have staff to get on it immediately, no toll free number to dial, no web chat to a level 1 person in India, etc.
Our EMR is hosted, because that's big enough that I want to pay someone to be in control over it, and someone to blame. However, there have been many times where I'm frustrated with how they handle problems, and jumping from one EMR to another is not easy. And in the end they're all bad anyway. Sometimes I DO wish we were self hosted.
The Cloud is just someone else's computer. If they're running those machines more cheaply than you are, they're cutting out some cost. The question is, do you need what they're cutting?
Also, sheer cost. Literally everyone I know in my particular part of the industry uses Hetzner boxes. For what I do, it’s orders of magnitude cheaper than AWS.
We don't do on-prem but we do make heavy use of colo. The thought of cloud growth and DC space consolidation some day pushing out traditional flat rate providers absolutely terrifies me.
At some point those cloud premiums will trickle down through the supply chain, and eventually it could become hard to find reasonably priced colo space because the big guys with huge cash-flush pockets are buying up any available space with a significant premium attached. I don't know if this is ever likely, but growth of cloud could conceivably put pressure on available physical colo space.
Similar deal with Internet peering. There may be a critical point after which cloud vendors, through their sheer size will be able to change how these agreements are structured for everyone.
Cost. On-prem is roughly on-par in an average case, in my experience, but we've got many cases where we've optimized against hardware configurations that are significantly cheaper to create on-prem. And sunk costs are real. It's much easier to get approval for instances that don't add to the bottom line. But for that matter, we try to get our on-prem at close to 100% utilization, which keeps costs well below cloud. If I've got bursty loads, those can go to the cloud.
Lock-in. I don't trust any of the big cloud providers not to jack my rates up. I don't trust my engineers not to make use of proprietary APIs that get me stuck there.
Related to cost, but also its own issue, data transfer. Both latency and throughput. Yeah, it's buzzwordy, but the edge is a thing. I have many clients where getting processing in the same location where the data is being generated saves ungodly amounts of money in bandwidth, or where it wouldn't even been feasible to transfer the data off-site. Financial sector clients also tend to appreciate shaving off milliseconds.
Also, regulatory compliance. And, let's be honest, corporate and actual politics.
Inertia.
Trust.
Risk.
Interoperability with existing systems.
Few decisions about where to stick your compute and storage are trivial; few times is one answer always right. But there are many, many factors to consider, and they may not be the obvious ones that make the decision for you.
My team and I run the servers for a number of very big videogames. For a high-cpu workload, if you look around at static on-prem hosting and actually do some real performance bencharking, you will find that cloud machines - though convenient - generally cost at least 2x as much per unit performance. Not only that, but cloud will absolutely gouge you on egress bandwidth - leading to a cost multiplier that's closer to 4x, depending on the balance between compute and outbound bandwidth.
That's not to say we don't use the cloud - in fact we use it extensively.
Since you have to pay for static capacity 24/7 - even when your regional players are asleep and the machines are idle, there are some gains to be had by using the right blend of static/elastic - don't plan to cover peaks with 100% static - and spin up the elastic machines when your static capacity is fully consumed. This holds true for anything that results in more usage - a busy weekend, an in-game event, a new piece of downloadable content, etc... It's also a great way to deal with not knowing exactly how many players are going to show up on day 1.
Regarding latency, we have machines in many smaller datacenters around the world. We can generally get players far closer to one of our machines than to AWS/GCP/Azure, resulting in better in-game ping, which is super important to us. This will change over time as more and more cloud DCs spring up, but for now we're pretty happy with the blend.
And there are clients that demand it.
And researchers, in general, like to do totally wacky things, and it's often easier/cheaper to let us if you have physical access.
Also, my clients aren't software development firms. They are banks and factories. They buy a software based on features and we figure out how to make it work, and most of the vendors in this space are doing on-prem non-saas products. A few do all their stuff in IAAS or colo but a lot of these places are single-rack operations and they really don't care as long as it all works.
A lot of people in small/midsize banks feel like they are being left out. They go to conferences and hear about all the cool stuff in the industry but the established players are not bringing that to them. If you can stomach the regulatory overhead, someone with drive could replace finastra/fiserv/jackhenry. Or get purchased by them and get turned into yet another forever-maintenancemode graveyard app.
Started with a cluster of Raspberry Pis and expanded onto an old desktop. Primarily did this for cost (raspberry pis alone were more powerful than a GCP $35/mo instance). Everything was fine until I needed GPUs/handling more traffic than those Raspberrys could handle. So I expanded by including cloud instances in my Docker Swarm cluster (tidbit: Using Traefik and WireGuard)
So half on-prem half in the cloud. Honestly just scared GCP might one day cancel my account and I'll lose all my data unless I meet their demands (has happened in the past) so that half on-prem stores most of the data.
I also have some other APIs hosted in the same way (eg. website thumbnail generation API), for the very low traffic I have and no chance of getting burst traffic I think the use case of a VPS or dedicated server is perfect.
In the cloud, at least the way it's generally used, cost control is reactive: You get a bill from AWS every month, and if you're lucky you'll be able to attribute the costs to different projects.
This is both a strength and a weakness: on-premise assets will end up at much higher utilisation, because people will be keen to share servers and dodge the bureaucracy and costs of adding more. But if you consider isolation a virtue, you might prefer having 100 CPUs spread across 100 SQL DBs instead of 50 CPUs across two mega-databases.
About a year ago, I was in a meeting with my new CEO (who had acquired my company). My side of the business had kept hardware in-house, his was in AWS. We had broadly similar businesses in the same industry and with the same kind of customers.
My side of the business needed to upgrade our 5+ year old hardware. The quote came to $100K; the CEO freaked out. I asked him how much he spent on AWS?
The answer was that they spent $30K per month on AWS.
The kicker is that we managed 10x as many customers as they did, our devops team was half the size, and we were rolling out continuous deployment while they were still struggling to automated upgrades. Our deployment environment is also far less complicated than theirs because there isn't a complex infrastructure stack sitting in front of our deployment stack.
There was literally no dimension on which AWS was better than our on-prem deployment, and as far as I was able to tell before I quit, the only reason they used AWS was because everyone else was doing it.
What really gets me is that most cloud providers promise scalability, but offer no guard-rails - for example diagnosing performance issues in RDS - the goal for most cloud providers is to ride the line between your time cost and their service charges. Sure you can reduce RDS spend, but you'll have to spend a week to do it - so bust out the calculator or just sign the checks. No one will stop you from creating a single point of failure - but they'd happily charge for consulting fees to fix it. There is a conflict on interest - they profit from poor design.
In my opinion, the internet is missing a platform that encourages developers to build things in a reproducible way. Develop and host at home until you get your first customers, then move to a hosting provider down the line. Today, this most appeals to AI/ML startups - they're painfully aware of their idle GPUs in their gaming desktops and their insane bill from Major Cloud Provider. It also appeals to engineers who just want to host a blog or a wedding website, etc.
This is a tooling problem that I'm convinced can be solved. We need a ubiquitous, open-source, cloud-like platform that developers can use to get started on day 1, hosting from home if desired. That software platform should not have to change when the company needs increased reliability or better air conditioning for their servers. If its a Wordpress blog or a minecraft server or a petabyte SQL database - the Vendor should be a secondary choice to making things.
My observations from working with, and in the "cloud":
The "cloud" does benefit from it's scale in many ways. It has more engineers to improve, fix, watch, and page. It has more resources to handle spikes, whales, and demand. Almost everything is scale tested and the actual physical limits are known. It is damn right impressive to see what kind of traffic the cloud can handle.
Everything in the "cloud" is abstracted which increases complexity. Knowledgeable engineers are few and far between. As an engineer you assume something will break, and with every deployment you hope that you have the right metrics in place and alarms on the right metrics.
The "cloud" is best suited for whales. From special pricing to resource provisioning, they get the best. The rest is trickled down.
Most services are cost-centers. Very few can actually pay for the team and the cost of its dependencies.
It's insane how much VC money is spent building whatever the latest trend of application architecture is. Very few actually hit their utilization projections.
- Cost. It's vastly cheaper to run your own infra (like, 10-100x -- really!). The reason to run in cloud is not to save money, it's to shift from capex to opex and artificially couple client acquisition to expenditure in a way that juices your sheets for VCs.
- Principle. You can't do business in the cloud without paying people who also work to assemble lists of citizens to hand over to fascist governments.
- Control. Cloud providers will happily turn your systems off if asked by the government, a higher-up VP, or a sufficiently large partner.
EDIT: I should add. Cloud is great for something -- moving very fast with minimal staffing. That said, unless you get large enough to renegotiate you will get wedged into a cost deadend where your costs would be vastly reduced by going in-house, but you cannot afford to do so in the short term. Particularly for the HN audience, take care to notice who your accelerator is directing you to use for cloud services -- they are typically co-invested.
If you can feasibly run workloads onpremise or colo and have a warm failover to AWS you could probably have the best of all worlds.
1. IaaS - Which I mainly define as the raw programmable resources provided by "hypercloud" providers (AWS, GCP, Azure). Yes, it seems that using an IaaS provider with a VPC can provide many benefits over traditional on-prem data centers (racking & stacking, dual power supply, physical security, elasticity, programmability, locations etc).
2. SaaS - I lump all of the other applications by the hundreds of thousands of vendors into this category. I find it hard to trust these vendors the same way that I trust IaaS providers and am much more cautious of using these applications (vs OSS or "on-prem software" versions of these apps). They just don't have the same level of security controls in place as the largest IaaS providers can & do (plus the data is structured in a way that is more easily analyzed, consumed by prying eyes).
Windows server licenses on AWS and GCP are hundreds of times more expensive at our scale. Incidentally we actually do have some cloud infra and we like it, but the licensing cost is half the total price of the instance itself.
In fact, you might not know this but games are relatively low margin, and we have accidentally risked the companies financial safety by moving into the cloud.
__Modern__ servers are really awesome and I totally recommend them. You can do a ton remotely.
We ran a number of modelling jobs, basically CPU intensive tasks that would run for minutes to hours. Investing in on-prem computers (mostly workstations, some servers), we got very solid performance, very predictable costs and no ops issues. Renting beefy machines in the cloud is very expensive and unless you get crafty (spot and/or intelligent deployment), it will be prohibitive for many. Looking at AMD's offering these days, you can get sustained on-prem perf for a few dollars.
Three details of note: 1) We didn't need bursty perf (very infrequently) - had this been a need, the cloud would make a lot more sense, at least in a hybrid deployment. 2) we didn't do much networking (I'm in a different company now and we work with a lot of storage on S3 and on-prem wouldn't be feasible for us), 3) we didn't need to work remotely much, it was all at the office.
Obviously, it was a very specific scenario, but given how small the company was, we couldn't afford people to manage the whole cloud deployment/security/scaling etc. and beefy workstations was a much simpler and more affordable endeavour.
If that's not true, it turns out it's quite expensive to run things in the cloud. If your workload is crunching numbers 24/7 at 100% cpu, it's better to buy the cpu than to rent it.
It hurts sometimes, given we were fully colocated about 4 years back, and I know how much hardware that could buy us every month.
However, with serverless infra we can pivot quickly.
Since we're still in the beta stage, with a few large, early access partnerships, and an unfinished roadmap, we don't know where the bottlenecks will be.
For example, we depended heavily on CloudSearch, until it sucked for our use case, so we shifted to Elasticsearch, and ran both clusters simultaneously until we were fully off of CS. If we were to do that on-prem, we'd have to order a lot more hardware (or squeeze in new ES cluster VMs across heavy utilization nodes).
With AWS, a few minutes to launch a new ES cluster, dev time to migrate the data, followed by a few clicks to kill the CloudSearch cluster.
Cloud = lower upfront, higher long term, but no ceiling. On-prem = higher upfront, lower long term, but ceiling.
1) Bandwidth. I routinely saturate my plebian developer gigabit NIC links for half an hour, an hour, longer - and the servers slurp down even worse. In an AAA studio I am but one of hundreds of such workers. Getting a general purpouse internet connection that handles that kind of bandwidth to your heavily customized office is often just not really possible. If you're lucky your office is at least in the same metro area as a relevant datacenter. If you're really lucky you can maybe build a custom fiber or microwave link without prohibative cost. But with those kinds of geographical limitations, you're not so much relying on the general internet, so much as you're expanding your LAN to include a specific datacenter / zone of "the cloud" at that point.
2) Security. These servers are often completely disconnected from the internet, on a completely separate network, to help isolate them and reducing data exfiltration when some idiot installs malware-laden warez, despite clear corporate policy threatening to fire you if you so much as even think about installing bootleg software. Exceptions - where the servers do have internet access - are often recent, regrettable, and being reconsidered - because of, or perhaps despite, draconian whitelisting policies and other attempts at implementing defense in depth.
3) Customizability. Gamedev means devkits with strict NDAs and physical security requirements, and a motley assortment of phone hardware, that you want accessible to your build servers for automatic unit/integration testing. Oddball OS/driver/hardware may also be useful for such testing. Sure, if you can track down the right parties, you might be able to have your lawyers convince their lawyers to let you move said hardware into a datacenter, expand the IP whitelists, etc... but at that point all you've really done is made it harder to borrow a specific popular-but-discontinued phone model from the build farm for local debugging when it's the only one reproducing a specific crash when you want to debug and lack proper remote debug tooling.
...there are some inroads on the phone farms (AWS Device Farm, Xamarin Test Cloud) but I'm unaware of farms varied desktop hardware or devkits. Maybe they exist and just need better marketing?
I have some surplus "old" server hardware from one such gamedev job. Multiple 8gbit links on all of them. The "new" replacement hardware often still noticably bottlenecked for many operations.
Awesome bare metal is a new repo created by Alex Ellis that tracks a lot of the projects: https://github.com/alexellis/awesome-baremetal
Also we (Packet) just open sourced Tinkerbell, our bare metal provisioning engine: https://www.packet.com/blog/open-sourcing-tinkerbell/
2. Bought a hanful of $700 24 core Xeons on eBay 2 years ago for 24/7 data crunching. Equivalent cloud cost was over $3000/mo. On-Prem paid off within a month!
3. Nutanix is nice. Awesome performance for the price and almost no maintenance. Got 300+ VDI desktops and 50+ VMs with 1ms latency.
The cloud sucks for training AI models. It's just insanely overpriced in a way that no "Total Cost of Ownership" analysis is going make look good.
Every decent AI startup––including OpenAI––has made significant investments in on-premise GPU clusters for training models. You can buy consumer-grade NVIDIA hardware for a fraction of the price that AWS pays for data center-grade GPUs.
For us in particular, the payback on a $36k on-prem GPU cluster is about 3-4 months. Everything after that point saves us ~$10k / month. It's not even close.
When I was AWS, I tried to point this fact out to the leadership––to no avail. It simply seemed like a problem they didn't care about.
My only question is why isn't there a p2p virtualization layer that lets people with this on-prem GPU hardware rent out their spare capacity?
Now the hard part is turning those cost advantages into operational improvements instead of deficiencies.
All the best cloud providers are from the US and as a european company with clients in european government and healthcare we are often not morally or legally allowed to use a foreign provider.
The sad thing is that this is an ongoing battle between people on a municipal level who believe they can save money in clouds, and morally wiser people who are constantly having to put the brakes on those migration projects.
- unwillingness to seed control of the critical parts of our software infrastructure to a third party.
- given our small team size and our technical debt load we are not currently able to re-architect to make our software cloud-ready/resilient.
- true cost estimates feel daunting to calculate, whereas on-prem costs are fairly easy to calculate.
But the real reason we're not deeper in the cloud is that our business-types insist on turn of the century, server-based software from the usual big vendors, and all the things that integrate with them need to use 20th century integration patterns, so for us migrating to the cloud (in stages at least) would have drawbacks from all options without the benefits. It's only where we have cloud-native stuff that we can sneak in under the radar for stand-alone greenfields projects, or convince the business types that they can replace the Oracles and Peoplesofts with cloud-first alternatives will things really change.
I worked for some french or european companies, with IPs and sensitive informations, and US business competitors. By the US law, US companies may have to let the US gov spy on their customers (even non US, even on non US location), so this may be a problem for strategic sectors, like defense for example.
In that case, sensitive informations is required to be hosted in country by a company of the country, under the country law.
Of course, it's not against "cloud" in general... only against US cloud providers (and chinese, and...)
For my day job, it is privacy and legal constraints. I work for the government and all manner of things need to be signed off on to move to cloud. We could probably make it work, but in government, the hassle of doing so is so large that it is not going to happen for a long time.
In my contract project, it is a massive competitive advantage. I won't go into too many details, but customers in this particular area are very pleased that we do not use a cloud provider and instead host it somewhat on-premise. I don't see a large privacy advantage over using the cloud, but the people buying the service do simply because they are paranoid about the data and every single one of them could personally get in a lot of trouble for losing the data.
Not my project, but intensive computing requirements can be much more cheaply filled by on-premise equipment (especially if you don't pay for electricity), so my university does most of its AI and crypto research on-premise.
I try to remain objective, there are some pros to AWS, but I still much prefer my on prem setup. It was way cheaper, and deployments were way faster.
The number and frequency of outages in Azure are crazy. They happen non-stop all year around. You get meaningless RCAs but it never seems to get better, and if it did, you'd have no way of knowing.
Compare this with doing stuff internally - you can hire staff, or train staff, and get better. In the long run outsourcing and trusting other companies to invest in "getting better" doesn't end very well. Just because they moved their overall metrics overall from 99.9 to 99.91 may not help your use case.
- Reliability
Their UIs change every day, there's no end to end documentation on how things work. There's no way to keep up.
- Support
Azure's own support staff are atrocious. You have to repeatedly bang your head against the wall for days to get anyone who even knows the basic stuff from their own documentation.
But it's also difficult to find your own people to do the setup too. Sure, lots of people can do it, but because it's new they have little experience and end up not knowing much, unable to answer questions, and building garbage on the cloud platform. Because there's no cloud seniority, it hasn't been around for long enough.
- Security
Cloud providers have or can get access and sometimes use it.
- Management
I've seen too many last minute "we made a change and now everything will be broken unless you immediately do something" notifications to be happy about.
- Cost
It's ridiculously expensive above a certain scale, and that scale is not very big. I don't know if it's because people overbuild, or because you're being nickel-and-dimed, or if you're just paying so many times above normal for enterprise hardware and redundancy. It's still expensive.
Yes, owning (and licensing) your own is expensive too.
For smaller projects and tiny companies, totally fine! It's even great!
- Maturity
People can't manage cloud tools properly. This doesn't help with costs above.
PS: I don't think any other cloud service is better.
Bonus points for small stuff like RADIUS for wifi and stuff. Groups charging $5/user for junk like that is absolutely awful with a high number of staff.
With a staff of 100, a single box with a bunch of hard drives is two months worth of cloud and SaaS.
TCO needs to come down by like at least 100x before I consider going server-less.
Besides, we do have things like our own S3, k8s and other cloud-ish utilities running so we do not miss out that much, I guess.
There is room for both Cloud and On-Prem to exist. This endless drive by industry to push everyone to cloud infrastructure and SaaS, in my humble opinion, will look exactly like the whole supply chain coming from the east during a pandemic.
The economics of it look great in a lot of use cases, but putting our whole company at the mercy of a few providers sounds terrible to me. Even more so when I see posts on HN about folks getting locked out of their accounts with little notice.
It does not take much to bring our modern cloud to a grinding halt. For example, a mistake by an mostly unheard of ISP lead to a massive outage not less than a year ago(1).
It was amazing to see the interconnections turn to cascading issues. 1 ISP goofs. 1-2 major providers have issues and the trickle down effect was such that even services that thought they were immune from cloud issues were realizing that they rely on a 3rd party that relies on a different 3th party that uses cloudflare or AWS.
So, even though I think the cloud is (usually) secure, stable, resilient, etc... I still advocate for its use in moderation and for 2 main use cases.
1 - elastic demands. Those non-critical systems that add some value or make work easier. Things we could do without for several days and not hurt the business much.
2 - DR / Backup / redundancy. We have a robust 2 data center / DR fail over system. Adding cloud components to that seems reasonable to me.
(1)https://slate.com/technology/2019/06/verizon-dqe-outage-inte...
Edit: Spelling and clarity
Edit2: New reasons to stay on prem are happening all the time. https://www.bleepingcomputer.com/news/security/microsofts-gi...
There are plenty of companies that run their infrastructure to keep their data secure and accessible.
It's not the type of companies that blog about their infra or are popular on HN. Banks and financial institutions, telcos, airlines, energy production, civil infrastructure.
Critical infrastructure need to survive events larger than a datacenter outage. FAANGs don't protect customers from legal threats, large political changes, terrorist attacks, war.
We currently have double digit petabytes of data stored in our own data centres, but we’re moving it to S3 because we have far better things to do with our engineers that replace drives all day, and engineering plus hardware is more expensive than S3 Deep Archive - but it wasn’t until Deep Archive came out.
We put out hundreds of petabytes of bandwidth, and AWS is horribly expensive at first glance, but if you’re at that scale you negotiate private pricing that brings it to spitting distance of using a COLO or Linode/Hetzner/OVH - the distance is small enough that the advantages of AWS outweigh it, and it allows us to run our business at known and predictable margins.
Besides variability (most of our servers are shut nights, weekends and when not required), op ex vs cap ex, spikes in load (100X to 1000X baseline when tickets open), there’s also the advantage of not needing ops engineers and being to handle infrastructure with code. If you have a lot of ops people and don’t need any of the advantages, and you have lots of money lying around that you can use on cap ex, and you have a predictable load pattern, and you’ve done a clear cost benefit analysis to determine that building your own is cheaper, you should totally do that. Doesn’t matter what others are doing.
There are still some computers on site due to equipment being tied to it, telephony stuff, etc.
My last company was looking at "moving to the cloud", with the idea that its data centers were too expensive, but found out that the cloud solutions would be even more expensive, despite possible discounts due to the size. They still invested in it due to some Australia customers wanting data to be located there.
Everything on-site for a couple reasons (50 servers). Mainly because as a manufacturing company, machines on the shop floor need to talk to the servers. This brings up issues of security (do you really wan to put a 15 year old CNC machine 'on the internet'?). Also, if our internet connection has issues, we still need to build parts.
The other big part of it is the mindset of management and the existing system, which was built to run locally, does Amazon offer cloud hosted Access and Visual Basic workers?
> am I missing something?
I'd want more background behind what you mean by "at least for business". What kind of business? Obviously IaaS providers like Digital Ocean and Linode are are type of business that would not use other clouds. Dropbox and Backblaze as well would probably never use something like S3. And there are legitimate use cases outside of tech that have needs in specific teams for low latency compute, or its otherwise cost and time prohibitive to shuttle terabytes of data to the cloud and back (3D rendering, TV news rooms, etc). If you're talking about general business systems that can be represented by a website or app with a CRUD API, then most of that probably doesn't require on-prem. But that's not the only reason businesses buy servers.
We started out with Emvi [1] on Kubernetes at Google Cloud as it was the "fancy thing to use". I like Kubernetes, but we paid about 250€/month just to run some web servers and two REST APIs. Which is way too much considering that we're still working on the product and pivoting right now, so we don't have a lot of traffic.
We then moved on to use a different cloud provider (Hetzner) and host Kubernetes on VMs. Our costs went down to about 50€ just because of that. And after I got tired managing Kubernetes and all the complexity that comes along with it, we now just use a docker-compose on a single (more powerful) VM, which reduced our cost even futher to about 20€/month and _increased_ performance, as we have less networking overhead.
My recommendation is to start out as simple as possible. Probably just a single server, but keep scaling in mind while developing the system. We can still easily scale Emvi on different hardware and move it around as we like. We still use Google Cloud for backups (together with Hetzners own backup system).
2. Everything we do in house is small enough that the costs of running it on our own machines is far less than the costs of working out how to manage it on a cloud service AND deal with the possibility of that cloud service being unavailable. Simply running a program on a hosted or local server is far far simpler than anything I've seen in the cloud domain, and can easily achieve three nines with next to no effort.
Most things which 'really need' cloud hosting seem to be irrelevant bullshit like Facebook (who run their own infrastructure) or vendor-run workflows layered over distributed systems which don't really need a vendor to function (like GitHub/Git or GMail/email).
I'm trying to think of a counterexample which I'd actually miss if it were to collapse, but failing.
Reasoning:
* We know how much compute is need.
* We know how much the new servers can compute.
* We have the ability to load balance to AWS or Digital Ocean or another service as needed.
* This move provides a 10x speed improvement to our services AND reduces costs by 70%.
For reference, had to call the ISP (AT&T) and they agreed to let me host my current service. It’s relatively low bandwidth, but has high compute requirements.
We also run some registry data which we consider mission critical as a repository. We could run the live state off-prem, but we'd always have to be multi-site to ensure data integrity. We're not a bank, but like a bank or a land and titles administration office, registry implies stewardship in trust. That imposes constraints on "where" and "why".
Take central registry and the HSM/related out of the equation, if I was building from scratch I'd build to pub/sub, event-sourcing, async and in-the-cloud for everything I could.
private cloud. If you don't control your own data and logic, why are you in the loop?
AWS and co’s GPU-enabled servers are exceedingly expensive. Most of the GPU models on those machines are also very old. We pay maybe 1/3 or less to maintain these machines and train models in-house vs paying AWS.
Mind you, we use AWS for plenty of stuff...
All servers are on premises. Not allowed to have a laptop. No access to emails/data outside of the office. No USB drives, printing documents, etc.
Reason? Protect IP. From who? Mostly Huawei.
Good and bad: When I walk out the door... I switch off. The bad is that working from home isn't realy an option. Although they have accommodated somewhat for this pandemic.
Regarding cost, well it depends. We try to help customers to move to cloud hosting if it's cheaper for them. It almost always will be if they take advantage of the features provided by the cloud providers. If you just view for instance AWS as a VMware in the cloud, then we can normally host the virtual machines for you cheaper and provide better service.
You have to realize that many companies aren't developing software that's ready for cloud deployment. You can move it to EC2 instance, but that's not taking advantage of the feature set Amazon provides, and it will be expensive and support may not be what you expect. You can't just call up Amazon and demand that they fix your specific issue.
Then learn A LOT more and start with mainframes and their reliability.
Neither of us have any experience with the cloud, whereas we have a lot of Microsoft experience. We still rely on OEM licenses of Office, because Office 365 would be 3x or more expensive. We have a range of Office 2019, 2016, 2013 OEM, and we get audited by them nearly every year.
We use LastPass, Dropbox and Github, but only the basic features, and LastPass was an addition last year after someone got into our network through a weak username/password.
In our main location, we have three ESX boxes, running several virtual servers, and then we have a physical server for our domain controller, file sharing and DHCP, DNS in other locations. We also switched to a physical server for our new ERP application server, which hasn't yet been rolled out.
Projects like upgrading our ERP version can take months, but we have a local consulting team, with a specialist in our particular ERP solution, as well as a Server and Network specialist, and we also have a very close relationship with our ISP, who provides WAN troubleshooting.
Our IT budget is small relative to our company revenue, so most cloud proposals would raise our costs manyfold. We continue to use more services like Github and Lastpass, and we both have multiple hats.
I'm a developer, in-house app support, Email support, HR systems support, ERP support, PC setup, and I run our Data Synchronization operation and my boss runs EDI. I do a lot of PowerShell and Task Scheduler, but I've got familiar with bash through git bash.
A retail company may decide that the best place to put up a new branch is coincidentally (though there might be a correlation) at the edge of what the available ISPs currently cover. They might have to make a deal to get an ISP to extend their area to where the store is going to be. However, because of lack of competing ISP options on the part of the retailer, and the lack of clients in the retailer's area on the part of the ISP, that service is probably not going to be all that reliable.
Also, that retail company may experience a big rise in sales after a natural disaster occurs, when communications (phone/cell/internet) are mostly down for the area. One tends not to think about stuff like that until it happens at least once.
It's very important for the ERP/POS systems to be as operational as possible even when the internet is down.
There is one way around it: Mounting the cloud server as a network drive (some providers do this by default, but OneDrive is not one of them, neither is Dropbox).
I don't know of a way of mounting OneDrive as a virtual drive; I would be interested to know.
It sounds stupid, but the above was a real life scenario.
[1] Only if the files are closed. Excel can change the path if you have the file open, but it can't change it to multiple option across different PCs. But as I have mentioned before, Excel doesn't seem to document all of their more subtle features.
For under $50K, I have 4 machines with an aggregate 1TB RAM, 48 cores, 1 pricy GPU, 16TB of fast SSD, 40 TB HHD, and infiniband @ 56GB/sec. Rent on the cabinet is less than $1K/mo. It's going to cost me about $20K in labor to migrate.
So the nominal break-even point is six months but the real kicker is that this is effectively x10-30 the raw power of what I was getting on the cloud. I can offer a quantitatively different set of calculations.
It also simplifies a bunch of stuff: 1. No APIs to read blob data -- just good old files on an ZFS share. 2. No need to optimize memory consumption. 3. No need for docker/k8s/etc to spin up under load. Just have a cluster sitting there.
There are downsides but Coloc beats the cloud for certain problems.
Procurement is a nightmare especially when your vendor is having problems with yields (thanks Intel!) and the ability to scale up and scale down without going through hardware procurement process saves us millions of dollars a year.
We avoid the lock-in by running on basic services on multiple cloud providers and building on top of those agnostically.
Spend is in the millions per month between the cloud providers, but the discounts are steep. We're essentially had to build our own global CDN and the costs are better than paying the CDN services and better than running our own hardware & staffing all those locales.
It's a no brainer. We'll continue to operate mixed infrastructure for quite some time as certain things make sense in certain places.
The advantages of running things in a cloud are clear -- and as an infrastructure team we have challenges around managing physical assets at scale, however it's clear with the cost of cloud providers that eventually we would have to pull data into a datacenter to survive at some point.
Co-location costs are fixed, and it's actually easy to make a phenomenal deal now-a-days given the pressure these companies are under.
The real trick of it all is that regardless of running on-prem or in the cloud, we need to run as if everything is cloud native. We run Kubernetes, Docker, and as much as possible automate things to the point that running one of something is the same as running a million of it.
One other point I'll make, the true value of cloud isn't in IaaS, renting VMs from anyone is relatively expensive compared to the costs of buying a server and maintaining it yourself for a number of years. The true value of the cloud is when can architect your solution to utilize the various services the cloud providers offer, RDS/DynamoDB, CDN, Lambda, API Gateway, etc. so that you can scale quickly when you need to.
Is there something like that I could use on my own hardware? I just want to do a fresh Linux install, install this one package, and start pushing code from elsewhere, no other configuration or setup necessary. If it can accept multiple repos, one server process each, all the better. I know things like Docker and Kubernetes exist but what I want is absolute minimal setup and maintenance.
Does such a thing exist?
AWS turned out to be 5-10 times more expensive; what's worse, our developers are spending more then half their time working around braindead AWS design decisions and bugs.
A disaster any way you look at it.
There are good reasons to chose AWS, but they're never technical. (Maybe you don't want to deal with cross-departmental communications, or you can't hire people into a sysadmin role for some reason, maybe you want to hide hosting in operational expenses instead of capital, etc.)
Because my application involves live video transcoding I'm fairly demanding on CPU time, which is something that's hard to get (reliably) from a downmarket VPS operation (even DO or what have you) and costly from a cloud provider. On the other hand, dual 8 core Xeons don't cost very much when they're almost a decade old and they more than handle the job.
There are a few fairly reputable vendors for used servers out there, e.g. Unix Surplus, and they're probably cheaper than you think. I wouldn't trust used equipment with a business-critical workload but honestly it's more reliable than an EC2 instance in terms of lifetime-before-unscheduled-termination, and since I spend my workday doing "cloud-scale" or whatever I have minimal interest in doing it in my off-time, where I prefer to stick to an "old fashioned" approach of keeping my pets fed and groomed.
And, honestly, new equipment is probably cheaper than you think. Dealing with a Dell account rep is a monumental pain but the prices actually aren't that crazy. Last time I purchased over $100k in equipment (in a professional context, my hobbies haven't gotten that far yet) I was able to get a lot for it - and that's well less than burdened cost for one engineer.
Most dedicated servers come with unmetered bandwidth so not only is it cheap to serve large files but your bandwidth costs won't suddenly explode because of heavy usage or a DDoS attack.
Our company provides on-premise ERP systems to small (we’re talking at most 20 person companies) wholesale distributors.
Pre-COVID, I was pushing for a cloud solution to our product and pivoting our company towards that model. We’re at a hybrid approach when COVID hit.
What ends up happening with an on-premise/hybrid cloud model is we end up doing a lot of the sysadmin/IT support work for our customers just to get our ERP working. This includes getting ahold of static IP addresses (and absolving responsibility), configuring servers/OSes, and several other things along the same vein that’s wholly irrelevant to the actual ERP like inventory management and accounting.
Long story short, these customers of ours end up expecting us to maintain their on-premise server without actually paying for help or being knowledgable about how it all works. We keep pitching them the cloud but they’re not willing to pay us a recurring fee even though it actually saves the headaches of answering the question “who’s responsibility is it to make sure this server keeps running?"
I think a lot of these answers here are dealing with large-scale products and services where the amount of data and capital costs is so massive it makes sense to start hiring your own admins solely to maintain servers. For these small mom-and-pop shops who are looking for automation, the cloud is still the way to go.
If by cloud you mean a public cloud like Google, Amazon, or Microsoft, then forget about it; not with these companies piping data directly to U.S intelligence.
Getting Started --> Definitely go w/ CSPs. No need to worry about infra.
Pre Product Market Fit + Steady Growth --> On Premise, because CSPs might be expensive until you find a consistently profitable business.
Pre Product Market Fit + HyperGrowth --> CSPs since you wont be able to keep up [we never got to this stage]
Product Market Fit w/ Sustainable Good Margins --> CSPs, pay to remove the headache [we never got to this stage]
Side Note: w/ GPUs, CSPs rarely make sense
As a result, I run a power-hungry Dell r610 with 24 cores and 48GB of ram with 20+ services on it for many different aspects of my company. All the critical stuff runs on DigitalOcean / Vultr, but the 20+ non-critical services like demo apps, CI/CD, cron workers, archiving, etc. run for <$200/yr in my closet.
I have a 3 smallish VM for build server + managed SQL. It cost 500$/mo. It doesn't make sense. Having my own VMs on ESXi makes everything very different - most of the time this VMs do nothing, but you want to make them performant from time to time, so there are a plenty of resources because all other vms are too mostly IDLE.
In cloud they are billed as if they are 100% loaded all the time.
I am not really satisfied with latencies and insane price for egress traffic. I just can't do backups daily since it could cost whooping 500$/mo just for the traffic. This is just insane, i can't see how it could scale anywhere for B2C market. For B2B it might work really well though since revenue per customer is much higher.
We are not moving to our own DC, but just keep realtime stuff in the cloud, but anything that is not essential is being moved somewhere else. Bonus is that you need off-site backups anyway in the case if cloud vendor will just ban you and delete all your data.
Startups might move fast and iterate, but if you don't have your own servers you always reduce your usage because it could grow fast effectively reducing your delivery capacity.
1. Physical control over data is still a premium for many professional investors. As a hedge fund CIO told me recently when I asked her why she was so anti-cloud migration, "I want our data to live on our servers in our building behind our security guard and I want our lawyer, not AWS's, to read the subpoena if the SEC ever comes for us."
2. There are a lot of niche ERP- and CRM-adjacent platforms out there -- e.g., medical imaging software -- where the best providers are still on-prem focused, so customers in that space are waiting for the software to catch up before they switch.
3. A lot of people still fundamentally don't trust the security of the cloud. And I'd say this distrust isn't of the tinfoil hat, "I don't believe SSL really works" variety that existed a decade ago. Instead it's, "we'd have to transition to a completely different SysAdmin environment and we'd probably fuck up an integration and inadvertently cause some kind of horrendous breach".
Contrary to popular belief, it does not in the slightest save you a sysadmin (most just end up unknowingly giving the task to their developers). And contrary to popular belief, the perf/price ratio is atrocious compared to just buying servers.
For some of the loads I had been doing the math for, I could rent a colo and buy a new beefy server every year with money to spare for the yearly cost of something approximating the performance in AWS...
Building iOS apps require macOS and even though there are some well-known "Mac hosting" services, none of them are actual cloud services similar DigitalOcean, Azure, AWS, etc.
So it is much less expensive and actually easier to scale and configure to host the Macs onprem.
(Off the record: If it is for internal use only, you can even stick in a few hackintoshes for high performance.)
1. Does "security" critical stuff (like affecting the security of people not data if breached)
2. Which besides certain kinds of breaches has lowish requirements to performance and reliability (short outages of a view minutes are not a big problems, even outages of half a day or so can be coped with)
3. Slightly paranoid founders, with a good amount of mistrust into any cloud company.
4. Founders and tech-lead which have experience in some areas but toughly underestimate the (time)cost of managing servers them self, and that _kinda_ hard to do secure by yourself (wrt. long term DDOS and similar).
So was it a good reason? Probably not. But we still went with it.
As a site note while we did not use the could _we didn't physically manage server either_. Instead we had some dedicated hardware in some compute center in Germany which they did trust. So no "physical" managements securing etc. needed and some DDOS and network protection by default. Still we probably could have it easier without losing anything.
On the other side if you have a dedicated server hardware in some trusted "localish" compute center it's not _that_ bad either to manage.
Most people here seem to point out cost and utilization. I would like to offer another perspective: security.
I worked in both of these industries: finance ("banking", not crypto or PG) and medical (within a major hospital network). The security requirement, both from practical and legal perspectives, cannot be understated. In many situations, the data cannot leave an on-prem air-gapped server network, let alone use a cloud service.
It costed us more to have on-prem servers as we need a dedicated real estate and an engineering team to maintain. Moreover, the initial capital expenditure is high -- designing and implementing a proper server room/data center with climate control, power wiring, and compliant fire extinguishing system are not trivial.
Hardware is super cheap:
-A 40 slots rack, with gigabit fiber, dual power and a handful of public IP addresses, costs on average 10000€/y.
-A reconditioned server on eBay with 16 cores and 96GB of RAM costs 500€ (never seen them break in 3 years).
-A brand new Dell Poweredge with AMD EPYC 32 core and 64GB of RAM will cost 3000€.
-Storage is super cheap: 500GB of SSD costs 80€ (consumer stuff is super fine as long as you wisely plan between redundancy and careful load) and rotational disks are even cheaper. Never seen a rotational disk break.
Once bought, all of this is yours forever, not for a single month. You can pack very remarkable densities in a rack and have MUCH more infrastructure estate at disposal than you would ever afford on AWS.
The flip side of the coin is that you need operation expertise. If it's always you, then ok (although you won't be doing much more than babysitting the datacenter). Otherwise, if you need to hire a dedicated person, people is the most expensive resource yet and that should definitely be added to the cost of operations.
We started with 2 white-box PC's as servers, 2 mirrored RAID1 drives in each. We added a 3rd PC we built ourselves: total nightmare. The motherboard had a bug where, when using both IDE channels, it overwrote the wrong drive. We nearly lost our entire business. Putting both drives on the same IDE channel fixed it, but that's dangerous for RAID1.
A few years in, we needed an upgrade and bought 5 identical SuperMicro 2U servers with hardware RAID1 for around $10K. Those things were beasts: rock solid, fast, and gave us plenty of capacity. We split our services across machines with DNS and the 5 machines were in a local LAN to talk to each other for access to the database server. The machines' serial ports were wired together in a daisy-chain so we could get direct console access to any machine that failed, and we had watchdog cards installed on each so that if one ever locked up, it automagically rebooted itself. When I left in 2005, we were handling 100's of request/s, every page dynamically generated from a search engine or database.
Of course it took effort to set all this up. But the nice thing is, you control and understand _everything_. Some big company doesn't just do things to you, you have no idea what is happening, and they're not talking. And if things do go south, you can very quickly figure out why, because you built it.
The biggest mistake we made was in the first few years, where we used those crappy white-box PCs. Sure, we saved a couple thousand dollars, but we had the money and it was a terrible trade. Night and day difference between those and real servers.
You have to have a good recovery plan for when equipment X’s power supply fails but when deploy is all automated, it’s very easy to overcome swapping bare metal, and easy to drill (practice) during off hours.
This makes it much easier to meet regulatory compliance: either statutory or regulations your org has created internally (e.g. financial controls in-org, working with vulnerable people or children, working with controlled substances, working with sensitive intellectual property.)
Simply being able to say you can pull the plug on something and do forensic analysis of the storage on a device is an important thing to say to stakeholders (carers, carers families, pupil parents.)
I’m so grateful to be living in the modern age when “cloud” software exists[2], but I don’t have to be in the cloud to use it.
The downside: you need trained staff and it’s completely inappropriate if you need any kind of bandwidth, power consumption, or to support round the clock business (which we do not because, out here on the long tail, we work in single city so still have things like evenings weekends and holidays for maintenance!)
— [1] Premise vs premises is one of those “oh isn’t the English language awful” distinctions. Premise is the logical foundation for some other theory (“the premises for my laziness is that because the sky is grey it will probably rain so I’m not going to paint the house”) where as the premise_s_ means physical real estate property (“this is my freshly painted house: I welcome you onto the premises”.)
[2] Ansible, Ubiquiti, arm SBCs like raspberry pi, Docker, LXC, IPv6 (making global routing for more tractable, IPv4 for the public and as an endpoint to get on the VPN.)
We recently moved one rack to the different DC in the same city and used Digitalocean droplets to avoid downtime. Services running on Linux were migrated without high-availability (e.g. no pgsql replication, no redis cluster, single elasticsearch node...) and we just turned off Windows VMs completely due to licensing issues and no need to have them running at night.
The price of this setup was almost 4x higher than what we pay for colo. Our servers are usualy <5 years old Supermicro, we run on Openstack and Ceph (S3, rbd) and provide VPNaaS to our clients.
AWS/GCP/Azure was out of question due to cost. We considered moving Windows servers to Azure with the same result - the cost of running Windows Server (AD, Terminal, Tomcat) + MS SQL was many times higher than the price of colo per month. It is bizarre that you can buy the server running those VMs approximately every 3 months for the Azure expenses (Xeon Platinum, 512GB RAM).
I keep asking him about why they still use on premises equipment and it boils down to:
* Cost for training / transitioning + sunk cost fallacy * Perceived security risk (right or wrong) * IT is mostly invisible and currently "works" with the current arrangement, why change?
Come in 15 years ? It still works. Is that possible with cloud even in short periods, like 2 years ? No.
Will it ever be possible? No.
Thats primary reason for me. I can use cloud only for stuff that are nice but not mandatory to have for service to work, like status page.
Plus, work is more enjoyable then using somebodies else stuff.
One secondary factor is that we've only monotonically increased, and it's way cheaper to keep 10%-15% overprovisioned than to be on burst price with 50%+ constant load.
But the simplest math is - we have > 100 storage servers that are 2u 26x2tb flash, 256gb RAM, 36 cores. They cost $18k once, which we finance at pretty low interest over 36 months (and really last longer than that). Factor in $200-400/mo to host each depending (I think it's more like $200, but it doesn't matter for the cloud math).
That same server would be many $thousands/month on any cloud we've seen. Probably $4-6k/mo, depending on the type of EBS-ish attached. Or with the dedicated server 'alternate' they are moving to offer (and Oracle sorta launched with).
It'd be cheaper but still > 2x as expensive on Packet, IBM dedicated, OVH, Hetzner, Leaseweb (OVH and Hetzner the cheapest probably).
Three other factors for us:
1) Bandwidth would be outrageous on cloud but probably not as outrageously high as just the servers, given that our outbound is just our SaaS portal/API usage
2) We'd still need a cabinet with router/switch infra to peer with a couple dozen customers that run networks (other SaaS/digital natives and SPs that want to send infrastructure telemetry via direct network interconnect).
3) We've had 5-6 ops folks for 3 of the 6 years, 3-4 for the couple years before that. As we go forward, as we double we'll probably +1. It is my belief that we'd need more people in ops, or at least eng+ops mix, if we used public cloud. But in any case, the amount of time we spend adding and debugging our infra is really really really small, and the benefit of knowing how servers and switching stuff fails is huge to debugging (or, not having to debug).
All that said - we do run private SaaS clusters, and 100% of them are on bare metal, even though we could run on cloud. Once we do the TCO, no one yet has wanted to go cloud for an always-on data-intensive footprint like ours.
Good luck with your journey, whichever way you go!
And happy to discuss more, email in profile
I still have to build physical networks occasionally (ex: we are building a small manufacturing facility in a very specific niche that's required to be onsite for compliance reasons) but the scale is so small that I can get away with a lot of open source components (pfsense netgates are great) and not have to use things that are obnoxious to deal with (if I never have to deploy cisco anything ever again I won't be upset).
Recently though, I've been working on some distributed systems type projects which would allow these servers to be put in different physical locations (and power grids), and still continue to operate as a cohesive whole. This type of technology definitely increases my confidence in them being able to reliably host servers. I wouldn't want to be reliant on the cloud for large scale services though, from my understanding you can get some crazy cost savings by colocating some physical servers (especially for large data storage requirements).
I think if you choose cloud hosting that costs about same as renting dedicated server plus settings virtualization by yourself - than it’s a fair choice (can check on https://www.vpsbenchmarks.com/ or similar)
Another sweet configuration is dedicated servers with kubernetes: good user experience for developers, easy to setup and maintain, easy to scale up/down
Our environment is a mixture of in-house developed apps and COTS. Until recently, our major COTS vendors didn't have cloud solutions. Now they have cloud solutions but they're far too costly for us to afford. So we need to keep them in-house and continue to employ the staff to support it.
Our in-house apps integrate the COTS systems. Our newer apps are mostly in the cloud. But the older ones are in technologies that need to stay where the database server is, which is in our server room for the reason stated in the last paragraph. Rewriting the apps isn't on our radar due to new work coming in.
Historically, outsource vs. in-source seems to ebb and flow. The clear path is usually muddied when new technologies come out to reduce cost on one side or the other.
This is for many reasons. The one that comes back to me now is that the file sizes are HUGE, because resolution is very high, so bandwidth is a major concern. Editors and colorists need rapid feedback on their work, which demands beefy workstations connected directly with high bandwidth connections to the source files. Doing something like this over a long distance network (even if the storage was free) would be prohibitively expensive, and sometimes literally impossible.
So the write loads are basically the antithesis of what cloud optimized for: “random sequential reads of typical short length, big append only writes”. The big production houses (lucasarts famously) are also incredibly secretive about their source material, and like to use physical access as a proxy for digital access.
It leads to some seemingly strange (to me as a cloud SWE guy) decisions. He pretty much exclusively purchases top of the line equipment (hard drives/ssds), and keeps minimal if any backups for most projects because there simply isn’t any room. It’s a recipe for disastrous data loss, and apparently it’s something that happens quite often to this day. It’s just extremely prohibitively expensive to do version control for movie development.
I don’t know to what extent cloud technologies can solve for this domain. I asked him if Netflix was innovating in this area, since they’re so famously invested in AWS, but he said that they mostly contracted out the production stuff, and only managed the distribution, which makes sense. The contractors don’t touch the cloud at all, for the most part.
Again most of this is secondhand, I’d be curious to hear more details or reports from other people in the movie industry.
In every other case, you are paying for the same hardware you could buy yourself, plus the cloud provider's IT staff, plus your own IT staff which you likely need anyways to figure out how to deal with the cloud provider, and then the cloud provider's profit margin, which is sizeable.
Not running our infrastructure in the cloud is part of our value proposition.
Our customers depend on us to detect and alert them when their services go down. We _have_ to be operational when the cloud providers are not, otherwise we aren’t providing our customer with a valuable service.
Another reason we don’t run in the cloud is because we store a substantial amount of data that is ever increasing. It’s cheaper to run our own SAN in a data center than to store and query it in the cloud.
The final reason is our workloads aren’t elastic. Our CPU’s are never idle. In that type of use case, it’s cheaper to own the hardware.
It had both compute and storage (netapp). It had two twin sites in two different datacenters. The infra in each site consisted basically in six compute servers (24c/48t, 128gb ram) and netapp storage (two netapp heads per site + disk shelves).
Such hardware has basically paid itself across its seven or eight years of life, and having one of the sites basically in the building meant basically negligible latency.
The workload was mostly fixed, and the user base was relatively small (~1000 concurrent users, using the services almost 24/7).
It really checks all of the boxes, does all it is supposed to do and in a cheap manner.
We have a pool of 15 build servers for our CI. They run basically 100% cpu during office hours and tranffer terabytes of data every day. They have no real requirements for backup, reliability etc, but they need to be fast. If I run a pricing calculator for hosting those in the cloud it's ridiculous. We are moving source and CI to cloud, but we'll probably keep all the build machines on-prem for the foreseeable future.
For customer facing servers the calculation is completely different. More traffic means more business. Reliability, Scalability and backup is important and so on.
Many detail banks, asset management company or high security company refuse to use any public Cloud.
They want to have a strict and traceable list of people who have physical access to their hardware.
This in order to control any risk of dataleak [1].
In practice they use generally on-premise installation. They ren space in a computer center and own there a private cage monitored with multiple cameras. Meaning they know exactly anyone touching their hardware and enforce security clearance for them.
Chronological order:
1. E-commerce, low volume (1000-5000RPM), very high value conversions, highly localized trade.
We built an on-prem stack using hasicorp here. This place had on-prem stuff already in place, the usual vendor driven crap - expensive hypervisor, expensive spoffy SAN, unreliable network. Anyway, my platform team (4-5 guys) built a silo on commodity hardware to run the new version of this site. This is a few years back, but the power you get from cheap hardware these days is astounding. With 6 basic servers, in two DCs, stuffed with off the shelf SSDs we could run the site and dev teams no problem. Much less downtime compared to the expensive hyperconverged blade crap we started on at basically no cost. There’s a simplicity that wins out using actual network cables and 1u boxes... LXC is awesome btw! Using “legacy” vmware, emc, hp etc for non-essential on-prem? Cloud is tempting!
2. Very high volume (Billions of requests per day), global network. AWS. Team tasked with improving on-demand scalability. We implemented kubernetes on AWS and it really showed what it’s about! After 6-7 months of struggle with k8s < 1.12 things turned around when it hit 1.12-1.13-ish we got it to act how we wanted. Sort of, at least. Cloud just a no brainer for this type of round-the-clock, elastic workload. You’d need many millions up-front to even begin building something matching “the cloud” here. Lot of work spent tweaking cost though. At this scale cloud cost management is what you do.
3. Upstart dev-shop. No rpm outside dev (yet). Azure. About 30 devs building cool stuff. Azure sucks as IaaS, they want you to PaaS that’s for sure. Cloud decision had been made already when I joined. Do you need cloud for this? No. Are there benefits? Some. Do they outweigh the cost? Hardly. In the end it will depend on how and where your product drives revenue. We pay for a small local dev datacenter quarterly, which i find annoying.
Just some quick thoughts off the top of my head (on the phone so excuse everything).
Happy to discuss further!
We're concerned about corporate espionage and infiltration, so we can't trust our servers being out of our sight. Most people don't have the code on their physical machines either; I'm a cocky breath of fresh air in that regard in that I prefer my stuff to run locally instead of the (slow, underpowered) VMs, I trust Apple's encryption a lot.
Services that we will continue to run on-premises (as an exception to that rule) are some machine learning training clusters (where we need a constant, high-level amount of GPU and cloud provider pricing for GPU machines is very far off the mark of what you can build/run yourself) and some file shares for our production facilities where very large raster files are created, manipulated, sent to production equipment, and then shortly afterwards deleted.
Most everything else is going to the cloud (including most of our ML deployed model use cases).
Imagine a company in Europe that decides to host it's files on Alibaba Cloud in the US.
Imagine the US Department of State hosting it's files with Google.
Imagine an energy company working on new reactor Tech, ...
Imagine a Certificate Authority which has an offline store of root certificates which need to come online to sync twice a day.
Imagine cases where you need a hardware HSM.
Then there is also Cost as others have pointed out. AWS cost structure is so complex that whole business models[1] have sprung up to help you optimize the price tags or reduce the risks of huge bills. that's right: you need to have a commercial agreement with another partner that has nothing to do with your cloud just to work around aggressive pricing. The guy who started this ~2 years ago has grown to 40+ people (organically), is based in Switzerland and is still hiring even in this recession. It should give you an idea of how broken the cloud is.
Lastly there is also the lock-in. All the hours that you have to sit down and learn how the AWS IAM works is wasted once you decide to move to another cloud. The cost for learning how to use the 3rd party API is incurred by you not the cloud vendor. For people who think lock-in isn't much of a problem remember your whole hiring strategy will be aligned to whatever cloud vendor you're using (look at job description that already filter out based on AWS or GCP experience). Lock-in is so bad that for a business it is close to the rule of real-estate (location, location, location), only that it's to the advantage of the cloud vendor not you as the customer.
[1] optimyze.cloud
[2] "I have just tried to pull the official EC2 price list via JSON API, and the JSON file is 1.3gb" https://twitter.com/halvarflake/status/1258161778770542594
I've personally been wondering whether that's wise, because financial data and the handling of many banking processes are a bank's core business. It makes sense that a bank should be in control of that. And it needs to obey tons of strict banking data regulations. But apparently modern cloud services are able to provide all of that.
We need random access to about 50TB of files, and quite a decent number of VMs.
For storage on-perm vs cloud: buy was cheaper to have after 3(!) months.
For VMs(some of them could be containerized though): 1 year
It was cheaper to buy second-hand decent server, slap SSDs and just install a decent hypervisor. Those costs also include: server room, power usage, admins etc.
We do use cloud backups for the most important stuff.
Cloud is cheaper if your business is a something that is user-based - as in you might need to scale it, hard.
If you aren't doing anything like that it is absurdly expensive.
- lock in, all the hyper scalers want to sell you value add services that make it hard or impossible to move away.
- concentration risk, hyper scale providers are a well understood target for malign actors. It’s true they are better protected than most.
- complexity, if you think about how little time the hyperscalers have been operating in comparison with corporate IT they have created huge technical debt in the race to match features.
Redundant power, redundant internet connections, and a few racks of Dell servers and gigabit switches. Why did I mention Dell? They just don't seem to die. We used HP for a few years but had a few bad experiences.
A common thread of a lot of the replies to this post is network traffic costs. If one of the cloud providers can figure out a way to dramatically (and I mean at least 10x) reduce their network transfer pricing, then I think we'll see a second wave of companies adopting their services.
If your computing needs vary over time, then provisioning on-prem for peak load will mean that some of your resources will be idle at non-peak times. It may be cheaper to use cloud resources in cases like these, since you only need to pay for the extra capacity when you need it.
https://github.com/krisk84/retinanet_for_redaction_with_deep...
I haven't analyzed the TCO yet but the bandwidth costs alone from hosting my curated model of 100GB in the cloud (Azure blob) have greatly exceeded my total operation spend from downloading the entire dataset and running training. By an order of magnitude.
In general, I would say any noncritical system I would host on-prem.
(At least funded) startups should start with the cloud as speed to completion is key, but can later optimize for cost.
Elasticity of the cloud is also great, dealing with peak demands dynamically without having to purchase hardware.
I'd suggest larger companies to use at leas two cloud vendors to add resilience (when MS Teams went down, so did Slack - I was told they both use MSFT's cloud).
The cost benefits are huge and since our app is mostly a normal web app we dont need that many fancy cloud things. And i dont see us needing it in the future.
I really dont understand why a company doing similar things would wanna go the could route. Its so damn expensive and its not always easy to use and setup.
I run a couple of on-premise xeon gold machines with 96 gb ram and 40+ cores on each. Their total purchase cost was the monthly cost of renting them on the cloud. Also, you will never get the full benefit of the servers you use unless they are dedicated instances with no virtualization layer.
Some equipment is very latency sensitive -- I'm talking microseconds, not milliseconds.
More generic tasks need easy access to that specialist equipment (much of which doesn't quite grasp the concept of security)
Given that we therefore have to run a secure network across hundreds of sites on multiple continents, adding a couple of machines running xen adds very little to the overhead.
It’s kind of like outsourcing. If you don’t know what you are doing, cost goes up and quality goes down.
https://forrestbrazeal.com/2020/01/05/code-wise-cloud-foolis...
For example, building a Ceph based Software Defined Storage with croit.io for S3 comes for 1/10 to 1/5 of the AWS price in TCO. Same goes for any other product in the cloud.
If you only need it a short time up to 6 months, go to the cloud. If you plan to have the resources longer than 6 months go to Colocation.
Cost and security are important, but they may not be most important. In a business, the scarcest commodity is FOCUS. By outsourcing anything that isn't core to your product, you can excel at what differentiates you.
Using AWS / Azure / Google Cloud (even using datacenters from your own country) implies that the US government can access your data at will.
As soon as you treat sensitive information, especially related to non-US governments, this becomes a blocking factor.
If you suck and don’t understand costs, or don’t automate, or spend a lot of time eating steak with your Cisco team, you’ll save money... at first.
Though we have adopted something close to an "Edge Computing" solution... I guess it comes down to "Why not both?" :)
I think it also depends on your definition of "server"
We've never used cloud services and we do not want to use it.
Some are saying it's a matter of costs, but you know? For a dual node server (hot standby) we were asked 120K € + 50K € only for configuration fees.
The company I work for actually develops and hosts an AWS clone for the Linux Foundation, but with very specific requirements. They have special needs that requires baremetal machines and "real" networking between them across 6+ NICs per server.
1. Replace capital expenditure of in-house infrastructure + staff with OpEx that can be dialled down
2. Get to benefit from the economies of scale that the cloud vendors get (those Intel CPUs get a lot cheaper when purchased in the 1000s)
3. Get to leverage big shiny new tech like Big Data and AI that's 'advertised' as 'out-of-the-box'
My only concern really is that the big cloud players are all fighting for dominance and market share. What happens in the next 5-10 years time when they start raising prices? Different kind of lock-in - customers won't have the expertise in-house to migrate stuff back.
Why? Because our (German) clients don't thrust US cloud providers.
But also - legalities. Most cloud providers have very unclear rules and what exactly happen should you be in breach. For this reason, our business prefers to have most of the control.
we're not 100% on-prem, but aws, gcloud and azure are the worst examples of 3rd party hosting - unpredictable and with complicated billing. we're considering the alternatives to big 3 for the "cloud hosting"
We can't store our data outside of the company (or even worse: outside of the EU)
It does change right now.
I'm not a fan. In short:
Cost. CapEx and depreciation vs. OpEx. The numbers look amazing for ~3 years until the credits and discounts wear off. Then it's just high OpEx costs forever. Meanwhile I can depreciate my $10k server over time and get some cash back in taxes; plus it's paid for after a couple years -- $0 OpEx outside of licenses, and CentOS has no license cost.
Once you have significant presence in someone's cloud, they're not going to just lower costs either -- they've got you now. What in American Capitalism circa 2020 makes you think they won't find a way to nickle and dime you to death?
It's not going to reduce headcount, either. Instead of 14 devops/sysadmins, now I have 14 cloud admins, sitting pretty with their Azure or GCP certs. Automation is what's going to reduce those headcounts and costs, and Ansible+Jenkins+Kubernetes works fine just with VMware, Docker, and Cisco on-prem.
Trust. The Google Cloud just had a 12-hour outage -- I first read about it here on HN. AWS and Azure have had plenty of outages too... usually they're just not as open as Google is about it. You also have to trust that they won't get back-doored like what happened to NordVPN's providers, and that they're not secretly MITM'ing everything or dup-ing your data. We (and some of our clients) compete with some of the cloud providers companies and their subsidiaries, and we know for a fact that they will investigate and siphon any data that could give them an advantage.
Purpose. We just don't need hyper-scalable architecture. We've got a (mostly) fixed number of users in a fixed number of locations, with needs that are fairly easy to estimate / build for. Outside of a handful of sales & financial processing purposes, we will never scale up or down in any dramatic fashion. And for the one-off cases, we can either make it work with VMware, or outsource it to the software provider's SaaS cloud offering.
If we were doing e-commerce -- absolutely. Some sort of android app? Sure, AWS or Azure would be great. But it's a lot of risk and cost with no benefit for the Enterprise orgs than can afford their own stuff.
The essence of IT is to apply technology to solve a business problem. Otherwise, why would the business spend the money? The IT solution might be crazy/stupid/complex but if it works, many business simply adopt it and move on. Now, move that crazy/stupid/complex process to the cloud and surprise, it is very, very expensive. So, yes, the cloud is better, but only for some things. And until legacy applications are rewritten on-premise will exist.
One final insight. The cloud costs more. It has been engineered to be so, both from a profitability standpoint(Amazon is a for profit company) but also because the cloud has decomposed the infrastructure of IT into functional subcomponents, each of which cost money. When I was younger, the challenge for IT was explaining to management, the ROI of new servers, expanded networking, additional technology. We never quite got it right and often had it completely wrong. That was because we lacked the ability to account for/track and manage the actual costs of an on-premise operation. Accounting had one view, operations had another view and management had no idea really, why they were spending millions a year and could not get their business goals accomplished. The cloud changed all of that. You can do almost anything in the cloud, for a price. And I will humbly submit, that the cost of the cloud - minus the aforementioned profitability, is what on-premise organizations should have been spending all along. Anyone reading this and who has spent time in a legacy environment, knows that it is basically a futile exercise of keeping the plates spinning. On-premise failed because it could not get management to understand the value on in-house IT.
As I said, the costs are the same. A gallon of water weighs what it weighs regardless of location. It will be interesting to see, I predict the pendulum will swing back.
We don't generally trust cloud providers to meet our requirements for:
* uptime (network and machine - both because we are good at reliability [and we're willing to spend extra on it] and because we have lots of fancy redundant infrastructure that we can't rely on from cloud companies)
* latency (this is a big one)
* security, to some degree
* if something crazy is happening, that's when we need hardware, and that's when hardware is hard to get. Consider how Azure was running out of space during the last few months. It would have cost us an insane amount of money if we couldn't grow our data centers during Corona! We probably have at least 20-30% free hot capacity in our datacenters, so we can grow quickly.
We also have a number of machines with specs that would be hard to get e.g. on AWS.
We have some machines on external cloud services, but probably less than 1% of our deployed boxes.
We move a lot of bandwidth internally (tens of terabytes a day at least, hundreds some days), and I'm not sure we could do that cheaply on AWS (maybe you could).
We do use
Its also very very expensive to have a 96vcpu VM on amazon!
https://www.businessinsider.com/bank-of-americas-350-million...
Problem is point of failure, many businesses need to be independent and having data stored in the cloud is a bad idea overall. Because it produces point of failure issues. Consider if we ever got a real nasty solar wind and the electric grid goes down, the more we rely on the internet and centralize infrastructure into electric devices, the more it becomes a costly point of failure.
While many see redundancy as "waste" in terms of dollars, notice that our bodies have many billions of redundant cells and that's what makes us resilient as a species, we can take a licking and keep on ticking.
Trusting your data to out-side sources generally is a bad idea any day of the week. You always want to have backups and data available in case of diaster, mishap, etc.
Like no one has learned from this epidemic yet. Notice that our economic philosophy didn't plan for viral infections and has forced our capitalist society to make serious adjustments. Helping people is an anathema to liberals and conservatives / republicans / democrats, so for void to come along and actually force co-operation was a bit tragically humorous.
As a general rule you need redundancy if you want to survive, behaving as if the cloud is almighty is a bad idea, I'm not sold on "software as a service" or any of that nonsense. It's just there to lull you into a false sense of security.
You always need to plan for the worst case scenario for surviveability reasons.