(Hard-won bit of experience: k8s + Redis really don't like each-other if Redis 1. is configured to load from disk, and 2. your memory limit for the Redis container is somewhat-tightly bounded. At least from the k8s controller's perspective, Redis apparently uses ~400% of its steady-state memory while reading the AOF tail of an RDB file — getting the container stuck in an OOM-kill loop until you come along and temporarily de-bound its memory.)
However, we're considering switching back to k8s for stateful components, with a different approach: allocating single-node node-pools with taints that map 1:1 to each stateful component, effectively making these more like "k8s-managed VMs" than "k8s-managed containers." The point would be to get away from the need to manage the VMs ourselves, giving them over to GKE, while still retaining the assumptions of VM isolation (e.g. not having/needing memory limits, because the single pod is the only tenant of the VM anyway.)
On the other hand, once everyone on the team has experience building such a system from scratch, then deploying k8s and using it somehow becomes straightforward.
It's almost as if we need to learn how a tool works before being able to use it effectively.
Anyways, what we (actually didn't) replace it with:
- Don't let your devs learn about k8s on the job.
- Let them run side-projects on your internal cluster.
- Give them a small allowance to run their stuff on your network and learn how to do that safely.
- Give your devs time to code review each other's internally-hosted side-projects-that-use-k8s.
- Reap the benefits of a team that has learnt the ins and out of k8s without messing up your products.
Note: this isn't everyone's end game but I suspect it's realistic for a lot of people.
I would like to go back to cleanly divided, architected IaaS and ansible. It was fast, extremely reliable, cheaper to run, had a much lower cognitive load and a million less footguns. What's more important possibly is not everything can be wedged into containers cleanly despite the promises.
We ended up writing our own control plane that uses NATS as a message bus. We are in the process of open sourcing it here: https://github.com/drifting-in-space/spawner
From the Operations side, Kubernetes is scary. It's easy to screw things up and you can definitely run into problems. I understand why folks who work mostly on that side of the house are put off by the complexity of Kubernetes.
However, from the application side of things, our developers have been THRILLED with Kubernetes. For most developers my company provides a nice paved road experience with minimal customization required. For advanced use cases, we allow developers to use the Kubernetes API (along ArgoCD + GateKeeper policies) as a break glass type of approach. Istio gives the infra team the ability to easily move services between clusters and make policy changes easily. It also allows us to make use of Knative, although I think the Istio requirement is no longer there.
That said, you should be using managed Kubernetes wherever possible and not running your own clusters. That's where trouble lurks.
The result of the migration was that there is little underlying infrastructure to maintain, and ongoing operational costs were lowered by 50% year over year. The CTO and I liked the setup so much, we started converting another large client of theirs. I followed up with them at the beginning of 2022 to see how things were going, and they still love it. There is so little maintenance, and now they have more time to focus on what they do best–Software!
Other options on the horizon that I'm testing include utilizing AWS Copilot with ECS/Fargate, and/or Copilot with Amazon App Runner.
If a team were to start with no legacy and no complexity and there isn't going to be multi-team/multi-owner/shared-services I could see them using something else. But that applies to anything.
These days I'm a huge fan of CDK and Pipelines style deployments. I prefer to treat my compute layer as a swappable component which I'll change as and when I need to. I tend to lean towards serverless offerings which take care of the internal scaling details if I can while still giving me a traditional "instance", and if I can't then I'll go for the next best managed offering.
I've yet to see an example where internal tooling doesn't become a mess over time, and K8S requires a ton of work to keep things sensible.
I remember managing hundreds of virtual machines in datacenters & cloud, using Ansible and a myriad of other tooling.
It's nice when you're at a small scale and you don't have a lot of people making changes, but over time as it grows the pain grows with it unless you've enforced a consistent cattle model.
The longer VMs live with custom changes/code and updates over time the more brittle they can become. Part of the cattle model is so that you can recreate/rebuild when changing code so things stay consistent. The drift from infrastructure as code can be scary otherwise.
With the cattle model you need to have pipelines in place to build new VM images for infrastructure updates (packer etc), have multiple APIs to hit (easier in cloud) to upload images and serve them in a non damaging way. (HA deployments/rollouts/dealing with load balancers) It's certainly a non-trivial amount of work.
With Kubernetes, a lot of this tooling comes out of the box. You've got autoscaling, load balancing, health-checks, limits/requests, failure mitigation, service mesh options. On top of that it's served in a strict semi-consistent way. Good luck replicating that with virtual machines without a lot of tooling and effort.
If you can learn the Kubernetes tooling it can do a lot for you. However I agree that not all setups need it, a lot of times small setups never grow and that's ok a few virtual machines aren't that big of a deal.
We still use virtual machines for workloads that aren't container friendly, and to be honest these days I abhor it, even with pipelines in place.
Replaced with Linux servers and SSH.
Have done a lot of work with k8s in the past. Not the right tool for my startup.
There's still use-cases where k8s wins; but nomad handles state a bit better and is easier to reason about from scratch.
- Taking random .yml configs from The InternetTM to install an Nginx Ingress with automatic LetsEncrypt certs felt not-exactly-great. It's no better than piping curl to bash, except the potential impact is not that your computer is dead, but the entirety of prod goes down.
- Because of this, upgrades of Kubernetes are a pain. The DigitalOcean admin panel will complain about problems in 'our' configs, that aren't actually OUR configs. We don't know how to fix that, or if ignoring the warnings and upgrading will break our production apps.
- Upgrades of Kubernetes itself aren't actually zero downtime, and we couldn't figure out how to do that (even after investing a significant amount of research time).
- We were using only a tiny subset of the functionality in Kubernetes. Specifically we wanted high-availability application servers (2+ pods in parallel) with zero-downtime deployments, connecting to a DO managed PostgreSQL instance, with a webserver that does SSL-termination in front of it.
- Setting up deployments from a GitLab CI/CD pipeline was pretty hard, and it turned out the functionality for managing a Kubernetes cluster from GitLab was not really done with our use case in mind (I think?).
- It would be bad enough if DigitalOcean shit the bed, but the biggest problem was that we couldn't reliably recognize if something was a problem caused by us, or by DO. Try explaining that one to your customers.
Summarizing: it was just too complex and fragile, even once you wrap your head around what the hell a Pod, a Deployment, an Ingress and Ingress Controller, and all of the other Kubernetes lingo actually means. I suspect you need a dedicated infra person who knows their stuff to make this work, so it could very well make sense for larger companies, but for our situation it was overkill.We were not intellectually in control of this setup, and I do not feel comfortable running production workloads (systems used by 20k high-school students, mission-critical applications used by logistical companies) on something we couldn't quite grasp.
We went to a much simpler setup on Fly.io, and have been happy since. It's a shame they seem to be too young of a company to really be super reliable, but I suspect this is only a matter of time. In terms of feature set, it's all we need.
With this approach to hosting and deployment, I think Kubernetes' main advantage is that it opens the door to new kinds of infrastructure businesses, not that it makes hosting a website any easier.
I am slowly moving towards using Hashicorp's Nomad running on Fedora CoreOS using the Podman and QEMU drivers. I rolled out a Nomad at work for internal projects and it let's me get things done quickly without living in a total YAML hellscape.
1: https://docs.fedoraproject.org/en-US/fedora-coreos/getting-s...
We are a small team of 5 infrastructure engineers and previously managed 200+ libvirt VMs running on bare-metal HA hypervisors in a GlusterFS storage pool (software agency, different customer application services). We started to migrate to GKE in 2017 and finished within a year or so.
I know many associate k8s with a yaml mess, but this is actually our most favourite part of it. We are able to describe a whole customer project in this format and it's not something we have to maintain in-house (Ansible). As long as you don't try to be smart (templating/helm, operator dependance), it works out pretty well, prefer plain manifests and extend that with you own validation scripts.
Nevertheless, if you have no 24/7 operations, stay the hell away from bare-metal - go managed.
My company has clients who usually have very simple requirements. A Python/Django app server and a database. Sometimes there will be another background service or two (memcached or equivalent etc).
The most complex site we had was the above but with some Postgres replication clients.
We use docker and docker-compose. We've used ansible in the past as well as fabric and other simple solutions.
We've had a couple of devs try and convince us that we should be using Kubernetes and I counter with "it's overkill for what we need". Am I wrong?
For my new projects nowadays, I'm pushing mainly serverless approaches using AWS Lambdas (behind API Gateways for stuff that needs to be reachable by HTTP).
I think this shifts the complexity from managing Kubernetes and its accompanying ten-thousand-yaml-files to infrastructure-as-code and the complexities of dealing with AWS. And I happen to prefer the latter, even though it's not infinitely better by any margin.
For the few things that needs to be always-online, or 3rd party self-hosted apps, I'm still on Kubernetes, or pure Docker if possible.
I went from on-metal K8s clusters, which were a complete PITA and required a full team to manage, to using EKS which has been everything K8s should be... easy peasy.
We don't really have a use-case for Boundary but it looks pretty neat as well if you do.
Was on k8s for years and I don't miss it one bit.
While there definitely is some complexity once you get serious and set everything up properly with raft, federation, Connect, CAs, proxies, ACLs, proper secrets lifecycles... I find it's worth it. With the current assumptions that HC will keep improving and existing bugs and edge-cases will be ironed out.
Also: our main reason to adopt Kubernetes was to stay cloud-agnostic, but we soon realized that this is as unrealistic as writing a complex app's SQL in a vendor-independent way.
Instead, we decided to embrace our cloud (AWS) by using their CDK tooling and leveraging their features as much as possible. If we ever need to switch to another cloud we will bear the cost then, but for now it is clearly YAGNI.
Same goes for heroku/digital ocean app services. Even elastic beanstalk. If you are large enough that you need to manage your own k8s cluster, that is one thing, but I would encourage you to look at your needs from a usage and compute perspective long before you start solutionizing with trendy technologies.
I am trying to move on k3s but it is just too complex to run anything and there is still not solved problem of exposing services to internet.
What I want is to declare I want this service to be under this domain and this IP - so for that you still need to configure your load balancer (bare metal) manually, setup certificates etc. I am writing a tool to automate this, but it's been a pain.
1. Is it really so complicated?
2. Is that complexity incidental or essential?
3. Could we get away with a simpler set of abstractions for 90% of applications?
But it doesn't get brought up as an option very often because docker basically FUDed themselves by having two things called swarm and then loudly killing the older one making everyone think it no longer exists.
Didn't pass my BS test.
I am glad that people are moving on to something that will exhaust their creative juices on something ... pointless instead of focusing on delivering value for their customers :)
More people using the brand new tech - less competition :)
To be honest, even with the technical overhead it'll probably solve a lot of problems for us from a workflow perspective. We've (the engineers) been arguing for more component-level testing for years (as opposed to the all-up E2E testing we're required do now, which typically turns into component-level testing anyway), and containerizing everything is a good excuse to push it into reality. It'll also make deployments a lot easier (just roll back to X image if there's a problem). Right now we have tens of thousands of lines of hand-written deployment scripts that manage everything and have to be maintained, and intimate knowledge of how they work is often limited to who wrote it (many of whom are no longer with the company), and if there's a problem you have to do surgery on the environment. Kubernetes will give us a unified deployment architecture with problems you can google.
We have about 100 devs in multiple teams. Kubernetes provides great level of standardization and transparency - completely different experience than VMs, where admin team had too much ability to cut corners and build technology debt. People would riot if they had to go back to these days.
A few warnings: * It takes some resources. Maybe can be mitigated with k3s or similar, but I don't have first hand knowledge here. * It requires some time to learn and configure properly. If your entire team is 3 people and you are on limited budget, probably not a good idea. * Adopt some tools (helm?), standardize deployments, where possible. Bare k8s is bit too much for daily work. * Read good practices and don't try to be smarter, at least until you really know what you are doing. Limit misconfiguration may really burn you at least convenient moment.
And then a couple weeks ago, I was tasked with standing up a new Ansible AWX server, which now done via a k8s operator. It was an exquisitely painful experience. This is potentially a bad example because I'm pretty sure IBM's plan with AWX is now to make me suffer, but through that entire process, k8s just felt like extreme overkill.
I'm pretty sure that's going to be the last time I use k8s. I know it makes sense for some use cases, but I it just doesn't feel intuitive in any way. And although it may seem more efficient, I absolutely dread having to troubleshoot any problems down the road.
I'm probably not the target audience, but thought I'd leave a comment for fun anyway.
Do I recommend Kubernetes to other people/companies though? Absolutely not! The learning curve is incredibly steep, and it really does take investment into understanding how it works.
But to anyone who is looking to use Kubernetes, I highly recommend https://helm.sh since it actually makes templating deployments significantly easier.
But in reality, I think I developed an allergic reaction to complexity and hype. I took some metrics; things like recording the time taken, steps taken and happiness generated from my current build/release stages, then comparing to k8s.
In conclusion, struggling to learn k8s forced me to find joy in the simplicity - knowing that one day (that will never come), I can just hire someone to do this... "It's only a problem when it's a problem".
For now, I have a lovely bash script that is triggered on Github releases (using Actions), which uses doctl to do the following:
1) Create a new server from my baseline image 2) Run the setup steps as defined in the Dockerfile, although it doesn't use docker (it just makes sense to keep the configuration I used to have) 3) Copy the built-and-tested version of the repository to the new server 4) Run any post deployment scripts, like database migrations, whatever 5) Move the reserved IP to the new server
It takes about a minute from me clicking "new release" in Github to seeing the changes hit production. If there's a problem, I move the reserved IP back. Load balancers, database clusters, etc... they're all set up manually because "it's only a problem when it's a problem".
Kuberneeties only ever generated problems for me.
We feel the same about Docker. People have no idea what's running when they download a docker image. Stuff can be buried deep deep within an operating system image. Security should be simple, transparent, and minimal so it can be reviewed easily. Reviewing a docker image is impossible. I'm convinced the correct place for isolation is systemd. This guy wrote a great starter for hardening the crap out of your services: https://docs.arbitrary.ch/security/systemd.html Systemd offers a bridge too with nspawn if you're not ready to undertake ultra minimal hardening of services.
Scaling is a "sexy" problem to have though, and software "engineers" love to think that their SAAS product with 100 users is going to take Google scale workloads; thusly what could be done in a LAMP stack on a single DO server, is inflated into a fantasy that will never come to fruition.
I have spent the past couple weeks working with kustomize since I do not like helm and while it gets the job done I think Tanka would be better.
We are on GKE which makes things a lot easier and I personally would not choose to run my own cluster.
(disclaimer: i worked at hashi for 4 months in 2020 but not related to nomad)
I still run KEDA at home for managing plex, home assistant, some game servers and other of my own projects. But being the only one who is using the cluster is a different use case than getting RBAC, ingress and management set up correctly for a production cluster IMO. I’ve never had the sole responsibility or permission over a cluster before, so it was a daunting step I decided not to take for my own sake
People talk about it being incredibly complex, and honestly I don't see it. Yeah there's a layer of jargon you have to dive into, but it all makes sense once you start building something with it. By far the most complex pieces for us are the integration points with AWS (we're using EKS.) The examples/docs available are just not that great.
We run most of our app on Google App Engine explicitly to avoid devops work. However, we have a stateless-but-memory-hungry image manipulation service that was just too expensive on GAE. We migrated that service to k8s on Digital Ocean.
It was a disaster. I mean, it worked, but suddenly we were spending a lot of time learning k8s and fussing with k8s and it slowed down feature development. K8s is a time sink. So we migrated the service to Digital Ocean App Platform and velocity returned to normal.
I'm not wholly thrilled with DO App Platform. It has some maturity issues, and while it's cheaper than GAE, RAM is still more expensive than Elastic Beanstalk (which charges you more or less the EC2 VM cost). So we'll probably move it there someday.
i would stick no matter the company size on IAAS + $Deploymenttool (ansible or so) and docker and then get comfortable with and only then, when everything works as intended make the switch to k8s.
creating a scalable system is complicated within aws account limits
all we really want is to shove docker containers behind a load balancer and not worry about having to manage yet another system
At work, we're currently trying to migrate our stack to k8s. Why ? Because our startup is getting bigger and bigger and our current platform sucks, but our products are becoming a lot more complex as time goes. We benched a few platforms and landed on EKS + ArgoCD + Vault. Works really well.
For personal projects I roll with just docker.
I'm bothered by the minimal requirements of k8s, I want to deploy on 5$ machines
We came from Ansible managed deployments of vanilla docker with nginx as single node ingress with another load balancer on top of that.
Worked fine, but HA for containers that are only allowed to exist once in the stack was one thing that caused us headaches.
Then, we had a workshop for Rancher RKE. Looked promising at the start, but operating it became a headache as we didn't have enough people in the project team to maintain it. Certificates expiring was an issue and the fact that you actually kinda had to baby-sit the cluster was a turn off.
We killed the switch to kubernetes and moved back to Ansible + nginX + docker.
In the meantime we were toying around with Docker Swarm for smaller scale deployments and inhouse infrastructure. We didn't find anything to not like and are currently moving into that direction.
How we do things in Swarm:
1. Monitoring using an updated Swarmprom stack (https://github.com/neuroforgede/swarmsible/tree/master/envir...)
2. Graphical Insights into the Cluster / Debugging -> Portainer
3. Ingress: Treafik together with tecnativa/docker-socket-proxy so that traefik does not have to run on the managers
4. Container Autoscaling: did not need it yet for our internal installations as well as our customer deployments on bare metal, but we would go for a solution based on prometheus metrics, similar to https://github.com/UnclePhil/ascaler
5. Hardware Autoscaling: We would build a custom script for this based on prometheus that automatically orders servers of Hetzner using their hcloud-cli
6. Volumes: Hetzner Cloud Plugin, see https://github.com/costela/docker-volume-hetzner - Looking forward to CSI support though.
7. Load Balancer + SSL: in front of the Swarm using our Cloud Provider
Reasons that we would dabble in k8s again:
1. A lot of projects are k8s only (see OpenFaaS for example)
2. Finer grained control for User permissions
3. Service Mesh to introduce service accounts without requiring to go through a custom proxy
Honestly - I even use it personally for my self-hosted stuff at this point. The learning curve is... steep. But once you come out the other side, it's a great tool.
- Ansible to provision such VMs
- Docker to start/stop containers on such VMs
It feels like a breeze of fresh air!
Now building fully with serverless
Therefore you either can’t find anyone or more likely you hire less good DevOps engineers.
The solution is to not use k8s as a startup. The less a DevOps engineer can shoot themselves in the foot the better.
it all scales and works fine. there is one or two problems around but not enough for me to consider it does not work.
The two philosophies at megacorp here seem to be "I built it from the ground up to target X service" where X service is usually amazon serverless or something, and "I built it in docker containers but I don't know about the cloud".
The former is a conscious decision, and we (Architecture) have a serious, sit down discussion with them about what it actually means to be fully cloud native for that particular service. This discussion ranges from cost analysis, to things like "is your application actually build correctly to do this", to "you're not going to have access to onprem resources if you do this", even asking them simply "why".
A lot of the time when the teams realize they're going to be on the hook for the cost alone they back out, and a lot of teams try to do it because "we don't understand K8s". Well, it doesn't get much better in Cloud Run either folks because you're trading K8s yaml for terraform or cloudformation.
Where it has been successful is for teams which own APIs which only get called once a month, or very low traffic APIs. I hate to say it boils down to cost, but a lot of the time it really does boil down to cost.
Additionally we've seen a weird boomerang effect as clouds offer K8s clusters which are simply priced per pod rather than per worker node (like GKE Autopilot). A lot of teams which straddled the middle of "low traffic but not low enough to really migrate" have found they're quite happy in GKE Autopilot. They use autoscalers to provide surge protection, but they just use Autopilot with 1 or 2 pods running and it keeps the costs down. That also means we can migrate them to beefier clusters in a heartbeat if they get the Hug of Death or something from HN. ;)
The second use case I discussed gets railroaded into our K8s clusters we built ourselves because we can typically get them to use our templates which provide ingresses and service meshes and the developers don't have to think about it too much, and the Devops team is comfortable with the technologies. While it means that there's a bit of "rubber stamping" and potential waste, it's allowed us to use K8s and the nice features it provides without having to invest too much in thinking about it for an individual application.
Personal; I use docker-compose on VMs
For context, I ran a DevOps team for the last 4 years that managed two products on AWS - one on EKS and one on ECS. I was mostly managing by the time we had k8s in our stack, so I didn't get to interact with it much directly, but I know infrastructure generally (and I know ECS inside and out, unfortunately). For that infrastructure, we had a whole team managing the ECS deployment. We managed the EKS infrastructure with the equivalent of one DevOp's time or less for years. It was only when it started scaling to millions of users that we needed to give it more time and attention. Both infrastructures (ECS and EKS) were pretty complex with multiple services that needed their own configuration and handling.
I left that company a few months back to try to build my own thing and I just finished building out the alpha infrastructure for it on Kubernetes. I can now safely say, as an infrastructure engineer, Kubernetes is an absolute joy to work with compared to lesser abstractions. At least, when someone else is managing the control plane for you. It has exactly the right abstractions, with the right defaults, and it behaves basically exactly as it should.
Yes, it's complicated. Yes, there are a lot of moving pieces. Yes, there are hard problems. That's just the reality of software infrastructure. That's not kubernetes, those are just the problems of infrastructure. There's a whole set of problems kubernetes is working to solve in addition to those. Remove kubernetes and you still have those problems, but then you also have the whole set of problems Kubernetes solves as well.
I think what's really happening with this whole "Kubernetes is too complicated thing" is that a lot of teams expect to be able to use it like Heroku. That's not what it is. Or they try to build out infrastructures with javascript/php/python/etc engineers. You wouldn't try to have a team of front end engineers build your rest backend. It's not reasonable to expect javascript engineers to know how to build and operate an infrastructure - at least not with out dedicating themselves to learning the tooling and space full time for a while. Think of it from the perspective of a frontend engineer learning Python and Django to build out a rest backend, and then multiply the complexity by 4. That's just infrastructure regardless of what you're using.
If you just need to run and scale a container fast and simple, with maybe a single database - then sure the PaaS providers might fit your needs for a while. But eventually the trade off is going to be cost and limitations. You'll eventually need a piece of infrastructure they don't provide.
TL;DR Kubernetes isn't the problem here. Infrastructure work is just plain complicated. If you want multiple services, high availability, reliability, scalability, security, and performance, it's just complicated and hard. Don't short change it. Dedicate someone to learning it or hire someone who knows it.