- Is it the networking model that is simple from the consumption standpoint but has too many moving parts for it to be implemented?
- Is it the storage model, CSI and friends?
- Is it the bunch of controller loops doing their own things with nothing that gives a "wholesome" picture to identify the root cause?
For me personally, first and foremost thing on my mind is the networking details. They are "automatically generated" by each CNI solution in slightly different ways and constructs (iptables, virtual bridges, routing daemons, eBPF etc etc) and because they are generated, it is not uncommon to find hundreds of iptable rules and chains on a single node and/or similar configuration.
Being automated, these solutions generate tons of components/configurations which in case of trouble, even if one has mastery on them, would take some time to hoop through all the components (virtual interfaces, virtual bridges, iptable chains and rules, ipvs entries etc) to identify what's causing the trouble. Essentially, one pretty much has to be a network engineer because besides the underlying/physical (or the virtual, I mean cloud VPCs) network, k8s pulls its very own network (pod network, cluster network) implemented on the software/configuration layer which has to be fully understood to be able to maintained.
God forbid, if the CNI solution has some edge case or for some other misconfiguration, it keeps generating inadequate or misconfigured rules/routes etc resulting in a broken "software defined network" that I cannot identify in time on a production system is my nightmare and I don't know how to reduce that risk.
What's your Kubernetes nightmare?
EDIT: formating
We have a few rules:
1. Read a good intro book cover-to-cover before trying to understand it.
2. Pay a cloud vendor to supply a working, managed Kubernetes cluster.
3. Prefer fewer larger clusters with namespaces (and node pools if needed) to lots of tiny clusters.
3. Don't get clever with Kubernetes networking. In fact, touch it as little as possible and hope really hard it continues to work.
This is enough to handle 10-50 servers with occasional spikes above 300. It's not perfect, but then again, once you have that many machines, pretty much every solution requires some occasional care and feeding.
My personal Kubernetes nightmare is having to build a cluster from scratch on bare metal.
- Compute
- Deployment
- CI
- Networking
- Storage
- Policies
Imagine running a microservices solution without an orchestration solution - how many people would it take to administer the servers, the storage, the network, the policies, etc. And with Kubernetes, you get maybe a couple of teams if you're lucky. This is the power and the leverage of the platform.
But also, imagine in that environment, how many things can go wrong, and the amount of expertise that you need to properly debug them. You still need that amount of expertise, because all of that complexity is still in place (or at least most of it is) - if your physical disks are throwing errors, you need someone who knows how to debug and replace that. Not hard. But then you have Ceph above that, and Rook above that (or whatever storage solution you use). And then you've got the deployment that has to make the PVC successfully. And it's like that for everything. Every problem has the potential to be a full stack problem for any one of half a dozen stacks.
It's a lot.
1. Latch up states. It's very very easy for something to go wrong and blow a whole deployment up and lose all the pods for example a health check failure. Most application frameworks have some sort of request queuing and the health checks sit in the same queue so any upstream issues and you get health check failures and flapping. Of course the autoscaler goes fucking bonkers in the middle of that. The only thing you can do is drop traffic at the network edge and wait for it to get itself together.
2. No one knows how to fix it if anything major goes wrong. Even cloud providers. It's so large and complicated that no one has enough knowledge independently to actually fix it. For example I suffered from months of weird network issues where pods would come up without network. No one to this day know why that happened and could explain it. No amount of debugging and reverse engineering even resulted in a single step forward, resulting in the only outcome being "replace the whole cluster".
Don't get me wrong, I still like it but I wouldn't want to run it with little expertise at hand. It's not something I would trust someone to run without production experience, which is difficult because there are very few people out there who are battle hardened past trivial home deployments and tiny little stacks.
1. We build our own custom build system, because there is no CI that can do actual DAGs (maybe a few). A custom Kubernetes operator that parses Jsonnet files to create 100s of CRDs and pods to achieve extreme parallelization. EKS was 144$/mo (now 72$) but no info on master node types. Using watch endpoints with hundreds of pods did not scale well. They had to bump up the master node instances to c5.18xlarge, but same price for managed. But figuring out it was needed to do just scale-up took days. One c5.18xlarge is 2k$ month, and EKS runs at least 3 for HA. So it's a horror story for them. But we also had 100s of worker nodes so it might offset some of them.
2. Similar to CI, we allowed devs to deploy all microservices (~80) from any branch so that they can port-forward and use them. All of them had Ingress endpoints. Days after headaches and frustrations, it turns out nginx ingress generates megabytes of configuration whenever a new deployment occurs, forks a new subprocess with new cfg, kills the other connections. When it's done often, it takes 30GB of memory when 50 developers use it (~4000 pods) and it often dies and restarts. Similar story for Prometheus, kube-state-metrics; they do not like short-lived containers and hug on memory.
- Maintaining 200+ clusters for 10 small applications
- Cloud bills
- Autoscaling never working well
- Trying to untangle Terraform state without taking down Prod
My #2 (probably partially caused by #1) is the lack of attention paid to RBAC in vendor-supplied manifests. Multiple times I've found that the vendor's YAML binds some controller's service account to a ClusterRole giving access to all secrets in the cluster, when it only really needs to read one. After filing a GitHub issue it seems that I'm the first to even notice, even on popular projects that have been around for years.
But then upgrading can be very risky because if you have any problem at all, unless you understand the helm chart you can rarely simply downgrade/uninstall, you could have caused a fatal problem and for a cluster, the resilience is meaningless if you make a change that blocks access to all service.
Other issues relate to dependencies and breaking changes which might be subtle and which might not be easy to discover like the fact some old resource uses a v1beta type which becomes deprecated.
I think once it is working, Kubernetes is very reliable for me but it is when making infrastructure changes that things can go south very quickly. Updating deployments etc. is fine.
So we have a few Spring Boot based webapps which were running (along with PgSQL) on a shared AWS t2.medium instance, we migrated these to a GKE cluster with a node pool of e2-standard-2 instances. The nodes are on a private network and don't have public IPs. The services are exposed via Load Balancer based Ingress (with SSL). Even after allocating one core to PgSQL and 2GB RAM, the API calls from the GKE applications are perceptively slower than that of the shared AWS t2.medium instance based deployment. Tried giving generous CPU and RAM to the applications however, it still didn't improve the response time. Since these are the very fist applications being moved to this cluster, there isn't much else running on this cluster.
Now sure what's causing the slowness. Have any of you experienced something like this in GKE?
Inherited a web site and hosting from another studio. They setup a php site in a docker inside a vps. They don't use micro services its one monolith container. They didn't setup any way to get logs out of the thing. They don't use docker compose to build an image, they get a console for the container and use it like a vps.
They literally just use it to add another layer of containerisation on their vps.
You already need to understand linux to use docker or kubernetes, If you don't use micro services or need horizontal scaling its just more to learn, an extra layer of complexity thats super fragile and a nightmare to debug.
It has such a niche use case but every one use it where its not useful because its trendy. They want to put on their cv that they have used docker / kubernetes they don't have to write that it wasn't necessary and caused issues.
I suddenly wake up, covered in cold sweat. My heart is pumping so hard.
I take out my phone. I search the internet. Kubernetes still reigns, no simpler approach made it.
The end.
We thought it was an application issue, but it was that actually on the database side : the timestamp of each message was using the local time of the mongodb instance. And between different instances, the time was different. We realized that the Kubernetes Nodes had issues to connect to the NTP server, due to a rule in an random firewall.
When we fixed it, every other messages where in the good order
Running software at scale is my nightmare.
Shared FS between nodes, autoscaling volume claim sizes, autoscaling volume claim iops, and measuring storage utilisation (iops e.g.) for pod/node/pv.
How have I solved it? I haven't and I know its a key part of cost-control for us in about 12 months.
Fast deploy:
I'm trying to get a test cluster up in less than half an hour. With the DAG for building it all I'm getting a failure rate of 30% if I don't leave arbitrary timings and extra steps. I've also only automated about 25% of our stuff, so I expect it will take longer.
It seems that if you stick to simple configs, a setup hosted for you, etc, basically the happy path then people have had really good experiences with k8s. Those people can't understand how one could be inept enough _not_ to figure it out.
On the other hand, you'll also hear a lot of complaints about the difficulty of self-managed clusters, and attempting certain less popular or more complicated configs (or what have you). These people can't understand what benefit introducing such an insane amount of complexity could bring.
The second has mostly been my experience. I've tried now maybe a handful of times to create a cluster and get it running something on my home lab. At first I could rarely get it "up", but now I can usually get it to the point where I'd want to include storage or whatnot, and that's where I've been failing lately. Either way, I've never gotten it stable enough to warrant actual usage from me.
I like the idea of k8s; it seems like the natural next step of computing abstractions. I'm just not sure if "it's it", or if it's stable/reliable/evolved enough for people who don't need it now to invest in it yet.
The biggest nightmare for me is networking, simply because I'm not trained in networking. I know the basics to become a senior sysadmin but it's not natural to me. So mix in kubernetes and it becomes even more abstract.
The documentation at the main kubernetes site is poor, and is being deprecated, but not in favour of anything new.
Kubernetes itself has always been fine, just like a bare network or bare OS has been, but when you start stacking stuff built by other people (especially when the stuff isn't of the best quality) it just goes downhill from there.
Perhaps the actual nightmare is inadequate quality control... but that's not really specific to packaging or shared components in Kubernetes.
Long story short, a node crashed, and when it came back up, the pods wouldn't start. We spent a couple days trying to figure it out, but nothing was working. This was in production, so we made the choice to rebuild the entire cluster again with a newer version. We still had other nodes running, and were scaled enough that there was no complete downtime, but we were maxxing the cpu and some connections were getting dropped.
My two biggest gripes:
- Loss of visibility, especially related to inspecting network data as it moves from LB to pod.
- Half baked tooling around the eco-system, although this does seem to be slowly improving
My two biggest likes:
- I genuinely save a bunch of time with it at this point (it still occasionally sucker punches me)
- I can take the experience from my day job and self-host quite a large number of useful applications at home on old hardware.
Endpoint Security Software, just because it's adding some policy, that usually isn't written by the team trying to run the application, and will apply the policy sometimes in non-obvious ways. Even when you think it's turned off, sometimes it isn't, and the vendor will leave kernel modules running and partial configurations.
RedHat was more a result of the stability policy for kernel, and often running much older kernels then other distributions. We had lots of problems with the more modern kernel features used by kubernetes, that we had to track down and often link to known fixes. We had one customer even replacing their kernels so they wouldn't have as many issues. This may be less and less of the case all the time with newer RedHAt Releases, and I also have no reason to believe OpenShift suffers in the same way... just that I've spent a large amount of time troubleshooting this.
I run my own clusters and it just works.
Sure I have to ignore a lot of crap in the setup phase, there are so many products out there I don't want to pay for. The nightmare may come from some devop installing a bunch of helm charts without configuring things properly.
Scaling down to a minimal cluster is a real concern: I would like to run k8s for some micro project that literally run on 5$ vps but it's too heavy for that.
1. Managing etcd nodes -- Reconciliation is a patient waiting game, try and rush it and you'll loose your cluster.
2. Kubernetes Networking -- This is nearly impossible to trace packets coming through an LB into a kubernetes pods without very deep understanding of different networking layers and CNIs. A lot can go wrong here.
3. Running persistent volumes in kubernetes. This can range from outright unstable and dangerous to annoying and at the very best intermittently loosing access to services due to volume claims being detached/reattached. Would highly avoid this.
4. Running "sticky" services. Statefulset's can allow you to run enumerated services with stick sessions but my experience with any sticky service is it tends to be somewhat volatile as kubernetes really loves to move workloads at its convenience. I've found statefulsets to be a redflag when considering putting it in kubernetes.
But my main pain points are around Kubernetes and all the hidden stuff.
Kubernetes alone is not enough, you need terraform or helm (or both) to have something manageable and deployable by a team. When things errors or do not behave the way you expected it all become so complicated or cryptic, that you sometimes better delete an entire resource than understand the underlying issue.
Some stuff like dependency between resources (e.g: Deployments depending on ConfigMap, updating the ConfigMap won't restart the deployment) makes things a lot more complicated than you expect.
There are too many vendor specific stuff that are necessary to make a Kube cluster works that you can not expect to have one terraform setup that is multi cloud. etc...
My comfort zone is where Kubernetes works fine and I don't have to touch it, or only update trivial stuff.
Problems started by trying to push too many things into the clusters. Databases and specially ElasticCache with Kibana to collect metrics from the cluster ended killing the performance.
So it's like everything, some cases are great for K8s, some are terrible. This + a complex abstractions makes it not that developer friendly, but overall it does a good job to run and allow to scale services without having to worry too much with hardware.
OpenStack, Pivotal Cloud Foundry, internal compute platform
So far I think the nightmare problem is people trying to run it and CNCF software (Prometheus, various operators) with only a cursory understanding of how it works (me included)
It's easy to shoot yourself in the foot (oops, forgot requests on a resource intensive, high replica count deploy and hosed cluster autoscaling)
The deprecation lifecycle, and running ingress controllers in an automatic scaling group.
The first isn't as much of an issue if you have a (partially) dedicated team for managing your clusters, but can be prohibitively expensive (effort / time-wise) for smaller organisations.
The second highlights a bigger problem in K8s in general. I'll have to give a little background first:
If you run an Nginx ingress controller on a node that's part of an ASG — i.e. a group where nodes can disappear, or increase in number — you will experience service disruption to a small percentage of your requests, every time a scaling event occurs. This is caused by a misalignment between timeout values for your load balancer and Nginx, which can not be fixed:
* https://github.com/kubernetes/ingress-nginx/issues/6281 * https://github.com/kubernetes/ingress-nginx/issues/6791 * https://github.com/kubernetes/ingress-nginx/issues/7175
The fix is to only run the controllers on nodes that reside in a separate statically sized group, and perform updates to them out of hours when necessary :|
I'll leave you to decide on whether that's a fix or not, but the larger point it highlights is how _theoretically_ everything's great in K8s, but the headaches introduced by the complexity often make it not worth it.
Another example is pod disruption budgets. These are needed because the behaviour of K8s when instructed to shutdown a node is, well, to shutdown the node. Seems reasonable, until you realise that it doesn't handle moving the workloads off that node _first_. No, at some point later, the scheduler realises the pods aren't running and schedules them somewhere else. So you use a combination of PDBs to tell K8s that it must keep n pods of this deployment running at all times, and distribution rules to tell it pods must run on different nodes. This solution falls apart when you have pods that should only have a single instance running.
Even typing out the words makes me want to lie down in a dark room.
And then perhaps the proper handling of persistent disk.
Kubernetes Failure Stories A compiled list of links to public failure stories related to Kubernetes. Most recent publications on top.
Out of the things I have touched, it's load-balanced ingress (when running on premises). So yeah, it's networking.
- Migration of all our customer workloads from PSP to gatekeeper.
In short, the ingress may route traffic to a pod after it is killed. The solution is that when a pod gets a SIGTERM signal, it should mark itself not ready, wait for some amount of time and then shut down (see e.g. https://deepsource.io/blog/zero-downtime-deployment/). I've heard arguments for this behavior, but it's not the same trade-offs I would make.
Suppose that you work in an org that successfully ships software in a variety of ways - as regular packaged software that runs on an OS directly (e.g. a .jar that expects a certain JDK version in the VM), or maybe even uses containers sometimes, be it with Nomad, Swarm or something else.
And then a project comes along that needs Kubernetes, because someone else made that choice for you (in some orgs, it might be a requirement from the side of clients, others might want to be able to claim that their software runs on Kubernets, in other cases some dev might be padding their CV and leave) and now you need to deal with its consequences.
But here's the thing - if the organization doesn't have enough buy-in into Kubernetes, it's as if you're starting everything from 0, especially if paying some cloud vendor to give you a managed cluster isn't in the cards, be it because of data storage requirements (even for dev environments), other compliance reasons or even just corporate policy.
So, I might be given a single VM on a server, with 8 GB of RAM for launching 4 or so Java/.NET services, as that is a decent amount of resources for doing things the old way. But now, I need to fit a whole Kubernetes cluster in there, which in most configurations eats resources like there's no tomorrow. Oh, and the colleagues also don't have too much experience working with Kubernetes, so some sort of a helpful UI might be nice to have, except that the org uses RPM distros and there are no resources for an install of OpenShift on that VM.
But how much can I even do with that amount of resources, then? Well, I did manage to get K3s (a certified K8s distro by Rancher) up and running, though my hopes of connecting it with the actual Rancher tool (https://rancher.com/) to act as a good web UI didn't succeed. Mostly because of some weirdness with the cgroups support and Rancher running as a Docker container in many cases, which just kind of broke. I did get Portainer (https://www.portainer.io/) up and running instead, but back then I think there were certain problems with the UI, as it's still very much in active development and gradually receives lots of updates. I might have just gone with Kubernetes dashboard, but admittedly the whole login thing isn't quite as intuitive as the alternatives.
That said, everything kind of broke down for a bit as I needed to setup the ingress. What if you have a wildcard certificate along the lines of *.something.else.org.com and want it to be used for all of your apps? Back in the day, you'd just setup Nginx or Apache as your reverse proxy and let it worry about SSL/TLS termination. A duty which is now taken over by Kubernetes, except that by default K3s comes with Traefik as their ingress controller of choice and the documentation isn't exactly stellar.
So for getting this sort of configuration up and running, I needed to think about a HelmChartConfig for Traefik, a ConfigMap which references the secrets, a TLSStore to contain them, as well as creating the actual tls-secrets themselves with the appropriate files off of the file system, which still feels a bit odd and would probably be an utter mess to get particular certificates up and running for some other paths, as well as Let's Encrypt for other ones yet. In short, what previously would have been those very same files living on the file system and a few (dozen?) lines inside of the reverse proxy configuration, is now a distributed mess of abstractions and actions which certainly need some getting used to.
Oh, and Portainer sometimes just gets confused and fails to figure out how to properly setup the routes, though I do have to say that at least MetalLB does its job nicely.
And then? Well, we can't just ship manifests directly, we also need Helm charts! But of course, in addition to writing those and setting up the CI for packaging them, you also need something running to store them, as well as any Docker images that you want. In lieu of going through all of the red tape to set that up on shared infrastructure (which would need cleanup policies, access controls and lots of planning so things don't break for other parties using it), instead I crammed in an instance of Nexus/Artifactory/Harbor/... on that very same server, with the very same resource limits, with deadlines still looming over my head.
But that's not it, for software isn't developed in a vacuum. Throw in all of the regular issues with developing software, like not being 100% clear on each of the configuration values that the apps need (because developers are fallible, of course), changes to what they want to use, problems with DB initialization (of course, still needing an instance of PostgreSQL/MariaDB running on the very same server, which for whatever reason might get used as a shared DB) and so on.
In short, you take a process that already has pain points in most orgs and make it needlessly more complex. There are tangible benefits for using Kubernetes. Once you find a setup that works (personally, Ubuntu LTS or a similar distro, full Rancher install, maybe K3s as the underlying cluster or RKE/K3s/k0s on separate nodes, with Nginx for ingress, or a 100% separately managed ingress) then it's great and the standardization is almost like a superpower (as long as you don't go crazy with CRDs). Yet, you need to pay a certain cost up front.
What could be done to alleviate some of the pain points?
In short, I think that:
- expect to need a lot more resources than previously: always have a separate node for managing your cluster and put any sorts of tools on it as well (like Portainer/Rancher), but run your app workloads on other nodes (K3s or k0s can still be not too demanding with resources for the most part)
- don't actually shy away from tools like Portainer/Rancher/Lens for making the learning curve more shallow, inspect the YAML that they generate, familiarize yourself with the low level stuff as necessary, while still having an easy to understand overview of everything
- don't forget about needing somewhere to store Helm charts and container images, be it another node or a cloud offering of some sort
- if you can, just go for the cloud, but even if managed K8s is not in the cards for you, still strive at least for some sort of self-service approach for the inevitable reinstalls
- speaking of which, treat your clusters as *almost* disposable, have all of the instructions for preparing them somewhere, ideally as an executable script (maybe use Ansible)
- don't stray too far away from what you get out of the box, also look in the direction of the most tried and tested solutions, like an Nginx ingress (Traefik with K3s should *technically* have the better integration, but the lack of proper docs works against it, you'll probably want something like a cookbook of sorts)
- also manage your expectations, getting things up and running will probably take a long time and will be a serious aspect of development that cannot be overlooked; no, you won't have a cluster up and running on-prem with everything you need in 2 days
- ideally, have a proper DevOps team or even just a group of people who'll spearhead information sharing and creating any sorts of knowledgebases or templates so it's easier in the future
So, in summary, it can be a nightmare if you have unrealistic expectations or an unrealistic view of how Kubernetes might solve all of your problems, without an understanding of the tradeoffs that it would require. I still think that Nomad/Swarm/Compose might work better for many smaller projects/teams out there, but the benefits of Kubernetes are also hard to argue against. If you manage to get that far, though, and only then.*
The abstractions we have available to build and run distributed systems may have improved, but they still suck in the grand scheme of things. My personal nightmare is that nothing better comes along soon.
> - Is it the networking model that is simple from the consumption standpoint but has too many moving parts for it to be implemented?
Many poor sysadmins before us have tried to implement Neutron (OpenStack Networking Service) with OvS or a bunch of half-assed vendor SDNs. Or LBaaS with HAProxy.
> - Is it the storage model, CSI and friends?
I mean, the most popular CSI for running on-premise is rook.io, which is just wrapping Ceph. Ceph is just as hard to run as ever, and a lot of that is justified by the inherent complexity of providing high performance multi-tenant storage.
> - Is it the bunch of controller loops doing their own things with nothing that gives a "wholesome" picture to identify the root cause?
Partially. One advantage the approach has is that it's conceptually simple, consistent and feels easy to compose complex behavior. The problem is that Kubernetes enforces very little structure, even basics like object ownership. The result is unbounded complexity. A lack of tooling (e.g. time travel debugging for control loops) makes debugging complex interactions next to impossible. This is also not surprising, control loops are a very hard problem and even simple systems can spiral (or oscillate) out of control very quickly. Control theory is hard. David Anderson has a pretty good treatise of the matter https://blog.dave.tf/post/new-kubernetes/
Compared to OpenStack, Kubernetes uses a conceptually much simpler model (control loops + CRDs) and does a much better job at enforcing API consistency. Kubernetes is locally simple and consistent, but globally brittle.
The downside is that it needs much more composition of control loops to do meaningful work, and that leads to exploding complexity because you have a bunch of uncoordinated actors (control loops) each acting on partial state (a subset of CRDs).
The implementation model of an OpenStack service otoh is much simpler because they use straight forward "workflows", working on a much bigger picture of global state, e.g. neutron owning the entire network layer. This makes composition less of a source for brittleness, not that OpenStack still has its fair share of that as well. Workflows are however much more brittle locally, because they cannot reconcile themselves in case things go wrong.
They make up a tiny minority but they are loud and often nasty within their Twitter echo chamber and they are excellent at getting companies and people to do their bidding out of fear of reprisal.
It’s supposed to be the “most welcoming community” but it only takes you stepping out of line on the outrage du jour to get a proverbial face full of spittle and chased out of town with pitchforks.
I’m posting this with a throwaway obviously because I’m not trying to lose my job or get doxxed. Which would 100% happen if I posted under my real name.