What is your Kubernetes nightmare?

Question

Everything self-hosted has its maintenance tax but why Kubernetes (especially self hosted) is so hard? What aspect is that makes Kubernetes operationally so hard?- Is it the networking model that is simple from the consumption standpoint but has too many moving parts for it to be implemented?- Is it the storage model, CSI and friends?- Is it the bunch of controller loops doing their own things with nothing that gives a "wholesome" picture to identify the root cause?For me personally, first and foremost thing on my mind is the networking details. They are "automatically generated" by each CNI solution in slightly different ways and constructs (iptables, virtual bridges, routing daemons, eBPF etc etc) and because they are generated, it is not uncommon to find hundreds of iptable rules and chains on a single node and/or similar configuration.Being automated, these solutions generate tons of components/configurations which in case of trouble, even if one has mastery on them, would take some time to hoop through all the components (virtual interfaces, virtual bridges, iptable chains and rules, ipvs entries etc) to identify what's causing the trouble. Essentially, one pretty much has to be a network engineer because besides the underlying/physical (or the virtual, I mean cloud VPCs) network, k8s pulls its very own network (pod network, cluster network) implemented on the software/configuration layer which has to be fully understood to be able to maintained.God forbid, if the CNI solution has some edge case or for some other misconfiguration, it keeps generating inadequate or misconfigured rules/routes etc resulting in a broken "software defined network" that I cannot identify in time on a production system is my nightmare and I don't know how to reduce that risk.What's your Kubernetes nightmare?EDIT: formating

ekidd · Accepted Answer

It's odd, but I actually really enjoy using Kubernetes in production.
We have a few rules:
1. Read a good intro book cover-to-cover before trying to understand it.
2. Pay a cloud vendor to supply a working, managed Kubernetes cluster.
3. Prefer fewer larger clusters with namespaces (and node pools if needed) to lots of tiny clusters.
3. Don't get clever with Kubernetes networking. In fact, touch it as little as possible and hope really hard it continues to work.
This is enough to handle 10-50 servers with occasional spikes above 300. It's not perfect, but then again, once you have that many machines, pretty much every solution requires some occasional care and feeding.
My personal Kubernetes nightmare is having to build a cluster from scratch on bare metal.

MPSimmons · Answer

If you have a fully-functioning "best practices" Kubernetes environment, each of the following topics ends up with its own full-depth tech:
- Compute
- Deployment
- CI
- Networking
- Storage
- Policies
Imagine running a microservices solution without an orchestration solution - how many people would it take to administer the servers, the storage, the network, the policies, etc. And with Kubernetes, you get maybe a couple of teams if you're lucky. This is the power and the leverage of the platform.
But also, imagine in that environment, how many things can go wrong, and the amount of expertise that you need to properly debug them. You still need that amount of expertise, because all of that complexity is still in place (or at least most of it is) - if your physical disks are throwing errors, you need someone who knows how to debug and replace that. Not hard. But then you have Ceph above that, and Rook above that (or whatever storage solution you use). And then you've got the deployment that has to make the PVC successfully. And it's like that for everything. Every problem has the potential to be a full stack problem for any one of half a dozen stacks.
It's a lot.

iasay · Answer

The two things thing that gets me are:1. Latch up states. It's very very easy for something to go wrong and blow a whole deployment up and lose all the pods for example a health check failure. Most application frameworks have some sort of request queuing and the health checks sit in the same queue so any upstream issues and you get health check failures and flapping. Of course the autoscaler goes fucking bonkers in the middle of that. The only thing you can do is drop traffic at the network edge and wait for it to get itself together.2. No one knows how to fix it if anything major goes wrong. Even cloud providers. It's so large and complicated that no one has enough knowledge independently to actually fix it. For example I suffered from months of weird network issues where pods would come up without network. No one to this day know why that happened and could explain it. No amount of debugging and reverse engineering even resulted in a single step forward, resulting in the only outcome being "replace the whole cluster".Don't get me wrong, I still like it but I wouldn't want to run it with little expertise at hand. It's not something I would trust someone to run without production experience, which is difficult because there are very few people out there who are battle hardened past trivial home deployments and tiny little stacks.

CSDude · Answer

I have 2:1. We build our own custom build system, because there is no CI that can do actual DAGs (maybe a few). A custom Kubernetes operator that parses Jsonnet files to create 100s of CRDs and pods to achieve extreme parallelization. EKS was 144$/mo (now 72$) but no info on master node types. Using watch endpoints with hundreds of pods did not scale well. They had to bump up the master node instances to c5.18xlarge, but same price for managed. But figuring out it was needed to do just scale-up took days. One c5.18xlarge is 2k$ month, and EKS runs at least 3 for HA. So it's a horror story for them. But we also had 100s of worker nodes so it might offset some of them.2. Similar to CI, we allowed devs to deploy all microservices (~80) from any branch so that they can port-forward and use them. All of them had Ingress endpoints. Days after headaches and frustrations, it turns out nginx ingress generates megabytes of configuration whenever a new deployment occurs, forks a new subprocess with new cfg, kills the other connections. When it's done often, it takes 30GB of memory when 50 developers use it (~4000 pods) and it often dies and restarts. Similar story for Prometheus, kube-state-metrics; they do not like short-lived containers and hug on memory.

spicyusername · Answer

- 60% of the Kubernetes ecosystem is half-baked alpha software
- Maintaining 200+ clusters for 10 small applications
- Cloud bills
- Autoscaling never working well
- Trying to untangle Terraform state without taking down Prod

jhugo · Answer

My #1 k8s nightmare is the widespread practice of just writing (or downloading and never even looking at!) YAML and applying it to the cluster, with no additional management layer (we use Terraform, but use whatever you want), meaning that eventually you have no idea what the intended state of the cluster is, only its actual state. Vendor READMEs encourage this (some even going so far as to suggest `kubectl apply -f https://...`!).My #2 (probably partially caused by #1) is the lack of attention paid to RBAC in vendor-supplied manifests. Multiple times I've found that the vendor's YAML binds some controller's service account to a ClusterRole giving access to all secrets in the cluster, when it only really needs to read one. After filing a GitHub issue it seems that I'm the first to even notice, even on popular projects that have been around for years.

JensRantil · Answer

Honestly, creating my first K8s deployment of a service; typing out at least 150 lines of YAML to define my Deployment, figuring out how my ConfigMaps, Secrets, and Volumes, Services are defined and connected together. Vanilla K8s YAML is extremely low-level.

ryandvm · Answer

My Kubernetes nightmare is that all kinds of organizations will end up cargo-culting it as required tooling when the reality is that it's massive overkill for most deployment scenarios. Oh wait...

lbriner · Answer

Ironically, I think one of the biggest issues is around packaging, specifically Helm charts (but if there are others, it is probably the same). In many frameworks, packaging is to help people by hiding complexity. Need an ingress? Use a Helm chart!
But then upgrading can be very risky because if you have any problem at all, unless you understand the helm chart you can rarely simply downgrade/uninstall, you could have caused a fatal problem and for a cluster, the resilience is meaningless if you make a change that blocks access to all service.
Other issues relate to dependencies and breaking changes which might be subtle and which might not be easy to discover like the fact some old resource uses a v1beta type which becomes deprecated.
I think once it is working, Kubernetes is very reliable for me but it is when making infrastructure changes that things can go south very quickly. Updating deployments etc. is fine.

pritambarhate · Answer

Slow performance.So we have a few Spring Boot based webapps which were running (along with PgSQL) on a shared AWS t2.medium instance, we migrated these to a GKE cluster with a node pool of e2-standard-2 instances. The nodes are on a private network and don't have public IPs. The services are exposed via Load Balancer based Ingress (with SSL). Even after allocating one core to PgSQL and 2GB RAM, the API calls from the GKE applications are perceptively slower than that of the shared AWS t2.medium instance based deployment. Tried giving generous CPU and RAM to the applications however, it still didn't improve the response time. Since these are the very fist applications being moved to this cluster, there isn't much else running on this cluster.Now sure what's causing the slowness. Have any of you experienced something like this in GKE?

dangerface · Answer

> What aspect is that makes Kubernetes operationally so hard?
Inherited a web site and hosting from another studio. They setup a php site in a docker inside a vps. They don't use micro services its one monolith container. They didn't setup any way to get logs out of the thing. They don't use docker compose to build an image, they get a console for the container and use it like a vps.
They literally just use it to add another layer of containerisation on their vps.
You already need to understand linux to use docker or kubernetes, If you don't use micro services or need horizontal scaling its just more to learn, an extra layer of complexity thats super fragile and a nightmare to debug.
It has such a niche use case but every one use it where its not useful because its trendy. They want to put on their cv that they have used docker / kubernetes they don't have to write that it wasn't necessary and caused issues.

mgarciaisaia · Answer

2025. 3am. Clear night. Full moon shining, many stars around.
I suddenly wake up, covered in cold sweat. My heart is pumping so hard.
I take out my phone. I search the internet. Kubernetes still reigns, no simpler approach made it.
The end.

oxff · Answer

You aren't supposed to use a tool made for Google level complexity unless you work with such complexity in the first place.

likortera · Answer

I still wake up in cold sweat in the middle of the night feeling herds of yaml files are chasing me to pull me into the deep swamps of the clusters.

develatio · Answer

There is also https://k8s.af , which covers some horror stories!

ygouzerh · Answer

More distributed system than Kubernetes, but quite fun : We deployed a MongoDB cluster on our Kubernetes Clusters. Our application was a having a chat feature that stored the messages into the MongoDB cluster. After some months, we realized that we got some weird issues, some messages was arriving in the wrong orders, like : 1. A : Hi ! 2. B : Bye ! See you next time ! 3. A : Great and you ? 4. B : Hello ! How are you ?
We thought it was an application issue, but it was that actually on the database side : the timestamp of each message was using the local time of the mongodb instance. And between different instances, the time was different. We realized that the Kubernetes Nodes had issues to connect to the NTP server, due to a rule in an random firewall.
When we fixed it, every other messages where in the good order

k8sToGo · Answer

My biggest surprise was how vanilla even hosted Kubernetes clusters are. For EKS I had to configure and install quite a lot to make it work as expected. At that point you are installing and self managing so much on your own that I wonder if you gain anything.

melezhik · Answer

This is that I have to use yaml to configure k8s. Every k8sish tooling has it's own yaml API, including helm, gitops, argocd and friends, so you end up having a bunch of brittle and very hard to understand and maintain yaml files ... Sigh

high_5 · Answer

I'm afraid k8s would become like git - it's a great tool out of which we mostly use like 5% of it's complete capabilities, yet we all use it because everyone is using it. Yet, k8s doesn't really make all the underlying stacks go away. When the shit hits the fan, you have to troubleshoot it with knowledge of much more than just YAML syntax.

beckingz · Answer

Kubernetes gives you high availability, deployment automation, and powerful management tools, which are all needed to run software applications 'at scale'Running software at scale is my nightmare.

Normal_gaussian · Answer

Storage:Shared FS between nodes, autoscaling volume claim sizes, autoscaling volume claim iops, and measuring storage utilisation (iops e.g.) for pod/node/pv.How have I solved it? I haven't and I know its a key part of cost-control for us in about 12 months.Fast deploy:I'm trying to get a test cluster up in less than half an hour. With the DAG for building it all I'm getting a failure rate of 30% if I don't leave arbitrary timings and extra steps. I've also only automated about 25% of our stuff, so I expect it will take longer.

mkeedlinger · Answer

I think some of the comments shed light on an interesting dichotomy I've noticed while talking to folk about K8s:
It seems that if you stick to simple configs, a setup hosted for you, etc, basically the happy path then people have had really good experiences with k8s. Those people can't understand how one could be inept enough _not_ to figure it out.
On the other hand, you'll also hear a lot of complaints about the difficulty of self-managed clusters, and attempting certain less popular or more complicated configs (or what have you). These people can't understand what benefit introducing such an insane amount of complexity could bring.
The second has mostly been my experience. I've tried now maybe a handful of times to create a cluster and get it running something on my home lab. At first I could rarely get it "up", but now I can usually get it to the point where I'd want to include storage or whatnot, and that's where I've been failing lately. Either way, I've never gotten it stable enough to warrant actual usage from me.
I like the idea of k8s; it seems like the natural next step of computing abstractions. I'm just not sure if "it's it", or if it's stable/reliable/evolved enough for people who don't need it now to invest in it yet.

sureglymop · Answer

For me it is actually getting services exposed. Say i buy a domain and then i (trivially) containerize my applications and set up services in kubernetes. Now comes the networking part which is just a pain. How do I make my service accessible? It's easy with docker and an nginx reverse proxy but with kubernetes it's always seemed to be a real pain.

INTPenis · Answer

I honestly kinda like kubernetes and I have no problem tying together a bunch of distributed resources in my head.The biggest nightmare for me is networking, simply because I'm not trained in networking. I know the basics to become a senior sysadmin but it's not natural to me. So mix in kubernetes and it becomes even more abstract.

anothernewdude · Answer

I hate that kubectl wants all the images to be already built. Instead I'm forced to keep docker-compose yaml around to actually build the damn things first. Which introduces more yaml that kubectl will insist on reading.The documentation at the main kubernetes site is poor, and is being deprecated, but not in favour of anything new.

daneel_w · Answer

Kubernetes itself. Maybe I just need some more "hands-on therapy".

dope9967 · Answer

Overall it is a great tool and I completely do not get the hate for it, but I did have some issues with AWS EKS - we made two mistakes in our project by using k8s API instead of DNS for discovery and environmental variables instead of config maps and this ended up overloading master nodes, which started throttling sporadically, eapecially with load spikes. It seemed like AWS EKS support team were really puzzled by this and it took us weeks to get to the root cause, even with their support. This might be considered as more of an AWS issue than k8s one.

oneplane · Answer

The only nightmare is the same one as with npm and pip etc.: dependencies hiding behind other dependencies and having badly documented charts being used that don't share basic lifecycle information.
Kubernetes itself has always been fine, just like a bare network or bare OS has been, but when you start stacking stuff built by other people (especially when the stuff isn't of the best quality) it just goes downhill from there.
Perhaps the actual nightmare is inadequate quality control... but that's not really specific to packaging or shared components in Kubernetes.

legohead · Answer

I don't like to upgrade things that are working. Unfortunately that bit me in the butt with Kubernetes.Long story short, a node crashed, and when it came back up, the pods wouldn't start. We spent a couple days trying to figure it out, but nothing was working. This was in production, so we made the choice to rebuild the entire cluster again with a newer version. We still had other nodes running, and were scaled enough that there was no complete downtime, but we were maxxing the cpu and some connections were getting dropped.

horsawlarway · Answer

Mine is fairly boring. Overall - the tool does the job, and I much prefer it to hand managing servers, or some of the previous VM based management solutions.
My two biggest gripes:
- Loss of visibility, especially related to inspecting network data as it moves from LB to pod.
- Half baked tooling around the eco-system, although this does seem to be slowly improving
My two biggest likes:
- I genuinely save a bunch of time with it at this point (it still occasionally sucker punches me)
- I can take the experience from my day job and self-host quite a large number of useful applications at home on old hardware.

kevin_nisbet · Answer

I think my top ones are endpoint security software or running on redhat OSes. I used to work on a kubernetes distribution, so a large amount of the support escalations and workload was around shipping kubernetes to customers who weren't as apt as the HN crowd.
Endpoint Security Software, just because it's adding some policy, that usually isn't written by the team trying to run the application, and will apply the policy sometimes in non-obvious ways. Even when you think it's turned off, sometimes it isn't, and the vendor will leave kernel modules running and partial configurations.
RedHat was more a result of the stability policy for kernel, and often running much older kernels then other distributions. We had lots of problems with the more modern kernel features used by kubernetes, that we had to track down and often link to known fixes. We had one customer even replacing their kernels so they wouldn't have as many issues. This may be less and less of the case all the time with newer RedHAt Releases, and I also have no reason to believe OpenShift suffers in the same way... just that I've spent a large amount of time troubleshooting this.

jokethrowaway · Answer

Honestly I don't get the hate k8s get.I run my own clusters and it just works.Sure I have to ignore a lot of crap in the setup phase, there are so many products out there I don't want to pay for. The nightmare may come from some devop installing a bunch of helm charts without configuring things properly.Scaling down to a minimal cluster is a real concern: I would like to run k8s for some micro project that literally run on 5$ vps but it's too heavy for that.

tflinton · Answer

Overall kubernetes is far better than anything else i've used to manage deployments and production workloads. That said, what gives me the hee-bee-gee-bee's (and what has caused outages, for me at least) is:
1. Managing etcd nodes -- Reconciliation is a patient waiting game, try and rush it and you'll loose your cluster.
2. Kubernetes Networking -- This is nearly impossible to trace packets coming through an LB into a kubernetes pods without very deep understanding of different networking layers and CNIs. A lot can go wrong here.
3. Running persistent volumes in kubernetes. This can range from outright unstable and dangerous to annoying and at the very best intermittently loosing access to services due to volume claims being detached/reattached. Would highly avoid this.
4. Running "sticky" services. Statefulset's can allow you to run enumerated services with stick sessions but my experience with any sticky service is it tends to be somewhat volatile as kubernetes really loves to move workloads at its convenience. I've found statefulsets to be a redflag when considering putting it in kubernetes.

muskmusk · Answer

Someone hating on kubernetes and cobbling together their own stuff that "does the same, but simpler". Usually only simpler for that person.

h1fra · Answer

Love kubernetes. And at the same time it's not the magical thing we all expect.
But my main pain points are around Kubernetes and all the hidden stuff.
Kubernetes alone is not enough, you need terraform or helm (or both) to have something manageable and deployable by a team. When things errors or do not behave the way you expected it all become so complicated or cryptic, that you sometimes better delete an entire resource than understand the underlying issue.
Some stuff like dependency between resources (e.g: Deployments depending on ConfigMap, updating the ConfigMap won't restart the deployment) makes things a lot more complicated than you expect.
There are too many vendor specific stuff that are necessary to make a Kube cluster works that you can not expect to have one terraform setup that is multi cloud. etc...

rbirkby · Answer

Debugging CrashLoopBackOff

mcv · Answer

I'm probably the wrong person to ask. My Kubernetes nightmare is having to configure anything from scratch. Or having to fix something that doesn't work.My comfort zone is where Kubernetes works fine and I don't have to touch it, or only update trivial stuff.

weitzj · Answer

A failed upgrade of a CNI plugin on a production cluster. Since then I always have a blue / green cluster deployment at hand with a leach cluster containing the whole production environment and flipping via a Loadbalancer in front of the clusters

burai · Answer

I ended porting a few clients we had in a company that were already running based on Docker to a Kubernetes cluster. The major issues were trying to push everything there. I think it works very well to manage web clients.
Problems started by trying to push too many things into the clusters. Databases and specially ElasticCache with Kibana to collect metrics from the cluster ended killing the performance.
So it's like everything, some cases are great for K8s, some are terrible. This + a complex abstractions makes it not that developer friendly, but overall it does a good job to run and allow to scale services without having to worry too much with hardware.

yrro · Answer

Here's another: fitting cluster addressing plans into my organization's ridiculously constrained IPv4 addressing plan. I find it crazy that a modern technology like k8s was not IPv6-native from the start.

francoispon · Answer

we have been using openshift at work and it has been relatively troublesome, to be honest when we have infrastrcuture (which is not often) we just have some redhat consultant fixing it.

nijave · Answer

I'm not sure Kubernetes is worse than any other complex, distributed system.
OpenStack, Pivotal Cloud Foundry, internal compute platform
So far I think the nightmare problem is people trying to run it and CNCF software (Prometheus, various operators) with only a cursory understanding of how it works (me included)
It's easy to shoot yourself in the foot (oops, forgot requests on a resource intensive, high replica count deploy and hosed cluster autoscaling)

ghusto · Answer

TLDR: Complexity.The deprecation lifecycle, and running ingress controllers in an automatic scaling group.The first isn't as much of an issue if you have a (partially) dedicated team for managing your clusters, but can be prohibitively expensive (effort / time-wise) for smaller organisations.The second highlights a bigger problem in K8s in general. I'll have to give a little background first:If you run an Nginx ingress controller on a node that's part of an ASG &mdash; i.e. a group where nodes can disappear, or increase in number &mdash; you will experience service disruption to a small percentage of your requests, every time a scaling event occurs. This is caused by a misalignment between timeout values for your load balancer and Nginx, which can not be fixed:* https://github.com/kubernetes/ingress-nginx/issues/6281 * https://github.com/kubernetes/ingress-nginx/issues/6791 * https://github.com/kubernetes/ingress-nginx/issues/7175The fix is to only run the controllers on nodes that reside in a separate statically sized group, and perform updates to them out of hours when necessary :|I'll leave you to decide on whether that's a fix or not, but the larger point it highlights is how _theoretically_ everything's great in K8s, but the headaches introduced by the complexity often make it not worth it.Another example is pod disruption budgets. These are needed because the behaviour of K8s when instructed to shutdown a node is, well, to shutdown the node. Seems reasonable, until you realise that it doesn't handle moving the workloads off that node _first_. No, at some point later, the scheduler realises the pods aren't running and schedules them somewhere else. So you use a combination of PDBs to tell K8s that it must keep n pods of this deployment running at all times, and distribution rules to tell it pods must run on different nodes. This solution falls apart when you have pods that should only have a single instance running.

JensRantil · Answer

While I'm venting K8s frustration; They told me K8s was awesome because it allows me to easy test things locally on my machine. They told me minikube would solve it. Minikube ran out of memory, was unstable, crashed. Production YAML required certmanager not on my machine. Production YAML required volume manager not on my machine.

vbezhenar · Answer

Some very basic things look very hard to me. Right now I&rsquo;m scratching my head how do I implement even non-HA WireGuard server in a pod, so wg clients can access a pod network and pods can access wg client network. Seems like a very basic requirement for any installation yet zero guides about it. And don&rsquo;t even talk about HA server with load balancer.

kitd · Answer

Operators/OLM/channels/operand versioning/upgrade testing.Even typing out the words makes me want to lie down in a dark room.

lxe · Answer

It's the lack of good first-class observability and good administration tooling into what's going on. There's a bunch of third party tools, and those seem to be lacking. The cli is abysmal tbh. Good luck figuring out why your pod crashed, or why it's frozen, or where the logs are from non-existing pods are, etc...

waylandsmithers · Answer

A client who was on version 1.13 (which was the current version when they started the project in 2018) being forced to upgrade by the aws managed kubernetes service. I'm not sure what they ended up doing but they were facing the requirement of having to upgrade the nodes and the control plane one version at a time.

Deletionk · Answer

For me the primary thing is OOM killed. But swap support is being worked on.And then perhaps the proper handling of persistent disk.

makz · Answer

The thing itself. I&rsquo;ve been avoiding Java for most of my career and will do the same with kubernetes.

cutler · Answer

Having to use it. Same for Docker.

vanillax · Answer

If anyone is looking for some tutorial/guides for a homelab. This is a great wiki. https://wiki.technotim.live/en/kubernetes

gadders · Answer

Related:Kubernetes Failure Stories A compiled list of links to public failure stories related to Kubernetes. Most recent publications on top.https://k8s.af/

orthoxerox · Answer

I haven't even touched intra-cluster networking, so it must be networking.Out of the things I have touched, it's load-balanced ingress (when running on premises). So yeah, it's networking.

uberduper · Answer

The iptables rules are mostly unrelated to your CNI plugin. They're added/managed by kube-proxy to provide your internal service routing and load balancing.

hunta2097 · Answer

- PSP retirement and trying to define a replacement with 100% coverage. Gatekeeper seems to be the heir-apparent.- Migration of all our customer workloads from PSP to gatekeeper.

bajsejohannes · Answer

The decoupling of ingress and deployments always bothered me, although it might not be a _nightmare_ exactly.In short, the ingress may route traffic to a pod after it is killed. The solution is that when a pod gets a SIGTERM signal, it should mark itself not ready, wait for some amount of time and then shut down (see e.g. https://deepsource.io/blog/zero-downtime-deployment/). I've heard arguments for this behavior, but it's not the same trade-offs I would make.

victor106 · Answer

On AWS, the best way we found to use K8S is using Margate and using Copilot to create and manage infrastructure and deployments.

KronisLV · Answer

Late to the party, but figured I'd share my own story (some details obviously changed, but hopefully the spirit of the experience remains).

Suppose that you work in an org that successfully ships software in a variety of ways - as regular packaged software that runs on an OS directly (e.g. a .jar that expects a certain JDK version in the VM), or maybe even uses containers sometimes, be it with Nomad, Swarm or something else.

And then a project comes along that needs Kubernetes, because someone else made that choice for you (in some orgs, it might be a requirement from the side of clients, others might want to be able to claim that their software runs on Kubernets, in other cases some dev might be padding their CV and leave) and now you need to deal with its consequences.

But here's the thing - if the organization doesn't have enough buy-in into Kubernetes, it's as if you're starting everything from 0, especially if paying some cloud vendor to give you a managed cluster isn't in the cards, be it because of data storage requirements (even for dev environments), other compliance reasons or even just corporate policy.

So, I might be given a single VM on a server, with 8 GB of RAM for launching 4 or so Java/.NET services, as that is a decent amount of resources for doing things the old way. But now, I need to fit a whole Kubernetes cluster in there, which in most configurations eats resources like there's no tomorrow. Oh, and the colleagues also don't have too much experience working with Kubernetes, so some sort of a helpful UI might be nice to have, except that the org uses RPM distros and there are no resources for an install of OpenShift on that VM.

But how much can I even do with that amount of resources, then? Well, I did manage to get K3s (a certified K8s distro by Rancher) up and running, though my hopes of connecting it with the actual Rancher tool (https://rancher.com/) to act as a good web UI didn't succeed. Mostly because of some weirdness with the cgroups support and Rancher running as a Docker container in many cases, which just kind of broke. I did get Portainer (https://www.portainer.io/) up and running instead, but back then I think there were certain problems with the UI, as it's still very much in active development and gradually receives lots of updates. I might have just gone with Kubernetes dashboard, but admittedly the whole login thing isn't quite as intuitive as the alternatives.

That said, everything kind of broke down for a bit as I needed to setup the ingress. What if you have a wildcard certificate along the lines of *.something.else.org.com and want it to be used for all of your apps? Back in the day, you'd just setup Nginx or Apache as your reverse proxy and let it worry about SSL/TLS termination. A duty which is now taken over by Kubernetes, except that by default K3s comes with Traefik as their ingress controller of choice and the documentation isn't exactly stellar.

So for getting this sort of configuration up and running, I needed to think about a HelmChartConfig for Traefik, a ConfigMap which references the secrets, a TLSStore to contain them, as well as creating the actual tls-secrets themselves with the appropriate files off of the file system, which still feels a bit odd and would probably be an utter mess to get particular certificates up and running for some other paths, as well as Let's Encrypt for other ones yet. In short, what previously would have been those very same files living on the file system and a few (dozen?) lines inside of the reverse proxy configuration, is now a distributed mess of abstractions and actions which certainly need some getting used to.

Oh, and Portainer sometimes just gets confused and fails to figure out how to properly setup the routes, though I do have to say that at least MetalLB does its job nicely.

And then? Well, we can't just ship manifests directly, we also need Helm charts! But of course, in addition to writing those and setting up the CI for packaging them, you also need something running to store them, as well as any Docker images that you want. In lieu of going through all of the red tape to set that up on shared infrastructure (which would need cleanup policies, access controls and lots of planning so things don't break for other parties using it), instead I crammed in an instance of Nexus/Artifactory/Harbor/... on that very same server, with the very same resource limits, with deadlines still looming over my head.

But that's not it, for software isn't developed in a vacuum. Throw in all of the regular issues with developing software, like not being 100% clear on each of the configuration values that the apps need (because developers are fallible, of course), changes to what they want to use, problems with DB initialization (of course, still needing an instance of PostgreSQL/MariaDB running on the very same server, which for whatever reason might get used as a shared DB) and so on.

In short, you take a process that already has pain points in most orgs and make it needlessly more complex. There are tangible benefits for using Kubernetes. Once you find a setup that works (personally, Ubuntu LTS or a similar distro, full Rancher install, maybe K3s as the underlying cluster or RKE/K3s/k0s on separate nodes, with Nginx for ingress, or a 100% separately managed ingress) then it's great and the standardization is almost like a superpower (as long as you don't go crazy with CRDs). Yet, you need to pay a certain cost up front.

What could be done to alleviate some of the pain points?

In short, I think that:

  - expect to need a lot more resources than previously: always have a separate node for managing your cluster and put any sorts of tools on it as well (like Portainer/Rancher), but run your app workloads on other nodes (K3s or k0s can still be not too demanding with resources for the most part)
  - don't actually shy away from tools like Portainer/Rancher/Lens for making the learning curve more shallow, inspect the YAML that they generate, familiarize yourself with the low level stuff as necessary, while still having an easy to understand overview of everything
  - don't forget about needing somewhere to store Helm charts and container images, be it another node or a cloud offering of some sort
  - if you can, just go for the cloud, but even if managed K8s is not in the cards for you, still strive at least for some sort of self-service approach for the inevitable reinstalls
  - speaking of which, treat your clusters as *almost* disposable, have all of the instructions for preparing them somewhere, ideally as an executable script (maybe use Ansible)
  - don't stray too far away from what you get out of the box, also look in the direction of the most tried and tested solutions, like an Nginx ingress (Traefik with K3s should *technically* have the better integration, but the lack of proper docs works against it, you'll probably want something like a cookbook of sorts)
  - also manage your expectations, getting things up and running will probably take a long time and will be a serious aspect of development that cannot be overlooked; no, you won't have a cluster up and running on-prem with everything you need in 2 days
  - ideally, have a proper DevOps team or even just a group of people who'll spearhead information sharing and creating any sorts of knowledgebases or templates so it's easier in the future

So, in summary, it can be a nightmare if you have unrealistic expectations or an unrealistic view of how Kubernetes might solve all of your problems, without an understanding of the tradeoffs that it would require. I still think that Nomad/Swarm/Compose might work better for many smaller projects/teams out there, but the benefits of Kubernetes are also hard to argue against. If you manage to get that far, though, and only then.*

jrudolph · Answer

It's so saddening to see how the Kubernetes hype-cycle follows OpenStack and all the fundamental problems still seem unsolved. I sometimes feel like its just the same story playing out 5 years later, one layer up to the stack (IaaS -> CaaS) and with other fools to fall for it (with OpenStack it was sysadmins trying to run a control plane, with Kubernetes its devs trying to run infrastructure).
The abstractions we have available to build and run distributed systems may have improved, but they still suck in the grand scheme of things. My personal nightmare is that nothing better comes along soon.
> - Is it the networking model that is simple from the consumption standpoint but has too many moving parts for it to be implemented?
Many poor sysadmins before us have tried to implement Neutron (OpenStack Networking Service) with OvS or a bunch of half-assed vendor SDNs. Or LBaaS with HAProxy.
> - Is it the storage model, CSI and friends?
I mean, the most popular CSI for running on-premise is rook.io, which is just wrapping Ceph. Ceph is just as hard to run as ever, and a lot of that is justified by the inherent complexity of providing high performance multi-tenant storage.
> - Is it the bunch of controller loops doing their own things with nothing that gives a "wholesome" picture to identify the root cause?
Partially. One advantage the approach has is that it's conceptually simple, consistent and feels easy to compose complex behavior. The problem is that Kubernetes enforces very little structure, even basics like object ownership. The result is unbounded complexity. A lack of tooling (e.g. time travel debugging for control loops) makes debugging complex interactions next to impossible. This is also not surprising, control loops are a very hard problem and even simple systems can spiral (or oscillate) out of control very quickly. Control theory is hard. David Anderson has a pretty good treatise of the matter https://blog.dave.tf/post/new-kubernetes/
Compared to OpenStack, Kubernetes uses a conceptually much simpler model (control loops + CRDs) and does a much better job at enforcing API consistency. Kubernetes is locally simple and consistent, but globally brittle.
The downside is that it needs much more composition of control loops to do meaningful work, and that leads to exploding complexity because you have a bunch of uncoordinated actors (control loops) each acting on partial state (a subset of CRDs).
The implementation model of an OpenStack service otoh is much simpler because they use straight forward "workflows", working on a much bigger picture of global state, e.g. neutron owning the entire network layer. This makes composition less of a source for brittleness, not that OpenStack still has its fair share of that as well. Workflows are however much more brittle locally, because they cannot reconcile themselves in case things go wrong.

tyingq · Answer

Extra layers, debugging things that are ephemeral, and overlay networks.

yrro · Answer

Users pulling any old image off of Docker Hub!

ciguy · Answer

What are you fishing for exactly here? K8s has it's issues, but the way you have phrased the question is only going to get you biased answers.

nunez · Answer

that just as I understand v1, the Kubernetes maintainers roll out v2 and it's completely different!

cncfNormalLady · Answer

My Kubernetes nightmare is not the software but the strident, intolerant, social justice keyboard warriors involved with the CNCF and k8s technologies.
They make up a tiny minority but they are loud and often nasty within their Twitter echo chamber and they are excellent at getting companies and people to do their bidding out of fear of reprisal.
It’s supposed to be the “most welcoming community” but it only takes you stepping out of line on the outrage du jour to get a proverbial face full of spittle and chased out of town with pitchforks.
I’m posting this with a throwaway obviously because I’m not trying to lose my job or get doxxed. Which would 100% happen if I posted under my real name.

What is your Kubernetes nightmare?

- 60% of the Kubernetes ecosystem is half-baked alpha software
- Maintaining 200+ clusters for 10 small applications
- Cloud bills
- Autoscaling never working well
- Trying to untangle Terraform state without taking down Prod

Honestly, creating my first K8s deployment of a service; typing out at least 150 lines of YAML to define my Deployment, figuring out how my ConfigMaps, Secrets, and Volumes, Services are defined and connected together. Vanilla K8s YAML is extremely low-level.

My Kubernetes nightmare is that all kinds of organizations will end up cargo-culting it as required tooling when the reality is that it's massive overkill for most deployment scenarios. Oh wait...

2025. 3am. Clear night. Full moon shining, many stars around.
I suddenly wake up, covered in cold sweat. My heart is pumping so hard.
I take out my phone. I search the internet. Kubernetes still reigns, no simpler approach made it.
The end.

You aren't supposed to use a tool made for Google level complexity unless you work with such complexity in the first place.

I still wake up in cold sweat in the middle of the night feeling herds of yaml files are chasing me to pull me into the deep swamps of the clusters.

There is also https://k8s.af , which covers some horror stories!

My biggest surprise was how vanilla even hosted Kubernetes clusters are. For EKS I had to configure and install quite a lot to make it work as expected. At that point you are installing and self managing so much on your own that I wonder if you gain anything.

This is that I have to use yaml to configure k8s. Every k8sish tooling has it's own yaml API, including helm, gitops, argocd and friends, so you end up having a bunch of brittle and very hard to understand and maintain yaml files ... Sigh

Kubernetes gives you high availability, deployment automation, and powerful management tools, which are all needed to run software applications 'at scale'
Running software at scale is my nightmare.

Kubernetes itself. Maybe I just need some more "hands-on therapy".

Someone hating on kubernetes and cobbling together their own stuff that "does the same, but simpler". Usually only simpler for that person.

Debugging CrashLoopBackOff

I'm probably the wrong person to ask. My Kubernetes nightmare is having to configure anything from scratch. Or having to fix something that doesn't work.
My comfort zone is where Kubernetes works fine and I don't have to touch it, or only update trivial stuff.

A failed upgrade of a CNI plugin on a production cluster. Since then I always have a blue / green cluster deployment at hand with a leach cluster containing the whole production environment and flipping via a Loadbalancer in front of the clusters

Here's another: fitting cluster addressing plans into my organization's ridiculously constrained IPv4 addressing plan. I find it crazy that a modern technology like k8s was not IPv6-native from the start.

we have been using openshift at work and it has been relatively troublesome, to be honest when we have infrastrcuture (which is not often) we just have some redhat consultant fixing it.

Operators/OLM/channels/operand versioning/upgrade testing.
Even typing out the words makes me want to lie down in a dark room.

For me the primary thing is OOM killed. But swap support is being worked on.
And then perhaps the proper handling of persistent disk.

The thing itself. I’ve been avoiding Java for most of my career and will do the same with kubernetes.

Having to use it. Same for Docker.

If anyone is looking for some tutorial/guides for a homelab. This is a great wiki. https://wiki.technotim.live/en/kubernetes

Related:
Kubernetes Failure Stories A compiled list of links to public failure stories related to Kubernetes. Most recent publications on top.
https://k8s.af/

I haven't even touched intra-cluster networking, so it must be networking.
Out of the things I have touched, it's load-balanced ingress (when running on premises). So yeah, it's networking.

The iptables rules are mostly unrelated to your CNI plugin. They're added/managed by kube-proxy to provide your internal service routing and load balancing.

- PSP retirement and trying to define a replacement with 100% coverage. Gatekeeper seems to be the heir-apparent.
- Migration of all our customer workloads from PSP to gatekeeper.

On AWS, the best way we found to use K8S is using Margate and using Copilot to create and manage infrastructure and deployments.

Extra layers, debugging things that are ephemeral, and overlay networks.

Users pulling any old image off of Docker Hub!

What are you fishing for exactly here? K8s has it's issues, but the way you have phrased the question is only going to get you biased answers.

that just as I understand v1, the Kubernetes maintainers roll out v2 and it's completely different!

What is your Kubernetes nightmare?

- 60% of the Kubernetes ecosystem is half-baked alpha software- Maintaining 200+ clusters for 10 small applications- Cloud bills- Autoscaling never working well- Trying to untangle Terraform state without taking down Prod

Honestly, creating my first K8s deployment of a service; typing out at least 150 lines of YAML to define my Deployment, figuring out how my ConfigMaps, Secrets, and Volumes, Services are defined and connected together. Vanilla K8s YAML is extremely low-level.

My Kubernetes nightmare is that all kinds of organizations will end up cargo-culting it as required tooling when the reality is that it's massive overkill for most deployment scenarios. Oh wait...

2025. 3am. Clear night. Full moon shining, many stars around.I suddenly wake up, covered in cold sweat. My heart is pumping so hard.I take out my phone. I search the internet. Kubernetes still reigns, no simpler approach made it.The end.

You aren't supposed to use a tool made for Google level complexity unless you work with such complexity in the first place.

I still wake up in cold sweat in the middle of the night feeling herds of yaml files are chasing me to pull me into the deep swamps of the clusters.

There is also https://k8s.af , which covers some horror stories!

My biggest surprise was how vanilla even hosted Kubernetes clusters are. For EKS I had to configure and install quite a lot to make it work as expected. At that point you are installing and self managing so much on your own that I wonder if you gain anything.

This is that I have to use yaml to configure k8s. Every k8sish tooling has it's own yaml API, including helm, gitops, argocd and friends, so you end up having a bunch of brittle and very hard to understand and maintain yaml files ... Sigh

Kubernetes gives you high availability, deployment automation, and powerful management tools, which are all needed to run software applications 'at scale'Running software at scale is my nightmare.

Kubernetes itself. Maybe I just need some more "hands-on therapy".

Someone hating on kubernetes and cobbling together their own stuff that "does the same, but simpler". Usually only simpler for that person.

Debugging CrashLoopBackOff

I'm probably the wrong person to ask. My Kubernetes nightmare is having to configure anything from scratch. Or having to fix something that doesn't work.My comfort zone is where Kubernetes works fine and I don't have to touch it, or only update trivial stuff.

A failed upgrade of a CNI plugin on a production cluster. Since then I always have a blue / green cluster deployment at hand with a leach cluster containing the whole production environment and flipping via a Loadbalancer in front of the clusters

Here's another: fitting cluster addressing plans into my organization's ridiculously constrained IPv4 addressing plan. I find it crazy that a modern technology like k8s was not IPv6-native from the start.

we have been using openshift at work and it has been relatively troublesome, to be honest when we have infrastrcuture (which is not often) we just have some redhat consultant fixing it.

Operators/OLM/channels/operand versioning/upgrade testing.Even typing out the words makes me want to lie down in a dark room.

For me the primary thing is OOM killed. But swap support is being worked on.And then perhaps the proper handling of persistent disk.

The thing itself. I’ve been avoiding Java for most of my career and will do the same with kubernetes.

Having to use it. Same for Docker.

If anyone is looking for some tutorial/guides for a homelab. This is a great wiki. https://wiki.technotim.live/en/kubernetes

Related:Kubernetes Failure Stories A compiled list of links to public failure stories related to Kubernetes. Most recent publications on top.https://k8s.af/

I haven't even touched intra-cluster networking, so it must be networking.Out of the things I have touched, it's load-balanced ingress (when running on premises). So yeah, it's networking.

The iptables rules are mostly unrelated to your CNI plugin. They're added/managed by kube-proxy to provide your internal service routing and load balancing.

- PSP retirement and trying to define a replacement with 100% coverage. Gatekeeper seems to be the heir-apparent.- Migration of all our customer workloads from PSP to gatekeeper.

On AWS, the best way we found to use K8S is using Margate and using Copilot to create and manage infrastructure and deployments.

Extra layers, debugging things that are ephemeral, and overlay networks.

Users pulling any old image off of Docker Hub!

What are you fishing for exactly here? K8s has it's issues, but the way you have phrased the question is only going to get you biased answers.

that just as I understand v1, the Kubernetes maintainers roll out v2 and it's completely different!

- 60% of the Kubernetes ecosystem is half-baked alpha software
- Maintaining 200+ clusters for 10 small applications
- Cloud bills
- Autoscaling never working well
- Trying to untangle Terraform state without taking down Prod

2025. 3am. Clear night. Full moon shining, many stars around.
I suddenly wake up, covered in cold sweat. My heart is pumping so hard.
I take out my phone. I search the internet. Kubernetes still reigns, no simpler approach made it.
The end.

Kubernetes gives you high availability, deployment automation, and powerful management tools, which are all needed to run software applications 'at scale'
Running software at scale is my nightmare.

I'm probably the wrong person to ask. My Kubernetes nightmare is having to configure anything from scratch. Or having to fix something that doesn't work.
My comfort zone is where Kubernetes works fine and I don't have to touch it, or only update trivial stuff.

Operators/OLM/channels/operand versioning/upgrade testing.
Even typing out the words makes me want to lie down in a dark room.

For me the primary thing is OOM killed. But swap support is being worked on.
And then perhaps the proper handling of persistent disk.

Related:
Kubernetes Failure Stories A compiled list of links to public failure stories related to Kubernetes. Most recent publications on top.
https://k8s.af/

I haven't even touched intra-cluster networking, so it must be networking.
Out of the things I have touched, it's load-balanced ingress (when running on premises). So yeah, it's networking.

- PSP retirement and trying to define a replacement with 100% coverage. Gatekeeper seems to be the heir-apparent.
- Migration of all our customer workloads from PSP to gatekeeper.