HACKER Q&A
📣 fiddlerINT

Do's/don'ts of working with Kubernetes you learned through experience?


We are rolling out Kubernetes to production next month and I'm interested to hear from people who made that step already.


  👤 nameless912 Accepted Answer ✓
Can I be honest?

As someone who hopped on the K8s bandwagon back in the early days (circa-early 2017), _do not_ go into production with Kubernetes if you're still asking this question.

Just a few of the issues I've run into over the past 2 1/2 years or so:

- Kubernetes DNS flaking out completely

- Kubernetes DNS flaking out occasionally (for ~5 percent of queries)

- Giving out too many permissions, causing pods to be deleted without a clear reason why, often taking down production traffic or logging with it

- Giving out too few permissions, making our deployment infrastructure depend on a few lynchpins rather than sharing the production burden

- probably a dozen different logging aggregation systems, none of which strike a balance between speed and CPU cost

- probably a half-dozen different service meshes, all of which suck (with the exception of linkerd, which is actually quite good)

- teams with bad santization practices leaking credentials all over the place

- Running Vault in Kubernetes (really, don't ever do this)

- Disks becoming unattached from their pods for no discernable reason, only to be re-attached minutes later again with no explanation

- At least one major production outage on every single Kubernetes-based system I've built that can be directly attributed to Kubernetes

- Etcd failovers

- Etcd replication failures

- Privilege escalation due to an unsecured Jenkins builder causing credential exfiltration (this one was _super_ fun to fix)

Kubernetes is a powerful tool, and I've helped run some massive (1000+ node, 5000+ pod x 3 AZ's) systems based on K8s, but it took me a solid year of experimenting and tinkering to feel even remotely comfortable putting anything based on K8s into production. If you haven't run into any "major" issues, you're going to very soon. I can only wish you good luck.


👤 wikibob
See the excellent collection of Kubernetes incident reviews at https://k8s.af

👤 CameronBarre
Do keep a gitops folder / repository to keep your cluster in sync with expectations, do not let adhoc edits become the norm.

Use tools like kustomize to reduce proliferation of duplicate k8s resource files.

Do make sure you are using health and liveness checks.

Definitely take care to specify resource requests and limits.

Do use annotations to control provider resources, rather than manually tweaking provider resources that are auto generated by basic k8s files with no annotations.

Aggregate your logs.


👤 moon2
- Deleting or bulk changing something? Always use the flag --record. This way, you can refer back to what you changed using kubectl rollout history.

- If you're planning on using GKE, you'll have to expose your apps using Ingress (this way you can use GCP's L7 Load Balancing with HTTPs). However, this architecture has many limits (e.g. a hard limit of 1000 forwarding rules (FW) per project, each ingress creates an FW and k8s ingress can't refer to another namespace), so make sure you use namespaces wisely.

- Try to learn and teach people on your team about requests and limits. If you don't use it carefully, you'll end up wasting a lot of resources. Also, make sure you have Prometheus and Grafana set up, to give you some visibility.

- Setup Heptio's Velero, it's a lifesaver, especially when running in a managed environment where you have no access to etcd. It can be used to backup your whole cluster and migrate workloads between clusters. If, for some reason, you end up deleting a cluster by mistake, it will be easier to recover its workloads using Velero.


👤 sergiotapia
We're moving away from Kubernetes, to Aptible.

If you're asking these kinds of questions you shouldn't be using kubernetes.

If you are going to use it, be ready to have an engineer on your time be full time devops. Or be ready to hire someone who knows k8. It'll be around 110k to 140k.

But really, don't use it. The gospel you hear is from engineers who already invested their careers in it. Buyer beware.


👤 longcommonname
Don't give everybody prod access, but give enough people prod access.

Use namespaces and logically bounded clusters. Get your monitoring, and tracing and a dashboard to visualize this figured out now.


👤 rochacon
Managed or custom deploy? What is the size of the cluster and team that will be using it?

Kubernetes is a hell of a lot configurable, so your environment matters a lot on the must and nice to have.

If not managed, make sure you go through all components flags and configure things like reserved resources, forbid hostpath usage, pod security policies (do not allow root), etc

Also, avoid service meshes until you fully understand how to use “vanilla” Kubernetes, don’t add this complexity from day 1 because debugging cluster issues can get a lot harder.


👤 charlieegan3
We experienced an issue with a validating webhook controller configured to validate (way) more than needed - I wrote it up here: https://blog.jetstack.io/blog/gke-webhook-outage. It's on https://k8s.af - a great place for k8s related postmortems.

👤 kasey_junk
Have a very clear business case of why you are using it.

Rolling out K8s should not be the goal. It’s a toolset, an expensive a bleeding edge one. It’s also very much geared for operators not developers so you likely need to build guide rails on top of it.

There are lots of good reasons to use K8s but make sure you know why you are.


👤 yellow_lead
If your application requires high availability, make sure you are setting pod disruption budgets and have some special behavior when SIGKILL is sent to an app/pod. For some of our applications, we have some logic to finish all current requests after SIGKILL is sent, so that none are dropped.

👤 anon284271
Don't use Kubernetes.

👤 eeZah7Ux
DON'T: use it.

👤 iamnothere123
DON'T USE IT !!!!!