What's the right way to scale K8s for GPU workloads?

Question

I'm running a relatively simple app right now that makes use of EKS clusters for GPU workloads. Right now everything's pretty simple and working great, despite all of the warnings about k8s being complicated.However, everything I have is currently scaled up and down manually. Looking into HPA it doesn't seem like it's a tool built for GPU tasks where each pod can only handle at most one unit of work at a time. I have both async GPU workers and a small flask API that uses GPUs. These are currently all using g4dn.xlarge EC2s. Essentially I want a scaling scheme where the number of nodes approximates the number of concurrent requests, up to a maximum.I did look into simpler solutions like Replicate etc. but found they would be too limiting and a little bit over-simplified for us.Edit: I know I should probably use k8s Jobs for the async worker stuff, but when I looked into it there seemed to be more operational complexity than I can handle. I'm a solo dev and set up workers that poll a Postgres job queue so they can manage themselves instead of there being yet another service to schedule and maintain k8s Jobs.

gtirloni · Accepted Answer

It seems you want something like KEDA (https://keda.sh)