However, everything I have is currently scaled up and down manually. Looking into HPA it doesn't seem like it's a tool built for GPU tasks where each pod can only handle at most one unit of work at a time. I have both async GPU workers and a small flask API that uses GPUs. These are currently all using g4dn.xlarge EC2s. Essentially I want a scaling scheme where the number of nodes approximates the number of concurrent requests, up to a maximum.
I did look into simpler solutions like Replicate etc. but found they would be too limiting and a little bit over-simplified for us.
Edit: I know I should probably use k8s Jobs for the async worker stuff, but when I looked into it there seemed to be more operational complexity than I can handle. I'm a solo dev and set up workers that poll a Postgres job queue so they can manage themselves instead of there being yet another service to schedule and maintain k8s Jobs.