When I scale, I have a jobs system with a queue.
Problem is each job can take 2 mins, but I might add a new job every minute sometimes during peak.
If my job system goes FIFO, it'll perform terribly for the latest users, who have to wait 10 minutes or longer.
But if I choose randomly or LIFO, some users never get a solution or have to wait something crazy like 6 hours.
Where can I read about optimising this kind of problem or useful approaches? I feel analysis-paralysis and question each change of default strategy, like FIFO to random order within last 10 minutes (most recent)
The most obvious way to scale in the scenario you describe: process the queue asynchronously, in parallel. By analogy, if a grocery store has a long queue for checkout they put more checkers to work. So you would add listeners to the queue until the jobs got processed as fast as they come in.
If you can't process the queue in parallel, you have to speed up the job processing so the time to finish a job gets lower than the mean time between new jobs entering the queue. Maybe adding hardware capacity (vertical scaling) will work, no way for me to tell from your description. Maybe the jobs can get broken down into subtasks that you can run in parallel.
Fiddling the ordering in the queue doesn't seem like the answer. You need to remove the obstacle that causes jobs to sit in the queue, which appears to come from the long job processing time, not the queueing algorithm.