What's the right tool for this job?

Question

I support a group of researchers in medical imaging and I have the following use case:A researcher needs to run a data pipeline on several dozen subjects, each with ~5GB of data. When I say pipeline I mean the data will first pass through one executable, then another, then maybe a 3rd. Basically there's a DAG for processing but this isn't the important part.Obviously I can't just run them all at once, they'll eat up RAM and gum up everything. I could do something with Python multiprocess and spin up 5-10 workers and have them slowly chew through the workload, but now we get into a second use case:A second researcher comes along and wants to run a different data pipeline on a different set of several dozen subjects!I would want to add their tasks to the queue.I've looked at SLURM but it looks really complicated to set up. We can set all this up to run on a large workstation with lots of RAM, i.e. a single large node is fine for the volume of requests we expect to receive, I'm just not sure how to manage all the jobs. I'm tempted to write my own code reserve RAM and CPU cores and just have all jobs sleep on that check until it passes, but I feel like there's got to be something out there that fills this niche without me having to write everything from scratch.I've also looked at snakemake, and we will probably use it for the DAG portion, and it can "reserve" resources, but only for a single run. When the second researcher comes in with her dataset there's no way to tell snakemake about this.I hope this is an appropriate forum to ask this, and I appreciate any insight you fine folks might provide.

caprock · Accepted Answer

From what I've seen, there are sort of two paths if you're running on your own hardware or VMs. I'll provide a well known example from each.
1. lang specific distributed task library
For example, in Python, celery is a pretty popular task system. If you (the dev) are the one doing all the code and running the workflows, it might work well for you. You build the core code and functions, and it handles the processing and resource stuff with a little config.
* https://github.com/celery/celery
Or lower level:
* https://github.com/dask/dask
2. DAG Workflow systems
There are also whole systems for what you're describing. They've gotten especially popular in the ML ops and data engineering world. A common one is AirFlow:
* https://github.com/apache/airflow

dayjah · Answer

My $0.02: https://flyte.org/ - you write the python functions, they take an s3 (or similar) path to the images, and flute handles the orchestration for you, also allowing you to control how much compute is thrown at the problem, which essentially gives you your queue.If cost of operations starts to be an issue you can start moving elements to your own infrastructure.