However, I often run into these issues:
1. Some experiments fail due to run-time errors and tmux allows them to fail silently
2. Some experiments cause a GPU to run out of memory, and I have to dig through many tmux sessions to find and re-run that experiment
3. If many GPUs are close to full, I have to revert to running experiments in sequence, and have to wait until experiment_i is over before running experiment_i+1
4. When running different experiments, I have to manually estimate how much GPU memory a specific experiment will consume before I can deploy them onto multiple GPUs
5. When doing a particularly tedious task (eg. hyper-parameter search), there will often be on the order of a hundred experiments; this becomes extremely difficult to manually maintain using tmux
Ideally, a perfect solution for this workflow would be a tool that could 1) profile memory consumption for a set of experiments, 2) automatically deploy experiments onto a cluster of GPUs, 2) re-run, queue, or re-assign experiments to other GPUs if needed, and 4) send notifications/keep track of all experiment progress.
I currently know of other tools like PyTorch Lightning (which only works with PyTorch and requires a significant code restructure) and Weights & Biases (only has experiment progress/logging ability), but I have yet to find something that is lightweight and flexible enough to handle all of these requirements.
What's the best way to manage experiments like this?
I've seen plenty of experiment management tools being advertised, but every time I looked at them they were either very limited, or required significant restructuring of my code or my workflow.
I'd like to hear about whatever solution you find because I agree, this does get tedious and painful sometimes.