How do you manage your ML experiments?

Question

For work, I need to run many PyTorch/TF/MXNet experiments in parallel on a cloud instance with multiple GPUs. Currently, I use Tensorboard (and its variants) to log results and tmux to run experiments simultaneously on multiple GPUs.However, I often run into these issues:1. Some experiments fail due to run-time errors and tmux allows them to fail silently2. Some experiments cause a GPU to run out of memory, and I have to dig through many tmux sessions to find and re-run that experiment3. If many GPUs are close to full, I have to revert to running experiments in sequence, and have to wait until experiment_i is over before running experiment_i+14. When running different experiments, I have to manually estimate how much GPU memory a specific experiment will consume before I can deploy them onto multiple GPUs5. When doing a particularly tedious task (eg. hyper-parameter search), there will often be on the order of a hundred experiments; this becomes extremely difficult to manually maintain using tmuxIdeally, a perfect solution for this workflow would be a tool that could 1) profile memory consumption for a set of experiments, 2) automatically deploy experiments onto a cluster of GPUs, 2) re-run, queue, or re-assign experiments to other GPUs if needed, and 4) send notifications/keep track of all experiment progress.I currently know of other tools like PyTorch Lightning (which only works with PyTorch and requires a significant code restructure) and Weights & Biases (only has experiment progress/logging ability), but I have yet to find something that is lightweight and flexible enough to handle all of these requirements.What's the best way to manage experiments like this?

p1esk · Accepted Answer

I do something very similar (five 8xGPU servers, Pycharm, ssh, tmux), and I have no solution to the issues you described. I manually launch one ssh/tmux session per server and typically have multiple tmux panes, with nvidia-smi and htop outputs. I keep reconnecting to these ssh/tmux sessions to monitor progress. I also save the results of experiments to text files, so that at the end of the hyperparameter search I can just look at those files. Looking at files is sometimes easier/quicker than looking through tmux sessions (files are kept in shared storage).
I've seen plenty of experiment management tools being advertised, but every time I looked at them they were either very limited, or required significant restructuring of my code or my workflow.
I'd like to hear about whatever solution you find because I agree, this does get tedious and painful sometimes.