DVC felt very clunky (now I need git AND s3 AND dvc) by the team.
What best practices and patterns have you seen work or have you implemented yourself?
Don’t store your data in git, store your training code there and your data in s3. And you can add metadata to the bucket so you know what’s in there/how it was generated.
Website: https://oxen.ai
Dev Docs: https://docs.oxen.ai
GitHub: https://github.com/Oxen-AI/oxen-release
Feel free to reach out on the repo issues if you run into anything!
Models are then stored in an S3 bucket. But since the IDs are unique, they can be exchanged and cached and copied with next to no risk of confusion.
But perhaps the best solution is to just use something like MlFlow or WandB that handles this for you, if you use the API correctly!
v1
v2
v2_ v3_final FINAL_final
In response, I created a command-line tool that addresses these issues[0]. To reduce the comparison to an analogy: Dud : DVC :: Flask : Django. I have a longer comparison in the README[1].
[0]: https://github.com/kevin-hanselman/dud
[1]: https://github.com/kevin-hanselman/dud/blob/main/README.md#m...
We trained the whole team to:
- version the analysis/code with git
- save the data to the bucket s3:// We normally work with zipped parquet files and model binnaries and we try to keep them together in the path mentioned It's super easy and simple, very little dependencies, and allow for rerunning the code with the data. If someone deviates from this standard, we will always request a change to keep it tidy. Keeping track of data is the same with keeping a clean git tree, it requires practice, a standard, and constant supervision from all. This saved my butt a many times, such as when I had to rerun an analysis done over a year ago, or take over for a colleague that got sick.
Look for something with good algorithms. Xethub worked very well for me, and oxen looks like a good alternative. git-xet has a very nice feature that allows you to mount a repo over the network [0]
https://www.anchorpoint.app/blog/version-control-using-git-a...