HACKER Q&A
📣 skadamat

How do your ML teams version datasets and models?


Git worked until we hit a few gigabytes. S3 scales super well but version control, documentation, and change management isn't great (we just did lots of "v1" or "vsep28_2023" names).

DVC felt very clunky (now I need git AND s3 AND dvc) by the team.

What best practices and patterns have you seen work or have you implemented yourself?


  👤 herodoturtle Accepted Answer ✓

👤 janalsncm
We have a task name, major version, description and commit hash. So the model name will be something like my_task_ v852_pairwise_refactor_0123ab. Ugly but it works.

Don’t store your data in git, store your training code there and your data in s3. And you can add metadata to the bucket so you know what’s in there/how it was generated.


👤 gschoeni
We have been working on an open source tool called "Oxen" that aims to tackle this problem! Would love for you to kick the tires and see if it works for your use case. We have a free version of the CLI, python library, and server on github, and a free hosted version you can kick around at Oxen.ai.

Website: https://oxen.ai

Dev Docs: https://docs.oxen.ai

GitHub: https://github.com/Oxen-AI/oxen-release

Feel free to reach out on the repo issues if you run into anything!


👤 AJRF
MLFlow

👤 speedgoose
Have you use git or git lfs to store the large files?


👤 plonk
Models that actually get deployed get a random GUID. Our docs tell us which is which (release date, intended use, etc.)

Models are then stored in an S3 bucket. But since the IDs are unique, they can be exchanged and cached and copied with next to no risk of confusion.


👤 snovv_crash
CSV file in git with paths to all of the files, all the training settings, and the path to the training artifacts (snapshots, loss stats etc). The training artifacts get filled in by CI when you commit. Files can be anywhere, for us it was a NAS due to PII in the data we were training on so "someone else's computer" AKA cloud wasn't an option.

👤 prashp
Git LFS

👤 zxexz
I think a decent solution is coming up with a system for storing the models and datasets, checkpoints, etc. in S3 - store the metadata, references, etc. in a well structures postgres table (schema versioning, audit logs, etc. with snapshots). Also, embedding the metadata in the model/dataset as well, in a way you could always reconstruct the database from the artifacts (in Arrow and Parquet files, you can embed arbitrary metadata at the file-level and the field level).

But perhaps the best solution is to just use something like MlFlow or WandB that handles this for you, if you use the API correctly!


👤 warkdarrior
I use five version tags, after that I just rename the dataset.

v1

v2

v2_

v3_final

FINAL_final


👤 kvnhn
I've used DVC in the past and generally liked its approach. That said, I wholeheartedly agree that it's clunky. It does a lot of things implicitly, which can make it hard to reason about. It was also extremely slow for medium-sized datasets (low 10s of GBs).

In response, I created a command-line tool that addresses these issues[0]. To reduce the comparison to an analogy: Dud : DVC :: Flask : Django. I have a longer comparison in the README[1].

[0]: https://github.com/kevin-hanselman/dud

[1]: https://github.com/kevin-hanselman/dud/blob/main/README.md#m...


👤 smfjaw
ML Flow solves most of these issues for models, I haven't used it in relation to data versioning but it solves most model versioning and deployment management things I can think of


👤 cuteboy19
Haphazardly, with commit# + timestamp of training

👤 thegginthesky
Process, git and S3.

We trained the whole team to:

- version the analysis/code with git - save the data to the bucket s3:/// - we wrote a small code to get the commit id to build this path and use boto3 to both access it and save it

We normally work with zipped parquet files and model binnaries and we try to keep them together in the path mentioned

It's super easy and simple, very little dependencies, and allow for rerunning the code with the data. If someone deviates from this standard, we will always request a change to keep it tidy.

Keeping track of data is the same with keeping a clean git tree, it requires practice, a standard, and constant supervision from all.

This saved my butt a many times, such as when I had to rerun an analysis done over a year ago, or take over for a colleague that got sick.


👤 john-shaffer
DVC is very slow because it stores and writes data twice, and the default of dozens of concurrent downloads cause resource starvation. They finally improved uploads in 3.0, but downloads and storage are still much worse than a simple "aws s3 cp". You can improve download performance somewhat by passing a reasonable value for -j. Storage can be improved by nuking .dvc/cache. There's no way to skip writing all data twice though.

Look for something with good algorithms. Xethub worked very well for me, and oxen looks like a good alternative. git-xet has a very nice feature that allows you to mount a repo over the network [0]

[0] https://about.xethub.com/blog/mount-part-1


👤 pjfin123
I put the metadata in a JSON file and then store the datasets as a zip archive on a Nginx server.

👤 nofitty376
W&B artifacts for days

👤 wingman-jr
For a side project of image classification, I use a simple folder system where the images and metadata are both files, with a hash of the image acting as a key/filename - e.g. 123.img and 123.metadata. This gives file independence. Then as needed, I compile a CSV of all the image-to-metadata as needed and version that. Works because I view the images as immutable, which is not true for some datasets. On a local SSD, it has scaled to >300K images. Professionally, I've used something similar but with S3 storage for images and Postgres database for the metadata. This scales up better beyond a single physical machine for team interaction of course. I'd be curious how others have handled data costs as the datasets grow. The professional dataset got into the terabytes of S3 storage and it gets a bit more frustrating when you want to move data but are looking at thousands of dollars projected costs for egress of the data... and that's with S3, let alone a more expensive service. In many ways S3 is so much better than a hard drive, but it's hard not to compare to the relative cost of local storage when the gap gets big enough.

👤 m_niedoba
Here is a tutorial on how to use Git LFS with Azure DevOps for game dev. But the same principle applies for ML. It's about versioning large data. DevOps does not charge for storage, yet.

https://www.anchorpoint.app/blog/version-control-using-git-a...