How do you version your data?

Question

I'm working at a company that processes data through multiple distinct stages, and struggling to figure out what tooling to use for versioning and maintaining an auditable history of changes.I'd be interested to hear about first hand experiences with how you version production data, use it safely during testing or experimentation, and maintain audit trails.

toomuchtodo · Accepted Answer

Potentially useful threads for your consideration.
Data Version Control - https://news.ycombinator.com/item?id=41888937 - Oct 2024 (52 comments)
Data Version Control - https://news.ycombinator.com/item?id=33047634 - Oct 2022 (59 comments)
Oxen.ai: Fast Unstructured Data Version Control - https://news.ycombinator.com/item?id=34831547 - Feb 2023 (63 comments)
Show HN: Oxen.ai – Fast Unstructured Data Version Control - https://news.ycombinator.com/item?id=34825056 - Feb 2023 (5 comments)
Ask HN: How do you version your data? - https://news.ycombinator.com/item?id=13683539 - Feb 2017 (55 comments)
With regards to tooling, https://github.com/pachyderm/pachyderm may satisfy this use case.

gschoeni · Answer

We're working on Oxen.ai which is an Open Source CLI and Server with Python bindings as well. Optimized for ML/AI workloads but works with any type of data and we see usage from game companies, bio, aerospace etc.
Feel free to check it out here: https://github.com/Oxen-AI/oxen-release
Or a hub you can host data on (we have public and private repos, or private VPC deployments): https://oxen.ai
The CLI mirrors git so it's easy to learn. It has some interesting build in tooling for diff-ing datasets and working on them remotely without downloading a full copy of the data as well.
Happy to answer any other questions!

mathi0750 · Answer

I just released (1 minute ago) a blog going into the most popular data version control options and comparing them. Hopefully this clarify which is the best solution for you. Heres the blog - https://www.oxen.ai/blog/the-best-ai-data-version-control-to...

bpf120 · Answer

Check out www.dolthub.com

thenaturalist · Answer

Another one is projectnessie.org