I'd be interested to hear about first hand experiences with how you version production data, use it safely during testing or experimentation, and maintain audit trails.
Data Version Control - https://news.ycombinator.com/item?id=41888937 - Oct 2024 (52 comments)
Data Version Control - https://news.ycombinator.com/item?id=33047634 - Oct 2022 (59 comments)
Oxen.ai: Fast Unstructured Data Version Control - https://news.ycombinator.com/item?id=34831547 - Feb 2023 (63 comments)
Show HN: Oxen.ai – Fast Unstructured Data Version Control - https://news.ycombinator.com/item?id=34825056 - Feb 2023 (5 comments)
Ask HN: How do you version your data? - https://news.ycombinator.com/item?id=13683539 - Feb 2017 (55 comments)
With regards to tooling, https://github.com/pachyderm/pachyderm may satisfy this use case.
Feel free to check it out here: https://github.com/Oxen-AI/oxen-release
Or a hub you can host data on (we have public and private repos, or private VPC deployments): https://oxen.ai
The CLI mirrors git so it's easy to learn. It has some interesting build in tooling for diff-ing datasets and working on them remotely without downloading a full copy of the data as well.
Happy to answer any other questions!