Most things are settled, but we expect to collect a LOT of data that will be labeled and or auto labeled ( to the tune of 100 MIO video clips )
We will be training multiple models for different tasks from that data and we need a good system to organize it.
Does anybody have any tips experiences with that kind of thing. We can use any on premise or cloud solution....
Specifically we would need
* Data ingestion pipeline ( data will come from field personel ) * Data versioning * Being able to define datasets that are a subset of the whole collected data * Inexpensive storage ( e.g S3 or similar ) * Branching/Merging for maintaining production training data sets * Metadata storage and query capabilities ... * User interface for less tech savy people ( e.g just a git like command line is fine for engineers but not for field personell who are not in IT )
I know of tools like https://dvc.org/ but a) they are just layers on top of git b) break appart on huge datasets without a folder hierarchy ( git tree objects just don't work for linear lists of items ) are only useable by IT personell, and require checking out at least a part of the dataset.
Our datasets would be 100.000.000 x 100 MB = 10 PB of raw data. Training data should be delivered to training nodes via network etc.. we just can't have a full checkout of that data...
https://github.com/Oxen-AI/oxen-release#-oxen
Going down your list of requirements, Oxen has:
* Data versioning, similar paradigm to git, but built from the ground up for large ML datasets
* Inexpensive storage, comparable pricing to s3
* Branching/Merging for maintaining production training data sets
* Metadata storage and query capabilities, works with many structured data types. Have APIs for querying.
* User interface for less tech savy people, building out a hub at https://www.oxen.ai to enable this.
* Being able to define datasets that are a subset of the whole collected data (is this a similar requirement to querying?)
* Data ingestion pipeline - engineers would have to hook into APIs or CLI tools right now.
Feel free to check it out and leave any feedback on the GitHub repo!
https://github.com/dolthub/dolt
And that has a user-friendly UI in DoltHub:
You wouldn't store the images themselves in Dolt, those would likely be links to S3 but al the labels and surrounding metadata could be stored in Dolt?
DISCLAIMER: I'm the CEO of DoltHub so this is self-promotion.