Is widespread use of non-commercial datasets an open secret in startups?
I'm asking outside of language models. Aside from foundational models, I still see small companies with very specific goals, but even niche offshoots at bigger companies. One still needs data sets for performant, custom models. Collecting that data can be a hindrance, but some companies succeed anyway, with no appearance of data collection efforts. This is true for language, vision, etc.
I suspect that many of these are bootstrapped with pretrained models, many of which surprisingly do have non-commercial licenses or were trained with non-commercially licensed data sets.
So is it an open secret that companies just suck up whatever they can get their hands on anyway? Perhaps the legal landscape is still so grey?