Corporate data, research, books, blogs, any tokens the kid will train itself on will "feel" right in it's stomach and not "too heavy" or "too light" for its semantic mass. The rest of what the rest of the internet might have to offer in the future (comment sections) is so predictable, it would be a duplication of effort the next gen of AI won't waste any RAM on.
- the builders are well aware of the situation
- they are not training on the full internet, they are actually training on less than previously, a filtered subset produces better models
- training involves much more than text on the internet, textbooks are a great addition to the training set. Multi-modal, especially video, is expected to give them better world understanding. I suspect this will unlock the household robot
- they now have all the actual interactions (and feedback) with the LLM to add to the training, which is much more relevent and direct training data