- The output can be much better with better input (multiple text blocks, images, video, sound and such).
- The collective training dataset is much better with labeled/associated video, audio, static images and text.
- I suspect previous models can be used to augment human datasets without too much risk of "AI inbreeding." For example, imagine asking an LLM to reword a text page, or a image model generate a variant of an image. Now picture this in a multimedia dataset, with (for instance) a model more accurately labeling video, or producing image variations with associated audio/text as inputs.
There is the very minute possibility of emergence, that is the idea that if we unite multiple simple random systems they can create an intelligent system. Given how fast computer systems have become then there might be something there. But that's more like wishful thinking than not.