Will it be possible to, given a large enough dataset of MP3 files, predict the next millisecond of audio based on previous milliseconds of audio and generate songs? Will we generate videos by predicting the next best frame?
Is there any technical reason we couldn't collect first person audio and video data with the cameras and microphone on a Quest Pro and generate how the next few minutes of our life could look?
Not milliseconds, but AudioLM [1] already does it with just seconds, for speech (and piano). Results are already very convincing (to me).
[1] https://google-research.github.io/seanet/audiolm/examples/