Training neural nets on MIDI files seems to allow them to pick up these structures (better than unpicking them from raw audio) but it's a lossy format which misses all the unique instrument sounds and nuances of playing and sounds of the human voice.
The Bon Jovi / Chopin crossover here isn't particularly musically interesting, but mainly it doesn't fit the prompt because MuseNet has no idea what Bon Jovi's voice or guitar tone sounds like https://openai.com/blog/musenet/ And yes, it's also unimpressive because a toy script could add drums and bass and variations to the piano hook...
By analogy, I guess it's like training an AI image generator on polygons and textures. You'd end up with coherent geometry and the ability to look at the asset from different perspectives which might be useful to generate video as well as static assets, but the images would look like late 90s computer games, not photographs and wouldn't be difficult to procedurally generate in other ways.
Training on raw audio could pick up the nuances in tone and timing but lose the differentiation between instruments and notes in the mix which allows the music to be rearranged. I guess an idealised training set would include track by track raw audio and notation: but good luck getting the music industry to licence that as a training set...
https://magenta.tensorflow.org/perceiver-ar
Here is an older project that generated audio from lyrics:
https://openai.com/blog/jukebox/
Music has two key difficulties: the long, linear context makes it hard for most models to generate coherent works and reason about long-term higher level structure, and the global structure being learned is much more function of human preference, rather than of physical constraints.
That said, I don't think the gap is all that sizeable. Jukebox is easily comparable in quality to the image synthesis from arbitrary text of its time. Perceiver AR is not as impressive as DALL-E 2, sure, but DALL-E 2 is new, and it's plenty of evidence that music synthesis will continue to work.
Except we haven't had as much development in AI music yet. It's still at the "dall e mini" equivalent.
You can literally just play three chords and have a pop song. Play the same string, hell, the same note, on an electric guitar and bass and you have a metal song, as long as you get a good rhythm. The bar for algorithmic music is low.
Except AI doesn't use algorithms. It imitates patterns that it sees, and probably doesn't get the pattern as well as someone who knows a little music theory.
Is DALL-E 2's work remarkable if you saw it on art station? No, it's just remarkable because of who made it.
My gut says it's mostly the first bit, though. We probably just have a lower entertainment threshold for randomized images.
This is true even when no machines are involved. Music of the same standard as your average fan art on the internet is nothing anyone would listen to. Even average consumers are simply more discerning when it comes to music. Almost everyone you can ask on the street has a favorite musician and can tell you a few things about different genres, how many people have a favorite painter and can tell you something about art? It's just a cultural thing.
Although maybe it would work, generating with a very low sample rate and upscaling that..
Launching soon.