Why is computer generated music not as impressive as DALL·E 2?

Question

Maybe composing good music is more difficult than creating a quality painting regardless of whether it is done by a human being or computer?

bjourne · Accepted Answer

A while ago I did a lot of research in this area. The answer is basically that the human ear is incredibly good at detecting "errors" while the human eye smooths over them. You don't notice that the color of pixels are off, that some figures proportions are wrong, or that a face is all mush. But you do notice if a few notes are in the wrong key or if the rhythm is off in some beats. So your music model needs to be more conservative to not make errors but then it often falls into the trap of just repeating the same notes over and over or outputting the blandest elevator music you can think of. Or it just plagiarizes. E.g. takes the melody from a Fleetwood Mac song and combines it with some Rolling Stones drums and passes it off as its own. I'm convinced that image models also plagiarizes but that their plagiarism is much easier to mask. A song is still the same song even if you change the key, tempo, and timbre, but the equivalent operations applied to images are much harder to spot.

notahacker · Answer

I think part of the problem is the opposite: most styles of Western music are composed based on fairly common patterns to to the point it's possible to generate perfectly adequate music from rhythm/instrument/scale/chord templates without the neural net.
Training neural nets on MIDI files seems to allow them to pick up these structures (better than unpicking them from raw audio) but it's a lossy format which misses all the unique instrument sounds and nuances of playing and sounds of the human voice.
The Bon Jovi / Chopin crossover here isn't particularly musically interesting, but mainly it doesn't fit the prompt because MuseNet has no idea what Bon Jovi's voice or guitar tone sounds like https://openai.com/blog/musenet/ And yes, it's also unimpressive because a toy script could add drums and bass and variations to the piano hook...
By analogy, I guess it's like training an AI image generator on polygons and textures. You'd end up with coherent geometry and the ability to look at the asset from different perspectives which might be useful to generate video as well as static assets, but the images would look like late 90s computer games, not photographs and wouldn't be difficult to procedurally generate in other ways.
Training on raw audio could pick up the nuances in tone and timing but lose the differentiation between instruments and notes in the mix which allows the music to be rearranged. I guess an idealised training set would include track by track raw audio and notation: but good luck getting the music industry to licence that as a training set...

Veedrac · Answer

For context, here's some recent computer music, both MIDI and direct audio generation:
https://magenta.tensorflow.org/perceiver-ar
Here is an older project that generated audio from lyrics:
https://openai.com/blog/jukebox/
Music has two key difficulties: the long, linear context makes it hard for most models to generate coherent works and reason about long-term higher level structure, and the global structure being learned is much more function of human preference, rather than of physical constraints.
That said, I don't think the gap is all that sizeable. Jukebox is easily comparable in quality to the image synthesis from arbitrary text of its time. Perceiver AR is not as impressive as DALL-E 2, sure, but DALL-E 2 is new, and it's plenty of evidence that music synthesis will continue to work.

muzani · Answer

I think it's the frame of reference. DALL-E 2 is less impressive than a good artist, but it's more impressive than an amateur. AI music is similar.
Except we haven't had as much development in AI music yet. It's still at the "dall e mini" equivalent.
You can literally just play three chords and have a pop song. Play the same string, hell, the same note, on an electric guitar and bass and you have a metal song, as long as you get a good rhythm. The bar for algorithmic music is low.
Except AI doesn't use algorithms. It imitates patterns that it sees, and probably doesn't get the pattern as well as someone who knows a little music theory.

Geonode · Answer

What percentage of human generated music is impressive? One percent? Even the greats only have three or four truly great songs.Is DALL-E 2's work remarkable if you saw it on art station? No, it's just remarkable because of who made it.

xodjmk · Answer

Images themselves have very little correlation beyond a few pixels in any direction. We might think of them as complex because we are interpreting 3D objects, light and building a scene from our understanding of reality, but none of this actually exists in the image itself. Audio on the other hand has a very high degree of correlation across multiple time scales. There is tone and timbre and harmonic structure that is information on a very small time scale, then there is transients like percussive sounds that are slightly longer, then there is changing pitch that is on a longer time scale, then there are note envelopes, rhythmic phrases, bars of notes, and longer musical phrases. There is still a lot of interpretation done by your brain to build meaning from all this information, but there is much more actual physical correlation with phase and amplitude of physical waveforms evolving in time that has to be inferred by a neural net. Another way to think of this, is what happens when you take a 'snapshot' of music? If you freeze time and take a snapshot you can capture all the phases, amplitudes, and stochastic properties of the music for an instant, but you basically lose all meaning. Compared to an image, a snapshot can represent an entire story, not because there is actually anything significant recorded in any of the individual pixels, but because your brain is able to construct a story out of simple symbols and shapes. The lack of correlation lets machine learning algorithms use relatively simple operations like small 2D convolution filters and divide an image into small regions and still a neural net is able to infer a useful new image. I think it is much more difficult to break apart audio into small representative parts, then reassemble into something that would be aesthetically appealing.

smoldesu · Answer

It doesn't take much to make a visually interesting image. Even if DALL-E was incapable of recreating human subjects, it would still be pretty interesting to treat as a black-box of interesting images. I don't really feel like music is the same. Even if you can recreate "the sound" of a song or the general style, you're not considering arrangement, listener fatigue, transitions, instrumentation or mixing. It's just making a rough sketch of some music, which sounds strange relative to the deliberately-composed arrangements we hear on a regular basis.My gut says it's mostly the first bit, though. We probably just have a lower entertainment threshold for randomized images.

Barrin92 · Answer

People have significantly lower standards when it comes to painting compared to music. Music is ubiquitous in our culture whereas appreciation for visual arts is much less common, so people notice generic music much more than they notice generic art.This is true even when no machines are involved. Music of the same standard as your average fan art on the internet is nothing anyone would listen to. Even average consumers are simply more discerning when it comes to music. Almost everyone you can ask on the street has a favorite musician and can tell you a few things about different genres, how many people have a favorite painter and can tell you something about art? It's just a cultural thing.

joeld42 · Answer

I think it's because the image generation models can generate very low res images (64x64) and then have a slightly different model that upscales them (still with context) to 1024 or whatever. With audio, there's no similar "global" upscaling.Although maybe it would work, generating with a very low sample rate and upscaling that..

adamnemecek · Answer

I'm working on an IDE for music composition which might be of your interest if you are into this sort of thinghttps://ngrid.io.Launching soon.

Why is computer generated music not as impressive as DALL·E 2?

Maybe composing good music is more difficult than creating a quality painting regardless of whether it is done by a human being or computer?

What percentage of human generated music is impressive? One percent? Even the greats only have three or four truly great songs.
Is DALL-E 2's work remarkable if you saw it on art station? No, it's just remarkable because of who made it.

I'm working on an IDE for music composition which might be of your interest if you are into this sort of thing
https://ngrid.io.
Launching soon.