I'm sure this barely scratches the surface. What's your weirdest fantasy involving ChatGPT or Midjourney?
The section on conditional diffusion mentioned in passing that their attention blocks enable the diffusion to be conditioned by other modalities (i.e., other kinds of data).
Text is the modality we're all familiar by now thanks to DALL-E, Dream Studio, etc. But I got the following ideas of combining other modalities:
1) Audio => image synthesis: Speaking / meowing / singing / grunting to the model to generate images. Either for fun or more serious stuff like helping the hearing-impaired.
2) Video => image synthesis: From audio+video of one type (like cartoons), generate images of a different type (like real-life)
3) Signal => image synthesis: Maybe heartbeat or some other such physiological signal to generate images