Even more interesting would be video of someone using it that way.
I assume it's just a matter of having enough compute.
Stable Diffusion runs on under 10 GB of VRAM on consumer GPUs, generating images at 512x512 pixels in a few seconds. This will allow both researchers and soon the public to run this under a range of conditions, democratizing image generation. We look forward to the open ecosystem that will emerge around this and further models to truly explore the boundaries of latent space.
The model was trained on our 4,000 A100 Ezra-1 AI ultracluster over the last month as the first of a series of models exploring this and other approaches.
We have been testing the model at scale with over 10,000 beta testers that are creating 1.7 million images a day.