Why we need complex text-to-image networks if this simple one works?

Question

Made sonnet-3.5 write a simple text-to-image program. Trained it on mnist dataset with 50 epochs. Training took like 20 minutes only on my M1 mac with 8GB RAM only.It was able to produce very good images based on training data. And is such a simple network.My question is: why is all that extra complexity needed in today's text-to-image models based on transformers? Wouldn't scaling this out work equally well?Code: https://gist.github.com/freakynit/1118403ad80448ee0313ba6c879f8688Generated image: https://imgur.com/LCHDBhI

bjourne · Accepted Answer

But the images your network generates look nothing like MNIST digits. There is also no variance. For example, all 9s are identical.

p1esk · Answer

For MNIST your model is sufficient. But it&rsquo;s not structurally complex enough to generate more complex images, even if you scale it up.

Am4TIfIsER0ppos · Answer

Have you employed a dozen ethicists to go through the input and output to make sure it can't say any slurs? [EDIT] That's why small ones don't exist