It was able to produce very good images based on training data. And is such a simple network.
My question is: why is all that extra complexity needed in today's text-to-image models based on transformers? Wouldn't scaling this out work equally well?
Code: https://gist.github.com/freakynit/1118403ad80448ee0313ba6c879f8688
Generated image: https://imgur.com/LCHDBhI