By my understanding there are two GPT-4 models, one that is text only and one that is multimodal, with image input. That's what I mean by image GPT--4.
GPT-4 has demonstrated that it is significantly better than previous models at drawing with tikz or canvas. It can draw rudimentary pictures of things like animals or food. It can also modify these pictures according to your desire. It was not clear to me from the "sparks of AGI" paper if they used image-GPT-4 or GPT-4 when they tested this.
My hypothesis is that image-GPT-4 is much much better at this task because learning to understand images requires learning some understanding of 2d geometry. If this knowledge carries over to a text-only domain in the form of a better understanding of drawing, then I would see this as a great demonstration of the ability of SGD to really "integrate knowledge". To find weights that somehow represent the underlying world model behind the data, which ought to be the most efficient way to reproduce the data.
Furthermore, I hypothesize that the tikz/canvas drawing capabilities of image-GPT-4 jump up another good bit if one lets it look at its output and iterate.
I hope you find these questions interesting and, if no one can answer them here, perhaps ask them to someone else who might know.