Historically, the ancestor of Linux was developed in an environment where terminal standardization usually stripped out the marketable graphics features. Sound was single beeps, vocalized with ^G.
If it seems overdue to standardize a real-time character-encoded graphics protocol, you can look at the Tektronix 4104 and descendants.
That being said, typesetting was important early in UNIX. Even though the interface was command-line, UNIX implemented roff and maybe wouldn’t have succeeded if it couldn’t word-process into roff. Word processing was the first application for UNIX outside of the Labs.
You have a chicken egg problem; you need to get people to want to use both a different shell and a different terminal emulator, at the same time.
Textual mode fits well with all the tools. You can't grep an image sensibly.
In practical terms, a graphic shell can only give you what the shell's programmer has envisaged is possible, and has given you a menu to include those specific things.
That can't cover every single imaginable eventuality. If you don't believe that, try making a menu on paper that covers everything possible that a textual shell can do.