Was Unicode designed well? If it were designed from scratch today, with no legacy considerations, would the ideal design look like the current design? What would you change?
Being extremely ignorant of the problem space, the first thing I would consider for the chopping block would be combining characters. Just make every character a precomposed character (one code point), so there's no need for normalization. I'm curious if such a scheme could fit every code point into 32 bits, though. Would this be feasible?
It’s like giving me a list of numbers and asking me to “combine” them. What does that mean? Do I sum them up, or concatenate them, or something else entirely? A lot of string reversal solutions are “incorrect” because there isn’t even a correct question in the first place.
Even with an infinitely large code space, doing away with combining marks and encoding everything as precomposed would be impossible because you cannot have a definitive list of every single combination of letters and diacritics that may mean something to someone. If Unicode had been the first digital character set ever created, it would not contain a single precomposed code point because they are utterly impractical. As such, normalisation – or at least the canonical reordering part of it – is always going to be a necessity.
For Unicode, a “design from scratch” design would remove duplicate legacy code points. Why have “é” both as a single code point and as ”e” plus a combining character?
It also wouldn’t have any of the deprecated characters (https://en.wikipedia.org/wiki/Unicode_character_property#Dep...)
I also would remove the few special flag code points (https://home.unicode.org/the-past-and-future-of-flag-emoji/)
If “design from scratch” also means “drop the goal of encompassing old character encodings”, more code points probably could go. Why are DOS box characters in Unicode, while Atari/PET, etc, ones aren’t, for example?
Finally, I would look into making it easier to retrieve character class from a code point (the ‘these code points are digits, these are combining marks, etc’ tables are a bit of a wart, and getting rid of them could be useful in small embedded devices).
I doubt a solution exists there that is future proof agains extension of Unicode and doesn’t blow up memory use, though, and am not sure any embedded devices too small to host those tables actually could use that info.
Where I would make the change isn't Unicode itself but the APIs. All of the problems you're talking about basically come down to legacy language design where people think they're working with grapheme clusters but they're really working with code points. Making that more explicit in the tools would be good, similar to how Python 3 forced you to think about whether you wanted encoded binary data or a decoded string but there's so much history around that making it hard to do without getting a lot of griping from people who don't want to update decades of habit.
has nothing with Unicode
It's thing to do with whether it's Little or Big Endian
Unicode as fixed & secured as Ascii at its era