HACKER Q&A
📣 indil

Is Unicode Designed Badly?


The more I learn about Unicode, the more complicated it gets. It was rather shocking to learn that the presence of combining characters makes most "reverse a string" programming solutions incorrect, and that strings need to be normalized to compare them. The whole thing seems so much more complicated than it should be, but perhaps that's just the nature of the problem?

Was Unicode designed well? If it were designed from scratch today, with no legacy considerations, would the ideal design look like the current design? What would you change?

Being extremely ignorant of the problem space, the first thing I would consider for the chopping block would be combining characters. Just make every character a precomposed character (one code point), so there's no need for normalization. I'm curious if such a scheme could fit every code point into 32 bits, though. Would this be feasible?


  👤 Charlotte_Buff Accepted Answer ✓
Reversing a string is a useless operation in the real world. Its only application is padding out interview questions. “How to reverse a string” is also an incredibly vague question. What do you actually want me to do? Reverse code points, or code units, or grapheme clusters, or make it look like it’s written backwards? It doesn’t even make sense as a concept in most of the world’s writing systems.

It’s like giving me a list of numbers and asking me to “combine” them. What does that mean? Do I sum them up, or concatenate them, or something else entirely? A lot of string reversal solutions are “incorrect” because there isn’t even a correct question in the first place.

Even with an infinitely large code space, doing away with combining marks and encoding everything as precomposed would be impossible because you cannot have a definitive list of every single combination of letters and diacritics that may mean something to someone. If Unicode had been the first digital character set ever created, it would not contain a single precomposed code point because they are utterly impractical. As such, normalisation – or at least the canonical reordering part of it – is always going to be a necessity.


👤 Someone
Combining characters have their issues (https://en.wikipedia.org/wiki/Zalgo_text), but making string reversal trickier isn’t one of them. “Reversing” is an extremely atypical thing to do with text. I think only programming exercises and palindrome searchers do it. Why would your data structure make that easy to do?

For Unicode, a “design from scratch” design would remove duplicate legacy code points. Why have “é” both as a single code point and as ”e” plus a combining character?

It also wouldn’t have any of the deprecated characters (https://en.wikipedia.org/wiki/Unicode_character_property#Dep...)

I also would remove the few special flag code points (https://home.unicode.org/the-past-and-future-of-flag-emoji/)

If “design from scratch” also means “drop the goal of encompassing old character encodings”, more code points probably could go. Why are DOS box characters in Unicode, while Atari/PET, etc, ones aren’t, for example?

Finally, I would look into making it easier to retrieve character class from a code point (the ‘these code points are digits, these are combining marks, etc’ tables are a bit of a wart, and getting rid of them could be useful in small embedded devices).

I doubt a solution exists there that is future proof agains extension of Unicode and doesn’t blow up memory use, though, and am not sure any embedded devices too small to host those tables actually could use that info.


👤 acdha
I think it would look a lot like UTF-8 with some of the legacy parts removed (e.g. drop the non-combining characters which duplicate combining character combinations). One thing to remember is that there are a LOT of edge cases in the world and you're looking at a lot of permutations when there are characters combining multiple access, or things like Emoji where they use skin tone modifiers to avoid needing to specify every permutation. I'm not sure if that would fit in a 32-bit code point, but I would also consider what that would do to file and network sizes — there are real costs to making almost every document substantially larger and while we have more headroom than we used to, I'd be still be surprised if that didn't result in noticeable performance regressions.

Where I would make the change isn't Unicode itself but the APIs. All of the problems you're talking about basically come down to legacy language design where people think they're working with grapheme clusters but they're really working with code points. Making that more explicit in the tools would be good, similar to how Python 3 forced you to think about whether you wanted encoded binary data or a decoded string but there's so much history around that making it hard to do without getting a lot of griping from people who don't want to update decades of habit.


👤 lysergia
Every computer science problem eventually ends with The Unicode Problem and its various agendas. Personally I avoid Unicode in my editor and use ASCII at all times. If I have to deal with Unicode, I escape it into the relevant ASCII equivalent and normalize things like emojis to ASCII. This avoids various headaches down the line, since Unicode is not cross-compatible across devices and having everything in ASCII is a saner way to approach that.

👤 mardiyah
"..characters makes most "reverse a string"..."

has nothing with Unicode

It's thing to do with whether it's Little or Big Endian

Unicode as fixed & secured as Ascii at its era