UTF-16 vs UTF-32

Shawn Steele Shawn.Steele at microsoft.com
Mon May 16 17:42:44 PDT 2011


It's clear why we want to support the full Unicode range, but it's less clear to me why UTF-32 would be desirable internally.  (Sure, it'd be nice for conversion types).

What UTF-32 has that UTF-16 doesn't is the ability to walk a string without accidentally chopping up a surrogate pair.  However, in practice, stepping over surrogates is pretty much the least of the problems with walking a string.  Combining characters and the like cause numerous typographical shapes/glyphs to be represented by more than one Unicode codepoint, even in UTF-32.  We don't see that in Latin so much, especially in NFC, but in some scripts most characters require multiple code points.  

In other words, if I'm trying to find "safe" places to break a string, append text, or many other operations, then UTF-16 is no more complicated than UTF-32, even when considering surrogates.

UTF-32 would cause a huge amount of ambiguity though about what happens to all of those UTF-16 sequences that currently sort-of work even though they shouldn't really because ES is nominally UCS-2.

-Shawn


More information about the es-discuss mailing list