UTF-16 vs UTF-32
allen at wirfs-brock.com
Mon May 16 18:02:19 PDT 2011
On May 16, 2011, at 5:42 PM, Shawn Steele wrote:
> It's clear why we want to support the full Unicode range, but it's less clear to me why UTF-32 would be desirable internally. (Sure, it'd be nice for conversion types).
> What UTF-32 has that UTF-16 doesn't is the ability to walk a string without accidentally chopping up a surrogate pair. However, in practice, stepping over surrogates is pretty much the least of the problems with walking a string. Combining characters and the like cause numerous typographical shapes/glyphs to be represented by more than one Unicode codepoint, even in UTF-32. We don't see that in Latin so much, especially in NFC, but in some scripts most characters require multiple code points.
> In other words, if I'm trying to find "safe" places to break a string, append text, or many other operations, then UTF-16 is no more complicated than UTF-32, even when considering surrogates.
> UTF-32 would cause a huge amount of ambiguity though about what happens to all of those UTF-16 sequences that currently sort-of work even though they shouldn't really because ES is nominally UCS-2.
One reason is that none of the built-in string methods understand surrogate pairs. If you want to do any string processing that recognizes such pairs you have to either handles such pairs as multi-character sequences or do you own character by character processing.
More information about the es-discuss