UTF-16 vs UTF-32

Allen Wirfs-Brock allen at wirfs-brock.com
Mon May 16 18:02:19 PDT 2011

On May 16, 2011, at 5:42 PM, Shawn Steele wrote:

> It's clear why we want to support the full Unicode range, but it's less clear to me why UTF-32 would be desirable internally.  (Sure, it'd be nice for conversion types).
> What UTF-32 has that UTF-16 doesn't is the ability to walk a string without accidentally chopping up a surrogate pair.  However, in practice, stepping over surrogates is pretty much the least of the problems with walking a string.  Combining characters and the like cause numerous typographical shapes/glyphs to be represented by more than one Unicode codepoint, even in UTF-32.  We don't see that in Latin so much, especially in NFC, but in some scripts most characters require multiple code points.  
> In other words, if I'm trying to find "safe" places to break a string, append text, or many other operations, then UTF-16 is no more complicated than UTF-32, even when considering surrogates.
> UTF-32 would cause a huge amount of ambiguity though about what happens to all of those UTF-16 sequences that currently sort-of work even though they shouldn't really because ES is nominally UCS-2.
> -Shawn

One reason is that none of the built-in string methods understand surrogate pairs. If you want to do any string processing that recognizes such pairs you have to either handles such pairs as multi-character sequences or do you own character by character processing.


More information about the es-discuss mailing list