UTF-16 vs UTF-32
jat at google.com
Mon May 16 18:07:05 PDT 2011
On Mon, May 16, 2011 at 8:42 PM, Shawn Steele <Shawn.Steele at microsoft.com>wrote:
> It's clear why we want to support the full Unicode range, but it's less
> clear to me why UTF-32 would be desirable internally. (Sure, it'd be nice
> for conversion types).
> What UTF-32 has that UTF-16 doesn't is the ability to walk a string without
> accidentally chopping up a surrogate pair. However, in practice, stepping
> over surrogates is pretty much the least of the problems with walking a
> string. Combining characters and the like cause numerous typographical
> shapes/glyphs to be represented by more than one Unicode codepoint, even in
> UTF-32. We don't see that in Latin so much, especially in NFC, but in some
> scripts most characters require multiple code points.
> In other words, if I'm trying to find "safe" places to break a string,
> append text, or many other operations, then UTF-16 is no more complicated
> than UTF-32, even when considering surrogates.
> UTF-32 would cause a huge amount of ambiguity though about what happens to
> all of those UTF-16 sequences that currently sort-of work even though they
> shouldn't really because ES is nominally UCS-2.
Personally, I think UTF16 is more prone to error than either UTF8 or UTF32
-- in UTF32 there is a one-to-one correspondence, while in UTF8 it is
obvious you have to deal with multi-byte encodings. With UTF16, most
developers only run into BMP characters and just assume that there is a
one-to-one correspondence between chars and characters. Then, when their
code runs into non-BMP characters they run into problems like restricting
the size of a field to a number of chars and it is no longer long enough,
etc. The problems arise infrequently, which means many developers assume
the problem doesn't exist.
John A. Tamplin
Software Engineer (GWT), Google
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the es-discuss