UTF-16 vs UTF-32
John Tamplin
jat at google.com
Mon May 16 18:07:05 PDT 2011
On Mon, May 16, 2011 at 8:42 PM, Shawn Steele <Shawn.Steele at microsoft.com>wrote:
> It's clear why we want to support the full Unicode range, but it's less
> clear to me why UTF-32 would be desirable internally. (Sure, it'd be nice
> for conversion types).
>
> What UTF-32 has that UTF-16 doesn't is the ability to walk a string without
> accidentally chopping up a surrogate pair. However, in practice, stepping
> over surrogates is pretty much the least of the problems with walking a
> string. Combining characters and the like cause numerous typographical
> shapes/glyphs to be represented by more than one Unicode codepoint, even in
> UTF-32. We don't see that in Latin so much, especially in NFC, but in some
> scripts most characters require multiple code points.
>
> In other words, if I'm trying to find "safe" places to break a string,
> append text, or many other operations, then UTF-16 is no more complicated
> than UTF-32, even when considering surrogates.
>
> UTF-32 would cause a huge amount of ambiguity though about what happens to
> all of those UTF-16 sequences that currently sort-of work even though they
> shouldn't really because ES is nominally UCS-2.
>
Personally, I think UTF16 is more prone to error than either UTF8 or UTF32
-- in UTF32 there is a one-to-one correspondence, while in UTF8 it is
obvious you have to deal with multi-byte encodings. With UTF16, most
developers only run into BMP characters and just assume that there is a
one-to-one correspondence between chars and characters. Then, when their
code runs into non-BMP characters they run into problems like restricting
the size of a field to a number of chars and it is no longer long enough,
etc. The problems arise infrequently, which means many developers assume
the problem doesn't exist.
--
John A. Tamplin
Software Engineer (GWT), Google
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20110516/9ffdd016/attachment.html>
More information about the es-discuss
mailing list