UTF-16 vs UTF-32

John Tamplin jat at google.com
Mon May 16 18:07:05 PDT 2011


On Mon, May 16, 2011 at 8:42 PM, Shawn Steele <Shawn.Steele at microsoft.com>wrote:

> It's clear why we want to support the full Unicode range, but it's less
> clear to me why UTF-32 would be desirable internally.  (Sure, it'd be nice
> for conversion types).
>
> What UTF-32 has that UTF-16 doesn't is the ability to walk a string without
> accidentally chopping up a surrogate pair.  However, in practice, stepping
> over surrogates is pretty much the least of the problems with walking a
> string.  Combining characters and the like cause numerous typographical
> shapes/glyphs to be represented by more than one Unicode codepoint, even in
> UTF-32.  We don't see that in Latin so much, especially in NFC, but in some
> scripts most characters require multiple code points.
>
> In other words, if I'm trying to find "safe" places to break a string,
> append text, or many other operations, then UTF-16 is no more complicated
> than UTF-32, even when considering surrogates.
>
> UTF-32 would cause a huge amount of ambiguity though about what happens to
> all of those UTF-16 sequences that currently sort-of work even though they
> shouldn't really because ES is nominally UCS-2.
>

Personally, I think UTF16 is more prone to error than either UTF8 or UTF32
-- in UTF32 there is a one-to-one correspondence, while in UTF8 it is
obvious you have to deal with multi-byte encodings.  With UTF16, most
developers only run into BMP characters and just assume that there is a
one-to-one correspondence between chars and characters.  Then,  when their
code runs into non-BMP characters they run into problems like restricting
the size of a field to a number of chars and it is no longer long enough,
etc.  The problems arise infrequently, which means many developers assume
the problem doesn't exist.

-- 
John A. Tamplin
Software Engineer (GWT), Google
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20110516/9ffdd016/attachment.html>


More information about the es-discuss mailing list